_f'( ./l
ii_
StataReferenceManual Relea 7 Volume2 H-P Stata Press CollegeStation,Texas
Stata Press, 4905 Lakeway Dri...
27 downloads
1636 Views
56MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
_f'( ./l
ii_
StataReferenceManual Relea 7 Volume2 H-P Stata Press CollegeStation,Texas
Stata Press, 4905 Lakeway Drive, College Station, Texas 77845 Copyright C) 1985-2001 All fights reserved Version 7.0
by Stata Corporation
Typeset in TEX Printed in the United States of America l0 9 8 7 6 ISBN ISBN 1SBN ISBN ISBN
5 4 3 2
1-881228-47-9 1-88t228-48-7 1-881228-49-5 1-881228-50-9 1-881228-51-7
1
(volumes 1-4) (volume I) (volume 2) (volume 3) (volume 4)
This manual is protected by copyright. All rights are reserved. No part of this manual may be reproduced, stored in a retrieval system, or transcribed, in any form or by any means--electronic, mechanical, photocopying, recording, or otherwise--without the prior written permission of Stata Corporation (StataCorp), StataCorp provides this manual "as is" without warranty of any kind, either expressed or implied, including, bul not limited to the implied warranties of merchantability and fitness for a particular purpose. StataCorp may make improvements and/or changes in the product(s) and the programCsl described in this manual at any time and without notice. The software described in this manual is furnished under a license agreement or nondiscIosure agreement. The software may be copied only in accordance with the terms of the agreement. It is against the law m copy the software onto CD. disk, diskette, tape, or any other medium for any purpose other than backup or archival purposes. The automobile dataset appearing on the accompanying media is Copyright (_) 1979, 1993 by Consumers Union of U.S.. Inc.. Yonkers. NY 10703-1057 and is reproduced by permission from CONSUMER REPORTS, April 1979, April 1993. The Stata for Windows installation software was produced using Wise Installation System, which is Copyright @ 1994-2000 Wise Solutions. Inc, Portions of the Macintosh installation software are Copyright (_ 1990-2000 Aladdin Systems, Inc.. and Raymond Lau. Stata is a registered trademark and NetCourse is a trademark of Stata Corporation. Alpha and DEC are trademarks of Compaq Computer Corporation. AT&T is a registered trademark of American Telephone and Telegraph Company. HP-UX and HP LaserJet are registered trademarks of Hewlett-Packard Company. IBM and 0S/2 are registered trademarks and AIX. PowerPC. and RISC System/6000 are trademarks of International Business Machines Corporation. Linux is a registered trademark of Linus Torvalds. Macintosh is a registered trademark and Power Macintosh is a trademark of Apple Computer. Inc. MS-I)OS. Microsoft, and Windows are registered trademarks of Microsoft Corporation. Pentium is a trademark of Intel Corporation. PostScript and Display PostScript are registered trademarks of Adobe Systems. Inc. SPARC is a registered trademark of SPARC International. Inc Star/Transfer is a trademark of Circle Systems. Sun, SunOS. Sunview, Solaris. and NFS are trademarks or registered trademarks of Sun Microsysrems. Inc. TEN is a trademark of the American Mathematical Society. UNIX and OPEN LOOK are registered trademarks and X Window System is a trademark of The Open Group Limited. WordPerfect is a registered trademark of Corel Corporation.
The suggested citation for this software is StataCorp. 2001_ Stata Statistical Software: Release 7.0. College Station. TX: Stata Corporation.
i I
i
Title hadimvo -- Identify multivariate outliers
Syntax hadiravo varlist [if exp] [in range], g_en_rate(newvarl
[newvar2]) [p(#)]
;Description hadimvoidentifies multiple outtiers in multiv_ate data using the method of Hadi (1992, 1994), creating newvarl equal to 1 if an observation is an "outlier" and 0 otherwise. Optionally, newvar2 can also be created containing the distances from!the basic subset.
Options E
generate (newvarl [newvar2]) is not optional; it identifiesthe new variable(s) to be created. Whether you specify two variables or one, however, is dptional, newvarl--which is required--will create newvarl containing 1 if the observation is an outlier in the Hadi sense and 0 otherwise: Specifying gen (odd) would call this variable odd. newvai'2,if specified, will also create newvar2 containing the distances (not the distances squared) from the basic subset. Specifying gen (odd dist) creates odd and also creates dist containing the Hadi distances. p(#) specifies the sl_,mficance level for outlier cutoff; 0 < # < 1. The default is p(.05). Larger numbers identify a larger proportion of the sample as outliers. If # is specified greater than I. it is interpreted as a percent. Thus, p(5) is the Sameas p(.05).
Remarks Multivariate analysis techniques are commonly used to analyze data from many fields of study. The data often contain outliers. The search for subsets of the data which, if deleted, would change results markedly is known as the search for outli+rs,hadimvo provides one computer-intensive but practical method for identifying such observations. Classical outlier detection methods (e.g., Mahalhnobisdistance and Wilks' test) are powerful when the data contain only one outlier, but the power Of these methods decreases drastically when more than one outlying observation is present. The losisof power is usually due to what are known as masking and swamping problems (false negative and false positive decisions), but in addition, these methods often fail simply because they are affecied by the very observations they are supposed to identih,. Solutions to these problems often involve an unreasonable amount of calculation and therefore computer time. (Solutions involving hundreds of! millions of calculations for samples as small as 30 have been suggested.) The method developed:,{bY Hadi (1992, I994) attempts to surmount these problems and produce an answer, albeit second b{st, in finite time. A basic outline of the procedure is as follows_A measure of distance from an observation to a cluster of points is defined. A base cluster of r pdints is selected and then that cluster is continually redetined by taking the r + 1 points "closest" to the cluster as the new base cluster. This continues until some rule stops the redefinition of the clustei.
d:
.au=mvo
_
=o_nt.y
mUlZlVarlale
OUlllers
Ignoring many of the fine details, given k variables, the initial base cluster is defined as r = k + 1 points. The distance that is minimized in selecting these k + 1 points is a covariance-matrix distance on the variables with their medians removed. (We wilt use the language loosely; if we were being more precise, we would have said the distance is based on a matrix of second moments, but remember, the medians of the variables have been removed. We would also discuss how the k -r- 1 points must be of full column rank and how they would be expanded to include additional points if they are not.) Given the base cluster, a more standard mean-based center of the r-observation cluster is defined and the r + 1 observations closest in the covariance-matrix sense are chosen as a new base cluster. This is then repeated until the base cluster has r = int{(n + k + 1)/2} points. At this point, the method continues in much the same way, except a stopping rule based on the distance of the additional point and the user-specified p() is introduced. Simulation results are presented in Hadi (1994).
Examples hadimvo
price
• list
if odd
• summ
price
drop
gen(odd) /*
weight
if -odd
price
weight,
gen(odd
the
*/
outliers stats
for
clean
data
*/
D)
id=_n
graph
/* make
Did
graph price weight [w=D] graph price weight [w=i/D] summarize D, detail sort D list
list
/* summary
odd
hadimvo gen
weight,
make
hadimvo • regress
price
weight
price weight ... if -odd2
an index
/* index
plot
/* 2-way /* same,
scatter, outliers
variable
,/
of D
,/ outliers small
big
*/ ,/
D odd
mpg,
gen(odd2
Di)
p(.01)
Identifying outliers You have a theory about xz, x2, ..., xk which we will write as F(xl, x2,..., xk). Your theory might be that xl, x2, ..., xk are jointly distributed normally, perhaps with a particular mean and covariance matrix: or your theory might he that
xl =/31 T/32x2 + .-- + _kxk + u where u _ N(0, or2); or your theory might be
xl -/310 +/312x2+ ,,_14x4+ ua x2 -/320 + 1321xl+ _23xz + u2 or your theory might be anything else it does not matter. You have some data on x_, x2, ..., xk, which you will assume are generated by F(.). and from that data you plan to estimate the parameters (if any) of your theory and then test your theory in the sense of how well it explains the observed data.
h_dimvo-- Identifymultivariateoutliers
l
3
What if, however, some of your data are generated not by F(-) but by G(.), a different process? a a wages example, you For have theory on how are_assignedto employees in firm and. for the bulk of employees, that theory is correct. There are, hdwever, six employees at the top of the hierarchy for whom wages are set by a completely different process. Or, you have a theory on how individuals select different health insurance options except that, for a handful of individuals already diagnosed with serious illness, a different process controls ihe selection process. Or, you are testing a drug that reduces trauma after surgery except that, for a few patients with a high level of a particular protein, the drug has no effect. Or, in another drug experiment, some of the historical data are simply misrecorded. The data values generated by G(.) rather than F(.) are called contaminant observations. Of course, the analysis should be based only on the observations generated by F(.), but in practice we do not know which observations those are. In addition, if it happened by chance that some of the observations are within a reasonable distance _om the center of F(.), it becomes impossible to detennine whether they are contaminants. Accordingly.we adopt the following operational definition: Outliers are observations that do not conformto the pattern suggestedbythe majority of the observations in a dataset. Accordingly, observations generated b_ F(.) but located at the tail of F(.) are considered outliers. On the other hand, contaminants that are within a statistically reasonable distance from the center of F(.) are not considered outliers. It is well worth noting that outliership is strongly related to the completeness of the theory--a grand unified theory would have no outliers becafise it would explain all processes (including, one supposes, errors in recording the data). Grand uni_d theories, however, are difficult to come by and are most often developed by synthesizingthe results of many special theories. i / Theoretical work has tended to focus on one !pecial case: data containing only one outlier. As mentioned above, the single-outlier techniques ofteh fail to identify multiple outliers, even if applied recursively. One of the classic examples is the star t cluster data (a.k.a. Hertzsprung-Russell diagram) shown in the figure below (Rousseeuw and Leroy 1987, 27). For 47 stars, the data contain the (log) light intensity and the (log) effective temperature _t the star's surface. (For the sake of illustration, we treat the data here as bivafiate data--not as r_gression data--i.e., the two variables are treated ..... i . similarly with no dlstmcuon between which vanatile is dependent and which is independent.) This graph presents a scatter of the data along with two ellipses expected to contain 95% of the data. The larger ellipse is based on the mean and dovariance matrix of the full data. All 47 stars are inside the larger ellipse, indicating that classical iingle-case analysis fails to identify any outliers, The smaller ellipse is based on the mean and co_,afiancematrix of the data without the five stars identified by had±taro as outliers. These observaiions are located outside the smaller ellipse. The i dramatic effects of the outliers can be seen by comparing the two ellipses. The volume of the larger ellipse is much greater than that of the smaller one and the two ellipses have completely different orientations. In fact, their major axes are nearly orthogonal to each other; the larger ellipse indicates a negative correlation (r = -0.2) whereas the smalle_rellipse indicates a positive correlation (r = 0.7] (Theory would suggest a positive correlation: hot ihings glow.)
:i
(Graph on _ext page)
i
•
_r
Itc_u..vu
"-;"
lUt:lluly
inuiLivuna|e
OUtllers
I
I
8-
/
/ "_ =
/
o o
0 0
_o_ o -4
4"
0
\\
_.
// /
@__2"j
2
/
.//
F_Log
lemperalute
The single-outlier techniques make calculations for each observation under the assumption that it is the only outlier and the remaining n - 1 observations are generated by .F('.) producing a statistic for each of the n observations. Thinking about multiple oufliers is no more difficult. In the case of two outliers, consider all possible pairs of observations (there are n(nI)/2 of them) and, for each pair, make a calculation assuming the remaining n 2 observations are generated by F(-). For the three-outlier case, consider all possible triples of observations (there are rz(_ - 1)(n - 2)/(3 x 2) of them) and, for each triple, make a calculation assuming the remaining rL 3 observations are generated by F(-). Conceptually, this is easy but practically, it is difficult because of the rapidly increasing number of calculations required (there are also theoretical problems in determining how many outliers to test simultaneously). Techniques designed for detecting multiple outliers, therefore, make various simplifying assumptions to reduce the calculation burden and, along the way, lose some of the theoretical foundation. This loss, however, is no reason for ignoring the problem and the (admittedly second best) solutions available today. It is unreasonable to assume that outtiers do not occur in real data. If outliers exist in the data, they can distort parameter estimation, invalidate test statistics, and lead to incorrect statistical inference. The search for outliers is not merely to improve the estimates of the current model but also to provide valuable insight into the shortcomings of the current model. In addition, outliers themselves can sometimes provide valuable clues as to where more effort should be expended. In a drug experiment, for example, the patients excluded as outliers might well be further researched to understand why they do not fit the theory.
Multivariate, multiple outliers hadimvo is an example of a multivariate, multiple outlier technique. The multivariate aspect deserves some attention, In the single-equation regression techniques for identifying outliers, such as residuals and leverage, an important distinction is drawn between the dependent and independent variables--the b' and the x's in y = x/3 + u. The notion that the ff is a linear function of x can be exploited and. moreover, the fact that some point (Yi-xi) is "far" from the bulk of the other points has different meanings if that "farness" is due to ;ti or xi. A point that is far due to xi but, despite that, still close in the _/i given xi metric adds precision to the measurements of the coefficients and may not indicate a problem at all. In fact, if we have the luxury of designing the experiment, which means choosing the values of x a priori, we attempt to maximize the distance between the x's (within
h_dimvo-- Identify multivariateoutliers
5
the bounds of x we believe are covered by our {inear model) to maximize that precision. In that extreme case, the distance of xi carries no information as we set it prior to running the experiment. More recently, Hadi and Simonoff (1993) exploit _he structure of the linear model and suggest two
i
diagnostics). methods for identifying muhiple outliers v,hen the inodel is fitted to the data (also see [R] regression In the multivariate case, we do not know the strhcture of the model, so (y,, x+) is just a point and the y is treated no differently from an5"of the x's_a, fact which we emphasize by+writin_ the point as (xaz,x2i) or simply (Xi). The technique doeg assume, however, that the X's are multivariate normal o1"at least elliptically symmetric. This lead_ to a problem if some of the X's are functionally related to the other X's, such as the inclusion of x _nd x 2, interactions such as .r_x> or even dummy variables for multiple categories (in which one of the dummies being 1 means the other dummies must be 0). There is no good solution to this problrm. One idea, however, is to perform the analysis with and without the functionally related variables _ind to subject all observations identified for further study (see What to do with outliers below).
]
An implication of hadJ.mvo being a multivariaie technique is that it would be inappropriate to apply it to (9,x) when x is the result of experi_ntal design. The technique would know nothing of our design of x and would inappropriately treat "distance" in the x-metric the same as distance in the +j-metric. Even when x is inuhivariate norNal, unless y and x are treated similarly it may still be inappropriate to apply had±taro to (9, x)because of the different roles that _/ and x play in regression. However, one may apply had±taro on x to identify outliers which, in this case. are called leverage points. (We should also mention here that if had:i.mvo is applied to x when it contains + constants or any collinear variables, those variables will be correctly i'_nored, allowine_ the analxs_. ,_'s to continue.) It is also inappropriate to use hadimvo (and other outlier detection techniques) when the sample + size is too small, had:i.mvo uses a small-sample correction factor to adjust the covariance matrix of the "clean" subset. Because the quantity n - (3L:@ I) appears in the denominator of the correction factor, the sample size must be larger than 31,:+ _. Some authors would require the sample size to be at least 5h, i.e., at least five observations per vhriable, With these warnings, it is difficult to misapply {his tool assuming that you do not take the results as more than suggestive, hadimvo has a p () optioh that is a "significance level" for the outliers that i
are chosen. quote the term level b_cause, although has that beeninterpretation expended to really make We a significance level,significance approximations a_e involved and itgreat will effort not have i in all cases. It can be thought of as an index between 0 and 1, with increasing values resulting in the labeling of more obse_,'ations as outliers and _ith the suggestion that you select a number much as you would a significance level--it is roughly the probability of identifying any given point as an outlier if the data truly were multivariate normal. Nevertheless, the terms significance level or critical values should be taken with a grain of shlt. It is suggested that one examine a graphical display (e.g., an index plotl of the distance with berhaps different values of p(). The graphs give more information than a simple yes/no answer. FOr example, the graph may indicate that some of the observations (inliers or outliers) are only mar_nally so,
i }
What to do with outliers After a reading of the literature on outlier ddtection., many people are left with the incorrect impression that once outliers are identified, they, s_ould be deleted from the data and analysis should be continued. Automatic deletion (or even automatic down-weighting) of outliers is not ahvavs correct because outliers are not necessarily bad obser 'atlons. On the contrary,, if they are correct, they' may bc thc most informative points in the data. For _xample, they may indicate that the data do not
!:
wuaunmwvv
v
--
Igl_lltlly
illglIiVarla|e
come from a normally distributed techniques.
OUtllers
population,
as is commonly
assumed
by almost all multivariate
The proper use of this tool is to label the outliers and then subject the outtiers to further study, not simply to discard them and continue the analysis with the rest of the data. After further study, it may indeed turn out to be reasonable to discard the outliers, but some mention of the oufliers must certainly be made in the presentation of the final results. Other corrective actions may include correction of errors in the data, deletion or down-weighting of outliers, redesigning the experiment or sample survey, collecting more data, etc.
Saved Results hadimvo saves in r(): Scalars r(N)
number of outliers remaining
Methodsand Formulas hadimvo
is implemented
as an ado-file. Formulas are given in Hadi (I992,
1994).
Acknowledgment We would like to thank Ali S. Hadi of Comell University for his assistance
in writing hadimvo.
References Gould, W. W. and A. S. Hadi. 1993. smv6: Identifying multivariate outliers. Stata Technical Bulletin 11: 28-32. Reprinted in Stata TechnicalBulletin Reprints, vol. 2. pp. 163-168 Hadi, A. S. 1992. Identifying multiple outliers in multivariatedata. Journal of the Royal Statistical Society, Series B 54: 761-771. 1994. A modification of a method for the detection of outliers in multivariate samples. Journal of the Royal Statistical SocieD',Series B 56: 393-396. Hadi, A. S. and J. S. Simonoff. 1993. Procedures for the identificationof multiple outtiers in linear models. Journal of the American Statistical Association 88:1264 1272. Rousseeuw,P. J. and A. M. Leroy. 1987. Robust Regression and Outlier Detection. New York: John Wiley & Sons.
Also See Related:
JR] mvreg, [R] regression diagnostics,
[R] sureg
Title hausman -- Hausman specification test
Syntax hausman, save
hausman
[, [more
l!ess ] constant a_leqs skipeqs(eqtixt) i
sigmamore
p_rior(string)current (string)equations (matchlist)] hausman, clear where matchlist in equat ions () is
#:#[,#i#[,...]] For instance, equations (:1.: 1), equations(1
:_, 2: 2), or equations (1 :2).
i Description hausmanperforms Hausman's (t978) specificition test. /
Options save requests that Stata save the current estimation results, hausman will later compare these results with the estimation results from another model. A model must be saved in this fashion before a test against other models can be performed. more specifies that the most recently estimated model is the more efficient estimate. This is the default, less specifies that the most recently estimated model is the less efficient estimate. constant specifies that the estimated intercept(s) are to be included in the model comparison: by default, they are excluded. The default behavi_ is appropriate for models where the constant does not have a common interpretation across the two models. i alleqs specifies that all the equations in the mod_l be used to perform the Hausman test: by default. only the first equation is used. skipeqs(eqlist) specifies in eqlist the names of equations to be excluded from the test. Equation numbers are not allowed in this context as it is the equation names, along with the variable names. that are used to identify common coefficients. sigmamore allows you to specify that the two Cbvariancematrices used in the test be based on a common estimate of disturbance variance (cr2i--the variance from the fully efficient estimator. This option provides a proper estimate of the :contrast variance for so-called tests of e×ogeneity and over-identification in instrumental variablegregression: see Baltagi (1998, 29!). Note that this option can only be specified when both estimators save e(sigma) or e(rmse). !
prior (string) and current (string) are formattin_options that allow you to specify alternate wording for the "Prior" and "Current" default labels used to identify the columns of coefficients.
i i J
v
,JlauoIIIglll
equations
--
I ll;Ig;_lll(:ill
(matchlist)
If equations
:tl.l_l.;llli.;dl.l[.}l]
test
specifies• by number, the pairs of equations
that are to be compared.
() is not specified, then equations are matched on equation names.
equations() handles the situation where one does not. For instance, equations(l:2) means equations(i:1, 2:2) means equation 1 is to to be tested against equation 2. If equations() ignored.
estimator ases equation names and the other equation 1 is to be tested against equation 2. be tested against equation 1 and equation 2 is is specified, options alleqs and skipeqs are
clear discards the previously saved estimation resuhs and frees some memory; it is not necessary to specify hausmaxl, clear before specifying hausman, save.
Remarks hausm_n is a general implementation of Hausman's estimator that is known to be consistent with an estimator tested. The null hypothesis is that the efficient estimator the true parameters. If it is• there should be no systematic
(1978) specification test that compares an that is efficient under the assumption being is a consistent and efficient estimator of difference between the coefficients of the
efficient estimator and a comparison estimator that is known to be consistent for the true parameters. If the two models display a systematic difference in the estimated coefficients, then we have reason to doubt the assumptions on which the efficient estimator is based. To use hausman, you (estimate
the less efficient
• hausman,
save
(estimate • hausman
the fully
Alternatively,
efficient
model
)
model)
you can turn this around:
(estimate the fully efficient model) • hausman, • (estimate • hausman,
save the tess efficient
model)
less
> Example We are studying the factors that affect the wages of young women in the United States between 1970 and 1988 and have a panel-data sample of individual women over that time span.
(Continued
on next page)
lausman -- Hausman specificationtest
9
• describe Contains
data
from nlswork.dta
obs:
28,534
National i
Longitudinal
Young Women in 1968
14-26
Survey. years
of age
Z
vars :
6
size:
485,078
I Aug 2000 (88.4_ of memory
storage
09:48
fr_e)
display
value i
type
format
label!
idcode
imt
_8.0g
NLS id
year
byte
_8.0g
interview
age
byte
_8.0g
age in current
year
msp
byte
_8.0g
1 if married,
spouse
ttl_exp
float
_9.0g
total
work
experience
In_wage
float
_9.0g
In(wage/GNP
deflator)
variable
Sorted
name
by:
Note:
idcode
year
dataset
has
changed
since
last i
variable
label
year present
saved
We believe that a random-effects specification is @ppropriate for individual-level effects in our model. We estimate a fixed-effects model that will capture all temporally constant individual-level effects. . xtreg
In_wage
Fixed-effects
age msp ttl_exp,
(within)
fe Number
of obs
=
28494
(i) : idcode
Number
of groups
=
4710
within
= 0.1373
Obs per
between overall
= 0.2571 = 0.1800
Group
variable
_-sq:
corr(u_i,
Xb)
in_wage
regression
= 0.1476
Coef.
Std. Err.
i
t
group:
min =
1
avg = max =
6.0 15
F(3,23781)
=
Prob
=
P>[tl
> F
[95Z Conf.
1262.01 0.0000
Interval]
age
-.005485
.000837
_6.55
0.000
-.0071256
-.0038443
msp
.0033427
.0054868
!0.61
0.542
-.0074118
.0140971
ttl_exp
,0383604
.0012416
_0.90
0.000
.0359268
.0407941
_cons
1.593953
.0177538
_9.78
0.000
1.559154
1.628752
sigma_u
.37674223
sigma_e rho
.29751014 .61591044
F test
that
all u_i=O:
! (fraction
i of!variance ii i
F(4709,23781)ii =
due to u_i)
7.76
Prob
> F = 0.0000
{J
We assume that this model is consistent for the true parameters and save the results by typing . hausman,
save
Now, we estimate a random-effects model _s a fully efficient specification of the individual effects under the assumption that they follow a rahdom-normal distribution. These estimates are then compared to the previously saved results using tile hausman command.
1
•_
Jtuu_w
mvow I --
! lOU_I
1I_1
i _}.ll_'_ll
II,;BllOn
Ie$l
I
• xtreg
in_wage
age
msp
ttl_exp,
re
Random-effects
GLS
regression
Number
of
obs
=
28494
Group
variable
(i)
: idcode
Number
of
groups
=
4710
R-sq:
within
= 0.1373
between overall
= 0.2552 = 0.1Z97
effects
u_i
Random
corr(u_i,
X)
in_wage
Obs per
group:
min
=
1
avg max
= =
6.0 15
~ Gaussian
Wald
chi2(S)
=
5100.33
= 0 (assumed)
Prob
> chi2
=
0.0000
Coef.
Std.
Err.
z
P>Izl
[95Z
Conf.
Interval]
age
-.0069749
.0006882
-10.13
0.000
-.0083238
-.0056259
msp
.0046594
.0051012
0.91
0.361
-.0053387
.0146575
ttlexp _cons
.0429635 1.609916
.0010169 .0159176
42.25 101.14
0.000 0.000
.0409704 1.578718
.0449567 1.641114
sigma_u
.32648519
sigma_e rho
.29751014 .54633481
(fraction
of
variance
to
due
u_i)
• hausman Coefficients (b) Prior
(B) Current
(b-B) Difference
sqrt(diag(V_b-V_B)) S.E.
age
-.005485
-.0069749
msp
.0033427
.0046594
-.0013167
.0020206
.0383604
.0429635
-.0046031
.0007124
tt1_exp
b = less
efficient
B = fully Test:
Ho:
difference chi2(3)
.0014899
estimates
efficient
obtained
estimates
in coefficients
obtained
not
.0004764
previously from
from
xtreg
xtreg
systematic
= (b-B)'[(V_b-V_B)'(-I)](b-B) = 275.44 Prob>chi2
=
0.0000
Using the current specification, our initial hypothesis that the individual-level effects are adequately modeled by a random-effects model is resoundingly rejected. We realize, of course, that this result is based on the rest of our model specification and that it is entirely possible that random effects might be appropriate for some alternate model of wages. hausman is a generic implementation of the Hausman test and assumes that the user knows exactly what they want tested. The test between random- and fixed-effects is so common that Stata provides a special command for use after xtreg. We could have obtained the above test in a slightly different format by typing xthausman Hausman
specification
test -
Coefficients Fixed
in_wage
Random
Effects
Effects
Difference
age
-. 005485
-. 0069749
msp
.0033427
.0046594
-.0013167
tt l_exp
.0383604
.0429635
-.0046031
.0014899
t hausman--
Hausmanspecificationtest
11
i Test:
Ho:
difference
in coefficients
not
chi2(3)
= (b-B)'ES'(-!)]Cb-8),
Prob>chi2
= =
systematic
S = (S_fe - S_re)
275.44 O.0000
q
Example A stringent assumption of multinomial and cbnditional logit models is that outcome categories • i for the model have the property of independence of irrelevant alternatives (IIA). Stated simply, this assumption requires that the inclusion or exclusmn of categories does not affect the relative risks associated with the regressors in the remaining citegories. One classic example of a situation where thi_ assumption would be violated involves choice of transportation mode: see McFadden (1974). For s_mplicity, postulate a transportation model with the four possible outcomes: rides a train to work, t_es a bus to work, drives the Ford to work, and drives the Chevrolet to work• Clearly "drives the _ord" is a closer substitute to' drives the Chevrolet" than it is to "rides a train" (at least for most people). This means that excluding "drives the Ford" from the model could be expected to affect the relative risks of the remaining options and the model would not obey the IIA assumption. i Using the data presented in [R] mlogit, we w_l use a simplified model to test for IIA. Choice of insurance type among indemnity, prepaid, and un_sured is modeled as a function of age and gender. The indemnity category is allowed to be the base _ategory and the model including all three outcomes is estimated. i
• mlogit insure age male Iteration O: Iteration I: Iteration 2:
log likelihood = -555.854_6 log likelihood = -551.329t3 log likelihood = -551.32802
Multinomial regression
Number of obs LR chi2(4) Prob > chi2
= = =
615 9,05 0.0598
Log likelihood = -551.32802
Pseudo R2
=
0.0081
,i
insure
Coef.
Std. Err.
Prepai6 age male
-. 0100251 .5095767
.0060181 ,1977893
_cons
.2633838
.2787574
Uninsure age male _cons
z i i _1.67 2.58
I
O, 94
P>lz]
[95Y,Conf. Interval]
O. 096 O. 010
-. 0218204 .1219148
.0017702 .8972345
O. 345
-. 2829708
.8097383
0.648 O. 189 0.001
-.027501 -.2343477 -2.797504
.017116 i.184057 -.7161824
i
i -.0051925 .4748547 -1.756843
.0113821 .3618446 .5309591
40.46 iI.31 43.31 :i
(Outcome insure==Indem is the comparisonlgrouP)
]
• hausman, save
i i
!
Under the IIA assumption, we would expect no _;ystematic change in the coefficients if we excluded one of the outcomes from the model. (For an ektensive discussion, see Hausman and McFadden. 1984.) We re-estimate the model, excluding the!uninsured outcome, and perform a Hausman test
i i
. mlogit insure age male if insure-=
against the fully efficient full model.
"U_insure":insure
I
_:r,,
..
..u_.,..--n_usman
specmcaIlontest
Iteration
O:
log
likelihood
=
-394.8693
Iteration
I:
log likelihood
=
-390.4871
I_eration
2:
log likelihood
= -390.48643
Multinomial
Log
regression
likelihood
= -390.48643
Number of obs LR chi2(2)
= =
Prob
=
0.0125
=
0.0111
> chi2
Pseudo
Std.
Err.
z
R2
P>Iz}
[95Z
Conf.
570 8.77
insure
Coef.
Interval]
age male
-.0101521 .5144003
.0060049 .1981735
-1.69 2.60
0.091 0.009
-.0219214 .1259875
.0016173 .9028132
_cons
.2678043
.2775562
0.96
0.335
-.2761959
.8118046
Prepaid
(Outcome
insure==Indem
hausman,
alleqs
is the
less
comparison
group)
constant
Coefficients (b) Current
(B) Prior
(b-B) Difference
sqrt (diag (V_b-V_B)) S.E.
age male
-.0101521 .5144003
-.0100251 .509574Z
-.0001269 .0048256
_cons
.2678043
.2633838
.0044205
b = less
efficient
B = fully Test:
Ho:
efficient
difference
estimates
in coefficients
chi2(3)
obtained
estimates
from
obtained
not
.012334
mlogit
previously
from
mlogit
systematic
= (b-B)'[(V_b-g_B)_(-l)](b-B) = 0.08 Prob>chi2
=
0.9944
First, note that the somewhat used to identify the "Uninsured"
subtle syntax of the if condition on the mlogit command was simply category using the insure value label; see [U] 15.6.3 Value label.
On examining the output has been violated.
hausman,
Second, mlogit
from
since the Hausman
requires
test is a standardized
that the base category
most frequem category use the basecategory()
we see that there
be the same
is no evidence
comparison
of model
in both competing
that
the IIA assumption
coefficients,
models.
using
it with
In particular,
if the
(the default base category) is being removed to test for IIA, then you must option in mlogit to manually set the base category to something else.
The missing values for the square root of the diagonal of the covariance matrix of the differences is not comforting but is also not surprising. This covariance matrix is guaranteed to be positive definite only asymptotically and assurances are not made about the diagonal elements, Negative values _ong the diagonal are possible, and the fourth column of the table is provided maitdy for descriptive use. We can also perform the Hausman • mlogit
insure
age
male
IIA
if insure
test against the remaining alternative in the model. ~=
"Prepaid":insure
Iteration
O:
log
likelihood
= -132.59915
Iteration
i:
log
likelihood
= -131.78009
Iteration
2:
log
likelihood
= -131.76808
Iteration
3:
log
likelihood
= -131.76807
Multinomial
Log
regression
likelihood
= -131.76807
Number of obs LR chi2(2)
= =
Prob
> chi2
=
0.4356
R2
=
0.0063
Pseudo
338 i.66
lausman -- Hausmanspecificationtest
it
13
i
insure
Coef.
Std. Err.
z
P>Jzl
[95Z Conf. Interval]
Uninsure age male _cons
-.0041055 .4591072 -1. 801774
.0115807 .3595663 .5474476
_0.35 I.28 43.29
O.723 O. 202 O. 001
-.0268033 -.2456298 -2. 874752
.0185923 I.163844 -, 7287968
(Outcome insure==Indem is the comparison group)
• hausman, alleqs less constant -Coefficients -(b) Current age male _cons
-.0041055 .4591072 -1.801774
i
(B) Prior
(b-B) sqrt(diag(V b-V_B) ) Difference S.E.
-.0051925 .4748547 -1.756843
.001087 -.0157475 -.0449311
i
.0021357 .1333464
I
b = less efficient estim_ttesobtained from mlogit B = fully efficient estiz_atesobtained previously from mlogit Test:
Ho:
difference in coefficients not systematic chi2(3)
= (b-B)'[(V_b-__B)'(-I)](b-B) = -0.18 dhi2 model estimated on these } _ata fails to meet the asymptotic $ssumptions of the Hausman test i
/
In this case, the X2 statistic is actually negati_'e. We might interpret this as strong evidence that we cannot reject the null hypothesis. Such a result is not an unusual outcome for the Hausman test, particularly when the sample is relatively small'there are only 45 uninsured individuals in this dataset. Z
Are we surprised by the results of the Hausn_an test in this example? Not really. Judging from the z-statistics on the original multinomiat logit model, we were struggling to identify any structure in the data with the current specification. Even when we were willing to make the assumption of IIA I and estimate the most efficient model under this assumption, few of the effects could be identified as statistically different from those on the base category. Trying to base a Hausman test on a contrast (difference) between two poor estimates is just a_king too much of the existing data. chi2 = 0.0000
heckman assumes thatwage is the dependent vada_]e and thatthe firstvariablelist(educ and age) are the determinants of wage. The variablesspecifiedin the select() option (married, children.
educ,andage)areassumedtodetermine whetherihedependent variable isobserved(theselection equation). Thus, we estimated the model wage = fi0+ flledUc + fl2age+ ul andwe assumedthatwage isobservedif 70 + 71married
+ 72children
_ 73educ + 74age + u2 > 0
where ut and u2 have correlation p.
l t
The reported results for the wage equation are interpreted exactly as though we observed wage data for all women in the sample; the coefficients on age and education level represent the estimated marginal effects of the regressors in the underlying regression equation. The results lbr the two ancillary parameters require some explanation, hec_anan does not directly estimate p: to constrain ,o within its valid limits, and for numerical stabiliiv during optimization, it estimates the inverse hyperbolic tangent of p:
atanh p = _
1-_/
I,
22
heckman -- Heckman selection model
This estimate is reported as /athrho. In the bottom panel of the output, heckman undoes this transformation for you: the estimated value of p is .7035061. The standard error for p is computed using the delta method and its confidence intervals are the transformed intervals of/aghrho. Similarly, or, the standard error of the residual in the wage equation, numerical stability, heckman instead estimates In or. The untransformed of the output: 6.004797.
is not directly estimated; for sigma is reported at the end
Finally, some researchers--especially economists are used to the selectivity effect summarized not by p but by A = per. heckman reports this, too, along with an estimate of the standard error and confidence interval.
q
[] Technical Note If each of the equations in the model had contained many regressors, the heckman command could become quite long. An alternate way of specifying our wage model would make use of Stata's global macros. The following lines are an equivalent way of estimating our model. global
wageeq
global
seleq
. heckman
"wage "married
$wageeq,
educ
age"
children
edue
age"
select($seleq)
[]
o Technical Note The reported model X _ test is a Wald test of all coefficients in the regression model (except the constant) being 0. heckman is an estimation command, so you can use test, testnl, or lrtest to perform tests against whatever nested alternate model you choose; see [R] test, [R] testnl, and [R] lrtest. The estimation of P and cr in the form atanh p and In cr extends the range of these parameters to infinity in both directions, thus avoiding boundary problems during the maximization. Tests of p must be made in the transformed units. However, since atanh(0) - 0, the reported test for atanh p = 0 is equivalent to the test for ,o = O. The likelihood-ratio test reported at the bottom of the output is an equivalent test for p = 0 and is computationally the comparison of the joint likelihood of an independent probit model for the selection equation and a regression model on the observed wage data against the heckman model likelihood. The z -- 8.619 and X _ of 61.20, both significantly different from zero. clearly justify the Heckman selection equation with this data. []
Example heckman supports the HuberAVhite/sandwich estimator of variance under the robust and cluster() options, or when the pweights are used for population weighted data: see IU] 23.11 Obtaining robust variance estimates. We can obtain robust standard errors for our wage model by specifying clustering on county of residence (the county variable).
h$ckman-- Heckmanselection model
23
i
• heckman
wage
educ
Iteration
O:
log likelihood
= -5178.7009
Iteration
I:
log likelihood
= -5178.3049
Iteration
2:
log likelihood
= -5178,3045
Heckman
selection
(regression
select(married
children
model
model
Log likelihood
age,
with
educ age)
Number sample
selection)
= -5178.304
Coef.
cluster(county)
of obs
=
2000
Censored obs Uncensored obs
= =
657 1343
Wald
=
272.17
=
0.0000
chi2(1)
Prob > chi2 (standard
errorsiladjusted :!
Robust
i!
Std,
Err.
for clustering
P>lzl
[957 Conf.
on county)
Interval]
! wage education age _cons
.9899537
.0600061
16.$0
0.000
.8723438
1.107564
.2131294 .4857752
.020995 1.302103
I0,i5 0._7
0.000 0.709
.17198 -2.066299
.2542789 3.03785
.4451721 .4387068
.0731472 .0312386
6.09 14.04
0.000 0.000
.3018062 .3774802
.5885379 .4999333
5._6
0.000
.0341645
.0772991
0.000 0.000
.0285954 -2,717059
.0444242 -2.264972
0.000
.5991596
1.149258
0.000
1.741902
1.843216
select married children education age _cons
.0110039 •004038 .1153305
9.$4 -21._0
.8742086
.1403337
6._3
1.792559
.0258458
69._6
rho
•7035061
,0708796
.5364513
.817508
sigma lambda
6.004797 4.224412
.155199 .5186709
5.708189 3.207835
6.316818 5.240988
Prob
= 0.0000
/athrho /Insigma
Wald
.0557318 •0365098 -2.491015
test
of indep,
eqns.
(rho = 0): chi2(l!
=
38.81
> chi2
The robust standard errors tend to be a bit larger, bit we do not notice any systematic differences. This is not surprising since the data were not constructed to have any county-specific correlations or other characteristics that would deviate from the assumptions of the Heckman model. q
Example The default statistic produced by predict after tieckman is the expected value of the dependenl variable from the underlying distribution of the regression model. In our wage model, this is the expected wage rate among all women, regardless of whether they were observed to participate in the labor force. • predict heckwage (option xb assumed;
fitted
values) }
It is instructive to compare these predicted wage v_lues from the Heckman model with an ordinary regression model--a model without the selection adiustment:
;_r;
24
heckman-Heckmanselectionmodel wage educ age
. regress
]
Source
13524.0337
wage
age _cons education
(option
MS
2
39830.8609
Total
• predict
df
i
Model Residual
)
SS
1340
53354,8948
1342
I
Coef.
Std.
I
.1465739 6.084875 .8965829
Number of obs F( 2, 1340)
= =
1343 227.49
6762.01687
Prob
=
0.0000
29.7245231
R-squared
=
0.2535
39.7577456
Adj R-squared Root MSE
= =
0.2524 5.452
Err.
t
.0187135 .8896182 .0498061
7.83 6.84 18.00
> F
P> It J
[95Y, Conf.
0.000 O. 000 0.000
.109863 4.339679 .7988765
Interval]
.1832848 7. 830071 .9942893
re.age xb assumed;
. summarize
fitted
heckwage
values)
re.age
'
Variable
0bs
Mean
Std.
Dev.
Min
Max
(
heckwage
2000
21. 15532
3. 83965
14.6479
32. 85949
regwage
2000
23. 12291
3. 241911
17. 98218
32. 66439
Since this dataset was concocted, we know the true coefficients of the wage regression equation to be 1, 0.2, and 1, respectively. We can compute the true mean wage for our sample. • gen
truewage
• sum
truewage
= i +
,2*age
+ l*educ
Variable
I
0bs
Mean
truewage
I
2000
21. 3256
Std.
Dev.
3.797904
Min
Max
15
32.8
Whereas the mean of the predictions from heckmanis within 18 cents of the true mean wage, ordinary regression yields predictions that are on average about $1.80 per hour too high due to the selection effect. The regression predictions also show somewhat less variation than the true wages. The coefficients from heckman are so close to the true values that they are not worth testing. Conversely, the regression equation is significantly off, but seems to give the right sense. Would we be led far astray if we relied on the OLS coefficients? The effect of age is off by over 5 cents per year of age and the coefficient on education level is off by about 10%. We can test the OLS coefficient on education level against the true value using test. • test (I)
educ
= 1
education F(
= 1.0
1, 1340) = Prob > F =
4.31 0.0380
Not only is the OL$ coefficient on education substantially lower than the true parameter, the difference from the true parameter is statistically significant beyond the 5% level. We can perform a similar test for the OLS age coefficient: • test (1)
age
=
.2
age
=
.2
F(
1, 1340) = Prob > F =
8.15 0.0044
We find even stronger evidence that the OLS regression results are biased away from the true parameters. Example Several other interesting aspects of the Heckmah model can be explored with predict;. Continuing with our wage model, the expected wages for wOmen conditional on participating in the labor force can be obtained with the ycond option. Let's gdt these predictions and compare them with actual wages for women participating in the labor forcel • predict
hcndwage,
• stmm_lize
wage
ycond
hcndwage
Variable wage hcndwage
if wage
-=
Obs
Mean
Std ! Dev.
1343
23.69217
6.3_5374,
1343
23.68239
3.355087 i
Min
Max
5.88497
45.80979
16. 18337
33.7567
We see that the average predictions from beckman are very close to the observed levels but do not have exactly the same mean. These conditional w'age predictions are available for all observations in the dataset, but can only be directly compared with observed wages where individuals are participating in the labor force. What if we were interested in making predictions about mean wages for all women? In this case, the expected wage is 0 for those who are not exp_ted to participate in the labor force, with expected participation determined by the selection equation.: These values can be obtained with the yexpected option of predict. For comparison, a variable can be generated where the wage is set to 0 for nonparticipants. . predict
hexpwage,
yexpected
• gen wageO=
wage
(657 missing
values
generated)
. replace
wageO=
0 if wage
(657 real
changes
made)
• summarize Variable hexpwage wageO
hexpwage
== .
wageO
fibs
Mean
Stdi Dev.
2000
15. 92511
5.949336
2000
15,90929
12. _7081
Min 2.492469
Max 32.45858
0
45.80979
i
i
Again, we note that the predictions from heckman are very' close to the observed mean hourly wage rate for all women. Why aren't the predictions uging ycond and yexpected exactly equal to their observed sample equivalents? For the Heckman _odel, unlike linear regression, the sample moments implied by the optimal solution to the model likelihood do not require that these predictions exactly match observed data. Properly accounting for thh additional variation from the selection equation ,-quires that the model use more information thar_just the sample moments of the observed wao_es. q
Example Stata wilt also produce Heckman's (1979) two-step efficient estimator of the model with the twostep option. Maximum likelihood estinaation of the parameters can be time-consuming with large datasets and the two-step estimates may provide a_ood alternative in such cases. Continuing with the women's wage model, we can obtain the two-step estimates with Heckman's consistent covariance estimates by typing
!
I
',,
26
heckman m Heckman selection model
• heckman wage educ age, select(married children ednc age) twostep Heckman selection model -- two-step estimates (regression model with sample selection)
Coef, wage education age _cons
Std, Err.
z
Number of obs Censored obs Uncensored obs
= = =
2000 657 1343
Wald chi2(4) Prob > chi2
= =
551.37 0.0000
P>Izl
[95_ Conf. Interval]
.9825259 .2118695 .7340391
.0538821 .0220511 1.248331
18.23 9.61 0.59
0.000 0.000 0.557
.8789189 .1686502 -1.712645
1.088133 .2550888 3.180723
.4308575 .4473249 .0583645 .0347211 -2.467365
.074208 .0287417 .0109742 .0042293 .1925635
5.81 15.56 5.32 8.21 -12.81
0.000 0.000 0.000 0.000 0.000
.2854125 .3909922 .0368555 .0264318 -2.844782
.5763025 .5036576 .0798735 .0430105 -2.089948
select married children education age _cons mills lambda
4.001615
rho sigma lambda
0.67284 5.9473529 4.0016155
.6065388
6.60
0.000
2.812821
5.19041
.6065388
q
t] Technical Note The Heckman selection mode] depends strongly on the model being correct; much more so than ordinary regression. Running a separate probit or ]ogit for sample inclusion followed by a regression, referred to in the literature as the two-part model (Manning, Duan, and Rogers 1987) not to be confused with Heckman's two-step procedure--is an especially attractive alternative if the regression part of the model arose because of taking a logarithm of zero values. When the goal is to analyze an underlying regression model or predict the value of the dependent variable that would be observed in the absence of selection, however, the Heckman model is more appropriate. When the goal is to predict an actual response, the two-part model is usually the better choice. The Heckman selection model can be unstable when the model is not properly specified, or if a specific dataset simply does not support the model's assumptions. For example, let's examine the solution to another simulated problem.
(Continued
on next page)
heckman-- _man
• heckman
yt xl x2 x3,
Iteration Iteration Iteration Iteration Iteration
O: i: 2: 3: 4:
log log log log log
selec¢(zl
likelihood likelihood likelihood likelihood likelihood
selection model
27
z2) = = = = =
-t11.94996 -110.82258 -II0.17707 -107.70663 (not concave) -107.07729 (not concave)
(outputo_.ed ) Iteration 31: Iteration 32:
log likelihood = -104.08268 log likelihood = -104.08267 (backed up)
Heckman selection model
Number of obs
=
150
(regression model with sample selection)
Censored obs Uncensored obs
= =
87 63
Wald chi2(3)
=
8.64e+07
Prob
=
0.0000
Log likelihood
= -104.0827 Coef.
Std. Err.
z
> chi2
P>Izl
[957,Conf, Interval]
yt xl x2
.8974192 -2,525302
.0006338 1415._ .0003934 -6418.57
O.000 O.000
.896177 -2.526074
.8986615 -2.524531 2. 856651
x3 _cons
2.855786 .6971255
.0004417 6465.84 .0851518 8.I_
O.000 O.000
2.85492 ,5302311
zl
-.6830377
.0824049
-8.29
O.000
-.8445484
-.521527
z2 _cons
1.004249 -.361413
.1211501 .1165081
8. _ -3,ID
O. 000 O.002
.7667993 -.589"/647
1. 241699 -. 1330613
/athrho /insigma
15.12596 -.5402571
151.3627 ,1206355
0.10 -4._
0.920 O.000
-281.5395 -.7766984
311.7914 -.3038158
.8640198
select
i
rho sigma lambda
1 .5825984 .5825984
4.40e-
LR test ol indep, eqns. (rho = 0):
I
I !
11
-1
.0702821 .0702821
.459922 .4448481 chi2(i) =
25.67
1 .7379968 .7203488
Prob > chi2 = 0,0000
the form of the likelihood for the Heckman selectioh model, this implies a division by zero and it is surprising that the model solution turns out as will as it does. Reparameterizing p has allowed The model has converged to a value of p that is 1.0--within machine rounding tolerances. Given the estimation to converge, but we clearly have problems with the estimates. Moreover, if this had occurred in a large dataset, waiting over 32 iteration_ for convergence might take considerable time. This dataset was not intentionally developed to cause problems. It is actually generated by a "Heckman process" and when generated starting fromidifferent random values can be easily estimated. The luck of the draw in this case merely led to daia that, despite its source, did not support the assumptions of the Heckman model. The two-step model is generally more stable in chses where the data are problematic. It is even tolerant of estimates of p less than -1 and greater !than t. For these reasons, the two-step model may be preferred when exploring a large dataset. Still, if the maximum likelihood estimates cannot converge, or converge to a value of p that is at the bouhdary of acceptable values, there is scant support for estimating a Heckman selection model on the d_ta. Heckman (1979) discusses the implications of p being exactly t or 0, together with the implica!ions of other possible covariance relationships among the model's determinants.
_
l_,T
28
heckman --
Saved Results heckman
saves
Heckman selection model
in e():
Scalars e (N)
number of observations
e (re)
return code
e (k)
number of variables
e (chi2)
x2
e(k_eq)
number of equations
e(chi2_c)
X2 for comparison test
e(k_dv) e (N_eens)
number of dependent variables number of censored observations
e(p_c) e (p)
p-value for comparison test significance of comparison test
e (dr_m)
model degrees of freedom
e (rho)
p
e(11)
log likelihood
e(£c)
number of iterations
e(ll_O) e(N_clust)
log likelihood, constant-only model number of clusters
e(rank) e(rankO)
rank of e(V) rank of e(V) for constant-only
e(lambda)
A
model
e(selambda) standard errorof A
e(sigma)
sigma
typeof optimization
Macros e(cmd)
heckman
e(opt)
e(depv_')
name(s)of dependent variable(s)
e(chi2type) Wald or Lit; typeof modcl x2 test
e(title) e(title2)
title in estimation output secondary title in estimation output
e(chi2_ct)
Wald or LR; type of model )c2 test corresponding to e(chi2_c)
e(utype)
weight type
e(offset#)
offset for equation #
e (wexp) e(clustvar)
weight expression name of cluster variable
e (mills)
variable containing nonselection hazard (inverse of Mills')
e (method)
requested estimation method
e(predict)
program used to implement predict
e(vcetype) e (user)
covanance estimation method name of tikelihood-evaluator program
e(cnslist)
constraint numbers
e(b)
coefficient vector
e(V)
variance-covariance
e(ilog)
iteration log (up to 20 iterations)
Matrices matrix of
the estimators
Functions e(sample)
marks estimation sample
Methods and Formulas heckma_n is implemented 446-450)
provide
as an ado-file.
an introduction
Greene
Regression estimates using the nonselection maximum likelihood estimation. The regression
equation
(2000,
928-933)
to the Heckmm-a selection hazard
(Heckman
is
yj = xjO + ulj The selection
equation
is zj'7 + u2j
> 0
where
ul _ N(O, a) u2 _ N(0,
1)
1
I_'i''1 :-
cor_(_l,u_)= p
or Johnston
and DiNardo
(1997,
model. t979')
provide
starting
values
for
|
!
--
i necKman- Heclcmanselection model
2g
The log likelihood for observation j is
observed lj =
V/1-- "_
-_
a
/
-- Wj
ln(
wjln @(-zjT)
yj
yj not observed i
where _0
is the standard cumulative normal and wj is an optional weight for observation j.
In the maximum likelihood estimation, o-and p are not directly estimated. Directly estimated are In a and atanh p:
(
l+p
_
i i i
atanh p = _ ln\ _] The standard error of ,_ = #a is approximated through the propagation of error (delta) method: that is, Var(A) _ D Var{(atanh
p lncr)} D'
where D is the Jacobian of )_ with respect to at_h p and In a. The two-step estimates are computed using H_ckman's (1979) procedure. Probit estimates of the selection equation Pr(yj
observed I zj)-
_(zj")')
are obtained. From these estimates the nonselection hazard, what Heckman (t979) referred to as the inverse of the Mills' ratio, m j, for each observa¢ion 3 is computed as
¢(zjS) mj where ¢ is the normal density. We also define
Following Heckman, the two-step parameter estimates of /3 are obtained by augmenting the regression equation with the nonselection hazard m. Thus, the regressors become [X m] and we obtain the additional parameter estimate/3,a on the variable containing the nonselection hazard. A consistent estimate of the regression disturbance variance is obtained using the residuals from the augmented regression and the parameter estimate on the nonselecfion hazard. e'e +/3_ _--]j=l N N
5j
The two-step estimate of p is then _ = /3r,L c3 Heckman derived consistent estimates of the coefficient covariance matrix based on the augmented regression.
]
.......
--.o_..
.,vv,-..,t,ua|
.O_I¢_I_.I.I1JII
IIIUUI_I
Let W = [X m] and D be a square diagonal matrix of rank N with (1 _ P^2o_ j) on the diagonal elements.
Vtwostep - 2(W'W)-I(W'DW + Q)(W'W)-1 where q = _2(W'DW)Vp(W'DW) and Vp is the variance-covariance
estimate from the probit estimation of the selection equation.
References Greene. W. H. 2000. Econometric Analysis. 4th ed. Upper Saddle River. NJ: Prentice-Hall. Gronau. R. 1974. Wage comparisons: A selectivity bias. Joumat of Political Economy 82: 1119-1155. Heckman, J. 1976. The common structure of statistical models of truncation, sample selection, and limited dependent variables and a simple estimator for such models. The Annals of Economic and Social Measurement 5: 475--492. 1979. Sample selection bias as a specificationerror. Econometrica47: 153-16t. Johnston. J, and J. DiNardo. 1997. EconometricMethods. 4th ed. New York: McGraw-Hill. Lewis, H. 1974. Comments on selectivity biases in wage comparisons. Journal of Political Economy 82: 1119-1155. Manning, W. G., N. Duan. and W. H. Rogers. 1987, Monte Carlo evidence on the choice between sample selection and two-part models. Journal of Econometrics 35: 59-82.
Also See Complementary:
[R] adjust, [R] constraint, JR] lincom, [R] lrtest, [R] mfx, [R] predict, [R] test, [R] testnl, JR] vce, [R] xi
Related:
[R] heckprob,
Background:
[U] [u] [U] [U]
[R] regress,
[R] tobit, [R] treatreg
16.5 Accessing coefficients and standard errors. 23 Estimation and post-estimation commands, 23.11 Obtaining robust variance estimates. 23.12 Obtaining scores
---Title heckprob -- Maximumd_kelihood probit estimation with selection
Syntax heckprob
dewar
[vartist] [,,eight]
[if
exp] [in
range],
select( [ depva,'s = ] varlists [, ,gffse_(varname) } [ robust
] )
cl__uster (vamame)
constraints by .,.
noconstant
(numlist)
s qcore(newv_rlist) first noconstant i noskip level(#) _ffset (varname) maximize_options
: may be used with heckprob; and i_eights
see [R] by.
_eights,
f_eights,
are allowed; see [U] 1_1.6
weight.
heckprob
shares the featuresof all estimationcommands;see [U] 23 Estimationand post-estimationcommands.
Syntaxforpredict predict
[type] newvarname
[if exp] [in range]
[, statistic nooffset
]
where statistic is /
pmargin
q'(xjb),
success probability (th_ default)
pll
_2(xjb,
_/.probit = 1, yj _ select zjg, p), predicted prolJability P'tyj
plO
_52(xjb,-z/g,-p),
predicted ,robability Pr(_/pr°bit = 1,_/;elect = O)
pO1
_52(-x3b,
predicted _robability P_yj
pO0
_2(-xjb,--zjg,
psel pcond
_(zjg), selection probability _52(xjb, zig, p)/_5(zjg), prob_ility
xb stdp
xyb, fitted values standard error of fitted values
xbsel
linear prediction for selection equation
stdpsel
standard error of the linear prediction for selection equation
zjg,-p),
_/_ probit
p),
predicted )robability
Pr(y
= l)
= O, yj
_ select
p.r°bit = O, y;elect
= 1) = O)
of success conditional on selection
q)() is the standard normal distribution function and q52() is the bivariate normal distribution function. These statistics are available both in and out of sample; type predict the estimation
...
if
e(sample)
...
sample.
Description heckprob
estimates maximum-likelihood probit models with sample selection. 31
if wanted only for
Options select(...) specifiesthe va_ables and optionsfor the selectionequation. It is an integral part of specifying a selection model and is not optional. robust specifies that the Huber/White/sandwich estimator of the variance is to be used in place of the conventional MLE variance estimator, robust combined with cluster() further allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights, robustis implied; see [U] 23.11 Obtaining robust variance estimates. clust;er (varname) specifies that the observations are independent across groups (clusters) but are not necessarily independent within groups, varname specifies to which group each observation belongs, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE),but not the estimated coefficients, cluster() can be used with pweights to produce estimates for unstratified cluster-sampled data. cluster() cluster()
implies robust; by itself.
that is, specifying robust
cluster()
is equivalent to typing
score(newvarlist) creates a new variable, or a set of new variables, containing the contributions to the scores for each equation and the ancillary parameter in the model. The first new variable specified will contain ul_ -- OtnLj/O(xj/3) for each observation j in the sample, where lnLj is the 3th observation's contribution to the log likelihood. The second new variable: u2j = OlnLj/O(zj_') The third: u3j = OlnLj / O(atanh p) If only one variable is specified, only the first score is computed; if two variables are specified, only the first two scores are computed; and so on. The jth observation's contribution to the score vector is { OtnLj/Ol30lnLj/O("/)
OlnLj/O(atanhp)}
= (UljXj
u2jzj
u3j)
The score vector can be obtained by summing over j; see [U] 23.12 Obtaining scores. first specifies that the first-step probit estimates of the selection equation be displayed prior to estimation. noconstant omits the constant term from the equation. This option may be specified on the regression equation, the selection equation, or both. constraints (numlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command; see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. noskS.p specifies that a full maximum likelihood model with only a constant for the regression equation be estimated. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation being zero (except the constant). For many models, this option can substantially increase estimation time. level (#) specifies the confidence level, in percent, for confidence intervals The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. ogfset (w_rname) is a rarely used option that specifies a variable to be added directly to Xb. This option may be specified on the regression equation, the selection equation, or both.
hect(prob--:Maximum-likelih_od_pmbit estimationwith selection
33
_,
mca:imize_.options control the maximization process; see [R] maximize. With the possible exception of iterate (0) and trace, you should never ha_e to specify them.
Optionsfor predict pma.rgin, the default, calculates the univariate (ma@nal) predicted probability of success _e, probit 1). ptty j _. calculates the bivariate predicted probability P_yj
probit
=
__ probit plO calculates the bivariate predicted probability P,_y)
--
pll
i
1
- 1, _yjselect
_}_,probit
p01 calculates the bivariate predicted probability
P_yj
._. probit pO0 calculates the bivariate predicted probability P_(yj
psel
_ select , yj
=
l).
___ 0).
o select
= O, yj
-----1).
. select = O, yj
:
0).
calculates the univariate (marginal) predicted probability of selection Pr(y_ elect = l).
pcond calculates the conditional (on selection) predicted probability of success P_tYj-'" probit = l, yj-select = t)/Pr(y_ elect = 1). xb calculates the probit linear prediction xjb. stdp calculates the standard error of the prediction, it can be thought of as the standard error of the predicted expected value or mean for the obsel_'_tion's covariate patfern. This is also referred to as the standard error of the fitted value. xbsel
calculates the linear prediction for the selectibn equation.
stdpsel
calculates the standard error of the linear irediction for the selection equation.
nooffset is relevant only if you specified offset (varname) for heckprob. It modifies the calculations made by predict so that they ignore th_ offset variable; the linear prediction is treated as xj b rather than xj b + offsetj.
Remarks The probit model with sample selection (Van d_ Yen and Van Pragg 1981) assumes that there exists an underlying relationship y_ = xjj3 + ulj
latent
equation
probit
equation
such that we observe only the binary outcome , probit
yj
= (y_ > O)
The dependent variable, however, is not always observed. Rather, the dependent variable for observation j is observed if ffselect j
I =-
(Zjy-4:-U2j
>
0)
selection
where ul _ 2?{0, 1)
Nio,1) corr(ul,
=p
equation
i
F
F
o,_
necKproo
m MaxlmumqlKellhood
probit estimation
with
selection
When p _ 0, standard probit techniques applied to the first equation yield biased results, heckprob provides consistent, asymptotically efficient estimates for _t the parameters in such models.
> Example We use the data from Pindyck and Rubinfeld (1998). In this dataset, the variables are whether children a_end private school (private), number of years the family has been at the present residence (years), log of property tax (logptax), log of income (loginc), and whether one voted for an increase in prope_y taxes (vote). In this example, we alter the meaning of the data. Here we assume that we observe whether children attend private school only if the family votes for increasing the property taxes. This is not true in the dataset and we make this untrue assumption only to illustrate the use of this command. We observe whether children attend private school only if the head of household voted for an increase in property taxes. We assume that the vote is affected by the number of years in residence, the current prope_y taxes p_d. and the household income. We wish to model whether children are sent to private school based on the number of years spent in the current residence and the cu_ent prope_y taxes paid. . heckprob Fitting
private
probit
years
logptax,
sel(vote=years
Iteration
O:
log
likelihood
Iteration
I:
log
likelihood
= -18.407192
Iteration
2:
log
likelihood
= -16.1412S4
Iteration
3:
log
likelihood
= -15.953354
Iteration
4:
log
likelihood
= -15.887269
Iteration
5:
log
likelihood
= -15.883886
Iteration
6:
log
likelihood
= -15.883655
Fitting
selection
model:
O:
log likelihood
= -63.036914
Iteration
I:
log likelihood
= -58.581911
Iteration
2:
log
likelihood
= -58.497419
Iteration
3:
log
likelihood
= -58.497288
log
likelihood
= -74.380943
Comparlson: starting
values:
Iteration
O:
log
likelihood
Iteration
I:
log
likelihood
= -17.920826
Iteration
2:
log
likelihood
= -18.375362
= -40.895884
Iteration
3:
log likelihood
= -16.067451
Iteration
4:
log
likelihood
=
Iteration
5:
log
likelihood
= -15.760354
Iteration
6:
log
likelihood
= -15.753805
Iteration
7:
log
likelihood
= -15.753785
full
model
Iteration
0:
log
likelihood
= -75.010619
Iteration
I:
log
likelihood
= -74.287753
Fitting
-15.84787
Iteration
2:
log
likelihood
= -74.250148
Iteration
3:
log
likelihood
= -74.245088
Iteration
4:
log
likelihood
= -74.244973
Iteration
5:
log
likelihood
= -74.244973
Probit
Log
model
likelihood
logptax)
= -IZ.122381
Iteration
Fitting
loginc
model:
with
sample
= -74.24497
selection
(not
concave)
Number
=
95
Censored obs Uncensored obs
of obs
= =
36 59
Wald
chii(2)
=
Prob
> chi2
=
1.04 0.5935
heckprob- Maximum-likelihoodprobitestimationwith selection
Coef.
Std.
Err.
P> lzl
!z
[95_.
Conf.
35
Interval]
)
private years logptax _cons
-. 1142597
.1461711
-0 i78
0.434
-.4007498
.1722304
.3516095 -2,780664
1.01648 6.905814
0 i 35 -0_40
O. 729 0.687
-1.640655 -16.31581
2.343874 10.75448
-,0167511
.0147735
-li13
0.257
-.0457067
vote years
.0122045
loginc
.9923025
.4430003
2 i24
O.025
,1240378
I.860567
logptax _cons
-1. 278783 -.5458214
.5717544 4.070415
-2 _24 -0.13
O.025 O.893
-2.399401 -8,523689
-. 1581649 7.432046
/athrho
-.8663147
1.449995
-0.60
O.550
-3. 708253
1.975623
-. 6994969
.7405184
-. 9987982
.9622642
rho LR test
of indep,
eqns.
(rho = 0):
chi2(_)
=
0.27
Prob
> chi2
= 0.6020
J
The output shows several iteration logs. The first iteration log corresponds to running the probit model for those observations in the sample where we hav< observed the outcome. The second iteration log corresponds to running the selection probit model, _hich models whether we observe our outcome of interest. If p = 0, then the sum of the log likelihoods from these two models will equal the log likelihood of the probit model with sample selectioh; this sum is printed in the iteration log as the comparison log likelihood. The third iteration log shows starting values 'for the iterations. The finn iteration log is for estimating the full probit model with sample selection. A likelihoodratio test of the log likelihood for this model and the comparison log likelihood is presented at the end of the output. If we had specified the robust option, then this test would be presented as a Wald test instead of a likelihood-ratio test.
q
Example In the previous example, we could obtain robust standard errors by specifying the robust We also eIiminate the iteration logs using the nolog option. • heckprob Probit
private
model
Log likelihood
with
years
lo_ptax,
sample
selection
eel(vote=years
= -74.24497
loginc
lo_q_tax) nolog
robust
Number of obs Censored obs
= =
95 36
Uncensored
=
59
obs
Wald
chi2(2)
=
Prob
> chi2
=
2,55 0.2798
Kobust Coef.
Std. Err.
2
P>Iz[
[95_ Conf.
Interval]
i
private years
-.1142597
.1113949
i -1.03
0.305
-.3325896
•1040702
logptax _cons
.3516095 -2.780664
.735811 4.786602
0.48 -0.58
0.633 0.561
-I.090553 -12.16223
1.793773 6.600903
)
Vote
! i
years
-.0167511
.0173344
-0.97
0.334
-.0507258
.0172237
loginc
.9923025
.4228035
2.35
0.019
.1636229
1.820982
lo_ptax _cons
-1.2T8783 .5458214
.5095157 4.543884
-2._1 -0._2
0.012 0.904
-_.277416 -9.45167
-.280t505 8.360027
option.
!L
t_ir_
_
,=©_Rp, uu --
/athrho rho
maA..um-.Ke.nooa
-.8663147
1.630569
-. 6994969
.8327381
prODlI estimation
-0,53
Wald test of indep, eqns, (rho = 0): chi2(1) =
with
O.595
0.28
selection
-4,062171
2.329541
-, 9994077
.9812276
Prob > chi2 = 0.5952
Regardless of whether we specify the robustoption, it is clear that the outcome is not significantly different from the outcome obtained by estimating the probit and selection models separately. This is not surprising since the selection mechanism estimated was invented for the example rather than born from any economic theory. Example It is instructive to compare the marginal predicted probabilities with the predicted probabilities we would obtain ignoring the selection mechanism. To compare the two approaches, we will synthesize data so that we know the "true" predicted probabilities. First, we need to generate correlated error terms, which we can do using a standard Cholesky decomposition approach. For oar example, we will clear any data from memory and then generate errors that have correlation of .5 using the following commands. Note that we set the seed so that interested readers might type in these same commands and obtain the same results. clear set
seed
set
obs 5000
gen ci
12309
= invnorm(uniform())
gen c2 = invnorm(uniform()) matrix P = (1,.5\.5,1) matrix A = cholesky(P) local facl = A[2,1] local fac2 = A[2,2] gen ul = cl gen u2 = "facl"*cl + "fac2"*c2
We can check that the errors have the correct correlation using the corr command. We will also normalize the errors such that they have a standard deviation of one so that we can generate a bivariate probit model with known coefficients. We do that with the following commands. summarize ul replace ul = ul/sqrt(r(Var)) summarize u2 replace u2 = u2/sqrt(r(Var)) drop cl c2 gen xl = u_uifomn{)-.5 gen x2 = uniform()+i/3 gen yls = 0.5 gen
y2s
gen yl gen
+ 4.x1
= 3 - 3.x2 = (yls>0)
y2 = (y2s>0)
+ ul + .5*xl
+ u2
heckpmb -- Maximum-likelihood, , probit estimationwith selection
37
We have now created two dependent variables yl aM y2 that are defined by our specified coefficients. We also included error terms for each equation and ]he error terms are correlated. We run heckprob to verify that the data have been correctly _oenerate(]according to the model Yl -- .5 -}-4Xl _ ul Y2 = 3 + .5xl _ 3x2 + u2 where we assume that Yl is observed only if Y2 = J. • heckprob yl xl, sel(y2 = xl x2) nolog Probit model with sample selection
Log likelihood = -3600.854 Coef.
Std. Err.
Number of obs Censored obs Uncemsored obs
= = =
5000 1790 3210
Wald chi2(1) Prob > chi2
= =
941.68 0.0000
P>Iz]
[95_ Conf. Interval]
xl _cons
3.985923 .4852946
.1298904 .0464037
30._9 i0•_6
0.000 0.000
3.73i342 •3943449
4.240503 .5762442
xl x2 _cons
.5998148 -3.004937 3.0_1587
.0716655 .0829469 .0782817
8.37 -36•23 38.47 i
0.000 0.000 0.000
.4593531 -3.1/6751 2.858157
.7402765 -2.842364 3.165016
0.000
.4053964
.7427295
.3845569
.6307914
y2
/athrho rho
,574063
.0860559
,5183369
.062935
LR test of indep, eqns. (rho = 0):
6._7
chi2(I) =
46.58
Prob > chi2 = 0.0000
Now that we have verified that we generated data according to a known model, we can obtain and then compare predicted probabilities from the pi'obit model with sample selection and a (usual) probit model. predict pmarg (option pmargin assumed; Pr(yl=l)) probit yl xl if y2==l
(outputomitted) predict phat (option p assumed;
Pr(yl))
Using the (marginal) predicted probabilities from the probit model with sample selection (pmarg) and the predicted probabilities from the (usual) prob!t model (phat), we can also generate the "true" predicted probabilities from the synthesized yls variOble and then compare the predicted probabilities: • gen ptrue
= norm(yls)
• summarize pmarg ptrue phat Variable Obs
Mean
Std. Dev. i
Min
Max
.0658356 1.02e-06
.9933942 1
pmarg ptrue
5000 5000
.619436 .6090161
.32094_4 .34990_5
phat
5000
.6723897
.30632_)7
.096498
.9971064
i
_
38
heckprob m Maximum-likelihood
Here
we see that ignoring
the selection
probit estimation with selection
mechanism
(comparing
the phat
variable
with the true
ptrue variable) results in predicted probabilities that are much higher than the true values. Looking at the marginal predicted probabilities from the model with sample selection, however, results in more accurate
predictions.
0) The selection equation is zj'T + u2i
> 0
where
ul _ N(0, 1) uz _ N(O. 1)
corr(ul,u2)- p
heckprob-- Maximum-!ikelih0odprobit estimationwith selection
39
The log likelihood is
IES
_ti=0
+
{1- (z,-y+ offsetT)}
where S is the set of observations for which 9i is observed, (1)20 is the cumulative bivariate normal distribution function (with mean [0 0 ]'), _0 is the standard cumulative normal, and wi is an optional weight for observation i. In the maximum likelihood estimation, p is not directly estimated. Directly estimated is atanh p: i
7
From the form of the likelihood, it is clear that if p = 0, then the log likelihood for the probit model with sample selection is equal to the sum of the probit model for the outcome 9 and the selection model. A likelihood-ratio test may therefore be performed by comparing the likelihood of the full model with the sum of the log likelihoods fo_ the probit and selection models,
References Greene,W. H. 2000. EconometricAnalysis.4th ed. Upper Sa_le River, NJ: Prentice-Hall. Beckman.J. t979. Sampleselectionbias as a specificationerror. Economettica47: 153-161. Pindyek.R. and D. Rubinfeld.1998. EconometricModelsand EconomicForecasts.4th ed. New York:McGraw-Hill. Vande Ven,W. R M. M. and B. M. S. VanPragg. 1981.The demandfor deductiblesin private health insurance:A probit modelwith sample selection.JournaIof Econometric_17: 229-252.
Also See Complementary:
[R] adjust, [R] constraint, [R] l_com, [R] lrtest, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce, [R] Xi
Related:
[R] heckman, JR] probit, [R] treatreg
Background:
[u] [U] Iv] [u]
16.5 Accessing coefficients and standard errors, 23 Estimation and post-est_ation commands, 23.H Obtaining robust var] ance estimates, 23.12 Obta_iningscores
vw __'° ti"
Title he
u Obtain on-line help In
I
I
I
Syntax Windows, Macintosh, and Unix(GUI): help [ command or topic name ]
whelp[command or fopicname] Unix(console & GUI):
{help lma.n} [command or topic name ]
Description The help command displays help information on the specified command or topic. If help is not followed by a command or a topic name, a description of how to use the help system is displayed. Stata for Unix users may type help or mmamthey mean the same thing. Stata for Windows, Stata for Macintosh, and Stata for Unix(GUI) users may click on the Help menu. They may also type whelp something to display the help topic for something in Stata's Viewer. whelp typed by itself displays the table of contents for the on-line help.
Remarks See [U] 8 Stata's on-line help and search facilities for a complete description of how to use help. Q Technical Note When you type help something, Stata first looks along the S_ADOpath for something.hip; [U] 20.5 Where does Stata look for ado-files?. If nothing is found, it then looks in state.hip the topic.
Also See Complementary:
[R] search
Related:
[R] net search
Background:
[GSM]4 Help, [GSW]4 Help, [GSU] 4 Help, [U] 8 Stata's on-line help and search facilities
40
see for vl
Title i [ hetprOb - llliMaximum-l_etih°°d r 1 _ heter°skedastic "'i!_pr°bit estimati°n l llllllll I ,I I I
II I
i
i
ntax
eerar het(varlist,
[offset(varname)
c!luster(varname) nolrtest
'd [noconstant
level(#)
asis_robust
score (newvarl [newvar2:]) noskip offset (varname)
constraints
by ... : may be used with hetFob; fweights, iweights,
])
(numlist) nolqg maximize_options ] see [R] by.
and pweights are allowed; see [U] 14il.6 weight.
This command shares the features of all estimation commands;see [U] 23 Estimation and post-estimation commands.
Syntaxforpredict
i
predict[O?e]newvarname[ifexp][in r_ge] [, { p I xb [ sigma } nooffset] These statistics are available both in and out of sample; type predict the estimation sample.
...
if e(samp!e)
...
i
if wanted only for
scription hetprob
estimates a maximum-likelihoodhetero_kedasticprobit model.
See [R] logistic for a list of related estimation commands.
Options het(varlist [, of.fset(varname)]) specifies the independent variables and the offset variable, if there is one, in the variance function, her () is not optional. noconstant
suppresses the constant term (intercept}in the model.
level (#) specifiesthe confidencelevel, in percent, foi confidenceintervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying !he width of confidence intervals. as is forces retention of perfect predictor variablesand their associatedperfectly predictedobservations and may produce instabilities in maximization; see [R] probit. robust specifies that the Huber/White/sandwichestinmtor of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust varianee estimates, robust combined with cluster () allows observations which are not independent within cluster (although the)' must be independent between clusters). If you specify pweights, robust is implied: see[U] 23.13 Weighted estimation, 41
i
42
hetprob -- Maximum-likelihood heteroskedastic probit estimation
cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily within groups, varname specifies to which group each observation belongs; e,g., cluster(personid) in data with repeated observations on individuals, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U] 23.11 Obtaining robust variance estimates, cluster() can be used with pweights to produce estimates for unstratified cluster-sampled data. but see the svyprobit command in [r Example You have data on 956 U.S. cities, including average January temperature, average July temperature, and region. The region variable is coded 1, 2, 3, and 4, with 4 standing for the West. You wish to make a graph showing the relationship between January and July temperatures, highlighting the fourth region: . hilite tempjan tempjuly, hilite(region==4)
region==4
ylabel xlabel
highlighted
80=
= =
©
60"
'
_ :,"'.3 =_ 4e
._.¢'""
C
•
g (g _'
",
20
io_ a •
.....
g=*
" .
i"_r
• J _
". ,tNi'_': " : "; _'.2"_.: , "-.:
=
,:
4
.; .
e
0
1 Average
July
48
Temperalure
,, :_
hilite -- Highlighta subset of pointsin a two-way scatterplot It is possible to use graph
to product graphs like ffiis, but hilite is often more convenient.
49
q
[3Technical Note By default, hilite uses'.' for the plotting Lvmbbl and additionally highlights using the o symbol. Its default is equivalent to specifying sTabol(.o)as one of the graph_options. You can vary the symbols used, but you must specify exactly two symbols. The first is used to plot all the data and the second is used for overplotting the highlighted Subset.
Methodsand Formulas hilite is implemented as an ado-file.
References Weesie,J. 1999.dr38: Enhancementto the hilite command.Stata TechnicalBulletin 50: 17-20. Reprintedin Stara TechnicalBulletin Reprints,vol. 9, pp. 98-101.
AlsoSee Related:
[R] separate
Background:
Stata Graphics Manual
'"::"
Title I
hist-
Categorical
variable histogram
[
II
Syntax hist
varna,ne [weight]
[if exp] [in range]
[. i._ncr(#) graph_options
]
fweights are allowed; see [U] 14.1.6 weight.
Description hist graphs a histogram of varname, the result being quite similar to graph hist, however, is intended for use with integer-coded categorical variables. hist determines the number of bins automatically, labels are centered below the corresponding bar.
the z-axis
hist may only be used with categorical variables maximum(varname) - minimum(varname) < 50.
with
varname,
is automatically a range
of less
histogram.
labeled, and the than
50;
i.e.,
Options incr(#) specifies how the z-axis is to be labeled, incr(1), the default if varname reflects 25 or fewer categories, labels the minimum, minimum + 1, minimum 4- 2..... maximum, incr (2), the default if there are more than 25 categories, would label the minimum, minimum + 2, ..., etc. graph_options xlabel(), saving().
refers to any of the options allowed with graph's histogram style excluding bin (), and xscale(). These do include, for instance, freq, ylabel(), by(), total, and See [G] histogram.
Remarks b, Example You have a categorical variable rep78 reflecting the repair records of automobiles. 1 = Poor, 2 = Fair, 3 = Average, 4 = Good, and 5 - Excellent. You could type
(Continued
on next page)
5O
It is coded
h_t -- Categoricalvariablehistogram
51
• graph rep78, histogram bin(5)
to obtain a histogram. Youshould specie, bin(5) because your categorical variable takes on 5 values and you want one bar per value. (You could omit the option in this case, but only because the default value of bin() is 5; if you had 4 or 6 bars, you would have to specify it; see [G]histogram.) In any case, the resulting graph, while technically correct, ii aesthetically displdasing because the numeric code 1 is on the left edge of the first bar while the numeric code 5 is on the fight edge of the last bar. Using hist; is better: • hist rep78
434783
0
Repair
Record
1878
not only centers the numeric codes underneath the corresponding bar, it also automatically labels all the bars. hist
You are cautioned: hist is not a general replacement for graph, histogram, hist is intended for use with categorical data only, which is to say, floncontinuousdata. If you wanted a histogram of automobile
prices,
for instance,
you
would
still want
_o use the graph,
histogram
command.
;:
,_r,.
52
hist -- Categorical variable histogram
I> Example You may use any of Research and Survevs on election day data in Lipset (1993)--you • hist
candi
the options you would with graph, histogram. Using data collected by Voter based on questionnaires completed by 15,490 voters from 300 polling places originally printed in the New York Times• November 5. 1992 and reprinted draw the following graph:
[freq=pop],
by(inc) total ylab ytine noaxis title (Exit Polling By Family Income)
$50-75k
$75k+
6
= o o It.
.6
o,,_' .... L, ' Candidale
Exit Polling
voted
for,
1992
by Family
Income
[] Technical Note In both of these examples, each bar is labeled; if vour categorical you may not want to label them all. Typing
variable takes on many values,
hist myvar, incr(2)
would labeleveryotherbar.Specifying incr(3) would labeleverythirdbar,and so on,
Methods and Formulas hist is implemented
as an ado-file.
References Lipset, S. M. ]993. The significance
of the I992
Also See Related:
[R] spikeplot, [G] histogram
Background:
Stata Graphics Manual
election.
Political
Science
and Politic,_ 26_1_: 7-16,
Title hotel -- Ho_lling's
generalized means test
Syntax hotel varlist [weigh_ [iJ exp] [in range] [, by(varname)notable aweights
and fweights
]
are allow _d: see [U] 14,1.6 weight
DescriptiOn hotelperforms Hotelling's T-squared test for testing whether a set of means is zero or, alternatively, equal between two groups
i
Options by(varname) specifies a var able identifying two groups; the test of equality of means between groups is performed. If by '.) is not specified, a test of means being jointly zero is performed. i
t
notablesuppresses printing
table of the means being compared.
Remarks hotel performs Hotelling's T-squared test of whether a set of means is zero, or two sets of means are equal. It is a multivariate est that reduces to a standard t test if only one variable is specified.
I, t i I
_ Example You wi_h to _est whether a new fuel additive improves gas mileage in both stop-and-go and highway situatiotls. Taking tw_lye cars, you fill them with gas and run them on a highway-style track, recordingtheir gas mileage. Y_)uthen refill them and run them on a stop-and-go style track. Finally, you repeat the two runs but this time use fuel with the additive. Your dataset:is . describe Contains d_ta from gasexp.dta obS : 12 vats : size:
i
variable n_me
i ! i
id bmpgl ampgl bmpg2 ampg2
_
5 288 (c _.9% of memory free) storage type float float float float float;
lisplay _ormat
value label
/,9. Og /,9.Og /,9.0g /,9.0g /,9.0g
13 Jul
2000
13:22
variable label car id trackl before additiv_ trackl after additive_ track 2 before additive track 2 after additive
Sortdd by :
53
r'_
54
hotel -- Hoteiling's T-squared generalized means test [
To perfor_ zero:
the statistical test, you jointly test whether the differences
• g_n |
diffl
= ampgl
- bmpgl
g_n dill2 = ampg2 | hgtel diffl dill2
bmpg2
Variable
0bs
diffl dill2
12 12
I
Mean
Std.
1.75 2.083333
Dev.
2.70101 2.906367
1-g_oup Hotelling's T-squared = 9.6980676 F t, st statistic: ((12-2)/(12-1)(2)) x 9.6980676 H0:
Vector
of means is equal ¢o a vector F(2,10) = 4.4082
Prob
The meat
> F(2,10)
=
in before-and-after
Min
Max
-3 -3.5
5.5
results are
5
= 4.4082126
of zeros
0.0424
are different at the 4.24% significance level.
[] Technical _ote We used Hotelling's T-squared test because we were testing two differences jointly. Had there been onlvlone difference, we could have used a standard t test which would have yielded the same results as_totelling's''
test:
* W_ could have performed ampgl = bmpgl
the
test
like
this:
t test
Vari
Obs
Mean 22.75
12
• tt_st
20.68449
24.81551
2.730301
19.26525
22.73475
12
1.75
•7797144
2.70101
.0338602
3.46614
mean(ampgl
- bmpgl)
Ha:
= mean(diff)
mean(dill) ~= 0 t = 2,2444
P >
Itl =
mean(dill) > 0 _ = 2.2444
P > t =
0,0232
this: = 0
t test Std.
dlffl
12
1.75
.7797144
of freedom:
Ha: mean < 0 t = 2.2444 0.9768 this:
Err.
Std.
Dev.
2.70101
[95_
Conf,
.0338602
Interval] 3.46614
11 He:
Or like
= 0 Ha:
0.0463
Mean
t =
Interval]
3.250874
Obs
PI
mean t =
Itl =
= 0
-= 0 2.2444 0.0463
Ha: mean > 0 t = 2.2444 P > t =
0.0232
""---
hotel-- Hotellin_sT-squaredgeneralizedmeanstest
55
. _otel dill1 i
Variable
i
[
0
Mean
diffl
Std.
1.75
Dev.
Min
Max
2.70101
-3
5
1-Sroup _otelling's T squared = 5.0373832 F _est statistic: ((i!2-I)/(12-I)(I))x 5.0373832 = 5.0373832
I
HO{ Vecter of means i 3 equal to a vector of zeros F(I,II) = 5.0374 Prob > F(I,II) = 0.0463
> Example Now"donsider a variation _n the experiment: rather than using 12 cars and running each car with and without the fuel additiv, you run 24 cars, 12 with the additive and 12 without. You have the following!dataset: . d_scribe
! i
I i
I
Contains data from ga: _xp2.dta o_s: 24 vats: 4 size: 480 97.4_ of memory free)
8 Sep 2000 12:19
[
! storage variable name type
display format
id mpgl mpg2 additive
_9.Og Z9.0g _9.0g _9.0g
float float float float
value label
Variable label
yesno
car id _rack 1 track 2 additive?
Sorted by: • tab additive additive?
Fr_q.
Percent
_um. i
no
12
50.:00
50.00
yes
12
50.00
100.00
Total
24
I00,00
jr
I
i
This is an_unpaired expefime_t because there is no natural pairing of the cars; we want to test that the rneanslof mpgl are equal for the two groups specified by additive as are the means of mpg2:
{
(Continued on next page)
_
=_
60
d9 -- ICD-9-CM diagnostic and procedure codes
] !
t ICD-9 codes and, if it does, icd9_] clean modifies the _ariable to contain the codes in either of two standard formats. Use of icd9[p]_clean is optional: all icd9[p] commands work equally well with cleaned or uncleaned codes. 'I_e_e are numerous ways of writing the same ICD-9 code, and icd9[p] clean is designed (1) to ensure c insistency, and (2) to make subsequent output look better. icd9[p] uncleaned) icd9[p] ge code. icd9 ICD-9 code
generate produces new variables based on existing variables containing (cleaned or ICD-9 codes, icd9[p] generate, main produces newvar containing the main code. aerate, description produces newvar containing a textual description of the ICD-9 p] generate, range() produces numeric newvar containing I if varname records an in the range listed and 0 otherwise.
icd9_p] lookup and icd9[p] search are utility routines that are useful interactively, icd9[p] lookup sin ply displays descriptions of the codes specified on the command line, so to find out what diagnostic _:913.1 means, you can type icd9 lookup e913.1. The data you have in memory are . I lrrelevant-_and remain unchanged when using icd9[p] lookup, icd9[p] search is like icd9[p] lookup ex!:ept that it turns the problem around; icd9[p] search looks for relevant ICD-9 codes from the dd_cription given on the command line. For instance, you could type icd9 search liver or icd9p
s._arch
liver
to obtain a list of codes containing the word "liver".
icd9[p] query displays the identity of the source from which the leD-9 codes were obtained and the textual _escription that icdg[p] uses. Note that! ICD-9 codes are commonly written in two bays,, with and without periods. For instance, with diagnostic codes, you can write 001, 86221. E8008. and V822, or you can write 001., 862.21, E800.8, and V82.2. With procedure codes, you can write 01, 50. 502. and 502l, or 0l., 50., 50.2. and 50.21. _he icd9[p] command does not care which syntax you use or even whether you are consistent. _ase also is irrelevant: v822, v82.2, v822. and v82.2 are all equivalent. Codes may be recorded w_h or without leading and trailing blanks.
Optionsfor use with icd9[p]check any tells ic the code, 230.52 option, conside list
59[p] check to verify that the codes fit the format of leD-9 codes but not to check whether are actually valid. This makes icd9_p] check run faster. For instance, diagnostic code _r 23052 if you prefer) looks valid, but there is no such ICD-9 code. Without the any 30.52 (or 23052) would be flagged as an error. With any. 230.52 (and 23052) is not d an error.
tells cd9[p] chock that invalid codes were found in the data. 1. 1.1.1. and perhaps 230.52 if any is n )t specified, are to be individually listed.
genaratet
ewvar) specifies that icd9[p]
check
is to create new variable newvar containing,
for
each observation, 0 if the code is valid and a number from 1 to I0 otherwise. The positive numberslindicate the kind of problem and correspond to the listing produced by icd9[p] check. For instance. 10 means the code could be valid, but it turns out not to be on the official list.
Options for use with icd9[p] clean dots specifies whether periods are to be included in the final format. Do you want the diagnostic codes recorded, for instance, as 86221 or 862.21? Without the dots option, the 86221 format would b_ used. With the dots option, the 862.21 format would be used. pad specifids that the codes are to be padded with spaces, front and back. to make the codes line up vertically.: in listings. Specifying pad makes the resulting codes look better when used with most other Stata commands.
i
icd9 -- ICD-9-CM diagnoSticand procedure codes
Options fOr i
61
with icd )[p]generate
main,descrip}ion,and ra _ge(icd9rangelist) specify what icd9[p] generate is to calculate. In all cases, varname specifies variable containing ICD,9 codes. main specifies ihat the malt .'ode is to be extracted from the IED-9 code. For procedure codes, the
i
i i
main code i_ the first tw_ characters. For diagnostic codes, the main Code is usually the first three or four characters (tlie characters before the dot if the code has dots) In any case, icdg[p] generate &Ses not care _hether the code is padded With blanks in front or how strangely it might be written; :icd9[p] gene rate will find the main code and extract it. The resulting variable is itself an ICD-Ocode and may be used with the other icd9[p] subcommands. This includes icd9[p] generate, ilain.
i ! i i_
descriptiondreates newva, containing descriptions of the ICD-9 codes.
I i{ i I
long is for _e with desc: iption.It specifies thai the new variable, in addition to containing the text describing the code, is to contain the code, too. Without long, newvar in an observation might contain "bro1_chus injury-closed". With long, it would contain " 862.21 t_ronchus injury-closed". end modifie_ long (speci _'ying end implies long) and places the code at the end of the string: "bronchus injury-closed 8I 2.21".
! i
i
range(icd9ran_elist) allows you to create indicator variables equal to l when the ICD-9 code is in the inclusive! range specifi _d.
Optionsfor usewith icd )[p]search
!
i
or specifies thai ICD-9 codes are to be searched for entries that contain any of the words specified after icd9[p I search,Th, default is to list only entries that contain all the words specified.
i
Remarks
I
code is
Let us begin _withdiagnost
codes--the
codes icd9 processes. The format of an ICD-9 diagnostic
[blanks { O-9,V,v} {0-9} {0-9} [. ] _0-9 r [0-9] jl [blanks] or
I
i i ; .
i
[blanks! E.e } { 0-9 } { 0-9 } {0-9 }[. ][0-9 [0-91 ] [blanl_s1 icd9 can dell with ICD-9 tiagnOstic codes written in any of the ways the above allows. Items in square brackets tare optional, the code might start with some number of blanks. Braces { } indicate required items. _he code the_ either has a digit from 0 to 9 the letter V (utmercase or lowercase) (first hne), or thei letter E (upl_ercase or lowercase ' '_.. second line).' After that, it has--two or mo re d_'g"_ts s, perhaps followed b a enod and th v u to tw e dvm ha s tollowed b ' more b' " i ! Y P " _t en it may ha "e p omor "_'siper p. " y
,anks .
l
_;;::
62
icd9 --
ICD-9-CM diagnostic and procedure codes
All of the following
meet the above
definition:
00: ool
')ol 001,9 O019 862 ._2 862,22 E80_). 2 e8092 V82|
Meeting t_ae above the above[definition, Examl_les|
definition of which
of currently
does not make the code valid. 15,186
defined
are currently
diagnostic
codes
There
are 233.100
possible
codes meeting
defined include
l
code
I
t i
I
description
001 001.0
cholera* cholera d/t vib cholerae
001.9 001.1
cholera cholera nos d/t vib el tot
999
complic medical care nec*
VOl
communicable dis contact*
V01. i VOI. 701.20 VOI. 3 VOl. 4 VOI. 5 VO1.6 VOl. 7 YO1.8 V01.9
tuberculosis contact cholera contactcontact poliomyelitis smallpox contact rubella contact rabies contact venereal dis contact viral dis contact nec communic dis contact nec communic dis contact nos
. . .
E800 E800.0 E800.1 E800.2 E800.3 E800.8 E800.9
rr rr rr rr rr rr rr
collision nos* collision nos-employ coll nos-passenger coll nos-pedestrian coll nos-ped cyclist coil nos-person nec coil nos-person nos
o , o
"Main ,eodes" refer to the part of the code to the left of the period. v82, and !_800 ..... E999 are main codes. There are 1.182 diagnostic
001,002 ..... main codes.
999. v0] .....
The m'Ain code corresponding to a detailed code can be obtained by taking the part of the code to the left lof the period, except for codes beginning with 176. 764. 765. v29. and v69. Those main codes are not defined, yet there are more detailed codes under them:
icd9 -- ICD-9-CM diagnostic and procedure codes I
cdde
d,:scription
[
1}'6 if(6.0 176,1
CDDE DOES NOT EXIST, but $ codes starting with 176 do exist: sl in - ka_si's sarcoma sl tisue - kpsi's srcma
764 754.0 754. O0
C )DE DOES NOT EXIST, but 44 codes starting with 7i54 do exist: It for-dates w/o let real* li :ht-for-dates winos
63
i
.5.
755 7_5.0 7_5. O0
C )DE DOES NOT EXIST, but 22 codes starting with %5 do exist: extreme immaturity* e_treme immatur wtnos
I
V_9 V:_9.0
O )DES DOES NOT EXIST, but 6 codes stating with V29 do exist: nt obsrv suspct infect
i
V_9.1
nt obsrv
I
V69 V_9.0 V619.1
O )DE DOES NOT EXIST, but 6 codes starting with V69 do exist: la k of physical exercse inirpprt diet eat habits
suspct
neurlgcl
"'_"
!
Our solution iis to define f)ur new codes:
t ! i
!
code
description
176 764 765 729 g69
kaposi's sarcoma (Stata)* light-for-dates (Stata)* immat & preterm (Stata)* nb suspct cnd (St,am)* lifestyle (stata)*
Thus, there are 15,186 + 5 = 15,191 diagnostic
i
Things are less confusing format of I CD-9iprocedure
!
I I
I
'
codes, of which
_'ith respect to procedure
codes_the
1,181 + 5 = 1,186 are main codes. codes processed
by icd9p.
The
co :les is [banks]
{0-9}{0-9}
[. ] [0-9 [0-9]]
[blanks]
Thus, there are i0,000 possil: e procedure codes, of which 4,275 are currently valid. The first two digits represent _e main co& of which 100 are feasible and 98 are currently used (00 and 17 are not used).
Descriptions The degcriptidns given for each of the codes is as found in the original source. The procedure codes • contain' the addition of fve new codes b_, us. An asterisk on the end of a description n_ d_cate_ "" _
i_ I
that the c°trespoiding ICD-9 tiagnostic code has subcategories. icd9[pJi quer_ reports thebriginal source of the information
1
on the codes:
r F J
64
icd9 -- ICD-9-CM diagnostic and procedure codes
• _cd9
query
_dta: I
i 1
Dataset from http://www.hcfa.gov/stats/pufiles.htm obtained 24aug1999 file http://www.hcfa,gov/stats/icdgv16.exe Codes 176, 764, 765, V29, and V69 defined
I
-- 765 176 kaposi's immat _ preterm sarcoma (Stata)* (Stata)* V29 nb suspct end (Stata)* V69 lifestyle (Stata)* cd9
query
J _d_a: Dataset obtained 24aug1999 • from http://www.hcfa.gov/stats/pufiles.htm file http://www.hcfa.gov/stats/icd9vl6.exe
> Example You t_ve a dataset containing up to three diagnostic codes and up to two procedures on a sample of 1,000 Ipatients: _se patients, _ist in 1/10 7.
patid I
I:
I_.
clear diagl 65450
diag2
3 2
710.02 23v.6
5 6 7 8 9
861.01 38601 705 v53.32 20200
procl 9383
proc2
37456
8383
17
2969
9337 7309 7878 0479
8385 951
i0
464.11
7548
diag3
E8247
20197
!
4641
Do not tD, to make sense of these data because, in constructing procedure codes were randomly selected.
this example,
the diagnostic
and
- _,-_Be_inlbvnoting that variable diagl is recorded sloppily--sometimes the dot notation is used and sometimes not, and sometimes there are leading blanks. That does not matter. We decide to begin by using icd9 cd9
clean clean
to clean up this variable:
diagl
di_gl contains invalid ICD-9 codes r (459) ;
icd9 clean refused because there are invalid codes among the 1.000 observations, check to find and flag the problem observations (or observation, as in this easel:
We can use icd9 :
i-|
!
[
) ) i
i
_....
icd9-_-ICD-9-CMdiagnostic and proceIt[
hhsize
Number of obs = F( 6, 657) =
664 212.55
Prob > F R-squared Adj R-squared Root MSE
0.0000 0.6600 0.6569 .26228
=
[95Z Conf. Interval]
In_:sales_pc jantemp pre:ipitat~n i.n_income
.6611241 .0019624 -.0014811 .I158486
.026623 .0007601 .0008433 .056352
24.83 2.58 -1.70 2.06
0,000 0.010 0.090 0-040
.6088476 .0004698 -.0030869 .0051969
.7134006 .003455 .000224Z .2265003
m,_dian_age hhsize .cons
-.0010863 -.0050407 -I.377592
.0002823 .0004243 .4777641
-3.85 -11.88 -2.88
0.000 0.000 0.004
-.0016407 -.0058739 -2.31572
-.0005319 -.0042076 -.459463
Despit, having data on 898 cities, your regression was estimated on only 664 cities_74% of the original 8 )8. Some 234 observations were unused due to missing data. In this case, when you type snrnmari: e, you discover that each of the independent variables has missing values, so the problem is not that one variable is missing in 26% of the observations, but that each of the variables is missing in some _servations. In fact, summarize revealed that each of the variables is missing in roughly 5% of th_ observations. We lost 26% of our data because, in aggregate. 26% of the observations have one )r more missing variables. Thus, |"+'eimpute each independent variable on the basis of the other independent variables: . i_pute In_rtl jantemp precip In_inc medage hhsize, gen(i_In_rtl) 4._0Y, (44) observations imputed impute } jantemp in rtl precip In_inc medage hhsize, gen(i_]antmp) 5.B0_, (53) observations imputed
f
impute -- Predict missing values
71
. impute _recip In rtL jantemp In_inc medage hhsize, gen(i_precip) 4.i56_(41) observati)ns imputed . impute In_inc In rtL jantemp precip medage!hl_size,gen(i_in_inc) 4.!34_ (39) observati)ns imputed . impute Medage In rt jantemp precip In inc hhsize, gen(i_medage) 4._5% (40) observati .ns imputed . impute lihsize In rt jantemp precip in_inc medage, gen(i_hhsize) 5._3_ (47) observati ,nsimputed
Thatdone,we can now re-estmatethe recession on the imputedvariables: • regress !in,eat i_Injsales Soul4ce Mod_l Residual ! _ Total
: i
In_eat
!
i _
S_
df
108.8_923 63.792_
i_jantmp i_precip i_in_inc i_median_ase i hhsize
.45
172.65:i145 Conf.
MS
6
18.1432051
891
,071596986
897
.192477308
Std. Err.
t
Number of obs = F( 6, 89_) = Prob > F =
898 253.41 0.0000
R-squared = Adj R-squexed = Root MSE =
0.6305 0.6280 .26758
P>ItI
[95% Conf. Interval]
i_im_rsales i_jantmp i_precip i_in_inc
.660_ )6 .0021G .9 -.0013_88 .095883
.0245827 .0006932 .0007646 .0510231
26,89 3.03 -1.74 1.88
0.000 0.002 0.083 0,061
.6126593 .0007414 -.0028275 -.0042764
.7091528 .0034625 .0001739 .1960024
i_median_age i_khsize _cons
-,0011_34 -.0052508 -I.143142
.0002584 .0003953 .4304284
-4.35 -13.28 -2.66
0.000 0.000 0,008
-.0016304 .0060267 -1.987914
-.0006163 -,004475 -.2983702
Notethat the regressionis no_ estimatedon all 898observations. Example impute canalsobe used 4th factor to extend[actorscoreestimatesto caseswith missing data.Forinstance, we havea /afiantof the automobile dataset (see[U] 9 Stata'son-linetutorials and sampledataSets)that conltinsa few additionalvariables.Wewill begin by factoringall but the price
vadable;
see [R] factor
• factor m_g-foreign, f ctors(4) (obs=66) (principal _actors; 4 factors retained) Eigenvalue Difference Proportion
Factor
Cumulative
3
i
1 2 3 4 5 6 7 8 9 I0
6. 99066 1.39528 O. 58952 O. 29870 0.24252 O. 12598 0.03628 -0.01457 -0.02732 -0.05591
5. 59588 0.80576 O. 29082 O. 05618 0.11654 O. 08970 0.05085 0.01275 O. 02860 0.05736
O. 7596 0.1516 O. 0641 O. 0325 O. 0264 O. 0137 0.0039 -0.0016 -0.0030 -0.006i
O.7596 0.9112 O. 9753 1. 0077 1. 0341 1.0478 1.0517 1.0501 1.0472 1.0411
tt I2 13
-0.11327 -0.11891 -0. 14605
O.00564 0,02714
-0.0i23 -0.0129 -0.0i59
1. 0288 1.0159 1.0000
r
72
impute
--
Predict missing values Factor Loadings
i Variable
1
mpg rep78 I rep77 headroom Irear_seat trunk
2
3
4
Uniqueness
-0.78200 -0.51076 -0.27332 0.56480 0.66135 0.72935
-0.02985 0.68322 0.70653 0.26549 0.20473 0.37095
-0.06546 -0.1i181 -0.32005 0.29651 0.36471 0.28176
0.33951 -0.01428 0.04710 0.16485 0.02062 0.12140
0,26803 0.25963 0.32145 0.49542 0.38727 0.23633
weight length turn ! displacement
0.95127 0.94621 0.88264 0.92199
0.10135 0.19595 -0.05607 0.06333
-0.18056 -0.05372 -0.08502 -0.17349
-0.09179 -0.10325 0.01169 -0.02554
0.04378 0.05274 0.21043 0.11518
_ar_ratio | order |foreign
-0.82782 -0.25907 -0.75728
0.06672 0.15344 0.30756
0.24558 0.01622 0.19130
-0.10994 0.14668 -0.29188
0.23787 0.88756 0.21014
I
There appear interpreta_on
to
be two we might
factors interpret
here. the
Let's pretend that we have given first factor as size. We now obtain
the first the factor
two factors scores:
an
||
. s_ore fl f2 (based on unrotated factors) (2 scorings not used) Scoring Coefficients 1 2
Variable I
mpg rep78 rep77 headroom _ear_seat I trunk
-0.02094 -0.03224 -0.11150 0.05530 0,03355 0.04603
0.11107 0.44562 0.27942 0.10017 0.02812 0.20622
I
0.12250 0.39997 0.04562 0.19281 -0.08534 0.00638
-0.13040 0.60223 -0.12825 0.11611 0.03528 0.06433
weight length turn displacement g_ar_ratio order
-0.06469
foreign Although nfissing observati(
is not v ]ues is:
(we
0.28292
revealed
by
this
output,
in 8 cases
would
see
that
if we typed
the
scores
summarize).
could
To
not
impute
. i_ _ute fl mpg-foreign, gen(i_fl) 10.91_ (8) observations imputed i, _ute f2 mpg-foreign, gen(i_f2) I0._1Z (8) observations imputed And
we _ ight
now
run
a regression
of price
(Continued
in terms
on next
of the
page)
two
thctors:
the
be calculated factor
scores
because
of
to all the
impute -- Predict imissingvalues i
. regre_s
price
i_f3
Source
i_f2 SS
df
MS
Number
of obs =
F( 2, Model
15¢_._23103
Residual
47£ 342293
Total
63_ )65396
price i_fl
t
73
74
71)=
2
79611551.5
Prob
71
6702004.13
R-squared:
=
0.2507
8699525.97
kdj R-squ_red Root MSE
= =
0.2296 2588.8
73
Err.
t
P>lt
> F
3oef.
Std.
3.347
315.7177
3.88
0.000
595.8234
{
[95Y. Conf.
=
ti.88 0.0000
Interval] 1854.87
i_f2
I
911:2878
339.9821
2.68
0.009
233.3827
1589.193
icons
J
626 1.285
301.7093
20.76
0.000
5660.694
6863,877
Methodsand Formulas imputeis implemented
Lsan ado-file.
Consider the command
repute y xl X2 ... Xk, gen(_)
When y is not missing,
varp(_).
=yand_=0.
Let y9 be an observatiol br which y is missing. A regressor list is formed containirig all x's for whic_ xij is not missing. If _e resulting list is empty, missing. !OtherWise a regres!iion of y on the list is dsdmated (see [R] regress) the predicted Value of yj (si,'e IN] predict), t,"j is defined as the square of the prediction, as Calculated by _redict, stdp; see [Ri predict.
from xl, x2 ..... xk _.3 and _j are set to and _j is defined as standard error of the
References i
Goldstein,R. 1996.sedl0: Patters of missingdata, Stata TechnicalBulletin32: 12-13, Reprintedin Stata Technical Bulletin Reprints.vol. 6. p. I 5.
i[
_.
1996. sedl0.I: Patternsof!missing data. update. Stata TechnicalBulletin 33: 2, Reprintedin Stata Technical BulletinReprints,vol. 6, pp. 15-116.
l,ittle.R. i. A. and D. B. Rubin. 1987.StatisticalAnalysis u4OJMissingData. New York:John Wiley& Sons. }
.
Mander.Ai asd D. Clayton.I999 sg116:Hotdeckimputation.Stata TechnicalBulletin 51: 32-34. Reprintedin Srata TechniralBulletin Reprints, v,)t.9, pp. 196-199. -_.
2000. sgll6A: Update to hotdeck imputation.Stata TechnicalBulletin 54: 26. Reprinted in Smta Technical BulletirtReprints, vol. 9, p. )9,
AlsoSee
i
Complementary:
[R] pr, diet
Related:
[R] ipelate, JR]regress
..... Title
,t ;
i
Quick reference for reading data into Stata
Description This er_try provides a quick reference for determining which method to use for reading non-Stata data into _hemoD,. See [U] 24 Commands to input clam for more details.
Remarks Summary bfthe different methods insheet o inshe, t reads text (ASCII) files created by a spreadsheet o The da a must be tab-separated space-s ,_parated. o A sing] _ observation
or comma-separated,
or a database program.
but not both simultaneously,
nor can it be
must be on only one line.
o The fir t line in the file can optionally contain the names of the variables. infile (fre_ format)--infile without a dictionary o The da
can be space-separated,
o Strings with embedded separat, d). o A singl _observation line.
tab-separated,
or comma-separated.
spaces or commas must be enclosed
in quotes (even if tab- or comma-
can be on more than one line or there can even be multiple observations
infix (fixe(_ format) o The da!
must be in fixed-column
format.
o A singl
observation
o infix
as simpler syntax than infile
can be on more than one line. (fixed format).
infile (fixe 1 format)--infile with a dictionary o The daa may be in fixed-column o A singl _ observation o infil_
format.
can be on more than one line.
(fixed format) has the most capabilities
74
for reading data.
per
infile-- Quicl_referencefor readingdata intoStata
75
I
Examples
I
l
> Example
topof exampl.raw i
1 0 0 0
0 0 I O
1 I 0
John Smith Paul Lin Jan Doe f Julie McDonald
m m f
endof exampl.raw-contains tab-separated data. The type command with the showtabs
option shows the tabs:
type eXampl.rau, slowtabs 1O1John Smithm OO.IPaulLin_T>m OIOJan Doef oO.Julie Mc[onaldf Z
It could be read in by • insheet a b c name gender using exampl
Example topof examp2.raw--
i
a,b, c,name, gender 1,0,I ,John Smith,m 0,0,I ,Pa_l Lin,m O,l,O,Jan Doe,f 0,0, Julie McDonald,:
!
endof examp2.rawcould be read in by
i
" insheet
using
examl,2
q
Example topof examp3.raw 1 0 0 0
0 0 I 0
i 1 0
"John Smith" m "Paul Lin" m "Jan Doe" f "Julie McDonald"
f
t
endof examp3.raw
contains tab-separated data _'ith strings in double quotes. ;
. type
examp3.raw, s]lowtabs
lO Example The _lir_e () and _lines () directives instruc! Stata how to read your data when there are multiple records per _Sbservation. You have the following in mydata2 .raw: ri
top of mvdata2.raw
id incpme educ sex age 1024 2_000 HS Male | 28 1025 2_7000 C Femalei 1035 2 000 HS Male 32 1036 25000 C Female 25
1
You can read this with a dictionary mydata2, reads the daia: • infi_e using mydata2, clear
end of mydata2.raw
dct, which we will just let Stata list as it simultaneously
i infile(fixed formal -- Read ASCII (text) data in fixed format with a dictionary
87
z
in_ile dictionary usiag mydata2 { _first(2) _lines(3) int id "Identific ttion Number" income "Annual in :ome" sir2 educ "Highes _line(2) sir6 sex _line(3)
* * * *
Begin reading on line 2 Each obbervatiOn takes 3 lines. Since __ine is not specified, Stata assumes that it is I.
educ level" * Go to line 2 of the observation. * (values for sex are located on line 2) * Go to llne 3 of the observation. * (values for age are located on line 3)
int age
} (4 bbservations read)
. list Ii 2! 3_
id 1024 1025 1035
inc(ime 251)00 27_i00 26_I00
4i
1036
25( 00
Now, here is the really good ii
could jus( as wdll have
educ _S C HS
sex Male Female Male
age 28 24 32
C
Female
25
art: We read these variables in order but that was not necessaD_. We
usedrhedictionary:
top of mydata2p.dct
inf_le dictionary using mydata2 { _first (2) _lines (3) _line (1)
int
id "Identification number" income "_ual income"
_line(3) _line(2)
sti int st_
educ age sex
"Highest educ level"
} end of mydata2p.dct
We would obtain the same re_ults--and just as quickly--the only difference being that our variables in the fin_ dataset would be n the order specified: id, income, educ, age, and sex. q
Technical!Note i.
You can use _newline tO specify where breaks occur, if you prefer: Z
........ i
i
!_
topof highway.dct,example5--
infile dictionary { acc_rate "Ac :. Rate/Million Miles" spdlimit "S )eed Limit (mph)"
>
_newline acc_pts
"Ac :essPts/Mile"
4.58 55 4.6 2.861 60 4.4 1.61. 2.2 3.02 i 60 4.7 end of highway.dct, example 5
The line th)at reads '1. 61 .' ould have been read 1.61 (without the period), and the results would have been unchanged. Since _tictionaries do not go to new lines automatically, a missing value is assumed for all values not foulnd in the record.
!
88 ---
i_file (fixed format) -- Read ASCII (text) data in fixed format with a dictionary =
]
Readingfied-format files Values _n formatted data are sometimes packed one against the other with no intervening For instande, the highway data might appear as I top of highway.raw,example 6
,,
:',
blanks.
4.58_54.6
2.86 04.4 1.61| 2.2 3.02604.7
":
end of highway.raw,example6 The first f_ur columns of each record represent the accident rate; the next two columns, the speed limit; and _he last three columns, the number of access points per mile. To read:: these data, you must specify the %infmt in the dictionary. Numeric Y,infints are denoted by a leadir_g percent sign (%) followed optionally by a string of the form w or w.d, where w and d stand for @o integers. The first integer, w, specifies the width of the format. The second integer, d, specifies ti_enumber of digits that are to follow the decimal point. Logic requires that d be less than or equal tqw. Finally, a character denoting the format type (f, g, or e) is appended. For example, %9.2f spe_zifies an f format that is nine characters wide and has two digits following the decimal point.
Numericformats The f f_rmat indicates that infile is to attempt to read the data as a number. When you do not specify th_%infmt in the dictionary, infile assumes the %f format. The missing width w means that infille is to attempt to read the data in free format. At the _mrt of each observation, to 1, indic moves the occurrence is left at tl
infile
reads a record into its buffer and sets a column pointer
ating that it is currently on the first column. When infile processes a %f format, it "olumn pointer forward through white space. It then collects the characters up to the next of white space and attempts to interpret those characters as a number. The column pointer e first occurrence of white space following those characters, If the next variable is also
free forma I, the logic repeats. When ypu space. Instead, the result @ a is, on the first
explicitly specify the field width w, as in %wf, infile does not skip leading white it collects the next w characters starting at the column pointer and attempts to interpret number. The column pointer is left at the old value of the column pointer plus w, that character following the specified field.
Example If the d tta above are stored in highway, the data:
raw, you could create the following
infi e dictionary using highway { acc_rate Y,4f "Acc. Rate/Million spdlimit acc_pts
dictionary to read
top of highway.dct,
example 6
end of highwa?.dct,
example 6
Miles"
Y,2f "Speed Limit (mph)" Y,3f "Access Pts/Mile
} 1
Wh:ncolu_s you explicitly field width, not skip intervening and characters. The first are usedindicate for the the variable ace_rate,infile the does next two for spdlim-it, the last three for acc_pts.
# lines!states the number of lines per observation in the file. Simple datasets typically have '1 ; lines!. Large datasets often have many lines (sometimes called records) per observation, lines is optional even when there is more than one line per observation because infixcan sometimes figure _t out for itself. Still, if 1 lines is not fight for your data, it is best to specify the directive.
'
i, ,
lines Iappears only once in the specifications. #: tells infix to jump to line # of the observation. Consider a file with 4 lines, meaning four lines per observation. 2: says to go to the second line of the observation. 4: says to go to the fourth line of_the observation. You may jump forward or backward: infix does not care nor is there any inefficiency in going forward to 3:, reading a few variables, jumping back to 1:, reading anothei" variable, and jumping back again to 3 :. It is n0t your responsibility to ensure that, at the end of your specification, you are on the last line of!the observation, infix knows how to get to the next observation because it knows where you are and it knows lines, the total number of lines per observation #: may appear, and typically does, many times in the specifications. / is an alternative to #:. / goes forward one line. //goes forward two lines. We do not recommend the usd of / because #: is better. If you are currently on line 2 of an observation and want to get [_ "'
to linei6, you could type////, but your meaning is clearer if you type 6:. / may!appear many times in the specifications.
: : i
[byte I int Ifloat j long I double and, sdmetimes,_more than one.
I str ]varlist [#-]#[-#]
instructs infix
to read a variable
'_
Begin _y realizing that the simplest form of this is varname #, such as sex 20. That says that variabl_ varname is to be read from column # of the current line: variable sex is to be read from
: '_ t
column20 and here, sex is a one-digit number. varn " rn m fr m the column ran e s eclfied read ar_e#-#, such as age 21-23, says to readva a e o " g p " ; age frtm columns 21 through 23 and here, age is a three-digit number. You cab prefix the variable with a storage type. str name 25-44 means to read the string variable name _rom columns 25 through 44. If you do not specify str. the variable is assumed to be numeriC. "Youcan specify the numeric subtype if you wish.
infix (fixed format) _ Read A_Cll (text) data in fixeclformat-
i i
You can specify more than one variable, with or without a type. byte ql-q5 51-55 means read va_ables ql, q2, q3, q4. and q5 from column_; 51 through 55 and store the five variables as b_tes. Finally, you can spec fy the line on which the Variable(S) appear, age 2:21-23
i
107
says that age is
tO:be obtained from }he second line, column_ 21 through 23. Another way to do this is to put together the #: direct}ve _ith the input-_afiabte directive: 2: age 21-23. There is a difference. but not with respect t_ reading the variable age, Let s consider two alternatives: ;1: str name 25-4_ age 2:21-23 ql-q5 51-55 1:
[
str
name
25-44
2:
age
21"23
ql-q5
51-55
The difference is thai the first directive says variables ql through q5 are on line I whereas the seCond says they are an line 2. When the colon is p_t out front it says on which line variables are to be found when we do not explicitly say otherwise. Vc'hen the colon is put inside, it applies only to the variable under consideration.
Remafks There are two ways t9 use "infix il
One is to type the specifications that describe how to read the
fixed_format data on thelcommand line: .
infix
ace
rate
_-4
spdlimit
6-7
acc_pts
9-11
using
highway.raw /
The other is to type the specifications into a file Z
topof highway.dcI,exampleI
--i infix
dictionary acc rate spdlimit acc_pts
asing highway.raw t-4
{
3-7 I-II
} end of highway.dct,
example
I
and {hen, inside Stata. t, _e . infix
{
i
using
hig way.dct
Which you use makes r_o difference to Stata. The first form is more convenient if there are only a few variables and the second form is less prone to error if you are reading a big, complicated file The second form alkws two variations, the one we just showed--where file_and one where the data are in the same file as the dictionary:
the data are in another
topof highwav.dct,example2i
infix
dictionary acc_rate
{ i-4
spdlimit
_-7
acc_pts
)-II
} 4.58
55 .46
2.8660 1.61 3.02
4.4
2.2 60 4.7 --
>a ot6 that
in the first ex
ple, the top line of the file read infix
wheieas in the second toe line reads simply iMix {
dictionary.
data _.are it is implied t_at__the data follow the dictionary.
end of highway.tier
dictionary
example
using
2
highway.raw where the When you do not say.
108
infi]K(fixed format) -- Read ASCII (text) data in fixed format
'!.
> Example So let's complete the example we started. You have a dataset on the accident rate per million vehicle miles along a stretch of highway, the speed limit on that highway, and the number of access points per mile. You have created the dictionary file highway, dct which contains the dictionary and the data: top of highway.dct,example 2 infix d_ctionary { ace_rate I-4 spdlimit 6-7 acc_pts 9-11
} 4.58 2.86
55_ .46 6_ 4.4
1.61 3.02
i 2.2 6_ 4.7
|
! ! end of highway.dct,
example 2
You created this file outside of Stata using an editor or word processor. Inside Stata. you now read the data. infix lists the dictionary so you will know the directives it follows:
! i
: !
• infix_using highway infix dictionary { ace_rate 1-4 spdlimit 6-7 acc_pts 9-11
} (4 observations
read)
list 1. 2. 3. 4.
ace_rate 4.58 2.86 1.61 3.02
spdlimit 55 60 60
acc_pts .46 4.4 2.2 4.7
Note that we simply typed infix using highway rather than infix using highway.dct, When we do not specify the file extension, infix assumes we mean .dot. Example Consider the following
l
aw data:
i_ income educ / se_ age / rcode, 1024 25000 HS | Male 28 119503
top of mydata.raw answers
to questions
--
1-5
1025 27000 C Female 24 022113 1035 26000 HS Male 32 110321
f36 2sooo c Female 25 131232 ;
--
end of mydata.mw
This dmaset has 3 lines oer observation and the first line is just a comment. One possible set of specifi+ations to mad _is ktata'_is infix dictionary u i 2 first 3 lines I: id income str educ 2: str sex 3:
4
topof mydatalAct
ing mydata {
I-4 6-10 12-13 6-11
int age !13-14 rcode 16 ql-q5
7-16
I end of mydatal,dct
----_
although we pre_r i
110
infix (fixed format) -- Read ASCII (text) data in fixed format top of mydata2.dct infi_
dictionary
using
mydata
{
2 first 3 lines id
1:I-4
income
I: 6-10
' E_
sir
educ
1:12-13
i!i
sir
sex
2:6-11
I '
age rcode
2:13-14 3:6
ql-q5
3:7-16
} •end of mydata2.dct Either will read these data, so we will use the first and then explain why we prefer the second. • infix
using
mydatal
infix dictionary 2 first I:
using
lines id
mydata
{
1-4
income
6-10
str
educ
12-13
2:
str
sex
6-11
3:
int age rcode
13-14 6
ql-q5
7-16
} (4 observations • list
in
read)
I/2
Observation
1 id
1024
income
sex
Male
age
28
q2 q5
9 3
q! q4 Observation
1 0
25000
educ
HS
rcode
1
q3
5
2 id
1025
sex
Female
income
educ
C
age
27000 24
rcode
0
q3
1
ql
2
q2
2
q4
1
q5
3
Now, what is better about the second? What is better is that the location of each variable is completely documented on each line, in terms of both line number and column, Since infix does not care about the order in which we read the variables, we could take the dictionary, jumble the lines, and it would still work. For instance, .... infi:
dictionary first
using
mydata
top of mydata3.dct
{
lines str
sex
1
rcode
!
sir age id
educ
ql-q5 income
2:6-11 3:6 1:12-13 2:13-14 I: i-4 3:7-16 i: 6-10
}
t
end of mydam3.dct
!
[ ]
I
infix(fixedformat)--Read ASCII(text)datain fixedformat
111
wilt also read these data even though•for each observation, we start on line 2, go forward to line 3, jump back to line l, and end up on line 1. It is not even inefficient to do this because infix does not really jump to record 2, then record 3, then record 1 again, etc, infix takes what we say and organizes it efficiently. The order in which we say it makes no difference. Well, it does make one: the order of the variables in the resulting Stata dataset will be the order we specify. In this case the reordering is senselessbut, in real datasets, reordering variables is often desirable. Moreover, we often construct dictionaries, realize _at we omitted a variable, and then go back and modify them. By making each line complete in and of itself, we can add new variables anywhere in the dictionary and not worry that. because of our addition, something that occurs later will no longer read correctly. in i/i00
1:
id I-6
str name 7-36
2: age i-2
str sex 4
using empi.raw
Or, if the specification was instead recorded in a dictionary and you wanted observations 10l through 573, you could type • infix using emp2.dct in 101/573
Also See Complementary:
[R]outfile, [R] outsheet, [R] save
Related:
[R]intile (fixed format), [R]insheet
B_ckground:
[L] 24 Commands to input data, [R]intile
i
F °'; e
input -- Enter data from keyboard I
I II I
I III
I
I
I
Syntax input
[varlist]
[,_automatic label]
Description input allows you to type data directly into the dataset in memo_• alternative to input.
Also see [R] edit for a windowed
Options automatic causes Stata to create
value labels from the nonnumeric
data it encounters•
automatically widens the display format to fit the longest label. Specifying label even if you do not explicitly type the label option.
automatic
It also implies
label allows you to type the labels (strings) instead of the numeric values for variables associated with value labels. New value labels are not automatically created unless automatic is specified.
Remarks If there are no data in memory, when you type input you must specify a vartist• Stata will then prompt you to enter the new observations until you type end.
> Example You have data on the accident rate per million vehicle miles along a stretch of highway along with the speed limit on that highway. You wish to type these data directly into Stata: • input nothing to input r (104) ;
Typing input by itself does not provide enough information know the names of the variables you wish to create. • input ace_rate spdlimit 1. 2. 3. 4.
ace_rate 4.58 55 2.86 60 1.61 end
spdlimit
112
about your intentions.
Stata needs to
input -- Enter data from keyboard
113
:
! _ !
i
We typed input acc_rate spdlimit and Stata responded by repeating the variable names and then prompting us for the first observation. We then typed 4.58 and 55 and pressed Retth,'n. Stata prompte_ktusfor the second obsen, ation. We entered it and pressed Return. Stata prompted us for the third 6bservation. We knew that the accident rate is 1.61 per million vehicle miles, but we did not know the corresponding speed limit for the highway. We typed the number we knew, 1.61, followed by a period, the missing value indicator. When we pressed Return, Stata prompted us for the fourth 6bservation. We were finished entering our data, so we typed end in lowercase letters.
i i
We can now list
the data to verify that we have entered it correctly:
. list i. 2. 3.
acc_rate 4.58 2.86 1.61
spdlimit 55 60 Q
If you have data in memory and type input without a vartist, you will be prompted to enter adklitional information on all the variables. This continues until you type end. :
i
i
Examp You now have an additional observation you wish to add to the dataset. Typing input by itself tells Stata that you wish to add new observations: • i_ut 4, 5,
act_rate 3.02 60 end
spdlimit
St_ta rem/nded us of the names of our v_-iables and prompted us for the fomth observation. We entered 'the numbers 3,02 and 60 and pressed Return. Stats then prompted us for the fifth observation. We could add as many new observations as we wish. Since we needed to add only one observation, we typ_ _nd, Our dataset now has four observations. "xl
You may add new variables to the data in memory by typing input followed by the names of the new variables. Stata will begin by prompting yGu for the first observation, then the second, and so on, until you type end or enter the last observation.
'iExample i
! ,
In addition to the accident rate and speed limit, we now obtain data on the number of access points (omramps and off-ramps) per mile along each stretcl of highway. We wish to enter the new data.
I
• input acc_pts acc_pts t. 4.6 2. 4.4
3 2.2 I
i
4. _4.7
F
114 input -- Enter data from keyboard When we typed input acc_pts, Stata responded by prompting us for the first observation. There are 4.6 access points per mile for the first highway, so we entered 4.6 and pressed Return. Stata then prompted us for the second observation, and so on. We entered each of the numbers. When we entered the final observation, Stata automatically stopped prompting us--we did not have to type end. Stata knows that there are four observations in memory, and since we are adding a new variable, it stops automatically. We can, however, type end anytime we wish. If we do so, Stata fills the remaining observations on the new variables with m/ssing. To illustrate this, we enter one more variable to our data and then list the result: • input
junk
jun_ 1. 1 2. 2 3. end • list acc_rate 4.58
I.
spdlimit 55
acc_pts 4.6
60
4.4
2.86
2. 3,
1.61
4.
3• 02
junk 1 2
2.2 60
4.7
q
You can input string variables using input, but you must remember to explicitly indicate that the variables are strings by specifying the type of the variable before the variable's name.
> Example String variables are indicated by the types str#, where #represents the storage length, or maximum length, of the variable. For instance, a str4 variable has maximum length 4, meaning it can contain the strings a, ab, abe, and abed but not abede. Strings shorter than the maximum length can be stored in the variable, but strings longer than the maximum length cannot. You can create variables up to str80 in Stata. Since a str80 variable can store strings shorter than 80 characters, you might wonder why you should not make all your string variables str80. You do not want to do this because Stata allocates space for strings based on their maximum length. It would waste the computer's memory. Let's assume that we have no data in memory and wish to enter the following input
strl6
name
age
str6
name i.
"Arthur
2.
"Mary
3. Guy "Fawkes" 3.
"Guy
Hope"
Fawkes cannot
We first typed input sex a str6 variable.
sex age
sex
22 male
37
"female"
48 male be read
Fawkes"
4. "Kriste 5. end
:
Doyle"
data:
as a number
48 male
Yeager"
25 female
strl6 name age str6 sex, meaning that name is to be a strl6 variable and Since we did not specify anything about age, Stata made it a numeric variable.
Stata then prompted us to enter our data. On the first line, the name is Arthur Doyle, which we typed in double quotes. The double quotes are not really part of the string; they merely delimit the
_!lput
_
_l_,t;'w uam
llVli!
hVyL_)awu
J ,_,
beginning and end of the str ng. We followed that with Mr Doyle's age, 22, and his sex, male. We did not bpther to type doubk quotes around the word male because it contained no blanks or special characters. For the second o _servation,we did type the double quotes around female;it changed _othing. In the third observation w omitted the double quotes around the name, and Stata informed us that Fawkes c_uld not be read as number and repromptddus for the observation. When we omitted the double q_otes, Stata interpre:ed Guy as the name, Fa_rl_esas the age, and 48 as the sexl All of this would have been okay with Stata except for one problem: Fawkes looks nothing like a number, so Stata complained and gave :s another chance. This lime, we remembered to put the double quotes around ttie name.
i
Stata was satisfied, and _ continued. We entered lhe fourth observation and then typed end. Here is our dataset: • _ist 1. 2. 3. 4.
1 nam_ Arthur Doyle Mary Hope Guy Fawke. _ Kriste Yeagez
age 22 37 48 25
sex male female male female
q
I
>
I
Example Just as we indicated whic Lvariables Werestrings by placing a storage type in front of the variable name, we can indicate the .torage type of our numeric variables as well. Stata has five numeric storage types: byte, int, 1c ng, float, and double. When you do not specify the storage type. Stata assumes the variable is afl _at. Youmay want to review the definitions of numbers in [U] 15 Data.
! ' i i i i !i i i
,'_dditional Therei are two reasons you might The wantdefault to explicitly specify storage type: toforinduce precision or to co_vhy aserve memory. type float has the plenty of precision most circumstances because Stata performs all calculations in double precision no matter how the data are stored. I[ you were storing 9-digit Social Security Numbers, however, you would want to coerce a different storage type or else the last digit would be r0uhded, long would be the best choice: double would _,_orkequally well, b_]tit would waste memory. Sometimes you do not need to store a variable as float.If the variable contains only integers between -32,768 and 32,7i_6,it can be stored as an int and would take only half the space. If a variable contains only inti',gersbetween -127 and 126, it can be stored as a byte which would _:akeonly half again as mu( i space. For instance, in tile previous example we entered data for age ,_ithout explicitly specifyin, the storage type; hence, it was a float. It would have been better to _tore it as a byte.To do ti" tt. we would have typed input
strl6
name b _te age str6 nam _
_. "Arthur Doyle"
sex sex
12male
°I i
2. "Mary Hope" 37 'female" _. "Guy Fawkes" 48 male
i
4. "Kriste
Yeager"
age
25 female
5. end
i
Stata understands a number of shorthands. For instance,
_I
input
int(a
b) c
allows you to input three variables, Remember .{input
int
and c a float•
a b c
would make a an int *
a, b, and c, and makes both a and b ints
but both b and c floats.
. inputa longb double(cd) e would make a a float,b a long,c and d doubles,and e a float. Statahas a shorthandforvariable names withnumericsuffixes. Typingvl-v4 isequivalent to typing Vl v2 v3 v4. Thus, • linput
'
int(vl-v4)
inputs f6ur variables and stores them as ints. q
Q Technic,_l Note You may want to stop reading now. The rest of this section deals with using input with value labels. If you are not familiar with value labels, you should first read [U] 15.6.3 Value labels. Remdmber that value labels map numbers into words and vice versa. There are two aspects to the process. !First, we must define the association between numbers and words. We might tell Stata that 0 corresponds to male and 1 corresponds to female by typing label define sexlbl 0 "male" 1 "female". The correspondences are named, and in this case we have named the O_male l++female correspondence sexlbl. Next, iwe must associate this value label with a variable. If we had already entered the data and the variable was called sex, we would do this by typing label values sex sexlbl. We would have entered the data by typing O's and !'s, but at least now when we list the data, we would see the words rather than the underlying.numbers. We cab do better than that. After defining the value label, we can associate the value label with the type:variable at the time we input the data and tell Stata to use the value label to interpret what we l_bel • i_put
define strl6
I.
"Arthur
2.
"Mary
3.! "Guy
sexlbl name
byte(age
Hope"
1 "female"
sex:sexlbl),
name Doyle" 22 male
Fawkes"
4. "Kriste 5. end
0 "male"
age
label sex
37 "female" 48 male
Yeager"
25 female
After deft ing the value label, we typed our input command. Two things are noteworthy: We added the label option at the end of the command, and we typed sex:sexlbl for the name of the sex variable, T_e byte(...) around age and sex:sexlbl was not really necessary: it merely forced both age _nd sex to be stored as bytes. Let's first decipher sex : sexlbl, sex is the name of the variable we want to input. The : sexlbl part tells Stata thal the new variable is Lo be associated with the value label named sexlbl. The label option tells Stata that it is to look up any strings we type for labeled variables in their
input- Enter datafrom keyboard
117
corresponding value label and substitute the number when it stores the data. Thus, when we entered the first observation of ou • data, we typed male for Mr Doyle's sex even though the corresponding variable is numeric. Rather than complaining that ""mate" could not be read as a number", Stata accepted what we typed, 3oked up the number corresponding to male, and stored that number in the data.
i
The! fact that Stata has lctually stored a number rather than the words male or female is almost irrelevant. Whenever we ist the data or make a table, Stata will use the words male and female just as if those words were actually stored in the dht/set rather than their numeric codings: • list
I.
nm _e
age
se_
DoylLe
22
male
Ho],e
37
female
Guy Fawb is
48
male
95
female
Arthur
2.
Mary
3. i
, Kriste Yeag_ r tabulate sex sex
] req.
Percent
Cure,
male
2
50. O0
50.00
female
2
50. O0
I00. O0
Total
4
I00. O0
It is only almost irreleva at since we can make use of the underlying numbers in statistical analyses. For instance, if we were to ask Stata to calculate the mean of sex by typing sumrnarize sex, Stata would report 0.5. We woul interpret that to mean that one-half of our sample is female.
i i
Value labels are perman_ Itly associated with variables. Thus, once we associate a value label with a variaNe, we never have ti do so again. If we wanted to add another observation to these data, we
i
could type . input,
i i
label
i5. "Mark
Esman"
nam _ 26 male
age
sex
_. end
!_ i
[3Technical Note The automatic option ',utomates the definition of the value label. In the previous example, we _nformed Stata that male c, ,rresponds to 0 and female corresponds to 1 by typing label define sexlbl 0 "male" :t "female". It was not necessary to explicitly specify the mapping. Speci_,ing the aut6maticoption tells ;tara to interpret what we type as follows:
i i
ii
First, ;see if it is a numbeI If so, store that number and be done with it. If it is not a number, check
I I ! i_ i
I
the value label associated u th the variable in an attempt to interpret it. If an interpretation exists, store theIcorresponding nun: .tic code• If one does not exist, add a new numeric code corresponding to what was typed. Store th_ new number and update the value label so that the new correspondence is never t'orgotten. We can use these feature to reenter our age and sex data. Before reentering the data, we drop -all and label drop _all to prove tha_ we have nothing up our sleeve:
atop_an _abel
drop
_all
....
i
118
i input -- Enterdata from keyboard input
strl6
name
!
byte(age
sex:sexlbl),
name
i.
"Arthur
_.
"Mary
3.
"Guy
4.
"Kriste
Doyle" Hope"
22
37
Fawkes"
age
48
Yeager"
automatic sex
male
"female" male 25
female
. end i I
•
T We previouslydefinedthevaluelabelsexlbl so thatmale correspondedto 0 and female corresponded
to 1. Th+ label that Stata automatically created is slightly different but just as good: i
Sabel list sexlbl se: Ibl : I
male
0 2
female
Also See
'
Complementary:
[R] save
i
Related: i
[R] edit, [R] infile
Background:
[U] 24 Commands to input data
, i
t
I
i
, I
!
!_ !
i
_ in_iheet -- Read AS II (text)data created by a spreadsheet i i iHll i r i i iJ ii i iil ill [
i
Syntax i i
i i
i
insheet
[varlist] using
filename
[, _double [no]n_ames comma t__abclear
]
If filen_me is specifiedwithmt an extension, .raw is assumed.
Description
in_heet reads into rremory a disk dataset that is not in Stata format. ±nsheet is intended for readir_g files created by a spreadsheet or database program. Regardless of the creator, :i,nsheet reads text (ASCII) files where here is one observation per line and the values are separated by tabs or commas. In addition, the first line of the file can contain the variable names or not. The best thing
I i
[ about!insheet is that if you type . insheetusingill, name
i
insheet will read your lata; that's all there is to it. Stata has other comrr ands for reading data. If you are not sure that insheet
i
lookingfor, see [R] infih and [U] 24 Commands to input data. If y/ou want to save your data in "spregdsheet-style" forma
Options
is what you are
see [R] outsheet.
i
double forces Stata to st_age types.
t
tore variables as double_
rather than floats:
see IV] 15.2.2 Numeric
:-
[no]names informs Stata whether variable names are included on the first line of the file. Speci_,ing this option will speed insheet's processing--assuming you are right--but that is all. ±nsheet can determine for itse!f whether the file includes variable names.
1
comma tells Stata that the values are comma-separated. Specifying this option will speed insheet's pr0cessing--assumin_ you are right--but thai is all. insheetcan determine for itself whether the separation charact_ is a comma or a tab.
i i
!
tab prOcessing--assumin_ tells Stata the v_lues are right--but tab-separated. this can option will speed insheet's you are that Specifying ig all. insheet determine for itself whether the separation charact_:r is a tab or a comma.
i
clear specifies that it is okay for the new data |o replace what is currently in memory. To ensure that you do not lose sc mething important, insheetwill refuse to read new data if data are already in memory I clear
is _ne way you can tell ±nsheet
x_ourselfb_ typing drip
_all
that it is okay. The other is to drop _he data
before reading new data.
119
i
12o
Remarks
insheet-
Read ASCII (text) data created by a spreadsheet
There i_ nothing to using insheet.You type insheet and
insheet
using
filename
will read your data, That is, it will read your data if
1. It can find the file and 2. The file meets insheet's
expectations
as to the format in which it _s written.
Assuring I is easy enough; just realize that if you type infix using myfile, Stata interprets this as an instruction to read myfile.raw. If your file is called myfile.txt, type infix using myf ile. t,btt. As for the file's fo,-mat, most spreadsheets and some database programs write data in the form insheet expects, It is easy enough to look as we will show you--and it is even easier simply to try and see what happens. If typing • insheet
using
filenarrle
does not produce the desired result, you will have to try one of Stata's other infile commands: see [R] infile.
> Example You ha*e a raw data file on automobiles and can bd read by typing (5
called auto.raw.
This file was saved by a spreadsheet
insheet using auto vars, I0 obs)
That done, we can now look at what we just loaded: • describe Contains
data
obs:
I0 5
vats: size:
310
(99.8%
storage
of memory
free)
display
value
type
format
label
make_ price
strt3 int
%13s %8.0g
mpg
byte
Z8.0g
rep78
byte
%8.0g
foreign
strlO
ZlOs
variable
name
Sorted by: |Note:
dataset
has
changed
since
last
variable
label
saved
li_t
I. i 2._I 3,
make
price
mpg
AMC Concord AMC Pacer
4099 4749
22 17
3 3
foreign Domestic Domestic
Spirit
3799
22
4. Buick 5. Buick
Century Electra
4816 7827
20 15
3 4
Domestic Domestzc
6. Buick
LeSabre
5788
18
3
Domestzc
4453
26
7. !
AMC
rep78
Buick
Opel
Domestic
Domestic
insheet 8. BuickRegal 9. Buick Riviera 10. Buick
Read ASCII (text) data created by a spreadsheet
5189 10372
20 16
3 3
Domestic Domestic
4082
19
3
Domestic
Skylark
Note that these data contain a combination of string and numeric variables, insheet out by i_elfi
121
figured all that
i
i
3 Technical Note Now let's back up and look at the auto.raw screen: • Sype mal_e
These invisible
file. Stata's type command will display files to the
auto.raw mpg
rep78
foreign
AM¢ Concord
4099
22
3
Domestic
AMC Pacer
4749
17
3
Domestic
AMC Spirit
3799
22
Buick
Century
4816
20
3
Domestic
Buick Buick
Electra LeSabre
7827 5788
15 18
4 3
Domestic Domestic
Buick
Opel
4453
26
Buick Buick
Regal Riviera
5189 10372
20 16
3 3
Domestic Domestic
Buick
Skylark
4082
19
3
Domestic
data and
i
price
have
tab
hence
characters
Domestic
Domestic
between
indistinguishable
i !
]
i
values.
from
blanks,
Tab
characters
type's
showtabs
are
difficult option
to makes
see
since the
tabs
thev
are
visible: I
_ype
auto.raw,
showtabs
1
makepricempgrepZ8foreign AMC Concord4099223Domestic AMC Pacer4749173Domestic AMC Spirit379922.Domestic Buick Century4816203Domestic Buick Electra7827lS4Domestic Buick
LeSabre5788lS3Domestic
Buick
Opel445326.Domestic
Buick Buick
Rega!5189203Domestic KivieralO372i63Domestic
Buick
Skylark4082193Domestic
:
This is an example of the kind of data insheet is willing to read. The first line contains the variable names, Nthough that is not necessm/. What is nedessary is that the data values have tab characters between them. + insheet would be just as happy if the data values were separated by commas. Here is another vafiationi on auto. raw that insheet can read: type
auto2.raw
make,price,mpg,rep78,foreign AMC Concord,4099,22,3,Domestic AMC Pacer,4749,17,3,Domestic AMC Spirit,3799,22,,,Domestic Buick Buick
Century,48i6+,20,3,Domestic Electra,7827i, 15,4,Domestic
Buick
LeSabre,5788,18,3,Domestic
Buick
Opel,4453,26!,,Domestic
i
Buick Buick
Regal,5189,20,3,Domesti¢ Riviera,lO37_,16,3,Domestic
i
Buick
Skylark,4082.,19,3,Dom_sZic
["
It is way one easieror for the us other. human beings to see the commas rather than the tabs. but computers 122
insheet-
do not care O
Read ASCII (text) data created by a spreadsheet
!
D Example The file does not have to contain variable names. Here is another variation on auto. the first line. this time with commas rather than tabs separating the values:
raw without
type auto3, raw AMC Concord, 4099,22,3, Domestic AMC Pacer, 4749,17,3 ,Domestic (output omitted ) Buick
Skylark. 4082,19,3, Domestic
Here is what happens when we read it: insheet using auto3 you must start with an empty dataset r(I8);
Oops; we still have the data from the last example in memory. • insheet using auto3, clear (5 vars, I0 obs) . describe Contains data obs : vats : size:
variable name
10 5 310 (99.8Y,of memory free) storage type
display format
vl
strl3
Y,13s
v2 v3 v4 v5
int byte byte strlO
7,8.0g ]/,,8.0g XS. 0g Y, IOs
Sorted by : Note:
value label
variable label
dataset has changed since last saved
list vl AMC Concord AMC Pacer
v2 4099 4749
v3 22 17
v4 3 3
v5 Domestic Domestic
(output omitted ) i0. Buick Skylark
4082
19
3
Domestic
I. 2.
'j
The only difference is that rather than the variables being nicely named make, price, mpg, rep78, and foreign,they are named vl,v2, ..., v5. We could now give our variables nicer names: • rename vl make . rename v2 price
insheet-- Read ASCII (text) data created by a spreadsheet
123
!
Another alternative is to specie' the variable names when we read the data: • insheet make price mpg rep78 foreign u_ing auto3, clear
(5 vats,I0obs) list make AMC Concord AMC Pacer
i. 2.
price 4099 4749
mpg 22 17
rep78 3 3
4082
19
3
foreign Domestic Domestic
!
i
(outputomi,ed ) 10.
,i
Buick Skylark
Domestic
ii
If we use this approach, we must not specify too few variables • insheet make price mpg rep78 using aut03, clear too few variables specified error in line 11 of file r,(102) ;
i 4 I
|
or too many.
|
. insheet make price mpg rep78 foreign weight using auto3, clear too many variables specified e_ror in line 11 of file r,(103);
mat is why we recommend . insheet using
i
filename
/
It is not difficult to rename your variables afterwards should that be necessary,
q
> Example
I
About the only other thing that can go wrong is that the data are not appropriate for reading by insheet. Here is yet another version of the automobile data: type auto4.raw, showZabs "_AMCConcord" 4099 22 '_,AMC Pacer" 4749 17
3 3
Domestic Domestic
3 4
Domestic Domestic Domestic
"!AMCSpirit" '_BuickCentury" "Buick Electra"
3799 4816 7827
22 20 15
"Buick LeSabre" "Buick 0pel" "Buick Regal" ".'Buick Riviera"
5788 4453 5189 10372
18 26 20 16
3 3 3
Domestic Domestic Domestic Domestic
"_uick Skylark"
4082
19
3
Domestic
] ]
i
Note that we specified type's showtabs option and no tabs are shown. These data are not tabdelimited or comma-delimited and are not the kind of data insheet is designed to read. I,et's try insheetanyway:
] 1
(Continued on next page)
i
124
insheet -- Read ASCII (text) data created by a spreadsheet insheet using auto4, clear (I vat, I0 obs) describe Contains data obs : vats:
10 1
size:
430 (99.8Y,of memory free)
variable name
storage type
vl
display format
sir39
Sorted by: Note:
value label
variable label
7,39s
dataset has changed since last saved
• list vl I. AMC Concord 2. AMC Pacer (output omitted) 10, Buick Skylark
4099 4749
22 17
3 3
Domestic Domestic
4082
19
3
Domestic
When £nsheet tries to readdatathathave no tabsor commas, itisfooledintothinkingthedata contain justone variable. Ifyou had thesedata,you would have toreadthedatawithone ofStata's other commands such as infile (free format). F R-squared Adj R-squared Root MSE
MS
: Model Residual
46189.152 15053.968
Total
61243.12
rent
Coef.
hs_val pctttrban hsns_hat _cons
135
.0006509 .0815159 .0015889 120.7065
3 46
15396.384 327.260173
49
1249.85959
Std. Err.
t
P>ttl
.0002947 .2438355 .0003984 12.42856
2.21 0.$3 3.99 9._1
0.032 0.740 0.000 0.000
= = = = = =
50 47.05 0.0000 0.7542 0.7382 18.09
[95_ Cong. Interval] .0000577 -.4092994 .000787 95.68912
.00124_2 .57233113 .0023908 145.72_9
Since we have only a single endogenous rhs variable, our test statistic is just the t statistic for the hsng__hat variable. If there were more than one endogenous rhs variable, we would need to perform a joint test of _illtheir predicted value regressors being zero. For this simple case, the test statement w_ld be • ,test _sng_hat (1) ,
Itsng_hat= 0.0 _( 1, 46) = Prob > F =
15.91 0.0002
While the p-value from the augmented regression test is somewhat lower than the p-value from the Hausman test, both tests clearly show that OLS is nor indicated for the rent equation (under the assumption that the instrumental variables estimator is a consistent estimator for our rent modeD.
_!Example Robust sta_ard • ivreg
rent
errors are availab_ with ±vreg: pcturban
(hsngval
= famine
reg_-reg4),
robust
IV (2SLS) regression with robust standard errors
--_
rent
Coef.
hsngval pcturban _cons
.0022398 .081516 120.7065
Robust Std. Err.
t
P>It I
.0006931 .4585635 15.7348
3.23 O.18 7.67
O.002 O.860 O.000
Number of obs = F( 2, 47) = Prob > F =
50 21._4 O.O(YO0
R-squared Root MSE
O.5989 22.882
= =
[95_ Conf. Interva_l] .0008455 -. 8409949 89.05217
.0036342 i.004027 152.3609
T
InstzRlmented: hsngval In_tra_ments: pcturban famine reg2 reg3 re$4
The robust star_darderror for the coefficiem on housing value is double what was previously estimated.
_
_
13u
wreg -- mstrumemal variables and two-stage least squares regression
Q Technical Note You may perform weighted two-stage instrumental variables qualifier with irreg. You may perform weighted or unweighted variable estimation, suppressing the constant, by specifying the constant is excluded from both the structural equation and the
estimation by specifying the [weight] two-stage least squares or instrumental noconstant option. In this case, the instrument list. Cl
Acknowledgments The robust estimate of variance with instrumental Mead Over, Dean Jolliffe, and Andrew Foster (1996).
variables was first implemented
in Stata by
Saved Results ivreg saves in e() : Scalars e (N) e (ross) e(df_m) e(rss) e(df.._r) Macros
number of observations mode] sum of squares mode] degrees of freedom residual sum of squares, residual degrees of freedom
e(r2) e (r2_a) e(F) e(rmse) e(N_clust)
e(cmd)
ivreg
e(clustvar)
e(version)
version number of ivreg name of dependent variable iv weight type weight expression
e(vcetype)
e(depva.r)
e(model) e(wtype) e (wexp) Matrices e (b)
coefficientvector
e(instd)
e(insts) e(predict)
e (V)
R-squared
adjusted R-squared F statistic root mem_square error number of clusters name of cluster variable covariance estimation method instrumented variable instruments program used to implement predict
variance-covanance matrix of the estimators
Functions e (sample)
marks estimation sample
Methods and Formulas ivreg
is implemented
as an ado-file.
Variables printed in lowercase and not boldfaced (e.g., x) are scalars. Variables printed in lowercase and boldfaced (e.g., x) are column vectors. Variables printed in uppercase and boldfaced (e.g., X) are matrices. Let v be a column vector of weights specified by the user. If no weights are specified, then v -- 1. Let w be a column vector of normalized weights. If no weights are specified or if the user specified fweights or iweights, w= v. Otherwise, w = {v/(I'v)}(ltl).
i
i
1 i
J
!
ivreg -- Instrumentalvariablesand two-stageleast squares regression
137
The number of observations, n, is defined as l'w. In the case of iweights, this is truncated to an integer. The sum of the weights is l'v. Define e = t if there is a constant in the regression and zero otherwise. Define k as the number of right-hand-side (rhs) variables (including the constant). Let X denote the matrix of observations on the ths variables, y the vector of observations on the left-hap,d-side (lhs) variable, and Z the matrix of observations on the instruments. In the following formulas, if the user specifies weights, then X'X, X ! y, y'y, Z'Z, Z'X, . and Z'y are replaced by X'DX; X'Dy, y'Dy, Z'DZ, Z'DX, and Z'Dy, respectively; where D is a diagonal matrix whose diagonal elements are the elements of w. We suppress the D below to simpli_ the notation. Define A as X'Z(Z'Z)-I(x'z)
j
i
' and a as X'Z(Z'Z)-IZ'y.
The coefficient vector b is defined as A-la. Although not shown in the notation, unless hascons is specified, A and a are accumulated in deviation form and the constant is calculated separately. This comment applies toall statistics listed below. The total sum of squares, ySS, equals y'y if there is no intercept and y'y - { (l'y)2/n The degrees of freedom are n - c. The error sum of squares, ESS, is defined as y'_ - 2bX'y
k.
i
aren
The mode/sum
+ b'X'Xb.
} otherwise.
The degrees of freedom
of squares, MSS, equals TSS- ESS. The degrees of freedom are k - c.
The mean square error, s2. is defined as ESS/(n 2 k). The root mean square error is s, its square root. If c - 1, then F is defined as F = (b - c)iA(b - c) (k _ 1)s 2 where e is a veclor of k - 1 zeros and h'th element l'y/n. this case, you may use the test
Otherwise, F is defined as missing. (In
command to construct any F test you wish )
]
The R-squared, R 2, is defined as R 2 = 1 - ESS/TSS.
i
The. adjusted R-squared, R2a, is 1 - (1 - R2)(n- c)/(n - k). If robust is not specified, the conventional estimate of variance is s2A -1.
i ]
For a discussion of robust variance estimates in the context of recession and regression with instrumental Variables. see [R] regress. Methods and Formulas. See this same section for a discussion of: the formulas for predict after irreg,
i i i]
References Baltagi,B. H. 1998. Econometrics.New York: Springer-Verlag. Basmann. R. L. t057. A generalizedclassical method of iinear estimationof coefficientsin a structural equation. -
Econometrica25: 77-83. Davidson.R, and J. G. MacKinnon.t993. Estimationand Inferencein Econometrics.New York:OX%rdUniver_iU Press:
Johnston.J. and J. DiNardo.I99Z EconometricMethods.4lh ed New York:McGraw-Hill. Kmenta,J. 1997. Elememsof Economemcs.2d ed. Ann Arbor:Universityof MichiganPress. Koopmans,T. C. and W C, Hood. I953. Studiesin EconometrkMethod,New York:John Witey& Son_. Koopmans.T. C and J. Marschak.I950. StatisticalInferencein DynamicEconomic Models. New York:John Wiley & SOns.
I
138
ivreg -- Instrumental variables and two-stage least squares regression
Over, M.. D. Jolliffe. and A. Foster. 1996. sg46: Huber correction for two-stage least squares estimates. Stata Technical Bulletin 29: 24-25. Reprinted in Sta_aTechnical Bulletin Reprints, vo]. 5. pp. 140-142. Theil, H. 1953. Repeated Least Squares Applied _oCompleteEquation Systems. Mimeographfrom the Central Planning Bureau, Hague. White, H. 1980. A heteroskedasticity-consistentcovariance matrix estimator and a direct test for heteroskedasticity. Econome_ca 48: 817-.838. Wooldridge, J. M. 2000. Introductory Econome;_cs: A Modern Approach. Cincinnati. OH: South-Western College Publishing.
Also See Complementary:
[R] adjust, [R] lincom, JR] linktest, [R] mfx, [R] predict. [R] testnl, [R] vce, [R] xi
Related:
[R] anova, [R] areg, JR] ensreg,
[R] mvreg, [R] qreg, [R] reg3, JR] regress,
[R] rreg, [R] sureg, [R] svy estimators, [P] _robust Background:
[U] [U] [U] [U]
[R] test,
[R] xtreg, [R] xtregar,
16.5 Accessing coefficients and standard errors. 23 Estimation and post-estimation commands, 23.11 Obtaining robust variance estimates, 23.13 Weighted estimation
1
!
Title
i jknife_-- J_kknife ]
i i
i 'i
,i
estimation
'
i
I
i
i ,I
_
exp_list [if exp] [in ra,ge]
[,
i
N I
,, f
i
Syntax jtmife
"cmd"
[r_class
1 e_class
t n(exp) ]
!
level(#) keep ] expJist
Contains
] i i
newvarnarne = (exp) (exp) eexp
] i i
eexp is specname [eqno]specname specnarne is _b
i I
_b[]
1
_se _se []
eqno is ## /lan'td
Distinguish between [ ], which are to be typed, and [], which indicate optional arguments.
Iscription jknife
performs jack&nile estimation.
cmd defines the statistical command to be executed, cmd must be bound in double quotes. Compound double quotes ("" and "") are needed if the command itself contains double quotes exp_[ist specifies the statistics to be retrieved after the execution of cmd. on which jackknife statistics will lie calculated.
Qptions rclass, eclaSS, and n(exp) specify where crnd saves the number of observations on which it based the calculated results. You are strongly advised tO specify one of these options.
i i
rclass
specifies that cmd saves the number of dbservations in r(N).
ectass
specifies that cmd saves the number of observations in e(N).
n(exp) allows you to specify an}, other expression that evaluates to the number of obser_'ations used. Specifying n(r(N)) is equivalent to spedifying option rclass. Speci_'ing n(e(N)) is equi'falent to specifying option eclass. If cmd saved the number of observations in r(Nl), specify n(_(Ni) ). Y
139
140
Jknife-- Jackknifeestimation
If you specify none of these options, jknife assumes that all observations in the dataset contribute to the calculated result. If that assumption is incorrect, the reported standard errors wilt be incorrect. For instance, say you specify • jknife "regress y xl x2 x3" coef=_b[x2]
and pretend that observation 42 in the dataset has x3 equal to missing. The 42nd observation plays no role in obtaining the estimates, but jknife has no way of knowing that and will use the wrong N. If, on the other hand, you specify jknife "regress y xl x2 x3" coef=_b[x2], e
will correctly notice that observation 42 plays no role. Option e is specified because regress is an estimation command and saves the number of observations used in e (N). When jknife runs the regression omitting the 42nd observation, jknife will observe that e(N) has the same value as when jknife previously ran the regression using all the observations. Thus, jknife will know that regress did not use the observation. jknife
In this example, it does not matter whether you specify option eclass eclass is easier.
or n (e (N)). but specifying
level(#) specifies the confidence level, in percent, for the confidence intervals. The default is level(95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. keep specifies that new variables are to be added to the dataset containing the pseudo-values of the requested statistics. For instance, if you typed . jknife "regress y xl x2 x3" coef=_b[x2], e keep
new variable cool would be added to the dataset containing the pseudo-values for _b [x2]. Let b be defined as the value of _b [x2] when all observations are used to estimate the model, and let b(j) be the value when the jth observation is omitted. The pseudo-values are defined as pseudovaluej = N • {b- b(j)} + b(j) where N is the number of observations used to produce b.
Remarks While the jackknife--developed in the late 1940s and earl}, 1950s--is of largely historical interest today, it is still useful in searching for overly influential observations. This feature is often forgotten. In any case, the jackknife is 1. an alternative, first-order unbiased estimator for a statistic; 2. a data-dependent way to calculate the standard error of the statistic and to obtain significance levels and confidence intervals; and 3. a way of producing measures called pseudo-values for each observation, reflecting the observation's influence on the overall statistic. The idea behind the simplest form of the jackknife the one implemented here is to calculate the statistic in question N times, each time omitting just one of the dataset's observations. Write S for the statistic calculated on the overall sample and S(j) for the statistic calculated when the jth observation is removed. If the statistic in question were the mean, then S=
(N - 1)5'(3) + sj N
i
_
jknife-- J_ffe where
estinl_.i_
141
is the value of the data in the jth observation. Solving for sj, we obtain
sj
sj = NS - (N-
1)S(j)
These are the pseudo-values the jackknife calculates, even though the statistic in question is not the mean. The jac_nife estimate is _, the average of the sj's, and its estimate of the standard error of the statistic is the corresponding standard error of the mean (Tukey 1958). The jackknife estimate of variance has been largely replaced by the bootstrap (see [R] bstrap), which is widely viewed as more efficient and robust. The use of iackknife pseudo-values to detect outliers is too o[ten forgotten and is something the bootstrap is unable to provide. See M0stetler and ,Tukey (1977, 133-t63) and Mooney and Duval (1993., 22-27) for more information.
I
JaCkknifedStandard deviation Example Moste!ler and Tukey (1977, 139-140) request a 95% confidence interval for the standard deviation of the eleven v_ilues: 0.t,
0.1,
0,1,
0.4,
0.5,
t.0,
1.1,
1.3,
1.9,
1.9,
4.7
Stata's summarize command calculates the mean and standard deviation and saves them as r (mean) and r (sd), To obtain the jackknifed standard deviation of the eleven values and to save the pseudovalues asia new variable sd, type • i input
x X
1.0.1 2. 0.1 3. 0.1 4. 0.4 5.0.5 6. i.O 7.!.1 8. t.3 9. 1.9
lo. 1.9 11. 12.
4.7 end
j:knife"summarize x" sd=r(sd), r keep command: statistic: n():
summarize x sd=r(sd) r(N)
Variable
Obs
jknife sd
overall
Statistic
i.489364 II
Std. Err.
[95Y,Conf. Interval]
.6244049
.0981028
2.880625
I.343469
Interpreting the, oulpu[, the standard deviation repoded by snmmaxize mpg is 1.34. The jackknife estimate is 1.49 with standard error 0.62. The 95% confidence interval for the standard deviation is .10 to 2.88. By spectfyii_g the keep option, jknife pseudo-valu_es.
creates a new variable in our dataset, sd. for the
• list -' ,
142
jknife-- Jackknife estimation x sd 1. 2. 3. 4. 5. 6. 7. 8. 9, 10. 11.
•I •1 •1 •4 •5 1 i.1 I.3 1.9 1.9 4.7
1.1399778 1.1399778 1,1399778 . 88931508 .8242672 • 63248882 •62031914 •62188884 .8354195 . 8354195 7.7039493
The jackknife estimate is the average of sd,so sd contains the individual "values" of our statistic. We can see that the last observation is substantially larger than the others, The last observation is certainly an outlier, but whether that reflects the considerable information it contains or indicates that it should be excluded from analysis is a decision that must be based on the context of the problem. In this case, Mosteller and Tukey created the dataset by sampling from an exponential distribution. so the observation is quite informative. ,q
> Example Let us repeat the above example using the automobile dataset, obtaining the standard error of the standard deviation of mpg. • use auto, clear (1978 Automobile Data) jknife "summarize mpg" sd=r(sd), r keep command: statistic: n() :
smmmarize mpg sd=r(sd) r(N)
Variable
Obs
Statistic
74
5.785503
Std. Err.
[95_ Conf. Interval]
sd overall jknife
5.817373
,607251
4.607124
Looking at sd more carefully, summarize sd, detail r(sd) pseudovalues
i_ 57 107 25_ 50Z 75_ 90_ 95_ 997
Percentiles 2.870485 2.870485 2.906249 3•328494
Smallest 2.870485 2.870485 2.870485 2.870485
3.948327 6.844408 9.597005 17.34316 38.60905
Obs Sum of Wgt. Mean
Largest 17.34316 19.76169 19.76169 38.60905
Std. Dev. Variance Skewness Kurtosis
74 74 5.817373 5.22377 27.28778 4.07202 23.37822
7.027623
_
jknife-- Jackknifeest_ • ]/istimake mpg
sd if sd > 30
,.pg
_ake 71•
_
143
Diesel
41
sd 38.60@052
Inthi s case,the_v Dieselistheonlydiesel carinourdataset.
q
!Collectingmultiple statistics i>Example : jknife is not limited to collecting just one ;tatistic. For instance, you can use s-T.marize, detail and then obtain the jackknife estimate of _e standard deviation and skewness, m_rnmarize, detail saves the standard deviation in r(sd) and the skewness in r(skewness), so you might type i
• _se (I_78
auto, clear _utomobile
• jkni_e
"summarize
comm_: statistic n():
Data) mpg, detail"
summarize :
sd=r(sd)skew=r(skewness),
r
mpg, detail
sd=r (sd) skew=r (skewness) r(N)
Variable
Obs
Statistic
74
5.78550_
Std. Err.
[95_, Conf.
Interval]
sd overall jknif
e
5. 817379
•607251
4. 607124
7• 027623
.3525056
1. 694686
skew 74
overall
.948717_ 1. 023596
jknife
.3367242
q
!Collectingcoefficients and standard errors Example
, jkni_eCan also collect coefficients and standard errors from estimation commands. For instance, using auto. klta we wish to obtain the jackknife e,_timate of the coefficients and standard errors from a regression in which we model the mileage of a _ar by its weight and trunk space. To do this. we refer io the Coefficients and the standard errors as _b [weight], _b [trunk], _se [weight], and _se [ttrumk] in the exp_list, or we could simplify them by using the extended expressions _b and
i 1
_SO. • Use _uto, clear (1978 iAutomobile Data) • _kniife "reg mpg weight
trunk"
co_iman61:
reg mpg weight
statistic:
b_weight=_b
_b _se, e
trunk
[weight]
se_weight=_se [weight] b_trunk=_b [trunk] b cons=
n() :
b[_cons]
se
Zrtmk=_se
se
cons=_se
_(_)
[trunk] [_cons]
] i
144
jknife -- Jackknife estimation Variable
1
Obs
Statistic
74
-.0056527
Std. Err.
[95X Conf. Interval]
b_weight overall jknife
-.0056325
.O010216
-.0076684
-.0035965
se_weight overall
74
.0007016
jkaife
.0003496
.000111
.0001284
.0005708
b_trunk overall
74
-.096229 -.0973012
jknife
.1486236
-.3935075
.1989052
b_cons overall
74
39.68913
jknife
39.65612
1.873323
35.92259
43.38965
.0218525
.0196476
.1067514
.2690423
.2907921
1.363193
se_trunk overall
74
.1274771
jknife
.0631995
se_cons overall
1.65207
74
jknife
.8269927
q
Saved Results jknife saves in r(): of observations
used in calculating
#
r(N#)
number
r(stat#)
value of statistic # using all observations
statistic
f
r(me_n#) r(se#)
jackknife estimate of statistic # (mean of pseudo-values) standard error of the mean of statistic #
|
Methods and Formulas jknife is implemented
as an ado-file.
References Gould, W. 1995. sg34: Jackknife estimation. Reprints, vol. 4, pp. 165-170.
Stata Technical
Bulletin
24: 25-29.
Mooney, C. Z. and R. D. Duval. Park, CA: Sage Publications.
1993. Bootstrapping:
A Nonparametric
Mosteller, E Company.
1977. Data Amdysis
and Regression.
and J. W. Tukey.
Tukey, J. W. 1958. Bias and confidence 614.
in not-quite
large samples.
Reprinted
Approach Reading.
Abstract
to Statistical
Related:
[R] bstrap, [R] statsby
Background:
[U] 16.5 Accessing coefficients and standard errors, [u] 23 Estimation and post-estimation commands
Inference.
MA: Addison-Wesley
in Annals
Also See
in Stata Technical
of Mat,_ematical
Bulletin
Newbury Publishing
Statistics
29:
rT!tle jointly --- l_orm all p_rwise combinations within groups
Syntax joinby
[varIist] using
nqlabe_. ....... , update
filena,ne
replace
[, _atehed({none
_merge(varname)
] b_oth [ re_aster
using
})
]
DescripUon j oinby joiqs, within groups formed by varlist, observations of the dataset in memory withfiiename, a Stata-format dataset. By join we mean "form all parwise combinations", filename is required to be sorted by varti_t. Iffilename is specified without an extension, '. dta' is assumed. If rarlist isnot specified, joinby memory antt in filename.
takes as varligt the set of variables common to the dataset in
Observations unique to one or the other dataset are ignored unless unmatched () specifies differently. Whether you load one dataset and join the other or vice versa makes no difference in terms of the number of resalting observations. If there ar_ common variables between the two datasets, however, the combined dataset will contain the va]ues from the master data for those observations. This behavior can be modified with the update and replace options.
Options
z
unmatched({llone I both !master I using }) specifies whether observations unique to one of the datasets are to be kept, with the variables from the other dataset set to missing. Valid values are none both m_.stier using
all unmatched observations are ignored (default) unmatched observations from the master and using data are included unmatched obse_'ations from ihe master data are included unmatched observations from the using data are included
}
no!abe! Nevents Stata from copying the value label definitions from the disk dataset into the dataset in memory. Even if you do not specify this option, in no event do label definitions from the disk dataset tephce label definitions already in memory. update i
varies the action that joinby
data_et is tield inviolate--values
takes when an observation is matched. By default, the master from the master data are retained when the same variables are
found in both datasets. If update is specified, however, the values from the using dataset are retained in cases where the master dataset contains missing. i '-
replace,
fillowed with update only, specifies that even when the master dataset contains nonmissin_ values, the_' are to be replaced with corresponding values from the using dataset when the corresp0ndjng values are not equal. A nonmissi_g value, however, will never be replaced with a
missing value. -merge(varname) specifies the name of the variable that will mark the source of the resulting observation. The default is _.merge (_merge) . To preserve compatibility with earlier versions of joir_b_, ___erge is only generated if unmatched is specified. 145
l
Remarks
146 joinby -- Form all pairwise combinations within groups The following, admittedly artificial, example illustrates j oinby.
> Example You have two datasets: child, dta and parent .dta, identifies the people who belong to the s_ne family. .
use
Both contain a family_id
child on
(Data
Children)
describe Contains
data
from
child.dta
obs:
5
Data
vats:
4
13
size:
50
(99.9_
storage variable
name
type
of memory
on Children
Jul
display
value
format
label
variable
family_id
int
_8.0g
Family
child_id
byte
_8.0g
Child
xl
byte
_8.0g
x2
int
_8.0g
by:
Sorted
2000
15:29
free)
label Id Number
Id Number
family_id
list family~d 1025
I.
child_id 3
xl II
x2 320
2.
1025
1
12
300
3.
1025
4
10
275
4.
1026
2
13
280
5.
1027
5
15
210
. use (Data
parent, clear on Parents)
describe Contains obs:
data
from
parent.dta 6
Data
vats:
4
13 Jul
size:
108 storage
(99.9_
of memory
on Parents 2000
15:31
free)
display
value
type
format
label
family_id
int
_8.0g
Family
Id Number
parent_id
float
_9.0g
Parent
Id Number
xl
float
_9.0g
x3
float
_9.0g
variable
Sorted
name
variable
by:
list
1.
fsmily-d 1030
2. 3.
1025 1025
4. 5. 6.
parent_id I0
xl 39
x3 600
11 12
20 27
643 721
1026
13
30
Z60
1026
14
26
668
1030
15
32
684
label
variable which
joinby-- Form ail pairwisecombinationswithin groups
147
You want tO "join" the information for the parents and their children. The data on parents are in memory;the data on children are on disk. dhild.dta has been sorted by family_id, but parerit._ti has not, so first we sort the parent _data on famity_id: • Sort
i family_id
• joinby
family_id
using
child
• describe Co,tails
data
o.bs:I
8
vats:
6 168
Data (99.4_, of memory
free)
i
i
storage
on Parents
,
display
value
type
format
label
family¢id
int
Y,8.0g
Family
Id Number
paz_nt_id
float
Y,9.0g
Parent
Id Number
Xl
float
%9.0g
x3
float
Y.9.0g
child__d
byte
XS.0g
x2
int
Y,8.0g
variable
name
Sorted iby : Npte:
dataset
has changed
variable
Child
since
last
label
Id Number
saved
l_st 1.
family-d 1025
parent_id 12
xl 27
x3 721
child_id 3
2;.
1025
11
20
643
3
320
3. 4.
1025 1025
12 11
27 20
721 643
1 1
300 300
5,
1025
li
20
643
4
275
6. 7.
1025 1026
12 13
27 30
721 760
4 2
275 280
8.
1026
14
26
668
2
280
x2 320
Notice that
I , I
1. fami_y__d of I027, which appears only in child.dta, and family_id only in Narent. dta, are not in the combined dataset. Observations variable(s) are not in both datasets are omitted.
of 1030, which appears for which the matching
2. The x_ v_riable is in both datasets. Values for this variable in the joined dataset are the values from par_nt.dta--the dataset in memory when we issued the joinby command. If we had cMld.d_a in memory and parent.dta on di_k when we requested joinby, the values for xl wouldiha_'e been from child.dta. Values from the dataset in memory take precedence over the datasel o_i disk. q
Methods joinby
Formulas ii implemented as an ado-file.
t
148
joinby-- Formall pairwisecombinationswithin groups
Acknowledgment joinbywas written by Jeroen Weesie, Department of Sociology, Utrecht University, Netherlands.
Also See Complementary:
[R] save
Related:
JR] append, [R] cross, JR] fillin, JR] merge
Background:
[U] 25 Commands for combining data
1
!
-¥itle kappa,
interrater agreement 4
1
i Syntax :
}
kap va_ai_el varname2 varnarne3 [...] [weigh3] [if exp] [in range] kappa i,ariist [if exp] [in range] fweights
i
a_e aliowed; see [U] 14,1.6 weight.
DescriptiOn kap (first s_,ntax)calculates the kappa-statistic measure of interrater agreement when there are two unique raters and two or more ratings. /
kapwgt: defines weights for use by kap in measuring the importance of disagreements. kap (secoqd syntax) and kappa calculate the kappa-statistic measure in the case of two or more (nonuniqu¢) r_atersand two outcomes, more than two outcomes when the number of raters is fixed, and more thah two outcomes when the number of raters varies, kap (second syntax) and kappa produce the same results: they merely differ in how they expect the data to be organized. kap assurrie's that each observation is a subject, varnamel contains the ratings by the first rater, varname2 'by ihe second rater, and so on. kappa also_assumesthat each obse_'ation is a subject. The variables,however, record the frequencies with which r@ngs were assigned. The first variable records the number of times the first rating was assigned, the gecond variable records the number of times the second rating was assigned, and so on.
Options tab displays a tabulation of the assessmentsby the two raters. wgt(wgtid) _pecifies that wgtid is to be used to weight disagreements. User-defined weights can be created using kapwgt: in that case, wgt() specifies the name of the user-defined matrix. For instance, you might define . kapwg_ mine i \, .8 1 \ 0 .8 I \ 0 0 .8 I'
and them . k_p
rgta
ratb,
wgt(mine)
14g
i i
150
kappa -- lnterrater agreement
In addition, two prerecorded
weights are available.
wgt(w) specifies weights 1 - [i -jl/(k - 1), where i and j index the rows and columns of the ratings by the two raters and k is the maximum number of possible ratings. wgt(w2)
specifies weights 1-{(i-
j)/(k-
1)} 2.
absolute is relevant only if wgt () is also specified; see wgt () above. Option absolute modifies how i, j, and k in the formulas below are defined and how corresponding entries are found in a user-defined weighting matrix. When absolute is not specified, i and j refer to the row and column index, not the ratings themselves. Say the ratings are recorded as {0, 1, 1.5, 2}. There are 4 ratings; k = 4 and i and j are still 1, 2, 3, and 4 in the formulas below. Index 3, for instance. corresponds to rating = 1.5. This is convenient but can, with some data, lead to difficulties. When absolute is specified, all ratings must be integers and they must be coded from the set {1,2, 3,...}. Not all values need be used; integer values that do not occur are simply assumed to be unobserved.
Remarks The kappa-statistic measure of agreement is scaled to be 0 when the amount of agreement is what would be expected to be observed by chance and 1 when there is perfect agreement. For intermediate values, Landis and Koch (1977a, 165) suggest the following interpretations: below 0.0 0.00-0.20 0.21-0.40 0.41-0.60 0.61- 0.80 0.81- 1.00
Poor Slight Fair Moderate Substantial Almost Perfect
The case of 2 raters > Example Consider the classification by two radMogists of 85 xeromammograms as normal, benign disease. suspicion of cancer, or cancer (a subset of the data from Boyd et al. 1982 and discussed in the context of kappa in Altman 1991, 403-405). . tabulate Radiologist A's assessment
rada
radb
Radiologist Normal
B's
benign
suspect
cancer
Total
12
0
0
33
benign
4
17
1
0
22
suspect cancer
3 0
9 0
15 0
2 1
29 1
38
16
3
85
Normal
Total
21
assessment
28
Our dataset contains two variables: Each observation is a patient.
rada,
radiologist A's assessment: radb.
We can obtain the kappa measure of interrater agreement
by typing
radiologisl B's assessment.
kap
-- lnterrat_ agreement
151
• kap rada radb ; Agreement i
Expected Agreement
Kappa
Std. Err,
30. 2z 0.472a 0.0694 !
Prob>Z
6.81
0.0000
Had each radiologist made his determination randomly (but with probabilities equal to the overall proportions), _we would expect the two radiologist_ to agree on 30.8% of the patients. In fact, they agreed on 6}.5% of the patients, or 47.3% of the way between random agreement and perfect a_reemenL _he amount of agreement indicates that we can reject the hypothesis that the?, are making their detetrni lations randomty.
Example l
Z
,I
i
There is a difference between two radiologists disagreeing whether a xeromammogram indicates cancer or th_ ,_uspicion of cancer and disagreeing whether it indicates cancer or is normal. The weighted kappa attempts to deal with this. kap provides two "prerecorded" weights, w and w2: . k_p _ada radb, wgt(_;) Ratlng_ weighted by: 1.0¢00 O, 666? 0.61_67 1.0000 O. 3_33 O. 6667 O. od,oo o. 3333
O.3333 0.6667 1.0000 o. 6667
i Expected /{gr¢_em_nt Agreement
l i i I
i !
O. 0000 0.3333 O. 6667 1. 0000
Kappa
Std. Err.
Z
Prob>Z
The w vJe_ghts are given by 1 - li - jt/(k - 1) where i and j index the rows of columns of the ratings by th_ two raters and k is the maxinmm _umber of possible ratings. The weighting matrix
i i
ratings normal, benign, suspicious, and cancerous. i the table. In our "case, the rows and columns of the 4 × 4 matrix correspond to the is prin_ed above A weight Of 1 indicates an observation should count as perfect agreement. The matrix has l s down the dia_ofials!--when both radiologists make the s_me assessment, they are in agreement. A weight of, say_0J66_7 means they are in two-thirds agreement. In our matrix they get that score if they are one aparl -+one radiologist assesses cancer and the other is merely suspicious, or one is suspicious and the o_herisays bemgn, and so on, An emry of 0.3333 means they are in one-third agreement or, } if you prefer,!two-thirds disagreement. That is the gcore attached when they are "two apart". Finally, they are in c_mplete disaueement when the weighi is zero, which happens only when the3, are three apart--one says cancer and the other says normal.: Z 0.0000
and is probably inappropriate
here.
> Example In addition to prerecorded weights, you can define your own weights with the kapwgt command. For instance, you might feel that suspicious and cancerous are reasonably similar, benign and normal reasonably similar, but the suspicious/cancerous group is nothing like the benign/normal group: • kapwgt xm i \ .8 1 \ 0 0 1 \ 0 0 .8 1 • kapw_ xm 1.0000 0.8000 1.0000 O. 0000 O.0000 i.0000 O. 0000 O.0000 O.8000 1.0000
You name the weights--we named ours xm and after the weight name, you enter the lower triangle of the weighting matrix, using \ to separate rows. In our example we have four outcomes and so continued entering numbers until we had defined the fourth row of the weighting matrix. If you type kapwgt followed by a name and nothing else, it shows you the weights recorded under that name. Satisfied that we have entered them correctly, we now use the weights to recalculate kappa: . kap rada radb, wgt (xm) Ratings weighted by: I.0000 O.8000 O.8000 I.0000 O.0000 O.0000 O.0000 O.0000
Agreement 80.47)',
O.0000 O.0000 I.0000 O.8000
O.0000 0.0000 O. 8000 1.0000
Expected Agreement
Kappa
52.677.
O. 5874
Std, Err, O.0865
Z
Prob>Z
6.79
O.0000
4
[3 Technical Note In addition to weights for weighting the differences in categories, you can specify Stata's traditional weights for weighting the data. In the examples above, we have 85 observations in our dataset one for each patient. If all we knew was the table of outcomes that there were 21 patients rated normal by both radiologists, etc. it would be easier to enter the table into Stata and work from it. The easiest way to enter the data is with tabi; see [R] tabulate.
) ) ) ( ,
kappa-- lnterrateragreement
!
153
. tabi 21 12 0 0 \ 4 17 I 0 \ 3 9 15 2 \ 0 0 0 I, replace col row
1
2
3
4
Total
1 2
21 4
12 17
0 1
0 0
33 22
3 4
3 0
9 0
15 0
2 1
:29 1
3
85
,(
T_tal
28
Pearson cM2(9)
38 =
16
77.8111
Pr _ 0.000
tabi
felt obligated to tell us the Pearson X2 for this table, but we do not care about it. The important thing is tlaat,with the replace option, tabi left the table in memory: • list )in I/5 row 1 1
col 1 2
pop 21 12
3_
I
3
0
4_ 5,
1 2
4 1
0 4
1, 2.
The variable row is radiologist A's assessment: so assesse_ _ both. Thus, •:kap _ow col [freq=pop] ; Expected Agreement Agreement Kappa : j
Std. Err.
!
63.53'/,
30.827,
col,. radiologist B's assessment: and pop the number
O.4728
O.0694
Z
Prob>Z
6.81
O.0000
If we are going to keep these data, the names row and col are not indicative of what the data reflects. We could (seb [U] 15.6 Dam.set, variable, a,d value labels) • rename row rada • rename col radb . label var rada "Radiologist A's assessment" label var radb "Radiologist B's assessment" . label define assess I normal 2 benign 3 suspect 4 cancer l&be] values rada assess label values radb assess l&be] data "Altman p. 403"
kap's
tab
option, which can be used with or withont weighted data. shows the table of assessments: i
• kap _ada radb
[freq=pop],
Radiolqgist iA's assessment
Radiologist B's assessment normal benign suspect cancer
!
tab
Total
21
n
0
o
33
bez_ign
4
17
1
0
22
Suspect
3
9
15
2
29
cancer ) T_tal
0
0
0
I
1
28
38
18
3
85
_o_mal
)
]:_
Kappa -- mmrramr agreement Expected Agreement
Agreement 63.53_
Kappa
30.82_
Std.
0.4728
Err.
Z
0.0694
Prob>Z
6.81
0.0000
0
Q Technical Note You have data on individual patients. There are two raters and the possible
ratings are I, 2, 3,
and 4, but neither rater ever used rating 3: . tabulate ratera raterb raterb •
ratera
I
2
4
Total
1 2 4
6 5 1
4 3 1
3 3 26
13 11 28
12
8
32
52
Total
In this case, kap would determine the ratings are from the set {1,2, 4} because those were the only values observed, kap would expect a use_defined weighting matrix to be 3 x 3 and, were it not, kap would issue an error message. In the formula-based weights, the calculation would be based on i,j -- I, 2, 3 corresponding to the three observed ratings {1,2, 4}. Specifying the absolute option would make it clear that the ratings are 1, 2, 3, and 4; it just so happens that rating = 3 was never assigned. Were a use_defined weighting matrix also specified, kap would expect it to be 4 × 4 or larger (larger because one can think of the ratings being 1, 2, 3, 4, 5, ... and it just so happens that ratings 5, 6, ... were never observed just as rating -- 3 was not observed). In the formula-based weights, the calculation would be based on i,j -- I, 2, 4. • kap ratera raterb, wgt(w) Ratings weighted by: 1.0000 0.5000 0.0000 0,5000 1.0000 0.5000 0.0000 0.5000 1.0000
Agreement 79.81_
Expected Agreement 57.17Z
Kappa
Z
Prob>Z
4.52
0.0000
Z
Prob>Z
Std. Err.
0.5285
0.1169
. kap ratera raterb, wgt(w) absolute Ratings weighted by: 1.0000 0.6667 0.0000 0.6667 1.0000 0.3333 0.0000 0,3333 1.0000
Agreement
Expected Agreement
81.41Z
55.08X
Kappa 0.5862
Std. Err. 0.1209
4.85
0.0000
If all conceivable ratings are observed in the data, then whether absolute is specified makes no difference. For instance, if rater A assigns ratings { 1,2, 4} and rater B assigns {1,2, 3, 4}, then the complete set of assigned ratings is {1,2, 3, 4}, the same as absolute would specify. Without absolut e, it makes no difference whether the ratings are coded { 1,2, 3, 4}, {0.1.2, 3 }, {1,7, 9, 100}, {0, 1, t.5, 2.0}, or otherwise.
O
kappa-- lnterrateragreement
The case
155:,
more than two raters
In the c,se of more than two raters, the matha aatics are such that the two raters are not considered unique.!Fol " instance, if there are three raters, there is no assumption that the three raters who rate I
the are the the three ratersraters that rate thanflrst_suSject two r_iters case, it same can beasused with two whenthe thesecond. raters' Although identities we vary.call this the more The 'norlunique rater case can be usefully broken down into three subcases: (a) there are two possible raiings which we will call positive and negative; (b) there are more than two possible ratings but _thenumber of raters per subject is the same for all subjects; and (c) there are more than two possiblle ratings and the number of raters per subject varies, kappa handles all these cases. To emphasize that there is no assumption of constant identity of raters across subjects, the variables specified contain counts of the number of raters rating the subject into a particular category.
!
{ i
_ Example (Two; ratings.) Fleiss (1981, 227) offers the following hypothetical ratings by different sets of raters on 25}subjects:
Subject 1 2 3 4 5 6 7 8 9 t0 11 i
l
NO.of No. of raters pos. ratings 2 2 2 0 3 2 4 3 3 3 4 1 3 0 5 0 2 0 4 4 5 5
12 13
34
34
No. of No. of Subject raters pos. ratings 14 4 3 15 2 0 16 2 ' 2 17 3 1 18 2 t 19 4 t 20 5 4 21 3 2 22 4 0 23 3 0 24 3 3 25
2
2
We have entered these data into Stata and the variables are called subject, raters, and pos. kappa, however, re@ires that we specify variables containing the number of positive ratings and negative ratings; that i_s,pos and raters-pos: gen
_eg
kapp_
= raters-pos
pos neg
TWo4ou_comes, Kappa 0.5415
multiple
raters: Z 5.28
Prob>Z 0.0000
We wouldlha_e obtained the same results if we had typed kappa neg pos.
Example (More thanitwo ratings, constant number of raters,) Each of ten subjects is rated into one of three categories by five raters (Fleiss 1981, 230): li_t
I i
156
kappa-- Interrateragreement subject 1. 2. S. 4. 5. 6, 7. 8. 9. 10.
cat1 1 2 3 4 5 6 7 8 9 10
cat2 1 2 0 4 S 1 5 0 1 3
cats 4 0 0 0 0 4 0 4 0 0
0 S 5 1 2 0 0 1 4 2
We obtain the kappa statistic: • kappa earl-cat3 Outcome
Kappa
Z
Prob>Z
catI cat2 cat3
O.2917 0.6711 0.3490
2.92 6.71 3.49
O. 0018 0.0000 O. 0002
combined
0.4179
5.83
O.0000
The first part of the output shows the results of calculating kappa for each of the categories separately against an amalgam of the remaining categories. For instance, the cat I line is the two-rating kappa where positive is carl and negative is cat2 or catS. The test statistic, however, is calculated differently (see Methods and Formulas). The combined kappa is the appropriately weighted average of the individual kappas. Note that there is considerably less agreement about the rating of subjects into the first category than there is for the second. q
> Example Now suppose that we have the same data as in the previous example, but that it is organized differently: • list 1. 2. 3. 4. 5. 6. 7. 8. 9. i0.
subject 1 2 3 4 5 6 7 8 9 i0
raterl 1 1 3 I 1 1 1 2 1 1
In this case, you would kap
use
kap
rater2 2 I 3 1 1 2 I 2 3 1
rater3 2 3 3 1 1 2 1 2 3 1
rather than
kappa:
raterl rater2 raterS rater4 rater5
There are 5 raters per subject: Outcome
Kappa
Z
Prob>Z
1 2 3
0.2917 0.6711 O. 3490
2.92 6.71 3.49
0.0018 0.0000 O. 0002
combined
O. 4179
5.83
O. 0000
rater4 2 3 3 1 3 2 1 2 3 3
rater5 2 3 3 3 3 2 1 3 3 3
_
,
kappa -- Interrater agreement
157
Note that thfe information of which rater is which is not exploited when there are more than two raters.
q
_, Example (More' tha_ two ratings, vmo,ing number of raters!) In this unfortunate case, kappa can be calculated, but there is _o test statistic for testing against _ > 0. You do nothing differently--kappa calculates the total nun{bet of raters for each subject and, if it is not a constant, it suppresses the calculation of test statisttics[ .
1,ist
1,
subject 1
cat 1 1
cat 2 3
2.
2
2
0
3
3.
3
0
0
5
4.
4
4
0
1
5.
5
3
0
2
6.
6
1
4
0
7.
7
5
0
0
8_
8
0
4
1
9;
9
1
0
2
10.
10
3
0
2
• k_pp_
0
catl-cat3 Outcome
Kappa
cat i
O. 2685
cat2
O. 64,57
cat3
O. 2938
combined note:
cat3
Z
Prob>Z
O. 3816
}Number of ratings
per
subject
vary;: cannot
calculate
test
Istatistics,
q
Example This case _s similar to the previous example, but the data are organized differently: • list
I.
_ubject i
raterl I
rater2 2
rater3 2
2.
2
3.
3
4.
1
1
3
3
3
3
3
3
3
3
4
1
1
t
1
3
5-
5
1
1
1
3
3
6. 7.
6 7
1 1
2 1
2 1
2 1
2 1
8.
8
2
2
2
2
3
9.
9
1
3
10.
10
1
t
1
3
In this
cas_,
|
we
specify
kap,
instead
of
kappa:
rater4
rater5 2
3 3
i i
158
kappa -- Interrater agreement • kap raterl-rater5 There are between 3 and 5 (median = 5.00) raters per subject: 0utcome
Kappa
1 2 3
0.2685 0.6457 0.2938
Prob>Z
0.3816
combined note:
Z
Number of ratings per subject vary; cannot calculate test statistics.
Saved Results kap and kappa save in r(): Scalars r(N)
number
of subjects
(kap only)
r(prop_o) observed proportion of agreement (kap only) r(prop_e)expected proportion of agreement (kap only)
r(kappa)
kappa
r(z) r(se)
z statistic standard error for kappa statistic
Methods and Formulas kap, kapwgt,
and kappa
are implemented
as ado-files.
The kappa statistic was first proposed by Cohen (1960). The generalization for weights reflecting the relative seriousness of each possible disagreement is due to Cohen (1968). The analysis-of-variance approach for k = 2 and ra _> 2 is due to Landis and Koch (1977b). See Altman (1991, 403-409) or Dunn (2000. chapter 2) for an introductory treatment and Fleiss (198t, 212--236) for a more detailed treatment. All formulas below are as presented in Fleiss (1981). Let rn be the number of raters and let k be the number of rating outcomes.
kap: m = 2 Define wij (i = 1.... , k, j = 1,..., k) as the weights for agreement and disagreement (wgt ()) or, if not weighted, define wiz = 1 and wij = 0 for i ¢ j. If wgt (_r) is specified, u'ij -- 1-l i-jt/(k1). If wgt (_r2) is specified, wij -- 1 - {(i-j)/(k The observed proportion of agreement
1)} 2.
is k
k
Po = _ _ wijPij i=1 3=1
where Pij is the fraction of ratings i by the first rater and j by the second. The expected proportion of agreement is k 'De-
_ i=1
k _wijPi-P.j j=l
Ii
_
-- Interrater agreement
159
where Pi. = ___jPij and p.j = Y'_-iP/J" f Kappa is _iven by _ = (Po - Pe) / (I - Pe). The standard error of _ for testing against 0 is
s0 (1-
j
where n is th, number of subjects being rated and Ni. = _j statistic Z= _/'so is assumed to be distributed N(0, 1).
p.jwij
and ¥.j = _i Pi':wiJ" The test
kappa: m > 2,!k = 2 Each sUbjeCt i, i = 1,...,
n, is found by' xi of mi raters to be positive (the choice as to what is
labeled:positiVe being arbitrary). The overail proportion of positive ratings is _ = _i xi/(nN), between-s_bj@s mean square is (approximately)
B
where _
= _-_i rni/n.
_
The
!1 1
--
n
t
i
r°'i
and the w_thla-subject mean square is
W = n(_--
1 1i
E i
xi(mimi
xi) ] i
Kappa is thent defined
i
i
The standard !error for testing against 0 (Fleiss arid Cuzick 1979) is approximately calculated as 1 _'0 = (N.-
1)_
/
(_-
(2(_H z 1)+
_H)(1
equal to and
- @-_)
mpq
ii
i
where Nh, is the harmonic mean of rni and _ = 1 - _. The test st',ttistic Z = "/^_/Sois assumed to be distributed N(0, 1).
i
kappa: m >2,ik> 2 Let .rij be ithe number or ratings on subject i, i = 1,...,
n, into category j, j = 1,...,
k. Define
i
_j as the overall proportion of ratings in category j, _j = 1 - _._, and let _j be the kappa statistic given above f_r k = 2 when category j is compared with the amalgam of all other categories. Kappa
I
is (Landis an_ Koch 19778)
_=
A____PJ-qjr'JJ
t
160 kappa m lnterrater agreement In the case where the number of raters per subject _j xij is a constant m for all i, Fleiss, Nee, and Landis (1979) derived the following formulas for tile approximate standard errors. The standard error for testing
_j against
0 is
/ V /
and the standard
error
for testing
1)
g is
_= _. _f_jx/nm(m-J
2
i)
PJqJ
j
PJ'qJ('qJ-
_j)
References Altman, D. G. t991. Practical Statistics for Medical Research. London: Chapman & Hall. Boyd, N. F., C. Wolfson, M. Moskowitz, T. Carlile, M. Petitclerc, H. A. Ferri, E. Fishell. A. Gregoire, M. Kieman. J. D. Longley, I. S. Simor, and A. B. Miller. 1982. Observer variation in the interpretation of xeromammograms. lournaI of the National Cancer Institute 68: 357-63. Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37-46. 1968. Weighted kappa: Nominal scale agreement with provision for scaled disagreement Psychological Bulletin 70: 213-220.
or partial credit.
Dunn. G. 2000. Statistics in Psychiatry. London: Arnold. Fleiss, J. L. 1981. Statistical Methods for Rates and Proportions. 2d ed. New York: John Wiley & Sons. Fleiss, J. L. and J. Cuzick. 1979. The reliability of dichotomous judgments: unequal numbers of judges per subject. Applie 4 Psychological Measurement 3: 537-542. Fleiss, J. L., J. C. M. Nee, and J. R. Landis. 1979. Large sample variance of kappa in the case of different sets of raters. Psychological BuIletin 86: 974-977. Gould, W. 1997. stata49: Interrater agreement. Stata Technical Bulletin 40: 2-8. Reprinted in Stata TechnicaJ Bulletin Reprints, rot. 7, pp. 20-28. Landis, J. R. and G. G. Koch. 1977a. The measurement of observer agreement for categorical data. Biometrics 33: 159-174. 1977b. A one-way components of variance model for categorical data. Biometdcs 33: 671-679. Steichen, T. J. and N. J. Cox. 1998a. sg84: Concordance correlation coefficient. Stata Technical Bulletin 43: 35-39. Reprinted in Stata Technical Bultetin Reprints, vol. 8, pp. 137-143. 1998b. sg84.t: Concordance correlation coefficient, revisited. Stata Technical Bulletin 45: 21-23. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 143-145. 2000. sg84.2: Concordance correlation coefficient: update for Stata 6. Stata Technical Bulletin 54: 25-26. Reprinted in Stata Technical Bulletin Reprints, vol. 9. pp. 169-170.
Also See Related:
[R] tabulate
i
Title !
i
• kdenslty
Univariate kernel density estimation
i
Syntax kdensity varname [weight] [ifexp][inrange] [, nJ _raphgenerate(neuwarznewvardinsity) n(#) _width(#) [ ibiweight I cQsineI eppanlgausl] parzenl rectangle I triangle ] ! n(_rmal stud(#) at(varz) s_ymbol(...) _connect(...) title(string) [ [
fweigh_s
ggoph_optwns ? i J i and _weights are allowed;see [U] 14.1.6weighi.
Descriptien kdensity!produces
kernel density estimates and graphs the result. /
Options i
nograph suppresses the graph. This option is often Used in combination with the generate () option. generate(n_wvarz newvardensitv) stores the results of the estimation, newvardensity will contain
i t_ I
the densit>l estimate, newvarz will contain the pbints at which the density is estimated. n(#) specifie_ the number of points at which the d_nsity estimate is to be evaluated. The default is min(N; 501, where N is the number of observations in memory.
i i ! i
width(#) sNcifies the halfwidth of the kernel, the width of the density window around each point. If w() is hot specified, then the "optimal" width is calculated and used. The optimal width is the width [hat would minimize the mean integrated square error if the data were Gaussian and a Gaussiar_ kernel were used, so is not optimal in any global sense. In fact, for multimodal and highly skeived densities, this width is usually too wide and oversmooths the density (Silverman
i
1986). bi_reight,
I cbsine,
default, e_t_, [ l i
I i
i
gauss,
parzen,
rectangle,
and triangle
specify the kernel. By
specifying the Epanechnikov kernel, is used.
normal requd?ts that a normal density be overlaid on the density estimate for comparison. stud(#)
i.
speOfies that a Student's _ distribution with # degrees of freedom be overlaid on the density
estimate f_r comparison, at(varz) specifies a variable that contains the v_lues at which the density should be estimated. This optiot0 allows you to more easity obtain density estimates for different variables or different subsamples of a variable and then overlay the e_t_mated densmes for comparison. symbol(...)
i
epan,
!is graph,
is symbollo);
two,ray's symbol()
see [G]graph options.
°pti°h for specifying the plotting symbol. Tile default :
i
(
connect(...)isgraph, twoway'sconnect ()estimation optionforhow pointsareconnected. The default is 162 kdensity -- Univariate kernel density connect (1), meaning points are connected with straight lines: see [G] graph options. title(string) is graph, twoway's title() option for speci_'ing the title. The default title is "Kernel Density Estimate"; see [G] graph options. graph_options
are any of the other options allowed with graph,
twoway; see [G] graph
options.
Remarks Kernel density estimators approximate the density f(z) from observations on z. Histograms do this too, and the histogram itself is a kind of kernel density estimate. The data are divided into nonoverlapping intervals and counts are made of the number of data points within each interval. Histograms are bar graphs that depict these frequency counts the bar is centered at the midpoint of each intervalwand its height reflects the average number of data points in the interval. In more general kernel density estimates, the range is still divided into intervals and estimates of the density at the center of intervals are produced. One difference is that the intervals are allowed to overlap, One can think of sliding the intervalPcaUed a window along the range of the data and collecting the center-point density estimates. The second difference is that, rather than merely counting the number of observations in a window, a weight between 0 and 1 is assigned--based on the distance from the center of the window and the weighted values are summed. The function that determines these weights is called the kernel. Kernel density estimates have the advantages of being smooth and of being independent choice of origin (corresponding to the location of the bins in a histogram).
of the
See Salgado-Ugarte, Shimizu, and Taniuchi (1993) and Fox (1990) for discussions of kernel density estimators that stress their use as exploratory data analysis tools.
Example Goeden investigate histogram. is roughly
(1978) reports data consisting of 316 length observations of coral trout. We wish to the underlying density of the lengths. To begin on familiar ground, we might draw a In [G] histogram, we suggest setting the bins to min(v/-_. 10-loglcn ). which for n = 316 18:
graph length, xlab ylab bin(18)
2-
15"
05"
m 0
length
kdensity -- Univariatekernel density estimation
163
The kernel density estimate, on the other hand, is smooth. . kdens_ty
length,
xlab
ylab
006 -_
004
121
£,02
i
\
1
o
Kernel
Density
tdngth Estimate
Kernel den_ity)stimators are. however, sensitive to an assumption just as are histograms. In histograms, we specify a _umber of bins. For kernel density estimators, we specify a width. In the graph above, we used the d_fault width, kdensity is smarter than graph, histogram in that its default width is not a fixed _:onstant. Even so, the default width is not necessarily bei. i kder_sity !ayes the width in the return scalar width, so typing display Doing this, wd discover that the width is approximately 20.
r(width)
reveals it.
! i
Widths are(ketail. isimilarThe to units the inverse of thearenumber of ofz, bins in histogram in analyzed. that smaller provide more of the width the uhits the avariable being The widths width is specified as ia halfwidth, meaning that the kernel density estimator with halfwidth 20 corresponds to sliding a w!ndow of size 40 across the data.
I
We can specify halfwidths for ourselves using the t the density as imuch. • kdens_ty
length,
epan
width()
i ]
option. Smaller widths do not smooth
w(lO)
xlab ylab
i
I
]
OOB I
oo6_I
\
/
,004
/
e
_oo
5
_Jo
,oo I_ng(h
Kernel
Density
Estimate
s_o
do
• kdensity length, epam xlab ylab w(15)
164
kdensity -- Univariate kernel density estimation
•
006
/_
.004 >.
j
0 200
"\ 3(_0
Kernel
4(_0 length
Density
560
6;0
Estimate
q
> Example When widths are held constant, different kernels can produce surprisingly different results. This is really an attribute of the kernel and width combination; for a given width, some kernels are more sensitive than others at identifying peaks in the density estimate. We can see this when using a dataset with lots of peaks. In the automobile dataset, we characterize the density of we £ght, the weight of the vehicles. Below, we compare the Epanechnikov and Parzen kernels. kdensity weight, epan nogr g(x epan) kdensity weight, parzen nogr g(x2 parzen) • label vat epan "Epamechnikov Density Estimate" • label vat parzen "Parzen Density Estimate" • gr epan parzen x, xlab ylab c(ll)
o Epanechnikov .0008
Density
Estimate
_ Parzen
Density
Estimate
"_
oooo
i'll!
0
1ooo
2o'oo
3ooo Weight
(l_s.)
4otoo
5ooo
!
kdensRy-_ Univariatekerneldensityestimation
165
!
We did not s_ecify a width and so obtained the d_fault width. That width is not a function of the selected kerneil but of data. See the Methods and Formulas section for the calculation of the optimal
I
width.
q
> Example In examining the density estimates, we may wi_h to overlay a normal density or a Student's t • ari_" Mng automobile weights, we can get an idea of the distance from normality density for col_p _u,,. U__ with the normal option. , kdens_ty
weight,
epam
normal
xlab ylab
,0006
i
.ooo4
,ooo2 t
°t 1 1000
il 2000
I 3_00 Weigh!
Kernel
Density
I 4000
5000
(}bs,)
IEstimate
Example Another conmon desire in examining density estimates is to compare two or more densities. In this example, _,e will compare the density estimatesof the weights for the foreign and domestic cars. I
kdensi_y
i
.• kdensi_y kdens@y
i
weight,
negr
weight weight
gen(x
fx)
if gen(f_O) if foreign==O, foreigxl==l, nogr nogr gen(fXl)
label
_ar fxO
"Domestic
label
_ar fxl
"Foreign
at(x) at(x)
cars" cars"
(Continued on twxt page)
"
166
• gr
fxO fxl c(ll) s(TS) xlab ylab kdensity --x, Univariate kernel density estimation
i :
Domestic
cals
D Foreign
cars
OOl.
{
I !
.0005
l
_a_
fX
'
0" 1000
20100
3000' Weight
40t00
5000r
(Ibs.)
chi2
=
0.0000
:
0.1582
Prob Log
.ikelihood
: -98.777998
Pseudo
of obs
R2
, I
:
,
low
i
age twd
i
-. 0464796 .8420615
Std. Err. .0373888 .4055338
z
P>[zI
-1.24 2.08
0.214 O. 038
[957, Conf. -. 1197603 ,0472299
Interval] .0268011 1,636893
black
I.073456
.5150752
2.08
O. 037
.0639273
2.082985
other smoke
.815367 .8071996
.4452979 .404446
1.83 2. O0
0.067 O. 046
-. 0574008 .0145001
I. 688135 i.599899
ptd
i i
Coef.
i .281678
.4621157
2.77
0,006
.3759478
2. 187408
ht
:1. 435227
.6482699
2.21
0.027
.1646415
2. 705813
ui _cons
.65762S6 -1.216781
.4666192 .9556797
I. 41 _1.27
-. 2569313 -3.089878
1. 572182 .6563!7
O. 159 0.203
To _et the odds ratio for black smokers relative to white nonsmokers (the reference group), type
• lincom (1) i: I
200
black
black
+ smoke,
+ smoke
or
= 0.0
Iincom -- Linear combinations of estimators
ili
low
0dds
Ratio
Std.
z
Err.
il
P>,z,
[957, Conf.
Interval]
o0o, lincom computedcxp(_black+ blacknonsmokers,type lincom (I)
smoke
- black
- black, + smoke
low
Odds
(1)
_smoke)
_-
6.56.To seetheoddsratioforwhitesmokersrelative to
or
= 0.0
Ratio
Bid.
.7662425
Err.
z
.4430176
-0.46
P>IzJ
[95% Conf.
0.645
.2467334
Interval] 2.379603
Now let's add the interaction terms to the model (Hosmer and Lemeshow 1989, Table 4.10). This time we will use logistic rather than legit. By defaulL logistic displays odds ratios. . logistic Logit
Log
low
age
black
other
smoke
ht ui
Iwd
estimates
likelihood
=
low
-96.00616
Odds
Ratio
Std.
Err.
z
ptd
agelwd
smokelwd
Number of obs LR chi2(10) Prob > chi2
= = =
189 42.66 0.0000
Pseudo
=
0.1818
R2
P>|zl
[95_ Conf.
Interval]
,8408967 1.068277
1.005344 8,167462
age black
.9194513 2.95383
.041896 1.532788
-1.84 2.09
0.065 0.037
other
2.137589
.9919132
1.64
0.102
.8608713
5,307749
smoke
3.168096
1.452377
2.52
0.012
1.289956
7.780755
ht
3.893141
2.5752
2.05
0.040
1.064768
14.2346
ui
2.071284
.9931385
1.52
0.129
.8092928
5.301191
lwd
.1772934
.3312383
-0.93
0.354
.0045539
6.902359
ptd
3.426633
1.615282
2.61
0.009
1.360252
8.632086
1.15883 .2447849
.09602 .2003996
1.78 -1.72
0.075 0.086
.9851216 .0491956
1.36317 1.217988
agelwd smokelwd
Hosmer and Lemeshow (1989, Table 4.13) consider the effects of smoking (smoke -:- 1) and low maternal weight prior to pregnancy (lwd = 1). The effect of smoking among non-low-weight mothers (lwd -- 0) is given by the odds ratio 3.17 for smoke in the logistic output. The effect of smoking among low-weight mothers is given by • lincom (I)
smoke
smoke
low (1)
+ smokelwd
+ smokelwd
Odds
= 0.0
Ratio
.7755022
Std.
Err.
.5749508
z
P>Izl
[957 Conf.
-0.34
0.732
.1813465
Note that we did not have to specify the or option
After
logistic,
lincom
Interval] 3.316322
assumes or by default.
The effect of low-weight (Iwd = 1) is more complicated since we fit an age x lwd interaction. We must specify' the age of mothers for the effect. The effect among 30-year old nonsmokers is given by t
i _• !|_
i
'
_
i
lin¢om-- Linearcombinations of estimators
i _incom l_d + 30*agelwd (ii) lwd + 30.0 agelwd
i
i.
low
I
(t)
i
t
=
201
0.0
Odds Ratio 14. ?669
Std,
Err.
13. 56689
z 2.93
P>lz[
[95X Conf.
O.003
2. 439266
Interval] 89. 39625
..........
"
lincom _omputed exp(fllwd+30,_agelwd) = 14_.8.It seemsodd that we entered it as lwd+ 30*agelwd. but remember that lwd and agelwd are just'lincom's (and test'S) shorthand for _b[twd] and
i
_b [age_wd].
i
I
_ i
We could
i
i !
typed
(ii) incomlwd _b[1wd] + 30.0+ agelwd 30*_b[agelwd] = 0.0
low (1)
i
!
have
I
Odds Ratio 14. 7669
Std. Err. 13. 56689
z 2.93
P> Iz I O. 003
[957,Conf. Interval] 2. 439266 89. 39625
,
Multiple- quation models lincpm also works with multiple-equation models. The only difference is how"y'ou refer to the coefficiehts. Recall thai for multiple-equation models, coefficients are referenced using the syntax [e_no] varname where e,_nois the equation number or equation_nameand varname is the corresponding variable name for the cpefficient: see [U] 16.5 Accessing coefficients and standard errors and JR] test for detaih. ExampM, Consider the example from [R] mlogit (Taflov et al. 1989: Wells et al. I989).
!
. _logit insure age male nonwhite site2 site3,
nolog
Mu_tinomial regresszon
Number of obs LR chi2(I0)
= =
615 42.99
Lo i likelihood = -534.36165
Pseudo Prob > R2 chi2
= =
0,0387 O.0000
} !
insure
Coef.
age
-.Ol 1745
.0061946
-I.90
O.058
-.0238862
.0003962
male nonwhite
.5616934 .9747768
.2027465 .2363213
2.77 4.12
O.006 0.000
.1643175 .5115955
.9590693 1,437958
site2 site3 cons
.1130359 -. 5879879 .2697127
•2101903 .2279351 .3284422
O,54 -2.58 O.82
O.591 O. 010 O.412
.2989296 -1.034733 -.3740222
.5250013 -. 1412433 .9134476
I
age male
-.0077961 .4518496
.01t4416 .3674867
-0.68 1.23
0.496 0.219
-.0302217 -.268411
.0146294 I.17211
i
nonwhit e
.2170589
.4256361
O.51
O.610
-.6171725
i.05129
i
site2 site3 _cons
-1.211563 -.2078123 -1.286943
.4705127 .3662926 .5923219
-2.57 -0.57 -2.17
O.010 0.570 0.030
-2.133751 -.9257327 -2.447872
-.2893747 .510108 -. 1260135
i "
Izf
4.70
0. 000
[95Y,Conf. Interval] lr
.
.8950741
2.177866
To view the estimate as a ratio of relative risks (see [R] miogit for the definition and interpretation), specify the rrr option. lincom (1)
[Prepaid]
male
[Prepaid]male
+ [Prepaid]
insure I (i) I
nonwhite,
+ [Prepaid]nonwhite
KPd{
rrr = 0.0
Std. Err.
4.648154
z
1.521103
4.70
P> [zl
[95X Conf. Interval]
0.000
2.447517
8.827451
q
Saved Results lincom saves in r(): Scalars r (estimate) r(se)
point estimate estimate of standard
r(df)
degrees
error
of freedom
Methods and Formulas lincom is implemented
as an ado-file.
References Hosmer, D. W., Jr,, and S. Lemeshow. edition forthcoming in 2001,)
f989. Applied
Logistic
Regression.
Tarlov, A. R,, J. E. Ware. Jr,, S. Greenfield. E. C. Nelson, E. Perrin. study, Journal of the American Medical Association 262: 925-930,
New York: John Wiley & Sons. (Second
and M. Zubkoff.
t989.
The medical
outcomes
Wells. K. E., R. D. Hays, M, A. Burnam, W, H. Rogers, S. Greenfield, and J. E. Ware. Jr. t989. Detection of depressive disorder for patients receiving prepaid or fee-for-service care. Journal of the American Medical Association 262: 3298-3302,
Also See Related:
[R] svylc, [R] svytest, [R] test, [R] testni
Background:
[u] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and pOst-estimation commands
I
rR r-
Title
II
l[nktes -- Specification link test for single-equation models i
!
i
I [
I
I
J
[
iI
I
i
I [
I
T[
I
I I
Syntax
i
lin.kt,
st [if exp] [in range] [, estimation_options ]
_qaen i: cap and in range are not specified,the link _estis performedon the same sample as the previous estimation.
Descripti(,n iin.kt_ st performs a link test for model specificationafter any single-equationestimationcommand such as 1, .gistic.
i
etc.;
regress,
see [R]
estimation commands.
I
:: ! ;
Options estimation_options must be the same option_ specified with the underlying estimation command.
] :
i
Remarks • i
The fotm of the linkltest implemented here if based on an idea of Tukey (1949) which was further descfibedlby Pregibon !(1980), elaborating on ,/,ork in his unpublished tl_esis (Pregibon t979). See Methods _nd Formulas! below for more details.
We at mpt to explifin Exampletl , the mileage ratings Of cars in our automobile dataset using the weight. engine displacement, a_d whether the car is manufactured outside the U.S.: r,_gress mpg
Source Model ;
w_ight
i !1619,71935
Residual
_ Total
mpg i
weight
23.740114 ; 12443.45946 _ SS
'
! i
foreign _cons
Coef.
-.0067745
displacement I
displ
.0019286 i-1.600631 41.84795
t
foreign
3
539.906448
F( 3, Prob > F
70
Ii.7_77159 33.4720474 MS
73 dI
Std. Err. .0011665 .0100701 1.113648 2.350704
t
P>It I
70)
= =
45.88 0.0000
R-squared
=
0,6629
Adj R-squared Root Number MSE of obs
= = =
0.6484 3.4304 74
[95X Conf.
Interval]
-5.81
0.000
-.0091011
0.19
0.849
-.0181556
.0220129
-1.44 17.80
0.155 O. 000
-3.821732 37.15962
.6204699 46.53628
-.0044479
204
linktest -- Specification link test for single-equation models
Based on the R 2. we are reasonably pleased with this model. If our model really is specified correctly, then were we to regress mpg on the prediction and the prediction squared, the prediction squared should have no explanatory power. This is what link"cost does: linktest Source
_
SS
df
f
Model Residual
] |
mpg
Number of F( 2,
1670.71514
2
835.357572
Prob
772.744316
71
10.8837228
73
33.4720474
I Total
MS
2
443
I
.45946
Coef.
I
Std.
Err.
obs 71)
= =
> F
74 76.75
=
0.0000
R-squared
=
0.6837
Adj K-squared Root MSE
= =
0.6748 3.299
t
P>Itl
[95_
Conf.
-0.63
0,532
-i.724283
2.09 2.16
0.041 0.034
.6211541 .0026664
Interval]
i
_hat _cons _hatsq
]
-.4127198
t
14.00705 .0338198
We find that the prediction good as we thought.
.6577736 6,713276 .015624
.8988434 27.39294 .0649732
squared does have explanatory, power, so our specification
is not as
Although linktest is formally a test of the specification of the dependent variable, it is often interpreted as a test that. conditional on the specification, the independent variables are specified incorrectly. We will follow that interpretation and now include weight-squared in our model: • gen
weight2
regress
= weight*weight
mpg
Source
weight I
Model
weight2 SS
displ
foreign
df
MS
Number F( 4,
of obs 69)
74 39.37
1699.02634
4
424.756584
Prob
=
0.0000
744.433124
69
10.7888859
K-squared
=
0.6953
Total
2443.45946
73
33.4720474
Adj R-squared Root MSE
= =
0.6777 3.2846
mpg
Coef.
Std.
Residual
Err.
t
P>rtl
> F
= =
[95_
Conf.
Interval]
weight
-.0173257
.0040488
-4.28
0.000
-.0254028
-.0092486
weight2
1.87e-06
6.89e-07
2.71
0.008
4.93e-07
3.24e-08
-.0101625
.0106236
-0.96
0.342
-.031356
foreign
-2.560016
1.123506
-2.28
0.026
-4.801349
-.3186832
_cons
58.23575
6.449882
9,03
0.000
45.36859
71.10291
displacement
.011031
And now we perform the link test on our new model: linktest
Model
1699.39489
Kesidual
i
744. 06457
Total
I
Source
[
2443.45946
SS
2
849.697445
Prob
=
0.0000
71
I0.4797827
R-squared
=
O. 6955
33.4720474
Adj R-squared Root MSE F( 2, 71)
= =
0.6869 3.2372 81.08
Number
=
73
df
MS
> F
of obs
74
'
gi
linktest-- specifi_ationlink test for single-equationmodels
i !•
i
mpg
Coef.
1
i,
hat hatsq
I 141987
.7612218
1.50
0.138
-.3758456
2.659821
i
_cons
- .0031916 -i.50305
.0170194 8.196444
-0.19 -0.18
O.852 O.855
-.0371272 -17.84629
.0307441 14.84019
'
t
P>Itl
[957,Conf. Interval]
! We now pass the link!'test.
! i
Std. Err.
205
Exampl Abo_ we followe_ a standard misinterpretation of the link test when we discovered a problem, we focusted on the exl_lanatory variables of our model. It is at least worth considering varying exactly
i
what thi link test testS. The link test told us it]at our dependent variable was misspecified. For those
i
with _engineeringconsurr_tion__gallonsbackground, mpgperis mile--inindeed a terms strangeofmeasure.andIt woulddisptacement:make more sense to modelan@ergy weight ! _egress gpm _eight displ foreign i Source I 85 df i
i
] Model i Residual
i
Prob > F R_squared
=
0.7659
.01t957628
73
.000163803
Root MSE Adj R-squared
: =
.00632 0.7558
weight displacement I foreign
_cons
Std. Err.
t
P>lt I
[957.Conf, Interval]
.0000144
2.15e-06
6.72
O.000
.0000102
.0000187
.0000186 .0066981
.0000186 .0020531
I.O0 3.26
O.319 O. 002
-.0000184 .0026034
.0000557 .0107928
.0008917
.0043337
0.21
O. 838
-. 0077515
.009535
looks eve _ bit as reasonable as our original model.
inkiest _.
[ [
.003052654 .000039995
Coef.
_
,
Source
.
I li
SS
df
Residual
i .002782409 I Total .011957628 Model l ) i .009175219
gpm [i
Coef.
hat hatsq
I i I i
.6608413 3.275857
li
-_cons
))
.008365
irt a m( eparsimonio_s
MS
Number of obs =
F( 2,
!
i
74 76.33 0. 0000
3 70
gpm
This re+el
Number of obs = F( 3, 70) =
: .009157962 ;: .002799666
Total
! !
i
MS
71
.008039189
73 2
.000163803 .00_587609
Std. Err.:
t
P> It{
74
71) :
117.06
R-squared = Adj R-squ_red : Root = Prob MSE > F =
0.7673 0.7608 .00626 0.0000
[95_ Conf. Interval]
.5152751 4.936655 ;
1.28 0.66
0.204 0.509
-.3665877 -6.567553
1.68827 13.11927
.0130468
0.64
0.523
-.0176496
.0343795
speecmcanon " .
206
linktest -- Specification link test for single-equation models
> Example The link test can be used with any single-equatio/q estimation procedure, not solely regression. Let's mm our problem around and attempt to explain whether a car is manufactured outside the U.S. by its mileage rating and weight. To save paper, we will specify logit's nolog option, which suppresses the iteration log: . logit foreign mpg weight, nolog Logit estimates
Number of obs LR chi2 (2) Prob > chi2
= = =
74 35.72 0.0000
Log likelihood = -27.175156
Pseudo R2
=
0.3966
foreign
Coef.
mpg weight _cons
-.1685869 -.0039067 13.70837
Std. Err.
z
P>]z_
.0919174 .0010116 4.518707
-1.83 -3.86 _ 3.03
0.067 O.000 O.002
[95_. Conf. Interval] -.3487418 -.0058894 4.851864
.011568 -.001924 22. 56487
When you run linktest after logit,the result is another logit specification: linktest, nolog Logit estimates
Number of obs LR chi2(2) Pro5 > chi2
= = =
74 36.83 0.0000
Log likelihood = -26.615714
Pseudo R2
=
0.4090
foreign
Coef.
_hat _hatsq _cons
.8438531 -.1559115 .2630557
Std. Err. .2738759 .1568642 .4299598
z 3.08 -0.99 0.61
P>Iz_ 0.002 0.320 0.541
[95_ Conf. Interval] .3070661 -.4633596 -.57965
1.38064 .1515366 1.105761
The link test reveals no problems with our specification. If there had been a problem, we would have been Virtually forced to accept the misinterpretation of the link test we would have reconsidered our specification of the independent variables. When using logit, we have no control over the specification of the dependent variable other than to change likelihood functions. We admit to seeing a dataset once where the link test rejected the logit specification. We did change the likelihood function, re-estimating the model using probit, and satisfied the link test. Probit has thinner tails than logit. In general, however, you will not be so lucky. q
_3Technical Note You should specify exactly the same options with linktest as you do with the estimation command, although you do not have to follow this advice as literally as we did in the preceding example, logit's nolog option merely suppresses a part of the output, not what is estimated. We specified nolog both times to save paper. : '_
If you are testing a cox model with censored observations, however, you must specify the dead() option on linktest as well. If you are testing a tobit model, you must _pecify the censoring points
i
just as you do with the tobit;
command.
T
_
linktest
!
Specification linktest for single-equation models
207
i I . !
If youiare not sure which options are important, duplicate exactly what you specified on the command. estimatio_ _ If youido not specie' if exp or in range :with li_ktest, Stata will by default perform the link test _n the same s_unpleas the previous estimation. Suppose that you omitted some data when performin_ your estimation, but want to calculate the link test on all the data, which you might do if you belleved the moclelis appropriate for all !thedata. To do this, you would type 'linktest if
:{
e(
i
pl4) -=.'.
SavedRemrs linkt,mtsaves in ::(): Scalars r(t)
t statisticon _!aats_
r(df)degreesof freedom
linkt_stis not an estimation command in the sense that it leaves previous estimation results unchangeql.For instan@, one runs a regressiofi and then performs the link test. %,ping regress without a_gumentsarid the link test still replays the original regression.
i
In ternls of integrati g an estimation commafid with linktest, linktestassumes that the name of the estimation com_nand is stored in e(cmtt) and that the name of the dependent variable in e (depval_). After estirhation, it assumes that the number of degrees of freedom for the t test is given
i
by o(df_._)if the ma!ro is defined. If the estimation co_amandreports Z statistics instead of t statistics, then tinktestwill also report Z aatistics. TheiZ statistic, however, is still returned in r(t) and r(df) is set to a missing
i
vai ,e
I i
!
Methods ,nd ForMulas, linkt.=st is implemented as an ado-file. The link test is based on the idea that if a regression or
i , !
regressior-like equatioriis properlyspecified,on_ should not be able to find any additional independent variables :hatare signil_cantexcept by chance. One kind of specificationerr'or is called,a link error. In regression, this means that the dependent vmiable needs a transformation or "link function to
i !
properly relate to the i_dependent variables. Th_ idea of a link test is to add an independent variable to the eqt ation that is l_specialb likely to be significantif there is a link error,
i
Let
l
Y = f (X/3) be the m( del and _ be!the parameter estimatesl linktest
l } I I ! ! ,i
calculates
_hat= Xa and _hat_q= _hat2 The mod_l is then refit with these two variablesl and the test is based on the significance of _hatsa. This is tNe form suggelted by Pregibon (1979/based on an idea of Tukey (t949). Pregibon (1980) su_ests slightly different method that has tome to be known as "Pregibon s goodness-of-link tesf'. We _referredthe!olderversion because it is universally applicable, straightforward, and a good second-or ier approximation. It is universally applicable in the sense that it can be applied Io any' single-eq_,ationestimation technique whereas Pregibon s more recent tests are estimation-technique specific.!
|
i
=__
Pregibon, D. 1979. Data Analytic Methods for Generalized Linear Models. Ph.D. Dissertation 1980. Goodness of link tests for generalized linear modelS. Applied Statistics 29: 15-24. Tukey, J. W. 1949. One degree of freedom for non-additivity. Biometrics 5: 232-242.
Also See Related:
[R] estimation
commands,
JR] lrtest,
[R] test, [R] testnl
University of Toronto.
_ !
Title
i
1
I
list --
Sy ntax I!
f
i
I i l
'
_list
Iv_rlist! [i:fe_]
i
[n o]_display
nolabel
noobs doublespace
]
Descrlptic_n di ;plays the v
es of variables, If no v_rlist is specified, the values of all the variables are
_,lso see bro_vsein [R] edit.
displayed.
Options . [nojdisplgy forces th_format into display or tabular (nodisplay) format. If you do not specify one its judgment of which would be most one of t_ese two options, then Stata chooses based on re.adabk nolabel
[
[in range][,
by ... : mai, be used with kist; see [R] by. The second vadist may _ntain_ime-seriesoperators;see [U114.4.3"l'ime.seHesvaNi_s.
list I
I
, s v lues of variables
uses the nur eric codes rather than the label values (strings) to be displayed.
noobs sup _ressesprintiJ g of the observation numbers. doublesp_tce requests mt a blank line be inserted between observations. This option is ignored in displa format.
Remarks l
i° ! i
I ! I
list,t,. )ed by itself lists all the observations and all the variables in the dataset. If you specify
varlist, onl those vafia_tes are listed. Specifyifig one or both of in range and if e_p limits the observatiot listed.
:;
_ Examplei
list h. s two outpu
listing a f_w variables, whereas the display format is suitable for listing an unlimited number of variables. _tata chooses automatically between those two formats: Obse_ :vation 1 li_t in 1/2 make rep78 weight
I
formats, known as tabular and display. The tabular format is suitable for
I _ispla-t
AMC Concord 3 2,930 121
price headroom
4,099 2.5
mpg trunk
22 11
length
186
turn
40
gear_r-o
3.58
foreign
Domestic
Observation ri
'
I ,
--,-
2
.,,o..
_,.r._ vauu_ Ul vitrlODleS make AMC Pacer price rep78 3 headroom
weight
3,350
displa-t . list
make
258 mpg
weight
displ
make I. AMC Concord 2. AMC Pacer ;
3. AMC
The
Spirit
length
_
mpg trunk
17S
gear_r~o rep78
_
4,749 3.0
turn
2.53
40
foreign
Domestic
in I/5
mpg 22 17
weight 2,930 3,350
displa~t 121 258 121
rep78 3 3
22
2,640
4. Buick
Century
20
3,250
196
3
5. Buick
Electra
15
4,080
350
4
first case is an example
of display
17 II
format;
[he second
is an example
of tabular
format.
The
tabular format is more readable and takes less space, but it is effective only if the variables can fit on a single line across the screen. Stata chose to list all twelve variables in display format, but when the varlist was restricted to five variables, Stata chose tabular format. tf you are dissatisfied with Stata's choice, you can make the decision yourself. Specify the display option to force display format and the nodisplay option to force tabular format.
Oi
25
2,650
121
Foreign
1 _ Example
1
You can Imake the list easier to read by specifOng the doublespaceoption:
I
i lis_ make make '
i
Pont. iPhoenix
19
3,420
231
Domestic
i
Pont. Igunbird
24
2,690
151
Domestic
Audi_000
17
2,830
131 Foreign
Audi
'ox
23
2,070
97
Foreign
BMW 3:!0i
25
2,650
121
Foreign
}
i } ! i_
mpg weight displ foreign in 51/55, noobs double mpg weight 4ispla~t foreign
21Technical Note
You can !suppress the use of value labels by specifying the nolabel option. For instance, the variable foreign in the _:xamples above really contains numeric codes. 0 meaning Domestic and 1 meaning Foreign.When you list the variable however, you see the corresponding value labels rather than the underlyin_ numeric code: lis_
foreign
I
51
iforeign _omestic
i
52.
_omesl;ic
I
211
53. 54.
iF°reign _Foreign
55.
IForeign
Specifying t e nolabel
in
1/55
i 1 1
ption displays the underlying numeric codes:
list
_!
212
_,
_
5i. 52. 53. 54. 55.
foreign
in
51/55,
nolabel
listforeign -- List values of variables 0 0 1 1 1
0
References Riley, A. R. 1993. dml5: Interactively list values of variable,s.Stata Technical Bulletin 16: 2-6. Reprinted in Stata TechnicalBulletin Reprints. vol. 3, pp. 37-41. Royston, P. and P. Sasieni. 1994. dml6: Compact listing Of a single variable. Stata Technical Bulletin 17: 7-8. Reprinted in Smta Technical Bulletin Reprints, vol. 3, pp_41-43. Weesie, J. t999. din68: Display of variablesin blocks. Stata TechnicalBulletin 50: 3-4. Reprinted in Stata Technical Bulletin Reprints. vol. 9, pp. 27-29.
Also See Related:
[R] edit, [P] display,
[P] tabdisp
i
Ins
! i
; i i i _I I
j i
"_
I
0 -- Find z
iit
1
_l
I
I
ire-skewness log or BoxLCox transform
lnske'_O ,,ewvar = ._xp [if exp] [in range] [, level(#) I
_delta(#)
_zero(#) 1
delta(#)
--
Syntax bcskei_O newvar = _.rp [if 1
e_,7_ ] [ill range] [ .
m
level(#)
--
--
zero(#)
] d
Deseripti n of inske_10creates n@var = ln(+exp - k). choosing k and the sign of exp so that the skewness newvar is zero. bcske_FO creates n 'vat= (exp _ - 1)/A, .the Box-Cox power transformation (B x and Cox 1964), ch_osing A so t_at the skewness of ned,vat is zero. exp must be strictly positi_°c. Also see
[R] boxeo
for maximu_n likelihood estimation of A
Options level (#) specifies the confidence level for a confidence interval for k (lnskewO) or A (bcskewO). Unlike usual, the ccnfidence interval is calculated only if level() is specified. As usual, # is _ecified as an integ,.'r; 95 means 95% confidence intervals. The level() option is honored onl>_ if the umber of observations exceeds 7. delta(#) specifies the increment used for calculating the derivative of the skewness function with respect to k (lnske'gO) or A (bcskewO). The default values are 0.02 for lnskewO and 0.0I for bcske_O. zero(#) s_ecifies a vah Eefor skewness to determine convergence that is small enough to be considered zero arld is by defau it 0.001.
Remarks
Example
1
Using dur automobih_ dataset (see [U] 9 Statai's on-line tutorials and sample datasets), we want to generatd a new variab le equal to ln(mpg- k) t6 be approximately normally distributed, mpg records the miles r gallon for _ach of our cars. One feature of the normal distribution is that it has skewness
• in_kewO lnmpg Transfor_
mpg k
[95Y,Cdnf. Interval]
Skewness
(not calculated)
-7.05e-06
....
in(mpg-k)
5.383659
214
InskewO-- Find zero-skewness log or Box-Cox transform
This created the new variable lnmpg = ln(mpg - 5.384): describe Inmpg
variable name
storage type
Inmpg
display format
float
value label
X9.0g
Since we did not specify the level we could have typed
variable label in(mpg-5.383659)
() option, no confidence
interval was calculated.
At the outset,
InskewO inmpg = mpg, level(95) Transform
I
In(mpg-k)
[
k 5.383659
[95_
Conf. Interval]
-17. 12339
Skewness
9.892416
-7.05e-06
The confidence interval is calculated under the assumption that In(mpgk) really does have a normal distribution. It would be perfectly reasonable to use Inskew0 even if we did not believe the transformed variable would have a normal distribution--if we literally wanted the zero-skewness transform--although in that case the confidence inte_'al would be an approximation of unknown quality to the true confidence interval. If we now wanted to test the believability of the confidence interval, we could also test our new variable lnmpg u!sing swilk with the !nnormal option. q
El Technical Note lnskewO (and bcskewO) reports the resulting skewness of the variable merely to reassure you of the accuracy of its results. In our above example, lnskew0 found k such that the resulting skewness was -7- 10-6, a number close enough to zero for all practical purposes. If you wanted to make it even smaller, you could specify the zero () option. Typing lnskewO new=mpg, zero (le-8) changes the estimated k to -5.383552 from -5.383659 and reduces the calculated skewness to -2.10 -11 When you request a confidence interval, it is possibl+ that lnskew0 will report the lower confidence interval as '. ', which should be taken as indicating the lower confidence limit kL = -oc. (This cannot happen with bcskew0.) As an example, consider a sample of size n on z and assume the skewness of z is positive, but not significantly so at the desired significance level, say 5%. Then no matter how large and negative you make kz,, there is no value extreme enough to make the skewness of ln(x - kL) equal the corresponding percentile (97.5 for a 95% confidence interval) of the distribution of skewness in a normal distribution of the same sample size. You cannot because the distribution of ln(z - kL) tends to that of zpapart from location and scale shift--as z --+ oc. This "problem" never applies to the upper confidence limit ku because the skewness of ln(:c - ku) tends to -co as k tends upwards to the minimum value of z.
Example In the above example, using lnskewO a natural
zero
and
we
are
shifting
variable such as temperature zero is indeed arbitrary.
that
with a variable like mpg is probably zero
arbitrarily.
measured in Fahrenheit
On
the
other
hrmd,
use
undesirable,
mpg has
of tnskew0
with
or Celsius would be more appropriate
a
as the
i i
_
lnskewO-- Find zerO-skewnesSlog or Box-Cox transform 215 For a var able like mpg it makes more sense touse the Box-Cox power transform (Box and Cox 1964): y(_)=
Y_-I
A is free to :ake on any ,,alue. but note that y(1) _ y_ bcskew0 works like 1:1skew0:
1. y(0) _ In(y), and yf-l_
= 1 - 1/y.
• bcs_ewO bcmpg = ipg, level(95) 4 i Transform (_pg'L-1)/L
L
[95_,Conf. Interval]
-. 3673283
-1. 212752
Skewness
,4339645
0001898
,i
It is worth n( ting that the _ _,%confidence interval i,cludes ), = - ] (A is labeled L in the output), which has a rather nore pteasin_ interpretation--gallons iper mile--rather than (mpg-'3673 - 1)/(-3673).• The confide;_ce interval, however, is calculated under the assumption that the power transformed
_
variable is rormally distributed. It makes perfect sense to use bcskewO even when one does not believe that the transforrred variable will be normallv distributed, but in that case the confidence interval is an approximaticn of unknown quality, ffone believes thai the transformed data are normally gging. Command logs are ah,,ays straight ASCII text files and this makes them easy to convert into
, !
do-files. (In ihis respect, it would make more sen.4e if the default extension of a command log file was .do be&use commaz]d lo_osare do-files. The default is .txt. nOt .do. howe_er, to keep you
i i
from accidenialty overwriting your important do-files.) Full logs !are recordedlin one of two formats: SMCL (Stata Markup and Control Language) or
_° i
text (meaning ASCII). The default is SMCL. but s_t logtype can change that, or you can specify an option to state the forrrm you wish. We recommend SMCL because it preserves fonts and colors. SMCL logs c_n be convert,_d to ASCII text or to other formats using the translate command: see [R] translate; translate can also be used to produce printable versions of SMCL IO_S.or you can print SMCL l_gs by pullin_ down File and choosing Log. SMCL logs can be viewed in the viewer, as can any file: !see [R] view.
: ! i
21_
_
zl _
tog -- Ec.o copy of session to file or device
log or cmdlog,
typed without arguments, reports the status of logging.
log using and cmdlog using open a log file. log close and cmdlog close close the file. Between times, log off and cmdlog off, and log on and cmdlog on can temporarily suspend and resume logging. set logtype specifies the default format in which full logs are to be recorded. Initially, full logs are set to be recorded in SMCL format. set linesize specifies the width of the screen currently being used (and so really has nothing to do with logging). Few Stata commands currently respect linesize, but this will change in the future.
Options Optionsfor use with both log and logcmd append specifies that results are to be appended onto the end of an already existing file. If the file does not already exist, a new file is created. replace specifies that filename, if it already exists, is to be overwritten, and so is an alternative to append. When you do not specify either replace or append, the file is assumed to be new and an error message is issued and logging is not started.
Options for use with log text and smcl specify the format in which the log is to be recorded. The default is complicated describe but is what you would expect: If you specify the file as filename.smcl, (regardless of the value of set logtype).
to
then the default is to write the log in SMCL format
If you specify the file asfilename, log, then the default is to write the log in text format (regardless of the value of the set logtype). If you type filename without an extension and specify neither the smcl or text options, the default is to write the file according to the value of set logtype. If you have not reset set logtype, then that default is SMCL. In addition, the filename you specified will be fixed to read filename, sracl if a SMCL log is being created or fiiename, log if a text log is being created. If you specify either of the options text or smcl, then what you specify determines how the log is written. Iffilename was specified without an extension, the appropriate extension is added for you.
Remarks For a detailed explanation
of logs, see [U] 18 Printing
and preserving
output.
Note that when you open a fulI log, the default is to show the name of the file and a time and date stamp: log
using
log
log: type :
opened
L
on:
myfile
C: \data\proj smcl 12 Dec
2000,
Ikmyfile. 19:28:23
smcl
log _ Echo copy of sessionto file or device
i
219
The above information ' ,'ill appear in the log. If you do not want this information to appear, precede
i
the comm_ nd by quiet . qu etly
l
quietly
!
Ly:
log using myfile
'_ill not suppr,;ss an}, error messages qr anything else you need to know.
i
Simila@ when you :lose a fuel log, the default is to show the full information:
i
. lo*
I
i
close
i log- c:\_t_\proj l\_y_ile, s_l
clo!ed on
12
c 2000,
12:32:41
and lhat information wili appear in the log. If you want to suppress that, type quietly log close,
i I
SavedReSults log
and cmdlog sav_ in r ()" Macros
i
r (filename) name of file I
AlsoSee
} _ i {
I
! i
r(s_atus)
on or off
r(type)
text or smcl
i
Complemehtary:
[Ri translate; [R] more, [R] query
Baekgrounh:
17 Logs: Printing and saving output [G:;W] 17 Logs: Printing and saving output, [G:;U] 17 Logs: Printing and saving output, [G!M]
[U 10 -more-- conditions, [Ui 14.6 File-naming conventions, [UI 18 Printing and preserving output
' 1
Title [ I°gistic
-- L°gisfic , regressi°n
,
i
t
Syntax logistic
depvar varlist [weight]
cluster
(varname)
maximize_options lfit
[depvar] all
lstat lroc all
[depvar]
genprob
asis
[if
exp]
[in
range]
[if exp] _in range]
[weight]
[if
(varname)
coef
[, group(#)table
outsample
expl
[in
[. cutoff(*)all
range]
[, nograph
beta(mamame)
]
graph_options
]
[weight]
beta(matname)
offset
robust
]
[weight]
(varname)
[, _level(#)
]
[weight]
beta(mamame)
Isens [depvar]
all
score (newvarname)
beta(matname) [depvar]
[i£ exp] [in range]
[if
gensens
expl
[in
(varname)
range]
[. nograph
genspec
(varname)
graph_options replace
]
by ... : may be used with logistic; see [R] by. logistic allows fweights andpweights; lfit, lstat, lro¢, and lsens allow only fweights; see [U] 14.1.6weight. logistic shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. logistic may be used with sw to perform stepwise estimation; see [R] sw.
(Continued
on next page)
220
yntax fir predict predict
[type] ,ewvarname [if exp] [in range] [, statistic rules
asif
nooffset
]
where slatistic is p xb strip * d_eta * deviance
, ___2 , ddeviemce , hat , number r,esiduals , rstandard
probability of a positive outcome (the default) xib, fitted values standard error of the prediction Pregiborl(198t) A 13influence statistic deviance residual Hosmer and Lemeshow (1989) A X2 influence statistic Hosmer and Lemeshow (1989) A D influence statistic Pregibon (1981) leverage sequential number of the covariate pattern Pearson residuals; adjusted for number sharing covafiate pattern standardized Pearson residuals: adjusted for number sharing covariate pattern
Unstarred statistics are available both in and out of sample; type predict ... if e(sataple) ... if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample even when if e(sample) is not specified.
DeScription logisticesumates a logistic regression of _lepvaron vartist, where depvar is a 0tl variable lot, more precisely, a 0/non-0 variable). Withoutarguments, logistic redisplays the last logistic estimates, logistic displays estimatesas odds ratios; to view coefficients,type logit after running logistic. To obtain odds ratios for any covariate pattern relative to another, see JR] lineom. ].fi_ displays either the Pearson goodness-of-fit test or the Hosmer-Lemeshow goodness-of-fit test is'_at displays various summary statistics, including the classification table, lroc graphs and calculates the area under the ROe curve. lsens graphs sensitivity and specificity versus probability cutoff and optionally creates new variables containing these data lfit, lstat, lroc, and lsens can produce Statisticsand graphs either for the estimation sample or for;any set of observations. However, they always use the estimation sample by default. When weights, if, or in are used with logistic, it ig not necessary to repeat them with,these commands when you want statistics computed for the estimition sample. Specify if, in. or the all option onb,' whe_nyou want statistics computed for a set of obsen_ationsother than the estimation sample. Specify wmghts (only fweights are allowed with these commands) only when you want to use a different set oftweights. Bydefault, if it. lstat, lroc, and lsens use the fastmodelestimated by logistic. Alternatively, the model can be specified by inputting a vector of coefficients with the beta() option and passing the name of the dependent variable depvar to ttie commands, The lfit,
lstat,
lroc. and lsens commands may also be used after logit
or probit.
Here is a list of other estimation commands that may be of interest. See |R] estimation commands for a complete list. See Gould (2000_for a discussion of the interpretation of logistic regression.
I)
222
lOgistic --
LOgiStiC regression
blogit
[R] glogit
Maximum-likelihood logit regression for grouped data
bprobit
[R] glogit
Maximum-likelihood probit regression for grouped data
clogit
[R] ciogit
Conditional (fixed-effects) logistic regression
cloglog
[R] cloglog
Maximum-likelihood complementary log-log estimation
glra
[R] glm
Generalized linear models
glogit
[R] glogit
Weighted least-squares togit regression for grouped data
gprobit
[R] glogit
Weighted least-squares probit regression for grouped data
heckprob
[R] heekproh
Maximum-likelihood probit estimation with selection
hetprob
[R] hetprob
Maximum-likelihood heteroskedastic probit estimation
logit
IR] logit
Maximum-likelihood logit regression
mlogit
[R] mlogit
Maximum-likelihoo_l multinomial (polytomous) logistic regression
nlogit
[R] nlogit
Maximum-likelihood nested logit estimation
ologit
[R] ologit
Maximum-likelihood ordered logit regression
oprobit
[RI oprobit
Maximum-likelihood ordered probit regression
probit
[R] probit
Maximum-likelihood probit regression
scobit
[R] scobit
Maximum-likelihood skewed logit estimation
svylogit
[R] svy estimators
Survey version of logit
svymlogit
[R] svy estimators
Survey version of mlogit
svyologit
[R] svy estimators
Survey version of ologit
svyoprobit
[R] svy estimators
Survey version of oprobit
svyprobit
[R] svy estimators
Survey version of probit
xtclog
[R] xtclog
Random-effects and population-averaged ctoglog models
xtlogit
[R] xtlogit
Fixed-effects, random-effects, and population-averaged
xtprobit
[R] xtprobit
Random-effects and population-averaged
xtgee
[R] xtgee
GEE population-averaged generalized linem: models
logit models
probit models
Options Optionsfor logistic level
(#) specifies
the confidence
or as set by set
level;
level, in percent,
for confidence
see [U] 23.5 Specifying
the width
intervals.
Tile default
of confidence
is level
(95)
intervals,
robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster() be independent
allows between
If you specify
pweights,
cluster(varname) not necessarily cluster(personid) estimated estimated
observations clusters).
which
are not independent
robust
is implied;
see [U] 23.13
specifies within
that
groups, in data
the
observations
Weighted
independent
(although
they must
estimation. across
groups
varname specifies to which group each observation with repeated observations on individuals, cluster()
standard errors and variance-covariance coefficients; see [U] 23.11 Obtaining
used with pweights to produce command in [R] svy estimators
are
within cluster
matrix of the estimators robust variance estimates,
(clusters)
but
belongs; e.g., affects the
(VCE) but not the cluster() can be
estimates for unstratified cluster-sampled data, but see the svylogit for a command designed especially for survey data.
logistic-- Logisticregremsion c_hister () implies robust' by itself.
specifying robust cluster()
is equivalent to typing cluster
223 ()
scorei(newvarname) creates newvar containing uj = 01nLj/0(xjb) for each observation j in the sample. The score vector is _ 01nLj/ab = _ujxj; i.e., the product of ne_'var with each covariate summed over observations. See [U] 23.12 Obtaining scores. asis forces retentionof perfectpredictor variables and their associatedperfectly predictedobservations and may produce instabilities in maximization; see [R] probit (sic). offset (varname) specifies that varname is to be included in the model with coefficientconstrained tolJe 1. coef causes logistic to report the estimated coefficients rather than the ratios (exponentiated coefficieJas),coef may be specified when the _odel is estimated or used later to redisplay results. c0ef affects only how resuks are displayed ahd not how they are estimated. marimize_options control the maximization process; see [RI maximize. You should never have to specify Uhem.
'
Options!forlilt, Istat,troc,andIsens group(#) (ifit onl_y)specifies the number of quantiles to be used to group the data for the Hosmer-Lemeshow goodness-of-fit test. groqp(lO) is typically specified. If this option is not _iven, ttie Pearson goodness-of-fit test is computed using the covariate patterns in the data as groups.
I_
table (If it only) displays a table of the groups used for the Hosmer-Lemeshow or Pearson goodness-of-fit test '_,ithpredicted probabilitieS,observed and expected counts for both outcomes. anldtotal_ for each group, oulzsample (lfit only) adjusts the degrees of freedom for the Pearson and Hosmer-Lemeshow goodness-of-fittests for samples outside of the estimation sample. See the section Samples other thsn_estimation sample later in this entry. all requests that the statistic be computed for all observations in the data ignoring any if or in restrictions specified with logistic. beta(matn_lme) specifies a row vector containing coefficients for a logistic model. The columns of the row vector must be labeled with the corresponding names of the independent variables in the data. The dependent variable depvar must be specified immediately after the command name. See the section Models o/her than last estimated model later in this entry. cutoff (#) (1star only) specifies the value for determining whether an observation has a predicted positive outcome. An observation is classified as positive if its predicted probability is > #. The default is_0.5. nograph (1roe and lsens) suppresses graphical output. eraph_options (1roe and lsens_ are any of the options allowed with graph, lzwoway;see [G] graph options. genprob (va'rname). gensens (varname), and gelaspec (varname) (lsens only) specily the names of new variables created to contain, respectively, the probability cutoffs and the corresponding ser_sitivityand specificity. replace (lsens only) requests tha) if existing variables are specified for genprob(), or geaspec (), they should be ovem'ritten.
'i
gensens(),
1
Optionsfor predict p, the default, calculates the probability
of a positive outcome.
xb calculates the linear prediction. std_p calculates the standard error of the linear prediction. dbeta calculates the Pregibon (1981) A_ influence statistic, a standardized measure of the difference in the coefficient vector due to deletion of the observation along with all others that share the same covariate pattern. In Hosmer and Lemeshow (1989)jargon, this statistic is M-asymptotic, that is, adjusted for the number of observations that share the same covariate pattern. deviance calculates the deviance residual. dx2 calculates the Hosmer and Lemeshow (1989) AX2 influence statistic reflecting the decrease in the Pearson X2 due to deletion of the observation and all others that share the same covariate pattern. ddeviance calculates the Hosmer and Lemeshow (1989) AD influence statistic, which is the change in the deviance residual due to deletion of the observation and all others that share the same covariate pattern. hat calculates the Pregibon (1981) leverage or the diagonal elements of the hat matrix adjusted for the number of observations that share the same covariate pattern. number numbers the covariate patterns observations with the same covariate pattern have the same number. Observations not used in estimation have number set to missing. The "first" covariate pattern is numbered l, the second 2, and so on. residuals calculates the Pearson residual as given by Hosmer and Lemeshow for the number of observations that share the same covariate pattern.
(1989) and adjusted
rstandard calculates the standardized Pearson residual as given by Hosmer and Lemeshow and adjusted for the number of observations that share the same covariate pattern.
(1989)
rules requests that Stata use any "rules" that were used to identify the model when making the prediction. By default, Stata calculates missing fot excluded observations. See JR] legit for an example. asif requests that Stata ignore the rules and the exclusion criteria and calculate predictions for all observations possible using the estimated parameter from the model. See [R] logit for an example. nooffset is relevant only if you specified offset (vamame) for logistic. It modifies the calculations made by predict so that they ignore the Offset variable: the linear prediction is treated as x3b rather than x ab + offset a.
Remarks Remarks are presented under the headings logistic and logit Robust estimate of variance lilt lstat lroc lsens Samples other than estimation sample Models other than last estimated model predict after logistic
]
1 225
- uxjmtm lOgisciadd Iogit i
logistic provides an alternative and preferr_ way to estimate maximum-likelihood logit models, the other Choice being logit described in [R] iogit, First, let us dispose of some confusing terminology. We use the words logit and logistic to mean the same thing: maximum likelihood estimation. To some, one or the other of these words connotes trarisfOrming the dependent variable and using weighted least squares to estimate the model, but that is riot ho'& we use either word here. Thus, the logit and logistic commands produce the same res_tlts, The logistic command is generally preferred to logit because logistic presents the estimates in terms 6f odds ratios rather than coefficients. To a few, this may seem a disadvantage, but you can type logb:t without arguments after logistic to see the underlying coefficients. Nevertfieless. [R] log'it is still worth reading because logistic shares the same features as logit. incl_ud_ngomitting variables due to collinearity or one-way causation. For an introduction to logistic regression, see Lemeshow and Hosmer (1998) or Pagano and Gauvreau (2000, 470-487); for a thorough discussion, see Hosmer and Lemeshow (t989: second edition_ foghcoming in 2001).
> Example Colisidtr the following dataset from a study of risk factors associated with low birth weight des¢ribed ]n Hosmer and Lemeshow (1989, appendix 1). ., ddscribe Contains data from Ibw.dta ob_: 189 ,vaz]s: Ii :size:
_ari_ble name . ]
3,402 storage type
Hosmer _ Lemeshow data 18 Jul 2000 16:27
(95.!% of memory f_ee) di6play format
valu_ label
variable label
race
fd
int
_8,0g
identification code
]_bw v(ge l%_t Z_tce
byte byte int byte
_8.Og XS.0g _8.Og _8.0g
birth weight chi2
Pseudo
R2
189 33.22
Robust low
Odds
age lwt _Irate
2
Ratio
Std.
Err.
.9732636 .9849634
.0329376 .0070209
z -0.80 -2.13
P>Izl
[95_, Conf.
Interval]
O. 423 O. 034
.9108015 .9712984
1.040009 .9988206
3.534767
I.793616
2.49
O.013
I.307504
9.556051
_Irace_3
2.368079
I.026563
1.99
O.047
i.012512
5. 538501
smoke
2. 517698
,9736416
2.39
O. 017
1.179852
5,372537
ptl ht
1. 719161 6.249602
.7072902 4.102026
1.32 2.79
O. 188 O. 005
.7675715 1. 726445
3. 850476 22. 6231
ul
2.1351
I. 042775
1.55
O. 120
.8197749
5. 560858
chi2
=
0. 0003
Pseudo
=
0.1416
R2
Robust low
Odds Ratio
Std.
Err.
z
P>}zl
[957, Conf.
Interval]
0.423 0.034
.9108015 .9712984
1.040009 ,9988206
r
age lwt iIrace_2 JIrace_3 smoke ptl ht ui
.9732636 .9849634
.0329376 .0070209
-0.80 -2, 13
3. 534767 2. 368079
I.793616 1.026563
2.49 i.99
0.013 O.047
I.307504 1.012512
9.556051 5.538501
2. 517698
.9736416
2.39
0. 017
1.179852
5.372537
1.7t9161
.7072902
1.32
O. 188
.7675715
3. ff50476
6,249602 2. 1351
4. 102026 1.042775
2,79 I. 55
0.005 O. 120
1.726445 .8197749
22.6231 5.560858
Additionally, robust allows you to specify cluster() and is then able, within cluster, to relax the assumpiion of independence. To illustrate this, we have made some fictional additions to the low-birth-Weight data. Pretend [hat these data are not a random sample of mothers but instead are a random sample of mothers+from a random sztmple of hospitals. In fact, that may be true--we do not know the history of these_dam but we can pretend in any case.
i i
H0spital$ specialize and it would not be too incorrect to say that some hospitals specialize in more difficult cases. We are going to show two extremes. In one, all hospitals are alike but we are going to estimate bnder the possibility that they might differ. In the other, hospitals are strikingly different In bc/_hCases, we assume patients are drawn from 20 hospitals. In both examples, we will estimate the same model and we will type the same command to estim_ate!it. !Below are the same data we have been using but with a new variable hospid, which ident_fie_ frbm which of the 20 hospitals each patient was drawn (and which we have made up):
_r.o
" F_
,ug,_uu-
Logm.c
regressl0N
. xi: logistic low age lwt i.race smoke ptliht ui, robust cluster(hospid) i.race _Irace_1-3 (naturally coded; _Irace_l omitted) Logit estimates
Log likelihood =
-100,724
Number of obs Wald chii(8) Prob > chi2
= = =
189 49.67 0.0000
Pseudo R2
=
0.1416
(standard errors adjusted for clustering on hospid) Robust low
Ratio
Std. Err.
age lwt _Irace_2 _Irace_3 smoke
.9732636 .9849634 3,534767 2.368079 2.517698
.0397476 .0057101 2.013285 .8451325 .8284259
ptl ht ui
1.719161 6.249602 2,1351
.6676221 4.066275 1o093144
z
P>_zl
[957 Conf. Interval]
-0.66 -2.6!1 2.2_ 2.42 2.81
0,507 0,009 0.027 0.016 0.005
.898396 .9738352 1,157563 1.176562 1.321062
1.05437 .9962187 10,79386 4.766257 4.79826
1.40 2.82 1.48
0.163 0.005 0,138
.8030814 1,745911 .7827337
3.680219 22.37086 5.824014
The standard errors are quite similar to the standard etTors we have previously obtained, whether we used the robust or the conventional estimators. In this example, we invented the hospital ids randomly. Here are the results of the estimation with the same data but with a different set of hospital ids: . xi: logistic low age lwt i.race smoke ptl ht ui, robust cluster(hospid) i.race _Irace_l-3 (naturally coded; _Irace_1 omitted) Logit estimates
Log likelihood =
-100.724
Number of obs Wald chii(8) Prob > chi2
= = =
189 7,19 0.5167
Pseudo R2
=
0.1416
(standard errors adjusted for clustering on hospid) Robust Std. Err.
low
0dds Ratio
age lwt _Irace 2 _Irate 3 smoke
.9732636 .9849634 3.534767 2.368079 2.517698
.0293064 .0106123 3.120338 1.297738 1.570287
ptl ht ui
1.719161 6.249602 2.1351
.6799153 7.165454 1.411977
z
P>[zJ
[957 Conf. Interval]
-0,90 -1,41 1.43 1.57 1.48
0.368 0.160 0.153 0,116 0.139
.9174862 ,9643817 .6265521 .8089594 .7414969
1.032432 1.005984 19.9418 6.932114 8.548654
1.37: 1.60 1.15
0.171 0.110 0.251
.7919046 .660558 .5841231
3.732161 59.12808 7.804266
Note the strikingly larger standard errors. What happened? In these data, women most likely to have low-birth-weight babies are sent to certain hospitals and the decision on likeliness is based not just on age, smoking history, etc., but on other things that doctors can see but that are not recorded in our data. Thus, merely because a woman is at one of the centers identifies her to be more likely to have a low-birth-weight baby. So much for our fictional example. The rest of this' section uses the real low-birth-weight data. To remind you, the last model we left off with was
r"
logistic-- Loglstlcregression i
i •
229
,
• Xi: logistic low age lwt i.race smoRe ptl ht ui i._ace _Irace_1-3 (naturally coded; _Irace_l omitted) Logit estimates
Log likelihood =
Number of obs LR chi2(8) Prob > chi2 Pseudo R2
-100.724
low
Odds Ratio
age l'.*t _!race_2 _Irace_3 smoke
.9732636 .9849634 3.534767 2.368079 2. 517698
.0354759 .0068217 1.860737 1.039949 1. 00916
1.719161 6.249602 2.1351
.5952579 4.322408 .9808153
ptl ht ui
Std. Err.
z -0.74 -2.19 2.40 1.96 2.30 I.56 2.65 1,65
= = = =
189 33.22 0.0001 0.1416
P> iz I
[95_ Conf. l_terval]
O.457 O.029 O.O16 O.050 0.021
.9061578 .9716834 1.259736 1.001356 1. 147676
t. 045339 .9984249 9.918406 5.600207 5. 523162
O.118 0.008 O. 099
.8721455 1.611152 .8677528
3,388787 24.24199 5. 2534
lilt Ifit Computes i goodness-of-fit tests, either the Pearson X2 test or the Hosmer-Lemeshow i i
test.
By _de(ault, lfit. Istat,lroc, and lsenS compute statistics for the estimation sample using the llast rdodel estimated by logistic. However, samples other than the estimation sample can be spetifibd;t'l.seethe section Samples other than esflmation sample later in this entry. Models other thanthe )last mbdel estimated by logistic can also be specified; see the section Models other than last estimated )model
> Example
i 1
lfi_c, fyped without options, presents the Pearson X2 goodness-of-fit test for the .estimated model. The: Pdarsbn k 2 goodness-of-fit test is a test of:the observed against expected number of responses usir_g ¢elI_ defined by the covafiate patterns; see predict with the numbers option below for the defiiaitibn bfcovariate patterns. ._if_.t L_gi_tic model for low, goodness-of-fit test I number of observations = 189 _umi_r of covariate patterns = Pearson chi2(t73) = Prob > chi2 =
182 i_9.24 0.3567
Our mddell fits reasonably well. We should note, however, that the number of covafiate patterns is close td the{number of observations, making the applicability of the Pearson X2 test questionable, but not riec_ssa_ily inappropriate. Hosmer and Lemeshow (1989) suggest regrouping the data by ordering on the predicted probabilities and then forming, say, I0 nearly equal-size groups. 1fit with the group(_
o_tion does this:
.!l_i_, group(to) L_gis_ic model for low, goodness-of-fit test -_(_abl_collapsed on quantiles of estimated probabilities) i ) number of observations = 189 number of groups = :_ iHosmer-Lemeshow ohJ2(8) = i
Prob > chi2 =
10 @.65 0.2904
,
;
230 Logistic regression Again, welogistic cannot --reject our model. If you specify the tableoption, Ifit displays the groups along with the expected and observed number of positive responses (low-birth-weight babies):
_'
Ifit, group(lO) table Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) _Group
_Prob
_Obs_ i
_Exp_l
_Obs_O
_Exp_O
_Total
1 2
0.0827 O. 1276
0 2
1.2 2.0
19 17
17.8 17.0
3
0.2015
6
3.2
13
15.8
19
4
0.2432
1
4.3
18
14.7
19
5
O. 2792
7
4.9
12
14.1
19
6
O. 3138
7
5.6
12
13.4
19
7 8
O. 3872 0.4828
6 7
6.5 8.2
13 12
12.5 i0.8
19 19
9
0.5941
10
10.3
9
8.7
19
0.8391
13
12.8
5
5.2
18
10
number
of observations
number Hosmer-Lemeshow
=
189
of groups = chi2 (8) =
Prob
> chi2
19 19
i0 9.65
=
0.2984
q
Q Technical Note ifit with the group() option puts all observations with the same predicted probabilities into the same group. If, as in the previous example, we request 10 groups, the groups that lfit makes are [P0,Plo], (Pl0_P20], (P20,P30], -.-, (P90_Pl00], where Pk is the kth percentile of the predicted probabilities, with Po the minimum and Ploo the maximum. If there are large numbers of ties at the quantile boundaries, as will frequently happen if all independent variables are categorical and there are only a few of them, the sizes of the groups will be uneven. If the totals in some of the groups are small, the X 2 statistic for the Hosmer-Lemeshow test may be unreliable. In this case, either fewer groups shtutd be specified or the Pearson goodness-of-fit test may be a better choice. El
> Example The tableoption can be used without the group()option. We would not want to specify this for our current model because there were 182 covafiate patterns in the data. caused by the inclusion of the two continuous variables age and lwt in the model. As an aside, we estimate a simpler model and specify table with lfit: logistic Logit
Log
low
_Irate_2
_Irate_3
smoke
ui
estimates
likelihood low
= -107.93404 Odds
Ratio
Std.
Err.
z
Number of obs LR chi2(4) Prob > chi2
= = =
189 18.80 0.0009
Pseudo
=
0.0801
R2
P>Iz_
[95_
Conf.
1.166749
Interval]
_Irace_2
3.052746
1.498084
2.27
0.023
_Irace_3
2.922593
1.189226
2.64
0.008
1.31646
6.488269
2.945742
1.101835
2.89
0.004
1.41517
6.131701
2.419131
1.047358
2.04
0.041
1.03546
5.651783
smoke ui
7.987368
_
.
i
i
tf_t,
logistic-- Logistic regression
_Exp_O
Total
l
_1 12
O. 1230 0.2533
I3
4.9 1.0
373
35.1 3.0
404
!
!4, ':5
O. 2923 0.2997
15 3
12.6 3.9
28 10
30.4 9.1
43 13
i8 i9
O. 5087 O. 5469
2 2
1.5 4.4
1 6
I. 5 3.6
3 8
_0
O.5577
6
5.6
4
4.4
10
_I
0.7449
3
3.0
1
1.0
4
16 0.4978 !7 0.4998
__ro_p | _I i2 i3 14 i5 !6 17 i8 19 dO _I
I
231
tab
LCgi_tic model for low, goodness-of-fit test __rodp Prob _Obs_l _Exp_I _Obs_O
! I
4 4
4.0 4.5
_Prob O. 1230 O. 2533 O.2907 O. 2923 O. 2997 O. 4978 O. 4998 O.5087 O. 5469
_Irace_2 0 0 0 0 1 O 0 1 0
_irace_3 O O 1 O O 1 0 O 1
O.5577 O.7449
1 0
0 1
number of observations _umter of covariate patterns Pearson chi2(6) Prob > chi2
= = = =
4 5
4.0 4.5
smoke O 0 0 1 0 0 I O 1
ui 0 1 O O O 1 1 I 0
1 1
0 1
i
8 9
18_ 5.71 0.4569
chi2 =
6_67 0_5721
Note tha we did not specify an if statement _-lth Ifit since we wanted to use the estimation sample. ;inqe ! the test is nonsignificant, we are satisfied with the fit of our model. Running _roc gives a measure of the discrimination: .
oc,
nograph
Logistic model for low 94 0.8158
number of observations = ar_a u.%derROC curve =
No_: we lest the calibration of our model by lJerforming a goodness-of-fit test on the validation sample. We _pecify the outsampleopUon so that the degrees of freedom are 10 rather than 8, Lo@istic . !fitI ifmodel group==2, for low, group(iO) goodness-of-fit table outsample t_st (Table collapsed on quantiles of estimateflprobabilities) _G_ou_ ' II
_Prob 0.0725 O. 1202 O. 1549 O.1888
_Obs_l 1 4 3 1
_ExpI 0.4 O.8 1.3 1.5
_Obs_O 9 5 7 8
_Exp_O 9.6 8.2 8.7 7.5
_Total 10 9 I0 9
O. 3258
4
2.7
5
I
I
O.2609 O.42t7
3 2
2.2 3.7
7 8
6.3 6. 3 7.8
9 10 10
I
181
O. 6265 0.4915 0.9737
4 3 4
5.5 4.1 7.1
6 65
4.5 4.9 1.9
10 9 9
I
235
I number of observations nu.iIlber groups [ of osmer-Lemeshow chi2(lO) Prob > chi2
=
95
= = =
I0' 28',03 0.0018
.._
We must acknowledge that our model does not fit well on the vaIidation 236 logistic -- Logistic discrimination in the validation regression sample is appreciably lower as well. • iroc
if group==2,
Logistic number i
area
model
nograph
for
low
of observations
under
ROC
sample. The model's
curve
=
95
=
O. 5839
,
q
Models other than last estimated model By default, lfit, lstat, lroc, and lsens use the last model estimated by logistic. specify other models using the beta() option.
One can
i> Example Suppose that someone publishes the following logistic model of low birth weight: Pr(low
= 1) = F(-0.02
age - 0.01 lwt + 1.3 black
where F is the cumulative logistic distribution. are the equivalent of what logit produces.
+ 1.1 smoke + 0.= ptl
Note that these coefficients
-t 1.8 ht + 0.8 ui + 0.5) are not odds ratios: they
We can see whether this model fits our data. First, we enter the coefficients as a row vector and label its columns with the names of the independent variables plus _cons for the constant (see [el matrix define and [P] matrix rowname). matrix
input
• matrix
b =
colnames
C-.02
-.01
b = age
lwt
1.3 black
1.1
.5 1.8
smoke
.8 .5)
pt]. ht
ui
_cons
We run lfit using the beta() option to specify b. The dependent variable is entered right after the command name, and the outsample option gives the proper degrees of freedom. . ifit
low,
Logistic (Table
beta(b)
model
for
collapsed number
group(lO) low,
goodness-of-fit
on quantiles
of observaZions
number Hosmer-Lemeshow
outsample
=
of groups = chi2 (I0) =
Prob
> chi2
:
Although the fit of the model is poor, lroc •iroc
low,
Logistic number area
beta(b)
model of
under
for
probabilities)
189 i0 27.33 0.0023
shows that it does exhibit some predictive
ability.
nograph low
observations ROC
test
of estimated
curve
= =
189 0.7275
q
logistic-- Logisticregression
237
predict after logistic p#edictis used after logisticto obtain predicted probabilities, residuals, and influence statistics for tt_e ostimation sample. The suggested diagnostic graphs below are from Hosmer and Lemeshow (1989). Where they are more elaborately explained. Also see Collett (1991. 120-t60) for a thorough discussion cjf model checking.
predict _wiihout options Typing p_edict p after estimation calculates _he predicted probability of a positive outcome. We ptevibusly ran the model logisticlow age Ivt _Irace_2 _Irace_3 smoke ptl ht ui. We obtain tl_epredicted probabilities of a positive outcome by typing • _re4ict
P
(o]_ti_n
p assumed;
• dum_arize
Pr(tow))
p low Obs
V_iable p low
189 189
Mean .3121693 .3121693
Std. Dev. .1918915 .4646093
Max
Hin .0272559
.8391283 0
I
predibt _vit_ the xb and stdp options predict with the xb option calculates the linear combination xjb, where xj are the independent variaNes! in he jth observation and b is the estimated parameter vector. This is sometimes known as the incle_ fu _ction since the cumulative distribution function indexed at this value is the probability of a _siiive outcome. With the _tdp option, predict calculates the standard error of the prediction, which is not adjusted tbr replidated covariate patterns in the data. The itifluence statistics described below are adjusted for replicated c(_variate patterns in the data.
predict Wit_ the residuals option predict _can calculate more than predicted probabilities. The Pearson residual is defined as the square root bf the contribution of the covariate pattern to the Pearson X2 goodness-obfit statistic. signed adcor_ing_to whether the observed number of positive responses within the covanate pattern is less th_n Or greater than expected. For instance, lz_red_ct N_rize
r,
residuals r, detail Pearson
i '
residual
_ercentiles
Smallest
IZ_
-1.750923
-2.283885
5_
-1.129907
-1.750923
IOZ
-,9581174
-1.636279
Obs
25_:
-,6545911
-1.636279
Sum of
50Z
-.3806923
189 Wgt.
Mea_ Largest 2.23879
Std.
189 -.0242299
Dev.
,9970949
75_
.8162894
90ZI
1.510355
2.317558
Variance
.9941981
95Z 99Z
1.747948 3.002206
3,002206 3,126763
Skewness Kurtosis
.8618271 3.038448
238 notice logistic-Logistic We the prevalence of aregression few, large positive residuals: t
'"
• sort list
r id r 10w
p age
race
in -5/1
185.
id 33
r
low 1
186.
57
2.23879
1
187.
16
2.317558
1
188.
77
3.002206
1
189.
36
3.126763
1
2.224501
p
age 19
race white
,166329
15
white
.1569594
27
other
.0998678
26
white
.0927932
24
white
.1681123
predict with the number option Covariate patterns play an important role in logistic regression, Two observations are said to share the same covariate pattern if the independent variables for the two observations are identical. Although one thinks of having individual observations, the statistical information in the sample can be summarized by the covariate patterns, the number of observations with that covariate pattern, and the number of positive outcomes within the pattern. Depending on the model, the number of covariate patterns can approach or be equal to the number of observations or it can be considerably less. All the residual and diagnostic statistics calculated by Stata are in terms of covariate patterns, not observations. That is, all observations with the same covariate pattern are given the same residual and diagnostic statistics. Hosmer and Lemeshow (1989) argue that such "M-asymptotic" statistics are more useful than "N-asymptotic" statistics. To understand the difference, think of an observed positive outcome with predicted probability of 0.8. Taking the observation in isolation, the "residual" must be positive--we expected 0.8 positive responses and observed 1. This may indeed be the "correct" residual, but not necessarily. Under the M-asymptotic definition, we ask how many successes we observed across all observations with this covariate pattern. If that number were, say, 6, and there were a total of 10 observations with this covariate pattern, then the residual is negative for the covariate pattern we expected 8 positive outcomes but observed 6. predict makes this kind of calculation and then attaches the same residual to all observations in the covariate pattern. Thus, there may be occasions when you want to find all observations number allows you to do this: predict
pattern,
• summarize
We
number
pattern
Variable
Obs
pattern
189
previously
estimated
Mean
89.2328
the model
ui over 189 observations.
predict
sharing a covariate pattern.
logistic
Std.
Dev.
Min
53. 16573
low
age
1
lwt
_Irace_2
Max
182
_Irace_3
smoke
ptl
ht
There are 182 covariate patterns in our data.
with the deviance
option
The deviance residual is defined as the square rooT:of the contribution to the likelihood-ratio test statistic of a saturated model versus the fitted model. It has slightly different properties from the Pearson residual (see Hosmer and Lemeshow, 1989): predict
d,
de_iance
•
......
........
logistic-- Logistic regression Summarize
Percentiles
5_
residual
Smallest
-1.843472
1_
-1.911621
-i. 33477
-1.843472
10_ '_
-!. 148316
-I .843472
Dbs
25_
-.8445325
-1.674869
Sum of _/gt.
50_
-.5202702
Mean Largest
189 189 -. 1228811
Std. Dev.
I.049237
175_ : 90_
.9129041 1,541558
1.894089 1. 924457
Variance
1. 100898
!95_ !99_
i.673338 2. 146583
2. 146583 2. 180542
Skewness Kurtosis
.6598857 2. 036938
predict With the rstandard option z PearsOn residuals do not have a standard deviation equal to t. a fine point, rstandard Pearson _esiduals normalized to have expected: standard deviation equal to !. i redict
rs,
ummarize
generates
rstandard r rs
Variable
Obs
Mean
Std. Dev,
Mix
Max
r
189
-.0242299
.9970949
-2.283885
3. 126763
rs
189
-.0279135
i,026406
-2.4478
3.149081
I
'
• +rrelate
i(o=189>
r rs r
r
t, 0000
rs
O. 9998
rs
1. 0000
Rememblr that we previously created r containing the (unstandardized) Pearson residuals, In these data, wh&her you use standardized or unstandardized residuals does not much matter, I
pred_t
ith the hat option
, : _
hat @culates the leverage of a covariate pattern a scaled measure of distance in terms of the in_tep_ndent variables. LaNe values indicate covariate patterns "far" from the average covariate patlern--_patterns that can have a large effect on the estimated model even if the corresponding residual i[ small. This suggests the following:
[
}
239
d, detail deviance
[
i
i
(Continuefl on next page)
,_
240
.
predict logistich, graph h r,
hatLogis_c regression border yline(O) ylab xlab
°g 15
o
_,
o o000
0
o
o
cO0
oooo
oo
I
o
o
_,
o °°
o
o
o
o
_
vj °
oo
O-
Pearson
residual
The points to the left of the vertical line are obserx,_ed negative outcomes: in this case, our data contain almost as many covariate patterns as observatiens, so most covariate patterns are unique. In such unique patterns, we observe either 0 or 1 success and expect p, thus forcing the sign of the residual. If we had fewer covariate patterns, which is to say, if we did not have continuous variables in our model, there would be no such interpretation arid we would not have drawn the vertical line at O. Points on the left and right edges of the graph represent large residuals--covariate patterns that are not fitted well by our model. Points at the top of our graph represent high leverage patterns. When analyzing the influence of observations on the model, we are most interested in patterns with high leverage and small residuals patterns that might otherwise escape our attention. predict with the dx2 option There are many ways to measure influence of which hat is one example, dx2 measures the decrease in the Pearson X 2 goodness-of-fit statistic that would be caused by deleting an observation (and all others sharing the covafiate pattern):
(Continued
on next page)
{
}
• _re_ict graph
dx2,
dx2
dx2 p, border
ylab
xlab
Paraphrasing Hosmer and Lemeshow (1989), the points going from the top left to the bottom right, correspond to covariate patterns with the number of positive outcomes equal to the number in the group; the points on the other curve correspond to 0 positive outcomes. In our data. most of the covariale patterns are unique, so the points tend to lie along one or the other curves: the points that are off' the curves correspond to the few repeated covariate patterns in our data in which all the outcomes a_e not the same.
! i
We exa_ainethis graph for large values of dx2--there are two at the top left. I i
i
predct w th the ddeviance option Anothel measure of influenceis the change in _thedeviance residuals due to deletion of a covarJate pattern:,
!
pr_dict As
with
d_2,
dd, ddeviance one
typically graphs ddevi_uce:
against the
probabi]ir}, of a positive outcome.
We
direct you ito Hosmer and Lemeshow (I989) foran example and the interpretation of this graph. predi_ With the dbeta option One_of the more direct measures of influence of interest to model fitters is the Pregibon (1981} ,me tsure, a measure of the change in the!coefficientvector that would be caused by deleting an observ_tion (and all others sharing the covartate pattern): dbe_a
i
(Continued on next page}
. predict
242
db, dbeta
logistic -- Logistic regression
ilt!
graph
db p, border
ylab
xlab
I
.75
{
I
J
I
"
-o p
o
_ _
o o o o
o
o
oo
o
o
c_mlmm_
.25 t
o
Q
o
o
_o
a_l_J_,::_o
o_Oo
o
o
o
J
o
eOo_
_
''
o 1
"T
Pr(low)
One observation .
sort
• list
has a large effect on the estimated coefficients.
in I
Observation
189
id lwt
188 95
ptl
3
ht
0
fry
0
bwt
_3637
0 117
p d
.839i1283 -1. 9111621
r rs
-2. 283885 -2.4478
5.99_726
dd
4. 197658
_Irace_3 pattern h db
dx2
low race
dx2
.1294439 ,8909163
Hosmer and Lemeshow graph
We can easily find this point:
db
p [w=db],
0 _ite
age smoke
25 1
ui
I
_Irace_2
0
(1989) suggest a graph that _combines two of the influence measures: border
Symbol
ylab
xlab
size proportional
tl("Symbol
size
proportional
to dSeta
I
Pro'fowl,
We can easily spot the most influential points by the dbeta
and dx2 measures.
to dBeta")
-
] i i
•
.
'_
logistic -- Logistic
regression
243
SavedReSults i
ti ! ! ! I
_og_s Scalars ic saves e(N) e df_.m) e r2_p)
in e(): number of observations model de_ees of freedom pseudo R-sqeared
tog likelihood, constant-only model number of clusters X2
e(ll_O) e(N_clust) e(chi2)
log likelihood
e Ii) :Macro_ e_ e_depvar) el wtype)
logistic name of dependent variable weight type
e(clustvar) name of cluster variable e(vcetype) covariance estimation method e(chi2type) Wald or LR: type of model X2
e(wexp)
weight expression offset
e(predict)
program used to implement predict
coefficient vector
e (V)
variance-covariance matrix of the estimators
eloffset)
test
Matrices e_b)
Functichls e sample)
marks estimation sample
_fi_s:_vesin r(): Scalars
_st_t i
rmmber of observations
r(df)
degrees of freedom
r(m)
number of covariate patterns _r groups
r(chi2)
X2
_aves in r ():
Scalars r(P_c_rr)
i r(P-n
I
r(N)
4)
r(P_p(_) r(P_.n_)
!roc
percent correctly classified
r(P_lp)
putative predictive value
sensitivity
r(P_ln)
negative predictive
specificity
r(P_Op)
false
false positive rate given true negative false negative rate given true positive'
r(P_On)
false negative rote given classified negalivc
positive
Vall}e
rate given
classified
positive
s_ves in r(): Scalars
i
r(N)
number of observations
r(area)
area under the ROC curve
in i
r():
Isens
Scalars r(N) number of observations
_
,,
Methods 3ndFormulas logis';ic,
lfit,
lstat,
lroc.
and lsens
are implemented
as ado-files. l! o
l
Define xj as the (row) vector of independent Variables, augmented by 1. and b as the correspond _v estimated 3arameter (column) vector, The logistic regression model is estimated bv logit: see {RI loRit for demih
!
of estimation.
....
The odds ratio corresponding to the ith coefficient is ¢i --- exp(bi). The standard error of the odds ratio = g'_si,-- where si isregression the standard error of bi estimated by logit. zo_ is sS Iogmuc Logistic
_!
Define lj = xj b as the predicted index of the jth observation. positive outcome is
The predicted probability
of a
exp,±i) 1+ exp(I )
Pj
If it Let 2_r be the total number of covariate patterns among the N observations. View the data as collapsed on covariate patterns j = 1, 2.... , M and define mj as the total number of observations having covariate pattern j and yj as the total number of positive responses among observations with covariate pattern j. Define pj as the predicted probability of a positive outcome in covariate pattern
j.
The Pearson X2 goodness-of-fit
statistic is M
x2=
(uj mjpj)2 j=, mjpj(i-
This X2 statistic has approximately M - k degrees of: freedom for the estimation sample, where k is the number of independent variables including the constant. For a sample outside of the estimation sample, the statistic has M degrees of freedom. The Hosmer-Lemeshow
goodness-of-fit
X 2 (Hosmer and Lemeshow
1980; Lemeshow and Hosmer
1982; Hosmer, Lemeshow, and Klar 1988) is calculated similarly, except rather than using the M covariate patterns as the group definition, the quanti!les of the predicted probabilities are used to form groups. Let G = # be the number of quantiles requested with group (#). The smallest index 1 65 predicts failure perfectly". It will then inform us about the fixup it takes and estimate what can be estimated of our model. logit
(and logistic
note:
4 failures
and probit) and
0 successes
will also occasionally display messages such as completely
determined.
There are two causes for a message like this. Let us deal with the most unlikely case first. This case occurs when a continuous variable (or a combination df a continuous variable with other continuous or dummy variables) is simply a great predictor of the dependent variable. Consider Stata's auto. dta dataset with 6 observations removed. . use
auto
(1978
Automobile
Data)
drop if foreign==O _ gear_ratio>3.1 (6 observations deleted) • logit Logit
Log
foreign
mpg
gear_ratio,
nolog
likelihood
Number
= -6.4874814
foreign mpg weight gear
note:
weight
estimates
Coef.
Err.
=
68
=
Prob
> chi2
=
0.0000
R2
=
0,8484
Pseudo
Std.
of obs
LR chi2 (3)
z
P>Izl
[95% Conf,
72.64
Interval]
-.4944907
.2655508
-1.86
0,063
-i,014961
.0259792
-.0060919
.003i01
-1.96
0.049
-.0121696
-.000014
ratio
15,70509
8.166234
1.92
0.054
-.3004359
31.71061
_cons
-21.39527
25.41486
-0.84
0.400
-71.20747
28.41694
4 failures
and
0 successes
completely
determined.
Iogit -- Maximum-likelihood Ioglt estimation :+ •
I , i
Note that t[ ere are no missing standard errors in the output. If you receive the "completely
determined"
message ar d have one or more missing standard errors in your output, see the second case discussed
;
below. Note g_ar_ratao +large coefficient,logit thought that the 4 observationswith the smallest predictedprobabilitieswereessentiallypredictedperfectly. 1
.(option predict p passumed; !
Pr(foreign))
. so_t"p . li_t p in i/4
!
+
255
i
p
1. 2. 3.
1.34e-I0 6.26e-09 7.84e-09
4. |
!.49e-08
If this hlappensto you,there is no need to dd anything.Computationally,the modelis sound.It
+
i
is the seco d case discussed
+
The se@ndcase occurswhenthe independenttermsare all dummyvariablesor continuousones with repea_edvalues (e.g.. age). In this case, one or more of the estimatedcoefficientswill have missingst_dard errors.:Forexample,considerthis datasetconsistingof 5 observations.
?
• li_t i
y 0 0 1 0 i
1. 2. 3, 4. 5.
I ¢
below that requires
xl 0 1 i 0 0
careful examination.
x2 0 0 0 i i
. lo_it y xl x2, nolog. Logi_ 7 estimates ! i 1 i
Numberof obs LR chi2(2) Prob > chi2 Pseudo R2
-2.7725887
Log likelihood • 1 Coef-. 8td. Err.
++
i8. 26157 t8.26157
{
co
-_8.26157
2 1.414214
P>Izl
[95Y, Conf. Interval]
9 13
0.000
14.34164
-i12.91
0.000
note: i failureand 0 successescompletelydetermined.
i
. predict p (optionp assumed Pr(y)) xl
y
-15,48976
0
0
0
+
2. 3. 4. 5.
0 t 0 1
1 1 0 0
0 0 1 I
Two thiSgs are happeaing
i
covariate
i
dropped,
p
x2
i.
i+
-21.03338
22.1815
, li. _ _
•
5 I,18 0.5530 O. 1761
z
+
+
= = = =
1',. 17e-08
.5 .5 .5 .5
here. The first is tl_at logit is able to fit the outcome
(y = 0) for the
p_ttern+ xl = 0 and x2 = 0 (i.e., the first observation) perfectly. It is this observation that letel' dete + is the "1 f__ilure ...con_ } rm'ned". The second thing to note is that if this observation is t,n
!+
xl, x2, arid the constant
'i
are colli_ar.
....................-_oo---........,o_J,_ "Wmxmmum-.KellnOO0Ioglt estimation
This is the cause of the message "completely determined" and the missing standard errors. It happens when you have a covariate pattern (or patterns) with only one outcome, and there is collinearity when the observations corresponding to this covariate pattern are dropped. If this happens to you, confirm the causes. First identify the covariate pattern with only one outcome. (For your data, replace xl and x2 with the independent variables of your model.) • egen pattern = group(x1 x2) quietly logit y xl x2 predict p (option p assumed; Pr(y)) • snmraarize p Variable
Obs
Mean
p
5
.4
Std. Dev. .22360_8
Min
Max
I. 17e-08
.5
If successes were completely determined, that means there are predicted probabilities 1. If failures were completely determined, that means there are predicted probabilities O. The latter is the case here, so we locate the corresponding value of pat_;ern:
that are almost that are almost
• tab pattern if p < le-7 x2) 1 Total group (xl 1.
Freq.
Percent
Cum.
1
i00.O0
i00.O0
1
100.O0
Once we omit this covafiate pattern from the estimation sample, logit can deal with the collinearity: logit y xl x2 if pattern-=l, nolog note: x2 dropped due to collinearity Number of obs LR chi2 (I) Prob > chi2 Pseudo R2
Logit estimates
Log likelihood = -2.7725887
= = = =
4 0.00 1.0000 0.0000
|
y [
Coef.
xl _cons
0 0
Std. Err. 2 I.414214
z O.O0 O.O0
P>lz{ 1.000 I.000
[95'/, Conf. Interval] -3.919928 -2.771808
3.919928 2 _771808
We omit the collinear variable. Then we must decide whether to include or to omit the observations with pattern
= 1. We could include them
logit y xl, nolog Logit estimates
Number of obs LR chi2(1) Prob > chi2
= = =
5 O. 14 0.7098
Log likelihood = -3.2958369
Pseudo R2
=
0.0206
Y I _cons xl j
Coef. -. 6931472 .6931472
Std. Err. I.224742 1. 870827
or exclude them: • logit y xl if pattern~=l, nolog
z -0.57 O. 37
P> lz[ O.571 O. 711
[957,Conf. Interval] -3.093597 -2.973605
I.707302 4. 3599
Iog_ -- Maximum-likelihoodIoglt estimation ,
Logi
estimates _ i
Log _ikelihood '
i
= =
Prob
=
1.0000
=
0.0000
> chi2
Pseudo
R2
4 0.00
1
Y
i
!
= -2.7725887
Number of obs LR chi2 (I)
257
Coef,
xI
0
_cons
0
Std.
Err. 2
1,4142/4
z
P> ]zt
[95_ Conf.
Interval]
O. O0
1.000
-3.919928
3.919928
0.00
1.000
-2.771808
2.771808
If the _ovariate pattern that predicts outcome perfectly is meaningful, you may want to exclude these observations from the model. In this case_ one would report covariate pattern such and such predicted 6utcome perfectly and that the best model for the rest of the data is .... But. more likely. the perfec_ prediction was simply the result of h_ving too many predictors in the model. In this case. one wouldI omit the extraneous variable(s) from further consideration and report the best model for all the datA. 23
'
i
Obtaining redicted values
i
Once y_u have estimated a logit model, you can obtain the predicted probabilities using the predict )remand for both the estimation sample and other sampI_s: see [U] 23 Estimation and post-estimation commands and [R] predict. Here we will make only a few additional comments.
i
predict without arguments calculates the predicted probability of a positive outcome: i.e.. Pr'd/j = .) = F(xjb), With the xb option, it calculates the linear combination xjb, where x_
!
are the in, ependent variableSasin the jth observation and b is the estimated parameter vectOr.atThis is sometin_es known the index function since the ,mmulative distribution function indexed this
i
value is thI probability of a positive outcome. In bothtcases, State remembers any "rules'" used to identify the model and calculates missing for excluded dbservations unless rules or asif is Specified. This is covered in the following example.
e
!
i
!
For inf_rmation about the other statistics available after predict,
> Example In the _revious example, we estimated the 10git model logit foreign
Pr(foreign))
i
(_0 _issing
values
generated)
i
s_arize
foreign
! i
rep_is_l
rep_is_2.
To obtain _redicted probabilities: . predict p (option p assumed;
,
see [R] logistic.
_ariable
!
:foreign p
p 0bs
Mean
58 48
,2068966 .25
Std. Dev. .4086186 .1956984
Min
Max
0 .t
! .5
State rem_ mbers any "'rules" used to identify ihe model and sets predictions to missing for any excluded , bservations, in the previous examplel logit dropped the variable rep_is_l from our model anc excluded l0 observations. Thus. when we typed predict p. those same I0 observations were a_ai_ excluded an_t their predictions were _et to missing.
-
predict'srules option will use the rules in file prediction. During estimation, we were told "rep__is_t-=O predicts failure perfectly", so the rule is that when rep_is_l is not zero, one should zoe _oglt-- MaxJmum-l|lmlthoodIogit predict 0 probability of success or a positive estimation outcome: predict p2,
rules
• summarize foreign p p2 Variable [
)
foreign p p2
predict's asif for all observations
Obs 58 48 58
Mean .2068966 .25 .2068966
Std. Dev. .4086186 .1956984 •2016268
Min
Max
0 .I 0
1 .5 .5
option will ignore the rules and the exclusion criteria and calculate predictions possible using the estimated parameters from the model:
• predict p3, asif . summarize foreign p p2 p3 Variable
Obs
Mean
foreign p p2 p3
58 48 58 58
.2068966 .25 .2068966 .2931034
S%d, D_v.
Min
Max
.4086186 .195698_4 .2016268 .2016268
0 .1 0 .1
1 .5 .5 .5
Which is right? What predict does by default is the most conservative approach. If a large number of observations had been excluded due to a simple rule, one could be reasonably certain that the rules prediction is correct. The asif prediction is only correct if the exclusion is a fluke and you would be willing to exclude the variable from the analysis anyway. In that case. however, you should re-estimate the model to include the excluded observations. q
Saved Results logit
saves in
e():
Scalars e (N)
number
e ('2.1)
log likelihood
e(df_m) e (r2_p) e (N_clust)
model degrees of freedom pseudo R-squared number of clusters
of observations
e(_l_O) e (chi2)
log likelihood, X2
e(cmd)
logit
e(vcetype)
covariance
e(depvar) e (wtype)
name of dependent variable
e(chi2type) e(offset)
Wald offset
e(wexp) e(clustvar)
weight expression name of cluster variable
e(predict)
program
coefficient
e(V)
variance-covariance estimators
constant-only
model
Macros
weight
type
estimation
method
or LR: type of model X_ test used to implement
predict
Matrices e(b)
vector
Functions e(sample)
marks estimation
sample
matrix
of the
t
Iogitj- Maximum-likelihoodIoglt estimation
259
V
; i
.
Methods
Formulas
The wo_ logit is due to Berkson (1944) and is by analogy with the word probit, For an introduction to probit a_d logit, see, for example. Aldrich and Nelson (1984), Hamilton (1992Z Johnston and DiNardo (1_97), Long (1997), or Powers and Xie (2000). The likelihood function for logit is 1
InL=
Ewj
lnF(xjb)
+ Zwiln{1-F(xjb)}
jES
i !
j_S
where S is!the set of all observations j such that yj _ O, F(z) = eZ/(l optional wdights. In L is maximized as described in [R] maximize.
+ eZ), and wj denotes the •
If robusl standard errors are requested, the dalculation described in Methods and Formulas of [R] regresstis carried forward with uj = !1 - F(xjb)}xj for the positive outcomes and -F(xjb)xj for the neghtive outcomes, qc is given b5 its asymptotic-like formula,
Reference
.Aldrich.J. 0' and F. D. Nelson. 1984. Linear Probab_lit);Logit, and Probit Models. Newbury Park. CA: Sage
i o
Publicatiohs. Berkson,J. @44. Applicationof the logisticfunctionto l'iio-assay.Journalof the AmericanStatisticalA._ociation39: 357-365.' Cleves,M. a}d A. Tosetto 2000. sg139:Logisticregressionwhenbinary outcomeis measuredwith uncertainty.Stata T_chmcallBulletin55: 20-23.
',
Hami['ton.L.!C 1992. f_egres_ionwith Graphics.PacificGrove.CA: Brooks/ColePublishingCompany.
i
ltosmer. D +'.. Jr.. and S. Lemeshow.1989. AppliedLOgisticRegression.New York:John Wiley & Sons. (Second editionforthcoming"in 200I.) Johr_ston.J. ]nd J. DiNardo. I997. EconometricMethods.4th ed. New York:McGraw-Hill. Judge,G. G..!W.E Griffiths,R. C. Hill.H. L/itkepohl.andT.-C.Lee. 1985. The Theoryand Practiceof Econometrics. 2d ed. Nlw York:John Wiley& Sons. Long. J. S_|997. RegressiobModels for Cate_,oricaland Limited Dependent Variables.ThousandOaks. CA: Sage
i
, i
Publicatic_as. Powers.D, _, and Y. Xie. 2000. StatisticalMezhodsfor CategoricalDataAnah,sis, San Die__o.CA: AcademicPress Pre_ilyon.D.!1981. Logisticregressiondiagnostics.Annals of Statistics9: 705-72&
Also Compleme
atary:
i
[R] clogit, [R] cloglog, [R] cusum, [R] glm, [R] giogit, [R] logistic. [Ri nlogit, [R] probit, [R] scobit. [R] svy estimators. [R] xtelog.
I
[R]
i
i
Related:
[R] adjust, [R] lincom. [R] linktest. [R] lrtest. [R] mfx. [R] predict, [R] roe. [R] sw, [R] test, [R] testnl, [R] vce. [R] xi
Backgrounch
xtgee, [R] xtlogit, [R] _tprobit
[u] 16.5 Accessing coefficients and standard errors, [U_]23 Estimation and post-estimation commands, [U_23.11 Obtaining robu_ variance estimates. [U_23,12 Obtaining scores, [R_maximize
_
Ioneway
:,
--
Large
one-way
ANOVA,
random
effects,
I I
and reliability
I
Syntax loneway
response_var
group_var
[weight t [if
exp]
[in
range I [, mean median exact
l_evel(#) ] by ...
: may be used with loneway; see [R] by.
aweights
are allowed; see [U] 14.1.6 weight.
Description loneway estimates of levels
one-way
of group_var
analysis-of-variance
and presents
different
(ANOVA) models
ancillary,
statistics
Feature
from
on datasets one,ray
with a large number (see [R] oneway):
oneway loneway
Estimate one-way model on fewer than 376 levels on more than 376 levels Bartlett's test for equal variance Multiple-comparison tests Intragroup correlation and S.E. Intragroup correlation Confidence interval Est. reliability of group-averaged score Est. S.D. of group effect Est. S.D. within group
x x
x x x
x x x x x x x
Options mean specifies that the expected value of the Fk-l.N,-k distribution Fm in the estimation of p instead of the default value of 1. median specifies that the median of the Fk-l_N-k distribution in the estimation of p instead of the default value of 1. exact;
requests
confidence not used. level
(#)
that exact intervals.
specifies
default is level(95) intervals.
confidence
This option
the confidence
intervals is allowed
level,
or as set by set
be computed,
level;
be used as the reference
as opposed
only if the groups
in percent,
be used as the reference
for confidence
are equal intervals
see [U] 23.5 Specifying
260
to the default
point
of the coefficients. width
Fm
asymptotic
in size and weights
the
point
are The
of confidence
r
i
_
Re.m
loneway-
i
Large one-wayANOVA,random effects,and reliability
261
> Example lonewa't's output looks like that of oneway except that, at the end, additional information is presented. Jsing our automobile dataset (see [U]'9 Stata's on-line tutorials and sample datasets), we have eated a (numeric) variable called ma_u:facturer_grp identifying the manufacturer of each car an within each manufacturer we have retained a maximum of four models, selecting those with',the h Jcest mpg. We can compute the intradlass correlation of mpg for all manufacturers with at least You models as follows: . 'loneway mpg manufacturer_grp if nummake == 4 One-way Analysis of VarianCe for mpg: Mileage (mpg)
S_urce
SS
df
Number of obs =
36
R-squared :
0. 5228
MS
F
Between|manufactu/~p Withi_ manufactur_p
621.88889 567.75
8 27
77,736111 21.027778
Total
1189. 6389
35
33.989683
Intraclass correlation 0.402T0
Asy. S.E O.18770
Prob > F
3.70
O.0049
[957 Conf. Interval] O.03481
0.77060
Estimatec_SD of manufactur_p effect 3.765247 Estimated SD within manufactur-p 4.585605 Est. reliability of a manufactur-p mean .7294979 (evau%uatedat 11=4.00)
q
In additi(,n to the standard one-way ANOVAoutput, lonewayproduces the R-squared, estimated standardde,,iation of the group effect, estimated standard deviation within group, the intragroup correlation he estimated reliability of the group-a_eraged mean, and, in the case of unweighted data. the asyrr/ptc :ic standard error and confidence interval for the intragroup correlation.
R-squared The R-squared is, of course, simply the underlying R 2 for a regression of response_var on the levels of ¢rqlup_var. or mpg on the various manu(acturers in this case.
The random effects ANOVA model loneway assumes that we observe a variable Yij measured for r_, elements within k groups or classes such that Yij
::=
,tZ+ Ct" i -I-
6ij,
i = 1,!2,...,
k.
3 = 1.2 .....
ni
and %. and _]ij are independent zero-mean randon3 variables with variance a,] and cr2, respectively. This is the random-effects ANOVAmodel, also kno_'n as the components of variance model, in which it is t}_picall31assumed thak the Yij are normally d_stributed.
!
The interpretation '!
with respect to our example is that the observed value of our response
variable,
mpg, is created in two steps. First, the ith manufacturer is chosen and a value c_i is determined--the !o,Large one-way reliability typical mpgtoneway for that --manufacturer less ANUVA, the overallrandom mpg/_. effects, Then aand deviation, eij, is chosen for the jth model within this manufacturer. This is how much that particular automobile differs from the typical mpg value for models from this manufacturer. For our sample of 36 car models, the estimated standard deviations are cr,_ = 3.8 and cr, -- 4.6. Thus, a little more than half of the variation in mpg between cars is attributable to the car model with the rest attributable to differences between manufacturers. These standard deviations differ from those that would be produced by a (standard) fixed-effects regression in that the regression would require the sum within each manufacturer of the eij, ei. for the ith manufacturer, to be zero while these estimates merely impose the constraint that the sum is expected to be zero.
Intraclass correlation There are various estimators of the intraclass correlation, such as the pairwise estimator, which is defined as the Pearson product-moment correlation computed over all possible pairs of observations that can be constructed within groups. For a discussion of various estimators, see Donner (1986). loneway computes what is termed the analysis of variance, or ANOVA, estimator. This intraclass correlation is the theoretical upper bound on the variation in response_var that is explainable by group_var, of which R-squared is an overestimate because of the serendipity of fitting. Note that this correlation is comparable to an R-squared you do not have to square it. In our example, the intra-manu correlation, the correlation of mpg within manufacturer, is 0.40. Since aweights weren't used and the default correlation was computed, i.e., the mean and median options were not specified, loneway also provided the asymptotic confidence interval and standard error of the intraclass correlation estimate.
Estimatedreliability of the group-averagedscore The estimated reliability of the group-averaged score or mean has an interpretation similar to that of the intragroup correlation; it is a comparable number if we average response_var by group_vat, or rapg by manu in our example. It is the theoretical upper bound of a regression of manufactureraveraged mpg on characteristics of manufacturers. Why would we want to collapse our 36-observation dataset into a 9-observation dataset of manufacturer averages? Because the 36 observations might be a mirage. When General Motors builds cars, do they sometimes put a Pontiac label and sometimes a Chevrolet label on them, so that it appears in our data as if we have two cars when we really have only one. replicated? If that were the case, and if it were the case for many other manufacturers, then we would be forced to admit that we do not have data on 36 cars: we instead have data on 9 manufacturer-averaged characteristics.
Saved Results loneway
saves in r O :
Scalars r(N) r(rho) r(lb) r(ub)
number of observations intraclass correlation lower bound of 95% CI for rho upper bound of 95% CI for rho
r(rho_t) r(se) r(sd_w) r(sd_b)
estimated reliability asymp. SE of intraclass correLati_m estimated SD within group estimated SD of group effect
!
'loneway-- Large one-wayANOVA,random effects, and reliability
263
Metl!ods and Formulas is implemented as an ado-file.
lo_e_ !
The r_ean squares in the lone_ay's
ANOVAtable are computed as follows:
Mso= _i wi(_,.-9.)_/(k- t) an_
MS,= _ _ _,j(y,j- _,.)2/(u-k) •
1"
j
in which
j i
i
j
i
= E expected wij w.. values = wi. these Yi. m_an = E squares wiiyij/wi, The c0rre_:i. ;ponding of are
t
and
_.. =
wi.ff_./w..
2 + go% and E(MS_)= _2
E(MS_,) = a 2 in Which
_..- Z,wUw k-1
Note that in the unweighted case, we get
N- Z,-_/N g=
k-1
i
As expecti d, g = rn for the case of no weights _mdequal group sizes in the data, i.e., n_ = m for all i. l_ep[acilLgthe expected values with the obse_,ed values and solving yields the ANOVAestimates of a_ and cry. Substituting these into the defini[ion of the intraclass correlation 2
P= _ + G_ yields the _NOVA estimator of the intraclass correlation: IFobsPA
=
_bbs
--
1 1+ 9
Note that 7obs is the observed value of the F statistic from the ANOVAtable. For the case of no weights ar d equal hi, PA = roh, the intragroup correlation defined by Kish (1965). Two slightly different e:timators are available through the mean and median options (Gleason 19971. If either of these optioas is specified, the estimate of p becomes
•
0= Fob_ _-(__ i-)Fm
i
}
) ' :
For _he:rae..n option, Fm= E(Fk-1,._'-K) = (_}r_ k)/(N - k - 2), i.e., the expected value of the ANO\__,tab e's F statistic. For the median optioh. Fm is simply the median of the F statistic. Note thal setting F,, to I gives PA, so for large samples these different point estimators are essentially the samd. Als_, since the iniyaclass correlation of the random-effects model is by definition nonnegative.
I
:
for any of he three possible point estimators p is truncated to zero if Fobs is less than F,_.
r_ ' it ,:i
For the case of no weighting, interval estimators for PA are computed. If the groups are equal-sized 264 ni equal) Ionewayeffects, exact and reliability (all and the Large exact one-way option isANOVA, specified,random the following (under the assumption that the Yij are normally distributed) 100(1 a)% confidence interval is computed:
-
[
Fobs - FmF_,
Fobs -- FmFz
Fobs + (9 - 1)FmFu'
Fobs + (9 - 1)FmFt
]
with F,_ - 1, Fl = Fa/2,k_l,N_k, and Fu - Fl_a/2,k_l,N_k, F.,k--l,N--k being the cumulative distribution function for the F distribution with k - 1 and N - k degrees of freedom. Note that if mean or median is specified, Fm is defined as above. If the groups are equal-sized and exact is not specified, the following asymptotic 100(1 - a)% confidence interval for PA is computed: [PA --ZaI2V(pA),PA + zaI2V(pA)] where Zal2 is the t00(1 -a/2) percentile of the standard normal distribution and V(pA) is the asymptotic standard error of p defined below. Note that this confidence interval is also available for the case of unequal groups. It is not applicable, and therefore not computed, for the estimates of p provided by the mean and median options. Again, since the intraclass coefficient is nonnegative, if the lower bound is negative for either confidence interval, it is truncated to zero. As might be expected, the coverage probability of a truncated interval is higher than its nominal value. The asymptotic standard error of PA, assuming the Yij are normally distributed, is also computed when appropriate, namely for unweighted data and when PA is computed (neither the mean nor the median options are specified):
V(pA)
= 2(1_P)2i
(A + B + C)
with A = {1 + p(gN-k
1)} 2
B = (1 - p){1 + p(2gk-1 2
1)}
2
C = p {_-_ ni " 2N-1E:nf
(k- 1)2
n_)2}
and PA is substituted for p (Donner 1986). The estimated reliability of the group-averaged score, known as the Spearman-Brown formula in the psychometric literature (Winer, Brown. and Michels t991, 1014), is
prediction
tO Pt--
1 -t- (tfor group size t. loneway
1)p
computes Pt for t -= 9.
The estimated standard deviation of the group effect is aa -- v/(MSa - MSe)/g. This comes from the assumption that an observation is derived by adding a group effect to a within-group effect. The estimated standard deviation within group is the square root of the mean square due to error, or x/--M--Se.
Ioneway -- Large one-wa_ ANOVA, random effects, and reliability
265
AcknOWledgment We wo_lld like to thank John Gleason of Syracuse vernon
University
for his contributions
to this improved
of loneway.
Referencts Donner, A. 1986. A review of inference procedures for'the intraclass correlation coefficient in the one-way random effects ITodel. International Statistical Review 54: 67;-82. Gteason, L !_. 1997. sg65: Computing intraclass correlations and large ANOVAs. Stata Technical Bulletin 35: 25-3t Reprinte_ in Stata Technical Bulletin Reprints. vol. 6, pp I67-176. Kish, L.; 19_5. Survey Sampling. New York: John Wiley & Sons. Win¢r, B. L D. R. Brown. and K. M Michels. 199I. Statistical Principles in Experimental Design. 3d ed. New York: McOraw -Hill.
Also See Related:
[R] onewa_d
'io
lorenz -- Inequality measures
............
[;i"
; _
1
II
IIII H I
i
i
ii
III
II
i
Remarks Stata should have commands for inequality measures, but at the date that this manual was written, Stata does not. Stata users, however, have developed an excellent suite of commands, many of which have been published in the Smm Technical Bulletin (STB),
Issue
insert
author(s)
command
description
STB-48
gr35
N.J. Cox
psr., qsm, pdagum._ qdagum
Diagnostic plots for assessing Singh-Maddala and Dagum distributions fitted by MLE
STB-23
sg30
E, Whitehouse
lorenz, inequal_ atkinson, relsginl
Measures of inequality in Stata
STB-23
sg31
R. Goldstein
rspread
Measures of diversity: absolute and relative
STB-48
sgl04
S.P. Jenkins
sumdist, xfrac,
Analysis of income distributions •
ineqdeca, geivars, i ineqfac, povdeco STB-48
sgl06
S. R Jenkins
smfit, dagumfiti
Fitting Singh-Maddala and Dagum distributions by maximum likelihood
STB-48
sgl07
S. R Jenkins, E Van Kerm
glcurve
Generalized Lorenz curves and related graphs
STB-49
sgl07.1
S. E Jenkins, P. Van Kerrn
glcurve
update; install this version
STB-48
sgl08
P. Van Kerm
poverty
Computing poverty indices
STB-51
sgtI5
D. Jolliffe, B. Krushelnytskyy
ineqerr
Bootstrap standard errors for indices of inequality
STB-51
sgll7
D. Jolliffe, A. Semykina
sepoy
Robust standard errors for the Foster-GreerThorbexke class of poverty indices
Additiona] commands may be available; enter Stata and typ_ search
inequality
measures.
To download and install from the Internet the Jenkins isumdistcommand, for instance, you could 1. Pull down Help and select STB and User-written Programs. 2. Click on http://www.stata.com. 3. Click on stb, 4. Click on stb48. 5. Click on sg 104. 6. Click on click here to install. or you could instead do the following: 266
i
- '
-
lorenz -- Inequality measures
1, !Na_lgate
to the appropriate
_. Type net
267
STB issue:
_rom http://vwv,
stata,
com
stata,
com/stb/stb48
| Type net cd stb Type net cd stb48 or
). Type net
from http://_w,
2. Ty] e net
describe
3. Ty
±nsta_[1
net
sgl04 sgl04
Refemncq s Cox, N. J. I999, gr35: Diagnostic plots for assessing Singh_Maddala and Dagum distributions fitted by MLE, Stata Technic_1 Bulletin 48: 2-4, Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 72-74. Goldstei_, _. 1995. sg3]: Measures of diversity: absolute and relative. Stata Technical Bulletin 23: 23-26. Reprin'ted in Stata Technical Bulletin Reprints, voL 4, pp. 150_-154. Jenkins. S. _. 1999a. sgl04t Analysis of income distributions. Stata Technical Bulletin 48: 4-18. Reprinted in Stata Tec1_ica' Bulletin Reprints, vol. 8, pp. 243-260. -. 19991 sg]06: Fitting Singh-Maddal_ & Dagum distributions by maximum likelihood. Stata Technical Bulletin 48: t9-5. Reprinted in Stata Technical Bulletin Reprints. rot. 8, pp. 26t-268. Jenldns. _S. • and P. Van Kdrm. 1999a, sgl07: Generalized Lorenz curves and related graphs. Stata Technical Bulletin 48: 25- 9. Reprinted in Stata Technical Bulletin Re_qrints,vol, 8, pp. 269-274. --
lff°J9 sgl07.t: Generalized Lorenz cur'¢es and related graphs. Stata Technical Bulletin 49: 23, Reprinted in S_ata Tetfinical Bulletin • epnnts, voL 9, p. 171.
Jolliffe, D. nd B. Krushelrtytskyy. 1999 sgll5: Bootstrap standard errors for indices of inequality, Stata Technical 13_lletin ;1: 28-32. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 191-196, Jotliffe, D. nd A. Semykintt. 1999 sgll7: Robust stant_u'd errors for the Foster-Greer-Thorbecke class of poverty i_ices. 'tara Technical Bulletin_51: 34-36. Reprinted in Stata Technical Bulletin Reprints. vol. 9. pp, 200-203." Van Kerm. 1999. sg]08: Computing povert_ indices. St_ta Technical Bulletin 48: 29-33. Reprinted in Stata Technical Bulletin _eprints, vol. 8,_pp. 274-278. Whitethouse,_E. 1995, sg30: !Measures of inequality in Sltata. Stata Technical Bulletin 23: 20-23. Reprinted in Stata Te_chnicallBulletinReprir_s. vol. 4, pp. 146-150.
! ,
I
Irtest -- Likelihood-ratio I
test after model estimation i
I
II
I
I
Syntax irtest [, saving(name) using(name)
m_odel(name)dr(#) ]
where name may be a name or a number but may not exceed 4 characters.
Description irtest saves information about and performs lil_elihood-ratio tests between pairs of maximum likelihood models such as those estimated by cox, ]_ogit, logistic, poisson, etc. lrtest may be used with any estimation command that reports a tog-likelihood value or, equivalently, displays output like that described in [R] maximize. lrtest, typed without arguments, performs a likelihood-ratio test of the most recently estimated model against the model previously saved by lrtest ,i saving(0). It is your responsibility to ensure that the most recently estimated model is nested within the previously saved model. lrtest
provides an important alternative
to test'for
maximum likelihood
models.
Options saving(name) specifies that the summary statistics as:;ociated with the most recently estimated model are to be saved as name. If no other options are pecified, the statistics are saved and no test is performed. The larger model is typically saved by typing lrtest, saving(0). using(name) specifies the name of the larger mode_ against which a model is to be tested. If this option is not specified, using(O) is assumed. model (name) specifies the name of the nested model (a constrained specified, the most recently estimated model is used.
model) to be tested. If not
df (#) is seldom specified: it overrides the automatic degrees-of-freedom
calculation.
L
Remarks The standard use of Irtest is 1. Estimate the larger model using one of Stata's estimation saving(O). 2. Estimate an alternative,
nested model (a constrained
commands
and then type lrtest,
model) and then type lrtest.
Example You have data on infants born with low birth weights along with characteristics of the mother (Hosmer and Lemeshow 1989 and more fully described in JR] logistic). You estimate the following model: 268
_
irtest -- LiWelihood-ratio test after model estimation
269
lo istic low age lwt race2 race3 smoke ptl ht ui bogi
Dog
estimates
ikelihood =
]
-100.724
age lwt low race2 race3 smoke
.9732636 ,9849634 Odds Ratio 3. 534767 !2.368079 12.517698
ptl ht ui
;1.719161 ;6.249602 2.1351
.0354759 .0068217 Std. Err, 1.860737 1.039949 I.00916 .5952579 4.322408 .9808153
Number of obs LR chi2 (8) Prob > chi2
= = =
189 33.22 0.0001
Pseudo R2
=
0.1416
-0,74 -2.19 z 2.40 1.96 2,30
O.457 0.029 P> Iz J O,016 0.050 O.021
.9061578 1,045339 .9716834 .9984249 [957_Conf. Interval] 1.259736 9.918406 1.001356 5.600207 I.147676 5,52316"2
1.56 2.65 1.65
O.118 O.008 0.099
.8721455 I.611152 .8677528
You now _ ish to test the constraint that the coefficients on age, lwt;, ptl, equivalent] in this case that the odds ratios are all 1). One solution is te3t ( I ( 2 ( 3 (4
age l_t
3.388787 24,24199 5.2534
and ht are all zero (or
pl_ ]at
age = 0.0 lwt = 0.0 ptl = 0.0 ht = 0.0
I
chi2( 4) = Prob > dhi2 =
12.38 0.0147
This test i; based on the inverse of the information matrix and is therefore based on a quadratic approxima ion to the likelihood function: see [R] test. A more precise test would be to re-estimate the model, apt lying the proposed constraints, and then calculate the likelihood-ratio test. lr't:est assists you iin doi lg this. You fir_t save the st_itistics associated with tlie current model: lr zest, saving_O)
The"nam_" 0 was not _h°sen arbitrarily, although we could have chosen any name. Why we chose 0 will bec _me clear sb+rtly. Having saved the information on the current model, we now estimate the constrained model, ,_,hich in this case is themodel omitting age, l_,,,"t:,ptL and ht: Io istic low r_ce2 race3 smoke ui L_gi
estimates
Number of obs LR chi2(4) Prob > chi2
Dog Likelihood = '-107.93404 low race2 race3 smoke ui
Pseudo
R2
= -=
189 18.80 0.0009
=
0.0801
Std. Err.
z
P>Izt
[957,Conf. Interval]
3.052746
I.498084
2.27
O.023
I.166749
7.987368
12.922593 12.945742 2.419131
I.189226 I.101835 1.047358
2.64 2.89 2.04
0. 008 O.004 0.041
1.31646 i.41517 i.03546
6.488269 6.131701 5.651783
Odds Ratio
That done. typing Irteit will compare this model with the model we previously saved: Ir zest Logi 3tic:
likelJhood-ratio test
chi2(4) = Prob > chi2 _
14.42 0.0061
._/i
_'
¢"# 'J LqI_OL -I.,.II_V_IIIIVq.PU--i CIILI_I IIIUUI_I _,_ILI||lidIIqQIFI The more !!precise syntax for theCILIU test |ql_Ol is Irtest, usihg(O),meaning that the current model is to be compared with the model saved as 0. The name 0, a_ we previously said. is special when you do not specify the name of the using() model, using(b) is assumed. Thus. saving the original model as 0 saved us some typing when we performed the test.
Comparing results, test reported that age, lwt, ptl, and ht were jointly significant at the 1.5% level; lrtest reports they are significant at the 0.6% level, lrtest's results should be viewed as more accurate. q
Example Typing lrtest, saving(0) and later lrtest by itself is the way lrtest used, although here is how we might use the other options: logit lrtest, logit
chd age
age2
sex
estimate full model
saving(O) chd
age
save results
sex
estimate
lrtest lrtest, logit
is most commonly
simpler model
obtain test saving(I)
save logit results as t
chd sex
estimate simplest model
Irtest
compare
with full model
irtest, using(1)
compare
with model 1
lrtest,
repeat against furl model wst
model (1)
_>Example Returning to the low birth weight data in the first example, you now wish to test that the coefficient on race2 is equal to that on race3. The base modellis still stored in 0. so you need only estimate the constrained model and perform the test. Letting z be the index of the logit, the base model is z = _0 -1- fllag
e -J- _21wt
+ fls_ace2
+ fl4race3
-k ""
+ fl3lrace2
+ race3)
+ -..
If _3 -- 34, this can be written z --- _0
+ tinge
+ fl21wt
To estimate the constrained model, we create a variable equal to the sum of race2 estimate the model including that variable in their place:
(Continued
on next page)
and race3
and
........
•"_-
ge:
race23
= r_ce2
_
--
I----7
...............................
_....................
LiK_llnooo-ra1[lO le_ff a_er model estlmatlon
271
+ race3
• loiistic low a_e _ lwt race23 smoke ptl ht ui Lbgi' estimates
Lpg
Ekelihood = low
)
-100.9997
Oc_dsRatio
age lvt; race23 smoke ptl ht ui
Number of obs LR chi2(7) Prob > chi2 Pseudo P_
1.9716799 :.9864971 _.728186 5.664498 )1.709129 _.116391 !2.09936
Std. Err. .0352638 .0064627 1.080206 1.052379 ,5924775 4.215585 .9699702
z -0.79 -2.08 2.53 2.48 1.55 2.63 1.61
= = = =
189 32.67 0.0000 0.1392
P>lzl
[95Z Conf. Interval]
0.429 0.038 0.011 0.013 0.122 0.009 0.108
.9049649 .9739114 !.255586 1.228633 ,8663666 1.58425 .8487997
1.043313 .9992453 5.927907 5.778414 3.371691 23.61385 5.192407
chi2(1) = Prob > chi2 =
0.55 0.4577
Comparing this model with our original model, we obtain Irl;est Logi_tic:
likelihood-ratio test
By corn )arison, typing testrace2=race3after estimating our base model results in a significance level of .4:;72. q
Saved Re;ults lirtest
saves in r() Scalars r(p)
two-sided p-_alue
r(df)
degrees of fvbedom
r(chi2)
X2
Ptogan mers desiring that an estimation command be compatible with trtest it requires :he following Imacros to be defined: e(c_l)
name q estimationcommand
e(ll) e (dr._m) e(N)
log-likelihood value model degrees of freedom number of observations
should note that
MethOdsand Form! Jlas irtest
is implemen_d
as an ado-file.
Let Lo and Lt be +e log-likelihood values_ associated with the full and constrained models, respectivel '. Then X2 _ -2(L1 - L0) with d_ - dl degrees of freedom, where do and dl are the model Jegrees of freedom associated with the full and constrained models (Judge el al )985, 216-;21q).
Z7Z
+t i
Irtest --
LiKelmooo-ratlo
lesl al_er moael estlmalciOrl
References Hosmer, D. W., Jr., and S. Lemeshow. I989. Applied Logistic Regression. New York: John Wiley & Sons. (Second edition forthcoming in 2001.) Judge, G. G., W. E. Griffiths, R. C. Hill, H. L/itkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Wiley & Sons. P6rez-Hoyos, S. and A. Tobias. 1999. sgtll: A modified likelihood-ratio test command. Stata Technical Bulletin 49: 24-25. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp 171-173. Wang, Z. 2000. sg133: Sequential and drop one term likelihood-ratio Reprinted in Stata Technical Bulletin Rcpnnts, vol. 9, pp. 332-334.
tests. Stata Technical Bulletin 54: 46-47.
Also See Related:
[R] estimation commands, JR] linktest, [R] test, [R] testnl
I
F
Title
i Life tables for survival data
SyntSx lgab:e
timevar [deadvar]
[weight] [if
bxp] [in range]
sm'vival fail_re hazard intervals(interval)
test
[, by(groupvar)
level(#)
tyid(varname)
noadjust
nol;abgraph g_,aph_options noconf ]
fwdigkts Ireallowed;se_ [UI 14.1.6weight.
Deripti
)n
itab] displays and graphs life tables for individual-level or ag_egate data and optionally presents the likeli]lood-ratio and log-rank tests for equNalence of groups, ttable also allows examining the empirical hazard function through aggregation. Also see [R] st sts for alternative commands. timevc r specifies the time of failure or censoring. If deadvar is not specified, all values of timevar are inteq: reted as failure times: otherwise, time_ar is interpreted as a failure time where deadvar _ 0 and as a _ensoring time otherwise. Observations with timevar or deadvar equal to missing are ignored Note arefully that deadvar does not specify_the number of failures. An observation with deadvar eq_aalto or 50 has the same interpretation the observation records one failure. Specify frequency weights )r aggregated data (e.g., itabletim_ [freq=number]).
options bz(groulwar) creates :_eparate tables (or graphs within the same image) for each value of groupvar :group Jar may be siring or numeric. 1eve! (#) specifies the confidence level, in percent, for confidence intervals. The default is level i or,as _et by set level; see [R] le Vel.
(95)
survival, failure, 'and hazard indicate the table to be displayed. If not specified, the default is the survival table. Specifying failure Would display the cumulative failure table. Specifying surv_ va! failure would display both the survival and the cumulative failure table. If graph is specif ed, multiple iables may not be requested. intervals (intowaI) !specifies the time intervals into which the data are to be aggregated for tabular preset tation. A single numeric argument is':interpreted as the width of the interval. For instance. intm'val(2) aggregates data into the time intervals 0 _< t < 2, 2 _< _ < 4. and so on. Not specif¢ing interval() is equivalent to specifying interval(I). Since in most data, failure times are recorded _s integers, tNs amounts _to no aggregation except that implied by the recording of the time variable and so produces Kaplat_-Meier product-limit estimates of the survival curve (with an actuarial _justment; see the noadjust option below). Also see [R] st sts list. Although it is l]ossible to exhmine survival and faihire without aggregation, some form of aggregation is almol j
always req_red for exarnining the tilazard. 273
=
"
274
._
ltable -- Life tables for survival data
When more than one argument is specified, time intervals are aggregated as specified. For instance, interval(O,2,8,16) aggregates data into the intervals 0 < t < 2. 2 _< t < 8, 8 < t < 16, and (if necessary) the open-ended interval t > 16.
....
interval (w) is equivalent to interval (0,7,15,30,60,90,180,360,540,720), corresponding to one week, (roughly) two weeks, one month, two months, three months, six months, 1 year, 1.5 years, and 2 years when failure times are recorded in days. The w suggests widening intervals. test
presents two X2 measures of the differences does nothing if by () is not specified.
between groups when by()
is specified, test
tvid(varname) is for use with longitudinal data with time-varying parameters as processed by cox; see [R] cox. Each subject appears in the data more than once and equal values of varname identify observations referring to the same subject. When tvid() is specified, only the last observation on each subject is used in making the table. The order of the data does not matter, and "last" here means the last observation chronologically. noadjust suppresses the actuarial adjustment for deaths and censored observations. The default is to consider the adjusted number at risk at the start of the interval as the total at the start minus (the number dead or censored)/2. If noadjust is specified, the number at risk is simply the total at the start, corresponding to the standard Kaplan and Meier assumption, noadjust should be specified when using ltable to list results corresponding to those produced by sts list; see [R] st sts list, notab
suppresses displaying the table. This option is often used with graph.
graph requests that the table be presented graphically as well as in tabular form; when notab is also specified, only the graph is presented. When specifying graph, only one table can be calculated and graphed at a time; see survival,failure, and hazard above. graph_options are any of the options alIowed with graph, twoway; see [G] graph options. When noconf is specified, twoway's connect() and symbol() options may be specified with one argument; the default is connect (1) symbol(O). When noconf is not specified, the connect () and symbol () options may be specified with one or three arguments. The default is connect(Ill) and symbol(Oii), drawing the confidence band as vertical lines at each point. When you specify one argument, you modify the first argument of the default. When you specify three, you completely control the graph. Thus. connect(ill) would draw the confidence band as a separate curve around the survival, failure, or hazard. noconf
suppresses graphing the confidence
intervals around survival, failure, or hazard.
Remarks Life tables date back to the seventeenth century; Edmund Halley (1693) is often credited with their development, ltable is for use with "cohort" data and. although one often thinks of such tables as following a population from the "birth" of the first member to the "death" of the last. more generally, such tables can be thought of as a reasonable way to list any kind of survival data. For an introductory discussion of life tables, see Pagano and Gauvreau (2000. 489-495): for an intermediate discussion, see. for example, Armitage and Berry (1994. 470-477) or Selvin (t996 311-355); and for a more complete discussion, see Chiang (1984).
L>Example In Pike (1966), two groups of rats were exposed to a carcinogen and the number of days to death from vaginal cancer was recorded (reprinted in Kalbfleisch and Prentice 1980, 2):
,
itabte -- Life tables for survLvMderta Group i Group 2
143
164 t88 188 190 t92 206 209
213
220
227
230
234
246
265
304
216"
244*
t42 233 344*
155 239
163 240
198 261
205 280
232 280
232 296
233 296
233 323
216 233 204*
The '*' o[ a few of the' entries indicates that the observation was censored--as the rat ha rea_n$.
275
of the recorded day,
still not died due to vaginal cancer but was withdrawn from the experiment for other
Having _ntered these data into Stata, the firs| few observations are i
list
in
1/5 group
1 2 3 4 5,
t 143 164 188 188 190
1 1 1 1 1
died 1 1 1 1 1
That is, t] e first obse_¢ation records a rat from group I that died on the 143rd day. "/'be va6able died reccrds whether that rat died or was wlthdra n (censored): lJst if died==O t 216 244 204 324
group I 1 2 2
18, 19, 39. 40,
died 0 0 0 0
Four rats, wo from each group, did not die but were withdrawn. The sl: lival table f_brgroup 1 is I'able t died lifgroup==l nterval 1,_3 I_M 1_t8 I!)0 I!)2 2_)6 2 )9 2 ,3 2 .6 2 !0 2 !7 _0 2 14 2 _4 2 _6 255 3 )4
144 165 189 191 193 207 210 214 217 221 228 231 235 245 247 266 305
Beg. Sotal 19 18 17 15 14 13 12 11 I0 8 7 6 5 4 3 2 1
Deaths
Lost 1 1 2 1 I 1 1 1 1 1 1 1 i 0 1 1 1
0 0 0 0 0 0 0 0 I 0 0 0 0 1 0 0 0
Survival O.9474 O.8947 O.7895 O.7368 O.6842 O.6316 0.5789 O.5263 0.4709 O.4120 0.3532 O.2943 0.2355 O. 2355 O.1570 O.0785 O.0000
Std. Error O.0512 O.0704 O. 0935 O.1010 O.I066 O.1107 0.1133 O.1145 O.1151 O.1148 O.1125 O.1080 O. 1012 O. 1012 0.0931 O.0724
[957. Conf. O.6812 0.6408 O. 5319 O.4789 O.4279 O.3790 0.3321 O.2872 0.2410 O.1937 O.1502 O.1105 0.0751 0.0751 0.0312 O.0056
Int. ] O.9924 O.9726 O.9153 O.8810 O.8439 O.8044 0.7626 O.7188 O.6713 O.6194 O.5648 O.5070 0.4259 O. 4459 O.3721 O.2864
The repoted survival rates are the survival rates at the end of the interval, Thus. 94,7% of rats su_ived 144 days or r_ore. Example When you do not specify the intervals, Itable uses unit intervals. The only aggregation performed on the data was aggregation due to deaths or withdrawals occurring on the same "day". If we wanted to see the table aggregated into 30-day intervals, we would type Itable t died if group==l, interval(30) Interval 120 150 180 210 240 300
150 180 210 240 270 330
Beg. Total
Deaths
Lost
Survival
19 18 17 11 4 1
1 1 6 6 2 1
0 0 0 1 1 0
O. 9,_74 0. 8947 O. 5_89 O. 24_81 O. 1063 O. O(N)O
Std. Error O. 0512 O. 0704 O. 1133 O. I009 O. 0786
[95Y.Conf. Int.] O. 6812 O. 6408 O. 3321 O. 0847 O. 0139
O. 9924 0. 9726 O. 7626 O. 4552 O. 3090
The interval printed 120 150 means the interval including 120. and up to but not including The reported survival rate is the survival rate just after the close of the interval. When you specify more than one number as the argument to interval(), widths but the cutoff points themselves.
150.
you specify not the
I
i
t
Rab4e-- Life tables for survival data
277
{ o
l_able }
t
died
if
group==1,
interval(l!20,180,210,240,330)
Beg. nterval
Total
Std. Deaths
Lost
Survival
Error
[95Z Conf.
Int,]
I_0
180
19
2
0
0.8947
0.0704
0,6408
0.9726
2 0
240
I1
6
1
0.2481
0.1009
0.0847
0.4552
2II0 0
330 210
4 17
3 6
1 0
0.0354 0,5789
0.0486 0.1133
0.0006 0,3321
0.2245 0,7626
If any of :he underlying failure or censoring tifnes are larger than the last cutoff specified, they are treated as being in the open-ended interval: • l;able
t died
interval(_20,180,210,240)
Beg. Total
Deaths
Lost
210
17
6
0
0.5789
0.1133
0.3321
0.76_6
240
11
6
1
0.2481
0.1009
0.0847
0.4552
4
3
1
0.0354
0.0486
0.0006
0.2245
_nterval
1 0
if group==l,
i! Io
Survival
Std. Error
[95Z Conf.
Int,]
ooo00o4 00o
Whether lhe last interval is treated as open-end_d or not makes no difference for survival and failure tables, bu_' it does affect hazard tables. If the ifiterval is open-ended, the hazard is not calculated for it. q
Examfle !
The by(varname) option specifies that separate tables are to be presented for each value of va;vzame. Remember that our rat dataset contains two goups: l_able
I
t died,
by(group)
interval(30)
I interval
Beg. Total
Deaths
Lost
groap = 1 20 150
gr
Survival
Std. Error
[95Z Conf.
Int.]
19
t
0
0.9474
0.0512
0.6812
0.9924
50
180
18
1
0
0,8947
0.0704
0.6408
0,9726
30
210
17
6
0
0.5789
0.1133
0.3321
0.7626
tO
240
11
6
1
0.2481
0.1009
0,0847
0,4552
_0 )0
270 330
4 1
2 1
1 0
0.1063 0.0000
0,0786
0.0139
0.3090
ap = 2 20 150
21
1
0
0.9524
0.0465
0.7072
0.9932
_50
180
20
2
0
0.857!
0.0764
0.6197
0.9516
_80 _10
210 240
18 15
2 7
1 0
0,7592 0,4049
0.0939 0.1099
0.5146 0,1963
0.8920 0.6053
70
300
6
4
0
0.1012
0.0678
0.0172
0.2749
O0
330
2
1
0
0.0506
0.0493
0.0035
0.2073
30
360
1
0
1
0.0506
0.0493
0.0035
0.2073
40
270
8
2
0
0.3037
0.1031
0.1245
0,5057
278
,,.
Rable -- Life tables for survival data
> Example A fmlure table is simply a different way of looking at a surviv_ • liable t died if group==l, Interval 120 150 180 210 240 300
interval(30)
Beg. Total
Deaths
Lost
19 18 17 ii 4 1
1 1 6 6 2 i
0 0 0 1 1 0
150 180 210 240 270 330
table: failure is 1 - survive:
failure Cum. Failure
Std. Error
0.0526 0.ii053 0._211 0.7519 0.8937 1.0000
0.0512 0.0704 0.1133 0.1009 0.0786
[95_ Conf. Int.] 0.0076 0.0274 0.2374 0.5448 0.6910
0.3188 0.3592 0.6679 0.9153 0.9861
q
Example Selvin (! 996, 332) presents follow-up data from Cuder and Ederer (1958) on six cohorts of kidney cancer patients. The goal is to estimate the 5-year survival probabihty. WithYear Interval Alive Deaths Lost drawn
','ear Interval Alive
1946
19_48
1947
0-1 1-2 2-3 3-4 4-5 5-6 0- 1 1-2 2-3 3-4 4-5
9 4 4 4 4 4 t8 11 ll 10 6
4 0 0 0 0 0 7 0 1 2 0
1 0 0 0 0 0 0 0 0 2 0
19z_9 4 1950 1951
0-1 1-2 2-3 3-4 0-I 1-2 2-3 0-I I-2 0-1
21 10 7 7 34 22 t6 19 13 25
WithDeaths Lost drawn 11 I 0 0 I2 3 1 5 1 8
0 2 0 0 0 3 0 ! ! 2
7
15 1t 15
6
The following is the Stata dataset corresponding
to the table:
list
I. 2. 3. 4. 5. e[c.
year 1946 1946 1946 1947 1947
t .5 .5 5.5 .5 2.5
died 1 0 0 1 1
pop 4 1 4 7 1
As summary data may often come in the form shown above, it is worth understanding exactly how the data were translated for use with 3.table. t records the time of death or censoring (lost to follow-up or withdrawal), died contains 1 if the observation records a death and 0 if it instead records Iost or withdrawn patients, pop records the number of patients in the category. The first line of the table stated that. in the 1946 cohort, there were 9 patients at the start of the interval 0-1, and during the interval. 4 died, and 1 was lo_t to follow-up. Thus, we ent.ered in observation 1 that at t = .5. 4 patients died and, in observation 2 that at t = .5, t patient was censored. We ignored the information on the total population because ].table will figure that out for itself. t
liable -- Life tables for survival data
279
!
i •
The s@ond line of the table indicated that in the interval 1-2, 4 patients were still alive at the beginninglof the interval and. during the interval, 0 died or were lost to follow-up. Since no patients died or wgre censored, we entered nothing into our data, Similarly, we entered nothing for lines 3, 4, and 5 _f the table. The last line for 1946 staied that, in the interval 5-6, 4 patients were alive at the begmr_mg of the mtervat and that those 4 peltlents were w,hdrawn. In obserx:atmn & we entered that there lwere 4 censorings at t = 5.5.
4
}
The f t that we chose to record the times cff deaths or censoring as midpoints of intervals does not matte_: we could just as well have recorded the times as 0.8 and 5.8. By default, liable wilt
l
form mteqvals 0-1, 1-2, and so on, and place Observations into the intervals to which they belong. We sugge,_t using 0.5 and 5,5 because those numbers correspond to the underlying assumptions made
i
by ltabl_;
!
in making its calculations. Using midpoints reminds you of the assumptions.
To ob@n the survival rates, we type
!
. l_abte
t
died
nterval
[freq=pop]
Total
Deaths
Lost
Survival
Error
[95Y, Conf.
Int.]
iO
1
Beg. i26
47
19
O.5966
Std. O. 0455
O.5017
O. 6792
II 12 13
2 3 4
60 38 21
5 2 2
17 15 9
O. 5386 O. 5033 0.4423
O. 0479 O.0508 O. 0602
O.4405 O.4002 O. 3225
O. 6269 O. 5977 O. 5554
14
5
I0
0
6
0.4423
0.0602
O.3225
O. 5554
5
6
4
0
4
O. 4423
0.0602
O. 3225
O. 5554
|
We estimate the 5-year sur_,ival rate as .4423 and the 95% confidence interval as .3225 to .5554, Selvin t1996, 336), in presenting these results, lists the survival in the interval 0-1 as I. in 1-2 as .597, i_ 2-3 as .539. and so on. That is. rdative to us, he shifted the rates down one row and inserted all in the first row. In his table, the survival rate is the survival rate at the start of the interval. 1t_ our table, the survival rate is the survival rate at the end of the interval (or, equivalently. at the star_ of the new interval). This is. of course, simply a difference in the way the numbers are presented!and not in the numbers themselves. 4
Example The di,,crete hazard function is the rate of failure--the number of failures occurring within a time interval divided by the width of the interval (assuming there are no censored observations). While the surviv:fl and failure tables are meaningful at_the "individual" level with intervals so narrow that each cont4ins only a si_lgle failure
that is not true for the discrete hazard. If all intervals contained
!
to be a c_nstant! one death liand if all intervals were of equal widfla, the hazard function would be I/'At and so appear The e_piricalty determined discrete hazard function can only be revealed by aggregation. Gross and Clark!(1975, 37_ print data on malignant melanoma at the M. D. Anderson Tumor Clinic between
1
1944 and 1t960. The interval is the time from i/fitial diagnosis:
i
! {
!
a.,.,v
_uz_u_ --
n_._ taulW_
;, _ _:_i:_i
IUI- _UI'VlVal
Interval (years)
Number withdrawn alive
Number dying
19 3 4 3 5 1 0 2 0 0
77 71 58 27 35 36 17 10 8 0
312 96 45 29 7 9 3 t 3 32
0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9+
i i
CIRl[a
Number lost to follow-up
For our statistical purposes, there is no difference between the number test to follow-up (patients who disappeared) and the number withdrawn alive (patients dropped by the researchers)--both are censored. We have entered the data into Stata; here:is a small amount of it: . list
1.
t .5
d 1
pop 312
2.
.5
0
19
3.
.5
0
77
4.
t .5
1
96
5.
1,5
0
6.
1.5
0
We entered numbers
in I/6
each group's
of the table, Itable
t d
time of death
recording
or censoring
d as 1 for deaths
[freq=pop],
hazard
Beg. Interval
3 71
as the midpoint
of the intervals
and 0 for censoring.
The hazard
and entered table
the
is
interval(O,l,2,3,4,5,6,7,8,9)
Cum.
Std.
Total
Failure
Error
Std. Hazard
Error
[95_ Conf.
Int.]
0
1
913
0.3607
0.0163
0.4401
0.0243
0.3924
0.4877
1
2
505
0.4918
0.0176
_.2286
0,0232
0.1831
0,2740
2
3
335
0.5671
0.0182
0_.1599
0.0238
0.1133
0.2064
3
4
228
0.6260
0.0188
0i. 1461
0,0271
0,0931
0.1991
4
5
169
0.6436
0.0190
01.0481
0.0182
0.0125
0.0837
5
6
122
0,6746
0.0200
0_.0909
0.0303
0.0316
0.1502
6 7
7 8
76 56
0.6890 0.6952
0.0208 0.0213
0_.0455 0L0202
0.0262 0,0202
0.0000 0.0000
0.0969 0,0598
8 9
9
43 32
0,7187 1.0000
0,0235
0,0800
0.0462
0.0000
0.1705
We specified the interval() option as we did and not as interval(1) (or omitting the option altogether) to force the last interval to be open-ended. Had we not, and if we had recorded t as 9.5 for observations in that interval (as we did), ltable would have calculated a hazard rate for the "interval". In this case. the result of that calculation would have been 2, but no matter what the result, it would have been meaningless since we do not know the width of the interval. You are not limited to merely see the result graphically: . itable
i
t d [freq=pop],
> xlab(0,2,4,6,8,10)
examining
hazard
border
a column
of numbers.
1(0,1,2,3,4,5,6,7,8,9)
With the graph
graph
notab
option,
you can
liable -- Life tables for survival data i
,
i
,l_
I
1
I
|
281
I
{o I
2-
t
"_trne{years)
The verti@ lines in the graph represent the 95% confidence intervals for the hazard; specifying noconf w_uld have suppressed them. Among th+ options we did specify, although it is not required, not_b supl_ressed printing the table, saving us some paper, xlab () and border _ ere passed through
,.ee
o.,,o°.,e.e
made
q
Example You cani _raph the survival function the same way you graph the hazard function: just omit the hazm-d', op! on,
q
Methodsand Formulas It.able lis implemented as an ado-file. Let ri b_ the individual failure or censoring times. The data are aggregated into intervals given by tj, j = !, .... J, and t j+l = oc with each interval containing counts for tj _ Example You have two datasets stored on disk that you wish to merge into a single dataset. The first dataset, called odd. dta, contains the first five positive odd numbers. The second dataset, called even. dta, contains the fifth through eighth positive even numbers. (Our example is admittedly not realistic, but it does illustrate the concept.) The datasets are • use odd (First five odd numbers) list number 1 2 3 4 5
i. 2. 3. 4. 5.
odd 1 3 5 7 9
• use even (5th
through
8th
even
numbers)
list 1. 2. 3. 4.
number 5 6 7 8
even 10 12 14 16
We will join these two datasets using a one-to-one memory (we just used it above), we type merge using
merge. Since the even dataset is already in odd. The result is
- merge using odd number was int now float list
1. 2. 3. 4. 5.
number 5 6 7 8 5
even 10 12 14 16
odd 1 3 5 7 9
_merge 3 3 3 3 2
The first thing you will notice is the new variable _merge. Every time Stata merges two datasets, it creates this variable and assigns a value of 1, 2, or 3 to each observation. The value I indicates that the resulting observation occurred only in the master dataset, 2 indicates the observation occurred only in the using dataset, and 3 indicates the observation occurred in both datasets and is thus the result of joining an observation from the master dataset with an observation from the using dataset. In this case, the first four observations are marked by _merge equal to 3, and the last observation by .._merge equal to 2. The first four observations are the result of joining observations from the two datasets, and the last observation is a result of adding a new observation from the using dataset. These values reflect the fact that the original dataset in memory had four observations, and the odd dataset stored on disk had five observations. The new last observation is from the odd dataset exclusively: number is 5, odd is 9, and even has been filled in with missing. Notice that number takes on the values 5 through 8 for the first four observations. Those are the values of number from the original dataset in memory--the even dataset--and conflict with the value of number stored in the first four observations of the odd dataset, number in that dataset took on the values 1 through 4, and those values were lost during the merge process. When Stata joins observations and there is a conflict between the value of a variable in memory' and the value stored in the using dataset. Stata by default retains the value stored in memory.
_
i i [
ii
:
When t e command 'merge using
merge -- Merge datasets 309 odd was ;zssued, Stata responded with "number was int now
i
Letls describethe datasets in this example: float"., describe usingodd Contains
data
First
ob#:
I
5
vat :
2
starage
display
valm
vari_ible name
_ype
format
labe_
numb_ ,=r
float
_.9.Og
odd
_loat
Y,9.0g
Sort,_d
five
5 Jul 2000
variable
odd numbers 17:03
label
Odd numbers
by :
!
,
I
Cont_dns
i
ob_,_: sin,=:
i
de:;cribe using
!evenl
data 4 40
5th through
8th even numbers
II Jul 2000
14:12
2 1 vari_ble
name
St QTage _3pe
display format
valu_ label
variable
label
[ I
numb_ _r
i_t
Y,8.Og
even!
float
Y,9.Og
Sort+d
by :
Even
numbers
,
!
i
Note that number is stored as a float in oad.dta,but is stored as an int ineven.dta;see [U] t5.2.2 Numeric storage types. When you mergetwo datasets, Stata engages in automatic variable promotion;! that is, if there are conflicts in numeric storage types, the more precise storage type will be used. The resulting dataset, therefore, will have number stored as a float, and Stata told you
i
this when It said "number was int now float".
I MatchimerSe !
In a maich merge, obgervations are joined if ttie values of the variables in the varlist are the same.
[ if
Since the +alues must lie the same, obviously the variables in the varlist must appear in both the master andtthe using datasets.
!
A mate merge proceeds by taking an observation from the master dataset and one from the usm_ dataset and comparing t_e values of the variable_ in _he vartist. If the varlist values match, then the • 1 .. ' . _ . . observatmr_s are joined, if the varhst values do nqt match, the observation from the earher dataset (the dataset whose var/ist valhe comes first in the sort order) is joined with a pseudo-observation from the l_ter data@ (the other dataset). All the variable_ in the pseudo-observation contain missing values. The actual !observation from the later dataset is _etained and compared with the next observation in the earlier [lataset, and the process repeats.
f
i ! [ i !
l
I
!
,_
_ Ju
merge-
Merge
oaTasel[s
> Example
-_
The result is not nearly so incomprehensible as the explanation. Let's return to the dataset used in the previous example and merge the two datasets on variable number. We first use the even dataset and then type merge number using odd: .
even
use
(Sth
through
8th
even
numbers)
• merge number using odd master data not sorted
r(5); Instead of merging the datasets, Stata reports the error message "master data not sorted". Match merges require that the data be sorted in the order of the varlist, which in this case means ascending order of number. If you look at the previous example, you will observe that the data are in such an order, so the message is more than a little confusing. Before Stata can merge two datasets, however, the data must not only be sorted but Stata must know that they are sorted. The basis of Stata's knowledge is the internal information it keeps on the sort order, and Stata reveals the extent of its knowledge whenever you describe the dataset: • describe Contains
data
from
evenl.dta 4 2
obs: vats: size:
40
5th through 11 Jul 2000 (99.9%
storage
of
memory
display
value
format
label
number
int
7.8.Og
even
float
7,9.Og
Sorted
name
numbers
free)
type
variable
8th even 14:12
variable
Even
label
numbers
by :
The last line of the description shows that the data are "Sorted by:" nothing. We tell Stata to sort the data (or to learn that it is already sorted) with the sort command: • sort
number
describe Contains
data
from
evenl.dta
obs: vars:
4 2
size:
40
5th through 11 Jul 2000 (99.8Y, of memory
storage
display
value
format
label
number
int
7,8.Og
even
float
_,9.Og
Sorted
name
by:
numbers
free)
type
variable
8th even 14:12
variable
Even
label
numbers
number
Now when we describe the dataset, Stata informs us that the data are sorted by that Stata knows the data are sorted, let's try again: • merge number using data not
r(5);
using odd sorted
number.
Now
::
Stata stilltrefuses to ca_ out our request, this time complaining that the using data are not sorted. Both data_ets, the masier and the using, must be in ascending order of number before Stata can
ii
perform
I
As befbre, if you look at the previous exardple you will discover that odd.dta is in ascending order of _umber,but as before, Stata does noi kn_w this yet. We need to save the data we just sorted, use the odd da_.a,sort it, and re-save it:
[ [
aimerge.
1
stve even, replace fil_ even. dta sa#ed . u_ e odd
i
cn, t s ode
!
• s¢rt number i
i
. s_veodd.dta odd, repl_ce fil_ saved
Now we hould be able to merge the two datMets
[ i 1 i
. .
(5tl]
through
number
was
8th even
numbers)
int nbw float
li_st
i
number 5
even 10
2 .i 3 .i
6 7
12 14
I
4.
8
16
[
5.{
1
t
2
;
6. i
2
3
2
7
3
5
2
8;
4
7
2
1
I li
i [ I
[
[ i
_ [ !
i [
odd 9
amerge 3 1 1 1
{
It workedl Let's under_and what happened, Even though both datasets were sorted by number, we immediately discern that the result is no longei in ascending order of number. It will he easier to understan( what happer_ed if we re-sort _he d_ta and then list the data a_ain: . sort li st number I 1 _
number 1
even
odd 1
_merge 2 2
2
2
3
3
3
5
2
4
4
7
2
5 .! 6?
5
10
9
3
6
i2
1
7 8
14 16
1 1
8 _
7 8. i
Notice [hat number now goes from 1 to 8, With no repeated values and no values left ou_ of the sequence. Recall that th_ odd datase[ defined observations for number between I and 5, whereas the even data Jet defined o_servations between 5 a_id 8. Thus, the variable odd is defined for number equal to ]llhrough 5, a_d even is defined for n_ber equal to 5 through 8, 1 For insiance, in the first observation number is l, even is missing, and odd is 1, The value of _mer _ _ indicate_ that this ob_ervat on chme from the usin£ dataset odd dta In the last obserxatio_ number is 8, even is 16, and oddis missing. The Value of _merge. this obser_Jation came f}om the master _' dataset_even.dta.
1, indicmes that
i
i
,/
312 merge -- Merge dalmsets The fifth observation is worth comrnent, number is 5, even is 10, and odd is 9. Both even and odd are defined, since both the even and the odd datasets had information for number equal to 5. The value of _.merge, 3. also tells us that both datasets contributed to the formation of the observation. q
> Example Although the previous example demonstrated, in glorious detail, how the match-merging process works, it was not a practical example of how you wilI ordinarily employ it, Here is a more realistic application. You have two datasets containing information on automobiles. The identifying variable in each dataset is make, a string variable containing the manufacturer and the model. By identifying variable, we mean a variable that is unique for every observation in the dataset. Values for make-- for instance, Honda Accord--are sufficient for identifying each observation. One dataset, autotech.dta, also contains mpg, weight, and length. cost. dta, contains price and rep78, the 1978 repair record. describe
using
Contains
autotech
data
Automobile
1978
obs: vats : size
74 4
:
11 Jul
2000
Data 13:55
2,072
variable
name
storage type
display format
value label
variable
label
make
strl8
7.18s
Make
mpg
int
7.8.Og
Mileage
weight
int
7.8.Og
Weight
(lbs.)
length
int
7,8.Og
Length
(in.)
Sorted
by:
• describe Contains obs: vats :
and Model (mpg)
make using
autocost
data
1978 Automobile Data ii Jul 2000 13:55
74 3
size :
1,924
storage
display
value
type
format
label
make
strl8
price
int
rep78
int
variable
Sorted
The other dataset, auto-
name
by:
variable
label
Y,18s
Make
Model
7.8.0g
Price
7.8.0g
Repair
and
Record
1978
make
We assume that you want to merge these two datasets into a single dataset: • use
autotech
(A1_tomobile . merge
make
Models ) using
autocost
I
I
i
F
merge-- Merge datasets Let's now _xamine the _esult: • de: ,cribe
i
!
from
siz4:
11 Jul 2000
2,¢142 (99.6X of memory sto_age
flee)
value
t_pe
format
label
strl8 iht
_,18s
Make
i_t
XS.0g XS.0g
Mileage (mpg) Weight (IBM.)
lengl h pric(
int i:It
Y,8. Y,8.Og Og
Price
(in.)
rep7_ _mez e
i:it b rte
_,8.0g %8.0g
Repair
Record
vari_b!e
name
make mpg weig_ t
Data
13:55
display
variable
label
and Model
Length
Sorted
1978
by:
Note:
datas_t
has changed
since
last saved
We have alsingle defame! containing all the information from the two original datasets--or at least it appears t_at we do. B_fore, accepting that conclusion, we need to verify, the result. We think that we entered ldata for the _ame cars in each dataset, so every variable should be defined for every car. Although "ateknow it is tintikely, we recognize the possibility that we made a mistake and accidentally ,eft some _ars out of ohe or the other dataset. We can reassure ourselves of our infallibility by tabulating _._merge: . ta_late
I
_mergel
a
I Total _merge
74
}
74 i Freq. !
I
ioooo I00.00 Percent
Ioooo Cure.
We see that __merge is !3 for ever), observation in the dataset. We made no mistake--for obsevvation in autocos_.dta, there was an obs+rvation in autotech.dta and vice versa.
every
Nov,' p_etend that £'e have another dataset containing additional information on these automobile_--automord.dta--and we want to merge that dataset as well. Before we can do so. we muM sort the da{a we have in memory, by make since after a merge the sort order may have changed: . sor mer
I
r(ltO ; _m_rg t already
i
1978 Automobile
7
I
! I .
autotech.dta 74
var_:
I I
}
!
Cont_Lins data ob_,:
i
i i : ii
313
make make
usin_
automore
! defined
After sortin_ the data, S!ata refused to merge the new dataset, complaining instead that ._merge _s already de_ned. Every t_me State merges datase|s it wants to create a variable called _merge (or tlarname if!the _merge(!vamame) option was specified). In this case, there is an _merge variable can rename _merge • _ • • i .... left over frdm the last ti_e we merged. \_,e have _hree choices: We the variable, we can dr_p it, or we can specify a different variable name with the _.merge() option. In this case _merge contains nQ useful lnformatmn we already verified that the prevmus merge went as expected ;o we drop ii and try, again: • dro
_merge
i
Stata performed
our request; whatever new variables
were contained
in automore,
dta
are now
contained in our single, master dataset---perhaps. One should not jump to conclusions. After a match merge, you should always tabulate ..merge to verify that the expected actually happened, as we do below: • tabulate _merge _merge
Freq.
Percent
Cum.
1 2 3
1 1 73
1.33 1.33 97.33
1.33 2.67 100. O0
Total
75
100.00
Surprise! In this case something strange did happen. Some 73 of the observations merged as we anticipated. However, the new dataset automore.dta added one new car to the dataset (identified by ..merge equal to 2) and failed to define new variables for another car in our original dataset (identified by _merge equal to 1). Perhaps this is what should happen, bui it is more likely that we have a mistake in automore, dta. We probably misidentified one car so that to Stata it appeared as data on a new car, resulting in one new observation and missing data on another. If this happened to us, we would figure out why it happened. We would type list make if ._merge==l to learn the identity of the car that did not appear in automore, dta. and we would type list make if _merge==2 to learn the identity of the car that automore, dta added to our data. q
[3 Technical
Note
It is difficult to overemphasize the importance of tabulating ..merge no matter how sure you are that you have no errors. It takes only a second and can save you hours of grief. Along the same lines, one-to-one merges are a bad idea. In the example above, we could have performed all the merges as one-to-one merges and saved a small amount of typing. Let's examine what would have happened. We first merged autotech.dta with autocost,dta by typingmerge make using autocost We could perform a one-to-one merge by typing merge using autocost. The result would be the same; the datasets line up and are in the same sort order, so sequentially matching the observations from the two datasets would have resulted in a perfectly matched dataset. In the second case, we merged the data in memory with automore .dta by typing merge make using automore. A one-to-one merge would have led to disaster, and we would never have known it! If we type merge using automore, Stata would sequentially, and blindly, join observations. Since there are the same number of observations in each dataset, everything would appear to merge perfectly. We speculated in the previous automore.dta included data on no error, things have gone awry. match. For instance, assume that
example that we had an error in automore, dta. Remember that one new car and lacked data on an existing car. Even if there is No matter what, the data in memory and automore, dta do not this new car is the first observation of automore, dta and that it
is some (perhaps mistaken) model of Ford. Assume that the first observation of the data in memory is on a Chevrolet. Stata could and would silently join data on the Chevrolet with data on the Ford. and thereafter data on a Volvo with data on a Saab, and even data on a Volkswagen with data on a Cadillac, and you would never know. Every dataset should carry a variable or a set of variables that uniquely identifies each observation, and then you should always use those variables when merging data. Ignore this advice at your own peril. []
I
! merge -- Merge datasets
315
1 i
121Technical INote , Circu n!s t tances ma Y _rise when you will mer,,e . . _,_two datasets knowino5 there will be mismatches,, hay
_
you have _n analysis d!taset on patients from t!e cancer ward of a particular hospital and you have just recei_ded another d_taset containing their demographic information. Actually. this other dataset contains nbt just their dhmographic information but the demographic information on every patient in
! i
the
i
hospital during the year. You could • m_ge patid using demog d_op if _raerge,=2
or
i
':
m_ge
i i_
l
i
, patid using demog, nokeep
.
.
.
The noke_p opnon tell_ merge not to store observations from the usln_ data that do not appear in the mastel_ There is an!advantage in this, \Vhe_ we merged and dropped, we stored the irrelevant • I -i observanotls and then discarded them• so the data in memory temporarily grew, When we merge with the nokee_ option, the _data never grow beyond what is absolutely necessary,
°
i
In our _automobile example, we had a _ingle identifying variable. Sometimes you will have
! !
idenfifvin ! variables, v_riables that, taken togettJer, are unique for every observation," Let's in_agine that. r_ther than having a single variable called make, we had two variables: manuf and mode_,manuf conl_ainsthe manufacturer arid model contains the model, Rather than having a single varihble recording, sav, Honda Accord . we have two variables, one recording Honda" and another re_ording "Accord". Stata can deal with this type of data. You can go back through our previous ekample and substitute manuf model everywhere you see make.For instance, rather than typing meljge make usihg autocost, we would have typed merge manuf model using autocost
! ! ! i iI i i
Now le_'s make one more change in our assumptions. Let's assume that manuf and model are nol strir_g vari_tbles but are ihstead numerically coded variables. Perhaps the number 15 stands for Honda in the roanifvariable a_d the number 2 stands for Accord in the model variable \Ve do not have to remember mr numeric dudes because we have smartly created value labels telling Stata what number stands for what string of characters. We now go back to the step where we merged autotech.dta with auto :ost. dta: • us, ._ autotech i (Aut)mobile mode_s)
I
, me:.'ge manuf mo_ei using autocost (lab._lmanuf alr4ady defined) (2ab,_lmodel alrdady defined)
I i
! !
Stata makds two minor domments but otherwise carries out our request, It notes that the labels manuf and modell, are already. _lefined.•The messages refer to the value labels named manuf and model
:_ ._
Both• d t tasets contait_t value label definitions that turn the numeric codes for manufacturer and model lntolwords. Whed Stata merged the two daiasets, it already had one set of definitions in memory (obtained _hen we type_t use autotech) and thus ignored the second set of definitions contained in autocost;dta. Stata f_lt obliged to mention the second set of definitions while otherwise ignorin_
i
t
.
i
.
.
.
hem slncq they smght _contaln different codings. In this case, we know they are the same since we CreatedI them. (Him:i You should never give the same name to value label's containin._ different codings.) ! i
_,_
p
,_ ,vWhen one mergeis -Merge oatasets performing a match
merge, the master an_or using datasets may have multiple observationswith the same varlist value. These multiple observations are joined sequentially,as in a one-to-one merge. If the datasets have an unequal number of observations with the same varlist value, the last such observationin the shorter dataset is replicated until the number of observations is equal. ;> Example The processof replicating the observationfrom the shorter dataset is known as spreadingand can be put to practical use. Suppose you have two datasets, costs of your firm, by region, for the last year:
dollars,
dta
contains
the dollar
sales and
• use dollars (RegionalSales & Costs) • list region NE N Cntrl South West
I. 2. 3. 4. sforce,
dta
they operate:
sales 360,523 419,472 532,399 310,565
cost 138,097 227,677 330,499 1.65,348
containsthe names of the individualsin your sales force along with the region in which
• use sforce (SalesForce) list region I. NE 2. NE 3. N Cntrl 4. 5. 6. 7. 8. 9. I0. II. 12.
N Cntrl N Cntrl South South South South West West West
name Ecklund Franks Krantz Phipps Willis Anderson Dubnoff Lee McNiel Charles Grant Cobb .
You now wish to merge these two datasets by region, spreading the sales and cost information across
all observations
for which
it is relevant;
that
is, you want
to add the
costs to the sales force data. The variable sales will assume the value observations, $4t9,472 for the next three observations, and so on.
(Continued
on next page)
variables
$360,523
sales
and
for the first two
I
i
i -
i• ! ! i {
i
i
r-
merge -- Merge da'msets • me::ge region (la_;1
region
using
317
dollars
at_,eady
defined)
• li_t 1
I. _
region NE
2. i
name ']ckland
NE
3. N Cntrl 4, _ N Cntrl 5. N Cntrl
sales 360 523
cost 138,097
merge
Freaks
360 523
138,097
3
Krantz Phipps Willis
419 472 419 472 419 472
227,677 227,677 227,677
3 3 3
3
6.
South
7.
South
_nderson bubnof f
399 532 399
330,499 330,499
3 3
8. 9.
South South
: Lee , McNiel
532 399 532 399
330,499 330,499
3 3
West West West
Grant bharles Cobb
310 310 565 565 310 565
165,348 165,348 165,348
3 3 3
11. 10. i 12.
532
Even th)ugh there arc 12 observations in the gales force data and only 4 obse_'ations in the sales and cost data, all the re%rds The sforce, dta contained dollars. Ira was matched record in dollars, dta'was
merged. The dollars, dta contained one obsen'ation for the NE region. two observations for the same region. Thus, the single observation in to both the observations in sforce.dta. In technical jargon, the single replicated, or spread, across the observations in sforce, dta.
2
UpdatingdOata merge
ith the updaie
option varies merge's actions when an observation in the master is matched
with an oblervationin tile using dataset. Without the update option• merge leaves the values in the master dat_set alone and adds the data for the new variables. With the update option, merge adds the new vagables, but ii also replaces missing values in the master observation with corresponding values fron_ the using. (_dissing values mean numeric missin_ (.) and empty, strings (""_)., The vahtes for __merge are extended: _merge
1 2 i
meaning
obs. from masterdata obs. from usingdata obs. from botlt,masteragrees with using obs. from both, missingin master updated obs from both, masterdisagreeswith using
4 5
i
In the easel of __merge = 5. the master values are retained unless replace
i i
case the m_.,.ter values are updated just as if they had been missing. Pretend tlataset l con[ains variables id. ,,_.and b: dataset 2 contains id. a, and z. You merge the
i
two dataset_ by _d. data_et I being the master d_taset in memory and dataset 2 the using dataset on disk. Con._i_er two obseivations that match and _all the values from the first dataset idl, etc., and those fromlthe second i_,. etc. The resulting d_taset will have variables id _ b. a'. and _anerge. merge's tyt_ical logic is i
i
I
.
_
•
o
I. The factl,that the obsdrvations match means idl = ida, Set id = ida.
:_!
_.'_Variablelo occurs in l_oth datasels. Ignore o,, and _;eta = al.
i _
3. Variable b occurs in c}nlydataset 1 Set b = b_. 4. Variable a' occurs in +nly dataset 2. Set z = ,r>
i
5. Set _me::ge = 3.
!
i
is specified, in which
.?
With update,
the logic is modified:
1. (unchanged.) :
Since the observations
match, idl = id2. Set id = idl
2. Variable a occurs in both datasets: a. If al -- a2, set a = al and set __merge = 3. b. If al contains missing and a2 is nonmissing, update was made•
set a = a2 and set ...merge -- 4, indicating
an
c. If a2 contains missing, set a = al and set __merge - 3 (indicating
no update).
d. If al 5¢ a2 and both contain nonnfissing, set a = al or, if replace regardless, set _merge = 5, indicating a disagreement.
was specified, a = a2 but,
Rules 3 and 4 remain unchanged.
> Example In original.dta you have data on some cars that include the make, price, and mileage rating. In updates, dta you have some updated data on these cars along with a new variable recording engine displacement. The data contain • use
original,
(original
clear
data)
list make 1. Chev. 2. Chev.
pri c e 3,299 4,504
Chevette Malibu
3. Datsun
mpg 29
510
5,079
4. Merc.
XR-7
6,303
5. Olds
Cutlass
4,733
19
3,895
26
7,140
23
6.
Renault
Le Car
7. VW Dasher • use updates, (updates,
24
clear
mpg
and
displacement)
• list make i. Chev.
Chevette
mpg
displac-t 231
2. Chev. 3. Datsun
Malibu
22
200
510
24
119
XR-7
14
302
Cutlass
19
231
25
79
23
97
4. Merc. 5.
Olds
6. Renault
Le Car
7. VW Dasher
By updating our data. we obtain • use
original,
(original • merge
clear
data)
make
using
updates,
update
list make i. Chev.
Chevette
price 3,299
mpg 29
displac~t 231
_merge 3
2. Chev.
Malibu
4,504
22
200
4
3. Datsun
510
5,079
24
119
3
4. Merc. XR-7 5. Olds Cutlass
6,303 4,733
14 19
302 231
4 3
6. Renault
3,895
26
79
5
7,140
23
97
3
Le Car
7. VW Dasher
_,
,
i
_
merge-- Merge datasets
319
All observations merged because all have _..merge > 3. The, observations having _.merge = 3 have _pg:just a_ it was recorded in the original dataset. In observation t. mpg is 29 because the updated data,£et had mpg = . : in observation 3. mpg remhins 24 because the updated dataset also stated that
i i
mpg_is 24"I The ob!ervations having ...merge = 4 have had their mpg data updated. The mpg variable was missing in,observations 2 and 4 and new values' were obtained from the update data.
i
The _ob_ervation having _merge = 5 has its mpg just as it was recorded in the original dataset, ! i
just as do the __merge = 3 observations, but ther_ is an important difference. There is a disagreement "about the %lue of rapg; the original claims it is 26 and the updated, 25. Had we specified the
i
replaGe
i
_mergo =i5. replace
_ption, mpg would now contain the u_ated
25 but the observation would still be marked
affects only _xhich value _is retained in the case of disagreement,
q
ReferenCe Nash,J. D. t994. dmlg: Mergingrag, data and dictionaryfiles. Stata TechnicalBulletin 20: 3-5, Reprintedin Su_ta i.
i
i
Techaicalt BulletinReprints' v°l" 4'matchedmerging._tata pp" 22-25" Weesie. J. 2000. din75:Safe andeasy TechnicalBulletin53: 6-1% Reprintedin Stata Technical Bulletin!eprints, vol. 9, pp. 62-77.
{
i
AlsoSee Compimne _tary:
[R] save, JR]sort
Related::
JR] append, [R] cross, [R] _oinby I
BaeRgroun d:
[u] 25 Commands for combining data
meta
--
Meta analysis
)i!
,
II
,
Remarks Stata
should
have
a meta-an_alysis
command,
but
as of the date
that this manual
was
written,
Stata does not. Stata users, however, have developed an excellent suite of commands for performing meta-analysis, many of which have been published in the Stata Technical Bulletin (STB).
Issue
insert
author(s)
command
description
STB-38
sbet6
S. Sharp, J. Sterne
meta
meta-analysis for an outcome of two exposures or two treatment regimens
STB-42
sbel6.1
S. Sharp, J. Sterne
meza
update of sbel6
STB-43
sbel6.2
S. Sharp, J, Sterne
meta
update; install this ve_ion
STB-41
sbel9
T, J. Steichen
metabias
performs the Begg and Mazumdar (1994) adjusted rank correlation test for publication bias and the Egger et al. (1997) regression asymmetry test for publication bias
STB-44
sbel9.1
T.J. Steichen. M. Egger, J. Sterne
metabias
update of sbel9
STB-57
sbel9.2
T.J. Steichen
metabias
update; install this version
STB-41
sbe20
A. Tobias
galbr
performs _he Gatbraith plot (1988) which is useful for investigating heterogeneity in meta-analysis
STB-56
sbe20.1
A. Tobias
galbr
update; install this version
STB-42
sbe22
J. Sterne
metacum
performs cumulative meta-analysls, using fixed- or random-effects models, and graphs the result
STB-42
sbe23
S. Sharp
raetareg
extends a random-effects meta-analysis to estimate the extent to which one or more covariates, with values defined for each study in the analysis, explains heterogeneity in the treatment effects
STB-44
sbe24
M.J. Bradbum. J. J. Deeks, D. G. Altman
metma, funnel, labbe
meta-analysis of studies with two groups funnel plot of precision versus treatment effect UAbb6 plot
STB-45
sbe24.1
M. J, Bradbum J. J. Deeks, D. G. Altman
funnel
update; install this version
STB-47
sbe26
A. Tobias
metainf, meta
graphical technique to look for influential studies in the meta-analysis estimate
STB-56
sbe26.1
A. Tobias
metainf
update: install this version
STB-49
sbe28
A. Tobias
metap
combines p-values using either Fisher's method or Edgington's method
STB-56
sbe28.I
A. Tobias
metap
update: install this version
STB-57
sbe39
T_ J. Steichen
metatrim
pedbrms the Dural and Tweedie (20001 nonparametric "trim and fill" method of accounting for publication bias in meta-analysis
Additional commands may be available: enter Stata and type search
320
meta analysas.
I
meta-- Metaanalysis
f
321
To downl_)adand install from the Interact the Sharp and Stem meta command, for instance. Stata i
yot_cquld
i
i
2. !- Click Pull down on http://www.stat_.com. Help and select STB and Use :-written Programs. I i
3. Click on stb. !
I
4. Click on stM9. 5. Click on she28.
I
6. Clk k on dick here to install i !
or yot1co aid instead do the following: l. Na_igate to the appropriate STBissue: Type net from http://w_, Type net ¢d stb Type net
I
i i
i i
[
cd stb49
or _. Type net from http ://www.sta_a, com/stb/stb49
2. Tyt e net describe sbe28 3.TyFe net installsbe28
fere.cs
Be_, C. B|and M. Mazurndar 1994 Opet_atingcharactdristics of a rank correlation test for publication bias. Biometric._
0:!lo8 -llOi.
t
Bradbu_n,
ii
'
sta_a, corn
--.
_k/l.J., J, J. Decks, and D. G. Altman. 1998a. sbe24: metan--an alternative mcta.anatys_s command. Stat,_
:Yechnicll Bulletin 44: 4-15. Reprinted in Stata TeJhnica! Bulletin Reprints, vol. 8. pp. 86-100. t998_, sbe24.l: Correction to funnel plot. Stata qgechnicalBulletin 45: 21. Reprinted in Stata Technical Bulletin
_eOrint_, vol. 8, pp, 100. ¢ Egger, M.,IG. D. Smith, M. Schneider, and C. Mindef. I997. Bias in meta-analvsi., detected by a simple, graphical _test_Bt ;tish Medical Journal 315: 629-634.
Gat_rai_h,iMe_licile_. 7:F" 889-894.I988' A note on graphical display of"cstimated;,odds ratios from several clinical trials. Statistics i,,_ t L Abbe:,K. A., A. S. Detsky. and K. O'Rourke. 1987. Meta-anaNsis in clinical research. Annals of Internal Medicine 107): 224-233. , i i !
Sh_sp, S. 1998. sbe23: Meta-analvsis rc_rcssion. Stat_ Technical Bulletin 42: 16-22. Reprinted in Stata Technic_! :Bu_letir,Reprints, vol. 7. pp. 1_18-155. i Sharp, S, _nd J. Sterne. t997. sbet6: Meta-analysis. Siata Technical Bulletin 38:9-1,1 Reprinted in Stata Technical 'Bu_detit Reprints, voi. 7. pp. 100-106. 199ta. sbel6.1: New syntax and output for the meta-analysis command. Stata Technical Bulletin 42:6-8 Rcprint,'.d in Stata Technical Bulletin Reprints, vol. 17,pp. 106-108. 'Te_nk al Bulletin Reprints, vol. 8, p. 84. ....... . 19%b. sbel6.2: Corrections to the recta-analysis!command. Stata Technical Bulletin 43: 15. Reprinted in Slam Steichen. _. J. 1998. sbcl9: Tests for publication bias in meta-ana!ysis. Stata Technical Bulletin 41:9-t5 Reprinted in Stat_ Technicel Bulretin Reprints, rot. 7. pp. 125-133. __..i_ 200_a. sbel9.2: Update of tests for publication _ias in recta-analysis. Stata Technical Bulletin ._7:4.
!
-_'
_]!i;
322
meta u
Meta analysis
. 2000b. she39: Nonparametric trim and fill analysis of publication bias in meta-analysis. Stata Technical Bulletin 57: 8-14. Steichen, T. J.. M, Egger, and J. Sterne. 1998. sbel9.1: Tests for publication bias in recta-analysis. Stata Technical Bulletin 44: 3-4. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp, 84-85. Sterne, J. 1998. sbe22: Cumulative recta-analysis. Stata Technical Bulletin 42: 13-t6. Bulletin Reprints, vol. 7, pp. 143-147.
Reprinted in Stata Technical
Tobias. A. 1998. sbe20: Assessing heterogeneity in recta-analysis: the Galbraith plot. Stata Technical Bulletin 41: 15-17. Reprinted in Stata TechnicaI Bntletin Reprints, vot. 7, pp. I33-I36. 1999a, she26: Assessing the influence of a single study in the meta-analysis estimate, Stata Technical Bulletin 47: 15-17. Reprinted in Stata Technical Bulletin Reprints, vot. 8, p. 108-t10, 1999b. she28: Meta-analysis of p-values. Stata Technical Bulletin 49: 15-t7. Bulletin Reprints, vol. 9, pp, 138-140. 2000a. she20.1: Update of galbr. Stata Technical Bulletin 56: 14, 2000b. sbe26.1: Update of metainf. Stata Technical Bultetin 56: 15. • 2000c. sbe28.1: Update of metap. Stata Technical Bulfetin 56: 15,
Reprinted in Stata Technical
Title !
!
i! i
=-
i m,x
Obtain marginal effects or elasticities after estimation
i
mfx
i
c_mpRte
[if el]o]
at l atlist) eqlist Rot Linear
i
' [in range]
[, d_dx
eyex
dyex
i
eydx
(eqnames) predict ('predict_option)
nodiscrete
noesample
nowght
nose
level(#)
]
=fX r ,play[,level(#)] where!at/_t is {mean I median I zero [varname=# [, varrrame=#] nu,ntist
(for single_quation estimators only)
mamame
(for single-equation estimators only)
[...]]}
DesCriptii:m numerically calculates the marginal effects or the elasticities and their standard errors after esti_mtion. Exactly what mfx can calculate is determined by the previous estimation command and the ]_redict (predict_option) option. At Whichpoints the marginal effects or elasticities are to be evaluated is determined by the at (atlist) option. By default, mfx calculates the marginal effects or the el_Lsticitiesat the means of the independent variables by using the default prediction option
}
mfx ci,mpute
i i i
associate, with the previous estimation command. mfx r _lay replays the results of the previous rafx computation.
!
i
t
i _
Opti dy x s! ifies that marginal effects are to be calculated. It is the default. eyex .sp_:ifies that elasticities are to be calculated in the form of 0 log b'/Olog z.
i
dy_xsp_ :ifiesthat elasticities are to be calculated in the form of Oy/Ologz
i
eydxspe _ifiesthat elasticities are to be calculated in the form of 0 log y/O_. at(atlist specifies the points around which the marginal effects or the elasticities are to be estimated.
i
The d ,.faultis to estimate the effect around_the means of the independent variables. at( mBan median I zero [ varname = # [, varnarne= #] [...]]) specifies that the marginal effecL,or the elasticities are to be evaluated at means, at medians of the independent variables, or at zer,)s. It also allows users to specify pahicular values for one or more independent variables. assuming the rest are means, medians, or zeros. For instance.
i
• ,
robit foreiffp, mpg _eight price !
.
fx compute, at(mean
mpg=30) 323
_
Jii" i
,._d;-e
..A
--
uuttllll
IIIdlylllal
elTeCI$
or
elasTiciTies
after
estimation
at there (numlist) specifies term that the are tooption be evaluated at the numlist. If is a constant in marginal the model,effects add 1ortothe theelasticities numlist. This is for single-equation estimators only. For instance, • probit • mfx
foreign
compute,
mpg at(21
weight 3000
price
6000
I)
at (matname) specifies the points in a matrix format. A 1 is also needed if there is a constant term in the model. This option is for single-equation estimators only. For instance, • probit
foreign
mpg
• mat
A = (2i,
3000,
• mfx
compute,
at(A)
weight 6000,
price I)
eqlist (eqnames) indirectly specifies the variables for which marginal effects (or elasticities) are to be calculated. Marginal effects (elasticities) will be calculated for all variables in the equations specified. The default is all equations, which is to say, all variables. predict (predict__option) specifies which function is to be calculated for the marginal effects or the elasticities; i.e., the form of ft. The default is the default predict option of the previous estimation command. For instance, since the default prediction for probiZ is the probability of a positive outcome, the predict () option is not required to calculate the marginal effects of the independent variables for the probability of a positive outcome. probit . mfx
foreign
mpg
weight
price
compute
To calculate the. marginal effects for the linear prediction (xb), specify predict (xb). • mfx
compute,
predict(xb)
To see which predict options are available,
see help for the particular estimation
command.
nonlinear specifies that y, the function to be calculated for the marginal effects or the elasticities, does not meet the linear-form restriction. For the definition of the linear-form restriction, please refer to the Methods and Formulas section, By default, mfx will assume that y meets the linearform restriction, unless one or more independent variables are shared bv multiple equations• For instance, predictions after heckman
mpg
price,
sel(for=rep)
meet the linear-form restriction, but those after • heckman
mpg
price,
sel(for=rep
price)
do not. If y meets the linear-form restriction, specifying nonlinear or not should produce the same results. However, the nonlinear method is generally more time-consuming. Most likely, users do not need to specify nonlinear after a State official estimation command. For user-written estimation commands, if you are not sure whether y is of linear-form, specifying nonlinear is always a safe choice. Please refer to the Speed and accuracy section for further discussion. nodiscrelze treats dummy variables as continuous ones. Ifnodiscrete is not effect of a dummy variable is calculated as the discrete change in the expected variable as the dummy variable changes from 0 to 1. This option is irrelevant the elasticities, because all the dummy variables are treated as continuous in
specified, the marginal value of the dependent to the computation of computing elasticities.
noesample only affects at (atlist). It specifies that when the means and medians are calculated, the whole dataset is to be considered instead of only those marked in the e (sample) defined by the previous estimation command.
d
J mfx -- Obtain marginaleffects or elasticitiesafter estimation
325
nowghtof ly affects at(atIist). It specifies that Weights are to be ignored when calculating the means and ime [inns for the atlist. nose asks mfx to calculate the marginal effects or the elasticities without their standard errors. Caleula ing standard errors is very time-consuming, Speci_,ing nose will reduce the running time Of dfx. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as s¢ by set level: see [U] 23,5 Specifying the width of confidence intervals.
(95)
Pmarl 1 '
Rentarl_s are presented under the headings ! Obtaining
marginal
effects
after single-equation
Obtaining
marginal
effects
after multiple-equation
Obtaining
three
Speed
forms
(SE) estimation (ME)
esdmation
of elasticities
and accuracy
Obtaining arginal effects after singk -equation (SE) estimation Bef4_re :unning mfx. type help estimation_cmd to see what can be predicted after estimation and to see _e default prediction.
> Example We esti mate a logit model using the auto dataset: io ;it foreign
mpg
price
_ter _ion
O:
log likelihood
=
Ieer_tion
I:
log likelihood
= -36.694B39
I_er_tion
2:
log likelihood
= -36,463_94
_ter_tion
3:
log likelihood
=
I_er_tion
4:
log likelihood
= -36,462189
L_gi
L_g
-45.03_21
-36.46_19
estimates
likelihood
foreign I
Number of obs LR chi2(2) Prob > chi2 = -36.462189
Coef.
Pseudo
Std. Err.
mpg
.2338353
.067t449
price cons
.000266 -7.648111
.0001166 2.043673
z 3.48 2.28 -3.74
R2
= = =
74 17.14 0.0002
=
0.1903
P>]z
[95_ Conf.
Interval]
0.000
.1022338
.3654368
0.022 0,000
.0000375 -11.65364
.0004945 -3.642586
To determine the marginal effects of mpg and price for the probability of a positive outcome at their mean values, issue the mfx command, because the default prediction after !ogit is the probabilit} of a positive outcome and the calculation is requested at the mean values.
_N
.=_
-
-
,,,,A -- uuu==n marginal e_ects or elasticities after estimation
• mfx compute effects after Marginal logit y = Pr(foreign)(predict) = .26347633
/J
variable
dy/dx
Std. Err.
z
P>lzl
[
95Z C.I.
]
X
J
mpg price
The
.0453773 .0000516
first line
of the
output
.0131 .00002
indicates
3.46 2.31
that the
0.001 0.021
marginal
.019702 .071053 7.8e-06 .000095
effects
were
21.2973 6165.26
calculated
estimation, The second line of the output gives the form of fl and the predict would type to get y. The third line of the output gives the value of fl given X, in the last column of the table.
after
a logit
command that we which are displayed
To calculate the marginal effects at particular data points, say, mpg = 20, price = 6000, specify the at () option: mfx compute,at(mpg=20 , price=6000) Marginal effects after logit y = Pr(foreiKn) (predict) = .20176601 variable I mpg price
dy/dx
Std. Err.
I
.0376607 ,0000428
.00961 .00002
z
P>Izl
3.92 2,47
0.000 0.014
[
95_ C.l.
]
.018834 .056488 8.8e-06 .000077
X 20.0000 6000.00
To calculate the marginal effects for the linear prediction (xb) instead of the probability, specify predict (xb). Note that the marginal effects for the linear prediction are the coefficients themselves. mfx compute, predict(xb) Marginal effects after legit y = Linear prediction (predict,xb) =-I.0279779 variable
dy/dx
Std, Err,
.2338353 .000266
mpg price
.06714 .00012
z
P>[z[
3.48 2.28
0.000 0.022
[
95_ C,I.
.102234 .000038
]
.365437 .000495
X 21.2973 6165.26
If there is a dummy variable as an independent variable, mlx will calculate the discrete change as the dummy variable changes from 0 to 1.
gen
record
. replace (34 real
= 0
record changes
= I if rep made)
> 3
_
mfx -- Obtain marginaleffects or elasticitiesafter estimation
327
• _logit foreign mpg record, nolog L_it
estimates
Number of obs LR chi2(2) Prob > chi2 Pseudo R2
Lo_ likelihood = _31.898321
mpg I
.1079219
record cons ._oreign :.
1
_fx
2,435068 -4.689347 Coef.
.0565077
i.91 3.42 43.54 z
.7128444 i.326547 Std. Err.
O.056 0.001 O.000 P>Izl
= = = =
74 26.27 O.0000 0.2917
-.0028311 .2186749 I 3.832217 -7.28933 -2.089363 [957,Conf. Interval] • 03_18
compute
_Ma_gi_al effects after logit ly = Pr(forei_1) (predict) | = .21890034 ! v_i
le mP_ I
_ec_'-rd* I
dy/dx
Std. Err.
z
P>Izl
[
95_ C.I.
.018_528
.01017
1.81
0.070 -,001475
.4272707
.10432
4.09
0,000
]
X
.038381
21.2973
.63163
.459459
.222712
(*) d'r/dxis for discrete change of dummy variable from 0 to I
If nodiscrete iS specified, mfx will treat the:dummy variable as continuous. Ifx compute,
no_iscrete
!Ma_gilLaleffects a_ter iogit y = Pr(foreign) (predict) = . 218900_34 _vari le
dyjdx
Std. Err.
z
P>Izl
[
957,C.I.
]
X
m_g I
,0184528
.01017
1.81:
0.070
-.001475
.038381
21.2973
recI'rd ]
.4163552
.10733
3.88
0.000
.205994
.626716
.459459
q
Obtaining rr arginal effects after multiple-equation (ME) estimation If you ha ;e not read the discussion above on using mfx after SE estimations, please do so. Except tbr the abilily to select specific equations for the calculation of marginal effects, the use of mfx after ME models Olk)ws atmost exactly the same form as fix SE models. The details of prediction statistics that are specific to particular ME models are documented with the estimali,)n command. Users of mfx after ME commands should first read the documentation of predict for the estimation command. For a general introduction to the ME models, we will demonstrate mfx after heckman and mlogit.
...........
F /_i
_ ........
_v,_
v, _,aau_lu_a,
_ImatlON
Example Heckman selection model Number of obs . heckman mpg weight length, sol(foreign = displ) nolog (regression model with sample selection) Censored obs
: i
_Ik_r
Log likelihood = -87.58426 Coef.
Std, Err.
z
=
74
=
52
Uncolored Wald chi2(2)obs
= =
22 7.27
Prob > chi2
=
0.0264
P>Izl
[95_ Conf. Interval]
mpg weight length _cons
-.0039923 -.1202545 56.72567
.0071948 .2093074 21.68463
-0.55 -0.57 2.62
0.579 0.566 0.009
-.0180939 -.5304895 14.22458
.0101092 .2899805 99.22676
-.0250297 3.223625
.0067241 .8757406
-3.72 3.68
0.000 0.000
-.0382088 1.507205
-.0118506 4.940045
/athrho
-.9840858
.8112212
-1,21
0.225
-2.57405
.6058785
/lnsigma
1.724306
.2794524
6.17
0.000
1.176589
2.272022
-.7548292
.349014
-.9884463
.5412193
5.608626 -4,233555
1.567344 3.022645
3.243293 -10.15783
9.698997 1.690721
foreign displacement _cons
rho sigma lambda
LK test of indep, eqns. (rho = 0):
chi2(1) =
1.37
Prob > chi2 = 0.2413
heckman estimated two equations, mpg and foreign; see [R] heckman. Two of the prediction statistics a_er heckman are the expected value of the dependent variable and the probability of being observed. To obtain the marginal effec_ of the independent variables of all the equations for the expected value of the dependent variable, specify predict (yexpected) with mfx. . mfx compute, predict(yexpected) Marginal effects after heckman y = E(mpg*IPr(foreigm)) = .56522778 variable
dy/dx
weight length displa~t
-.0001725 -.0051953 -.0340055
(predict, yexpected)
Std. Err. .00041 .01002 .02541
z -0.42 -0.52 -1.34
P>Izl
[
95Z C.l.
0.675 0.604 0.181
-.000979 -.02483 -.083802
]
.000634 .01444 .015791
X 3019.46 187.932 197.297
To calculate the marginal effects for the probability of being observed, since only the independent variables in equation foreign affect the probability of being observed, specify eqlist (foreign) to restrict the calculation. mfx compute, eqlist(foreign)
predict(psel)
Marginal effects after beckman y = Pr(foreign) (predict, psel) = .04320292 variable
dy/dx
dispia-t
-.0022958
Std. Err. .00153
z -1.50
P>Izl
[
95Z C.I.
0.133
-.005287
]
.000696
X 197.297
q
V
mfx -- Obtain ma
inal effects or elasticities after estimation
329
E>ExampleI predict after mlogit has a special feature that most other estimation commands do not. It can predict m_ltiple new variables by issuing predict only once; see [R] mlogit. This feature cannot be adopte 1into mix. To calculate the marginal effects for the probability of each outcome, run _z septtra_ely for each outcome.
• _ ogit rep78 mpg disp_,
notog
MuI_inomialregression
Log likelihood=
-82.27874
rep78 I
Numberof obs LR c_i2(8) Prob > chi2 PseudoR2
Coef.
Std. Err.
z
P> z[
= = = =
69 22.83 O. 0036 0,1218
[95Z Conf, Interval]
i 1
mpg
displacement cons
-.0021573
.2104309
-0.01
0.992
-.0052312 -1.566574
.0126927 6.429681
-0.41 -0.24
0.680 0.808
-.0301085 -14.16852
.4145942
.4102796 .0196461 11.03537
• mpg _is_,acement _cons
.01509S4 .1235325 ,0020254 ,0063719 -2.09099 3.664348
0.12 0,32 -0.57
0.903 0.751 0.568
-.2270239 -.0104634 -_.272981
.2572147 .0145142 5.091001
mpg d_s .acement _cons
.0070871 ,0883698 -,0066993 .0053435 .7047881 2.704785
0,08 -1.25 0.26
0.936 0.210 0.794
-.1661146 -.0171723 -4.596492
.1802888 .0037737 6,006069
,0808327 ,0983973 -.0231922 .0119692 .652801 3.545048
0.82 -1.94 0,18
0.411 0.053 0.854
-.I120224 -.0466514 -6,295365
.2736878 ,0002671 7,600967
5 i
mpg displacement _cons (Dut,:ome rep78==3 • mf: compute,
is the comparison
group)
predict(outcome(1))
Marg nal effectsafter mlogit y = Pr(rep78==1)(predict,outcome(I)) = ,03438017
-.0003566 _y/dx -.0000703
.C_679 Std. Err. -0.05 z .00041
-0.1V
0.958 .01295] P>lz[ -.013663 [ 95Z C.I.
21.2899 X
0.864
198.000
-.000873
.000732
• mfI compute,predict(outcome(2)) Marginaleffects after mlogit = .12361544 l y =I Pr(rep78==2)_predict,outconm(2)) variable
I
dy/dx
mpg d_sp]a-t
[ ] i
.0008507 .0006444 4
Std. Err.
.01277 .00067
z
0.07 0.96
P>Iz]
[
95Z C.I.
0.947 0.336
-.024183 -.000668
]
.025885 .001957
X
21.2899 198.000
....v
,,,,_ --
i":_i
,JUmm
marginal
effects
or elasUciUes
after estimation
mfx compute, predict(outcome(3)) Marginal effects after mlogit y = Pr(rep78==3) (predict, outcome(3)) = .48578012
' 1
variable
I
mpg displa-t
t
t
dy/dx
-.0039901 .0015484
Std. Err. .01922 .00108
z -0.21 1.43
P>Iz[
[
95_ C.I.
0.836 0.151
-.041682 -.000567
]
.033682 .003664
X 21.2899 198.000
. mfx compute, predict(outcome(4)) Marginal effects after mlogit y = Pr(repYS--=4) (predict, outcome(4)) = .30337619 variable
dy/dx
mpg displa-t
-.0003418 -.0010654
Std. Err. .01707 .00106
z -0.02 -1.01
P>IzF
[
95_ C.l.
0.984 0.313
-.033805 -.003136
]
.033122 .001005
X 21.2899 198.000
• mfx compute, predict(outcome(5)) Marginzl effects after mlogit y = Pr(rep78==5) (predict, outcome(5)) = .05284808
variable I
dy/dx
displa~t mpg I
-.0010572 .0038378
Std. Err.
.00047 .00561
z
-2.24 0.68
P>[zl
[
95_ C.I.
]
0.025 0.494
-°001984 -.000131 -.007167 .014843
X
198.000 21.2899
q
Obtaining three forms of elasticities mfx can also be used to obtain all ttu'ee forms of elasticities. option
elasticity
eyex,
0 log y/O log x Oy/O log x 0 log y/Ox
dyex eydx
b, Example We estimate a regression model using the auto dataset. The marginal effects for the predicted value y after a regress are the same as the coefficients. To get the elasticities of torm Olog;_/Ologz, specify the eyex option:
r
mfx -- Obtain marginaleffects or elasticitiesafter estimation
, re_ress =_
weight length
Source
SS ' 1_616.08062
_ Model Ftesidual
df
827.378835
Number of F( 2, Prob > F
2
MS _ 808.040312
71
11.643223
73
33.47;J0474
'
Std. Err. .001586 ,0553577 6.08787
obs = 71) = =
R-squared Adj R-squared Root MSE
t i-2.43 _-1.44 7.87
P> ]t ] 0.018 O. 155 O. 000
= = =
33t
74 69.340,0000 0.6614 0,6519 3,4137
[957 Conf. Interval] - .0070138 -. 1899736 35,746
-.0006891 .0307867 60,02374
•'mr: compute, eyex Elasl icities after regress y = Fitted_values (predict) = 21.297297 I
varieble I
e_/ex
Std. Err.
z
P>Izl
[
95_,C.I.
]
0.015 0.151
-.987208 -.104891 -1.66012 .2554t4
X
t weJght [ -.5460497 !length { -.7023518 --
.22509 .48867
-2.4B -1.44
3019.46 187.932
,, [
The first line of the output indicates that the elasticities were calculated after a regressestimation, I
The titlecha o_ _ge the insecond of the table ingives percent b' for column a 1 percent change x. the form of the elasticities, 0 log y/O log x, the If'the in lependent v_ables have been log-transformed already, then we will want the elasticities of the fO_ 0 log y/Oz _stead. ge_ Inweight = In(weight) gen inlength = _n(lengZh) reg::essropeInweight inlen_h Source SS df
•
Number of obs =
MS
Model
" I_51.28916
2
825.644581
R,sidual
79_2.170298
71
11.1573281
2_43.4594_6
73
33.4720474
F( 2, Prob > F
' Total
,
mpg
Coef.
4 I
1 weight l_length | cons 1
I! I [
• =fx lcompute,
-_3.5974 -9.816726 181. 1196
Std. Err.
P>It I
t
'
4.692504 t0.40316 22.18429
-2.90 -0.94 _.16
0.005 0.349 0.000
74
71) = =
74.00 0.0000
R-squared = Adj R-squared = Root MSE =
0,6758 0.6667 3.3403
[957. Conf. Interval] -22.95398 -30.56004 136.8853
-4.24081i 10.92659 225.3538
eydX
_Elastlcitiesafter': regress _y = Fitted _alues (predict) =[ 21.2972_}7 va_iat [e l
eyldx
Std. Err.
ln_en _h ,ln_tei_ at I
-.4609376 -. 6384565
.48855 .22064
z -0.94 -2.89
P>Izl
[
957.C l
0.345 0.004
-1.41847 .496594 -1,0709 -.206009
]
X 5.22904 7.97875
t
...... _ _ii; • '
........
-.w
umL_=:= _;,,,_illlli_ll,
lon
Note that although interpretation is the same, the results for eyex and eydx differ since we are estimating different the models. If the dependent variable were log-transformed, we would specify dyex instead.
--
]
I i
I i !
ml plot
i
[eq,,a/ne:]name
[# [# [#]]].[, saving(filename[,
replace]>
i
mt init{ m! init l
{ [eqname:]name=#1/eqnaine=# # [# ...], copy
i
} [...]
:
ml init
,,,atname
[, skip
copy]
{ }
ml repcrt i ml trac_
{ on i off }
]
oi
mlco_
'i
ml max
[cle_ Ionlof_ _] ize [, difficult nolog __trace'gradient hessian shovstep iterate(#) Itolerance(#)!tolerance(#) nowarning novce
_.
i
i
sscore:(newvarnames) noout_t level(#) eform(string) noc!e_r
ml
graph
[#] [
saving(fiienarae[, r_place])
]
{
mt dis
lay
[
no_eader
eform(string)
tirst 389
neq(#)
plus
level(C/)
i
i
•,--,, ";i_i I
,,,, -- Maximum IIKelIItOOCl estimation
where method is { If [ dOIdl Idldebug Id2 1d2debug } and eq is the equation to be estimated, enclosed given to the equation, preceded by a colon:
[
in parentheses,
and optionally
with a name to be
([eqname: l [varnames =] [varnames] I" eq-options] ) or eq is the name of a parameter such as sigma with a slash in front /eqname and eq_options are
which is equivalent
noconstant offset
(varname)
to
(eqname:)
exposure
(varname)
fweights, pweights, aweights, and iweights are allowed, see [U] 14.1,6 weight. With all but method lf, you must write your likelihood-evaluationprogram a certain way if pweights are to be specified, and pweights may not be specifiedwith method dO. ml shares features of all estimationcommands;see [U] 23 Estimation and post-estimation commands. To redisplay results,the type ml display.
Syntaxof ml model in noninteractivemode ml model method progname
eq [eq ...]
[weight]
lif exp]
[in rangej,
maximize
I robust cluster(varname) title(string)nopreserve collinear missing ifO(#k #u) continue waldtest(#) constraints (numlist)obs(#) noscvars init(ml_init_args) search({on Iquietly Ioff}) _repeat(#) b_ounds(ml_search_bounds) difficult .nologtrace gradient hessian showstep iterate(g) _Itolerance(#) tolerance(g) nowarning novce score(newvarlist) I Noninteractive
mode is invoked by specifying
option maximize.
Use maximize
when ml is to
be used as a subroutine of another ado-file or program and you want to carry forth the problem, from definition to posting of final results, in one command.
Syntax of subroutinesfor use by method dO,dl, and d2 evaluators mleval
newvarname
= vecname
mleval
scalarname
mlsum
scalarnamelnf
= exp [if exp] [, noweight
mlvecsum
scalarnamelnf
rowvecname : exp [if exp] [, eq(#)
mlmatsum
scalarnamelnf
matrixname = exp [if
= vecname,
[, eq(#)] scalar
[eq(#)]
]
exp] [, eq(#[,#i)
i ]
ml -- Maximum likelihood estimation
341
Syntax of user-writtbn evaluator Summar
of notation
The lo :-likelihood fdnction is In L(Olj, 02j,..:., OEj) where Oij = xijbi andj = 1,..., N indexes observ ttions and i _ t,..., E indexes the liftear equations defined by mt model. If the likelihood satisfie[ the linear-fc_rm restrictions,
Method
it can be decomposed
as In L = Z;=I
In g(Olj, 02j,...,
It evaluators:
program dv:fr_n:jvgname args
inf _,thetal [theta2
...
]
/* if you _eed to createany intermediate results: */ tempvax trap1 trap2... quietly gen double "tmpl" = ... }
°''
;
quietly
replace
"lnf"
= ...
end t
wbe ' !
"!n_]" "the_tal" "th_ta2"
vafihble to be filled in with observation-by-observation values of In £j vari_bte containing evaluation of 1st equation Ou=xljb_ varihbte containing evaluation of 2nd equation Ozj=x,._b2
l
Method d D evaluators:i _ '
prol_ :am define p_ogname version args Code b inf tempvar_etal theta2 ... mleval "_hem1" = "b', eq(1) mleval "_eta2" = "b', eq(2)
/* if there is a 02 */
/* if you _eed to create any intermediateresutts: */ •
!
tempvar
1 i
mlsum "l_ff" --- ...
}
end!
'f_npt trap2...
gen double
"tmpl " = ...
"
where} i i
i !
"todd"
"b" i "lnf_ t
always contains 1 {may be ignored) full parameter row vector b=(bl,,b2,...,bE) scal_ to be filled in with overat_ In L
Method dl evaluators: prog 'am defineprOgname version
! i
tempvar tbetal lheta2 ... mleval "t_etal" = "b', eq(I) mleval "ttheta2" = "b', eq(2)
i _!
/* if there is a 02 */
/* if you n_ed to create any intermediateresults: */ tempvar t_pt trap2 ... gen doubl_e "tmpl " = . . . ...
0,_).
F
.
o,,,_
ml --mls_ Maximum "lnf"
Ii''_P
ilKellrlooO = ...
estimation
if "todo'==O ] "Inf'==. { exit }
iI i
tempname dl d2 ... mlvecsum "Inf" "d1" = formulaforOlngj/O91j, eq(1) mlvecsum "Inf" "d2" = formulafor0 Ing#/002#, eq (2) matrix
"g" = ('dl','d2", ... )
end where "todo"
contains0 or 1
"b" "lnf"
O_'Inf'to be filled in;l_'lnf" and "g" to be filled in fullparameterrow vectorb=(bl,b_,...,bE) scalar to be filled in with overall In L
"g"
row vector to be filled in with overall g---01nL/0b
"negH"
argument
"gl" "g2"
variable variable
to be ignored optionally optionally
to be filled in with 01ng#/0bl to be filled in with ,9 In £_/0b2
Method d2 evaluators: program
define progname version 7 args todo b inf g negH [gl [g2 ... ] ] tempvar thetal thcta2... mleval "thetal" mteval "theta2"
= "b', = "b',
eq(1) eq(2) /*
if there is a 02 */
/* if you need to create any intermediate tempvar tmpl unp2 ... gen double "tmpl= ... mlsum "lnf"
results:
./
= ...
if "todo'==O
[ "inf'==. { exit }
tempname dl d2 ... mlvecsum mlvecsum
"lnf ""dl" "lnf" "d2"
= formula for 0 In gj/OOL7, = formula for 0 tn ej/OO2j,
eq(1) eq(2)
matrix "g" = ('dl','d2", ... ) if "todo'==l [ "inf'==. { exit } tempname dll d12 d22 ... mlmatsum "inf" "dll"= fotmtlla for 02 In_3/08_j, eq(1) mlmatsum "Inf" "dl2" = formulafor- 02 Ingj/OOljOe2j, ecl(l,2) mlmatsum "Inf" "d22" = fonntlla for-02 Inej/OO_j, eq(2) end
matrix
"negH"
= ('dll','dl2",
...
\
"d12'','d22",
...
)
where "todo"
contains
0, 1, or 2
"b" "lnf"
O_'lnf" to be filled in: l_'lnf" and "g" to be filled in: 2_" lnf', "g', and "negtt" to be filled m full parameter row vector b--(bl,b2,...,bE) scalar to be filled in with overall In L
"g" "negH"
row vector to be filled in with overall g=01n L/Ob matrix to be filled in with overall negative Hessian -H=--02
"gl" "g2"
variable variable
optionally optionally
to be filled in with Oln_j/Obl to be filled in with Olngj/Ob2
In L/ObOb"
i
_
ml -- Maximumlikelihood estimation
1
Global
SML__rl SML@2
_ _ i
name of first dependent variable nam_ of second dependent variable, if any
SHL_ _amp
variable containing
1 if observation
SML_'
variable containing
weight associated with observation or 1 if no weights specified
to be used; 0 otherwise
Method If evaluators can ignore SML_samp, but restricting calculations to the SML_samp==l subsaml_te will speedi execution. Method If evaluators must ignore SML_w;application of weights is handl_ by the me_hod itself. Method dO. dl. and _]2 can ignore $ML_samp as tong as ml model's nopreserve
:i
i ,_ ! i
"
[• i
343
m_crosfor use!byall evaluators !
option is not
specifie_l. Method d0_ dl, and d2 will run more quickly if nopreserve is specified. Method dO, dl. and d2 evaluator_ can ignore $ML_w only if they use mlsum, mlvecsum, and mlmatsum to produce final results.:
Description ml cle
r clears the current problem definition. This command is rarely, if ever. used because,
when you t_pe ml modest, anv, previous problem is automatically cleared. m2 mod_t defines the!currenl problem. ml query displays a 8escription of the current problem.
ral check verifies thai the log-likelihood evahaator you have written seems to work, We strongly recommend using this command. ml sea: ch searches for (better) initial values. We recommend using this command. ml plot
provides a g_aphical way of searchip_g for (better) initial ]
values.
ml init Iprovides a Way_of setting initial values to user-specified values. ml report reports thi: values of tn L, its gradient, and its negative Hessian at the initial values •l " or current l_ameter estimates b0. ra! trac_
traces the execution of the user-deft=ned log-likelihood evaluation program.
,
ml co counts the _umber of times the user-defined log-likelihood evaluation program is called. It was inteqded as a del_ugging tool for those developing ml and it now serves little use besides entertainment, ml count! clear clears the coun{er, ml count on turns on the counter• ml count
!
without argt}ments report S the current values of the counters, ml count off stops counting calls.
i i
ml maxillizemaxlmi_es the likelihood function and reports final results. Once ml maximize has succe.tsfully completed, tl_e previously mentioned ml commands may no longer be used--ml graph
[
and rat dis ,lay may be ,fused" m! grapl graphs the ltog-likelihood values against the iteration number.
i ! ?
! , :_ [
,
. i
rnl disp: ay redisplav_ final results. prognam_I is the namd of a program you write to evaluate the tog-likelihood function. In this documentation, it is referled to as the user-written evaluator or sometimes simply as the cvaluator. The progra_._._youwrite is,_,ritten in the style required by the method you choose. The methods are If, dO. d l. If evaluator. and ]t2. Thus. if _ou choose to use methotl If, your program is called a method Method If e_aluators are Zrequired to evaluate th_ observation-by-observation log likelihood In (_, .j : I,.....,'It. Method dO evaluators are required |o evaluate the overall log likelihood tn L. Method d I evaluator_ are required io evaluate the overall log likelihood and its gradient vector g = 0 In L/Oh Method d2 e_,aluators are [equired to evaluate the Overall log likelihood, its gradient, and its neeative Hessian matrix -H = -0-°In L/ObOb'
:_
344 ml -- Maximum likelihood estimation mleval is a subroutine used by method dO, dl, and d2 evaluators to evaluate the coefficient vector b that they are passed. mlsum is a subroutine used by method dO, dt, and d2 evaluators to be returned.
to define the value In L that is
mlvecsum is a subroutine used by method dl and d2 evaluators to define the gradient vector g that is to be returned. It is suitable for use only when the likelihood function meets the linear-form restrictions. mlmatsum is a subroutine used by method d2 evaluators to define the negative Hessian -H matrix that is to be returned. It is suitable for use only when the likelihood function meets the linear-form restrictions.
Optionsfor use with ml model in interactive or noninteractive mode robust
and cluster
(varname)
specify the robust variance estimator, as does specifying pweights.
If you have written a method If evaluatOr, robust, is nothing to do except specify the options.
cluster
(), and pweights
If you have written a method dO evaluator, robust, cluster(), Specifying these options will result in an error message.
and pweights
will work. There wilt not work.
If you have written a method dl or d2 evaluator and the likelihood function satisfies the linear-form restrictions, robust, cluster (), and pwe±ghts will work only if you fill in the equation scores; otherwise, specifying these options will result in an error message. title
(stritTg) specifies the title to be placed on the estimation
output when results are complete.
nopreserve specifies that it is not necessary for ml to ensure that only the estimation subsample is in memory when the user-written likelihood evaluator is called, nopreserve is irrelevant when using method lf. For the other methods, if nopreserve is not specified, then ml the original dataset) and drops the irrelevant observations before This way, even if the evaluator does not restrict its attentions results will still be correct. Later, ml automatically restores the ml need not go through these machinations evaluator calculates observation-by-observation
saves the data in a file (preserves calling the user-written evaluator. to the $ML_samp==l subsample, original dataset.
in the case of method If because the user-written values and it is ml itself that sums the components.
ml goes through these machinations if and only if the estimation sample is a subsample of the data in memory. If the estimation sample includes every observation in memory, ml does not preserve the original dataset. Thus. programmers must not damage the original dataset unless they preserve the data themselves. We recommend interactive users of ml not specify nopreserve; chances of incorrect results.
the speed gain is not worth the
We recommend programmers do specify nopreserve, but only after verifying that their evaluator really does restrict its attentions solely to the $ML_samp==l subsample. collinear bpecifies that ml is not to remove the collinear variables within equations. There is no reason one would want to leave coltinear variables in place, but this option is of interest to programmers who. in their code. have already removed collinear variables and thus do not want ml to waste computer time checking again.
+ {
I ,
_
ml -- Maximum likelihoodestimation
345
missings])ecifies that observations containing variables with missing values are not to be eliminated from th_ estimation somple. There are two reasons one might want to specify missing:
! _
Prograrr _ers may wi_h to specify" missingbecause, in other parts of their code, they have already eliminat_ observatio+hs with missing values and thus do not want ml to waste computer time
i ! + •
looking again. All user_ may wish tO specify missingif their model explicitly deals with missing values. Stata's heckmaa command i_ a good example of this. In such cases, there will be observations where
_
missing values are allbwed and other observations where they are not--where their presence should cause the observationl to be eliminated. If you specify missing,it is your responsibility to specify an if e rp that elimidates the irrelevant obserVations.
_; !
If0(#k #u_ is typically _sed by programmers. It Specifies the number of parameters and log-likelihood value o_the constant-only model so that ml++ can report a likelihood-ratio test rather than a Wald
i
i } + + i
+_
test. Th_se values w_re, or they may have been determined by t + perhaps, analytically_idetermined, • • a previous estimation! of the constant-only m6del on the estimation sample. "
1
Also so the continueoption directly below: If you specify IfO(),it must be safe for you to specify the missing option, too, else how did you cal 1 works like the above except that it forces a Wald test to be reported even if the infommtion to perform the likelihood-ratio test is available and even if none of robust, cluster, or pweights were specified, waldtest(k), k > 1, may not be specified with lf0(). constraints (numlist) specifies the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are defined using the constraint command and are numbered: see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. obs (#) is used mostly by pro_mers. It specifies that ultimately stored in e (N), is to be #. Ordinarily, ml Programmers may want to specify th_s option when, in for N observations, they first had to modify the dataset observations.
the number of observations reported, and works that out for itself, and correctly. order for the likelihood-evaluator to work so that it contained a different number of
noscvars is used mostly by programmers. It specifies that method dO, dl, or d2 is being used but that the likelihood evaluation program does not calculate nor use arguments "gl ", "g2", etc., which are the score vectors. Thus, m! can save a little time by not generating and passing those arguments.
Options for use with ml model in noninteractive mode In addition to the above options, the following options are for use with ml model in noninteractive mode. Noninteractive mode is for programmers who use ml as a subroutine and want to issue a single command that will carry forth the estimation from start to finish. maximize
is not optional. It specifies noninteractive
mode.
init (ml_init__args) sets the initial values be. mI_init_args ml init command.
are whatever you would type after the
search({onlquietlyloff }) specifies whether ml search is to be used to improve the initial values, search (on) is the default and is equivalent to running separately ml search, repeat (0). search(quietly) is the same as search(on) except that it suppresses ml search's output. search(off) prevents the calling of ml search altogether. repeat(#) repeat
is ml search's repeat (0) is the default.
bounds(ml_search_bounds) model issues is 'ml search
() option and is relevant only if search(off)
is not specified.
is relevant only if search(off) is not specified. The command ml ml--search_bounds, repeat (#) '. Specif_,ing search bounds is optional.
difficult, nolog, trace, gradient, hessian, showstep, iterate(), Itolerance(), tolerance(),nowarning,novce,and score() are ml maximize'sequivalent options.
Options for use when specifying equations noconstant specifies that the equation is not to include an intercept. offset (varname) specifies that the equation is to be xb + varname: that the equation is to include varname with coefficient constrained to be 1. exposure (varname) xb + ln(varname).
is an alternative to offset (varname); it specifies that the equation is to be The equation is to include In(varname) with coefficient constrained to be 1.
V
ml -- Maximumlikelihoodestimation
347
Options f( use withlml search repeat(#1
specifies the number of random attempts that are to be made to find a better initial-_alue
! ! ! ) } !
vector. "he default ropeat(lO). repea_ (0) specifiesi__at no random attempts are to be made. More correctly, repeat (0) specifies that no "andom attempts are to be made if the initial initial-value vector is a feasible starting point. If it is aot, ml search will make random attempts even if you specify repeat (0) because it has no _tltemative. Tl_e () option refers to the number of random attempts to be made to ) repeat improv_ the initial values, When the initial starting value vector is not feasible, ml search will make u]_to 1,000 rafidom attempts to find starting values. It stops the instant it finds one set of values t aat works and then moves into its improve-initial-values logic.
i
repeat (k), k > O. _pecifies the number of random attempts to be made to improve the initial values. !
)
nolog spe!ifies that no 6utput is to appear while ml search !
looks for better starting values. If you
!
specify tolog and th_ initial starting-value vector is not feasible, ml search wilt ignore the fact that youIspecified the nolog option. If ml search must take drastic action to find starting values, it feels },ou should kriow about this even if you attempted to suppress its usual output.
! !
trace spee ifies that you want more detailed output about ml search's actions than it would usually provide. This is more entertaining than useful, ml search prints a period each time it evaluates the likel hood functioh without obtaining a better result and a plus when it does.
! l i
restart stecifies that rlndom actions are to be taken to obtain starting values and that the resulting starting _alues are notto be a deterministic function of the current values. Users should not specify this opti )n mainly bejzause, with restart, mt search intentionally does not produce as good a set of st lrting values _s it could, restart is included for use by the optimizer when it gets into
I :
workingltogether, serious tlouble. Thedor_ndom hot result actions in a are long, to endless ensure that loop. the actions of the optimizer and ml search, restar I implies no_escale, which is why !we recommend you do not specify restart. In i i )
)
testing, 4ases were discovered where rescale _worked so well that, even after randomization, the rescaler _;ould bring (he starting _alues right back to where they had been the first time and so defeated lthe intended irandomization. norescale! specifies that_ml search is not to engage in its rescaling actions to improve the parameter vector. We do not recbmmend specifying this Option because rescaling tends to work so well.
Options for use with ml plot saving(fil_'name[,
replace I)_
specifies that the graph is to be saved in fiIename . gph.
Options for use with ml init skip specif es that any phrameters found in the specified initialization vector that are not also found in the m,)del are to bE ignored. The default action is to issue an error message. ) ,_ector b) position ratbter than by name. i )
)
copy speci_es that the ]isi of numbers or the initialization vector is to be copied into the initial-value
r
348
ml -- Maximum likelihood estimation
Options for use with mi maximize difficult specifies that the likelihood function is likely to be difficult to maximize. In particular, difficult states that there may be regions where -H is not invertible and that, in those regions, ml's standard fixup may not work well. difficult specifies that a different fixup requiring substantially more computer time is to be used. For the majority of likelihood functions, difficult is likely to increase execution times unnecessarily. For other likelihood functions, specifying difficult is of great importance. nolog,
trace,
gradient,
hessian,
and showstep
control the display of the iteration log.
nolog
suppresses reporting
trace
adds to the iteration log a report on the current parameter
gradient
vector,
adds to the iteration log a report on the current gradient vector.
hessian
adds to the iteration log a report on the current negative Hessian matrix.
showstep iterate
of the iteration log.
adds to the iteration log a report on the steps within iteration.
(#), ltolerance
iterate(16000) Convergence
(#), and tolerance
tolerance(le-6)
(#) specify the definition of convergence.
ltolerance(le-7)
is the default.
is declared when mreldif(bi+l,bi) _< tolerance () or
reldif{lnL(bi+l),InL(bi)}< Itolerance()
In addition, iteration stops when i -- iterate(); in that case. results along with the message "convergence not achieved" are presented. The return code is still set to 0. nowarning is allowed only with iterate (0). nowarning suppresses the "convergence not achieved" message. Programmers might specify iterate (0) nowarning when they have a vector b already containing what are the final estimate,; and want ml to calculate the variance matrix and post final estimation results. In that case, specify 'init(b) search(off) iterate(0) nowarning notog'. novce is allowed only with iterate (0). novce substitutes the zero matrix for the variance matrix which in effect posts estimation results as fixed constants. score (newvarlist) specifies that the equation scores are to be stored in the specified new variables. Either specify one new variable name per equation or specify a shorl name suffixed with a *. E.g., score(sc*) would be taken as specifying scl if there were one equation and scl and sc2 if there were two equations. In order to specify score(), either you must be using method If, or the estimation subsample must be the entire dataset in memory, or you must have specified the nopreserve option. nooutput quietly
suppresses display of the final results. This is different from prefixing ral maximize in that the iteration log is still displayed (assuming nolog is not specified).
with
level(#) is the standard confidence-level option. It specifies the confidence level, in percent, for confidence intervals of the coefficients. The default is leveI(95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals, eform(string)
is ml display's
eform()
option.
noclear specifies that after the model has converged, the ml problem definition is not to be cleared. Perhaps you are having convergence problems and intend to run the model to convergence. If so, use ml search to see if those values can be improved, and then start the estimation again.
ml -- Maximumlikelihood estimation
349
!
i
__
! ! l ,
Options for use with ml graph saving(f/enamel,
replace])
specifies that the graph is to be saved in filename.gph.
Options fo_"use witfi ml display noheader suppresses the display of the header above the coefficient table that displays the final log-likelihood value.!the number of observations, and the model significance test. e:form (str than b intercer coeffici first
_ng) displays !the coefficient table in exponentiated form: for each coefficient, exp(b) rather s displayed _tnd standard errors and _onfidence intervals are transformed. Display of the t, if any. is suppressed, string is the table header that will be displayed above the transformed rots and musi be 1t characters or fewer in length, for example, efornl("0dds ratio").
displays a coefficient table reposing resqlts for the first equation only, and the report makes
it appe_ that the fi@ equation is the only e_uation. This is used by programmers who estimate ancillary, parameters !in the second and subsdquent equations and will report the values of such parameters themselves. neq(#) is!an alternative to first,
neq(#)
displays a coefficient table reporting results for the first
# equat ons. This is !used by programmers who estimate ancillary parameters in the # _I_1 and subsequent equations!: and will report the values of such parameters themselves. i
plus displays the coefficient table just as it would be ordinarily, but then, rather than ending the table
I
in a lin._ of dashes, _nds it in dashes-plus-sign-dashes. This is so that programmers can write additional display c_e to add more results to the table and make it appear as if the combined result ii one table. Pl'ogrammers typically specify plus with options first or neq().
i
t i i
i
i
level.(#) !is the standard confidence-level option. It specifies the confidence level, in percent, for confidence intervals 6f the coefficients. The default is level(95) or as set by set level" see [U] 23,_ Specit'ying lhe width of confidence intervals.
Options fol use with_mleval eq(#) spet:ifies the equation number _ for which Oij = xijbi if eq() is not specified.
is to be evaluated, eq(1)
is assumed
scalar as_erts that the _th equation is known to evaluate to a constant; the equation was specified as () (ha he: ). or/ndtne on the ml model staiement. If you specify this option, the new variable " • created is created as !a scalar. If the ith equation does not evaluate to a scalar, an error message is issue4.
Options fm use with imlsum t
noweight
•
_pecifies that'welghts_ ($ML_v) are to be ignored when summing the likelihood function.
Options for use with 1 i rnlvecsum eq(#)
specifies the equa_ion for which a gradient vector Oln L/Obi
is eq(ll.
is to be constructed. The defauh
_,,
_f,_w
Ill!
--
IflaA|lilUlll
III_III|U_
u_||lna[|on
Options for use with mlmatsum eq(#[,#])
specifies the equations for which the negative
default is eq(:t), which means the stone as eq(1,1), eq(i,j) results in -021n L/Ob_Ob_.
Hessian matrix is to be constructed.
which means -021n L/OblOb'
The
1. Specifying
Remarks For a thorough discussion of ml, see Maximum Likelihood Estimation with Stata (Gould and Sribney 1999). The book provides a tutorial introduction to ml, notes on advanced programming issues, and a discourse on maximum likelihood estimation from both theoretical and practical standpoints. ml requires that you write a program that evaluates the log-likelihood function and, possibly, its first and second derivatives. The stvle of the program you write depends upon the method chosen; methods If and dO require your program evaluate the log likelihood only; method dl requires your program evaluate the log likelihood and gradient; method d2 requires 3,'our program evaluate the log likelihood, gradient, and negative Hessi_m. Methods If and dO differ from each other in that, with method If, your program is required to produce observation-by-observation log-likelihood values In gj and it is assumed that In L = }"_j In £j; with method dO, your program is required to produce the overall value In L. Once you have written the program--called an evaluator--you define a model to be estimated using ml model and obtain estimates using ml maximize. You might type • ml model ... ml maximize
but we recommend that you type • ml ml • ml • ml
model ... check search maximize
ml check will verify your evaluator has no obvious errors, and ml search will find better initial values. You fill in the ml model statement with (1) the method you are using, (2) the name of your program, and (3) the "equations". You write your evaluator in terms of 01, 02, ..., each of which has a linear equation associated with it. That linear equation might be as simple as 0i -- b0, or it might be 0i = blmpg + b2weight+ ha, or it might omit the intercept b3. The equations are specified in parentheses on the ml model line. Suppose you are using method If and the name of your evaluator program is myprog. statement ml model If myprog
The following
(mpg weight)
would specify a single equation with 0_ = blmpg + b2weight _-b3. If you wanted to omit ha, you would type . ml model if myprog
(mpg weight, nocons)
and if all you wanted was 0i -- b0, you would type • ml model if myprog With
multiple
equations,
you
• ml model if myprog
() list
the
equations
(mpg weight)
()
one
after
the
other:
so if you
typed
ml -- MaximumIikelih_
I I
t
i
i
I
351
i you would ,e specifying 01 = blmpg+ b2weight_ b3 and Oz = b4: You would write your likelihood in terms of )t and 02. If _he model was linear reg_ssion, 01 might be the xb part and 0.2.the variance of the resid lals. i When you specify thelequations, you also specify any dependent variables. If you type . m! +odel L
if myp+g
(price = mpg weight_
()
price woulu be the one !and only dependent varihble, and that would be passed to your program in SML,yl. If _sour model _d two dependent variables, you could type 1 • ml _odel
If
mypr_g
I
i )
estimation
(price
displ
= mpg Oeight)
()
i
and then $_L_yl
would !be price
and $ML_y2 _ould be displ.
You can specify however many
dependent "_ariables are _ecessary and specify them on any equation. It does not matter on which equation vo6 specify them: the first one specifie4 is placed in $ML.yl, the second in SNL'y2, and SO on,
Example
I
Using m4thod If. we aketo produce observation-by-observationvalues of the log likelihood. The probit ]og-li_elihood funclion is
= f lne(O j)
lngj
l lncI'(4Olj)
Otj
=
xjb
if
=1
if yj = 0
1
The foltowiI_g is the mettiod If evaluator for this ikelihood function: progrem
define myp_obit version 7. args
Inf t_etal
quietly quietly
re_lace re,lace
"inf" = in(norm('thetal')) "inf" = in(norm(-_thetal'))
if SML_yI==I if $ML_yI==O
end
i If we wante_ to estimate a model of foreignonmpg and weight, we would type i
. ml m_del
!
if mypr_bit
i
(foreign
= mpg weight)
• ml m _ximize
i
The 'forei_ =' part siecifies that y is foreign.The 'mpg weight'part specifies that OIj == blmpgj + b_9_eightj + b,t. The result of running this is
i
I ml m_del If myprrbit ml m _ximize
(foreign
= mpg weight)
initia [:
log likelihood
e -51. 29289
altern itive:
log likelihood
= -45.055271
Iterat ton O:
log likelihood
± -45.05527:
Iterat [on I: festal _: Iterat [on 2:
log likelihood log likelihood log likelihood
= -27.9041L = -45.05527: = -26.8578
Iterat Lon 3: iterat LOn 4:
log likelihood log likelihood
= -26,844191 = "26.84418!
Iterat .on 5:
log likelihood
= -26.84418!
_.og li]_elihood t
i i
Number
of obs
=
74
)
Wald
chi2(2)
=
20.75
= -6.844189
Prob
> chi2
=
0.0000
I i
352
mi -- Maximum likelihood estimation foreign
Coef.
mpg weight _cons
Std. Err.
-.1039503 .0515689 -.0023355 .0005661 8.275464 2.554142
z
P>_zl
[95_Conf.Interval]
-2.02 -4.13 3.24
0.044 0.000 0.001
-.2050235 -.0028772 -.003445 -.0012261 3.269438 13.28149
q
> Example A two-equation, two-dependent variable model is little different. Rather than receiving one theta, our program will receive two. Rather than there being one dependent variable in $ML_yI, there will be dependent variables in $ML_yl and $gL_y2. For instance, the WeibuI1 regression log-likelihood function is Ingj=(tje°'_)
exp(°_) +dj{Olj-Olj+(e
°lj
t)(lntj
Ou) }
Olj = Xjbl 023
_
S
where tj is the time of failure or censoring and dj = I if failure and 0 if censored. We can make the log likelihood a little easier to program by introducing some extra variables: pj -- exp(O2j) Mj = tj exp(-01j)
pj
Rj = In tj - 01j In gj -= -Mj
+ dj{Olj - Olj -}-(pj - 1)nj }
The method If evaluator for this is program
define myweib version 7.0 args
Inf
thetal
theta2
tempvar
p M R
quietly
gen
double
"p"
= exp('theta2")
quietly
gen
double
"M"
= ($ML_yl*exp(-'thetal'))"p"
quietly
gen
double
"R" = in($ML_yl)-'thetal"
quietly
replace
"Inf"
= -'M"
+ SML_y2*('thetal"-'thetal"
+ ('p'-l)*'R')
end
We can estimate a model by typing • ml model If myweib ml maximize
(studytime
died
= drug2
drug3
age)
()
Note that we specified ' ()" for the second equation. The second equation corresponds to the Weibull shape parameter s and the linear combination we want for s contains just an intercept• Alternatively, we could type ml model
if myweib
(studytime
died
= drug2
drug3
age)
/s
Typing /s means the same thing as typing (s:) and both really mean the same thing as ()• The s, either aher a slash or in parentheses before a colon, labels the equation. It makes the output look
f
prettier and that is all:
ml -- Maximumlikelihoodestlmation
• ml_ odel
if mywe Lb (studytime
died
= dr_g2
drug3
353
J
age) /s
• ml _ ax ,
initi_ .i:
log likelihood
=
)
alterl .drive :
log likelihood
= -356.142716
resca]e:
log likelihood
"- -200.802_I
log likelihood
= -138,692342
log
= -136.692_
rescale Iterat
eq:
ion
O:
likelihood
-7_
I
Iteral ion i:
log)likelihood
= -124.117_
i {
Iteral ion 2: Iteral ion 3:
log) likelihood log I likelihood
-113.889iB -iI0.303_
I
Iteralion Iteraiion 5: 4: Iteral;ion 6: i i
log log)ilikelihood likelihood log I likelihood ! .!
Log l:kelihood
= -_I0.26736
i
I Coef.
(not
concave)
-110.267_ -II0.267_7 = -110.267_
Std.
Err.
z
Number of obs Weld chi2(3)
= =
Prob
=
> chi2
P>_zl
48 35.25 0.0000
[95_, Conf.
Interval]
!
drug2 drug3
I. 12966 I). 45917
.2903917 .2821195
3.488 _.172 5
O.000 0.000
.4438086 .9062261
I. 582123 2.012114
)
age _cons
-.0_7&728 6._60723
.0_5688 1. i52845
_3,266 5.257
0.00t 0.000
-. t074868 3. 801188
-. 0268587 8L 320269
.1402154
3.975
0.000
.2825162
.8321504
S
i
i
I
lej,
, OE,land they are required to evaluate the overall log-likelihood In/J rather than
!
Use mle;al to produce) the thetas from the coefficient vector. Use mlsm to sum th_ components that enter into InL. In, the ca se of WeibuI!, In L = _ In gj, and otir method dO evaluator is version progr tm define
7.!0 we4bO
i
tempvar args todot_etal )b inf theta2 mleval "tl_etal" = "b', eq(1)
i
mleval "t_eta2" = "b" local t "_ML_yl"
I
local
l
eq(2) : /* th_s
is just
for readability
d "_ML_y2"
=empvar quietly
p M double gen R
quietly gEn double quietly g(n double mlsum "In_" =-'M"
i
I
i
4
> Example Method 0 evaluatorsTeceiveb = (bl, b2,..., bE), the coefficientvector,ratherthanthe already evaluated 0_. 02....
i
_73333
i i
!
i I
•
i _cons
l i
i
"p, = exp('theta2") "R" = in('t')_'thetal" "M' = ('t'*ex_(-'thetal'))_'p _ + 'd'*('theta_'-'thetal_ + ('p'-I)*'R')
To estimatei our model using this evaluator, we would type . ml
odel dO wei 0 (studytime
i
died = dr_g2
-
drug3
age) /s
*/
i i !
[3 Technical Note i
354 ml -- Maximum likelihood estimation Method dO does not require In L = _-']d In gj, j = 1,..., N, as method If does. function might have independent components only for groups of observations. Panel have a log-likelihood value in L - Y-_iIn Li where i indexes the panels, each of multiple observations. Conditional logistic regression has In L = _k in Lk where k pools. Cox regression has In L = _--_(t)In L(t ) where (t) denotes the ordered failure
Your likelihood data estimators which contains indexes the risk times.
To evaluate such likelihood functions, first calculate the within-group log-likelihood This usually involves generate and replace statements prefixed with by, as in tempvar
contributions.
sumd
by group:
gen
double
"sumd*
= sum($ML_yl)
Structure your code so that the log-likelihood contributions are recorded in the last observation each group. Let's pretend that variable is named "cont ". To sum the contributions, code t empvar
of
last
quietly by group: gen byte "last" mlsthm "inf" ="cont" if "last"
=
(_n==_N)
It is of _eat importance that you inform mlsum as to the observations that contain log-likelihood values to be summed. First, you do not want to include intermediate results in the sum. Second, mlsttm does not skip missing values. Rather, if mlsum sees a missing value among the contributions, it sets the overall result "lnf" to missing. That is how ml maximize is informed that the likelihood function could not be evaluated at the particular value of b. ml maximize will then take action to escape from what it thinks is an infeasible area of the likelihood function. When the likelihood function violates the linear-form restriction In L - _j In t74, j -- 1,..., N, with In gj being a function solely of values within the jth observation, use method dO. In the following examples we will demonstrate methods ,51 and d2 with likelihood functions that meet this linear-form restriction. The d t and d2 methods themselves do not require the linear-form restriction, but the utility routines mlvecsum and mlmatstm do. Using method dl or d2 when the restriction is violated is a difficult programming exercise. El
> Example Method dl evaluators are required to produce the gradient vector g = 0 In L/Ob as well as the overall log-likelihood value. Using mlvecsura, we can obtain O ln L/Ob from 0 in L/OOi, i = 1,..., E. The derivatives of the Weibull log-likelihood function are
Olntj OOla
= pj(Mj
dj)
Oln t 9
o02j= dj - Rjpj(Mj-dj) The method dl evaluator for this is program
define weibl version 7 args
todo
tempvar
b inf g
/* g is new
tl t2
mleval
"tl"
= "b',
eq(1)
m!eval
"t2"
= "b',
eq(2)
local local
t "@HL_yl" d "$ML yi"
_empvar
p M R
quietly
gen
double
"p"
= exp('ti")
*/
5
ml-- Maximum likelihoodestimat_n
355
i
quietly g+n double "M" = ('t'*ex _(-'tl'))_'p" quietly g,..n double "R" = in('t')_'tl" mlsum "in:'"= -'M" + _d'*('t2"-'_1" + ('p':-I)*'R') if "todo'::=0I "Inf"--=.{ exit }:
/* chi2
= = =
616 9.62 0.0081
Pseudo R2
=
0.0086
P> [zl
[95_ Conf. InterVal]
IIndem_ity n_nwhite _cons
-.6608212 .1879149
.2157321 .0937644
-B.06 2. O0
O.002 O. 045
-I.083648 .0041401
-.2379942 .3716896
_/_._nsu_e n_nwhite _cons
-.2828628 -i.754019
.3877302 .1805145
-l). 71 -_.72
O.477 O. 000
-i.0624 -2.107821
.4966741 -1.400217
'(O_tc e insu_e==Prepaid is the comparison group)
Theb_s_ca_egory()option requires thatwe specify thenumericvalueofthecategory, sowe could not type__ _a_e(Prepaid). t Alflao_ghlthe coefficients now appear to be different, note that the summary statistics reported at the toi) ale i_entical. With this paramelerization the probability of prepaid insurance for whites is • !
1 Pv(insure
= Prepaid)
-- rj+ e "188+ e-1"754 -- 0.420
Thinsi$ tNe same answer we obtained previously. q
b. Examfllei Bv_sp_cif:ing rrr,which we can do at estimation time or when we redisplav results, we ._ee the -i. tterms of relative risk ratios: model.m
• mlogit,
364
rrr
Multinomial
regression
Number
of obs
=
616
mlogit-- Maximum-likelihood multinomial(polytomous) logisticregression LR chi2 (2) = 9.62 Prob
Log
likelihood
= -551.78348
insure
KRK
Indemnity nonwhite
.516427
.7536232
> chi2
Pseudo
Std.
Err.
R2
=
0.0081
=
0.0086
z
P>Izl
[957, Conf.
Interval]
.1114099
-3.06
O. 002
.3383588
.7882073
.2997387
-0.71
0.477
.3456254
1. 643247
Uninsure nonwhite
(Outcome
insure==Prepaid
is the
comparison
group)
Looked at this way, the relative risk of choosing an indemnity over a prepaid plan is 0.52 for nonwhites relative to whites.
Example One of the advantages of mlogit over tabulateis that continuous variables can be included in the model and you can include multiple categorical variables. In examining the data on insurance choice, you decide you want to con_ol for age, gende5 and site of study (the study was conducted in three sites): . mlogit
insure
age male
nonwhite
site2
sites
Iteration
O:
log
likelihood
= -555.85446
Iteration
1:
log
likelihood
= -534.72983
Iteration
2:
log
likelihood
= -534.36536
Iteration
3:
log
likelihood
= -534.36165
Iteration
4:
log
likelihood
= -534.36165
Multinomial
Log
regression
likelihood
= -534.36165
Number of obs LK chi2(lO)
= =
Prob
> chi2
=
0.0000
R2
=
0.0387
Pseudo
Std.
Err.
z
P>IzI
[95_
Conf.
615 42.99
insure
Coef.
Interval]
age male
-.011745 .5616934
.0061946 .2027465
-1.90 2.77
0.058 0.006
.9747768
.2363213
4.12
0.000
.1i30359
.2101903
0.54
0.591
-.2989296
site3
-.5879879
.2279351
0.010
-I.034733
_cons
.2697127
.3284422
0.82
0.412
-.3740222
.9134476
age male
-.0077961 .4518496
.0114418 .3674867
-0.68 1.23
0.496 0.219
-.0302217 -.268411
.0146294 1.17211
Prepaid
nonwhite site2
-2.58
-.0238862 .1643175
.0003962 .9590693
.5115955
1.437958 .5250013 -.1412433
Uninsure
nonwhite
(Outcome
.2170589
.4256361
0.51
0.610
-,6171725
1.05129
site2 site3
-1.211563 .2078123
.4705127 .3662926
-2.57 -0.57
0.010 0.570
-2.133751 -.9257327
-.2893747 .510108
_cons
-i.286943
.59_3219
-_.17
0.030
-5.447872
-.1260135
insure==Indemnity
is the
comparison
group)
_-
mlogit -- Maximum-likelihoodmultinomial(polytomous)logistic regl'ession 365 These results suggest that the inclination of nonwhites to choose prepaid care is even stronger than it was withbut controlling. We also see that subjects in site 2 are less likely to be uninsured, fl(2)) and thus the probability of outcome
i
i
i
i
i
i
I
!
i
Ptedicti6n can be used to aid interpretation. 2 ac|uaily (ells relative to that outcome•
! ]
predlctibns;by race. our Forpreviously this purpose, we caninsurance-choice use the "methodmodel, of recycled predictions", we Cbndnu!ng with estimated we wish to describe inthewhich model's
i
varv,, charadteristics of interest across the whole dmaset and average the predictions. That is, we have , data !on!bo_h whites and nonwhites, and our individuals have other characteristics as well. We will first preiend that all the people in our data are wfiite but hold their other characteristics constant. We then Icaltul_ite the probabilities of each outcome. Next we will pretend that all the people in our data are rJon_'hiie, still holding their other characteristics constant. Again we calculate the probabilities of each outcome. The difference in those two sets of calculated probabilities, then, is the difference due to ra_e; _hokling other characteristics constant.
_i
. ge_ I
I
l
i i
byte
nonwhold
/*
= nonwhite
. re_lace
nonwhite
= 0
(426 real
changes
made)
. pr_)ict
wpind,
outcome(Indemnity)
(@pti_bn p assumed;
save real
/* make
predicted
/*
predict
*/
race
everyone
i
white
*/
probabilities'
*/
probability_
(t missing value generated) • predict wpp, outcome(Prepaid) (dptipn p assumed; predicted (_ missing value generated) • !pre'dict wpnoi,
probabilityl
outcome(Ullinsure)
(_tibn p assumed; predicted (_ missing value generated) • _ep_ace (l_I_ real
i
probability)
i
nonwhite=l changes made)
/* make
everyone
nonwhite
*/
i i
• _re_ict nwpind, outcome(Indemnity) (_tipn p assumed; predicted probability
i _ i
(I missing
i
• _re_ict
value nwpp,
generated) outcome(Prepaid)
(_pti_n p assumed; predicted (1 missing value generated) pre
ict nwpnoi,
i I
outcome(Uninsure)
(optipn p assumed; predicted mi#sing value generated) replace (5R8 real
probability)
i
probability!
nonwhite--nonwhold changes made)
/* restore
real
race
*/
.p,n.p*
• V_riable
Obs
Mean
Std' Dev.
Min
Max
wpind
643
.5141673
.08_2679
.3092903
wpp wpnoi
643 643
.4082052 .0776275
,09( ,3286 .03( 0283
.1964103 .0273596
,6502247 .1302816
nwpind nwpp Inwpnoi
643 643 643
:3112809 .630078 .0586411
.08:7693 .095 9976 02_ 7185
.1511329 .3871782 0209648
.535021 .8278881 ,0933874
.71939
i
i
.........................
tl_,,s, Ttu=-uu=
I lug!SIlO
regress=on
unadjusted. The means reported above are the values adjusted for age, sex, and site. Combining the results gives Earlier in this entry we presented a cross-tabulation of insurance type and race. Those values were Unadjusted white nonwhite
Adjusted white nonwhite
Indemnity
.51
.36
.52
.31
Prepaid
.42
.57
.41
.63
Uninsured
.07
.07
.08
.06
We find, for instance, that while 57% of nonwhites in our data had prepaid plans, after adjusting for age, sex, and site, 63% of nonwhites choose prepaid plans,
Q Technical Note Classification of predicted values followed by comparison of the classifications with the observed outcomes is a second way predicted values can help interpret a multinomial logit model. This is a variation on the notions of sensitivity and specificity for logistic regression. Here, we will adopt a three-part classification with respect to indemnity and prepaid: definitely predicting indemnity, definitely predicting prepaid, and ambiguous. (1
predict indem, missing value predict
prepaid,
(I missing
value
outcome(Indemnity) generated)
index
outcome(Prepaid)
index
predict sediff, outcome(Indemnity,Prepaid) (i missing value generated) gen type = I if diff/sediff (504 missing values generated)
obtain
difference
*/
I "Def
type
insure
Ind"
/* _ its
standard
error
*/
> 1.96
/* definitely
prepaid
*/
_ diff/sediff!=.
/_ ambiguous
type = 2 if type==. changes made)
• tabulate
/*
indemnity
replace (404 real
values
*/
/* definitely
type = 3 if diff/sediff changes made)
clef type
indexes
stddp
< -1.96
replace (i00 real
label
obtain
generated)
gen diff= prepaid-indem (I missing value generated)
label
/*
2 "Ambiguous"
3
type
"Def
*/
*/
Prep"
/* label
results
,/
type type
insure
Def
Ind
Ambiguous
Def
Prep
Total
Indemnity
78
183
33
294
Prepaid Uninsure
44 12
177 28
56 5
277 45
Total
134
388
94
.......
616
One substantive point learned by this exercise is that the predictive power of this model is modest. There are a substantial number of misclassifications in both directions, though there are more correctly classified observations than misclassified observations. A second interesting point is that the uninsured look overwhelmingly come from the indemnity system rather than the prepaid system.
as though they might have Q
i
W
mlogit-- Maximum-likelihoodmultinomial(polytomous)logisticregression
369
Tes@nghypothesesaboutcoefficients
i '
,
E=mp,* i i HypOtheses about the coefficients are tested with test just as they are after any estimation c0mina6d;___ see [R] test. The only important point to note is test's syntax for dealing with multiple equa_tior_models. You are warned that test bases its results on the estimated covariance matrix and tlht _alikelihood-ratio test may be preferred; see Estimating constrained models below for an example
ofl t st.
I' i f o_e simply lists variables after the test c0effici_nts are zero across all equations: • itest I) (]2) (I3) (i4)
site2
command, one is testing that the corresponding
site3
[Prepaid]site2 = 0.0 [Uninsure]site2 = 0.0 [Prepaid]site3 = 0.0 [Uninsure]site3 = 0.0 chii(4) Prob > chi2
19.74 0.0006
= :
One :ca0 test that all the coefficients (except the constant) in a single equation are zero by simply typlrig tlhe outcome in square brackets:
:
. test i (i 1) i (' 2) ( 3) (i 4) (I 5) :
'
!
[Uninsure] [Uninsure]age = 0.0 [Uninsure]male = 0.0 [UninsureJnonwhiZe = 0.0 0.0 [UninSure]site2 = [Uninsure}site3 chi2(5)
=
= 0.0 9.31
Prob > ahi2 =
0,0973
S_ectfic_tion of the outcome is just as with predict; you can specify the label if the outcome variable is!lal_ele_t, or you can specify the numeric value of the outcome. We would have obtained the same te_t _is above had we typed test [3], since 3 is the value of insure for the outcome uninsured
i !
Tt_e t_vo syntaxes can be combined. To test that the coefficients on the site variables are 0 in the equation! corresponding to the outcome prepai& we can type •
_est [Prepaid]: site2 site3 (! I) ( !2) '
i
[Prepaid]site2 : 0.0 [Prepaid]site3 = 0.0 chii( 2) = Prob > chi2 =
10.78 0.0046
sFeqfied the outcome and then followed that with a colon and the variables we wanted to test We can _ _lso ' test that coefficients are equal across equations. To test that all coefficients except the constlmt ' • !are equal for the prepaid and uninsured outcomes, _est
;
(il) (i2) (!3) (!4) (i5) ! !
[Prepaid=Uninsure] [Prepaid]age - [Uninsure]age = O,0 [Prepaid]male - [Uninsure]male = 0.0 [Prepaid]nonwhite - [Uninsure]nonwhite = 0.0 [Prepaid]site2 - [Uninsure]site2 = 0.0 [Prepaid]site3 - [Uninsure]site3 = 0.0 chii( 5) = 13.80 Prob
> cbi2
=
0.0169
To test that only the site variables are equal: • test
......
[Prepaid=Uninsure]:
site2
site3
,vu,. -- ,eux-,,um-.Ke.nooo
(1) (2)
[Prepaid]site2
-
[Uninsure]site2
[Prepaid]site3
-
[Uninsure]
=
chi2(2) Prob
> chi2
multinomial (polytomous) logistic regression = 0.0
site3
= 0.0
12.68
=
0.0018
Finally, we can test any arbitrary constraint by simply entering the equation, coefficients as described in [U] 16,5 Accessing coefficients and standard errors. hypothesis is senseless but illustrates the point: test (i)
( [Prepaid] age+ [Uninsure] .5 [Prepaid]age Prob
= 2- [Uninsure] nonwhite
+ [Uninsttre]nonwhits
1)
=
> chi2
=
chi2(
site2)/2
specifying the The following
+ .5 [Uninsure]site2
= 2.0
22.45 0.0000
Please see [R] test for more information on test. All that is said about combining across test commands (the accum option) is relevant after mlogit.
hypotheses q
Estimating constrained models mlogit can estimate models with subsets of coefficients constrained to be zero, with subsets of coefficients constrained to be equal both within and across equations, and with subsets of coefficients arbitrarily constrained to equal linear combinations of other estimated coefficients. Before estimating a constrained model, you define the constraints using the constraint command: see [R] constraint. Constraints are numbered and the syntax for specifying a constraint is exactly the same as the syntax for testing constraints; see Testing hypotheses about coe£IJciems above. Once the constraints are defined, you estimate using mlogit, specifying the constraint () option. Typing constraint (4) would use the constraint you previously saved as 4. Typing constraint (1,4,6) would use the previously stored constraints 1, 4, and 6. Typing constraint (1-4,6) would use the previously stored constraints 1, 2, 3, 4, and 6. Sometimes, you will not be able to specify the constraints without knowledge of the omitted group. In such cases, assume the omitted group is whatever group is convenient for you and include the basecategory () option when you type the mlogit command.
> Example Among other things, constraints can be used as a means of hypothesis testing. In our insurancechoice model, we tested the hypothesis that there is no distinction between having indemnity insurance and being uninsured• We did this with the test command. Indemnity-style insurance was the omitted group, so we typed • test (i) (2)
[Uninsure] [Uninsure]age [Uninsure]male
= 0.0 = 0.0
(3)
[Uninsure]nonwhite
(4) (5)
[Uninsure]site2 [Uninsure]site3 chi2( Prob
5) =
> chi2
= 0.0 = 0.0 = 0.0
=
9.31 0.0973
]
_r
mlogit-- Maximum-likelihoodmUltinomial(polytomous)logistic regression
!
(Had!indemnity not been the omitted group, we would have typed test :
i
,
.)
;T_e r_sults produced by test are based On the estimated covariance matrix of the coefficients _e an approx_mauon. Since the probabthty of being uninsured is quite low, the log hkehhood m_y be honlinear for the uninsured. Conventional statistical wisdom is not to trust the asymptotic
• .
! I
[Uninsure=Indellmity]
371
_ I i
.......
answbr dnder these circumstances, but to perform a likelihood-ratio test instead. State _asa lrtest likelihood-ratio test command; to use it we must estimate both the unconstrained anktttte cbnstrained models. The unconstrained model is what we have previously estimated. Following the ifistr_ction in [R] Irtest. we first save the unconstrained model results:
)
!
.
_rtest, saving(O)
TO e_imme the constrained model, we must re-estimate our model with all the coefficients except the c6nsthnt set to 0 in the Uninsure equation. We define the constraint and then re-estimate:
I
donstraint define 1 [Uninsure]
I
_logit insure age male nonwhite site2 site3, constr(1)
I
(I)
[Uninsure]age
(3) (2) (4) (5)
[Unins_mre]nonwhite [Uninsure]male = 0,0 = 0.0 [Uninsttre] site2 = O. 0 [Unins_Ire] site3 = 0.0
!Iteration O: iIteration 1 : _Iteration 2: riferation 3:
log log log log
= 0.0
likelihood likelihood likelihood likelihood
= = = =
-555.85446 -539.80523 -539.75644 -539.7(6643
!Mu]'tinomialregression ' Log) likelihood = -539.75643
Number of obs LR chi2(5)
= =
615 32.20
Prob > chi2
=
0.0000
Pseudo K2
=
0.0290
t i insure ,
Coe_.
Std. Err.
z
P> ]Z)
[95Y.Conf. Interval]
I
Prepaid age male nonwhite site2 sil;e3 _cons
-.0107025 .4963616 .942137 .2530912 -. 5521774 .1792752
.0060039 .1939683 .2252094 .2029465 .2187237 .3171372
(dropped) (dropped) (dropped) (dropped) (dropped) -1.8735i
.1601099
-1.78 2.56 4.18 t. 25 -2.52 O.57
O.075 O. 010 O.000 0.212 O.012 O.572
-.0224699 .1161908 .5007347 -. 1446767 -, 9808678 -.4423023
.0010649 .8765324 i.383539 .6508591 -. 1234869 .8008527
Uni_sure age male inonwhite site2 site3 _cons
-11.70
0.000
-2.18732
-I.5597
(Outcome insure==Indemnity is the comparison group)
We ,ca_ new perform the likelihood-ratio test: , l_test Mlo_it : likelihood-ratio test !
chi2(5) = Proh > chi2 =
10,79 0.0557
The lil_eli_ood-ratio ehi-squared is 10.79 with 5 degrees of freedom just slightly greater than the ma_ic _ J4 .05 level. Thus. we should not call _is difference significant. l
o TechnicalNote 372In certain mlogit circumstances, -- Maximum-likelihood a multinomialmultinomial logit model(polytomous) should be estimated logistic regression with conditional logit; see [R] ciogit. With substantial data manipulation, clogit is capable of handling the same class of models with some interesting additions. For example, if we had available the price and deductible of the most competitive insurance plan of each type, this information could not be used by mlogit but could be incorporated by cZogit. 71
Saved Results mlogit saves in e(): Scalars e (N) e (k_cat) e(df..m) e(r2_p) e(11)
number of observations number of categories model degrees of freedom pseudo R-squared log likelihood
e (1I_0) e (N_clust) e(chi2) e(ibaseeat) e(basecat)
log likelihood, constant-only model number of clusters X2 base category number the value of depvar that is to be treated as the base category
mlogit
e (clustvar)
name of cluster
Macros e (emd)
variable
(depvar) name of dependent variable e(wtype) weight type e(wexp) weight expression Matrices
covariance estimation method e(chi2type)Waldor LR: _ype of model X2 test e(predict) program used to implement predict
e (b) e (cat) Functions
e (V)
e
e(sample)
coefficient vector category values
e (vcetype)
variance-covariance matrix of the estimators
marks estimation sample
Methods and Formulas The model for multinomial
logit is
Pr(Y/=
k) = rr_=1
"j-_--O
This model is described in Greene (2000, chapter t9). Newton-Raphson
maximum likelihood is used; see [R] maximize.
In the case of constrained equations, the set of constraints is orthogonalized and a subset of maximizable parameters is selected. For example, a parameter that is constrained to zero is not a maximizable parameter. If two parameters are constrained to be equal to each other, only one is a maximizable parameter.
mlogit -- Maximum-likelihoodmultinomial(polytomous)logistic regression •
'
373
L_t r!be the vector of maximizable parameters, Note that r is physically a subset of the solution p_a_ete_rs b. A matrix T and a vector m are defined b=Tr_m t
wiih [he lconsequence that df=df_ T, db dr d2f = T d2f _, db 2 d-_ -t T 'consists of a block form in which one part is a permutation of the identity matrix and the other pa_ describes how to calculate the constrained parameters from the maximizabte parameters,
ReferenQes Aldrich.J.! H. and F. D. Nelson. t984. LinearProbabilit):Logit, and Probit Models. Newburv Park, CA: Sage _Puliticaiions. Grdene.W_H. 2000. EconometricAnalysis,4th ed. Upper SaddleRiver,NJ: Prentice-HalL HamiltOn,_..C. 1993 sqv8: Interpretingmultinomia]logisticregression.Stata TechnicalBulletin[3: 24-28. Repnnted in _tat_TechnicalBulletin Reprints,vol. 3, pp. 176-181. Hefidri_kx, IJ 201)0.sbe37:Specialrestrictionsin multii_omiallogisticregression,Stata TechnicalBulletin56: 18-26_ Hosrner.D_W., Jr,, and S. Lemeshow 1989.AppliedLogisticRegression.New York:John Wiley & Sons. lSecot_d !editionIforthcomingin 2001.) Judge,G. _,, W. E. Griffiths,R. C. Hill.H. Lfitkepohl.and T,-C.Lee. 1985, The Theoo"and Practiceof Econometrics. !2db_d,New York:John Wiley & Sons. Long, 1. SI 1997.Regression Models for Categoricaland Limited Dependent Vari,_bles.ThousandOaks, CA: Sage PuNica_ions, Tarlbx:, _A. _,..1. E. Ware,Jr., S. Greenfield,E. C, Nelson,E. Pen-in.and M. Zubkoff. 1989. The medicaloutcome_ study.3aurnalof the American MedicalAssociation,262: 925-930. We_ls.K. E R. D. Hays, M A. Burnam,W, H. Rogers,IS.Greenfield,and J. E. Ware,Jr. 1989, Detectionof depressive disdrderfor patientsreceivingprepaidor fee-for-servicecare, Journalof the AmericanMedical Association262 3298-3 _02.
i
AlsoiSie Corn_en_entary:
[R] adjust, [R] constraint. [R] lincom. [R] lrtest. [R] mfx, [R] predict, [R] test, [R] testnL [R] xi
Re_t0d:
[R] clogit, [R] logistic, [R] nlogit, [R] ologit, [R] svy estimators
Baekgtouhd:
i
[U] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and p_st-estimation commands. [t;] 23.11 Obtaining robfist variance estimates. [u] 23.12 Obtaining scores,
I
[R] maximize
°
more
-- The --more--
i
message
1
]
[
Syntax
set more{ onloaf } set
_p_agesize #
Description set more on, which is the default, tells Stata to wait until a key is pressed before continuing when a --more-message is displayed. set more off
tells Stata not to pause or display the --more--
set pagesize
# sets the number of lines between --more--
message. messages.
Remarks When you see --more
at the bottom of the screen
Press ...
and Stata...
letter t or Enter
displays the next line
letter q
acts as if you pressed Break
space bar or any other key
displays the next screen
In addition, you can press the More button, or click on --more--,
to display the next screen.
--more is Stata's way of telling you it has something more to show you, but that showing you that something more will cause the information on the screen to scroll off. If you type set more oft.--more--conditions at full speed. If you type set more on, --more Programmers
conditions
will never arise and Stata's output will scroll by will be restored at the appropriate
should see [p] more for information
Also See Complementary:
[R]
query, [P] more
Background:
[U] 10 ---more---
conditions
374
on the more programming
places.
command.
.
Title
,
-- Change missing to coded missing value and vice versa
Synta _ercode
varlist [if exp] [in range] , my(#) [override
]
m_d, code varlist [-if exp] [in range] , my(#)
Destcttipl:ion m_veJ code changes all occurrences of mis_ing to # in the specified varlist. m_d_code changes all occurrences of # to missing in the specified varlist.
options my(#) }pecifies the numeric value to which or from which missing is to be changed and is not opti@al. oveJ'ri[te specifies that the protection provided by mvencode is to be overridden. Without this option, m_rer_code refuses to make the requested change if # is already; used in the data.
, !
Remalrk One _occasionalty reads data where missing (e.g., failed to answer a survey question, or the data were ndt collected, or whatever) is coded wi|h a special numeric value. Popular codings are 9. 99. 29, -9_), and the like. If missing were encoded as -99, ' • mvdecode _all, my (-99)
would translate the special code to the Stata missing value ' ' Use this command cautiously since. even if L99 were not a special code, all 99's in the data would be changed to missing. Conxlersely, one occasionally needs to export data to software that does not understand that .' friends r_issmg value, so one codes missing With a special numeric value. To change all missings to -99i _nvencode
_all, my(-99)
mvenco_leis smart: it will automatically recast variables upward if necessary, so even if a variable is strred as a byte. its missing values can be recoded to. say, 999. In addition, mvencode refuses to make th_ change if # _-99 in this case) is already used in the data, so you can be certain that your coding ig unique. You can override this feature by including the override option. _. Example O_ur_utomobile dataset (described in [U] 9 State's on-line tutorials and sample datasets) contains 74 observations and 12 variables• Let us first attempt to translate the missing values in the data to 1: 375
. mvencode
_
..............
,
_all,
my(1)
make : string ,..,,-,,_ variable ,,,,oo.,y ignored .u uuueu rmssmg rep78: already I in 2 observations
foreign: already no action taken
i in
22
value aria vice versa
observations
r(9) ; Our attempt failed, mvencode first informed us that make is a string variable--this is not a problem but is reported merely for our information. String variables are ignored by mvencode. It next informed us that rep78 already was coded 1 in 2 observations and that foreign was already coded 1 in 22 observations. Thus, 1 would be a poor choice for encoding missing values because, after encoding, you could not tell a real 1 from a coded missing value t. We could force mvencode to encode the data with 1 anyway by typing mvencode _all, my(l) override and that would be appropriate if the ls in our data already represented missing data. They do not, however, and we will code missing as 999: • mvencode
_all,
make:
mv(999)
string
rep78:5
variable
missing
ignored
values
This worked, and we are informed that the only changes necessary were to 5 observations
of rep78.
Example Let us now pretend that we just read in the automobile data from some raw dataset where all the missing values were coded 999. We can convert the 999's to real missings by typing • mvdecode
_all,
mv(999)
make : string variable ignored rep78:5 missing values
We are informed that make is a string variable and so was ignored and that rep78 observations with 999. Those observations have now been changed to contain missing.
contained
5 q
Methods and Formulas mvencode and mvdecode are implemented
Also See Related:
[R] generate, [R] recode
as ado-files.
:
!
Title•
........
i
-- Multivariateregression
Syntax :mere!depvarlist = vartist [weigh,] [if expl [in range] [, noconstantcorrnoheader
by..
_ : _nay be used with m-rreg; see JR] by.
aw_i_tsland_: fweights are allowed;see [Ul 14.1.awei_t. mvteff sh_es the features of all estimation commands; see [U] 23 F_imation and l_t-estimati_
commands.
SyntaxIfo_predict predict I
[type] newvarname [if exp] Iin range][,
{ _b !stdp Iresiduals I_difference
i
equation(eqno
[,eqnoj)
Is_tdap }]
i
These gtati{sticsare available both in and out of sample: type predict :theiesfi_nation sample. :
...
if e(s_ple)
...
if wanted onl? fl)r
i I
Desaripti_n I
T avte_ estimates multivariate regression models.
Optienis :
1
no_:o_st_mt omits the constant term from the estimation. corr _lis ys the correlation matrix of the residuals between the equations. _ !la noheddei" suppresses display of the table reporting F statistics. R-squared, and root mean square errbr a _ove the coefficient table notable suppresses display of the coefficient table. leve2 (# specifiesthe confidencelevel, in percent, for confidenceintervals. The default is level (95) or as _t by set level: see [U] 23.5 Specifying the width of confidence intervals 1
Optionsf_r predict t
oqu o (qo[.qnot
,ow ich eq tiooareefem g.
equat _on() is filledin with one eqno for options zb, stdp, and residuals, equation(#1) would mean the calculation is to be made for the fi_stequation, equation(#2) would mean the second, and so on. Alternatively, you could refer to the equations by their names, equation(income) Wotild"efer to the equation named income and equation(hours) to the equation named hours. '
37"/
not -specify equation(), results are as if you specified equation(#1). oreff you do mvreg Multivariate regression difference and stddp refer two equations; e.g., equat ion be specified, equation() is prediction of equation(#1)
to between-equation concepts, To use these options, you must specify (#1, #2) or equation (income, hours). When two equations must not optional. With equation(#1,#2), difference computes the minus the prediction of equation(#2).
xb, the default, calculates the fitted values--the
prediction
of xj b for the specified equation.
stdp calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is also referred to as the standard error of the fitted value. residuals difference
calculates
the residuals.
calculates the difference between the linear predictions
of two equations in the system.
stddp is allowed only after you have previously estimated a multiple-equation model. The standard error of the difference in linear predictions (xljb - x2jb) between equations 1 and 2 is calculated. For more information on using predict
after multiple-equation
estimation commands,
see [R] predict.
Remarks Multivariate regression differs from multiple regression in that several dependent variables are jointly regressed on the same independent variables. Multivariate regression is related to Zellner's seemingly unrelated regression (see [R] sureg) but, since the same set of independent variables is used for each dependent variable, the syntax is simpler and the calculations faster. The individual coefficients and standard errors produced by mvreg are identical to those that would be produced by regress estimating each equation separately. The difference is that mvreg, being a joint estimator, also estimates the between-equation covariances, so you can test coefficients across equations and, in fact. the tesl: syntax makes such tests more convenient.
> Example Using the automobile data. we estimate a multivariate regression for "space" variables (headroom, trunk, and turn) in terms of a set of other variables including three "pertbrmance variables" (displacement, gear_ratio, and mpg): . mvreg
headroom
trunk
turn
Parms
= price RMSE
mpg
displ
gear_ratio
Equation
Obs
"R-sq"
headroom
74
7
.7390205
O. 2996
4. 777213
trunk
74
7
3. 0523 L4
0. 5328
12. 7265
turn
74
7
2. 132377
0. 7844
length
weight
F
40.
62042
P 0. 0004 0. 0000 0. 0000
mvreg-- Multivariateregression
¢-...
i
Coef.
Std.
Err.
t
P>(t[
[95_ Conf.
379
Interval]
he. oom price I mpg displacement g#ar_ratio ! length weight _cons
-.0000528 -.0093774 .0031025 .2108071 .015886 -.0000868 -.4525117
.000038 .0260463 .0024999 .3539588 .012944 ,0004724 2.170073
-1.39 -0.36 i,24 O.60 I.23 -0.18 -0.21
0.168 O.720 0.219 O.553 O.224 0.855 O.835
-.0001286 -.061366 -.0018873 -.4956976 -.0099504 -,0010296 -4.783995
.0000229 .0426112 .0080922 .9173118 .0417223 .0008561 3.878972
, price ' mpg _is_lacement 1 g_ar_ratio length c weight _cons
.0000445 -. 0220919 .0032118 -,2271321 .170811 - ,0015944 -13.28253
,0001567 .1075767 .0103251 I.461926 .0534615 ,001951 8. 962868
0,28 -0.21 0.31 -0.16 3.20 -0.82 - 1.48
0,778 O.838 0.757 0.877 O,002 0.417 0. 143
-.0002684 -. 2368159 -,0!73971 -3.145149 .0641014 - ,0054885 -31. 17249
.0003573 .1-926322 .0238207 2.690885 .2775206 ,0022997 4.607429
price mpg displacement
-.0002647 -.0492948 .0036977
,0001095 .0751542 .0072132
-2.42 -0.66 O. 51
O.018 O.514 O.610
-.0004833 -.1993031 -. 0106999
-.0000462 .1007136 .0180953
gear_ratio --length
-. 1048432 .072128
I.021316 .0373487
-0.10 I. 93
0.919 O. 058
-2. 143399 - J)024204
1.933712 .1466764
_cons
20. 19157
3.22
O.002
7.693467
32. 68967
i
_un!
I
i !
i
6.261549
!
We should have specified the corr option so that we would also see the correlations between the residu_ils _ of the equations. We can correct our omission because mvreg--like all estimation com_ahds!--typed without arguments redisplaysiresutts The noheader and notable (read no-table) options Sul_press redisplaying the output we have already seen:
g
• mv_'eg, notable noheader corr
i
COrrl_lationmatrix of residuals: headroom trunk turn h@ad]'oom i.0000 t]'unk O.4986 I.0000 urn -0.1090 -0.0628 1.0000 Breu _ch-Pagantest of independence: chi2(3) =
19.566, Pr = 0.0002
The Breusc h-Pagan test is significant, so the residuals of these three space variables are not independent of eachiott er. t I
The thre_eperformance variables among our ihdependent variables are mpg, displacement, and gear_ratio. We can jointly test the significance of these three variables, in all the equations, by typing
i
(Continued on next page)
I!iI'!
• test
mpg
(1) ,
displacement
[headroom]mpg
(2)
[truak]mpg
(3)
[turn]mpg
gear_ratio
= 0.0
= 0.0 -- 0.0
(4)
[headroom]displacement
(5)
[trunk]displacement
(6)
[ttu'-n]displacement
(7)
[headroom]gear_ratio
(8) (9)
[trul_k]gear_ratio [t_rn]gear_ratio F(
9,
67)
Prob
= 0.0 = 0.0 = 0.0 = 0.0
= O. 0 = 0.0
=
0.33
> F =
0.9622
These three variables are not, as a group, significant. We might have suspected this from their individual significance in the individual regressions, but this multivariate test provides an overall assessment with a single p-value. We can also perform a test for the joint significance of all three eqtmtions: - test
[headroom]
(output
omitted
• test
)
[trunk],
(output
omitted
• test
accum
)
[turn],
accum
(I)
[headroom]price
(2)
[headroom]mpg
= 0.0
(3) (4)
[headroom]displacement [headroom]gear_ratio
= 0.0 = 0.0 = 0.0
(5)
[headroom]length
(6) (7)
[headroom]weight = 0.0 [tr%L_k]price = 0.0
= 0.0
(8)
[trunk]mpg
(9)
[trumk]displacement
= 0.0
(i0)
[trunk]gear_ratio
(II)
[trunk]length
= 0.0
(::[2)
[trunk]weight
= 0.0
(13)
[turn]price
= 0.0 = 0.0
= 0.0
(14) [turn]mpg= 0.0 (15)
[turn]displacement
(16)
[turn]gear_ratio
= 0.0 = 0.0
(17)
[turn]length
= 0.0
(18)
[turn]weight
= 0,0
F(
18, Prob
67)
=
> F =
19.34 0.0000
The set of variables as a whole is strongly significant. individual equations.
We might have suspected this, too, from the q
C3Technical
Note
The mvreg command provides a good way to deal with multiple comparisons. If we wanted to assess the effect of length, we might be dissuaded from interpreting any of its coefficients except that in the trunk equation. [trunk]length--the coefficient on length in the trunk equation has a p-value of .002, but in the remaining two equations it has p-values of only .224 and .058. A conservative statistician might argue that there are 18 tests of significance in mvreg's output (not counting those for the intercept), so p-values above .05/18 = .0028 should be declared insignificant
i ,
'
,
,
mweg -- Multivariate _k:m
at!the 5! _level. A more aggressive but, in our opinion, reasonable approach would be to first note
1
Then_ w: :hree would work through the individual using test, inpossibly = .0083 that _he equations are jointly significant, variables so we are justified making using some .05/6 interpretation. (6_becau,,e there are 6 independent variables) for the 5% significance level. For instance, examining lemg_h:
!
._stlength (t)
[headroom]length = O.0
(2) (3)
[t_,-_]le_h = o.o [t_m]lem_h = 0.0 F(
3, Prob
67) = > F =
4.94 0.0037 i
The r_por_ed significance level of .0037 is less than ,0083, so we will declare this variable significant. [tru_]iengeh is certainly significant with its p-value of .002, but what about in the remaining two equationsiwith p-values .224 and .058? Performing a joint test: . l;_s_,[headLrooI_]length [t_]length ((!))
I
[tttrn] lenl_ch= O.0 [headroom]length = O.0
F( 2,Prob 67) = > F =
2.91 0.0613
At t_isipolnt; reasonable statisticians could disalgee. The .06 significance value suggests no interpretation t_ut}hese were the two least-significant values out of three, so one would expect the p-value to be a litkte high. Perhaps an equivocal statement is warranted: there s_ms to be an effect, but chance cannot Ibe _xcluded. Q
SavedReSults mvreg '
_aves
in e () : Scalars e(N)
number of obsep;atior_
e (k)
number of parameters ifincluding constant)
e(k_eq)
number of equations
e(df_I)
residual degrees of freedom
e(chi2)
Breusch-Pagan
e (df_chi2)
degrees of freedom for Breusch-Pagan
X2 (corr
only) X2 (curt
Macros e (cmd)
mvreg
e(eqn_es)
names of equations
e(r2) e (rinse)
R-squared for each eqt!ation RMSE for each equatidn
e(F)
F statistic for each eqdation
e(p._F)
significance of F for each equation
e(predic_)
program used to implement predict
Matrices
I
e (b)
coefficient vector
e(V)
variance-covariance
e (Siuna)
_
malrix of the estimators
matrix
i Functions e(sample)
t.
marks estimation samptd
only)
•
Methods and Formulas _
......implemented :, ,.u,.,,,m i -gresslon mvregis as ,at_ an ado-file.
p independent variables (including the constant), the parameter estimates are Given given qbyequations the p × qand matrix B-
(XtWX)-lxtwY
where Y is an n × q matrix of dependent variables and X is a n x p matrix of independent variables. W is a weighting matrix equal to I if no weights are specified. If weights are specified, let v: 1 x n be the specified weights. If fweight frequency weights are specified. W = diag(v). If aweight analytic weights are specified, W = diag{v/(l'v)(l'l)}, which is to say, the weights are normalized to sum to the number of observations. The residual covariance matrix is R={YIWy
B tX' (
.WX)B}/(n-;)
The estimated covariance matrix of the estimates is R ® (X _WX)-I These results are identical to those produced by sureg when the same list of independent variables is specified repeatedly; see [R] sureg. The Breusch and Pagan (1980) X2 statistic--a
Lagrange multiplier statistic--is
given by
q i-I =nZ
.2 z=l .,4=1
where vii is the estimated correlation between the residuals of the equations and n is the number of observations. It is distributed as X 2 with g(q - 1)/2 degrees of freedom
References Breusch. T. andStudies A. Pagan. t980. The LM test and its applications to model specification in econometrics. Review of Economic 47: 239-254.
Also See Complementary:
[R] adjust, [R] lincom, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce
Related:
[R] reg3, [R] regress, [R] regression diagnostics. [R] sureg
Background:
[U] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and post-estimation commands
{
Title -- Negative binomial regression
syntax nbrqg depvar [indepvars] [weight] [if exp] [in range] [,
{
d!spersion({mean Iconstant} ) level(#)irr exposure(varname)oflset(varneme) r_bust cluster(varname)score(ne_'vars) noconstantcoBsr, raints(numlist)
!
[
n__test
nolog maximize_options ]
fffibr_ E depvar[indepvar,] [_eight] [ifexp][inrange][,inalpha(varlist) level(#) irrr ; e_posure (varname) offset (vantame) robust cluster (varname) score (newvars) , ? nc constmat constraint s (numlist) n_log maximtze._options j by ../ : may be used with nbrog; see IR] by, f_ei_hts iweights, and p-aeights are allowed; see [U] 14,1.6 weight, T_ese_,con mands share the features of all estimation commands; n_re_ m_
see [U] 23 Estimation
and post-estimation
commands,
be used with sw to perform stepwise estimation; see [R] sw.
Syntax!fo predict p!cec .ct [_pe] newvarname where st_nsnc is
n ir xb stdp
[if exp] Iin range] [, statistic
predicted number of events (the default) incidence rate (equNalent to predict ..., linear prediction standard error of the prediction
In _dd!itiqn. relevant only after gnbreg _alpha lnalpha stdplna
i
nooffset
i
n nooffset)
are
predicted values of alpha predicted values of In(alpha) standard error of predicted In(alpha)
The,_e !tati tics are available both in and out of sample; type predict the _esti_ation sample.
...
if
e(sample)
...
_f wanted only Io_
Desclriptign nbreglesttmates a negative binomial maximum-likelihood regression of depvar on varlist, where dep_,ar is / _ nonne_ative count variable. In this model, the count variable is believed to be gene|atcd cept that the greater than that of a true Poisson. This cxuu by _ Pbislon-like process, ex variation is variation ls referred to as ox.erdispersion. See [R] poisson before reading this entry. 1 383 l
i ;:!
_breg is a generalized negative binomial regression; the shape p_'ameter alpha may also be parameterized. Persons who have panel data should see [R] xtnbreg
Options dispersion ( {mean I constant } ) specifies the pararneterization of the model, dispersion (mean), the default, yields a model with dispersion equal to 1+ cr exp(zib + offset/); that is, the dispersion is a function of the expected mean: exp(xib + offseti), dispersion(constant) has dispersion equal to I + 6; that is. it is a constant for all observations. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. irr
(95)
reports estimated coefficients transformed to incidence rate ratios, i.e., e b rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored, irr may be specified at estimation or when replaying previously estimated results.
exposure(varname) and offset(varname) are different ways of specifying the same thing, exposure () specifies a variable that reflects the amount of exposure over which the depvar events were observed for each observation; ln(varname) with coefficient constrained to be 1 is entered into the log-link function, o:ffset() specifies a variable that is to be entered directly into the log-link function with coefficient constrained to be I, so exposure is assumed to be e varnarne. robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster () allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights,
robust
is implied; see [U] 23.13 Weighted
estimation.
cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily within groups, varname specifies to which group each observation belongs; e.g., cluster (person±d) in data with repeated observations on individuals, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U] 23.11 Obtaining robust variance estimates. cluster() by itself.
implies robust;
specifying
robust
cluster()
is equivalent
to typing cluster()
score (newvars) creates newvar containing % = OlnLj/0(x/b) for each observation j in the sample. The score vector is _7] OlnLj/Ob --: _ujxj; i.e., the product of newvar with each covariate summed over observations. If two newvars are specified, then the score from the ancillary parameter equation is also saved. See [U] 23.12 Obtaining scores. noconstant
suppresses
the constant term (intercept) in the regression.
constraints (numlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command: see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. nolrtest suppresses fitting the Poisson model. Without this option, the Poisson model is fit and the likelihood is used in a likelihood-ratio test of the alpha parameter, This option is only valid for nbreg; gnbreg has no likelihood-ratio comparison test (see Technical Note in the section on gnbreg within this entry).
nbreg-
385
n(_logst_ppresses i the iteration log. maxb_iz__options control the maximization process; see [R] maximize. You should never have to specif_, them, although we often recommend specifying trace. lr_alpha (varIist) is a/lowed only with gnbreg. If this option is not specified, gnbreg and nbreg :will produce the same results because the _hape parameter will be parameterized as a constant. lntalt.ha() allows specifying a linear equation for In(alpha). Specifying lnalpha(male old) means in(alpha) = ao + almale .a a2old, where a0, al, and a2 are parameters to be fitted along wilh t ae other model coefficients.
Options n. I !
t i
Negative binomialregression
predict
the _efault, calculates the predicted number of events, which is exp(xjb) if neither of_s_t(varname) nor exposure(varname) was specified when the model was estimated: exp(xib + offset) if offset(varname) was specified: or exp(x_b) • exposure if exposuite (varname) was specified.
ir caicul ttes the incidence rate exp(xjb), which is the predicted number of events when exposure is I. Thi ; is equivalent to n when neither offset (varname) nor exposure (varname) was specified when he model was estimated. xb.calcul ires the linear prediction. I
strip caklulates the standard error of the linear prediction. •
a!pha, l_alpha, and stdplna are relevant after gnbreg estimation only; they produce the predicted values !of alph_ or In(alpha) and the standard error of the predicted In(alpha), respectively. nooffse_ is relevant only if you specified offset(varname) or exposure(vamame) when you esffma_ed the model. It modifies the calculations made by predict so that they ignore the offset or ex_ost_re variable: the linear prediction is treated as xjb rather than x./b + offseb. and specifying predilzt ... is equivalent to specifying predict ... , nooffset Jr.
Remarks See Lo_ng(1997. chapter 8) for an introduction to the negative binomial regression model and lot a discassibn of other regression models for count data. Negati_,e binomial regression is used to estimate models of the number of occurrences (counts) of an event when the event has extra-Poisson Variation; that is. it has overdispersion. The Poisson re_essionl model is Yi "- PoisSon(#i ) where tti = exp_xi,O + offseti) for obser_.led counts Yi with covanates xi for the ith observation. One derivation of the negative binomial i_ that individual units follow a Poissdn regression model, but there is an omitted variable u_ such th_atc"_ follows a gamma distribution With mean 1 and variance a:
5'i "_ Poissorl(/z_) wheTe
_,,. _
/_' = exp(xi/_ and ssu
+ offset/+
ui)
nbreg -- Negative binomial regression
c ~ gamma(1/ , (Note that the scale (i.e., the second) parameter for the gamma(a, A) distribution is sometimes parameterized as 1,/A: see the Methods and FormuIas section for the explicit definition of the distribution. ) We refer to a as the overdispersion parameter. The larger a is, the greater the overdispersion. The Poisson model corresponds to a = 0. nbreg parameterizes c_ as In a. gnbreg allows In G to be modeled as In _xi = ziT, a linear combination of covariates z,. nbreg will estimate two different parameterizations of the negative binomial model. The default, described above and also given by the option dispersion(mean), has dispersion for the ith observation equal to 1 + a. exp(xd3 + offset/). The alternative parameterization, given by the option dispersion(constant), has dispersion equal to 1 + 6, i.e. it is constant for all observations. The Poisson model corresponds to 6 = 0.
nbreg It is not uncommon to pose a Poisson regression following data appeared in Roddguez (i993):
model and observe a lack of model fit. The
list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. II. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
cohort 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3
age_mos 0.5 2.0 4.5 9.0 18.0 42.0 90.0 0.5 2.0 4.5 9.0 18.0 42.0 90.0 0.5 2.0 4.5 9.0 18.0 42.0 90.0
deaths 168 48 63 89 102 81 40 197 48 62 81 97 103 39 195 55 58 85 87 70 10
exposure 278.4 538.8 794.4 1,550.8 3,006.0 8,743.5 14,270.0 403.2 786.0 1,165.3 2,294.8 4,500.5 13,201.5 19,525.0 495.3 956.7 1,381.4 2,604.5 4,618.5 9,814.5 5,802.5
gen logexp = in(exposure) quietly tab cohort, gen(coh) poisson deaths cob2 cob3, offset(logexp) Iteration Iteration Iteration Iteration
O: I: 2: 3:
log log log log
likelihood likelihood likelihood likelihood
Poisson regression
Log likelihood = -2159.5159
= = = =
-2160.0544 -2159.5182 -2159.5159 -2159.5159 Number of obs LI%chi2(2) Prob > chi2 Pseudo R2
= = =
21 49.16 0.0000 0.0113
l• i ! !
T
_
nbreg-
I
deaths
Coef.
Std. Err.
z
Negative binomial regression
P>Izl
[95_, Conf.
387
Interval]
, J
I
coh2
-. 3020405
.0573319
coh3
.0742143
.0589726
-3.899488 (offset)
.0411345
_cons logexp
-5.27 1,26 -94.80
0.000
-. 4144089
-. 1896721
O. 208
-. 0413698
.1897983
O. 000
-3.
98011
-3.818866
. _oisgof Goodness-of-fit Prob
chi2
> chi2(18)
=
4190.689
=
0.0000
The extreme significance of the goodness-of-fit X2 indicates the Poisson regression model is inap_opTiate suggesting to us that we should try a negative binomial model: Lbreg deaths Negative _
I
coh2
binomial
Lo_ likelihood
=
coh3,
offset(loge_p)
nolog
regression
Number of obs LR chi2(2) Prob > chi2
= = =
21 0.40 0.8171
-131.3799
Pseudo
=
0.0015
deaths
Coef.
Std. Err.
z
R2
P>[zl
[95_, Conf.
coh2
-. 2676187
.7237203
-0.37
O. 712
-I.
coh3
-. 4575957
.7236651
-0.63
O. 527
- I.875753
.511856
-4.08
0.000
_cons
-2.086731
logexp
(offset)
/Inalpha
.5939963
686084
-3.08995
.2583615
.0876171
Interval] I. 150847 .9609618 -I.083511
1. 100376
l
alpha Li_elihood
I. 811212 ratio
test
of
.4679475 alpha=O:
I. 09157 chibar2(01)
= 4056.27
Prob>=chibar2
3.005295 = 0.000
Our original Poisson model is a special case of the negative binomial--it corresponds to a = O. nbreg, however, ' estimates a indirectly, estimating instead In a. In our model. In c_ = 0.594. meanin_ that a 't.81 (nbreg undoes the transformati6n for us at the bottom of the outputk In!order to test o = 0 (equivalent to lna, = -ao),
nbreg
performs a likelihood-ratio test. The
stag_erin_z._ ...r ,_2 value of 4.056 asserts that the probability that we would observe these data conditional on cti= t_ is ,_irtually zero. i.e., conditional on!the process being Poisson. The data are not Poisson. It is _ot ]accidental thal this _2 value is quite close to the goodness-of-fit statistic from the Poisson regre_sio! itself. 1 "
t
Q TechnicaI!Note Tl'/e u_ual Gaussian test of ct = 0 is omitted since this test occurs on the boundary, invalidating the u_ual! theory, associated with such tests However. the likelihood-ratio test of a. = 0 has been modifiedlto be valid on the boundao,. In partieular, the null distribution of the likelihood-ratio lest statistic _, not the usual ;_2 but rather a 50:50 mixture of a )_o _ (point mass a, zerot and a t'_. denoted as _02i. 5[ee Gutierrez et al. (2001) for more details.
' : r_
[] Technical Note v,,,,
...,._
_ _egatwe olnomla! regression
The negative binomial model deals with cases where there is more variation than would be expected were the process Poisson. The negative binomial model is not helpful if there is less than Poisson
i
Poisson models arise because of independently generated events. Overdispersion comes about if some of the parameters (causes)of of Poisson areitsunknown. obtain underdispersiom the variation--if the variance the the count variableprocesses is less than mean. ButTounderdispersion is uncommon. sequence of events would have to somehow be regulated; that is, events would not be independent, but controlled based on past occurrences. []
gnbreg gnbreg is a generalization of nbreg. Whereas in nbreg a single tn _ is estimated, gnbreg allows In a to vary observation by observation as a linear combination of another set of covariates: ln c_=z_. We will assume the number of deaths is a function of age whereas the in _ parameter of cohos. To estimate the model, we type gnbreg
deaths
Fitting
age_mos,
constant-only
Inalpha(coh2
coh3)
O:
log
likelihood
=
Iteration
I:
log
likelihood
= -148.64462
-187.067
= -132.49595
Iteration
2:
log
likelihood
Iteration
3:
log
likelihood
= -131.59338
Iteration
4:
log
likelihood
= -131.57949
log
likelihood
= -131.57948
Iteration
5: full
model:
Iteration
O:
log
likelihood
= -124.34327
Iteration
I:
log
likelihood
= -117.72418
Iteration
2:
log
likelihood
= -117.56349
Iterazion
3:
log
likelihood
= -117.56164
Iteration
4:
log
likelihood
= -117.56164
Generalized
offset(logexp)
model:
Iteration
Fitting
negative
binomial
regression
Number LR
likelihood
deaths
= -117.56164
Cool.
Pseudo
Std.
Err.
z
P>IzI
of obs
=
chi2(1)
Prob Log
is a function
=
21 28.04
> chi2
=
0.0000
R2
=
0.1065
[95X
Conf.
Interval]
deaths age_mos _cons logexp
-,0516657 -1.867225 (offset)
.0051747 .2227944
-9,98 -8.38
0,000 0.000
-.061808 -2.303894
-.0415233 -1,430556
cob2
.0939546
.7187747
0.13
coh3
.0815279
,7365477
0.II
0.896
-1.314818
1.502727
0.912
-1.362079
1.525135
0.356
-1.486614
.5346978
inalpha
_cons
-.4759581
.5156502
-0.92
We find that age is a significant determinant of the number of deaths. The standard errors for the variables in the In c_ equation suggest that the overdispersion parameter does not vary across cohorts. We can test this by typing
i
-' _
i
i
'
nbreg -- Negative binomialregresldon
. !test coh2 cob3 i
_ I)
[inalpha] coh2 = O.0
d 2)
[inalpha]coh3
Prob
:
= 0.0
2)
chi2(
0.02
=
> chi2 =
0.9904
There isl no evidence of variation by cohort in these data.
i
[3Techr icl Note
!
NOte the intentional absence of a likelihood-ratio test for o_ = 0 in gnbreg. The test is affected by the .,ame boundary condition the affects the comparison test in nbreg, however, when a is paramet(:rized by more than a constant term the null distribution becomes intractable. For this reason we recot nmend using nbreg to test for overdispersion and if overdispersion exists, only then model
i
the over tispersion using gnbreg.
! 1
Predicted values '
i
After!nbreg
and gnbreg,
predict
returns the predicted number of events:
_breg deaths coh2 coh3, nolog Negative binomial regression
Lo_ likelihood
:
= -108.48841
deaths
Prob
=
O. 9307
=
0.0007
[95Y. Conf.
Interval]
> chi2
Err.
z
.2978419
_cons
4.435906
.2107213
21.05
O. 000
4.0229
-. 0538792
•2981621
-0.18
O. 857
-. 6382662
.5305077
- 1.207379 .29898
.3108622 .0929416
-I. 816657 .1625683
-.5980999 .5498555
• _redict
ratio
test
of alpha=O:
O. 20
P> Izl
chibar2(01)
O. 843
R2
.0591305
"LiKelihood
=
-. 5246289
434.62
Prob>=chibar2
count
(o_tion n assumed; _lmmarize Variable
i
= =
coh2
/ Inalpha alpha
i
Std.
2I O. 14
Number of obs LE chi2(2)
Pseudo
Coef.
coh3
deaths
predicted
number
of events)
count Obs
Mean
Std. Dev.
Min
Max
i
i
deaths
21
84.66667
i
count
2i
84.66667
48.84192
10
4.00773
80
(Continuett on next page)
|
389
197 89. 57143
.64289 4.848912
= 0.000
Saved Results nbreg and gnbreg save in e O" Scalars e (N) e (k) e (k_eq)
number of observations number of variables number of equations
e (N-clust) e(re) e (chi2)
number of clusters return code X2
e(k_dv) e (df_.m)
number of dependent variables model degrees of freedom
e(chi2_c) e (p)
_2 for comparison test significance
e (r2_p)
pseudo R-squared log likelihood
e (ie) e (rank)
number of iterations rank of e (V)
log likelihood, constant-only model log likelihood, comparison model
e(ram.k0) e(alpha)
rank of e(V) for constant-only model the value of alpha
e (cmd)
nbreg or gnabreg
e (opt)
_ype of optimization
e(depvar) e(title) e(wtype)
name of dependent variable title in type estimation outpul weight
e(chi2type) e(chi2_ct)
e (11)
e(ll_O) Macros e(ll_c)
e(wexp) weigh!expression e(clustvar) name ofcluster variable
e(offset#) e(dispers)
Wald or LR; type of model X _ test Wald or LR; type of model X 2 test corresponding to e(chi2_c) offset forequation # mean or constant
e(user) e(vcetype)
e(predict)
program used to implement predict
name covanance of likelihood-evaluator estimation method program
Matrices e (b)
coefficient vector
e(ilog) Functions e(samp2e)
iteration log (up to 20 iterations)
e(Y)
variance-covariance the estimators
matrix of
marks estimation sample
Methodsand Formulas nbreg
and gnbreg
See [RJ poisson
are implemented
and Feller
(1968,
as ado-files. 156-164)
for an introduction
to the Poisson
distribution.
A negative binomial distribution can be regarded as a gamma mixture of Poisson random variables. likelihood is The number of times something occurs, Yi, is distributed as Poisson(ui# i). That is, its conditional f(Yi
where _ui = e×p(xij3+
offseh)
Jui) --
(uilzi)U'e-_"_'
r(y, + 1)
and u_ is an unobserved g(u)
parameter
with a gamma(I/a,
i/a)
density:
= u(1-,_)t%__,/,_ cei/c'F(1/a)
This gamma distribution has mean 1 and variance c_, where o_ is our ancillary parameter. (Note the scale (i.e., the second) parameter for the gamma(a, A) distribution is sometimes parameterized l/A; the above density defines how it has been Parameterized here.)
that as
I
nbreg -- Negative binomialregression
j,,,_
i
391
The unconditional likelihood for the ith observation is therefore
/0
f(Y_) =
:
i
f(Yi I u)g(u) du
_
+y,)
, r(y + 1)r(m)
whe/re _i = I/(1 + c_#i) and m 1/a. SolUtions for a are handled by searching for lna since c_ is requiied to be greater than zero. The ,cores and log likelihood (with weights and offsets) are given by
,(z) = digamma _ction evaluated at z _ls(z) = trigamma function evaluated at z '
a = exp(Tl
m = I/a
Pi = 1/(1 + c_#i)
Pi -- exp(xd3 + offset/)
; c= i=1
I
scOrei[3)/ = Pi (Y_ - #i) scOre(iT-}i = -m
1 + a#i {_(_ui-yi)
In the, :ase of gnbreg,
tn(l +°_gi) 4g'(Yi
_m)-_b(m)}
a is allowed to vary across the observations according to the parameterization
[IIO_i m _:W"
M_in_ization for gmbreg is via the If linear-form method and for nbreg described in [R] ml.
is via _he d2 method
i
Refemnccs Cameron, A! C. and P. K Trivedi. 1998. Regression analysis of count dat_. Cambridge: Feller, W, 1§68. An Introduction
Sons.
to Probabititv
Theory and Its Applications,
Cambridge
University Press.
vol 1 3d ed. New York: John Wile_ &
i
Ounen'ez. R 1 G., S. L. Carter, and D. M Dmkker. 200t, On boundary-value l_,]le_m, _forthcoming. !
likelihood
ratio rests. Stata Technical
Hilbe, J. 19_8. sggl: Robust variance estimators /or MLE Poisson and negative binomial regression_ Stata Technic_d Bulletin _5: 26-28. Reprinted in Stata Technical Bulletin Reprints, vol. 8. pp. 177-180. I
.
!999. @102: Zero-truncawd
poisson
and negative
binomial
regression.
Long, J. S, 1997. Regression Models tot Categorical aad Limited Dependent Reprinted Iin Stata Technical Butledn Reprints, vol, 8; pp. 233-236. Pubtieatit_s, l
1
Rodr{gueL 4" 1993. sbel0- An improvement tu poisstm, Stata Technical 7_-hnie_llBulletin Reprint,_. vol. 2, pp. 94-98.
I
Stata
Technical
Bulletin
Variables. Thousand
Bulletin
11: 11-t4,
47:_7-'40
Oaks, CA: Sage
Reprinted
in St_lta
v
• ii::'i
.....
Rogers, W. H. 1991. sbel: Poisson regression with rates. Stata Technical Bulletin t: 11-t2 Bulletin Reprints, vol. 1, pp. 62-64.
Reprinted in Stata Technical
1993, sgl6.4: Comparison of nbreg and glm for negative binomial. Stata Technical Bulletin 16: 7. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 82-84.
Also See Complementary:
JR] adjust, [R] constraint, [R] lincom, [R] linktest, [R] Irtest, [R] mfx, [R] predict, [R] sw, [R] test, [R] testnl, [R] vce, [R] xi
Related:
[R] glm, [R] poiss0n, [R] xtnbreg, [R] zip
Background:
[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [U] 23.12 Obtaining scores, JR]maximize
I
Tit Install and manage user-written additions from the net
Syrlta:oi fro=
directory._or_.url
_ael cd
path_or_urt
no1 link
tinkname
he1 search
keywords
(see [RJ net search)
taet net
describe
pkgname
net net
set ado set other
dirnarae dirname
ne_ query net
install
net ii get
I
i' all
replace
j
pkgname
[[J all
replace
]
adoi
[, f__i_ad(string) frora(dirname)
ado dir
r ' i, f__ind(string) from(dirname)
ado i describe
[pkgid]
_, fired(string)
ado!uninstall
pkgid
[, fr_m(dimame)
where
i
pkgname
i p_enarne is p_,id is or di "nanle is or or or
]
fArom(dirname)] f
name of package name of a package a number in square brackdts: [#] a directory name PEKSONAL STBPLUS SITE
(default)
DesCription net etches and installs additions to Stata. The additions can be obtained from the Internet or from m__tia. The additions can be ado-files _new commands), help files, or even datasets. Collections of:files Z'e bound together into packages. For instance, the package named zz49 might add the xyz comman( to Stata. At a minimum, such a package would contain xyz.ado, the code to implement the new _:ommand. and xyz .hlp, the on-line help to describe it. That the package contains two files is a detail: you use net to fetch the package z:z49 regardless of how many files there are. ado n_anages the _Packages you have installed using net. The ado command allows you to list packages you have previously installed and to uninstall them. Users can also access the net and ado features by pulling down Help and selecting STB and User-wri ten Programs. 393
I_I
394
net -- Install and manage user-written additions from the net
Options ,
all is used with net install and net get. Typing it with either one makes the command equivalent to typing net Lnsza11 followed by net get. replace is for use with net install and net get. existing files if any of the files already exist.
It specifies that the fetched files are to replace
find(string) is for use with ado, ado dir, and ado describe. It specifies that the descriptions of the packages installed on your computer are to be searched and that the package descriptions containing string are to be listed. from(dimame) is for use with ado. It specifies where the packages are installed. The default is from(STBPLUS). STBPLUS is a codeword that Stata understands to correspond to a particular directory on your computer that was set at installation time. On Windows computers, STBPLUS probably means the directory c:\ad.o\stbplus, but it might mean something else. You can find out what it means by typing sysdir, but this is irrelevant if you use the defaults.
Remarks For an introduction to using net and ado. see [U] 32 Using the Internet to keep up to date. The purpose of this documentation is 1. To briefly but accurately describe net 2. To provide documentation to Stata.
and ado and all their features.
to those who wish to set up their own sites to distribute
additions
Remarks are presented under the headings Definitionof a package The purposeof the :net and ado commands Content pages Package-descriptionpages Where packages are installed A summary of the net command A summary of the ado command Relationship of net and ado to the point-and-click interface Creating your own site Format of content and package-descriptionfiles Example 1 Example 2 Metacharactersin content and package-descriptionfiles Error-free file delivery
Definitionof a package A package is a collection of files typically . ado and . hip files--that provides a new feature in Stata. Packages contain additions that you wish were part of Stata at the outset. We write such additions and so do other users. One source of these additions is the Stata Technical Bulletin (STB). The STB is a printed and electronic journal with corresponding software. If you want the journal, you must subscribe, but the software is available for free from our web site. If you do not have Intcrnet access, you may purchase the STB media from StataCorp.
i
net -- Install and manage user-writtenadditionsfrom the net
i
i
i
I +
I ii
l
395
The purp(se of the net and ado commands 1
The pprpose of the net command is to make distribution and installation of packages easy. The goal is tO_get you quickly to a package description page that summarizes the addition: rte_stat
• n_et describe ! '
package
rte
star from
http://www.wemak_itupaswego.edu/fscu!ty/sgazer/
i
+
_I_ rte_stat.
i '
The robust-to-everything
_ES_PrIOl/tCrln S. Gazer, Aleph-O
(S) Dept.
_ '
of Applied
I00_, confidence
applications; |
i
__
Aleph-I
statistic ; update.
Theoretical
intervals confidence
Mathematics,
proved
WMIUAWG
too conservative
intervals
have been
Univ.
for some
substituted.
The new robust-to-everything supplants the previous robust-toeverything-conceivable statistic. See "Inference in the absence of data"
1
(forthcoming).
After
ihstallation,
_IN+ILLITIOE FILES _ rt e.ado
see help (type
net
rl}o. ins_a]/
rte_stat)
rte.hlp i
nullset, ado random, ado
Should y,)u addition might prove u_fu], net makes the installation easy: at decide install the rte_stat checking
r_e_stmt
consistency
ins talling into c :\ado\stbpius\ ins ;allation complete•
and verifying
not already
installed.-.
...
The p_rpose of the ado command is to help you manage packages installed with net. Perhaps you reme[nber that you installed a package that calculates the robust-to-everything statistic but cannot remembe I the command's name. You could use ado to search what you have previously installed for the r'_e dommand:
[I]i package sg146 from http://waw.ststa, STB-66 sg146. Scalar measures of ( _utput omitted
[I+ 1i
rto_stat. package
com/stb/stb56 fit for re_ression
models.
)
rte_statThe
robust-to-svoryth_-_ sta_2s_ic! Ilpdmte, from http://wwa._emakeitupaswego.edu/faculty/sgazer
(_utputomitted) [2+
package STB-62
sgl21from http://www, stata, com/stb/stb52 sK121: Seemingly unrelated est. t cluster-adjusted
sandwich
est.
or, you _ight type . afro,package find("robust-to-everything") [15_] fro_star from http://www.i_emakeitupaswego.edu/faculty/sgazer rte_s_at. The robust-to-ovorytb_n_ statistic! update. 1
Perhaps 9ou decide that rte You can hse ado to erase it: o uninstall pa+age
(_kag, !
rte_s_at rte_stat.
_-,tall.d)
despite the author's claims is not worth the disk _paee it oeeupie
Example neweyi,lag(O) is equivalent to regress,robust: . r, gross
price
weight
displ,
with
robust
standard
Re_ _ession
[ [
robust errors
[
{
74 14,44 0,0000
R-squared Root MSE
O.2909 2518.4
= =
Robust price
Coef.
i weight dis_)lacement
1.823366 2,087054
,7808755 7.436967
2.34 O. 28
O.022 O.780
.2663445 -12.74184
3. 380387 16. 91595
247,907
1129.602
O. 22
O.827
-2004.455
2500.269
i
_cons
. niwey !
price
ReCession i
Number of obs = F( 2, 71) = Prob > F =
maximum
weight
with lag
Std. Err.
displ,
Newey-West
t
P>Jtl
[95_. Conf,
lag(O) standard
errors
Number
of obs
F( 2, Prob > F
: 0
i
Interval}
71)
=
74
= =
14.44 0.0000
Newey-West
dis
price
Cool.
weight ,tacement
I.823366 2.087054
_cons
247.907
Std. Err.
t
P>Itl
[957, Conf.
Interval}
.7808755 7.436967
2.34 0.28
O.022 0.780
.2663445 -12.74184
3. 380387 16.91595
1129.602
0.22
0.827
-2004.455
2500.269
:1
[
i[
.-Example ha e time-series measurements on variables usr and idle
mo_el; bit
obtain
Newey-Wes!
_land,_rd
,'rrors
allowing
for
a lag
and now
of
up
wish
to 3:
to estimate an o15
i_ !
414
newey --
Regression with Newey-West
standard errors
t • newey
usr
Regression maximum
idle, with
lag
lag(3)
t(time)
Newey-West
standard
errors
Number
of
F( I, Prob > F
: 3
usr
Coef.
idle
-.2281501
_cons
23.13483
Std,
Err,
t
.0690927 6.327031 Newey-West
P>[tl
[95_
obs
=
28)
= =
Conf.
30 10.90 0.0026
Interval]
-3.30
0.003
-.3696801
-.08662
3.66
0.001
!0.17449
36.09516
q
Saved Results newey saves in e(): Scalars e (N)
number of observations
e (F)
F statistic
e (dr_m) e(df_/)
model degrees of freedom residual degrees of freedom
e(lag)
maximum lag
e (cmd)
newey
e (wexp)
weight expression
e(depvar) e(wtype)
name of dependent variable weight type
e(vcetype) e(prediet)
covariance estimation method program used to implement predict
coefficient vector
e (V)
variance-covariance the estimators
Macros
Matrices e (b)
matrix of
Functions e(sample)
marks estimation sample
Methods and Formulas newey is implemented newey
calculates
as an ado-file.
the estimates flons
-- (X'X)-IX'Y
- (x'x)-lx'hx(x'x) That is, the coefficient For the case of lag formulation:
estimates (0)
are simply
those
(no autocorrelafion),
X' X
= X'noX
-1
of OLS linear regression.
the variance
estimates
are calculated
using the White
n i
Here gi - Yi XiflOLS, where xi is the ith row of the X matrix, n is the number of observations, and k is thc number of prodictors in the modal, including the constant if there is one. Note that the above formula is exactly the same as that used by regress, robust with the regression-like formula (the default) for the multiplier qc; see the Methods and Formulas section of [R] regress.
newey -- Re ression with Newey-West standarderrors
F i
FOr e case of lag(#), (1987) f_rmulation X'_X
415
# > 0, the variance estimates are calculated using the Newey-We_t
= X t"J'_0X +
n
n-kt=t
Z
eiei_i(xix__t ---/t
/'
m+l
+ xt/_lxi)
i=t+1
where Q is the maximum lag.
ReferenCes i •
i
Hardin.J.!W 1997.sg72:Newey-Wes1standarderroN for probit,logit, and poissonmodels. Stata TechnicalBulletin 39: 32_35. Reprintedin Stata TechnicalBulletinl_eprints,vol. 7. pp. 182-186. covari ce matrix. Econometrica55: 703-708. Newey, \_ 1980. and K. West. 1987. A simple, positixesemi-definite,heteroskedasticitv and test autocorretationconsistent White, H.; A heteroskedasticitv-consisten, cov_ance matrixestimatorand a direct for heteroskedasticitv Econonefrica48: 817-838.
Also.Set C0mplel lentao,: Related:
JR] adjust, JR] lincom, jR] linktest, JR]mfx, JR] test, JR] testnl, JR] vce [R] regress, IN] svy estimators, [R] xtgls. [R] xtpcse
Backgro and:
[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and iaost-estimation commands
i/
Title i
news -- Report Stata news
II
[l
I
]
Syntax news
Description news displays a brief listing of recent news and information of interest to Stata users. It obtains this information directly from Stata's web site. The news command requires your computer to be connected to the Interact. An error message will be displayed if the connection to Stata's web site is unsuccessful. You may also execute the news command by selecting News from the Help menu.
Remarks news provides an easy way of displaying a brief list of the latest Stata news. More details and other items of interest are available at Stata's web site; point your browser to http://www.stata.com. Here is an example of what news produces: . news StataCorp
News
---
* Intercooled Windows * STB-68
2001 (July
* NetCourse
information http://www,
23, for
is sold 2002)
151:
* Proceedings For
July Stata
2002 Windows
is now
on these
8th London and
will
release:
available
"Introduction
of the
2001
(projected
to
Stata
User
additional
---
be available Aug use
the
net
Programming"
Group
topics
Meeting point
the
first
day
i, 2002) command begins now
your
to download next
month
available
web
browser
to:
stata, com
In this case news indicates that there is a new STBavailable. Users can click on STB and User-written Programs from the Help menu to download STBfiles. Alternatively, the net command (see [R] net) can be used.
Also See Related:
[R] net
Background:
[U] 32 Using the Interne! to keep up to date
416
[
I
l [
nl --
least squares
Syntax i n-t kn_depvar
[varhst] [weight] [ifexp] [inrange][, level(#)i_nit(..,) _ "
!e_ve eps(#) _o!og trace it_erate (#) delta(#) i nllni_
fcn_options
ll__isq(#)
]
# parameterHist I
by ....
: n_ay be used witl_ hi; see
[R] by,
aweights lnd fweights are allowed; see [U] 14,1.6 Weight. nl _hare._ the features of eli estimation commands, see[U] 23 Estimation
and post-estimation
commands.
Syntaxfo! predict pred}ct t
[t3,pe] newvarname
[if
exp] [in range]
These _tatiitics are available both in and out of sample: type predict the estimation sample.
[,
{ _yhat . ..
if
residuals
e(sareple)
...
} ] if wanted only for
i 1
Descriptibn n! fit_ an arbitrary nonlinear function to the dependent variable depvar by least squares. You provide tt_e function itself in a separate program with a name of your choosing, except that the first two letteds of the name must be nl. fcn refers to the name of the function without the first two letters. F6r _1 example, you type nl nexpgr ... 1o estimate with the function defined in the prepare nlneapg[. nlini!,
is useful when writing nlfcns. i
Options level: (#}
specifies the confidence level, in percent, for confidence intervals. The default is level or as _et by set level; see [U] 23.5 Specifying the width of confidence intervals.
(95)
i,
init (. • .i specifies initial values for parameters that are to be used to ovemde the default initial
I l i
val_es Examples are provided below. inlsq(#' fits the model defined by nlfcn using "log least squares", defined as least squares with shifted lognormal errors. In other words, ln!(depvar - #) is assumed nommllv distributed. Sum_
[_
i i I
of squ,.res and deviance are adjusted m the same scale as depvar. leave le,]ves behind after estimation a set of new variables with the same name_ as the estimated pamm_ters containing the derivative of E(!/) with respect to the parameter. eps(#)
_pecifies the convergence criterion for successive parameter e._timates and for the residual
sum o[squares, eps(le-5) _
is the default. 417
_I i 1
nolog suppresses the iteration log. trace expands the iteration log to provide more details, including values of the parameters step of the process. iterate(#) delta(#)
at each
specifies the maximum number of iterations before giving up and defaults to 100. specifies the relative change in a parameter
to be used in computing
the numeric deriva-
tives. The derivative for parameter fll is computed as {f(X, fll,,32,... ,fli + d, fli+l,...) f(X, fll,fl2,... ,/3_,fli+x,...)}/d where d is 6(13i -t- _). The default 5 is 4e-7. fen-options
refer to any options allowed by nlfcn.
Options for predict yhat,the default, calculates the predicted residuals calculates the residuals.
value of depvar.
Remarks Remarks are presented under the headings nlfcns Some common nlfcns Log-normal errors Weights Errors General comments on fitting nonlinear models More on nlfcns nl fits an arbitrary nonlinear function to the dependent variable depvar by least squares. The specific function is specified by writing an nlfcn, described below. The values to be fitted in the function are called the parameters. The fitting process is iterative (modified Ganss-Newton). It starts with a set of initial values for the parameters (guesses as to what the values will be and which you also supply) and finds another set of values that fit the function even better. Those are then used as a starting point and another improvement is found, and the process continues until no further improvement is possible.
nlfcns nl uses the function defined by nlfcn, nlfcn has two purposes: to identify the parameters of the problem and set default initial values, and to evaluate the function for a given set of parameter estimates.
> Example You have variables 9 and x in your data and wish to fit a negative-exponential parameters B0 and B_: Y -- Bo (I - e -Bta:) First, you write a program to calculate the predicted
values:
growth curve with
-
t
nl -- Nonlinearleast squares
pr
419
am define I_inexpgr version 7.0
I
if "'i "_global == "7" S_I { i !
"80 BI"
g!obal
BO=-I
global exit
BI=. 1
/* ... /* if Query declarecall parameters
*/ */
/*
*/
and initialize
them
} replace i
"
"1"=$BO*(l-exp(-$Bl*x)
/* otherwise,
calculate
function
*/
endt
! !
To estimate the model, you type nl nexpgr y. nl's first argument specifies the name of the function. although you do not type the nl prefix. You type nexpgr, meaning the function is ntnexpgr, nl's second mgument specifies the name of the dependent variable. Replicating the example in the SAS manual (985, 588-590): . u ,e sasxmpll • n
nexpgr
y
(oh = 20) Ite:'ation
O:
residual
SS =
.1999027
Ite:'ation I:
residual
SS =
.0026064
Ite."ation 2:
residual
SS =
.0005769
Ite:'ation 3:
residual
SS =
.0005768
Source Model
SS
df
17,6717234
IResidual
2
.0005T_81
18
MS
Number
,
F(
17.6723003
20
18)
20
= 275732.74
8.83S86172
Prob
'> F
=
O.0000
.00013t32045
R-squared
=
1.0000
.883_H5013
Adj R-squared Root MSE Kes. dev.
= = =
1.0000 .0056608 -152.317
I Total
of obs =
2,
i (ne.l)gr) y
Coef.
BO BI
.9961886 .0419539
(SE "s, P values,
i
i
CI's,
Std.
Err.
.0016138 .0003982
and correlations
t 617.29 105.35
P>[t [
[95_, Conf.
O. 000 0.000
.9927981 .0411172
are asymptotic
Interval] .9995791 .0427905
approximations)
Notice -:th_ : the initial values of the parameters ;were provided in the nlnexpgr program. You can, however, ,verride these initial values on the nl_ command line. To estimate the model using .5 for the initial _alue of B0 rather than 1, you can tylje nl nexpgr y, iniZ(B0=. 5). To also change the q
i
i i i
i
initial vail e of B1 from.1 to .2, you type nl nexpgr y, init (B0=.5 The_:oulline of all nlfcn's is the same: program
', i
define
I
B1=,2).
nltcn
version 7.0 if "'I .... == "7" { global
'
(tnhialize axit
S_I
"parameternames"
paramelers
)
} replace
"I" = ...
emd
• ' • " "_" " to place the na mes of the P.arameters in the On a q_ler_ call, Indicated b}, "i- being . , the_nlfcn is global mac_-oS_l and ififtialize the parameters, t_arameters are stored as macros, so ff ,_lfc, declares
!
!_ .!
420
nl -- Nonlinear least squares
that the parameters are A, B, and C (via global S_l "A B C"), it must then place initial values in the corresponding parameter macros A, B, and C (via global A=O, global B=I, etc.). After initializing the parameter macros, it is done. On a calculation call, "1" does not contain "?"; it instead contains the name of a variable that is to be filled in with the predicted values. The current values of the parameters are stored in the macros previously declared on the query call (e.g., $A, SB, and $C).
1>Example You wish to fit the CES production
functions defined by
lnq = Bo + Aln{Dl
R + (1 - D)k 2}
where the parameters to be estimated are B0, A, D, and R. q, l, and k refer to total output and labor and capital inputs. In your data, you have the variables lnq, labor, and capital. The ntfcn is program
define nlees version 7.0 "'1""
if
==
"7"
{
global
8_i
global
BO = 1
"BOA
global
A = -1
global
D =
global exit
R = -1
D R"
.5
} " I'=$BO
replace
+ SA*in($D*labor'$R
* (l-$D)*eapitai'$R)
end
Again using data from the SAS manual (1985, 591-592): . use
sasxmpl2
nl ces inq (obs = 30) Iteration
O:
residual
SS =
37.09651
Iteration
I:
residual
SS =
3-5.48655
Iteration
2:
residual
SS
=
22.69058
Iteration
3:
residual
SS
=
1.845468
(output omitted) Iteration
20:
residual
SS
=
Iteration
21:
residual
SS
=
Source
SS
Model Residual
1.761039 1.761039 df
MS
Number
of obs
30
59.5286148
3
19.8428718
1.76103929
26
.06773228
R-squared
=
0.9713
Adj K-squared Root MSE Res. dev.
= = =
0.9680 .2602543 .0775147
Total
61.2896541
29
Inq
Coef.
Std.
2.11343635
26)
=
F( 3, Prob > F
= =
292.96 0.0000
(ces)
BO
* Parameter (SE's,
.1244892
Err.
Conf.
Interval]
0.124
-.0365497
.2855282
-.3362823 .3366722
.2721671 .1361172
-1.24 2.47
0.228 0.020
-.8957298 .0568793
.2231652 .6164652
R
-3.011121
2.323575
-1.30
0.206
-7.787297
1.765055
BO taken
as
CI's,
constant and
term
correlations
1.59
[957
A D
P values,
.0783443
P>ltl
t
in model are
_ ANOVA
asymptotic
table approximations)
i
nl -- Nonlinearleast squares
421
.......
I ! i
If the ncnlinear model contains a constant term, nl will find it and indicate its presence by placing an asteri ;k next to the parameter name when displaying results. In the output above. B0 is a constant. (nldetelmines that a parameter BO is a constant term because the partial derivative f = OE(y)/OBO has a coffficient of variation (s.d./mean) less than eps(). Usually. f = I tbr a constant, as it does
i
in, th,;_ctse.)
I
_utput mimics that of regress;calculates see [R]them, regress. The means model inF this test,case R-squared, of' nl's squares, etc..closeh are calculated as regress which that theysmn: are correcte_t for the mean. If no "constant" is present, as was the case in the negative-exponential gowth • example _prevlouslv. the usual caveats apply tO the interpretation of the F and/?-squared statistics:
I
i
, l
see comr_ents and'references in Goldstein (1992). When! making its calculations, nl creates flee partial derivative variables for all the parameters. giving e_ch the same name as the corresponding parameter. Unless you specify leave, these are discardecl when nl completes the estimation. "_erefore. your data must not have data variables that have thel same names as parameters. We recommend using uppercased names for parameters and
! !
lowercas¢d names (as is common) for variables. After _stimating with nl, typing nl by itself will redisplay previous estimates. Typing correlate, _coef w!ill show the asymptotic correlation matrix of the parameters, and typing predict myvar will creale new variable myvar containing the' predicted values. Typine predict
res,
resid
will
create, r_s containing the residuals. ntfcn'_ have a number of additional featurei that are described in More on nlfcns below.
Someoorlmonnlfcns Ar_ impo:tant feature of nl. in addition to estimating arbitrary nonlinear regressions, is the facility for addin prewritten common fi:ns.
i
i
Three ?ns are provided for exponential regression with one asymptote:
:
_xp3
Y - b0 4- bl b2 x
_xp2
Y -- bib x
For irrst_ _ce. typing nl exp3 ras dvl estimates the three-parameter exponential model tparameters bo. bl, ard 52) using Y = ras and X = dvl. TwOfi ns are provided for the logistic function (symmetric sigmoid shape; not to be confused with log_stic r( _ression): _g4
Y-bo
+ bl/l'
+ exp{-52(X-
b3)}]
Finally, t_,_ofens are provided for the Gompertz function (asymmetric sigmoid shape):
_3Technical Note You may find the functions above useful, but the important thing to note is that, if there is a nonlinear function you use often, you can package the function once and for all. Consider the function we packaged called exp2, which estimates the model Y = bib x. The code for the function is program
define nlexp2 version 7.0 if
"'I'"=="?"
{
global global
S_2 S_I
"2-param. "bl b2"
* Approximate local exp t empvar Y quietly
}
exp.
initial
"['e(wtype)
growth
values
by
" "e(wexp)
curve,
"e(depvar)"=b1*b2"'2""
regression
of
log Y on X
"]"
{ gen "Y" ffilog('e(depvar)') if e(sample) regress "Y" "2" "exp" if e(sample)
global
51 = exp(_b[_cons])
global exit
b2
= exp(_b['2"])
} replace
"i "=$b1*$b2-"
2"
end
Becausewe were packagingthisfunction forrepeated use,we went tothetroubleofobtaining good initial values, whichin thiscasewe couldobtainby takingthelogof bothsides,
Y = bib X ln(Y)
= ln(blb X) -ln(bl)+
tn(b2)X
and then using linear regression to estimate ln(bl) and ln(b2). If this had been a quick-and-dirty implementation, we probably would not have bothered (initializing bt and b2 to 1, say) and so forced ourselves
enough.
to specify
better initial values with nl's
initial()
option when they were not good
The only other thing we did to complete the packaging was store nlexp2
as an ado-file called
nlexp2, ado. The alternatives would have been to type the code into Stata interactively or to place the code in a do-file. Those approaches are adequate for occasional use, but we wanted to be able to type nl exp2 without having to wor O, whether the program nlexp2 was defined. When nl attempts to execute nlexp2, if the program is not in Stata's memory, Stata will search the disk(s) for an
ado-file of the same name and, if found, automatically load it. All we had to do was name the file with the .ado suffix and then place it in a directory where Stata could find it. In our case, we put nlexp2, ado in Stata's system directory for StataCorp-written ado-files. In your case, you should put the file in the director}, Stata reserves for user-written ado-files, which is to say, c:\ado\personal (Windows), -/ado/personal (Unix), or - : ado: persona/ (Macintosh). See [U] 20 Ado-files.
Q
nl -- Nonlinearleast squares
423
Log.normltl errors A non] near model with identically normally distributed errors may be written
y, =
+
~ N(0,,,2)
(1)
i
for i = 1._..., n. If the Yi are thought to have a:k-shifted lognormal instead of a normal distribution. that is, lnt y_ - k) 4 N (t,, r2), and the systemaiic part/(xi,/3) of the original model is still thoughi
l
approlmat '_ :, the model becomes ln(yi-k)=¢i+v,=in{f(xi,/3)-k}+vi This rood t is estimated if lnlsq(k)
i ! i i
vi"_N(0,
r =)
(2)
is specifidd.
If ntod,_l (2)is correct, the variance of (Yi- _)is proportional to {f(xi,/3)k} 2. Probably the most corn non case is k = 0, sometimes called :"proportional errors" since the standard error of Yi is proport anal to its expectation, f(xi,/3). Assuming the value of k is known. (2) is just another nonlinear nodel in/3 and it may be fitted as usual. However, we may wish to compare the fit of (1)
i i
with that ( f (2) using the residual sum of square i or the deviance D, D = -2 x log-likelihood, from each mo& I. To do so, we must alloy, for the ctjange in scale introduced by the log transformation. Assuming, then, the y, to be normally distributed, Atkinson (1985, 85-87, 184), by considering the Jacobi_n IJ ]0 ln(yi - k)/Oy{I, showed that multiplying both sides of (2) by the geometric mean :,:
of Yi - k.1!1, gives residuals on the same scale as those of Yi- The geometric mean is given by
which is aiconstant for a given dataset. The residual deviance for (1)imd for (2) may be expressed as ,
':
D(_)
=
l+ln(2rr_
2) n
(3)
i
where _ i the maximum likelihood estimate (MLE) of/3 for each model and n_ 2 is the RSS from
i
(1), or tha1 from (2) mtfltiplied by b2. i Since (_) and (2) are models with different !error structures but the same functional form, the
! {
_ _
arithmetic _lifference in their RSS or deviances is _ot easily tested for statistical significance. However, if the devtance difference is large" (> 4, say), one would naturally prefer the model with the smaller de_,iance. Of course, the residuals for e_ch model should be examined for departures from
i_ '_
assumptiots (nonconstant variance, nonnormalit3_, serial correlations, etc.) in the usual way. Consider alternatively modeling E(yi) = I_(C + Ae Bx') E(1/yi) = E(y_) = C + Ae Bx'
i ,
(4)
(5)
where C.._, and 13 are parameters to be estimated. We will use the data (y, x) = (.04, 5). (.06, 12), (.08.25). (I.1.35), (1_ 42). (.2,48), (.25,60), (,3,75), and (.5,120)(Danuso 1991). Model C IA B RSS Deviance I
(4) (4) with l_lsq(0)
1.781 1.799
25.74 2545
-.03926-.001640 -.04051 -.001431
t!
(5) (5) with lnlsq(0)
1.781 1.799
25)4 2745
-.03926 -.04051
! i
,
! I
24.70 17.42
There is lit@ to choose between the two versions ;f the logistic model (4), whereas for the exponential model (5) _the fit using inlsq(O) is much betier (a deviance difference of 7.28). The reciprocal •
i
8.197 3.65t
-51.95 -53.18
I
¢
.
transformation has introduced heteroskedasticity into '_liwhich is countered by the propomonal errors property o_ the lognorrfial distribution implicit :in lnlsq(0). The deviances are not comparable between th_ logistic and}exponential models because the change of scale has not been allowed for. althcmgh inl principle it d°uld be"
•_ 'i
,
424
nl -- Nonlinear least squares
Weights Weights are specified the usual way--analytic and frequency weights are supported; see [U] 23.13 Weighted estimation. Use of analytic weights implies that the Yi have different variances. Therefore, model (i) may be rewritten as Yi -- f(xi,_)
+ ui,
ui "-' N(O, a2/wi)
where wi are (positive) weights, assumed known and normalized number of observations. The residual deviance for (la) is D(_)
--- { 1 + ln(27r_ 2) }n - E
(la)
such that their sum equals the
In(w/)
(3a)
(compare with equation 3), where
Defining and fitting a model equivalent to (2) when weights have been specified as in (la) is not straightforward and has not been attempted. Thus, deviances using and not using the lnlsq() option may not be strictly comparable when analytic weights (other than 0 and 1) are used.
Errors nl will stop with error 196 if an error occurs in your nlfcn program and it will report the error code raised by nlfcn. nl is reasonably robust to the inability of nlfcn to calculate predicted values for certain parameter values, nl assumes that predicted values can be calculated at the initial value of the parameters. If this is not so, an error message is issued with return code 480. Thereafter. as nl changes the parameter values, it monitors nlfcn's returned predictions for unexpected missing values. If detected, nl backs up. That is, nl finds a linear combination of the previous, known-to-be-good parameter vector and the new, known-to-be-bad vector, a combination where the function can be evaluated, and continues its iterations from that point. nl does require, however, that once a parameter vector is found where the predictions can be calculated, small changes to the parameter vector can be made in order to calculate numeric derivatives. If a boundary is encountered at this point, an error message is issued with return code 481. When specifying inlsq(), an attempt to take logarithms error rues sage with return code 482.
of Yi - k when Yi ottoz
type
restaurant
type I
Preebirds NamasPizza
type2
C_feEccell Lk)sNorte~s WtingsNmore
type3
Christ op-s MadCows
_]
Test,oftNe indeperidenceof irrelevant alternatives (IIA) i The I:roperty of th_ multinomial logit model and conditional ]ogit model where odds ratios are independent of the other alternatives is referred to as the independence of irrelevant alternatives (IIA). Hausraan and McPadden (1984) suggest that if a subset of the choice set truly is irrelevant with respect t the other alternatives, omitting it frbm the model will not lead to inconsistent estimate_. Ttierefor Hausman's:_1978) specification test can be used to test for IIA.
'3 ExampleI Supp(,se we want to run ctogit on our choice of restaurants dataset. We also want to test IIA between the alternatives of family restaurants and the alternatives of fast food places and fancy restaurants. To do so, we need to use Stata's hausman command: see [R] hausman. We fi "st run the e_timation on the full bottom alternative set: save the results using hausman, save; ard then run th_ estimation on the bottom alternative set, excluding the alternatives of family restaurarts. We then mn the hausman test. w_th the less option indicating the order in which our models ,_ere fit. 1
en incFast _en
incFancy
en kidFast
(type
== I) *
income
_ (type == 3) * income _ (type
== I) * kids
en kidFancy = (type == 3) * kids logit chose_ cost rating distance Iteration
O:
log
It(ration It_ration
2: I:
_og likelihood likelihood _og
It( ration It_ ration
3: 4:
_og likelihood i _og likelihood
It_ ration
5:
Col,ditional t Lo
_og likelihood
(fiied-effects) _
likelihood
i
likelihood
_ -488.90834
i incFast "_
incFancy
kidFast
kidFancy,
group(family_id)
= -564._7856 = -489.$5097 -496.41546 = -488. _1205 -488.90854 -488.g0834 logistic
regression
Number of obs LR chi2(7)
: =
2100 189.73
Pseudo R2 Prob > chi2
= =
0.1625 0.0000
_
....................
,,,vuu
chosen
Coef.
cost rating
IOgl!
Std. Err.
z
estimation
P>IzI
[95X Conf. Interval]
-.1367799 .3066622
.0358479 .1418291
-3.82 2.16
0.000 0.031
-.2070404 .0286823
-.0665193 .584642
t
-.1977505
.0471653
-4.19
0.000
-.2901927
-.1053082
incFancy incFast I kid_Past kidFancy__[
.0407053 -.0390183 -.2398757 -.3893862
.0080405 .0094018 .1063674 .1143797
5.06 -4.15 -2.26 -3.40
0.000 0.000 0.024 0.001
.0249462 -.0574455 -.448352 -.6135662
.0564644 -.0205911 -.0313994 -.1652061
distance i
-u3zeO
• hausman, save clogit chosen cost ratine distance incFast incFancy kidFast kidFancy if type group(family_id)
l= 2,
note: 222 groups (888 obs) dropped due to all positive or all negative outcomes. Iteration Iteration Iteration Iteration Iteration Iteration
O: 1: 2: 3: 4: 5:
Conditional
log log log log log log
likelihood likelihood likelihood likelihood likelihood likelihood
= = = = = =
-104.85538 -88.077817 -86.094611 -85.956423 -85.955324 -85.955324
(fixed-effects) logistic regression
Log likelihood = -85.955324
chosen
Coef.
cost rating distance
-.0616621 .1659001 -.244396 -.0737506 .4105386
incFastI kidFast __
Std. Err.
Number of obs LR chi2(7) Prob > chi2 Pseudo R2
= = = =
312 44,35 0.0000 0.2051
z
P>JzJ
[95X Conf. Interval]
.067852 .2832041 .0995056
-0.91 0.59 -2.46
0.363 0.558 0,014
-.1946496 -.3891698 -.4394234
.0713254 .72097 -.0493687
.0177444 .2137051
-4.16 1.92
0.000 0.055
-.108529 -.0083157
-.0389721.8293928
• hausman, less Coefficients--j'
cost d kidFast_
Test:
Ho:
i
(b) Current
(B) Prior
-.0616621
-.1367799
-.244396 -.0737506 .4105386
-.1977505 -.0390183 -.2398757
(b-B) sqrt (diag(V_b-V B)) Difference S.E. .0751178 -.0466456 -.0347323 .6504143
.0576092 .0876173 .015049 .1853533
b = less efficient estimates obtained from clogit B = fully efficient estimates obtained previously from clogit difference in coefficients not systematic chi2(5)
= (b-B)'[(V_b-V_B)-(_I)](b_B) = 10.70 Prob>chi2
=
0.0577
The small p-value indicates that the IIA assumption between the alternatives of family restaurants and the bealternatives should utilized. of other restaurants is weak, hinting that the more complex nested logit model
t
• _
_
...................................................................
_
;
nlogit --
/laximum-likelihoodnested Iogit estimation
437
Model! ea timation Exampt¢ In tl_is example, _e want to examine how alternative-specific attributes apply to the bottom altemati,i_.eset (all se_.,en of the specific restaurants), and how family-specific attributes apply to the altema@e set at the Ifirstdecision level (all ttiree types of restaurants). Inlogitchoseh (restaurant = cost ra_ing distance ) (type = incFast incFancy > kidFast kidF_ncy), group(family_id)Inolog tzee structure specified for the nestbd logit model top--_bottom type fast
_estaurant !Freebirds
_asPizza family
fancy
_afeR.ccell _osNort;e-s WingsNmore _ristop~s MadCows
Ne _ted logit Le'rels = De')endentvariable = Lo likelihood =
2 chosen -483,9584
Number of obs LR chi2(10) Prob > chi2
= = =
2100 199.6293 0,0000
i' Coef.
z
P>Jz[
[95X Conf. Interval]
re:_taurant cost
-,0944352
-2.78
O.006
-.1611131
-.0277572
rating distance
.1793759 -.1745797
.126895 1.41 .0433352 , -4,03
O,157 0.000
-,0693338 -,2595152
,4280855 -.0896443
.0116242
-2.47
0.013
I incFancy
-.0287502 . 0458373
5.14
O. 000
-.0515332 0283722 .
-, 0059672 0633024 .
I kidFancy , kidFast
-.0704164 -,3626381
.1394359 ' .1171277
-0.51 -3.10
0.614 O.002
-.3437058 -.5922041
-.1530721 .2028729
2.45 1.49 3.52
0.014 0 135 0.000
1,143415 -.5366608 .6494642
10.2881 3.979105 2.283711
t_e i l
Std. Err.
incFast
.03402
.0089109
(Ii params) /fast /family i /fancy
5,715758 1.721222 1.466588
2,332871 1 152002 .4169075
I
i
LR _est of homo$cedastlclty (iv = 1): 1 •
I
In thi_ model.
"
[
' Ji
chi2(3)=
9.90
Prob
> chi2 = 0.0194
:
Pr(restdurant I tyPe)= !
_
I
Pr(tvpe)!-
Pr(_cost cost + 3rati_ rating + 3dist distance) i
Pr(a, iva_ incFast +
_ Tfast
IVfast
+ 7family
aiFancy
ineFancy +
!V 'I family
+ Tfancy
CtkFast
"
kidFast +
O kFast
kidFast
IVfancy)
T he [J_ test against t_e ' constant-only model iMicates_ that the model is significant (p-value = 0.000). and t.466588. The inclul}ive value, part,meters for Iast, famil 'iy,and import are 5.......... 715758.1 -7o_o-_o
-..... _
,,,..,_,.-- m.x,,,,u.,-..e.noo,
nesteaIOglt estimation
respectively. The LRtest reported at the bottom of the table is a test for the nesting (heteroskedasticity) against the null hypothesis of homoskedasticity. Computationally, it is the comparison of the likelihood of a nonnested clogit model against the nested legit model likelihood. The X2 value of 9.90 clearly supports the use of the nested legit model with these data, Example Continuing with the above example, we fix all the three inclusive value parameters to be 1 to recover the model estimated by clogit. . nlogit
chosen
> kidFast User
defined
I000:
(restaurant
kidFancy),
[fast]_cons
distance
nolog
) (type
ivc(fast
=I,
= incFast
family=l,
incFancy fancy=l)
notree
= 1
[family]_eons
998:
[fancy]_cons
= 1 = 1
legit =
Dependent Log
rating
constraint(s):
999:
Nested Levels
= cost
group(family_id)
variable
2
=
likelihood
Number
chosen
LR
= -488.90834
Coef.
Prob
Std.
Err.
z
of obs
=
chi2(7) > chi2
P>lzl
2100
=
189.T294
=
0.0000
[95_ Conf.
Interval]
restaurant -.1367799
.0358479
-3.82
0.000
-,2070404
-.0665193
rating distance
.3066622 -.1977505
.1418291 .047i653
2.16 -4.19
0.031 0,000
.0286823 -.2901927
.584642 -.I053082
incFast
cost
type -.0390183
.0094018
-4.15
0.000
-.0574455
-.0205911
incFancy kidFast
.0407053 -.2398757
.0080405 ,1063674
5.06 -2.26
0.000 0.024
.0249462 -.448352
.0564644 -.0313994
kidFancy
-.3893862
.I143797
-3.40
0.001
-.6135662
-.t652061
(IV params) type 1
/fast /family
1
/fancy
I
LR test
clogit
of homoscedasticity
chosen
cost
rating
(iv = I):
distance
chi2(O)=
incFast
> group(family_id) Iteration Iteration
O: 1:
log likelihood io g likelihood
= -564.57856 = -496.41546
Iteration
2:
log
likelihood
= -489.35097
Iteration
B:
log
likelihood
= -488.91205
0.00
incFancy
Prob
kidFast
> chi2
kidFancy,
=
i
•
! I
_
_ Itezation
ii
4:
nlogit-- Mlaximum-likelihood Iogitestimation 439 _ nested .......... i....
l_g
likelihood
Number of obs LR chi2(7)
= =
2100 189,73
Prob
=
0.0000
Log likelihood
Pseudo
=
0.1625
[95Y, Conf.
Interval]
= -488.90834
,
l
i
Coef.
5
Std.
Err.
z
> chi2
P>Izl
1{2
cost
, .1367799
,0358479
-3.82
O.000
-. 2070404
-.0665193
r rating
i "3066622
1418291
2.16
0.031
.0286823
.584642
distance , incFast
_- 1977505 _ .0390183
.0471653 .0094018
-4.19 -4,15
0.000 0.000
-.2901927 -.0574455
-. 1053082 - :0205911
5.06
O. 000
.0249462
.0564644
-2,26 -3.40
O.024 O.001
-.448352 -• 6135662
-. 0313994 -. 1652061
lincFancy
. 0407053
i kidFast !kidFancy
.0080405
_. 2398757 _. 3893862
i
= -488.90834
Itez ation 5: (fixed-effects) lag. likelihoodlogistic = -488. re_ression 90834 Concitional ! } i _
chosen
i zl
.1063674 •1143797
'
i i
i
Obtainingredicted!values predictmay be use_lafter nlogitto obtain the predicted values of the probabilities, the conditional
!
probabili@s, the linear predictions, and the inclusive values for each level of the nested logit model Predicted _robabilities _or nlogitmust be inte_reted carefully. Probabilities are estimated for each group as _whole and dot for indi'_idual observations. ?
Example i 1
Contim _ingwith our Nodel with no constraintsl we can predict pb = Pr(restaurant); pl = Pr(type); condpb = Pr(restaura_t I type); xbb, the line_ prediction for the bottom-level altemativesi xbl, the linear ?rediction fo_ the first-level alternatives; and the inclusive values ivb for the bottom,level alternative _. • q_i nlogit
i
i
l
i
chosen
(restaurant
k±dFancy), group [family_id) . pzedict pb (opt ion
pb
assum,,,d;
distance
) (type
nolog
= incFast
incFancy
kidFast
i
Pr (mode))
. pxedict
p1,
• pzodict
condpb
• predict
xbb,
x!>b
. predict
xbl,
xl_)l !
pli condpb
• list predict id chosenlpb ivb, i'rb
i
pl condpb
in 1/14
pb .0831245
pl ; 1534534
condpb .5416919
.070329 ,2763391 .284375
11534534 ', 7266538 _,7266538
.4583081 .3802899 .3913486
0
.1659397
! 7266538
.2283615
0
.03992 t5
11198928
.3329766
I 2
0 0
.0799713 . Ol i76
_ 1198928 10286579
.6670234 •4103599
2 2
0 0
• 0168978 .2942401
i0286579 _7521651
. 5896401 .3911909
t ._
id 1
2. ! 3.1 4.1
1 1 1
_
0 0 0
5.:
1
i
6. i
1
7, i 8. 9 105
= cost _ating
chosen 1
i
j7
11. 12. 13. 14.
iF
,
2 2 2 2
.........
1 0 0 0
.2975767 .1603483 .1277234 ..vv_vv .0914536
.7521651 .3956268 -7521651 .2i31824 Iw_mt .219177 _OtllllQ||_ll .582741 .219177 .417259
. list id chosen xbb xbl ivb in 1/14 id chosen xbb xbl 1. 1 1 -.731619 -1.191674 2. 1 0 -.8987747 -1.191674 3. 1 0 -1.149417 0 4. 1 0 -1.120752 0 5. 1 0 -1.659421 0 6. 1 0 -3.514237 1.425016 7. 1 0 -2.819484 1.425016 8. 2 0 -1.22427 -1.878761 9. 2 0 -.8617923 -1.878761 10. 2 0 -1.239346 0 11. 2 1 -1.22807 0 12, 2 0 -1.846394 0 13. 2 0 -2.804756 1.570648 14. 2 0 -3.138791 1.570648
i
ivb -.1185611 -.1185611 -.1825957 -.1825957 -.1825957 -2.414554 -2.414554 -.3335493 -.3335493 -.3007865 -.3007865 -,3007865 -2.264743 -2.264743
Saved Results nlogitsaves in e O: Scalars e(N) e (k_eq)
number number
of observations of equations
e(tevels) e (re)
depth of the model return code
e(N_g)
number
of groups
e(chi2)
x2
e(df._m)
model
degrees
of freedom
e(df...me)
model
degrees
of freedom
e(ll) e(ll_O)
log likelihood log likelihood,
constant-only
log likelihood,
clogit
e(ll_c) Macros
for clogit model
model
e(chi2_c)
X 2 for comparison
e(p_c)
p-value
for comparison
test
e(p) e(ic)
p-value numoer
for X 2 test of iterations
e(rank)
rank of e(V)
e (cmd)
nlogit
e (vcetype)
covariance
e (level#)
altsetvar#
e (user)
]ikelihood-evaluator
e(depvar)
name of dependent
e(opt)
type of optimization
e(title)
title in estimation
output
e(chi2type)
LK: type of model
e(group) e(wtype)
name of group() weight type
variable
e(predict) e(cnslist)
program used to implement constraint numbers
e(iv._names)
parameters
e(V)
variance-covariance estimators
variable
e (wexp) Matrices
weight
e (b) e(ilog) Matrices
coefficient vector iteration log (up to 20 iterations)
e (sample)
marks
expression
estimation
sample
estimation
test
method program
X 2 test
for inclusive
predict
values
matrix
of the
nlogit -- Maximum-likelihoodnested togit estimation
441
Methods and For.mlas
!
provide ltroductions t the nested logit model nlogit is implem,,'nted as an ado-file. Greene (2000, 865-871) and Maddala (1983, 67-70) We _ 11present the! methods and formulas for a three-level nested logit model. The extension of this mo& to cases m_olvmg more levels of a tree is apparent, but is more complicated.
!
Using !he same not_tion as Greene (2000), we index the first-level alternative as i, the second-level
! !
ahemativ_ as j, and tte bottom-level alternative as k. Let Xijk, }_j and Zi refer to the vectors ot explanato_; variables _ecific to categories (i,j, k), (i,j), and (i), respectively• We write
i
:
--Prkli j Prjl i Pr_
The cond fional probability Prkto will involve only the parameters/3: eff Xij_ Prklij = Y'_ne'°'X_'_ We define the inclusiw values for category (i,d) as
1"1
and
easyij +ri5Iij PrJli = }-_,mea'V"_+ri'_h'_ Define in(lusive values! for category (i) as
m
/
then
e'f'
Zi-b_i
Ji
Pri = -= El eYrZt+rlJl If we r_strict all the
I
form:
l
where
i
Prijk
i
_ij
and 6i to be 1, we then recover the conditional logit model of the following
eVijk
= fl'X, k + c,'Y j+
,_ il i_ '
,,,,,_ mogrzm Maxlmum-iiKellllOOOnested logit estimation There are two ways to estimate the nested logit model: sequential estimation and the full information maximum likelihood estimation, nlogit estimates the model using the full information maximum likelihood method. If g = 1, 2,..., G denotes the groups, and Pr_j k is the probability of category (i, j, k) being a positive outcome in group 9, the log likelihood of the nested logit model is In L = E
ln(Pr;jk)
9
=
In Pr_lij + InPr_i
+ tnPrf
/
)
References Amemiya.
T. 1985. Advanced
Econometrics.
Greene. W. H. 2000. Econometric
Analysis.
Cambridge,
Hausman.
J. 1978. Specification
tests in econometrics.
Hausman,
J. and D. McFadden.
1984. Specification
Maddala. G. S. 1983. Limited-dependent Press.
McFadden, D. EconomeNc
1981. Econometric models Applications, pp, 198-272.
Saddle
University
46: 125t-I27t.
tests in econometrics.
for analyzing
Press°
River, NJ: Prentice-HalL
Economerrica
and Qualitative
McFadden, D. 1977. Quantitative methods Foundation Discussion Paper no. 474.
MA: Harvard
4th ed. Upper
Econometrica
Variables
in Econometrics.
behavior
of individuals:
of probabilistic choice. In Smacturat Cambridge, MA: MIT Press.
52: 1219-1240.
Cambridge: some recent Analysis
Cambridge developments. of
Also See Complementary:
[R] lincom, [R] lrtest, [R] predict, [R] test, [R] testnl, [R] xi
Related:
[R] elogit, [R] logistic, [R] logit, JR] mlogit
Background:
[u] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [R] maximize
Discrete
University CoMes Data
with
[ ie !
notes
i
i
Place
in data
Syntax vama,ne] notes
t_xt : !
_otes
notes drop evarlisf [in #[/#]] where eva list is a varl:_'t but may also contain _theword _dta and # is a number or the letter 1. If text incl ides the letters TS surrounded by blanks, the TS is removed and a time stamp is substituted in its p ace.
Descripti(,n notes
attaches note: to the dataset in memory. These notes become a part of the dataset and are
attached generically to :he dataset or specifically to a variable within the dataset.
i
Remarksi saved when
the dataset is saved and retrieved When the dataset is used: see [R] save, notes can be
j [
A not_ is nothing formal; it is merely a string of text--probably words in your native language Treminding you to do something, cautioning you against something, or anything else you might] feel like jotti lg down. People who work with real data invariably end up with paper notes plastered ground their tlerminal saying things like "Send the new sales data to Bob" or "Check the
I
income salary95; I don't believe if' or "The gender was significant!" would be betterv_iable jf theseinnotesi were attached to the dataset. Attached to dummy the terminal, they tend toItfall off
l
and get lost. Addin_ a note to y ur dataset requires typing note or notes (they are synonyms), a colon (:L and whatever _ou wan_ to remember. The note is added to the dataset currently in memory.
4
. n_te:
I
i
Send co_y to Bob once verified.
nite s
i
You can +play your n_tes by typing II
notes
(or note)
by itself.
!
Send copy ,_oBob once verified.
!
Once youi resave your _ata, you can replay the note in the future, too. You add more notes just as
i
you did tl_e first:
[
. nSte:
Ii 2i
i
Mary war_ts a copy, tOO.
Send copy to Bob once verified. Mary ,,rants a copy, too.
443
You can place time stamps on your notes by placing the word TS (in capitals) in the text of your note: • note: TS merged • notes
updates
from
JJ_F
_dta: I. 2. 3.
Send copy to Bob once verified. Mary wants a copy, too. 19 Jul 1000 15:38 merged updates
from JJ&F
The notes we have added so far are attached to the dataset generically, which is why Stats prefixes them with _dta when it lists them. You can attach notes to variables: • note mpg: is the 44 a mistake?
Ask Bob.
note mpg: what about the two missing values7 • notes _dta: i. 2. 3. mpg: i. 2.
Send copy to Bob once verified. Mary wants a copy, too. 19 Jul 2000 15:38 merged updates from JJ_F is the 44 a mistake? Ask Bob. what about the two missing values?
Up to 9,999 generic notes can be attached to _dta and another 9,999 notes can be attached to each variable.
Selectively listing notes notes by itself lists all the notes. In full syntax, notes is equivalent to typing notes _all in 1/1. Here are some variations: notes notes notes notes notes notes notes
_dta mpg _dta mpg _dta in 3 _dta in 3/5 mpg in 3/5 _dta in 3/1
list list list list list list list
all generic notes all notes for variable mpg all generic notes and mpg notes generic note 3 generic notes 3 through 5 mpg notes 3 through 5 generic notes 3 through last
Deletingnotes notes drop works much like listing notes except that typing notes all notes; type notes drop _a11. Some variations: notes notes notes notes notes
drop drop drop drop drop
_dta _dta in 3 ._dta in 3/5 _dta in 3/i mpg in 4
delete delete delete delete delete
drop by itself does not delete
all generic notes generic note 3 generic notes 3 through generic notes 3 through mpg note a
5 last
"
_
T
.......
!
............................
._ .......................................................
_
-_ .i ¸
i
_:
notes -- Place notes in data
" 445
Warningsi 1. Notes _re stored wit_ the data and, as with _her updates you make to the data, the _additions and deletions are not pei_nanent until you save the data; see JR] save, i i
I
2. The m_ximum lengt_ of a single note is 1,000 characters with Small Stata and 67,784 characters
+
with I ercooled Stala.
Methods it nd Forrrtulas ! '
noteaiis
implemen_d
as an ado-file.
1
i
References Gleason,
J, R. I998.
in Stata
Technical
i dm571
A notes
Butlqtin
editor
Reprints,
vol.
for Window 8, pp.
i
Also See
i
i
Complenenta_v:
[_] describe, [R] save
!
Related:
_] codebook
i
Backgrou nd:
L_]15,8 Characteristics
i
and Macintosh.
10_13.
1
i
J
Stata
Technical
Bulletin
43: 6-9.
Reprinted
f"f .."
! !
Title I nptrend, , -- Testfor trend across,°rderedgroups ,,,
I
Syntax nptrend
varname [if exp] [in range], by(groupvar) [ nodetail s_core(scorevar)]
Description nptrend
performs a nonparametric test for trend across ordered groups.
Options by(groupvar) is not optional; it specifies the group on which the data is to be ordered. nodetail
suppresses the listing of group rank sums.
score (scorevar) defines scores for groups. When not specified, the values of groupvar are used for the scores.
Remarks nptrend performs the nonparametric test for trend across ordered groups developed by Cuzick (1985), which is an extension of the Wilcoxon rank-sum test (rar,_ksum:see [R]signrank). A correction for ties is incorporated into the test. nptrend is a useful adjunct to the Kruskal-Wallis test; see [R] kwallis.
In addition to nptrend, for nongrouped data the signtest and spearman commands can be useful: see [R] signrank and [R] spearman. The Cox and Stuart test, for instance, applies the sign test to differences between equally spaced observations of varname. The Daniels test calculates Spearman's rank correlation of varname with a time index• Under appropriate conditions, the Daniels test is more powerful than the Cox and Stuart test. See Conover (1999) for a discussion of these tests and their asymptotic relative efficiency. > Example The following data (Altman 1991, 217) show ocular exposure to u]traviolet radiation for 32 pairs of sunglasses classified into 3 groups according to the amount of visible light transmitted. Group
Transmission of visible light
I 2
< 25% 25 to 35%
3
> 35%
Ocular exposure to ultraviolet radiation 1.4 0.9 2.6 0.8
1.4 1.0 2.8 1.7
1.4 1.1 2.8 1.7
Entering these data into Stata, we have 446
1.6 1.1 3.2 1.7
2.3 t.2 3.5 3.4
2.3 1.2 1.5 1.9 2.2 2.6 2.6 4.3 5.t 7.1 8.9 13.5
I
|
i _
V
......................................
_
............... i¸ 4
nptrend!--Test for trend across ordered groups
|
447
, li,t _xposmte 1.4
2.
1
1.4
3._
i
1.4
1
2.3
1
2.3
(o 7; ut omitted) 2 31 "i 3
s2.1
i
.9 8,9
s
ls.s
]
We use nt_trend to tes for a trend of (increasing) exposure across the 3 groups by typing . nl_rend exposure, by(group)
l
group 1
2z
=
sum of ranks 76
score t
obs 6
3
8
162
..522
18
290
3 i
!
i > Izl = i,13 Pr?b When the l_rou ps are g{iven any equally saced scores (such as -1, O, 1), we will obtain the same p , answer as !above. To illustrate the effect of changing scores, an analysis of these data with scores 1,
i
2, and 5 (_dmittedh' no! very sensible in this c_se) produces
ii
[
geb mysc = con_(group==3,5,group) nl_rend exposul_e,by(group) score(mys_)
I
group
s4ore
1 2 3 z
i _i
2 5 1
=
.46
Pr_b> Izl :
_.14
obs
sum of ranks
18 8 6
290 i62 76
"
This example suggests ihat the analysis is not all that sensitive to the scores chosen.
q
! i 3 Technical _lote
_
! I
The grc_uping variabt_ may be either a string v_able or a numeric variable. If it is a string variable and no scdre variable id specified, the natural nfimbers 1, 2, 3, .. are assigned to the goups in the son order !of the string!variable. This may not always be what you expect. For example, the sort raer olttle strings one, two, three _s one, three, two.
l
a
]
i
4 group 1
1.
i
SavedReSults nptrer@ii saves in r ): _
i
_calars r(N) r(p)
nuNber of observitions
r(z)
z statistic
two,sided p-value
r(T)
test statistic
i
-
- _
[,Ii!
..-
,,_.._,,,.
--
,_ot ,u, .u.u
across oruere(! groups
Methods and Formulas nptrend
is implemented as an ado-file.
nptrend is based on a method in Cuzick (1985). The following description of the statistic is from Altman (1991, 215-217). We have k groups of sample sizes ni (i = 1,..., k). The groups are given scores, li, which reflect their ordering, such as 1, 2, and 3. The scores do not have to be equally spaced, but they usually are. The total set of N = _ n_ observations are ranked from 1 to N and the sums of the ranks in each group, R/, are obtained. L, the weighted sum of all the group scores, is k L = E lini i=1
The statistic T is calculated as k T = E liRi i-=1
Under the null hypothesis, the expected value of T is E(T) = .5(N + l)L. and its standard error is
se--'(T)
=
I
(
n + 1 --_ N
k i=I
li2ni -- L 2
)
\ /
so that the test statistic, z, is given by z = { T - E(T) }/se(T), which has an approximately standard Normal distribution when the null hypothesis of no trend is true. The correction for ties affects the standard error of T. Let 2_"be the number of unique values of the variable being tested (N _
......
[,
u'"
.,.-^,,.u.m-.n¢..ooa
rep77
Coef.
foreign
oroerea
SCd. Err.
1.455878
.5308946
j
_cut1 _cut2
-2. 765562 -. 9963603
.5988207 .3217704
I
_cut3 _cut4
3.123351 .9426153
.3136396 .5423237
rep77
ioglt estimation
z 2.74
[95% Conf.
O. 006
.4153436
Interval] 2.496412
(Ancillary parameters)
Probability
Poor Fair Average Good Excellent
P>[zl
Observed
Pr( xb+u k), then odds(k1) and odds(k2) have the same ratio for all independent variable combinations. The model is based on the principle that the only effect of combining adjoining categories in ordered categorical regression problems should be a loss of efficiency in the estimation of the regression parameters (McCullagh 1980). This model was also described by Zavoina and McKelvey (1975), and previously by Aitchison and Silvey (1957) in a different algebraic form. Brant (1990) offers a set of diagnostics for the model. Peterson and Harretl (1990) suggest a model that allows nonproportional explanatory variables, Fu (1998).
odds for a subset of the
ologit does not allow this, but a model similar to this was implemented
by
The stereotype model rejects the principle on which the ordered logit model is based. Anderson (1984) argues that there are two distinct types of ordered categorical variables: "grouped continuous", like income, where the "type a" model applies; and "'assessed", like extent of pain relief, where the stereotype model applies. Greenland (1985) independently developed the same model. The stereotype mode/starts with a multinomial logistic regression model and imposes constraints on this model. Goodness of fit for ologi'l;
can be evaluated by comparing
the likelihood value with that obtained
by estimating the model with mlogit. Let Lj. be the log-likelihood value reported by ologit and let L0 be the log-likelihood value reported by mlogit, if there are p independent variables (excluding the constant) and c categories, mlogit will fit p(c - 1) additional parameters. One can then perform a "likelihood-ratio test", i.e., calculate -2(L1 - L0), and compare it to )C2{p(c2)}. This test is only suggestive because the ordered logit model is not nested within the multinomial logit model. A large value of -2(L1 - L0) should, however, be taken as evidence of poorness of fit. Marginally large values, on the other hand, should not be taken too seriously. The coefficients and cut points are estimated using maximum-likelihood as described in [R] maximize. In our parameterization, no constant appears as the effect is absorbed into the cut points. ologit and oprobit begin by tabulating the dependent variable. Category i = 1 is defined as •"the minimum value of the variable, i = 2 as the next ordered value, and so on, for the empirically determined [ categories. The probability
of observing an observation
Pr(outcome
= i) = Pr
t_i-1
Example
i !
Assu__ethat we ha,_eone subject and are interested in determining the drug profile for that subject. A reasonable, experiment would be to give, thei subject the drug and then measure the concentration • _ .
}
of the d4g m the subject s blood over a t,me period. For example, here is a dataset from Chow and --
time
1
o
.g
[
[ l
i'on
o 0
1.5 2 3 1 4
4.4 4.4 4,7 2.8 4.1
8 12
3.6 3
24 32 16
1.62 2.5
°
1
concentrat
,
)
Examining these d ta, we notice that the concentration quickly increases, plateaus for a short period, a_d then slowh' decreases over time. pkexamine is used to calculate the pharmacokinetic
i
measuresi°f interest" li_examine is explained !n detail in [R] pkexamine The °utpul is
le I pk-
Pharmacokinetic
(biopharmaceutical)
data
[
I
I
i
Description The term pk refers to pharmacokinetic
data and the commands,
all of which begin with the letters
pk, designed to do some of the analyses commonly performed in the pharmaceutical industry. The system is intended for the analysis of pharmacokinetic data, although some of the commands are of general use. The pk commands pkexamino pkst__mm pkshape pkcross pkequiv pkcollapse
are [R] [R] [R] [R] [R] [R]
pkexamine pksumm pkshape pkeross pkequiv pkeollapse
Calculate pharmacokinetic measures Summarize pharrnacokinetic data Reshape (pharmacokinetic) Latin square data Analyze crossover experiments Perform bioequivalence tests Generate pharmacokinetm measurement dataset
Remarks Several types of clinical trials are commonly performed in the pharmaceutical industry. Examples include combination trials, multicenter trials, equivalence trials, and active control trials. For each type of trial, there is an optimal study design for estimating the effects of interest. Currently, the pk system can be used to analyze equivalence trials. These trials are usually conducted using a crossover design; however, it is possible to use a parallel design and still draw conclusions about equivalence. The goal of an equivalence trial is the assessment of bioequivalence between two drugs. While it is impossible to prove two drugs behave exactly the same, the United States Food and Drug Administration believes that if the absorption properties of two drugs are similar, then the two drugs will produce similar effects and have similar safety profiles. Generally, the goal of an equivalence trial is to assess the equivalence of a generic drug with an existing drug. This is commonly accomplished by comparing a confidence interval about the difference between a pharrnacokinetic measurement of two drugs with a confidence limit constructed from U.S. federal regulations. If the confidence interval is entirely within the confidence limit, the drugs are declared bioequivalent. An alternative approach to the assessment of bioequivalence is to use the method of interval hypotheses testing, pkequiv is used to conduct these tests of bioequivalence. There are several pharmacokinetic measures that can be used to ascertain how available a drug is for cellular absorption. The most corn mort measure is the area under the time-versus-concentration curve (AUG). Another common measure of drug availability is the maximum concentration (Cmax) achieved by the drug during the follow-up period. Stata reports these and other less common measures of drug availability, including the time at which the maximum drug concentration was observed and the duration of the period during which the subject was being measured. Stata also reports the elimination rate, that is, the rate at which the drug is metabolized, and the drug's half-life, that is. the time it takes for thc drug concentration to fall to one-half of its maximum concentration. 507
1 l
_: ...........
.
i
506
.............
.........
pergram-- IPeriodogram
Also See C0mple[ _enta_:
IR] tsset
Related:
IR] corrgram, JR] cumsp, JR]wntestb
Baekgro_rod:
_tata Graphics Manual
pergram-- Periodogram
505
Methodsand Formulas P. The pth percentile is then
x[pl = !
x(_-l) +ix(i) 2 x (_)
if 1,9}i_1)= P otherwise
Whenlthe option a Ltde_ is specified, the _:followingalternative definition is used. In this case, weights _e not allowe:l. Lel i e integer flo,_rof (n _ l)p/lO0: i.e., i is largest integer i _ _ a. t)p/lO0. Let h be the remain& h = (n + llp/lO0 - i. The pth percentile is then |
where x j
x[p] = (1 - h)xii ) + hz(i+1)
is taken to _e x(i) and _(n+l) is taken to be x(n). /
xtile)roduces thelcategories
-i
pctile -- Create variable containing percentiles
497
_pctile _pctile is a programmer's command. It computes percentiles [U] 21.8 Accessing results calculated by other programs, You can use _pctile . _pctile . ret
and stores them in r();
see
to compute quantiles just as you can with pctile:
weight,
nq(lO)
list
scalars :
_pctile results.
r(rl)
=
2020
r(r2)
=
2160
r(r3)
=
2520
r (r4)
=
2730
r(r5)
=
3190
r(r6)
=
3310
r(rT)
=
3420
r (r8)
=
3700
r(r9)
=
4060
is, however,
The percentiles wish: _pctile ret
weight,
limited to computing () option (abbreviation p(10,
33.333,
45,
21 quantiles since there are only 20 r()s p()) 50,
55,
to hold the
can be used to compute any odd percentile 66.667,
you
90)
list
scalars : r(rl)
=
2020
r(r2)
=
2640
r(r3)
=
2830
r(r4)
=
3190
r(r5)
=
3250
r(r6)
=
3400
r(r7)
=
4060
_pctile, pctile, and xtile each have an option that uses an alternative definition of percentiles, based on an interpolation scheme; see Methods and Fom_ulas below. _pctile • ret
weight,
p(10,
33.333,
45,
50,
55,
66.667,
903
altdef
list
scalars : r(rl)
=
2005
r(r2)
=
2639. 985
r(r3)
=
2830
r(r4) r(rS)
= =
3190 3252.5
r(r6)
=
3400. 005
r(r7)
=
4060
The default formula inverts the empirical distribution function. We believe that the default formula is more commonly used. although some consider the "alternative" formula to be the standard definition. One drawback of the alternative formula is that it does not have an obvious generalization to noninteger weights.
_ i
Ii• i
rp !
496
jl pctile - create vadablecontaini_ percentiles i 120
1
3
18. 19.
17.
120 125
12o
1 1
1
3 4
:[I.
132
1
4
to,
13o
i2,
1
93
l
94 131 94 (o_qmtornitted)
lo
1_o. i
o
3
4 i
1 1
o
4
136
00
0 TechnicalNote
I_
In th!. iztite' last examplb.catego_y if=webp i:E°nlycase==l ,wanted cut!(pct)t° categorize cases, we could have issued the command
i * _ ! i
Mos_ Stata commahds follow the logic that Using an if exp is equivalent to dropping observations not satisfyi_2 the expressi on and running the command. This is not true of xtile when the cutpoints () option i_Jsed. (_qaer_ the eutpoints () option' is not used, the standard logic is true.) This is because xtile _ill use all no,missing values of the cutpoint () variable regardless of whether these values belon_ io observation that satisfy the if expression,
!
If yoh do not wan: to use all the values i. the cutpoint () variable as cutpoints, simply set the ones that you do not _eed to missing, xtile does not care about the order of the values or whether
I
they are separated by! missing values.
!
I
i i
_ TechnicalNote
,
Note!that quantile_are not always unique. If we categorize our blood pressure data on the basis of quinttles rather tha_ quartiles, we get t _ctile pet = 4 bp, _tile quint bp, nq(5) nq(5) genp(percent) _ist percent
I
bp
quint
pct
!
98
1
104
20
100
1
120
40
lo4
1
_25
80
1
_i
_.
!
_. 5. _.
! I
9. _. 1_. 1t._
i
i
_.
I_o 120 12o t2o
12o 13o la2 125
2 2 2
12o
60
2
2 s s 4
The 40t_ and 60th percentile are the same; t_ey are both 120. When two (or more) percentiles are the samd, they are gixen the lower category nhmber.
i i i i
pctile -- Create variable containing percentiles
495
• xtile category = bp, cut(class) list bp class category 1, 2. 3. 4. 5. 6. 7, 8, 9. 10. 11.
bp 98 100 104 110 120 120 120 120 125 130 132
class 100 110 120 130
category 1 1 2 2 3 3 3 3 4 4 5
The cutpoints can, of course, come from anywhere. They can be the quantiles of another variable or the quantiles of a subgroup of the variable. For example, suppose we had a variable case that indicated whether an observation represented a case (case = 1) or control (case -- 0). . list bp 98 IO0 104 ii0 120 120 120 120 125 130 132 116 93 115
case 1 1 1 1 1 1 1 1 1 1 1 0 0 0
(outputomi_ed) 110. 113
0
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
We can categorize the cases based on the quanfiles of the controls. To do this. we first generate a variable pet containing the percentiles of the consols' blood pressure data pctile pet = bp if case==O, nq(4) list pet in i/4 I. 2. 3. 4.
pet 104 117 124
and then use these percentiles
as outpoints to classify bp for all subjects.
xtile category = bp, cut(pet) gsort -case bp • list bp case category 1. 2. 3. 4. 5.
bp 98 lOO 104 110 120
case 1 1 1 1 1
category 1 1 1 2 3
494
pctile -- Cr,_atevariable containingpercentiles
xtil_ can be used to create a variable quart
i
• tile
quart
•
= bp,
98
i
nq(4)
I
I
i
1oo
i
_. ,
Ii0 t20 104
2 2 1
_.
12o
2
b_ i q"_I
10. :
I
i
11.
i
that indicates the quartiles of bp.
130 125
! !
132 1
4 3 4
The categories created i_are
I
(+_,x[2_l] ' (xi2_,xis_] ' (Xi_oi, X[7_l],(x[75_,+oo) where z_5, Ziso] an ZiTsi are, respectwely, the 25th, 50th (me&an), and 75th percentiles of bp We coul use the pc le command to genera[e these percentiles:
!
I i
-_
1
ictile pet = _p, nq(4) genp(percent) _ist bp quart ipercent pet i bp quart percent
pet
_.
98
I
25
104
_. _.
104 100
1 I
75 50
125 120
4. d._
llo 12o
2 2
_
_.
120
2
i
_.
12o
2
I_.
t20
2
1I!.
130 132
4 4
!
i
xtil(_ can categori_e a variable based on _y set of cutpoints, not just percentiles. Suppose that we wish iocreate the _ollowing categories for blood pressure:
i
(-_.,!_oo],(too, ! t_ot (U0.120] (i2o,_3o].(_3o,+o0)
To do thi_, we simph', ,create a variable contairiing the cu'lpoints
i:
i
class ihput class I!. I00
23i.i. t20 :io 5i. end
and then iuse xtile with the cutpoints()o_tion. |
{
: ]
i
i
pctite -- Create variable containing percentiles
493
Note that summarize, detail calculates standard percentiles. • summarize mpg, detail Mileage (mpg) Percentiles
Smallest
1_ 5_ 10Z 25_
12 14 14 18
12 12 14 14
50_
20
75X 90X 95X 99_
25 29 34 41
Larges¢ 34 35 35 41
0bs Sum of Wgt.
74 74
Mean Std. Dev.
21.2973 5.785503
Variance Skewness Kurtosis
33.47205 .9487176 3.975005
can onlycalculate thesepa_icular percentiles. The commands let you compute any percentile. But s_Immarize,
detail
Weights can be used with petile,
_pctile,
pctile
and
_pctile
and xtile:
. pctile pet = mpg [w=weight], nq(10) genp(percent) (analytic weights assumed) . list percent pet in I/I0 I. 2. 3. 4. 5. 6. 7. 8. 9. 10.
percent i0 20 30 40 50 60 70 80 90
pet 14 16 17 18 19 20 22 24 28
/
The result is the same no matter which weight type you specify--aweighz,
fweight,or pweight.
xtile xtile will create a categorical variable that contains categories corresponding to quantiles. We illustrate this with a simple example. Suppose we have a variable bp containing blood pressure measurements: list
i. 2. 3. 4. 5. 6. 7, 8. 9. 10. 11.
bp 98 I00 104 II0 120 120 120 120 125 130 t32
+
!_'+
I I ,i !
492
pctile -- Ci .=atevariable containin+ percentiles
cutpoi+tts(vamame, requests that xtile ise the values of varname, rather than quantiles, as cut ints for the c legories. Note that all v_lues of vamame are used, regardless of any if or in restri,:tion; see the technical note in the xt_le section below. percentiles(m+mtist) requests percentiles Corresponding to the specified percentages. Percentiles
+ i
are p!aced in r(r]t), r(r2) ..... etc. For example, percentiles(10(20)90) 10th.130th, 50th. 7Dth, and 90th percentilei be computed and placed into r(rl),
}
r(r4_,
i
detail_ on ho,a to _pecify a numIist.
and r!r5)I
Up to 20 (inclusive)p_rcentiles
requests that the r(r2), r(r3),
can be requested. See [u] 14.1.8 numlist for
Remark.. pctile pctil,ecreates a _ew. variable containing percentiles. You specify the number of quantiles that you wan(. and pctil_ computes the corresponding percentiles. Here we use Stata's auto dataset and
!
compute the deciles of mpg: t se auto
+
+
• _ctile pct= _ist pet
i
i
•
pet
14
_, _, _.
20 22 24
_. _.
25 29
!
'_
i
earner to oistinguish be tween the percentiles.
! !
I
'V_
apg, nq(lO)
in I tl0
_illthe
.en.
_ _
. p_tile pet
option
= _pg, t
1_st
percent
!
percent
_ct
.enerate
nq(10) in 1/10 pet
2_ 11. 31
20 10 30
17 14 18
4! si
40 so
19 20
+o .o 80
+
;+
I
:oi°°
.
to
/
22 2+ 25
anot]
genp(percent)
,e_
v_d_ab'e
_vJth
the
co_[espondi_.
_erce_]ta.e..
,, ,.
tie [ petile -- Create variable contlfining percentiles
]
i
Syntax pctile
genp(newvarp) T
xtile newvar
'
altdef = exp
{ nquantiles(#) _pctile
= exp [weight]
[type] newvar
varname
[if
exp]
[in
range]
[, _nquantiles(#)
]
[weight 1 [if
exp]
[in range]
I c_utpoints(varname) [weight]
[if
exp]
[,
} a!tdef
[in range]
t
[,
{ nquantiles(#) I p_ercentiles(numlist) } altdef ] fweights, and pweights are allowed (see [U] 14.1.6 weight) except when the altdef in which case no weights are allowed.
aweights,
option is specified,
Description pctile creates a new variable containing typically just another variable. xtile
the percentiles
creates a new variable that categorizes
of exp. where the expression
exp by its quantiles. If the cutpoints
option is specified, it categorizes exp using the values of vamame as category cutpoints. varname might contain percentiles, generated by pctile, of another variable.
exp is
(varname) For example,
_pct ile is a programmer's command. It computes up to 20 percentiles and places the results tn r(); see [U] 21.8 Accessing results calculated by other programs. Note that summarize, detail will compute some percentiles (1, 5, 10, 25, 50, 75, 90, 95, and 99th); see [R] summarize.
Options nquantiles (#) specifies the number of quantiles. The command computes percentiles corresponding to percentages 100k/m for k = t,2,..., m- 1, where m = #. For example, nquantiles(lO) requests that the 10th. 20th, ..., 90th percentiles be computed. The default is nquantiles(2): i.e., the median is computed. genp(newvarp) specifies a new variable to be generated containing to the percentiles. altdef uses an alternative
formula for calculating percentiles.
the percentages
corresponding
The default method is to invert the
empirical distribution function using averages ((zi + z_+x)/2) where the function is flat (the default is the same method used by summarize; see [R] summarize). The alternative formula uses an interpolation method. See Methods and Formulas at the end of this entry. Weights cannot be used when altdef is specified. 491
1
I"
• i !
_
4_J i I pc°rr _3Technic Note -- Pztrial c°rrelati°n c°efficients Some caution is in order when interpreting the above results. As we said at the omset, the partial corretati )n coefficient is an attempt to estimate the correlation that would be observed if the other variable., were held cc_nstant. The emphasis is on attempt, pcorr makes it too easy to ignore the fact that you are fitting a aodel. In the above example, the model is _price= fl0+ fllmpg+ _2wei_t + _3foreign+ e which is_ in all honestk a rather silly model. Even if we accept the implied economic assumptions of the moddl--that consumers value mpg, weight, and foreign--do we really believe that consumers
i i _. i i !
place equal value on _very extra l,O00 pounds of weight? That is, have we correctly parameterized the mod_l? If we hav_ not, then the estimated :partial correlation coefficients may not represent what the3' clai_ to represen I. Partial correlation coet_icients area reasonable way to summarize data after one is cobvinced that the underlying model is reasonable. One should not, however, pretend that there is no underlying model and that the partial correlation coefficients are unaffected by the assumptions and parai aeterization.
Methodsand Fornlulas pcor_ is implemen :ed as an ado-file. Result t are obtaine, by estimating a linear regression of varnamel on varlist; see [R] regress. The partial correlation coefficient between varnamel and each variable in varlist is then defined as
!
(Theil 19)1, 174). wh_re t is the t statistic, n the number of observations, and k the number of mdependdnt variables i_cludmg the constant but excluding any dropped variables, The significance .
is _iven _y 2, trail
_n - k, abs (t))
References Thei.I.H. 1_7!. Principles_)[Econometrics.New York John Witey& Sons.
i
AlsoSee
i
Related: !
} I
JR] eorrel te, [R] spearman
pcorr -- Partial correlation l coefficients I I
i I
Syntax ;
pcorr varnamel
vartist
[weight]
[if exp] [inrange]
by ... : may be used with pcorr; see [R] by. aweights and fweights are allowed: see [U] 14.1.6 weight.
'
Description pcorr displays the partial correlation coefficient of varnamel the other variables in varlist constant.
with each variable in varlist, holding
z
Remarks Assume that y is determined by xl, x2, ..., xk. The partial correlation between 5' and xl is an attempt to estimate the correlation that would be observed by y and xt if the other x s did not vary.
> Example Using our automobile dataset (described in [U] 9 Stata's on-line'tutorials and sample datasek_), the simple correlations between price, mpg, weight, and foreign can be obtained from correlate (see [R] correlate): • corr price (obs=74)
mpg
weight
foreign
price
mpg
weight
price
i. 0000
mpg
-0. 4686
weight
0.5386
-0.8072
1.0000
foreign
0.0487
0.3934
-0.5928
foreign
I.0000 1.0000
Although correlate gave us the full correlation matrix, our interest is in just the first column. We find, for instance, that the higher the mpg, the lower the price. We obtain the partial correlation coefficients using pcorr: pcorr price (obs=74) Partial
mpg
weight
correlation
Variable mpg
foreign
of price
with
Corr.
Sig.
O. 0352
O. 769
weight
O. 5488
O. 000
foreign
O, 5402
O. 000
We now find that. holding weight and foreign constant, the partial correlation of price with mpg is virtuallv zero. Similarly, in the simple correlations, we found that price and foreign were virtually uncorrelated. In the partial correlations holding mpg and weight constant we find that price and foreign are positively correlated. q
489
!
...............
o!_,_u_,,uet-_le
aataset
)
!
t
I[ i
i
.
1
our:sheet copi_ the data currently loaded in memory into the specified file. About all that can go wrbng is the fil_ you specify already extsts: outsheet
u_ing
[ile tosend._ut
r(602) ;
tosenfl already
exists
)
In thai case, you ca_l erase the file (see [R_ erase), specify outsheet's differe at filenarne, _aen all goes well, out,beet is silent:
replace
option, or use a
i I outsheet
us .ng tosend,
replace
-
tf you are copying tl e data to a program othtr than a spreadsheet, remember to sl_ify option: ) •i outsheet
us_ ng feral,
nenames
"!-
q
i
[
Also See Compl_ mentary:
[R] insheet
Related
[R] outffle
Backgr Example Stata never writes over an ex:{sting file unless explicitly told to do so. For instance, if the file employee, raw already exists and. you attempt to overwrite it by typing outfile using employee, here is what would happen: • outfile
using
file employee.raw r(602) ;
employee already
exists
You can tell Stata that it is okay to overwrite a file by specifying using employee, replace.
(Continued
on next page)
the replace option:
outfile
> Exampl_ Youlhaveentered nto Statasome data on s ven employeesin your firm. The data contain employee r_ame.!mployee identification number, salar_i and sex: •!list !
i
,
i
name
empno
salary
sex
Ii. Carl Mark_
i
57213
24,,000
male
i2. Irene Adl_r i3. Adam Smit_
47229 57323
127,000 24,000
female male
!4. David
57401
24,500
male
i5. Mary Rogers
57802
27,000
female
!76:Carolyn F_ank , Robert La#son
57805 57824
24,000 22,500
female male
Wal_is
i
If yo_ now wish tc use a program other thin Stata with these data, you must somehow get the data over to l_at other prol "am. The standard Stata_format dataset created by save will not dothe job--it is writte_ in a special _ormat that only Stata uhderstands, Most programs, however, understand ASCII datasetsg-standard te datasets that are like ihose produced by a text editor. You can tell Stata to produceisuch a datas_ using outfile. Typi8g outfile using employee createsa dataset called employee,raw that c,)ntains all the data. We Can use the Stata typecommand to review the resulting
i
file:
_utfile
using
employee
i "Carl Marks" _ype employee.raw
57213
24000
i
i "Irene : "Adam
47229 57323
27000 24000
"female" "male"
I
!"David Walli i" "Mary Roger ;"
57401 57802
245a0 270d0
"male" "female"
[tCarolyn
578os
24o o "femaW'
IRobert Lavso_"
57824
22500
i
Adler" Smith"
"male'*
"male"
We se _ , that the fileicontainsi the four variables and that Stata has surrounded the string variables with double quotes. I
1 I
i 3 TechnicalNote
[
!
outfi_e is careful _o columnize the data in :case you want to read it using formatted input. [n the example a_bove,the firs_tstring has a }',-16s display format. Stata wrote two leading blanks and then placed th+ string in a |6-character field, outfile always right-justifies string variables even when
I
the displa__format requests left-justification.
!
The fi!st number h_s a Y,9.0g format. Th_ number is written as two blanks followed by the number. _ght-justified in a 9-character field. The second number has a Y,9.0gc format, outfile ignores tt'_ comma part of the format and also writes this number as two blanks followed bv_the number, right-justified in a 9-character field. :
!
,
The ]aatt entry is really a numeric _ariable,:: but it ha:s an associated value label. Its tbrmat is
} "
Y,-9.0g. 4o Stata wrot_ two blanks and the2 tight-justiSed the value label in a 9-character field Again. ou{fileright-jt_stifies value labels e_en:;when the display formal specifies left-justification.
i
•
I outfile -- Write ASCII-format
dataset
[
I
I
Syntax
outfile[var//s,] using te,,ameexp][inra,,e][,
dictio=y
no!abel noquote replace wide ]
Description outfile writes data to a disk :file in ASCII format, a format that can be read by other programs. The new file is not in Stata format; see [R] save for instructions on saving data for subsequent use in Stata. The data saved by outfile can be read back by infile; see [R] infile. Iffilename is specified without an extension. '.raw' is assumed unless the dictionary option is specified, in which case '.dct' is assumed.
Options comma causes Stata to write the file in comma-separated-value format. In this format, values are separated by commas rather than blanks. Missing values are written as two consecutive commas. dictionary writes the file in Stata's data dictionary format. See [R] infile (fixed format) description of dictionaries. Neither comma or wide may be specified with dictionary. nolabel causes Stata to write the; numeric values of labeled variables. labels enclosed in double quotes. noquote
for a
The default is to write the
prevents Stata from placing double quotes around the contents of string variables.
replace permits outfile to overwrite an existing dataset, replace mav not be abbreviated. wide causes Stata to write the data. with one observation into lines of 80 characters or fewer.
per line. The default is to split observations
Remarks outfile enables data to be sent to a disk file for processin_ by a non-Stata program. Each observation is written as one or more records records that will not exceed 80 characters unless you specify the wide option. The values of the variables are written using their current display formats, and unless the comma option is specified, each is prefixed with two blanks. If you specify the dictionary option, the data are written in the same way, but in front of the data outfile writes a data dictionary describing the contents of the file.
483
i
orth rely uses th( Christoffel-Darboux Both _rtlaog and ,rthpoly
recurrence formula (Abramowitz and Stegun 1968).
normalize thd orthogonal variables such that
Q_Q=MX !
where It _ -- diag(w_,w2,...,wN)
with wl:,w2,...,WN
the weights (all 1 if weights are not
I
specifiedi), and M is t_e sum of the weights (the number of observations if weights are not specified).
i i
Referenqes
I
Abramowiiz.M. and I. 4' Stegun, ed. 1968.Handbook of Mathemat/ca/Functions,7th printing.Washington.DC: Nation_dBureauof Standards.
!
Golub,G. !H.and C. F. Va_Loan. 1996,Matr/x CompUtations,3d ed. Baltimore:JohnsHopkinsUniversityPress,pp.
218-2_9. I
Sribney, _,_(Reprints,voI. M, 1995.sg37:5,Orthogonalpolynomials.S}aa TechnicalBulletin25: 17-18. Reprintedin Stata Technical Bultetii pp. 96-98.
i I
!_o }
i AlsO,See
Related: I
R] regress
Backgrot_nd:
_] 23 Estimation and I_gst-estimation, commands
:
Some of the correlations problems removed,
orthog -- O_hogonal variables and o_hogonal polynomials 481 among the powers of weight are very large, but this does not create any
for regress. Howevel; we may wish to look at the quadratic trend with the constant the cubic _end with the quadratic and constant removed, etc. orthpoly will generate
polynomial terms with this property: . orthpoly weight, generate(pw*)
dog(4) poly(P)
. regress mpg pwl-pw4 Source Model Residual Total
SS
df
MS
1652.73666
4
413.184164
790.722803
69
11.4597508
2443.45946
73
33.4720474
mpg
Coef.
pwl pw2 pw3 pw4 _cons
-4.638252 ,8263545 -.3068616 -.209457 21.2973
Std. Err. .3935245 .3935245 .3935245 .3935245 .3935245
t -11.79 2.10 -0.78 -0.53 54.12
Number of obs = F( 4, 69) = Prob > F =
74 36.06 0.0000
R-squared Adj R-squared Root MSE
0.6764 0.6576 3.3852
P>ItJ
= = =
[95_ Conf. Interval]
0.000 0.039 0.438 0.596 0.000
-5.423312 .0412947 -1.091921 -.9945168 20.51224
-3.853192 1.611414 .4781982 .5756028 22.08236
Compare the p-values of the terms in the natural-polynomial regression with those in the orthogonalpolynomial regression. With orthogonal polynomials, it is easy to see that the pure cubic and quartic trends are nonsignificant and that the constant, linear, and quadratic terms each have p < 0.05. The matrix P obtained with the poly () option can be used to transform coefficients for orthogonal polynomials to coefficients for natural polynomials: orthpoly weight, poly(P) deg(4) . matrix b = e(b)*P matrix list b b[1,5] yl
degl .02893016
deg2 -.00002291
deg3 5.745e-09
deg4 -4.862e-13
_cons 23.944212
Examplei i
A_aini considerthe auto.dta dataset.Supposewe wish to fit themode] mpg
=
+ _I weight
+/_2
we_g ht2 + _3 weigh
t3 + ;_4 weight4 + e
We will first compute he regression with natuial polynomials: !
!
i a
double
w2 = wl*wl
, g_n double
w3 = w2*wl
. g,_n double
w4 = w3*wl
. c_rrelate
i!
wl-_4
I
w2
wl
w3
wl
1 .(300
i
w2
0.(
i
w3
O.¢.565
O. 9916
I.0000
i
w4
O. t 279
O. 9679
O. 9922
!
915
1.0000 1. 0000
. r,_gress mpg wl-v4 88
I
w4
.i
df
_
Model !Residual
MS
Number
,,
_( 4,
652.73666 '90.722803
4 69
413._84164 11.4fl_97508
!443.45946
73
33.4_20474
!
i }
Adj Total
mpg
Coef.
Std.
Err.
_i
.0289302
_2 w3
.-. 0000229 5.74e-09
.0000566 _.19e-08
w4
' 4.86e-13
_cons
23.94421 I
Root
t
,1161939
69) =
Prob > F R-squared
. _ i
0.25
P>It I
74
of obs =
R-squared MSE
[95Y, Conf.
30.06
= =
0.0000 0.6764
=
0.6576
=
3,3852
Interval]
O. 804
-. 2028704
.2607307
i -0.40 0.48
O. 687 0.631
-. 0001359 -1.80e-08
.0000901 2,95e_08
9.14e-13
-0.53
0.596
-2.31e-12
1.34e_12
86.60667
'
0.783
-148.83t4
196.7198
i!
0,28
W
odhog -- Odhogonal variables and odhogonal polynomials
. regress
price
length
Source
weight
SS
Model Residual Total
price
weight headroom
trunk
df
MS
Number F( 4,
of obs 69)
74 10.20
4
59004145.0
Prob
=
0.0000
399048816
69
5783316.17
R-squared Adj R-squared
= =
0.3716 0.3352
635065396
73
8699525.97
Root
=
2404.9
Std.
Err.
t
> F
= =
236016580
Coef.
length
headroom
MSE
P>ltl
[95_
Conf.
-185.747
479
Interval]
-I01.7092
42.12534
-2.41
0.018
4.753066 -711.5679
1.120054 445.0204
4.24 -1.60
0.000 0.114
2.518619 -1599.359
-17.67148 6.987512 176.2236
trunk
114.0859
109.9488
1.04
0.303
-105.2559
333.4277
_cons
11488.47
4543.902
2.53
0.014
2423.638
20553.31
However, we may believe a priori that length is the most impo_ant predicton followed by weight, followed by headroom, followed by trunk. Hence, we would like to remove the "effect" of length from all the other predictors; remove weight from headroom and trunk; and remove headroom from trunk. We can do this by running orthog, and then we estimate the model again using the orthogonal variables: • orthog
length
• regress
price
Source Model
weight
headroom
olength
I
trunk,
oweight
SS
i
gen(olength
oheadroom df
oweight
oheadroom
otrunk)
matrix(R)
otrunk
MS
Number F( 4,
of obs 69)
74 10.20
236016580
4
59004145.0
Prob
=
0.0000
399048816
69
5783316.17
R-squared Adj R-squared
= =
0.3716 0.3352
635065396
73
8699525.97
Root
=
2404.9
price
Coef.
Std.
olength
1265.049
279.5584
4.53
oweight oheadroom
1175.765 -349.9916
279.5584 279.5584
4.21 -1.25
0.000 0.215
1.04 22.05
Residual Total
Err.
otrunk
290.0776
279.5584
_cons
6165.257
279.5584
Using the matrix R, we can transform the metric of original predictors: • matrix matrix
t
> F
= =
MSE
P>ltt
[95_
Conf.
Interval]
0.000
707.3454
1822.753
618.0617 -907.6955
1733.469 207.7122
0.303
-267.6262
847.7815
0.000
5607.553
the results obtained using the o_hogonal
6722.961
predictors back to
b = e(b)*inv(K)" list
b
b[1,5] length yl
-101.70924
weight 4.7530659
headroom -711.56789
trunk 114.08591
_cons 11488.475
Technical Note The matrix R obtained using the matrix() option with orthog can also be used to recover .¥ (the original vartist) from Q (the orthogonalized newvarlist) one variable at a time. Continuing with the previous example, we illustrate how to recover the trunk variable: .
matrix
• matrix
C = R[l...,"trunk"]" score
double
rtrunk
= C
!
478 orthog -- )rthogonal variables and orthogonalpolynomials Notei that the coef_cients corresponding tcr the constant term are placed in the last column of the matrix. The last r_bwof the matrix is all tero except for the last column, which corresponds to the _onstant term. 1
Remarks •
Ortht,gonal variab
s are useful for two reasons. The first is numerical accuracy for highly collinear
variableg._ Stata'srel_ress and other estimationcommandscan face a largeamountof coll}nearitv and variables due to stil! produce accbrate results. But, at some point, these commands will drop r ! _!
cotlineahty".- If you ktnow with certainty that the variables are not perfectly collinear, you may want to retain a_lof their eff@ts in the model. B3,'usihg orthogor orthpolytO produce a set of orthogonal . i all vanable_ ...... will be present m the estimauon results. vanablef;,
}
User i are more lik ly to find orthogonal vafi_'ablesuseful for the second reason: ease of interpreting results, brthog and _rthpoly create a set 0f variables such that the "effect" of all the preceding
_
vanable_ have been Fm°ved from each vanable. For example, ff one 2ssues the command
I
. iorthog
xt
x2
x3,
generate(ql
q2
q3)
cons:ant
xl are constant produce ql, is removed from xt the removed 2, and finally the conslant, xl. and x2 then are removed fromand x3 to produce q3.
the the fromeffett x_ toof produce Hence,
tO
}
ql = r01 + rll xl
g
q2 = r02 + r1_2xl + r22 x2
i
q3 = ro3 + rl_3xI + r23 x2 -,- r33 x3 This cm be generali
i i ! _
d and written in matrin notation as
_
X = OR
where ..J,*: is the A" ×!(d + t) matrix representation of varlist plus a column of ones, and Q is the Ar × (di+ l) matrix representation of newvarlist plus a column of ones (d = number of variables in varli._t and N = namber of observations). The (d-t- 1) × (d + 1) matrix R is a permuted upper _riangul_.r matrix: i.e.. R would be upper triangular if the constant were first, but the constant is last. so _he, first row/zolumn has been permuted with the last row/column. Since Stata's estimatmn commar_ds list the constant term last. this allows R, obtained via the matrix() option, to _ used to trans rm estimatk n results.
!i
i
i
I i
.- Example ConsiderStata's md:o. dta dataset Supposewe postulatea mode] in which price dependson the car's le lgth. weight, headroom (headroom);, and trunk size (trunk). These predictors are collinear. bat not .'xtremely so--the correlations are nch that close to l" horrelate
leiLgth weight
headroom
trrmk
(o bs=74) I__ngth
weight
-!_ength
1 0000
_eight
0 9460
i 0000
0 5163 0 7266
0.4835 0.6722
'
he_adroom trunk
"
headroom,
1.0000 0. 6620
trunk
1. 0000
regres:certainly h@ no trouble estimating _hi_,rnodeh
"itle I °rth°g
-- Orth°g°nal ,
variables and °rth°g°nal
p°ly , n°mials
]
Syntax orthog
[varlis,]
tweightl
[matrix(matname)
orthpoly
varname
[if
expl
[in
range],
g_enerate(newvarlist)
]
[weight]
{ generate(newvarlist)
Iif
exp]
[in range],
[p_oly(matname)
} [ degree(#)
]
orthpoly requires that either generate() or poly(), or both. be specified, iweights, fweights, pweights, and aweights are allowed, see [U] 14.1.6 weight.
Description orthog orthogonalizes a set of variables, creating a new set of orthogonal variables (all of type double), using a modified Gram-Schmidt procedure (Golub and Van Loan 1996). Note that the order of the variables determines the orthogonalization: hence, the "most important" variables should be listed first. orthpoly computes orthogonal polynomials
for a single variable
Options generate(newvarlist) is not optional; it creates new orthogonat variables of type double. For orthog, newvarlist will contain the orthogonalized varlist. If varlist contains d variables, then so will newvarlist. For orthpoly, newvarlist will contain orthogonal polynomials of degree 1, 2, .... d evaluated at varname, where d is as specified by degree (d). newvarlist can be specified by giving a list of exactly d new variable names, or it can be abbreviated using the styles newvar 1newvard or newvar,. For these two styles of abbreviation, new variables newvar 1, newvar2, .... newvar d are generated. matrix(mamame) (orthog by X = QR, where X is and Q is the N × (d + 1) of variables in varlist and
only) creates a (d+ 1) × (d + l) matrix containing the matrix R defined the N × (d+ 1) matrix representation of vartist plus a column of ones, matrix representation of newvarlist plus a column of ones (d = number N := number of observations).
degree(#) (orthpoly only) specifies the highest degree polynomial to include. Orthogonal nomials of degree 1, 2.... , d - # are computed. The default is d = 1.
poly-
poly(mamame) (orthpoly only) creates a (d + 1) × (d 4- 1) matrix called matname containing the coefficients of the orthogonal polynomials. The orthogonal polynomial of degree i < d is matname[ i, d + I ] + matname[ i, 1 ] *varname + matname[ + " • + matname [ i, i ]*varname" 477
i, 2 ] *varname 2
I_
I !
,°,,au,_s
..........
In _,aataset
a Tectlnical Note :_
If _,our data ( ontain variables the_ correctly andi yee_r2.
e ,en though
named yearl,
to most computer
year2 ..... programs,
yea_rig,
yearl0
year20_
is alphabetically
aorder
will order
bep,veen yearI
i
I
Methi,dsandI_ormulas a ,rder is imp emented
as an ado-file,
Refe. nces Gleas_n, J, R. 1997. tmSl: Defining and recordirig variable ot'derings. Stata Technical Bultetin40: 10-12. Reprinted in IStata, TechnicaJ Bulletin Reprints, rot. 7, p_, 49-52. }
Weesi_. J. 1999. din7 .: Changing the order of variables in a dataset. Szala Technical Bulletin 52: 8-9. Reprinted in St_ta Technical B_ Iletin Rep6nts, vol. 9, pp. 6_-62. !
AlsoS_ee Coml_lementary:
[R]descry'be
Related:
[R] edit,
[R] rename
W
Contains
data
from
obs: I
74
1978
6
vars: size:
7
2,368
(99.6%
storage I
order -- Reorder variables in dataset
auto.dta
variable
name
of memory
Automobile
Jul
2000
475
Data
13:51
free)
display
value
type
format
label
variable
label
!
i ;1
i
make
strl8
%-18s
Make
mpg
int
%8,0g
Mileage
price
int
%8.0gc
Price
weight
int
%8.0gc
Weight
length
int
%8.0g
Length
(in.)
rep78
int
%8.0g
Repair
Record
Sorted
and
Model (mpg)
(Ibs.) 1978
by:
Note:
dataset
has
changed
since
last
saved
[
1} ' I '
If we now wanted length to be the last variable in our dataset, we could type order price weight rap78 length but it would be easier to use move: . move length describe
rep78
Contains
from
data
auto.dta
obs:
74
wars:
6 2,368
size:
variable
name
1978
Automobile
7 Jul (99.6%
storage type
of memory
display format
2000
Data
13:51
free)
value label
variable
label
make
strl8
Z-18s
Make
mpg
int
_8.0g
Mileage
price
int
_8.0gc
Price
weight
int
%8.0gc
Weight
(Ibs.)
rep78
int
%8.0g
Repair
Record
length
int
%8.0g
Length
(in.)
Sorted
make mpg
and
Model (mpg)
1978
by:
Note:
dataset
has
changed
since
last
saved
We now change our mind and decide that we would prefer that the variables be alphabetized. aorder describe Contains
data
from
obs:
auto.dta 74
1978
6
7 Jul
vars: 2,368
size:
(99.4%
storage
of memory
Automobile 2000
free)
display
value
type
format
label
length make
int sift8
_8.0g _-18s
Length (in.) Make and Model
mpg
int
_8.0g
Mileage
price
int
%8.0gc
Price
rep78 weight
int int
%8.0g _8.0gc
Repair Weigh_
variable
Sorted
name
Data
13:51
variable
label
(mpg) Record (ibs.)
1978
by:
Note:
dataset
has
changed
since
last
saved
_
i
Title tl i
!! 't
i
I
[
1
I
u
"
Syntax ord._r_ vartist Yartlame
movi
_rname2
1
aor_er [varlist]
Descriion
i
order changes tl_e order of the variables in the current dataset. The variables specified in varlist are m@ed, in order, lto the front of the data_t. ! ! movi_ also reorder_ variables, move relocaies vamamel to the position of vamame2 and shifts the
ii_
remain!ng_variables, !includingl varna,ne2, to make room.
_-
aor_er alphabeti_esthe variablesspecifiedin varlistand movesthem to the front of the dataset. If no vhrlist is specihed. _all
Remarks _- Examplb When using order, ., describe C)ntains
I "
obs: tars:
}
i
i
_
i
you must specify a vadist, but it is not necessa_' to specify all the variables
in the dataset._ For e::ample, i
!
!
is assumed.
data from
auto.dta 74 6
• 2,368
;ize:
1978 7 Jul Automobile 2000 13:51 Data (99.6}',of memory
storage
free)
display
value
type
format
label
p :ice
int
_8, Ogc
Price
w_ight m@g m_ke
int int strl8
_,8.0gc 7,8.0g Y,-18s
Weight (Ibs.) Mileage (mpg) Make and Model
l_ngth r_p78
int int
Y,8. Og '/,8.0g
Length Repair
v_riable
S_rted
name
by :
Note: ._order
variable
make
dataset
has
cIianged since
last
pg
describe
474
saved
label
(in.) Kecord
1978
oprobit m Maximum-likelihood
,
ordered probit estimation
473
Saved Results oprobit
saves
in e():
Scalars e (N)
number of observations
e (11)
log likelihood
number of categories model degrees of freedom pseudo R-squared
e(ll_O) e(chi2) e (N_clust)
log likelihood, constant-only model X2 number of clusters
e(cmd)
oprobit
e(vcetype)
covarlance estimation method
!
e(depvar) e(wtype)
name of dependent variable weight type
e(chi2type) e(offset)
Wald or LR: type of model X2 test offset
[ i
e(wexp) e (clustvar)
weight expression name of cluster variable
e(predict)
program used to implement predict
coefficient vector category values
e (V)
variance-covariance estimators
e (k_cat) e(df_m) e (r2_p) Macros
Matrices e (b) e (cat)
[ t
matrix of the
Functions e fsample)
marks estimation sample
Methodsand Formulas Please
see the Methods
and Formulas
section
of [R] ologit.
References Aitchison. J. and S. D. Silvey. 1957, The generalization of probit analysis to the case of muhiple responses. Biometrika 44: 131-140. Goldstein. R. 1997. sg59: Index of ordinal variation and Neyman-Barton Reprinted in Stare Technical Bulletin Reprints, vol. 6, pp. 145-147.
GOE Stat_ Technical Bultetin 33: 10-12.
Greene, W. H. 2000. Econometric Analysis. 4th ed. Upper Saddle River, NJ: Prentice-Hall. Long, J. S. 1997. Regression Models tbr Categorical and Limited Dependent _,hriable.s. Thousand Oaks, CA: Sage Publications. Wolfe, R. 1998. sg86: Continuation-ratio models for ordinal response data. Stata TechnicJ Bulletin 44:18-21. in Stata Technical Bulletin Reprints, vol. 8, pp. 149-153.
Reprinted
Wolfe. R, and W. W. Gould. 1998. sg76: An approximate likelihood-ratio test for ordinal response models. State Technical Butletin 42: 24-27. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 199-204.
Also See Complementary:
[R] adjust,
[R] lincom,
[R] linktest.
[R] lrtest.
[R] mix.
[R] predict,
[R] test, [R] testnl, [R] vce, [R] xi Related:
[R] logistic, [R] mlogit, [R] ologit, [R] probit, [R] svy estimators
Background:
[U] 16.5 Accessing coefficients and standard errors_ [u] 23 Estimation and post-estimation commands. [u] 23.11 Obtaining
robust
[U] 23.12 Obtaining tR] maximize
scores,
variance
estimates,
[R] sw,
!
I
!!" r
472
_
_
op_obit
m
Maximum-|i_lihood
o_ered ,;
P _0_
Hypothesistests md predictions
See u] 23 Estim tion and post-estimation commands for instructions on obtaining the variancec_3varialce matrix oi the estimators, predicted values, and hypothesis tests. Also see [R] lrtest for perforating likelihoo_ -ratio tests.
Exampi tn t_e above example, we estimated the model oprobit rep77 foreign length mpg. The predict command can be used to obtain the predicted probabilities. You type predict followed by the nantes of the ne_ variables to hold the p_dicted probabilities, ordering the names from low to high. I_ our data, the lowest outcome is poor and the highest excellent . We have five categories and so inust type fiv, names following predict; the choice of name is up to us:
I !
. !predict poor fair avg good exc (_ption p assu_ed; predicted probabilities)
_
. ilist make mo_el exc good
'
i
! }
i 13.
i
if rep77==.
make AMC
model Spirit
exc .0006044
good .0351813
Ford _zlO[ Buick 44. Mere. _3. Peugeot _6. Plym. _7. Plym.
Fiesta Opel Monarch 604 Horizon Sapporo
.0002927 .0043803 .0093209 ,0734199 .001413 .0197543
.0222789 .1133763 .1700846 .4202766 .0590294 .2466034
_3.1
Phoenix
.0234156
.266771
Pont.
i
For Srdered probill, predict, xb produces Sj = Xlifll -t- x2jfl2 +"" + xk3flk. Ordered probit is identlcal to ordered logit except that one uses a different distribution function for calculating probabilities. The orc_ered-probit predictions are then the probability that Sj 4- uj lies between a pair of cut _ints e;i-1 arid _i. The formulas in the case of ordered probit are
l
I
e_timatiOn
i
Pr(Si
+ u < n)=
I
Pr(Sj
+ w > _,) = i - _(_ - Sj) = _(Sj
Rather than using pr diet i . _predict I . _en . _en
,F(_-
Sj) - n)
directly, we could calculate the predicted probabilities by hand. " "
psco_re,xb I
probexc T norm(pscore-_b[_cut4]:) probgood norm( b[ cut4]-pscol_e)
- norm(
b[ cut3]-pscore)
oprobit -- Maximum-likelihood ordered probit estimation
471
Remarks An ordered probit model is used to estimate relationships between an ordinal dependent variable and a set of independent variables. An ordina/variable is a variable that is categorical and ordered, for instance, "poor", "good", and "excellent", which might be the answer to one's current health status or the repair record of one's car. If there are only two outcomes, see [R] logistic, IN] logit, and [R] probit. This entry is concerned only with more than two outcomes. If the outcomes cannot be ordered (e.g., residency in the north, east, south and west), see IN] mlogit. This entry is concerned only with models in which the outcomes can be ordered. In ordered probit, an underlying score is estimated as a linear function of the independent variables and a set of cut points. The probability of observing outcome i corresponds to the probability that the estimated linear function, plus random error, is within the range of the cut points estimated for the outcome: Pr(outcomej
= i) = Pr(n__l
< fllzlj
+/32x2j
+'"
< _)
uj is assumed to be normally distributed. In either case, one estimates the coefficients ill, 132, ..., flk together with the cut points nl, n2, ..., nz-1, where I is the number of possible outcomes. no is taken as -oo and nz is taken as 4-00. All of this is a direct generalization of the ordinary two-outcome probit model.
> Example In [R] ologit, we sample datasets) to logit to explore the proxy for size), and togit:
use a variation of the automobile dataset (see [U] 9 Stata's on-line tutorials and analyze the 1977 repair records of 66 foreign and domestic cars. We use ordered relationship of rep77 in terms of foreign (origin of manufacture), length (a mpg. Here we estimate the same model using ordered pmbit rather than ordered
. oprobit
rep77
Iteration
O:
log
likelihood
= -89.895098
Iteration
I:
log
likelihood
= -78.141221
Iteration
2:
log
likelihood
= -78.020314
Iteration
3:
log
likelihood
= -78.020025
Drdered
Log
probit
likelihood
foreign
length
mpg
estimates
N_raber of obs LR chi2(3) Prob > chi2
= = =
66 23.75 0.0000
= -78.020025
Pseudo
=
0.1321
repY?
Coef.
foreign
1.704861
length
,0468675
mpg
Std.
Err.
R2
z
P>Iz[
[95_
.4246786
4.01
0.000
.8725057
2.537215
.012648
3.71
0.000
.022078
.0716571
.1304559
.0378627
3.45
0.001
.0562464
.2046654
_cutl _cut2
10.1589 11.21003
3.076749 3.10T522
_cut3
12.54561
3.155228
_cut4
13.98059
3,218786
(Ancillary
Conf.
Interval]
parameters)
We find that foreign cars have better repair records, as do larger cars and caus with better mileage ratings. q
clus_er(varnamt
specifies that the observations are independent
across groups (clusters) but
;
n__ necessarily vithin groups, varname:specifies to which group each observation belongs; e.g,, catuster(pers mid) in data with repeated observations on individuals, cluster() affects •the estimated stand_trd errors and variance-covariance matrix of the estlmators (VCE), but nol the es_mated coeffi :ients; see [t2] 23,11 Obtaining robust variance estimates, cluster() can be us#d with pwe: ghts to produce estinmtes for unstratified cluster-sampled data. but see the sWoprobit colnmand in [R] svy estimators for a command designed especially for survey data.
i
cl_aster()
imp ies robust;
specifying robust
cluster()
is equivalent to typing cluster()
by iitself, }
scor_(newvarlist) creates k new variables, where k is the number of observed outcomes. The firs_ variable cot tains OlnLj/O(xjb); the second variable contains OlnLj/O(_cutlj); the third conhins OlnLj/, _(_cut2j); and so on. Note that if you were to specify the option score(sc*), Sta!a would creale the appropriate number of new variables and they would be named seO. scl, level #) specifies le confidence level, in percent, for confidence intervals. The default is level or _ set by set level: see [U] 23.5 Specifying the width of confidence intervals.
(95)
_,
offse_ (varname) s_cifies that varname is to be included in the model with coefficient constrained to be 1.
i l
maximi_e..options control the maximization process; see [R] maximize. You should never have to spedfy them.
i
Optionsior predicl
I
p.:the d_ault, calculat _s the predicted probabilities. If you do not also specify the out come () option. you must specify new variables, where kis the number of categories of the dependent variable. Say vbu estimatec _ model by typing oprobit result xl x2. and result takes on three values.
i
Then i,ou could tyl:e predict pl p2 p3. to obtain all three predicted probabilities. If you specie' the ot_tcome() opt on, then you specify one new variable. Say that result takes on values 1.2. and 3i Then typing predict pl outcome(I) would produce the same pl. xb. calculates the line_ • prediction. You specify one new variable; for example, predict linear, xb. Tt_e linear prod ction is defined ignoring the contribution of the estimated cut points. i
xb calcult_tes the line prediction. You specify one new variable: for example, predict linear, xb. Ttje linear pred_ fion is defined ignoring the contribution of the estimated cut points.
_ _"
s_:dp calculates the stm dard error of the linear prediction. You specify one new variable: for example, predittse, stdp. outcome outcome) sp 'cities for which outcome the predicted probabilities are to be calculated. owcco_e() should dontain either a single value of the dependent variable, or one of #I, #2 ..... _vith #i meaning the_first categor_ of the dependent variable, #2 the second category, etc.
i _
nooffsetiis
relevant o
if you specified olfset (varname) for oprobit It modifies the calculations made bi, predict s_ that they ignore the offset variable; the linear prediction is treated as x3b rather t_an xjb + eraser,.
le ,
[
oprobit
-- Maximum-likelihood
ordered probit estimation
,
]
T
Syntax oprobit cluster :
depvar
[varlist]
(varname)
[weight]
[if
score (newvarlist)
exp] level
[in
range I [,
(#) 9ffset
t_able_robust
(varname)
maxbnize_options
]
by ... : may be used with oprobit; see [R] by, fweights, iweights, and pweights are allowed; see [U] 14.1.6 weight. oprobit shares the features of all estimationcommands; see [U] 23 Estimation and post-estimation commands. oprobit
may be used with sw to perform stepwise estimation: see [R] sw,
Syntaxfor predict predict [O,pe]newvarname(x)[if exp] [in range] [. { p I xb I stdp } outcome(outcome)
nooffset ]
Note that with the p option, you specify either one or k new variablesdepending upon whether the outcome () option is also specified (where k is the number of categories of depvar). With xb and stdp, one new variable is specified. These statistics are available both in and out of sample; type predict ... if e(sample) ... if wanted only for the estimation sample.
Description oprobit estimates ordered probit models of ordinal variable depvar on tile independent variables varlist. The actual values taken on by the dependent vmiable are irrelevant except that larger values are assumed to correspond to "higher" outcomes. Up to 50 outcomes are allowed in Intercooled Stata and up to 20 are allowed in Small Stata. See [R] logistic for a list of related estimation commands.
Options table requests a table showing how the probabilities equation.
for the categories
are computed from the fitted
robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster () allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights,
robust is implied; see [U] 23.13 Weighted 469
estimation,
+
i + ' i
I_;+
i
468
oneway-- Dne-wayanalysis of variance
+
The :cheff_ test (Scheffd 1953. 1959: also see Winer. Brown, and Michels 1991, 191-t95)differs in derivdttion, but it altacks the same problem. Let there be k means for which we want to make all the pair,k,ise tests. Two means are declared significantly different if
i
+ iI_
t >_ v/(k-
1)F(a:k-
1,_,)
where /_(a:_ k - 1.__, is the a+-critical value of the F distribution with k - 1 numerator and 12 denominator degrees of freedom. Scheffd's test has the nicety that it never declares a contrast si!mificalt if the over tll F test is nonsignificant. Turnihg the test ar )und, Stata calculates a significance level
}
g=F
,k-l,v
! [ i
I [ ! ;
J For instance,, you hay.• a calculated t statistic of 4.0 with 50 The F test equivalent, says the significance evel is says the same. If vou are doing three comparisons, however, and S0 degrees of'frec:dom, which says the significance level 100021
_
degrees of freedom. The simple t test 16 with t and 50 degrees of freedom. you calculate an F test of 8.0 with 2 is .0010.
+ Referendes +
Ahman. D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall. Bartlett, _. S. 1937, Pro _erties of sufficiency and statistical tests, Proceedings .268-2_2. Hochberg. Judge. G. 2d ed.
of the Royal Socieb', Series A 160:
"_, and A.C. mhane. 1987. Multiple Comparison Procedures. New "_brk: John Wile)' & Sons, t13_ W. E. Gri ffit_ s, R, C. Hilt, H L/itkepohl, and T.-C. Lee. 1985. The Theoo, and Practice of Economerncs. !New York: Johb Wiley & Sons.
;
Miller, R. +!G"Jr. 1981. S_nultaneous
•_
Scheff_. H!+t953. A method for judging all contrasts in the analysis of variance. Biometrika
[ ! i
i i
-,
4
195 +. The Analysis
Statistical Inference. 2d ed, New York: Springer-Verlag.
of Variance. New York: John Wiley & Sons.
Sid_k. Z. !1967. Reetangu ar confidence Ameri_n,
Statistical A;sociation
regions for the means of multivariate
normal distributions.
of the
Snedecor, 1_, W. and W. ( Cochran. 1989. Statistical Methods. 8th ed. Ames. tA: Iowa State University Press. \Viner, B. D.R. Brown and K. M. Michels. t991. Statistical Principles in Experimental Design. 3d ed. New York: +tcGra_v-Hfll.
AlsoSeei +
Complementary:
!
Backgrodnd: Related:
+
Journal
62: 626-633.
}
i
40: 87-104.
m]encode
U] anova, 21.8 Accessing results by other programs tl_] [R] loneway, i[R]calculated table
_M
ultiple-comparison tests
oneway n one-way analysis of variance
Let's begin by reviewing the logic behind these adjustments. comparison of two means is
The "standard"
467
_ statistic for the
t=
/±
1
s_/ n_ + nj where s is the overall standard deviation, ffi is the measured average of ,Y in group i, and ni is the number of observations in the group. We perform hypothesis tests by calculating this t statistic. We simultaneously choose a critical level a and took up the t statistic corresponding to that level in a table. We reject the hypothesis if our calculated t exceeds the value we looked up. Alternatively, since we have a computer at our disposal, we calculate the significance-level e corresponding to our calculated t statistic and, if e < c_, we reject the hypothesis. This logic works well when we are performing a single test. Now consider what happens when we perform a number of separate tests, say n of them. Let's assume, just for discussion, that we set oLequal to 0.05 and that we will perform 6 tests. For each test we have a 0.05 probability of falsely rejecting the eq uality-of-means hypothesis. Overall, then, our chances of falsely rejecting at 1east one of the hypotheses is 1 - (1 - .05) 6 _ .26 if the tests are independent. The idea behind multiple-comparison tests is to control for the fact that we will perform multiple tests and to reduce our overall chances of falsely rejecting each hypothesis to c_ rather than letting it increase with each additional test. (See Miller 1981 and Hochberg and Tamhane 1987 for rather advanced texts on multiple-comparison procedures.) The Bonferroni adjustment (see Miller I981; also see Winer, Brown, and Michels 1991, 158-166) does this by (falsely but approximately) asserting that the critical level we should use. a, is the true critical level a divided by the number of tests n, that is, a = a'/n. For instance, if we are going to perform 6 tests, each at the .05 significance lev el, we want to adopt a critical level of .05/6 _ .00833. We can just as easily apply this logic to e, the significance level to our critical level a. If a comparison has a calculated significance adjusted for the fact of n comparisons, is n- e. If a comparison has and we perform 6 tests, then its "real" significance is .072. If we cannot reject the hypothesis. If we adopt a critical level of .10, we
associated with our t statistic, as of e, then its "real" significance, a significance level of, say, .012, adopt a crilical level of .05, we can reject it.
Of course, this calculation can go above 1, but that just means that there is no a < 1 for which we could reject the hypothesis. (This situation arises due to the crude nature of the Bonferroni adjustment.) Stata handles this case by simply calling the significance level t. Thus. the formula for the Bonferroni significance level is eb = min(1, en ) where n - k(k - 1)/2 is the number of comparisons. The Sidg_kadjustment {Si&ik 1967; also see Winer, Brown, and Michels 1991. 165-166) different and provides a tighter bound. It starts with the assertion that a=l-(1-a) Turning this formula around and substituting
1/n
calculated significance
e_=min{1,1-(1-e)
is slightly
levels, we obtain
n}
For example, if the calculated significance is 0.012 and we perform 6 tests, the "'real" significance is approximately 0.07.
i
......
/
466
111_
........
i oneway --
)ne-way analysis of variance
The rbodel one- ay analysis of variance is Methods andiofFondulas
for level! i = 1.... ,/_]and observations j = 1i .... hi. Define Yi as the (weighted) mean of Yij over Yij Yij!_Define --# nt- O_i-t_ij as the weight associated with Yij, which j and _ is the overaili(weighted) mean of w_j is 1 if the, data are untweighted,wij is normaI!zed to sum to 'n = _. _ ni if aveights are used and is othen__seunnormaltz_ • wi refers to _j wij and w refers to _i u'i. The between group sum of squares is then
i ,!
Sl = _ _,,(_ - _)_
! l
i
The t_tal sum of s_uares is
The _ithin group @m of squares is given By S_ = S - $t. The _etween gro@ mean squ_e is s_ = S1/(k - 1) and the within group mean square is s_ = Se/!(u, - k). Th_ test statistic is Y = s21/s2e.See, for instance, Snedecor and Cochran (1989).
t
i
Bartlett'stest
! _=
Bartleit's test assum,._sthat you have m independent,normal random samplesand tests the hypothesis 2 The test statistic, M, is defined c_ =...= c_m.
t t
} _ i
M - (T-
m) tr!_2 - _'_(Ti - 1) ln_?
1 --t-3(m__l)Z_." Ti--'l
T-m
where th(ire are T ove all observations, T/obs_p,_ations in the ith group, and r_ j=l
o
i=l 5
i _
i
An:approkimate test ott the homogeneity of variance is based on the statistic 3I with critical values oNamed"_rom the k"_q_stnbut_on" " " of m"- 1 degrees of freedom. See Bartlett (t937) or Judge et al.
(_9s5,,4_-449).
/
oneway -- One-way analysis of variance
465
Weighted data Example oneway a one-way data, Your population
can work with both weighted and unweighted data. Let's assume that you wish to perform layout of the death rate on the four Census regions of the United States using state data contain three variables, d_rate (the death rate), region (the region), and pop (the of the state).
To estimate the model, you type oneway drate region abbreviates weight as w. We will also add the tabulate summary statistics differs for weighted data:
[weight=pop], although one typically option to demonstrate how the table of
oneway drate region [w=pop], tabulate (analytic weights assumed) Census region
Mean
Sum, mary of Death Rate Std. Dev. Freq.
NE N Cntrl South West
97.15 88. I0 87.05 75.65
5.82 5.58 i0.40 8.23
49135283 58865670 74734029 43172490
9 12 16 13
Total
87,34
10.43
2.259e+08
50
Obs,
Analysis of Variance SS df MS
Source Between groups Within groups Total
2360.92281 2974. 09635
3 46
786.974272 64,6542685
5335. 01916
49
108.877942
Bartlett's test for equal variances:
chi2(3) =
F
Prob > F
12.17
5.4971
0.0000
Prob>chi2 = 0.139
When the data are weighted, the summary table has four rather than three columns. The column labeled "Freq." reports the sum of the weights. The overall frequency is 2.259- l0 s , meaning that there are approximately 226 million people in the U.S, The ANOVAtable is appropriately
weighted. Also see [u] 14.1.6 weight. q
Saved Results oneway saves in r(): Scalars r(N)
number
of observations
r(F)
F statistic
r(df_r)
within group degrees
r(mss)
between
of freedom
group sum of squares
r (dr..m)
between
group degrees
r(rss)
within
r(chi2bart)
Bartlett's
_c_
r(df_bart)
Bartlett's
degrees
of freedom
group sum of squares
of freedom
l
i
.................
Ur :lemeath that number is reported "0.001".
i
This is the Bonferroni-adjusted significance of the
.,,
differ_nce. The dif _renee is significant at the 0.1% level. Looking down the coIumn, we see that
'.
concelltration 3 is lso worse than concentrmion 1 (4.2% level) as is concentration 4 (3.6% level). Ba_;edon this e idence, we would use concentration 1 if we grew apple trees.
!
i
_>Examl_le
i
We!can just as asily obtain the Scheff_adjusted significance levels. Rather than specifying the bonfoirroni i
option_ we specify the scherzo
option.
Weiwill also addlthe noanova option to prevent Stata from redisplaying the ANOVAtable:
i
_ oneway
_omparison o_ Average weight in fframsby Fertilizer weight treatment, noauova (S_heffe) _cheffe
_owMean-I _01 Mean [
!I
2
1
3
0.001 3
-33.25
25.9167
0.039
O. 101
,_ 4
-34.4
I
24.7667
0.034
-1.15
O. 118
0.999
The di] 'erences are he same as we obtained in the Bonferroni output, but the significance levels
are noti According , the Bonferroni-adjuste_ numbers, the significance of the difference between !
fertilize_-concentrations 1 and 3 is 4.2%. Thq Scheff6-adjusted significance level is 3.9%.
I
We _'ill leave it t( you to deride which rdsults are more accurate.
l _ Example !_
Let'si.,.conclude thi I example by obtaining the Sidfik-adjusted multiple-comparison tests. We do this to illustrate Stata s capabilities to calculate these results. It is understood that searching across adjustm4tnt methods u_til you find the results yo_ want is not a valid technique for obtaining significance
!
levels.
I
i
. freeway weigh_ noanova we!ght si_al_ in grams by Fertilizer ; Cc treatment, _arison of Average
I
RO_ MeanCol.: Mean
!
!
2
I
1 -5
2
(Sldak)3
. 1667 0,001
"
3
-33.25 0.04i
25. 9167 O, 116
4
-34,4 0.035
24.7667 O. 137
:
J
_-t. 15 I.000
We find _esutts that an similar to the Bonferroni-adjusted numbers.
F
285.82 1221.49
3 13
95.27 93.98
1.01 59.96
0.4180 0.0000
effect
15.13
2
7.56
6.34
0.0048
effect
8.48
1
8.48
8.86
0.0056
Carryover effect Kesiduals
0.11 29.56
1 30
0.Ii 0.99
0.12
0.7366
Total
1560.59
50
Treatment
Omnibus
measure
of separability
of treatment
and
carryover
=
64.6447_
in this example, the sequence specifier used dashes instead of zeros to indicate a baseline period during which no treatment was given. For pkcross to work, we need to encode the swing sequence variable and then use the order option with pkshape. A word of caution: encode does not necessarily choose the first sequence to be sequence I as in this example. Always double check the sequence numbering when using encode. W
pkctoss -- Analyze crossoverexperiments
521
finishlthe analysis hat was started in [R] pk, little additional work is needed. The data were wi_h pkshape a ad are id 1 2 3 4 5 7
sequence 11 1 1 1 1 il
outcome 150.9643 146.7606 160.6548 157.8622 133.6957 160.639
treat A A A A 1 i
carry 0 0 0 0 0 0
period 1 1 1 1 1 1
8 9 I0 12 13 14 15 18
il 11 i2 !2 12 !2 12 !2
131.2604 168.5186 137.0627 153.4038 163.4593 146.0462 158.1457 147.1977
1 1 B B S B B B
0 0 0 0 0 0 0 0
1 1 1 i 1 1 I 1
19 20 1 2 3 4
12 !2 !I i1 i1 !1
B B B B B B B B B B _ A A A
0 0 A A A n A A A A B B B B
1 1 2 2 2 2 2 2 2 2 2 2 2 2
R _ A A
B B B B
2 2 2 2
5
11
7 8 9 '10 12 13 14
!1 Ii !i i2 12 12 12
164.9988 145.3823 218.5551 133.3201 126.0635 96.17461 188.9038 223.6922 104.0139 237.8962 139.7382 202.3942 136.7848 104.519i
I5 18 19 i20
_ _ _ _
165.8654 139.235 166.2391 158.5146
i
model is fi_ using pkcross: i
_s outcome
i i
sequence variable = sequence period variable = period treatment variable = treat carryover variable = carry id variable = id
;
Ana!_sis of variance (ANOV_) for a 2x2 crossover s%udy urce of Variation SS dd MS F Prob > F tersubjacts Sequence _ffect Residuals
378.04 17991.26
_ 14
378.04 1285.09
0.29 1.40
0.5961 0.2691
Iz_rasubjects i Treatment _ffect Period _ffect
455.04 419.47
1 1
455.04 419.47
0.50 0.46
0.4931 0.5102
i
.......
i I Total 32104.59 3_ Residuals 12860.78of treatment 14 918.63 Ounibus measurelof separability and carryover
,
=
29.2893_
q
w
522
pkcross -
Analyze crossover experiments
> Example Consider the case of a six-treatment crossover trial where the squares are not variance balanced, The following dataset is from a partially balanced crossover trial published by Ratkowsky et al. (1993): . list cow i 2 3 4 1 2 3 4 1 2 3 4
seq adbe baed ebda deab dafc fdca cfda acdf efbc beef fceb cbfe
periodl 38.7 48.9 34.6 35.2 32.9 30.4 30.8 25.7 25.4 21.8 21.4 22.8
period2 37.4 46.9 32.3 33.5 33.1 29.4 29.3 26.1 26 23.9 22 21
period3 34.3 42 28.5 28.4 27.5 26.7 26.4 23,4 23.9 21.7 19.4 18.6
period4 31,3 39.6 27.1 25.1 25.t 23.i 23.2 18.7 19.9 17.6 16.6 16.1
block I 1 1 1 2 2 2 2 3 3 3 3
In cases where there is no vEiancc balance in the design, a square or blocking variable is needed to indicate in which treatment cell a sequence was observed, but the mechanical steps are the same. . pkshape cow seq period1 period2 period3 period4 pkcross outcome, model(block
cowlblock period|block
treat carry) se
Number of obs = 48 Root MSE = .730751
R-squared = Adj R-squared =
Source
Seq. SS
df
MS
Model
2650.0419
30
block cowlblock
1607.17045 628.621899
2 9
803.585226 69.8468777
periodlblock treat
407.531876 2.48979215
9 5
carry
4.22788534
5
Residual
9.07794631
17
.533996842
Total
2659.11985
47
56.5770181
88.3347302
F
0.9968 0.9906 Prob > F
165.42
0.0000
1504.85 130.80
0.0000 0.0000
45.2813195 .497958429
84.80 0.93
0.0000 0.4846
.845577068
1.58
0.2179
When the model statement is used and the omnibus measure of separability is desired, specify the variables in the treatment(), carryover(), and sequence() options to pkcross.
q
Methods and Formulas pkcross is implemented pkcross
as an ado-file.
uses ANOVAto fit models for crossover experiments;
The omnibus measure of separability
is
S= where V is Cramer's
100(1
V)%
V and is defined as
V=
min(r-
1,c-
1)
see [R] anova.
pkcr_ss-- Analyze crossoverexpedments :
523
=
The X2 is calculated as
where 0 andIE are the obs er_'ed and expected counts in a table of the number of times each treatment !
!
i
i
is followed I_ the other treatments.
References Chow. S. C. a_LdJ.
R Liu. 2tl00.Design and Analysis of Bioavedtabilityand BioequivalenceStudies.2d ed New York:MarcelDekker.
Neter. J., M t1. Kutner,C. , Nacbtsheim.and W. Was_rman. 1996. Applied Linear Statistical Models. 4th ed. Chicago:lr_era. Ratkowsky,D. _tk.,M. A. Evans_and J. R. Alldredge.1993. Cross-overExperiments:Design,Analysisand Application. New York: VlarcelDekker.
AlsoSee Related:
[R] _kcollapse, [R] pkequiv. _[R] pkexamine, [R] pkshape, JR] pksumm
Complemenl ary:
[R] Statsby
Background:
[R] _k
Title
f
pkequiv
-- Perform bioequivalence
I
I
I11 II
II I
I
tests II
[ I I
I
exp]
[in range]
Syntax pkequiv
outcome treatmentperiod
sequence id [if
[, compare(string)limit(#) _level(#)noboot fieller symmetric anderson tost ]
Description pkequiv this entry.
is one of the pk commands.
If you have not read [R] pk, please do so before reading
pkequiv performs bioequivalence testing for two treatments. By default, pkequiv calculates a standard confidence interval symmetric about the difference between the two treatment means. pkequ:tv also calculates confidence intervals symmetric about zero and intervals based on Fieller's theorem. Additionally, pkequiv can perform interval hypothesis tests for bioequivalence.
Options compare(string) specifies the two treatments to be tested for equivalence. In some cases, there may be more than two treatments, but the equivalence can only be determined between any two treatments. limit (#) specifies the equivalence limit. The default is 20%. The equivalence limit can only be changed symmetrically, that is, it is not possible to have a 15% lower limit and a 20% upper limit in the same test. level (#) specifies the confidence level, in percent, for confidence intervals. Note that this is not controlled by the see level command.
The default is level
(90).
noboot prevents the estimation of the probability that the confidence interval lies within the confidence limits. If this option is not specified, this probability is estimated bv resampling the data. fieller symmetric
specifies that an equivalence
interval based on Fieller's
specifies that a symmetric equivalence
theorem is to be calculated.
interval is to be calculated.
anderson specifies that the Anderson and Hauck hypothesis test for bioequivalence is to be computed. This option is ignored when calculating equivalence intervals based on Fieller's theorem or when calculating a confidence interval that is symmetric about zero. tost specifies that the two one,-sided hypothesis tests for bioequivatence are to be computed. This option is ignored when calculating equivalence intervals based on FielIer's theorem or when calculating a confidence interval that is symmetric about zero.
_
524
pke_l_ uiv -- Perform bioequivalencetests
525
i
Remarks } :_
l
_
',
i
4
•
pkequiv i designed to +nduct tests for bioequivalence based on data from a crossover experiment. pkequiv requires that the User specify the outcome, uvatment, period, sequence, and id variables. The data mus I be in the sake format as produced lff pkshape;
see [R] pkshape.
> Example
We will co ]duct equivalence testing on the data i_troduced in [R] pk. After shaping the data with .list
i
pkshape, the data id are
I
I
i
[
\_
se(uence
1. 2. 3. 4. 5. 6, 7. 8. 9. 10. 11. 12. 13. 14. i5. 16. t7. 18, 19. 20. 21. 22. 23. 24.
1 1 2 2 3 3 4 4 5 5 7 7 8 8 9 9 I0 t0 12 12 13 13 14 14
1 1 1 1 i 1 1 1 1 1 1 1 1 i 1 1 2 2 2 2 2 2 2 2
outcome 150.19643 218.5551 146.7606 133.3201 160.6548 126.0635 157.8622 96.17461 133.6957 188.9038 160,639 223.6922 131.2604 104,0139 168.5186 237.8962 137.0627 139.7382 153.4038 202.3942 163.4593 136.7848 146.0462 104.5191
great
26. 27. 25. 28. 29. 30. 31.
15 18 15 18 19 19 20
2 2 2 2 2 2 2
165.8654 147.1977 158.1457 139.235 164.9988 166.2391 145.3823
A B B h B h B
B 0 B0 0 B 0
2 1 2! 1 2 1
32.
20
2
158.5146
A
B
2
A B A B A B A B A B A B A B A B B A B A B A B A
now can _onduct a bio_quivalence test between treat
!
!
.pkequlv outcome t_eat period seq id
carry 0
period 1
A 0 A 0 i 0 A 0 A 0 A 0 A 0 A 0 B 0 B 0 B 0 B
2 1 2 I 2 1 2 1 2 i 2 l 2 1 2 1 2 1 2 1 2 t 2
-- A and treat
-- B.
i
C}aSSiC confidence interval for bioe{uivalence ! I
4 i ;
i
! difference: rat_o:
[equivalence limits]
[
-30.296 80X
-11.332 92.519_
plobability t_st•limits are
30.296 120X
test limits
within equivalence limits =
]
26.416 i17.439_ 0.6350
The defau|t tput for pk_quiv shows a confidence interval for the difference of the means (tes! limits), the ra_io of the means, and the federal equivalence limits. The classic confidence interval can
!
.
be constructed around the difference between the average measure of effect for the two drugs or around the ratio of the average measure of effect for the two drugs, pkequiv reports both the difference measure and the ratio measure. For these data, U.S. federal government regulations state that the confidence interval for the difference must be entirely contained within the range [-30.296, 30.296 ], and between 80% and I20% for the ratio. In this case, the test limits are within the equivalence limits. Although the test limits are inside the equivalence timks, there is only a 63% assurance that the observed confidence interval will be within the eqmvalence limits in the long run. This is an interesting case because although this sample shows bioequivalence, the evaluation of the long-run performance indicates that there may be problems. These fictitious data were generated with high intersubject variability, which causes poor long-run performance. If we conduct introduced in [R] limits are within seen in expanded pkequiv
a bioequivalence test with the data published in Chow and Liu (2000), which we pk and fully describe in [R] pkshape, we observe that the probability that the test the equivalence limits is very high. The data from Chow and Liu (2000) can be form in [R] pkshape. The equivalence test is
outcome
Classic
treat
period
confidence
seq
id
interval
for
[equivalence difference:
test
limits]
-16. 512
ratio : probability
bioequivalence
16. 512
80_, limits
are
[
limits
-8. 698
120_. within
test
4. 123
89. 464Y,
equivalence
limits
]
=
104. 994_ 0.9970
For these data, the test limits are well within the equivalence limits, and the probability that the test limits are within the equivalence limits is 99.7%. Example Using the data published in Chow and Liu (2000), we compute a confidence interval that is symmetric about zero: pkequiv
outcome
Westlake's
treat
period
symmetric
seq
confidence [Equivalence
Test
formulat
ion:
id,
75. 145
symmetric interval
for
bioequivalence
limits] 89. 974
[
Test
mean
]
80. 272
The reported equivalence limit is constructed symmetrically about the reference mean, which is equivalent to constructing a confidence interval symmetric about zero for the difference in the two drugs, In the above output, we see that the test formulation mean of 80.272 is within the equivalence limits, indicating that the test drug is bioequivalent to the reference drug. pkequiv will display interval hypothesis tests of bioequivalence if you specify the tost the anderson options. For exanlple,
(Continued on next page)
and/or
ikequiv
i
pk_quiv _i/
outcomeitreat
Classic
con
g
period
dence
seq
interval
id, for
i
i
_oequtvalencetests
527
tpst anderson i blOequlvalence ' limits]
[equivalence
i
Perform
[
test limits
]
i
difference
:
-16.512
r_tio:
I_.512
80Y,
-8.698
120Y,
4.123
89.464Y,
104. 994Y,
! probability i
!test limits
1 schuirmann'i lupper test
Anderson i i
tw° °ne-sided _tatistic
an_ Hauck's
ilo_er test
tes_
1
I
=
tests
limits =
0.9980
I
-5.036
p-value
=
0.000
p-value
=
0.001
test
_tatistic
noncentralit_
!
are withiniequivalence
3.810
parameter
=
4.423
statistic
:
-0.613
" empirical
p-value
=
0.0005
Both of Sc uirmann's ohe-sided tests are hight! significant, suggesting that the two drugs are bioequivale_t. A similar Conclusionis drawn fro_ the Anderson and Hauck test of bioequivalence.
t
q
i
i
t i
Saved Resldts pkexaani: e saves Sc_ ars r(stddev)
in
r_ trea_(displ_ys)I anovg outcome pattern order dlsplay idli_ttern
I i
Number of obs = | Sburce
i 7
Root MSE Partial SS
odel I
pa_tern #rder
18
= 1.59426 df
R-squared
=
0.9562
Adj R-squared = 0.9069 MS F Prob > F
443.666667
9
49.2962963
19.40
0.0002
.333333333 233.333333
2 2
.166666667 116.666667
0.07 45.90
0.9370 0.0000
21,00
3
7.00
2.75
0.1120
8
2.54166667
17
27.2941176
/
!
i
id Ipa_tern Residual
20.3333333
t
|
_otal
!
!
These are the same results reported by Neter et al. (1996).
_ Example Returning
i
I i 1
o the examp e from the pk entry, the data are idI
I
se
auc_concA 150.9643
auc_concB 218.5551
i2 '3 i4
146.76o6i33.32ol _1t 16o 6548 126.o635 157.862296.17461
!5
1{
133.6957
188,9038
17 i8 19
11 1t 1
160.639 i31.2604 168.5186
223.6922 104. 0139 237.8962
li
2 2 2!
153. 4038 163. 4593 137.0627
202. 3942 136. 7848 139.7382
158. 1457
165. 8654
! }
'r
464.00
4
W_
:
b4u
pksbape
I,
--
Reshape
i
18 19 20
[_i
pkshape ±d seq . sort id
,i
2 2 2
(pharmacokinetic)
147.1977 164.9988 145.3823
Latin square
data
'_o
139.235 166.2391 158.5146
auc_concA
auc_concB,
sequence 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 I 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
outcome 150.9643 218.5551 146.7606 133.3201 126.0635 160.6548 96.17461 157.8622 188.9038 133.6957 160.639 223.6922 131.2604 104.0139 237.8962 168.5186 137.0627 139.7382 202.3942 153.4038 163.4593 136.7848 104.5191 146.0462 165.8654 158.1457 139.235 147.1977 164.9988 166.2391 158.5146 145.3823
order(ab
ha)
list
:_
id 1 1 2 2 3 3 4 4 5 5 7 7 8 8 9 9 i0 10 12 12 13 13 14 14 15 15 18 18 19 19 20 20
1. 2. 3. 4. 5. 6. 7, 8, 9. 10. 11. 12. 13. 14. 15. 16. 17, 18. 19. 20. 21. 22. 23. 24, 25. 26. 27, 28. 29. 30. 31. 32.
These
data
can
be analyzed
with
pkcross
treat 1 2 1 2 2 1 2 1 2 1 1 2 1 2 2 1 2 1 1 2 2 1 1 2 1 2 1 2 2 1 1 2
carry 0 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 2 2 0 0 2 2 0 2 0 2 0 0 2 2 0
period 1 2 1 2 2 1 2 1 2 1 1 2 1 2 2 1 1 2 2 ! I 2 2 1 2 1 2 1 1 2 2 1
or anova.
Methodsand Formulas pkshape
is
implemented
as an ado-file.
References Chow. S. C. and J. E Liu. 2000, York: M_cel Dekker. Neter, J,, M. H Kutner. C_cago: [rwin.
C. J
De_;ign and Analysis
Nachtsheim,
of BioavailabiHty
and W. Wasserman.
and Bioeq_livalence
1996, Applied
Linear
Studies.
Statistical
2d ed. New
Models.
4th ed.
Also See ;
Related:
JR] pkcollapse,
Background:
JR] pk
[
fi
[R] pkcross,
[R] pkequiv,
[R] pkexamine,
[R] pksumm;
JR] anova
le i_
'_
pksumm -- Summari e pharmacokinetic data I
"
I
t
I
I1
I
I
I
"
Syntax pksu_ i_ i
i
i_lti,,,econce_tration [if expl [in ra,,gc] [, fit(#)
t__rapezoid
stat (lneasure)_ no'orsnotim_chk_graph graph_options ! where meas_treis one of auc !area under concentration-time cu_'e (AUC0,ec,) aucline under concentration-time curve from 0 to _ using a linear extension t !area under !ii concentration-time curve from 0 to vc using an exponential extension aucexp _area auc!og area under tt_e log concentration-time curve extended with a linear fit
l
half
half life of
drug
i
ke
!elimination cdncentration tmaximum r_ate
I o
tmax tome
cmax
!time at last _oncentration !time of maxilrnumconcentration
Deseriptio, pksummii one of the p commands. If you have no_read [R] pk, please do so before reading this
i
pksumm dbtains the firit four moments from the empirical distribution of each pharmaeokinetic measurement and tests tt{e null hypothesis that the distribution of that measurement is normally distributed.
Options }
_ fit (#) spedides the nur_e ,b r of points, counting back from the last time measurement, to use in
_
fitting the!extension to _stimate the AUC0,vc, The default is fit (3). the last 3 points. This should be viewed as a minimlJm;the appropriate number of points will depend on the data,
;i
.ipecifiesthat _hetrapezoidal role should be usedto calculate the AUC.The default is cubic splines, Whichgive belter results for most situations. In cases where the curve is very irregular,
trapezoid
!
the trape+idal rule m+ give better results. star (statistic) specifies the statistic that pksnrm should graph. The default is stat(auc).
i
graph o_ion is not specified, this option is ignored. nodots suppresses the progress dots during calculation, By default, a period is displayed for every call to calculate the phlarmacokineticmeasures.
!
not imechk
! l
If the
iuppresses th_ check that the follow:up time for all subjects is the same, By default. pksumm e_pects the m_ximum follow-up time _obe equal for all subjects.
graph reque_ts_a graph ofl the distribution of the gtatistic specified with s'cat(), graph_optioi_sare any of lthe options allowed with graph, twoway: see [G] graph options. 541
_,,tt
o_z
pKsumm -- _ummanze pharmacokinetic data
-7
Remarks pksnmm will produce summary statistics for the distribution of nine common pharmacokinetic measurements. If there are more than eight subjects, pksumm will also compute a test for normality on each measurement. The nine measurements summarized by pksurr_ are listed above and are described in the Methods and Formulas sections of [R] pkexamine and [R] pk.
> Example We demons_ate the use of pksumm with the data described in [R] pk. We have drug concen_ation data on 15 subjects, each measured at 13 time points over a 32-hour period. A few of the records are . list id 1 1 1 1 1 1
1. 2. 3. 4. 5. 6.
time 0 .5 1 1.5 2 3
cone 0 3.073403 5.188444 5.898577 5.096378 6.094085
0 .5 1 1.5 2 3 4 6 8 12 16 24 32
0 3.86493 6.432444 6.969195 6.307024 6,509584 6.555091 7.318319 5.329813 5.411624 3.891397 5.167516 2,649686
(ou_utomi_ed) 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195.
15 15 15 15 15 15 15 15 15 15 15 15 15
We can use pksmm
to view the summary statistics for all the pharmaco_netic
parameters.
. pksnmm id time cone
Summary statistics for the pharmacokinetie
measures Number of observations =
star. auc aucline aucexp auclog half ke emax tome tmax
15
Mean
Median
Variance
Skewness
Kurtosis
p-value
150.74 408.30 691.68 688.98 94.84 0.02 7.36 3.47 32.00
150.96 214.17 297.08 297.67 29.39 0.02 7.42 3.00 32.00
123.07 188856.87 762679.94 797237.24 18722.13 0.00 0,42 7.62 0.00
-0.26 2.57 2.56 2.59 2.26 0.89 -0.60 2.17
2.10 8.93 8.87 9.02 7.37 3.70 2.56 7.18
0.69 0.00 0.00 0.00 0.00 0.09 0.44 0.00
For the 15 subjects, the
mean
AUCo,t .....
is 150.74 and cr2 = 123.07. The skewness of -0.26 indicates
that the distribution is slightly skewed left. The p-value of 0.69 for the X 2 test of normality indicates that we cannot reject the null hypothesis that the distribution Is normal.
i
'
....
_ pksumm-- Summarizepharmacoldnetic data 543 If we were to consider an3 of the three variants of the AUC&oo, we would see that there is huge variabilib' ant that the distribution is heavily skewed. A skewness different from 0 and a kurtosis different from,3 are expecu,d because the distribution of the AUCo._ is not normal. We now grapt the distribut on of AUe0,tm_ and specify the graph option,
i
!
i "
. pksuml_ id time con:)
graph bin(20)
I
Summari statistics _or the pharmacokinetic measures i
Number of
i
150.7_
Median 150.96
au_line a_cexp
408.38 691._
214.17 297.08
188856.87 762679.94
l
a_clog lhakl_e 1 lcmax
688. 94. 0. O_ 7.3_5
297.67 29.39 0.02 7.42
797237:24 18722,13 0.O0 0.42
l
!tmax !tome
0.00 7.62
1 I
s_at"
n
_ auc
)_e_i_
32. 3.
Variarlce Skewness 123.07 -0.26
32. O0 3.00
f I
i
observations = Kurtosis 2.10
p-value 0.69
8.93 8.87
O.O0 O.O0
2.59 2.26 0,89 -0.60
7.37 9.02 3.70 2.56
O. O.O0 O0 0.09 0,44
2.17
7.18
0.00
2.57 2.56
_
I Area Under
Cu!ve
15
168,5_9
(AUC}
_i
graph, AI.r_0.tm.,.To a graph ofwe one of ask the other pharmacokineticmeasurements. _e needbyde_ult, to specify plots the star () option. plot For example, can Stata to produce a plot of the AUCo._:
I
using the log t;xtension:
(Continued on next page)
544
pksumm -- Summarize pharmacokinetic data . pksumm id time cone, stat(auclog)
graph bin(20)
Summary statistics for the pharmacokinetic measures Number of observations = star.
18
Mean
Median
Variance
Skewness
Kurtosis
p-value
auc aucline
150,74 408.30
150.96 214.17
123,07 188856.87
-0.26 2.57
2,10 8.93
0.69 0.00
aucexp auclog half ke cmax tome tmax
691.68 688.98 94.84 0.02 7.36 3.47 32.00
297.08 297.67 29.39 0.02 7.42 3.00 32.00
762679.94 797237.24 18722.13 0.00 0.42 7.62 0.00
2,56 2.59 2,26 0,89 -0,60 2.17
8,87 9.02 7,37 3.70 2.56 7.18
0,00 0.00 0.00 0.09 0.44 0.00
,666667
-
o.
I
II
I
=
|
18g,135
3624!8 Linear
fit to tog concentration
years multiplied by 2._00 person-years means 40 events are expected:
i I
and so on. _ 3. Over very s_nall exposun'.s e. the probabilib, of finding more than one event is small compared with _. 4. Nonoverlapt_ing exposure t
i !
l l l
are mutually independent.
With these assumptions, to fi _dthe probability of k events in an exposure of size E. divide E into rz subinter,:als Eat, E2 ..... E._ and approximate the answer as the binomial probability of observing h successes in i_ztrials. If ycu let n _ oc. you obtain the Poisson distribution. In the Poiss+n regression model, the incidence rate for the jth observation is assumed to be given
,
?,j = e13a+'21xl,j+"'+13_zkj
tf Ej is lhe ex )osure, the elpected number of events Cj will be
,i
Cj i
i
T
'
= E3eB°+fllXl"+"+c%x_,J = CIn(E-_ )+3°+Btx_'J_'"+fl_zk'3
1
i
his model is elstimated by l_oisson. Without the exposure() or offset() options. Ej is assumed _o be t (equivalent to assuming that exposure i_ unknown_ and controlling for exposure, if necessary,
i :
i is your responsibility. One often _ants to comptre rates and this is mos_ easitv done by calculating incidence rate ratio_
}
(IRR). For inst!nce, what i@he relative incidence rate of'chromosome interchanges in cells as the intensity of radiation increases; the relative incidence irate of telephone connections to a wrong number
i !
as load increases: or the reMive inci:ence rate of deaths due to cancer for females relative io males? That is, one w}nts to hold _ll the x s in the model i:onstant except one. say the ith. The incidence rate ratio for atone-unit cha_ge in xi is
_
i e_(_)+_'++'_(_"+a)+
+,_x_
__ e¢3i
!
More generally, lhe inciden 'e rate ratio for a _xi _hange in xi is e_z_. The lincor_ command can be used atter poisson to display incidence raie ratios for any group relative to another: _ee
i
JR] iincom.
,
> Example Chatteuee, Hadi, and Price (2000, 164) give the number of injury incidents and the propo_ion of flights for each in a single year: airline out of the total number of flights from New York for nine major U.S. airlines • list
I. 2. 3. 4, 5. 6, 7. 8. 9.
airline I 2 3 4 5 6 7 8 9
injuries II 7 7 19 9 4 3 1 3
n 0.0950 0.1920 0.0750 0.2078 0.1382 0.0540 0.1292 0.0503 0.0629
XYZowned 1 0 0 0 0 1 0 0 1
To their data we have added a fictional variable, XYZowned. W%will imagine made that the airlines owned by xYZ Company have a higher injury rate. , poisson
injuries
XYZowned,
exposure(n)
that an accusation is
irr
log likelihood = -23.027197 log likelihood = -23.027177 log likelihood = -23.027177
Iteration O: Iteration 1: Iteration 2:
Poisson regression
Number of obs LR chi2(1) Prob > chi2 Pseudo R2
Log likelihood = -23.027177 injuries
IPd{
XYZowned n
Std. Err.
1.463467 (exposure)
.406872
z 1.37
= = = =
9 1.77 0.1836 0.0370
P>lzl
[95_ Conf. Interval]
0,171
,8486578
2.523675
We specified irr to see the incidence rate ratios rather than the underlying coefficients. We estimate that XYZ Airlines" injury rate is 1.46 times larger than that for other airlines, but the 95% confidence rate. interval is .85 to 2.52; we cannot even reject the hypothesis that xYz Airlines has a lower injury
chi2
= = =
9 19.15 0.0001
Log li_elihood = -2_.332276
Pseudo R2
=
0.3001
injhries I •
poisson-- Poissonregression
'
,
.,!
Std. Err.
z
P>Izl
[957,Conf. Interval]
.6_40667 1.424169
.3895877 .3725155
1,76 3;82
0.079 0.000
-.0795111 ,6940517
4 ._ 63891
.7090501
6;86
O.000
3.474178
r
XYZhwned I :! InN I __
Coef.
553
/.cons! I
1.447645 2.154285 6.253603
L
, ! !
In this case, +ther than sp_'cifyingthe exposure (} option, we explicitly included the variable that would normalize for expos1:re in the model. We did not specify the irr option, so we see coefficients rather than incidence rate r fios. We started with the model i
rate
=
e¢3°+fltXYz°wned
COllltt = /2¢ _°+_lXYzowned
i
= e tn(n)+fl°+fllXYZ°wned
The observed :ounts are therefore which amount_ to constrai dng the coefficient on in(n) to 1, This is what was estimated when ourselves and, !rather than o)nstraining the coefficientto be I, estimated the coefficient.
i
weThe specified tt_e exposure In the abovedistance model away we included estimated coefficienI(n)option. is 1.42, a respectable from 1,the andnormalizing consistent exposure with our speculation thai larger airlin#s also use larger airplanes. With this small amount of data, however, we
I
also have a wi!le confidenc_interval that includes 1. Our estimai_d coefficienI on XYZo,anedis now .684, and the implied incidence rate ratio is e.TM ,_, 1.98 (#lhich we could also see by typing poisson, irr). The 95% confidence interval for the coefficient .till includes ) (the interval for the incidence rate ratio includes 1), so while the point estimate is no_alarger, we ill cannot be very certain of our results.
I i
Our expert 4pinion woulc be that, while there is insufficientevidence to support the charge, there !
is enough evidence to justif3 collecting more data.
"i ! !
Example
:
In a famous_'age-specific_tudy of coronary disease deaths among male British doctors, Dolt and Hilt (1966) rep6rted the folNwing data (reprinted in Rothman and Greenland I998. 259):
i
Smokers Deaths Person-years
Nonsmokers Deaths Person-years
_.
Age
_,
35 - 44 45- 54
32 104
52.407 43.248
2 12
18,790 10.673
55-64 75 - 84 65- 74
206 102 t 86
28,612 5.317 t 2.663
28 31 28
5.710 1.462 2585
i i
The first step __sito I enter thes, data into Stain, which we have done: • list
_
i! i
agecat 1 2 3
smokes 1 1 1
deaths 32 104 206
5.
5
1
102
5,317
6. 7. 8. 9. 10.
21 3 4 5
00 0 0 0
122 28 28 31
18,790 10,673 5,710 2,585 1,462
1. 2. 3.
4
4
1
pyears 52,407 43,248 28,612
186 12,663
agecat 1 corresponds to 35-44, agecat 2 to 45-54, and so on. The most "natural" analysis of these data would begin with introducing indicator variables for each age category and a single indicator for smoking: tab agecat, gen(a) agecat
Freq.
1 2 3 4 5
2 2 2 2 2
20.00 20.00 20.00 20.00 20.00
10
100.00
Total • poisson Iteration Iteration Iteration Iteration
Percent
Cum. 20.00 40,00 60.00 80.00 100.00
deaths smokes a2-a5, exposure(pyears) O: log likelihood = -33.823284 1: log likelihood = -33.600471 2: log likelihood = -33.600153 3: log likelihood = -33.600153
Poisson regression
Number of obs LR chi2(5) Prob > chi2 Pseudo 22
Log likelihood = -33.600153 deaths
IP_
smokes a2 ] a4 a5 pyears a3
1.425519 4.410584
28.51678 40.45121 (exposure) ; 13.8392
irr
Std. Err. .1530838 .8605197
z 3.30 7.61
= = = =
i0 922.93 0.0000 0.9321
P>Iz_
[95_ Conf. Interval]
0.001 0.000
1.154984 3.009011
1.759421 6.464997
5.269878 7.775511
18.13 19.25
0.000 0.000
19.85177 27.75326
40.96395 58.95885
2.542638
14.30
0.000
9.654328
19.83809
• poisgof Goodness-of-fit Prob > chi2(4)
In the above, we began by using
chi2
= =
tabulate
12,13244 0.0164
to create the indicator variables,
equal to 1 when ageeat = 1 and 0 otherwise; a2 equal to 1 when agecat and so on. See IV] 28 Commands for dealing with categorical variables.
tabulate created al = 2 and 0 otherwise;
We then estimated our model, specifying irr to obtain incidence rate ratios. We estimate that smokers have 1.43 times the mortality rate of nonsmokers. We also notice, however, that the model does not appear to fit the data well; the goodness-ogfit X 2 tells us that, given the model, we can reject the hypothesis that these data are Poisson distributed at the 1.64_ significance level. So let us now back up and be more careful. We can most easily obtain the incidence within age categories using ir; see [R] epitab:
rate ratios
i,
I
i i
po_sson-- Poissonregression
_-
, ir
d4aths smokes _yeexs, by(_ecat) nocz_de nohet ! agecat IRR [95_ Conf. Interval] t 2 3 4 5 H-_ combined
, 5.736638 2.138812 1.46824 1.35606 .9047304
1.463519 1.173668 .9863626 .9082155 .6000946
49.39901 4.272307 2.264174 2.09649 1.399699
_ 1.424882 I
1.t5470_
1.757784
g,nal=soo os*lagecat= >
'
I
gen _a2 = smokes* agecat==2) gen
34 = smokes_(agecat==3 I agecat==4)
. pois_on deaths sa
i
!zeratdon O:
log
i I
I_erat_on 1: iterat_n 2: IteratiOn 3:
log .ikelihood = -27.788819 log .ikelihood = -27.573604 log .ikelihood = -27.57fi645
i
Iterat_n
I ° i i
I i i
!
(exact) (exact) (exact) (exact) (exact)
"i although we Mll begin by!combining age categories 3 and 4: i
i
I
1.472169 9.624747 23.34176 23.25315 24.31435
robust.} Seeing thiL we will nbw parametefize the smoking effects separately for each age category. l . .
. gen _5
I
M-H Weight
We find that !the mo_alityl incidence ratios are greatly different within age category, being highest for the youn_st categofie_ and actually dropping below 1 for the oldest. (in the las( Case, we might argue that th_se who smoke and who have not died by age 75 are sel_selected to be particularly
i
i
555
•
= smokes*_agecat==5)
4:
sa2 sa34 sa5 a2-a5, exposure(pyears) irr ikelihood = -31.635422
log .ikelihood = -27.572645
P_isso_ regression i
Number of obs LR chi2(8)
= =
i0 934.99
L_g li_e lihood = -2' .572645
Pseudo Prob > R2 chi2
=
0.9443 0.0000
IRR iaths sal sa2 sa34 sa5 a2 a3 a4 a5
P>_z{ Std. Err.
i
5.f36638 2._38812 1._12229 .9_47304 1_.5631
4.181257 .6520701 .20t7485 .1855513 8.067702 34.3741
98}22766 _99.21
70.85013 145.3357
7.671
[9SZConf.
z
Interval]
2.40 2.49 2.42 -0.49 3.09 5.36
0.017 0.013 0.016 0.625 0.002 0.000
1.374811 1.176691 1.067343 .6052658 2.364153 11.60056
23.93711 3.887609 1.868557 1.35236 47.19624 195.8978
6.36 7.26
0.000 0.000
23.89324 47.67694
403.8245 832.365
pois f _ Goodness-o -fit chi2
=
.0774185
i Prob > chit(1)
=
0.7808
£ Note that the _oodnes_-o_f it X2 is now small: we are no longer running roughshod over the data. Let u_ no_ consider simpli_,in _the model. The point estimate of the incidence rate ratio for smoking in age category i1 is much la__er than that for _moking in age category v but the confidence interval | . for sal iS s_ularty wide, [s the difference real?
• test sal=sa2 =ao ;
(I) [deaths]sal[deaths]sa2 = 0.0 pumson _ l"olsson regression chi2( 1) = 1.56 Prob > chi2 = 0.2117
The point estimates may be far apart, but there is insufficient data and we may be observing random differences. With that success, might we also combine the smokers in age categories 3 and 4 with those in I and 2? • test sa34=sa2, accum (I) (2)
[deaths]sal - [deaths]sa2 = 0.0 - [deaths]sa2 + [deaths]sa34 = 0.0 chi2( 2) = Prob > chi2 =
4,73 0.0938
Combining age categories 1 through 4 may be overdoing it to stop us, although others may disagree.
the 9,38%, significance level is enough
Thus, we now estimate our final model: • gen sal2 = (sallsa2) . poisson deaths sa12 sa34 sa5 a2-a5, exposure(pyears) Iteration Iteration Iteration Iteration
O: I: 2: 3:
log log log log
likelihood likelihood likelihood likelihood
= = = =
Poisson regresslon
Number of obs LR chi2(7) Prob > chi2 Pseudo R2
Log likelihood = -28.514535
deaths
IRR
sat2 2.636259 sa34 1.412229 sa5 .9047304 a2 4. 294559 a3 23.42263 a4 48.26309 a5 97,87965 pyears i (exposure)
irr
-31. 967194 -28.524666 -28.514535 -28.514535
Std. Err. .7408403 .2017485 .1855513 .8385329 7. 787716 16. 06939 34. 30881
z 3.45 2.42 -0.49 7.46 9.49 11.64 13.08
P>Izl 0.001 O.016 O.625 O.000 O.000 O. 000 O. 000
= = = =
i0 933. Ii 0.0000 0.9424
[95Y,Conf. Interval] 1,519791 i.067343 .6052658 2. 928987 12.20738 25,13068 49. 24123
4.572907 i.868557 1. 35236 6. 296797 44.94164 92. 68856 194.561
The above strikes us as a fair representation of the data. q
(Continued on next page)
l
i i
l
!._
i i
poigson -- Potsson regression
557
SavedResults .....
:i
poisson s_wes in e():
e(N) Scalars e (k) e (k_.eq) e(k.dv)
number of
observations number of lvariables number of equations number of dependent variables
e(ll_0)
log likelihood, constant-only model
e (1__clust) e (re) e(chi2)
number of clusters return code X2
e (dL.m)
model deglMs of freedom
e (p)
significance
e(r2_p) e(lt)
pseudo
e(ic)
e(rank)
number of iterations rank of e(V)
R-squared
log likelihcod
Macros e(emd)
poisson
e(user)
name of likelihood-evaluator program
e(depvar)! e(title)
name of &pendent variaNe title in estimation output
e(opt) e(chi2tTpe)
type of optimization Wald or LR; type of model X2 test
e(wgype) i e(wexp)
weight tyl_ weight exp ession
e(offseSt) e(prediet)
offset program used to implement predict
e(ctustv@) e (vee'_yp@
name of cluster variable covariance _stimation method
e(cnsli_t)
constraint numbers
e (V)
variance-covariance matrix the estimators
Matrices
!!
e (b) e (ilog)
coefficient _vector / iteration lo_ (up to 20 iterations)
of
Functions e (sample)
marks estit nation sample
MethodsanqtVormul_ts _:)oi_so_.
_d
_Doi_go_
_re
inr_plei_e.ted
The log lil_elihood (witl" weights
a_
ande-_Ay offsets)
ado_|_s.
and scores are given by
Pr Y = 5') -
I
5'!
1
(i = _ifl + offseti
!
e- exp(_i)e_,_,
i
I
i=1
}'i
s,:ore(,3)i
= Yi _ e{_
I I
References
i
Bonke_tsch, I.i yon. 18c_8.D_s Gesetz der Kleiner_ Zahlen: Leipzig: Teubner. Cameron, A. C.{and R K. Triv,_di. t998. Regres.sion analysis of count data. Cambridge: Cambridge Universit> Press.
i
Chat|e@e. S.. ,,k. S. Hadi, and B. Price. 2(_10.Regres._ionAnalvsis _ Example. 3d ed. New York: John Wiley &
i
Clarke. R. D. 1146. An applic lion of the Poisson distribution. Journal of the Institute of Actuarie_ 22: 48.
.,.,_,
pu,_un
--
romson
regression
Coleman. J. S. 1964. Introduction m Mathematical Sociology New York: Free Press. [
Doll, R. and A. B. Hill. 1966. Mortality of British doctors in relation to smoking; observations on coronary thrombosis. In Epidemiological Approaches to the Study of Cancer and Other Chronic Diseases. ed, W. HaenszeL National Cancer Institute Monograph 19: 204-268. Feller, W. 1968. An Introduction to Probability Theory and Its Applications. vol. 1. 3d ed. New York: John Wiley & Sons. Hilbe, J. 1998. sg91: Robust variance estimators for MLE Poisson and negative binomial regression. Stata Technical Bulletin 45: 26-28. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp, 177-180. . 1999. sgt02: Zero-truncated poisson and negative binomial regression. Stata Technical Bulletin 47: 37-40. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 233-236. Hilbe, J. and D. H. Judson. 1998. sg94: Right, left, and uncensored Poisson regression. Stata Technical Bulletin 46: 18-20. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. t86-189. Long, L S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications. McNeil, D. i996. Epidemiological Research Methods. Chichester, England: John Wiley & Sons. Poisson, S. D, 1837 Recherches sur ta probabilitd des jugements en matibre criminetle et en mati_re civile, pr_cddrs des rbgles grndrales du calcul des probabilitds. Paris: Bachelier. Rodrlguez. G. 1993. sbel0: An improvement to poisson. Stata TechnicaI Bulletin Technical Bulletin Reprints, vol. 2. pp. 94-98.
11: 11-14. Reprinted in Stata
Rogers, W_H. 1991. sbel: Poisson regression with rates. Stata Technical Butletin t: l I-12, Bulletin Reprints, vol. I, pp. 62-64.
Reprinted in Stata Technical
Rothman. K. J. and S. Greenland. 1998. Modem Epidemiology. 2d ed. Philadelphia: Lippincott-Raven. Rutherford, E., J. Chadwick. and C. D. Ellis. 1930. Radiations from Radioactive Substances. Cambridge: Cambridge University Press. Selvin, S. 1995. Practical Biostatisticat Methods. Belmont, CA: Duxbury Press. 1996. Statistical Analysis of Epidemiologic Data. 2d ed. New York: Oxford University Press. Thorndike, E 1926. Applications of Poisson's probability summation. Bell System Technical Journal 5: 604-624. Tobias. A. and M. J. Campbell. 1998. stsl3: Time-series regression for counts allowing for autocorrelation. Stata Technical Bulletin 46: 33-37. Reprinted in Stata Technical Bulletin Reprints, vo], 8, pp. 291-296.
Also See Complementary:
[R] adjust, [R] predict,
[R] constraint, [R] lincom, [R] sw, [R] test, [R] testnl,
Related:
[R] epitab,
[R] glm, [R] nbreg,
Background:
[U] 16.5 Accessing [U] 23 Estimation
coefficients and
[R] svy estimators, and
post-estimation
[u] 23.11 Obtaining
robust
[El 23.12 Obtaining
scores
[R] linktest, [R] lrtest, [R] vce, [R] xi
variance
standard
errors.
commands. estimates,
[R] xtpois
[R] mfx,
l!litle ....
'
i
ppermn
i [
i ,
test for unit roots
Syntax pperron pperron
vtrname i
[if
,,xp]
[in
range]
[,
is for u,_e with time-series data; see [R] tsset
noconstant
lags(#)
You must_ tsset
t xrend
regress ........
]
your data before usinz_ pperron.
varHame may coniain time-series loperators; see [U] 14.4.3 Time, series varlists
Description [
excludePperr°nthe co_stant,Pe_f°rms the. P illips-Perron testand/or for umt_ roots lagged on_a variable. userdifference may optionally mcludl a trend term, include values The of the of the variable in the regression.
Options ! i
noconstant
st_ppressesthe constant term (intercep0 in the model.
tags (#) specit_esthe number of Newey-West lags io use in the calculation of the standard error. •
i
/
trend speclfie.,tthat a trend!term should be included in *..heassociated regression. This option may not be speci_ed if nocon_tant is specified. regress speci_es that the ",lssociatedregression table should appear in the output. By default, the re_ression t_ble is not pr_luced,
|
I ! f_
i l I
I i
Remarks
,i
Hamilton (I_94) and Fuller (1976) give excellent overviews of this topic; see especially chapter 17 of the forme_r.Phillips (1_86) and Phillips and Pe_on (1988) present statistics for testing whether a time series h_d a unit-roottautoregressive,component.
Example
:
ii
Here, we ttse the international airline passengers dataset (Box, Jenkins, and Reinsel I994, Series G). This datasei has 144 obs_rt ations on the monthlynumber of international airline passengers from 1949 through 1_760. • pperro_ air Phillips4Perron test [or unit root
i
{
Tes Statis ic Z(rho) Z(t)
-6. _64 -i _44
* HacKi_on
i
Number of obs = Newey-West lags =
143 4
Interpolated Dickey-Fuller I_,Critical 5Y,Critical I0_,Critical Value
Value
Value
-19. 943 -3,496
-13. 786 -2,887
"11.057 -2,577
approxima:e p-value for Z(t) = 0.3588 559
i
; I
Note fail to--reject the hypothesistestthattorthere a unit root in this time series by looking either at ,.,vv that we Vk_,_un rrml,ps-i-'erron unit isroots the MacKinnon approximate asymptotic p-value or at the interpolated Dickey-Fuller critical values.
_>Example In this example, we examine the Canadian lynx data from Newton (1988, 587). Here we include a time trend in the calculation of the statistic. • pperron lynx, trend Phillips-Perron
test for unit root
Number of obs = Newey-West lags =
Test Stat istic
IY,Critical Value
-38.365 -4.585
-27.487 -4. 036
Z (rho) Z(t)
113 4
Interpolated Dickey-Fuller 5Y,Critical lOY,Critical Value Value -20.752 -3.448
-17.543 -3. 148
* MacKinnon approximate p-value for Z(t) = 0.0011
We reject the hypothesis that there is a unit root in this time series. q
Saved Results pperron
saves in r ():
Scalars r(N)
number
of observations
r(lags) r (pval)
Number of lagged differences used MacKinnon approximate p-value (not included
if noconstant
r(Zt)
Phillips-Perron
r test statistic
r(Zrho)
Phillips-Perron
p test statistic
specified)
Methods and Formulas pperron
is implemented as an ado-file.
In the OLS estimation of an AR(I) process with Gaussian errors, Yi = PYi-_ + e_ where ei are independent and identically distributed as N(O, cr2) and Yo = O, the OLS estimate (based on an n-observation time series) of the autocorrelation parameter p is given by n
_-_ Yi- I Yi Pn
i=1
71
i=l
We know that if IPI < 1 then v_{(_'n - p) --+ .IV(0,1 p2). If this result were valid for the case that p = 1, then the resulting distribution collapses to a point mass (the variance is zero).
..........
i
_
I
,
............................
pperron-- Phillips;-Perrontest for unit roots
561
L
It is this m6tivation that rives one to check for the possibility of a unit root in an autoregressive process. :In order to comput| the test statistics, we compute the Phillips-Perron regression Yi = a + PYi-1 + q where we mayIexclude the :onstant or include a trend term (i). There are two statistics, Z o and Z_-, calculateA as 1)
=n(pn
1
_r
__
=
_j n = _'-
2 sr_
I!
2
--
?2i_ti-j
n i=j+l
A
j 1 j=l
_
1
I
_j,,_ q+l_
^2 i=l
where ui is the OLS residu I, k is the number i_
1 n_ Ls. -
"2
A
A 70,,_+ 2
)n
2
f covariates in the regression, q is the number of
Newey-West lhgs to use in lthe calculation of _n, and _ is the OLS standard error of ft. The critical !,'alues (which! have the same distribution as the Dickey-Fuller
statistic: see Dickey and
i !
Fuller (1979))included in tt_e output are linearly inte_olated from the table of values which appear in Fuller (197(_), and the M_cKinnon approximate p-,alues use the regression surface punished in
I
MacKinnon (I _94).
i
Referenoes
i
BOX,EnglevaoodG. E. P.,cllffs,G. M.Nj:JenkinS,Prentic,__Hall.and G. C. Reinsel. !994. Time Series Analysis: Forecastingand Control.3d ed.
l
Dickey..D A. an_ W. A. Fuller. 1979. Distributionof the estimatorsfor autore_ressive_ time series with a umt root. Journalof theiAmerican Stati;ticalAssociation74: 427-431. Fuller,W. A. 197B.Introduction:o Statistical TimeSeries. New York:John Wiley& Sons. Hakkio.
C
S.
19i)4., sts6:
Apprcximate
p-values
for unit
root
and cointegration
tests.
Srata
Technical
Bulletin
17:
i
25-28. Repriniedin Stata Te_hnical BulletinReprints,vo]. 3, pp. 219-224. Hamilton,J. D. 1}94. Time Seri_s Analysis. Princeton:PrincetonUniversityPress.
i l
MacKinnon.J. G.!1994.Approxilaateasymptoticdistributionfunctionsfor unit-rootand cointegrationtests. Jottrnal(ff Businessand _conomic Statisics 12: 167-176. Newton,H, J. 19,_8,TIMESLAB A Time SeriesLaboratory.PacificGrove.CA: Wadsworth& Brooks/Cole.
I. g
Phillips.P.aC.B. _986. Time series regression,xith a unit root Economemca56: I021-104._."
}
Phillips,P_C. B. _nd R Pen-on. 988. Testingfor a unit root in time series regression.Biomemka 75: 335-346.
!
Also See
!
l
i t
Complementary:
[R] tss_t
Related:
[R] d ller
i
-i
prais
m Prais-Winsten I
regression
and Cochrane-Orcutt
[
I
regression nrl
II
i,
i
:-
Syntax prais
depvar
twostep
[vartist]
robust
nodw level(#)
[if exp]
[in range I [, corc ssesearch rhotype(rhomethod)
cluster(varname) no log
hc2
maximize_options
hc3 noconstant
h_ascons
savespace
]
is for use with time-series dam; see [R] tsset. You must :sset your data before using prais. depvar and _,arlistmay contain time-series operators: see [U] 14.4.3Time-series varlists. prais shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. prais
Syntaxfor predict predict
[type] newvarname
[if
exp]
[in range]
[,
These statistics are available both in and out of sample; type predict the estimation sample.
{ xb I residuals I stdp } ] ...
if e(sample) ...
if wanted only for
Description prais estimates a linear regression of depvar on varlist that is corrected for first-order serially-correlated residuals using the Prais-Winsten (1954) transformed regression estimator, the Cochrane-Orcutt (1949) transformed regression estimator, or a version of the search method suggested by Hildreth and Lu (1960).
Options corc specifies that the Cochrane-Orcutt transformation be used to estimate the equation. With this option, the Prais-Winsten transformation of the first observation is not perfomaed and the first observation is dropped when estimating the transformed equation; see Methods and Formulas below. ssesearch specifies that a search be performed for the value of p that minimizes the sum of squared errors of the transformed equation (Cochrane-Orcutt or Prais-Winsten transformation). The search method employed is a combination of quadratic and modified bisection search using golden sections.
562
! i
! i
-].
............
, prais rais-Winsten regressionand Cochrane-Orcutt regression 563 rhotype(rhor_ethod) setedts a specific computation for the autocorrelation parameter p. where rhomethod ban be • re,tress frbg
['reg -- 9 from the residual regression et - tier-1 /'freg = _3from the residual regression et = flet+l
!
ts_orr clwl _
f'tscorr = e'et_l/e'e, where e is the vector of residuals /'dw -- 1 -- dw/2, where dw is the Durbin-Watson d statistic .
!
th_il
/'theit = Ptscorr(N - k)/N
i
na!gar
[',_gar = (Pdw * N 2 4- k_)/(N 2 - k 2)
I i !
li
The prais !estimator can use any consistent estimate of p to transform the equation and each of these estimdtes meets tha_requirement. The default is regress and it produces the minimum sum of squares s_lution (sses_arch option) for the C0chrane-Orcutt transformation--no computation will produc_ the minimur_ sum of squares solution for the full Prais-Winsten transformation. See Judge Grif_ths, Hill. Lii_kepohl. and Lee (1985) for _ discussion of each of the estimates of p. twostep
speci_es that pra±_ will stop on the first iteration after the equation is transformed bv p
the
two-step efficient estima@. Although it is customa_' to iterate these estimators to convergence.
!
they are ef_ient at eachlste p. , " robust specifies that the Nuber/White/sandwich estimator of variance is to be used in place of the traditiov_al calculation, robust combined with cluster() further allows observations which
,t
are not ind+p endent witt in cluster (although they must be independent between clusters). See
I ! !
[U] 23.11 Obtaining rob,,st variance estimates. Note that all estimates fr,)m prais are conditional on the estimated value of ,o. This means that robust variahce estimates in this case are only robust to heteroskedasticity and are not generally robus! to n{isspecificatio_ of the functional form or omitted variables. The estimation of the
! !
functional fiirrn is intertwined with the estimate of p, and all estimates are conditional on p. Thus, we cannot t{erobust to _isspecification of functional form. For these reasons, it is probably best
li :
i
i
• ! i
to mterpret _obust
mth t spirit of White's _19801 original paper on estimatton of heteroskedastic
consistent @variance matrices. cluster(,,arnb,,,e) specifie_ that the observations are independent across groups (clusters) but not neces._arilv Within groups._'lvarname specifies to which group each observation belongs, cluster () affects the astlmated _ " v stan lard errors and variance,-covariance matrix of the estimators (,Cg). but not lhe estirhated coeffici .'nts. Specifying cluster () implies robust he2 and he3 _pecifv an alt _rnative bias correction for the robust variance calculation; for more informationlsee [R] regre_s, he2 and he3 may not be specified with cluster() Specifying he2 or hcB imp_es robust. 1 t
.
/
• I
!
.
.
.
hascons rodin}tea that a us@defined constant, or set of variables that m hnear combination form a constant, ha_ been includetd in the regression. For Some computational concerns, see the discussion in [RI regreb's. savespace
sNcifies that pz ais attempt to save as much space as possible by retaining only those
!!
variables for eslimation. Theused original are isrestored after space estimation. This option rarely usedre+uired hnd should g meratty be only data if there insufficient to estimate a modelis
[
without the _ption.
!
nodw suppresses reporting o the Durbin-Watson
statistic'.
level(#)specifies the confidence level, in percent, for confidence intervals. The default is level (95) oo,, prms -- _,ram-wmslen regress=onand Cochrane-Orcutt regression H
ii ,
*
or as set by set level, see [U] 23.5 Specifying nolog suppresses the iteration log. maximize_options specify them.
control the maximization
the width of confidence
process;
see [R] maximize.
intervals.
You should never have to
Options for predict xb, the default, calculates the fitted values the prediction of xjb for the specified equation. This is the linear predictor from the estimated regression model; it does not apply the estimate of p to prior residuals. residuals
calculates the residuals from the linear prediction.
strip calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is al so referred to as the standard error of the fitted value. As computed for prais, this is strictly the standard error from the variance in the estimates the parameters of the linear model under the assumption that p is estimated without error.
of
Remarks The most common autocorrelated error process is the first-order autoregressive assumption, the linear regression model may be written
process. Under this
Yt --=xtfl + ut where the errors satisfy ?_t -'=- P U¢--I
and the et are independent and identically error term e may then be written as
if,
1 1
distributed
_
et
as N(O, 02). The covariance
1 p
p 1
f12 p
... . ..
pT-1 pT-2
p2
p
1
...
p T- 3
pT-3
...
1
matrix if" of the
p2 pT-1
pT-2
The Prais-Winsten estimator is a generalized least squares (GLS) estimator. ]'he Prais-Winsten method (as described in Judge et al. 1985) is derived from the AR(1) model Jbr the enor term described above. Whereas the Cochrane-Orcutt method uses a lag definition and loses the first observation in the iterative method, the Prais-Winsten method preserves that first observation. In small samples, this can be a significant advantage.
Q TechnicalNote To estimate a model with autocorrelated errors, you must specify your data as time series and have (or create) a variable denoting the time at which an observation was collected. The data for the regression should be equalty spaced in time. Q
!
i
i
!
i
,
V
prais-- =rais-Winstenregressionand Cochrane-Orcuttregression
Example
You wish td estimate a e-series model of usr on idlebut are concerned that the residuals may be _nally, correlated. We will declare the variable t to represent time by typing . tsset
t_ 1
We can obtain _ochrane-Or:utt estimates by specifying the core option: • prais
_sr
Itera_io_
idle,
O:
cor"
rho = 0 i0000
Itera_io_ I: rho = 0,3518 (output o_nirted ) Iteratio_ Cochrane
13:
rho = i.5708
Orcutt
So _ rce M
AR(1) i |regressio; _S d
I
el
Resic_ual
40.13 I
9584
--iterated MS I
166.8T8474
27
est imates Number Prob
6.18142498
R-squared Adj R-squared
C_ef.
Std.
Err.
t
_?n, 14.,7641 4.2,2299
I
Durbin-Wdson
statist:.c
(original)
1.295766
I
Durbin-Wa_son
statist:.c
(transformed)
1.4662_2
!
t
}
I
i i i i
i
]'he estimated model is
P>lt
> F
Root.sE
!
_sr
of obs =
40.1309584_
T_al / 207.0_943328 7.39390831
!
i
565
I
[95%
Conf.
29
=
0.0168
= =
0.1938 O. 1640
: 2.4862 Interval]
0.002 5.78036 23.3,=45
,
_srt = --.125 idler
14.55 + u_
+
and
_t = .5708_t_1 + et
We can also estlmate the m_lel with the Prais-Winsten method: • prais u3r idle Iteration 0: rho = 0 0000 Iteration
I:
rho = 0 3518
(output or #tted) Iteration 14: Prais-Win_ten
rho = (.5535 AR(1)
i Source Mo_el
Residgal i To_al
I
rcgression
-- iterated
estimates
43,00 _ S 6941
df1
MS 43. 0076941
F( 1, 28) = Number Prob > Fof obs = =
7.12 0.012530
169.1( 5739 212.1i3433
28 29
6.04163354 7.31632528
R-squared Adj R-squared Root MSE
0.2027 0.1742 2.458
_sr i_le c_ns
Durbin-Wa son statistic Durbin-Wa_son statist5
Std.
Err.
.0472195 4. 160391
(original) (tremmformed)
t -2.87 3.65
1.295766 1.476004
P>It I O. 008 O. 001
[95_, Conf. -. 2323769 6. 681978
= = =
Interval] -. 0389275 23. 72633
P !i!
ooo
where
pram m P'rals-wlnsten
the Prais-Winsten usrt
As the results
regression and Cochrane-Orcutt
estimated
-
model
-.1357±diet
indicate,
is
+ 15.20 + ut
for these
regression
data there
and
u_ = .5535ut_1
is little to choose
between
+ et
the Cochrane-Orcutt
and
Prais-Winsten estimators whereas the OLSestimate of the slope parameter is substantially different. q
> Example We have data on quarterly sales, in millions of dollars, for five }'ears, and we would like to use this information to model sales for company X. First, we estimate a linear model by OLS and obtain the Durbin-Watson
statistic
regress
csales
using dwstat;
diagnostics.
isales
Source
SS
Model Residual
d_
MS
Number
of obs
=
20
110.256901
1
110.256901
F( 1, Prob > F
.133302302
18
.007405683
R-squared
=
0.9988
5.81001072
Adj R-squared Root MSE
= =
0.9987 .08606
.... Total
110.390204
csales
Coef.
isales
.1762828
_cons
see [R] regression
-1,454753
19
Std.
Err.
.0014447
¢
P>Itl
122.02
0.000
.2141461
-6.79
0.000
[95Z
18) =14888.15 = 0.0000
Conf.
Interval]
.1732475 -1.904657
.1793181 -1.004849
• dwstat Durbin-Watson
d-statistic(
2,
20)
=
.7347276
Nofng that the Durbin-Watson statistic is far from 2 (the expected value under the null hypothesis of no sefiN correlation) and well below the 5% lower limit of 1.2, we conclude that the disturbances are serially correlated. (Upper and lower bounds for the d statistic can be found in most econometrics texts, e.g., Harvey, 1993. The bounds have been derived for only a limited combination of regressors and observations.) We correct for the autocorrelation using the ssesearch option of prais to search for the value of p that minimizes the sum of squared residuals of the Cochrane-Orcutt transformed equation. Normally the default Prais-Winsten dataset, but the less efficient Cochrane-Orcutt of the estimator's
_ansformations would be used with such a small transformation will allow us to demons_ate an aspect
convergence.
. prais csales isales, core ssesearch Iteration I: rho = 0.8944 , criterion
=
-.07298558
Iteration
=
-.07298558
2:
(ou_utomittcd) Iteration 15: Cochrane-Orcutt Source Model Residual Total
rho rho
= 0.8944 = 0.9588
AR(1)
, criterion ,
criterion
regression SS
-- SSE
df
=
-.07167037
search MS
estimates Number
of obs
19
2.33199178
1
2,33199178
.071670369
17
.004215904
R-squared
=
0.9702
.133536786
Adj R-squared Root MSE
= =
0.9684 .06493
2.40366215
18
17)
=
F( I, Prob > F
= =
553.14 0.0000
I
t
-
.
i
|
_
1 _c_ns ho
Ccef.
! i
i
I I
i"
i
Std. Err.
t
P>ltl
[95_ Conf. .1461233
.160_233
.0068253
23.52
0.000
1 1.73_946 .958_209
1.432674
1.21
0.241
I)urbin-Wa_sonstatistic (original)
.
567
Interval] .1749234
-1.283732
4.761624
0.734728
(transformed) 1.724419
Durbin-Wa_son statist
1 ! i
_._
prais -- _rais-Winsten regressionand CoChrane-Orcu_ regression l
csa_es ! isa_es
.............
It was noted m the Optic
section that with the default computation of p the Cochrane-Orcutt
themeth°dssesearchPr°duce_an estima!, IGi_e )fthat p thatthe minimiZeSmethodsthe sum of squared residuals the same criterion as bption, two produce the same results, why would the search method ever be _referred? lt,t__rnsout that the back-and-forth iterations employed by Cochrane-Orcutt can o_en have difficulty, corn e cging if the value of p is large. Using the same data, the Cochrane -Orcutt iterative procedure requires o_er 350 iterations to converge and a higher tolerance must be specified to prevent premature converg, mce: • prais c_ales isales, core tol(le-9) iterate(500) Iteration O: Iteration_l: Iterationl2:
rho = O. rho = 0.5312 rho = 0.5866
Iteration!3: rho = 0.T161 3000 Iteration!4: rho = Iterationl5: rho = (output onA_t_d) Iteration!377: rho Iterationi378: rho Iteration!379: rho
0.7373 0.[550 = 3.9588 = ).9588 = ).9588
Cochrane-_reutt AR(1) regression -- iterated estimates Source
S
df
Mo_el
2,3319
To_a! !
2,4036_208
csa_es
Co f.
isal_s _colOns
,o
t71
MS
Number of obs =
19
1
2.33199171
Prob > F
=
0.0000
18
.133536782
Root MSE Adj R-squared
= =
.06493 0.9684
Std. Err.
.1605_33
.0068253
1.738_46
1.432674
t 23.52 1.21
P>ltl
[95X Conf. Interval]
0_000
.1461233
.1749234
0.241
-1,283732
4.761625
9 8 1o9
Durbin-Wat_on statisti_ (original) Durbin-WatNon statisti
0.734728
I (transformed) 1.724419
Once convergende is achieve 1,| the two methods produce identical results.
(Con6nued on next page)
q
568
prais --
_1 _|
Saved Results prais saves in
:t
scal
I
Prais-Winsten
regression and Cochrane-Orcutt
regression
-_'_
e()"
e (/0
number of observations
e (ross) e (df_ta)
model sum of squares model degrees of freedom
e(rss)
residual sum of squares
e(df_.r) e (r2)
residual degrees of freedom R-squared
e (r2_a) e(F)
adjusted R-squared F statistic
e(rmse) e (11) e(N_clust)
root mean square error log likelihood number of clusters
e(rho)
autocorrelation parameter p
e(dw)
Durbin-Watson
e (dw_O) e (to 1) e(max_ic)
Durbin-Watson d statistic of transformed regression target tolerance maximum number of iterations
e (ic)
number of iterations
e(N._gaps)
number of gaps
d statistic for untransformed regression
Macros e(cmd) prais e(depvar) name of dependent variable e(clustvar) name of cluster variable e(rhotype) e(method) e (vcetype)
methodspecified inrhotype option twostep,iterated,or SSE search covariance estimation method
e(tranmeth) corc or prais e(cons) e(predict)
noconstant or notreported programusedto implement pred£ct
Matrices e(b)
coefficient vector
e(V)
variance-covariance
matrix of the estimators
Functions e (sample)
marks estimation sample
Methods and Formulas prais is implemented
as an ado-file.
Consider the command 'prais from the standard linear regression:
y x z'.
The
Yt = axt An estimate regression:
of the correlation
in the residuals
0-th iteration
is obtained
by estimating
a, b, and c
+ bz_ + c + ut is then obtained.
_t _- PUt--i+ et
By default,
prais
uses the auxiliary
i
I
Y
!
_,.
[_l*_iS
_
_r_is
_W|ns_n
r_ssJ_n
and
Cochr_ne_
O_ut_
This can be cMnged to any )f the computations noted in the rhotype
!
Next we apl_ly a Cochran ,,-Orcutt transformation(l)
_ssion
_9
() option.
for observations t = 2,...,
n
4
! I
i l
v, -
!
=
-
+
-
+ 41 - p)+
(1)
!
and the transformation (1') fi)r t = 1
Thus, the diffe: ences betwe4 n the Cochrane-Orcutt and the Prais-Winsten methods are that the latter uses equa ion (1') in a. Idition to equation (1), _,hereas the former uses only equation (I) and
l i
necessarily dec_ _.asesthe san pie size by one. Equations (1i and (t ') are used to transform the data and obtain new estimates of a, b, and c. When the tubstepoption is specified; the estimation process is halted at this point andthese are
i
the estimates re_orted. Under the default behavior of i!erating to convergence, this process is repeated i i
until the Changei in the estim_Lteof p is within a specified tolerance. The new esti_nates are used to produce fitted values i Yt = axt 4- bzt + "d
t and then p is re[estimated, b
default using the regression defined by
Yt
-- Y% =
lO(Vt--1
Y_--I)
-t- Ut
(2)
i
We then re-estir_ate equatiol and (2) until the_estimate of Convergence ts declared af coixelation between two iterx that this processiwiil always i
I ! i
Under the ss_search opt on a combined quadratic and bisection search using golden sections is used to search f& the value (4 p that minimizes the Sum of squared residuals from the transformed e_uation. The transformation may be either the Cochiane--Orcutt (I only) or the Prais-Winsten (1
i
and 1').
I
(1) using the neu, estimate of p, and continue to iterate between (t) converges. _riterate () iterations or when the absolute difference in the estimated ions is tess than tel (): , see [R] maximize. Sargan (1964) has shown :onverge.
All reported _tatistics are ased on the p-transformed variables and there is an assumption that p is estimated witl_out error. S, Judge et ak (1985) for;:details. I
The Durbin-g[atson
d sta, stic reported by praisand d_rstat is n--1
!
_ (uj+l- uj
l
j=l
j=l
:
t
where uj represo_nlsthe residual of the jth t t 4
observation.
]
ozu
prms N P'ram-wlnsten
regression and Cochrane-Orcutt
regression
Acknowledgment We thank Economics
Richard
Dickens
and Political
Science
of the Centre for testing
for Economic and assistance
Performance with an early
at the version
London
School
of
of this command.
References Chatterjee, S.. A. S. Hadi, and B. Price. 2000. Regression Analysis by Example. 3d ed. New York: John Wiley & Sons. Cochrane, D. and G. H. Orcutt. 1949. Application of least-squares regression to relationships containing autocorrelated error terms. Journal of the American Statistical Association 44:32-61. Durbin, J. and G. S. Watson. 1950 and 1951. Testing for serial correlation in least-squares regression. Biome_ika 409-428 and 38: I59-178.
37:
Hardin, J. W. 1995. stsl0: Prais-Winsten regression. Stata Technical Bulletin 25: 26-29. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 234-237. Harvey, A. C. t993. The Econometric Analysis of Time Series. Cambridge, MA: MIT Press. Hildreth, C. and J Y. Lu. 1960. Demand relations with autocorrelated disturbances. Agricultural Experiment Station Technical Bulletin 276. East Lansing, MI: Michigan State University. Johnston. J. and J. DiNardo. 1997. Econometric Methods. 4th ed. New York: McGraw-Hill. Judge, G. G., W. E. Griffiths, R C. Hill, H. Lfitkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Wiley & Sons. Kmenta, J. 1997. Elements of Econometrics. 2d ed. Ann Arbor: University of Michigan Press. Prais, S. J. and C. B. Winsten. 1954. Trend Estimators and Serial Correlation. Cowtes Conm_ission Discussion Paper No. 383. Chicago. Sargan. J. D. 1964 Wages and prices in the United Kingdom: a study in econometric methodology. In Econometric Analysis for National Economic Planning, ed. P. E. Hart, G. Mills, J. K. Whitaker, 25-64. London: Butterworths. Theft, H. 1971. Principles of Econometrics. New York: John Wiley & Sons. White, H. 1980. A heteroskedasticity-consistent Econometrica 48: 817-838.
covariance matrix estimator and a direct test for heteroskedasticity.
Also See Complementary:
[R] adjust, [R] iincom, [R] vce, JR] xi
Related:
[R] regress,
Background:
[U] 16.5 Accessing
[R] mfx, jR] predict,
[R] regression
[U] 23 Estimation [u] 23.11 Obtaining
diagnostics
coefficients
and standard
and post-estimation robust
JR] test,
variance
errors.
commands, estimates
[R] testnl,
i
.....
l :? 1°..o It,. e t --
_
ic ions, residuals, etc., after estimation .i i i i i
i
i
i
After single-eqtlation (SE) estimators
t
Syntax
predict;
[_,pe_ newvarlrame [if
other_op_ons
exp] [in range] [, xb stdp
nooffset
]
After multiple-_quation CME)iestimators
stdp stdrtp nooffse_
other_options ]
DescriptiOn predict call :ulates predicl ions, residuals, influence statistics, and the like after estimation. Exactly what predict an do is dete mined by the previous estimation command; command-specific options are documented larith each est mation command. Regardless of command-specific options, the actions of predict shale certain sirr ilarities across estimation commands:
i
l) Typing p_edict newv _rname creates newvanvame containing "predicted values"--numbers related to,ithe E(_ljlxj t. For instance, after linear regression, predict newvarname creates t l
t
xjb and, _fter probit, creates the probability/b(xjb). 2) predict _ewvarname, xb creates newvarname containing xjb. This may be the same result i hnear , as (1) (e.g_, regression) or different (e.g., probit), but regardless, option xb is allowed.
! i
3) predict _ewvarname,'_ stdp creates newvarnanie containing the standard error of the linear prediction !xj b. !
I
4) predict/lewvarname,_ther_options may createnewvarname containing other useful quantities; _ee help _r the referende manual entD for the particular estimation command to find out about other avai_ble options. I
i i
5) Addling th4 noel fset @tion to any of the above requests that the calculation ignore any offset or e_posule variable s_cified by including the offset(varname) or exposure(varname)
i
options w_en you estim!ted the model. predict
!
can be used to mah in-sample or out-of-sample predictions:
6) tn general, predictcall ulates the requested statistic for all possible observations, whether they were used in estimating the model or not. predict does this for standard options (1) through (3), and generally does ;his for estimator-specific options (4). 7) To restrict ithe predictio_ to the estimation subsample, type , predict
l
I
!newvarname
i:
e(sample)
....
8) Some stati._tics make se _se only with respect to the estimation subsample. In such cases, the calculation iis automatically restricted to the estimation subsampte, and the documentation for the specific!option states this. Even so, you can still specify if e (sample) if you are uncertain.
571
572
predict -- Obtain predictions, residuals, etc., after estimation
9) predict's you can • use
dsl
(estimate • use
I!
ability to make out-of-sample prediction even extends to other datasets. In particular,
a model) /*
two
• predict
hat,
...
another
/* fill
*/
dataset
in the
predictions
*/
I:
[t [
[95Y, Conf.
Interval]
-4.18
0.000
-.0156287
- .0052232
8,3'3
O. 000
36,66983
61,16676
mpg now, we would obtain
e linear predictions for all 74 observations.
_edictions _iusI for the sample on which we estimated the model, we could type
. predict I pmpg
if e(s_unple)
(option (52 missihg x_ assumed; values ge_erated) f:.tted values) !
:
!
In this e×ample_
I
e_;timatedI the nlodel and the: e are no missing values among the relevam variables. Had there been missing ,,,alues._e (sample) ,'ould also account for t_ose.
e(sample)
is true only for foreign cars because we typed
if
foreign
when we
!
I
574
predict -- Obtain predictions, residuals, etc., after estimation
ti
By the statistics way, theonifthee(sample) be type used with any Stata command, summary estimation restriction sample, wecan could . summarize (output
if
omitted
so to obtain
e(sample) )
Example Using the same auto dataset, assume that you wish to estimate the model: mpg = _lweight + fl2weight2+ fl3foreign+ t4 We first create the weight 2 variable and then type the regress command: • use
auto
(1978
Automobile
generate
Data)
weight 2=weight'2
• regress
mpg
weight
Source
weight2 SS
foreign df
MS
Number F(
Model
of
3,
= =
52.25
=
0.0000
3
563.05124
754.30574
70
10.7757963
R-squared
=
0.6913
33.4720474
Adj K-squared Root MSE
= =
0.6781 3.2827
Total
2443.45946
73
mpg
Coef.
Std.
Err.
t
P>ltl
> F
74
1689.15372
Residual
Prob
obs 70)
[957, Conf,
welght
-. 0165729
.0039692
-4.18
O. 000
-. 0244892
weight 2
1.59e-06
6.25e-07
2.55
O. 013
3.45e-07
foreign _cons
-2.2035 56. 53884
1.059246 6.197383
-2.08 9.12
0.041 O.000
-4.3161 44. 17855
Were we to type 'predictpmpg' now, we would obtain predictions data. Instead, we are going to use a new dataset.
Interval] -. 0086567 2.84e-06 -.0909002 68.89913
for all 74 cars in the current
The dataset newautos, dta contains the make, weight, and place of manufacture of two cars. the Pontiac Sunbird and the Volvo 260. Let's use the dataset and create the predictions: use newaut os (New
Automobile
Models)
list
i. Pont. 2.
make Sunbird
we ight 2690
260
3170
Volvo
predict mpg (optlon xb assumed; variable r(lll) ;
weight2
noZ
fitted found
f or e ign Domestic Foreign
values)
II
! !_ }i
1
_
,
pl_dict -- Obtain predictions, residbals, etc., after estimation
Things did not work. We typed predict mpg and Stata responded with the message "weight2 not found", predictcan calcuh Ltepredicted values on a different dataset only if that dataset contains the variables that ,_ent into the aaodet. In this case, our data do not contain a variable called weight2. weight2 is just the square _,f weight, so we can create it and try again: • generate . predic_
(o_tion i
575
weight2=we ight*2 mpg
Ib assttmed;
itted
values)
. list
i 1.
make Pon_:. Sunbird
2.
Volvo
260
weight 2690
foreign Domestic
weight2 7236100
mpg 23.47137
3170
Foreign
1.00e+07
17.78846
i
i
\Ve obtained o,tr predicted alues. The Pontiac SunNrd has a predicted mileage rating of 23.5 mpg whereas _heVovo 260 has alpredicted rating of 17.8mpg By way of comparison, the actual mileage
'_
ratings are 24 or the Pontia_ and 17 for the Volvo.
q
Residuals i
_, Example
!
t
With many estimators, p_edictcan calculate more than predicted val'ues.With most regressiontype estimator_ we can, for _instance, obtain residuals. Using our recession example, we return to
i
our original daia and obtain residuals by typing
-_
. use
I
(Automobfle
au$o,
predic$ !
,
Models)
!
double
summar!ze ,
clear
res_d,
resid
Variable
residuals
i 8_s
Mean
Std.
J , Dev.
Min,
Max
resid
i
_ i
i
i t
l
1 ?_4
-1,78e-15
3.214491
-5.636126
i3.85172
Notice that wd did this wi!hout re-estimating the model. Stata always remembe_ the last set of J.. estimates, ever_as we use n w datasets. It was not n_cessar_ to t're the double in predict double resid, residuals; but we wanted " " ' variable in front of the variable s name; see to remir_dyouI that you ca_ specify the type of [U] 14.4.2 List_ of new variSbles. We made the ne_ :variableresida doublerather than the defaul_ float_
i
If"you wantiyour residua to have a mean as clese to zero as possible, remember to request the extra precision of double.If we had not specified double, the mean of resid would have been , --s ! -14 -]4 sounds _ more precise than 10-s. the difference roughl) 10 rather than _) • Although 10 really does notimatter.
F,_rlinear rti,.zression, r diet can also calculate standardizedresiduals and studentized residuals • . i_ P with the ophoqs rstandar
and rstudent:
for examples
see JR] regression
diagnostics
576
predict -- Obtain predictions, residuals, etc., after estimation
Single-equation (SE) estimation If you have not read the discussion above on using predict after linear regression, please do so. Also note that predict's default calculation almost always produces a statistic in the same metric as the dependent variable of the estimated model e.g., predicted counts for Poisson regression. In any case, xb can always be specified to obtain the linear prediction. predict is also willing to calculate the standard error of the prediction, which is obtained by using the inverse-matrix-of-second-derivatives estimate for the covariance matrix of the estimators.
Example After most binary outcome models (e.g., logistic, legit, probit, cloglog, scobit), predict calculates the probability of a positive outcome if you do not tell it otherwise. You can specify the xb option if you want the linear prediction (also known as the legit or probit index), The odd abbreviation xb is meant to suggest XB. In legit and probit models, for example, the predicted probability is p -- F(XB), where F() is the logistic or normal cumulative distribution function respectively. . logistic foreign (output omitted ) predict (option
mpg
weight
phat
p assumed;
predict
idLhat,
• summarize
foreign
Pr(foreign)) xb phat
Variable
Obs
foreign phat idxhat
74 74 74
idxhat Mean
Std.
.2972973 .2972973 -1.678202
Dev.
.4601885 ,3052979 2.321509
Min
0 .000729 -7.223107
Since this is a legit model, we could obtain the predicted probabilities index gen
phat2
Max
1 .8980594 2.175845
ourselves from the predicted
= exp(idxhat)/(l+exp(idxhat))
but using predict without options is easier. q
Example For all models, predict attempts to produce a predicted value in the same metric as the dependent variable of the model. We have seen that for dichotomous outcome models, the default statistic produced by predict is the probability of a success, statistic produced by predict is the predicted count specify the xb option to obtain the linear combination of (the inner product of the coefficients and z values). For is the natural log of the count. poisson (output
injuries
omitted
predict (option
injhat n assumed;
predict gen
XYZowned
)
idx,
exp_idx
. summarize
predicted
number
of events)
xb = exp(idx)
injuries
injhat
exp_idx
idx
Similarly, tbr Poisson regression, the default for the dependent variable. You can always the coefficients with an observation's x values poisson (without an explicit exposure), this
)redict -- Obtainpredictions,residuals,etc.,after estimation Vari_le
1
I
Ot
Meam
Min
7.111ni .833333 7.11illt .833333_
injb t
exp__dx iidx injuries
i _
Std. De_.
1.955174 7. 111111
I
.122561 _ 5.48735_)
577
Max
66 7.666667 7.666667 1.791759
1
2.036882 19
We note that o_r "'hand-co_ _uted" prediction of the count (exp_idx)
exactly matches what was
i _
produced by th_ default oper _tion of predict. If our model! has an expo,,',ure-time variable, we can use predict to obtain the linear prediction with or without !the exposure. Let's verify what we are getting by obtaining the linear prediction with and without exl_osure, transfi)rming these predictions to count predictions, and comparing with the default count p_diction from predict. We must remember to multiply by the exposure time when
i
usin_ predict!
) !
i
. poisson
nooffset.
injuries
XY_ owned,
exposure(n)
(outputor__i.ed) . predict double injh_.t (option n assumed; predicted i
. predict
I
. gen dou)le
i
• predict
i
. s_mmari,_ei injuries njhat VariaIle Ob
!
• gen
double
of event_)
xb
exp_idx
double
double
idx,
number
exp(idx)
idxr
xb nooffset
exp_idxn
exp(idxn)*n
exp_idx Mean
exp_idxn idx idxn Std. Dev. Min
injuries
9
7.11t111
inj_at
_
7,111111
3.10936
!
exp_ _dx
9
7. 111111
;
exp_i_xn _dx
_ 9
7.111111 I.869722
i
i_xn
9
4. 18814
!
1
5.487359
Max 19
2.919621
12.06158
3. 1093_
2. 919621
12. 06158
3.1093_ .4671044
2.919621 I.0714_4
12.06158 2.490025
.190404_
4.061204
4.442013
| Looking at t_e identical m_ans and standard deviations for injhat,
exp_idx,
and exp_idxn,
we
! )
, ee that ]! _s possible to reproduce the default computations of predict for pozsson esnmatlons. We have also d_monstrated tlle relationship between the count predictions and the linear predictions
i
with and withodt exposure. q
! i i ! i
Multiple-equation (ME) estimation If you have lot read the _bove discussion on using predict after SE estimation, please do so. With ]he exception of the aNlity to select specific ettuations to predict from, the use of predict after ME model,_ follows almost exactly the same for£ as it does for SE models,
Example i i
The details c;f prediction statistics that are specific to particular ME models are documented wi_h the estimation c_)mmand. Users of ME commands that do not have separate discussions on obtaining predictions wou_d also be well-advised to read the predict section in [R] mlogit, even if their interest isnot in!multinomial _ogistic regression. As a general introduction to the ME models, we will
ii_
demonstrate pr!dict!
after slreg:
, _ureg
(price
foreign
disp1)
(weight
foreign
length)
;_._mlngly unrelated regression _jijation prJco w. lght
Obs
Parms
RMSE
"R-sq'
chi2
P
74 74
2 2
2202.447 245.5238
0.4348 0.8988
45.20554 658.8548
0.0000 0.0000
Coef.
Std. Err.
z
P>Jzl
[95_ Conf. Interval]
prlco foreign dl_placement _cons w_|ght foreign length _cons
_ut-g
3137.894 23.06938 680.8438
697.3805 3.443212 859.8142
-154.883 30.67594 -2699.498
75.3204 1.531981 302.3912
cstinmted two equations,
4.50 6.70 0.79
-2.06 20.02 -8.93
one called price
0.000 0.000 0.428
1771.054 16.32081 -1004.361
4504.735 29.81795 2366.049
0.040 0.000 0.000
-302.5082 27.67331 -3292.173
-7.257674 33.67856 -2106.822
and the other weight:
see [R] sureg.
predict prod_p, equation(price) (_q)tionxb assumed; fitted values) }n,odictprod_w, equation(weight) (option xb assumed; fitted values) • .u_arize
price pred_p weight pred_w
Variable
Obs
Mean
price pred_p weight pred_w
74 74 74 74
6165.257 6165.257 3019.459 3019.459
Std. Dev. 2949.496 1678.805 777.1936 726.0468
Min
Max
3291 2664.81 1760 1501.602
15906 10485.33 4840 4447.996
Y,m may Sln\'ifY the equation by name, as we did above, or by number: _;m," Ihing as equation(price) in this case.
equation(#1)
means
the
Methods and Formulas I)cnotc _h_-previously estimated
coefficient vector by b and its estimated
variance matrix by V.
pr,,l* ,'_ x__,Tksby recalling various aspects of the model, such as b, and combining that information witl_ the ,tau currently in memory. Let us write xj for the jth observation currently in memory. 'l'hc t _.......
d value (xb option) is defined _'j
TIw ,nv xf-_--derror of the prediction The
_:.n)_2c3 error o/" the difference
(stdp)
=
xjb
q- offset#
is defined spj = v/xjVx}
in linear predictions between equations
1 and 2 is defined
s% - V/(x_j,-x2j, o,..., 0) v (x_j,-x2j, o,..., o)' See _h_"".=Nvidual estimation Sl_ll isli,'._-
commands
for the computation
of command-specific
predict
i1 iJ
Also See Related:
_redict -- Obtain predi_ions, residuals, etc., after estimation [R] regress, [R] regression diagnostics [P] _wet [ict
Background:
[u] 23 E,,',timation and post-estinlationcommands
I
I
I I
I !
i i
i
! f
579
,
I _,
Title [ probit
--I Maximum-likelihood [
probit estimation
i
I
J
Syntax probit depvar
[indepvars ] [weight]
noconstant r_obust maximize_options
dprobit
cl_uster(varname)
exp]
[in range]
[, level(#)nocoef
score(newvarname)
asis
offset
(varname)
]
[ depvar indepvars
probit_options
[if
[weight]
[if
exp] [in range]]
[, at(matname)
classic
]
by ... : may be used with probit and dprobit; see [R] by. fweights, iweights, and pweights are allowed; see [U] 14.1.6 weight. These commandsshare the features of all estimation commands:see [U] 23 Estimation and post-estimation commands. probit may be used with sw to perform stepwise estimatlon; see [R] sw.
Syntaxfor predict predict
[type] newvarname
[if
exp]
[in range]
[.
{ p
xb
stdp}
rules asif nooffset ] These statistics are available both in and out of sample; type the estimation sample.
predict
...
if
e(sample)
...
if wanted only for
Description probit
estimates a maximum-likelihood
probit model.
dprobit estimates maximum-likelihood probit models and is an alternative to probit. Rather than reporting the coefficients, dprobit reports the change in the probability for an infinitesimal change in each independent, continuous variable and, by default, reports the discrete change in the probability for dummy variables. You may not specify the noconstant option with dprobit, probit may be typed without arguments after dprobit estimation to see the model in coefficient form. If estimating on grouped data, see the bprobit
command described in [R] glogit.
A number of auxiliary commands may be run after probit, for a description of these commands. See [R] logistic for a list of related estimation
commands.
580
togit,
or logistic;
see [R] logistic
......
,
probit --I_laxh_um_i_kelihoodprobe estimation
581
Options
i l
Options,for p,obit
level (#) spec ifies the confi,tence level, in percent, for confidence intervals. The default is level or as set by set level:see [U] 23.5 Specifying the width of confidence intervals
(95)
l
nocoef specifies that the c( efficient table is not to be displayed. This option is sometimes used by programmel but is of n( use interactively.
I
noconstmat s_ppresses the :onstant term (intercept)in the probit model. This option is not available for dprobi_.
!
robust species that the H _ber/'White/sandwich estimator of variance is to be used in place of the traditional dalculation; s_:e [U] 23.11 Obtaining robust variance estimates, robust combined
!
with cluster]
l
be independent between :lusters). If you specify pweights robust is implied; see [U] 23.13 Weighted estimation.
! i ! i
l !_ } i _ i }
I i {
() allows
bservations which are not independent within cluster (although they must
cluster(varn_me) specifi_m that the observations are independent across groups (cluSters) but not necessaily within gr)ups, varname specifies to which group each observation belongs; e.g.. cluster (p _rsonid) in data with repeated observations on individuals, cluster() affects the estimated s_andard error: and variance-covariance matrix of the estimators (vcH), but not the estimated c,)efficients; see [U] 23.11 Obtaining robust variance estimates, cluster() can be the unstratified cluster-_ampled data. but used with _weights to produce estimates for see svyprobitcommand in [R] s_Westimators for a command designed especially for survey data. 1 cluster()limplies by itself. score(newva,lname)
rob_tst; create
speci_ing
robust
cluster()
is equivalent to typing cluster()
newvar containing uj = OInLj/O(xjb)
for each observation j in the
sample. Th_ score vecto is _ OlnLj/Ob = _ u.jxj; i.e., the product of newvar with each covm-iate s_mmed over ,servations. See [u] 23.12 Obtaining scores. asis requests _,thatall spec ]ed variables and observations be retained in the maximization process. This option I is typically r_ot specified and may introduce numerical instability. Normally probit 1
drops variables that perf_ctly predict success or failure in the dependent variable. The associated observation_ are also dr(pped. In those cases, the effective coefficient on the dropped variables is infinity (_egative infinity) for variables that completely determine a success (failure). Dropping the variable, and perfectly predicted observations has no effect on the likelihood or estimates of the remaining cbefficients an,t increases the numericaI stability of the optimization process. Speci_ing this option _orces retenti_)n of perfect predictor _:ariables;and their associated perfectly predicted observationL offset (varmmu:) specifies that varname _sto be included in the model with the coefficient constrained to be 1. madmize_optWns specify thma.
control tt_e maximizalion process: _ee [R] maximize. You should never have to.
_----
l_lv_l_
Illlfli^lflllll_llllll--III11,_llllll//11,/lbl
I./ll./i./ll
1_6LIIII_I|IUI'|
Optionsfor dprobit at (matname) specifies ::i
the point around which the transformation of results is to be made. The default is to perform the transformation around _, the mean of the independent variables. If there are k independent variables, rnatname may be 1 × k or 1 x (k + 1), that is, it may optionally include final element 1 reflecting the constant, at () may be specified when the model is estimated or when results are redisplayed.
classic requests that the mean effects be calculated using the formula f(_b)b_ in all cases. If classic is not specified, f(x-b)bi is used for continuous variables, but the mean effects for dummy variables are calculated as ff(_lb) - _5(2ob). Here 51 = _ but with element i set to 1. 20 - _ but with element i set to 0, and _ is the mean of the independent variables or the vector specified by at(). classic may be specified at estimation time or when the results are redisplayed. Results calculated without classic may be redisplayed with classic and vice versa. probit_options
are any of the options allowed by probit;
see Options for probit, above.
Optionsfor predict p, the default, calculates
the probability
of a positive outcome.
xb calculates the linear prediction. strip calculates the standard error of the linear prediction. rules requests that Stata use any "rules" that were used to identify the model when making prediction. By default, Stata calculates missing for excluded observations. asif requests that Stata ignore the rules and the exclusion criteria and calculate predictions observations possible using the estimated parameter from the model. nooffset
is relevant only if you specified offset
(varv_ame) for probit.
the
for all
It modifies the calculations
made by predict so that they ignore the offset variable: the linear prediction rather than xjb + offsetj.
is treated as xjb
Remarks Remarks are presented under the headings Robust standard errors dprobit Model identification Obtainingpredicwd values Performinghypothesis tests probit performs maximum likelihood estimation of models with dichotomous hand-side) variables coded as 011 (or, more precisely, coded as 0 and not 0).
dependent
(left-
> Example You have data on the make, weight, and mileage rating of 22 foreign and 52 domestic automobiles. You wish to estimate a probit model explaining whether a ca" is foreign based on its weight and mileage. Here is an overview of your data:
r I
_'_
'
!dec I
Contain_
!
size:
_
i
variabl_
i
make mpg
I
data
obs: v_rs:
from
°,3
_uto.dta
7_ 'I 1,9911
name
1978 7 Jul (99,7Z
stora_ type
of
memory
display format
'/,-18s _8.Og
weight
!
int
_8.0gc
foreign
!
byte
_,8.0g
Data
free)
value label
strl int
Aatomobile 2000 13:51
variable
label
Make and Model Mileage (mpg) Weight origin
Car
(Ibs.)
type
S_rted _y: foreign No_e: ! . inspect
dataset
las
changed
since
last
saved
foreign
foreign: Car type
Numberof Observations
i
Total
!
#
*
Negative
# #
'_
#
#
r
# 0 (2 !
|
_ique
NonIntegers
Integers -
Zero Positlve
52 22
52 22
Total
74
74
Missing 1
74
values
f_reign
is
la_eled
and
all
values
ar_
documented
in
the
label.
The variable f_reign take_ on two unique values. 0 and 1. The value 0 denotes a domestic car and t denotes _ foreign car. l
The model _ou wish to e ;timate is Pr(:_oreign - I)= _(_o+ glweight+ g2mpg) where _ is the cumulative n )rmal distribution. i
To estimate his model, y _utype
ItezatioNO:
log likelihood=
Iteration_
log
1 :
lik#lihood
f outputo__i_ed ) Iteration5: Probit
Log
i
iI
i
|
log likJlihood= -26.844189
es _imates
! likellhood
fore!gn
-45.03321 -29.244141
I
=
-26. _4 4189
(
_pg [ -.10_: 503 _clns 8.27 464
Std,
Err,
.0515689 2.554142
z
-2.02', 3.24
Number LR chi2 of (2) obs
= =
Prob > R2 chi2 Pseudo
=
P>,zI
0.044 0.001
[95_
Conf.
-.2050235 3.269438
74 36,38
0.0000 0.4039
Interval]
-.0028772 13.28149
You find that heavier cars are less likely to be foreign and that cars yielding better gas mileage are also less likely to be foreign, at least holding the weight of the car constant. _
-IllaAltliUIIl'llRUil|lIJi_,i See [R]JJIIJIJIt maximize for an explanation _/IgiJl| of the_O|IIIICI|IU|I output.
chi2
= = =
74 30.26 0.0000
Log likelihood = -26.844189
Pseudo R2
=
0.4039
Robust foreign weight mpg _cons
Coef. -. 0023355 -. 1039503 8. 275464
Std. Err. .0004934 .0593548 2. 539176
z -4.73 -1.75 3.26
P>Iz I O. 000 0.080 O. 001
[95Y,Conf. Interval] -. 0033025 -. 2202836 3. 29877
-. 0013686 .0123829 :13. 25216
the standard error for the coefficient on mpg was reported to be .052 with a resulting confidence interval of [-.21.-.00]. Without
robust,
robust with the cluster () option has the ability to relax the independence assumption required by the probit estimator to being just independence between clusters. To demonstrate this. we will switch to a different dataset. You are studying unionization of women in the United States and have a dataset with 26,200 observations on 4.434 women between t970 and 1988. For our purposes, we will use the variables age (the women were 14-26 in 1968 and your data thus ._panthe age range of 16-46), grade (years of schooling completed, ranging from 0 to 18), not_smsa (28% of the person-time was spent living outside an SMSA standard metropolitan statistical area), south (4I% of the person-time was in the South), and southXt (south interacted with year, treating 1970 as year 0i. You also have variable union. Overall, 22% of the person-time is marked as time under union membership and 44% of these women have belonged to a union.
r
probit--! Maximum_likelihood probltestimation
You estimate the follow ag model ignoring that the women are observed an average of 5.9 times each in these lata: . probit
union
I_eraticn
0:
log li_elihood
age
fade not_smsa =
south
southXt
I_erati_n
i:
log l±_elihood
= -13548•436
Iteration
2:
log liCelihood
= -13547.308
-13864.23
Probit It_rati¢ 3timates _ 3: log li :elihood = -13547.308
i
Number
_ -
Log Ilk
ihood
= -13_47,308
u_ion i
:oef.
iage g_ade not__mse
i
585
s_uth
i
sou_hXt __ons "
i
Std.
Err.
z
=
26200
LR chi2(5) Prob > chi2
= =
633.84 0.0000
Pseudo
=
0.0229
P> Izl
of obs
R2
[95Z
Conf•
Interval]
•0015798 .0036651 .0202523
3•T6 7.20 -6.44
0.000 0.000 0•000
.0028496 .0192066 -.1700848
,0090425 .0335735 -•0906975
7254
.033989
-Ii.85
0.000
-.4693426
-.3361081
.00 3088 -1,1 3091
•0029253 .0657808
1.13 -16.92
0.258 0.000
-.0024247 -i•242019
.0090423 -.9841628
'
.00;9461 2639 -.13 3911
]
-.40
I I I
The reposed standard errors n this model are probably meaningless. Women are observed repeatedly and so the observations are not independent. Looking at the coefficients, you find a large southern effect against u}aionization a_d little time trend. The robust and cluster () options provide a way to estimate thistmodel and o gtain correct standard errors: • probit _nion
i
not_smsa
i:
log
2: 0: 3:
log lik _lihood = -13547.308 log lik log likelihood _lihood = -13547.308 -13864.23 Number of obs Wald chi2(5)
= =
26200 165.75
i
Prob
=
0.0000
= -135 17.308 I.
(standard
> chi2
Pseudo R2 = on idcode) 0.0229 _djusted for clustering
errors
Robust
un_on,
1
cluster(id)
_ i
i
I•
robust
ilk _lihood = -13548.436
estimates
i
i
south/t,
Iteratior Ite_atioI Iteratior
Log likelihood : i
i
south
Iteratior
Probit
i
age grlde
Cdef.
Std.
Err.
z
P>Izl
[957 Conf.
Interval]
.001327 ,0110282 -.209595
.0105651 .04i7518 -.0511873
_ge grade not_s_sa
.005R461 ,02_39 -.130_911
.0023567 ,0078378 .0404109
2.52 3.37 -3.23
0.012 0.001 0.001
so_th
-.4027_254
.0514458
-7.83
0.000
-.5035573
-.3018935
souz_Xt _c_ns
.003_)88 -I.1131)91
.0039793 .I188478
0.83 -9.37
0.406 0.000
-.0044904 -1.346028
.0111081 -.8801534
'
!
l
Thesestandard _ors arerou_hly50% larger thanth0sereported by theinappropriate conventional calculation. By Comparison. mother model we could estimate is an equal-correlation population-
i
a_eraged probit _odet:
i
i
I
Iteration : tolerance = .04796083 . xtprobitiunion age g:'ade no__smsa Iteration : tolerance = .00352657 Iteration tolerance = .00017886 IZer&_ion l_erat_on
_: tolerance _: tolerance
= 4.150e-07 = 8.654e-06
south
southXt,
i(id) pa
586
probit -- Maximum-likelihood probit estimation GEE population-averaged Group variable: Link: Family: Correlation:
model
Scale parameter: .
,
Number of obs Number of groups Obs per group: min avg max Wald chi2(5) Prob > chi2
idcode probit binomial exchangeable 1
union
Coef.
age grade not_smsa south southXt _cons
.0031597 .0329992 -.0721799 -.409029 .0081828 -I.184799
Std. Err, .0014678 .0062334 .0275189 .0372213 .002545 .0890117
z 2.15 5.29 -2.62 -10.99 3.22 -13.31
P>IzI 0.031 0.000 0,009 0.000 0.001 0.000
= = = = = = =
26200 4434 1 5.9 12 241.66 0.0000
[95_ Conf. Interval] .0002829 .020782 -.1261159 -.4819815 .0031946 -1.359259
.0060366 .0452163 -.0182439 -.3360765 .0131709 -1.01034
The coefficient estimates are similar but these standard errors are smaller than those produced by probit, robust cluster(). This is as we would expect. If the equal-correlation assumption is valid, the population-averaged probit estimator above should be more efficient. Is the assumption valid? That is a difficult correspond to an assumption of exchangeable to assume an AR(I) correlation within person that we do not wish to impose any structure.
question to answer. The population-averaged estimates correlation within person. It would not be unreasonable or to assume that the observations are correlated, but See [R] xtgee for full details.
What is important to understand is that probit, robust cluster () is robust to assumptions about within-cluster correlation. That is, it inefficiently sums within cluster for the standard error calculation rather than attempting to exploit what might be assumed about the within-cluster correlation.
dprobit A probit model is defined Pr(yj where 62 is the standard cumulative
# 0[xj)
= 62(xyb)
normal distribution and xjb
is called the probit score or index.
Since xjb has a normal distribution, interpreting probit coefficients requires thinking (normal quantile) metric. For instance, pretend we estimated the probit equation Pr(yj
# 0) = 62(.08233za
- 1.529x2
in the Z
- 3.139)
The interpretation of the xl coefficient is that each one-unit increase in :t,1 leads to increasing the probit index by 08233 standard deviations. Learning to think in the Z metric takes practice and, even if you do, communicating results to others who have not learned to think this way is difficult. A transformation of the results helps some people think about them. The change in the probability somehow feels more natural, but how big that change is depends on where we start. Why not choose as a starting point the mean of the data? If 51 - 21.29 and 52 = .42. then we would report something like .0257. meaning the change in the probability calculated at the mean. We could make the calculation as follows. The mean normal index is .08233 x 21.29 4- 1.529 × .42 -3.139 = -.7440 and the corresponding probability is _( .7440) = .2284. Adding our coefficient of .08233 to the index and recalculating the probability, we obtain 62(-.7440 + .08233) = .2541. Thus, the change in the probability is .2541 .2284 = .0257.
i
r
probit-- Maximum-likelihood probitestimation
587
In prattle. Feople mak _,this calculation somewhat differently and produce a slightly differcnl numb£r.Ratbi- than make the calculation for a one-unit change in z, they calculate tile slope of the probabilir_.'unction. D( ing a little calculus, they derive that the change in the probabiliU for a change_in.r: _,_,9"" (?.Q) is he height of the normal density multiplied by the zl coefficienu that is. 0@ ;
0X 1
Going throug_ The 'differe this ce between 0257 andobtain .0249.0249. is not much. they differ because the .0257 is the e_act calculat on, they answer Ifor a lone-unit incr _ase in ¢_ whereas .0249 is the answer for an infinitesimal chariot.
I0
extrapolated obt Ot.l[.
dpr_bit ,,lm_the ¢las 1ic option transfom_s results as an infinitesimal change exm_pokued Example .L.._ Consider the a_tomobile dat again:
!
. use a_ ;o, clear i
• gen gc (1978 A_ rdplus :omobile= Dat_ repi I>=4 if rep78-=. (5 missii_ values generated)
I i
dprobi foreign mpg goodplus, classic Iteratio: 0: log liielihood = -42.400729 Iteratio: I: log li_elihood = -27.643138
I
It_ratio: It4_ratio:2: 3:
1 i
log li_elihood -26.953126 log li_elihood == -26.942119
Pr@bit Iteratio: e_timates 4: log li :elihood = -26.942114 _
Number of obs = LR chi2(2) =
Log likelihood = -26.!142114
Prob > R2 Pseudo chi2
foreign
dF/dx
mpg goodplus _cons
.0249187 .46276 -.9499603
obs. P
.3043478
pred. P
.2286624
Std. Err. .0110853 .1187437 .2281006
z
P>Izl
2.30 3.81 -3.82
O.022 0.000 0.000
x-bar
[
69 30.92
= 0.3646 0.0000
95Y.C,I.
]
21.2899 .003192 .046646 .42029 .230027 .695493 1 -I,39703 -.502891
(at x-bar)
z and!P>Izl are t_,etest of the underlying coefficient being 0
Afterestimation with dpro )it, the untransformedcoefficientresults can be seen by typing probit • probit i
I
_ithoutProbit options:! estimates Log likelihood ! fore_n
}[ [.
Number LR cbi2 Prob > Pseudo
= -26.912114 t I I
pZg I __CO S good s ]
Co,._f, Std, Err.
.082 33 -3.138 37 1.528 _92
.0358292 .8209689 .4010866
z
2.30 -3.82 3,81
P>Izl
0.022 O.000 0.000
of obs (2) chi2 R2
= = = =
69 30.92 0.0000 0.3646
[95X Conf. Interval]
.0121091 -4..7428771 747807
.152557 -12.315108 . 5_9668
.,..,v
W,,..L,,t
--
r_ux.llU.l-,iKellnOOO
esUmat|on
proDIt
There is one case in which one can argue that the classic, infinitesimal-change based adjustment could be improved on, and that is in the case of a dummy variable. A dummy variable is a variable that takes on the values 0 and 1 only--1 indicates that something is true and 0 that it is not. goodplus is such a variable. It is natural to summarize its effect by asking how much goodplus being true changes the outcome probability over that of goodplus being false. That is, "'at the means", the predicted probability of foreign for a car with goodplus = 0 is q5(.08233_1 - 3.139) = .0829. For the same car with goodplus = 1, the probability is I'(.08233 E_ + 1.529 - 3.139) = .5569. The difference is thus .5569 -.0829 = .4740. When we do not specify the classic option, dprobit makes the calculation for dummy variables in this way. Even though we estimated the model with the classic option, we can redisplay results with it omitted: i f
dprobit Probit estimates
Log
likelihood
= -26.942114
foreign
dF/dx
Std.
Err.
z
21.2899
,0110853
2.30
O. 022
.4740077
.I 114816
3.81
O. 000
obs.
P
.3043478
pred.
P
.2286624
of dummy
variable
P>[zl
> chi2 R2
= 0.3646
[
.42029
69 30.92
957, C.I.
]
.003192
.046646
.255508
.692508
(at x-bar)
discrete are
= 0.0000
x-bar
.0249187
is for
Prob
P>tzJ
mpg
z and
= =
Pseudo
goodplus*
(*) dF/dx
Number of obs LR chi2(2)
the
change test
of the
underlying
from
0 to 1
coefficient
being
0
q
[3 Technical Note at (mamame) allows you to evaluate effects at points other than the means. Let's obtain the effects for the above model at mpg = 20 and goodplus = 1: • matrix .
myx
dprobit,
Probit
Log
= (20,1)
at(myx) estimates
likelihood
Number
= -26.942114
foreign
dF/dx
Std.
Err.
z
of obs
=
69
LR chi2(2) Prob > chi2
= 30.92 = 0.0000
Pseudo
= 0.3646
P>Izl
x
R2
[
95_
C.I.
]
mpg
.0328237
.0144157
2.30
0.022
20
.004569
.061078
goodplus*
.4468843
.1130835
3.81
0,000
1
.225245
.668524
obs,
P
.3043478
pred.
P
.2286624
(at x-bar)
pred.
P
.5147238
(at x)
(*) dF/dx
is for
z and P>Iz}
discrete are
the
change test
of dummy
of the
variable
underlying
from
0 to I
coefficient
being
0
Q
t
i
i
,
prObit-- Maximum-likelihoodprobit estimation
589
Model identification The probi_; command h s one more feature, and it is probably the most useful. It will automatically check the model for identification and, if it is underidentified, drop whatever variables and obser_ ations i
are necessary !or estimatior to proceed. 1
> Example
i
Have you ei'er estimated [a probit model where one or more of your independent variables perfectly ! _
predicted" one br the other _utcome? For instanc{e, th! following i consider " " small amount of data: Outcome ?4 Indepeiadent Variable z
I
0
J
!
o1
o (.)
t I
l.et's imagine _'e wish to pn dict the outcome on the basis of the independent variable. Notice that the
!
outcome, is ah_{ayszero whel,ever the independent variable is one. In our data Prty = 0 ix = 1) - 1, ,',,rcn ]n turn ;means that tire probit coefficient on x must be minus infinity with a corresponding infinite stand_d error. At this point, you may suspect we: have a problem. UnfortunatOly, not all suctt problems are so easily detected, especially if you have a lot of independent variables in yohr model. If ,,ou have ever had such difficulties, then you have experienced one of the
i_ } ! _i ! ! *
more unpleas@ aspects of _amputer optimization. The computer has no idea that it is trying to solve for an infinite i:oefficient as it begins its iterative process All it knows is that. at each step, making the coefficient }alittle bigge_, or a little smaller, works wonders. It continues on its merry way until either (1) the _,hole thing c _mes crashing to the grdund when a numerical overflow error occurs or _2) it reaches s_me predeterr tined cutoff that stops the process. Meanwhile, you have been waiting. In addition, the e_timates that ou finally receive, if an3;. may be nothing more than numerical roundoff. i
State watches for these s,)rts of problems, alerts you. fixes them, and then properly estimates the model. ; 1
|
i
Let's return _toour automobile data. Among the variables we have in the data is one called repair that takes on tDee values. 4_ value of 1 indicates that the car has a poor repair wcord, 2 indicates
!
an avera_ze rec+rd, and 3 indicates a better-than-average record. Here is a tabulation of our data:
{
Car
tyre
2
3
Total
repair
I
Do=
ti {
2r
}
3
9
30
18
Foreign
i
Tot_l '
9 12
,58
1 i i
Notice that all ithe cars with 3oor repair records (repair==l) are domestic. If we were to attempt _o predict foreign on the basis of the repair records, the predicted probability for the repair==l category :would!have to be zero. This in turn means thai the probit coeN cient must be minus infinity, and that Would!set most corr.puter programs buzzing,
t l
Let's try, State on this proglem, First. we make up two new variables, rep_is_l lhat indicate thi repair cat,.'gory.
and
rep_is_2,
590
probit -- Maximum-likelihood probit estimation • generate
rep_is_1
= repair==1
generate
rep_is_2
= repair==2
The statement generate rep_is_l=repair==l creates a new variable, rep_is_l, that takes on the value 1 when repair is 1 and zero otherwise. Similarly, the next generate statement creates rep_is__.2 that takes on the value 1 when repair is 2 and zero otherwise. We are now ready to estimate our model: • probit note:
rep_is_2 failure perfectly 10 obs not used
Iteration
O:
log
likelihood
= -26.992087
1:
log
likelihood
= -22.276479
Iteration
2:
log
likelihood
= -22.229184
Iteration
3:
log
likelihood
= -22.229138
Log
I'
estimates
likelihood
Number
= -22.229138
foreign
Coef.
rep_is_2
- t. 281552
_cons
'
rep_is_l
Iteration
Probit
L
for
rep_is_1~=O predicts rep_is_l dropped and
I,21e-I 6
Err.
.4297324 ,295409
48
Prob
=
0.9020
=
0.1765
z
P>lzl
-2.98
O. 003
O. O0
= =
Pseudo
Std.
of obs
LR chi2(1)
I. 000
> chi2 R2
[95_, Conf. -2,123812 -, 578991
9.53
Interval] -.4392916 .578991
Remember that alI the cars with poor repair records (rep_is_l) are domestic, so the model cannot be estimated, or at least it cannot be estimated if we restrict ourselves to finite coefficients. Stata noted that fact. It said, "Note: rep_is_l-=0 predicts failure perfectly". This is Stata's mathematically precise way of saying what we said in English. When rep_is_l is not equal to 0, the car is domestic. Stata then went on to say, "'rep_is_l dropped and 10 obs not used". This is Stata eliminating the problem. First, the variable rep_is_l had to be removed from the model because it would have an infinite coefficient. Then, the 10 observations that led to the problem had to be eliminated as well so as not to bias the remaining coefficients in the model. The 10 observations that are not used are the 10 domestic cars that have poor repair records. Finally, Stata estimated what was left of the model, which is all that can be estimated. q
Technical Note Stata is pretty smart about catching these problems. variable", as we demonstrated above.
It will catch "one-way causation by a dummy
Stata also watches for "two-way causation"; that is, a variable that perfectly determines the outcome, both successes and failures. In this case Stata says, "so-and-so predicts outcome perfectly" and stops. Statistics dictates that no model can be estimated. Stata also checks your data for collinear variables; it will say "so-and-so dropped due to collineari ty". No observations need to be eliminated in this case, and model estimation will proceed without the offending variable. It will estimating age, and perfectly". model.
also catch a subtle problem that can arise with continuous data. For instance, if we were the chances of surviving the first year after an operation, and if we included in our model if all the persons over 65 died within the year, Stata will say, "'age > 65 predicts failure It will then inform us about the fixup it takes and estimate what can be estimated of our
f
IF
i probit
_j
(_nd logit
note:
.
an_ logistic) 0 successes
4 failures
probit_ , -_-,M.... aximum-likelihood probitestimation
591
will also occasionally display messages such as completely
determined.
The. cause!of this mess; ge and what to do if you get it are described in [R] legit. Q
Obtainingpredictedvlues Once you !have estimat_d a probit model, you can obtain the predicted probabilities using the predictcorr[mand for bolh the estimation sample,and other samples: see [U] 23 Estimation and post-estimati4n command, and [R] predict. Here we will make only a few additional comments.
i ! i
predict
'4ithout argur_rots calculates the predicted probability of a positive outcome. With the
i i_ i
xb option, it ¢_alculatesthe linear combination xjb; where xj are the independent variables in the jth observatio_ and b is th,_ estimated parameter vector. This is known as the index function since the cumulatiw density inde_ed at this value is the probability of a positive outcome.
i.
In both cas ',s, Stata remctubers any "rules" used to identify the model and calculates missing for
i
excluded obse vations unle_ rules or asif is specified. This is covered in the following example. Withithe s ;dp option, _redict calculates the standard error of the prediction, which is not adjusted_forre31icated cova iate patterns in the data.
!'_ i
One can c_ culate the u_adjusted-tbr-repticated-covariate-patternsdiagonal elements of the hat matrix, or leverage, by typir_g } . . predic!
pred
• predic_
sgdp,
genera_e
hat
I stdp! = std_2*pred*(l-pred)
> Example V';
In the pre lqus example, _'e estimated the probit model probit "Ib obtain predicted probabililies,
!
(option • predicti
"
p! assumed; p
(aO:missi_g • smmmari_e
_
f
foreign
rep_is_l
rep_is_2.
Pr foreign))
values foreign generated) Pl
r
.2068966 25
.4086186 1956984
0 .1
1 .5
I
Stata remember8 any "rules" used to identi_' the model and sets predictions to missing for any
i
excluded the'previous example, rep_is_lfrom our model andobserv_ttions.In excluded lO obser_'ations.Thus. whenprohitdropped we typed predictthe p.variable those same 10 obser_ation._ v,ere aa,ain excldded and their predictions se_to missing. predic:t's r41es option rill use the rules in the prediction• During estimation, we were told "'rep_is_l-=0 predicts failure )effectly", so the rule is that when rep_is_lis not zero. one should predict 0 probability of succe_ or a positive outcome: • sulmuariz_ foreign . predict _2,• rules
p +
592
probit -- Maximum-likelihood probit estimation Variable
Obs
Mean
foreign p p2
58 48 58
.2068966 .25 .2068966
Std. Dev. .4086186 .1956984 .2016268
Min
Max
0 .1 0
I .5 .5
predict's asif option will ignore the rules and the exclusion criteria, and calculate for all observations possible using the estimated parameters from the model:
predictions
predict p3, asif • summarize for p p2 p3 Variable
Obs
Mean
foreign p p2 p3
58 48 58 58
.2068966 .25 ,2068966 .2931034
Std. Dev. .4086186 .1956984 .2016268 .2016268
Min
Max
0 .1 0 .1
1 .5 .5 o5
Which is right? By default, predict uses the most conservative approach. If a large number of observations had been excluded due to a simple rule, one could be reasonably certain that the rules prediction is correct. The asif prediction is only correct if the exclusion is a fluke and you would be willing to exclude the variable from the analysis anyway. In that case. however, you should re-estimate the model to include the excluded observations.
Performinghypothesistests After estimation with probit, commands; see [U] 23 Estimation
you can perform hypothesis tests using the test or testnl and post-estimation commands.
Saved Results ,,
probit saves in
e()"
Scalars e(N)
number
e(df__m)
model
of observations
e (r2_p)
pseudo R-squared
e (11)
log likelihood
degrees
of freedom
e(ll_0)
log likelihood,
e(N_clust)
number
e (chi2)
X2
e(clustvar)
name of cluster
e(vcetype)
covariance
constant-only
model
of clusters
l
Macros variable
e(cmd)
probit
e(depvar)
name of dependent
e(wtype)
weight
type
e(chi2type)
Weld or LK; type of model
X_ test
e(wexp)
weight
expression
e(predict)
program
predict
e (g)
variance-covariance estimators
variable
estimation
method
used to implement
Matrices e (b)
coefficient
vector
Functions e(sample)
marks
estimation
sample
matrix
of the
,rr i
probtt -- Maximum-likelihood probit estimation
dprrbit [
593
s_ves in e()"
Scalars e(l_)
number of _bservations
e (lq_clust)
number of clusters
!
e(df_m)
model deg_'es of freedom
e(¢hi2)
X"_
i
e(r2_p) e(lt) e(lt_0)
pseudo R-s. unfed log likeliho_t log likeliho d. constant-only model
e(pbar) e(xbar) e(offbar)
fraction of successes observed in data average probit score average offset
e (emd) e(depva/-) e(wt_e)
dprobit name of de rendent variable weight type
e (_ cetype) e(chi2type) e(predict)
covariance estimation methodx 2 test Watd or LK; type of model program used to implement predict
e(wexp) e (clustvart)
weight expression name of clu _tervariable
e(dummy)
string of blank-separated 0s and Is: 0 means corresponding independent
I
Macros
i
!
variable is not a dummy, I means that it is
e (b) Matrices
coefficient
ctor vt vafiance-co_afiance matrix of
e(_/)
marginal effects
e(se_dfdx)
standard errors of the marginal effects
the estimators !
Functions e(sample)
i
e(dfax)
marks estim4tion sample
MethodsandFormula Probit analysis
originate
in connection
with bioassay,
and the word probit,
a contraction
"probability unit", was suggested by Bliss (1934). FOr an introduction to probit and example, Aldrich and Nelsor_ (t984), Hamilton (1992). Johnston and DiNardo (t997), or Powers and _#pl. and #obs2 -> #p2.
} I
i
Remarks The priest qutput followslthe output of ttest in providing a lot of information. Each proportion is presented alon_ with a cont_dence interval. The appropriate one- or two-sample test is performed and the two-sidell and both o_e-sided results are included at the bottom of the output, in lhe case of a two-sampleitest, the cal_ulated difference is also presented with its confidence interxal. This command may be used for bo_h large-sample testing and large-sample interval estimation.
i, i
1 I
595
596
prtest-
One- and two-sampie tests of proportions
D Example
i ,
In the first form, priest tests whether the mean of the sample is equal to a known constant. Assume you have a sample of 74 automobiles. You wish to test whether the proportion of automobiles that are foreign is different from 40 percent. . priest
foreign=.4
One-sample
test
of proportion
Variable
Mean
foreign
.29T2973
Std.
Err.
.0531331
Ho: Ha:
foreign:
z
P>lz[
5. 59533
O. 0000
proportion(foreign)
foreign < .4 z = -1.803
Ha:
P < z = 0.0357
of obs
[95Z
=
Conf.
74
Interval]
.1931583
.4014363
= .4
foreign-= .4 z = -1.803
Ha:
P > Izl = 0.0713
The test indicates that we cannot reject the hypothesis .40 at the 5% significance level.
Number
foreign > ,4 z = -I. 803
P > z = 0.9643
that the proportion
of foreign automobiles
is
Izl
- proportion(cure2) Ha:
z = -2.060 P < z =
of obs of obs
z
proportion(curel)
diff
Izl
diff
~= 0
= -2.060 =
0.0394
are statistically
=diff
Interval]
= 0
Ha:
diff
> 0
z = -2.060 P > z =
0.9803
different from each other at any level greater than
_j
prtest -- One-;and two-sampletests of proportions
i
597
Immediate for m
i
Example
I ! !_
pr't;esti is like prtes" except that you specify summary statistics rather than variables as arguments. Foz instance, vo_ are reading an article Which reports the proportion of registered voters among 50 randomly, selecte_ eligible voters as .52. You wish to test whether the proportion is .7: prtesti
i
50 .52 .70
One-samp].e
test
of proportion
Variable I
Mean
x ;
,52
t
x: Number of obs =
Std. Err. .0706541
z
P>Izl
7.3598
O.0000
[95%Conf,
50 Interval]
.3815205
.6584795
!
I
, H_: x < .7
Ho: proportion(x) = ,7 Ha: x -= .7
Ha: x > .7
zl = -2.777
z = -2.777
z = -2.777
P
Iz[ =
0.0055
P > z =
0.9973
Example
i i
In order to jhdge teacher effectiveness, we wish to test whether the same proportion of people from two classds will answe_ an advanced question correctly. In the first classroom of 30 students.
_i
40% answered the question correctly, whereas in the second classroom of 45 students, 67% answered the question cofi'ectly. ! Two-sample test of pr . prtesti!30 .4 45 .6_
Variable
Mean
x y
.4 .67
ortion
x: Number of obs = y: }_umberof obs =
Std. Err.
z
P>[z_
30 45
[957,Conf. Interval]
Izl
=
0.0210
Ha: diff>
0
z = -2.309 P > z =
0.9895
Saved Results •
I
prtest
Scalars saves_in :
r()
r(z)I
z statistic
r(P.__)
proportion
r(N__)
:or variable #
number of obser_'ations
_br variable #
598
prtest -- One- and two-sample tests of proportions
Methods and Formulas prtest and prtesti areimplemented A large-sample
(1 - a)100%
as ado-files.
confidence interval for a proportion p is
pq-"
and a (1 - a)100%
confidence for the difference of two proportions
(P'I
where _"= 1 -_
Zl_a/2v/P-----_
-- P2)
Jr- Zl_ol/2
is given by
/P'tqq + P'2q"2 V nl n2
and z is calculated from the inverse normal distribution.
The one-tailed and two-tailed statistic calculated as
test of a population
proportion
uses a normally distributed
test
_-po
z= ;_/_g_/,_ where P0 is the hypothesized proportion. A test of the difference normally distributed test statistic calculated as
of two proportions
also uses a
Z=
v%%(llnl+11n2) where _p
__. Xl
-}-X2
nl + n2
and xl and x2 are the total number of successes in the two populations.
References Sincich, T, I987. Statistics By Example 3d ed. San Francisco:Dellen Publishing Company.
Also See Related:
[R] bitest, [R] ci, [R] hotel, [R] oneway, [R] sdtest, [R] signrank,
Background:
[U] 22 Immediate
commands
[R] ttest