F iTitie summarize — Summary statistics
Syntax summarize [var/rsf] |WigAf] [if exp\ [in rakge] [, [ detail | meanonly j...
12 downloads
1740 Views
94MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
F iTitie summarize — Summary statistics
Syntax summarize [var/rsf] |WigAf] [if exp\ [in rakge] [, [ detail | meanonly j format ] by ... : may be used with suamarize; see [R] by. aweights and freights are allowed. The varlisi following summarize may contain time-series diperators; see [U] 14.4.3 Time-series varlists,
Description •i
summarize calculates and displays a variety of univariate summary statistics. If no varlist is specified, then summary statistics are calculated for all the variables in the dataset. Also see [R] ci for calculating the standard error and confidence intervals of the mean.
Options detail produces additional statistics, including Newness, kurtosis. the four smallest and largest values, and various percentiles. meanonly, which is allowed only when detail Is not specified, suppresses the display of results and calculation of the variance. Ado-file writers will find this useful for fast calls. format requests that the summary statistics be displayed using the display formats associated with the variables, rather than the default g display' format; see [U] 15.5 Formats: contlroHing how data are displayed.
iRemarks summarize can produce two different sets of j summary statistics. Without the detail option, the number of nonmissing observations, the meajji and standard deviation, and the minimum and maximum values are presented. With detail, trie same information is presented along with the variance, skewness, and kurtosis; the four smallest and four largest values; and the 1st. 5th. 10th, 25th. 50th (median), 75m, 90th. 95th, and 99th pefcentiles. 'i
''> Example
j
You have data containing information on variois automobiles, among which is the variable mpg, the mileage rating. We can obtain a quick summaly of the mpg variable by typing . summarize mpg
I
Variable
Obs
Hean
mpg
74
21.2973
Std. Dev. 5.78^503
1
Min
Max
12
41
summarize — Summary statistics
We see that we have 74 observations. The mean of mpg is 21.3 miles per gallon, and the standard deviation is 5.79. The minimum is 12 and the maximum is 41. If we had not specified the variable (or variables) we wanted to summarize, we would have obtained summary statistics on all the variables in the dataset: . summarize Variable
Obs
Mean
make price mpg rep78 weight foreign
0 74 74 69 74 74
6165.257 21.2973 3.405797 3019 . 459 .2972973
Std. D«v.
Min
Max
2949 . 496 5.785503 ,9899323 777.1936 .4601885
3291 12 1 1760 0
15906 41
5 4840 1
Notice that there are only 69 observations on rep78, so some of the observations are missing. There are no observations on make since it is a string variable. 0
t> Example The detail option provides all the information of a normal summarize and more. The format of the output also differs: . summarize mpg, detail Mileage (mpg)
IX 5X 107. 25%
soy. 75%
soy. 95% 99'/.
Percentiles 12 14 14 18 20 25 29 34 41
Smallest 12 12 14 14
Obs Sum of Wgt.
Std. Dev.
74 74 21.2973 5.785S03
Variance Skewness Kurtosis
33.47205 .9487176 3.975005
Mean Largest 34 35 35 41
As in the previous example, we see that the mean of mpg is 21.3 miles per gallon and that the standard deviation is 5.79. We also see the various percentiles. The median of mpg (the 50th percentile) is 20 miles per gallon. The 25th percentile is 18 and the 75th percentile is 25. When we performed summarize, we learned that the minimum and maximum were 12 and 41, respectively. We now see that the four smallest values in our dataset are 12, 12, 14, and 14. The four largest values are 34, 35, 35, and 41. The skewness of the distribution is 0.95, and the kurtosis is 3.98. (A normal distribution would have a skewness of 0 and a kurtosis of 3.) (Skewness is a measure of the lack of symmetry of a distribution. If the coefficient of skewness is 0, the distribution is symmetric. If the coefficient is negative, the median is usually greater than the mean and the distribution is said to be skewed left. If the coefficient is positive, the median is usually less than the mean and the distribution is said to be skewed right. Kurtosis (from the Greek kyrtosis meaning curvature) is a measure of peakedness of a distribution. The smaller the coefficient of kurtosis, the flatter the distribution. The normal distribution has a coefficient of kurtosis of 3 and provides a convenient benchmark.) (On a historical note, see Plackett (1958) for a history of the concept of the mean.)
summarize — Summery statistics
ij» Example summarize can usefully be combined with the b| varlist: prefix. In our dataset we have a variable foreign that distinguishes foreign and domestic cars. We can obtain summaries of mpg and weight within each subgroup by typing . by foreign: summarise mpg «eight -> foreign = !>omestic Obs Variable
Mean
5? 52
19.82692 3317.115
mpg weight
Std. f)ev.
Min
Max
4.743JZ97 695.3J537
12 1800
34 4840
Std. tev.
Kin
Max
6.611J187 433.0b35
14 1760
41 3420
-> foreign = I"oreign Variable
Obs
Mean
opg weight
22 22
24,77273 2315.909
Domestic cars in our dataset average 19.8 miles pdr gallon, whereas foreign cars average 24.8. Since by varlist: can be combined with summarize, it can also be combined with s-ummarize, detail: . by foreign: summarize mpg, detail
-> foreign = Domestic iy. 57, 107, 257,
soy. 757. 907, 997.
Per cent iles 12
14 14 16.5 19 22 26 29 34
Mileage (mpg) Smallest 12 12 14 14
Largest 28 29 30 34
i
Obs! Sum of Wgt . Mea4 StdJ Dev.
52 52 19.82692 4.743297
Variance Skewness KuHfosis
22.49887 .7712432 3.441459
Obs Sum of Wgt. Meat Stdj Dev.
22 22 24 . 77273 6.611187
Variance Skelmess Kurtosis
43.70779 .657329 3.10734
-> foreign = Foreign Mileage (m;
17. 57. lO'/i 257, 507,
757. 907. 957. 997.
Per cent iles 14 17 17 21 24.5
28 35 35 41
Smallest 14 17 17 18 Largest 31 36 35 41
summarize — Summary statistics
Q Technical Note summarize respects display formats if you specify the format option. When we type summarize price weight, we obtain . summarize price weight Variable Obs price weight
74 74
Mean 6165.257 3019,459
Std. Dev.
Min
Max
2949.496 777.1936
3291 1760
15906 4840
The display is accurate but is not as aesthetically pleasing as you may wish, particularly if you plan to use the output directly in published work. By placing formats on the variables, you can control how the table appears: . format price weight 7,9.2fc . summarize price weight, format Variable Obs Mean price weight
74 74
6,165,26 3,019.46
Std. Dev. 2,949.50 777.19
Min
Max
3,291.00 15,906.00 1,760.00 4,840.00
Q If you specify a weight (see [U] 14.1.6 weight), each observation is multiplied by the value of the weighting expression before the summary statistics are calculated, so that the weighting expression is interpreted as the discrete density of each observation.
t> Example You have 1980 Census data on each of the 50 states. Included in your variables is medage, the median age of the population of each state. If you type summarize medage, you obtain unweighted statistics: . summarize medage Variable Obs medage
50
Mean
Std. Dev.
29.54
1.693445
Min
Max
24.2
34.7
Also among your variables is pop. the population in each state. Typing summarize medage [w=pop] produces population-weighted statistics: . summarize inedage [w=pop] (analytic weights assumed) Variable Obs Weight medage
50
225907472
Mean 30 11047
Std. Dev.
Min
Max
1 66933
24 2
34.7
The number listed under Weight is the sum of the weighting variable, pop. It indicates that there are roughly 226 million people in the U.S. The pop-weighted mean of medage is 30.11 (as compared with 29.54 for the unweighted statistic), and the weighted standard deviation is 1.67 (as compared with 1.69).
summarize — Summary statistics
0 Example You can obtain detailed summaries of weighted data as well. When you do this, all the statistics are weighted, including the percehtiles. ! . summarize medage [w=pop], detail (analytic weights assumed) Median age
I'/. 107, 25'/.
soy.
Percentilee 27.1 27.7 28.2 29.2 29,9
75V. 90'/, 95'/. 997.
30.9 32.1 32.2 34.7
Smallest 24.2 26.1 27.1 27.4 Largest 32 32.1 32.2 34.7
Dbs Si|im of Wgt. Mean SJbd. Dev. Variance Skewness Kiirtosis
50 225907472 30.11047 1.66933 2.786661 .5281972 4.494223
Q Technical Note You are writing a program and need to access the mean of a variable. The raeanonly option provides for fast calls. For example, suppose y4ur program reads as follows: program define mean summarize '1', meanonly display " mean = " r(mean)
! ;
end
The result of executing this is , mean price mean = 616S.2568
i
Saved Results summarize saves in r(): Scalars r(N) r(mean) r(sketmess) r(min) r(max) r(sum_w) rCpl) r(p5) r(plO) r(p25)
number of observations : mean ! skewness (detail only) minimum I maximum : sum of the weights ; 1st percentile (detail only)] 5th percentiie (detail only)| !0th percentile (detail onhl) 250) percentile (detail onl)l)
r(pSO) r(p75) r(p90) r(p95) r(p99) r(Var) r(kurtosis) r(sum) r(sd)
50th pereentile (detail 75th percentile (detail 90th percentile (detail 95th percentile (detail 99th percentile (detail variance kurtosis (detail only) sum of variable standard deviation
only) only) only) oniy) only)
summarize — Summary statistics
Methods and Formulas Let x denote the variable on which we want to calculate summary statistics, and let x», i = 1 , . . . , n, denote an individual observation on x. Let u; be the weight, and if no weight is specified, define Vi — \ for all i. Define V as the sum of the weight
ta=l
Define Wi to be v^ normalized to sum to n, Wi = u,-(n/V).. The mean, x, is defined as
n T ™ —
r , n^—' i=l
'
-
The variance, s2, is defined as «2 _
•*•
MT^,.. /~
^\2
The standard deviation, s, is defined as vs*. Define mr as the rth moment about the mean x:
i mr = — n
n
— 3/2
The coefficient of skewness is then defined as m^m^
. The coefficient of kurtosis is defined as
-2
Let £{j) refer to the or in ascending order, and let w^ refer to the corresponding weights of x^. The four smallest values are x^j, X( 2 ), £(3), and £(4). The four largest values are £(n), £( n _i), £( n _2), and £( n _ 3 )To obtain the pth percentile, which we will denote as a;^, let P — rip/100^ Let
Find the first index i such that M^^j > P. The pth percentile is then
otherwise
References Gleason, J. R. 1997. sg67: Univariate summaries with boxplots. Stata Technical Bulletin 36: 23-25. Reprinted in Siafa Technical Bulletin Reprims, vol. 6, pp. 179-183.
summarize — summary statistics . 1999. sg67.1: Update to univar. Stata Technical1 Bulletin 51: 27-28. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 159-161. j Hamilton. L. C. 1996. Data Analysis for Social Scientists. Pacific Grove. CA: Brooks/Cole Publishing Company. Placketi, R. L. 1958. The principle of the arithmetic mejkn. Biometrika 45: 130-135. Stuart, A. and J. K. Ord. 1994. Kendall's Advanced Thfory of Statistics, Vol. J. 6th ed. London: Edward Arnold, Weisberg. H. F. 1992. Cental Tendency and Variability. • Newbury Park, CA: Sage Publications.
Also See Related:
[R] centiie, [JR] cf, [R] ci, [R] code||>ook, [R] compare, [R] describe, [R] egen, [R] inspect, [R] Iv, [R] means, [R] jpctile, [R] st stsum, [R] svymean, [R] table, [R] tabstat, [R] tabsum, [R] xtsuni
Title sureg — Zellner's seemingly unrelated regression
Syntax Basic syntax
sureg (depvar\ varlisti ) (depvar? varlist-z ) . . . (depvar^ varlistpj ) [we/g/tf]
[if exp]
[in range]
Full syntax sureg ( [eqnamei :]depvaria [depvaru, . . , = ]varlisti [, noconstant ] ) ( \eqname2 '-\depvaria [depvar^b • • • - ] varlist^ [, noconstant ] )
( [eqname w : ] depvar^a [depvar^t, . . . = ] varlist^ [, noconstant ] ) [we?"g/if] [if exp] [in range] [, corr constraint s(numlist') i.sure dfk dfk2 small noheader notable level (#) maximize -options 1 by . . . : may be used with sureg; see [R] by. aweights and fweights are allowed; see [U] 14.1.6 weight, The depvan and the varlisK may contain time-series operators; see [U] 14.4.3 Time-series varlists. sureg shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. Explicit equation naming (eqname:) cannot be combined with multiple dependent variables in an equation specification.
Syntax for predict predict [type] newvarname [if exp] [in range] [, equationCegnt? [,£qno]~) xb stdp difference stddp residuals ] These statistics are available both in and out of sample; type predict . . . if e (sample) . . . if wanted only for the estimation sample.
Description sureg estimates seemingly unrelated regression models (Zellner 1962, Zellner and Huang 1962, Zellner 1963). The acronyms SURE and SUR are often used for the estimator.
sureg -4 Zellner's seemingly unrelated regression
Options ;j
noconstant omits the constant term (intercept) from the equation on which the option is specified, corr displays the correlation matrix of the residiials between equations and performs a Breusch-Pagan test for independent equations; i.e., the disturbance covariance matrix is diagonal. constraints (numUst} specifies by number t|e linear constraint(s) to be applied to the system. By default, sureg estimates an unconstrained system. See [R] reg3 for an example using constraints with a system estimator isure specifies that sureg should iterate over the estimated disturbance covariance matrix and parameter estimates until the parameter estimates converge. Under seemingly unrelated regression, this iteration converges to the maximum livelihood results. If this option is not specified, sureg produces two-step estimates. dfk specifies the use of an alternate divisor in computing the covariance matrix for the equation residuals. As an asymptotically justified estimator, sureg by default uses the number of sample observations (n) as a divisor. When the dflj option is set, a small-sample adjustment is made and the divisor is taken to be -^/(n — fcj){n - Mj), where k{ and k^ are the numbers of parameters in equations i and j respectively. dfk2 specifies the use of an alternate divisor ;in computing the covariance matrix for the equation residuals. When the df k2 option is set, the divisor is taken to be the mean of the residual degrees of freedom from the individual equations, this was the default divisor for sureg before version 6.0. ! small specifies that small sample statistics ^re to be computed. It shifts the test statistics from chi-squared and Z statistics to F statistics;and t statistics. While the standard errors from each equation are computed using the degrees or freedom for the equation, the degrees of freedom for the t statistics are all taken to be those for the first equation. Before version 6.0. sureg reported small-sample statistics. noheader suppresses display of the table reporting F statistics, j?-squared, and root mean square error above the coefficient table. i notable suppresses display of the coefficient table. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. maximize-options control the maximization process; see [R] maximize. You should never have to specify them. \
Options for predict ecpi the equations by their names, equation (income) would refer to the equation named income jand equation (hours) to the equation named hours. If you do not specify equ&tionQ, the res fits are as if you specified equation(#1). difference and stddp refer to between-eqi lation concepts. To use these options, you must specify two equations; e.g., equation(tl,#2) or equation (income, hours). When two equations must be specified, equation() is not optional. \
1o
sureg — Zellner's seemingly unrelated regression
xb, the default, calculates the fitted values— the prediction of Xjb for the specified equation. stdp calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is also referred to as the standard error of the fitted value. difference calculates the difference between the linear predictions of two equations in the system. With equation(#i ,#2) , difference computes the prediction of equation(#l) minus the prediction of equation(#2). stddp is allowed only after you have previously estimated a multiple-equation model. The standard error of the difference in linear predictions (xi^b — X2jb) between equations 1 and 2 is calculated. residuals calculates the residuals. For more information on using predict after multiple-equation estimation commands, see [R] predict.
Remarks Seemingly unrelated regression models are so called because they appear to be joint estimates of several regression models, each with its own error term. The regressions are related because the (contemporaneous) errors associated with the dependent variables may be correlated.
l> Example When you estimate models with the same set of right-hand-side variables, the seemingly unrelated regression results (in terms of coefficients and standard errors) are the same as estimating the models separately (using, say, regress). The same is true when the models are nested. Even in such cases, sureg is useful when you want to perform joint tests. For instance, let us assume that you think price — /?o + J3\f oreign + /32length + HI weight = 70 + 7if oreign + 72length + u-2 Since the models have the same set of explanatory variables, you could estimate the two equations separately. Yet, you might still choose to estimate them with sureg because you want to perform the joint test 3\ = 71 = 0. We use the small and df k options to obtain small-sample statistics comparable to regress or mvreg. . sureg (price foreign length) (weight foreign length), small dfk Seemingly unrelated regression Equation price weight
Obs 74 74
Farms 2 2
RMSE 2474.593 250.2515
"R-sq"
F-Stat
P
0.3154 0.8992
16.35382 316.5447
0.0000 0.0000
r
sureg — Zdllner's seemingly unrelated regression
Coef .
Std. Err.
t
j
P>ltl
11
[957, Conf . Interval]
price foreign length _cons
2S01.143 90.21239 -11621.35
766.117 15.83368 3124.436
3.66 |-70 4.72 i
0.000 0.000 0.000
1286.674 58.91219 -17797.77
4315.611 121.5126 -5444.93
weight foreign length _cons
-133.6775 31.44455 -2880.25
77.47615 1.601234 31B.9691
-1-73 If. 64 ~f. 02
0.087 0.000 0.000
-286.8332 28.27921 -3474.861
19.4782 34.60989 -2225.639
These two equation have a common set of regr^ssors and we could have used a shorthand syntax to specify the equations: ; . sureg (price weight = foreign length), small dfk i
In this case, the results presented by sureg aije the same as if we had estimated the equations separately: . regress price foreign length (output omitted) . regress weight foreign length (output omitted)
\
There is, however, a difference. We have allowed fii and u% to be correlated and have estimated the full variance-covariance matrix of the coefficients, sureg has estimated the correlations, but it does not report them unless we specify the corr option! We did not remember to specify corr when we estimated the model, but we can redisplay the results: . sureg, notable noheader corr Correlation matrix of residuals: price weight price 1.0000 weight 0.5840 1,0000 | Breusch-Pagan test of independence: chi2(|) = j
25.237, Pr = 0.0000
The notable and noheader options prevented s^reg from redisplaying the header and coefficient tables. We find that, for the same cars, the correlation of the residuals in the price and weight equations is .5840 and that we can reject the hypothesis that this correlation is zero. We can perform a test that the coefficients on foreign are jointly zero in both equations—as we set out to do—by typing test foreign: see [R] test. When we type a variable without specifying the equation, that variable is tested for zero in all equations in which it appears: . test foreign ( 1) [price]foreign = 0 . 0 ( 2) [weight]foreign = 0 . 0 F( 2, 71) = 17.99 Prob > :F = 0.0000
sureg — ZeHner's seemingly unrelated regression
12
E> Example When the models do not have the same set of explanatory variables and are not nested, sureg may lead to more efficient estimates than running the models separately as well as allowing joint tests. This time, let us assume you believe price = @Q -f /?iforeign + /^mpg + /fedispl + ui weight = 7o + 71 foreign -f 72length -f u^ To estimate this model, you type . sureg (price foreign mpg displ) (weight foreign length) , corr Seemingly unrelated regression Equation price weight
Dbs
Farms
RMSE
74 74
3 2
2165.321 245.2916
Coef.
fl
R-sq"
0.4537 0. 8990
z
Std. Err.
P>iz|
chi 2
P
49 . 6383 0 . 0000 661.8418 0.0000
1957. Conf. Interval]
price foreign displacement _cons
3058.25 -104.9591 18.18098 3904.336
685.7357 4.46 58.47209 -1.80 4.286372 4.24 1966.521 ; 1.99
0.000 0.073 0.000 0.047
1714.233 -219.5623 9.779842 50 . 0263
4402.267 9.644042 26.58211 7758.645
weight foreign length _cons
-147.3481 30.94905 -2753.064
75.44314 1.539895 303.9336
0.051 0.000 0.000
-295.2139 27.93091 -3348.763
.517755 33.96718 -2157.365
-1.95 20.10 -9.06
Correlation matrix of residuals: price weight price 1 . 0000 weight 0.3285 1.0000 Breusch-Pagan test of independence: chi2(l) =
7 . 984 , Pr = 0.0047
By way of comparison, had we estimated the price model separately: . regress price foreign mpg displ SS Source df
MS
Model Residual
294104790 340960606
3 98034929.9 70 4870865 . 8-1
Total
635065396
73
price
Coef.
foreign
3545.484 -98.88559 22.40416 2796.91
mpg displacement _cons
Std.
Number of obs F( 3, 70) Prob > F R-squared Adj R-squared Root MSE
8699525.97 Err.
712.7763 63.17063 4.634239 2137.873
t 4.97 -1 .57 4.83 1 .31
P>lt)
0 000
0 122 0 000 0 195
74 = 20.13 = 0 . 0000 = 0.4631 = 0.4401 = 2207.0
[957, Conf. Interval] 2123.897 -224.8754 13.16146 -1466.943
4967,072 27.10426 31.64686 7060.763
The coefficients are slightly different but the standard errors are uniformly larger. This would still be true if we specified the df k option to make a small-sample adjustment to the estimated covariance of the disturbances.
w
sureg — Zellner's seemingly unrelated regression
13
Q Technical Note Constraints can be applied to SURE models! using Stata's standard syntax for constraints. For a general discussion of constraints, see [R] constraint: for examples similar to seemingly unrelated regression models, see [R] reg3.
i
Saved Results sureg saves in e(): Scalars e(K) e(k) e(k_eq)
number of observations number of pararjjieters in system number of equafons e(msB_#) model sum of siuares for equation # model degrees F t
-5.55 15.35 7.38 -5.31 -4.03 2.43 -4.22
We can redisplay the results expressed as odds ratios.
P>lt| 0.000 0.000 0.000 0.000 0.000 0.021 0.000
= 10351 = 31 = 62 = 1.172e+08 = 87.70 = 0.0000
[95X Conf. Interval] -.0445771 .0425545 .1115486 -.0014877 - . 537066 .0555615 -7.259813
-.0206222 .0555936 .1966815 -.0006616 -.175928 .6302986 -2.531668
24
svy estimators — Estimation commands for complex survey data . svylogit, or Survey logistic regression pweight : f inalwgt Strata: stratid PSU: psuid
highbp
Odds Ratio
height weight age age2 female black
.967926 1.050298 1 . 166625 .998926 .7001246 1 . 40907
Number of obs Number of strata Number of PSUs Population size F( 6, 26) Prob > F
Std. Err. .0056843 .0033574 .0243485 .0005023 .0619858 .1985388
t -5.55 15.35 7.38 -5.31 -4.03 2.43
10351 = 31 = 62 = = 1 . 172e+08 87.70 = 0.0000
P>|tl
£95% Conf. Interval]
0.000 0.000 0.000 0.000 0.000 0.021
.9564019 1.043473 1.118008 .9985135 .5844605 1.057134
.979589 1.057168 1.217356 . 9993386 .8386784 1.878171
svylc can be used to estimate the sum of the coefficients for female and black. . svylc female + black ( 1) female + black = 0.0 highbp
Coef.
(1)
-.0135669
Std. Err. . 1653936
P>lt| -0 08
0 935
[957, Conf. Interval] -.3508894
.3237555
This result is more easily interpreted as an odds ratio. . svylc female + black, or ( 1) female + black =0.0 highbp
Odds Ratio
Std. Err.
(1)
. 9865247
. 1631648
t -0.08
P>lti
[951/. Conf. Interval]
0,935
.7040616
1.382309
The odds ratio 0.987 is an estimate of the ratio of the odds of having high blood pressure for black females over the odds for our reference category of nonblack males (controlling for height, weight, and age). You now know enough to use svylc for odds ratios; see [R] svylc for its other uses; see [R] lincom for more examples of odds ratios.
F = 0.0001 , — _ — highlead Odds & tio Std, Err. P>lt| [957, Conf . Interval]
i
age female
1 . 01'8d8 .0265 989
.0079836 .0195109
l.iT
-5.4°
0.071 0.000
.9986331 .0061714
1.031244 .118115
ij
Note that this time we specified ifled the or option when ^e first issued the command.
The subpop(varaflme) option takes a 0/1 variabl^; the subpopulation of interest is defined by varname = 1. All other memi ers of the sample not in t|e subpopulation are indicated by vamume = 0. If a person's subpopulatio.i status is unknown, theji varname should be set to missing ( ' . ' ) and those observations will be omitted from the analysis as] they should be. For instance, in the preceding example, if person's race is junknown, race should be coded as missing rather than as nonblack (race = Oj. ; i Note that using 'if black?=l' to model the subpopjulation would not give the same result. All the discussion in the section Walking about the use of if pnd in in [R] svymean applies to the variance estimates for svyreg, svyloeit, and svyprobit as jwell.
Q Technical Note
I i
i
Actually, the subpop(vompme) option takes a zero^nonzero variable; the subpopulation of interest is defined by varname ^ 0 anq not missing. All other n|embers of the sample not in the subpopulation are indicated by varname = 0. 'But 0.1, and missing are typically the only values used for the subpop () variable.
Example
j
\
In the NHANES II dataset, we have a variable healtfi containing self-reported health status, which takes on the values 1-5, win 1 being "poor" and \ being "excellent". Since this is an ordered categorical variable, it makes: ense to model it using s^yologit or svyoprobit. As predictors, we use basic demographic variab :s: female (1 if female* 0 if male), black (1 if black, 0 otherwise). age. and age2 (= age2):
28
svy estimators — Estimation commands for complex survey data
Saved Results The svy estimation commands save in e (): Scalars
e(N) e(N_strata) e(N_psu) e(N_pop) e(N_subpop) e(N_sub) e(df_m) e(df_r)
e(F) e(k_cat) e(basecat) e(ibasecat) Macros e(cmd) e(depvar) e(wtype) e(wexp) «(strata) e(psti) e(fpe) e(offset)
e(predict) e(cnslist) Matrices e(b) e(V) e(V_srs) e(V_srs»r) e(deff) e(deft) e(cat)
number of observations m number of strata L number of sampled PSUs n estimate of population size M estimate of subpopulation size subpopulaticjn number of observations model degrees of freedom variance decrees of freedom = n~L model F statistic number of Categories (svymlogit, svyologit, svyoprobit) base category value of dependent variable (svymlogit) base category number (svymlogit) command name (e.g., svyreg) dependent variable name weight type weight variable or expression strata () variable psttO variable fpc() variable offsetQ Variable program used to implement predict constraint numbers (svyintreg, svymlogit, svypois) vector of estimates 0 design-based (co)variance estimates V simple-randcjm-sampling-without-replacement (co)variance Vsrswor simple-randcjm-sampling-with-feplacement (co)variance Vsrswr (only creajted when fpc() option is specified) vector of dejff estimates vector of deft estimates vector of category values (svymlogit, svyologit, svyoprobit)
Functions
e(sample)
marks estimation sample
Methods and Formulas All of the svy estimators are implemented as ado-files that call _robust; see [P] _robust. These commands use a variant on the basic weighted-point-estimation methods used by svytotal. They use ulinearization"-based variance estimators that are natural extensions of the variance estimator used in svytotal. For general methodological background on regression and generalized-linear-model analyses of complex survey data, see, for example, Binder (1983), Cochran (1977), Fuller (1975), Godambe (1991), Kish and Frankel (1974), Sarndal et al. (1992), Shao (1996), and Skinner (1989). The notation and development presented below is adapted from Binder (1983). We use here the same notation as in the Methods and Formulas section of [R] svymean; that section should be read first.
if
_
svy
[
estimators
--
Estimation
commands
complex
survey
data
_
z»
:
!
•At.
for
;
Linear regression We let (/i,i, j) index the elements in the popujation, where h — 1 ..... I are the strata, i = I . ..., Nh, are the PSUs in! stratum h, and j — 1] . . ., Mhi are the elements in PSU (h,i). The regression coefficients /? b (/3o>$i 5 • • --.0k} are Viewed as fixed finite-population parameters that we wish to estimate, The$e parameters are define^ with respect to an outcome variable YMJ and a k 4- 1-dimensional row vector of explanatory variables X^j — (Xhijo- • • • • X^jk}. As in nonsurvey work, we often have J^/ujo identically equal to unity, so that /?o is an intercept coefficient. Within a finite-population context, we can formally define the regression coefficient vector 0 as the solution to the vector estimating equation \
G(P) = X'Y\-X'XP = Q
I i
(1)
! j
where Y is the vector of outcomes for the full population and X is the matrix of explanatory variables for the full population. Assuming (X'X}~1 existif, the solution to (1) is /3 = (X'X)~1X'Y. Given observations (j^. x/uj), collected through /3 in a way that accounts, for the sample design. JTo X'Y can be viewed as majtrix population totals. Fcjr Thus, we estimate X'X and X'Y with the weighted
X X
'
a complex sample design, we need to estimate do this, note that the matrix factors X'X and example, X'Y - ]Ch~i Y^,i=i Tlj~i XhijYhijestimators
=
h=l t=l j=l
and
\
=
L
i\h "ih
' ^ EEE ;
h=H=lj = l
where XB is the matrix pf explanator)' variables 'for the sample, Ys is the outcome \'ector for the sample, and W = diag(uj^tj) is a diagonal mjatrix containing the sampling weights WMJ. The corresponding coefficient; estimator is
: 3 = (XlXrlX'Y^(X'8WXsrlX'ilWYe
(2)
Note that equation (2) is what the regress corimand with aweights or iweights computes for point estimates. ; j The coefficient estimator 13 can also be definexj as the solution to the weighted sample estimating equation , ,
d(P) = X'Y - X^Xp =j X'sWYt - X'SWX53 - 0 We can write G(B} as
where dfll} = x'hij€hij kn& €hi} ~ ytiij — xhi]i is
tne
regression residual associated with sample
unit ( h . i , j ) . Thus, G(/3J) can be viewed as a special case of a total estimator Our variance estimator for 3 is based on tht following "linearization" argument. A first-order Taylor expansion shows ithat '
"-^-{wi
3D
svy estimators — Estimation commands for complex survey data
Thus, our variance estimator for ft is -T'
\
J
V{G(fi)}
0=0
^(X>
Viewing G(ft) as a total estimator according to equation (3), the variance estimator V{G(ft)}\a_g can be computed using equation (3) from [R] svyraean with yhij replaced by dhij and with (3 used to estimate e/,»j. Pseudo-maximum-likelihood
estimators
To develop notation for our pseudo-maximum-likelihood estimators, suppose that we observed (Y/uj, Xhij) for the entire population, and that (Yhij^Xhij) arose from a certain likelihood model (e.g., a logistic distribution). Let /(/?; Yhij,Xhij) be the associated "log likelihood" under this model. Then, for our finite population, we define (he parameter ft by the vector estimating equation
L
f h Mhi
where S = dl/d/3 is [he score vector; i.e., : the first derivative with respect to ft of I (ft; Yhij,Xhij). Then, the "pseudo-maximum-likelihood" estimator ft is the solution to the weighted sample estimating equation
Note that the solution ft of equation (4) is what the nonsurvey version of the command with iweights produces for point estimates. Again, we use a first-order matrix Taylor series expansion to produce the variance estimator for ft i
V(ff] = 0=0
where H is the Hessian for the weighted sample log-likelihood. We can write G(/3) as L
rih HIM
where dhl} = shl}X)HJ and shij is the score index for element (h,i,j). The term by rewriting the sample log-likelihood l(3;yhij,xhij) as a function of xhij0: s 3
*
is computed
svy iestimafors — Estimation commands for complex survey data
31
Thus, again, (?(/?) can be'' viewed as a special cast of a total estimator, and the variance estimator V{G(fl}}\Sa£ is computed using equation (3) frtiim [R] svymean with y^j replaced by d^ and with 0 used to estimate
Acknowledgments The svyreg. svylogifc, and svyprobit commands were developed in collaboration with John L. Eltinge, Department of Statistics, Texas A&M University. We thank him for his invaluable assistance. We thank Wayne Johnson of the National Centejr for Health Statistics for providing the NHANES II dataset.
References Binder. D. A. 1983. On the variances of asymptotically normal estimators from complex surveys. International Statistical : Re view 51: 279-292. Cochran. W. G. 1977. Sampling Techniques. 3d ed. New Y0rk: John Wiley & Sons. Eltinge, J. L, and W. M. Sribn^y. 1996. svy4: Linear, logistjc. and probit regressions for survey data, Sfafa Technical Bulletin 31: 26-31. Reprinted in Stats Technical Bul/etiij? Reprints, vol. 6, pp. 239-245. Fuller. W. A. 1975. Regression [analysis for sample survey. Sankhya, Series C 37: 117-132. Godambe, V. P. ed. 1991. Estimating Functions. Oxford: Clarendon Press. Gonzalez J. F, Jr.. N. Krauss. ajnd C. Scon. 1992. Estimatioi| in the 1988 National Maternal and Infant Health Survey. In Proceedings of the Section on Statistics Education. America/! Statistical Association, 343-348. Johnson. W. 1995. Variance estimation for the NMIHS. Technical document. National Center for Health Statistics. Hyattsville, MD. ; Kish. L. and M. R. Frankel. 1974. Inference from complexi samples. Journal of the Royal Statistical Society B 36:
1-37.
:
:
Korn, E. L,. and B. I. Graubard. 1990. Simultaneous testing of regression coefficients with complex survey data: use of Bonferroni / statistics. Tfie American Statistician 44: 270-276. McDowell. A.. A. Enge!, J. T. JMassey, and K. Maurer. 198j. Plan and operation of the Second National Health and Nutrition Examination Survtjy. 1976-1980. Vital and Heklth Statistics 15(1). National Center for Health Statistics. l Hyacisville. MD. \ Sarndal. C.-E., B. Swensson. artd J. Wretman. !992. Model lAssisted Survey Sampling. New York: Springer-Verlag. Shao. J. 1996. Resampling methods for sample surveys (with discussion). Statistics 27: 203-254. Skinner. C. J. 1989. Introduction to Part A. In Analysis -of Complex Surveys, ed. C. J. Skinner. D. Holt, and T. M. F. Smith, 23-58. Ne* York: John Wilev & SonsJ f
..
|
-I
Also See Complementary:
[R] ajdjtist, [R] cwistraint. [R| mfx, [R] svydes. [R] svylc, [R] svymean. [R] sjvyset, [R] svytab, [R] svytest, [R] testnl
Related:
fp] _]robust j
Background:
:!
[u] 3(0 Overview of survey estimation. [R]
Title svydes — Describe survey data
Syntax svydes [vor/isr] [weig/if] [if exp] [in range]
[, strataOwzrname) psu(varname)
fpc(vamame) bypsu ] pweights and weights are allowed; see [U] 14.1.6 weight.
Description svydes displays a table that describes the strata and the primary sampling units for sample survey data.
Options strata(varname) specifies the name of a variable (numeric or string) that contains stratum identifiers, strata () can also be specified with the svyset command; see [R] svyset. psu(vamame) specifies the name of a variable (numeric or string) that contains identifiers for the primary sampling unit (i.e., the cluster). psu() can also be specified with the svyset command. fpc(varmzme) can be set here or with the svyset command. If an fpc variable has been specified, svydes checks the fpc variable for missing values. Other than this, svydes does not use the fpc variable. See [R] svymean for details on fpc. bypsu specifies that results be displayed for each PSU in the dataset; that is, a separate line of output is produced for every PSU. This option can only be used when a PSU variable has been specified using the psu() option or set with svyset. Note: Weights are checked for missing values, but are not otherwise used by. svydes.
Remarks Sample-survey data are typically stratified. Within each stratum, there are primary sampling units (PSUs), which may be either clusters of observations or individual observations. svydes displays a table that describes the strata and PSUs in the dataset. By default, one row of the table is produced for each stratum. Displayed for each stratum are the number of PSUs, the range and mean of the number of observations per PSU, and the total number of observations. If the bypsu option is specified, svydes will display the number of observations in each PSU for every PSU in the dataset. : If a varlist is specified, svydes will report the number of PSUs that contain at least one observation with complete data (i.e., no missing values) for all variables in the varlist. These are precisely the PSUs that would be used to compute estimates for the variables in varlist using the svy estimation commands: svymean, svytotal, svyratio, svyprop, svytab, or any of the commands described in [R] svy estimators. 32
r
svydes — Describe survey data
33
The variance estimation formulas for the svy estimation commands require at least two PSUs per stratum. If there are sorde strata with only a single PSU, an error message is displayed: . svyaean x • stratum with onlyjone PSO detected r(499); , svydes x
ifc
m
The stratum (or strata) wjth only one PSU can be Ibcated from the table produced by svydes x. After locating this stratum, it dan be "collapsed" into aji adjacent stratum, and then variance estimates can be computed. See the following examples for an ^illustration of the procedure. For details on the svy estimation commands, see [R] svymean and [R] svy estimators.
Example KE*
We use data from thi Second National Healtjh and Nutrition Examination Survey (NHANES II) (McDowell et al. 1981) is our example. First, w4 set the strata, psu, and pweight variables. . svyset strata stjratid . svyset psu psui4 . svyset pweight ^inalwgt
Typing svydes will shoi> us the strata and PSU arrangement of the dataset. . svydes i pweight: finalwglj Strata: stratid' PSU: psuid ; #0^s per PSU Strata stratid
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26
27 28 29
#0bs
#PSOs
2j 2j
i i
2
!
2J 2i 2!
380 185 348 460 252 298 476 338 244 262 275 314 342 405 380 336 393 359 285 214 301 341 438 256 261 283 299 503
min
mean
165 ; 190.0 67 92.5 149 174.0 230.0 229 105 I 126.0 131 149.0 206 \ 238.0 158 169.0 100 ; 122.0 119 i 131.0 120 •1 137.5 144 157.0 154 ! 171.0 200 1 202.5 189 190.0 159 168.0 180 196.5 144 179.5 125 142.5 107.0 102 128 150.5 159 | 170.5 205 219.0 116 i 128.0 129 I 130.5 139 141.5 136 \ 149.5 215
i 251.5
max 215 118 199 231 147 167 270 180 144 143 155 170
188 205 191 177 213 215 160 112 173 182 233 140 132 144 163 288
34
svydes — Describe survey data 30 31 32
2 2 2
31
62
iee
36B 308 450
10351
143 211
182.5 154.0 225.0
199 165 239
67
167.0
288
Our NHANES II dataset has 31 strata (stratum 19 is missing) and 2 PSUs per stratum. The variable hdresult contains serum levels of high-density lipoproteins (HDL). If we try to estimate the mean of hdresult, we get an error. . svymean hdresult stratum with only one PSU detected r(460);
Running svydes with hdresult as its varlist will show us which stratum or strata have only one PSU. . svydes hdresult pweight : f inalwgt Strata: stratid psuid PSU: Strata stratid 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26
27 28 29 30 31 32 31
SPSUs included 1* 1* 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 60
#0bs wjlth #0bs with ffPSUs complete missing omitted datfe data 1 1 0 0 0 0
0 0
0 0 0
114 98 277 340 173
255 409 299 218 233 238
0
275
0 0
J297 355 329 280 352 -'335 J240 198 263 304 388 239
0 0
0 0 0 0
0 0 0 0 0
0 0 0
0 0
0
240 259
284 440 326 279 383
8720 10351
266 87 71 120 79 43 67 39 26 29 37 39 45 50 51 56 41 24 45 16 38 37 50 17
21 24 15 63 39 29
#0bs per included PSU min 114 98 116 160 81 116 191 129 85 103 97 121 123 167 151 134 155 135 95 91 116 143 182
106 119 127 131
67
193 147 121 180
1631
81
mean
max
114.0 98.0 138.5 170.0 86.5 127.5 204.5 149.5 109.0 116.5 119.0 137 .-5 148.5 177.5 ' 164.5 140.0 176.0 167.5 120.0 99.0 131.5 152.0 194.0 119.5 120.0 129.5 142.0 220.0 163.0 139.5 191.5
114 98 161 180 92 139 218 170 133 130 141 154 174 188 178 146 197 200 145 107 147 161 206 133 121 132 153 247
145.3
247
179
158 203
r
svydes — Describe survey data
35
Both of stratid = 1 arid stratid — 2 have orily one PSU with nonmissing values of hdresult. Since this dataset has onljy 62 PSUs, the bypsu option will give a manageable amount of output:
. svydes hdresult,jbypsu pweight: Strata: PSU:
Strata stratid
finalwgt stratid psuid i
, #0bs with #0bs with PSU \ complete missing psuid ; data data
1
2i
0 114 98 0 161 116
215 51 20 67 38 33
32 32
1 2i
180 203
59 8
31
62i
8720
1631
1 2 2 3 3
li 2!
li 2;
li
(output omitted )
10351
It is rather striking thjat there are two PSUs Without any values for hdresult. All other PSUs have only a moderate number of missing values. Obviously, in a case such as this, a data analyst should first try to ascertajn the reason why these data are missing. The answer here (Johnson 1995) is that HDL measurement^ could not be collected ;until the third survey location. Thus, there are no hdresult data for the first two locations: stratid = 1, psuid = 1 and stratid = 2. psuid = 2. Assuming that we wish; to go ahead and analyze [he hdresult data, we must "collapse" strata—that is. merge them together-fso that every stratum h$s at least two PSUs with some nonmissing values. We can accomplish this by collapsing stratid ;= 1 into stratid = 2. To perform the stratum collapse, we create a newj strata identifier newstr and a new PSU identifier newpsu. This is easy to do using basic command^ in Stata. . gen newstr = striatid . gen neupsti = psufid . replace newpsu =! psuid + 2 if stratid==!l (380 real changes bade) ; . replace newstr = 2 if stratid==l (380 real changes bade) •
We set the new strata an| PSU variables. I . svyset strata neiwstr . svyset psu newps|u
We use svydes to check what we have done
svydes — Describe survey data
36
, svydes hdresult , bypsu pweight : final wgt Strata: newstr newpsu PSU: fObs with #0bs with Strata PSU complete missing newstr newpsu data data
2 2 2 2 3 3
1 2 3 4 1 2
98 0 0 114 161 116
20 67 215 51 38 33
32
1 2
180 203
59 8
30
62
8720
(output omitted ) 32
1631 10351
The new stratum, newstr = 2, has 4 PSUs, 2 of which contain some nonmissing values of hdresult. This is sufficient to allow us to estimate the mean of hdresult. . svymean hdresult Survey mean estimation pweight: finalwgt Strata: newstr PSU: newpsu
Number of obs Number of strata Number of PSUs Population size
Mean
Estimate
Std. Err.
[951/. Conf . Interval]
hdresult
49.67141
.3830147
48.88919
50.45364
= 8720 = 30 = 60 = 98725345
Deff 6.257131
Methods and Formulas svydes is implemented as an ado-file.
References Eitinge. J. L. and W. M. Sribney. 1996. svy3: Describing survey data: sampling design and missing data. State Technical Bulletin 31: 23-26. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 235-239. Johnson, C. L, 1995. Personal communication. McDowell, A., A. Engel, J. T. Massey, and K. Maurer. 1981. Plan and operation of the Second National Health and Nutrition Examination Survey. 1976-1980. Vital and Health Statistics 15(1). National Center for Health Statistics, Hyattsville. MD.
Also See Complementary:
[R] svy estimators, [R] svylc, [R] svymean, [R] svyset, [R] svytab, [R] svytest
Background:
[U] 30 Overview of survey estimation, [R] svy
Title
•
1
svylc — Estimate inear combinations afteij survey estimation 4
1
Syntax svylc \exp\ \, show or irr rrr eforjn level(#) deff deft neff meft
Description svylc produces est mates for linear combinations of parameters after a svy estimation command; i.e., any of the comnunds svymean, svytotajl, svyratio, or the commands described in [R] svy estimators. Estimating differences of subpopulation means, for example, can be done by running svymean with a by() option, ajnd then running svylc. The svylc command computes estimates of linear combinations of parameters, whether means, total, ratios, proportions, or regression coefficients. i Used after svylogilt, it will compute odds ratios for any covariate group relative to another. After svymlogit, it can compute relative risk ratio|, and after svypois, it can compute incidence rate ratios, svylc is the equivalent of lincom for purvey data. See [R] lincom for a thorough coverage of odds ratios. '
Options
Example We use data from the Second National Health and Nutrition Examination Survey (NHANES II) (McDowell et al. 1981) as our example. Suppose that we wish to estimate the difference of the means of systolic (variable bpsystol) and diastolic (variable bpdiast) blood pressures. First, we estimate the means, and then we use svylc. . svymean bpsystol bpdiast Survey mean estimation
10351 Number of obs = 31 Number of strata = 62 Number of PSUs Population size = 1 172e+08
pweight: finalwgt Strata: strata PSU: psu
Mean
Estimate
Std. Err.
[9B1/. Goaf. Interval]
bpsystol bpdiast
126.9458 81.01726
.603462 .5090314
125.715 79.97909
128.1766 82.05544
Deff 8.230475 16.38656
. svylc bpsystol - bpdiast ( 1) bpsystol - bpdiast =0,0 Mean
Estimate
Std. Err.
(1)
45.92852
.2988395
t 153.69
P>lt|
[957, Conf . Interval]
0.000
45.31903
46.53801
We can also specify any of the options deff, deft, meff, or meft. or change the confidence level (i.e., nominal coverage rate) of the confidence interval.
svylc — Estimate linear combinations after survey estimation
39
. svylc bpsystol - bpdiast, level(90) de|f meff ( 1)
bpsystol - Jbpdiast = 0.0 Mean
stimate
Std. Err.
(1)
-3 5.92852
.2988395
Mean
(1)
Deff
1J53.69
P>ltl
[907, Conf, Interval]
0.000
45.42183
46.43521
Number of obs Number of strata Number of PSUs Population size
= 10351 = 31 = 62 = 1.172e+08
Meff
S .835532
3.087148
j
svylc works in the same manner after using thejsubpop option. . svynean bpsysto] bpdiast, subpop(femal4) Survey mean estiiMtion pveight: Strata: PSU: Subpop.:
finalwglj strata ; psu j female=4l
Mean
Estir ate
Std. Err.
[951J Conf. Interval]
bpsystol bpdiast
124.: 027 79.05 227
.7051858 . 5207306
122 i 7644 77.^7023
125.6409 80.09431
Deff 5.162487 8.973799
svylc bpsystol -j bpdiast (1)
bpsystol - bpdiast = 0 . 0 Mean
! stimate
Std.. Err.
(1)
i 5.17039
.4040852
t ill. 78
P>it|
[95X Conf. Interval]
0.000
44.34625
45.99453
Missing data: The complete and available options The svymean, svytotal. and svyratio corjimands can handle missing data in two ways: see [R] svymean. The available option (which is tHe default when there are missing values and two or more variables) uses every available nonmissing Rvalue for each variable separately. The complete option (which is the default when there are noj missing values or only one variable) uses only those observations with i.onmissing values for all: variables in the varlisi. Here is an example where available is the default: I
(Continued bn next page)
40
svylc — Estimate linear combinations after survey estimation
Example . svymean tcresult tgresult Survey mean estimation p»eight: f inalwgt Strata: strata PSU: psu
Number of obs(*) = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08
Mean
Estimate
Std. Err,
[951/. Conf . Interval]
Deff
tcresult tgresult
213.0977 138.576
1.127252 2.071934
210.7986 134.3503
215.3967 142.8018
5 . 602499 2.356968
(*) Some variables contain missing values.
We redisplay the results using the obs option to see how many observations were used for each estimate. . svymean, obs Survey mean estimation pwe ight: f inalwgt Strata: strata PSU: psu
Number of obs(*) = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08
Mean
Estimate
Std. Err.
tcresult tgresult
213.0977 138.576
1.127252 2.071934 i
Obs
10351 5050
(*) Some variables contain missing Values.
Because we estimated the mean of tgresult using a different set of observations than tcresult, we could not compute the covariance between the two, and hence, we cannot estimate the variance of the difference. So if we now ask svylc to estimate this difference, we get. an error message. . svylc tcresult - tgresult must run svy command with "complete" option before using this command r(301); |tj
[957. Conf. Interval]
0.000
68.66129
76.98163
The use of svylc after svy model estimators svylc can be used i fter any of the commands described in [R] svy estimators to estimate a linear combination of the coefficients (i.e., the /3's). j Using svylc after s vyreg is straightforward:
t> Example . svyreg tcresul bpsystol bpdiast age ege2 Survey linear re press ion pwe ight: f inalw jt Strata: strata PSU: psu
tcresult bpsystol bpdiast age age2 _cons
Coef . .1060743 .2966662 3.35711 . 0247207 83.8242
Std. Err. I jj
!
Coef.
(1)
-.1905919
t
P>|t|
.0346796 3 . 06 .0569B94 I 5.21 . 2099842 15.99 .0020795 1-11.89 5.649261 14.84
. svylc bpsystol - bpdiast ( 1) bpsystol • bpdiast = 0.0 tcresult
Number of obs = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08 F( 4, 28) = 307.00 Prob > F = 0.0000 R-squared = 0.1945
. 0353449 . 1804969 2.928844 -.0289619 72 . 30246
. 1768038 .4128356 3.785375 -.0204796 95 . 34594
i
Std. Err. .: .0818056
0.005 0.000 0.000 0.000 0.000
[957, Conf. Interval]
t -2.33
p>ltl 0.027
[957. Conf. Interval] -.3574354
-.0237483
F highbp
Coef .
Std. Err.
height weight age age2 fesale black _cons
-.0325996 . 049074 .1541151 -.0010746 -.356497 .3429301 -4.89574
. 0058727 .0031966 .0208709 .0002025 .0885354 . 1409005 1.159135
t
P>lt|
-5.55 15.35
0.000
7.38 -5.31
0.000 0.000 0.000 0.021
-4.03 2.43 -4.22
0.000
0.000
10351 = 31 = 62 = = 1 . 172e+08 87.70 = 0 . 0000
[95'/. Conf . Interval] -.0445771 .0425545 .1115486 -.0014877 - . 537066 .0555615 -7.259813
-.0206222 .0555936 .1966815 -.0006616 -.175928 .6302986 -2.531668
We can redisplay the results expressed as odds ratios. . svylogit, or Survey logistic regression pweight: final vgt Strata: strata PSU: psu
Number of obs = 10351 Number of strata = 31 Number of PSUs = _ 62 Population size = 1.172e+08 F( 6, 26) = 87.70 Prob > F = 0.0000
highbp
Odds Ratio
Std. Err.
height weight age age 2 female black
.967926 1.050298 1.166625 . 998926 .7001246 1 . 40907
.0056843 .0033574 .0243485 .0002023 .0619858 . 1985388
t -5 .55
15 .35 7 .38 -5 .31 -4 .03 2 .43
P>lt!
[95V, Conf. Interval]
0.000 0,000 0.000 0.000
.9564019 1.043473 1.118008 .9985135 . 5844605 1.057134
0.000
0.021
.979589 1.057168 1.217356 .9993386 .8386784 1.878171
svylc can be used to estimate the sum of the coefficients for female and black. . svylc female + black ( 1)
female + black =0.0 highbp
Coef.
(1)
-.0135669
Std. Err. .1653936
t
-0.08
P>|t| 0.935
[957, Conf. Interval] -.3508894
.3237555
r
s vylc — Estimate linear combinations after survey estimation
43
This result is more easily i iterpreted as an odds ratio. . svylc female + bis ck, or ( 1)
I
female + bladk = 0 . 0
highbp
Odds Ratio
(1)
.9£ 65247
Std. Err. .1631648
t -d.08
p>ltl
[957. Conf. Interval]
0.935
.7040616
1.382309
The odds ratio 0.987 is an estimate of the ratio of the odds of having high blood pressure for black females over the odds for <pr reference category of nonblack males (controlling for height, weight, and age). See [R] lincom for mori examples of odds ratio!.
Using svylc after estimators that estimate multiple-equation models is a little trickier, but still straightforward. Users mere ly need to refer to the coefficients using the syntax for multiple equations; see [U] 16.5 Accessing coe: Bcients and standard errors for a description and [R] test for examples of its use.
[> Example In the NHANES II data, 've have a variable health containing self-reported health status, which takes on the values 1-5, with 1 being "poor" anil 5 being "excellent". Since this is an ordered categorical variable, it makes sense to model it using svyologit or svyoprobit. We will do so in the next example, but we will first use svymlogit since it is a good example of a multiple-equation estimator at its simplest, j So. we estimate a multinomial logistic regressiori model:
(Continued oninext page)
44
svylc — Estimate linear combinations after survey estimation . svymlogit health female black age age2 Survey multinomial logistic regression pweight: finalvgt Strata: strata PSU: psu
health
Coef .
Std. Err.
female black age age2 _cons
- . 1983735 .8964694 .0990246 -.0004749 -5.475074
. 1072747 .1797728 .032111 .0003209
female black age age2 _cons
female
Number of obs Number of strata Number of PSUs Population size F( 16, 16) Prob > F
P>|t|
t
= 10335 = 31 = 62 = 1.170e+08 = 36.41 = 0.0000
[957. Conf . Interval]
poor
. 7468B76
-1. 85 4.99 3. 08 -1. 48 -7.33
0.074 0.000 0.004 0.149 0.000
-.4171617 .5298203 .0335338 -.0011294 -6 . 9983
.0204147 1.263119 .1645155 .0001796 -3.951848
.1782371 .4429445 . 0024576 .0002875 -1.819561
.0726556 . 122667 .0172236 .0001684 .4018153
2.45 3. 61 0. 14 1. 71 -4. 53
0.020 0.001 0.887 0.098 0.000
.030055 .1927635 - . 0326702 -.0000559 -2.639069
.3264193 .6931256 .0375853 .000631 -1.000053
black age age2 _cons
-.0458251 -.7532011 -.061369 .0004166 1.815323
.074169 .1105444 .009794 .0001077 .1996917
-0.62 -6.81 -6.27 3. 87 9.09
0.541 0.000 0.000 0.001 0.000
- . 1970938 -.9786579 -.081344 .000197 1.408049
. 1054437 -.5277443 -.0413939 .0006363 2.222597
excellent female black age age2 _cons
- . 222799 -.991647 -.0293573 - . 0000674 1.499683
.0754205 . 1238&06 .0137789 .0001505 .286143
-2. 95 -8. 00 -2. 13 -0.45 5. 24
0.006 0.000 0.041 0.657 0.000
-.3766202 -1.244303 -.0574595 -.0003744 .9160909
-.0689778 -.7389909 -.001255 .0002396 2.083276
fair
good
(Outcome health==average is the comparison group) One might want to calculate the estimate for black females for the "excellent" category: . svylc [excellent]female + [excellent]black ( 1)
[excellent]female +• [excellent]black = 0 . 0 health
Coef.
(1)
-1.214446
Std.
Err.
.1428188
t
-8.50
P>lt| 0.000
[957. Conf. Interval] -1.505727
-.9231652
This result might be better interpreted as a relative risk ratio. Since the estimate was negative, one could reverse signs to get a relative risk ratio that is greater than one: . svylc -[excellent]female - [excellent]black, rrr ( 1} - [excellent]female - [excellent]black = 0.0 health
RRR
(1)
3.368427
Std. Err.
t
P>|t|
[95% Conf. Interval]
.4810747
8.50
0.000
2.517245
4.507429
— Estimate linear combinations after survey estimation
45
RRR = 3,37 is the ratio o relative risk for nonblick males to black females, with "relative risk" being the probability of be ng in the "excellent" category divided by the probability of being in the "average" base category. Hence, this relative risk ritio is RRR =
Pi (excellent nonblack ma|e)/Pr(average | nonblack male) ^{excellent black femate)/Pr(average j black female)
We now estimate the sane model using svyologit: , svyologit health ieraale black age age2 Survey ordered logistic regression pweight: finalwgt Strata: strata psu PSU:
Number of obs Number of strata Number of PSUs Population size F( 4, 28) Prob > F
Std. Err.
health
Coef .
female black age age2
-.if. 15219 -.S 86568 -.01 19491 -.00 03234
.0523678 .0790276 .0082974 .000091
-3.08 -13-48 -11.44 -3.55
0.004 0.000 0.160 0.001
-.2683266 -1 . 147746 - . 0288717 -.000509
-.0547171 -.8253901 .0049736 -.0001377
/cut! /cut 2 /cut3 /cut4
-4.5 66229 -3.0 57415 -1.5 20696 _ ' 42785
. 1632559 . 1699943 .1714341 . 1703964
-27*. 97 -17=. 99 -187 -1.42
0.000 0.000 0.000 0.164
-4.899192 -3.404121 -1.870238 - . 5903107
-4.233266 -2.710709 -1 . 170954 . 1047407
: t
P>lt|
= 10335 = 31 « 62 = 1.170e+Q8 = 223.27 = 0.0000
[957. Conf. Interval]
j
Although svyologit and s /yoprobit are multiple-equation estimators, one can refer to the estimates in the first equation using s gle-equation syntax: . svylc female + black ( 1)
[health] female + [health]black = 0.0 !
i
health
;oef.
Std. Err.
(1)
-1. 14609
. 1008367
t -111 39
P>ltl 0.000
[957. Conf. Interval] -1.353748
-.942432
The single-equation syntax does not work when referring to the cutpoints: . svylc cutl - cut2 j cutl not found | r(lll); j
When in doubt, always use the show option. It will jihow you exactly how the equations are labeled.
(Continued on tiext page)
46
svylc — Estimate linear combinations after survey estimation , svylc, show Coef.
Std. Err.
female black age age2
-.1615219 -.986568 -.0119491 -.0003234
. 0523678 .0790276 .0082^74 -000091
-3.08 -12.48 -1.44 -3.55
0.004 0.000 0.160 0.001
-.2683266 -1.147746 -.0288717 - . 000509
-.0547171 -.8253901 .0049736 -.0001377
_cons
-4.566229
. 1632559
-27.97
0 . 000
-4.899192
-4.233266
_cons
-3.057415
.1699943
-17.99
0.000
-3.404121
-2.710709
_cons
-1.520596
.1714341
-8.87
0.000
-1.870238
-1.170954
_cons
-.242785
. 1703964
-1.42
0 . 164
-.5903107
. 1047407
t
P>ltl
[95% Conf . Interval]
health
cutl
cut2
cut4
The output of svyologit and svyoprobit is actually quite deceptive. The first equation contains all the coefficient estimates, but then there is one equation for each cutpoint. To estimate differences of the cutpoints, use the multiple-equation syntax: . svylc [cut2]_cons - [cutl]_cons ( 1) - [cutl]_cons + [cut2]_cons = 0.0 health
Coef.
(1)
1.508814
Std. Err. .0501686
t 30.07
P»ti
[95'/, Conf . Interval]
0.000
1.406495
1.611134
Subpopulations with one by{) variable The svymean. svytotal, and svyratio commands allow a by () option which produces estimates for subpopulations; see [R] svymean. Frequently, one wishes to compute estimates for differences of subpopulation estimates. It is easy to use svylc to compute estimates for differences or any other linear combination of estimates. The only thing one must know is the proper syntax for referencing the subpopulation estimates. In this and the next two sections, we illustrate the syntax with a series of examples.
> Example Suppose that we wish to get an estimate of the difference in mean vitamin C levels (variable vitaminc) between males and females. First, we compute the means of vitaminc by sex. . svymean vitaminc, by(sex) Survey mean estimation pweight: finalwgt strata Strata: psu PSU:
Number of obs Number of strata Number of PSUs Population size
= 9973 = 31 = 62 = I.l29e+08
svylc — Estimate linear combinations after survey estimation
Mean
Subpop .
—i———• i Estimate
Std. Err.;!
[957. Conf. Interval]
.9312051 1.12753
.0169297 : .0173704
.8966768 1 . 092103
47
Deff
vitaminc Kale Female
.9657333 1.162957
4.926449 5.028652
Then we use the svylc (command. . svylc [vitaminc Male - [vitaminc] Femalfe ( 1) [vitaiainc]'.iale - [vitaminc]Female = 0.0 Mean
Estimate
(1)
- 1963252
Std. Err.
t
.015981
K2.28
P>|t|
0.000
[957. Conf. Interval] -.2289186
-.1637318
When svymean or sv^ 'total is used with a by (!) option, the syntax for referencing the subpopulation estimates is
[varname] subpop Jabel For example, we use [idtamincjMale to refef to the subpopulation estimates. This is the same syntax that is used with the test command wh|n there are multiple equations; see [R] test for full details. Be sure to type the \ ariable names and subpcj>pulation labels exactly as they are displayed in the output. Remember that >tata is case-sensitive. . svylc [vitamin|tl 0.000
[957. Conf. Interval] -.2289186
-.1637318
48
svyic — Estimate linear combinations after survey estimation
Subpopulations with two or more by() variables If there are two or more by() variables, you must refer to the subpopulations by numbers (1,2, 3, . . . ) when using svylc.
l> Example , svymean vitaminc, by(sex race) Survey mean estimation pweight: finalwgt Strata: strata PSU: psu
Mean
Number of obs Number of strata Number of PSUs Population size
9973 = 31 62 = = 1 . 129e+08
Subpop .
Estimate
Std. Err.
[95"/, Conf. Interval]
Deff
White Black Other White Black Other
.9475117 .7382045 1.021363 1.151125 .9222313 1 . 0804
.0168982 .0477521 .0521427 .0168117 , 0348224 . 0412742
.9130475 .6408135 .915017 1.116838 .8512105 . 9962202
. 9819758 .8355955 1 . 127708 1.185413 .993252 1 . 164579
4.646413 2.165885 1.739788 4.032603 2.915009 1.00135
vitaminc Hale Male Male Female Female Female
You can see the numbering scheme by running svylc with the show option. . svylc, show Mean
Coef .
1 2 3 4 5 6
.9475117 .7382045 1.021363 1.151125 .9222313 1.0804
Std. Err.
t
P>|t|
[95X Conf. Interval]
0.000 0.000 0.000 0.000 0.000 0.000
.9130475 .6408135 .915017 1.116838 .8512108 .9962202
vitaminc .0168982 .0477521 .0521427 .0168117 .0348224 .0412742
56.07 15.46 19.59 68.47 26.48 26.18
.9819758 .8355955 1.127708 1.185413 .993252 1 . 164579
So if we want to test the hypothesis that vitamin C levels are the same in white females and black females, we need to test subpopulation 4 versus subpopulation 5. . svylc [vitaminc]4 - [vitaminc]5 ( 1) [vitaminc]4 - [vitaminc]5 = 0 . 0 Mean
Estimate
Std. Err.
(1)
.2288941
.0337949
t
P>|ti
[957, Conf. Interval]
6.77
0.000
. 1599688
.2978193
svylc — Estimate linear combinations after survey estimation
49
Iftie use of svylc after svyratio Using svylc after svyrstio is a little more complicated. But, again, the show option on svylc will guide you.
> Example . svyratio yl/xl y2/i2 Survey ratio estimatj on pwe ight: f inalwgt Strata: strata PSU: psu
Number of obs Number of strata Number of PSUs Population size
10351 31 62 1.172e+08 Deff
Ratio
Estimate
Std. Err.
[957, Conf. Interval]
yl/xl y2/x2
.9918905 .9962729
.0102386 .0083088 '
. 9710087 ,9793269
1.012772 1.013219
1.647415 1.0771
. svylc, show
yi
Ratio
( oef.
xl
,99: 8905
.0102386
x2
.99( 2729
.0083088
i
Std. Err.
P>ltl
[957. Conf. Interval]
96. ^8
0.000
.9710087
1.012772
119. fl
0.000
.9793269
1.013219
, svylc [yl]xl - [y£ x2 ( 1) [yllxl - [y2j:t2 = 0.0 ;
1
Ratio
Est: mate
Std. Err.
it
P>lt|
(1)
-.00' ,3824
.0125921
-Oi35
0.730
[95X Conf. Interval] - . 0300641
.0212993
The following examples illistrate the syntax when there are by() subpopulations.
Example svyratio yi/xl, by race) Survey ratio estimation pwe ight: f inalwgt Strata: strata j PSU: psu
Ratio
Number of obs = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08
Subpop .
Estimate
Std. Err.
White Black Other
.995116 .9525558 i . 026876
.0116867 .0381059 i
•
[957. Conf. Interval]
Deff
yl/xl .0447707
.9712807 . 8748384 .9355659
1.018951 1,030273
1.879759 2 . 242268
1.118187
. 8308877
svylc — Estimate linear combinations after survey estimation
50
. svylc, show Ratio
Coef.
White Black Other
.995116 .9525558 1.026876
Std. Err.
P>|t|
t
[957, Conf. Interval]
1
.0116867 .0381059 .0447707
85.15 25.00 22.94
0.000
.9712807
0 . 000 0 . 000
. 8748384 . 9355659
1.018951 1.030273 1.118187
. svylc [1] White - [1] Black ( 1) [1] White - [l3Black = 0.0 Ratio (1)
Estimate
Std. Err.
t
P>lt|
. 0425602
. 0439945
0 . 97
0.341
. svyratio yl/xl, by (sex race) Survey ratio estimation pweight: finalwgt Strata: strata PSU:
psu
Ratio
Subpop.
Estimate
Std. Err.
yl/xl Male Male Male
White Black Other
1.000215 .9726418 1.000358
.0150805 .0486307 .0732775
Female
White
.9904237
.0169396
Female Female
Black Other
.9362548 1.056553
.0409748 .082305
[957. Conf. Interval] -.0471671
.1322875
Number of obs Number of strata Number of PSUs Population size
10351 = 31 = 62 = = 1 . 172e+08
[957, Conf. Interval]
Deff
.9694585 .8734589 . 850907 .9558752 .8526861 .8886906
1.030972 1.071825 1 . 149808 1.024972 1.019823 1.224415
1.460442
1.426839 1.266913 2.109029 1.619815 1.228803
. svylc, show Ratio
Coef.
1
1.000215 .9726418 1.000358 ,9904237 .9362548 1.056553
Std. Err.
t
P>|t|
[95V. Conf. Interval] -
66.33 20.00 13.65 58.47 22.85 12.84
0.000 0.000 0.000 0.000 0.000 0.000
.9694585 .8734589 .850907 .9558752 .8526861 .8886906
1 2
3 4 5
6
.0150805 .0486307 .0732775 .0169396 .0409748 .082305
1.030972 1.071825 1 . 149808 1.024972 1.019823 1.224415
. svylc [1]1 - [1]4 ( 1) [1]1 - [1]4 = 0.0 Ratio
Estimate
Std. Err.
t
P>|tl
(1)
.0097916
.0221119
0.44
0.661
[957. Conf. Interval] -.0353058
.054889
svyic — Estimate linear combinations after survey estimation
51
r
$aved Results svylc saves in r(): Scalars r(est) r(se) r(N_strita) r(N_psu] r(deff) r(deft) r(meft)
point estimate of lineajr combination standard error (square ;root of design-based variance estimate) number of strata I number of sampled P$Us deff ; deft : meft
•I Methods and Formulas svylc is implemented .s an ado-file. svylc estimates r? = Cfd, where 8 is a q x 1 Vector of parameters (e.g., population means or population regression coefficients) . and C is any i x q vector of constants. The estimate of r\ is ! r\ — C9, and the estimate >f its variance is
Similarly, the simple-randdm-sampling variance esiimator used in the computation of deff and deft C'. The variance estimator used in the computation of meff and meft is s mi p }C'. See the Method's and Formulas section of [R] svymean for details Mnspwmsp) — on the computation of deff deft, meff, and meft.
References Eltinge. J. L. and W. M. Sribn^y. 1996. svy5: Estimates ofilinear combinations and hypothesis iests for survey data. Stata Technical Bulletin 3!: 31-42. Reprinted in Stata technical Bulletin Reprints, vol. 6. pp. 246-259. McDowell. A.. A. Engel. J. T. Massey. and K. Maurer. 1981. Plan and operation of the Second National Health and Nutrition Examination Survey. 1976-1980. Viral and Health Statistics 15(1). National Center for Health Statistics, Hvansville, MD.
Also See Complementary:
[R] svy estimators. [R] svycUs, [R] svymean, [R] svyset. [R] svytest
Related:
[R] Bncom
Background:
[U] 6.5 Accessing coefflcierits and standard errors. ;i [u] '. SO Overview of survey Estimation. [R]«vy
Title svymean — Estimate means, totals, ratios, and proportions for survey data
I
Syntax svymean varlisl {weight} [if exp] [in range] [, common-options ] svytotal varlist [weight] [if exp] [in range] [, common ^options svyratio varname [/] varname [varname [/] varname ...] [weight] [if exp] [in range] [, common-options ] svyprop varlist [weight] [if exp] [in range] [, strata(vamame) psu(vai7zaw«) fpc(varname) by(iw//,y/) subpop(varname') nolabel format(7,fmt) ] The common ^options for svymean, svytotal, and svyratio are strata (varname') psn(\'arname) fpc(varname) by(varlist) subpop(varname) srssubpop nolabel { complete available } level(#) ci deff deft meff meft obs size svymean, svyratio, and svytotal typed without arguments redisplay previous results. Any of the following options can he used when redisplaying results: level(#) ci deff deft oeff meft obs size AH tiu-sc commands allow pweights and iweights; see [U] 14.1.6 weight. Warning: Use- of if