Stata Reference Su-Z Release 7

F iTitie summarize — Summary statistics Syntax summarize [var/rsf] |WigAf] [if exp\ [in rakge] [, [ detail | meanonly j...

Author: Stata Press

12 downloads 1740 Views 94MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

F iTitie summarize — Summary statistics

Syntax summarize [var/rsf] |WigAf] [if exp\ [in rakge] [, [ detail | meanonly j format ] by ... : may be used with suamarize; see [R] by. aweights and freights are allowed. The varlisi following summarize may contain time-series diperators; see [U] 14.4.3 Time-series varlists,

Description •i

summarize calculates and displays a variety of univariate summary statistics. If no varlist is specified, then summary statistics are calculated for all the variables in the dataset. Also see [R] ci for calculating the standard error and confidence intervals of the mean.

Options detail produces additional statistics, including Newness, kurtosis. the four smallest and largest values, and various percentiles. meanonly, which is allowed only when detail Is not specified, suppresses the display of results and calculation of the variance. Ado-file writers will find this useful for fast calls. format requests that the summary statistics be displayed using the display formats associated with the variables, rather than the default g display' format; see [U] 15.5 Formats: contlroHing how data are displayed.

iRemarks summarize can produce two different sets of j summary statistics. Without the detail option, the number of nonmissing observations, the meajji and standard deviation, and the minimum and maximum values are presented. With detail, trie same information is presented along with the variance, skewness, and kurtosis; the four smallest and four largest values; and the 1st. 5th. 10th, 25th. 50th (median), 75m, 90th. 95th, and 99th pefcentiles. 'i

''> Example

j

You have data containing information on variois automobiles, among which is the variable mpg, the mileage rating. We can obtain a quick summaly of the mpg variable by typing . summarize mpg

I

Variable

Obs

Hean

mpg

74

21.2973

Std. Dev. 5.78^503

1

Min

Max

12

41

summarize — Summary statistics

We see that we have 74 observations. The mean of mpg is 21.3 miles per gallon, and the standard deviation is 5.79. The minimum is 12 and the maximum is 41. If we had not specified the variable (or variables) we wanted to summarize, we would have obtained summary statistics on all the variables in the dataset: . summarize Variable

Obs

Mean

make price mpg rep78 weight foreign

0 74 74 69 74 74

6165.257 21.2973 3.405797 3019 . 459 .2972973

Std. D«v.

Min

Max

2949 . 496 5.785503 ,9899323 777.1936 .4601885

3291 12 1 1760 0

15906 41

5 4840 1

Notice that there are only 69 observations on rep78, so some of the observations are missing. There are no observations on make since it is a string variable. 0

t> Example The detail option provides all the information of a normal summarize and more. The format of the output also differs: . summarize mpg, detail Mileage (mpg)

IX 5X 107. 25%

soy. 75%

soy. 95% 99'/.

Percentiles 12 14 14 18 20 25 29 34 41

Smallest 12 12 14 14

Obs Sum of Wgt.

Std. Dev.

74 74 21.2973 5.785S03

Variance Skewness Kurtosis

33.47205 .9487176 3.975005

Mean Largest 34 35 35 41

As in the previous example, we see that the mean of mpg is 21.3 miles per gallon and that the standard deviation is 5.79. We also see the various percentiles. The median of mpg (the 50th percentile) is 20 miles per gallon. The 25th percentile is 18 and the 75th percentile is 25. When we performed summarize, we learned that the minimum and maximum were 12 and 41, respectively. We now see that the four smallest values in our dataset are 12, 12, 14, and 14. The four largest values are 34, 35, 35, and 41. The skewness of the distribution is 0.95, and the kurtosis is 3.98. (A normal distribution would have a skewness of 0 and a kurtosis of 3.) (Skewness is a measure of the lack of symmetry of a distribution. If the coefficient of skewness is 0, the distribution is symmetric. If the coefficient is negative, the median is usually greater than the mean and the distribution is said to be skewed left. If the coefficient is positive, the median is usually less than the mean and the distribution is said to be skewed right. Kurtosis (from the Greek kyrtosis meaning curvature) is a measure of peakedness of a distribution. The smaller the coefficient of kurtosis, the flatter the distribution. The normal distribution has a coefficient of kurtosis of 3 and provides a convenient benchmark.) (On a historical note, see Plackett (1958) for a history of the concept of the mean.)

summarize — Summery statistics

ij» Example summarize can usefully be combined with the b| varlist: prefix. In our dataset we have a variable foreign that distinguishes foreign and domestic cars. We can obtain summaries of mpg and weight within each subgroup by typing . by foreign: summarise mpg «eight -> foreign = !>omestic Obs Variable

Mean

5? 52

19.82692 3317.115

mpg weight

Std. f)ev.

Min

Max

4.743JZ97 695.3J537

12 1800

34 4840

Std. tev.

Kin

Max

6.611J187 433.0b35

14 1760

41 3420

-> foreign = I"oreign Variable

Obs

Mean

opg weight

22 22

24,77273 2315.909

Domestic cars in our dataset average 19.8 miles pdr gallon, whereas foreign cars average 24.8. Since by varlist: can be combined with summarize, it can also be combined with s-ummarize, detail: . by foreign: summarize mpg, detail

-> foreign = Domestic iy. 57, 107, 257,

soy. 757. 907, 997.

Per cent iles 12

14 14 16.5 19 22 26 29 34

Mileage (mpg) Smallest 12 12 14 14

Largest 28 29 30 34

i

Obs! Sum of Wgt . Mea4 StdJ Dev.

52 52 19.82692 4.743297

Variance Skewness KuHfosis

22.49887 .7712432 3.441459

Obs Sum of Wgt. Meat Stdj Dev.

22 22 24 . 77273 6.611187

Variance Skelmess Kurtosis

43.70779 .657329 3.10734

-> foreign = Foreign Mileage (m;

17. 57. lO'/i 257, 507,

757. 907. 957. 997.

Per cent iles 14 17 17 21 24.5

28 35 35 41

Smallest 14 17 17 18 Largest 31 36 35 41


Q Technical Note summarize respects display formats if you specify the format option. When we type summarize price weight, we obtain . summarize price weight Variable Obs price weight

74 74

Mean 6165.257 3019,459

Std. Dev.

Min

Max

2949.496 777.1936

3291 1760

15906 4840

The display is accurate but is not as aesthetically pleasing as you may wish, particularly if you plan to use the output directly in published work. By placing formats on the variables, you can control how the table appears: . format price weight 7,9.2fc . summarize price weight, format Variable Obs Mean price weight

74 74

6,165,26 3,019.46

Std. Dev. 2,949.50 777.19

Min

Max

3,291.00 15,906.00 1,760.00 4,840.00

Q If you specify a weight (see [U] 14.1.6 weight), each observation is multiplied by the value of the weighting expression before the summary statistics are calculated, so that the weighting expression is interpreted as the discrete density of each observation.

t> Example You have 1980 Census data on each of the 50 states. Included in your variables is medage, the median age of the population of each state. If you type summarize medage, you obtain unweighted statistics: . summarize medage Variable Obs medage

50

Mean

Std. Dev.

29.54

1.693445

Min

Max

24.2

34.7

Also among your variables is pop. the population in each state. Typing summarize medage [w=pop] produces population-weighted statistics: . summarize inedage [w=pop] (analytic weights assumed) Variable Obs Weight medage

50

225907472

Mean 30 11047

Std. Dev.

Min

Max

1 66933

24 2

34.7

The number listed under Weight is the sum of the weighting variable, pop. It indicates that there are roughly 226 million people in the U.S. The pop-weighted mean of medage is 30.11 (as compared with 29.54 for the unweighted statistic), and the weighted standard deviation is 1.67 (as compared with 1.69).


0 Example You can obtain detailed summaries of weighted data as well. When you do this, all the statistics are weighted, including the percehtiles. ! . summarize medage [w=pop], detail (analytic weights assumed) Median age

I'/. 107, 25'/.

soy.

Percentilee 27.1 27.7 28.2 29.2 29,9

75V. 90'/, 95'/. 997.

30.9 32.1 32.2 34.7

Smallest 24.2 26.1 27.1 27.4 Largest 32 32.1 32.2 34.7

Dbs Si|im of Wgt. Mean SJbd. Dev. Variance Skewness Kiirtosis

50 225907472 30.11047 1.66933 2.786661 .5281972 4.494223

Q Technical Note You are writing a program and need to access the mean of a variable. The raeanonly option provides for fast calls. For example, suppose y4ur program reads as follows: program define mean summarize '1', meanonly display " mean = " r(mean)

! ;

end

The result of executing this is , mean price mean = 616S.2568

i

Saved Results summarize saves in r(): Scalars r(N) r(mean) r(sketmess) r(min) r(max) r(sum_w) rCpl) r(p5) r(plO) r(p25)

number of observations : mean ! skewness (detail only) minimum I maximum : sum of the weights ; 1st percentile (detail only)] 5th percentiie (detail only)| !0th percentile (detail onhl) 250) percentile (detail onl)l)

r(pSO) r(p75) r(p90) r(p95) r(p99) r(Var) r(kurtosis) r(sum) r(sd)

50th pereentile (detail 75th percentile (detail 90th percentile (detail 95th percentile (detail 99th percentile (detail variance kurtosis (detail only) sum of variable standard deviation

only) only) only) oniy) only)


Methods and Formulas Let x denote the variable on which we want to calculate summary statistics, and let x», i = 1 , . . . , n, denote an individual observation on x. Let u; be the weight, and if no weight is specified, define Vi — \ for all i. Define V as the sum of the weight

ta=l

Define Wi to be v^ normalized to sum to n, Wi = u,-(n/V).. The mean, x, is defined as

n T ™ —

r , n^—' i=l

'

-

The variance, s2, is defined as «2 _

•*•

MT^,.. /~

^\2

The standard deviation, s, is defined as vs*. Define mr as the rth moment about the mean x:

i mr = — n

n

— 3/2

The coefficient of skewness is then defined as m^m^

. The coefficient of kurtosis is defined as

-2

Let £{j) refer to the or in ascending order, and let w^ refer to the corresponding weights of x^. The four smallest values are x^j, X( 2 ), £(3), and £(4). The four largest values are £(n), £( n _i), £( n _2), and £( n _ 3 )To obtain the pth percentile, which we will denote as a;^, let P — rip/100^ Let

Find the first index i such that M^^j > P. The pth percentile is then

otherwise

References Gleason, J. R. 1997. sg67: Univariate summaries with boxplots. Stata Technical Bulletin 36: 23-25. Reprinted in Siafa Technical Bulletin Reprims, vol. 6, pp. 179-183.

summarize — summary statistics . 1999. sg67.1: Update to univar. Stata Technical1 Bulletin 51: 27-28. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 159-161. j Hamilton. L. C. 1996. Data Analysis for Social Scientists. Pacific Grove. CA: Brooks/Cole Publishing Company. Placketi, R. L. 1958. The principle of the arithmetic mejkn. Biometrika 45: 130-135. Stuart, A. and J. K. Ord. 1994. Kendall's Advanced Thfory of Statistics, Vol. J. 6th ed. London: Edward Arnold, Weisberg. H. F. 1992. Cental Tendency and Variability. • Newbury Park, CA: Sage Publications.

Also See Related:

[R] centiie, [JR] cf, [R] ci, [R] code||>ook, [R] compare, [R] describe, [R] egen, [R] inspect, [R] Iv, [R] means, [R] jpctile, [R] st stsum, [R] svymean, [R] table, [R] tabstat, [R] tabsum, [R] xtsuni

Title sureg — Zellner's seemingly unrelated regression

Syntax Basic syntax

sureg (depvar\ varlisti ) (depvar? varlist-z ) . . . (depvar^ varlistpj ) [we/g/tf]

[if exp]

[in range]

Full syntax sureg ( [eqnamei :]depvaria [depvaru, . . , = ]varlisti [, noconstant ] ) ( \eqname2 '-\depvaria [depvar^b • • • - ] varlist^ [, noconstant ] )

( [eqname w : ] depvarâ [depvar^t, . . . = ] varlist^ [, noconstant ] ) [we?"g/if] [if exp] [in range] [, corr constraint s(numlist') i.sure dfk dfk2 small noheader notable level (#) maximize -options 1 by . . . : may be used with sureg; see [R] by. aweights and fweights are allowed; see [U] 14.1.6 weight, The depvan and the varlisK may contain time-series operators; see [U] 14.4.3 Time-series varlists. sureg shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. Explicit equation naming (eqname:) cannot be combined with multiple dependent variables in an equation specification.

Syntax for predict predict [type] newvarname [if exp] [in range] [, equationCegnt? [,£qno]~) xb stdp difference stddp residuals ] These statistics are available both in and out of sample; type predict . . . if e (sample) . . . if wanted only for the estimation sample.

Description sureg estimates seemingly unrelated regression models (Zellner 1962, Zellner and Huang 1962, Zellner 1963). The acronyms SURE and SUR are often used for the estimator.

sureg -4 Zellner's seemingly unrelated regression

Options ;j

noconstant omits the constant term (intercept) from the equation on which the option is specified, corr displays the correlation matrix of the residiials between equations and performs a Breusch-Pagan test for independent equations; i.e., the disturbance covariance matrix is diagonal. constraints (numUst} specifies by number t|e linear constraint(s) to be applied to the system. By default, sureg estimates an unconstrained system. See [R] reg3 for an example using constraints with a system estimator isure specifies that sureg should iterate over the estimated disturbance covariance matrix and parameter estimates until the parameter estimates converge. Under seemingly unrelated regression, this iteration converges to the maximum livelihood results. If this option is not specified, sureg produces two-step estimates. dfk specifies the use of an alternate divisor in computing the covariance matrix for the equation residuals. As an asymptotically justified estimator, sureg by default uses the number of sample observations (n) as a divisor. When the dflj option is set, a small-sample adjustment is made and the divisor is taken to be -^/(n — fcj){n - Mj), where k{ and k^ are the numbers of parameters in equations i and j respectively. dfk2 specifies the use of an alternate divisor ;in computing the covariance matrix for the equation residuals. When the df k2 option is set, the divisor is taken to be the mean of the residual degrees of freedom from the individual equations, this was the default divisor for sureg before version 6.0. ! small specifies that small sample statistics ^re to be computed. It shifts the test statistics from chi-squared and Z statistics to F statistics;and t statistics. While the standard errors from each equation are computed using the degrees or freedom for the equation, the degrees of freedom for the t statistics are all taken to be those for the first equation. Before version 6.0. sureg reported small-sample statistics. noheader suppresses display of the table reporting F statistics, j?-squared, and root mean square error above the coefficient table. i notable suppresses display of the coefficient table. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. maximize-options control the maximization process; see [R] maximize. You should never have to specify them. \

Options for predict ecpi the equations by their names, equation (income) would refer to the equation named income jand equation (hours) to the equation named hours. If you do not specify equ&tionQ, the res fits are as if you specified equation(#1). difference and stddp refer to between-eqi lation concepts. To use these options, you must specify two equations; e.g., equation(tl,#2) or equation (income, hours). When two equations must be specified, equation() is not optional. \

1o

sureg — Zellner's seemingly unrelated regression

xb, the default, calculates the fitted values— the prediction of Xjb for the specified equation. stdp calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is also referred to as the standard error of the fitted value. difference calculates the difference between the linear predictions of two equations in the system. With equation(#i ,#2) , difference computes the prediction of equation(#l) minus the prediction of equation(#2). stddp is allowed only after you have previously estimated a multiple-equation model. The standard error of the difference in linear predictions (xi^b — X2jb) between equations 1 and 2 is calculated. residuals calculates the residuals. For more information on using predict after multiple-equation estimation commands, see [R] predict.

Remarks Seemingly unrelated regression models are so called because they appear to be joint estimates of several regression models, each with its own error term. The regressions are related because the (contemporaneous) errors associated with the dependent variables may be correlated.

l> Example When you estimate models with the same set of right-hand-side variables, the seemingly unrelated regression results (in terms of coefficients and standard errors) are the same as estimating the models separately (using, say, regress). The same is true when the models are nested. Even in such cases, sureg is useful when you want to perform joint tests. For instance, let us assume that you think price — /?o + J3\f oreign + /32length + HI weight = 70 + 7if oreign + 72length + u-2 Since the models have the same set of explanatory variables, you could estimate the two equations separately. Yet, you might still choose to estimate them with sureg because you want to perform the joint test 3\ = 71 = 0. We use the small and df k options to obtain small-sample statistics comparable to regress or mvreg. . sureg (price foreign length) (weight foreign length), small dfk Seemingly unrelated regression Equation price weight

Obs 74 74

Farms 2 2

RMSE 2474.593 250.2515

"R-sq"

F-Stat

P

0.3154 0.8992

16.35382 316.5447

0.0000 0.0000

r

sureg — Zdllner's seemingly unrelated regression

Coef .

Std. Err.

t

j

P>ltl

11

[957, Conf . Interval]

price foreign length _cons

2S01.143 90.21239 -11621.35

766.117 15.83368 3124.436

3.66 |-70 4.72 i

0.000 0.000 0.000

1286.674 58.91219 -17797.77

4315.611 121.5126 -5444.93

weight foreign length _cons

-133.6775 31.44455 -2880.25

77.47615 1.601234 31B.9691

-1-73 If. 64 ~f. 02

0.087 0.000 0.000

-286.8332 28.27921 -3474.861

19.4782 34.60989 -2225.639

These two equation have a common set of regr^ssors and we could have used a shorthand syntax to specify the equations: ; . sureg (price weight = foreign length), small dfk i

In this case, the results presented by sureg aije the same as if we had estimated the equations separately: . regress price foreign length (output omitted) . regress weight foreign length (output omitted)

\

There is, however, a difference. We have allowed fii and u% to be correlated and have estimated the full variance-covariance matrix of the coefficients, sureg has estimated the correlations, but it does not report them unless we specify the corr option! We did not remember to specify corr when we estimated the model, but we can redisplay the results: . sureg, notable noheader corr Correlation matrix of residuals: price weight price 1.0000 weight 0.5840 1,0000 | Breusch-Pagan test of independence: chi2(|) = j

25.237, Pr = 0.0000

The notable and noheader options prevented s^reg from redisplaying the header and coefficient tables. We find that, for the same cars, the correlation of the residuals in the price and weight equations is .5840 and that we can reject the hypothesis that this correlation is zero. We can perform a test that the coefficients on foreign are jointly zero in both equations—as we set out to do—by typing test foreign: see [R] test. When we type a variable without specifying the equation, that variable is tested for zero in all equations in which it appears: . test foreign ( 1) [price]foreign = 0 . 0 ( 2) [weight]foreign = 0 . 0 F( 2, 71) = 17.99 Prob > :F = 0.0000

sureg — ZeHner's seemingly unrelated regression

12

E> Example When the models do not have the same set of explanatory variables and are not nested, sureg may lead to more efficient estimates than running the models separately as well as allowing joint tests. This time, let us assume you believe price = @Q -f /?iforeign + /^mpg + /fedispl + ui weight = 7o + 71 foreign -f 72length -f u^ To estimate this model, you type . sureg (price foreign mpg displ) (weight foreign length) , corr Seemingly unrelated regression Equation price weight

Dbs

Farms

RMSE

74 74

3 2

2165.321 245.2916

Coef.

fl

R-sq"

0.4537 0. 8990

z

Std. Err.

P>iz|

chi 2

P

49 . 6383 0 . 0000 661.8418 0.0000

1957. Conf. Interval]

price foreign displacement _cons

3058.25 -104.9591 18.18098 3904.336

685.7357 4.46 58.47209 -1.80 4.286372 4.24 1966.521 ; 1.99

0.000 0.073 0.000 0.047

1714.233 -219.5623 9.779842 50 . 0263

4402.267 9.644042 26.58211 7758.645

weight foreign length _cons

-147.3481 30.94905 -2753.064

75.44314 1.539895 303.9336

0.051 0.000 0.000

-295.2139 27.93091 -3348.763

.517755 33.96718 -2157.365

-1.95 20.10 -9.06

Correlation matrix of residuals: price weight price 1 . 0000 weight 0.3285 1.0000 Breusch-Pagan test of independence: chi2(l) =

7 . 984 , Pr = 0.0047

By way of comparison, had we estimated the price model separately: . regress price foreign mpg displ SS Source df

MS

Model Residual

294104790 340960606

3 98034929.9 70 4870865 . 8-1

Total

635065396

73

price

Coef.

foreign

3545.484 -98.88559 22.40416 2796.91

mpg displacement _cons

Std.

Number of obs F( 3, 70) Prob > F R-squared Adj R-squared Root MSE

8699525.97 Err.

712.7763 63.17063 4.634239 2137.873

t 4.97 -1 .57 4.83 1 .31

P>lt)

0 000

0 122 0 000 0 195

74 = 20.13 = 0 . 0000 = 0.4631 = 0.4401 = 2207.0

[957, Conf. Interval] 2123.897 -224.8754 13.16146 -1466.943

4967,072 27.10426 31.64686 7060.763

The coefficients are slightly different but the standard errors are uniformly larger. This would still be true if we specified the df k option to make a small-sample adjustment to the estimated covariance of the disturbances.

w

sureg — Zellner's seemingly unrelated regression

13

Q Technical Note Constraints can be applied to SURE models! using Stata's standard syntax for constraints. For a general discussion of constraints, see [R] constraint: for examples similar to seemingly unrelated regression models, see [R] reg3.

i

Saved Results sureg saves in e(): Scalars e(K) e(k) e(k_eq)

number of observations number of pararjjieters in system number of equafons e(msB_#) model sum of siuares for equation # model degrees F t

-5.55 15.35 7.38 -5.31 -4.03 2.43 -4.22

We can redisplay the results expressed as odds ratios.

P>lt| 0.000 0.000 0.000 0.000 0.000 0.021 0.000

= 10351 = 31 = 62 = 1.172e+08 = 87.70 = 0.0000

[95X Conf. Interval] -.0445771 .0425545 .1115486 -.0014877 - . 537066 .0555615 -7.259813

-.0206222 .0555936 .1966815 -.0006616 -.175928 .6302986 -2.531668

24

svy estimators — Estimation commands for complex survey data . svylogit, or Survey logistic regression pweight : f inalwgt Strata: stratid PSU: psuid

highbp

Odds Ratio

height weight age age2 female black

.967926 1.050298 1 . 166625 .998926 .7001246 1 . 40907

Number of obs Number of strata Number of PSUs Population size F( 6, 26) Prob > F

Std. Err. .0056843 .0033574 .0243485 .0005023 .0619858 .1985388

t -5.55 15.35 7.38 -5.31 -4.03 2.43

10351 = 31 = 62 = = 1 . 172e+08 87.70 = 0.0000

P>|tl

£95% Conf. Interval]

0.000 0.000 0.000 0.000 0.000 0.021

.9564019 1.043473 1.118008 .9985135 .5844605 1.057134

.979589 1.057168 1.217356 . 9993386 .8386784 1.878171

svylc can be used to estimate the sum of the coefficients for female and black. . svylc female + black ( 1) female + black = 0.0 highbp

Coef.

(1)

-.0135669

Std. Err. . 1653936

P>lt| -0 08

0 935

[957, Conf. Interval] -.3508894

.3237555

This result is more easily interpreted as an odds ratio. . svylc female + black, or ( 1) female + black =0.0 highbp

Odds Ratio

Std. Err.

(1)

. 9865247

. 1631648

t -0.08

P>lti

[951/. Conf. Interval]

0,935

.7040616

1.382309

The odds ratio 0.987 is an estimate of the ratio of the odds of having high blood pressure for black females over the odds for our reference category of nonblack males (controlling for height, weight, and age). You now know enough to use svylc for odds ratios; see [R] svylc for its other uses; see [R] lincom for more examples of odds ratios.

F = 0.0001 , — _ — highlead Odds & tio Std, Err. P>lt| [957, Conf . Interval]

i

age female

1 . 01'8d8 .0265 989

.0079836 .0195109

l.iT

-5.4°

0.071 0.000

.9986331 .0061714

1.031244 .118115

ij

Note that this time we specified ifled the or option when ê first issued the command.

The subpop(varaflme) option takes a 0/1 variabl^; the subpopulation of interest is defined by varname = 1. All other memi ers of the sample not in t|e subpopulation are indicated by vamume = 0. If a person's subpopulatio.i status is unknown, theji varname should be set to missing ( ' . ' ) and those observations will be omitted from the analysis as] they should be. For instance, in the preceding example, if person's race is junknown, race should be coded as missing rather than as nonblack (race = Oj. ; i Note that using 'if black?=l' to model the subpopjulation would not give the same result. All the discussion in the section Walking about the use of if pnd in in [R] svymean applies to the variance estimates for svyreg, svyloeit, and svyprobit as jwell.

Q Technical Note

I i

i

Actually, the subpop(vompme) option takes a zero^nonzero variable; the subpopulation of interest is defined by varname ^ 0 anq not missing. All other n|embers of the sample not in the subpopulation are indicated by varname = 0. 'But 0.1, and missing are typically the only values used for the subpop () variable.

Example

j

\

In the NHANES II dataset, we have a variable healtfi containing self-reported health status, which takes on the values 1-5, win 1 being "poor" and \ being "excellent". Since this is an ordered categorical variable, it makes: ense to model it using s^yologit or svyoprobit. As predictors, we use basic demographic variab :s: female (1 if female* 0 if male), black (1 if black, 0 otherwise). age. and age2 (= age2):

28

svy estimators — Estimation commands for complex survey data

Saved Results The svy estimation commands save in e (): Scalars

e(N) e(N_strata) e(N_psu) e(N_pop) e(N_subpop) e(N_sub) e(df_m) e(df_r)

e(F) e(k_cat) e(basecat) e(ibasecat) Macros e(cmd) e(depvar) e(wtype) e(wexp) «(strata) e(psti) e(fpe) e(offset)

e(predict) e(cnslist) Matrices e(b) e(V) e(V_srs) e(V_srs»r) e(deff) e(deft) e(cat)

number of observations m number of strata L number of sampled PSUs n estimate of population size M estimate of subpopulation size subpopulaticjn number of observations model degrees of freedom variance decrees of freedom = n~L model F statistic number of Categories (svymlogit, svyologit, svyoprobit) base category value of dependent variable (svymlogit) base category number (svymlogit) command name (e.g., svyreg) dependent variable name weight type weight variable or expression strata () variable psttO variable fpc() variable offsetQ Variable program used to implement predict constraint numbers (svyintreg, svymlogit, svypois) vector of estimates 0 design-based (co)variance estimates V simple-randcjm-sampling-without-replacement (co)variance Vsrswor simple-randcjm-sampling-with-feplacement (co)variance Vsrswr (only creajted when fpc() option is specified) vector of dejff estimates vector of deft estimates vector of category values (svymlogit, svyologit, svyoprobit)

Functions

e(sample)

marks estimation sample

Methods and Formulas All of the svy estimators are implemented as ado-files that call _robust; see [P] _robust. These commands use a variant on the basic weighted-point-estimation methods used by svytotal. They use ulinearization"-based variance estimators that are natural extensions of the variance estimator used in svytotal. For general methodological background on regression and generalized-linear-model analyses of complex survey data, see, for example, Binder (1983), Cochran (1977), Fuller (1975), Godambe (1991), Kish and Frankel (1974), Sarndal et al. (1992), Shao (1996), and Skinner (1989). The notation and development presented below is adapted from Binder (1983). We use here the same notation as in the Methods and Formulas section of [R] svymean; that section should be read first.

if

_

svy

[

estimators

--

Estimation

commands

complex

survey

data

_

z»

:

!

•At.

for

;

Linear regression We let (/i,i, j) index the elements in the popujation, where h — 1 ..... I are the strata, i = I . ..., Nh, are the PSUs in! stratum h, and j — 1] . . ., Mhi are the elements in PSU (h,i). The regression coefficients /? b (/3o>$i 5 • • --.0k} are Viewed as fixed finite-population parameters that we wish to estimate, The$e parameters are define^ with respect to an outcome variable YMJ and a k 4- 1-dimensional row vector of explanatory variables X^j — (Xhijo- • • • • X^jk}. As in nonsurvey work, we often have J^/ujo identically equal to unity, so that /?o is an intercept coefficient. Within a finite-population context, we can formally define the regression coefficient vector 0 as the solution to the vector estimating equation \

G(P) = X'Y\-X'XP = Q

I i

(1)

! j

where Y is the vector of outcomes for the full population and X is the matrix of explanatory variables for the full population. Assuming (X'X}~1 existif, the solution to (1) is /3 = (X'X)~1X'Y. Given observations (j^. x/uj), collected through /3 in a way that accounts, for the sample design. JTo X'Y can be viewed as majtrix population totals. Fcjr Thus, we estimate X'X and X'Y with the weighted

X X

'

a complex sample design, we need to estimate do this, note that the matrix factors X'X and example, X'Y - ]Ch~i Y^,i=i Tlj~i XhijYhijestimators

=

h=l t=l j=l

and

\

=

L

i\h "ih

' ^ EEE ;

h=H=lj = l

where XB is the matrix pf explanator)' variables 'for the sample, Ys is the outcome \'ector for the sample, and W = diag(uj^tj) is a diagonal mjatrix containing the sampling weights WMJ. The corresponding coefficient; estimator is

: 3 = (XlXrlX'Y^(X'8WXsrlX'ilWYe

(2)

Note that equation (2) is what the regress corimand with aweights or iweights computes for point estimates. ; j The coefficient estimator 13 can also be definexj as the solution to the weighted sample estimating equation , ,

d(P) = X'Y - X^Xp =j X'sWYt - X'SWX53 - 0 We can write G(B} as

where dfll} = x'hij€hij kn& €hi} ~ ytiij — xhi]i is

tne

regression residual associated with sample

unit ( h . i , j ) . Thus, G(/3J) can be viewed as a special case of a total estimator Our variance estimator for 3 is based on tht following "linearization" argument. A first-order Taylor expansion shows ithat '

"-^-{wi

3D

svy estimators — Estimation commands for complex survey data

Thus, our variance estimator for ft is -T'

\

J

V{G(fi)}

0=0

^(X>

Viewing G(ft) as a total estimator according to equation (3), the variance estimator V{G(ft)}\a_g can be computed using equation (3) from [R] svyraean with yhij replaced by dhij and with (3 used to estimate e/,»j. Pseudo-maximum-likelihood

estimators

To develop notation for our pseudo-maximum-likelihood estimators, suppose that we observed (Y/uj, Xhij) for the entire population, and that (Yhij^Xhij) arose from a certain likelihood model (e.g., a logistic distribution). Let /(/?; Yhij,Xhij) be the associated "log likelihood" under this model. Then, for our finite population, we define (he parameter ft by the vector estimating equation

L

f h Mhi

where S = dl/d/3 is [he score vector; i.e., : the first derivative with respect to ft of I (ft; Yhij,Xhij). Then, the "pseudo-maximum-likelihood" estimator ft is the solution to the weighted sample estimating equation

Note that the solution ft of equation (4) is what the nonsurvey version of the command with iweights produces for point estimates. Again, we use a first-order matrix Taylor series expansion to produce the variance estimator for ft i

V(ff] = 0=0

where H is the Hessian for the weighted sample log-likelihood. We can write G(/3) as L

rih HIM

where dhl} = shl}X)HJ and shij is the score index for element (h,i,j). The term by rewriting the sample log-likelihood l(3;yhij,xhij) as a function of xhij0: s 3

*

is computed

svy iestimafors — Estimation commands for complex survey data

31

Thus, again, (?(/?) can be'' viewed as a special cast of a total estimator, and the variance estimator V{G(fl}}\Sa£ is computed using equation (3) frtiim [R] svymean with y^j replaced by d^ and with 0 used to estimate

Acknowledgments The svyreg. svylogifc, and svyprobit commands were developed in collaboration with John L. Eltinge, Department of Statistics, Texas A&M University. We thank him for his invaluable assistance. We thank Wayne Johnson of the National Centejr for Health Statistics for providing the NHANES II dataset.

References Binder. D. A. 1983. On the variances of asymptotically normal estimators from complex surveys. International Statistical : Re view 51: 279-292. Cochran. W. G. 1977. Sampling Techniques. 3d ed. New Y0rk: John Wiley & Sons. Eltinge, J. L, and W. M. Sribn^y. 1996. svy4: Linear, logistjc. and probit regressions for survey data, Sfafa Technical Bulletin 31: 26-31. Reprinted in Stats Technical Bul/etiij? Reprints, vol. 6, pp. 239-245. Fuller. W. A. 1975. Regression [analysis for sample survey. Sankhya, Series C 37: 117-132. Godambe, V. P. ed. 1991. Estimating Functions. Oxford: Clarendon Press. Gonzalez J. F, Jr.. N. Krauss. ajnd C. Scon. 1992. Estimatioi| in the 1988 National Maternal and Infant Health Survey. In Proceedings of the Section on Statistics Education. America/! Statistical Association, 343-348. Johnson. W. 1995. Variance estimation for the NMIHS. Technical document. National Center for Health Statistics. Hyattsville, MD. ; Kish. L. and M. R. Frankel. 1974. Inference from complexi samples. Journal of the Royal Statistical Society B 36:

1-37.

:

:

Korn, E. L,. and B. I. Graubard. 1990. Simultaneous testing of regression coefficients with complex survey data: use of Bonferroni / statistics. Tfie American Statistician 44: 270-276. McDowell. A.. A. Enge!, J. T. JMassey, and K. Maurer. 198j. Plan and operation of the Second National Health and Nutrition Examination Survtjy. 1976-1980. Vital and Heklth Statistics 15(1). National Center for Health Statistics. l Hyacisville. MD. \ Sarndal. C.-E., B. Swensson. artd J. Wretman. !992. Model lAssisted Survey Sampling. New York: Springer-Verlag. Shao. J. 1996. Resampling methods for sample surveys (with discussion). Statistics 27: 203-254. Skinner. C. J. 1989. Introduction to Part A. In Analysis -of Complex Surveys, ed. C. J. Skinner. D. Holt, and T. M. F. Smith, 23-58. Ne* York: John Wilev & SonsJ f

..

|

-I

Also See Complementary:

[R] ajdjtist, [R] cwistraint. [R| mfx, [R] svydes. [R] svylc, [R] svymean. [R] sjvyset, [R] svytab, [R] svytest, [R] testnl

Related:

fp] _]robust j

Background:

:!

[u] 3(0 Overview of survey estimation. [R]

Title svydes — Describe survey data

Syntax svydes [vor/isr] [weig/if] [if exp] [in range]

[, strataOwzrname) psu(varname)

fpc(vamame) bypsu ] pweights and weights are allowed; see [U] 14.1.6 weight.

Description svydes displays a table that describes the strata and the primary sampling units for sample survey data.

Options strata(varname) specifies the name of a variable (numeric or string) that contains stratum identifiers, strata () can also be specified with the svyset command; see [R] svyset. psu(vamame) specifies the name of a variable (numeric or string) that contains identifiers for the primary sampling unit (i.e., the cluster). psu() can also be specified with the svyset command. fpc(varmzme) can be set here or with the svyset command. If an fpc variable has been specified, svydes checks the fpc variable for missing values. Other than this, svydes does not use the fpc variable. See [R] svymean for details on fpc. bypsu specifies that results be displayed for each PSU in the dataset; that is, a separate line of output is produced for every PSU. This option can only be used when a PSU variable has been specified using the psu() option or set with svyset. Note: Weights are checked for missing values, but are not otherwise used by. svydes.

Remarks Sample-survey data are typically stratified. Within each stratum, there are primary sampling units (PSUs), which may be either clusters of observations or individual observations. svydes displays a table that describes the strata and PSUs in the dataset. By default, one row of the table is produced for each stratum. Displayed for each stratum are the number of PSUs, the range and mean of the number of observations per PSU, and the total number of observations. If the bypsu option is specified, svydes will display the number of observations in each PSU for every PSU in the dataset. : If a varlist is specified, svydes will report the number of PSUs that contain at least one observation with complete data (i.e., no missing values) for all variables in the varlist. These are precisely the PSUs that would be used to compute estimates for the variables in varlist using the svy estimation commands: svymean, svytotal, svyratio, svyprop, svytab, or any of the commands described in [R] svy estimators. 32

r

svydes — Describe survey data

33

The variance estimation formulas for the svy estimation commands require at least two PSUs per stratum. If there are sorde strata with only a single PSU, an error message is displayed: . svyaean x • stratum with onlyjone PSO detected r(499); , svydes x

ifc

m

The stratum (or strata) wjth only one PSU can be Ibcated from the table produced by svydes x. After locating this stratum, it dan be "collapsed" into aji adjacent stratum, and then variance estimates can be computed. See the following examples for an îllustration of the procedure. For details on the svy estimation commands, see [R] svymean and [R] svy estimators.

Example KE*

We use data from thi Second National Healtjh and Nutrition Examination Survey (NHANES II) (McDowell et al. 1981) is our example. First, w4 set the strata, psu, and pweight variables. . svyset strata stjratid . svyset psu psui4 . svyset pweight înalwgt

Typing svydes will shoi> us the strata and PSU arrangement of the dataset. . svydes i pweight: finalwglj Strata: stratid' PSU: psuid ; #0^s per PSU Strata stratid

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26

27 28 29

#0bs

#PSOs

2j 2j

i i

2

!

2J 2i 2!

380 185 348 460 252 298 476 338 244 262 275 314 342 405 380 336 393 359 285 214 301 341 438 256 261 283 299 503

min

mean

165 ; 190.0 67 92.5 149 174.0 230.0 229 105 I 126.0 131 149.0 206 \ 238.0 158 169.0 100 ; 122.0 119 i 131.0 120 •1 137.5 144 157.0 154 ! 171.0 200 1 202.5 189 190.0 159 168.0 180 196.5 144 179.5 125 142.5 107.0 102 128 150.5 159 | 170.5 205 219.0 116 i 128.0 129 I 130.5 139 141.5 136 \ 149.5 215

i 251.5

max 215 118 199 231 147 167 270 180 144 143 155 170

188 205 191 177 213 215 160 112 173 182 233 140 132 144 163 288

34

svydes — Describe survey data 30 31 32

2 2 2

31

62

iee

36B 308 450

10351

143 211

182.5 154.0 225.0

199 165 239

67

167.0

288

Our NHANES II dataset has 31 strata (stratum 19 is missing) and 2 PSUs per stratum. The variable hdresult contains serum levels of high-density lipoproteins (HDL). If we try to estimate the mean of hdresult, we get an error. . svymean hdresult stratum with only one PSU detected r(460);

Running svydes with hdresult as its varlist will show us which stratum or strata have only one PSU. . svydes hdresult pweight : f inalwgt Strata: stratid psuid PSU: Strata stratid 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26

27 28 29 30 31 32 31

SPSUs included 1* 1* 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 60

#0bs wjlth #0bs with ffPSUs complete missing omitted datfe data 1 1 0 0 0 0

0 0

0 0 0

114 98 277 340 173

255 409 299 218 233 238

0

275

0 0

J297 355 329 280 352 -'335 J240 198 263 304 388 239

0 0

0 0 0 0

0 0 0 0 0

0 0 0

0 0

0

240 259

284 440 326 279 383

8720 10351

266 87 71 120 79 43 67 39 26 29 37 39 45 50 51 56 41 24 45 16 38 37 50 17

21 24 15 63 39 29

#0bs per included PSU min 114 98 116 160 81 116 191 129 85 103 97 121 123 167 151 134 155 135 95 91 116 143 182

106 119 127 131

67

193 147 121 180

1631

81

mean

max

114.0 98.0 138.5 170.0 86.5 127.5 204.5 149.5 109.0 116.5 119.0 137 .-5 148.5 177.5 ' 164.5 140.0 176.0 167.5 120.0 99.0 131.5 152.0 194.0 119.5 120.0 129.5 142.0 220.0 163.0 139.5 191.5

114 98 161 180 92 139 218 170 133 130 141 154 174 188 178 146 197 200 145 107 147 161 206 133 121 132 153 247

145.3

247

179

158 203

r


35

Both of stratid = 1 arid stratid — 2 have orily one PSU with nonmissing values of hdresult. Since this dataset has onljy 62 PSUs, the bypsu option will give a manageable amount of output:

. svydes hdresult,jbypsu pweight: Strata: PSU:

Strata stratid

finalwgt stratid psuid i

, #0bs with #0bs with PSU \ complete missing psuid ; data data

1

2i

0 114 98 0 161 116

215 51 20 67 38 33

32 32

1 2i

180 203

59 8

31

62i

8720

1631

1 2 2 3 3

li 2!

li 2;

li

(output omitted )

10351

It is rather striking thjat there are two PSUs Without any values for hdresult. All other PSUs have only a moderate number of missing values. Obviously, in a case such as this, a data analyst should first try to ascertajn the reason why these data are missing. The answer here (Johnson 1995) is that HDL measurement^ could not be collected ;until the third survey location. Thus, there are no hdresult data for the first two locations: stratid = 1, psuid = 1 and stratid = 2. psuid = 2. Assuming that we wish; to go ahead and analyze [he hdresult data, we must "collapse" strata—that is. merge them together-fso that every stratum h$s at least two PSUs with some nonmissing values. We can accomplish this by collapsing stratid ;= 1 into stratid = 2. To perform the stratum collapse, we create a newj strata identifier newstr and a new PSU identifier newpsu. This is easy to do using basic command^ in Stata. . gen newstr = striatid . gen neupsti = psufid . replace newpsu =! psuid + 2 if stratid==!l (380 real changes bade) ; . replace newstr = 2 if stratid==l (380 real changes bade) •

We set the new strata an| PSU variables. I . svyset strata neiwstr . svyset psu newps|u

We use svydes to check what we have done


36

, svydes hdresult , bypsu pweight : final wgt Strata: newstr newpsu PSU: fObs with #0bs with Strata PSU complete missing newstr newpsu data data

2 2 2 2 3 3

1 2 3 4 1 2

98 0 0 114 161 116

20 67 215 51 38 33

32

1 2

180 203

59 8

30

62

8720

(output omitted ) 32

1631 10351

The new stratum, newstr = 2, has 4 PSUs, 2 of which contain some nonmissing values of hdresult. This is sufficient to allow us to estimate the mean of hdresult. . svymean hdresult Survey mean estimation pweight: finalwgt Strata: newstr PSU: newpsu

Number of obs Number of strata Number of PSUs Population size

Mean

Estimate

Std. Err.

[951/. Conf . Interval]

hdresult

49.67141

.3830147

48.88919

50.45364

= 8720 = 30 = 60 = 98725345

Deff 6.257131

Methods and Formulas svydes is implemented as an ado-file.

References Eitinge. J. L. and W. M. Sribney. 1996. svy3: Describing survey data: sampling design and missing data. State Technical Bulletin 31: 23-26. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 235-239. Johnson, C. L, 1995. Personal communication. McDowell, A., A. Engel, J. T. Massey, and K. Maurer. 1981. Plan and operation of the Second National Health and Nutrition Examination Survey. 1976-1980. Vital and Health Statistics 15(1). National Center for Health Statistics, Hyattsville. MD.


[R] svy estimators, [R] svylc, [R] svymean, [R] svyset, [R] svytab, [R] svytest

Background:

[U] 30 Overview of survey estimation, [R] svy

Title

•

1

svylc — Estimate inear combinations afteij survey estimation 4

1

Syntax svylc \exp\ \, show or irr rrr eforjn level(#) deff deft neff meft

Description svylc produces est mates for linear combinations of parameters after a svy estimation command; i.e., any of the comnunds svymean, svytotajl, svyratio, or the commands described in [R] svy estimators. Estimating differences of subpopulation means, for example, can be done by running svymean with a by() option, ajnd then running svylc. The svylc command computes estimates of linear combinations of parameters, whether means, total, ratios, proportions, or regression coefficients. i Used after svylogilt, it will compute odds ratios for any covariate group relative to another. After svymlogit, it can compute relative risk ratio|, and after svypois, it can compute incidence rate ratios, svylc is the equivalent of lincom for purvey data. See [R] lincom for a thorough coverage of odds ratios. '

Options

Example We use data from the Second National Health and Nutrition Examination Survey (NHANES II) (McDowell et al. 1981) as our example. Suppose that we wish to estimate the difference of the means of systolic (variable bpsystol) and diastolic (variable bpdiast) blood pressures. First, we estimate the means, and then we use svylc. . svymean bpsystol bpdiast Survey mean estimation

10351 Number of obs = 31 Number of strata = 62 Number of PSUs Population size = 1 172e+08

pweight: finalwgt Strata: strata PSU: psu

Mean

Estimate

Std. Err.

[9B1/. Goaf. Interval]

bpsystol bpdiast

126.9458 81.01726

.603462 .5090314

125.715 79.97909

128.1766 82.05544

Deff 8.230475 16.38656

. svylc bpsystol - bpdiast ( 1) bpsystol - bpdiast =0,0 Mean

Estimate

Std. Err.

(1)

45.92852

.2988395

t 153.69

P>lt|

[957, Conf . Interval]

0.000

45.31903

46.53801

We can also specify any of the options deff, deft, meff, or meft. or change the confidence level (i.e., nominal coverage rate) of the confidence interval.

svylc — Estimate linear combinations after survey estimation

39

. svylc bpsystol - bpdiast, level(90) de|f meff ( 1)

bpsystol - Jbpdiast = 0.0 Mean

stimate

Std. Err.

(1)

-3 5.92852

.2988395

Mean

(1)

Deff

1J53.69

P>ltl

[907, Conf, Interval]

0.000

45.42183

46.43521


= 10351 = 31 = 62 = 1.172e+08

Meff

S .835532

3.087148

j

svylc works in the same manner after using thejsubpop option. . svynean bpsysto] bpdiast, subpop(femal4) Survey mean estiiMtion pveight: Strata: PSU: Subpop.:

finalwglj strata ; psu j female=4l

Mean

Estir ate

Std. Err.

[951J Conf. Interval]

bpsystol bpdiast

124.: 027 79.05 227

.7051858 . 5207306

122 i 7644 77.^7023

125.6409 80.09431

Deff 5.162487 8.973799

svylc bpsystol -j bpdiast (1)

bpsystol - bpdiast = 0 . 0 Mean

! stimate

Std.. Err.

(1)

i 5.17039

.4040852

t ill. 78

P>it|

[95X Conf. Interval]

0.000

44.34625

45.99453

Missing data: The complete and available options The svymean, svytotal. and svyratio corjimands can handle missing data in two ways: see [R] svymean. The available option (which is tHe default when there are missing values and two or more variables) uses every available nonmissing Rvalue for each variable separately. The complete option (which is the default when there are noj missing values or only one variable) uses only those observations with i.onmissing values for all: variables in the varlisi. Here is an example where available is the default: I

(Continued bn next page)

40


Example . svymean tcresult tgresult Survey mean estimation p»eight: f inalwgt Strata: strata PSU: psu

Number of obs(*) = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08

Mean

Estimate

Std. Err,

[951/. Conf . Interval]

Deff

tcresult tgresult

213.0977 138.576

1.127252 2.071934

210.7986 134.3503

215.3967 142.8018

5 . 602499 2.356968

(*) Some variables contain missing values.

We redisplay the results using the obs option to see how many observations were used for each estimate. . svymean, obs Survey mean estimation pwe ight: f inalwgt Strata: strata PSU: psu

Number of obs(*) = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08

Mean

Estimate

Std. Err.

tcresult tgresult

213.0977 138.576

1.127252 2.071934 i

Obs

10351 5050

(*) Some variables contain missing Values.

Because we estimated the mean of tgresult using a different set of observations than tcresult, we could not compute the covariance between the two, and hence, we cannot estimate the variance of the difference. So if we now ask svylc to estimate this difference, we get. an error message. . svylc tcresult - tgresult must run svy command with "complete" option before using this command r(301); |tj

[957. Conf. Interval]

0.000

68.66129

76.98163

The use of svylc after svy model estimators svylc can be used i fter any of the commands described in [R] svy estimators to estimate a linear combination of the coefficients (i.e., the /3's). j Using svylc after s vyreg is straightforward:

t> Example . svyreg tcresul bpsystol bpdiast age ege2 Survey linear re press ion pwe ight: f inalw jt Strata: strata PSU: psu

tcresult bpsystol bpdiast age age2 _cons

Coef . .1060743 .2966662 3.35711 . 0247207 83.8242

Std. Err. I jj

!

Coef.

(1)

-.1905919

t

P>|t|

.0346796 3 . 06 .0569B94 I 5.21 . 2099842 15.99 .0020795 1-11.89 5.649261 14.84

. svylc bpsystol - bpdiast ( 1) bpsystol • bpdiast = 0.0 tcresult

Number of obs = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08 F( 4, 28) = 307.00 Prob > F = 0.0000 R-squared = 0.1945

. 0353449 . 1804969 2.928844 -.0289619 72 . 30246

. 1768038 .4128356 3.785375 -.0204796 95 . 34594

i

Std. Err. .: .0818056

0.005 0.000 0.000 0.000 0.000

[957, Conf. Interval]

t -2.33

p>ltl 0.027

[957. Conf. Interval] -.3574354

-.0237483

F highbp

Coef .

Std. Err.

height weight age age2 fesale black _cons

-.0325996 . 049074 .1541151 -.0010746 -.356497 .3429301 -4.89574

. 0058727 .0031966 .0208709 .0002025 .0885354 . 1409005 1.159135

t

P>lt|

-5.55 15.35

0.000

7.38 -5.31

0.000 0.000 0.000 0.021

-4.03 2.43 -4.22

0.000

0.000

10351 = 31 = 62 = = 1 . 172e+08 87.70 = 0 . 0000

[95'/. Conf . Interval] -.0445771 .0425545 .1115486 -.0014877 - . 537066 .0555615 -7.259813

-.0206222 .0555936 .1966815 -.0006616 -.175928 .6302986 -2.531668

We can redisplay the results expressed as odds ratios. . svylogit, or Survey logistic regression pweight: final vgt Strata: strata PSU: psu

Number of obs = 10351 Number of strata = 31 Number of PSUs = _ 62 Population size = 1.172e+08 F( 6, 26) = 87.70 Prob > F = 0.0000

highbp

Odds Ratio

Std. Err.

height weight age age 2 female black

.967926 1.050298 1.166625 . 998926 .7001246 1 . 40907

.0056843 .0033574 .0243485 .0002023 .0619858 . 1985388

t -5 .55

15 .35 7 .38 -5 .31 -4 .03 2 .43

P>lt!

[95V, Conf. Interval]

0.000 0,000 0.000 0.000

.9564019 1.043473 1.118008 .9985135 . 5844605 1.057134

0.000

0.021

.979589 1.057168 1.217356 .9993386 .8386784 1.878171

svylc can be used to estimate the sum of the coefficients for female and black. . svylc female + black ( 1)

female + black =0.0 highbp

Coef.

(1)

-.0135669

Std. Err. .1653936

t

-0.08

P>|t| 0.935

[957, Conf. Interval] -.3508894

.3237555

r

s vylc — Estimate linear combinations after survey estimation

43

This result is more easily i iterpreted as an odds ratio. . svylc female + bis ck, or ( 1)

I

female + bladk = 0 . 0

highbp

Odds Ratio

(1)

.9£ 65247

Std. Err. .1631648

t -d.08

p>ltl


0.935

.7040616

1.382309

The odds ratio 0.987 is an estimate of the ratio of the odds of having high blood pressure for black females over the odds for <pr reference category of nonblack males (controlling for height, weight, and age). See [R] lincom for mori examples of odds ratio!.

Using svylc after estimators that estimate multiple-equation models is a little trickier, but still straightforward. Users mere ly need to refer to the coefficients using the syntax for multiple equations; see [U] 16.5 Accessing coe: Bcients and standard errors for a description and [R] test for examples of its use.

[> Example In the NHANES II data, 've have a variable health containing self-reported health status, which takes on the values 1-5, with 1 being "poor" anil 5 being "excellent". Since this is an ordered categorical variable, it makes sense to model it using svyologit or svyoprobit. We will do so in the next example, but we will first use svymlogit since it is a good example of a multiple-equation estimator at its simplest, j So. we estimate a multinomial logistic regressiori model:

(Continued oninext page)

44

svylc — Estimate linear combinations after survey estimation . svymlogit health female black age age2 Survey multinomial logistic regression pweight: finalvgt Strata: strata PSU: psu

health

Coef .

Std. Err.

female black age age2 _cons

- . 1983735 .8964694 .0990246 -.0004749 -5.475074

. 1072747 .1797728 .032111 .0003209

female black age age2 _cons

female


P>|t|

t

= 10335 = 31 = 62 = 1.170e+08 = 36.41 = 0.0000

[957. Conf . Interval]

poor

. 7468B76

-1. 85 4.99 3. 08 -1. 48 -7.33

0.074 0.000 0.004 0.149 0.000

-.4171617 .5298203 .0335338 -.0011294 -6 . 9983

.0204147 1.263119 .1645155 .0001796 -3.951848

.1782371 .4429445 . 0024576 .0002875 -1.819561

.0726556 . 122667 .0172236 .0001684 .4018153

2.45 3. 61 0. 14 1. 71 -4. 53

0.020 0.001 0.887 0.098 0.000

.030055 .1927635 - . 0326702 -.0000559 -2.639069

.3264193 .6931256 .0375853 .000631 -1.000053

black age age2 _cons

-.0458251 -.7532011 -.061369 .0004166 1.815323

.074169 .1105444 .009794 .0001077 .1996917

-0.62 -6.81 -6.27 3. 87 9.09

0.541 0.000 0.000 0.001 0.000

- . 1970938 -.9786579 -.081344 .000197 1.408049

. 1054437 -.5277443 -.0413939 .0006363 2.222597

excellent female black age age2 _cons

- . 222799 -.991647 -.0293573 - . 0000674 1.499683

.0754205 . 1238&06 .0137789 .0001505 .286143

-2. 95 -8. 00 -2. 13 -0.45 5. 24

0.006 0.000 0.041 0.657 0.000

-.3766202 -1.244303 -.0574595 -.0003744 .9160909

-.0689778 -.7389909 -.001255 .0002396 2.083276

fair

good

(Outcome health==average is the comparison group) One might want to calculate the estimate for black females for the "excellent" category: . svylc [excellent]female + [excellent]black ( 1)

[excellent]female +• [excellent]black = 0 . 0 health

Coef.

(1)

-1.214446

Std.

Err.

.1428188

t

-8.50

P>lt| 0.000

[957. Conf. Interval] -1.505727

-.9231652

This result might be better interpreted as a relative risk ratio. Since the estimate was negative, one could reverse signs to get a relative risk ratio that is greater than one: . svylc -[excellent]female - [excellent]black, rrr ( 1} - [excellent]female - [excellent]black = 0.0 health

RRR

(1)

3.368427

Std. Err.

t

P>|t|

[95% Conf. Interval]

.4810747

8.50

0.000

2.517245

4.507429

— Estimate linear combinations after survey estimation

45

RRR = 3,37 is the ratio o relative risk for nonblick males to black females, with "relative risk" being the probability of be ng in the "excellent" category divided by the probability of being in the "average" base category. Hence, this relative risk ritio is RRR =

Pi (excellent nonblack ma|e)/Pr(average | nonblack male) ^{excellent black femate)/Pr(average j black female)

We now estimate the sane model using svyologit: , svyologit health ieraale black age age2 Survey ordered logistic regression pweight: finalwgt Strata: strata psu PSU:


Std. Err.

health

Coef .

female black age age2

-.if. 15219 -.S 86568 -.01 19491 -.00 03234

.0523678 .0790276 .0082974 .000091

-3.08 -13-48 -11.44 -3.55

0.004 0.000 0.160 0.001

-.2683266 -1 . 147746 - . 0288717 -.000509

-.0547171 -.8253901 .0049736 -.0001377

/cut! /cut 2 /cut3 /cut4

-4.5 66229 -3.0 57415 -1.5 20696 _ ' 42785

. 1632559 . 1699943 .1714341 . 1703964

-27*. 97 -17=. 99 -187 -1.42

0.000 0.000 0.000 0.164

-4.899192 -3.404121 -1.870238 - . 5903107

-4.233266 -2.710709 -1 . 170954 . 1047407

: t

P>lt|

= 10335 = 31 « 62 = 1.170e+Q8 = 223.27 = 0.0000


j

Although svyologit and s /yoprobit are multiple-equation estimators, one can refer to the estimates in the first equation using s gle-equation syntax: . svylc female + black ( 1)

[health] female + [health]black = 0.0 !

i

health

;oef.

Std. Err.

(1)

-1. 14609

. 1008367

t -111 39

P>ltl 0.000

[957. Conf. Interval] -1.353748

-.942432

The single-equation syntax does not work when referring to the cutpoints: . svylc cutl - cut2 j cutl not found | r(lll); j

When in doubt, always use the show option. It will jihow you exactly how the equations are labeled.

(Continued on tiext page)

46

svylc — Estimate linear combinations after survey estimation , svylc, show Coef.

Std. Err.

female black age age2

-.1615219 -.986568 -.0119491 -.0003234

. 0523678 .0790276 .0082^74 -000091

-3.08 -12.48 -1.44 -3.55

0.004 0.000 0.160 0.001

-.2683266 -1.147746 -.0288717 - . 000509

-.0547171 -.8253901 .0049736 -.0001377

_cons

-4.566229

. 1632559

-27.97

0 . 000

-4.899192

-4.233266

_cons

-3.057415

.1699943

-17.99

0.000

-3.404121

-2.710709

_cons

-1.520596

.1714341

-8.87

0.000

-1.870238

-1.170954

_cons

-.242785

. 1703964

-1.42

0 . 164

-.5903107

. 1047407

t

P>ltl

[95% Conf . Interval]

health

cutl

cut2

cut4

The output of svyologit and svyoprobit is actually quite deceptive. The first equation contains all the coefficient estimates, but then there is one equation for each cutpoint. To estimate differences of the cutpoints, use the multiple-equation syntax: . svylc [cut2]_cons - [cutl]_cons ( 1) - [cutl]_cons + [cut2]_cons = 0.0 health

Coef.

(1)

1.508814

Std. Err. .0501686

t 30.07

P»ti

[95'/, Conf . Interval]

0.000

1.406495

1.611134

Subpopulations with one by{) variable The svymean. svytotal, and svyratio commands allow a by () option which produces estimates for subpopulations; see [R] svymean. Frequently, one wishes to compute estimates for differences of subpopulation estimates. It is easy to use svylc to compute estimates for differences or any other linear combination of estimates. The only thing one must know is the proper syntax for referencing the subpopulation estimates. In this and the next two sections, we illustrate the syntax with a series of examples.

> Example Suppose that we wish to get an estimate of the difference in mean vitamin C levels (variable vitaminc) between males and females. First, we compute the means of vitaminc by sex. . svymean vitaminc, by(sex) Survey mean estimation pweight: finalwgt strata Strata: psu PSU:


= 9973 = 31 = 62 = I.l29e+08


Mean

Subpop .

—i———• i Estimate

Std. Err.;!


.9312051 1.12753

.0169297 : .0173704

.8966768 1 . 092103

47

Deff

vitaminc Kale Female

.9657333 1.162957

4.926449 5.028652

Then we use the svylc (command. . svylc [vitaminc Male - [vitaminc] Femalfe ( 1) [vitaiainc]'.iale - [vitaminc]Female = 0.0 Mean

Estimate

(1)

- 1963252

Std. Err.

t

.015981

K2.28

P>|t|

0.000

[957. Conf. Interval] -.2289186

-.1637318

When svymean or sv^ 'total is used with a by (!) option, the syntax for referencing the subpopulation estimates is

[varname] subpop Jabel For example, we use [idtamincjMale to refef to the subpopulation estimates. This is the same syntax that is used with the test command wh|n there are multiple equations; see [R] test for full details. Be sure to type the \ ariable names and subpcj>pulation labels exactly as they are displayed in the output. Remember that >tata is case-sensitive. . svylc [vitamin|tl 0.000

[957. Conf. Interval] -.2289186

-.1637318

48

svyic — Estimate linear combinations after survey estimation

Subpopulations with two or more by() variables If there are two or more by() variables, you must refer to the subpopulations by numbers (1,2, 3, . . . ) when using svylc.

l> Example , svymean vitaminc, by(sex race) Survey mean estimation pweight: finalwgt Strata: strata PSU: psu

Mean


9973 = 31 62 = = 1 . 129e+08

Subpop .

Estimate

Std. Err.

[95"/, Conf. Interval]

Deff

White Black Other White Black Other

.9475117 .7382045 1.021363 1.151125 .9222313 1 . 0804

.0168982 .0477521 .0521427 .0168117 , 0348224 . 0412742

.9130475 .6408135 .915017 1.116838 .8512105 . 9962202

. 9819758 .8355955 1 . 127708 1.185413 .993252 1 . 164579

4.646413 2.165885 1.739788 4.032603 2.915009 1.00135

vitaminc Hale Male Male Female Female Female

You can see the numbering scheme by running svylc with the show option. . svylc, show Mean

Coef .

1 2 3 4 5 6

.9475117 .7382045 1.021363 1.151125 .9222313 1.0804

Std. Err.

t

P>|t|

[95X Conf. Interval]

0.000 0.000 0.000 0.000 0.000 0.000

.9130475 .6408135 .915017 1.116838 .8512108 .9962202

vitaminc .0168982 .0477521 .0521427 .0168117 .0348224 .0412742

56.07 15.46 19.59 68.47 26.48 26.18

.9819758 .8355955 1.127708 1.185413 .993252 1 . 164579

So if we want to test the hypothesis that vitamin C levels are the same in white females and black females, we need to test subpopulation 4 versus subpopulation 5. . svylc [vitaminc]4 - [vitaminc]5 ( 1) [vitaminc]4 - [vitaminc]5 = 0 . 0 Mean

Estimate

Std. Err.

(1)

.2288941

.0337949

t

P>|ti


6.77

0.000

. 1599688

.2978193


49

Iftie use of svylc after svyratio Using svylc after svyrstio is a little more complicated. But, again, the show option on svylc will guide you.

> Example . svyratio yl/xl y2/i2 Survey ratio estimatj on pwe ight: f inalwgt Strata: strata PSU: psu


10351 31 62 1.172e+08 Deff

Ratio

Estimate

Std. Err.


yl/xl y2/x2

.9918905 .9962729

.0102386 .0083088 '

. 9710087 ,9793269

1.012772 1.013219

1.647415 1.0771

. svylc, show

yi

Ratio

( oef.

xl

,99: 8905

.0102386

x2

.99( 2729

.0083088

i

Std. Err.

P>ltl


96. ^8

0.000

.9710087

1.012772

119. fl

0.000

.9793269

1.013219

, svylc [yl]xl - [y£ x2 ( 1) [yllxl - [y2j:t2 = 0.0 ;

1

Ratio

Est: mate

Std. Err.

it

P>lt|

(1)

-.00' ,3824

.0125921

-Oi35

0.730

[95X Conf. Interval] - . 0300641

.0212993

The following examples illistrate the syntax when there are by() subpopulations.

Example svyratio yi/xl, by race) Survey ratio estimation pwe ight: f inalwgt Strata: strata j PSU: psu

Ratio

Number of obs = 10351 Number of strata = 31 Number of PSUs = 62 Population size = 1.172e+08

Subpop .

Estimate

Std. Err.

White Black Other

.995116 .9525558 i . 026876

.0116867 .0381059 i

•


Deff

yl/xl .0447707

.9712807 . 8748384 .9355659

1.018951 1,030273

1.879759 2 . 242268

1.118187

. 8308877


50

. svylc, show Ratio

Coef.

White Black Other

.995116 .9525558 1.026876

Std. Err.

P>|t|

t


1

.0116867 .0381059 .0447707

85.15 25.00 22.94

0.000

.9712807

0 . 000 0 . 000

. 8748384 . 9355659

1.018951 1.030273 1.118187

. svylc [1] White - [1] Black ( 1) [1] White - [l3Black = 0.0 Ratio (1)

Estimate

Std. Err.

t

P>lt|

. 0425602

. 0439945

0 . 97

0.341

. svyratio yl/xl, by (sex race) Survey ratio estimation pweight: finalwgt Strata: strata PSU:

psu

Ratio

Subpop.

Estimate

Std. Err.

yl/xl Male Male Male

White Black Other

1.000215 .9726418 1.000358

.0150805 .0486307 .0732775

Female

White

.9904237

.0169396

Female Female

Black Other

.9362548 1.056553

.0409748 .082305

[957. Conf. Interval] -.0471671

.1322875


10351 = 31 = 62 = = 1 . 172e+08


Deff

.9694585 .8734589 . 850907 .9558752 .8526861 .8886906

1.030972 1.071825 1 . 149808 1.024972 1.019823 1.224415

1.460442

1.426839 1.266913 2.109029 1.619815 1.228803

. svylc, show Ratio

Coef.

1

1.000215 .9726418 1.000358 ,9904237 .9362548 1.056553

Std. Err.

t

P>|t|

[95V. Conf. Interval] -

66.33 20.00 13.65 58.47 22.85 12.84

0.000 0.000 0.000 0.000 0.000 0.000

.9694585 .8734589 .850907 .9558752 .8526861 .8886906

1 2

3 4 5

6

.0150805 .0486307 .0732775 .0169396 .0409748 .082305

1.030972 1.071825 1 . 149808 1.024972 1.019823 1.224415

. svylc [1]1 - [1]4 ( 1) [1]1 - [1]4 = 0.0 Ratio

Estimate

Std. Err.

t

P>|tl

(1)

.0097916

.0221119

0.44

0.661

[957. Conf. Interval] -.0353058

.054889

svyic — Estimate linear combinations after survey estimation

51

r

$aved Results svylc saves in r(): Scalars r(est) r(se) r(N_strita) r(N_psu] r(deff) r(deft) r(meft)

point estimate of lineajr combination standard error (square ;root of design-based variance estimate) number of strata I number of sampled P$Us deff ; deft : meft

•I Methods and Formulas svylc is implemented .s an ado-file. svylc estimates r? = Cfd, where 8 is a q x 1 Vector of parameters (e.g., population means or population regression coefficients) . and C is any i x q vector of constants. The estimate of r\ is ! r\ — C9, and the estimate >f its variance is

Similarly, the simple-randdm-sampling variance esiimator used in the computation of deff and deft C'. The variance estimator used in the computation of meff and meft is s mi p }C'. See the Method's and Formulas section of [R] svymean for details Mnspwmsp) — on the computation of deff deft, meff, and meft.

References Eltinge. J. L. and W. M. Sribn^y. 1996. svy5: Estimates ofilinear combinations and hypothesis iests for survey data. Stata Technical Bulletin 3!: 31-42. Reprinted in Stata technical Bulletin Reprints, vol. 6. pp. 246-259. McDowell. A.. A. Engel. J. T. Massey. and K. Maurer. 1981. Plan and operation of the Second National Health and Nutrition Examination Survey. 1976-1980. Viral and Health Statistics 15(1). National Center for Health Statistics, Hvansville, MD.


[R] svy estimators. [R] svycUs, [R] svymean, [R] svyset. [R] svytest

Related:

[R] Bncom

Background:

[U] 6.5 Accessing coefflcierits and standard errors. ;i [u] '. SO Overview of survey Estimation. [R]«vy

Title svymean — Estimate means, totals, ratios, and proportions for survey data

I

Syntax svymean varlisl {weight} [if exp] [in range] [, common-options ] svytotal varlist [weight] [if exp] [in range] [, common ôptions svyratio varname [/] varname [varname [/] varname ...] [weight] [if exp] [in range] [, common-options ] svyprop varlist [weight] [if exp] [in range] [, strata(vamame) psu(vai7zaw«) fpc(varname) by(iw//,y/) subpop(varname') nolabel format(7,fmt) ] The common ôptions for svymean, svytotal, and svyratio are strata (varname') psn(\'arname) fpc(varname) by(varlist) subpop(varname) srssubpop nolabel { complete available } level(#) ci deff deft meff meft obs size svymean, svyratio, and svytotal typed without arguments redisplay previous results. Any of the following options can he used when redisplaying results: level(#) ci deff deft oeff meft obs size AH tiu-sc commands allow pweights and iweights; see [U] 14.1.6 weight. Warning: Use- of if

Stata Reference Su-Z Release 7

Stata Reference H-P Release 7

Stata Survey Data Reference Manual: Release 11

Stata Multiple-Imputation Reference Manual: Release 11

Stata Multivariate Statistics Reference Manual Release 10

Stata Data-Management Reference Manual: Release 11

Stata Time-Series Reference Manual: Release 11

Stata Multivariate Statistics Reference Manual: Release 11

Stata Programming Release 9

Stata User's Guide Release 11

Stata Longitudinal-Data Panel-Data Reference Manual: Release 11

Stata 11 Base Reference Manual

JDK 7 Reference Card

Release

Release

Release

Release

BusinessObjects XI (Release 2): The Complete Reference

ANSYS, Inc. Theory Reference: ANSYS Release 9.0

JMP Design of Experiments, Release 7

Release

Release

Sweet Release

Beauty's Release

Beauty's Release

Beauty's Release

Remy's Release

Noble Release

Release Me

Beauty's Release

Subtle Release

Stata Reference Su-Z Release 7

Stata Reference H-P Release 7

Stata Survey Data Reference Manual: Release 11

Stata Multiple-Imputation Reference Manual: Release 11

Stata Multivariate Statistics Reference Manual Release 10

Stata Data-Management Reference Manual: Release 11

Stata Time-Series Reference Manual: Release 11

Stata Multivariate Statistics Reference Manual: Release 11

Stata Programming Release 9

Stata User's Guide Release 11

Stata Longitudinal-Data Panel-Data Reference Manual: Release 11

Stata 11 Base Reference Manual

JDK 7 Reference Card

Release

Release

Release

Release

BusinessObjects XI (Release 2): The Complete Reference

ANSYS, Inc. Theory Reference: ANSYS Release 9.0

JMP Design of Experiments, Release 7

Release

Release

Sweet Release

Beauty's Release

Beauty's Release

Beauty's Release

Remy's Release

Noble Release

Release Me

Beauty's Release

Subtle Release

Recommend Documents