Statistics Toolkit
Statistics Toolkit Rafael Perera
Centre for Evidence-based Medicine Department of Primary Health ...
186 downloads
6903 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Statistics Toolkit
Statistics Toolkit Rafael Perera
Centre for Evidence-based Medicine Department of Primary Health Care University of Oxford Old Road Campus Headington Oxford OX3 7LF
Carl Heneghan Centre for Evidence-based Medicine Department of Primary Health Care University of Oxford Old Road Campus Headington Oxford OX3 7LF
Douglas Badenoch Minervation Ltd Salter's Boat Yard Folly Bridge Abingdon Road Oxford OX1 4LB
© 2008 Rafael Perera, Carl Heneghan and Douglas Badenoch Published by Blackwell Publishing BMJ Books is an imprint of the BMJ Publishing Group Limited, used under licence Blackwell Publishing, Inc., 350 Main Street, Malden, Massachusetts 02148-5020, USA Blackwell Publishing Ltd, 9600 Garsington Road, Oxford OX4 2DQ, UK Blackwell Publishing Asia Pty Ltd, 550 Swanston Street, Carlton, Victoria 3053, Australia The right of the Author to be identified as the Author of this Work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. First published 2008 1 2008 ISBN: 978-1-4051-6142-8 A catalogue record for this title is available from the British Library and the Library of Congress. Set in Helvetica Medium 7.75/9.75 by Sparks, Oxford – www.sparks.co.uk Printed and bound in Singapore by Markono Print Media Pte Ltd Commissioning Editor: Mary Banks Development Editors: Lauren Brindley and Victoria Pittman Production Controller: Rachel Edwards For further information on Blackwell Publishing, visit our website: http://www.blackwellpublishing.com The publisher's policy is to use permanent paper from mills that operate a sustainable forestry policy, and which has been manufactured from pulp processed using acid-free and elementary chlorine-free practices. Furthermore, the publisher ensures that the text paper and cover board used have met acceptable environmental accreditation standards. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The Publisher is not associated with any product or vendor mentioned in this book.
Contents v
Contents Page Introduction Data: describing and displaying Probability and confidence intervals Hypothesis testing Choosing which measure and test to use Randomised controlled trials Systematic reviews 1 Case-control studies Questionnaire studies 1 Questionnaire studies 2 Cohort studies Systematic reviews 2 Diagnostic tests Scale validation Statistical toolkit: glossary Software for data management and statistical analysis References Index Commonly used symbols
1 3 15 21 25 28 36 46 53 59 66 70 73 81 89 103 108 109 Inside back cover
This handbook was compiled by Rafael Perera, Carl Heneghan and Douglas Badenoch. We would like to thank all those people who have had input to our work over the years, particularly Paul Glasziou and Olive Goddard from the Centre of Evidence-Based Medicine. In addition, we thank the people we work with from the Department of Primary Health Care, University of Oxford, whose work we have used to illustrate the statistical principles in this book. We would also like to thank Lara and Katie for their drawings.
Introduction 1
Introduction This ‘toolkit’ is the second in our series and is aimed as a summary of the key concepts needed to get started with statistics in healthcare. Often, people find statistical concepts hard to understand and apply. If this rings true with you, this book should allow you to start using such concepts with confidence for the first time. Once you have understood the principles in this book you should be at the point where you can understand and interpret statistics, and start to deploy them effectively in your own research projects. The book is laid out in three main sections: the first deals with the basic nuts and bolts of describing, displaying and handling your data, considering which test to use and testing for statistical significance. The second section shows how statistics is used in a range of scientific papers. The final section contains the glossary, a key to the symbols used in statistics and a discussion of the software tools that can make your life using statistics easier. Occasionally you will see the GO icon on the right. This means the difficult concept being discussed is beyond the scope of this textbook. If you need more information on this point you can either refer to the text cited or discuss the problem with a statistician.
2 Statistics Toolkit
Essentials you need to get started
The null (H0) and alternative (H1) hypothesis
What type of test?
Is it significant?
Software tools
Types of data: categorical, numerical, nominal, ordinal, etc. Describing your data (see p. 3) For H0 and H1 (see p. 20) acceptance area
Choosing the right type of test to use will prove the most difficult concept to grasp. The book is set out with numerous examples of which test to use; for a quick reference refer to the flow chart on page 26
Once you have chosen the test, compute the value of the statistic and then learn to compare the value of this statistic with the critical value (see p. 20)
Start with Excel (learn how to produce simple graphs) and then move on to SPSS and when your skills are improving consider STATA, SAS or R (see p. 103)
Data: describing and displaying 3
Data: describing and displaying The type of data we collect determines the methods we use. When we conduct research, data usually comes in two forms: • Categorical data, which give us percentages or proportions (e.g. ‘60% of patients suffered a relapse’). • Numerical data, which give us averages or means (e.g. ‘the average age of participants was 57 years’). So, the type of data we record influences what we can say, and how we work it out. This section looks at the different types of data collected and what they mean. Any measurable factor, A variable from our data can be characteristic or attribute is a two types: categorical or numerical. variable Categorical: the variables studied are grouped into categories based on qualitative traits of the data. Thus the data are labelled or sorted into categories.
Nominal
Categories are not ordered (e.g. ethnic group)
Ordinal
Categories are ordered (e.g. tumour stage)
Categorical
A special kind of categorical variables are binary or dichotomous variables: a variable with only two possible values (zero and one) or categories (yes or no, present or absent, etc.; e.g. death, occurrence of myocardial infarction, whether or not symptoms have improved). Numerical: the variables studied take some numerical value based on quantitative traits of the data. Thus the data are sets of numbers.
Discrete
Only certain values are possible with gaps between these values (e.g. admissions to hospital).
Numerical Continuous
All values are theoretically possible and there are no gaps between values (weight, height).
4 Statistics Toolkit
You can consider discrete as basically counts and continuous as measurements of your data.
Censored data – sometimes we come across data that can only be measured for certain values: for instance, troponin levels in myocardial infarction may only be detected for a certain level and below a fixed upper limit (0.2–180 µg/L)
Summarizing your data It’s impossible to look at all the raw data and instantly understand it. If you’re going to interpret what your data are telling you, and communicate it to others, you will need to summarize your data in a meaningful way. Typical mathematical summaries include percentages, risks and the mean. The benefit of mathematical summaries is that they can convey information with just a few numbers; these summaries are known as descriptive statistics. Summaries that capture the average are known as measures of central tendency, whereas summaries that indicate the spread of the data usually around the average are known as measures of dispersion. The arithmetic mean (numeric data) The arithmetic mean is the sum of the data divided by the number of measurements. It is the most common measure of central tendency and represents the average value in a sample. x = sample mean µ = population mean x = ∑ xi n ∑ = the sum of x = variable Consider the following test scores: i = the total variables n = number of measurements Test scores out of ten 6 4 5 6 7
4 7 2 9 7
2. Divided by the number of measurements ⇓
(6+4+5+6+7+4+7+2+9+7) / 10 = 5.7 ⇑ ⇑ 1. The sum of the measurements 3. Gives you the mean
To calculate the mean, add up all the measurements in a group and then divide by the total number of measurements.
Data: describing and displaying 5
The geometric mean If the data we have sampled are skewed to the right (see p. 7) then we transform the data using a natural logarithm (base e = 2.72) of each value in the sample. The arithmetic mean of these transformed values provides a more stable measure of location because the influence of extreme values is smaller. To obtain the average in the same units as the original data – called the geometric mean – we need to back transform the arithmetic mean of the transformed data: geometric mean original values = e(arithmetic mean ln(original values)) The weighted mean The weighted mean is used when certain values are more important than others: they supply more information. If all weights are equal then the weighted mean is the same as the arithmetic mean (see p. 54 for more). We attach a weight (wi) to each of our observations (xi): w1x1 + w2x2 + … w2xn w1 + w2 + … wn
=
∑ wixi ∑ wi
The median and mode The easiest way to find the median and the mode is to sort each score in order, from the smallest to the largest: Test scores out of ten 1) 2
6) 6
2) 4
7) 7
3) 4
8) 7
4) 5
9) 7
5) 6
10) 9
In a set of ten scores take the fifth and sixth values ⇓
(6+6) / 2 = 6 ⇑ The median is equal to the mean of the two middle values or to the middle value when the sample size is an odd number
The median is the value at the midpoint, such that half the values are smaller than the median and half are greater than the median. The mode is the value that appears most frequently in the group. For these test scores the mode is 7. If all values occur with the same frequency then there is no mode. If more than one value occurs with the highest frequency then each of these values is the mode. Data with two modes are known as bimodal.
6 Statistics Toolkit
Choosing which one to use: (arithmetic) mean, median or mode? The following graph shows the mean, median and mode of the test scores. The x-axis shows the scores out of ten. The height of each bar (y-axis) shows the number of participants who achieved that score.
test scores
Mode
2
Median
Mean
# of scores
3
6
7
1
0 2
3
4
5
8
9
scores out of ten This graph illustrates why the mean, median and mode are all referred to as measures of central tendency. The data values are spread out across the horizontal axis, whilst the mean, median and mode are all clustered towards the centre of the graph. Of the three measures the mean is the most sensitive measurement, because its value always reflects the contributions of each data value in the group. The median and the mode are less sensitive to outlying data at the extremes of a group. Sometimes it is an advantage to have a measure of central tendency that is less sensitive to changes in the extremes of the data. For this reason, it is important not to rely solely on the mean. By taking into account the frequency distribution and the median, we can obtain a better
Data: describing and displaying 7
understanding of the data, and whether the mean actually depicts the average value. For instance, if there is a big difference between the mean and the median then we know there are some extreme measures (outliers) affecting the mean value. Distributions test scores
The shape of the data is approximately the same on both the lefthand and righthand side of the graph (symmetrical data). Therefore use the mean (5.9) as the measure of central tendency.
30
# of scores
25 20 15 10 5 0 3
4
5
6
7
8
9
scores out of ten
# of scores
test scores
The data are now nonsymmetrical, i.e. the peak is to the right. We call these negatively skewed data and the median (9) is a better measurement of central tendency.
50 45 40 35 30 25 20 15 10 5 0 3
4
5
6
7
8
9
10
scores out of ten
test scores
The data are now bimodal, i.e. they have two peaks. In this case there may be two different populations each with its own central tendency. One mean score is 2.2 and the other is 7.5
# of scores
50 45 40 35 30 25 20 15 10 5 0 1
2
3
4
5
6
7
8
9
10
scores out of ten
test scores
# of scores
50 45 40 35 30 25 20 15 10 5 0 1
2
3
4
5
6
7
scores out of ten
8
9
10
Sometimes there is no central tendency to the data; there are a number of peaks. This could occur when the data have a ‘uniform distribution’, which means that all possible values are equally likely. In such cases a central tendency measure is not particularly useful.
8 Statistics Toolkit
Measures of dispersion: the range To provide a meaningful summary of the data we need to describe the average or central tendency of our data as well as the spread of the data. Meaningful summary = central tendency + spread For instance these two sets of class results have the same mean (5.4) Class 1 test scores 4
5
5
6
5
7
5
5
5
4
Class 2 test scores Mean = 54/10 5.4
1
6
3
8
3
9
2
6
4
9
However, class 2 test scores are more scattered; using the spread of the data tells us whether the data are close to the mean or far away. score in each class 6
5
5
# of scores
# of scores
score in each class 6
4 3 2
4 3 2 1
1
0
0 1
2
3
4
5
6
7
8
9
10
scores out of ten
The range for results in class 1 is 4–6
1
2
3
4
5
6
7
8
9
10
scores out of ten
The range for results in class 2 is 1–9
The range is the difference between the largest and the smallest value in the data. We will look at four ways of understanding how much the individual values vary from one to another: variance, standard deviation, percentiles and standard error of the mean. The variance The variance is a measure of how far each value deviates from the arithmetic mean. We cannot simply use the σ2 = population variance mean of the difference as the negatives would s2 = sample variance cancel out the positives; therefore to overcome this problem we square each mean and then find the mean of these squared deviations.
Data: describing and displaying 9
To calculate the (sample) variance: 1. 2. 3.
s=
Subtract the mean from each value in the data. Square each of these distances and add all of the squares together. Divide the sum of the squares by the number of values in the data minus 1.
∑ (xi – x )
2
n–1
Note we have divided by n – 1 instead of n. This is because we nearly always rely on sample data and it can be shown that a better estimate of the population variance is obtained if we divide by n – 1 instead of n. The standard deviation The standard deviation is the square root of the variance:
s=
∑ √(
(xi – x ) n–1
2
)
s = standard deviation
The standard deviation is equivalent to the average of the deviations from the mean and is expressed in the same units as the original data. Consider class 1 results with the mean result of 5.4 2
x
xi – x
(xi – x )
4
–1.4
1.96
4
–1.4
1.96
5
–0.4
0.16
5
–0.4
0.16
5
–0.4
0.16
5
–0.4
0.16
5
–0.4
0.16
5
–0.4
0.16
6
0.6
0.36
7
1.6
2.56
3. Add the squares together
∑ (xi – x )
2
= 7.8 4. Divide by the number of values n – 1
∑ (xi – x ) /n – 1 = 7.8/9 = 0.87 2
1. Subtract the mean from each value
2. Square each of the distances
SD = square root of 0.87= 0.93
10 Statistics Toolkit
Therefore, in class 1 the mean is 5.4 and the standard deviation is 0.93. This is often written as 5.4 ± 0.93, describing a range of values of one SD around the mean. Assuming the data are from a normal distribution then this range of values one SD away from the mean includes 68.2% of the possible measures, two SDs includes 95.4% and three SDs includes 99.7%. Dividing the standard deviation by the mean gives us the coefficient of variation. This can be used to express the degree to which a set of data points varies and can be used to compare variance between populations. Percentiles Percentiles provide an estimate of the proportion of data that lies above and below a given value. Thus the first percentile cuts off the lowest 1% of data, the second percentile cuts off the lowest 2% of data and so on. The 25th percentile is also called the first quartile and the 50th percentile is the median (or second quartile). Percentiles are helpful because we can obtain a measure of spread that is not influenced by outliers. Often data are presented with the interquartile range: between the 25th and 75th percentiles (first and third quartiles). Standard error of the mean The standard error of the mean (SEM) is the standard deviation of a hypothetical sample of means. The SEM quantifies how accurately the true population mean is known: SEM = s / √n ,
SEM = standard error of the mean
where s is the standard deviation of the observations in the sample. The smaller the variability (s) and/or the larger the sample the smaller the SEM will be. By ‘small’ we mean here that the estimate will be more precise.
Data: describing and displaying 11
Displaying data Most of our graphical displays are about summarizing frequencies making it easier to compare and/or contrast data. They also allow for the identification of outliers and assessment of any trends in the data. Key elements for the construction of graphs are generally not well understood, which then leads to poor representations and misunderstandings. Commonly graphs generated depend on the statistical package used in the analysis (Excel, SPSS, STATA). The three key principles of graph construction are: a) visual detection of data symbols; b) estimation of values and important relationships; c) context. Puhan et al. Three key principles of graph construction. J Clin Epidemiol 2006;59:1017–22. There are various ways to display your data such as: • Pie chart • Bar or column chart • Histogram • Stem and leaf • Box plot • Dot plot
# of scores
25 20 15 10 5 0 4
5
6
90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% Statin Therapy (6)
Verbal advice (3)
Review HT annually (5)
Bar or column chart for test scores: vertical column represents each category with the length proportional to the frequency. The small gaps indicate that the data are discrete.
test scores 30
3
100.0%
7
8
9
scores out of ten
We used Excel to generate the above bar or column chart.
12 Statistics Toolkit
Use the chart wizard in Excel to get started Chart type allows you to change the appearance of the display
Source data: To change data and labels right click graph and highlight source data and series
By selecting data series and then selecting options we can reduce the gap width to zero. We now have a histogram: similar to the bar chart but no gap. This should be used when the data are continuous. The width of each bar would now relate to a range of values for the variable, which may be categorized. Careful labelling should be used to delineate the boundaries.
% passes by sex
52%
% boys passing
48%
% girls passing
Pie charts: not often used in scientific papers but can be helpful as visual presentations. The area of the pie segment is proportional to the frequency of that category. Such charts should be used when the sum of all categories is meaningful, i.e. if they represent proportions. To construct simply select the chart type by right clicking on your chart and select the pie chart icon.
Data: describing and displaying 13
Box and whisker plot: is a graphical display of the lowest (predicted) value, lower quartile, median, upper quartile and the highest (predicted) value. Median Lower quartile
Upper quartile
Lowest value
Highest value
We used SPSS to generate the following box and whisker plot Select from the Graphs drop down menu; Chart type allows you to change the appearance of the display
Select at least one numeric variable, for which you want box plots
You can select a variable and move it into the Label Cases by box: numeric or string
String: a variable which is not numeric and which is not used in calculations If the population sampled in your dataset is not normal, you may see many ‘outliers’ – points outside the predicted lowest and/or highest value.
14 Statistics Toolkit
Dot plot or scatter-plot: used to display two quantitative variables and represent the association between these two variables (this is not the same as causation). Useful for continuous data and correlation or regression. There are at least four possible uses for scatter-plots: 1) Relationship between two variables (simple). 2) Relationship between two variables graphed in every combination (matrix). 3) Two scatter-plots on top of each other (overlay). 4) Relationship between three variables in three dimensions (3-D scatter-plot).
Probability and confidence intervals 15
Probability and confidence intervals Probability is a measure of how likely an event is to occur. Expressed as a number, probability can take on a value between 0 and 1. Thus the probability of a coin landing tails up is 0.5.
Probability: Event cannot occur = zero Event must occur = one
Probability rules If two events are mutually exclusive then the probability that either occurs is equal to the sum of their probabilities.
Mutually exclusive: a set of events in which if one happens the other does not. Tossing a coin: either it can be head or tails, it cannot be both.
P (heads or tails) = P (heads) + P (tails) Independent event: event in which the If two events are independent the outcome of one event does not affect probability that both events occur is the outcome of the other event. equal to the product of the probability of each event. E.g. the probability of two coin tosses coming up heads: P (heads and heads) = P (heads) × P (heads) = 0.5 × 0.5 = 0.25 Probability distributions A probability distribution is one that shows all the possible values of a random variable. For example, the probability distribution for the possible number of heads from two tosses of a coin having both a head and a tail would be as follows: (head, head) = 0.25 • • (head, tail) + (tail, head) = 0.50 • (tail, tail) = 0.25 Probability distributions are theoretical distributions that enable us to estimate a population parameter such as the mean and the variance.
Parameter: summary statistic for the entire population
16 Statistics Toolkit
The normal distribution The frequency of data simulates a bell-shaped curve that is symmetrical around the mean and exhibits an equal chance of a data point being above or below the mean. For most types of data, sums and averages of repeated observations will follow the normal distribution. Normal distribution
Mean m
−3 −2 −1
0
1
2
3
Variance s 2
Variance sa2 Variance sb2
Mean µma
Mean m
The normal (Gaussian) distribution can be described by two parameters: the mean and the variance. The mean and the median are equal.
Effect on normal distribution of changing variance. Two normal distributions can have the same mean but different variances (here σa2 < σb2).
Mean µmbb Effect of changing mean: µa < µb without changing the variance.
Given that the distribution function is symmetrical about the mean, 68% of its area is within one standard deviation (σ) of the mean (µ) and 95% of the area is within (approximately) two standard deviations of µ. Therefore the probability that a random variable x is between: (µ – σ) and (µ + σ) = 0.68 (1 standard deviation) (µ – 1.96σ) and (µ + 1.96σ) = 0.95 (2 standard deviations) Some practical uses of probability distributions are: • To calculate confidence intervals for parameters. • To determine a reasonable distributional model for univariate analysis. • Statistical intervals and hypothesis tests are often based on specific distributional assumptions.
Probability and confidence intervals 17
Underlying the normal distribution is the central limit theorem: the sum of random variables have (approximately) a normal distribution. The mean is a weighted sum of the collected variables, therefore as the size of the sample increases, the theoretical sampling distribution for the mean becomes increasingly closer to the normal distribution. Different distributions: their uses and parameters Distribution
Common use Used to describe discrete variables or attributes that have two possible outcomes, e.g. heads or tails Used with continuous data: inference on a single normal variance; tests for independence, homogeneity and ‘goodness of fit’ Used for inference on two or more normal variances; ANOVA and regression Used for modelling rates of occurrence Used when the data are highly skewed whereas the natural log values of the data are normally distributed Used for modelling rates of occurrence for discrete variables Used to estimate the mean of a normally distributed population when the sample size is small. Also for calculating confidence intervals and testing hypotheses
Binomial distribution The chisquared distribution F distribution Geometric distribution Log-normal distribution Poisson distribution Student’s t distribution
Probability density
df = 50 df = 5 df = 1
-4
-2
0
2
4
t
Parameters Sample size and probability of success Degrees of freedom Numerator and denominator degrees of freedom Probability of event Location parameter and the scale parameter The rate ( mean)
Degrees of freedom
Degrees of freedom (df): the number of observations that are free to vary to produce a given outcome. Basically how many values would we need to know to deduce the remaining values? The larger the df the greater the variability of the estimate.
18 Statistics Toolkit
Confidence intervals Confidence intervals tell you about the likely size of an effect (parameter). A single guess (estimate) of the mean (our sample mean) does not tell us anything about how sure we are about how close this is to the true population mean. Instead we use a range of values – the confidence interval – within which we expect the true population mean to lie. To do this we need to estimate the precision using the standard error of the mean (SEM). The corresponding 95% CI is: (Sample mean – (1.96 × SEM) to Sample mean + (1.96 × SEM)) The 95% CI is the range of values in which we are 95% certain the mean lies. When the data is normally distributed but population variance is unknown the sample mean follows a t distribution. (Sample mean – (t0.05 × SEM) to Sample mean + (t0.05 × SEM)) This shortens to:
Sample mean ± t0.05 ×
s √n
Where t0.05 is the percentage point of the t distribution with n – 1 degrees of freedom for a two-tailed probability of 0.05. At this point you need to use a t distribution table. The t distribution produces a wider CI to allow for the extra uncertainty of the sampling. If we are interested in the proportion of individuals who have a particular characteristic or outcome then we need to calculate the standard error of the proportion SE(p).
SE(p) =
√
p(1 – p) n
Large standard error = imprecise estimate Small standard error = precise estimate
Probability and confidence intervals 19
The standard error of the proportion by itself is not a particularly useful measure. If the sampling distribution of the proportion follows a normal distribution we can estimate the 95% confidence interval (95% CI) by:
P–
1.96 ×
√
p(1 – p)
to
P+
n
1.96 ×
√
p(1 – p) n
The 95% CI is ±1.96 times the standard error of the proportion. This is the most common method used in clinical research that reports the proportion of patients experiencing a particular outcome. Therefore we could describe the mean of our data and the 95% CI; that is we are 95% confident that the mean lies within this range. Given that the SE depends on the sample size and the variability of our data, a wide confidence interval tells us the results are imprecise. Also, small samples will give wider CIs than larger samples. For example, let’s say 10 men weigh the following: 95, 97, 98, 99, 94, 97, 95, 96, 92 and 100kg. Consider the following table and results: x
xi – x
(xi – x)
2
95
–1.3
1.69
97
0.7
0.49
98
1.7
2.89
99
2.7
7.29
94
–2.3
5.29
97
0.7
0.49
95
–1.3
1.69
96
–0.3
0.09
92
–4.3
18.49
100
3.7
13.69
Mean (x ) = 96.3 kg 2
∑ (xi – x ) = 52.1
2
∑ (xi – x ) /n – 1 = 52.1/9 = 5.79
SD = square root of 5.79= 2.41
Standard error of the mean SEM = s /√n = 2.41/ 3.16 = 0.76 kg
20 Statistics Toolkit
If the SD2 estimated was equal to the true population variance, then we could calculate the 95% CI using normal distribution = Sample mean – (1.96 × SEM) to Sample mean + (1.96 × SEM) 96.3 – (1.96 ×0.76) to 96.3 – (1.96 ×0.76) = 94.81 kg to 97.79kg However, generally the true population variance is not known. If we consider the present example, the variance is unknown. We would need to use the t distribution to calculate the 95% CI for the true population mean. Sample mean ± t0.05 = 96.3 – (2.262 × 0.76) to = 96.3 – (2.262×0.76) = 94.58 kg to 98.02 kg Where 2.262 is the percentage point of the t distribution with nine df (n – 1) giving a twotailed probability of 0.05. You can generate the t distributions in Excel. Select the Statistical category; select TINV and the screen below will appear. Probability 0.05
function(fx) feature.
Degrees of freedom
The t distribution
Note now the 95% CI is slightly wider reflecting the extra uncertainty in the sampling and not knowing the population variance. The use of a t distribution in this calculation is valid because the sample mean will have approximately a normal distribution (unless the data are heavily skewed and/or the sample size is small).
Hypothesis testing 21
Hypothesis testing A hypothesis is an explanation for certain observations. We use hypothesis testing to tell if what we observe is consistent with the hypothesis. Hypothesis testing is fundamental to statistical methods. In statistics we use the probability of an event occurring to compare two competing hypotheses: 1. Null hypothesis, H0 2. Alternative hypothesis, H1 Commonly used test of hypotheses: Null hypothesis = no effect Alternative hypothesis = there is an effect in either direction We can use statistical tests to compare these hypotheses. If the probability of the null hypothesis being true is very small, then the alternative hypothesis is more likely to be true. For instance, if we wanted to know about how effective chloramphenicol eye drops are compared with placebo in the treatment of infective conjunctivitis (see p. 25), the hypotheses would be: H0: no effect of chloramphenicol eye drops over placebo H1: eye drops are more or less effective than placebo If we can reject H0 then chloramphenicol must have an effect. Notice we have not specified a direction of the effect for the eye drops. The effect could be to make the conjunctivitis worse or better in the children. This is referred to as a two-tailed test. For some hypotheses we may state in advance that the treatment effect is specified in one direction: a one-tailed test. The steps in testing a hypothesis State the null hypothesis H0 Choose the test statistic that summarizes the data
Based on the H0 calculate the probability of Interpret the P getting the value value of the test statistic
22 Statistics Toolkit
Step 1: It is essential to state clearly your primary null hypothesis and its alternative. Step 2: Choosing the test statistic is a major step in statistics, subsequent chapters will focus on the most appropriate test statistic to use to answer a specific research question. In addition we have provided flow charts to guide you through the minefield of choosing the appropriate test statistic for your data. Step 3: Select the level of significance for your test statistic: the level of significance when chosen before the statistical test is performed is called the alpha (α) value. Conventional values used for α are: 0.05 and 0.01. These values are small because we do not want to reject the null hypothesis when it is true.
α value: the probability of incorrectly rejecting the null hypothesis when it is actually true
Further to this we can use the P value. P value: the probability of The P value is the probability of obtaining obtaining the result if the null the result we got (or a more extreme hypothesis is true result), if the null hypothesis is true. If you are having difficulty with this concept then it might be easier to consider the P value as the probability that the observed result is due to chance alone. The interpretation is that if P is small, then what is more likely is that the H0 is wrong. The P value is calculated after the test statistic is performed and if it is lower than the α value the null hypothesis is rejected.
a acceptance area
Value of test statistic P < 0.05
b acceptance area
Step 4: Determine the critical value of the test statistic: at what value do we consider the hypothesis proved or disproved? In defining areas of rejection and acceptance in a normal distribution figure, (a) illustrates a two-tailed nondirectional probability distribution. Thus in the blue shaded area the P value is less than 0.05 and we can reject the null hypothesis and say the results are significant at the 5% level. Figure (b) gives an illustration of a one-tailed upper test. If the P value is greater than 0.05 then there is insufficient evidence to reject the null hypothesis.
Step 5: State the appropriate conclusion from the statistical testing procedure.
Hypothesis testing 23
Errors in hypothesis testing Type I error: we reject the null hypothesis The α value is the probability when it is true. For example, we say that a of a type I error treatment is beneficial when in fact it isn’t. Thus to prevent type I errors we set the significance level of the test, the α value, at a low level. Type II error: we do not reject the null The β value is the probability hypothesis when it is false. For example, we of a type II error think that a treatment is not beneficial when it is. The chance of making a type II error is generally denoted by β. The power of a test is the probability of rejecting the null hypothesis when it is false (1 – β). Thus high power is a good thing to have because ideally we would want to detect a significant result if it is really present. Statistical power In the Rose et al. study (see Chapter 5, Power = 1 – β p. 28) you will see the following in the methods: The initial planned sample size (n = 500) cited in the original protocol was sufficient to detect this difference with a power of 80%, α = 0.05 using a two-tailed test based on a placebo cure rate of 72% and a prevalence of bacterial events of 60%. After the study is conducted, ‘post hoc’ power calculations should not be needed. Once the size of the effect is known, confidence intervals should be used to state the likely error of the study estimate. Factors that affect the power: 1. Sample size Power increases with increasing sample size and thus a large sample has a greater ability to detect an important effect. 2. Effect size The larger the effect the easier it will be to detect it (higher power). 3. Significance level Power is greater if the significance level is larger – the α value is larger. So as the probability of a type I error increases the probability of a type II error decreases (everything else staying equal). So if the α value is changed from 0.01 to 0.05 at the outset then the power of the study increases; increasing the probability of rejecting the null hypothesis when it is true (type 1 error).
24 Statistics Toolkit
Calculating sample sizes: Different equations are appropriate for different types of data and different types of questions, but sample size calculations are generally based on the following general equation:
n=
(
Two standard deviations Size of effect
)
2
For unpaired t-tests and chi-squared tests we can use Lehr’s formula: 16 (Standardized difference)
2
for a power of 80% and a two-sided significance of 0.05. We can try available programs online: for power calculations try the PS program, downloadable free from Vanderbilt University’s Department of Biostatistics (http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleSize).
0⋅1 0⋅2 0⋅3 Standardized difference
We could use tables: see Machin D. et al. Sample Size Tables for Clinical Studies, 2nd edn. Oxford: Blackwell Publishing, 1997.
0⋅0
0⋅4 0⋅5 0⋅6
0 00 10 0000 6 00 0 4 00 3 00 20 00 14 00 10 00 8 0 6000 5 00 4 0 30 0 24 0 20 0 1640 1 0 12 0 10 80 0 7 0 6 50
0⋅995 0⋅99 0⋅98 0⋅97 0⋅96 0⋅95
N 40
0⋅7 0⋅8 0⋅9
30 24 20 164 1 2 1 10 8 0⋅05
1⋅0 0⋅01
1⋅1 1⋅2
0⋅90 0⋅85 0⋅80 0⋅75 0⋅70 0⋅65 0⋅60 0⋅55 0⋅50 0⋅45 0⋅40 0⋅35 0⋅30 0⋅25 0⋅20 0⋅15 0⋅10
Power
We could use Altman’s nomogram: a nomogram that links the power of a study to the sample size. It is designed to compare the means of two independent samples of equal size. Reproduced from Br Med J 1980;281:1336–38, with permission from the BMJ publishing group.
SIG LEVEL
We could use general formulas, which are necessary in some situations.
0⋅05
Choosing which measure and test to use 25
Choosing which measure and test to use
Incidence: number of new cases in a given time interval
Incidence rate: incidence/number of person time years at risk Cumulative incidence: incidence/number initially disease free Hazard rate: number of expected events in instant (time dependent)
Prevalence: number of cases at a given time
Point prevalence: number of cases at one time-point Period prevalence: number of cases during a period of time
Relative measures: for measuring the ratio of one event or one variable to another
Hazard ratio: the effect of an explanatory variable on the hazard risk of an event Odds ratio: the ratio of the odds of an event occurring in one group to the odds of it occurring in another group Relative risk: ratio of the risk of an event occurring in one group to the risk of it occurring in another group Relative risk reduction: the reduction in the risk made by the intervention compared with the risk of the event in the control group
Absolute measures: for measuring the absolute difference of one event or one variable to another
Absolute risk difference: the difference in the risk for an event between two groups (e.g. exposed vs unexposed populations) Attributable risk: the proportion of an event in those exposed to a specific risk factor that can be attributed to exposure to that factor Number needed to treat: the average number of patients who need to receive the intervention to prevent one bad outcome
Hypothesis testing and assessing differences
Between means for continuous data
Between distributions
> two samples
Two samples
Non parametric
Parametric
Non parametric
Parametric
Independence between two or more categorical variables
Between one observed variable and a theoretical distribution
2
Kruskal Wallis
ANOVA
Sign test for related samples *
Rank sum test for ij independent samples
T test difference for related samples
T test for dependent samples
McNemar’s test for related groups
2
X test for independence
X test for goodness of fit
26 Statistics Toolkit
2
Two samples
z score equal proportions
z score
Generalized linear models encompass ANOVA, ANCOVA, linear regression, logistic regression and Poisson regression and should be considered when other variables in your data may affect the outcome.
§
The Wilcoxon sign rank test can be used instead of the sign rank test to take into the account the magnitude of the difference as well as the direction.
‡
Mann–Whitney U-statistic is the most commonly used method. However, the Wilcoxon rank test (unpaired) is equivalent.
†
*Fisher’s exact test can be used instead of the test when assessing independence when the number of events per cell in the contingency table is very low (