Textbook in Psychiatric Epidemiology
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Ma...
286 downloads
1918 Views
10MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Textbook in Psychiatric Epidemiology
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
Textbook in Psychiatric Epidemiology Edited by
Ming T. Tsuang Center for Behavioral Genomics, Department of Psychiatry, University of California, Harvard Institute of Psychiatric Epidemiology & Genetics, Harvard School of Public Health, Boston, USA
Mauricio Tohen Department of Psychiatry, University of Texas Health Science Centre at San Antonio, USA
Peter B. Jones Department of Psychiatry, University of Cambridge, UK
THIRD EDITION
A John Wiley & Sons, Ltd., Publication
This edition first published 2011, © 2011 John Wiley & Sons, Ltd. Wiley-Blackwell is an imprint of John Wiley & Sons, formed by the merger of Wiley’s global Scientific, Technical and Medical business with Blackwell Publishing. Registered office: John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK Other Editorial Offices: 9600 Garsington Road, Oxford, OX4 2DQ, UK 111 River Street, Hoboken, NJ 07030-5774, USA For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com/wiley-blackwell The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting a specific method, diagnosis, or treatment by physicians for any particular patient. The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of fitness for a particular purpose. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. Readers should consult with a specialist where appropriate. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising herefrom. Library of Congress Cataloguing-in-Publication Data Textbook of psychiatric epidemiology / [edited by] Ming T. Tsuang, Mauricio Tohen, Peter B. Jones. – 3rd ed. p. ; cm. Rev. ed. of: Textbook in psychiatric epidemiology / edited by Ming T. Tsuang, Mauricio Tohen. 2nd ed. c2002. Includes bibliographical references and index. ISBN 978-0-470-69467-1 (cloth) 1. Psychiatric epidemiology. I. Tsuang, Ming T., 1931- II. Tohen, Mauricio, 1951- III. Jones, Peter B. (Peter Brian), 1960IV. Textbook in psychiatric epidemiology. [DNLM: 1. Epidemiologic Methods. 2. Mental Disorders–epidemiology. 3. Mental Disorders–diagnosis. WM 140] RC455.2.E64T49 2011 362.2 0422–dc22 2010046396 ISBN: 978-0-470-69467-1 A catalogue record for this book is available from the British Library. This book is published in the following electronic formats: ePDF: 978-0-470-97672-2; Wiley Online Library: 978-0-470-97673-9; ePub: 978-0-470-97740-8. Set in 9/12pt Sabon by Laserwords Private Limited, Chennai, India First
2011
Contents
List of Contributors, xi 1 Introduction to epidemiologic research methods, 1 Glyn Lewis 1.1 What is epidemiology? 1 1.2 Causation in medicine, 2 1.3 Causal inference, 6 1.4 The future for psychiatric epidemiology, 7 References, 7 2 Analysis of categorical data: The odds ratio as a measure of association and beyond, 9 Garrett M. Fitzmaurice and Caitlin Ravichandran 2.1 Introduction, 9 2.2 Inference for a single proportion, 10 2.3 Analysis of 2 × 2 contingency tables, 11 2.4 Analysis of sets of 2 × 2 contingency tables, 16 2.5 Logistic regression, 18 2.6 Advanced topics, 25 2.7 Concluding remarks, 29 2.8 Further reading, 29 References, 29 3 Genetic epidemiology, 31 Stephen V. Faraone, Stephen J. Glatt and Ming T. Tsuang 3.1 Introduction, 31 3.2 The chain of psychiatric genetic research, 31 3.3 Psychiatric genetics and psychiatric epidemiology, 44 Acknowledgements, 45
References, 45 Further reading, 47 4 Examining gene–environment interplay in psychiatric disorders, 53 Judith Allardyce and Jim van Os 4.1 Introduction, 53 4.2 The process of genetic epidemiology, 54 4.3 Gene–environment interplay takes different forms, 55 4.4 Gene–environment correlation, 55 4.5 Gene–environment interaction, 58 4.6 Measurement of genotype, environmental exposure and pathological phenotype, 58 4.7 Models of GxE, 60 4.8 Which scale should we use to measure GxE? 61 4.9 Study designs for the detection of GxE, 62 4.10 Threats to the validity of epidemiological GxE studies, 65 4.11 Epigenetic mechanisms, 67 References, 67 5 Reliability, 73 Patrick E. Shrout 5.1 Introduction, 73 5.2 The reliability coefficient, 73 5.3 Designs for estimating reliability, 74 5.4 Statistical remedies for low reliability, 76 5.5 Reliability theory and binary judgements, 77 5.6 Reliability statistics: General, 78 5.7 Other reliability statistics, 82 5.8 Summary and conclusions, 83 References, 83
v
CONTENTS
6 Moderators and mediators: Towards the genetic and environmental bases of psychiatric disorders, 87 Helena Chmura Kraemer 6.1 Introduction, 87 6.2 Current methodological barriers, 89 6.3 Moderation, mediation and other ways in which risk factors ‘work together’, 92 6.4 Extensions, 94 6.5 Beyond moderators and mediators, 95 References, 96 7 Validity: Definitions and applications to psychiatric research, 99 Jill M. Goldstein, Sara Cherkerzian and John C. Simpson 7.1 Introduction, 99 7.2 Validity of a construct, 99 7.3 Validity of the relationships between variables, 110 7.4 Summary, 112 Acknowledgements, 113 References, 113 8 Use of register data for psychiatric epidemiology in the Nordic countries, 117 Jouko Miettunen, Jaana Suvisaari, Jari Haukka and Matti Isohanni 8.1 Introduction, 117 8.2 Registers for use in psychiatric research, 118 8.3 Register research in Denmark, 122 8.4 Register research in Finland, 123 8.5 Register research in Norway, 124 8.6 Register research in Sweden, 125 8.7 Discussion, 126 Acknowledgements, 127 References, 127 9 An introduction to mental health services research, 133 ´ Anna Fernandez, Alejandra Pinto-Meza, Antoni Serrano-Blanco, Jordi Alonso and Josep Maria Haro 9.1 Introduction, 133 9.2 What is mental health services research? 134 9.3 A framework for mental health services research, 135 vi
9.4 Key concepts in mental health services research, 137 9.5 Examples of mental health services research studies, 141 9.6 Conclusion, 152 References, 152 10 The pharmacoepidemiology of psychiatric medications, 155 Philip S. Wang, Alan M. Brookhart, Christine Ulbricht and Sebastian Schneeweiss 10.1 Introduction, 155 10.2 Overview of psychopharmacoepidemiology, 156 10.3 Sources of data, 157 10.4 Examples of recent psychopharmacoepidemiologic studies, 159 10.5 Conclusions, 162 Acknowledgements, 163 References, 163 11 Peering into the future of psychiatric epidemiology, 167 Michaeline Bresnahan, Ezra Susser, Dana March and Bruce Link 11.1 Introduction, 167 11.2 Levels of causation: A historical overview, 167 11.3 Levels of causation, 169 11.4 Causation over (life) time, 172 11.5 Examples, 174 11.6 Framing the future, 178 References, 179 12 Studying the natural history of psychopathology, 183 William W. Eaton 12.1 Introduction, 183 12.2 Onset, 183 12.3 Course, 188 12.4 Outcome, 191 12.5 Methodological concepts for studying the natural history of psychopathology, 192 12.6 Conclusion, 195 Acknowledgements, 195 References, 195
CONTENTS
13 Symptom scales and diagnostic schedules in adult psychiatry, 199 Jane M. Murphy 13.1 Introduction, 199 13.2 North American instruments for epidemiological research, 202 13.3 North American instruments for psychiatric services and primary care, 205 13.4 European instruments for psychiatric services and primary care, 206 13.5 European instruments for epidemiological research, 208 13.6 Summary, 210 Acknowledgements, 212 References, 212 14 The National Comorbidity Survey (NCS) and its extensions, 221 Ronald C. Kessler 14.1 Introduction, 221 14.2 The baseline NCS, 221 14.3 The NCS follow-up survey (NCS-2), 227 14.4 The NCS replication survey (NCS-R), 229 14.5 The NCS-R adolescent supplement (NCS-A), 233 14.6 The WHO WMH Surveys, 234 14.7 Overview, 236 Acknowledgements, 237 References, 237 15 Experimental epidemiology, 243 John R. Geddes 15.1 Introduction, 243 15.2 Limitations of non-randomised evidence, 243 15.3 RCTs: The translation of the experimental design into the real world, 245 15.4 Importance and control of systematic error or bias, 245 15.5 Importance and control of random error and noise, 248 15.6 Reporting the results of clinical trials—the CONSORT statement, 248 15.7 Different clinical questions will prioritise control of different threats to validity and confidence, 248
15.8 The classification of RCTs, 251 15.9 Effectiveness trials in schizophrenia, 255 15.10 Department of Veterans Affairs co-operative study on the cost-effectiveness of Olanzapine (Rosenheck), 255 15.11 The clinical antipsychotic trials of intervention effectiveness (CATIE) study, 257 15.12 Cost utility of the latest antipsychotic drugs in schizophrenia study (CUtLASS 1), 257 15.13 European first-episode schizophrenia trial (EUFEST), 258 15.14 The size and cost of experimental studies in psychiatry, 259 15.15 Clinical trials in the future, 259 References, 259 16 Epidemiology of Schizophrenia, 263 William W. Eaton, Chuan-Yu Chen and Evelyn J. Bromet 16.1 Introduction, 263 16.2 Methods, 263 16.3 The burden of schizophrenia, 264 16.4 Natural history, 265 16.5 Demographic correlates, 268 16.6 Social risk factors, 269 16.7 Biological risk factors, 272 16.8 Prevention, 279 16.9 Discussion, 280 References, 280 17 Epidemiology of depressive disorders, 289 Deborah S. Hasin, Miriam C. Fenton and Myrna M. Weissman 17.1 Introduction, 289 17.2 Major depression, 290 17.3 Dysthymia, 302 17.4 Summary, 304 Appendix 17.A Measurement of major depression in the NLAES and NESARC, 304 References, 305 18 Epidemiology of anxiety disorders, 311 Ewald Horwath, Felicia Gould and Myrna M. Weissman 18.1 Introduction, 311 vii
CONTENTS
18.2 18.3 18.4 18.5 18.6 18.7 18.8
Anxiety disorders, 311 Panic disorder, 313 Agoraphobia, 316 Social phobia, 317 Generalised anxiety disorder, 318 Obsessive–compulsive disorder, 319 Anxiety and affective disorders and mass disasters, 320 18.9 Future developments, 323 Acknowledgements, 323 References, 323 Further reading, 326 19 Epidemiology of bipolar disorder in adults and children, 329 Kathleen R. Merikangas and Mauricio Tohen 19.1 Introduction, 329 19.2 Epidemiology of bipolar disorder, 329 19.3 Patterns of comorbidity of bipolar disorder, 333 19.4 Risk Factors, 334 19.5 Future directions, 336 19.6 Summary, 338 References, 338 20 Epidemiology of eating disorders, 343 Tracey D. Wade, Anna Keski-Rahkonen and James I. Hudson 20.1 Introduction, 343 20.2 Case definition, 343 20.3 Major prevalence studies, 345 20.4 Incidence studies, 351 20.5 Comorbidity, 351 20.6 Mortality from eating disorders, 352 20.7 Risk factors, 352 20.8 Future directions, 355 References, 356 21 Epidemiology of alcohol use, abuse and dependence, 361 Deborah A. Dawson, Ralph W. Hingson and Bridget F. Grant 21.1 Introduction, 361 21.2 Population estimates of per capita consumption, 361 21.3 Survey-based estimates of the prevalence of drinking, 362 21.4 Alcohol-related mortality and morbidity, 365 viii
21.5 Alcohol and injury, 365 21.6 Alcohol and chronic disease, 366 21.7 Diagnostic classification of alcohol use disorders, 367 21.8 Population estimates, prevalence, incidence and natural course of alcohol use disorders, 368 21.9 Comorbidity of DSM-IV alcohol use disorders and other psychiatric disorders, 371 21.10 Summary, 374 Acknowledgements, 375 References, 375 22 Epidemiology of illicit drug use disorders, 381 Wilson M. Compton, Marsha F. Lopez, Kevin P. Conway and Yonette F. Thomas 22.1 Introduction, 381 22.2 Drug consumption, 381 22.3 Definitions, 384 22.4 Rates of DSM-IV abuse and dependence, 384 22.5 Global rates of drug use disorders, 387 22.6 Comorbidities with psychiatric conditions, 388 22.7 Genetic epidemiology, 391 22.8 Future opportunities, 391 22.9 Conclusions, 394 22.10 Disclaimer, 394 References, 394 23 The epidemiology of personality disorders: Findings, methods and concepts, 401 Michael J. Lyons, Beth A. Jerskey and Margo R. Genderson 23.1 Introduction, 401 23.2 Substantive findings, 402 23.3 Course, prognosis and developmental issues, 404 23.4 Treated prevalence, 406 23.5 Prevalence of specific personality disorders, 407 23.6 Antisocial personality disorder, 410 23.7 Conceptual issues, 419 23.8 Models of personality disorder, 419 23.9 Methodological issues, 422 23.10 Future directions, 427 References, 428
CONTENTS
24 The epidemiology of depression and anxiety in children and adolescents, 435 Kathleen Ries Merikangas and Erin F. Nakamura 24.1 Introduction, 435 24.2 Magnitude of depression and anxiety in children and adolescents, 435 24.3 Correlates and risk factors, 438 24.4 Service patterns and impact, 442 24.5 Summary, 443 References, 443 25 Epidemiology of attention deficit hyperactivity disorder, 449 Stephen V. Faraone 25.1 Introduction, 449 25.2 Prevalence of ADHD, 450 25.3 Pharmacoeconomics of ADHD, 451 25.4 Comorbid psychiatric disorders, 451 25.5 Demographic risk factors, 452 25.6 Genetic risk factors, 453 25.7 Environmental risk factors for ADHD, 457 25.8 Summary and conclusions, 460 25.9 Future directions, 460 Acknowledgements, 461 References, 461 26 The epidemiology of autism, 469 Gregory S. Liptak 26.1 Introduction, 469 26.2 Background, 469 26.3 Definition and diagnosis, 469 26.4 Natural history, 472 26.5 Prevalence, 473 26.6 Risk factors, 473 26.7 Genetic factors, 476 26.8 Public health impact, 476 26.9 Associations and causal factors, 477 26.10 Future directions, 477 26.11 Summary, 478 References, 478 27 Mental illness, women, mothers and their children, 483 Kathryn M. Abel and Vera A. Morgan 27.1 Introduction, 483
27.2 The epidemiology of mental illness in women of reproductive age, 484 27.3 Fertility and fecundity in women with mental illness, 487 27.4 Maternal mental illness at the time of conception and during pregnancy, 488 27.5 Gene–environment interactions and offspring outcomes, 493 27.6 Obstetric complications and risk of adult onset mental disorder in offspring, 493 27.7 Parental condition, 496 27.8 Motherhood and perinatal mental illness, 500 27.9 Designing studies examining the relationship between maternal mental illness and outcomes for their children, 504 27.10 Conclusions, 507 References, 507 Further reading, 515 28 Epidemiology of suicide and attempted suicide, 517 Dianne Currier and Maria A. Oquendo 28.1 Introduction, 517 28.2 Definitions, 517 28.3 Prevalence of suicide and attempted suicide, 518 28.4 Risk factors for suicide and attempted suicide, 520 28.5 Protective factors, 526 28.6 Conclusions, 526 Acknowledgements, 526 References, 526 29 Epidemiology and geriatric psychiatry, 535 Celia F. Hybels and Dan G. Blazer 29.1 Introduction, 535 29.2 Issues of case identification, 535 29.3 The distribution of cases, 536 29.4 Aetiological studies, 544 29.5 Outcome studies, 547 29.6 Historical trends in the epidemiology of psychiatric disorders in late life, 549 29.7 Use of health care services, 550 References, 550 ix
CONTENTS
30 Recent epidemiological studies of psychiatric disorders in Japan, 559 Masayoshi Kawai, Kenji J. Tsuchiya and Nori Takei 30.1 Introduction, 559 30.2 Schizophrenia, 560 30.3 Affective disorders, 566 30.4 Autism and autism spectrum disorder, 569 30.5 Summary, 572 References, 573 31 Epidemiology of migration and serious mental illness: The example of migrants to Europe, 579 Monica Charalambides, Craig Morgan and Robin M. Murray 31.1 Introduction, 579 31.2 Defining the constructs, 579 31.3 High rates of psychosis in migrants: A genuine finding or methodological artefact?, 581 31.4 Possible explanations, 584 31.5 Biological considerations, 585 31.6 Cannabis use, 586 31.7 Adverse social experiences, 586 31.8 Mechanisms, 589 31.9 Implications, 590 References, 591 32 Epidemiology of migration substance use disorder in Latin American populations and migration to the United States, 595 Mar´ıa Elena Medina-Mora, Guilherme Borges, Tania Real and Jorge Villatoro 32.1 Introduction, 595
x
32.2 Definitions: What do we understand by migration?, 595 32.3 Countries of origin: Social, political and other reasons that trigger migration, 597 32.4 Living conditions of migrants in the United States, 599 32.5 Alcohol and drug use in countries of origin and receiving communities, 600 32.6 Dependence and treatment rates, 604 32.7 The process of migrating, 606 32.8 Migration, substance use and access to services, 608 32.9 Returning migrants and families left behind, 611 32.10 Conclusions, 611 References, 611 33 Early detection and intervention as approaches for preventing schizophrenia, 617 Ming T. Tsuang, William S. Stone, Margo Genderson and Michael Lyons 33.1 Introduction, 617 33.2 Modelling genetic and phenotypic heterogeneity, 618 33.3 Defining a syndrome of liability using cognitive and clinical characteristics of relatives, 620 33.4 Gene-based vs. genome-based research, 624 33.5 Future directions, 626 33.6 Clinical implications, 627 Acknowledgements, 627 References, 628 Index, 633
List of Contributors
Kathryn M. Abel
Monica Charalambides
Centre for Women’s Mental Health, 3rd Floor East, Jean McFarlane Bdg, University of Manchester, Oxford Road, Manchester M13 9PL, UK
Institute of Psychiatry, Psychological Medicine, King’s Institute, De Crespigny Park, Denmark Hill, London SE5 8AF, UK
Judith Allardyce
Chuan-Yu Chen
Department of Psychiatry and Neuropsychology, School for Mental Health and Neuroscience, South Limburg Mental Health Research and Teaching Network, EURON, SEARCH, Maastricht University Medical Centre, PO BOX 616 (VIJV), 6200 MD Maastricht, The Netherlands
National Health Research Institutes, Institute of Population Health Sciences, Division of Mental Health and Addiction Medicine, Taiwan
Jordi Alonso Carrer del Doctor Aiguader, 88, Edifici PRBB, E-08003 Barcelona, Spain
Dan G. Blazer Department of Psychiatry and Behavioral Sciences, Center for the Study of Aging and Human Development, Box 3003, Duke University Medical Center, Durham NC 27710, USA
Guilherme Borges Ramon de la Fuente National Institute of Psychiatry, Calzada Mexico Xochimilco 101, DF 14370, Mexico
Michaeline Bresnahan Mailman School of Public Health, Columbia University, Department of Epidemiology, 600 West 168th Street, New York NY 10032, USA
Evelyn J. Bromet Dept of Psychiatry, SUNY Stony Brook University, PutnamHall – South Campus, Stony Brook NY 11794-8790, USA
Alan M. Brookhart Division of Pharmacoepidemiology and Pharmacoeconomics, Brigham and Women’s Hospital, Harvard Medical School, Boston MA 02115, USA
Sara Cherkerzian Departments of Psychiatry and Medicine at Brigham and Women’s Hospital (BWH), Harvard Medical School, Boston, MA, USA
Wilson M. Compton Division of Epidemiology, Services, and Prevention Research, National Institute on Drug Abuse, 6001 Executive Blvd., MSC 9589, Bethesda MD 20892-9589, USA
Kevin P. Conway Division of Epidemiology, Services, and Prevention Research, National Institute on Drug Abuse, 6001 Executive Blvd., MSC 9589, Bethesda MD 20892-9589, USA
Dianne Currier Division of Molecular Imaging and Neuropathology, Department of Psychiatry, Columbia University, New York, NY 10032, USA
Deborah A. Dawson Laboratory of Epidemiology and Biometry, Division of Intramural Clinical and Biological Research, National Institute on Alcohol Abuse and Alcoholism, National Institutes of Health, Suite 514, Willco Building, 6000 Executive Boulevard, MSC 7003, Bethesda MD 20892-7003, USA
xi
LIST OF CONTRIBUTORS
William W. Eaton Dept of Mental Health, Bloomberg School of Public Health, John Hopkins School of Hygiene and Public Health, John Hopkins University, 615 N. Wolfe Street, Baltimore MD 21205, USA
Stephen V. Faraone Center for NeuroPsychiatric Genetics, SUNY Upstate Medical University, Weiskotten Hall 3285, Syracuse NY 13210, USA
Miriam C. Fenton New York State Psychiatric Institute 1051 Riverside Drive, New York, NY 10032, USA
´ Anna Fernandez Fundac´ıo Sant Joan de D´eu Research and Development Unit, Dr. Antoni Pujadas, 42., 08830 Sant Boi de Llobregat, Barcelona, Spain
Garrett Fitzmaurice Laboratory for Psychiatric Biostatistics, McLean Hospital, Belmont, MA, USA
Institutes of Health, NIAAA/LEB, 5635 Fishers Lane, Bethesda MD 20892-9304, USA
Josep Maria Haro Fundacı´o Sant Joan de D´eu Research and Development Unit, Dr. Antoni Pujadas, 42., 08830 Sant Boi de Llobregat, Barcelona, Spain
Deborah S. Hasin Department of Psychiatry, Columbia University College of Physicians and Surgeons, New York NY 10032, USA
Jari Haukka National Institute for Health and Welfare, Department of Mental Health and Substance Abuse Services, P.O. Box 30, FI 00271 Helsinki, Finland
Ralph W. Hingson Division of Epidemiology and Prevention Research, National Institute on Alcohol Abuse and Alcoholism, National Institutes of Health, NIAAA/LEB, 5635 Fishers Lane, Bethesda MD 20892-9304, USA
Ewald Horwath John R. Geddes Oxford University, Department of Psychiatry, Warneford Hospital, Oxford OX3 7JX, UK
Margo R. Genderson Department of Psychology, Boston University, Boston, MA 02215, USA
Stephen J. Glatt Departments of Psychiatry and Behavioral Sciences and Neuroscience and Physiology, Medical Genetics Research Center, SUNY Upstate Medical University, NY, USA
Jill M. Goldstein Departments of Psychiatry and Medicine at Brigham and Women’s Hospital (BWH), Harvard Medical School, Boston MA02115, USA
Department of Psychiatry, Epidemiology and Public Health, Miller School of Medicine, University of Miami, MHHC, Suite 3100, 1695 NW 9th Ave, Miami FL 33136, USA
James I. Hudson Psychiatric Epidemiology Research Program, Harvard Medical School/McLean Hospital, 115 Mill Street, Belmont MA 02478, USA
Celia F. Hybels Department of Psychiatry and Behavioral Sciences, Center for the Study of Aging and Human Development, Duke University Medical Center, Box 3003, Durham NC 27710, USA
Matti Isohanni Department of Psychiatry, University of Oulu, Finland
Felicia Gould Department of Psychiatry & Behavioral Sciences, Miller School of Medicine, University of Miami, 3225 Aviation Ave., Suite 303, Miami FL 33133, USA
Bridget F. Grant Laboratory of Epidemiology and Biometry, Division of Intramural Clinical and Biological Research, National Institute on Alcohol Abuse and Alcoholism, National
xii
Beth A. Jerskey Instructor of Psychiatry and Human Behavior (Research), Alpert Medical School of Brown University, Providence RI 02912, USA
Peter B. Jones Department of Psychiatry, University of Cambridge, Box 189, Addenbrooke’s Hospital, Cambridge CB2 2QQ, UK
LIST OF CONTRIBUTORS
Masayoshi Kawai
Kathleen R. Merikangas
Research Centre for Child Mental Development, Hamamatsu University School of Medicine 1-20-1, Handayama, Higashi-Ku, Hamamatsu 431–3192, Japan
Genetic Epidemiology Research Branch, Intramural Research Program, NIMH, Porter Neuroscience Research Center, 35 Convent Dr., MSC 3720, Bethesda MD 20892-3720, USA
Anna Keski-Rahkonen
Jouko Miettunen
Academy of Finland, University of Helsinki, Helsinki, Finland
Department of Psychiatry, University of Oulu, P.O.Box 5000, FI-90014 Helsinki, Finland
Ronald C. Kessler
Craig Morgan
Department of Health Care Policy, Harvard Medical School, 180 Longwood Avenue, Boston MA 02115, USA
Institute of Psychiatry, Psychological Medicine, King’s Institute, De Crespigny Park, Denmark Hill, London SE5 8AF, UK
Helena Chmura Kraemer
Vera A. Morgan
Department of Psychiatry, Stanford University, Palo Alto, Stanford CA 94305, USA
Neuropsychiatric Epidemiology Research Unit, School of Psychiatry & Clinical Neurosciences, The University of Western Australia, Medical Research Foundation Building, 50 Murray Street, Perth, WA 6000, Australia
Glyn Lewis Academic Unit of Psychiatry,University of Bristol, Cotham House, Cotham Hill, Bristol BS6 6JL, UK
Jane M. Murphy Bruce Link Mailman School of Public Health, Columbia University, Department of Epidemiology, 600 West 168th Street, New York NY 10032, USA
Greg S. Liptak Center for Development, Behavior and Genetics, SUNY Upstate Medical University, Center for Children’s’ Health Policy, Syracuse NY 13210, USA
Professor, Department of Psychiatry, Massachusetts General Hospital, Harvard Medical School, Room 215, 5 Longfellow Place, Boston MA 02114, USA
Robin M. Murray Institute of Psychiatry, Psychological Medicine, King’s Institute, De Crespigny Park, Denmark Hill, London SE5 8AF, UK
Erin F. Nakamura Marsha F. Lopez Division of Epidemiology, Services, and Prevention Research, National Institute on Drug Abuse, 6001 Executive Blvd., MSC 9589, Bethesda, MD 20892-9589, USA
Michael J. Lyons Department of Psychology, Boston University, Harvard Institute of Psychiatric Epidemiology and Genetics, 64 Cummington Street, Boston MA 02215, USA
Dana March Mailman School of Public Health, Columbia University, Department of Epidemiology, 600 West 168th Street, New York NY 10032, USA
Mar´ıa Elena Medina-Mora Ramon de la Fuente National Institute of Psychiatry, Calzada Mexico Xochimilco 101, DF 14370, Mexico
Genetic Epidemiology Research Branch, Intramural Research Program, NIMH, Porter Neuroscience Research Center, 35 Convent Dr., MSC 3720, Bethesda MD 20892-3720, USA
Maria A. Oquendo Division of Molecular Imaging and Neuropathology, Department of Psychiatry, Columbia University, NY, USA
Alejandra Pinto-Meza Fundac´ıo Sant Joan de D´eu Research and Development Unit, Dr. Antoni Pujadas, 42., 08830 Sant Boi de Llobregat, Barcelona, Spain
Antoni Pujadas Fundaci o Sant Joan de Deu Research and Development Unit, Dr. Antoni Pujadas, 42., 08830 Sant Boi de Llobregat, Barcelona, Spain
xiii
LIST OF CONTRIBUTORS
Caitlin Ravichandran
Yonette F. Thomas
Laboratory for Psychiatric Biostatistics, McLean Hospital, 115 Mill St, Belmont MA 02478, USA
Howard University, Office of the Vice President for Research and Compliance (OVPRC), C.B. Powell Building, Suite 137, 525 Bryant Street, N.W., Washington DC 20059, USA
Tania Real Ramon de la Fuente National Institute of Psychiatry, Calzada Mexico Xochimilco 101, DF 14370, Mexico
Mauricio Tohen
Rebeca Robles
Division of Mood and Anxiety Disorders, University of Texas Health Science Center at San Antonio, 7526 Louis Pasteur Drive, San Antonio TX 78229-3900, USA
Ramon de la Fuente National Institute of Psychiatry, Calzada Mexico Xochimilco 101, DF 14370, Mexico
Ming T. Tsuang
Sebastian Schneeweiss Division of Pharmacoepidemiology and Pharmacoeconomics, Brigham and Women’s Hospital, Harvard Medical School, Boston MA02115, USA
Antoni Serrano-Blanco Fundac´ıo Sant Joan de D´eu Research and Development Unit, Dr. Antoni Pujadas, 42., 08830 Sant Boi de Llobregat, Barcelona, Spain
Patrick E. Shrout Department of Psychology, New York University, 6 Washington Place, Rm 455, New York NY 1003, USA
Center for Behavioral Genomics, Department of Psychiatry, University of California, San Diego, 9500 Gilman Drive, La Jolla CA 92039, USA
Kenji J. Tsuchiya Research Centre for Child Mental Development, Hamamatsu University School of Medicine 1-20-1, Handayama, Higashi-Ku, Hamamatsu 431–3192, Japan
Christine Ulbricht Division of Services and Intervention Research, National Institute of Mental Health, 6001 Executive Blvd, Room 7151, MSC 9629, Bethesda MD 20892-9663, USA
Jim van Os John Simpson Department of Psychiatry at VA Boston Healthcare System, Harvard Medical School Department of Psychiatry, Boston MA 02215, USA
William S. Stone
Department of Psychiatry and Neuropsychology, School of Mental Health and Neuroscience, South Limburg Mental Health Research and Teaching Network, Maastricht University Medical Centre, PO Box 616, Vijverdal, 6200 MD, Maastricht, The Netherlands
Harvard Institute of Psychiatric Epidemiology and Genetics, Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA
Jorge Villatoro
Ezra Susser
Tracey D. Wade
Mailman School of Public Health, Columbia University, Department of Epidemiology, 600 West 168th Street, New York NY 10032, USA
School of Psychology, Flinders University, GPO Box 2100, Adelaide SA 5001, Australia
Jaana Suvisaari
Division of Services and Intervention Research, National Institute of Mental Health, 6001 Executive Blvd, Room 7151, MSC 9629, Bethesda, MD 20892-9663, USA
Department of Mental Health and Substance Abuse Services, National Institute for Health and Welfare, P.O.Box 30, FI 0027 Helsinki, Finland
Nori Takei Research Centre of Child Mental Development, Hamamatsu University School of Medicine 1-20-1, Handayama, Higashi-Ku, Hamamatsu 431–3192, Japan
xiv
Ramon de la Fuente National Institute of Psychiatry, Calzada Mexico Xochimilco 101, DF 14370, Mexico
Philip S. Wang
Myrna M. Weissman Department of Psychiatry, Columbia University,College of Physicians and Surgeons, New York State Psychiatric Institute, 1051 Riverside Drive, New York NY 10032, USA
1
Introduction to epidemiologic research methods Glyn Lewis Academic Unit of Psychiatry, University of Bristol, Bristol, UK
1.1 What is epidemiology? Epidemiology, according to Last’s Dictionary of Epidemiology, is ‘The study of the distribution and determinants of health-related states or events in specified populations and the application of this study to control of health problems’ [1]. Wikipedia states ‘Epidemiology is the study of factors affecting the health and illness of populations, and serves as the foundation of interventions made in the interest of public health and preventive medicine’. Rothman and Greenland [2] after observing ‘there seem to be more definitions of epidemiology than epidemiologists’ fulfil their own observation by creating a new definition: ‘the ultimate goal of most epidemiologic research is the elaboration of causes that can explain patterns of disease occurrence’ [2], thereby narrowing the focus of the subject on aetiology. John Snow is usually credited with creating epidemiology as a result of his work in the 1840s associating cholera with contaminated water from the River Thames in London [3]. It was only in the second half of the twentieth century that epidemiological methods began to be consistently applied to the whole range of health problems. Before that time, most of the focus was on infectious disease, though there were exceptions, such as pellagra [4]. Rothman coined the term ‘modern epidemiology’ [5] to reflect the increasing understanding of population based research after the second world war and the increase in its application. The Framingham Heart Study was started in 1949 and Bradford Hill,
amongst his other contributions, conducted the first randomised controlled trial (RCT) in medicine in 1948 [4]. This postwar era is the most important from the perspective of psychiatry. In this period the terms ‘chronic disease epidemiology’ or ‘risk factor epidemiology’ have been used to describe the extension of epidemiological methods to noninfectious disease. It is during this period that, in the main, psychiatric epidemiology has developed, often learning from epidemiologists studying heart disease and cancer. Epidemiologists get involved in studies with a variety of uses [6] including straightforward description, as well as the studies of aetiology that Rothman and Greenland mention in their definition. However, most definitions of epidemiology appear, at least at first sight, to leave out RCTs and systematic reviews yet many epidemiologists also carry out such studies. The use of the term clinical epidemiology [5] reflects this broadening of epidemiological methods into the care of patients, the validity of diagnostic tests and clinical decision making [7]. Epidemiologists have been at the heart of the evidence-based medicine movement [8] and thinking about how research findings are best transferred to clinical practice. And finally, ‘genetic epidemiology’ [6] is the creation of a marriage between epidemiology and genetics. It is designed to exploit molecular genetics and the technological advances that have enabled rapid characterisation of a person’s genetic makeup. Epidemiology has increased its scope and remit within medicine and psychiatric epidemiology is
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
1
CHAPTER 1
a reflection of these imperialistic tendencies. At times, it is difficult to decide where epidemiology ends and ‘other’ clinical research begins; it is a matter of emphasis. Epidemiologists tend to be more oriented towards the study of common conditions of public health importance and are more interested in making inferences about whole populations. In epidemiology, there is more emphasis on establishing causal relationships than understanding the mechanisms that might underpin those relationships. Even though, when possible, epidemiological methods are also needed for investigating mechanisms. This concern with causation has led epidemiologists to emphasise the importance of RCTs to evaluate treatments and to summarise evidence using systematic reviews. So perhaps, those definitions of epidemiology quoted above are sufficient and adequately cover the remit and scope of the discipline.
the collection of DNA in epidemiological studies, which is now useful thanks to the massive technological progress in genetics. But for other areas of neuroscience, this is still mostly a challenge for the future. There are examples of population-based studies of brain imaging [9] and neuroendocrinology [10] but epidemiologists will need further assistance from scientists engaged in imaging, psychology and other areas of basic science in order to develop quicker and easier tests to use in population studies and improve understanding of the neuroscientific basis of psychiatric disorder. We should also not forget about social science. Social scientists will also help in understanding the social context of psychiatric disorder, for example the ideas about social capital have been influential [11] though the advances in social science appear less rapid than those in neuroscience.
1.1.1 Psychiatric epidemiology
1.2 Causation in medicine
Psychiatric epidemiology is simply the epidemiology of psychiatric disorders – no more, no less. The principles and practice are the same when studying psychiatric disorder as they are when studying other medical conditions. Understanding the epidemiological principles and methods developed for physical disease will inform our epidemiological study of psychiatric disorder. Good psychiatric practice requires attention paid to biological, psychological and social factors. The same can be said for psychiatric epidemiology. When studying aetiology or evaluating treatments, epidemiological research is testing hypotheses about cause or treatment based upon a theory relating biological, psychological or social factors to illness or recovery. Understanding the mechanisms underlying disease and treatment is therefore critical in interpreting data from epidemiological and clinical studies. However, it is important to acknowledge that epidemiology is often limited in investigating mechanisms as epidemiological studies often involve measurements that are remote from the mechanisms that are likely to be important. This is an especial problem in psychiatry as it is difficult to carry out intensive biological and psychological assessments in the context of large scale epidemiological studies. A recent exception is
One of the most important functions of epidemiology, as suggested by Rothman and Greenland [2] is to investigate factors that might cause disease and treatments or interventions that might cause recovery. Causal inference is the label for a process of reasoning that provides some structure to this difficult and often rather subjective task. ‘Risk factor’ is often used by epidemiologists, in part, to show that there is always some doubt about causal relationships. However, we are only really interested in ‘risk factors’ if they are causal. The first issue to address, then, is what is meant by ‘cause’. Cause is a word, that is used in everyday language but in medicine it is important that this word is defined and understood in a way that distinguishes it from its usual use in language. Rothman [5] has provided one of the most reasoned and influential approaches towards thinking about cause in medicine. He defines cause (of disease) as ‘an event, condition or characteristic that plays an essential role in producing an occurrence of a disease’. In other words, that a particular occurrence of disease would not have occurred without that event/condition/characteristic having happened first. Rothman has also argued that causes have to occur before outcomes. This is a sine qua non of any causal
2
INTRODUCTION TO EPIDEMIOLOGIC RESEARCH METHODS
relationship and so any consideration of cause has to include this criterion. Rothman [5] emphasises that causation implies a comparison. Smoking one pack of cigarettes a day is not a cause if it is compared to two packs, but is a cause when compared to a person who smokes no cigarettes. This comparison is usually measured in epidemiology by calculating an index of association, such as an odds ratio, between a possible causal factor and the disease or outcome of interest. For example, smoking cannabis regularly doubles the risk of schizophrenia compared to people who do not smoke cannabis [12] (though whether this is a causal relationship is still uncertain). In everyday talk, people often think of causes as though they have a one-to-one relationship with an outcome. The smashed china was caused by the ball kicked by your son. This approach is also attuned to the deterministic model common in basic science, in which, for example, a neurotransmitter acts on a receptor, that is coupled with a G protein that in turn activates a signal transduction pathway. However, the model of causation in clinical medicine has increasingly regarded causes as neither necessary nor sufficient for the majority of non-infectious medical conditions. Smoking cigarettes increases the risk of lung cancer, but many people who smoke do not develop lung cancer and some people develop lung cancer without smoking. It is possible to think of some exceptions to this rule, but in the main these are single gene disorders with high penetrance such as Huntington’s disease. In infectious disease, the infectious agent is necessary, but not always sufficient for the clinical disease. Nevertheless, for most non-infectious disease, there has grown a consensus that causal factors are likely to be neither necessary nor sufficient. This has also encouraged use of the term ‘risk factor’ in epidemiology as the causal factors that are identified in human populations increase the risk of disease but do not confer any certainty about future events. At first sight there appears to be a conflict between the deterministic models used in biological science and the more probabilistic models that seem to apply to disease in human populations. There are two ways in which this apparent conflict has been resolved. First, that most diseases have multiple causes and this would seem particularly true for psychiatric
disorder. The evidence from heart disease and cancer provides ample evidence that this can be the case. The other suggestion, again made by Rothman [13], is the idea of multiple sufficient causes for a single disease, and that each of these sufficient causes are in turn multifactorial and with overlapping sets of causal factors. If we accept this model, it is possible to understand that in a circumstance of partial knowledge, each element of those sufficient causes will appear neither necessary nor sufficient. This is an important argument that enables us to link epidemiology to the underlying mechanisms that underpin the associations that epidemiologists will observe in human populations.
1.2.1 Alternative explanations Epidemiological studies estimate the association between a possible causal factor and a disease or a treatment and recovery. In human populations, this is the only approach that is feasible. We also have to understand that the tight experimental controls that can occur in basic science and in experimental animals are impossible in epidemiological studies of human populations. Participants in epidemiological studies or clinical trials will change their behaviour, change their treatment, and may refuse to continue to take part in a study. On occasions, these changes will be influenced by the study itself, public health campaigns or changes to health policy. There will always be difficulties therefore, in interpreting data from epidemiological studies. There are no perfect studies in epidemiology and this leads to more emphasis upon interpretation of any finding of association. It also implies that single studies are rarely sufficient, on their own, to draw conclusions. It is common for RCTs to be described as the ‘gold standard’ but this ignores the difficulties in interpreting even that most rigorous of the designs at our availability. Patients drop out of RCTs, stop taking their medication, start taking non-trial medication or make other changes to their lifestyle and health care use, sometimes as a result of the randomised intervention. RCTs might reduce the controversy surrounding interventions but they do not eliminate them [14, 15]. If this is epidemiological gold, it has less lustre than its counterpart in government vaults. 3
CHAPTER 1
One approach towards causal inference is therefore to consider the alternative explanations for an association, apart from causation and it is usual to consider at least these four alternatives: sampling variation and chance, confounding, bias and reverse causality.
1.2.1.1 Sampling variation and chance Epidemiologists have been at the forefront of considering statistical issues in relation to medical research. There is marked variation within human populations and so sampling variation is usually important to consider. It is difficult to imagine the days when medical journals did not include any statistical tests, but at least in the United Kingdom, Bradford Hill’s series of articles in The Lancet in the 1930s were very influential in introducing statistical tests into medicine [16]. Many studies are completed and many statistical tests are carried out, even within a single study. Every article in a scientific journal will usually contain dozens of statistical tests. Type 1 errors in which results are statistically significant by chance are therefore common. Statistical tests can be very useful when an analysis was planned as part of a hypothesis driven investigation. However, carrying out repeated tests during exploratory analyses or ‘data mining’ can lead to results that will often be due to chance. Results from exploratory analyses are best thought of as ‘hypothesis generating’ that require replication. It is particularly difficult when unscrupulous investigators report such analyses as though they were testing a priori hypotheses. In the light of these concerns, the conventional 5% threshold for statistical significance is almost certainly too high [17], and for most decisions, one needs much better statistical evidence. Type 2 errors, in which non-significant findings are interpreted as reflecting no association, are very common in the psychiatric literature given the relatively small size of many studies. Confidence intervals can help you decide upon the accuracy with which an association is estimated and help to decide if the investigators have excluded an important result. This is a common circumstance in treatment research in psychiatry [18].
4
1.2.1.2 Confounding Factors of aetiological importance are not randomly allocated in human populations. In RCTs of sufficient size there should be a complete balance between the groups in confounding factors, including those that the investigator does not know about. In observational studies, however, confounding can occur. For example, cannabis users will differ in many ways from people who do not use cannabis. In the Swedish conscript study, cannabis users were more likely to live in cities, were more sociable and were more likely to get into trouble with the police [19]. It is possible that these other characteristics could alter risk of subsequent psychosis. These ‘other characteristics’ are potential confounding variables. A confounder is defined as an independent risk factor (or protective factor) for the outcome at each level of the exposure, that is also associated with the exposure. A confounding factor can lead to a spurious association or can eliminate a real association between exposure and disease. In the case of cannabis and psychosis, there is good evidence that confounding occurs [12]. In other words, much of the increased risk of psychosis in cannabis users can be attributed to their other characteristics. Statistical adjustment for confounders accounted for about half, but only half, of the observed association.
1.2.1.3 Bias Bias is another epidemiological term that is borrowed from normal every day use. In epidemiology, bias refers to the possibility that the estimate of association that is obtained is not the ‘true’ association that would pertain if one could carry out a perfect study. It can be contrasted with confounding, that is, a real explanation for an association that would be present even if your study had perfectly estimated the association in the population. In contrast, bias is introduced by the investigator or is a consequence of the investigation. The distinction between confounding and bias can be illustrated using the example given above of the link between cannabis and schizophrenia. Even if the measurement of cannabis and schizophrenia were
INTRODUCTION TO EPIDEMIOLOGIC RESEARCH METHODS
done perfectly and everyone in a study was followed up, confounding would still exist and have to be considered. Bias will only be introduced as one departs from this utopian state. There are two main types of bias: selection and measurement bias. Selection bias is to do with the selection of subjects for the study while measurement or information bias is concerned with bias in measurement, diagnosis and ascertainment of outcome and confounders. There are more comprehensive classifications of bias [7], but in the main these two types are the most important to consider. Selection bias is often described in relation to case–control studies that are very susceptible to this bias. It occurs when the cases and controls in a case–control study are drawn from different populations that differ with respect to the exposure variable. In case–control studies, controls estimate the frequency of exposure in the population from which the cases were drawn. If the control were to become diseased the control should be in the sampling frame for the cases. Case–control studies are therefore population based studies and it is this aspect of case–control design, that is often overlooked. For example, Mulvany and colleagues [20] carried out a case–control study in which people with schizophrenia (the cases) were selected from a hospital in Dublin who had birth records in the local maternity hospitals. The controls were the next birth in that hospital. There was no way of knowing whether the controls were still resident in Dublin when adult so might not have been in the population ‘at risk’ of being cases in the study. Some of the controls will have moved away from Dublin. This mismatch could lead to selection bias. This study reported that people of higher socioeconomic status were more likely to develop schizophrenia but this might have been because wealthier people were less likely to move away between birth and adulthood. This result is the opposite of the findings from a cohort study [21] and a case–control study with less risk of selection bias [22] that both found that people of lower socioeconomic status were at increased risk of schizophrenia. On balance, the Mulvany study does not support the idea that higher social classes
are at risk of developing schizophrenia; if anything, the reverse is the case. Selection bias can also be used to describe the bias introduced by partial follow-up in cohort studies and RCTs. Cohort studies are relatively insensitive to the selection of participants in the cohort, for example the British doctors’ cohort of Doll and Hill [23] has produced some robust and reproducible findings even though British doctors are a highly selected group. Likewise, Framingham is far from a representative town. However, bias is more likely to be introduced by differential drop out from the cohort than from the initial selection of the subjects in the cohort, at least in this kind of design. Many cohort studies have quite marked attrition, particularly for longer term follow-up and statistical methods for dealing with such missing data (see www.missingdata.org.uk) are designed to reduce this form of bias. Measurement or information bias occurs when measurement of exposure or ascertainment of disease is influenced by knowledge of the exposure (longitudinal designs or cross-sectional designs) or of the outcome (case–control and cross-sectional designs). Recall bias can be a problem if the presence of disease influences the measurement of exposure, as might occur in case-control studies and cross-sectional surveys. People with an illness, or their relatives, are likely to be more aware of past events that might be relevant to illness. The mental state of people with psychiatric disorder might increase or reduce the chance that past events are remembered. For example, people with depression have well-documented information processing biases that make it more likely that negative events are recalled [24]. There are many examples of studies that ask people with depression to record negative adverse experiences [25]. The strong associations that have been observed between depression and these measures may be partly as a result of such a recall bias. It is always difficult or impossible to estimate the likely influence of bias on results. The high chance of recall bias when measuring factors of potential aetiological importance in psychiatry is a powerful argument for using longitudinal designs to study causation. Using data sources gathered before the onset of disease will
5
CHAPTER 1
also reduce measurement bias. Other strategies to reduce measurement bias include using structured questionnaires and restricting retrospective inquiry to events that are unlikely to be forgotten. Bias can also be introduced by the researcher who is interviewing the participant, so-called observer bias. If possible, this source of bias can be eliminated by using self-administered questionnaires. However, there are occasions when participants might find the questions in self-administered form difficult to understand or when they might be misinterpreted. This seems particularly likely when asking about psychotic symptoms [26]. Many assessments of psychiatric disorder are semistructured and rely upon ‘cross-examination’ of the participant. There has been a vigorous debate comparing the validity and reliability of self-reported and semistructured interviews in assessing psychiatric disorder [27]. One has to balance the danger that questions can be misinterpreted with the risk that the observer can influence findings according to preconceived views. The balance of these arguments differs according to the diagnoses that are being studied. For most depression and anxiety disorders where insight is retained, self-reported information would seem to be an advantage. In contrast, for psychotic disorders the cross-examination style of semistructured interviews would seem necessary.
1.2.1.4 Reverse causality Finally, the disease may cause the exposure. This might occur in case–control studies and crosssectional surveys because data on exposure is usually collected retrospectively. In contrast, longitudinal studies should ensure that exposures occur before the onset of disease. Many biological aspects of psychiatric disorder are studied using case–control methods. For example, in imaging studies the abnormalities described in people with schizophrenia could result from the illness rather than being a marker of possible causes. Studies of first episode psychosis [28] go some way to address this possibility, but longitudinal studies are required in order to establish abnormalities that are present before the onset of psychosis.
6
1.3 Causal inference A number of criteria have been suggested that might encourage a conclusion that exposures have a causal role in disease [13, 29, 30]. These usually require evidence from a variety of sources and one would expect a number of different studies using different approaches all to produce consistent results before coming to a conclusion about causality. The criteria usually suggested include: 1 Timing. The cause has to occur before the disease. 2 Strength of relationship measured by relative risk. Large relative risks are more likely to be causal. A relative risk below about 1.5 should be treated with more caution. 3 Consistency of findings across studies. One would want a variety of different studies in different populations and with different strengths and weaknesses in the design all to produce the same results. 4 Dose–response relation. Does the evidence support a ‘dose–response’ relation in that the more exposure to a risk factor the more likely the disease. 5 Biological plausibility. Is the relationship biologically plausible and underpinned by a reasonable mechanism? One advantage of epidemiology is that it can work in isolation of knowledge of mechanisms. For example, John Snow argued that contaminated water led to cholera many decades before the cholera Vibrio was identified or the molecular basis of that disease was established. This should be especially useful for psychiatric epidemiology given the complexity of brain structure and function and the limits of our basic neuroscientific knowledge. None of the criteria listed above are essential, except perhaps for the issue of timing – causes have to occur before the onset of disease. These criteria are a guide, but often the final conclusion relies upon a matter of judgement. One important principle to consider is whether the evidence is good enough to justify any policy decisions that might be taken. For example, if cannabis
INTRODUCTION TO EPIDEMIOLOGIC RESEARCH METHODS
had a causal relationship to schizophrenia then the main policy implication would be to carry out a public health campaign to alert young people to the possible dangers. The amount of evidence required to justify this would be less than that needed to justify a more expensive or risky intervention. For example, suggestions to recommend widespread use of cholesterol-lowering agents has to take account of the greater financial cost and potential for adverse effects. The strength of evidence required for such an intervention would be greater than that needed for a publicity campaign.
1.4 The future for psychiatric epidemiology Studying the causes of psychiatric disorder in human populations has to be carried out using epidemiological methods. Basic science experiments can often suggest likely causal mechanisms and generate hypotheses about the risk factors for psychiatric disorder but cannot support that such mechanisms are operating in humans. Small-scale experimental studies in humans can illustrate if these mechanisms are occurring in humans with disease but they cannot argue if they are causing the disease in human populations. For example, the work of Meaney and others [31] has suggested possible influences on stress reactivity based upon work on experimental animals. Small-scale experimental work on humans can investigate possible mechanisms further. However, it is only by studying humans in population-based studies that allow us to infer whether the kind of stresses that exist in human life could lead to permanent changes in hypothalamo–pituitary–adrenal axis responsivity and thus lead to human disease. The future of psychiatric epidemiology will rest upon advances in neuroscience and will increasingly need to measure psychological and biological processes in population based studies. Likewise, epidemiology can generate hypotheses that will need to be investigated by basic scientists and in smaller scale experimental studies in humans. This approach is often described as ‘translational medicine’ [32] and epidemiology will remain one of its key building
blocks if this vision is to be realised and the benefits of medical research to human health will be achieved.
References [1] Last, J. (2001) A Dictionary of Epidemiology, 4th edn, Oxford University Press, New York. [2] Rothman, K. and Greenland, S. (1998) Modern Epidemiology, 2nd edn, Lippincott, Williams & Wilkins, Philadelphia. [3] Snow, J. (1936) On the Mode of Communication of Cholera, 2nd edn, The Commonwealth Fund, New York. [4] Shepherd, M. (1978) Epidemiology and clinical psychiatry. Br. J. Psychiatry, 133, 289–298. [5] Rothman, K. (1987) Modern Epidemiology, Little Brown, Boston. [6] Morris, J.N. (1975) Uses of Epidemiology, 3rd edn, Churchill Livingstone, Edinburgh. [7] Sackett, D.L. and Holland, W.W. (1975) Controversy in the detection of disease. Lancet, 2 (7930), 357–359. [8] Evidence-Based Medicine Working Group (1992) Evidence-based medicine. a new approach to teaching the practice of medicine. J. Am. Med. Assoc., 268 (17), 2420–2425. [9] Tanskanen, P., Ridler, K., Murray, G.K. et al. (2008) Morphometric brain abnormalities in schizophrenia in a population-based sample: relationship to duration of illness. Schizophr. Bull., 36, 766–777. [10] Strickland, P.L., Deakin, W.J.F., Percival, C. et al. (2002) Bio-social origins of depression in the community. Br. J. Psychiatry, 180, 168–173. [11] Putnam, R.D. (1993) The prosperous community: social capital and public life. Am. Prospect, 13, 35–42. [12] Moore, T., Zammit, S., Lingford-Hughes, A. et al. (2007) Systematic review of cannabis use and risk of developing psychotic or affective outcomes. Lancet, 370, 319–328. [13] Rothman, K.J. and Greenland, S. (2005) Causation and causal inference in epidemiology. Am. J. Public Health, 95 (Suppl. 1), S144–S150. [14] Gotszche, P.C. and Olsen, O. (2000) Is screening for breast cancer with mammography justifiable? Lancet, 355, 129–134. [15] Marshall, M. and Creed, F. (2000) Assertive community treatment – is it the future of community care in the UK? Int. Rev. Psychiatry, 12, 191–196. [16] Doll, R. (1992) Sir Austin Bradford Hill and the progress of medical science. Br. Med. J., 305 (19–26), 1521–1526.
7
CHAPTER 1 [17] Sterne, J.A. and Davey, S.G. (2001) Sifting the evidence – what’s wrong with significance tests? Br. Med. J., 322 (7280), 226–231. [18] Hotopf, M., Lewis, G. and Normand, C. (1997) Putting trials on trial: the costs and consequences of small trials in depression. J. Epidemiol. Community Health, 51, 354–358. [19] Zammitt, S., Allebeck, A., Dalman, C. et al. (2002) Self-reported cannabis use as a risk factor for schizophrenia: further analysis of the 1969 Swedish conscript cohort. Br. Med. J., 325, 1199–1201. [20] Mulvany, F., O’Callaghan, E., Takei, N. et al. (2001) Effect of social class at birth on risk and presentation of schizophrenia: case–control study. Br. Med. J., 323 (7326), 1398–1401. [21] Wicks, S., Hjern, A., Gunnell, D. et al. (2005) Social adversity in childhood and the risk of developing psychosis: a national cohort study. Am. J. Psychiatry, 162 (9), 1652–1657. [22] Harrison, G., Gunnell, D., Glazebrook, C. et al. (2001) Association between schizophrenia and social inequality at birth: case–control study. Br. J. Psychiatry, 179, 346–350. [23] Doll, R., Peto, R., Boreham, J. et al. (2000) Smoking and dementia in male British doctors: prospective study. Br. Med. J., 320 (7242), 1097–1102. [24] Mathews, A. and MacLeod, C. (2005) Cognitive vulnerability to emotional disorders. Annu. Rev. Clin. Psychol., 1, 167–195.
8
[25] Paykel ES. (2001) The evolution of life events research in psychiatry. J. Affect. Disord., 62 (3), 141–149. [26] Horwood, J., Salvi, G., Thomas, K. et al. (2008) IQ and non-clinical psychotic symptoms in 12-year-olds: results from the ALSPAC birth cohort. Br. J. Psychiatry, 193 (3), 185–191. [27] Wittchen, H.U., Ustun, T.B. and Kessler, R.C. (1999) Diagnosing mental disorders in the community. A difference that matters? Psychol. Med., 29 (5), 1021–1027. [28] Steen, R.G., Mull, C., McClure, R. et al. (2006) Brain volume in first-episode schizophrenia: systematic review and meta-analysis of magnetic resonance imaging studies. Br. J. Psychiatry, 188, 510–518. [29] Hill, A.B. (1965) The environment and disease: association or causation? J. R. Soc. Med., 58, 295–300. [30] Susser, M. (1991) What is a cause and how do we know one? A grammar for pragmatic epidemiology. Am. J. Epidemiol., 133 (7), 635–648. [31] Meaney, M.J. (2001) Maternal care, gene expression, and the transmission of individual differences in stress reactivity across generations. Annu. Rev. Neurosci., 24, 1161–1192. [32] Sung, N.S., Crowley, W.F.Jr., Genel, M. et al. (2003) Central challenges facing the national clinical research enterprise. J. Am. Med. Assoc., 289 (10), 1278–1287.
2
Analysis of categorical data: The odds ratio as a measure of association and beyond Garrett M. Fitzmaurice1,2,3 and Caitlin Ravichandran1,2 1 Laboratory
for Psychiatric Biostatistics, McLean Hospital, Belmont, MA, USA Department of Psychiatry, Harvard Medical School, Boston, MA, USA 3 Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA 2
2.1 Introduction In this chapter we present an overview of many of the statistical methods commonly used for the analysis of categorical ‘outcome’ data in psychiatric studies. A categorical variable is defined as one that takes on a finite number of levels or categories (e.g. ‘success’ and ‘failure’ in the case of a dichotomous or binary variable). For example, consider the data in Table 2.1 which are from a study of rates and predictors of recovery in patients with first-episode major affective disorders with psychosis [1]. In this study investigators obtained information on candidate predictors of recovery at the time of first hospitalisation (e.g. Axis I comorbidity) and then followed patients for 2 years to determine which patients experienced syndromal and functional recovery. In this simple illustration of one comparison of interest, the categorical outcome has two levels, ‘recovered’ or ‘not recovered’. Table 2.1 is commonly referred to as a 2 × 2 contingency table. Much of the statistical theory underlying the analysis of categorical data is more easily formulated for 2 × 2 contingency tables. Indeed, methods for the analysis of 2 × 2 contingency tables provide the cornerstone for many of the advanced statistical methods required for more complicated problems. These include
Table 2.1 Illustrative data from a study of recovery in patients with first-episode major affective disorders with psychosis. Comorbidity
No Axis I Axis I Total
Recovery
Total
Not recovered
Recovered
65 48 113
50 18 68
115 66 181
extensions for analysing outcomes with more than two levels (e.g. ‘not recovered’, ‘partially recovered’ and ‘recovered’), which may or may not be ordered; the former are referred to as ordinal variables, the latter are referred to as nominal variables. In addition, there can be more than two levels of the experimental treatment or exposure variable (e.g. no Axis I comorbidity, one Axis I comorbidity, two or more Axis I comorbidities) and other factors or covariates (e.g. age, gender, health status before treatment) that influence the outcome variable. Some of the most widely used probability distributions for categorical outcomes include the Bernoulli, binomial, hypergeometric and multinomial distributions. Throughout this chapter we assume the reader has very little prior knowledge of these probability distributions. The chapter is
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
9
CHAPTER 2
organised as follows. We begin with a discussion of inference for a single probability or proportion. This is followed by a description of methods for analysing 2 × 2 contingency tables; the extensions to R × C contingency tables (i.e. contingency tables with R rows and C columns) are mentioned but not discussed in great detail. We discuss measures of association for 2 × 2 tables that quantify departures from independence. In particular, we focus on the odds ratio (OR) as a measure of association. We also discuss the analysis of sets of 2 × 2 tables, and describe the Cochran–Mantel–Haenszel test. Finally, we present an overview of regression models for categorical data, focusing extensively on logistic regression models for binary outcomes. The logistic regression model is first introduced for the simple case where there is only a single predictor or covariate. This model is compared and contrasted with the classical linear regression model. Later the generalisations to more than one predictor variable are considered. A major emphasis of this chapter is placed on how logistic regression is used in practice and how the logistic regression coefficients should be interpreted. An example, based on data from the first-episode major affective disorders with psychosis study, is used to illustrate and reinforce the main concepts. Finally, at the end of the chapter, we introduce some advanced topics, including extensions of logistic regression to matched study designs; exact logistic regression, which is appropriate for small sample sizes or sparse data; multinomial regression models for nominal and ordinal outcomes; and applications of logistic regression models to so-called ‘clustered’ categorical data, when the outcomes are not independent.
2.2 Inference for a single proportion In this section we discuss inference for a single proportion or probability. In order to motivate the methods, we consider data from the first-episode major affective disorders with psychosis study. One of the goals of this study was to estimate the probability that patients with first-episode major affective disorders with psychosis achieve functional recovery after 2 years. The outcome for each patient can 10
be denoted ⎧ ⎪ ⎨ 1 Yi = ⎪ ⎩ 0
if the patient achieves functional recovery, if the patient does not achieve functional recovery
for i = 1, . . . , n patients. The binary outcomes for the n patients are assumed to be independent of each other. The probability of success (e.g. ‘recovered’) is denoted by p = pr(Yi = 1) and the probability of failure (e.g. ‘not recovered’) by 1 − p = pr(Yi = 0). The distribution of the number of successes among the n patients, Y = ni=1 Yi can be used to form test statistics and a confidence interval for p. Counts of the number of successes, Y, have a binomial probability distribution n y pr(Y = y) = p (1 − p)n−y , y n where the binomial coefficient, , is the number y of ways y ‘successes’ can be obtained in n trials. The probability of success can be estimated using the sample proportion of successes, p = Yn . In large samples (say, n > 30, and with the expected number of successes np ≥ 5 and the expected number of failures n(1 − p) ≥ 5), p has an approximate normal distribution with mean p and variance p(1−p) n . A 95% confidence interval for p is given by
p(1 − p) p ± 1.96 . n The above confidence interval for p, known as a Wald confidence interval, is commonly used and easy to compute, but has been criticised for its poor performance; for example these confidence limits cover the true value of p less than 95% of the time on average. Two alternatives with more favourable properties are the Wilson confidence interval, which is based on the score test, and the Jeffreys interval, which can be derived using Bayesian statistical theory. Both can be calculated using popular statistical software, and the Wilson interval also has the closed-form expression: z21−α/2 p(1 2 ˆ ˆ − p) +
z1−α/2 4n pˆ + ± z1−α/2 2n n , (2.1) z21−α/2 1+ n
ANALYSIS OF CATEGORICAL DATA: THE ODDS RATIO AS A MEASURE OF ASSOCIATION AND BEYOND
where z1−α/2 = 1.96 for a 95% confidence interval. For more information on the performance of these and other intervals, see for example Brown, Cai and DasGupta [2]. When sample sizes are relatively small (say, n < 30), an exact confidence interval can be obtained that is based directly on the binomial distribution for Y. Finally, hypothesis tests for p equalling a specified value, say po , can be conducted using either large sample theory for the approximate normal distribution of p or via exact methods based on the binomial distribution for Y.
2.2.1 Example Using data from the first-episode major affective disorders with psychosis study presented in Table 2.1 and the methods for inference for a single proportion, we can estimate the proportion of patients who achieve functional recovery 2 years after first hospitalisation. The estimated proportion is the total number of patients who recovered (Y = 68) divided by the total number of patients (n = 181), which equals 0.376. Ninety-five per cent confidence intervals for this estimate are (i) Wald: (0.305, 0.446), (ii) Wilson: (0.308, 0.448), (iii) Jeffreys: (0.308, 0.448) and (iv) exact: (0.305, 0.451). Note that there is close agreement among the four confidence intervals; this is to be expected when, as with these data, n is relatively large and both n p ≥ 5 and n(1 − p) ≥ 5.
2.3 Analysis of 2 × 2 contingency tables In many settings we are interested in the effect of treatments or exposures on a binary outcome. When the treatment or exposure has only two levels the data can be summarised in a 2 × 2 contingency table. Data in the form of a 2 × 2 contingency table can arise from many different types of study designs [3]. For example, consider a clinical trial comparing the probability of remission between patients with depression assigned to a novel treatment or standard treatment. The question of scientific interest is: ‘How does treatment affect the probability of remission?’ Similarly, for the first-episode major affective disorders with psychosis study (Table 2.1), the presence of Axis I comorbidities (the exposure) was determined
at baseline, and we are interested in the number of patients with and without Axis I comorbidities who recover. The question of scientific interest is: ‘How does the presence of comorbidities affect recovery?’ Both are examples of prospective study designs. However, data in the form of 2 × 2 tables also arise from other types of study designs. Consider, for example, the data from a retrospective case–control study of psychiatric disorders and occurrence of elderly suicides [4] presented in Table 2.2. The number of suicide cases and controls (non-cases) are fixed by design (with 85 cases and 153 controls) and the prevalence of psychiatric disorders is then ascertained on each subject in the study. In this retrospective case–control study design, the prevalence of psychiatric disorder is considered a random variable. Case–control studies are commonly used when the outcome is rare and/or when it is not ethical to randomise patients to the ‘exposure’ in a prospective study. In this particular case–control study one question of scientific interest is: ‘Does prevalence of substance use disorders vary among the cases and controls?’ The third type of study design in which data in the form of a 2 × 2 table arise is the so-called double dichotomy, cross-sectional or prevalence study. In this study design a fixed number (n) of subjects are randomly selected and each subject is crossclassified on the basis of the two variables (the row and column variables) of scientific interest. Table 2.3 displays data from a prevalence study of neuropsychiatric symptoms and mild cognitive impairment (MCI) in the elderly [5], where only the total number of subjects, n = 1969, is fixed by design. Table 2.3 contains data on the presence of delusions for the 1909 subjects with neuropsychiatric data available. In this example, the question of scientific interest is: ‘Are delusions and cognitive status related?’
Table 2.2 Substance use disorders (SUDs) and occurrence of elderly suicides. SUD
Yes No Total
Status
Total
Case
Control
23 62 85
1 152 153
24 214 238
11
CHAPTER 2 Table 2.3 Data from the study of neuropsychiatric symptoms and MCI. Delusions
Yes No Total
Cognitive status MCI
Normal cognitive ageing
11 308 319
6 1584 1590
Total
17 1892 1909
Finally, although not shown here, data in the form of a 2 × 2 table can arise when the total sample size, n, is not fixed in advance. For example, an exit poll conducted at an election station might set out to record the political party preferences (e.g. Democrat or Republican) and opinions about mental health parity legislation (in favour or against) of all respondents who agree to participate in the poll; here, the total number of individuals who agree to participate, n, is random. Suppose we let Xi denote the row variable (e.g. treatment or exposure) and Yi denote the column variable (e.g. outcome variable) for each one of these types of study designs, where both Xi and Yi are binary (taking values 0 or 1). Then, the data in a general 2 × 2 contingency table can be represented as in Table 2.4. In Table 2.4 njk is the count of the number of subjects with X = j and Y = k; njk is referred to as a cell count. For example, n11 is the number of subjects with X = 1 and Y = 1. Also, in Table 2.4 the marginal row counts are nj+ = nj0 + nj1 (the
Table 2.4 General representation of counts in a 2 × 2 contingency table. Y 0 1 Total
0
n00
n01
Table 2.5 Probabilities in a 2 × 2 contingency table, with only n fixed.
Y
n0+
X
12
number of subjects with X = j), and the marginal column counts are n+k = n0k + n1k (number of subjects with Y = k). In each study, different marginal totals are fixed by design. As a result, the counts in the tables have different distributions. For example, for the case–control study, n+0 and n+1 are fixed by design, and the numbers of exposed subjects in each column (n10 and n01 ) have independent binomial distributions. However, for all of these different types of study designs, the question of scientific interest can be formulated in a similar way: ‘Are X and Y associated or are they independent?’ For ease of explanation, we focus on data arising from a cross-sectional study design. For the cross-sectional design, we can write the probabilities for the 2 × 2 table as in Table 2.5. The probabilities in Table 2.5 are pjk = pr[(Xi = j), (Yi = k)], and the marginal probabilities are pj+ = pj0 + pj1 = pr[X = j] and p+k = p0k + p1k = pr[Y = k]. For the cross-sectional design, all of these probabilities can be estimated from the data at hand. For prospective studies with the number of exposures fixed by design, and for case-control studies, they are not all estimable. For prospective studies with the number of exposures fixed by design, only the two conditional row probabilities pr(Yi = 1|Xi = 0) = p01 /p0+ and pr(Yi = 1|Xi = 1) = p11 /p1+ can be estimated. Similarly, for case–control studies, only the two conditional column probabilities pr(Xi = 1|Yi = 0) = p10 /p+0 and pr(Xi = 1|Yi = 1) = p11 /p+1 can be estimated. For example, using data from the study
0
1
Total
0
p00
p01
p0+
X 1
n10
n11
n1+
1
p10
p11
p1+
Total
n+ 0
n+1
n
Total
p+0
p+1
1
ANALYSIS OF CATEGORICAL DATA: THE ODDS RATIO AS A MEASURE OF ASSOCIATION AND BEYOND
of psychiatric disorders and occurrence of elderly suicides, the probability of a substance use diagnosis can be estimated for suicides and non-suicides, but the probability of suicide cannot be estimated for elderly with and without substance use diagnoses.
That is, the OR for Y associated with X is equal to the OR for X associated with Y, pr(Yi = 1|Xi = 1) pr(Yi = 1|Xi = 0) OR = pr(Yi = 0|Xi = 1) pr(Yi = 0|Xi = 0) pr(Xi = 1|Yi = 1) pr(Xi = 1|Yi = 0) = . pr(Xi = 0|Yi = 1) pr(Xi = 0|Yi = 0)
2.3.1 The odds ratio as a measure of association
It is this property, unique to the OR, that accounts for its widespread use for assessing the association between exposure and disease in case–control studies. In addition, in ‘rare disease’ settings, the OR is a close approximation to another measure of association called the relative risk (RR). The RR is defined as pr(Yi = 1|Xi = 1) RR = , pr(Yi = 1|Xi = 0)
To determine whether Xi and Yi are associated, it becomes necessary to formulate measures of association that quantify any departure from independence. The most commonly used measure of association is the odds ratio (OR), also known as the cross-product ratio (for reasons that will soon become apparent). The OR is a measure of association based on a comparison of ‘odds’. The odds is simply another metric for expressing risk or probability. Specifically, if p is p the probability of success, then 1−p is referred to as the odds of success. For example, if the probability of success is 0.8 then the odds of success is 4 (or 0.8 0.2 ) to 1. That is, the probability of success is four times as large as the probability of failure. In a prospective study with the number of exposures fixed by design, the OR measures association by comparing the odds of Y in the two exposure groups defined by X. Specifically, the OR for Y associated with X is OR =
pr(Yi = 1|Xi = 1) pr(Yi = 1|Xi = 0) . pr(Yi = 0|Xi = 1) pr(Yi = 0|Xi = 0)
The null value for the OR is 1 because it corresponds to pr(Yi = 1|Xi = 1) = pr(Yi = 1|Xi = 0) and implies that Y and X are independent. Quite often, the log of the OR is used as a measure of association, since log(OR) = 0 under the assumption of no association between Y and X. When OR > 1, then pr(Yi = 1|Xi = 1) > pr (Yi = 1|Xi = 0); similarly, when OR < 1, then pr(Yi = 1|Xi = 1) < pr(Yi = 1|Xi = 0). Note that the OR expresses association in relative (or multiplicative) terms in the sense that the odds of success in one group (e.g. unexposed group) is multiplied by OR to obtain the corresponding odds in the other group (e.g. exposed group). An appealing property of the OR is that it is symmetric in the roles of Y and X in the sense that reversing the roles of Y and X yields the same OR.
and also expresses association in relative or multiplicative terms. Unlike the OR, however, the RR is not symmetric in Y and X. The relationship between the OR and the RR will be discussed in greater detail in Section 2.5.1. Finally, we note that a simple computational formula for the OR arises from the following equivalent expression, OR =
p00 p11 . p10 p01
This expression helps to explain why the OR is sometimes referred to as the ‘cross-product ratio’. It is usually of interest to obtain a point estimate and confidence interval for the OR, and to test the null hypothesis that the OR equals 1. For all four designs, the OR can be estimated as = n00 n11 . OR (2.2) n10 n01 is approximately normally disBecause the log(OR) tributed, and because it will always result in nonnegative estimates of the OR, it is preferable to obtain a confidence interval for log(OR), and then exponentiate the endpoints. That is, a 95% confidence interval for log(OR) is given by ± 1.96 Var[log( log(OR) OR)], (2.3) = 1 + 1 + 1 + 1 . Then, where Var[log( OR)] n00 n10 n01 n11 a 95% confidence interval for the OR is obtained by exponentiating the endpoints of this interval, ± 1.96 Var[log( exp log(OR) OR)] .
13
CHAPTER 2
Finally, suppose it is of interest to construct a test for no association (independence). There are three commonly used test statistics. The Wald test statistic for the null hypothesis, H0 : log(OR) = 0, is given by: Z=
log(OR) Var(log( OR))
,
(2.4)
which, in large samples, has an approximate standard normal distribution, denoted by N(0, 1), under the null hypothesis of no association. Alternatively, the likelihood ratio test (LRT) statistic can be used. This is simply twice the difference in the log-likelihood under the alternative (association) and null (independence) hypotheses. Remarkably, for any of the four types of study designs considered, the LRT statistic reduces to G2 = 2
1 1
Ojk log
j=0 k=0
Ojk Ejk
,
(2.5)
where Ojk = njk is the ‘observed’ count in the 2 × 2 table and Ejk = E(njk |H0 ) = nj+ n+k /n is the ‘estimated expected’ count (under the assumption of independence). In addition, for any of the four study designs, the score test statistic reduces to X2 =
1 1 (Ojk − Ejk )2 j=0 k=0
Ejk
,
(2.6)
which is also known as the Pearson chi-square test for a 2 × 2 table. In large samples, both the likelihood ratio and the Pearson chi-square statistics have approximate chi-square distributions with 1 degree of freedom. Similarly, in large samples, the Wald test statistic has an approximate standard normal distribution or, equivalently, the squared Wald test statistic has an approximate chi-square distribution with 1 degree of freedom. If the sample size, n, is relatively small, these asymptotic (or very large sample) approximations cannot be relied upon. In particular, a rule-ofthumb in statistical folklore is that the asymptotic approximations cannot be relied upon if one (or say 25%) of the cells in the 2 × 2 table have estimated expected counts (Ejk ) less than 5. When at least
14
one Ejk is less than 5, and it is of interest to make inferences about the OR, a common technique is to fix both margins of the 2 × 2 table and use so-called ‘exact’ tests and confidence intervals. That is, for a prospective study (where the row margins are fixed), we further condition on the column margins; for a case–control study (where the column margins are fixed), we further condition on the row margins; or for a cross-sectional design (where n is fixed), we condition on both row and column margins. In all of these cases it can be shown that the counts in the resulting table with fixed margins have a non-central hypergeometric distribution. Under the null hypothesis H0 : OR = 1, the non-central hypergeometric distribution becomes a central hypergeometric distribution, which forms the basis of Fisher’s exact test of no association in a 2 × 2 contingency table (see, for example [6]). This test is appropriate in small samples; the non-central hypergeometric can also be used to obtain an estimate of the OR that has better small sample properties than the usual OR estimate given in Equation 2.2. One potential drawback with exact methods, however, is that they can be ‘conservative’ in the sense that the true significance level of an exact test is often far smaller than the nominal level (e.g. 0.05), thereby making it more difficult to reject the null hypothesis of independence.
2.3.2 Examples Returning to the data from the first-episode major affective disorders with psychosis study in Table 2.1, the scientific interest is in the association between Axis I comorbidity and 2-year functional recovery. Using formulas from this section (Equations 2.2–2.6), the estimated OR comparing odds of recovery between patients with and without Axis I Comorbidity is 0.49, with 95% confidence interval (0.25, 0.94). We would estimate that patients with Axis I comorbidity have about one half the odds of 2-year functional recovery as patients without Axis I comorbidity. Performing a test of no association, we would obtain a Wald test statistic of Z = −2.15 with a p-value of 0.03, a LRT statistic of G2 = 4.81 with a p-value of 0.03, or a score (Pearson chi-square) test statistic of χ2 = 4.70 with
ANALYSIS OF CATEGORICAL DATA: THE ODDS RATIO AS A MEASURE OF ASSOCIATION AND BEYOND
a p-value of 0.03. At the conventional α = 0.05 significance level, we would conclude there is a significant association between Axis I comorbidity and 2-year functional recovery among patients with first-episode major affective disorders with psychosis from any of these large-sample test statistics. For the data from the study of neuropsychiatric symptoms and MCI in Table 2.3, exact methods are more appropriate than large-sample methods due to small expected cell counts. For this example, the estimated OR comparing odds of MCI between elderly persons with and without delusions is 9.43, and the p-value for Fisher’s exact test is 30, and/or the sample size in each table, nj++ , is large. An estimate of the adjusted OR is given by (Adjusted) OR ⎛ ⎞ ⎛ ⎞ J J =⎝ nj00 nj11 /nj++ ⎠ ⎝ nj01 nj10 /nj++ ⎠ . j=1
j=1
Among the available tests for interaction (homogeneity of the ORs) is the Breslow–Day test [3], which has an approximate chi-square distribution with (J − 1) degrees of freedom. Like the Cochran–Mantel–Haenszel test, this test is based on conditioning on both margins. However, unlike the Cochran–Mantel–Haenszel test, the Breslow–Day test requires that the sample size in each partial table is large even if the number of tables is large. The calculation of the Breslow–Day test statistic is more complex than other calculations presented in the chapter, but this test is readily available within most popular statistical software.
2.4.1 Example The Cochran–Mantel–Haenszel test statistic for the data in Table 2.7 for the test that the common OR equals one is Z = −2.12, with a p-value of 0.03; the estimate of the adjusted OR is 0.49 (with 95% confidence interval: 0.26, 0.95). From the Cochran–Mantel–Haenszel test and estimate of the adjusted OR, we conclude that, adjusting for sex, patients with Axis I comorbidity have significantly lower odds of recovery than patients without Axis I comorbidity. Finally, we note that the Breslow–Day test statistic for homogeneity of the ORs is χ21 = 0.32 with a p-value of 0.57. From the Breslow–Day test, we would conclude we have no evidence that the association between Axis I comorbidity and recovery differs between males and females (i.e. there is no interaction between comorbidity and sex).
2.4.2 Matched pair study design A matched pair study design is an example of a case when the number of partial tables (J) is large, and the sample size for each partial table (nj++ ) is small. The matched pair design has become increasingly popular in epidemiologic studies. In a matched case–control study, a case is selected, and then a control is matched to the case on factors that could be confounders of the association between the exposure and outcome variables. Then, as in the usual case–control study, investigators determine the exposure status (exposed, not exposed) of all subjects. For example, the data in Table 2.9, reported in Everitt [9], arose from a study designed to test the hypothesis that complications during pregnancy and birth, a known risk factor for the development of schizophrenia, are more prevalent in schizophrenics with a low age of onset (prior to age 16) compared to those with a later age of onset (after age 21). In this study, 36 subjects with low age of onset schizophrenia (cases) were matched one-to-one to 36 controls with later age of onset schizophrenia; the cases and controls were pair-matched on sex, race and socioeconomic status. Alternatively, in a matched prospective study, individuals are matched on exposure status. For example, individuals could be matched by sex, race and socioeconomic status, and then assigned to two different treatments and followed over time to determine whether the patients respond to the treatments. In these study designs, there are J 2 × 2 tables with one matched pair each. That is, the total sample size for each 2 × 2 table is 2. Even though each 2 × 2 table has only two subjects, assuming the number of matched pairs J is large, the Cochran–Mantel–Haenszel test can be used. In this case, the Cochran–Mantel–Haenszel test reduces to a test specific to matched paris, McNemar’s test: χ2 =
[n10 − n01 ]2 , n10 + n01
which has an approximate chi-square distribution with 1 degree of freedom for large J, where n10 is the number of matched pairs in which the case is exposed and the control is unexposed (or the exposed subject is a success and the unexposed subject is a failure) and n01 is the number of matched pairs in which the case is unexposed and the control 17
CHAPTER 2 Table 2.9 Complications during pregnancy and birth for 36 matched pair cases and controls. Controls
Cases
Absent Present
Absent
Present
23 9
4 0
is exposed (or the exposed subject is a failure and the unexposed subject is a success). An exact test based on the binomial distribution can be used when J (and particularly n10 + n01 ) is small. Matching one case (or exposed individual) to one control (or unexposed individual) is desirable because it maximises the power of the study for a given total sample size. However, when the number of cases is limited but a greater number of controls are available (e.g. in a rare disease setting), study designs matching one case to multiple controls are common. Because the total number of subjects nj++ can vary across partial tables, the Cochran–Mantel–Haenszel test can accommodate an arbitrary number of controls for each case. In addition, conditional logistic regression, which will be discussed in Section 2.6, also accommodates matched designs other than matched pairs and offers many of the advantages of logistic regression to matched studies.
2.4.3 Example McNemar’s test can be used to test for an association between birth complications and age of onset of schizophrenia using the data from Table 2.9. For this example, n10 is the number of pairs for which the case with earlier onset schizophrenia experienced birth complications but the control did not and equals 9, and n01 is the number of pairs for which the control experienced birth complications but the case with earlier onset schizophrenia did not and equals 4. If the large-sample test were applied, the test statistic would be χ2 = 1.92, and the p-value would be 0.17. Because n10 + n01 is small, exact methods are appropriate in this case, and would result in a p-value of 0.27. Therefore, we would conclude there is no evidence of an association between birth complications and age of onset of schizophrenia from this study. 18
2.5 Logistic regression In this section we consider how the relationships in multi-way contingency tables, and more complicated designs, can be explored using regression methods known as logistic regression. Logistic regression is one of the most widely used methods for the analysis of binary data. It is used to examine and describe the relationship between a binary response variable Yi (e.g. 1 = ‘success’ or 0 = ‘failure’) and one or more covariates for i = 1, . . . , n independent subjects. The covariates can be continuous or categorical (e.g. indicator variables). Denoting the two possible outcomes for Yi by 0 and 1, the probability distribution of the response variable is the Bernoulli distribution with probability of success pi . In common with linear regression, the primary objective of logistic regression is to model the mean of the response variable, given a set of covariates. Recall that with a binary response, the mean of Yi is simply the probability that Yi takes on the value 1, pi . However, what distinguishes logistic regression from linear regression is that the response variable is binary rather than continuous in nature. This has a number of consequences for modelling the mean of the response variable. For ease of exposition, we will first consider the simple case where there is only a single predictor variable, say xi . Generalisations to more than one predictor variable will be considered later. Since linear models play such an important and dominant role in applied statistics, it may at first seem natural to assume a linear model relating the mean of Yi to xi , pr[Yi = 1|xi ] = pi = β0 + β1 xi
(2.8)
However, expressing pi as a linear function is problematic since it violates the restriction that probabilities must lie within the range from 0 to 1. As a result, for sufficiently large or small values of xi , the linear model given by Equation 2.8 will yield probabilities outside of the permissible range. A further difficulty with the linear model for the probabilities is that we often expect a nonlinear relationship between pi and xi . For example, a 0.2 unit increase in pi might be considered more ‘extreme’ when pi = 0.1 than when pi = 0.5. In terms of ratios, the change from pi = 0.1 to pi = 0.3 represents a threefold or
ANALYSIS OF CATEGORICAL DATA: THE ODDS RATIO AS A MEASURE OF ASSOCIATION AND BEYOND
g(pi ) = β0 + β1 xi .
(2.9)
However, the most commonly used in practice are 1 Logit or logistic function: g(pi ) = log[pi /(1 − pi )] 2 Probit or inverse normal function: g(pi ) = −1 (pi ), where is the standardised normal cumulative distribution function 3 Complementary log–log function: g(pi ) = log [− log(1 − pi )]. We note that all of these transformations are very closely related when 0.2 < pi < 0.8, and in a sense only differ in the degree of ‘tail-stretching’ outside of this range. Indeed, for most practical purposes it is not possible to discriminate between a data analysis that is based on, for example, the logit and probit functions. To discriminate empirically between probit and logistic regression would, in general, require very large numbers of observations. However, the logit function does have a number of distinct advantages over the probit and complementary log–log functions which probably account for its more widespread use in practice. Later in this chapter we will consider some of the advantages of the logit or logistic function. When the logit or logistic function is adopted, the resulting model logit(pi ) = log[pi /(1 − pi )] = β0 + β1 xi ,
p
then 1−pi is the odds of success. Consequently, logisi tic regression assumes a linear relationship between the log odds of success and xi . Note that this simple model can be expressed equivalently in terms of pi , pi =
exp(β0 + β1 xi ) . 1 + exp(β0 + β1 xi )
(2.11)
We must emphasise that Equations 2.10 and 2.11 are completely equivalent ways of expressing the logistic regression model. Expression 2.10 describes p how the log odds, log( 1−pi ), has a linear relationship i with xi , while expression 2.11 describes how pi has an S-shaped relationship with increasing values of β1 xi ; although, in general, this relationship is approximately linear within the range 0.2 < pi < 0.8 (see Figure 2.1 for a plot of pi versus xi when β0 = 0.5 and β1 = 0.9). Observe that the expression on the right of (Equation 2.11) cannot yield a value that is either negative or greater than 1. That is, the logistic transformation ensures that the predicted probabilities are restricted to the range from 0 to 1.
1.0
0.8 Probability of Success
200% increase, whereas the change from pi = 0.5 to pi = 0.7 represents only a 40% increase. In a sense, the units of measurement for a probability or proportion are often not considered to be constant over the range from 0 to 1. The linear probability model given by Equation 2.8 simply does not take this into consideration when relating pi to xi . To circumvent these problems, a nonlinear transformation is usually applied to pi and the transformed probabilities are related linearly to xi . In particular, a transformation of pi , say g(pi ), is chosen so that it maps the range of pi from (0, 1) to (−∞, ∞). Since there are many possible transformations, g(pi ), that achieve this goal, this leads to an extensive choice of models that are all of the form
0.6
0.4
0.2
0.0 −4
(2.10)
−2
0
2
4
x
is known as the logistic regression model. Recall from Section 2.3.1 that if pi is the probability of success,
Fig 2.1
Plot of logistic response function.
19
CHAPTER 2
and xi = 1. Let pi (xi = j), denote the probability of success when xi = j, for j = 0, 1. Then,
Finally, note that 1 − pi = so that the odds,
1 , 1 + exp(β0 + β1 xi ) pi 1−pi ,
β1 = (β0 + β1 ) − β0 = logit[pi (xi = 1)] − logit[pi (xi = 0)] pi (xi = 1) × [1 − pi (xi = 0)] = log pi (xi = 0) × [1 − pi (xi = 1)]
is simply exp(β0 + β1 xi ).
2.5.1 Interpretation of logistic regression coefficients Next we consider the interpretation of the logistic regression coefficients, β0 and β1 , in Equation 2.10. In simple linear regression, recall that the interpretation of the slope of the regression is in terms of changes in the mean of Yi for a single unit change in xi . Similarly, the logistic regression slope, β1 , in Equation 2.10 has interpretation as the change in the log odds of success for a single unit change in xi . Equivalently, a single unit change in xi increases or decreases the odds of success multiplicatively by a factor of exp(β1 ). Also, recall that the intercept in simple linear regression has interpretation as the mean value of the response variable when xi is equal to 0. Similarly, the logistic regression intercept β0 , has interpretation as the log odds of success when xi = 0. Note that, for case–control studies, the intercept β0 cannot be validly estimated since it is determined by the proportions of ‘successes’ (Y = 1) and ‘failures’ (Y = 0) selected by the study design. However, in many studies, there is far less scientific interest in the intercept than in the slope. For the special case where xi is dichotomous, taking values of 0 and 1, the logistic regression slope, β1 , has a simple and very attractive interpretation. Consider the two possible values for pi when xi = 0
which is the log of the OR (or cross-product ratio) in the 2 × 2 table of the cross-classification of Yi and xi (see Table 2.10). Thus, exp(β1 ) has interpretation as the OR of the response for the two possible values of the covariate. The OR has many appealing properties that probably account for the widespread use of logistic regression in many areas of application. First, as was noted earlier, the OR does not change when rows and columns of the 2 × 2 table are interchanged. This implies that it is not necessary to distinguish which variable is the response and predictor variable in order to estimate the OR. Furthermore, as noted in the previous sections, a very appealing feature of the OR, exp(β1 ), is that it is equally valid regardless of whether the study design is prospective, crosssectional or retrospective. That is, logistic regression provides an estimate of the same association between Yi and xi in all three study designs. Finally, in psychiatric studies where Yi typically denotes the presence or absence of a disease or disorder, the OR is often interpreted as an approximation to the RR p(x =1) of disease, p(xi =0) . When the disease is rare, and pi i is reasonably close to 0 in both of the risk groups (often known as the ‘rare disease’ assumption), the OR provides a close approximation to the RR. Retrospective designs are especially common in psychiatry
Table 2.10 Cross-classification probabilities for logistic regression of Y on x.
Y 1
1
p(x = 1) =
0
exp(β0 + β1) 1 + exp(β0 + β1)
1 − p(x = 1) =
1 1 + exp(β0 + β1)
Total
1.0
x 0
20
p(x = 0) =
exp(β0) 1 + exp(β0)
1 − p(x = 0) =
1 1 + exp(β0)
1.0
ANALYSIS OF CATEGORICAL DATA: THE ODDS RATIO AS A MEASURE OF ASSOCIATION AND BEYOND
where the possible outcomes of interest are very rare. Although the RR cannot be estimated from a retrospective study, the OR can be used to provide an approximation to the RR. Extra care is necessary when interpreting the OR as an approximation to the RR in prospective studies. In many prospective studies the binary event is relatively common (say greater than 10%) and the ‘rare disease’ assumption no longer holds; in these settings, the OR can be a very poor and unreliable approximation to the RR and should not be given such an interpretation.
2.5.2 Hypothesis testing and confidence intervals for logistic regression parameters Often, we are interested in testing for an association between the predictor in our logistic regression model and the outcome, or, equivalently, testing H0 : β1 = 0. As for 2 × 2 table methods, Wald, likelihood ratio and score statistics can be used for this test. A Wald test statistic can be obtained using the result that the estimate of β1 divided by its standard error (s.e.) approximately follows a N(0, 1) distribution in large samples. A LRT statistic can be obtained by comparing the log likelihood for the full model with the predictor included to the log likelihood for a reduced model including only the intercept β0 ; the former is at least as large as the latter. In large samples, twice the difference between the maximised log likelihoods for the full and reduced models approximately follows a chi-square distribution with 1 degree of freedom. Two-sided Wald confidence limits for β1 can be obtained using the result that βˆ 1 follows an approximate normal distribution; the confidence limits are given by the formula βˆ 1 ± zα/2 ∗ s .e.(βˆ 1 ). Just as we ˆ can exponentiate β1 to get an estimate of the OR comparing the odds of disease for a unit change in x1 , we can exponentiate the lower and upper limits of the confidence interval for β1 to get a confidence interval for the OR. Estimates of β1 (or, alternatively, its associated OR), its standard error and the log likelihood for the model are available from the output from logistic regression routines from popular statistical software. Test statistics and p-values for tests that β1 = 0 and Wald 95% confidence intervals are often also included automatically. Likelihood ratio and score test statistics can sometimes be requested.
Although Wald tests and confidence intervals are standard output from software for fitting logistic regression, we caution the reader that in certain circumstances the performance of Wald tests (and confidence intervals) can be somewhat irregular and lead to misleading conclusions. As a result, we recommend that LRTs (and confidence intervals) be used whenever possible.
2.5.3 Example: Logistic regression with a single binary covariate We now return to the Table 2.1 data from the first-episode major affective disorders with psychosis study and show that we can obtain identical results using large-sample methods for 2 × 2 contingency tables (as reported in Section 2.3.2) and logistic regression. Recall that our interest is in the association between Axis I comorbidity and 2-year functional recovery in this group of patients. Using logistic regression, we fit the model: logit[pr(Recoveryi = 1)] = β0 + β1 ∗ Comorbidityi , (2.12) where Recoveryi is an indicator variable coded 1 if the ith subject recovered and 0 otherwise, and Comorbidityi is an indicator variable coded 1 if the ith subject had Axis I Comorbidity and 0 otherwise. The following are the results: βˆ
ˆ s.e.(β)
Z
p > |Z| 95% CI
−0.2624 0.1881 −1.39 0.163 −0.6310, 0.1063 Comorbidity −0.7185 0.3343 −2.15 0.032 −1.3737, −0.0632 Intercept
The estimate of the OR comparing the odds of recovery in patients with and without Axis I comorbidities is exp(−0.7185) = 0.49, and the 95% confidence interval for the OR is exp(−1.3737, −0.0632) = (0.25, 0.94). The Wald test statistic for no association (or, equivalently, for H0 : β1 = 0 or H0 : OR = 1), which appears in the table, is Z = −2.15, with an accompanying p-value of 0.03. We can obtain the LRT statistic for no association by fitting the model with the intercept as 21
CHAPTER 2
the only covariate, which has a log-likelihood of −119.8066, and comparing it to the log likelihood from the model with both the intercept and comorbidity as covariates, −117.4037. The LRT statistic is χ21 = 2 ∗ [−117.4037 − (−119.8066)] = 4.81. The associated p-value, which can be obtained using statistical software or estimated from chi-square distribution tables, is 0.03. These results and their interpretation are identical to those obtained using methods for 2 × 2 contingency tables and reported in Section 2.3.2.
2.5.4 Multiple logistic regression So far, we have only considered the simple case where there is a single covariate xi . Next, we consider the extensions of Equations 2.10 and 2.11 to the case where there are two or more covariates. Recall that, in Section 2.4, we applied methods for stratified contingency tables to the first-episode major affective disorders with psychosis study data to test that the OR comparing patients with and without comorbidities adjusted for sex equals 1. Methods for stratified contingency tables are useful when adjusting for a small number of categorical covariates. However, multiple logistic regression has important advantages over stratified contingency table methods when the number of categorical covariates is larger or when we want to adjust for quantitative covariates. For example, using the first-episode data, we may want to test that the OR adjusted for both sex and age equals 1 and to obtain an estimate of the adjusted OR without classifying age into arbitrary categories. When there are many covariates, the logistic regression model becomes, log[pi /(1 − pi )] = β0 + β1 xi1 + β2 xi2 + · · · + βK xiK ,
2.5.5 Example: Multiple logistic regression To obtain an estimate of the OR for comorbidity adjusted for sex and age and to test that the adjusted OR equals one, we fit the following multiple logistic regression model to the first-episode major affective disorders with psychosis data: logit[pr(Recoveryi = 1)] = β0 + β1 ∗ Comorbidityi + β2 ∗ Malei + β3 ∗ Agei , (2.14) where Malei is an indicator variable coded 1 if the ith subject is male and 0 if the ith subject is female and Agei is the age of the ith subject in decades. The following results are obtained:
(2.13)
where xi1 , xi2 , . . . , xiK are the K covariates. The logistic regression coefficients in Equation 2.13 have the following interpretations. The logistic regression intercept, β0 , now has interpretation as the log odds of success when all covariates equal 0, that is when xi1 = xi2 = · · · = xiK = 0. Each of the logistic regression slopes, βk (for k = 1, . . . , K), has interpretation as the change in the log odds of success for a 22
single unit change in xik given that all of the other covariates remain constant. Note that the appealing property of logistic regression that the same OR can be estimated from either a prospective or retrospective study design readily generalises when xik is quantitative rather than dichotomous, and also when there are two or more predictor variables. Methods for hypothesis testing and constructing confidence intervals also generalise easily from the predictor in a simple logistic regression model (β1 ) to a predictor in a multiple logistic regression model βk . Expressions for Wald test statistics and confidence intervals for (βk ) can be obtained by substituting βk for β1 in the relevant portions of Section 2.5.2. LRTs of βk = 0 can be constructed by comparing the fit of the full model with βk included to the fit of a reduced model with all covariates except βk included. Twice the difference between the maximised log likelihood for the full model and the maximised log likelihood for the reduced model still approximately follows a chi-square distribution with one degree of freedom.
βˆ
ˆ s.e.(β)
Z
p > |Z| 95% CI
−1.4019 0.4955 −2.83 0.005 −2.3730, −0.4307 Comorbidity −0.4845 0.3496 −1.39 0.166 −1.1697, 0.2008 Male 0.0049 0.3243 0.01 0.988 −0.6307, 0.6404 Age 0.3107 0.1094 2.84 0.004 0.0963, 0.5250 Intercept
ANALYSIS OF CATEGORICAL DATA: THE ODDS RATIO AS A MEASURE OF ASSOCIATION AND BEYOND
The estimate of the OR for comorbidity adjusted for sex and age is exp(−0.4845) = 0.62, and its 95% confidence interval is exp(−1.1697, 0.2008) = (0.31, 1.22). Holding sex and age constant, we estimate that the odds of two-year functional recovery is 38% lower for patients with Axis I comorbidity when compared to patients without Axis I comorbidity. However, note from the 95% confidence interval that our data are consistent with odds of recovery up to 22% higher for patients with Axis I comorbidity. In addition, the Wald test statistic for testing that the adjusted OR equals one is Z = −1.39 with an associated p-value of 0.17, and the LRT statistic is χ21 = 1.95 with an associated p-value of 0.16. Using either test we conclude there is no association between Axis I comorbidity and 2-year functional recovery after adjusting for sex and age. We can also use the results from the multiple logistic regression to obtain estimates and test statistics for the other covariates in the model. The estimated OR comparing odds of recovery in males versus females is 1.00 (95% confidence interval: 0.53, 1.90), and we conclude from the Wald test that there is no evidence of an association between sex and recovery after adjusting for Axis I comorbidity and age (Z = 0.01, p = 0.99). On the other hand, the estimated OR comparing odds of recovery for a 10-year age increase is 1.36 (95% confidence interval: 1.10, 1.69). Adjusting for Axis I comorbidity and sex, the odds of two-year functional recovery increases with age (Z = 2.84, p = 0.004); for every decade age increase, we estimate that the odds of recovery is 36% higher.
2.5.6 Categorical predictors with more than two levels in logistic regression Section 2.3.3 presented contingency table methods that could be used to test for independence with predictors or outcomes with more than two categories. This section describes how logistic regression accommodates predictors with more than two categories, either with or without adjustment for additional covariates. (A later section describes extensions of logistic regression that accommodate outcomes with more than two categories.) For K unordered categories, a test for independence can be obtained by adding K − 1 indicator or ‘dummy’ variables as
covariates in the regression, where the kth indicator variable is coded 1 for subjects in the kth category and 0 for all other subjects (so that subjects in the remaining ‘reference’ category are coded 0 for all K − 1 indicator variables). A LRT for no association can be conducted by comparing the log likelihood for the model containing the predictor to the log likelihood for the model with the K − 1 indicator variables corresponding to the predictor removed; the LRT statistic follows a chi-square distribution with K − 1 degrees of freedom. Wald and score hypothesis tests are also available. However, when a predictor has three or more categories, the Wald test of no association is sometimes not available from standard logistic regression output and must be requested. For ordered categories, a test for independence can be conducted by assigning scores to each level of the predictor and then using the score as a covariate in the regression model. For example, the scores 1, 2 and 3 could be assigned to the categories mild, moderate and severe. The Z statistic for the covariate then corresponds to a test for no association, and interpretation of the corresponding regression parameter is similar to the interpretation of a regression parameter for a quantitative predictor. For example, the OR for the severity predictor would compare the odds of the outcome for a one category increase in severity, either moderate versus mild or severe versus moderate. This approach is most appropriate when the association between the score and outcome is approximately linear.
2.5.7 Example: Logistic regression with a three-level predictor In Section 2.3.4, we performed tests for independence between type of onset of first-episode affective disorder with psychosis (categorised as chronic, subacute or acute) and 2-year functional recovery. Equivalent tests can be performed using logistic regression by fitting the model: logit[pr(Recoveryi = 1)] = β0 + β1 ∗ Subacutei + β2 ∗ Acutei ,
(2.15)
where Subacutei is an indicator variable coded 1 if the ith subject had subacute onset and 0 otherwise, Acutei is an indicator variable coded 1 if the ith 23
CHAPTER 2
subject had acute onset and 0 otherwise, and chronic onset is the reference category. The following results are obtained:
we would use the model: logit[pr(Recoveryi = 1)] = β0 + β1 ∗ Comorbidityi + β2 ∗ Malei + β3 ∗ Comorbidityi ∗ Malei
βˆ
ˆ s.e.(β)
p > |Z|
95% CI
Intercept −0.6061 0.2930 −2.07
0.039
Subacute
0.1071 0.4247
0.25
0.801
Acute
0.2007 0.3761
0.53
0.594
−1.1804, −0.0318 −0.7253, 0.9396 −0.5364, 0.9377
Z
Exponentiating the parameters for subacute and acute onset provides estimates of the ORs comparing odds of recovery for subacute and acute onset respectively versus chronic onset, and the Z statistics for these two parameters are for separate tests that these ORs are equal to 1. However, our primary interest is in the overall test for independence between onset and recovery. The log likelihood for this model is −113.421, and the log likelihood for the model with an intercept only is −113.565. The resulting LRT statistic for independence is χ22 = 0.29 (i.e. twice the difference in log likelihoods), with an associated p-value of 0.87. These results are identical to the LRT results from Section 2.3.4, and our conclusions are the same; that is, there is no association between type of onset and 2-year functional recovery. In this case, the Wald test statistic is also χ22 = 0.29 with a p-value of 0.87.
2.5.8 Interactions in logistic regression In Section 2.4, we introduced a test for interaction using methods for contingency tables. Recall that an interaction between two predictor variables is present when the OR for one predictor differs according to the value of the other predictor. For example, for the data from Table 2.7, we would state that there is an interaction between comorbidity and sex if the OR comparing odds of recovery with and without comorbidity differs between males and females. We can allow for interaction in logistic regression models by multiplying the covariates for the predictors involved in the interaction and adding them as additional covariates to the regression model. For example, to test for an interaction between comorbidity and sex, 24
(2.16) For this model, exp(β1 ) is the OR comparing odds of recovery in female patients with and without comorbidity, and exp(β1 + β3 ) is the OR comparing odds of recovery in male patients with and without comorbidity. These two ORs are equal if and only if β3 = 0; therefore, the test of H0 : β3 = 0 is a test of no interaction between comorbidity and sex. Using logistic regression, tests for interaction are also straightforward for quantitative predictors, categorical predictors with more than two levels, and interactions among more than two predictors.
2.5.9 Example: Logistic regression with interaction Fitting the model from Equation 2.16 to the data from Table 2.7 results in the following output: βˆ
ˆ s.e.(β)
Z
p > |Z| 95% CI
−0.2657 0.2771 −0.96 0.338 −0.8089, 0.2775 Comorbidity −0.4881 0.5105 −0.96 0.339 −1.4887, 0.5125 Male 0.0062 0.3774 0.02 0.987 −0.7335, 0.7459 Comorbidity −0.3838 0.6771 −0.57 0.571 −1.7110, * Male 0.9433 Intercept
The estimated OR comparing odds of 2-year functional recovery for patients with and without Axis I comorbidity is exp(−0.4881) = 0.61 for females and exp(−0.4881 − 0.3838) = 0.42 for males. Note that we can calculate a confidence interval for the OR for females but not for males from the information in the output; because the OR for males is calculated by summing two parameter estimates, the covariance between the two parameter estimates would be required in order to calculate the confidence interval. The Wald test statistic for no interaction is Z = −0.57, with an associated p-value of 0.57. The LRT statistic (obtained by comparing the loglikelihood for this model to the log-likelihood for
ANALYSIS OF CATEGORICAL DATA: THE ODDS RATIO AS A MEASURE OF ASSOCIATION AND BEYOND
the model with covariates for comorbidity and sex but not their interaction) is 0.32 with an associated p-value of 0.57. We conclude that there is no interaction between Axis I comorbidity and sex; that is, the OR comparing the odds of functional recovery for patients with and without Axis I comorbidity is the same for males and females. These results agree with the Breslow–Day test results from Section 2.4.1.
2.5.10 Goodness-of-fit When a multiple logistic regression model has been used to draw conclusions from a study, we should check the fit of the model to the study data. One way to check the fit of a model is to use statistical tests for goodness-of-fit; in the absence of significant evidence of poor fit from these test statistics, we conclude the fit of our model is adequate. The deviance (based on the likelihood ratio statistic) or the Pearson chi-square can be used as a goodness-of-fit statistic if, at each observed covariate pattern, the data can be grouped. That is, if there are ni subjects with the same covariate values (and hence the same Bernoulli distribution), they can be treated as a binomial sample and a test of goodness-of-fit can be based on the comparison of the observed and expected (or predicted) counts in each covariate pattern. Alternatively, if the covariates are quantitative rather than categorical, Hosmer and Lemeshow [10] proposed a goodness-of-fit statistic similar to the Pearson chi-square, which can be calculated after grouping individuals on the basis of having similar values of the predicted probability pi . Evidence of poor fit can reflect a variety of problems with our model, such as an inappropriate choice of transformation function, failure to include important interaction terms, or inappropriate assumption of linearity for quantitative or ordered categorical covariates, and is an indication that we should revisit the assumptions made during the modelling process.
2.6 Advanced topics In this section we briefly review a number of more advanced topics that can be considered extensions of the standard logistic regression model. Many of these methods have been somewhat slow to move into
the mainstream of psychiatric research. However, with their recent implementation in widely available statistical software, these methods are starting to be more widely applied.
2.6.1 Conditional logistic regression The previous section showed that logistic regression can be used to perform analyses similar to those using contingency table methods but with more complex extensions and applications. This section introduces a related technique known as conditional logistic regression, which extends many of the benefits of logistic regression to studies with matched designs. In matched study designs individuals are stratified on the basis of variables thought to be related to the outcome variable of interest. For example, age and years of education are two variables commonly used to define strata in many psychiatric studies. The conditional logistic regression model used to analyse matched data is log[pij /(1 − pij )] = αi0 + β1 xij1 + β2 xij2 + · · · + βK xijK .
(2.17)
Note that this is similar to the standard logistic regression model, except that the probability of success and the predictors are now indexed by i and j instead of i alone to indicate that they apply to the jth individual from the ith strata (e.g. matched pair). Note also that the common intercept β0 in standard logistic regression has been replaced in Equation 2.17 by a stratum-specific intercept αi0 . Parameter estimates from conditional logistic regression can be interpreted in a similar way as parameter estimates from standard (or unconditional) logistic regression, and conditional logistic regression offers the same capabilities as standard logistic regression with a few exceptions. One is that the stratum-specific intercepts cannot be estimated (and will not be included in conditional logistic regression output). This is because the method of estimation (discussed later) effectively eliminates them to ensure that the β’s are estimated without any bias. This is usually not a concern since, as for standard logistic regression, these intercepts are generally not of scientific interest. Second, because the model includes stratum-specific intercepts, the β’s now have 25
CHAPTER 2
stratum-specific interpretations in terms of changes in the log odds of success for within-stratum changes in the covariates. For example, β1 has interpretation in terms of changes in the log odds of success for a single unit change in xij1 within the ith stratum (i.e. comparing two individuals within the same stratum that happen to differ by one unit in the covariate). Third, the associations between any variables used for matching (or any other covariates that are constant within strata) and the outcome cannot be quantified. This is because the method of estimation is based entirely on variation within a stratum; conditional logistic regression cannot be used to estimate the effect of a covariate that varies only between strata (but not within a stratum). Returning to the example from Section 2.4.2, a study examining the association between birth complications and age of onset of psychosis that matches on sex, conditional logistic regression cannot quantify the association between sex and age of onset of psychosis because, by study design, sex does not vary within each stratum. However, it is still possible to test for interactions between variables used for matching and other predictors. Next we consider estimation of the model parameters. One approach to fitting this model would be to attempt to estimate all of the parameters, including the stratum-specific intercepts. However, for matched designs, the number of strata grows as the sample size increases, which means that the number of parameters would be large relative to the sample size no matter how big a sample was collected. For example, in a matched-pair design with n pairs (i.e. two subjects in each stratum), such an analysis would require the estimation of n + K parameters from a sample of only 2n observations. It should not be surprising that this proliferation of stratum-specific intercepts causes problems for estimation; it also causes problems with the properties of standard maximum likelihood estimates of the model parameters. To avoid these problems, conditional logistic regression maximises a likelihood for the conditional distribution (and hence the term ‘conditional’ logistic regression) that eliminates these stratum-specific intercepts and bases estimation of the associations between the predictors and outcome entirely on information from within the strata. 26
2.6.2 Exact logistic regression Like many methods for contingency tables, logistic regression as traditionally implemented (i.e. maximum likelihood logistic regression) relies on large sample theory for the validity of its results. Maximum likelihood logistic regression can perform poorly when the sample size is small, the probability of success is near one or zero, or we have an insufficient number of successes or failures for certain combinations of our covariates. Error messages from statistical software, very large or small parameter estimates, or very wide confidence intervals can sometimes alert us to these problems, though logistic regression can still have poor performance due to sparse data even when the problem is not evident from the distribution of individual covariates or from an examination of the results [11]. Exact logistic regression [12] is a method for fitting logistic regression models that produces valid estimates, test statistics and confidence intervals even for small datasets or sparse data. For example, exact logistic regression was used to study psychiatric and social predictors of attempted suicides in a sample of Indian women [13]. In this study, the total number of participants was fairly large (2494), but the number of suicide attempts was relatively small (19). As a result, very small numbers of successes (suicide attempts) were observed for some predictors, and exact logistic regression was an appropriate analysis strategy. The relationship of exact logistic regression to maximum likelihood logistic regression is similar in some ways to the relationship of Fisher’s exact test to large-sample methods for R × C contingency tables. However, whereas Fisher’s exact test conditions on the row and column totals in order to derive the distribution of the test statistic, exact logistic regression conditions on the so-called sufficient statistics for the remaining parameters in the model when estimating each parameter. The sufficient statistics for the parameters are determined by the number of successes for different values of the corresponding covariates. Like other exact methods, exact logistic regression guarantees that tests conducted at significance level α have a type I error rate less than or equal to α, and that 95% confidence intervals have at least 95% coverage even for small sample sizes and sparse data.
ANALYSIS OF CATEGORICAL DATA: THE ODDS RATIO AS A MEASURE OF ASSOCIATION AND BEYOND
It can be implemented by several popular statistical software packages, and parameter estimates and confidence intervals have an interpretation identical to those for maximum likelihood logistic regression. Some disadvantages are that it may be overly conservative in settings when maximum likelihood logistic regression performs adequately and that it can be computationally intensive, especially when quantitative covariates or a large number of categorical covariates are included in the model. In principle, exact logistic regression can be applied in settings with multiple covariates; however, greater care is required when attempting to fit complex models. When feasible, exact logistic regression is an attractive alternative to maximum likelihood logistic regression in small sample and sparse data settings.
2.6.3 Multinomial regression models A major focus of this chapter has been on logistic regression modelling of a binary outcome. For some applications, however, the outcome variable of interest is categorical with more than two levels. For example, in a study of trauma in a high-risk AfricanAmerican sample [14], response to trauma was categorised as currently ill (current psychiatric disorder), recovered (past history of one or more psychiatric disorders) or resilient (no history of psychiatric disorder). Predictors of response to severe trauma in this population were examined using polytomous logistic regression. In this section we introduce multinomial models for categorical outcomes with more than two levels by first considering the case of such a nominal categorical variable. Suppose the outcome for individual i is categorical with J levels and let Yi equal 1 with probability pi1 , equal 2 with probability pi2 , and so on. In general, Yi equals j with probability pij , j = 1, . . . , J. We can introduce some additional notation that will make the extension of logistic regression to this setting more transparent. Suppose we let Yij equal 1 if Yi = j, and equal 0 otherwise, for j = 1, . . . , J. Then, pij = pr[Yi = j|xi1 , . . . , xiK ] = pr[Yij = 1|xi1 , . . . , xiK ]. When J > 2, the extension of the logistic regression model known as polytomous (or multinomial) logistic regression is appropriate. In polytomous logistic regression, we form J − 1
non-redundant logits: pr[Yi = j|xi1 , . . . , xiK ] log pr[Yi = J|xi1 , . . . , xiK ] pr[Yij = 1|xi1 , . . . , xiK ] = log pr[YiJ = 1|xi1 , . . . , xiK ] pij = log = βj0 + βj1 xi1 + · · · + βjK xiK piJ j = 1, . . . , J − 1, where the regression parameters, βj0 , βj1 , . . . , βjK , can be different for each level j. In this notation, the last category J is referred to as the ‘reference’ category. It can also be shown that, for j = 1, . . . , J − 1, pij =
exp[βj0 + βj1 xi1 + · · · + βjK xiK ] J−1 1 + j=1 exp[βj0 + βj1 xi1 + · · · + βjK xiK ]
Note that this polytomous logistic regression model is more appropriate when the categorical variable is nominal. In other settings, the categorical outcome is ordinal. For example, in a study of predictors of remission in patients over age 60 treated for depression, the outcome was categorised as no remission, partial remission or full remission [15]. For ordinal outcomes a variety of regression models can be used, including mean score models and models of a logistic regression form. The logistic models for ordinal data include the continuation odds model, the adjacent category logit and the cumulative logit models [16]. Here, we briefly discuss the cumulative logistic proportional odds model, one of the most widely used regression models for ordinal data. To formulate an ordinal response model, we form logits of the cumulative probabilities. Recall, pij = pr[Yij = 1|xi1 , . . . , xiK ]. We define the cumulative probabilities as Fij = pr[Yi ≤ j|xi1 , . . . , xiK ] = pi1 + . . . + pij . In the previous example, Fi2 is the probability of a response of (i) no remission or (ii) partial remission. The logit of Fij , pi1 + · · · + pij Fij = log logit(Fij ) = log 1 − Fij pi,j+1 + . . . + piJ is often referred to as the ‘cumulative’ logit, and these cumulative logits can be related to covariates in the following proportional odds model, logit(Fij ) = αj + β1 xi1 + · · · + βK xiK . 27
CHAPTER 2
Note that the original multinomial probabilities can be expressed in terms of the cumulative probabilities via pij = Fij − Fi,j−1 . Inferences about the ‘cumulative logits’ or ‘cumulative’ log ORs can be made similarly to inferences for standard logistic regression. For example, with remission category as the outcome, if xi1 is an indicator for comorbid dysthymia (an important predictor in the Hybels et al. [15] study), exp(β1 ) is the OR comparing the odds of full or partial remission versus no remission. A property of the proportional odds model is that exp(β1 ) is also the OR comparing the odds of full remission versus partial or no remission for patients with and without comorbid dysthymia.
2.6.4 Clustered categorical data In the previous sections we have considered regression models for a single categorical outcome. However, multivariate categorical response data commonly arise in a number of applications in psychiatry. That is, two or more measurements of the response are often obtained in a block or cluster and the categorical responses within a cluster are expected to be positively correlated. When this occurs, the responses from any pair of members of the same cluster are expected to be more closely related than the responses from a pair belonging to different clusters. Some common examples where data arise in clusters include repeated measures or longitudinal studies and studies on families, communities or other naturally occurring groups. For example, in a study of the familial association between rheumatic fever and obsessive–compulsive spectrum disorders, each cluster consisted of first-degree relatives of either a case with rheumatic fever or a control [17]. The important aspect of all of these studies is that the categorical responses within a cluster (e.g. the presence of obsessive–compulsive spectrum disorders in firstdegree relatives) cannot be regarded as independent of one another. There may be many reasons for the correlation among cluster members. For example, when the cluster is comprised of all the siblings within a family the correlation among siblings may be due to shared (or at least similar) genetic, environmental and social conditions. In a longitudinal study, where 28
the responses within a cluster represent measurements taken at different occasions, the categorical responses are expected to be positively correlated simply because they have been obtained on the same individual (or cluster). Whatever the underlying reasons for the correlation, failure to account for it in the analysis can lead to invalid inferences. That is, the standard application of logistic regression (or any methods that assume independent observations) in this setting is no longer appropriate. For the remainder of this section we focus on the case of clustered binary data; however, the methods we discuss apply more broadly to clustered categorical data. There are two general approaches for handling the analysis of clustered binary data. The first is to consider models for the joint distribution of the cluster of binary responses that explicitly account for the within-cluster correlation. There is an extensive statistical literature on this topic and the interested reader is referred to a review article by Pendergast et al. [18]. For the most part, these models can be computationally demanding and have only recently been implemented in commercially available statistical software. An alternative approach is to simply ignore the correlation among members of a cluster. That is, the analysis proceeds naively as though the binary responses within a cluster can be regarded as independent observations, but later a correction is applied to ensure that valid standard errors are obtained. Note that in this ‘naive’ approach that ignores the within-cluster correlation the estimated logistic regression coefficients are valid, but their nominal standard errors are not. However, valid standard errors can be readily obtained using the well known empirical variance estimator, first proposed by Huber [19]. Thus, the analysis proceeds in two stages. In the first stage, the correlation among binary responses within a cluster is simply ignored and standard logistic regression is used to obtain estimates of the logistic regression coefficients. In the second stage, valid standard errors for the estimated logistic regression coefficients are obtained using an alternative, but widely implemented, variance estimator that properly accounts for the correlation among the binary responses. The chief advantage of this approach is that it can be readily implemented using standard statistical software for logistic regression.
ANALYSIS OF CATEGORICAL DATA: THE ODDS RATIO AS A MEASURE OF ASSOCIATION AND BEYOND
2.7 Concluding remarks This chapter presents an overview of statistical methods for categorical (and primarily binary) outcomes, with an emphasis on the OR and its applications in psychiatry. The OR is a particularly useful measure of association because it can be estimated from a number of common study designs. Contingency table methods provide intuitive ways of examining the associations among a limited number of categorical variables from these study designs. Logistic regression offers many of the same advantages of contingency table methods but also provides an efficient method for incorporating an arbitrary number of categorical and quantitative covariates. Logistic regression also provides a framework for extensions to more complex study designs often encountered in psychiatry.
[6] [7] [8]
[9]
[10]
[11]
[12]
2.8 Further reading A very comprehensive description of statistical methods for analysing categorical data can be found in the classic textbook by Agresti [20], and a thorough overview of logistic regression methods with an emphasis on applications can be found in the text Applied Logistic Regression by Hosmer and Lemeshow [21].
References
[13]
[14]
[15]
[16] [17]
[1] Tohen, M., Hennen, J., Zarate, C.M. et al. (2000) Two-year syndromal and functional recovery in 219 cases of first-episode major affective disorder with psychotic features. Am. J. Psychiatry, 157, 220–228. [2] Brown, L.D., Cai, T.T. and DasGupta, A. (2001) Interval estimation for a binomial proportion. Statist. Sci., 16, 101–133. [3] Breslow, N.E. and Day, N.E. (1980) Statistical Methods in Cancer Research, The Analysis of Case–Control Studies, Vol. 1, World Health Organization, Lyon. [4] Wærn, M., Runeson, B.S., Allebeck, P. et al. (2002) Mental disorder in elderly suicides: a case–control study. Am J. Psychiatry, 159, 450–455. [5] Geda, Y.E., Roberts, R.O., Knopman, D.S. et al. (2008) Prevalence of neuropsychiatric symptoms in mild cognitive impairment and normal cognitive
[18]
[19]
[20] [21]
aging: population-based study. Arch. Gen. Psychiatry, 65, 1193–1198. Fisher, R.A. (1934) Statistical Methods for Research Workers, Oliver and Boyd, Edinburgh. Cochran, W.G. (1954) Some methods of strengthening the common χ2 tests. Biometrics, 10, 417–451. Mantel, N. and Haenszel, W. (1959) Statistical aspects of the analysis of data from retrospective studies of disease. J. Natl. Cancer Inst., 22, 719–748. Everitt, B.S. (1992) Some aspects of the analysis of categorical data, in A Handbook for Data Analysis in the Behavioral Sciences, Statistical Issues, Vol. 2 (eds G. Keren and C. Lewis), Lawrence Erlbaum Associates, Hillsdale, NJ, pp. 321–348. Hosmer, D.W. and Lemeshow, S. (1980) A goodnessof-fit test for the multiple logistic regression model. Commun. Stat., A10, 1043–1069. King, E.N. and Ryan, T.P. (2002) A preliminary investigation of maximum likelihood logistic regression versus exact logistic regression. Am. Stat., 56, 163–170. Mehta, C.R. and Patel, N.R. (1995) Exact logistic regression: theory and examples. Stat. Med., 14, 2143–2160. Maselko, J. and Patel, V. (2008) Why women attempt suicide: the role of mental illness and social disadvantage in a community cohort study in India. J. Epidemiol. Community Health, 62, 817–822. Alim, T.N., Feder, A., Graves, R.E. et al. (2008) Trauma, resilience, and recovery in a high-risk African-American population. Am. J. Psychiatry, 165, 1566–1575. Hybels, C.F., Blazer, D.G. and Steffens, D.C. (2005) Predictors of partial remission in older patients treated for major depression: the role of comorbid dysthymia. Am. J. Geriatr. Psychiatry, 13, 713–721. McCullagh, P. and Nelder, J. (1989) Generalized Linear Models, 2nd edn, Chapman & Hall, London. Hounie, A.G., Pauls, D.L., do Rosario-Campos, M.C. et al. (2007) Obsessive-compulsive spectrum disorders and rheumatic fever: a family study. Biol. Psychiatry, 61, 266–272. Pendergast, J.F., Gange, S.J., Newton, M.A. et al. (1996) A survey of methods for analyzing clustered binary response data. Int. Stat. Rev., 64, 89–118. Huber, P.J. (1967) The behavior of maximum likelihood estimates under nonstandard conditions, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 1, Berkeley, CA: University of California Press, pp. 221–233. Agresti, A. (2002) Categorical Data Analysis, 2nd edn, John Wiley & Sons, Inc., New York. Hosmer, D.W. and Lemeshow, S. (2000) Applied Logistic Regression, 2nd edn, John Wiley & Sons, Inc., New York.
29
3
Genetic epidemiology Stephen V. Faraone,1 Stephen J. Glatt1 and Ming T. Tsuang2,3 1 Departments
of Psychiatry and Behavioral Sciences and Neuroscience and Physiology, Medical Genetics Research Center, SUNY Upstate Medical University, NY, USA 2 Center for Behavioral Genomics, Department of Psychiatry, University of California, San Diego, CA, USA 3 Veterans Affairs San Diego Healthcare System, San Diego, CA, USA
3.1 Introduction Epidemiologists usually concern themselves with describing the distribution and determinants of disease as a function of exposure to some environmental variable. This leads naturally to the goal of finding environmental risk factors that cause illness. In contrast, geneticists focuses on genetic mechanisms and, in experimental studies, may even seek to strictly control the environment and eliminate environmental variance. Thus, epidemiologic research often treats genetic determinants as noise and environmental agents as the signal; genetic studies reverse the roles of genes and environment. Psychiatric genetics adopts the position of genetic epidemiology, which has been defined as, ‘a science that deals with aetiology, distribution and control of disease in groups of relatives and with inherited causes of disease in populations’ [1]. Genetic epidemiologists examine the distribution of illness within families with the goal of finding genetic and environmental causes of illness. Thus, psychiatric genetics considers both environmental and genetic risk factors – and their interaction – to be on an equal footing. This paradigm extends the epidemiologist’s concept of ‘exposure’ to genes and family relationships. Most psychiatric genetic research is predicated on an assumption that the pathway from genotype to phenotype
cannot be understood without reference to environmental agents that trigger illness in susceptible individuals. The debate ascribing the risk for psychiatric disorders to either nature or nurture has been laid to rest, as most of these conditions are now understood to arise from the combination of both. Current work seeks to define genetic and environmental risk factors, the magnitude of their contributions and how they interact. The methods of psychiatric genetic epidemiology have established the familial and heritable nature of various psychiatric disorders over the last several decades; they are now poised to unravel their underlying mechanisms in the years to come.
3.2 The chain of psychiatric genetic research Work in psychiatric genetics follows a series of progressive questions (Table 3.1). This chain of genetic epidemiologic research [2], is as follows: First, we ask ‘Is the disorder familial? Does it run in families?’ Second, ‘What is the relative magnitude of genetic and environmental contributions to disease aetiology and expression?’
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
31
CHAPTER 3 Table 3.1
Chain of psychiatric genetic research.
Questions
Methods
Is disorder familial? What are the relative contributions of genes and environment? What is mode of transmission? Where is the gene (or genes) located? What is the risk-conferring variants?
Family study Twin and adoption studies
Third, ‘How is disease transmitted from generation to generation?’ Fourth, ‘If genes mediate transmission, where are they located?’ Fifth, ‘What are the risk-conferring variants of the genes and what is the mechanism of disease?’ In modern practice, some questions are pursued before those ‘earlier’ in the chain have been addressed. This is due to the fact that ‘later’ questions are presumed to elucidate more specific risk factors than earlier ones, and because the technology for pursuing these later questions has advanced rapidly, making their wide-scale implementation highly feasible. It is a fallacy that the newest methods can obviate earlier ones, as there is no method devised to date that has been found capable of explaining all sources of variation in the liability towards a complex phenotype such as those in the realm of psychiatry. Thus, we start our introduction to the field of psychiatric genetic epidemiology with a review of its oldest methods and conclude by reviewing the latest molecular genetic techniques, with the understanding that each method is but one essential instrument in the genetic epidemiologist’s toolbox.
3.2.1 Is the disorder familial? This question can be addressed more easily than some others in genetic epidemiology, which is why it is often asked first; another reason for its primacy in the chain of research is that it provides the most fundamental direction for subsequent genetic epidemiologic studies. If a disorder shows familial transmission, follow-up with other methods is warranted, whereas if no familial resemblance for the trait is observed, investigations of the disorder would proceed in a different direction (e.g. environmental surveys). 32
Segregation analysis Linkage analysis Association analysis
Observations of disorders ‘running in families’ may come from clinicians who often treat patients from the same family. Once familiality is informally established in a clinical setting, it remains to be confirmed with a rigorous research design, known as the family study method.
3.2.1.1 Selection of probands A family study should use the blind case–control paradigm. The cases and controls used in genetic studies are known as probands. We select probands with the disorder from a source, that is ‘enriched’ with the diagnosis of interest. For example, patients in psychiatric clinics are more likely to have bipolar disorder than patients in a family practice clinic. Furthermore, patients in a bipolar speciality clinic are more likely to have bipolar disorder than patients in a general psychiatric clinic. Selection from clinics instead of the general population is useful because, to achieve an adequate number of cases from the general population we would need to screen many individuals. This is costly and of dubious benefit. Also, multiple stages of ascertainment increase the probabilities of ill probands being ‘true cases’ and of unaffected or ‘control’ probands not having the disorder under study. The positive predictive power of a diagnosis (the proportion of those with the disorder among all patients receiving the diagnosis) increases as the base rate of the disorder being diagnosed increases [3]. Thus, multiple-stage ascertainment increases positive predictive power by using clinic status to increase the proportion of ‘true cases’ in the sample. This method of increasing positive predictive power will increase the false-negative rate. In this context, false negatives are those who have a disorder but are (i) not referred to a clinic or (ii) referred but do not
GENETIC EPIDEMIOLOGY
receive a clinical diagnosis. The generalisability of results will be limited to the degree that these false negatives differ from the probands. Treatment is the most notable factor that differentiates these groups. Multi-stage screening of controls decreases the probability of misclassifying someone with the disorder as a control. Since screened controls are selected for absence of the disorder of interest, they are not representative of the general population, but they are very effective for projects seeking to delineate factors that differentiate controls from cases [4]. Furthermore, unscreened controls frequently have rates of psychopathology and its correlates that are above the population expectation [5–7], thereby obscuring the effects of the variable of interest. Controls should be screened only for the disorder being studied, not for other psychiatric disorders or conditions. When controls are screened for additional disorders, the results can spuriously indicate a familial relationship between the disorder used to select cases and the disorders screened from controls [8]. The selection of controls should satisfy the comparability principles required for meaningful inferences in case–control epidemiological studies [9–12]. It is usually not possible to establish a primary study base with a geographically defined population because the clinics from which probands are selected may serve a broad geographic region that is difficult to delineate. The usual approach establishes a secondary study base defined by the ascertainment source. This limits generalisability and does not produce a representative sample from a geographical population. Nevertheless, it allows for meaningful case–control comparisons if the controls are individuals who could have been cases had they developed the disorder of interest [9–12]. When sampling from a clinic, this requires that, if the control subjects had needed treatment for the disorder, they would have been referred to the clinics that provided the case probands. Instead of establishing a secondary study base, one could match cases and controls on ‘relevant’ variables. One problem here is defining what is and is not ‘relevant’. Age, sex and socioeconomic status are usually considered. Matching should be used cautiously to avoid the ‘matching fallacy’ [13] and ‘overmatching’ [9, 14] because matching on specific variables often unmatches on others [13].
This creates unusual samples, reduces statistical efficiency and biases estimates [12]. These problems are worse when the matching variable is strongly associated with the disorder under study.
3.2.1.2 Assessment of disorder among relatives After selecting case and control probands, the family study compares rates of illness among relatives of cases to rates among relatives of controls. Care must be taken to assess as many relatives as possible. Because psychiatric disorders affect emotions, thinking and interpersonal relationships, nonparticipation may not be random with respect to illness status: ill family members may be more likely to refuse participation than others. If a disorder has a genetic aetiology, then relatives of ill probands should carry a greater risk for the illness than relatives of controls and the risk to relatives of probands should increase with the amount of genes they share in common. First-degree relatives – such as parents, siblings and children – share 50% of their genes, on average, with the proband. They should be at greater risk for the disorder than second-degree relatives (grandparents, uncles, aunts, nephews, nieces and half-siblings) because second-degree relatives share only 25% of their genes with the proband. Family studies rarely have the resources to diagnose second- or third-degree relatives. Table 3.2 displays the familial pattern of risk found in the families of schizophrenic probands, which is similar to that for other psychiatric disorders. These risk figures show that first-degree relatives are at highest risk, followed by second- and then third-degree relatives. In Table 3.2, it is clear that risk for disorder increases with the amount of shared genes; however, the increase in risk is not linear with the increase in biological similarity. Rather, it is exponential, with the individuals most similar to an affected proband (monozygotic (MZ) twins, who are 100% genetically identical) at more than double the risk incurred by individuals with only half their genes in common with an affected proband. These results underscore the complexity of the genetic bases for psychiatric disorders, and imply that gene–gene interactions (epistasis) as well as environmental factors must contribute to their aetiologies. 33
CHAPTER 3 Table 3.2 Rates of schizophrenia among relatives of schizophrenic patients. Type of relative First-degree relatives Parents Children Both parents schizophrenic Brothers and sisters Neither parent schizophrenic One parent schizophrenic Fraternal twins of opposite sex Fraternal twins of same sex Identical twins
Per cent at risk 4.4 12.3 36.6 8.5 8.2 13.8 5.6 12.0 57.7
Second-degree relatives Uncles and aunts Nephews and nieces Grandchildren Half brothers/sisters
2.0 2.2 2.8 3.2
Third-degree relatives First cousins General population
2.9 0.86
Based on Slater and Cowie [15] with the exception of twin data from Shields and Slater [16]. Adapted, with permission from Tsuang et al. [17].
3.2.1.3 Family study vs. family history methods The family history method assesses diagnoses of family members by interviewing only one or several informants per family. In contrast, the family study method determines diagnoses by interviewing each family member [2]. Several excellent structured psychiatric interviews are available but only one was designed specifically for genetic studies: the Diagnostic Interview for Genetic Studies (DIGS; [18, 19]. The main advantage of the family history method is low cost: interviewing a few family members is less costly than interviewing all family members. However, family history data underestimate true rates of many psychiatric disorders. Ideally, diagnoses of subjects should use three sources of information: direct interviews with the subject, family history interviews with informants who are familiar with the subject and medical records when available. All sources of information about a given individual may then be combined into a consensus diagnosis [20, 21]. The direct interview and medical record usually provide more useful information than the family history 34
assessment. In fact, two studies find that diagnoses based on direct interviews alone closely approximated best-estimate diagnoses [20, 21]. However, a diagnosis based only on medical records is often a suitable proxy to the best-estimate diagnosis [20]. The choice between the family history and family study methods requires a tradeoff between data quality and the expense of data collection. The family history method is the method of choice when there are not sufficient data to justify the expense of a family study. It is a good choice for pilot studies. After the family history method demonstrates familiality, the family study is the tool of choice for examining the details of familial transmission and developing reliable estimates of familial risk. When using the family history method, the following should be considered: 1 use the Family Interview for Genetic Studies or some other semistructured method for eliciting the family history; 2 because the family history method has low sensitivity, use less stringent diagnostic criteria than used for direct interviews; 3 use multiple informants for each person to be diagnosed; 4 seek out informants who have had substantial contact with the person to be diagnosed; 5 the method is most valid when the person being diagnosed is ill at the time of interviewing the informant. These ‘rules of thumb’ provide a rough guide for planning a family history study.
3.2.1.4 Caveats We must be cautious in concluding a disorder is caused by genes after we observe that it is familial. Disorders can ‘run in families’ for non-genetic reasons such as shared environmental adversity, viral transmission and social learning. Because the culture and environment shared by family members tends to increase as the degree of relationship decreases, the pattern of risk due to environmental factors may mimic the pattern expected for genetic relationships. Thus, a finding of familial transmission cannot be
GENETIC EPIDEMIOLOGY
unambiguously interpreted. Although family studies are indispensable for establishing the familiality of disorders they cannot establish whether genes or environment mediate that transmission.
3.2.2 What are the relative contributions of genes and environment? Genes, environment and their interaction: these are the ingredients of the pathophysiological brew that engenders psychopathology. To assay these ingredients and determine their relative proportions, we turn to twin and adoption studies [2].
3.2.2.1 Twin studies Identical or MZ twins inherit identical chromosomes and thereby have 100% of their genes in common. In contrast, like siblings, dizygotic (DZ) twins share 50% of their genes. MZ and DZ twins are markedly different with regard to their genetic similarity, but, if twin pairs are reared in the same household then the degree of environmental similarity between MZ twins should be no different than that between DZ twins. The astute reader will note that our comments regarding genetic similarity are facts of inheritance, but our comments about the environment are assumptions. The correctness of these assumptions is key to the valid use of the twin method. Since MZ twins are genetic copies of one another, any differences between a pair of MZ twins are assumed to be due primarily to environmental influences. In contrast, differences between DZ twins could be due to either genetic or environmental influences. Thus, comparing the co-occurrence of a psychiatric disorder in the two types of twins provides information about the relative contributions of genes and environment to the disorder. The co-occurrence of a disorder in both twins is called concordance; if one twin has the disorder and the other does not, the twins are discordant for the disorder. Because we assume the same environmental similarity for both types of twins, a higher concordance rate for MZ compared with DZ twins indicates the influence of genes. We can use pairwise or probandwise concordance rates, depending on the method of sampling the twins. The pairwise concordance rate
is defined as the proportion of twin pairs in which both twins are ill. To compute this, count the number of twin pairs concordant for the disorder and divide the result by the total number of pairs. Use this method when the probability of sampling any specific ill individual is so low that two ill co-twins are never independently sampled as probands. When the sampling probability is higher, use the probandwise concordance rate. Probandwise concordance is the proportion of proband twins that have an ill co-twin. Thus, it is the number of concordant pairs plus the number of concordant pairs in which both the twins are probands, divided by the total number of pairs. Twin data can estimate the ‘heritability’ of a disorder. Heritability measures the degree to which genes influence variability in the manifestation of the disorder (the phenotype). We divide phenotypic variability (Vp ) into two parts: genetic variability (Vg ) and environmental variability (Ve ). This partitioning of phenotypic variability assumes that genetic and environmental factors are statistically independent (i.e. Vp = Vg + Ve ). Heritability in the broad sense (h2 ) is the ratio of genetic and phenotypic variances (i.e. h2 = Vg /Vp ). As these formulas show, a heritability of one indicates that variability in the phenotype is due to genes. A heritability of zero attributes all phenotypic variation to environmental factors. When estimating heritability, diagnostic unreliability increases the estimate of environmental influence. Heritability estimates are context-dependent, and this is reflected by the fact that the heritability estimate accounts for main effects of genetic factors but also gene–environment interactions. The details of methods for calculating heritability are beyond the scope of this chapter. Smith [22] and Plomin et al. [23] provide information about the calculation and interpretation of this measure. If we have data from parents and siblings of twins or indices of the environment, then specialised methods can provide information about gene–environment interaction and gene–environment correlation. An excellent reference for these methods is the book by Neale and Cardon [24]. A major assumption, and often-cited challenge to twin studies, is the ‘equal-environments assumption’. A basic tenet underlying the partitioning of genetic and environmental variance is the assumption that MZ twin-pairs reared together have the same degree 35
CHAPTER 3
of exposure to similar environmental factors that reared-together DZ twin-pairs have. However, this will be wrong if many more MZ twin-pairs than DZ twin-pairs are treated identically and exposed to the same events. Serious violation of the equalenvironments assumption could result in increased phenotypic similarity among MZ twin-pairs relative to DZ twin-pairs, that is due to environmental – not genetic – similarities between MZ twin-pairs. Thus, a portion of the variance in a phenotype that should be attributed to environmental factors would inadvertently be ascribed to genes, and heritability estimates will be artificially inflated. When MZ twins are reared apart, we have a unique – but rare – opportunity to study the relative importance of genes and environment without having to assume environments are equal. Since MZ twins reared apart do not share a common environment, any phenotypic similarity must be due to genetic factors. However, MZ twins with psychiatric illness are rare, and cases of such twins reared apart are even rarer. Thus, this design cannot be routinely used. A second twin study design uses the children of discordant MZ twins. The logic of this design is straightforward. If a disorder is caused by a genotype combining with environmental factors, then the well member of a discordant MZ twin pair should carry the genotype. Presumably, they did not develop the disorder because they had not been exposed to a relevant environmental cause. If so, then the children of the well twin should have the same risk for the disorder as the children of the ill twin.
3.2.2.2 Adoption studies Children adopted at an early age have a genetic relationship to their biological parents and an environmental relationship to their adopted parents. Thus, adoption studies can determine if biological or adoptive relationships account for the familial transmission of disorders. If genes are important, then the familial transmission of illness should occur in the biological, but not the adoptive family. If culture, social learning or other sources of environmental transmission cause illness, then familial transmission of illness should occur in the adoptive, but not the biological family. 36
There are three major adoption study designs. The parent-as-proband design compares rates of illness in the adopted offspring of parents with and without the disorder. If genetic factors mediate the disorder then rates of illness should be greater in the adopted away children of ill parents compared with the adopted children of well parents. The adoptee-as-proband design starts with ill and well adoptees and examines rates of illness in both biological and adoptive relatives. If the biological relatives of ill adoptees have higher rates of illness than the adoptive relatives of ill adoptees, then a genetic hypothesis is supported. In contrast, if the adoptive relatives show higher rates of illness then an environmental hypothesis gains support. The third design is the cross-fostering design. This approach compares rates of illness for two groups of adoptees: one has well biological parents and is raised by ill adoptive parents and the other has ill biological parents and is raised by well adoptive parents. Higher rates of illness in the former group of adoptees compared with the latter group imply a non-genetic mode of illness transmission. Although they are difficult to execute, adoption studies have provided extensive data for both mood disorders [25–29] and schizophrenia [30–32]. Taken as a group, these studies support the hypothesis that the familial transmission of these disorders is due to genetic, not environmental factors. Adoption studies must be viewed with caution due to potential methodological problems that cloud their interpretation. Adoptees and their families are not representative of the general population. This limits generalisability. Adoptees are at greater risk for psychiatric illness compared with non-adopted children [33, 34]. Although the reasons for this are not clear, this increased risk for psychiatric disorders requires use of an adoptee control group. For example, in the adoptee-as-proband design, the relatives of ill adoptees must be compared with the relatives of well adoptees. Some other control group cannot be used, even if it is matched to the ill adoptee group on demographic measures. It may be difficult to find a sample of adoptees who were all separated from their parents at birth. If the child has lived with a parent for even a short period of time prior to adoption, the biological relationship will have been ‘contaminated’ by environmental
GENETIC EPIDEMIOLOGY
factors. Some might even argue that the child’s contact with the mother immediately after birth creates a residue of environmental influence that affects subsequent psychopathology. Kety et al. [31, 32] presented a compelling design that deals with this problem. Their method requires a sample of biological paternal half-siblings of ill and well adoptees. Paternal half siblings share a common father yet have different mothers so do not share prenatal, perinatal or neonatal environmental exposure to the same mother. This design rules out confounding by in utero influences, birth traumas and early mothering. In the work of Kety and colleagues, the biological paternal half-siblings of schizophrenic adoptees were at greater risk for schizophrenia than the biological paternal halfsiblings of control adoptees, which bolsters the hypothesis that schizophrenia is caused, at least in part, by genetic factors. There are some environmental correlates of the biological parents that cannot be handled by the paternal half sibling design. Children born to fathers of the lowest social class may share toxic environmental factors such as poor pre- and perinatal care, poor nutrition and an adverse social environment; these may confound the genetic parent–child relationship. Despite these potential confounds and the difficulty of ascertaining appropriate cases and controls, the adoption study remains a valuable tool for disentangling genetic and environmental contributions to the familial aggregation of psychiatric disorders. The problems we note serve to underscore a basic tenet of psychiatric genetic research: any assertion that a disorder is caused by genetic factors must refer not to a single study, but to a series of studies using different paradigms.
3.2.3 What is the mode of transmission? After demonstrating that a disorder is influenced by genetic factors, the next logical task is to determine the mechanism of transmission from parent to child [2]. This information is useful from two perspectives. Showing that the transmission of a disorder corresponds to a known mode of transmission provides clues for subsequent research. For example, if the transmission is clearly due to a single gene, the next step might be linkage analysis, which uses family
psychiatric data and samples of DNA to find mutant genes. If environmental factors are implicated then a search for such factors would be warranted. Moreover, the mode of transmission has implications for genetic counselling. Genetic counselling is the process whereby clinical professionals inform people about either their probability of developing a genetic disorder or that of children they are planning to conceive. Ideally, in the absence of data implicating a particular genetic polymorphism(s), such counselling should be based on risk figures from a known model of genetic transmission. This model can be applied to an individual’s pedigree to determine that individual’s risk for a disorder. Morton et al. [35] demonstrated that the degree of risk predicted depends on the model of transmission. They also found that clinically important errors in risk prediction were made when they used the wrong genetic model to make predictions. A model of familial transmission translates assumptions about genetic and environmental causes into mathematical equations. These equations predict the distribution of a disorder expected in pedigrees or twin pairs. If the pattern of disorder predicted by the model is close to what we observe we say that the model fits the data. In contrast, if the predicted pattern of disorder differs from what is observed we reject the model and seek another. The term segregation analysis is used to describe analyses that assess the mode of disease transmission. The methods we discuss in this section require a good deal of mathematical and statistical expertise to be understood and correctly implemented. In the short space of this chapter we cannot present these mathematical details but instead provide an overview of the different classes of methods used to test hypotheses about genetic and environmental transmission. Several excellent texts, review articles and computer program documentation provide a detailed guide to these methods [1, 36–38].
3.2.3.1 Mathematical modelling of genetic and environmental transmission A genetic model comprises two major components. First, we must describe how the disorder is transmitted. For example, if we believe the disorder is due to a single dominant gene, our model must include the 37
CHAPTER 3
frequency of the gene in the population. It must also require that the transmission of the gene from parent to child follows the laws of genetic transmission. For example, if a mother carries one pathogenic gene the probability that she transmits this gene to a child must be 50%. Genetic models can specify environmental effects in several ways. In a single gene model we specify the penetrance of each genotype. Penetrance is the probability that each genotype causes disease. If we believe that disease occurs when an environmental event occurs in someone carrying the pathogenic gene, then our model should allow some gene carriers to be well. If other causes for the disease exist, then penetrance exceeds zero for those who do not carry the hypothetical disease gene. Genetic modelling requires a procedure for determining whether model predictions adequately describe the pattern of illness observed in families. One modelling approach attempts to predict rates of illness in various classes of relatives. The pedigree data is reduced to a table of numbers indicating the rate of illness in these classes (e.g. mothers, fathers, brothers, sisters, sons, daughters and more distant relatives). The mathematical model chooses values for the model parameters (e.g. gene frequency and penetrance) that most accurately reproduce the observed rates. The observed and predicted rates can then be compared with a chi-square test to determine if deviations between predicted and observed rates are large enough to warrant rejecting the model. Modelling rates of illness does not capitalise on all the information available in pedigree data. By lumping all families together within one data table, we cannot directly model the transmission of genes from one generation to the next. In contrast, pedigree analysis computes the likelihood of the pattern of illness in each family. For this approach, the analysis uses the status of each person and their relationship to others in the pedigree who are and are not affected. An algorithm then computes the probability, or likelihood, that the assumed model is correct given the pedigree data and the value of model parameters. Those parameter values yielding the most likely model are used as final estimates. With this approach we establish the model’s goodness of fit with a likelihood ratio chi-square test.
38
3.2.3.2 Types of transmission models Single gene Mendelian transmission is only one of many mechanisms used to describe family and twin data. We find it useful to classify the genetic mechanisms into three types of models: single major gene, oligogenic and multifactorial polygenic (MFP). The word ‘major’ indicates that one gene can account for most of the genetic transmission of a disorder. Other genes and environmental conditions may play minor roles in modifying the expression of the disease or determining its age of onset. In contrast, an oligogenic model assumes that the combined actions of several genes cause illness. These genes may combine in an additive fashion such that the probability of illness is a function of the number of pathogenic genes. Alternatively, the mechanism may be interactive. For example, three abnormal alleles at different chromosomal locations may be needed for disease to occur. The MFP model proposes that a large, unspecified number of genes and environmental factors combine in an additive fashion to cause disease. The difference between oligogenic and polygenic models is one of degree. The former contain ‘several’ genes (e.g. less than 10) whereas the latter include a ‘large number’ of genes (e.g. [39]). Geneticists originally developed polygenic models to describe quantitative traits such as height and intelligence. By specifying a ‘large number’ of genes, these models could explain how discrete genes could cause traits that were continuously distributed in the population. Since many diseases are qualitative categories – not quantitative dimensions – geneticists developed the concept of liability [40]. Liability describes an unobservable trait: the predisposition to onset with disease. As liability increases so does the probability of disease onset. Alternatively, we might assume that disease occurs when one’s liability crosses a specific threshold. More than one threshold may be placed along the liability continuum, representing varying degrees of severity. Individuals above an upper threshold will develop a severe form of the disorder. Those below the lower threshold may have minor problems or be unaffected, while those whose liability falls between the two thresholds would have an intermediate form of the disorder. The mixed model posits that both MFP and single gene
GENETIC EPIDEMIOLOGY
components may be involved in disease aetiology. Statistical analysis of the mixed model can determine if either component alone can provide an adequate fit to the data, or if the null hypothesis of no single gene effect and no MFP effect fits best.
3.2.4 Where is the gene (or genes) located? Eventually, psychiatric genetic research leads to questions such as, ‘Where is the gene located?’ This stage of inquiry requires molecular geneticists to provide the methods for tracking the inheritance of these disorders through families [2]. The search for disease genes faces formidable obstacles. Paramount among these is the number of potential disease genes. Each of us has over 20 000 genes. Moreover, only 10% of our chromosomal material (DNA) contains the coding sequences (i.e. instructions) for these genes. The average gene is made up of 3000 bp (the building blocks of DNA). But the entire set of chromosomes (the genome) contains 3 billion bp. Thus, searching for disease genes might seem as difficult as looking for a needle in a haystack. Fortunately, geneticists and statisticians have solved this genetic needle-in-a-haystack problem. Today, there is no question that molecular genetic and statistical technologies can find genes that cause genetic disorders. The list of diseases with known disease genes grows each year. Examples include Huntington’s disease, Alzheimer’s disease, cystic fibrosis, Duchenne’s muscular dystrophy, myotonic dystrophy and familial colon cancer.
3.2.4.1 Background for linkage analysis Linkage analysis is made possible by the ‘crossing over’, which takes place between two homologous chromosomes during meiosis, the process whereby gametes are created. Genetic transmission occurs because we inherit one member of each pair of chromosomes from our mother and one from our father. These inherited chromosomes are not identical to any of the original parental chromosomes. During meiosis, the original chromosomes in a pair cross over each other and exchange portions of their DNA. After multiple crossovers, the resulting
two chromosomes each consist of a new and unique combination of genes. When meiosis is complete, each gamete will contain one chromosome from each of the newly formed pairs. The probability that two genes on the same chromosome recombine during meiosis is a function of their physical distance from one another. This relationship is not linear because recombination events do not occur randomly across the genome, but instead occur at hotspots, leaving other tightly bound segments of the genome relatively intact and heritable as ‘blocks’. We say that two loci on the same chromosome are ‘linked’ when they are so close to one another that crossing over rarely occurs between them. Closely linked genes usually remain together on the same chromosome after meiosis is complete. The greater the distance between loci on the same chromosome, the more likely it is that they will recombine. This biological fact makes it possible to locate genes that are risk factors for disease. Genes on the same chromosome that are very far apart from one another are transmitted independently, as are genes on different chromosomes.
3.2.4.2 Statistical methods for linkage analysis The statistical methods of linkage analysis are beyond the scope of the present chapter, and so are only reviewed in brief. Linkage analysis capitalises on both the occurrence of crossing over and the availability of polymorphic genetic markers. It computes a statistic indicating the probability that the cosegregation of genetic markers and disease within pedigrees exceeds that expected from chance alone. Thus, linkage analysis assesses the association of disease and marker within families.
3.2.4.3 The affected relative pair method The affected relative pair method of linkage analysis evolved from the affected sib-pair method [41, 42]. The original ‘identity by descent’ affected sib-pair method worked with pairs of ill siblings having parents with four different alleles at the marker locus. That is, the father carried two versions of the gene and the mother carried two versions that differed from the father’s. For example, the father might
39
CHAPTER 3
have allele A and B and the mother might have C and D. Under the null hypothesis of no linkage, the distribution of alleles shared by siblings at the locus is well defined. For example, consider any of the alleles, say A. The probability that the father transmitted it to the first child is 0.50. The probability of transmitting it to the second child is also 0.50. Therefore, the probability that he transmits A to both children is 0.50 times 0.50 or 0.25. Now assume that the marker locus is close to a disease locus and that the two children have the disease. Because both have the disease gene, both share a segment of DNA that contains the gene and surrounding loci. The size of this segment is not fixed (it depends upon where crossovers occurred). However, the probability that the marker locus is on this segment increases with the proximity of the marker to the disease locus. If the marker and disease loci are contiguous, then children who inherited the disease gene should also have inherited the same allele at the marker locus. We say that the shared marker allele is ‘identical by descent’ to indicate that the alleles observed in the children are copies of the same parental allele. A statistical test was developed to determine whether the observed distribution of marker alleles differed from what was expected in the absence of linkage. The method was later generalised to the case in which the parental marker alleles were not all unique. In this situation we can determine if alleles are ‘identical by state’ but not if they are ‘identical by descent’. Identity by state means that the two alleles are the same but we cannot be certain if they are copies of the same parental gene. For example, suppose the father has allele A and B and the marker locus and the mother has A and C. If two of their children each have allele A at the marker locus then we cannot determine if both received it from the father, both from the mother or one from the mother and the other from the father. The loss of identity by descent information reduces statistical power [43]. The affected relative pair method is a general form of the sibling pair method. It allows all ill relative pairs to be included in the analysis. The major advantage of the affected relative pair method is that it can detect linkage with no a priori knowledge of the genetic and environmental parameters that mediate familial transmission. However, by eliminating 40
information about the mode of inheritance, the method sacrifices some statistical power.
3.2.4.4 The lod score method The lod score method requires knowledge of the mode of inheritance. Although it is possible to estimate the mode of inheritance and test for linkage simultaneously, the usual practice is to test for linkage under an assumed genetic model. We do so by estimating the recombination fraction: the probability that the disease and marker genes will recombine during meiosis. The lower the probability, the greater the likelihood of linkage. Most methods compute a maximum likelihood estimate of the recombination fraction. Then a likelihood ratio test compares the odds of the data occurring given that estimate with the odds of the data if the true recombination fraction is 0.5 (this is our null hypothesis because unlinked loci recombine with a probability of 0.5). This likelihood ratio is an odds ratio comparing the probability that linkage is present with the probability of no linkage. Since we usually examine the base-10 logarithm of the odds ratio, the test statistic is known as the lod score (log of the odds ratio). Lod scores greater than 3 are considered to be evidence in favour of linkage, while lod scores less than −2 constitute evidence against linkage. Thus, a linkage analysis will support the hypothesis of linkage if the odds favouring linkage are 1000 to 1 (i.e. log (1000/1) = 3). The main drawback of the lod score method is that we must specify parameters that describe the mode of genetic transmission. However, there is a way around this problem. Greenberg [44] showed that if we analyse our data several times under different modes of inheritance, the lod score will be greatest for the model closest to the true mode of inheritance. For example, we might choose to examine two dominant models and two recessive models. We might also vary the assumed frequency of the gene in the population. So far, we have been discussing linkage analyses that involve only two loci: one marker locus and one disease locus. Since a disease gene will be surrounded by many potential marker loci, these ‘two point’ analyses will not have optimal power to detect linkage and locate the gene. Multipoint analyses use several markers simultaneously during the linkage
GENETIC EPIDEMIOLOGY
analysis. Multipoint mapping improves statistical power by using all available marker information in the area of the putative disease locus [45]. Lander [46] proposed ‘interval mapping’ which assesses linkage, not to a single marker, but to an interval flanked by a pair of markers. Xu et al. [47] evaluated interval mapping with statistical simulations. Compared with single point analyses, interval mapping was much more powerful, requiring 30% fewer families to detect linkage. Moreover, interval mapping was more robust to misclassification of penetrances, diagnoses and phenocopies. Although these considerations favour a multipoint approach, the method must be used with caution. Risch and Guiffra [48] showed that, if the mode of transmission is complex, multipoint analyses can spuriously reject linkage. However, they also show that this problem is mitigated when using high estimates of the disease allele frequency. The lod score method has been generalised for the detection of linkage heterogeneity. That is, we can test the null hypothesis that all families are linked in favour of an alternative hypothesis that only a proportion are linked. The lod score criterion of three (LOD3) may not be appropriate for complex genetic diseases like schizophrenia. The LOD3 criterion was originally designed by Morton [49] for Mendelian diseases for which it was reasonable to compute a prior probability of linkage. For non-Mendelian diseases the prior probability is unknown [50, 51]. Morton also assumed that the test was carried out sequentially as pedigrees were collected. Thus, LOD3 does not apply to analyses of fixed sample sizes [50]. The LOD3 criterion must also be adjusted for the effects of testing multiple markers. This includes both the assessment of linkage at multiple loci and the use of multiple markers to assess linkage at a single locus [50, 52, 53]. Guidelines for interpreting linkage results for complex genetic disorders have adjusted the usual 5% probability of false positives since linkage analysis typically consists of multiple statistical tests. Lander and Kruglyak [54] proposed three levels of statistical significance for use in the interpretation of genomewide linkage results. Suggestive linkage would occur by chance once during a genome-wide scan. For the lod score method, the p-value would be less than 0.0017 and for the sib-pair method, p < 0.0007.
Significant linkage refers to a chance event of 0.05 times during a genome-wide scan. For this level of significance, they defined the lod score p < 0.000049 and for the sib-pair method, p < 0.00002. Confirmed linkage refers to the finding of a significant linkage in an initial scan, and independently confirmed in another sample. The use of computer simulation methods to determine the appropriate lod score criterion was demonstrated by Weeks et al. [55]. Briefly, the procedure is as follows. First, the linkage analysis is performed on real data by maximising the lod score over genetic models and phenotype definitions. After a high lod score is found, a second analysis is performed on the same pedigrees. The only difference between the two analyses is that the first analysis uses the real marker data and the second uses simulated marker data. In the second analysis the markers are simulated under the assumption that the disease and marker are not linked. In the simulation, marker genotypes are assigned to subjects whose parents were not studied based on the marker gene frequencies used in the first analysis and the assumption of Hardy–Weinberg equilibrium. Marker genotypes are assigned to other pedigree members by simulating Mendelian laws of transmission on the pedigree. The simulation step is replicated many times and, for each replicate, we record the maximum lod score attained. This provides us with the distribution of maximum lod scores expected under the null hypothesis of no linkage. To set a Type-I error rate of α, we choose the lod score corresponding to the 1 – α point on the cumulative maximum lod score distribution computed by the simulation. To accurately estimate probabilities in the upper tail of the maximum lod score distribution, many replications are necessary.
3.2.5 What are the risk-conferring variants of the genes and what is the mechanism of disease? 3.2.5.1 Association studies Crossing over during meiosis shuffles the parental genes so that the chromosomes we receive from our fathers and mothers are not identical to any of their original chromosomes. Through the generations, 41
CHAPTER 3
genes are constantly shifting from one chromosome to another. As a result, we should expect no association between alleles of genes on the same chromosome. For example, assume that locus 1 can have allele a or A and locus 2 can have allele b or B. If the two loci are on the same chromosome then the probability that any chromosome contains the pair Ab, P(Ab), should be equal to the probability of A, P(A), times the probability of b, P(b). That is, P(Ab) = P(A) × P(b). Similarly, P(AB) = P(A) × P(b), and so on. Put simply, if we know that a chromosome contains allele A at locus 1, this tells us nothing about the probability of locus 2 containing allele B or b. This random distribution of alleles at different loci on the same chromosome is only partially true. It is an empirical fact that some loci are associated with one another so that P(Ab) = P(A) × P(b) [39]. For example, it may be that chromosomes with allele A at locus 1 are more likely to have allele b at locus 2 than we would expect by chance (i.e. than we would expect from the frequency of allele b in the population). Now, assume that locus 1 is a disease locus and that A is a dominant pathogenic allele. Also assume that locus 2 is a DNA marker locus. If the two loci are associated as indicated above, then people with the disease should be more likely to have marker allele b than people without the disease. This nonrandom association of alleles at different loci is called linkage disequilibrium. One cause of linkage disequilibrium is the fact that the reshuffling of genes on chromosomes depends on genetic distance. If two genes are very close to one another, then they will rarely be separated by crossing over and will usually be transmitted together. Thus, due to very close linkage, the alleles at two loci will tend to be transmitted together. We say ‘tend to’ because eventually crossing over will separate them. Fortunately, the reshuffling of linked genes can take many thousands of years. This means that, theoretically, we should be able to detect associations between diseases and DNA markers if the marker locus is very close to the disease locus. Compared with a linkage study, the design and analysis of an association study is straightforward. We do not require pedigrees with multiple ill members. Samples of unrelated patients and controls will suffice (though family-based association study designs exist, and have their advantages). Instead of a complex 42
linkage analysis, all we need do is compare the rates of marker alleles (or genotypes) in patients and controls with standard statistical tests [39]. Genes within a linked region are candidates for involvement in the phenotype based on their chromosomal location or position (i.e. they are ‘positional candidate genes’). Within a linked region or even in the absence of linkage evidence, a gene may also be a candidate if there is some compelling reason to suspect that the gene influences risk for a given disorder. Association of candidate genes can be evaluated in an independent sample of cases with the disorder and matched control subjects (i.e. in a ‘case–control’ study), or in small family units, where the transmission of variant and normal forms of the gene from parents to offspring can be monitored. In a case–control association study, we simply count the number of each type of allele of a gene, that is found in cases and compare these counts with the allele distribution seen in the control group; this process can also be performed for genotypes. A statistical test is then used to determine if the distribution of alleles observed among cases differs from that seen among controls. If it is different, then we have found evidence for a genetic association with the disorder, where the allele that is over-represented in the group of cases is considered the risk allele. The degree of over-representation of the risk allele in cases relative to controls can be used to derive an odds ratio, which gives a numeric indication of the probability of an affected individual possessing the allele compared the probability of an unaffected individual possessing the allele. Association studies can be performed for alleles or genotypes. In addition, a disorder can be tested for association with a haplotype, which is a pattern of alleles across several markers on the same chromosome (for a description of the International HapMap Project, which is dedicated to cataloguing the haplotype structure of the entire human genome, see http://hapmap.ncbi.nlm .nih.gov/. If linkage disequilibrium, or unusually tight linkage, occurs between the markers in a haplotype, they will typically be inherited together, as no recombination will occur between them. This concept is particularly useful for family-based association studies. In family-based studies, we can use analogous statistics to determine if any difference from the expected equal inheritance of
GENETIC EPIDEMIOLOGY
risk and normal alleles of a gene (or haplotypes within or across several genes) is detected in affected probands who could have received either allele from their parent. In a family-based study, the odds ratio estimates the haplotype relative risk, which represents the increase in the probability of the affected offspring receiving the risk allele (which is presumed to be on the same haplotype as the marker allele) relative to the normal allele. If the odds ratio, relative risk or other effect size attributed to a polymorphism is large enough to attain statistical significance, there are four possible explanations for the result: (i) there is a true association with a causative risk allele; (ii) the associated polymorphism is in linkage disequilibrium (i.e. is in close proximity and usually co-inherited) with the causal variant; (iii) there is some confounding factor that introduces a systematic bias (e.g. population stratification, or background genetic differences between case and control groups) or (iv) the result is due to chance or random error. A disadvantage of association studies is that the DNA marker must either be in the disease gene itself or very close to it. This is in contrast with the linkage method which can detect linkage over relatively large distances. However, unlike linkage analysis, association analysis can detect genes having only a small effect on the susceptibility to illness. Candidate gene association studies are limited by the method used to choose candidates. For a gene to become a positional candidate gene, it must map to a chromosomal region that has been observed to show linkage to the disorder. However, genes with a small but reliable effect on risk may not generate a linkage signal, and thus may never come under study. Selecting genes for association analysis based on their theoretical involvement in the disease process is risky as well. Since our understanding of the biological basis of most psychiatric disorders is far from comprehensive, the pursuit of candidate genes typically progresses incrementally through genes that are expressed within systems widely implicated in the disorder. This is clearly not an optimal process, as the prior probability of selecting the right candidate gene (out of ∼25 000 human genes) and the right polymorphism (out of more than 10 000 000 in the human genome) for analysis is remote. The recent advancement of laboratory and statistical methods
for genome-wide association analysis should allow for a more unbiased examination of association patterns throughout the genome and help resolve this dilemma in coming years [56]. Another limitation of association studies is that they are notoriously difficult to replicate, perhaps owing to their propensity towards false-positive results [57]. The problem of false positive results is exacerbated by the fact that close linkage is not the only cause of disease-marker associations. As discussed above, the frequencies of DNA marker alleles may vary among ethnic groups, cohorts of different ages or other isolated segments of a population. Thus, if case and control groups are not drawn from the same populations and carefully matched on all relevant factors, spurious differences in allele frequencies between groups will emerge due to the population admixture alone [58]. Because it may be difficult to find patient and control groups that are suitably matched for ancestry, genomic control methods have been advocated to account for any imprecision in matching. These methods genotype ancestry-informative markers (i.e. those that differ in frequency across ancestral groups) in addition to those of hypothesised importance in the study, and use the frequencies of these markers in cases and controls to derive an adjustment factor that can be applied to the results pertaining to the hypothesised risk locus. Several investigators have developed tests of linkage disequilibrium that use the parents of ill individuals as controls [59–63], which also circumvents the problems associated with ancestral mis-matching between cases and controls. The transmission disequilibrium test (TDT) uses families having at least one affected offspring and one parent who is heterozygous for the DNA marker to be tested [61]. The TDT compares the number of times heterozygous parents transmit the associated marker allele to affected offspring with the number of times they transmit the other marker allele. If these probabilities differ from what is expected by chance, then we can conclude that linkage disequilibrium exists. Although the TDT solves the problem of ethnicity matching, it still faces the problem of false positives and must be cautiously interpreted in the absence of a credible candidate gene. 43
CHAPTER 3
Family-based association tests (FBATs) have been developed as extensions to the TDT model, whereby parents or siblings of patients are used as controls. Since each parent transmits only one allele to a child, the allele, that is not transmitted to the child is used as the control allele. The statistical test involves comparisons of the transmitted versus the non-transmitted allele. Because both alleles come from the same parent, there are no differences in ethnicity.
3.3 Psychiatric genetics and psychiatric epidemiology As this chapter shows, psychiatric genetics is a multidisciplinary endeavour. In combines the methodological talents of the epidemiologist, the mathematical proficiency of the statistician and the laboratory wizardry of the molecular geneticist. Although we now look towards molecular genetics and neuroscience to clarify the aetiological and pathophysiological details of psychiatric illness, these are unlikely to succeed without a continued partnership with epidemiology. Many complexities plague psychiatric genetics and many solutions have been proposed in the above referenced articles and others. We have summarised them in 10 key points as follows: 1 Use standardised diagnostic criteria. 2 Define diagnoses that will be included as affected cases before the data collection. 3 Use assessment and diagnostic procedures that minimise false positive diagnoses. 4 Ascertain pedigrees and collect data in a manner that can be reproduced by other investigators;. 5 Collect detailed clinical and demographic data to allow comparisons with other samples and the derivation of quantitative traits. 6 Maintain complete blindness between the psychiatric diagnoses and marker statuses of all subjects. 7 Implement procedures to facilitate the follow-up of pedigree members. 8 Implement procedures to minimise laboratory errors. 9 Use a threshold of statistical significance that takes into account the data analytic issues unique to complex non-Mendelian disorders. 44
10 Allow other investigators access to complete pedigree and clinical information relevant to any publications of linkage results. Psychiatric genetic researchers have a powerful toolbox of methods at their disposal for determining the genetic and environmental causes of mental illnesses. These methods span a wide spectrum, from clinical and behavioural genetic methods to molecular biological assays, reflecting the present status of psychiatric genetics as truly a multidisciplinary field. In addition, new methods such as transcriptomics (i.e. the analysis of gene transcription rates via mRNA quantification) and proteomics (i.e. the analysis of gene translation rates by protein quantification) are pushing the boundaries of what is considered ‘psychiatric genetics’. In a strict sense, these are not genetic techniques and thus may fall under the larger rubric of molecular psychiatry, or even biological psychiatry. However, these techniques examine gene products whose expression is influenced by both genetic and environmental factors, and in this sense, examining such molecules is entirely consistent with the approach of genetic epidemiology, which is to identify both genetic and environmental causes of disease. Despite this progress, the major contributions of psychiatric genetic research to the diagnosis, treatment, prediction and prevention of psychiatric disorders currently remain unrealised. As we have acknowledged throughout the chapter, several limitations of genetic research and its impact on clinical practice must be acknowledged and overcome. Most importantly, the aetiologic heterogeneity of psychiatric disorders must be embraced as the rule rather than the exception. Identifying phenotypic factors that differentiate genetic subtypes will allow future genetic research to derive more reasonable and reliable estimates of familial risk and heritability that are based on the particular features of the affected family and its members. It is nevertheless exciting to visualise the contributions to clinical psychiatry and genetic counselling that await: reduced uncertainty in formulating primary and differential diagnoses; individually tailored pharmacotherapy and disease management; early identification and intervention, leading to better prognosis and ultimately, effective prevention programmes. As technologies improve,
GENETIC EPIDEMIOLOGY
experimental capacity increases, and computational methods become more efficient, it is expected that the rate of discovery of risk genes for psychiatric disorders will also accelerate. The identification of specific genetic risk factors for psychiatric disorders will then facilitate the identification and quantification of environmental risk factors that interact with these genes to produce illness. A thorough understanding of these determinants of mental illness will allow the considerable promise of the psychiatric genetic approach to be fulfilled.
Acknowledgements Preparation of this article was supported in part by National Institutes of Health GrantsR01DA012846, R01DA018662, R01MH065562, R01MH071912, R21MH075027 and R01MH081861 to Dr. M.T. Tsuang, P50MH081755 and R01MH085521to Dr. S.J. Glatt, and R01DA018659, R01HD053586, R01MH066877, R01MH081803, R13MH059126 and U01MH085518 to Dr. S.V. Faraone, as well as a NARSAD Young Investigator Award to Dr. S.J. Glatt.
References [1] Morton, N.E. (1982) Outline of Genetic Epidemiology, Karger, Basel. [2] Faraone, S.V., Tsuang, D. and Tsuang, M.T. (1999) Genetics of Mental Disorders: A Guide for Students, Clinicians, and Researchers, Guilford, New York. [3] Meehl, P. and Rosen, A. (1955) Antecedent probability and the efficiency of psychometric signs, patterns, or cutting scores. Psychol. Bull., 52, 194–216. [4] Tsuang, M.T., Fleming, J.A., Kendler, K.S. and Gruenberg, A.M. (1988) Selection of controls for family studies: Biases and implications. Arch. Gen. Psychiatry, 45 (11), 1006–1008. [5] Gibbons, R.D., Davis, J.M. and Hedeker, D.R. (1990) A comment on the selection of ‘healthy controls’ for psychiatric experiments. Arch. Gen. Psychiatry, 47, 785–786. [6] Kruesi, M.J.P., Lenane, M.C., Hibbs, E.D. and Major, J. (1990) Normal controls and biological reference values in child psychiatry: defining normal. J. Am. Acad. Child Adolesc. Psychiatry, 29 (3), 449–452.
[7] Shtasel, D.L., Gur, R.E., Mozley, D. et al. (1991) Volunteers for biomedical research. Recruitment and screening of normal controls. Arch. Gen. Psychiatry, 48, 1022–1025. [8] Kendler, K.S. (1990) The super-normal control group in psychiatric genetics: Possible artifactual evidence for coaggregation. Psychiatr. Genet., 1, 45–53. [9] Miettinen, O.S. (1985) Theoretical Epidemiology, John Wiley & Sons, Inc., New York. [10] Wacholder, S., McLaughlin, J.K., Silverman, D.T. and Mandel, J.S. (1992) Selection of controls in casecontrol studies. I. Principles. Am. J. Epidemiol., 135 (9), 1019–1028. [11] Wacholder, S., Silverman, D.T., McLaughlin, J.K. and Mandel, J.S. (1992) Selection of controls in case-control studies. II. Types of controls. Am. J. Epidemiol., 135 (9), 1029–1041. [12] Wacholder, S., Silverman, D.T., McLaughlin, J.K. and Mandel, J.S. (1992) Selection of controls in casecontrol studies. III. Design options. Am. J. Epidemiol., 135 (9), 1042–1050. [13] Meehl, P.E. (1970) Nuisance variables and the ex post facto design, in Minnesota Studies in the Philosophy of Science (eds M. Radner and S. Winokur) University of Minnesota Press, Minneapolis MN,pp. 373–402. [14] Greenland, S. and Morgenstern, H. (1990) Matching and efficiency in cohort studies. Am. J. Epidemiol., 131 (1), 151–159. [15] Slater, E. and Cowie, V. (1971) The Genetics of Mental Disorder, Oxford University Press, London. [16] Shields, J. and Slater, E. (1975) Genetic aspects of schizophrenia. Br. J. Psychiatry, Special Publication 9, 32–40,. [17] Tsuang, M.T., Faraone, S.V. and Johnson, P. (1997) Schizophrenia: The Facts, Oxford University Press, Oxford. [18] Faraone, S.V., Blehar, M., Pepple, J. et al. (1996) Diagnostic accuracy and confusability analyses: an application to the diagnostic interview for genetic studies. Psychol. Med., 26, 401–410. [19] Nurnberger, J.I. Jr., Blehar, M.C., Kaufmann, C.A. et al. (1994) Diagnostic interview for genetic studies. Rationale, unique features, and training. Arch. Gen. Psychiatry, 51, 849–859. [20] Gershon, E.S. and Guroff, J.J. (1984) Information from relatives. Diagnosis of affective disorders. Arch. Gen. Psychiatry, 41, 173–180. [21] Leckman, J.F., Sholomska, D., Thompson, W.D. et al. (1982) Best estimate of lifetime diagnosis: a methodological study. Arch. Gen. Psychiatry, 39, 879–883. [22] Smith, C. (1974) Concordance in twins: methods and interpretation. Am. J. Hum. Genet., 26, 454–466. [23] Plomin, R., Defries, J.C. and McLearn, G.E. (1990) Behavioral Genetics: A Primer, Freeman, New York.
45
CHAPTER 3 [24] Neale, M.C. and Cardon, L.R. (1992) Methodology for Genetic Studies of Twins and Families, Kluwer Academic Publishers, The Netherlands. [25] Tsuang, M.T. and Faraone, S.V. (1990) The Genetics of Mood Disorders, Johns Hopkins, Baltimore. [26] Cadoret, R.J. (1978) Evidence for genetic inheritance of primary affective disorder in adoptees. Am. J. Psychiatry, 135, 463–466. [27] Cadoret, R.J., O’Gorman, T.W., Heywood, E. and Troughton, E. (1985) Genetic and environmental factors in major depression. J. Affect. Disord., 9, 155–164. [28] Mendlewicz, J. and Rainer, J.D. (1977) Adoption study supporting genetic transmission in manicdepressive illness. Nature, 268, 327–329. [29] Wender, P.H., Kety, S.S., Rosenthal, D. et al. (1986) Psychiatric disorders in the biological and adoptive families of adopted individuals with affective disorders. Arch. Gen. Psychiatry, 43, 923–929. [30] Heston, L.L. (1966) Psychiatric disorders in foster home-reared children of schizophrenic mothers. Br. J. Psychiatry, 112, 819–825. [31] Kety, S.S., Rosenthal, D., Wender, P.H. and Schulsinger, F. (1968) The types and prevalence of mental illness in the biological and adoptive families of adopted schizophrenics. J. Psychiatr. Res., 1, 345–362. [32] Kety, S.S., Rosenthal, D. and Wender, P.H. (1978) The biologic and adoptive families of adopted individuals who became schizophrenic: prevalence of mental illness and other characteristics, in The Nature of Schizophrenia: New Approaches to Research and Treatment (eds L.C. Wynne, R.L. Cromwell and S. Matthysse), John Wiley & Sons, Inc., New York, pp. 25–37. ˆ [33] Kotsopoulos, S., Cote, A., Joseph, L. et al. (1988) Psychiatric disorders in adopted children: a controlled study. Am. J. Orthopsychiatry, 58 (4), 608–612. [34] Deutsch, C.K., Swanson, J.M., Bruell, J.H. et al. (1982) Short communication: overrepresentation of adoptees in children with attention deficit disorder. Behav. Genet., 12, 231–238. [35] Morton, L.A., Kidd, K.K., Matthysse, S.W. and Richards, R.L. (1979) Recurrence risks in schizophrenia: are they model dependent? Behav. Genet., 9, 389–406. [36] Lalouel, J.M., Rao, D.C., Morton, N.E. and Elston, R.C. (1983) A unified model for complex segregation analysis. Am. J. Hum. Genet., 35, 816–826. [37] Elston, R.C. and Stewart, J. (1971) A general model for the genetic analysis of pedigree data. Hum. Hered., 21, 523–542. [38] Bailey-Wilson, J.E. and Elston, R.C. (1989) Statistical Analysis for Genetic Epidemiology, Department of
46
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46] [47]
[48]
[49] [50]
[51] [52] [53]
[54]
[55]
[56]
Biometry and Genetics, LSU Medical Center, New Orleans. Vogel, F. and Motulsky, A.G. (1986) Human Genetics: Problems and Approaches, Springer-Verlag, Berlin. Falconer, D.S. (1965) The inheritance of liability to certain disease, estimated from the incidence among relatives. Ann. Hum. Genet., 29, 51–71. Weeks, D.E. and Lange, K. (1988) The affectedpedigree-member method of linkage analysis. Am. J. Hum. Genet., 42, 315–326. Ward, P.J. (1993) Some developments on the affectedpedigree-member method of linkage analysis. Am. J. Hum. Genet., 52, 1200–1215. Bishop, D.T. and Williamson, J.A. (1990) The power of identity-by-State methods for linkage analysis. Am. J. Hum. Genet., 46, 254–265. Greenberg, D.A. (1989) Inferring mode of inheritance by comparison of lod scores. Am. J. Med. Genet., 34, 480–486. Spence, M.A. (1987) Genetic linkage: sampling issues and multipoint mapping. J. Psychiatr. Res., 21 (4), 631–637. Lander, E.S. (1988) Splitting schizophrenia. Nature, 336, 105–106. Xu, J., Weisch, D.G. and Meyers, D.A. (1998) Genetics of complex human diseases: genome screening, association studies and fine mapping. Clin. Exp. Allergy, 28 (5), 1–5. Risch, N. and Giuffra, L. (1992) Model misspecification and multipoint linkage analysis. Hum. Hered., 42, 77–92. Morton, N.E. (1955) Sequential tests for the detection of linkage. Am. J. Hum. Genet., 7, 277–318. Clerget-Darpoux, F., Babron, M-C. and Bana¨ıti-Pelli´e, C. (1990) Assessing the effect of multiple linkage tests in complex diseases. Genet. Epidemiol., 7, 245–253. Green, P. (1990) Genetic linkage and complex diseases: a comment. Genet. Epidemiol., 7, 25–27. Edwards, J.H. (1990) The linkage detection problem. Ann. Hum. Genet., 54, 253–275. Edwards, J.H. and Watt, D.C. (1989) Caution in locating the gene(s) for affective disorder. Psychol. Med., 19, 273–275. Lander, E. and Kruglyak, L. (1995) Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat. Genet., 11, 241–247. Weeks, D.E., Lehner, T., Squires-Wheeler, E. et al. (1990) Measuring the inflation of the lod score due to its maximization over model parameter values in human linkage analysis. Genet. Epidemiol., 7, 237–243. Thomas, D.C., Haile, R.W. and Duggan, D. (2005 Sep.) Recent developments in genomewide association
GENETIC EPIDEMIOLOGY
[57]
[58]
[59]
[60]
[61]
[62] [63]
scans: a workshop summary and review. Am. J. Hum. Genet., 77 (3), 337–345. Ioannidis, J.P., Ntzani, E.E., Trikalinos, T.A. and Contopoulos-Ioannidis, D.G. (2001). Replication validity of genetic association studies. Nat. Genet., 29 (3), 306–309. Freedman, M.L., Reich, D., Penney, K.L. et al. (2004). Assessing the impact of population stratification on genetic association studies. Nat. Genet., 36 (4), 388–393. Rubinstein, P., Walker, M., Carpenter, C. et al. (1981) Genetics of HLA disease associations: the use of the haplotype relative risk (HRR) and the ‘haplo-delta’ (Dh) estimates in juvenile diabetes from three racial groups. Hum. Immunol., 3, 384. Falk, C.T. and Rubinstein, P. (1987) Haplotype relative risks: an east reliable way to construct a proper control sample for risk calculations. Ann. Hum. Genet., 51, 227–233. Spielman, R.S., McGinnis, R.E. and Ewens, W.J. (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet., 52, 506–516. Ott, J. (1989) Statistical properties of the haplotype relative risk. Genet. Epidemiol., 6, 127–130. Knapp, M., Seuchter, S.A. and Baur, M.P. (1993) The haplotype-relative-risk (HRR) method for analysis of association in nuclear families. Am. J. Hum. Genet., 52, 1085–1093.
Further reading Thompson, W.D., Orvaschel, H., Prusoff, B.A. and Kidd, K.K. (1982) An evaluation of the family history method for ascertaining psychiatric disorders. Arch. Gen. Psychiatry, 39, 53–58. Mendlewicz, J., Fleiss, J.L., Cataldo, M. and Rainer, J.D. (1975) Accuracy of the family history method in affective illness. Arch. Gen. Psychiatry, 32, 309–314. LaBuda, M.C., Gottesman, I.I. and Pauls, D.L. (1993) Usefulness of twin studies for exploring the etiology of childhood and adolescent psychiatric disorders. Am. J. Med. Genet. Neuropsychiatr. Genet., 48, 47–59. Orvaschel, H., Thompson, W.D., Belanger, A. et al. (1982) Comparison of the family history method to direct interview: factors affecting the diagnosis of depression. J. Affect. Disord., 4, 49–59. Andreasen, N.C. (1986) The family history approach to diagnosis: how useful is it?. Arch. Gen. Psychiatry, 43, 421–429. Gottesman, I.I. and Bertelsen, A. (1989) Confirming unexpressed genotypes for schizophrenia. risks in the offspring
of Fischer’s Danish identical and fraternal discordant twins. Arch. Gen. Psychiatry, 46, 867–872. Silverman, J.M., Breitner, J.C.S., Mohs, R.C. and Davis, K.L. (1986) Reliability of the family history method in genetic studies of Alzheimer’s disease and related dementias. Am. J. Psychiatry, 143 (10), 1279–1282. Ott, J. (1983) Linkage analysis and family classification under heterogeneity. Ann. Hum. Genet., 47, 311–320. Risch, N. (1988) A new statistical test for linkage heterogeneity. Am.J. Hum. Genet., 42, 353–364. Ott, J. (1991) Analysis of Human Genetic Linkage, The Johns Hopkins University Press, Baltimore. Kosten, T.A., Anton, S.F. and Rounsaville, B.J. (1992) Ascertaining psychiatric diagnoses with the family history method in a substance abuse population. J. Psychiatr. Res., 26 (2), 135–147. Elston, R.C. and Namboodiri, K.K. (1977) Family studies of schizophrenia. Bull. Int. Stat. Inst., 47, 683– 697. McGue, M., Gottesman, I.I. and Rao, D.C. (1985) Resolving genetic models for the transmission of schizophrenia. Genet. Epidemiol., 2, 99–110. Risch, N. and Baron, M. (1984) Segregation analysis of schizophrenia and related disorders. Am.J. Hum. Genet., 36, 1039–1059. Faraone, S.V., Kremen, W.S. and Tsuang, M.T. (1990) Genetic transmission of major affective disorders: quantitative models and linkage analyses. Psychol. Bull., 108 (1), 109–127. Goldin, L.R. (1990) The increase in type I error rates in linkage studies when multiple analyses are carried out on the same data: a simulation study. Am. J. Hum. Genet., 47 (3), A180 (abstract). Ott, J. (1990) Genetic linkage and complex diseases: a comment. Genet. Epidemiol., 7, 35–36. Faraone, S.V. and Tsuang, M.T. (1985) Quantitative models of the genetic transmission of schizophrenia. Psychol. Bull., 98, 41–66. Pauls, D.L. and Leckman, J.F. (1986) The inheritance of Gilles De La Tourette’s syndrome and associated behaviors. Evidence for autosomal dominant transmission. N. Engl. J. Med., 315, 993–997. Egeland, J.A., Gerhard, D.S., Pauls, D.L. et al. (1987) Bipolar affective disorders linked to DNA markers on chromosome 11. Nature, 325, 783–787. Kelsoe, J.R., Ginns, E.I., Egeland, J.A. et al. (1989) Reevaluation of the linkage relationship between chromosome 11p loci and the gene for bipolar affective disorder in the old order Amish. Nature, 342, 238–243. Freimer, N.B., Sandkuiji, L.A. and Blower, S.M. (1993) Incorrect specification of marker allele frequencies: effects on linkage analysis. Am. J. Hum. Genet., 52, 1102–1110. Cavalli-Sforza, L.L. and King, M.-C. (1986) Detecting linkage for genetically heterogeneous diseases and detecting
47
CHAPTER 3 heterogeneity with linkage data. Am. J. Hum. Genet., 38, 599–616. Ott, J. (1986) The number of families required to detect or exclude linkage heterogeneity. Am. J. Hum. Genet., 39, 159–165. Clerget-Darpoux, F., Babron, M.-C. and Bona¨ıti-Pelli´e, C. (1987) Power and robustness of the linkage homogeneity test in genetic analysis of common disorders. J. Psychiatr. Res., 21 (4), 625–630. Martinez, M.M. and Goldin, L.R. (1989) The detection of linkage and heterogeneity in nuclear families for complex disorders: one versus two major loci. Am. J. Hum. Genet., 44, 552–559. Martinez, M.M. and Goldin, L.R. (1990) Power of the linkage test for a heterogeneous disorder due two independent inherited causes: a simulation study. Genet. Epidemiol., 7, 219–230. Goldin, L.R. and Gershon, E.S. (1988) Power of the affected-sib-pair method for heterogeneous disorders. Genet. Epidemiol., 5, 35–42. Risch, N. (1990) Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am. J. Hum. Genet., 46, 229–241. McGue, M., Gottesman, I.I. and Rao, D.C. (1983) The transmission of schizophrenia under a multifactorial threshold model. Am. J. Hum. Genet., 35, 1161–1178. Risch, N. (1990) Linkage strategies for genetically complex traits. I. Multilocus models. Am. J. Hum. Genet., 46, 222–228. Risch, N. (1990) Linkage strategies for genetically complex traits. III. The effect of marker polymorphism on analysis of affected relative pairs. Am. J. Hum. Genet., 46, 242–253. Chen, W.J., Faraone, S.V. and Tsuang, M.T. (1992) Linkage studies of schizophrenia: a simulation study of statistical power. Genet. Epidemiol., 9, 123–139. Goldin, L.R. and Martinez, M.M. (1989) The detection of linkage and heterogeneity in nuclear families when unaffected individuals are considered unknown, in Multipoint Mapping and Linkage Based Upon Affected Pedigree Members (eds R.C. Elston, M.A. Spence, S.E. Hodge and J.W. MacCluer), Alan R. Liss, Inc., New York, pp. 195–200. Levinson, D.F. (1993) Linkage information in small family structures: comparison of pedigrees with three to five affected members. Psychiatr. Genet., 3 (1), 45–57. Levinson, D.F. (1993) Power to detect linkage with heterogeneity in samples of small nuclear families. Am. J. Med. Genet., Neuropsychiatr. Genet., 48 (2), 94–102. Boehnke, M. (1986) Estimating the power of a proposed linkage study: a practical computer simulation approach. Am. J. Hum. Genet., 39, 513–527.
48
Ploughman, L.M. and Boehnke, M. (1989) Estimating the power of a proposed linkage study for a complex genetic trait. Am. J. Hum. Genet., 44, 543–551. Ott, J. (1989) Computer-simulation methods in human linkage analysis. Proc. Natl. Acad. Sci. USA, 86, 4175–4178. Kendler, K.S., Silberg, J.L., Neale, M.C. et al. (1991) The family history method: whose psychiatric history is measured? Am. J. Psychiatry, 148 (11), 1501–1504. Schweber, M.A. (1985) A possible unitary genetic hypothesis for Alzheimer’s disease and Down’s syndrome. Ann. N. Y. Acad. Sci., 450, 223–238. Korenberg, J., West, R. and Pulst, S. (1988) The amyloid protein precursor gene maps to chromosome 21 subbands q21.15-q21.1. Neurology, 38, 265. Goate, A., Chartier-Harlin, M-C., Mullan, M. et al. (1991) Segregation of a missense mutation in the amyloid precursor protein gene with familial Alzheimer’s disease. Nature, 349, 704–706. Chartier-Harlin, M-C., Crawford, F., Houlden, H. et al. (1991) Early-onset Alzheimer’s disease caused by mutations at codon 717 of the β-amyloid precursor protein gene. Nature, 353, 844–846. Naruse, S., Igarashi, S., Kobayashi, H. et al. (1991) Missense mutation Val-lle in exon 17 of amyloid precursor protein gene in Japanese familial Alzheimer’s disease. Lancet, 337, 978–979. van Duijn, C.M., Hendriks, L., Cruts, M. et al. (1991) Amyloid precursor protein gene mutation in early-onset Alzheimer’s disease. Lancet, 337, 978. Hsu, Y-P.P., Weyler, W., Chen, S. et al. (1988) Structural features of human monoamine oxidase A elucidated from cDNA and peptide sequences. J. Neurochem., 51, 1321–1324. Hsu, Y-P.P., Powell, J.F. and Chen, S. (1988) Molecular genetic studies of MAO genes, in Progress in Catecholamine Research: Part A. Basic and Peripheral Mechanisms (eds A. Dalstrom, H. Belmaker and M. Sandler), Alan Liss, New York, pp. 89–95. Bach, A.W.J., Lan, N.C., Johnson, D.L. et al. (1988) cDNA cloning of human liver monoamine oxidase A and B: molecular basis of differences in enzymatic properties. Proc. Natl. Acad. Sci., 85, 4934–4938. Kobayashi, K., Kurosawa, Y., Fujita, K. and Nagatsu, T. (1989) Human dopamine-beta hydroxylase gene: two mRNA types having different 3’-terminal regions are produced through alternative polyadenylation. Nucleic Acids Res., 17, 1089–1102. Lamouroux, A., Vigny, A., Faucon, B.N. et al. (1987) The primary structure of human dopamine-beta-hydroxylase: insights into the relationship between the soluble and the membrane-bound forms of the enzyme. Eur. Mol. Biol. Organ. J., 6 (13), 3931–3937.
GENETIC EPIDEMIOLOGY Grima, B., Lamouroux, A., Boni, C. et al. (1987) A single human gene encoding multiple tyrosine hydroxylases with different predicted functional characteristics. Nature, 326, 707–711. Bunzow, J.R., Van, T.H., Grandy, D.K. et al. (1988) Cloning and expression of a rat D2 dopamine receptor cDNA. Nature, 336 (6201), 783–787. Grandy, D.K., Litt, M., Allen, L. et al. (1989) The human dopamine D2 receptor gene is located on chromosome 11 at q22-q23 and identifies a TaqI RFLP. Am. J. Hum. Genet., 45, 778–785. Zander, K.J., Fischer, B., Zimmer, R. and Ackenheil, M. (1981) Long-term neuroleptic treatment of chronic schizophrenic patients: clinical and biochemical effects of withdrawal. Psychopharmacology, 73, 43–47. Sunahara, R.K., Niznik, H.B., Weiner, D.M. et al. (1990) Human dopamine D1 receptor encoded by an intronless gene on chromosome 5. Nature, 347, 80–83. Dearry, A., Gingrich, J.A., Falardeau, P. et al. (1990) Molecular cloning and expression of the gene for a human D1 dopamine receptor. Nature, 347, 72–76. Blum, K., Noble, E.P., Sheridan, P.J. et al. (1990) Allelic association of human dopamine D2 receptor gene in alcoholism. J. Am. Med. Assoc., 263 (15), 2055–2060. Gelernter, J., O’Malley, S., Risch, N. et al. (1991) No association between an allele at the D2 dopamine receptor gene (DRD2) and alcoholism. J. Am. Med. Assoc., 266 (13), 1801–1807. Nothen, M.M., Erdmann, J., Korner, J. et al. (1992) Lack of association between dopamine D1 and D2 receptor genes and bipolar affective disorder. Am. J. Psychiatry, 149 (2), 199–201. Schwartz, X.L. and Moises, H.W. (1993) No association between schizophrenia and homozygosity at the D3 dopamine receptor gene. Am. J. Med. Genet., Neuropsychiatr.Genet., 48 (2), 83–86. Conneally, P.M. (1991) Association between the D2 dopamine receptor gene and alcoholism. A continuing controversy. Arch. Gen. Psychiatry, 48, 757–759. Kidd, K.K. (1993) Associations of disease with genetic markers: Deja vu all over again. Am. J. Med. Genet., Neuropsychiatr.Genet., 48 (2), 71–73. Pato, C.N., Macciardi, F., Pato, M.T. et al. (1993) Review of the putative association of dopamine D2 receptor and alcoholism: a meta-analysis. Am. J. Med. Genet., Neuropsychiatr.Genet., 48 (2), 78–82. Crowe, R.R. (1993) Candidate genes in psychiatry: an epidemiological perspective. Am. J. Med. Genet., Neuropsychiatr.Genet., 48 (2), 74–77. Botstein, D., White, R.L., Skolnick, M. and Davis, R.W. (1980) Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. J. Hum. Genet., 32 (3), 314–331.
Jeffreys, A.J., Wilson, V. and Thein, S.L. (1985) Hypervariable ‘minisatellite’ regions in human DNA. Nature, 314, 67–73. Marshall, E. (1999) Genomics: drug firms to create public database of genetic mutations. Science, 284, 406–407. Schork, N.J., Fallin, D. and Lanchbury, S. (2000) Single nucleotide polymorphisms and the future of genetic epidemiology. Clin. Genet., 58, 250–264. Wang, J. (2000) From DNA biosensors to gene chips. Nucleic Acids Res., 28, 3011–3016. Sobell, J.L., Heston, L.L. and Sommer, S.S. (1992) Delineation of genetic predisposition to multifactorial disease: a general approach on the threshold of feasibility. Genomics, 12, 1–6. Sommer, S.S., Lind, T.J., Heston, L.J. and Sobell, J.L. (1993) Dopamine D4 receptor variants in unrelated schizophrenic cases and controls. Am. J. Med. Genet., Neuropsychiatr. Genet., 48 (2), 90–93. Gejman, P.V. and Gelernter, J. (1993) Mutational analysis of candidate genes in psychiatric disorders. Am. J. Med. Genet., Neuropsychiatr. Genet., 48(4), 184–191. Murphy, K.C., Jones, A.L. and Owen, M.J. (1999) High rates of schizophrenia in velo-cardio-facial syndrome. Arch. Gen. Psychiatry, 56, 940–945. Buckley, P., O’Callaghan, E., Larkin, C. and Waddington, J.L. (1992) Schizophrenia research: the problem of controls. Biol. Psychiatry, 32, 215–217. Risch, S.C., Lewine, R.J., Jewart, R.D. et al. (1990) Ensuring the normalcy of ‘normal’ volunteers. Am. J. Psychiatry, 147 (5), 682–683. Thaker, G.K., Moran, M., Lahti, A. et al. (1990) Psychiatric morbidity in research volunteers. Arch. Gen. Psychiatry, 47, 980. Weinberg, W. (1925) Methoden und technik der statistik ¨ mit besonderer berucksichtigun der sozialbiologie, in Handbuch der Sozialen Hygiene und Gesundheitsfursorge 1. Grundlagen und Methoden (eds A. ¨ Gottstein, A. Schlossmann and L. Teleky), Verlag von Julius Springer, Berlin, pp. 71–148. Fisher, R.A. (1934) The effect of methods of ascertainment upon the estimation of frequencies. Ann. Eugen., 6, 13–25. McGue, M. and Gottesman, I.I. (1989) Genetic linkage in schizophrenia: perspectives from genetic epidemiology. Schizophr. Bull., 15, 453–464. McGue, M. and Gottesman, I.I. (1991) The genetic epidemiology of schizophrenia and the design of linkage studies. Eur. Arch. Psychiatry Clin. Neurosci., 240, 174–181. Pulver, A.E. and Bale, S.J. (1989) Availability of schizophrenic patients and their families for genetic linkage studies: findings from the Maryland epidemiology sample. Genet. Epidemiol., 6, 671–680.
49
CHAPTER 3 Weissman, M.M., Merikangas, K.R., John, K. et al. (1986) Family-genetic studies of psychiatric disorders. Developing technologies. Arch. Gen. Psychiatry, 43 (11), 1104–1116. Bonney, G.E. (1984) On the statistical determination of major gene mechanisms in continuous human traits: regressive models. Am. J. Human Genet., 18, 731–749. Bonney, G.E. (1986) Regressive logistic models for familial disease and other binary traits. Biometrics, 42, 611–625. Bonney, G.E. (1987) Logistic regression for dependent binary observations. Biometrics, 43, 951–973. Borecki, I.B., Lathrop, G.M., Bonney, G.E. et al. (1990) Combined segregation and linkage of genetic hemochromatosis using affection status, serum iron, and HLA. Am. J. Hum. Genet., 47, 542–550. Chase, G.A. and Kramer, M. (1986) The abridged census method as an estimator of lifetime risk. Psychol. Med., 16, 865–871. ¨ Stromgren, E. (1935) Zum ersatz des Weinbergschen ‘abgekurzten verfahrens’ zugleich ein beitrag zur Frage von der Erblichkeit des Erkrankungsalters bei der Schizophrenie. Z. Gesamte Neurol. Psychiatr., 153, 784–797. ¨ Larsson, T. and Sjogren, T. (1954) A methodological, psychiatric and statistical study of a large Swedish rural population. Acta Psychiatr. Neurol. Scand., 89, 40–54. Risch, N. (1983) Estimating morbidity risks with variable age of onset: review of methods and a maximum likelihood approach. Biometrics, 39, 929–939. ¨ Stromgren, E. (1938) Beitrage zur psychiatrischen erblehre auf grund von Untersuchungen an einer Inselbevolkerung. Acta Psychiatr. Neurol. Scand., 19 (Suppl), 1–257. Thompson, W.D. and Weissman, M.M. (1981) Quantifying lifetime risk of psychiatric disorder. J. Psychiatr. Res., 16, 113–126. Cupples, L.A., Risch, N., Farrer, L.A. and Myers, R.H. (1991) Estimation of morbid risk and age at onset with missing information. Am. J. Hum. Genet., 49, 76–87. Chen, W.J., Faraone, S.V. and Tsuang, M.T. (1992) Estimating age at onset distributions: a review of methods and issues. Psychiatr. Genet., 2, 219–238. Heimbuch, R.C., Matthysse, S. and Kidd, K.K. (1980) Estimating age-of-onset distributions for disorders with variable onset. Am. J. Hum. Genet., 32, 564–574. Baron, M., Risch, N. and Mendlewicz, J. (1983) Age at onset in bipolar-related major affective illness: clinical genetic implications. J. Psychiatr. Res., 17, 5–18. Chen, W.J., Faraone, S.V., Orav, E.J. and Tsuang, M.T. (1993) Estimating age at onset distributions: The bias from prevalent cases and its impact on risk estimation. Genet. Epidemiol., 10, 43–60. Sturt, E. (1985) Estimating morbidity risks with variable age of onset (correspondence). Biometrics, 41, 311–313.
50
Lee, E.L. (1980) Statistical Methods for Survival Data Analysis, Lifetime Learning, Belmont. Faraone, S.V., Biederman, J., Krifcher, B. et al. (1993) Evidence for independent transmission in families for Attention Deficit Hyperactivity Disorder (ADHD) and learning disability: results from a family-genetic study of ADHD. Am. J. Psychiatry, 150, 891–895. Kaplan, E.L. and Meier, P. (1958) Nonparametric estimation from incomplete observations. Am. Stat. Assoc. J., 53, 457–481. Breslow, N. and Crowley, J. (1974) A large sample study of the life table and product limit estimates under random censorship. Ann. Stat., 2, 437–453. Lewis, S.W., Reveley, A.M., Reveley, M.A. et al. (1987) The familial/sporadic distinction as a strategy in schizophrenia research. Br. J. Psychiatry, 151, 306–313. Kendler, K.S. and Hays, P. (1982) Familial and sporadic schizophrenia: a symptomatic, prognostic and EEG comparison. Am. J. Psychiatry, 139, 1557–1562. Lyons, M.J., Faraone, S.V., Kremen, W.S. and Tsuang, M.T. (1989) Familial and sporadic schizophrenia: a simulation study of statistical power. Schizophr. Res., 2, 345–353. Kendler, K.S. (1987) Sporadic versus familial classification given etiologic heterogeneity: sensitivity, specificity, and positive and negative predictive power. Genet. Epidemiol., 4, 313–330. Eaves, L.J., Kendler, K.S. and Schulz, S.C. (1986) The familial sporadic classification: Its power for the resolution of genetic and environmental etiological factors. J. Psychiatr. Res., 20, 115–130. Lyons, M.J., Kremen, W.S., Tsuang, M.T. and Faraone, S.V. (1989) Investigating putative genetic and environmental forms of schizophrenia: Methods and findings. Int. Rev. Psychiatry, 1, 259–276. Erlenmeyer-Kimling, L. (1975) A prospective study of children at risk for schizophrenia: methodological considerations and some preliminary findings, in Life History Research in Psychopathology (eds R. Wirt, G. Winokur and M. Ross), University of Minnesota Press, Minneapolis, pp. 22–46. Mednick, S.A., Mura, E., Schulsinger, F. and Mednick, B. (1971) Perinatal conditions and infant development in children with schizophrenic parents. Soc. Biol. (Suppl. 18), 103. Fish, B., Marcus, J., Hans, S.L. et al. (1992) Infants at risk for schizophrenia: sequelae of a genetic neurointegrative defect. A review and replication analysis of pandysmaturation in the Jerusalem infant development study. Arch. Gen. Psychiatry, 49, 221–235. Biederman, J., Rosenbaum, J.F., Bolduc, E.A. et al. (1991) A high risk study of young children of parents with panic disorder and agoraphobia with and without comorbid major depression. Psychiatry Res., 37, 333–348.
GENETIC EPIDEMIOLOGY Orvaschel, H. (1990) Early onset psychiatric disorder in high risk children and increased familial morbidity. J. Am. Acad. Child Adolesc. Psychiatry, 29 (2), 184–188. Tsuang, M.T., Faraone, S.V. and Lyons, M.J. (1993) Advances in psychiatric genetics, in International Review of Psychiatry, vol. 1 (eds J.A. Costae Silva, C.C. Nadelson, N.C. Andreasen and M. Sato), American Psychiatric Press, Washington, DC, pp. 395–440. Tsuang, M.T., Gilbertson, M.W. and Faraone, S.V. (1991) Genetic transmission of negative and positive symptoms in the biological relatives of schizophrenics, in Positive vs. Negative Schizophrenia (eds A. Marneros, M.T. Tsuang and N. Andreasen), Springer-Verlag, New York, pp. 265–291. Morton, N.E., Rao, D.C. and Lalouel, J.-M. (1983) Methods in Genetic Epidemiology, Karger, New York. Sorant, A.J.M. and Elston, R.C. (1989) Segregation analysis of a truncated (censored) trait with logistic P.D.F. (REGTL version 1.0), in Statistical Analysis for Genetic Epidemiology (eds J.E. Bailey-Wilson and R.C. Elston), Department of Biometry and Genetics, LSU Medical Center, New Orleans. Morton, N.E. and MacLean, C.J. (1974) Analysis of family resemblance. III. Complex segregation analysis of quantitative traits. Am. J. Hum. Genet., 26, 489–503. Iselius, L. and Morton, N.E. (1991) Transmission probabilities are not correctly implemented in the computer program POINTER. Am. J. Hum. Genet., 49 (459), 459. Sorant, A.J.M. and Elston, R.C. (1989) A subroutine package for function maximization (A users guide to MAXFUN version 5.0), in Statistical Analysis for Genetic Epidemiology (eds J.E. Bailey-Wilson and R.C. Elston), Department of Biometry and Genetics, LSU Medical Center, New Orleans. Akaike, H. (1974) A new look at statistical model identification. IEEE Trans. Autom. Control, AC-19 (6), 716–723. DeLisi, L.E., Dauphinais, I.D. and Hauser, P. (1989) Gender differences in the brain: are they relevant to the pathogenesis of schizophrenia? Comp. Psychiatry, 30 (3), 197–208. Goldstein, J.M., Tsuang, M.T. and Faraone, S.V. (1989) Gender and schizophrenia: implications for understanding the heterogeneity of the illness. Psychiatry Res., 28 (3), 243–253. Faraone, S.V., Biederman, J., Keenan, K. and Tsuang, M.T. (1991) A family-genetic study of girls with DSM-III attention deficit disorder. Am. J. Psychiatry, 148 (1), 112–117. Pauls, D.L. (1979) Sex effect on the risk of mental retardation. Behav. Genet., 9 (4), 289–295.
Harris, T., Surtees, P. and Bancroft, J. (1991) Is sex necessarily a risk factor to depression? Br. J. Psychiatry, 158, 708–712. Cloninger, C.R., Christiansen, K.O., Reich, T. and Gottesman, I.I. (1978) Implications of sex differences in the prevalences of antisocial personality, alcoholism, and criminality for familial transmission. Arch. Gen. Psychiatry, 35, 941–951. Berney, T.P. (1989) Fragile X syndrome and disorders of the sex chromosome. Curr. Opin. Psychiatry, 2, 593–598. Khoury, M.J., Beaty, T.H. and Cohen, B.H. (1993) Fundamentals of Genetic Epidemiology, Oxford University Press, New York. Ottman, R. (1990) An epidemiologic approach to gene–environment interaction. Genet. Epidemiol., 7, 177–185. Fischer, M. (1971) Psychosis in the offspring of schizophrenic monozygotic twins and their normal co-twins. Br. J. Psychiatry, 118, 43–52. Merikangas, K.R., Spence, A. and Kupfer, D.J. (1989) Linkage studies of bipolar disorder: methodologic and analytic issues. Report of MacArthur foundation workshop on linkage and clinical features in affective disorders. Arch. Gen. Psychiatry, 46, 1137–1141. Ott, J. (1990) Invited editorial: cutting a Gordian knot in the linkage analysis of complex human traits. Am. J. Hum. Genet., 46, 219–221. Risch, N. (1990) Genetic linkage and complex diseases, with special reference to psychiatric disorders. Genet. Epidemiol., 7, 3–7. Weeks, D.E., Brzustowicz, L., Squires-Wheeler, E. et al. (1990) Report of a workshop on genetic linkage studies in schizophrenia. Schizophr. Bull., 16 (4), 673–686. Pato, C.N., Lander, E.S. and Schulz, S.C. (1989) Prospects for the genetic analysis of schizophrenia. Schizophr. Bull., 15 (3), 365–372. Faraone, S.V. and Santangelo, S. (1992) Methods in genetic epidemiology, in Research Designs and Methods in Psychiatry (eds M. Fava and J.F. Rosenbaum), Elsevier, Amsterdam, pp. 87–105. Andreasen, N.C., Endicott, J., Spitzer, R.L. and Winokur, G. (1977) The family history method using diagnostic criteria. Reliability and validity. Arch. Gen. Psychiatry, 34, 1229–1235. NIMH Genetics Initiative (1992) Family Interview for Genetic Studies, National Institute of Mental Health, Rockville.
51
4
Examining gene–environment interplay in psychiatric disorders Judith Allardyce1 and Jim van Os1,2 1 Maastricht University Medical Centre, School of Mental Health and Neuroscience, Department of Psychiatry and Neuropsychology, South Limburg Mental Health Research and Teaching Network, Maastricht, The Netherlands 2 King’s College, King’s Health Partners, Department of Psychosis Studies, Institute of Psychiatry, London, UK
4.1 Introduction Epidemiologists have traditionally studied the distribution and determinants of health related states as a function of environmental exposures; for example stressful life events, pregnancy and birth complications, minority status, urban environments and cannabis use. The aim has been to identify environmental risk factors that are causal and potentially modifiable. In contrast, classical psychiatric genetics has tended to focus on gene mapping to identify single susceptibility or candidate genes for a particular disorder, or on statistical modelling techniques that aim to quantify heritability. Genetic methodologies assume that genetic determinants are true signals while environmental effects are noise, which should be controlled for where possible, while studies examining only environmental risk factors, do not take account of the possibility that exposure and outcome may share a genetic liability. Over the past few decades, it has become increasingly clear that the identification of specific genes and environmental risk factors will be greatly aided by integrating the respective fields of epidemiology and genetics into the new discipline of genetic epidemiology, which considers the joint actions of genetic and environmental factors in causing disease, within human populations and
the pattern of inheritance in families. Family studies have been the basic approach used by genetic epidemiologists but classical epidemiological designs such as case–control and cohort studies are considered useful in many situations [1–4]. Genetic epidemiology considers both environmental and genetic risk factors – and their interactions – with potential parity, by extending the conceptualisation of epidemiological exposures to include genetic factors and family relationships [5, 6]. This joint approach aims to unravel the pathway from genotype to phenotype by trying to understand how an individual’s genetic makeup modifies their susceptibility to environmental exposures or modifies their level (dose) of causal environmental exposures and explores ways in which the physical and social structure of the environment may exacerbate genetic risk. In actual fact the best evidence that there are non-genetic causal factors in the expression of psychiatric disorders comes from the classical twin studies which demonstrate concordance rates for monozygotic twins of between 50 and 70% for schizophrenia or mood disorders, though of course it is also quite probable that (partial) genetic mechanisms such as epigenetic effects, stochastic factors or mitochondrial inheritance could contribute to this discordance, conflating the ‘environmental’ contribution estimated [7–9]. Similarly, standard
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
53
CHAPTER 4
heritability scores are confounded, in that they do not discriminate between ‘purely’ genetic determinants and gene–environment correlation (genetic mediation of the level of environmental exposures) and interaction (genetic mediation of the susceptibility of a causal environmental exposure). The high proportion of monozygotic pairs that remain unaffected, certainly points towards an environmental influence in the aetiology of these disorders [10, 11].
4.2 The process of genetic epidemiology The identification of causal factors in disease onset generally follows a series of logical steps or questions. This is the process or chain of genetic epidemiological research [2]. First, there is a systematic analysis of the rates of disease, comparing and contrasting rates in populations over time, between place and within different subgroups, for example; in people who have migrated, or between people of different social economic class and age/gender variations can all provide clues to putative genetic and environmental casual factors. The second question asks whether the disorder tends to run in families (do the relatives of patients with a disorder have higher rates of a specified disorder, greater than would be expected by chance alone?) Once a familial pattern is identified the risk to relatives is examined for correlation to closeness (degree) of the relationship, that is first degree relatives (parents, siblings and children) who share 50% of their genes on average with the patient should have a higher risk for the disorder, than second degree relatives who share on average only 25% of their genes with the affected person. The family study is a robust design, in many cases providing the initial clue that a disorder has a genetic component; however we have to be cautious in our interpretation of such familiarity as there are always alternative non-genetic explanations for family clustering, such as shared environment, social learning and viral causations, to name but a few. The third question asks what type of genetic transmission is occurring, is the familiar pattern compatible with one or more major genes or more suggestive of an oligogenic, polygenetic transmission 54
or shared environmental factors? Single major gene models propose that one gene accounts for most of the genetic transmission while other genes and environmental exposures play only a minor role in modifying the expression of the disorder or determining its age of onset. Oligogenic models assume that a few genes (e.g., less than 10) combine to cause the disorder. This joint effect can be additive, that is the likelihood of developing the disorder is simply the linear function of a number of genes or alternatively the mechanism may be interactive or a combination of the two models. The multifactorial–polygenetic model posits that an unspecified number of genes, perhaps running into the hundreds plus environmental factors act additively or interactively to increase risk of a disorder. Associated with the polygenetic model of disease is the concept of genetic liability, which is a latent trait that predisposes an individual to a particular disorder; the functional form of this may be a continuous linear association or a threshold form when disease occurs once a liability threshold has been crossed. The polygenetic model assumes that liability is normally distributed in the population. The fourth question asks where the genes for the disorder are located. This uses linkage analyses, a methodology examining family pedigrees (multiple case families) or affected sib pairs, using polymorphic markers of known chromosomal location in blood samples taken from the patent and their family. Linkage analyses establish whether markers are transmitted through the pedigree in a manner that parallels disease transmission (cotransmission/cosegregation) so that the general chromosomal location of the susceptibility gene can be identified. This process of gene mapping often begins by examining an array of widely spaced markers positioned across the whole genome, the search region narrowing down as information accrues and is later augmented with fine tuning techniques such as linkage disequilibrium (LD) mapping, which exploits the fact that common variants which are located close together on a chromosome ( r(E) − r(00)
Relative risk (RR) pattern RR (GE) = RR (G) + RR (E) − 1 RR (GE) > RR (G) + RR (E) − 1
Risks can also be measured on a multiplicative scale where the effect of the risk is measured as a ratio (relative risk) rather than a risk difference. For example, if r(G) is 0.25 and r(00) is 0.10, then the effect of G is 0.25/0.10 = 2.5, the effect of G can then be expressed as r(G)/r(00), the effect associated with E as r(E)/r(00) and the effect associated with the joint GE exposure as r(GE)/r(00). Using this scale the expected patterns of risk and relative risk with and without interaction would be:
EXAMINING GENE–ENVIRONMENT INTERPLAY IN PSYCHIATRIC DISORDERS
Model of joint action No interaction Synergistic action
Risk pattern r(GE)/r(G) = r(E)/r(00) r(GE)/r(G) > r(E)/r(00)
Relative risk (RR) pattern RR (GE) = RR (G) × RR (E) RR (GE) > RR (G) × RR (E)
4.8 Which scale should we use to measure GxE ? A given data set can be tested to see how well it conforms to an additive or multiplicative model. However, the question of which scale should be adopted has been heavily contested, though the theoretical literature on sufficient cause models seems to be converging on a consensus advocating use of a fixed reference definition based on the additive scale. This argument integrates the idea that two causal factors from different causal pies (mechanisms) will generally have an additive relationship, whereas component causes from the same mechanism will have a relationship which is super-additive [34]. It is therefore possible that our previous emphasis on multiplicative models has meant missing biological relevant interactions [50]. Further support for use of the additive scale when examining interactions, comes from theoretical work regarding the concept of biological parallelism. It is possible that within the group of individuals exposed to both G and E there may be a subgroup of individuals who would contract the disorder if they were exposed to just one of the risk factors G or E, that is the risk factors act in parallel (|G E|) This conceptualisation of potentially competing risk factors does not fit neatly into the sufficient causal framework, however it has been proposed that it could be accommodated by modifying the model allowing one component of a sufficient cause to be non-definite, containing either G alone, E alone, synergistic GxE or parallel |G E|. Considering parallelism potentially allows the extent of interaction to be quantified, by estimating the proportional size if the subgroup with GxE synergism from the group of individuals who are exposed to both G and E [51]. The actual amount of interaction or parallelism cannot be directly measured in individuals exposed to both
G and E. However, it has been demonstrated that the amount of synergism exceeding parallelism equates to the statistical additive interaction [52]. In practise the amount of interaction has been approximated using contingency tables suggested by Darroch [51]. Approximation of synergy (true interaction effect) |synergism| |x1|
|x2| |parallelism|
r(GE) – r(G) r(G) – R
R(GE) – r(E)
r(E) – r(00)
r(GE) – r(00)
Take, for example, the Finish Adoption study [53]. Diagnosis of maternal schizophrenia was used as a proxy marker for G , while E was the level of communication deviance and thought disorder in the adopted into family and the disease outcome was broadly defined schizophrenia spectrum disorder. The risk of schizophrenia spectrum disorder was around 4% in the group of individuals who were exposed to neither exposure and for those exposed to G alone. The r (E) was 34% and the r (GE) was 62%, therefore filling in the risks for Darroch’s table, (x1 and x2 are unmeasured parameters). Approximation of synergy (true interaction effect) in the Finnish Adoption Study |synergism| |x1|
|x2| |parallelism|
0.58 0
0.28
0.30
0.58
It follows that |x2| must lie somewhere between 0 and 0.30. Therefore |synergism| must be between 0.28 and 0.58. That is, between 45% (0.28/0.62) and 94% (0.58/0.62) of the patients with schizophrenia spectrum disorder exposed to both communication deviance in the family (E) and an affected mother (G) seems to be attributable to GxE. It is possible that the optimum choice of measurement scale may depend on the goal of the investigation. Additive models have certainly been shown to be of greater relevance when assessing the public health impact of interactions as interventions at the population level need to be understood in the context of the prevalence of all other causal factors [49], while some purport that the multiplicative scale is apposite for aetiological research. If the environmental risks have very low variance, 61
CHAPTER 4
that is they are pervasive, multiplicative GxE models would not be helpful even if the biological reality is that the effect of the genotype is contingent on the environmental exposure. For example, genetically moderated susceptibility to malaria in regions where the infection is endemic would not be demonstrated by epidemiological GxE studies on the multiplicative scale [54]. However, in the absence of specific pathophysiological models of the disorder being studied, the scale chosen to measure interactions will at some level be arbitrary [48]. This concern notwithstanding, some researchers have postulated that the multiplicative scale may be more appropriate when certain causative pathways are being investigated, for example a multistage model of disease progression may best be described on the multiplicative scale, as the onset of the disorder will only occur after a number of iterative stages are completed, that is the stages are independent of each other. For example, genetic factors may influence risk for the first stage of the disorder, and first stage only, whereas an environmental factor may influence risk for a second stage, and second stage only, with the disorder only developing after there has been transition from stage 1 to stage 2 [55]. Under this hypothesis the combined effect of G and E will be equal to the product of the individual effects. Therefore if the magnitude of G effect on the disorder is 10 and the effect size of E equals 2, then the effect of their combined exposure will be 20. Inherent in the use of the multiplicative scale is the idea that the causal factors are independent from each other. However, the current definition of the sufficient component cause framework is deliberately abstract and independent of any specific disease model. It considers component causes acting at different times as biologically interacting (that is they are not considered Independent) if both are necessary components of a specific casual mechanism, so the additive scale still fits. It is important to note that the presence or absence of interaction on the additive scale does not indicate any specific causal model. It is recommended when reporting interactions in scientific journals that sufficient information should be provided to allow the interested reader to interpret the interaction on the additive scale if the multiplicative scale has been used, therefore allowing interpretation in a sufficient causal framework [56]. 62
This can be done by presenting the direct effects of the genetic and environmental risk factor and their joint effect, relative to the group of individuals exposed to neither factor [57]. Another approach is to present the full multiplicative model, including the direct effects and product term so allowing the recalculation of the joint effect on the additive scale [58].
4.9 Study designs for the detection of GxE There are several different study designs that potentially allow GxE to be detected, each with its own strengths and potential problems [59, 60].
4.9.1 Cohort design The cohort study has been the design of choice for common disorders; however rarer disorders and those with late or very wide ranges of age of onset may require sample sizes which are too large to be practically viable. In this design DNA samples and environmental exposure information can be obtained from an initially healthy sample, that is followed up prospectively. As the assessment of the environmental exposure occurs prior to onset of the disorder it is relatively free of information (recall) bias. High rates of follow up are however necessary to reduce selection bias, this can be difficult if the disorder has a long incubation period requiring years or even decades of observation. It is also necessary to try and ensure DNA samples come from a high proportion of both cases and controls, as differential take up can lead to selection bias. Information about ethnic background or genomic control methods [61] should be considered if population stratification is a potential problem due to migration. A nested case–control approach can be used to compare cases with those individuals who did not develop the disorder, and analysed using the group of individuals who have neither exposure to the environmental risk factor or the high risk gene variant as the reference group, so estimating odds ratios. Measured confounding variables can be adjusted for by stratification procedures or by using multivariable modelling such as logistic regression, Poisson regression or Cox’s proportional hazard models [62, 63].
EXAMINING GENE–ENVIRONMENT INTERPLAY IN PSYCHIATRIC DISORDERS
4.9.2 Case–control design Generally, case–control studies are more economical than cohort studies. Further, they are potentially powerful methods for the investigation of rarer disorders. Selection bias, particularly due to the selection of controls who may not be representative of the population at risk, is a major limitation. Further, if cases are collected from a clinical setting then the sample is enriched with individuals who are help-seeking. This is especially problematic when investigating less severe disorders as many people will not bother engaging with services for mild disability. Information on environmental exposure is often collected retrospectively, which may result in information (recall) bias, therefore it is preferable if the estimation of previous exposure comes from multiple sources or contemporaneous records. A high and non-differential take up rate for DNA analyses is required if we wish unbiased estimates of genetic main effects and interactions; however, biased main effect estimates for the environmental factor [64] and biased genetic main effects [65] may result in relatively unbiased interaction parameters. If there is an ethnic differential between cases and controls then population stratification could result in spurious gene variant associations. The controls can be unrelated individuals or relatives of the cases. The use of unrelated controls can be analysed by using the group of individuals who have neither exposure to the environmental risk factor or the high-risk gene variant as the reference comparison and estimating odds ratios, controlling for measured confounding variables by stratification procedures. Multivariable methods such as logistic regression, recent traditionally measured interaction terms on the multiplicative scale, but more recently extensions have been developed to assess interactions on the additive scale [66, 67]. When relatives are used as controls, detection of interactions may be more efficient as we are enriching the sample with the high risk gene, However, if the risk variant has a high frequency in the family controls there will be a loss of contrast, which will reduce the study power, such that in the most extreme case, monozygotic twins, testing for GxE will more likely reflect main effects rather than the interaction effect. Each case is matched to one or more unaffected
relative and conditional logistic regression models are used to estimate the GxE. The main threat to the validity of findings from such studies is the problem that both genes and environment are generally shared by family members so correlation on unmatched risk factors within the matched case control pair is likely. Furthermore, gene–environment correlation is built into the design reducing its power to detect GxE. Twin studies share the same disadvantages [68, 69]. Few candidate/susceptibility genes have been replicated in psychiatric disorders to date. When there is no candidate gene known GxE can be measured indirectly using family-based approaches: (i) Case–control studies using both relatives and unrelated (population based) controls. The analytical strategy is to compare the odds ratio for the effect of the environmental factor, in the cases with relative controls, to the odds ratio estimated from the case and non-related controls. The premise is that if there is GxE operating you would expect to find higher odds ratios when relatives are used as controls as compared to the analyses using population-based controls, while you would expect equivalence of the risk across control groups if there was no GxE interaction [59]. Of course it is quite possible that the ‘family effect’ could result from a shared environmental factor which has been unmeasured. (ii) The use of proxy (surrogate) measures of genetic liability such as family history or confirmed intermediate (endo)phenotypes (heritable biomarker, lying along the causal pathway from gene to disorder, but at a more proximal position to the gene, than the manifest symptoms), is also possible.
4.9.3 Case only design When a genotype is independent of an environmental exposure and the disorder is rare, then within the population GxE can be tested in cases only [70]. In this case–case design, the prevalence of the exposure in the genotype-positive cases is expected to be the same as the prevalence of the exposure in the cases without the high risk genetic variant. Thus, statistically significant departures from equal prevalence are indicative of an interaction between genotype and environmental exposure. However, independence of genotype and environmental exposure is rare and gene–environment correlation is generally the rule 63
CHAPTER 4
rather than the exception. Violation of this assumption of independence has been shown to produce grossly inflated type 1 errors [71]. Furthermore, this method only allows an estimation of interaction not the main effect of the genotype and environment. Simulation studies demonstrate that GxE can be subsumed into main effect of the genotype; therefore this design fails to provide a comprehensive test of the causal mechanism and should only be used with great caution.
however multi-level analytical models are currently being developed to deal with these factors [75]. Multi-generational pedigrees may be useful in order to indirectly test the hypothesis that there has been a change in the penetrance of a known high risk variant over time due to changes in environmental factors, however this approach is most useful when the high risk gene variant is highly penetrant, which is required to allow familiar aggregation to be adequately detected.
4.9.4 Family designs
4.9.5 Gene–environment wide interaction studies (GEWIS)
Sib-pair analyses are linkage techniques based on the simple premise that pairs of phenotypically concordant siblings (the affected sib-pair design) will demonstrate excess sharing of commonly inherited genomic segments, while phenotypically discordant siblings (the unaffected sib pair design) , will tend to have lower proportions of shared variants. By estimating the degree of inter-pair genetic similarity (at the region of interest, or across the genome) should help us identify the chromosomal location of candidate genes. This is achieved by estimating the sharing pattern, that is the number of alleles at a given locus that are the same (identical by descent, IBD). The expected sharing pattern in siblings approximates to z0 = 25% for no identical allele – z1 = 50% for one identical allele and z2 = 25% for two identical alleles. Departure from this pattern suggests linkage and statistical significance can be estimated within the likelihood framework [72, 73]. Sib-pair studies can be extended to include GxE by using stratification or extensions of common multivariable models [74]. Case parent trio design has been used to test candidate gene associations including testing GxE interactions. This model uses the genotypes on all three members of the trio but only the environmental exposure from the case (i.e., a partial case control design). The basic premise of the design is to stratify the genetic relative risk estimates from the caseparent trio, by environmental exposure status of the case. If there is no GxE interaction the two genetic relative risks would be expected to be the same, however if an interaction is present their ratio will be an estimation of the interactive relative risk. Currently stratified analyses are used to control for known within family variables which may influence the risk, 64
The candidate (susceptibility) gene approach to the identification of genetic determinants of common psychiatric disorders is impeded by: • Lack of a definitive allelic architecture model for the disorders: the polygenetic model is generally considered the best approximation; however, this has been strongly contested [76]. • Substantial gaps in pathophysiological understanding of the disorders. If the multifactorial (polygenetic) model is a good approximation to the allelic architecture of common psychiatric disorders, GWAS will provide a potentially unbiased method to search the genome for causative variants of small effect [77]. However, we should bear in mind that if substantial allelic heterogeneity is present, due to rare variants or epigenetic phenomena (i.e., low allelic identity) this method will be less successful as each genetic variant will arise from an independent haplotype (set of genetic markers in DL) background, so cancelling out each other’s signal [78, 79]. Experience from GWAS in non-psychiatric conditions suggests that for some disorders as many as 30 000 cases and similar numbers of controls will be required to robustly identify highrisk genetic variants [77]. Such large-scale studies have led to the formation of consortia to coordinate the development of such methodology and carry out the studies in psychiatry [80]. To date, GWAS methods have only been used to detect main (direct) effects of single or linked (haplotypes) markers. GWAS SNPs cover more than 4/5th of the SNPs known to HapMap (http://www
EXAMINING GENE–ENVIRONMENT INTERPLAY IN PSYCHIATRIC DISORDERS
.hapmap.org) CNVs are also detected but with less reliability using current technology. However, in complex multi-factorial diseases, scanning for main effects might miss important genetic variants, especially in subgroups of individuals with specific environmental exposure interactions. Furthermore, GxE with opposite effects in groups with different exposure profiles, that is crossing interaction will not be identified, as no direct main effect will be found. Therefore, to be clinically relevant GWAS will have to be placed in an epidemiological and public health context. One way of doing this is to enrich GWAS with environmental information – a technique known as GEWIS. No GEWIS study has yet been done, due to considerable methodological and logistic challenges, however a number of analytical approaches have been proposed which attempt to deal with the substantial problems of prior probability errors which will occur when estimating main effects on 1 000 000 or more markers, which is even more likely with concomitant estimation of E exposures and GxE. GEWIS studies will therefore require new statistical approaches as the current log linear regression methods do not effectively test the global null hypothesis of a genetic variant not being associated with the disorder in any of E strata. Extensions of the interaction methods beyond the currently employed simple departures from additive (or multiplicative) joint effects will be required, most likely based on multivariate latent variable modelling techniques that can deal with ‘mega-variate’ data [81–83].
4.10 Threats to the validity of epidemiological GxE studies The study designs discussed in this chapter are (observational) epidemiological in nature, therefore the genotype, environmental risk exposure and the pathology are studied as they occur within the population and as such their findings need to be interpreted, in the light of potential bias (systematic errors) and confounding (mixing the effect of extraneous variables with the effects of the exposures under study). Such issues can be reviewed in any comprehensive textbook of epidemiology [84]. There are however specific threats to the validity of
GxE epidemiological studies including unmeasured GEr, population stratification effects and sample size and power issues, which we will discuss here.
4.10.1 Confounding by gene–environment correlation GxE analyses are based on the assumption that the environmental and genetic exposures are independent. However, it is clear a number of the exposures we conceptualise as ‘environmental’ have a significant genetic contribution, for example stressful life events and substance use both have demonstrable gene–family-environment covariance [85]. Case only , partial case–control and family methods are sensitive to even statistically non-significant GEr, while case–control and cohort designs are slightly more robust [86, 87]. A number of strategies can be used, preferably in combination to reduce GEr confounding. 1 Careful selection and measurement of environmental exposures. The retrospective measurement of environmental exposures using self report questionnaires are prone to recall bias, as variation in recall is likely to be behaviourally mediated by an individual’s personality, potentially leading to non causal GEr. Therefore in case–control studies that rely on retrospective measures, multiple informant sources and contemporaneous records should be sought. 2 Careful selection and measurement of proxy (surrogate) measures of genetic risk. Proxy measures of genetic liability such as family history or intermediate phenotypes can be used in omnibus (assumes a single unified interaction between the unidentified genes) models for GxE, with care. These are clearly related to genetic liability. However, while dimensions of personality and life events, such as adversity, are in part genetically determined, it would be invalid to substitute such surrogates as genetic risk markers as they are influenced by both genetic and environmental factors. 3 Stratification analyses are undertaken to determine whether the relationship between environmental exposures is modified by the genotype variants. That is, stratification by genotype subgroup. 65
CHAPTER 4
This can also be performed using genotype-specific disease – exposure odds ratios; interstrata differences are compared using likelihood ratio tests of homogeneity. Theoretically GEr can be controlled for by enforcing independence of G and E in loglinear models; however, such tests generally have low power to detect meaningful departures from independence [88]. 4 Compare prevalence of exposure in relatives as compared to non-related controls. These analyses are based on the premise that rates will be higher in relative controls as compared to population controls if there is a genetic effect uncontrolled for shared environmental influences, suggesting GEr. However, if there is no difference in exposure prevalence across the two control groups, GEr is less likely. Of course the most effective way to control for GEr in GxE studies is to use an experimental paradigm. While it would be unethical to allocate individuals to adverse environmental exposures, it may be possible to develop ‘harmless’ analogues of environmental adversity. An elegant example of this approach used standard emotional stimuli as a surrogate for vulnerability to stress in an functional magnetic resonance imaging study [89]. GxE strategies have been shown to be effective in the assessment of pharmacological interventions (pharmacogenetics) [90].
4.10.2 Population stratification When unrelated controls are used in case–control and cohort-designed GxE, the studies are potentially susceptible to confounding by ethnicity (migration), which in the genetic literature is termed population stratification confounding. Population stratification describes the gradients observed in gene frequencies, within broad ethnic groupings, that is there are potentially unmeasured genetic subpopulations. If these subgroups are unequally distributed between cases and controls, biased estimates of effect could be possible if subpopulation genetic variations are correlated with unmeasured environmental risk factor. Potentially more problematic, is the possibility of statistical over-dispersion (i.e., greater variability observed in the data than would be expected based on a given simple statistical model) 66
when subpopulations are present. Simulation studies suggest that even small amounts of stratification have significant consequences in large samples [91], though some commentators have refuted this claim [92]. While family-based designs overcome the potential of confounding by population stratification, they are subject to other problems as discussed above. Population stratification can, however, be effectively handled using genomic control approaches [61, 93].
4.10.3 Power, sample size and issues of multiple comparisons Power and sample size calculations are critical in the design of GxE studies and will vary according to the design used. However, it should be stressed that sample size is only one factor in determining power. The statistical power of any GxE study will also depend on the strength of the interaction effect, the variability of the environmental exposure, the frequency of the genetic variant in the population and the accuracy (reliability and validity) of the measured gene, environment and disorder (or pathological outcome) [94]. Uher [32] has demonstrated the impact of these factors in simulation studies. Strong effects were evident in the presence of variation in the reliability of environmental and pathological measures. A decrease of just 0.2 in both measures equated to losing half of the sample size [32]. Therefore, smaller studies with high quality accurate discriminative measures of environmental and pathological outcomes may be advantageous in many circumstances [95]. Simulation studies clearly show how the frequency and variability of the environmental exposures influence study power. One potential approach to handling exposure variation effects is to sample subjects from both extremes of the exposure distribution (very low and very high doses). Simulation studies suggest this effectively increases the power for detecting GxE, especially when the exposure is measured on a quantitative scale, though this approach requires further simulation testing [96]. Sample size calculations estimate the minimum number of subjects necessary to provide sufficient power to detect a GxE, if it is truly present. GxE as a higher order effect, require larger sample sizes as compared to designs which only measure main
EXAMINING GENE–ENVIRONMENT INTERPLAY IN PSYCHIATRIC DISORDERS
effects. A number of different statistical methods have been described that estimate the sample size requirements to detect GxE [97–99]. How to handle multiple comparisons is a major problem facing genetic epidemiology generally and GxE investigators in particular. The problem of potential type 1 errors (false positives) has traditionally been handled by making corrections for multiple testing, based on the probability of making at least 1 spurious positive inference; however such techniques are conservative as they do not take account of linkage between multiple genetic markers. This has lead to the development of rates based on the false discovery rate (FDR). Many such rates have recently been published and preserve greater power to detect true positive interactions. There is no consensus on how best to handle issues of multiple testing, with other techniques based on permutation methods and sequential multiple-decision procedures currently being developed [35]. While false positive findings in GxE will occur, perhaps a much commoner problem is actually type 2 errors, that is accepting the null hypothesis as a result of inherent low power in our statistical models [99, 100].
4.11 Epigenetic mechanisms As well as genes influencing the exposure and susceptibility to environmental exposures, through GEr and GxE, the reverse association is also possible, and has been postulated in a number of psychiatric disorders, including depression, schizophrenia, substance dependence and developmental disorders. Epigenetic mechanisms occur when environmental factors impact on DNA sequencing (causing de novo mutations) or through changes in DNA methylation and chromatin structure (causing altered gene expression through epimutations) both globally and at the promoters of candidate gene sites. For example, epigenetic chromatin remodelling of the brain derived neurotrophic factor promoter site (BDNF) is associated with neuronal activity, seizures, chronic stress, cocaine addiction and Rett’s syndrome: remodelling at the reelin promoter may play a role in mouse models of schizophrenia [101]. Although research on DNA methylation as an epigenetic mechanism underlying GxE is only in its early
stages the preliminary results are promising. Animal studies have shown early maternal behaviour predict the offspring’s stress sensitivity through altered DNA methylation in some key neuronal receptor genes that are involved in the stress response [102]. Environmentally induced epigenetic mechanisms may explain a range of epidemiological findings including the age of onset curves, monozygote discordance and the gender differences observed in psychiatric disorders. Methodologies designed to investigate such epigenetic processes are being developed currently and are likely to further elucidate the gene–environment interplay in psychiatric disorders [103].
References [1] Thomas, D.C. (2000) Genetic epidemiology with a capital ‘E’. Genet. Epidemiol., 19, 289–300. [2] Thomas, D.C. (2004) Statistical Methods in Genetic Epidemiology, Oxford University Press, New York. [3] Morton, N.E., Rao, D. and Lalouel, J.-M. (1983) Methods in Genetic Epidemiology, S.Karger, Berlin. [4] Khoury, M.J., Beaty, T.H. and Cohen, B.H. (1993) Genetic Epidemiology, Oxford University press, Oxford. [5] Susser, E. and Susser, M. (1989) Familiar aggregation studies. A note on their epidemiological properties. Am. J. Epidemiol., 129, 23–30. [6] Sham, P. (1996) Genetic epidemiology. Br. Med. Bull., 52, 408–433. [7] Morgan, H.D., Sutherland, H.E., Martin, D.I.K. and Whitelaw, E. (1999) Epigenetic inheritance at the agouti locus in the mouse. Nat. Genet., 23, 314–318. [8] Crow, T.J. (2007) How and why genetic linkage has not solved the problem of psychosis: review and hypothesis. Am. J. Psychiatr., 164 (1), 13–21. [9] Crow, T. (2007) Genetic hypotheses for schizophrenia. Br. J. Psychiatr., 191 (2), 180. [10] Faraone, S.V. and Tsuang, M.T. (1985) Quantitative models of the genetic transmission of schizophrenia. Psychol. Bull., 98, 41–66. [11] Faraone, S.V., Kremen, W.S. and Tsuang, M.T. (1990) Genetic transmission of major affective disorders: qualitative models and linkage analyses. Psychol. Bull., 108, 109–127. [12] Trikalinos, T.A., Karvouni, A., Zintzaras, E. et al. (2005) A heterogeneity-based genome search metaanalysis for autism-spectrum disorders. Mol. Psychiatr., 11 (1), 29–36. [13] McQueen, M.B., Devlin, B., Faraone, S.V. et al. (2005) Combined analysis from eleven linkage studies of bipolar disorder provides strong evidence of
67
CHAPTER 4
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22] [23]
[24]
[25]
[26]
[27]
[28]
68
susceptibility loci on chromosomes 6q and 8q. Am. J. Hum. Genet., 77 (4), 582–595. Hauser, E.R., Boehnke, M., Guo, S.W. and Risch, N. (1996) Affected sib-pair interval mapping and exclusion of complex genetic traits-sampling considerations. Genet. Epidemiol., 13, 117–137. Allen, N.C., Bagade, S., McQueen, M.B. et al. (2008) Systematic meta-analyses and field synopsis of genetic association studies in schizophrenia: the SzGene database. Nat. Genet., 40 (7), 827–834. Risch, N. and Teng, J. (1996) The relative power of family -based and case–control designs for linkage disequilibrium studies of complex human diseases. Genome Res., 8, 1273–1288. Hirschhorn, J.N. and Altshuler, D. (2002) Once and again – issues surrounding replication in genetic association studies. J. Clin. Endocrinol. Metab., 87, 4438–4441. Hardy, J. and Singleton, A. (2009) Genomewide association studies and human disease. N. Engl. J. Med., 360, 1759–1768. Jaffee, S.R. and Price, T.S. (2007) Gene–environment correlations: a review of the evidence and implications for prevention of mental illness. Mol. Psychiatr., 12, 432–442. Hardy, J. and Singleton, A. (2009) Genomewide association studies and human disease. N. Engl. J. Med., 360, 1759–1768. Jaffee, S.R. and Price, T. (2007) Gene–environment correlations: a review of the evidence and implications for prevention of mental illness. Mol. Psychiatr., 12, 432–442. Kenneth, K.S. (1996) Parenting: a genetic epidemiological perspective. Am. J. Psychiatr., 153, 11–20. Dawkins, R. (1982) The Extended Phenotype. The Gene As the Unit of Selection, Oxford University Press, Oxford. Kendler, K.S. and Greenspan, R.J. (2006) The nature of genetic influences on behavior: lessons from ‘Simpler’ organisms. Am. J. Psychiatry, 163 (10), 1683–1694. Plomin, R., DeFries, J.C. and Loehlin, J.C. (1977) Genotype-environment interaction and correlation in the analysis of human behaviour. Psychol. Bull., 84, 309–322. Kenneth, K.S., Gardener, C.O. and Prescott, C.A. (2003) Personality and the experience of environmental adversity. Psychol. Med., 33, 1193–1202. Spinath, F.M. and O’Connor, T. (2003) A behavioural genetic study of theoverlap between personality and parenting. J. Pers., 71, 785–808. Kenneth, K.S. and Baker, J.H. (2007) Genetic influences on measures of the environment: a systematic review. Psychol. Med., 37, 615–626.
[29] Rutter, M. (2006) Genes and Behaviour: Nature– Nurture Interplay Explained, Blackwell, Oxford. [30] Rutter, M. (2008) Biological implications of gene– environment interaction. J. Abnorm. Child Psychol., 36, 969–975. [31] Rutter, M. (2006) Implications of resilience concepts for scientific understanding. Ann. N. Y. Acad. Sci., 1094, 1–12. [32] Uher, R. (2008) Gene–environment interaction: overcoming methodological challenges, in Genetic Effects on Environmental Vulnerability to Disease (ed. M. Rutter), John Wiley & Sons, Ltd, Chichester, pp. 13–30. [33] Moogavkar, S.H. (2004) Fifty years of the multistage model:remarks on a landmark paper. Int. J. Epidemiol., 33, 1182–1183. [34] Rothman, K.J., Greenland, S., Poole, C. and Lash, T.L. (2008) Causation and causal inference, in Modern Epidemiology (eds K.J. Rothman and T.L. Lash), Lippincott Williams & Wilkins, Philadelphia, pp. 5–31. [35] North, K.E. and Martin, L.J. (2008) The importance of gene–environment interaction: implications for social scientists. Sociol. Methods Res., 37 (2), 164–200. [36] Neale, B.M. and Sham, P.C. (2004) The future of association studies. Gene-based analysis and replication. Am. J. Hum. Genet., 75, 353–362. [37] Zhang, K., Qin, Z.S., Liu, J.S. et al. (2004) Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. Genome Res., 14 (5), 908–916. [38] Pompanon, F., Bonln, A., Bellemain, E. and Taberlet, P. (2006) Genotying errors: causes, consequences and solutions. Nat. Rev. Genet., 6, 487–459. [39] Gottesman, I.I. and Gould, T.D. (2003) The endophenotype concept in psychiatry: etymology and strategic intentions. Am. J. Psychiatr., 160 (4), 636–645. [40] Brown, G.W. and Harris, T.O. (1978) Social Origins of Depression. A Study of Psychiatric Disorder in Women, Routledge, London. [41] Myin-Germeys, I., Oorschot, M., Collip, D. et al. (2006) Experience sampling research in psychopathology: opening the black box of daily life. Psychol. Med., 2009; 39: 1533–1547. [42] Susser, M. and Susser, E. (1996) Choosing a future for epidemiology: I. Eras and paradigms. Am. J. Public Health, 86 (5), 668–673. [43] Susser, M. and Susser, E. (1996) Choosing a future for epidemiology: II. From black box to Chinese boxes and eco-epidemiology. Am. J. Public Health, 86 (5), 674–677. [44] Allardyce, J., Gaebel, W., Zielasek, J. and van Os, J. (2007) Deconstructing psychosis conference february
EXAMINING GENE–ENVIRONMENT INTERPLAY IN PSYCHIATRIC DISORDERS
[45]
[46]
[47]
[48]
[49]
[50]
[51] [52]
[53]
[54]
[55]
[56]
[57]
[58]
2006: the validity of schizophrenia and alternative approaches to the classification of psychosis. Schizophr. Bull., 33 (4), 863–867. Kraemer, H.C., Noda, A. and O’Hara, R. (2004) Categorical versus dimensional approaches to diagnosis: methodological challenges. J. Psychiatr. Res., 38, 17–25. Risch, N., Herrell, R., Lehner, T. et al. (2009) Interaction between the serotonin transporter gene (5-HTTLPR), stressful life events, and risk of depression: a meta-analysis. J. Am. Med. Assoc., 301 (23), 2462–2471. ` M.R., Brown, S.M. and Hariri, A.R. (2008) Munafo, Serotonin transporter (5-HTTLPR) genotype and amygdala activation: a meta-analysis. Biol. Psychiatr., 63 (9), 852–857. Ottman, R. (1996) Theoretical epidemiology gene–environment interaction:definitions and study designs. Prev. Med., 25, 764–770. Greenland, S., Lash, T.L. and Rothman, K.J. (2008) Concepts of interaction, in Modern Epidemiology, 2nd edn (eds K.J. Rothman, S. Greenland and T.L. Lash), Lippincott Williams & Wilkins, Philadelphia, pp. 71–86. Rutter, M. (2008) Whither gene–environment interactions? in Genetic Effects on Environmental Vulnerability to Disease (ed M. Rutter), John Wiley & Sons, Ltd, Chichester, pp. 1–12. Darroch, J. (1997) Biological synergism and parallelism. Am. J. Epidemiol., 145, 661–668. Darroch, J.N. and Borkent, M. (1994) Synergism, attributable risk and interaction for two binary exposure factors. Biometrika, 81, 259–270. Tienari, P., Wynne, L.C. and Moring, J. (1994) The Finnish adoption family study of schizophrenia. Implications for family research. Br. J. Psychiatr., 23, 20–26. Moffitt, T.E., Caspi, A. and Rutter, M. (2006) Measured gen-environment interaction in psychopathology. Perspect. Psychol. Sci., 1 (1), 5–27. Siemiatycki, J. and Thomas, D.C. (1981) Biological models and statistical interactions; an example from multistage carcinogenesis. Int. J. Epidemiol., 10, 383–387. Knol, M.J., Egger, M., Scott, P. et al. (2009) When one depends on the other: reporting of interaction in case–control and cohort studies. Epidemiology, 20 (2), 161–166. doi: 10.1097/ EDE.0b013e31818f6651 Botto, L.D. and Khoury, M.J. (2001) Commentary: facing the challenge of gene–environment interaction: the two-by-four table and beyond. Am. J. Epidemiol., 153 (10), 1016–1020. Knol, M.J., van der Tweel, I., Grobbee, D.E. et al. (2007) Estimating interaction on an additive scale
[59]
[60] [61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
between continuous determinants in a logistic regression model. Int. J. Epidemiol., 36 (5), 1111–1118. Andrieu, N. and Goldstein, A.M. (1998) Epidemiologic and genetic approaches in the study of gene–environment interaction: an overview of available methods. Epidemiol. Rev., 20 (2), 137–147. Hunter, D.J. (2005) Gene–environment interactions in human diseases. Nat. Rev. Genet., 6 (4), 287–298. Devlin, B. (2001) Genomic control, a new approach to genetic-based association studies. Theor. Popul. Biol., 60, 155–165. Tung, L., Gordon, D. and Finch, S.J. (2007) The impact of genotype misclassification errors on the power to detect a gene–environment interaction using cox proportional hazards modeling. Hum. Hered., 63, 101–110. Li, R. and Chambless, L. (2007) Test for additive interaction in proportional hazards models. Ann. Epidemiol., 17 (3), 227–236. Garcia-Closas, M., Thompson, W.D. and Robins, J.M. (1998) Differential misclassification and the assessment of gene–environment interactions in case–control studies. Am. J. Epidemiol., 147 (5), 426–433. Morimoto, L.M., White, E. and Newcomb, P.A. (2003) Selection bias in the assessment of gene–environment interaction in case–control studies. Am. J. Epidemiol., 158 (3), 259–263. Hosmer, D.W. and Lemeshow, S. (1992) Confidence interval estimation of interaction. Epidemiology, 3 (5), 452–456. Skrondal, A. (2003) Interaction as departure from additivity in case–control studies: a cautionary note. Am. J. Epidemiol., 158 (3), 251–258. Teng, J. and Risch, N. (1999) The relative power of family-based and case–control designs for linkage disequilibrium studies of complex diseases. II, individual genotyping. Genome Res., 9, 234–241. Gladen, B.C. (1996) Matched-pair case–control studies when risk factors are correlated within the pairs. Int. J. Epidemiol., 25 (2), 420–425. Khoury, M.J. and Flanders, W.D. (1996) Nontraditional epidemiologic approaches in the analysis of gene–environment interaction: case–control studies with no controls. Am. J. Epidemiol., 144 (3), 207–213. Albert, P.S., Ratnasinghe, D., Tangrea, J. and Wacholder, S. (2001) Limitations of the case-only design for identifying gene–environment interactions. Am. J. Epidemiol., 154 (8), 687–693. Kerber, R.A., Amos, C.I., Yeap, B.Y., Finkelstein, D.M. and Thomas, D.C. (2008) Design considerations in sib-pair study of linkage for susceptibility loci in cancer. BMC Med. Genet., 9, 64.
69
CHAPTER 4 [73] Poznik, G.D., Adamska, K., Xu, X. et al. (2006) A novel framework for sib pair linkage analysis. Am. J. Hum. Genet., 78, 222–230. [74] Gauderman, W.J., Morrison, J.L., Siegmund, K. and Thomas, D.C. (1999) A joint test of linkage and gene x environment interaction, with affected sib pairs. Genet. Epidemiol., 17 (Suppl. 1), s563–s568. [75] Haines, J.L. and Pericak-Vance, M.A. (2006) Genetic Analysis of Complex Disease, 2nd edn, John Wiley & Sons, Inc., New York . [76] Crow, T.J. (2008) The emperors of the schizophrenia polygene have no clothes. Psychol. Med., 38 (12), 1681–1685. [77] PGCC Committee (2009) Genomewide association studies: history, rationale, and prospects for psychiatric disorders. Am. J. Psychiatr., 166 (5), 540– 556. [78] Lander, E.S. (1996) The new genomics: global views of biology. Science, 274 (5287), 536–539. [79] Chakravarti, A. (1999) Population genetics--making sense out of sequence. Nat. Genet., 21 (Suppl. 1), 56–60. [80] PGCC Committee Available from https://pgc.unc .edu/index.php. [81] Khoury, M.J. and Wacholder, S. (2009) Invited commentary: from genome-wide association studies to gene–environment-wide interaction studies-challenges and opportunities. Am. J. Epidemiol., 169 (2), 227–230. [82] Murcray, C.E., Lewinger, J.P. and Gauderman, W.J. (2009) Gene–environment interaction in genomewide association studies. Am. J. Epidemiol., 169 (2), 219–226. [83] Bhramar, M. and Nilanjan, C. (2008) Exploiting gene–environment independence for analysis of case–control studies: an empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics, 64 (3), 685–694. [84] Rothman, K.J., Greenland, S. and Lash, T.L. (2008) Modern Epidemiology, 3rd edn, Lippincott Williams & Wilkins, Philadelphia. [85] Kendler, K.S. and Prescott, C.A. (2006) Genes, Environment and Psychopathology: Understanding the Causes of Psychiatric and Substance use Disorders, Guilford Press, New York. [86] Liu, X., Fallin, M.D. and Kao, W.H. (2004) Genetic dissection methods: designs used for tests of gene–environment interaction. Curr. Opin. Genet. Dev., 14, 241–245. [87] Lindstrom, S., Yen, Y.-C., Spiegelman, D. and Kraft, P. (2009) The impact of gene–environment dependence and misclassification in getetic association
70
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
studies incorporating gene–environment interaction. Hum. Hered., 68, 171–181. Etheredge, A.J., Christensen, K., del Junco, D. et al. (2005) Evaluation of two methods for assessing gene–environment interaction using data from the Danish case–control study of facial clefts. Birth Defects Res. (Part A), 73, 541–546. Canli, T. and Lesch, K.-P. (2007) Long story short: the serotonin transporter in emotion regulation and social cognition. Nat. Neurosci., 10 (9), 1103–1109. Costa, L.G. and Eaton, D.L. (2006) Gene– Environment Interactions, 1st edn, John Wiley & Sons, Inc., New York . Hinds, D.A., Stokowski, R.P., Patil, N. et al. (2004) Matching strategies for genetic association studies in structured populations. Am. J. Hum. Genet., 74 (2), 317–325. Wacholder, S., Rothman, N. and Caporaso, N. (2002) Counterpoint: bias from population stratification is not a major threat to the validity of conclusions from epidemiological studies of common polymorphisms and cancer. Cancer Epidemiol. Biomar. Prev., 11 (6), 513–520. Devlin, B., Bacanu, S.A. and Roeder, K. (2004) Genomic control to the extreme. Nat. Genet., 36, 1129–1130. Luan, J.A., Wong, M.Y., Day, N.E. and Wareham, N.J. (2001) Sample size determination for studies of gene–environment interaction. Int. J. Epidemiol., 30 (5), 1035–1040. Wong, M.Y., Day, N.E., Luan, J.A. et al. (2003) The detection of gene–environment interaction for continuous traits: should we deal with measurement error by bigger studies or better measurement? Int. J. Epidemiol., 32 (1), 51–57. Boks, M.P.M., Schipper, M., Schubart, C.D. et al. (2007) Investigating gene–environment interaction in complex diseases: increasing power by selective sampling for environmental exposure. Int. J. Epidemiol., 36 (6), 1363–1369. Foppa, I. and Spiegelman, D. (1997) Power and sample size calculations for case–control studies of gene–environment interactions with a polytomous exposure variable. Am. J. Epidemiol., 146 (7), 596–604. Hwang, S.-J., Beaty, T.H., Liang, K.-Y. et al. (1994) Minimum sample size estimation to detect gene–environment interaction in case–control designs. Am. J. Epidemiol., 140 (11), 1029–1037. Lubin, J.H. and Gail, M.H. (1990) ON power and sample size for studying features of the relative odds of disease. Am. J. Epidemiol., 131 (3), 552–566.
EXAMINING GENE–ENVIRONMENT INTERPLAY IN PSYCHIATRIC DISORDERS [100] Eaves, L.J. (2006) Genotype x environment interaction in psychopathology: fact or artifact? Twin Res. Hum. Genet., 9, 1–8. [101] Tsankova, N., Renthal, W., Kumar, A. and Nestler, E.J. (2007) Epigenetic regulation in psychiatric disorders. Nat. Rev. Neurosci., 8 (5), 355–367.
[102] Weaver, I.C.G., Champagne, F.A., D’Alessio, A.C. et al. (2004) Epigenetic programming by maternal behavior. Nat. Neurosci., 7, 847–854. [103] Oh, G. and Petronis, A. (2008) Environmental studies of schizophrenia through the prism of epigenetics. Schizophr. Bull., 34 (6), 1122–1129.
71
5
Reliability Patrick E. Shrout Department of Psychology, New York University, NY, USA
5.1 Introduction In psychiatric epidemiology, assessment of mental conditions and of risks for psychiatric disorder relies heavily on information provided by patients (or survey respondents) and by informants who are close to the patient/respondent. How good is the information provided by these people, and how good are the assessment inferences that we make on the basis of this information? The quality of the assessment in psychiatry and epidemiology is typically characterised by the reliability and validity of the measure. Reliability is the degree to which a measurement is reproducible and not affected by transient assessment noise. Validity is the degree to which the measurement is useful. Although validity is the ultimate criterion by which to judge a measure, we know that a measure will not be useful if it is dominated by measurement noise. This means that reliability is a necessary condition for validity, but it is not sufficient to guarantee validity. Even though reliability is only an intermediate step towards quality measurement, it is often methodologically interesting because it is a problem that can usually be fixed. Reliability can be improved by structuring and standardising the assessment procedure, by improving the training of both the respondents and those carrying out the assessment and by averaging replicate measurements. If problems of unreliability are not addressed, then subsequent problems of validity are intractable. This is why reliability was given so much attention in developing the Versions III and IV of the Diagnostic and Statistical Manual of the American Psychiatric Association [1, 2].
Epidemiologists must attend both to the reliability of diagnostic measures and risk measures. Two features of psychiatric epidemiology make reliability more of an enduring problem in this field than in others. One feature is the previously mentioned dependence on information provided by respondents or informants. Respondent reports present many opportunities for noise to enter the recorded data: the understanding of the question, the recall and reporting of the answer and the coding and entry of the data. The other feature is the epidemiologists’ search for novel populations and risk groups that might provide clues to the aetiology of mental disorders. New populations require new assessments of reliability, since populations vary in language, literacy and cultural expression of disorders. As we will see, the variability of the trait under study in the new population also affects the reliability of measures.
5.2 The reliability coefficient Consider a single measurement procedure. Respondents are sampled from a specific population, measured in some way and assigned a numerical value, that is represented by the variable X. If the characteristic being measured is qualitative, such as having a certain diagnosis, then the variable X might be defined to be binary, that is X = 1 if the respondent has the characteristic, and X = 0 otherwise. If the characteristic is quantitative, such as severity of illness or exposure, then X might be defined to take some well-specified numerical score. 2 The variance of X, σX , is a population parameter that describes how much the measurements differ from person to person in the population being
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
73
CHAPTER 5 2 studied. In some populations σX might be relatively small, while in other populations the variance might be large. Small variance implies that the measurement distinction is subtle in the population, while large variation implies the opposite. In populations with small overall variation in X, any measurement error may be quite serious. According to classic reliability theory, it is useful to 2 2 decompose σX into two components, σX = σE2 + σT2 , 2 where σE is variance due to measurement noise and σT2 is variance due to systematic differences between persons being measured. We will discuss how these two components are estimated later. This decomposition implies that random measurement noise increases the total measurement variation. If measurement errors can be eliminated, then the error variance, σE2 , goes to zero and the total variance shrinks to σT2 . If errors dominate the measurement, 2 then the majority of σX may be attributable to σE2 , even if there is systematic variation between persons, that is of interest. In its purest form, the reliability coefficient, RX , is 2 a ratio of the population parameters, σT2 and σX :
RX =
σT2 2 σX
=
σT2 [σT2
+ σE2 ]
(5.1)
RX varies from zero (X is due entirely to unsystematic stochastic processes) to unity (X is due entirely to systematic individual differences). It can be thought 2 of as the proportion of σX that represents genuine, replicable differences in subjects. It turns out to be a useful quantity in statistical analyses as well. For example, it can be shown that the correlation between X and another variable Y will get smaller as the reliability coefficient of either variable gets smaller [3, 4]. Knowledge of reliability can also be used to adjust for bias [5] and to obtain more powerful tests of group differences [6]. How do we evaluate different values of RX ? If we know that a measure truly has a reliability of 0.50, then we know that only half its variance is systematic. That may not be what we hope for, but it might be good enough for some preliminary studies. For more definitive studies, we should aim to have reliability above 0.80. To provide some interpretive guidelines, Shrout [7] recommends the following characterisations of reliability values: (0.00,0.10), virtually no reliability; (0.11,0.40), slight; (0.41,0.60), fair; 74
(0.61,0.80), moderate; (0.81,1.0), substantial reliability.1 For a complete development of RX and its implications, see Lord and Novick [8] or Dunn [9].
5.3 Designs for estimating reliability To estimate RX we need to define what is meant by systematic variation of X. Classical psychometric theory defines this hypothetically. Suppose that a subject is selected and is measured once to produce the score X1 . Now suppose that it is possible to make the measurement over and over again without affecting the subject, and without recall of the previous Xj values (where j indexes each replicate measure). Classical measurement theory defines the systematic part of X to be the average of all of these hypothetically infinite measurements of the selected subject. This systematic component of the measurement is written as T = E(X), which is interpreted as the expected average of the many replications of X. Note that if the measurement were height or weight, then it would actually be possible to take many repeated measurements of this sort. In psychiatric epidemiology, reliability is estimated by approaching the hypothetical ideal with approximately replicate measurements. If there is virtually no variation across replications of X, then we infer that σE2 is small in magnitude, and that reliability is very good. If variation across replications is observed, then the magnitude of the within-subject variation is compared to that of the between subject variation using the definition of RX in Equation 5.1 above. The most common replication design calls for making the X measurement at two points in time (the test–retest design). Variation in the X values across replications and across respondents is used to 2 estimate σE2 , σT2 and σX , and these can be used to estimate RX . The formal equations for these estimates are presented in a later section. Although theoretically and intuitively appealing, the test-retest design falls short of the hypothetical ideal in several ways. On one hand, the second measurement is often affected by systematic 1 In setting standards for reliability, however, we must be aware that estimates of reliability may be smaller than the actual reliability because of systematic bias, which is discussed later.
RELIABILITY
biological, psychological and social changes in the respondent. These systematic changes make the estimate of σE2 appear larger than it would have been at a single measurement instance. When legitimate change is included with error, the estimate of the reliability of the first assessment is too small. On the other hand, if the respondents remember their original responses, and then try to be ‘good’ by reporting the same thing, then the reliability estimate may be too large. Methodologists who address these opposing biases recommend that the second assessments be carried out after a long enough period to reduce memory artefacts, but promptly enough to reduce the probability of systematic changes. Recommendations of how long the period should be are more products of opinion than science, but 2 weeks often seems to work well. Test–retest designs can be used with the whole range of measures made in psychiatric epidemiology. Interviews, questionnaires, ratings and physical measurements can all be repeated after an appropriate time. It is not always necessary, however, to wait to obtain a replicate measurement. When the measurement is a judgement, such as the Global Assessment Scale [10], it is possible to have two independent ratings made at the same time. Moreover, time can be frozen by video-recording the structured interview so that ratings can be obtained from those viewing the recording. Although these alternatives to traditional test–retest designs overcome the confounding of unreliability with genuine growth or development, they bring with them their own problems. These have been discussed by several authors, including [11]. Insofar as the respondent’s idiosyncratic responses contribute to unreliability, then estimates based on a single recorded interview may underestimate the level of random variation in the actual ratings obtained in the field. For this reason, inter-rater reliability studies using recorded interviews are expected to overestimate true reliability. When the measurement procedure under study is a questionnaire that includes several items pertaining to a single underlying psychological trait or symptom dimension, it is also possible to obtain some information about reliability within a single assessment occasion. The items that relate to the same underlying concept are considered to be replications of each other. The degree to which the patterns of
responses suggest that they are empirically related is used as evidence of reliability. This inference is made on the basis of the internal consistency of the questionnaire responses. The most widely used measure of reliability based on internal consistency is coefficient alpha [12]. An alternate measure is McDonald’s omega [13]. Internal consistency measures of reliability are affected by some biases that make them underestimate actual reliability [14], and others that make them overestimate reliability [15]. They will underestimate reliability if the items within the set are not close replications of each other. For example, a scale of depression symptoms may contain some items on mood, others on psychophysiological complaints and yet others on cognitive beliefs. Although these are all expected to be related to depression, they are not exact replications of each other. To the degree that the correlations among the items is due to the different item content rather than error, the overall reliability estimate will be smaller than it should be. Reliability may be overestimated by the internal consistency design if the whole interview is affected by irrelevant global response patterns, such as mood or response biases. For instance, some respondents may perceive that acknowledging symptoms is socially undesirable, and may systematically under report more bizarre problems. Others may fall into a pattern of denying everything. These so-called response biases inflate internal consistency reliability estimates. They are often addressed by mixing the items across many conceptual domains, editing the items so that half are keyed as a symptom when the respondent says ‘no’ and half are keyed the opposite way. Scales of Yea-saying, and Need-forapproval are also sometimes constructed to identify those respondents who are susceptible to response biases. The validity of these scales, however, is a subject of open discussion. Given the possibility of opposing biases, how should we evaluate internal consistency results? If the results appear to indicate high reliability, look for response artifacts that might have inflated the estimate. If provisions have been taken to address response biases, then the high level of reliability might be real. If the results indicate that there is low reliability, then look to see if the items included within the internal consistency analysis are heterogeneous 75
CHAPTER 5
in content. It is possible that a set of items that are heterogeneous might have adequate test–retest reliability even though the internal consistency estimate is low. Because researchers often seem unaware of the ambiguity of reliability results based on calculations of Cronbach’s alpha from one administration of a measure, a number of psychometric experts have recommended that reliability be measured other ways (e.g. [14, 16]). It is always helpful to incorporate multiple designs into a reliability program. By systematically studying the kinds of replication, one can gain an insight into sources of measurement variation. This is what is recommended by Cronbach and his colleagues [17] in their comprehensive extension of classical test theory known as Generalisability Theory. This theory encompasses both reliability and validity by asking about the extent to which a measurement procedure works in different populations, at different times, with different raters, who may have different training. This broad perspective easily included designs such as those on the Diagnostic Interview Schedule (DIS) [18] that compared results from interviews done by ‘lay’ interviewers to those done by mental health professionals. To the extent that the trained lay interviewers performed like the professionals, the results might be interpreted as test–retest reliability of the DIS. If the level of training actually made a difference, then the results might be interpreted as the validity of using lay interviewers (assuming that the professionals are the ideal interviewers for this structured measure). From the generalisability perspective, it is neither a reliability or validity study, but rather a study of the generalisability of DIS results across time and interviewer-type (see [19]). The flexibility of Generalisability Theory was illustrated by Cranford et al. [20], who used this approach be used to estimate reliability of changes in affect over days in a diary study. These reliability procedures have great utility for epidemiological studies of the course and temporal correlates of pathology.
5.3.1 The effect of population variance on reliability In all of the reliability designs reviewed above we assumed that respondents were sampled from the population, that is to be studied. By randomly sampling 76
from the population, we can obtain an unbiased esti2 mate of σX . Note that any bias, that is introduced in 2 estimating σX can have serious effects on the estimate of RX . Epidemiologists should be especially sensitive to the fact that samples of patients should not be used in a reliability study if the ultimate survey is to be carried out in the general population. Relative to the variance in community surveys, the variance of most psychiatric measures will be too large in treated samples. The bias is usually concentrated in the σT2 term 2 of σX = σT2 + σE2 and thus the reliability often appears to be better in the treated population than in a community population. When the reliability study sample has been constructed using stratified samples of cases and non-cases, then it is often possible to undo the bias through weighting (e.g. [21]).
5.4 Statistical remedies for low reliability If an investigator discovers that a quantitative measure is not sufficiently reproducible, there are several remedies that have been mentioned briefly before. The measure itself can be changed, the training of those administering it can be improved, or perhaps some special instructions can be developed for the respondents that improve the purity of the measurement outcome. These are examples of procedural remedies that are often effective. There is also a statistical remedy: Obtain several independent replicate measurements and average their results. The idea is simple: averages of replicate measures are by definition more systematic than the individual measures themselves, so the reliability of the sum or average of items or ratings will be consistently higher than that of the components. The degree to which reliability is expected to improve in the composites is described mathematically by Spearman [22] and Brown [23]. Let the sum of k ratings or items (X1 , X2 , X3 , . . ., Xk ) be called W(k). Then the expected reliability of W(k) can be written as a function of k and the reliability of the typical measurement, RX , according to the Spearman–Brown formula: RW(k) =
kRX . 1 + (k − 1)RX
(5.2)
Equation 5.2 is based on assumptions about the comparability of the measurements that are averaged
RELIABILITY
or summed into W(k), not on the form or distribution of the individual measurements. Because the result is not limited by the distribution of the X measures, the formula is even useful in calculating the expected reliability of a scale composed of k binary (0,1) items as well as scales composed of quantitative ratings or items. Note that averaging measures only is a remedy for low reliability if there is some evidence of replicability. It is clear that RW will be zero if RX is zero, regardless of the magnitude of k. The Spearman–Brown formula is especially useful for internal consistency reliability studies. When multiple items are available as replicate measures, it is usually the reliability of the scale score (the sum or average of items) that is of interest. While we could use the internal consistency design to calculate the average item reliability, and then use that result in Equation 5.2 to calculate the expected scale reliability, these steps are combined when one uses certain estimation formulas, such as the classic coefficient alpha of Cronbach [12]. The relationship described in the Spearman–Brown formula can also be used in studies of interrater reliability to determine how many independent ratings need to be averaged to obtain an ideal level of reliability, say CR . If the obtained level of reliability for a single rater is RX , then the number of raters that are needed to produce an averaged-rater reliability of CR is k=
CR (1 − RX ) . RX (1 − CR )
(5.3)
For example, if each rater only has a reliability of RX = 0.40 and one wants a reliability of CR = 0.75, then Equation 5.3 gives k = 4.5. This means that averages of four raters would be expected to have less than 0.75 reliability, while averages of five raters would exceed the target reliability of 0.75.
5.5 Reliability theory and binary judgements The reliability theory just reviewed does not make strong assumptions about the kind of measurement embodied in X, and indeed many of the results hold for binary variables such as ones that might represent specific psychiatric diagnoses (e.g. X = 1 when the respondent is thought to have current
major depression, X = 0 otherwise). Kraemer [24] has shown explicitly how the results work with binary judgements. From her mathematical analysis of the problem it can be seen that the systematic component of X that I have called T = E(X), will end up as a proportion falling between the extremes of 0 and 1. It represents the expected proportion of diagnosticians who would give the diagnosis to the respondent being evaluated. If T is close to 1, then most diagnosticians would say that the respondent is a case, and if T is close to 0, then most would say that the respondent is not a case. Note that while X itself is binary, T is quantitative in the range (0,1). Because averages are quantitative (at least as n gets large), the psychometric results from the Spearman–Brown formula are applicable only when the composite of interest is quantitative. This is often the case when X represents binary items in a symptom scale. Of interest is the count of symptoms, which is closely related to the average of symptom items. However, if we really want a binary variable as the outcome, then the Spearman–Brown result does not apply. For example, diagnoses of several independent judges are sometimes combined into a ‘consensus diagnosis’, that is itself binary. If the consensus rule is one that requires that all judges make the diagnosis before the diagnosis is applied, the result might be less reliable than some of the individual diagnosticians (see [25]). The total consensus rule is as weak as the least reliable diagnostician, because each has veto power regarding whether the consensual diagnosis is made. Many of the classic psychometric results depend on the assumed symmetry of errors. Because T is defined as an average, by definition about half the errors go in one direction and half in the other. For diagnoses, however, the errors that attract attention are those that seem to cause clinically relevant discrepancies. For example, if we know that a certain set of presentation facts are viewed by 90% of trained clinicians as indicating schizophrenia, then the clinically relevant discrepancies are those diagnosticians who argue that the diagnosis of schizophrenia is inappropriate. Persons who insist that schizophrenia should be diagnosed with more than 90% certainty are not usually considered in practical terms to be outliers. The interest in the asymmetry of errors in diagnoses prompts some researchers to decompose 77
CHAPTER 5
interrater discrepancies into ones that are consistent with problems of sensitivity and specificity. From this perspective it can be shown that the reliability coefficient of Equation 5.1 is a function of both kinds of errors. If we focus on one kind of error only, such as sensitivity, the classic relation between reliability and validity no longer holds necessarily. There are some examples in which different levels of reliability are consistent with the same level of sensitivity. (One usually finds that the assumed specificity or prevalence varies with the reliability in examples such as this.; see [26].) When asymmetric errors are of central interest, the results reviewed in this chapter may not be totally applicable. The role of asymmetric errors in binary ratings is only one special aspect of such data. Another is the relation of the expected mean of a binary variable and the expected variance of that variable. For variables that are normally distributed the mean contains no information about the variance of the variable, but for variables that are binomial (a very common distribution for binary variables), the variance is necessarily small for variables with means near 0 or 1. This fact has implications in the interpretation of Equation (5.1), the definition of the reliability coefficient. If the prevalence of a diagnosis is low in a population, then σT2 will be small. If the level of error variance is held constant, but σT2 is made smaller, then RX will be smaller. One way to interpret this result is that the level of error must be reduced to study disorders that have smaller base rates in the population. Any randomly false positive diagnosis makes the diagnostic system seem unreliable for rare disorders. In this case the diagnostic system is unreliable because the precious few true positives are swamped by the random false positives. Nevertheless, the fact that reliability is empirically related to prevalence has caused some commentators to question the utility of reliability measures in binary variables [27–29]. Others of us have argued that dropping the statistic because of the challenge of measuring rare disorders is misguided [7, 30] because the reliability statistic is useful in describing the effects of measurement error on statistical analyses. Kraemer [31] lucidly reviewed the rationale of reliability studies and showed how the challenge of establishing reliability of categorical data is affected by various features of the measurement situation and the design of the reliability study. 78
In the next section I present a survey of reliability statistics that can be used to evaluate data from reliability studies. One of these statistics is Cohen’s kappa [32]. It is especially designed for categorical outcomes, but it shares with the quantitative statistics its interpretation as estimators of the reliability coefficient in Equation 5.1 Although the special features of binary data require a careful consideration of the effects of errors in epidemiological analyses, the general concerns for the concept of reliability as reviewed in the preceding sections are usually relevant for multivariate analyses that treat binary distinctions as dummy variables.
5.6 Reliability statistics: General As we have seen, the reliability coefficient of Equation 5.1 is defined in terms of variances: variances of systematic person characteristics σT2 , and variances of measurements across replications for a single person, σE2 . There are several ways to estimate the variance ratio shown in Equation 5.1 [9], but one direct method is simply to estimate the separate variance components and then combine them in the form of Equation 5.1 Estimates of this sort are called intraclass correlations. Intraclass correlation is not a single statistic, but rather a family of statistics that can be used for estimating reliability. In this section we will review several versions that can be used with a wide variety of variables. We focus here on the easiest part of reliability analysis, ‘point estimation’ of the statistic that summarises the reliability results. Although it is important, we do not have the space to present the methods that must be used to estimate 95% confidence intervals for the study results. The form of the interval estimators depends on the nature and distribution of the data, and new methods are being actively developed in the literature. For reviews of methods for confidence intervals see Shrout [7], Dunn [9] and Blackman and Koval [33]. It is important to note, however, that estimates of reliability are often less precise than we would like [34], and that this fact is made clear by the use of confidence intervals. The intraclass correlation point estimates are derived from information summarised in the analysis of variance (ANOVA) of the data from the reliability
RELIABILITY
study. The ANOVA treats each subject as a level of the SUBJECTS factor. Usually subjects are considered to be a random factor, because they are selected to be representative of a population of interest. If the replicate measurements of the subjects are systematically obtained using a certain set of k raters or measuring devices, then the ANOVA might involve a two-way SUBJECTS by MEASURES design. If, on the other hand, the replicate measurements of each subject are obtained by randomly sampling k measures, then the analysis would use a one-way ANOVA.
Table 5.1 Hypothetical data on functioning of 10 probands by three of their relatives.
5.6.1 One-way ANOVA analyses
subjects’ mean ratings are larger than the disagreements among relatives regarding the subjects’ scores. The reliability estimate for the one-way ANOVA is calculated using the first formula in Table 5.3A. This form of the intraclass correlation was called ICR (1,1) by Shrout and Fleiss [36], and we retain that designation. To illustrate the calculation with the numerical example from Table 5.1, we find,
Table 5.1 illustrates data that might be collected in reliability study of relative informants. Each of N = 10 probands is rated by k = 3 distinct relatives. Between-subject variation can be estimated using all k ratings, and within-subject variation is used to estimate the magnitude of the error variation. When the relationships of the relatives vary from proband to probands (e.g. siblings for one proband, parents for another, cousins for a third), these data do not have a data analytic structure for informant. If there had been such a structure, we might have considered a proband-by-relationship two-way ANOVA. In our analysis we will assume that the informants are essentially a random sample of possible informants for a given respondent. Table 5.2 shows the layout of the one-way ANOVA, along with the numerical estimates obtained from the data in Table 5.1. The actual computation of the ANOVA results can be obtained from standard computer software, such as SPSS RELIABILITY [35]. The numerical example illustrates a pattern in which the between-subjects (probands) mean squares is substantially larger than the withinsubjects mean squares. Consistent with an informal examination of the hypothetical data in Table 5.1, this pattern suggests that the differences between
Proband
Relative 1
Relative 2
Relative 3
29 23 19 6 13 0 10 5 31 15
32 33 17 10 20 0 11 1 26 17
17 28 18 5 20 2 15 15 19 18
1 2 3 4 5 6 7 8 9 10
ICR(1, 1) = (251.0 − 22.2)/(251.0 + 2 ∗ 22.2) = 0.77 This result describes the reliability of a single randomly selected informant. About 77% of the variance of a single informant’s ratings is attributable to systematic differences between subjects. Although the stability of the result might be questioned because of the limited sample size, the result is encouraging that this rating, in this population appears to be made fairly reliably by a single informant. Suppose that it is possible to obtain three informant ratings for each subject in the survey. How much more reliable would the average of the three ratings be than an individual informant? The answer can be calculated using the Spearman–Brown formula (Equation 5.2), with k = 3 and RX = 0.77. Alternatively, one can use the formula for ICR(1,k)
Table 5.2 Analysis of variance when replications are nested within subjects: one way ANOVA. Source of variation Between subjects Within subjects
df
Sums of squares
Mean squares
Table 5.1 Example: MS on df
n−1 n(k − 1)
BSS WSS
BMS = BSS/(n − 1) WMS = WSS/[n(k − 1)]
BMS = 251.0 on 9 df WMS = 22.2 on 20 df
79
CHAPTER 5 Table 5.3
Versions of intraclass correlation statistics useful for various reliability designs.
Type of reliability study design
Raters fixed or random? Version of intraclass correlationa
(A) Reliability of single rater Nested: n subjects rated by k different raters
Random
ICR(1,1) =
BMS − WMS . BMS + (k − 1)WMS
Subject by rater crossed design
Random
ICR(2,1) =
TMS − EMS . TMS + (k − 1)EMS + k(JMS − EMS)/n
Subject by rater crossed design
Fixed
ICR(3,1) =
TMS − EMS . TMS + (k − 1)EMS
Nested: n subjects rated by k different raters
Random
ICR(1,k) =
BMS − WMS . BMS
Subject by rater crossed design
Random
Subject by rater crossed design
Fixed
(B) Reliability of the average of k ratings
TMS − EMS . TMS + (JMS − EMS)/n TMS − EMS . ICR(3,k) = TMS ICR(2,k) =
a BMS
and WMS refer to between-subject and within-subject mean squares from a one way ANOVA. TMS, JMS and EMS refer to between-subjects (targets), between measures (judges) and error mean squares from two way ANOVA based on n target-subjects and k raters.
shown in Table 5.3B. This formula is obtained by algebraically combining the expression for ICR(1,1) with the Spearman–Brown formula. In this case the answer is ICR(1,k) = 0.91. About 91% of the variance of the average of three randomly chosen informants is attributable to systematic differences between subjects.
5.6.2 Two-way ANOVA analyses Table 5.4 illustrates data that might be collected in a reliability study of two professional raters or interviewers. As a result of the interview by Interviewer 1 we have both a binary diagnosis (disorder present [X = 1] vs. disorder absent [X = 0]) and a quantitative score such as a total functioning score (called Z in the table). Replicate scores and diagnoses are obtained by a second interviewer, called Interviewer 2. The hypothetical data on X1, X2, Z1 and Z2 are shown for 17 respondents. The layout of the two-way ANOVA is shown in Table 5.5, along with numerical results from the Table 5.4 examples. Only two interviewers were used in the reliability study illustrated in Table 5.4, but we might consider the two to be a random sample from all possible interviewers from the study. If so, then they must not be selected on the basis of their special skills as interviewers, but rather should be selected to be 80
Table 5.4 Hypothetical data on assessment of depression and functioning. X1 and X2 represent test–retest diagnoses of major depression (X = 1, present; X = 0, not present), and Z1 and Z2 represent ratings of adaptive functioning. Respondent 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
X1
X2
Z1
Z2
0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0
1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
17 17 26 24 19 22 17 23 19 18 21 13 21 22 15 20 21
11 15 25 22 14 16 18 19 16 12 18 11 23 17 12 18 20
representative. When interviewers who are employed in the reliability study represent the population of potential interviewers, we say that they are random effects. In some cases we are interested in the ratings of specific interviewers rather than a population of
RELIABILITY Table 5.5 Analysis of variance when replications have structure: two way ANOVA. Source of variation
df
Sums of squares
Mean squares
Table 5.4 Examples: Mean square on df
Between subjects (targets)
n−1
TSS
TMS = TSS/(n − 1)
Variable X: 0.254 on 16 df Variable Z: 25.7 on 16 df
Between measures (judges)
(k − 1)
JSS
JMS = JSS/(k − 1)
Variable X: 0.029 on 1 df Variable Z: 67.8 on 1 df
(n − 1)(k − 1)
ESS
Residual (error)
interviewers. Suppose Interviewer 1 is a doctoral candidate who carried out her own data collection, and that Interviewer 2 is a colleague who is hired to document that the ratings are systematic. In this case we simply wish to describe the quality of data collected by the doctoral candidate, and we say that the interviewers are fixed effects. Depending on whether the raters are considered to be random or fixed, we use different versions of the intraclass correlation to estimate reliability. When we wish to estimate the reliability of a randomly sampled interviewer, we use the expression for ICR(2,1) shown in of Table 5.3A. This intraclass correlation is not only a function of the betweensubjects mean squares and the error mean squares, but also the between-measure (judge) mean squares. If different raters are more or less liberal in assigning high scores, then the final variability of the ratings will be affected. ICR(2,1) takes this extra variation into account in estimating reliability. In the two examples of Table 5.4, one reveals a large between-measure effect and the other does not. From the numbers in Table 5.4 it can be seen that the Z2 ratings are usually smaller than the Z1 ratings. Rater 2 seems to believe that most subjects are functioning somewhat worse than perceived by Rater 1. Even with this rater difference, the reliability of Z is higher than the reliability of X, according to the data in Table 5.4. The ICR(2,1) for Z is calculated as (25.7 − 2.67)/(25.7 + (1) ∗ 2.67 + 2 ∗ (67.8 − 2.67)/17) = 0.64 The ICR(2,1) for X is calculated as (0.254 − 0.092)/(0.254 + (1) ∗ 0.092 + 2 ∗ (0.029 − 0.092)/17) = 0.48
EMS =
ESS (n − 1)(k − 1)
Variable X: 0.092 on 16 df Variable Z: 2.67 on 16 df
For the rating of adaptive functioning we could consider averaging both individual Z ratings to obtain a more reliable score. We can use either the Spearman–Brown formula, or the expression ICR(2,k) to calculate the reliability of the mean of two such ratings. In this case, the result is 0.78 rather than 0.64. Although the reliability of the binary X variable is worse than that of the quantitative Z variable, it would not usually be meaningful to rely on an average diagnosis instead of a truly binary rating. For this reason the ICR(2,k) form of the intraclass correlation would not be applied to X in Table 5.4. The calculations carried out so far have assumed that the two sets of ratings in Table 5.4 are representative of a host of possible interviewers. Now we turn our attention to the situation in which the two raters can be considered to be fixed. In this case we can either ignore systematic rater differences in mean ratings, or we can adjust for them. The expression for ICR(3,1) in Table 5.3A is appropriate when we wish to describe the reliability of a single fixed rater. Unlike ICR(2,1), this version of the intraclass correlation is not affected by the between-rater mean squares. On average, ICR(3,1) will be larger in magnitude than ICR(2,1). By fixing the raters to certain persons, the extraneous variation due to sampling of raters is eliminated and the resulting reliability is usually higher. This effect is especially obvious for Z, which had a large between-rater effect. The ICR(3,1) for Z is calculated as (25.7 − 2.67)/(25.7 + (1) ∗ 2.67) = 0.81 ICR(3,1) for X is not much different than ICR(2,1), as the rater effects were small: (0.254 − 0.092)/(0.254 + (1) ∗ 0.092) = 0.47 81
CHAPTER 5
5.6.3 The reliability of the average of k fixed measures: Cronbach’s alpha Just as ICR(1,1) and ICR(2,1) can be used in the Spearman–Brown formula to determine how much reliability improves by using an average score, so can ICR(3,1) be used when an average measurement is of interest. In this case the reliability of the averaged measurement can be computed directly using ICR(3,k) from Table 5.3. For the quantitative Z variable, the reliability of the average is expected to be 0.90. One common application of ICR(3,k) is to internal consistency analyses of psychometric scales. Items in self-report questionnaires are usually fixed in that the same items are used with all respondents. Suppose that n subjects are administered k scale items, and the results are analysed using the two-way ANOVA layout of Table 5.5. The estimate of the reliability of the sum or average of the k fixed items can be computed using ICR(3,k). The result is identical to Cronbach’s alpha [12], which we discussed in the first section. Alpha is computed directly by computer programs such as SPSS RELIABILITY [35].
5.7 Other reliability statistics 5.7.1 Cohen’s kappa When binary data such as that for variable X in Table 5.4 are collected, reliability can be estimated directly using Cohen’s kappa [32]. Fleiss and Cohen [37] showed that kappa is conceptually equivalent to ICR(2,1) in Table 5.3. It can be calculated simply using the entries of a 2 × 2 table showing the diagnostic agreement. In general, this agreement table might be laid out as follows:
Rater 2: + Rater 2: − Total
Rater 1: +
Rater 1: −
Total
a c a+c
B D b+d
a+b c+d n
Cohen [32] pointed out that while cells a and d represent agreement, it is not sufficient to evaluate reliability by reporting the overall proportion of agreement, Po = (a + d)/n. This statistic may be 82
large even if raters assigned diagnoses by flipping coins or rolling dice. His kappa statistic adjusts for simple chance mechanisms: kappa =
Po − Pc 1 − Pc
where Po is the observed agreement, [(a + d)/n] and Pc is the expected agreement due to chance: Pc = [(a + c)(a + b) + (b + d)(c + d)]/n2 . When computing kappa by hand, it is sometimes more convenient to use the following equivalent expression, kappa =
ad − bc . ad − bc + n(b + c)/2
When the X data in Table 5.4 are tabulated into a 2 × 2 table like that shown above, we get a = 2, b = 2, c = 1 and d = 12. The observed agreement, Po = 0.82, but the expected agreement by chance is Pc = 0.67. Using either of the expressions for kappa, we find the reliability to be 0.46. As expected, this is quite close to the value of 0.48 obtained using ICR(2,1). One advantage of calculating the reliability of binary judgements using kappa instead of intraclass correlation methods is that the expressions for kappa’s standard error and confidence bounds are explicitly suited to binary data. Kappa can also be generalised to describe the overall reliability of classifications into multiple categories. Fleiss et al. [38] provides an overview of many forms of kappa, and Donner and his colleagues [39–44] have done much to describe the sampling variation of kappa statistics.
5.7.2 Product moment correlation If the reliability study yields two measurements, and if the raters are considered to be fixed (rather than representative of a pool of raters), then reliability can be estimated by computing the product moment correlation between the two measures. This is the usual correlation statistic built into most computer programs and calculators. When the ratings are quantitative, the correlation is known as the Pearson correlation, and when the ratings are binary it is known as the phi coefficient. Regardless of what they
RELIABILITY
are called, they are comparable to the ICR(3,1) version of the intraclass correlation described above. For the Z variables the Pearson correlation is rP = 0.83 and for the X variables in Table 5.4 the phi coefficient is rP = 0.47. These are very close to the ICR(3,1) values of 0.81 and 0.47 obtained on the same data.
5.7.3 Item response theory statistics Investigators who have a set of survey questions that are known to reflect an underlying dimension, such as severity of distress or impairment, often report estimates of Cronbach’s alpha as a summary of measurement quality. An alternate approach is to focus on the relation of each item response pattern to the underlying dimension. When items are clearly phrased and related to the underlying (latent) dimension, the probability of endorsing an item category will be systematically related to the latent dimension (see, for example [45]). Of special interest are the slope and location of each item, indicating the relevance and severity of each item with regard to the latent dimension. Item response theory (IRT) analyses are especially useful for comparing the measurement equivalence of items across different groups (e.g. [46]). Some argue that IRT analyses should supplant traditional reliability analyses (e.g. [45]).
5.8 Summary and conclusions Unreliability is a measurement problem that can often be rectified by improving interview procedures, or by using statistical sums or averages of replicate measures. Determining the extent to which unreliability is a problem, however, can be challenging. There are various designs for estimating reliability, but virtually all have some biases and shortcomings. Studies of sampling variability of reliability statistics [9, 39, 47] have suggested that sample sizes in pilot studies are often not adequate to give stable estimates about the reliability of key measurement procedures. It is important that reliability studies be considered critically in search for ways to improve measurement procedures. Specifically, if the reliability of a measure appears to be very good, ask whether there are biases in the reliability design that might bias the results optimistically. Were the respondents
sampled in the same way in the reliability study that they will be in the field study? Was the respondent given the chance to be inconsistent, or did the replication make use of archived information? If serious biases are not found, and the reliability study produced stable estimates, then one can put the issue of reliability behind you, at least for the population at hand. If the reliability of a measure appears to be poor, one should also look for biases in the reliability design. How similar were the replications? Could the poor reliability results be an artifact of legitimate changes over time, heterogeneous items within a scale, or artificially different measurement conditions? Was the sample size large enough to be sure that reliability is in fact bad? Be especially suspicious if you have evidence of validity of a measure that is purported to be unreliable. Rather than dismissing a measure with apparently poor reliability, ask whether it can be improved to eliminate noise.
References [1] American Psychiatric Association (1980) Diagnostic and Statistical Manual of Mental Disorders, 3rd edn, American Psychiatric Association, Washington, DC. [2] American Psychiatric Association (1994) Diagnostic and Statistical Manual of Mental Disorders, 4th edn, American Psychiatric Association, Washington, DC. [3] Cochran, W.G. (1968) Errors in measurement in statistics. Technometrics, 10, 637–666. [4] Snedecor, G.W. and Cochran, W.G. (1967) Statistical Methods, 6th edn, Iowa State University Press, Ames. [5] Bollen, K.A. (1989) Structural Equations with Latent Variables, John Wiley & Sons, Inc., New York. [6] Borm, G.F., Munneke, M., Lemmers, O. et al. (2007) An efficient test for the analysis of dichotomized variables when the reliability is known. Stat. Med., 26, 3498–3510. [7] Shrout, P.E. (1998) Measurement reliability and agreement in psychiatry. Stat. Methods Med. Res., 7, 301–317. [8] Lord, F.M. and Novick, M.R. (1968) Statistical Theories of Mental Test Scores, Addison-Wesley, Reading. [9] Dunn, G. (1989) Design and Analysis of Reliability Studies, Oxford University Press, New York. [10] Endicott, J., Spitzer, R.L., Fleiss, J.L. et al. (1976) The global assessment scale: a procedure for measuring overall severity of psychiatric disturbance. Arch. Gen. Psychiatry, 33, 766–771.
83
CHAPTER 5 [11] Spitzer, R.L. (1983) Psychiatric diagnosis: are clinicians still necessary? Compr. Psychiatry, 24, 399–411. [12] Cronbach, L.J. (1951) Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. [13] McDonald, R.P. (1999) Test Theory: A Unified Treatment, Erlbaum, Mahwah. [14] Sijtsma, K. (2009) On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74 (1), 107–120. [15] Raykov, T. (1997) Scale reliability, Cronbach’s coefficient alpha, and violations of essential tauequivalence with fixed congeneric components. Multivariate Behav. Res., 32, 329–353. [16] Kraemer, H.C., Shrout, P.E. and Rubio-Stipec, M. (2007) Developing the diagnostic and statistical manual V: what will ‘statistical’ mean in DSM-5. Soc. Psychiatry Psychiatr. Epidemiol., 42 (4), 259–267. [17] Cronbach, L.J., Gleser, G.C., Nanda, H. et al. (1972) The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles, John Wiley & Sons, Inc., New York. [18] Anthony, J.C., Folstein, M.,Romanoski, A.J. et al. (1985) Comparison of the lay Diagnostic Interview Schedule and a standardized psychiatric diagnosis. Arch. Gen. Psychiatry, 42, 667–675. [19] Brennan, R.L. (2001) Generalizability Theory, Springer, New York. [20] Cranford, J.A., Shrout, P.E., Iida, M. et al. (2006) A procedure for evaluating sensitivity to within-person change: can mood measures in diary studies detect change reliably? Pers. Soc. Psychol. Bull., 32 (7), 917–929. [21] Jannarone, R.J., Macera, C.A. and Garrison, C.Z. (1987) Evaluating interrater agreement through ‘casecontrol’ sampling. Biometrics, 43, 433–437. [22] Spearman, C. (1910) Correlation calculated from faulty data. Br. J. Psychol., 3, 271–295. [23] Brown, W. (1910) Some experimental results in the correlation of mental abilities. Br. J. Psychol., 3, 296–322. [24] Kraemer, H.C. (1979) Ramifications of a population model for kappa as a coefficient of reliability. Psychometrika, 44, 461–472. [25] Fleiss, J.L. and Shrout, P.E. (1989) Reliability considerations in planning diagnostic validity studies, in The Validity of Psychiatric Diagnoses (ed. L. Robbins), Guilford Press, New York, pp. 279–291. [26] Carey, G. and Gottesman, I.I. (1978) Reliability and validity in binary ratings: areas of common misunderstanding in diagnosis and symptom ratings. Arch. Gen. Psychiatry, 35, 1454–1459.
84
[27] Grove, W.M., Andreason, N.C., McDonald-Scott, P. et al. (1981) Reliability studies of psychiatric diagnosis: theory and practice. Arch. Gen. Psychiatry, 38, 408–413. [28] Guggenmoos-Holzmann, I. (1993) How reliable are chance-corrected measures of agreement? Stat. Med., 12, 2191–2205. [29] Spitznagel, E.L. and Helzer, J.E. (1985) A proposed solution to the base rate problem in the kappa statistic. Arch. Gen. Psychiatry, 42, 725–728. [30] Shrout, P.E., Spitzer, R.L. and Fleiss, J.L. (1987) Quantification of agreement in psychiatric diagnosis revisited. Arch. Gen. Psychiatry, 44, 172–177. [31] Kraemer, H.C. (1992) Measurement of reliability for categorical data in medical research. Stat. Methods Med. Res., 1, 183–199. [32] Cohen, J. (1960) A coefficient of agreement for nominal scales. Educ. Psychol. Meas., 20, 37–46. [33] Blackman, N.J.-M. and Koval, J.J. (2000) Interval estimation for Cohen’s kappa as a measure of agreement. Stat. Med., 19, 723–741. [34] Walter, S.D., Eliasziw, M. and Donner, A. (1998) Sample size and optimal designs for reliability studies. Stat. Med., 17, 101–110. [35] SPSS Inc. (2009) SPSS for Windows (Version 16), SPSS Inc., Chicago. [36] Shrout, P.E. and Fleiss, J.L. (1979) Intraclass correlations: uses in assessing rater reliability. Psychol. Bull., 86, 420–428. [37] Fleiss, J.L. and Cohen, J. (1973) The equivalence of weighted kappa and the intraclass coefficient as measures of reliability. Educ. Psychol. Meas., 33, 613–619. [38] Fleiss, J.L., Levin, B. and Paik, M.C. (2003) Statistical Methods for Rates and Proportions, 3rd edn, John Wiley & Sons, Inc., New York. [39] Donner, A. (1998) Sample size requirements for the comparison of two or more coefficients of interobserver agreement. Stat. Med., 17, 1157–1168. [40] Donner, A. and Eliasziw, M. (1992) A goodnessof-fit approach to inference procedures for the kappa statistic: confidence interval construction, significancetesting and sample size estimation. Stat. Med., 11, 1511–1519. [41] Donner, A. and Eliasziw, M. (1994) Statistical implications of the choice between a dichotomous or continuous trait in studies of interobserver agreement. Biometrics, 50, 550–555. [42] Donner, A. and Eliasziw, M. (1997) A hierarchical approach to inferences concerning interobserver agreement for multinomial data. Stat. Med., 16, 1097–1106.
RELIABILITY [43] Donner, A., Eliasziw, M. and Klar, N. (1996) Testing the homogeneity of kappa statistics. Biometrics, 52, 176–183. [44] Donner, A., Shoukri, M.M., Klar, N. et al. (2000) Testing the equality of two dependent kappa statistics. Stat. Med., 19, 373–387. [45] Embretson, S.E. and Reise, S.P. (2000) Item Response Theory for Psychologists, Erlbaum, Mahwah.
[46] Gregorich, S.E. (2006) Do self-report instruments allow meaningful comparisons across diverse population groups? Med. Care, 44 (11), S78–S94. [47] Cantor, A.B. (1996) Sample-size calculations for Cohen’s kappa. Psychol. Methods, 1, 150–155.
85
6
Moderators and mediators: Towards the genetic and environmental bases of psychiatric disorders Helena Chmura Kraemer Department of Psychiatry, Stanford University, CA, USA and University of Pittsburgh, Pittsburgh, PA, USA
6.1 Introduction The terms ‘moderator’ and ‘mediator’ have been around for at least 50 years, but until 1986, the terms were used inconsistently and idiosyncratically. As a result, the constructs of moderation/mediation played little role in biomedical research. In 1986 [1] Baron and Kenny proposed conceptual definitions to distinguish moderators from mediators, giving each term a specific distinct meaning. According to those conceptual definitions, when M (moderator or mediator), T (target variable) and O (outcome variable) are three variables measured on the individual subjects in a population: • M moderates the effect of T on O if M helps explain on whom or under what conditions T leads to O. • M mediates the effect of T on O if M helps to explain why or how T leads to O. Baron and Kenny also proposed a statistical method to apply those definitions, based on a linear model, assuming that the outcome is determined by a linear function of T and M: Linear Model: β0 + β1 T + β2 M + β3 TM
To show moderation Baron and Kenny required only that it be shown that β3 = 0. That is problematic, for a demonstration that T moderates M also demonstrates that M moderates T, leaving the direction of moderation ambiguous. To show mediation, they required that the interaction effect be assumed to be zero. That too is problematic, for if interaction exists in the population but is assumed zero in the model, that effect is remapped partially into the other coefficients, biasing them, and partially into the error, reducing power. Moreover, users were encouraged to fit both models to one set of data, first proving a non-zero interaction to show moderation, and then illogically assuming a zero interaction to show mediation, leading to a conclusion that one variable can both moderate and mediate another in its effect on the outcome. Added to such problems intrinsic to the approach, many users began to refer to any interactive relationship as ‘moderation’, to use the term ‘mediator’ as synonymous with ‘cause’, or to refer to a variable as a ‘moderator’ or a ‘mediator’ without specifying what the target and/or the outcome was. These are all misuses of the Baron and Kenny definitions. Around 2000, a subgroup of the MacArthur Network on Psychopathology in Development, modified
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
87
CHAPTER 6
the Baron and Kenny criteria [2, 3], retaining their conceptual definitions, but attempting to set criteria to resolve and clarify the incongruities and ambiguities [4]. This approach, the so-called MacArthur approach, is discussed here. The MacArthur reformulation clarified the importance of the moderator/mediator concept to all biomedical research, particularly psychiatry. Moderators of treatment on outcome in randomised clinical trials (RCTs) identify subpopulations that respond differently to treatments, facilitating personalisation in the form of targeting of treatment choices to those most benefited [3, 5–7]. Mediators of treatments in RCTs suggest ways in which a treatment might be made more effective or more cost-effective, thus encouraging personalisation in the form of tailoring treatment to individual needs [5]. Similarly in risk research, there may be chains of mediators leading to onset of a disorder (some perhaps causal), and even multiple such chains in subpopulations defined by moderators of those risk factors on that disorder [8, 9]. In the present discussion the moderator/mediator approach specifically directed to understanding the genetic influences on disorders will be discussed. Consider the simple situation in which G represents the presence (G = 1) or absence (G = 0) of a certain genotype, and E represents some binary factor (E = 1 and 0) such as presence/absence of an environmental factor, a gene expression or its effect on the individual, or an event, all during the early lifetime of an individual, where both G and E are risk factors for the disorder D. D represents whether the individual experiences the onset of a disorder (D+) or not (D−) at a certain time point (prevalence) or during a certain time span (incidence) in later lifetime. The complete distribution of (G, E, D) in the population is described in Table 6.1 There is nothing in the moderator/mediator approach that requires that the three variables be binary, but the basic concepts are clearest in this simple situation and thereafter easily expanded to more general situations. To motivate this discussion, consider first an extreme special case: Suppose that the only individuals who have D+ are those with both G = 1 and E = 1 (P11 = 1, Pij = 0 otherwise). If G and E were simultaneously studied, in the population of interest, the risk difference (RD) between those with both G 88
Table 6.1 G and E binary risk factors for the disorder D, with G (genotype) temporally preceding E (environmental risk factor, genetic expression, event), preceding D. G
E
Probability of (G, E) in the population
Probability of D+
1 1 0 0
1 0 1 0
pq1 p(1 − q1 ) (1 − p)q2 (1 − p)(1 − q2 )
P11 P10 P01 P00
and E = 1 and others would be a perfect 1.00. It would take a very small sample from the population to detect what the situation is. However, if one studied only the genotype, the probability of the disorder in the G = 1 subgroup is q1 , and that in the G = 0 subgroup is 0, a RD now of q1 < 1. If, as often happens, E = 1 is rare (q1 near zero), that represents considerable attenuation of the effect size. Moreover, the disorder itself cannot be observed; only a diagnosis of the disorder, which has a certain sensitivity (SED ) and specificity (SPD ) for that disorder, that is almost never perfect. Then when only G is considered, the RD is an even more attenuated q1 (SED + SPD − 1). To detect such an attenuated association, the sample size may have to be very large, and even if such association were found ‘statistically significant’, the effect size might well appear trivial. A great deal of attention has been focused on improving the reliability of measurement of G, E and especially of D for psychiatric disorders, but until recently, little attention has been paid to the fact that if genes and environment ‘work together’ in their effect on a disorder, studying genes in absence of the environment, (or environment in absence of genes), may conceal the crucial role that both genes and environment play in the aetiology of disorders. In what follows, the sample is assumed to be a representative one from the population of interest. G, E and D are binary, as reliably measured as possible. Their errors of measurement are independent of each other, an assumption that can be guaranteed in design by ‘blinding’ each assessment to the others. Based on these assumptions, several methodological problems are discussed, that are barriers to resolving such issues, necessary to clarifying the MacArthur moderator/mediator approach
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
MODERATORS AND MEDIATORS: TOWARDS THE GENETIC AND ENVIRONMENTAL BASES OF PSYCHIATRIC DISORDERS
for considering the genetic/environmental bases of psychiatric disorders. Then moderation and mediation as well as other important ways in which multiple factors can ‘work together’ to explain an outcome are defined. The concept is then expanded from one binary variable at each of three time points, to one variable measured at any level at each of three time points, to multiple variables at each of three time points, to multiple variables at multiple time points. The goal is to suggest how, using these principles, one might ‘piece together’ a very complex picture of what might lead to a disorder, thus to suggest how complex disorders might be prevented or successfully treated.
6.2 Current methodological barriers 6.2.1 Case–control studies Case–control studies are one of the most common approaches to examining the genetic bases of a disorder. It is very difficult in population-based studies to generate a sufficient number of cases of the disorder (D+) in a prospective study, when the disorder of interest is, as it often is, quite rare. A favoured alternative has long been to do a retrospective case–control study, in which N1 subjects are sampled from among those who have already had onset of the disorder of interest, and N2 subjects from among those without that disorder. In a case–control study, clearly genotype can be as reliably and accurately measured on the total sample after onset of the disorder as it might have been prospectively before the onset of the disorder. Accurate measurement of environment, gene expression, or events prior to onset of the disorder, however, is very difficult. Memory is flawed, records are often incomplete, and, most important, recall is often coloured by the subject’s knowledge that they do or do not have the disorder in question at the time of assessment. Moreover, Berkson’s Fallacy has been known since the 1950s [10, 11]: The samples of ‘cases’ and ‘controls’ may not be representative of the cases or the controls that would have resulted in prospective study of a population for a variety of reasons. What one sees in a case–control study may
well misrepresent what would have been seen in a prospective study of the same risk factors and outcomes in the same population. In short, while the first studies to explore associations between risk factors and outcomes may well be case–control studies, accepting inferences from such studies as ‘scientific truth’ is risky. Instead, such studies should be used to generate strong hypotheses to be tested in subsequent prospective studies, and to yield the information necessary to design powerful and cost-effective such prospective studies. The type of study design necessary to understanding gene moderation or mediation corresponds to the third type of study design discussed by Fitzmaurice and Ravichandran (Chapter 2), in which a random number of subjects are sampled from a population and both binary characteristics are measured on each subject.
6.2.2 Statistical significance necessary but not sufficient Concepts related to statistical hypothesis testing, such as ‘significance level’, ‘p-values’, ‘power’, came into common use in biomedical research about midtwentieth century. In recent years, however, there is a growing realisation of the limitations of statistical hypothesis testing [12–18]. One prominent epidemiological journal actually banned the use of ‘p-values’ for a time, while psychology [14, 19] and medicine [20–22] tried to deal with the problem by urging that every p-value be accompanied by a clinically interpretable effect size and its confidence interval. In the way such testing is commonly done (testing null hypotheses of randomness), ‘statistically significant’ generally means that the data are sufficient to demonstrate a non-random association, a comment on the data, not on the strength of association. Any non-null association, no matter how trivial, can be shown to be ‘statistically significant’ provided the sample size is large enough. The crucial issue is whether the strength of association is enough to warrant further interest in that association. In dealing with how risk factors ‘work together’ in the MacArthur approach, effect sizes are decomposed to understand the contributions of various risk factors. In essence, this approach assumes that, when there is rationale and justification for suspecting an 89
CHAPTER 6
association between G and D, most, if not all, effect sizes are non-null, although many, perhaps most, may be trivial. Certainly in all research into the risk factors (genetic or otherwise) for a disorder, it is necessary to establish statistical significance to warrant drawing any conclusions. It is not sufficient to stop there: a clinically interpretable effect size is necessary.
6.2.3 Odds ratio is not a clinically interpretable effect size The most common effect size used in epidemiological and genetic research is the odds ratio. If the probability of D+ in the high risk group is Q1 and that in the low risk group is Q0 , the odds of D+ in the two groups are Q1 /(1 − Q1 ) and Q0 /(1 − Q0 ). The odds ratio (OR) is the ratio of those two odds. The odds ratio was originally introduced as the likelihood ratio test statistic to test the null hypothesis of random differences between the two groups (Q1 = Q0 ), and remains an excellent indicator of non-randomness [23]. OR = 1 means that Q1 = Q0 ; OR = 1 indicates non-random association. Use of the odds ratio to test for non-random association, for example using logistic regression analysis, is both common and recommended. However, there are a number of arguments against the odds ratio as an interpretable effect size [23–27] all converging on one conclusion: odds ratio should not be used as an effect size, but only as an indicator of nonrandomness. Consider three questions: If not the odds ratio, then what? Why the odds ratio? Why not the odds ratio? If not the odds ratio, then what? What has been shown is that, once one excludes the odds ratio from consideration, all the other common measures of 2 × 2 association correspond to one or another of the weighted kappa coefficients [25]. Which weighted kappa is appropriate in any given situation depends on the relative clinical importance of false positives to false negatives (which determines the weight in the weighted kappa). Thus among commonly used measures of 2 × 2 association, odds ratio is an outlier. For the purpose of the present discussion, one commonly used such weighted kappa will be used,
90
the RD = Q1 − Q0 , where Q1 and Q0 are the incidence/prevalence of the disorder in the two groups compared. This is not to say that RD is the only appropriate choice, but this effect size is a reasonable choice in many clinical situations. Moreover RD is easily translated to ‘number needed to take’ (NNT = 1/RD), an effect size easily interpretable for clinical or public policy decision-making [28–31]. Suppose one could magically transfer subjects in the high risk group (Q1 ) to the low risk group (Q0 ). How many high risk patients would have to be transferred to hope to prevent one case of D+? The answer to that question is NNT = 1/(Q1 − Q0 ). If NNT = 1, every high risk subject has the disorder and every low risk subject does not: As soon as one patient is transferred, one case may be prevented. If, on the other hand, one needs to transfer 3 or 30 or 3000 or even more high risk subjects to prevent one case, the clinical importance of the risk factor that defines ‘high risk’ becomes progressively weaker. The choice between odds ratio and RD (or NNT) is of no concern when association is random. In that case OR = 1, RD = 0 and NNT is infinite. Also there is consistency when the probability of being in the high risk group equals the probability of having D+ in the population, for then √ √ RD = ( OR − 1)/( OR + 1) = 1/NNT Thus, under this condition, OR = 4 corresponds to RD = 1/3 (NNT =√3). Otherwise, NNT is √ always greater than ( OR + 1)/( OR − 1) [31]. Thus OR = 4 may correspond to NNT = 3, or to NNT = 30, 300, 3000, . . ., which makes interpretation for public health purposes impossible. Why the odds ratio? If the magnitude of the odds ratio is so difficult to interpret for public health purposes, what arguments have been given supporting its use (other than as an indication of non-randomness)? Epidemiologists often suggest that this is the statistic recommended by biostatisticians, and biostatisticians suggest that this is the statistic demanded by epidemiologists. If either claim is true, such recommendations in absence of a sound scientific basis are questionable. The most common reason given is ‘because this is what we’ve always used’ or ‘this is what everyone
MODERATORS AND MEDIATORS: TOWARDS THE GENETIC AND ENVIRONMENTAL BASES OF PSYCHIATRIC DISORDERS
uses’, that is that it is the most commonly used measure of 2 × 2 association (Section 6.3.1). This claim is true, but again, leaves the scientific basis for such common use unclear. Another reason is that, unlike many measures of 2 × 2 association, the odds ratio is symmetric in the roles of Y and X (see Section 6.3.1), that reversing the roles of Y and X yields the same odds ratio. However, the weighted kappa, placing equal weight on false positives and false negatives, and the phi coefficient have the same property, but generally yield very different conclusions. Similarly, the fact that the odds ratio approximates another measure of association, the relative risk, but that claim is true only for a very low prevalence situation (a ‘rare disease’). In any 2 × 2 table there are four relative risks, and the odds ratio is always larger than the largest one. Another, less often articulated reason is that the odds ratio is often big when most other effect sizes indicate a trivial effect, often stated as a claim that the odds ratio is more sensitive to deviations from randomness. As noted above, this is often true but leaves the question unanswered as to whether the odds ratio is conveying the right or wrong message. Often it is pointed out how easy odds ratio is to compute, most particularly that it can be estimated equally well with a prospective naturalistic or stratified sample or from an unbiased case–control sample [32]. However, if message conveyed by the odds ratio is wrong, ease of computation is not a valid scientific reason for its use. In short, scientific support for the use of the odds ratio as an interpretable effect size seems to be lacking, although it must again be emphasised that it remains the index of choice in testing null hypotheses of randomness, and is very convenient for use in multivariate modelling, for example in the logistic regression model (Chapter 2). Why not the odds ratio? The fundamental problem with the odds ratio lies in the fact that it is a ratio, very sensitive to the magnitude, and to the error of estimation, of a denominator that often approaches zero. For example, suppose that underlying the categorical diagnoses D+ and D−, there is a dimensional diagnosis D [33], which is normally distributed with equal variances in the two groups, where a
categorical diagnosis is obtained by dichotomising D at some cut-point. The effect size differentiating the two groups is δ = (μ1 − μ0 )/σ, with μ1 and μ0 the means of the dimensional diagnoses in the high and low risk groups, and σ their common standard deviation. Then where () is the standard normal distribution function, and the cut-point c is measured in σ-units from the point halfway between the two means (μ1 + μ2 )/2, the odds ratio would be: OR =
[1 − (c − δ)](c) . (c − δ)[1 − (c)]
This odds ratio is shown in Figure 6.1 for various cut-points c and for various values of δ. Clearly if δ = 0, OR = 1, the null value, regardless of the cut-point c (and accordingly RD = 0 and NNT is infinite). If δ = 0, OR takes on its minimal value halfway the two means at which √ between √ point RD = ( OR − 1)/( OR + 1). For cut-points above and below this midpoint, OR monotonically increases to infinity (and RD monotonically decreases to 0). The crucial fact is that when δ > 0, one can get an odds ratio, as large as one can possibly desire, simply by dichotomising far enough in the tails of the distribution. From another perspective, one of the most important research uses of an effect size is power computation in planning a hypothesis-testing study. However, as is well known, power computations cannot be done using odds ratio as the effect size. For example, in testing the simple hypothesis OR = 1 versus OR = 4 at the 5% level of significance, there is no sample size large enough to assure at least 80% power whenever OR = 4. This is because OR = 4 may mean that Q1 = 2/3 and Q0 = 1/3, in which case the necessary sample size per group would be 34, or it may mean that Q1 = 0.004 and Q0 = 0.001, in which case the necessary sample size would be 4294 per group (and even larger for smaller values of Q1 and Q0 having OR = 4). Generally to do power computations, users switch to other effect sizes such as RD. For example, OR = 4 corresponds at best to RD = 1/3. With 34 subjects per group, one has at least 80% power to detect any Q1 and Q0 pair with RD = 1/3.
91
CHAPTER 6 40.00 35.00 30.00 = 0.0 Odds ratio
25.00
= 0.2 = 0.4
20.00
= 0.6 = 0.8
15.00
= 1.0 10.00 5.00 0.00 −4.0 −3.0 −2.0 −1.0 0.0
1.0
2.0
3.0
4.0
cutpoint
Fig 6.1 Values of odds ratio obtained by dichotomising a hypothetical dimensional diagnosis having normal distributions with equal variance in the high risk and low risk subpopulation, where the effect size comparing those distributions is δ, the standardised mean difference between the two means. The cutpoints are measured in standard deviation units from the point halfway between the means of the two distributions.
There are many other such arguments that raise questions about the value of the odds ratio as an interpretable effect size, and few, other than those based on custom or convenience, supporting its use as such.
the association of two variables (or more) to a third. Interaction may exist with or without correlation; correlation may exist with or without interaction. That now sets the basis for consideration of moderation and mediation.
6.2.4 Correlation versus interaction A common source of confusion is that between two risk factors being correlated and two risk factors interacting. In Table 6.1, G and E are correlated if q1 = q2 and then G and E are correlated regardless of which outcome is being considered in that population. On the other hand, G and E interact in their effect on a specific outcome D if the effect size relating E to D is different for those with G = 1 than for those with G = 0, or equivalently if the effect size relating G to D is different for those with E = 1 than for those with E = 0, that is if P11 − P10 − P01 + P00 = 0. G and E may interact with respect to one choice of D, but not with another. Thus ‘correlation’ refers to the relationship between two variables, ‘interaction’ to 92
6.3 Moderation, mediation and other ways in which risk factors ‘work together’ 6.3.1 G moderates E in its effect on D In general, to show that M moderates the effect of T on O, one must show that in the population of interest: 1 M must precede T which must precede O. 2 M and T are uncorrelated. 3 The effect size relating T to O is different depending on M. As already specified, G precedes E precedes D (satisfying criterion (1)). Note that here G means
MODERATORS AND MEDIATORS: TOWARDS THE GENETIC AND ENVIRONMENTAL BASES OF PSYCHIATRIC DISORDERS
genotype, which is fixed at the fusion of the gametes, and which therefore precedes all E and D, as opposed to gene expression which can vary across the lifespan and which might be considered here as one possibility for E. To show that G moderates E on D, G and E must be shown to be uncorrelated (q1 = q2 ). Since the effect size relating E to O if G = 1 is P11 − P10 and that relating E to D if G = 0 is P01 − P00 , criterion 3 is satisfied if P11 − P10 − P01 + P00 = 0 (a non-zero ‘interaction effect’ between G and E on D). For example, several studies [34, 35] have shown that a certain set of genes moderate the effects of repeated rhythmic vestibular stimulation (tossing) on young EL mice on later occurrence of epileptic seizures. Indeed ‘susceptibility’ genes may often be those genes that moderate the effect of environmental insults on the subsequent onset of disorders.
6.3.2 Mediation: E mediates the effect of G on D In general, to show that M mediates the effect of T on O, one must show that in the population of interest: 1 T must precede M which must precede O. 2 T and M are correlated. 3 The effect size of T on O can be explained in part by the effect of T on M. Since E is temporally between G and D (criterion (1)), E mediates G on D, only if G and E are correlated (q1 = q2 ) (criterion (2)). The probability of D+ when G = 1 is q1 P11 + (1 − q1 )P10 , and when G = 0 is q2 P01 + (1 − q2 )P00 . The overall RD associated with G is then: RD = q1 P11 + (1 − q1 )P10 − q2 P01 − (1 − q2 )P00 . Let q∗ = (q1 + q2 )/2 Then:
and
q = (q1 − q2 )/2.
RD = (P10 − P00 ) + q(P11 − P10 + P01 − P00 ) + q ∗ (P11 − P01 − P10 + P00 ) The first bracket indicates the effect size of G on D in the absence of E (E = 0). The second term indicates how much is contributed to the effect size by the ‘main effect’ of E on D (P11 − P10 + P01 − P00 )
provided G and E are correlated (q = 0) and the third term how much by the ‘interaction’ (P11 − P01 − P10 + P00 ) between G and E on D. Thus the latter two terms indicate how much of the effect of G on D may be explained by E. Only if P11 = P10 and P01 = P00 , in which case there is neither a main nor an interactive effect of E on D, does E not ‘matter’ to the outcome. For example, the phenylketonuria (PKU) gene (G) is mediated in its effect on PKU-related retardation (D) by the PKU enzyme and its effects (E). This is important, since one can manipulate the PKU enzyme effects by dietary control to prevent PKU-related retardation. In general, all links in a causal chain leading to an outcome are mediators, but not all mediators are links in causal chains. Thus mediators suggest possible causal links; they do not prove they exist. There are other ways in which two factors can ‘work together’ in their effect on a third, that are also important.
6.3.3 Independent risk factors If two factors (perhaps G and E as above, but perhaps two choices for G or two choices for E) are uncorrelated, but one does not moderate the other, either because of lack of time precedence or absence of interaction, such factors are called ‘independent risk factors’ for the outcome. Independent risk factors lie on separate paths leading to the outcome. For example, gender and age are independent risk factors for many disorders (e.g. major depression, eating disorders).
6.3.4 Proxies If two factors, G and E, are correlated with G preceding E, but mediation does not exist because P10 = P11 and P01 = P00 , then E is said to be ‘proxy to’ G. In such cases E considered alone, may be a risk factor for D, but when G and E are considered together, only G matters. For example, G might be gender, E might be ‘excellent ball-throwing ability at age 10’ and D+ might be the onset of depression during the teenage years. Girls are more likely to develop depression during the teenage years, as well as more likely to be 93
CHAPTER 6
the attenuating effects described above. Moreover, including many measures of the same construct places additional measurement burden on both the subjects and assessors, which often leads to decreased reliability [36] of these measures, and to drop-out from prospective studies, both of which decrease the scientific value of a study while increasing its costs and difficulties
poor ball-throwers. If G were here ignored, it might well be that E would be identified as a risk factor for subsequent onset of depression. However, when G and E here are considered together, among boys as well as among girls, ball-throwing ability is unlikely to be correlated with D. If so, ball-throwing ability (E) is proxy to gender (G) for onset of depression (D), that is it is probably not worth while teaching girls how to throw a ball better in order to prevent depression! Here, we’ve taken a silly example, but often proxies are taken all too seriously. A similar situation occurs with two Gs or two Es, that is where there is no temporal precedence between the risk factors. If when considered together, only one of two such variables appears to be associated with D, the other is proxy to the variable that matters. Proxies are often found when there is one strong risk factor, but correlates of that risk factor are also simultaneously considered.
6.3.6 Summary The principles above can be applied to any two risk factors (M, T) for an outcome (O) of interest in the population as summarised in Table 6.2.
6.4 Extensions 6.4.1 One risk factor at each of two time points measured at any level
6.3.5 Overlapping risk factors
The principles here discussed for one binary risk factor at each of two time points are easily extended. With binary G and E, but with an outcome that may be binary, categorical, interval, or even, in some cases, multivariate, the principles in Table 6.2 can be directly used replacing RD with the non-parametric effect size area under the curve [37]. When either or both of G and E are ordinal, a linear model might be considered, as suggested by Baron and Kenny, for example multiple linear regression. If the outcome is time to onset, Cox Proportional Hazards Model might be considered. In all such models, interaction is indicated by a non-zero β3 . Both risk factors matter if any two of the three
Finally, two risk factors that satisfy criteria (2) and (3) for mediation, but where there is no time precedence, are ‘overlapping risk factors’ for D. Such factors might be closely linked genes, or might be multiple measures of the same underlying construct. In such cases, it would be far better to combine such measures a priori, to choose the best among such measures, or to select one measure that best represents the common construct of interest, than to include multiple measures of the same underlying construct. Including multiple overlapping measures does not add useful information to the data; it only increases the impact of errors of measurement in
Table 6.2 M and T are risk factors for an outcome O. In absence of temporal precedence between M and T, the labels M and T are arbitrarily assigned to the risk factors. Otherwise, if M and T are uncorrelated, M is assigned to the earlier risk factor, T to the later. If M and T are correlated, T is assigned to the earlier risk factor, M to the later. With reference to O:
Time precedence?
Correlation?
Analytic criterion
Possible action?
M moderates T M and T independent
Yes Yes No Yes Yes/no No No
No No No Yes Yes Yes Yes
Interactive effect No interactive effect Both M and T matter Both M and T matter. Only T matters Only M matters Both M and T matter
Stratify on M – – – Set aside M Set aside T Combine M, T
M mediates T M is proxy to T T is proxy to M M and T overlapping
94
MODERATORS AND MEDIATORS: TOWARDS THE GENETIC AND ENVIRONMENTAL BASES OF PSYCHIATRIC DISORDERS
regression coefficients are non-zero. Only T matters when both β2 and β3 are zero; only M matters when both β1 and β3 are zero. The effect sizes in such cases are functions of the standardised regression coefficients [37].
6.4.2 Multiple risk factors at two time points Moreover, the process can be extended to the consideration of multiple measures at two time points. First, one would examine all the pairs of risk factors in the set of risk factors measured at each time point, identifying and removing proxies, combining or otherwise removing overlapping variables. By this process the set of variables measured at each time is reduced to a smaller set of independent risk factors for the outcome. Then one would examine whether any variables in the earlier set moderate any of the variables in the later set. If so, the sample would have to be stratified on those moderators, for identification of a moderator of treatment response often means that the mediation patterns will differ in strata defined by moderators. Finally, within each moderator-defined stratum, one would seek variables in the later set that mediate variables in the first set on the outcome. What may result are one or more short mediational chains leading to the outcome of interest.
6.4.3 Multiple risk factors at multiple time points Finally the process can be extended even to include consideration of, not merely two time points, but multiple time points prior to the outcome by applying the above process to Times 1 and 2, then to Times 2 and 3, and so on. For example, in considering the risk factors for psychiatric symptoms for third graders, Essex et al. [2] considered variables measured (i) in infancy, (ii) in the preschool period and (iii) in kindergarten and first grade, all to predict a third grade outcome. This process has been compared to piecing together a jigsaw puzzle, by first discarding irrelevant or extraneous pieces (proxies and overlapping risk factors), sorting pieces that belong to different ‘pictures’ (moderators), then fitting pairs of pieces together systematically (mediators and independent risk factors) to begin to see the ‘whole picture(s)’.
6.5 Beyond moderators and mediators Einstein is quoted as defining insanity as doing the same thing over and over again and expecting different results, a comment that applies as much to choice of methods as to other activities. While it is generally conceded that psychiatric disorders are ‘complex’, the methods commonly used to investigate such disorders have often been selected for their simplicity and ease of use, often despite evidence that they cannot resolve complex problems: • Sampling: Case–control studies are notoriously subject to sampling and measurement biases. These as well as cross-sectional studies cannot be use to establish time precedence and thus have limited utility in identification of risk factors, or of how risk factors ‘work together’. Case–control and cross-sectional studies are easy; the prospective cohort studies necessary to risk research are difficult, but essential. • Terminology: There are terms in common use (beyond ‘moderators’ and ‘mediators’) that are questionable as they are currently applied. A ‘confounder’ is defined [38] (p. 35) as ‘A variable that can cause or prevent the outcome of interest, is not an intermediate variable (mediator), and is associated with the factor under investigation’. If the causal paths to an outcome were known, there would be little point to study of risk factors. The causal references in the definition aside, a ‘confounder’ may be proxy to the factor under investigation or that factor proxy to the confounder, or the two might be overlapping. It makes a difference whether the ‘confounder’ should be set aside and the risk factor under investigation retained or vice versa. Efforts to ‘control for’ or ‘adjust for’ certain ‘confounders’ are often motivated by the desire to estimate the specific causal effect of a selected risk factor of interest. However, causation cannot be inferred simply from correlation, and if one risk factor moderates or mediates another, even if causal, their effects cannot be separated. The phrase ‘independent risk factor’ is often applied to a risk factor that adds to the predictive value for the outcome after another risk factor 95
CHAPTER 6
is considered. Thus the term might be applied to overlapping risk factors, to a moderator, to a mediator, as well as to what in the MacArthur model is more narrowly defined as an independent risk factor. The usual use of the term seems vague and can be misleading. Thus the MacArthur approach proposes the definition in Table 6.2, which requires both that the factors be independent of each other, and that their effects on the outcome be independent. • Analysis: Entering multiple risk factors into multiple regression models omitting interactions is easy; one need merely enter all the data into a computer program and interpret what results. Carefully examining every pair-wise association and taking the correct action for each pair, as suggested in the MacArthur methods, is challenging. Including interactions in such models requires appropriate centring [39], larger sample sizes, and careful and thoughtful interpretations, and is difficult. However, omitting interactions that exist in the population both biases results and reduces power to detect associations. Ignoring interactions that might signal moderation effects is particularly troublesome. Since the paths leading to the disorder may differ in the subgroups defined by a moderator of subsequent risk factors on the disorder, problems associated with Simpson’s paradox may be quite prevalent [40–43]. In brief, if the risk factors in subpopulations defined by a moderator differ, the correlation obtained by ‘muddling’ the subpopulations mixes within group associations (which differ and are meaningful) with between group associations (which may be irrelevant). The associations one observes may be misleading. Moreover, even in absence of interactions in the population, the inclusion of proxies or overlapping variables in regression analyses induce problems associated with multicollinearity, again introducing bias and reducing power. The motivation in developing the MacArthur approach: to examine whether the methods in common use may actually be slowing progress in risk research. Whether the MacArthur approach will lead to more rapid gains in understanding the aetiology of psychiatric disorders remains to be seen.
96
References [1] Baron, R.M. and Kenny, D.A. (1986) The moderatormediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. J. Pers. Soc. Psychol., 51, 1173–1182. [2] Kraemer, H.C., Stice, E., Kazdin, A. and Kupfer, D. (2001) How do risk factors work together to produce an outcome? Mediators, moderators, independent, overlapping and proxy risk factors. Am. J. Psychiatry, 158, 848–856. [3] Kraemer, H.C., Wilson, G.T., Fairburn, C.G. et al. (2002) Mediators and moderators of treatment effects in randomized clinical trials. Arch. Gen. Psychiatry, 59, 877–883. [4] Kraemer, H.C., Kiernan, M., Essex, M.J. et al. (2008) How and why criteria defining moderators and mediators differ between the Baron & Kenny and MacArthur approaches. Health Psychol., 27 (2), S101–S108. [5] King, A.C., Ahn, D.F., Atienza, A.A. et al. (2008) Exploring refinements in targeted behavioral medical intervention to advance public health. Ann. Behav. Med., 35 (3), 251–260. [6] Kraemer, H.C., Frank, E. and Kupfer, D.J. (2006) Moderators of treatment outcomes: clinical, research, and policy importance. J. Am. Med. Assoc., 296 (10), 1–4. [7] Owens, E.B., Hinshaw, S.P., Kraemer, H.C. et al. (2003) What treatment for whom for ADHD: moderators of treatment response in the MTA. J. Consult. Clin. Psychol., 71 (3), 540–552. [8] Boyce, W.T., Essex, M.J., Alkon, A. et al. (2006) Early father involvement moderates biobehavioral susceptibility to mental health problems in middle childhood. J. Am. Acad. Child Adolesc. Psychiatry, 45 (12), 1510–1520. [9] Essex, M.J., Kraemer, H.C., Armstong, J.M. et al. (2006) Exploring risk factors for the emergence of children’s mental health problems. Arch. Gen. Psychiatry., 63, 1246–1256. [10] Berkson, J. (1946) Limitations of the application of fourfold table analysis to hospital data. Biometrics Bull., 2, 47–53. [11] Berkson, J. (1955) The statistical study of association between smoking and lung cancer. Proc. Staff Meet. Mayo Clin., 30, 56–60. [12] Abelson, R.P. (1997) On the surprising longevity of flogged horses: why there is a case for the significance test. Psychol. Sci., 8 (1), 12–15. [13] Borenstein, M. (1997) Hypothesis testing and effect size estimation in clinical trials. Ann. Allergy Asthma Immunol., 78, 5–16.
MODERATORS AND MEDIATORS: TOWARDS THE GENETIC AND ENVIRONMENTAL BASES OF PSYCHIATRIC DISORDERS [14] Borenstein, M. (1998) The shift from significance testing to effect size estimation, in Research and Methods, Comprehensive Clinical Psychology, vol. 3 (eds A.S. Bellak and M. Hersen), Elsevier Science Publishing Company, Burlington, MD, pp. 319–349. [15] Cohen, J. (1995) The earth is round (p < .05). Am. Psychol., 49, 997–1003. [16] Dar, R., Serlin, R.C. and Omer, H. (1994) Misuse of statistical tests in three decades of psychotherapy research. J. Consult. Clin. Res., 62, 75–82. [17] Hunter, J.E. (1997) Needed: a ban on the significance test. Psychol. Sci., 8 (1), 3–7. [18] Shrout, P.E. (1997) Should significance tests be banned? Introduction to a special section exploring the pros and cons. Psychol. Sci., 8 (1), 1–2. [19] Wilkinson, L., The_Task_Force_on_Statistical_Inference (1999) Statistical methods in psychology journals: guidelines and explanations. Am. Psychol., 54, 594–604. [20] Altman, D.G., Schulz, K.F., Hoher, D. et al. (2001) The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann. Int. Med., 134 (8), 663–694. [21] Begg, C., Cho, M., Eastwood, S. et al. (1999) Improving the quality of reporting of randomized controlled trials: the CONSORT statement. J. Am. Med. Assoc., 276, 637–639. [22] Rennie, D. (1996) How to report randomized controlled trials: the CONSORT Statement. J. Am. Med. Assoc., 276 (8), 649. [23] Kraemer, H.C. (2007) Correlation coefficients in medical research: from product moment correlation to the odds ratio. Stat. Methods Med. Res., 15 (6), 525–545. [24] Kraemer, H.C. (2004) Reconsidering the odds ratio as a measure of 2X2 Association in a population. Stat. Med., 23 (2), 257–270. [25] Kraemer, H.C., Kazdin, A.E., Offord, D.R. et al. (1999) Measuring the potency of a risk factor for clinical or policy significance. Psychol. Methods, 4 (3), 257–271. [26] Newcombe, R.G. (2006) A deficiency of the odds ratio as a measure of effect size. Stat. Med., 25, 4235–4240. [27] Sackett, D.L. (1996) Down with odds ratios!. Evid. Based Med., 1, 164–166. [28] Altman, D.G. (1998) Confidence intervals for the number needed to treat. Br. Med. J., 317 (7168), 1309–1312.
[29] Altman, D.G. and Andersen, K. (1999) Calculating the number needed to treat for trials where the outcome is time to an event. Br. Med. J., 319, 1492–1495. [30] Cook, R.J. and Sackett, D.L. (1995) The number needed to treat: a clinically useful measure of treatment effect. Br. Med. J. 310, 452–454. [31] Kraemer, H.C. and Kupfer, D.J. (2006) Size of treatment effects and their importance to clinical research and practice. Biol. Psychiatry, 59 (11), 990–996. [32] Cornfield, J. (1956) A statistical problem arising from retrospective studies, in Proceedings of the Third Berkeley Symposium, vol. 4 (ed J. Neyman), University of California Press, Berekely, CA, p. 135. [33] Helzer, J.E., Kraemer, H.C. and Krueger, R.F. (eds) (2008) Dimensional Approaches in Diagnostic Classification: Refining the Research Agenda for DSM-5, American Psychiatric Association, Arlington, VA. [34] Poderycki, M.J., Simoes, J.M., Todorova, M.A. et al. (1998) Environmental influences on epilepsy gene mapping in EL mice. J. Neurogenet., 12 (2), 67–85. [35] Todorova, M.T., Mantis, J.G., Le, M. et al. (2006) Genetic and environmental interactions determine seizure susceptibility in epileptic EL mice. Genes Brain Behav., 5 (7), 518–527. [36] Thiemann, S., Csernansky, J.G. and Berger, P.A. (1987) Rating scales in research: the case of negative symptoms. Psychiatry Res., 20, 47–55. [37] Kraemer, H.C. (2008) Toward non-parametric and clinically meaningful moderators and mediators. Stat. Med., 27, 1679–1692. [38] Last, J.M. (1995) A Dictionary of Epidemiology, Oxford University Press, New York. [39] Kraemer, H.C. and Blasey, C. (2004) Centring in regression analysis: a strategy to prevent errors in statistical inference. Int. J. Methods Psychiatr. Res., 13 (3), 141–151. [40] Hand, D.J. (1979) Psychiatric examples of Simpson’s paradox. Br. J. Psychiatry, 135, 90–96. [41] Kraemer, H.C. (1978) Individual and ecological correlation in a general context: investigation of testosterone and orgasmic frequency in the human male. Behav. Sci., 23, 67–72. [42] Samuels, M.L. (1951) Simpson’s Paradox and related phenomena. J. Am. Stat. Assoc., 88, 81–88. [43] Wagner, C.H. (1982) Simpson’s paradox in real life. Am. Stat., 36, 46–48.
97
7
Validity: Definitions and applications to psychiatric research Jill M. Goldstein,1,2 Sara Cherkerzian1,2 and John C. Simpson3 1 Departments of Psychiatry and Medicine at Brigham and Women’s Hospital (BWH), Harvard Medical School, Boston, MA, USA 2 Connors Center for Women’s Health and Gender Biology, Department of Medicine, Brigham & Women’s Hospital, Boston, MA, USA 3 Department of Psychiatry at VA Boston Healthcare System, Harvard Medical School, Boston, MA, USA
7.1 Introduction Measurement is a process of linking unobservable theoretical concepts to empirical indicators [1]. There are two basic properties of measurement that ensure the strength of this linkage: reliability and validity. In this chapter, we discuss the concept and usage of validity. Reliability was discussed fully in a previous chapter, but, for convenience, we define it here simply as the reproducibility of an empirical measure (e.g. internal consistency of the items in a scale, reproducibility of a measurement on different occasions or agreement between raters). For an empirical indicator to be valid it must first be reliable, but indicators can be reliable without also being valid. There are a number of ways to assess validity, not all of which are used for every measure of interest. In fact, validity has a number of meanings in different contexts and is perhaps one of the most overused words in the scientific literature. In this chapter, we discuss validity as it applies to the measurement of a construct, that is the process of ‘construct validity’. We also discuss validity as it applies to relationships between constructs, that is to the ‘internal
validity’ and ‘external validity’ of a presumed causal relationship. We provide examples of how validity is applied and statistically evaluated in psychiatric research. Finally, we discuss the future of the process of validating psychiatric disorders, given new genetic and brain imaging technologies that will allow for new aetiological discoveries to be incorporated into our concepts of how we define psychiatric disorders.
7.2 Validity of a construct An essential feature of scientific research is often the measurement of abstract concepts and relationships between abstract concepts. Validity can be defined as the extent to which an empirical indicator of a concept actually represents the concept of interest [2–4]. For example, if one used a particular symptom checklist to measure ‘major depressive disorder’ (MDD), validity asks the question, how accurate is this empirical indicator for diagnosing MDD? Thus, validity refers to the questions ‘for what purpose is the indicator being used?’ (e.g. to diagnose MDD) and ‘how accurate is it for that purpose?’ In fact, an indicator
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
99
CHAPTER 7
(e.g. an instrument such as a test, a rating or an interview) can be valid for one purpose, but not for another [5]. Thus, one validates the instrument in relation to its intended purpose [1, 4, 5]. If an instrument is to be scientifically useful, it must be both reliable (i.e. result in consistent findings over repeated measurements) and valid (i.e. represent the concept it is intended to represent). Unlike reliability, validity is an unending process [4] in which one attempts to capture the essence of the concept of interest as accurately as possible. It therefore involves a theoretical understanding of the concept of interest in order to measure it accurately. It also involves an assessment of the empirical relationships between an instrument and criteria chosen to evaluate whether the instrument assesses what it is intended to assess. There are three basic ways in which validity is assessed: content validity, criterion validity and construct validity (see Table 7.1).
7.2.1 Content validity For every abstract concept, there is a universe of items that one might sample in order to measure the concept operationally. Content validity involves the adequacy with which one samples the domain of items [4]. Content validity is ensured by the procedures used to construct items for a test [4]. One must first specify the universe of items that one hypothesises will accurately measure the concept of interest. Second, items are then sampled from this domain. If certain kinds of items are central to understanding the concept, one may decide to oversample these types. Finally, selected items are put into a testable form [1]. For example, if one were interested in measuring (diagnosing) ‘schizophrenia’, one would choose, among other things, items such as bizarre delusions or other types of delusions, various kinds of hallucinations, formal thought disorder and flat affect. An instrument would then be constructed in order to assess these items. Different types of diagnostic instruments have been constructed that are based on certain assumptions about how to acquire accurate assessments of the items. For example, the Diagnostic Interview Schedule (DIS) [6] was designed to allow lay interviewers to assess symptom items in a dichotomous form, that is 100
as present or absent, and was wholly dependent on the patient’s response to each item. That is, there was an assumption that clinical judgement was unnecessary to assess symptomatology. In contrast, the Schedule for Affective Disorders and Schizophrenia (SADS) [7] was designed to allow for clinical questioning to assess symptom items. Clinical/diagnostic knowledge was required in order to use the SADS instrument. In addition, ratings of SADS items consisted of a severity scale rather than present versus absent, as in the DIS. As one can see, these two instruments are based on different assumptions regarding how to assess a similar domain of symptom items. One can then assess the content validity of these two approaches, even though the evaluation of content validity alone would provide an incomplete assessment of the validity of these instruments. There are two standards by which content validity is assessed: the representativeness of the collection of items chosen and the type of test construction used to measure the concept. There are, however, no statistical means of assessing content validity. Essentially, content validity is dependent on appeals to reason regarding the accuracy of the content sampled, or a consensus among experts, and the adequacy with which the items are put into a testable form [2, 4].
7.2.2 Examples of assessment of content validity Streiner [8] recommended the use of a ‘content validity matrix’ as a means of ensuring that items in a scale are appropriately tapping the intended domains. In such a matrix, each column represents a distinct domain within the general domain of interest, and each row represents a single item. As a means of improving reliability, each domain is represented by several items (i.e. in terms of the content validity matrix, each column should have check marks in several rows). On the other hand, to minimise ambiguity of interpretation, each item should tap only one domain (i.e. each row should have only a single check mark). As an example of the relevance of domains and items to content validity, we can make use of a study by Schwartz et al. [9] who devised the Social Adjustment Interview Schedule to investigate outcome in schizophrenia. Although this study was from the mid
VALIDITY: DEFINITIONS AND APPLICATIONS TO PSYCHIATRIC RESEARCH Table 7.1 Three methods to assess validity: content, criterion and construct validity. Method
Description
Measurement
Key differences between measures
Content
Content validity: accuracy with which one samples the domain of items to measure the concept of interest operationally
Three steps: • Specify the universe of items • Sample items from the domain • Establish testable form
No statistical means of assessing
For a categorical criterion a qualitative rating is evaluated using methods such as sensitivity, specificity, and receiver operating characteristic (ROC) analysis Empirical relationship between the instrument under study and the criterion is statistically estimated by a correlation if continuous data are used The strength of the correlation often interpreted as strength of the measure’s validity
Dependent on empirical results
Three steps: • Understand theoretical relationships between related concepts • Estimate empirical relationships between operational measures • Interpret empirical evidence within theoretical context • Relate findings from other studies for coherence and consistency
Content and criterion validity used alone are limited in contributing to understanding the relationship between the theoretical (unobserved) concept and the empirical measure used to indicate it
Two standards: • Representativeness of the collection of chosen items • Adequacy of the testable form or construction used to measure the concept Criterion
Criterion, or predictive, validity measures that which is external to the measurement of the concept itself, the criterion Four forms: • Post-dictive • Concurrent • Prospective • Discriminant Post-dictive validity: correlates events/behaviours that occurred in the past Concurrent validity: correlates a measure and some criterion at the same point in time Prospective validity: correlates a measure with future criteria Discriminant validity: assesses whether the measure of interest is uncorrelated with expected events or behaviours, that is, specificity
Construct
Construct validity: extent to which one’s measure is related to other theoretically related and measured concepts
Content and criterion validity: part of the process of assessing construct validity
101
CHAPTER 7
1970s, it illustrates an important point with regard to assessing content validity and is still relevant today with regard to domains of social adjustment. Within the general domain of social adjustment, the authors conceptually identified eight role areas (i.e. domains) and devised multiple questions (i.e. items) within each role area to address performance and subjective feelings. The different domains included work role (18 items), household role (15 items), marital role (nine items) and social and leisure roles (54 items). Typical items within the work domain included the questions ‘Are you employed now?’ and ‘Are you confident about your ability to do the job?’ Within the marital domain, typical items included ‘In general, how do you and your spouse get along?’ and ‘Have you been able to talk about feelings and problems with your spouse recently?’ There would probably be little disagreement about content validity in this example. In other words, most would agree that these four questions comprise two sets of items, that the first two items are related to work roles, whereas the latter two concern marital roles, and furthermore that there is little if any overlap between the content of these specific items. Not all applications of content validity will be as straightforward, particularly if the concepts being measured are abstract, that is not directly observable. For example, Cloninger [10] devised an 80-item self-report inventory called the Tridimensional Personality Questionnaire (TPQ) to investigate three hypothesised dimensions of personality: harm avoidance, novelty seeking and reward dependence (an instrument that is still used today). Cloninger’s approach to content validity is apparent in his description of how the items were devised (p. 580): ‘To quantify behavioural variation on each dimension separately, questions were specified that were theoretically expected to involve minimal interaction among the dimensions. In practice, this meant that questions were chosen to evaluate the behaviours that were thought to be characteristic of individuals deviant on one dimension and average on the others’. As evidence that this standard was achieved, Cloninger reported that the intercorrelations among the three major TPQ scales (calculated using the Pearson product-moment correlation coefficient) were ‘negligible or weak’ and low relative to the reported index of internal consistency (Cronbach’s 102
α coefficient; see Chapter 5 for a discussion in the context of reliability). However, the interpretation of these results is complicated because weak intercorrelations were expected in some cases for theoretical reasons (e.g. a weak negative correlation between novelty seeking and harm avoidance). A somewhat different perspective was presented by Takeuchi et al. [11], who translated the TPQ into Japanese and replicated [10] study using a large sample of Japanese university students. Like [10, 11] reported negligible or weak intercorrelations between the three major scales. However, they also reported results from a factor analysis that were not completely consistent with the theoretical model. Factor analysis is a multivariate statistical procedure that is used to explain covariation among a set of observed variables in terms of a reduced number of unobserved, latent variables; for example see [12], for an introductory explanation. Within the framework of Streiner’s content validity matrix [8], for example, if each derived factor was considered to define a separate domain (i.e. column) in the matrix, then the harm avoidance, novelty seeking and reward dependence items should have loaded on different factors. While this was by and large the result for harm avoidance and reward dependence, ‘the novelty seeking scale showed a scattering factor structure, with several equivocal items loaded on two or more factors; reduction or reorganisation of items might be required here’ [11, p. 277]. On the other hand, all reported items had factor loadings above the cutoff of 0.4 on only one of the six factors, and this was consistent with the use of Streiner’s ideal content validity matrix (1993) of only one check mark per item in the matrix.
7.2.3 Criterion validity The second type of validity is referred to as criterion validity (or predictive validity). It is concerned with measuring something that is external to the measurement of the concept itself, called the criterion [2, 4]. For example, one dimension of predictive criterion validity for psychiatric diagnoses is to relate them to predictions of outcome. (Examples of this are discussed in detail later in this chapter.) Unlike content validity, which essentially depends on a consensus among experts, predictive validity is dependent on
VALIDITY: DEFINITIONS AND APPLICATIONS TO PSYCHIATRIC RESEARCH
empirical results. Predictive validity refers to the empirical relationships between the instrument under study and external events or behaviours that can occur at three points in time: before, during or after the instrument is used. In many studies, the empirical relationship is statistically estimated by a correlation if continuous data are used. Post-dictive validity refers to correlating events/ behaviours that have occurred in the past with the instrument one is presently using. These assessments are referred to as retrospective. For example, one might have a specific prediction about the early developmental history of patients, with a particular diagnosis that is being currently assessed with an instrument. Post-dictive validity entails correlating early history information with the diagnostic assessment currently obtained using the instrument under study. Concurrent validity refers to correlating a measure and some criterion at the same point in time. This involves what are known as cross-sectional assessments. Thus, for example if there were a laboratory test for diagnosing MDD, one could correlate the instrument used to diagnose the disorder with a laboratory test taken when the patient was interviewed. The form of predictive validity most commonly referred to correlates a measure with a criterion that is assessed at some future point in time. This form of validity entails prospective assessments. A common use of predictive validity in psychiatry is to assess outcomes of a specific diagnostic group under study, under the assumption that certain diagnostic groups have worse or better outcomes than others (see examples below). A fourth form of criterion validity is referred to as discriminant validity. Discriminant validity assesses whether certain external criteria (i.e. events or behaviours) are uncorrelated with the measure of interest compared with other criteria that are hypothesised to be related to the measure of interest. That is, is the measure of interest uncorrelated with events or behaviour with which one expects it would be independent? This has also been referred to as assessing the specificity of the relationship between the measure of the concept of interest and the external criteria chosen to relate to the concept. It is important to mention here that criterion validity is often assessed using correlations (when continuous data are involved). The strength of the
correlation is often interpreted as the strength of the validity of the measure. However, the strength of the correlation depends not only on the variability and other characteristics of the measure of interest, including its reliability, but also on the choice, measurement and reliability of the criterion.
7.2.4 Examples of criterion validity For examples of applications of criterion validity, we turn to two studies in the psychiatric literature from the 1990s. The first study [13] provides a fairly typical example of the use of correlational techniques. At issue was whether a self-report instrument can be used in populations of patients with schizophrenia to obtain valid ratings of depression. To examine this question, the authors compared self-report ratings obtained using the Beck Depression Inventory (BDI) with ratings of the Calgary Depression Scale (CDS), a semistructured interview designed to assess depression in schizophrenia patients. In this study, the CDS is the criterion because it makes use of informed judgements by trained clinicians, which form the current ‘gold standard’ for identifying depression in clinical populations. BDI and CDS scores were compared by calculating the Pearson product-moment correlation coefficient (e.g. see [14]), after creating scatterplots to examine the joint distribution of BDI and CDS scores as well as identifying any outliers. The latter step was essential because the presence of even a single outlier (i.e. an extreme and atypical value) could easily distort the product-moment correlation (e.g. see [15]). Another important methodologic step employed by Addington et al. [13] was to compare correlations between the BDI and CDS in clinically distinct subgroups of patients with schizophrenia: inpatients vs. outpatients, and (within these subgroups) patients who either did or did not require assistance in completing the self-report instrument. In this particular study, the correlation between the BDI and CDS was stronger among inpatients than outpatients, regardless of whether the patients required assistance (r = 0.84 vs. r = 0.96). However, the substantially greater percentage of inpatients requiring assistance (34% of inpatients vs. 12% of the outpatients) led the authors to conclude that ‘depressed affect can be assessed in patients with 103
CHAPTER 7
schizophrenia by both self-report and structured interview, but the BDI poses difficulties with use with inpatients’ [13, p. 561]. For our purposes, however, the substantive findings of this study were less important than the fact that this study admirably illustrated the critical importance of selecting and describing validation samples that are clinically meaningful in the context of the measurement instrument of interest [8]. In particular, users of such instruments need to be aware that published validation studies might have used ‘samples of convenience’ (e.g. university students) that do not approximate the clinical population the user has in mind and that the results of such studies do not necessarily generalise to other samples. Our second example of criterion validity in psychiatric research [16] also illustrates the critical importance of the validation sample. In this study, the validity of using a questionnaire (the Center for Epidemiologic Studies Depression Scale or CES-D) [17] as a case identification tool in studies of mood disorders among Native Americans was investigated. CES-D scores were compared with DSM-III-R diagnoses [18] based on a structured psychiatric interview (the Lifetime Version of the SADS [7]). The authors had concerns about the cross-cultural applicability not only of the screening instrument but also of the criterion itself (e.g. DSM-III-R diagnoses of affective disorders). For purposes of the study, however, it was assumed that DSM-III-R diagnoses would be relevant among Native Americans. Although the CES-D, like the BDI in the above example, yields a numerical score, its proposed use as a screening instrument for depression was for the purpose of identifying not the degree of depression, but the presence of a particular clinical syndrome, namely, DSM-III-R major depression. The criterion was therefore a categorical (i.e. qualitative) rating rather than a numerical (i.e. quantitative) rating, making it inappropriate to use correlational procedures. Instead, to evaluate the validity of the instrument for case identification, the authors employed statistical methods that have been expressly developed for qualitative data, including sensitivity, specificity and receiver operating characteristic (ROC) analysis. Sensitivity and specificity are both calculated using data that have been summarised in a 2 × 2 table of 104
Table 7.2 Schematic representation of the calculation of indices of criterion validity and predictive valuea. Criterion (gold standard)b
Test result
Positive Negative Total
Present
Absent
Total
a c a+c
b d b+d
a+b c+d N
a a,
b, c, d and N are frequencies (e.g. numbers of persons rated). b The Gold Standard is assumed too represent the ‘true’ value and thus to be free of error.
frequencies (see Tables 7.2 and 7.3 for definitions and computational formulas). In the example at hand, a 2 × 2 table was used to cross-classify the numbers of screened persons with and without the criterion (e.g. a DSM-III-R diagnosis of major depression) who either did or did not score above the cut-off for depression in the screening instrument, the CES-D. (ROC analysis was used to determine the optimal cut-off value for the CES-D.) As an illustrative finding, the sensitivity for DSM-III-R major depression was 100% (i.e. all three persons in the sample with a diagnosis of major depression scored above the cut-off on the CES-D). The corresponding value of specificity was 82% (i.e. 82% of those persons in the sample who did not have diagnoses of major depression scored below the CES-D cut-off for depression). It follows directly from the reported specificity value of 82% that 18% (100−82%) of the persons in the sample with no psychiatric diagnoses or with DSM-III-R diagnoses other than major depression scored above the CES-D cut-off and would have been classified as depressed by that screening instrument. Whether or not this degree of misclassification error (or invalidity) is considered to be an unacceptably high ‘false-positive rate’ depends on the proposed use of the instrument and on the comparable ‘operating characteristics’ of alternative instruments. For example, a higher CES-D cut-off value could be expected to decrease the false-positive rate (via increased specificity), but at the expense of sensitivity. In this particular study, a higher CES-D cut-off actually increased specificity without decreasing sensitivity, but this was probably attributable to the small number of cases with DSM-III-R diagnoses of
VALIDITY: DEFINITIONS AND APPLICATIONS TO PSYCHIATRIC RESEARCH Table 7.3 Statistical indices for evaluating qualitative data in the assessment of validity. Term
Definition
Formulaa
Concepts
Sensitivity
Proportion of those a test correctly identifies as having the disease (or characteristic) of interest
a/(a + c)
Sensitivity and specificity of a test are theoretically independent of disease/exposure prevalence as both conditioned on the bottom, or ‘true’, totals of Table 7.2
Specificity
Proportion of those a test correctly identifies as not having the disease (or characteristic) of interest Probability of misclassifying a true positive as a negative Probability of misclassifying a true negative as a positive Proportion of true positives among individuals who test positive
d/(b + d)
False negative rate False positive rate Positive predictive value (PPV)
Negative predictive value (NPV) Prevalence a Refer
Proportion of true negatives among individuals who test negative Proportion of true positives in the population
1 − (a/(a + c)) or 1 − sensitivity 1 − (d/(b + d)) or 1 − specificity a/(a + b)
In addition to the dependence of the PPV and NPV on the sensitivity/specificity of a test, they are also a function of the disease/exposure prevalence
d/(c + d)
(a + c)/N
to Table 7.2.
major depression. In most studies there is a systematic trade-off between sensitivity and specificity, and for that reason both of these indices of criterion validity must be considered together in determining whether a particular instrument is more valid than the available alternatives. ROC analysis provides a useful framework for making such comparisons (e.g. see [19]). In the present example, the non-negligible false-positive rate was consistent with the investigators’ concerns (based on previous research by a number of researchers using other samples) that the CES-D might be reflecting symptoms of not only major depression but also increased levels of anxiety, demoralisation or even physical ill health [16]. The study by Somervell et al. [16] also illustrates the difference between criterion validity and the related, but nevertheless distinct, concept of predictive value. Positive predictive value is literally the predictive value of a positive rating, that is the probability of having the criterion of interest given a positive rating on the instrument under investigation.
(Formulas for calculating positive predictive value, and the related index, negative predictive value, are given in Table 7.3.) Since the criterion (e.g. DSMIII-R major depression) is frequently of more direct clinical importance than the rating (e.g. a particular CES-D score), positive and negative predictive values are often more clinically meaningful than sensitivity and specificity. For example, most clinicians would probably be more interested in the usefulness of the CES-D for predicting major depression than the other way around. However, positive predictive value is a joint function of sensitivity, specificity and prevalence, such that low prevalence values can severely constrain the values of positive predictive value that can be realistically attained, even with very high sensitivity and specificity values [20, 21]. (Negative predictive value is similarly constrained by high prevalence values.) In the study by Somervell et al. [16], the prevalence of major depression can be estimated from the rate of major depression in the sample as 3/120 = 0.025. 105
CHAPTER 7
Using a cut-off value of 16 on the CES-D, the reported specificity value of 82.1% therefore corresponds to a positive predictive value of 0.125. In other words, even though sensitivity was perfect (100%) and specificity was very high, only one of every eight persons who scored above the CES-D cut-off of 16 would be expected actually to have major depression. Even increasing the CES-D cutoff to improve specificity would not dramatically change this result. Again, this is due to the constraint imposed by the low estimated prevalence of major depression in the study population. (With the CES-D cut-off set at 28, the reported specificity value of 96.6% corresponds to a positive predictive value of 0.429.) In conclusion, this example shows that even though an instrument may have excellent criterion validity as assessed using standard indices (namely, sensitivity and specificity), the actual predictive value of the instrument could be much more limited, depending on the prevalence of the disorder of interest, which in turn may vary with the composition of the validation sample.
within the theoretical context in which the concept of interest is embedded. In addition, findings from other studies must be related to one’s current findings regarding the measure and the concept it is intended to indicate. The theoretical context allows one to make theoretical predictions that then lead to empirical tests using the operational measure of the concept of interest. One study cannot wholly validate a measure of a concept. Construct validity requires a pattern of consistent findings across studies involving different samples and different settings. Cronbach and Meehl [2] refer to the theoretical context as the nomologic network. The use of the nomologic network requires relating theoretical constructs to each other, theoretical constructs to empirical indicators, and empirical indicators to each other. The construct is not reduced to the empirical indicators; it is combined with other constructs in the nomological net that allow for predictions using the empirical indicators [2, p. 290]. An ideal example of Cronbach and Meehl’s 1955 [2] framework for assessing construct validity is how we measure and ultimately understand psychiatric diagnoses.
7.2.5 Construct validity Of the three basic types of validity, construct validity involves the most complex process. Content validity and criterion validity used alone are limited in contributing to understanding the relationship between the theoretical (unobserved) concept and the empirical measure used to indicate it. In fact, content and criterion validity are considered part of the process of assessing construct validity. As first pointed out by Cronbach and Meehl [2], construct validity is essential for all abstract concepts, since there is no criterion or entire content of a domain that is wholly adequate to define the concept of interest. Construct validity is thus defined in a theoretical context. It is the extent to which one’s measure of interest is related to other theoretically related concepts that are also measured [4]. There are three steps to assessing construct validity [1]. First, one must have an understanding of the theoretical relationships between related concepts. Second, one must estimate the empirical relationships between operational measures of these concepts. Finally, the empirical evidence must be interpreted
106
7.2.6 Application of construct validity to psychiatric diagnosis In psychiatry, there are no known laboratory tests for wholly identifying a psychiatric case. Thus, in 1972, Robins and Guze established five criteria that became standards for validating a diagnosis. The first criterion of Robins and Guze [22] consisted of establishing the clinical description of the disorder. This involved specifying the phenomenology or symptomatology, premorbid history, age at onset, sociodemographic distribution and precipitating factors. The clinical description criterion thus involves issues of content validity. For example, what is the domain of symptoms chosen to represent the diagnosis? ‘On the face of it’, do these symptoms reasonably represent the domain of interest? Furthermore, how would one construct an instrument to assess these symptoms? The clinical description criterion also involves criterion validity. For example, post-dictive validity would be relating premorbid history, age at onset or precipitating factors to the empirical measure of the diagnosis.
VALIDITY: DEFINITIONS AND APPLICATIONS TO PSYCHIATRIC RESEARCH
The second Robins and Guze criterion referred to the relationship of the diagnostic measure to laboratory tests. As mentioned earlier, this is a form of concurrent validity. Laboratory tests could include chemical, physiological, neuropathological, genetic, brain imaging and/or psychological tests. In psychiatry, however, at present there are no laboratory ‘gold standards’ for validating diagnoses. The third criterion involved the use of family history to contribute to validation (in the era prior to the discovery of the genome). The assumption behind the use of family history was that many psychiatric disorders run in families. Thus, an increased prevalence of the same disorder in family members could be used as an indicator that the diagnosis was a valid entity. Family history can be thought of as a concurrent validator (in reference to ill relatives who are currently alive) or as a postdictive validator (in reference to relatives who were ill but who are now deceased). (Incorporation of genetic information in the molecular genetics era today is discussed in Section 7.2.7.) The fourth criterion, commonly thought of as predictive validity in psychiatric research, related the diagnosis of interest to outcomes, including treatment response. The assumption behind using this criterion was that individuals with the same diagnosis will have similar outcomes. Furthermore, it is sometimes assumed that certain diagnostic groups have particularly poor or good outcomes compared with other diagnostic groups. However, the use of outcome as a validating criterion is problematic because many psychiatric disorders have heterogeneous outcomes. This validating criterion will remain controversial unless more definitive knowledge regarding the specific outcomes of diagnostic groups can be elaborated. The final criterion for validating a diagnosis involved assessing the specificity of the other criteria for a particular diagnosis. This can be referred to as discriminant validity. Although different diagnoses may share, for example, certain symptoms, laboratory test results or outcomes, it is the role of discriminant validity to specify how a particular disorder is differentiated from other disorders. If it cannot be differentiated from other disorders, this becomes support for rejecting the validity of this particular diagnosis as a separate entity.
7.2.7 The future of validating psychiatric disorders: Towards DSM-5 and beyond The current American classification schema (the Diagnostic and Statistical Manual of Mental Disorders, 4th edition, Text Revision (DSM-IV), American Psychiatric Association [23]) as it was originally conceived, has an uncertain future given that it was derived from clinical consensus to address primarily the need for diagnostic reliability rather than validity [24], and has not yet been reoriented towards state-of-the-art investigations of the aetiology of psychiatric disorders. The shift in perspective to inclusion of aetiologic information is, in part, the result of a disappointing lack of identifying unequivocal, consistently replicated susceptibility genes for these disorders (as currently defined in the DSM), despite high heritability estimates and the development of powerful research tools such as genome-wide association studies [25, 26]. This may not be surprising given that the organisation of the DSM classification is not based on the pathogenesis of disorders, but rather on operationalised sets of categorical criteria based on signs and symptoms from clinical observation and research [24, 27, 28]. As pointed out by Steven Hyman, Without genotypes, objective tests, clues to pathogenesis and even adequate family and longitudinal studies, it was not possible to establish a true empirical base for valid diagnoses in DSM-III, DSM-II-R, DSM-IV and DSM-IV-TR [24, p. xiii]. Furthermore, Hyman [24] and other investigators have voiced concerns over how the DSM defines the thresholds for categorising disorders by delineating arbitrary cut-off points for normally distributed variables, such as behaviour traits, and for continuous measures such as severity and chronicity of the illness. Hyman [29] notes that many patients do not fit precisely into these categories and hence the DSM has relied extensively on the catch-all term ‘not otherwise specified (NOS)’. In fact, the NOS diagnoses are more commonly used than a number of the specifically named disorders. The DSM, now
107
CHAPTER 7
under revision, is expected to address the validity of the diagnostic system by including experimental criterion sets aimed at incorporating new genetic and neurobiological findings in its fifth edition [29]. This is a shift in approach from the use of categories based on clinical syndromes and levels of functioning [30] to biologically valid phenotypes that can potentially address questions concerning illness aetiology and clinical treatment. In general, psychiatric disorders are likely to be characterised by complex multifactorial and polygenetic aetiologies marked by the interaction of numerous genes with each other and a wide range of environmental risk factors, resulting in varying phenotypic expression from normal to clinical relevance [31]. Although we know there are high heritability estimates for many psychiatric disorders, single genes with sufficiently large effects are not likely to generate most disease phenotypes. Instead, the genetic contribution to psychiatric disorders can be viewed as the combined effect of a number of different genes, each with a small or moderate effect on disease liability [25]. Individually, each gene may have only a slight effect on the phenotype such that close relatives may share several susceptibility variants, although one relative may develop the disorder and another may not [25]. Environmental factors also play a significant role in the aetiology of psychiatric disorders, for example as epigenetic factors (i.e. exogenous exposures that influence the expression of genes). As illustrated by research findings on schizophrenia and MDD, environmental influences have included early foetal or neonatal events, such as exposure to obstetric complications [32–35], viruses [36–39], poor nutrition [40, 41] social conditions such as living in urban compared with rural regions [42, 43], and migration [44–47]. One example of environmental factors that has influenced the expression of genetic polymorphisms is the increased risk for schizophrenia in individuals who both smoke cannabis and have a functional polymorphism in the catecholO-methyltransferase (COMT) gene, a gene responsible for metabolism of dopamine [48]. Another example is the role of the serotonin transporter gene-linked polymorphic region (5-HTTLPR, 5-HTT (5-hydroxytryptamine transporter) gene-linked polymorphic region), which appears to increase the risk 108
for MDD only among those carrying the short ‘s’ allele and in the context of stressful life events, such as early childhood trauma [49]. Given the complexity of finding genetic causes of psychiatric disorders per se, there has been a surge of research focused on using intermediate phenotypes or traits in genetic modelling of disorders, called endophenotypes [50]. Endophenotypes are quantitative or continuous traits found more commonly in psychiatrically ill individuals and their unaffected family members (i.e. family members not meeting the same psychiatric diagnostic criteria) than in the healthy population. Endophenotypes are hypothesised to underlie or precede disease onset or the expression of a clinical phenotype (i.e. as measured on a continuum of the aetiologic pathway to the clinical phenotype), and are assumed to be strongly associated with the expression of genes that underlie the disorder [30, 50, 51]. The rationale behind using endophenotypes in molecular genetics research is that (i) Traits represent more elementary phenomena of decreased complexity than the clinical phenotype and thus will likely have stronger associations with specific functions of genes and hence be more genetically informative (i.e. the phenotype will segregate with the susceptibility locus) [26, 30], and (ii) by including unaffected family members, endophenotypes may afford the investigator greater power to detect linkage than a categorical diagnostic approach [26]. While the endophenotype approach is now widely used, the identification of quantitative endophenotypic traits of the clinical phenotypes remains controversial [52], given that it is still unclear how informative they are in contrast to the DSM categories [26]. Thus criteria for evaluating the validity and utility of endophenotypic markers for research in psychiatric genetics have been proposed by several investigative teams, including Gottesman and Gould [30], Skuse [53], Doyle et al. [54] and Waldman [55]. Based on the guidelines set forth in the psychiatric literature, Bearden and Freimer [51], have proposed a set of criteria viewed as both necessary and sufficient. 1 Endophenotypes should be familial, with at least moderate heritability, and should be detectable in those with the mental illness associated with the phenotype as well as in unaffected family members.
VALIDITY: DEFINITIONS AND APPLICATIONS TO PSYCHIATRIC RESEARCH
2 Endophenotypes should be part of the casual chain in the relationship between genes and the DSM diagnosis rather than an effect or sequelae of the disorder. 3 Endophenotypes should be reliable (internal consistency), and have test–retest reliability (at least within a particular clinical state, and preferably across clinical states in illnesses with an episodic pattern), sound psychometric properties (e.g. can discriminate across a broad range of individual differences) and good concurrent validity (convergent and divergent validity) with respect to hypothesised endophenotypes. 4 Endophenotypic traits should exhibit a continuous distribution (ideally, normally distributed) within the general population. 5 The endophenotype should be associated with an increased risk for a particular DSM diagnosis. Illness-specificity is desirable but not required. The authors also add that an ‘optimally informative candidate endophenotype should: (i) relate to reasonably well-characterised neural systems models, and (ii) involve homologies of expression across species (to enable development of animal models).’ [51, p. 309] The use of quantitative endophenotypic traits for understanding the genetic nature of psychiatric disorders fits well with the fact that there is a high comorbidity of psychiatric illnesses not only among psychiatric illnesses but with general medical disorders [56, 57]. In fact, only 10–20% of lifetime diagnoses are single disorders [58]. For example, Tsuang and et al. [59] point out that many genetic studies in the late 1990s failed to show linkage to schizophrenia based on a DSM diagnosis of schizophrenia alone, but found stronger linkage when the phenotype was broadened to include additional psychotic disorders (e.g. [60, 61]). As another example, chromosome 13q [62–64], 4p [65], 22q [62, 63] and 18p [62, 66] have been implicated as promising genomic regions for schizophrenia and bipolar disorder. Thus, for an understanding of the nature of schizophrenia and other psychotic disorders, these examples illustrate the relevance of traits that are shared across psychotic disorders and must be distinguished from the traits that may be specific for the disorders themselves.
In psychiatry, genetic studies using endophenotypes have been met with moderate success, including the development of animal models based on these traits [30, 67]. Endophenotypes in psychiatry have been described for several disorders including schizophrenia, mood disorders, Alzheimer’s disease and personality disorders. Schizotaxia, for example, is a clinical condition that indicates a predisposition or liability, to schizophrenia. A concept first termed by Meehl in 1962 [68] and which has subsequently been reformulated to reflect current research [69], schizotaxia is a more subtle brain disorder than schizophrenia, marked by negative symptoms and neuropsychologic impairment. A number of non-psychotic, first-degree relatives of persons with schizophrenia exhibit clinical and neurobiological abnormalities that are also manifest in patients with schizophrenia [70]. Family studies indicate that schizotaxia is present in about 20–50% of non-psychotic adult relatives of persons with schizophrenia [71, 72], with about 10% of relatives developing psychosis and another 10% developing schizotypal personality disorder [73]. Schizotaxia may well express the aetiologic mechanisms that underpin schizophrenia more clearly than the clinical symptoms of the disorder [70]. For example, Tsuang et al. [69, 74] conducted a validation study of schizotaxia based on the treatment of nonpsychotic, adult first-degree relatives of patients with schizophrenia using the antipsychotic medication risperidone, a drug which has been found to ameliorate negative symptoms and neuropsychologic abnormalities in persons with schizophrenia. The authors hypothesised that if schizotaxia is biologically related to schizophrenia, the negative symptoms and neuropsychologic abnormalities in the schizotaxic relatives would improve with risperidone treatment. In this study, all study subjects exhibited moderate levels of negative symptoms and neuropsychological deficits at baseline, and after a 6-week, open-label course of risperidone, these symptoms and cognitive deficits improved in five of the six relatives supporting the authors’ hypothesis in terms of predictive validity. Though the findings were only preliminary, they suggested common aetiologic elements between the two disorders [69]. The authors later published a study of the concurrent validity of schizotaxia in a group of 27 adult first-degree 109
CHAPTER 7
relatives of patients with schizophrenia [75]. Of these subjects, eight individuals met criteria for schizotaxia and were compared with 19 control subjects who were free of DSM-IV psychiatric diagnoses. The authors found that in contrast to those without schizotaxia, the schizotaxia group exhibited significantly lower levels of functioning and had a lifetime substance abuse diagnosis rate (50%) similar to that among persons with schizophrenia. Findings in this study provided further validation of schizotaxia as a psychiatrically relevant, familially-related condition closely associated with and aetiologically related to schizophrenia. In summary, new approaches to validating psychiatric diagnoses are being developed to incorporate aetiologic information with regard to the psychiatric disorder rather than relying on symptom and functioning information alone. The focus here on endophenotypes illustrates this trend and reflects the expressed need to identify characteristics that are likely related to specific genetic traits and other biomarkers for psychiatric illnesses. However, the search for traits associated with the underlying biomarkers for the illness is still in its infancy. Given the new genetic methodologies and biomedical imaging technologies, there is a realistic hope that future classification systems beyond DSM-5 will include specific biomarkers underlying these illnesses that may help tailor specific treatments to affected individuals in a reliable and valid manner.
7.3 Validity of the relationships between variables We now turn to another use of the term validity in psychiatric as well as other fields of research that refers to the ‘internal and external validity’ of a study. Internal and external validity are essential properties of how we assess empirical research and thus are important to discuss here in this chapter. Internal and external validity are discussed thoroughly by Cook and Campbell [76] in relation to quasiexperimental design studies. They are also discussed in basic textbooks on epidemiology [77, 78]. Internal validity refers to the extent to which a relationship found to be statistically significant is a causal relationship. Internal validity is an empirical 110
issue. That is, do the empirical measures used to assess concepts of interest relate to each other in a causal way? It is also a theoretical issue in that the presumed causal association between variables must be coherent with other empirical evidence and theory. In epidemiology, there are five ‘criteria of judgement’ that are used to aid in establishing a causal relationship [78]: (i) the temporal (time) sequence of variables, (ii) the consistency of associations on replication, (iii) the strength of the association, (iv) the specificity of association and (v) the coherency of the explanation of the association. The time sequence refers to the temporal order of the variables of interest. The consistency of the association refers to its reliability. The strength of the association is measured empirically using relative risk, correlational or nonparametric statistics. Specificity refers to what we previously discussed as discriminant validity. Finally, the coherence criterion refers to a more theoretical question of whether the explanation of the association between the variables of interest ‘fits’ with pre-existing theory and evidence. These five criteria are then used to make judgements regarding whether the empirical association between variables has internal validity or causal plausibility. The causal plausibility of a relationship may in part be dependent on the type of study design used to assess one’s variables of interest. In a controlled experimental study, one may specifically manipulate the time order of variables and experimentally control for confounding factors that may be threats to internal invalidity. However, many epidemiologic studies are not experimental, but rather are observational and what has been called quasi-experimental [76]. In these types of studies, it may be more difficult to establish the internal validity of the relationship between variables. There are a number of threats to internal validity that may arise in using non-experimental designs. They are discussed in detail by Cook and Campbell [76, p. 5159] and briefly described here. Suppose that in a treatment study one found that treatment ‘a’ was significantly better for a specific diagnostic group than treatment ‘b’, as measured by pre- and post-treatment measurements of symptomatology. However, suppose there was no random assignment to treatment; thus the study was not an experimental design. The following threats to the
VALIDITY: DEFINITIONS AND APPLICATIONS TO PSYCHIATRIC RESEARCH
‘internal validity’ of the effect of treatment ‘a’ may be operating and should be addressed. In general, threats to internal validity have to do with the possibility of differential effects of events on the treatment versus the control groups that are not due to the treatment of interest per se (see Table 7.4). History effects refer to the influence of events outside of the control of the study that may differentially affect the outcomes of the groups being studied but have little relationship to the treatment of interest. Maturation involves the differential development of participants in each group that is not due to treatment effects. Testing and instrumentation effects refer, respectively, to the number of times a test is given resulting in differential learning effects and changes in instrumentation over time that differentially affects one’s groups unrelated to treatment effects. Statistical regression artefacts are especially difficult for which to control. They can occur if the groups at pretreatment time are not equivalent, that is, do not come from the same population. In a nonrandomised study, one attempts to match groups on
certain pretreatment variables. However, the matching variables may be unreliable themselves, resulting in unmatched groups at pretreatment assessment time. Respondents with high scores on unreliable pretreatment variables may score lower at posttreatment time, and the reverse may be true for respondent with low scores on unreliable pretreatment variables. The expected direction of the change in unreliable scores from pre- to post-treatment is always towards the population mean [76]. This is referred to as regression to the mean. Thus, the change in one’s treatment groups would not be due to treatment, but rather to these regression artifacts. One way to control for these artefacts is to ensure that pretreatment matching variables are as reliable as possible. It is often difficult to match one’s groups completely, and therefore using experimental designs in treatment studies is preferable, although not always possible. A classic example of how regression artefacts can adversely affect results was the WestinghouseOhio University study of Head Start (preschool
Table 7.4 External and internal validity: definition and threats to validity.
Internal validity
Definition
Threats
The validity of the inferences drawn as they pertain to the subjects in the study
Differential effects of events on the exposed and unexposed groups that are not due to exposure • History: influence of events outside study control that may differentially affect outcomes and have little relationship to exposure • Maturation: differential development of participants in each group not due to exposure • Testing: number of times a test is given resulting in differential learning effects • Instrumental: changes in instrumentation over time that differentially affects groups (unrelated to exposure) Statistical regression artifacts • Regression to the mean: tendency for high values of continuous variables to decrease to the mean and low values to increase to the mean with repeated measurement • Selection: intra-individual variability at baseline • Mortality: differential group drop-out or refusal
External validity
Validity of the inferences drawn as they pertain to persons outside the study population • Generalisable to and across persons, time periods and settings
Interaction effects with exposure. Threats to external validity: • Selection • Setting • History
111
CHAPTER 7
education) [79]. In this study, the cases and controls were undermatched for socioeconomic status resulting in making Head Start look damaging to children. This occurred because controls were selected from a more able population than Head Starters. That is, the pretreatment or pretest matching variable, socioeconomic status, which includes educational status, was unreliable. When cognitive measures were assessed post-Head Start, the control group’s cognitive scores regressed to their population mean, which were higher than those in the Head Start group. The population means of the two groups were different, because the controls were originally selected from a population that was educationally and cognitively more advanced than the Head Start group [79]. When controls were appropriately selected for comparison with the Head Start children, the Head Start programme was shown to have a significant impact on the cognitive functioning of the children who experienced the programme. Selection effects are related to regression artefacts. Selection becomes a threat to internal validity when the characteristics of one’s groups are different, and this results in differential changes from preto post-treatment assessment between groups. For example, mortality can result in selection artefacts. Mortality effects refer to the differential drop-out or refusal rates between the groups that may affect the group’s post-treatment mean. For example, if the more severely ill patients dropped out of treatment ‘a’, then post-treatment assessment of symptoms among the treatment ‘a’ group may look better due to the differential drop-out of severely ill patients in that group rather than to effects of treatment ‘a’ on symptomatology. Other threats to internal validity discussed by Cook and Campbell [76, pp. 53–55] include differential social influences on the groups being compared. For example, communication between patients in the treatment and control groups about the treatment of interest may result in rivalry between the groups, ‘resentful demoralisation’ of the group receiving a less desirable treatment, or imitation of one group by the other. The external validity of a significant result refers to the extent to which a finding is generalisable to and across persons, time periods and settings [76]. Random sampling of one’s groups from the population 112
of interest contributes to the ability to ‘generalise to’ the population of interest. Generalising across populations refers to the identification of those populations to which the findings can be applied. That is, it refers to the extent of the generalisation of findings to other populations aside from those that were directly studied or subpopulations among those studied. For example, most readers would be cautious about generalising across males and females from a study of health services utilisation based solely on a sample of males. The threats to external validity can be thought of as interaction effects with the treatment of interest (see Table 7.4) [76]. For example, differences in treatment response between the sexes or socioeconomic statuses will lower the generalisability across the population as a whole. There are three types of interaction effects that are threats to external validity: interactions of selection, setting and history with treatment [76, pp. 73–74]. For example, selection interactions, or systematic recruitment artefacts, may result in findings being attributable only to those recruited into the study. The same can be said for interactions of treatment with setting and history. For example, using a university setting may limit one’s generalisability across other settings. Conducting the treatment study during a particular historical period may not allow generalisability to future time periods. To minimise both of these threats, multiple studies would need to be implemented using different populations at different historical time periods.
7.4 Summary In summary, validity can have different meanings depending on the context in which it is used. It is applied to the measurement of concepts, called construct validity, as in the case of ‘validating psychiatric diagnoses’, and to the relationship between operational measures, called the internal and external validity of a presumed causal relationship. As applied to construct validity, it is an unending process in which one attempts to measure a concept of interest as accurately as possible. Validity involves a theoretical understanding of the concept as well as an empirical assessment of the criteria chosen to operationalise the concept. This chapter discusses
VALIDITY: DEFINITIONS AND APPLICATIONS TO PSYCHIATRIC RESEARCH
three basic ways in which validity is assessed: content validity, criteria validity and construct validity. Content and criterion validity can be thought of as part of the process of assessing construct validity. One study cannot wholly validate a measure of a concept. It requires a pattern of consistent findings across studies involving different samples and different settings. An ideal example of this is how the field has approached identifying psychiatric diagnoses from the inception of the Diagnostic and Statistical Manual of Mental Disorders to current notions of understanding psychiatric diagnoses by incorporating aetiological information in future versions of how we operationalise our diagnostic classifications. The other way in which validity has been discussed in this chapter refers to the ‘internal and external validity’ of empirical relationships between operational measures of the concepts of interest. Internal validity refers to the extent that a statistically significant relationship is a causal one. There are a number of ways in which causal plausibility is assessed, for example the five criteria of judgement used in epidemiological studies [78]. In addition, causal plausibility is dependent on the type of study design employed. As discussed, quasi-experimental designs are open to a number of threats to internal validity, including regression artefacts, history and selection effects. Experimental study designs, in which one manipulates the time order of variables and controls for confounding factors, are less vulnerable to threats to internal validity. Finally, external validity refers to the extent that one can generalise the study findings to and across persons, time periods and settings. To minimise threats to external validity, multiple studies are needed in which the study populations, the historical time periods and the setting are varied.
Acknowledgements This chapter was written while Dr. Goldstein was supported by NIMH RO1 MH56956 and NIMH-ORWH P50 MH082679. Drs. Goldstein and Cherkerzian are also supported by the Connors Center for Women’s Health and Gender Biology at Brigham & Women’s Hospital. The authors would like to thank Lisa Cushman-Daly for help in manuscript preparation.
References [1] Carmines, E.G. and Zeller, R.A. (1979) Reliability and Validity Assessment. Series Quantitative Applications in the Social Sciences, Sage University Press, Beverly Hills, CA. [2] Cronbach, L.J. and Meehl, P.E. (1955) Construct validity in psychological tests. Psychol. Bull., 52 (4), 281–302. [3] Anastasi, A. (1976) Psychological Testing, Macmillan, London. [4] Nunnally, J.C. (1978) Psychometric Theory, McGraw-Hill, New York. [5] Cronbach, L.J. (1971) Educational measurement, in Test Validation (ed. R.L. Thorndike), American Council on Education, Washington, D.C. [6] Robins, L.N., Helzer, J.E., Croughan, J. and Ratcliff, K.S. (1981) The NIMH diagnostic interview schedule: its history, characteristics, and validity. Arch. Gen. Psychiatry, 38, 381–389. [7] Endicott, J. and Spitzer, R.L. (1978) A diagnostic interview: the schedule for affective disorders and schizophrenia. Arch. Gen. Psychiatry, 35, 837–844. [8] Streiner, D.L. (1993) A checklist for evaluating the usefulness of rating scales. Can. J. Psychiatry, 38, 140–148. [9] Schwartz, C.C., Myers, J.K. and Astrachan, B.M. (1975) Concordance of multiple assessments of outcome in schizophrenia: on defining the dependent variables in outcome studies. Arch. Gen. Psychiatry, 32, 1221–1227. [10] Cloninger, C.R. (1987) A systematic method for clinical description and classification of personality variants. A proposal. Arch. Gen. Psychiatry, 44, 573–588. [11] Takeuchi, M., Yoshino, A., Kato, M., Ono, Y. and Kitamura, T. (1993) Reliability and validity of the Japanese version of the tridimensional personality questionnaire among university students. Compr. Psychiatry, 34, 273–279. [12] Kim, J.O. and Mueller, C.W. (1978) Factor Analysis: Statistical Methods and Practical Issues, Sage University Paper Series on Quantitative Applications in the Social Sciences, series no. 07-014, Sage Publications, Inc., Beverly Hills, CA. [13] Addington, D., Addington, J. and Maticka-Tyndale, E. (1993) Rating depression in schizophrenia: a comparison of a self-report and an observer scale. J. Nerv. Ment. Dis., 181, 561–565. [14] Woolson, R.F. (1987) Statistical Methods for the Analysis of Biomedical Data, John Wiley & Sons, Inc., New York.
113
CHAPTER 7 [15] Simpson, J.C. (1982) Amino acid levels in schizophrenia and celiac disease: another look. Biol. Psychiatry, 17, 1353–1357. [16] Somervell, P.D., Beals, J., Kinzie, J.D., Boehnlein, J., Leung, P. and Manson, S.M. (1993) Criterion validity of the center for epidemiologic studies depression scale in a population sample from an American Indian village. Psychiatry Res., 47, 255–266. [17] Radloff, L.S. (1977) The CES-D scale: a self-report depression scale for research in the general population. Appl. Psychol. Meas., 1, 385–401. [18] American Psychiatric Association (1987) DSM-III-R: Diagnostic and Statistical Manual of Mental Disorders, 3rd edition revised. American Psychiatric Press, Washington, DC. [19] Murphy, J.M., Berwick, D.M., Weinstein, M.C., Borus, J.F., Budman, S.H. and Klerman, G.L. (1987) Performance of screening and diagnostic tests: application of receiver operating characteristics analysis. Arch. Gen. Psychiatry, 44, 550–555. [20] Baldessarini, R.J., Finkelstein, S. and Arana, G.W. (1983) The predictive power of diagnostic tests and the effect of prevalence of illness. Arch. Gen. Psychiatry, 40, 569–573. [21] Glaros, A.G. and Kline, R.B. (1988) Understanding the accuracy of tests with cutting scores: the sensitivity, specificity, and predictive value model. J. Clin. Psychol., 44 (6), 1013–1023. [22] Robins, E. and Guze, S.B. (1970) Establishment of diagnostic validity in psychiatric illness: its application to schizophrenia. Am. J. Psychiatry, 126, 983–987. [23] American Psychiatric Association (2000) Diagnostic and Statistical Manual of Mental Disorders, 4th edition, Text Revision. American Psychiatric Association, Inc., Washington, DC. [24] Hyman, S.E. (2003) Foreword, in Advancing DSM: Dilemmas in Psychiatric Diagnosis (eds K.A. Phillips, M.B. First and H.A. Pincus), American Psychiatric Association, Washington, DC, pp. xi–xxi. [25] Kendler, K.S. (2006) Reflections on the relationship between psychiatric genetics and psychiatric nosology. Am. J. Psychiatry, 163 (7), 1138–1146. [26] Szatmari, P., Maziade, M., Zwaigenbaum, L. et al. (2007) Informative phenotypes for genetic studies of psychiatric disorders. Am. J. Med. Genet. B Neuropsychiatr. Genet., 144B, 581–588. [27] Charney, D.S. (2002) Foundation for the NIMH strategic plan for mood disorders research. Biol. Psychol., 52, 455–456. [28] Lecrubier, Y. (2008) Refinement of diagnosis and disease classification in psychiatry. Eur. Arch. Psychiatry Clin. Neurosci., 258 (Suppl. 1), 6–11. [29] Hyman, S.E. (2007) Can neuroscience be integrated into the DSM-5? Nat. Rev. Neurosci., 8, 725–732.
114
[30] Gottesman, I.I. and Gould, T.D. (2005) The endophenotype concept in psychiatry, in Research Advances in Genetics and Genomics: Implications for Psychiatry, (ed. N.C. Andreasen), American Psychiatric Publishing, Inc., Washington, DC, pp. 63–84. [31] Allardyce, J., Suppes, T. and van Os, J. (2007) Dimensions and the psychosis phenotype. Int. J. Methods Psychiatr. Res., 16 (Suppl. 1), S34–S40. [32] Guth, C., Jones, P. and Murray, R.M. (1993) Familial psychiatric illness and obstetric complications in earlyonset affective disorder. A case–control study. Br. J. Psychiatry, 163, 492–498. [33] Cannon, M., Jones, P.B. and Murray, R.M. (2002) Obstetric complications and schizophrenia: historical and meta-analytic review. Am. J. Psychiatry, 159 (7), 1080–1092. [34] Buka, S.L., Lipsitt, L.P. and Murray, R. (1993) Pregnancy/delivery complications and psychiatric diagnosis: a prospective study. Arch. Gen. Psychiatry, 50 (2), 151–156. [35] Dalman, C., Allebeck, P., Cullberg, J., Grunewald, C. ¨ and Koster, M. (1999) Obstetric complications and the risk of schizophrenia: a longitudinal study of a national birth cohort. Arch. Gen. Psychiatry, 56 (3), 234–240. [36] Buka, S.L., Cannon, T.D., Torrey, E.F., Yolken and R.H. Collaborative Research Group. (2008) Maternal exposure to herpes simplex virus and risk of psychosis among adult offspring. Biol. Psychol., 63 (8), 809–815. [37] Machon, R.A., Mednick, S.A. and Huttunen, M.O. (1997) Adult major affective disorder after prenatal exposure to an influenza epidemic. Arch. Gen. Psychiatry, 54 (4), 322–328. [38] Brown, A.S., Begg, M.D., Gravenstein, S. et al. (2004) Serologic evidence of prenatal influenze in the etiology of schizophrenia. Arch. Gen. Psychiatry, 61 (8), 774–780. [39] Buka, S.L., Tsuang, M.T., Torrey, E.F., Klebanoff, M.A., Bernstein, D. and Yolken, R.H. (2001) Maternal infections and subsequent psychosis among offspring: a forty year prospective study. Arch. Gen. Psychiatry, 58, 1032–1037. [40] Brown, A.S., van Os, J., Driessens, E., Hoek, H.W. and Susser, E.S. (2000) Further evidence of relation between prenatal famine and major affective disorder. Am. J. Psychiatry, 157 (2), 190–195. [41] Susser, E.S., St. Clair, D. and He, L.. (2008) Latent effects of prenatal malnutrition on adult health: the example of schizophrenia. Ann. N. Y. Acad. Sci., 1136, 185–192. [42] Blue, I. and Harpham, T. (1996) Urbanization and mental health in developing countries. Curr. Issues Public Health, 2 (4), 181–185.
VALIDITY: DEFINITIONS AND APPLICATIONS TO PSYCHIATRIC RESEARCH [43] Pedersen, C.B. and Mortensen, P.B. (2001) Evidence of a dose-response relationship between urbanicity during upbringing and schizophrenia risk. Arch. Gen. Psychiatry, 58 (11), 1039–1046. [44] Selten, J.P., van Os, J. and Nolen, W.A. (2003) First admission for mood disorders in immigrants to the Netherlands. Soc. Psychiatry Psychiatr. Epidemiol., 38, 547–550. [45] Cantor-Graae, E. (2007) The contribution of social factors to the development of schizophrenia: a review of recent findings. Can. J. Psychiatry, 53 (5), 277–286. [46] Rwegellera, G.G. (1977) Psychiatric morbidity among West Africans and West Indians living in London. Psychol. Med., 7 (2), 317–329. [47] Corcoran, C., Perrin, M., Harlap, S. et al. (2009) Incidence of schizophrenia among second-generation immigrants in the Jerusalen Perinatal Cohort. Schizophr. Bull., 35 (3), 596–602. [48] Caspi, A., Moffitt, T.E., Cannon, M. et al. (2005) Moderation of the effect of adolescent-onset cannabis use on adult psychosis by a functional polymorphism in the catechol-O-methyltransferase gene: longitudinal evidence of a gene × environment interaction. Biol. Psychiatry, 57 (10), 1117–1127. [49] Caspi, A., Sugden, K., Moffitt, T.E. et al. (2003) Influence of life stress on depression: moderation by a polymorphism in the 5-HTT gene. Science, 301 (5631), 386–389. [50] Gottesman, I.I. and Shields, J. (1972) Schizophrenia and Genetics: A Twin Study Vantage Point, Academic Press, New York. [51] Bearden, C.E. and Freimer, N.B. (2006) Endophenotypes for psychiatric disorders: Ready for primetime? Trends Genet., 22 (6), 306–313. [52] Bilder, R.M. (2008) Phenomics: Building scaffolds for biological hypotheses in the the post-genomic era. Biol. Psychiatry, 63, 439–440. [53] Skuse, D.H. (2001) Endophenotypes and child psychiatry. Br. J. Psychiatry, 178, 395–396. [54] Doyle, A.E., Faraone, S.V., Deidman, L.J. et al. (2005) Are endophenotypes based on measures of executive functions useful for molecular genetic studies of ADHD? J. Child Psychol. Psychiatry Allied Discip., 46 (7), 774–803. [55] Waldman, I.D. (2005) Statistical approaches to complex phenotypes: evaluating neuropsychological endophenotypes for attention-deficit/hyperactivity disorder. Biol. Psychiatry, 57, 1347–1356. [56] Frances, A.J., First, M.B., Widiger, T.A. et al. (1991) An A to Z guide to DSM-IV conundrums. J. Abnorm. Psychol., 100 (3), 407–412. [57] Sabb, F.W., Bearden, C.E., Glahn, D.C. et al. (2008) A collaborative knowledge base for cognitive phenomics. Mol. Psychiatry, 13 (4), 350–360.
[58] Wittchen, H.U., Beesdo, K., Bittner, A. et al. (2003) Depressive episodes – evidence for a causal role of primary anxiety disorders? Eur. Psychiatry, 18 (8), 384–393. [59] Tsuang, M.T., Stone, W.S., Tarbox, S.I. et al. (2003) Insights from neuroscience for the concept of schizotaxia and the diagnosis of schizophrenia, in Advancing DSM: Dilemmas in Psychiatric Diagnosis (eds K.A. Phillips, M.B. First and H.A. Pincus), American Psychiatric Association, Washington, DC, pp. 105–127. [60] Maziade, M., Bissonnette, L., Rouillard, E. et al. (1997) 6p24-22 region and major psychoses in the Eastern Quebec population. Le Groupe IREP. Am. J. Med. Gen., 74 (3), 1726–1733. [61] Wildenauer, D.B., Hallmayer, J., Schwab, S.G. et al. (1996) Searching for susceptibility genes in schizophrenia by genetic linkage analysis. Cold Spring Harb. Symp. Quant. Biol., 61, 845–850. [62] Berrettini, W. (2003) Bipolar disorder and schizophrenia: not so distant relatives? World Psychiatry, 2 (2), 68–72. [63] Badner, J.A. and Gershaon, E.S. (2002) Metaanalysis of whole-genome linkage scans of bipolar disorder and schizophrenia. Mol. Psychiatry, 7 (4), 405–411. [64] Maziade, M., Chagnon, Y.C., Roy, M.A. et al. (2009) Chromosome 13q13-q14 locus overlaps mood and psychotic disorders: the relevance for redefining phenotype. Eur. J. Hum. Genet. [Epub ahead of print (doi: 10.1038/ejhg.2008.268)]. [65] Christoforou, A., Le Hellard, S., Thomson, P.A. et al. (2007) Association analysis of the chromosome 4p15-p16 candidate region for bipolar disorder and schizophrenia. Mol. Psychiatry, 12 (11), 1011–1025. [66] Schwab, S.G., Hallmayer, J., Lerer, B. et al. (1998) Support for a chromosome 18p locus conferring susceptibility to functional psychoses in families with schizophrenia, by association and linkage analysis. Am. J. Hum. Genet., 63 (4), 1139–1152. [67] Bearden, C.E., Jasinska, A.J. and Freimer, N.B. (2009) Methodological issues in molecular genetic studies of mental disorders. Annu. Rev. Clin. Psychol., 5, 49–69. [68] Meehl, P. (1962) Schizotaxia, schizotypy, and schizophrenia. Am. Psychol., 17, 827–838. [69] Tsuang, M.T., Stone, W.S., Seidman, L.J. et al. (1999) Treatment of nonpsychotic relatives of patients with schizophrenia: Four case studies. Biol. Psychiatry, 41, 1412–1418. [70] Tsuang, M.T., Stone, W.S., Gamma, F. et al. (2003) Schizotaxia: current status and future directions. Curr. Psychiatry Rep., 5, 128–134. [71] Faraone, S.V., Seidman, L.J., Kremen, W.S. et al. (1995) Neuropsychological functioning among the
115
CHAPTER 7
[72]
[73]
[74]
[75]
116
nonpsychotic relatives of schizophrenic patients: a diagnostic efficiency analysis. J. Abnorm. Psychol., 104, 286–304. Faraone, S.V., Kremen, W.S., Lyons, M.J. et al. (1995) Diagnostic accuracy and linkage analysis: how useful are schizophrenia spectrum phenotypes?. Am. J. Psychiatry, 152, 1286–1290. Battaglia, M. and Torgersen, S. (1996) Schizotypal disorder: at the crossroads of genetics and nosology. Acta Psychiatr. Scand., 94, 303–310. Tsuang, M.T., Stone, W.S., Gamma, F. et al. (2000) Towards the prevention of schizophrenia. Biol. Psychiatry, 48, 349–356. Stone, W.S., Faraone, S.V., Seidman, L.J et al. (2001) Concurrent validation of schizotaxia: a pilot study. Biol. Psychiatry, 50, 400–434.
[76] Cook, T.D. and Campbell, D.T. (1979) QuasiExperimentation Design and Analysis Issues for Field Settings, Rand McNally College Publishing Company, Chicago. [77] MacMahon, B. and Pugh, T.F. (1970) Epidemiology: Principles and Methods, Little, Brown, and Company, Boston. [78] Susser, M. (1973) Causal Thinking in the Health Sciences. Concepts and Strategies of Epidemiology, Oxford University Press, London. [79] Campbell, D.T. and Elebacher, A. (1970) How regression artifacts in quasiexperimental evaluations can mistakenly make compensatory education look harmful, in Compensatory Education: A National Debate, Disadvantaged Child, Vol. 3 (ed. J. Helmuth), Brunner/Mazel, New York, pp. 185–210.
8
Use of register data for psychiatric epidemiology in the Nordic countries Jouko Miettunen,1 Jaana Suvisaari,2 Jari Haukka2,3 and Matti Isohanni1 1
Department of Psychiatry, University of Oulu, Finland Department of Mental Health and Substance Abuse Services, National Institute for Health and Welfare, Helsinki, Finland 3 Department of Public Health, University of Helsinki, Finland 2
8.1 Introduction The four largest Nordic countries, Denmark (population 5.5M), Finland (5.3M), Norway (4.8M) and Sweden (9.3M), have great similarities and interconnections in their history and social structures. Their population has remained stable in spite of bouts of emigration and more recently immigration. The national and local levels of administration and the documentation associated with these have been well developed for hundreds of years, and relevant data have been computerised ever since this became possible. Records have been kept in the Nordic Countries for a long time. For instance, in Finland and Sweden, parishes used to keep a record of births and deaths as early as the sixteenth century for purposes of recruitment and taxation. In the seventeenth century these records were used by central government for its own survey and planning purposes [1]. Nowadays, management, organisation, planning, evaluation, control and protection of individuals as well as the identification, selection and enumeration of cases have been listed as good reasons to collect administrative health and welfare data [2]. Similar
registers are also available in non-Nordic countries, but they are not based on the whole population and it is not possible to link data between different registers. At the beginning such registers included only aggregate data, but later individual register data have been available as well. The data in different registers can be linked within countries using personal identification codes in all the Nordic countries since 1960s, and since nearly all administrative registers operate on this basis, vast linkage possibilities exist. The main registers used in psychiatric research are case registers [3] and administrative health and welfare registers [4]. Case registers are usually kept locally and include all referrals to psychiatric services in a particular community, for example as in the Stockholm County In-patient Register [5]. It is only in the Nordic countries that nationwide case registers are to be found, for example the Norwegian Case Register of Mental Disorders and the Danish Psychiatric Central Register. Case registers have been used in health system planning, for instance [6, 7]. Administrative registers (e.g. the Finnish Hospital Discharge Register, FHDR) are ones which are maintained nationally mainly for administrative purposes,
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
117
CHAPTER 8
although they are also employed for scientific purposes. Routinely collected administrative data can be used to study such matters as the incidence and prevalence of diseases, treatment outcomes and service utilisation. By linking different registers it is also possible to study risk factors for mental disorders. Register studies have notable strengths regarding statistical power and representativeness, and registers have enabled the examination of issues that would have been difficult to study otherwise, for example due to rare exposure events and/or disorders and high drop-out rates. Register studies offer internationally unique possibilities for psychiatric research. Many epidemiological studies have high drop-out rates, and this may affect the results and their interpretation [8]. Attrition is also a major problem in epidemiological and cross-sectional research related to psychiatric disorders, especially severe mental disorders. However, unlike the situation in most countries, it is possible in the Nordic region to compare participants and non-participants using information collected from registers and to estimate the effect of attrition [9]. The registers concerned can be used for statistical and scientific purposes even without specifically asking the subjects for their consent, – whereas the obtaining of informed consent for large study samples would be impossible for practical reasons. Thus registers of this kind provide an excellent basis for efforts to improve health, welfare and the health care and social welfare services [2, 7]. Where previous reviews of the use of Nordic health care registers have focused on specific topics or on one country [2, 10–13], the first and main aim of the current work is to describe the registers used in psychiatric research and to discuss issues related to register-based research. Second, we will also briefly review a selection of studies in psychiatric epidemiology produced in Denmark, Finland, Norway and Sweden that have made use of such registers.
8.2 Registers for use in psychiatric research The Nordic countries have quite similar sets of administrative registers of relevance to health care;
118
Table 8.1 Starting years of various nationwide health care registers in Nordic countries. Register Hospital discharge register Causes of death register Disability pension register Prescription register Medical birth register Cancer register
Denmark Finland Norway Sweden 1969a
1967b
1990c
1965d
1970e
1969
1951
1952
1996
1962
1967
1971
1994 1973 1987
1994 1987 1953
2004 1967 1953
2005 1973 1958
a Attempted
suicide since 1989. Full coverage since 1972. c The data are not identifiable to person. Personally identifiable data are now being gathered and when finished will be available from 1 March 2007. d Full coverage since 1987. e Suicides. b
although their availability and commencement dates vary somewhat between the countries. The coverage of selected health registers in the four countries is summarised in Table 8.1, while Table 8.2 gives links to the web sites of selected maintainers of registers in the Nordic countries. Most of the information on such pages is also in English. In addition, Denmark and Finland have register centres, which have collected information and links related to register-based research. In the following paragraphs we will briefly describe the different types of registers and give more detailed information on their content, especially in Finland, although their content is quite similar in Denmark, Norway and Sweden. Nationwide, representative samples for study purposes can be obtained in Finland from the Central Population Register, also called the Population Information System, which is maintained by the Population Register Centre and local registry offices throughout the country [1]. Similar agencies exist in the other Nordic countries. The data registered for individual persons include name and personal identity code, address, nationality and mother tongue, marital status, dates of birth and death and information on emigration and immigration. The information is mainly updated by the authorities, but
USE OF REGISTER DATA FOR PSYCHIATRIC EPIDEMIOLOGY IN THE NORDIC COUNTRIES Table 8.2 Selected organisations maintaining health registers in the Nordic countries, their internet addresses and examples of their registers. Register maintainers by country Denmark National Centre for Register-based Research Descriptions and addresses of various health registers Centre for Suicide Research Suicide attempts National Board of Health Causes of deaths, cancer register, psychiatric case register National Social Appeals Board Disability pension register Central Office of Civil Registration Civil Registration System Statistics Denmark Medical birth register
Internet address www.ncrr.au.dk www.selvmordsforskning.dk www.sst.dk www.ast.dk www.cpr.dk www.dst.dk
Finland Finnish Information Centre for Register Research Descriptions and addresses of various health registers National Institute for Welfare and Health Medical birth register, hospital discharge register Finnish Centre for Pensions Finnish Employment Register and Pensions Register Social Insurance Institution Disability pensions, social benefits (e.g. unemployment), medication reimbursement register Statistics Finland Cause of death register
www.rekisteritutkimus.fi www.thl.fi www.etk.fi www.kela.fi
www.stat.fi
Norway Norwegian Institute of Public Health Norwegian prescription database, Medical Birth Registry Norwegian Labour and Welfare Service Register of employers and employees, disability pensions, unemployment Cancer Registry of Norway Statistics Norway Norwegian Patient Register (statistics)
www.fhi.no www.nav.no
www.kreftregisteret.no www.ssb.no
Sweden National Board of Health and Welfare (Centre for Epidemiology) Patient register, causes of death register, medical birth register, medicine register, cancer register Social Insurance Agency Disability pensions Quality registers Statistics Sweden Multigeneration register
www.socialstyrelsen.se
www.forsakringskassan.se www.kvalitetsregister.se www.scb.se
119
CHAPTER 8
change-of-address information has to be provided by the individuals themselves. Information on people who have emigrated or are deceased is also kept in the register, emigrants being moved to the category of people who are absent from the country, and deceased individuals provided with the date of death. The population information system together with the availability of other registers has allowed the Nordic countries to replace questionnairebased population censuses with register-based censuses. Denmark pioneered this by completing its first register-based census in 1981. Linked, cross-sectional data files (such as those required in a census) can in principle be constructed from a continuously updated population register and a number of other registers as often as needed – on a weekly or annual basis as appropriate.
8.2.1 Hospital discharge registers The most commonly used health care registers are the hospital discharge registers. The FHDR, maintained by the National Institute for Health and Welfare, for instance, cover periods of treatment received in all public and private hospitals in Finland since the early 1970s. Data on the beginning and end of each in-patient stay, together with the primary diagnosis and up to three subsidiary diagnoses and a hospital identification code, are listed. The number of erroneous personal IDs in the Finnish administrative registers is negligible and the quality of the FHDR data has been improving continuously [14]. The Hospital Discharge Register (and the Register of Causes of Death) use the ICD classification and include complete diagnostic codes. Similar hospital discharge registers exist in all the Nordic countries, and information on outpatient treatments has also been included since the 1990s, first in Denmark and then in Sweden, but the coverage of the outpatient treatment data varies greatly. The validity and reliability of an FHDR diagnosis of schizophrenia or schizophrenia spectrum psychosis (ICD-9 295) have been investigated in several studies, revealing a good concordance in general between clinical and research diagnoses for any psychosis [15–18]. Clinical diagnoses have been found by Isohanni [15] and Moilanen et al. [19] to be conservative, however, with over 40% of cases with a 120
research diagnosis of schizophrenia having a register diagnosis of non-schizophrenic psychosis. Taiminen et al. [20] found a poorer validity for schizophrenia diagnoses, the kappa value between clinical diagnoses and the best-estimate research diagnoses being only 0.44 for schizophrenic disorders. The reliability of hospital diagnoses of schizophrenia has also been investigated in the other Nordic countries and has been found acceptable [21–23]. In a twin study Kieseppa¨ et al. [24] validated bipolar disorder diagnoses in the FHDR, founding 92% accuracy for both bipolar I disorder and the manic type of schizoaffective disorder. The reliability of other psychiatric diagnoses has not been investigated. Several studies have investigated the reliability of diagnoses of other medical conditions [25, 26]. When investigating less severe disorders it is important to remember that the registers cover only patients treated in hospital and will underestimate the true incidence and prevalence figures.
8.2.2 Medication data Antidepressants and antipsychotics are among the most widely used drugs at the population level. Although clinical trials are the primary source of information about the efficacy and effectiveness of drug treatment, they suffer from certain flaws [27]. The characteristics of the samples studied are usually not the same as in the population that will ultimately use the tested drug, and even in large-scale trials the samples are not large enough to detect rare adverse effects. While randomisation guarantees certain aspects of correct inference in clinical trials, selective and often massive drop-out and/or drop-in during a long-term trial (e.g. over 70% in the famous CATIE schizophrenia trial [28]) can considerably complicate the interpretation of the results and their application to ‘real-world’ situations. This means that large-scale observational register linkage studies could provide invaluable information on drug treatment. These phase IV or post-marketing surveillance studies nevertheless require high-quality register data on prescriptions and on community and hospital care. Such studies are especially urgently required for antipsychotic drugs, as these are usually taken continuously for a very long period of time, often decades. Prescription registers and other sources
USE OF REGISTER DATA FOR PSYCHIATRIC EPIDEMIOLOGY IN THE NORDIC COUNTRIES
of administrative data have become an important source of information for carrying out pharmacoepidemiological studies, yielding data that can be used to study the pattern of medication in large populations and to estimate individual exposure for assessments of the effectiveness and safety of drug treatment. There are two types of medication registers, prescription registers and medication reimbursement registers. All the Nordic countries have prescription databases [29, 30], and these are fairly similar in content. We will introduce the Finnish prescription register in more detail here. This contains information on all medications purchased in accordance with a doctor’s prescription, but before the year 2006 there was a 10 Euros cost threshold for basic reimbursement which meant that the register information was incomplete for very cheap medicines. The latter limitation means that the register information may not be complete for very cheap medicines. The prescription data available from the state-controlled Finnish Social Insurance Institution (SII) includes the generic name of the drug and its Anatomical Therapeutic Chemical (ATC) classification system code, the brand name that was bought, the formulation and package, the amount, the date when the drug was purchased, the prescribing practice (primary vs. secondary health care) and the prescribing physician’s area of specialisation. The validity of the prescription database by comparison with patient-reported medication data has been studied in Nordic countries by Glintborg et al. [31], Haukka et al. [32] and Haapea et al. [33], for instance, and has been found to be good for antipsychotics and antidepressives but slightly poorer for sedatives and hypnotics. It should be noted that prescription databases do not include data on drugs used in hospitals, nor all the drugs used in daily care at hospitals or nursing homes. The other register that contains information on medication supplied in Finland, the Medication Reimbursement Register maintained by the Finnish Social Insurance Institution, contains data on the diagnoses of persons receiving special reimbursements for outpatient medication for chronic diseases. Persons having ‘severe psychotic and other severe mental disorders’ are entitled to free antipsychotic
and antidepressive medication. Unfortunately, the registration of diagnostic codes in the Medication Reimbursement Register is not complete, and it often contains only the first three digits of the ICD diagnostic codes, or else the ICD code may be missing entirely in older data sets.
8.2.3 Cause of death registers Cause of death registers are among the oldest registers in all the Nordic Countries. The Finnish Causes of Death Register (FCDR), maintained by Statistics Finland, provides data on dates and causes of death and also stores death certificates. Statistics Finland has stored death certificates since 1936, but the data are available on the combined electronic file only since 1969. A large validation study came to the conclusion that none of the personal identification codes in the CDR was incomplete [14]. The register includes the personal identification number of each deceased person, sex, age, place of residence and principal, underlying and contributory causes of death. The routine validation of death certificates means that the accuracy of Nordic cause of death registers is good by international standards [34, 35]. Also the autopsy rate is very high, being found in one Finnish study, for instance, to have been 31% for all deceased persons aged 1 year or more [35]. Cause of death registers have been used in psychiatric research, for studying such topics as on mortality due to various somatic disorders, and especially suicides [36]. When studying suicidal behaviour it is also possible to include data on suicide attempts, which may be included in hospital discharge registers as external causes of hospitalisation [37].
8.2.4 Other registers There are also several other nationwide registers available in the Nordic countries, some of which include data collected more for the purposes of research than administration. The Nordic countries have some unique biobanks. The Finnish Maternity Cohort, started in 1983, for example contains currently approximately 1.5M serum samples from about 750 000 pregnant women (∼98% of all pregnancies during that period). These 121
CHAPTER 8
samples can be used for scientific research and linked to data from other sources, including personal identification numbers, numbers of pregnancies and deliveries and places of residence [38]. Denmark has been storing dried blood spot samples from all newborn infants since 1982 as a part of a neonatal screening programme, and this biobank has been regulated by specific legislation since 1993, granting it a unique position among biological specimen banks. Specimens from this source can also be used for research purposes, and have been used to investigate prenatal and neonatal infections and their association with schizophrenia, for example [39]. There are some specific registers in Sweden, such as the Multigeneration Register and Quality Registers. The Multigeneration Register provides information on all the people who have been resident in Sweden since 1960 who were born in 1932 or later and on their biological parents. This makes it possible to trace all first-degree relatives and second-degree relatives of these people who were alive in 1947 or later. The Quality Registers (www.kvalitetsregister.se) collect data on particular areas of health care, for example costs and outcomes, in order to motivate improvements. Data are being collected in 2009 on the treatment of eating disorders and substance dependence. There is a strong history of twin studies in Nordic countries [40, 41], also these have utilised registers. For instance, in Finland multiple births since the 1950s can be identified through the use of family member links added in the early 1970s for all persons in the database of the Population Register Centre [41]. There exists also specific twin registers [40, 42, 43]. Also in adoptive family studies registers have played an important role [44]. Both the Danish [45] and Finnish [46] adoptive family studies of schizophrenia have utilised several registers in finding individuals with schizophrenia who have adopted away a child (the Finnish study) or individuals who have been adopted and have developed schizophrenia (the Danish study), and their biological and adoptive relatives as well as control adoptees. There are also several social welfare registers which have been used in psychiatric research, for example for case finding purposes or for studying the outcomes of psychiatric disorders. These include registers of (disability) pensions, social benefits, sick 122
leave, unemployment, working periods, incomes and housing. Crime registers have commonly been used in forensic psychiatry, and others such as medical birth registers, birth defect registers and cancer registers have been used in psychiatric research. The Finnish and Swedish conscript registers have been also used for research purposes [46, 47]. Conscripts undergo a statutory medical examination, but males with known severe handicaps or chronic diseases are generally not accepted for conscription. The examination usually takes place at 17–19 years of age and consists of a health examination and an assessment of intellectual performance. The Finnish and Swedish school and education registers have also been used in psychiatric research [48].
8.3 Register research in Denmark Denmark has been a pioneer of register-based research, being the first country to abandon questionnaire-based censuses entirely, in 1981, basing all its censuses on registers only. The use of registers for psychiatric research purposes in Denmark has mainly amounted to epidemiological studies of schizophrenia, mania, depression and suicide. In their milestone work on the effects of family history and place and season of birth on the risk of schizophrenia, Mortensen et al. [49] were able to show that although the family history is a strong risk factor, other more common risk factors such as place and season of birth may play a more prominent role at the population level. They estimated that the population-attributable fraction (PAF) of having a parent or sibling with schizophrenia was 5.5%, whereas that of place birth was 34.6%. The work concerned involved linking the Danish Civil Registration System with the Danish Central Psychiatric Register. Later, the same group showed that the more urban the area of upbringing was, the higher was the risk of developing schizophrenia in adult life [50]. High-quality research from Denmark has also been published on other risk factors for schizophrenia, for example advanced paternal age [51], autoimmunity [52], prenatal infections [39, 53] and prenatal maternal stress [54].
USE OF REGISTER DATA FOR PSYCHIATRIC EPIDEMIOLOGY IN THE NORDIC COUNTRIES
The Danes have also carried out intensive investigations into the outcomes of children of parents with severe mental disorders, finding, for example that their mortality risk is elevated perinatally and during the first year of life [55, 56], and also in adolescence and young adulthood [57]. The power of large-scale register linkage was also shown in a study of the association of measles, mumps and rubella vaccination with autism [58], the material for which was obtained by linking data from the Danish Civil Registration System, Danish Central Psychiatric Register, vaccination data reported by general practitioners to the National Board of Health, the National Hospital Registry and the Danish Medical Birth Registry. The overall coverage was 537 303 children (82.0% vaccinated) and over 2M person-years, and the outcome was a body of strong evidence against the hypothesis that MMR vaccination causes autism. One example of a Danish pharmacoepidemiological register linkage study used a population of 2.1M individuals aged 50 years and over to study the association between increased use of antidepressants and decreasing suicide rates [59]. The authors were able to show that only a small portion of the individuals concerned were receiving treatment with antidepressants at the time of their death, and they concluded that active treatment with antidepressants seems to account for only 10% of the decline in the suicide rate. Another study in which the Danish Civil Registration System was linked with the Danish Central Psychiatric Register found no support for the hypothesis that depression independently increases the risk of cancer [60]. Because suicide mortality in Denmark was very high in the 1980s, this subject has been studied extensively. Nordentoft et al. [61] showed that natural and unnatural mortality remained high 10 years after an attempted suicide. A nationwide study showed that people with mental disorders are also run a risk of death by homicide and other unnatural causes [62]. Risk factors for suicide are different for psychiatric patients, the accent being on high incomes and postgraduate employment, which was not the case in the general population [63]. Frequent changes of residence in childhood were associated with an increased risk of suicide in a study that used the Danish Civil Registration System combined with the Central Psychiatric Register [64].
8.4 Register research in Finland The versatile registers available in Finland have been made use of in many studies, including cases of international scientific collaboration. Some analyses have led to substantial new findings of major clinical relevance. Several reports in the early 1990s [65] suggested that the incidence of schizophrenia was declining. A Finnish study combined information from the hospital discharge register, pension register and medication reimbursement register and carried out an age-period-cohort analysis of changes in the incidence of schizophrenia among birth cohorts born between 1954 and 1965 [66]. The incidence had declined, and the effects of period and cohort on the change were both significant. While the effect of period reflects the operation of related confounding factors such as changes in diagnostic criteria, the cohort effect suggests that the intensity or frequency of one or more risk factors for schizophrenia may have decreased in these birth cohorts. A recent Finnish study found a high lifetime prevalence of psychotic disorders (DSM-IV) in Finland, 3.1%, which rose to 3.5% when the non-responder group and their register diagnoses were included. Registers were the most important and reliable screening method, the kappa value for psychotic disorders being 0.80 for the Hospital Discharge Register, while the CIDI interview section on psychotic symptoms was able to identify only 27% of the persons with psychotic disorders, due to considerable under-reporting of psychotic episodes and symptoms [67]. Researchers in Finland have found marked regional variation in the incidence and prevalence of schizophrenia, but negligible urban–rural variation [68, 69]. The Finnish Adoptive Family Study of Schizophrenia used registers to follow-up adoptees. The main finding of the study was that adoptees at high genetic risk are significantly more sensitive to adverse vs. ‘healthy’ rearing patterns in adoptive families than are adoptees at low genetic risk [70]. Studies of Finnish twin cohorts have also utilised various registers in studies of various psychiatric disorders [41]. Finnish twin studies based on national registers have found over 80% heritability for schizophrenia [71] and bipolar disorder [72]. Tiihonen et al. [73], who studied the relation between antidepressant treatment and the risk of 123
CHAPTER 8
suicide and overall mortality through a nationwide computerised database, observed a substantially lower mortality rate when receiving a selective serotonin re-uptake inhibitor. Current use of medication among the subjects who had used an antidepressant at some time was associated with a markedly decreased risk of completed suicide and mortality as compared with no current use of medication. The lower mortality was attributable to a decrease in cardiovascular and cerebrovascular deaths during selective serotonin reuptake inhibitor use. Tiihonen et al. [74] also studied the association between prescribed antipsychotic drugs and outcome in cases of schizophrenia or schizoaffective disorders in the community, using national central registers and a series of 2230 adults hospitalised in Finland. Initial use of clozapine, a perphenazine depot and olanzapine had the lowest rates of discontinuation associated with them, while that for oral haloperidol was higher, but the first-mentioned drugs carried the lowest risk of rehospitalisation. Mortality was markedly higher in patients not taking antipsychotics, and the risk of suicide was also high in these cases. In a recent study, Tiihonen et al. [75] found that among second-generation antipsychotic drugs clozapine was associated with a substantially lower mortality than any other antipsychotics. In the scientifically valuable birth cohort setting it is possible to pool register data with clinical and observational data; whereas most large epidemiological, genetic or imaging studies are based on clinical case series, which are not representative. The aim of the Northern Finland 1966 Birth Cohort have been to analyse the developmental pathways of schizophrenic psychoses from the fetal period to adulthood, especially with respect to risk factors and outcomes, including genome-wide analyses and brain morphology [76]. The cases for the cohort have been obtained from the hospital discharge register. One aim has been to determine whether adult-onset schizophrenia is associated with abnormalities during pregnancy, delivery or the neonatal period [77]. Both low and high birth weight were more common among the schizophrenic subjects. The same cohort data have also been used to study register-based outcomes, for example by Miettunen et al. [78], who studied work periods and disability pensions from registers and found that almost half of the patients 124
with schizophrenic psychoses had not been pensioned off after an average follow-up of 10 years. One example of a study analysing psychiatric comorbidity in somatic illness is the comparison of the incidence and severity of depression in stroke patients and those chiefly responsible for taking care of them in four districts of Finland, two with and two without after-discharge intervention programmes [79]. In this case a population-based stroke register was used. Fewer patients in the districts with active programmes were depressed than in the control districts. Another example of register linkage was the work by Gissler et al. [80] to determine rates of suicide associated with pregnancy by the type of pregnancy. Information on suicides was linked with the Finnish birth, abortion and hospital discharge registers to find out how many women who committed suicide had had a completed pregnancy during their last year of life. Given a mean annual suicide rate of 11.3 per 100 000, the rate associated with birth was significantly lower (5.9) and those associated with miscarriage (18.1) and induced abortion (34.7) were significantly higher.
8.5 Register research in Norway Early Norwegian studies using the Norwegian Psychiatric Case Register were focused on topics such as admission rates for schizophrenia [81]. Hansen et al. [82] studied total mortality among people admitted to psychiatric hospitals and concluded that mortality among psychiatric patients is still unsatisfactorily high, and that men constitute a special high-risk group. Later they also studied cause-specific mortality among psychiatric patients after deinstitutionalisation and found especially that there were more cardiovascular deaths and unnatural deaths among such cases in both genders, but especially among men. Strand and Kunst [83] studied suicide mortality using registry data on 613 807 Norwegians born in 1955–1965. Suicide mortality was higher among women with a high childhood socioeconomic position than among those with a low childhood socioeconomic position. They suggested downward mobility and failure to meet the high demands set by the well-educated parents, psychological distress, mental disorder, gender differences and
USE OF REGISTER DATA FOR PSYCHIATRIC EPIDEMIOLOGY IN THE NORDIC COUNTRIES
social networks and norms as possible mechanisms for this finding. Tellnes et al. [84] analysed persons in Norway with long-term sickness certification at the end of 1990, based on data recorded by the National Insurance Administration. In cases of long-term sickness certification, mental disorders had a prevalence of 3.1 per 1000 employed persons respectively. The work demonstrates the possibility of using data from registers to provide information on the epidemiology of long-term sickness certification. The authors concluded, however, that it is necessary to further improve the validity of the data. Hagen et al. [85] linked the genetic data used in the Nord-Trondelag Health Study with antipsychotic medication data from the Norwegian prescription database. The Val158Met polymorphism in the COMT gene had no major impact on the number of individuals who had been prescribed antipsychotic medication, but the subjects with the Met/Met genotype were receiving the highest median daily doses of antipsychotics. Bramness et al. [86], who studied the muscle relaxant carisoprodol and its use and abuse on the basis of the Prescription Database, concluded that this drug was widely used and that the skewedness in its use indicated that it is a potential object of abuse. The Prescription Database has also been used when studying trends in the use of selective serotonin re-uptake inhibitors [87]. The Norwegian Twin Registers have also commonly been used for research in psychiatry [40]. The first Norwegian twin study based on the Norwegian Twin Registers was by Kringlen [88], who studied genetic factors of psychoses, an important finding was that problems in sampling techniques in earlier studies of schizophrenia resulted in overestimation of the genetic factors.
8.6 Register research in Sweden Sweden took a step towards promoting the use of health care registers for research purposes by founding the Swedish Centre for Epidemiology in 1992, and outstanding epidemiological research based on Swedish registers has been published during the past 15 years. In particular, there is a strong tradition of risk factor research in Sweden, work which has included several landmark studies.
Although investigations carried out in many countries had suggested that the incidence and prevalence of schizophrenia is higher in large cities than in rural areas, it was assumed for a long time that this had been caused by geographical drift of persons with a higher risk of schizophrenia from rural to urban areas. Lewis et al. [89] nevertheless showed in a 1992 follow-up of the 1969–1970 conscripts that it was urban upbringing, not urban residence that increased the risk of schizophrenia. Another Swedish study showed that the effect of urban place of birth was not related to obstetric complications or socioeconomic status in childhood [90]. Another landmark study was the longitudinal follow-up of a Swedish conscript cohort, which suggested that cannabis use in adolescence or young adulthood is a risk factor for schizophrenia [91]. A later follow-up of the same cohort showed that there was a linear trend in the frequency of cannabis use and the risk of schizophrenia, with a 3.1-fold higher risk of schizophrenia among those who had used cannabis over 50 times compared to those who had never used it [92]. Although there have been several investigations into cannabis and psychosis, this Swedish study is still the only one which has been able to use schizophrenia diagnosis as the outcome, due to its large sample size and register-based outcome assessment. Swedish research groups have actively used the Swedish birth register to investigate prenatal and perinatal risk factors for severe mental disorders. Hultman et al. [93] and Dalman et al. [94] showed that several specific obstetric complications increase the risk of schizophrenia and to some extent also the risk of affective and reactive psychoses. Other Swedish studies on childhood risk factors for psychotic disorders have found that serious viral infections of the central nervous system [95] and poor school performance both increase the risk of nonaffective psychosis [96]. A recent Swedish family study combined the multigeneration register and the hospital discharge register to investigate whether schizophrenia and bipolar disorder share a common genetic risk [97]. In the end 64% heritability was reported for schizophrenia and 59% heritability for bipolar disorder, with a 0.60 genetic correlation between the two disorders [97]. One key study concerning mortality in cases of 125
CHAPTER 8
psychiatric disorders was that published by Allebeck and Wistedt [5], which showed that persons with schizophrenia have an increased risk of mortality from all causes of death and from suicide in particular. A recent Swedish follow-up study compared the risk of death by suicide after a suicide attempt in different psychiatric disorders, and found the highest risks of suicide to exist among patients with schizophrenia and unipolar depressive and bipolar disorders [98]. Swedish Twin Registry has been used, for example in investigation of heritability of major depression. Kendler et al. [99] found, for the first time, that the heritability of liability to major depression was significantly higher in women (42%) than men (29%).
8.7 Discussion 8.7.1 Main findings The Nordic registers are unique nationwide registers that have been used in numerous high-quality investigations. Nationwide hospital registers have been used to study admission rates and also, in combination with interview data, to assess the incidence and prevalence of certain disorders. Hospital registers have also commonly been used for case finding, especially in more severe disorders such as schizophrenia. Registers from early life, such as those of births, have been used as sources for exposure variables in connection with various register or interview outcomes (e.g. suicides), and registers have enabled hypotheses to be tested which otherwise would have been difficult to study reliably, for example due to the rarity of the events or disorders concerned.
8.7.2 Methodological and administrative challenges In principle (but not always in practice) it is easy to obtain, analyse and publish data from registers. In a real-world setting, however, the use of register data includes both practical and methodological challenges [2]. Register data are usually collected for administrative or clinical purposes, and not for scientific study. The data are often superficial, and exposure and outcome definitions may be imprecise. 126
The variables used may reflect events that are easy to categorise, but not usually the complicated measures needed for analysing psychological and qualitative events as required for psychiatric research. Information on family and social environments is scarce, for example and while we can obtain information on the medication prescribed for the patient, we have no information on why the doctor chose this particular medication. Thus unidentified confounding factors are one major limitation on register-based research. Most registers include only subjects with severe psychiatric disorders that require hospitalisation or medical treatment, which may cause bias. The usefulness of health care registers for investigating a disease depends on the extent to which medical care is provided at different levels and in different units [100]. Most people with chronic psychotic disorders receive hospital care at some point in their illness [67], and thus psychotic disorders can be studied using a hospital discharge register. Most persons with depressive disorders or a cluster A personality disorder, for example [101] receive outpatient treatment or are not treated at all, and the cases treated in hospital represent only the tip of the iceberg. This explains why a substantial proportion of studies using registers are concerned with schizophrenia and other psychotic disorders. Byrne et al. [4] have reviewed studies of the validity of administrative registers and have concluded that these most often concern hospital discharge registers and that relatively little high-quality work exists on the topic. At best we may have longitudinal data collected over a period of decades which cover the entire lifespan of certain individuals. This may allow the exact time of exposure to be defined, but analysing the complex trajectories and pathways between exposure (e.g. genetic disposition or childhood adversity) and outcome (e.g. mental disorder) is challenging and full of potential mediating and confounding factors and effect modifications. The major advantages of register-based data are minimal attrition and the possibility to achieve high power when studying topics which would otherwise be unapproachable. In practice, there are still plenty of methodological problems involved in analysing possible causal relationships in the context of observational studies based on large register data sets, for instance. Modelling techniques have been developed further in recent years,
USE OF REGISTER DATA FOR PSYCHIATRIC EPIDEMIOLOGY IN THE NORDIC COUNTRIES
however, and, given certain assumptions, marginal structural models can partly resolve the problem of causal inference in observational studies [102].
8.7.3 The Nordic countries: An epidemiologist’s paradise? This metaphor contained in this title is sometimes presented by non-Nordic scientists – without the question mark. In real world, hard work, data processing, methodological skills and teamwork are needed for successful register-based studies. Nordic registers at best are population-based and achieve high levels of ascertainment. They are ideal for epidemiological purposes, as the biases common to many epidemiological surveys (such as information bias) are minimal, and loss to follow-up occurs only through emigration or death. Explanatory variables and outcome data are collected prospectively and the number of cases can be large, so that it is possible to investigate rare exposure events, for example specific birth complications, with rare outcomes, for example schizophrenia. The authors are most familiar with the Finnish register system; we acknowledge that our aim to review all Nordic countries in a balanced way may not have been successful. The current review did not include Iceland (population of 0.3M), which also has nationwide registers that have been used in psychiatric research. The current focus is on the anonymous Icelandic Healthcare Database, which has been constructed by a private company, deCODE Genetics. This database has opened up unique possibilities for modelling disease risk as a function of genetic and environmental factors and has resulted in the identification of risk genes for several disorders, including schizophrenia [103, 104]. For administrative and also ethical reasons, some registers are not yet available for research purposes or for linkage to other register data. The introduction of new registers will nevertheless make it possible to study many topics more reliably. More extensive outpatient registers, for instance, will make it possible to identify more subjects with less severe psychiatric disorders. Until recent years, outpatient registers have often been kept locally, or else nationwide registers have limited coverage. The accumulating medication data should be used in the
future to carry out observational, population-level, phase IV efficacy, effectiveness and safety studies on psychoactive drugs. Register data could also be more commonly used together with data collected from interviews. In the future, increasing international collaboration and the combining of different registers within and between countries would give further possibilities for studying novel topics.
Acknowledgements This work has been supported by the Academy of Finland (#125 853, J.M.; #129 434, JS; #110 143, M.I.), NARSAD: Brain and Behavior Research Fund (J.M., J.S., M.I.), and the Sigrid Jus´elius Foundation (J.S., M.I.).
References [1] Statistics Finland (2004) Use of Register and Administrative Data Sources for Statistical Purposes Best Practices of Statistics Finland. Statistics Finland Handbooks 45. Available in English www.stat.fi/tup/julkaisut/kasikirjoja_45_en.pdf (accessed 7 October 2010). [2] Gissler, M. and Haukka, J. (2004) Finnish health and social welfare registers in epidemiological research. Nor. Epidemiol., 14, 113–120. ¨ [3] Hafner, H. and der Heiden, W. (1986) The contribution of European case registers to research on schizophrenia. Schizophr. Bull., 12, 26–51. [4] Byrne, N., Regan, C. and Howard, L. (2005) Administrative registers in psychiatric research: a systematic review of validity studies. Acta Psychiatr. Scand., 112, 409–414. [5] Allebeck, P. and Wistedt, B. (1986) Mortality in schizophrenia. A ten-year follow-up based on the Stockholm County inpatient register. Arch. Gen. Psychiatry, 43, 650–653. [6] Bloor, R.N. (1995) Setting up a psychiatric case register. Adv. Psychiatr. Treat., 1, 86–91. [7] Wierdsma, A.I., Sytema, S., van Os, J.J. et al. (2008) Case registers in psychiatry: do they still have a role for research and service monitoring? Curr. Opin. Psychiatry, 21, 379–384. [8] de Graaf, R., Bijl, R.V., Smit, F. et al. (2000) Psychiatric and sociodemographic predictors of attrition in a longitudinal study. Am. J. Epidemiol., 152, 1039–1047.
127
CHAPTER 8 [9] Haapea, M., Miettunen, J., Veijola, J. et al. (2007) Non-participation may bias the results of a psychiatric survey. An analysis from the survey including magnetic resonance imaging within the Northern Finland 1966 Birth Cohort. Soc. Psychiatry Psychiatr. Epidemiol., 42, 403–409. [10] Munk-Jørgensen, P., Kastrup, M. and Mortensen, P.B. (1993) The Danish psychiatric register as a tool in epidemiology. Acta Psychiatr. Scand., 370, 27–32. [11] Cappelen, I. and Lyshol, H. (2004) An overview of the health registers in Norway (in Norwegian, with English abstract). Nor. Epidemiol., 14, 33–38. [12] Mortensen, P.B. (2004) Register-based research in Denmark. Nor. Epidemiol., 14, 121–124 (in Danish, with English abstract). [13] Otterblad Olausson, P., Spetz, C.L. and Ros´en, M. (2004) A large use of register data in Swedish research – a Nordic competitive advantage. Nor. Epidemiol., 14, 125–128 (in Swedish). [14] Pajunen, P., Koukkunen, H., Ketonen, M. et al. (2005) The validity of the Finnish Hospital discharge register and causes of death register data on coronary heart disease. Eur. J. Cardiovasc. Prev. Rehabil., 12, 132–137. ¨ ¨ T., Moring, J. et al. (1997) [15] Isohanni, M., Makikyr o, A comparison of clinical and research DSM-III-R diagnoses of schizophrenia in a Finnish national birth cohort. Soc. Psychiatry Psychiatr. Epidemiol., 32, 303–308. ¨ ¨ T., Isohanni, M., Moring, J. et al. (1998) [16] Makikyr o, Accuracy of register-based schizophrenia diagnoses in a genetic study. Eur. Psychiatry, 13, 57–62. ¨ [17] Arajarvi, R., Suvisaari, J., Suokas, J. et al. (2005) Prevalence and diagnosis of schizophrenia based on register, case record and interview data in an isolated Finnish birth cohort born 1940–1969. Soc. Psychiatry Psychiatr. Epidemiol., 40, 808–816. [18] Pihlajamaa, J., Suvisaari, J., Henriksson, M. et al. (2008) The validity of schizophrenia diagnosis in the Finnish hospital discharge register: findings from a 10-year birth cohort sample. Nord. J. Psychiatry, 62, 198–203. ¨ [19] Moilanen, K., Veijola, J., Laksy, K. et al. (2003) Reasons for the diagnostic discordance between clinicians and researchers in the Northern Finland 1966 Birth Cohort. Soc. Psychiatry Psychiatr. Epidemiol., 38, 305–310. [20] Taiminen, T., Ranta, K., Karlsson, H. et al. (2001) Comparison of Clinical and best-estimate research DSM-IV diagnoses in a Finnish sample of firstadmission psychosis and severe affective disorder. Nord. J. Psychiatry, 55, 107–111. [21] Kristjansson, E., Allebeck, P. and Wistedt, B. (1987) Validity of the diagnosis schizophrenia in
128
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
a psychiatric inpatient register. Nord. Psykiatr. Tidsskr., 43, 229–234. ¨ ¨ ¨ Loffler, W., Hafner, H., Fatkenheur, B. et al. (1994) Validation of Danish case register diagnosis for schizophrenia. Acta Psychiatr. Scand., 90, 196–203. Dalman, C., Broms, J., Cullberg, J. et al. (2002) Young cases of schizophrenia identified in a national inpatient register. Soc. Psychiatry Psychiatr. Epidemiol., 37, 527–531. ¨ T., Partonen, T., Kaprio, J. et al. (2000) Kieseppa, Accuracy of register- and record-based bipolar I disorder diagnoses in Finland – a study of twins. Acta Neuropsychiatr., 12, 106–109. ¨ J., Sundstrom, ¨ Ingelsson, E., Arnlov, J. et al. (2005) The validity of a diagnosis of heart failure in a hospital discharge register. Eur. J. Heart Fail., 7, 787–791. Elo, S.L. and Karlberg, I.H. (2009) Validity and utilization of epidemiological data: a study of ischaemic heart disease and coronary risk factors in a local population. Public Health, 123, 52–57. Flay, B.R. (1986) Efficacy and effectiveness trials (and other phases of research) in the development of health promotion programs. Prev. Med., 15, 451–474. Lieberman, J.L., Stroup, T.S., McEvoy, J.P. et al. (2005) Effectiveness of antipsychotic drugs in patients with chronic schizophrenia. N. Engl. J. Med., 353, 1209–1223. Gaist, D., Sørensen, H.T. and Hallas, J. (1997) The Danish prescription registries. Dan. Med. Bull., 44, 445–448. Furu, K. (2008) Establishment of the nationwide Norwegian prescription database (NorPD) – new opportunities for research in pharmacoepidemiology in Norway. Nor. Epidemiol., 18, 129–136. Glintborg, B., Hillestrom, P.R., Olsen, L.H. et al. (2007) Are patients reliable when self-reporting medication use? Validation of structured drug interviews and home visits by drug analysis and prescription data in acutely hospitalized patients. J. Clin. Pharmacol., 47, 1440–1449. Haukka, J., Suvisaari, J., Tuulio-Henriksson, A. et al. (2007) High concordance between selfreported medication and official prescription database information. Eur. J. Clin. Pharmacol., 63, 1069–1074. Haapea, M., Miettunen, J., Lindeman, S. et al. (2010) Concordance between self-reported and pharmacy data on medication use in the Northern Finland 1966 Birth Cohort. Int. J. Methods Psychiatr. Res., 19, 88–96.
USE OF REGISTER DATA FOR PSYCHIATRIC EPIDEMIOLOGY IN THE NORDIC COUNTRIES [34] Johansson, L.A. and Westerling, R. (2000) Comparing Swedish hospital discharge records with death certificates: implications for mortality statistics. Int. J. Epidemiol., 29, 495–502. ¨ A. (2001) The validity [35] Lahti, R.A. and Penttila, of death certificates: routine validation of death certification and its effects on mortality statistics. Forensic Sci. Int., 115, 15–32. ¨ H., Haukka, J., Suvisaari, J. et al. (2005) [36] Heila, Mortality among patients with schizophrenia and reduced psychiatric hospital care. Psychol. Med., 35, 725–732. [37] Haukka, J., Suominen, K., Partonen, T. et al. (2008) Determinants and outcomes of serious attempted suicide: a nationwide study in Finland, 1996–2003. Am. J. Epidemiol., 167, 1155–1163. [38] Holl, K., Lundin, E., Kaasila, M. et al. (2008) Effect of long-term storage on hormone measurements in samples from pregnant women: the experience of the Finnish maternity cohort. Acta Oncol., 47, 406–412. [39] Mortensen, P.B., Nørgaard-Pedersen, B., Lindum Waltoft, B. et al. (2007) Toxoplasma gondii as a risk factor for early-onset schizophrenia: analysis of filter paper blood samples obtained at birth. Biol. Psychiatry, 61, 688–693. [40] Bergem, A.L. (2002) Norwegian Twin Registers and Norwegian twin studies – an overview. Twin Res., 5, 407–414. [41] Kaprio, J. (2006) Twin studies in Finland 2006. Twin Res. Hum. Genet., 9, 772–777. [42] Lichtenstein, P., De Faire, U., Floderus, B. et al. (2002) The Swedish twin registry: a unique resource for clinical, epidemiological and genetic studies. J. Intern. Med., 252, 184–205. [43] Skytthe, A., Kyvik, K., Bathum, L. et al. (2006) The Danish Twin Registry in the new millennium. Twin Res. Hum. Genet., 9, 763–771. [44] Tienari, P., Wynne, L.C., Moring, J. et al. (2000) Finnish adoptive family study: sample selection and adoptee DSM-III-R diagnoses. Acta Psychiatr. Scand., 101, 433–443. [45] Rosenthal, D., Wender, P.H., Kety, S.S. et al. (1971) The adopted-away offspring of schizophrenics. Am. J. Psychiatry, 128, 307–311. [46] David, A.S., Malmberg, A., Brandt, L. et al. (1997) IQ and risk for schizophrenia: a population-based cohort study. Psychol. Med., 27, 1311–1323. [47] Tiihonen, J., Haukka, J., Henriksson, M. et al. (2005) Premorbid intellectual functioning in bipolar disorder and schizophrenia: results from a cohort study of male conscripts. Am. J. Psychiatry, 162, 1904–1910.
¨ [48] Isohanni, I., Jarvelin, M.-R., Nieminen, P. et al. (1998) School performance as a predictor of psychiatric hospitalization in adult life. A 28-year follow-up in the Northern Finland 1966 birth cohort. Psychol. Med., 28, 967–974. [49] Mortensen, P.B., Pedersen, C.B., Westergaard, T. et al. (1999) Effects of family history and place and season of birth on the risk of schizophrenia. N. Engl. J. Med., 340, 603–608. [50] Pedersen, C.B. and Mortensen, P.B. (2001) Evidence of a dose-response relationship between urbanicity during upbringing and schizophrenia risk. Arch. Gen. Psychiatry, 58, 1039–1046. [51] Byrne, M., Agerbo, E., Ewald, H. et al. (2003) Parental age and risk of schizophrenia. A casecontrol study. Arch. Gen. Psychiatry, 60, 673–678. [52] Eaton, W.W., Byrne, M., Ewald, H. et al. (2006) Association of schizophrenia and autoimmune diseases: linkage of Danish national registers. Am. J. Psychiatry, 163, 521–528. [53] Westergaard, T., Mortensen, P.B., Pedersen, C.B. et al. (1999) Exposure to prenatal and childhood infections and the risk of schizophrenia. Suggestions from a study of sibship characteristics and influenza prevalence. Arch. Gen. Psychiatry, 56, 993–998. [54] Khashan, A.S., Abel, K.M., McNamee, R. et al. (2008) Higher risk of offspring schizophrenia following antenatal maternal exposure to severe adverse life events. Am. J. Psychiatry, 65, 146–152. [55] Bennedsen, B.E., Mortensen, P.B., Olesen, A.V. et al. (2001) Congenital malformations, stillbirths, and infant deaths among children of women with schizophrenia. Arch. Gen. Psychiatry, 58, 674–679. [56] King-Hele, S.A., Abel, K.M., Webb, R.T. et al. (2007) Risk of sudden infant death syndrome with parental mental illness. Am. J. Psychiatry, 64, 1323–1330. [57] Webb, R.T., Pickles, A.R., Appleby, L. et al. (2007) Death by unnatural causes during childhood and early adulthood in offspring of psychiatric inpatients. Arch. Gen. Psychiatry, 64, 345–352. [58] Madsen, K.M., Hviid, A., Vestergaard, M. et al. (2002) A population-based study of measles, mumps, and rubella vaccination and autism. N. Engl. J. Med., 347, 1477–1482. [59] Erlangsen, A., Canudas-Romo, V. and Conwell, Y. (2008) Increased use of antidepressants and decreasing suicide rates: a population-based study using Danish register data. J. Epidemiol. Community Health, 62, 448–454. [60] Oksbjerg Dalton, S., Mellemkjaer, L., Olsen, J.H. et al. (2002) Depression and cancer risk: a registerbased study of patients hospitalized with affective disorders, Denmark, 1969–1993. Am. J. Epidemiol., 155, 1088–1095.
129
CHAPTER 8 [61] Nordentoft, M., Breum, L., Munck, L.K. et al. (1993) High mortality by natural and unnatural causes: a 10 year follow up study of patients admitted to a poisoning treatment centre after suicide attempts. Br. Med. J., 306, 1637–1641. [62] Hiroeh, U., Appleby, L., Mortensen, P.B. et al. (2001) Death by homicide, suicide, and other unnatural causes in people with mental illness: a population-based study. Lancet, 358, 2110–2112. [63] Agerbo, E. (2007) High income, employment, postgraduate education, and marriage: a suicidal cocktail among psychiatric patients. Arch. Gen. Psychiatry, 64, 1377–1384. [64] Qin, P., Mortensen, P.B. and Pedersen, C.B. (2009) Frequent change of residence and risk of attempted and completed suicide among children and adolescents. Arch. Gen. Psychiatry, 66, 628–632. [65] Munk-Jørgensen, P. and Mortensen, P.B. (1992) Incidence and other aspects of the epidemiology of schizophrenia in Denmark, 1971–1987. Br. J. Psychiatry, 161, 489–495. [66] Suvisaari, J.M., Haukka, J.K., Tanskanen, A.J. et al. (1999) Decline in the incidence of schizophrenia in Finnish cohorts born from 1954 to 1965. Arch. Gen. Psychiatry, 56, 733–740. ¨ a, ¨ J., Suvisaari, J., Saarni, S.I. et al. (2007) Life[67] Peral time prevalence of psychotic and bipolar I disorders in a general population. Arch. Gen. Psychiatry, 64, 19–28. [68] Haukka, J., Suvisaari, J., Varilo, T. et al. (2001) Regional variation in the incidence of schizophrenia in Finland: a study of birth cohorts born from 1950 to 1969. Psychol. Med., 31, 1045–1053. ¨ a, ¨ J., Saarni, S., Ostamo, A. et al. (2008) [69] Peral Geographic variation and sociodemographic characteristics of psychotic disorders in Finland. Schizophr. Res., 106, 337–347. [70] Tienari, P., Wynne, L.C., Sorri, A. et al. (2004) Genotype-environment interaction in schizophrenia-spectrum disorder. Long-term follow-up study of Finnish adoptees. Br. J. Psychiatry, 184, 216–222. ¨ J. et al. (1998) [71] Cannon, T.D., Kaprio, J., Lonnqvist, The genetic epidemiology of schizophrenia in a Finnish twin cohort. A population-based modeling study. Arch. Gen. Psychiatry, 55, 67–74. ¨ T., Partonen, T., Haukka, J. et al. (2004) [72] Kieseppa, High concordance of bipolar I disorder in a nationwide sample of twins. Am. J. Psychiatry, 161, 1814–1821. ¨ [73] Tiihonen, J., Lonnqvist, J., Wahlbeck, K. et al. (2006) Antidepressants and the risk of suicide, attempted suicide, and overall mortality in a nationwide cohort. Arch. Gen. Psychiatry, 63, 1358–1367.
130
¨ [74] Tiihonen, J., Wahlbeck, K., Lonnqvist, J. et al. (2006) Effectiveness of antipsychotic treatments in a nationwide cohort of patients in community care after first hospitalisation due to schizophrenia and schizoaffective disorder: observational followup study. Br. Med. J., 333, 224. ¨ [75] Tiihonen, J., Lonnqvist, J., Wahlbeck, K. et al. (2009) 11-year follow-up of mortality in patients with schizophrenia: a population-based cohort study (FIN11 study). Lancet, 374, 620–627. ¨ [76] Isohanni, M., Miettunen, J., Maki, P. et al. (2006) Developmental pathways of schizophrenia from gestation to the course of illness. The Northern Finland 1966 Birth Cohort Study. World Psychiatry, 5, 168–171. [77] Moilanen, K., Jokelainen, J., Jones, P.B. et al. (2010) Deviant intrauterine growth and risk of schizophrenia: A 34-year follow-up of the Northern Finland 1966 Birth Cohort. Schizophr. Res., 124, 223–230. [78] Miettunen, J., Lauronen, E., Veijola, J. et al. (2007) Socio-demographic and clinical predictors of occupational status in schizophrenic psychoses – followup within the Northern Finland 1966 Birth Cohort. Psychiatry Res., 150, 217–225. [79] Kotila, M., Numminen, H., Waltimo, O. et al. (1998) Depression after stroke: results of the FINNSTROKE Study. Stroke, 29, 368–372. ¨ [80] Gissler, M., Hemminki, E. and Lonnqvist, J. (1996) Suicides after pregnancy in Finland, 1987–1994: register linkage study. Br. Med. J., 313, 1431–1434. [81] Ødegard, Ø. (1971) Hospitalized psychoses in Norway: time trends 1926–1965. Soc. Psychiatry, 6, 53–58. [82] Hansen, V., Arnesen, E. and Jacobsen, B.K. (1997) Total mortality in people admitted to a psychiatric hospital. Br. J. Psychiatry, 170, 186–190. [83] Strand, B.H. and Kunst, A. (2006) Childhood socioeconomic status and suicide mortality in early adulthood among Norwegian men and women. A prospective study of Norwegians born between 1955 and 1965 followed for suicide from 1990 to 2001. Soc. Sci. Med., 63, 2825–2834. [84] Tellnes, G., Mathisen, S., Skau, I. et al. (1992) Who is long-term sick-listed in Norway? From the project Evaluation of the follow-up of long-term sick-listed. Tidsskr. Nor. Laegeforen., 112, 2773–2778. [85] Hagen, K., Stovner, L.J., Skorpen, F. et al. (2008) COMT genotypes and use of antipsychotic medication: linking population-based prescription database to the HUNT study. Pharmacoepidemiol. Drug Saf., 17, 372–377. [86] Bramness, J.G., Furu, K., Engeland, A. et al. (2007) Carisoprodol use and abuse in Norway: a pharmacoepidemiological study. Br. J. Clin. Pharmacol., 64, 210–218.
USE OF REGISTER DATA FOR PSYCHIATRIC EPIDEMIOLOGY IN THE NORDIC COUNTRIES [87] Bramness, J.G., Hausken, A.M., Sakshaug, S. et al. (2005) Prescription of selective serotonin reuptake inhibitors 1990–2004. Tidsskr. Nor. Laegeforen., 125, 2470–2473. [88] Kringlen, E. (1968) An epidemiological-clinical twin study on schizophrenia. J. Psychiatr. Res., 6 (Suppl. 1), 49–63. [89] Lewis, G., David, A., Andreasson, S. et al. (1992) Schizophrenia and city life. Lancet, 340, 137–140. [90] Harrison, G., Fouskakis, D., Rasmussen, F. et al. (2003) Association between psychotic disorder and urban place of birth is not mediated by obstetric complications or childhood socio-economic position: a cohort study. Psychol. Med., 33, 723–731. ¨ [91] Andreasson, S., Allebeck, P., Engstrom, A. et al. (1987) Cannabis use and schizophrenia: a longitudinal study of Swedish conscripts. Lancet, 8574, 1483–1486. [92] Zammit, S., Allebeck, P., Andreasson, S. et al. (2002) Self reported cannabis use as a risk factor for schizophrenia in Swedish conscripts of 1969: historical cohort study. Br. Med. J., 325, 1199. [93] Hultman, C.M., Sparen, P., Takei, N. et al. (1999) Prenatal and neonatal risk factors for schizophrenia, affective psychosis, and reactive psychosis of early onset: case-control study. Br. Med. J., 318, 421–426. [94] Dalman, C., Allebeck, P., Cullberg, J. et al. (1999) Obstetric complications and the risk of schizophrenia. A longitudinal study of a national birth cohort. Arch. Gen. Psychiatry, 56, 234–240. [95] Dalman, C., Allebeck, P., Gunnell, D. et al. (2008) Infections in the CNS during childhood and the risk of subsequent psychotic illness: a cohort study of more than one million Swedish subjects. Am. J. Psychiatry, 165, 59–65.
[96] MacCabe, J.H., Lambe, M.P., Cnattingius, S. et al. (2008) Scholastic achievement at age 16 and risk of schizophrenia and other psychoses: a national cohort study. Psychol. Med., 38, 1133–1140. ¨ [97] Lichtenstein, P., Yip, B.H., Bjork, C. et al. (2009) Common genetic determinants of schizophrenia and bipolar disorder in Swedish families: a populationbased study. Lancet, 373, 234–239. ˚ ¨ N., Lichtenstein, P. et al. [98] Tidemalm, D., Langstr om, (2008) Risk of suicide after suicide attempt according to coexisting psychiatric disorder: Swedish cohort study with long-term follow-up. Br. Med. J., 337, a2205. [99] Kendler, K.S., Gatz, M., Gardner, C.O. et al. (2006) A Swedish national twin study of lifetime major depression. Am. J. Psychiatry, 163, 109–114. [100] Wigertz, A. and Westerling, R. (2001) Measures of prevalence: which healthcare registers are applicable?. Scand. J. Public Health, 29, 55–62. [101] Isohanni, M. and Tienari, P. (2005) Cluster a personality disorders: unanswered questions about epidemiological, evolutionary and genetic aspects, in Personality Disorders (eds M. Maj, H.S. Akiskal, J.E. Mezzich and A. Okasha), John Wiley & Sons, Inc., New York, pp. 87–89. ´ [102] Hernan, M.A., Cole, S.R., Margolick, J. et al. (2005) Structural accelerated failure time models for survival analysis in studies with time-varying treatments. Pharmacoepidemiol. Drug Saf., 14, 477–491. [103] Stefansson, H., Sigurdsson, E., Steinthorsdottir, V. et al. (2002) Neuregulin 1 and susceptibility to schizophrenia. Am. J. Hum. Gen., 71, 877–892. [104] Stefansson, H., Rujescu, D., Cichon, S. et al. (2008) Large recurrent microdeletions associated with schizophrenia. Nature, 455, 232–236.
131
9
An introduction to mental health services research 1 ´ Anna Fernandez, Alejandra Pinto-Meza,2 Antoni Serrano-Blanco,3 Jordi Alonso4 and Josep Maria Haro5 1
Research and Development Unit, Sant Joan de D´eu-SSM, Fundacio´ Sant Joan de D´eu, ´ y Promocion ´ de Barcelona, Spain, Red de Investigaciones en Actividades de Prevencion la Salud (REDIAPP) 2 Research and Development Unit, Sant Joan de D´eu-SSM, Fundacio ´ Sant Joan de D´eu, ´ y Promocion ´ de Barcelona, Spain, Red de Investigaciones en Actividades de Prevencion la Salud (REDIAPP) 3 Research and Development Unit, Sant Joan de D´eu-SSM, Fundacio ´ Sant Joan de D´eu, ´ y Promocion ´ de Barcelona, Spain, Red de Investigaciones en Actividades de Prevencion la Salud (REDIAPP) 4 Head, Health Services Research Unit (IMIM-Hospital del Mar), CIBER Epidemiolog´ıa ´ y Salud Publica (CIBERESP), Spain, Master’s Program in Public Health (UPF-UAB), Carrer del Doctor Aiguader, Barcelona, Spain 5 Research and Development Unit, Sant Joan de D´eu-SSM, Fundacio ´ Sant Joan de D´eu, Barcelona, Spain, CIBER Salud Mental (CIBERSAM)
9.1 Introduction Health Services Research (HSR) entered the public health arena in the 1970s. Two events, in particular, heralded its appearance. The first was the publication, in 1973, of an essay by Barbara Starfield entitled ‘Health Services Research: a working model’ [1]. The second one was the publication of the widely cited Lalonde Report in 1974 [2] in which, for the first time, the term ‘determinants of health’ was used. In this report, four determinants of the health and well-being of populations were identified: human biology, environment, lifestyles and healthcare systems. In the Lalonde Report the influence of human biology, environment and lifestyles were emphasised, but this did not lead to a decrease of attention to health care. On the contrary, this report called for a change in the healthcare system from a limited
focus on ‘cures’ to illness prevention and health promotion. The inclusion of health services as one of the determinants of health, and the development of a framework for its study, promoted the creation of institutes and departments to study the performance and the effectiveness of health services. During the 1980s HSR received support mainly from three groups [3]: (i) payers who thought that HSR could help them to contain increasing costs; (ii) clinicians who felt that HSR would serve to provide them with evidence against those who argued that health services had little effect on the health of individuals and (iii) public/users who called for a more important role in the public health arena. In the case of mental health, the transformation that mental healthcare was undergoing, that is the implementation of deinstitutionalisation in the 1960s and 1970s, and the steady rise of community support
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
133
CHAPTER 9
system (CSS) programmes in the 1980s [4], delayed the development of a specific framework for mental health services research (Mental HSR) until the end of the 1990s. It was not until 1998 that Tansella and Thornicroft published their paper ‘A Conceptual Framework for Mental Health Services: the matrix model’, which aimed to review the major concepts, and the applications of HSR to mental health services [5]. Nevertheless, although this could be considered the first systematisation, it is important to mention the work of Stein and Test (1980) in the United States [6], and Hoult and Reynolds (1983) in Australia [7] who were pioneers in the evaluation of the effectiveness of the new forms of mental healthcare (specifically the CSS) for people suffering from mental illness. The chapter is organised into four sections. In the first section, various definitions of Mental HSR are presented and the general challenges faced by Mental HSR are discussed. In the second section, we describe and discuss the conceptual framework for Mental HSR developed by Tansella and Thornicroft. In the third section, we examine the key concepts in Mental HSR. Finally, we present some examples of Mental HSR.
9.2 What is mental health services research? Different definitions have been proposed for HSR that, generally, could be applied to the specific area of mental health services. One of the first definitions was proposed in 1979 by the Institute of Medicine. In the revised and expanded version of 1995 they defined HSR as [8]: ( . . . ) a multidisciplinary field of inquiry, both basic and applied, that examines the use, costs, quality, accessibility and delivery, organization, financing and outcomes of health care services to increase knowledge and understanding of the structure, processes and effects of health services for individuals and populations. The Agency for Healthcare Research and Quality defined HSR as the field of scientific investigation that [9]: 134
( . . . ) examines how people get access to health care, how much care costs, and what happens to patients as a result of this care. The main goals of health services research are to identify the most effective ways to organize, manage, finance, and deliver high quality care; reduce medical errors; and improve patient safety. A further definition, such as the one proposed by the Academy of Health, states that HSR is [10]: ( . . . ) the multidisciplinary field of scientific investigation that studies how social factors, financing systems, organizational structures and processes, health technologies, and personal behaviors affect access to health care, the quality and cost of health care, and ultimately our health and well-being. Its research domains are individuals, families, organizations, institutions, communities, and populations. On the other hand, Mental HSR has been defined as [11]: ( . . . ) the area of research that aims to maximize the quality of mental health care received by patients in their communities, as well as the quality of their lives. It examines treatment through the lenses of public health, public policy and the economics of mental health care Taking into account the above definitions, a major characteristic of Mental HSR is its multidisciplinary nature. Mental HSR uses concepts and methods from fields of knowledge such as medicine, epidemiology, sociology, economics and psychology. The use of multiple methods allows the use of different approaches depending on the problems being addressed. For example, qualitative methods are increasingly being applied, especially with the inclusion of service users and their relatives in the design and implementation of HSR studies [12]. The Mental HSR field of study is broad, covering the micro level (patient-based evaluation) to macro analysis (how health and social policies influence the outcomes of health services). As such, Mental HSR operates in the continuum from the patient–physician encounter to the wider
AN INTRODUCTION TO MENTAL HEALTH SERVICES RESEARCH
community and environmental context in which these encounters happen. The approach to many problems is not local but systemic, taking into account the influence of medical and non-medical factors on populations’ health. Finally, especially when economic aspects are taken into account, the approach is also relevant. Sometimes the focus is on individual health service costs, but in mental health there is a need to include a broader view of the costs of care, including indirect costs. The societal perspective allows the analysis of the impact of specific services not only on patients’ wellbeing, but also on their overall quality of life and integration into society. Sustainability of services requires the evaluation of overall costs and care benefits. The context in which HSR and Mental HSR have developed suffers from internal and external difficulties. General and mental health services are exposed to external pressures which some have called ‘environmental turbulence’ [3]. General and Mental HSR have to deal with the ‘turbulence’ originating from: (i) government, which can influence HSR through financial controls or political strategies; (ii) local opinions, local politicians and consumer organisations; (iii) healthcare organisations with their staff, internal politics and norms and conditions of services and (iv) the medical–industrial complex, which can promote new technologies primarily for commercial interest. On the other hand, both general and Mental HSR have to deal with the specific internal characteristics of the healthcare arena. These are: (i) the complexity of healthcare, with different occupational groups involved in the provision of healthcare, often with competing interests; (ii) the continuous healthcare changes and (iii) the effect of employees in healthcare organisations, especially medical doctors, who have considerable autonomy and influence over how resources are used. Finally, complexity comes from the fact that there are no two identical patients, which complicates the implementation of standardised processes. Somehow, this internal and external turbulence is common in all areas of HSR. Mental HSR has some additional difficulties, which arise from: (i) the still common use of non-standardised outcome measures; (ii) the complexity of mental health treatments, which often include social components and (iii) the
incapacity, or possible incapacity, of some patients to provide consent [12].
9.3 A framework for mental health services research In order to deal with this internal and external turbulence, Tansella and Thornicroft developed, in 1998, a conceptual framework for Mental HSR [5, 13]. This conceptual framework is a map which allows the organisation of the field and clearly states the objective of the studies to be conducted. One of their concerns was to avoid studies with limited use. For example, general descriptions of mental health services which were difficult to use, in particular, contexts; or conversely, specific descriptions of mental health services, which were difficult to extrapolate. Tansella and Thornicroft built a matrix with two dimensions, one geographical and one temporal. The geographical dimension is composed of three geographical levels: (i) country; (ii) local and (iii) patients. The temporal dimension considers three phases: (i) inputs; (ii) processes and (iii) outcomes. Combining the two dimensions, they constructed a 3 × 3 matrix that reflect the crucial issues for Mental HSR (see Table 9.1). The geographical dimension has three levels: • Country/regional level: which is also known as the macro level. At the macro level, mental health laws are established and policies are formulated. The domains studied in this level are related to: (i) the social, political and legal forces that shape policies; (ii) economic issues, such as public expenditure on mental health services or the methods to allocate health expenditure which consider variations in psychiatric morbidity and (iii) professional education and development, such as professional training and accreditation or setting standards of care. • The local level: By local level, Tansella and Thornicroft refer to the catchment area in which mental health services are set up. In their characterisation, they have in mind how most developed countries organise mental health services. Typically, areas with between 50 000 and 250 000 residents are defined and a given number of services are 135
CHAPTER 9
assigned to cover the needs of the population of the area. The local level is usually seen as the best perspective to study the components of the mental health system, how they are organised and integrated with general healthcare and social services. Moreover, at this level, assessment of required services is carried out. • The patient level: Here the focus is on the individual patient or small groups of patients sharing some common traits, needs or problem. Traditionally, this level is considered the clinicians’ domain. Nevertheless, in the matrix model, the influence of higher levels (country/region and local levels) on clinical work is considered. The temporal dimension has three phases: • The input phase: inputs are defined as the resources devoted to the mental health system. According to the authors, inputs could be divided into ‘visible’ and ‘invisible’. At the local level, the ‘visible’ inputs are basically composed of staff and facilities. The ‘invisible’ inputs activate the visible inputs and potentiate their effective performance. For instance, coordination between primary care and mental health professionals is an ‘invisible’ input. Invisible inputs also include such elements as experience, qualification and staff training. In traditional HSRthe input phase is also named ‘structure’. Table 9.1
• The process phase: in this phase the focus is on activities developed to provide mental health services. For instance, we could study the appropriateness of treatments provided for mental health problems. • The outcome phase: outcomes are changes in functioning, morbidity, mortality and quality of life, both at the individual and country-aggregated level. The outcomes could be seen as the complex result of resources and treatment received, which, as we have seen, could be considered as inputs and process variables. The model by Tansella and Thornicroft has many similarities to the model proposed by Starfield 25 years previously. She also divided the health services components into structure, processes and outcomes, which basically correspond to the three levels of the temporal dimension of the matrix. Moreover, Starfield also emphasised that the study of the interrelation between the patient and the health professional had to take into account the social context in which the encounter takes place. In recent revisions of her work, the individual–country dimension has also been incorporated [14]. The use of the matrix model may assist mental health services researchers in considering different factors that could help them answer complex questions. In the fourth section of this chapter we will provide some examples of the application of the matrix model.
The mental health matrix, with some examples. Temporal dimension
Geographical dimension
Input
Process
Outcome
Country
Mental health policies Expenditure on services
Compulsory treatment rates Bed occupancy rates
National suicide rates Burden of disease Primary prevention
Local (catchment area)
Population needs assessment
Pathways to care
Better access to services
Coordination between sectors
Patterns to service use
Secondary and tertiary prevention
Patient needs assessment Patients’ and/or relatives demands
Treatment appropriateness Continuity of care
Symptom reduction Increase in the quality of life
Patient
Adapted from: Tansella M and Thornicroft G (1998) A conceptual framework for mental health services: the matrix model. Psychological Medicine, 28, 503–508 [5].
136
AN INTRODUCTION TO MENTAL HEALTH SERVICES RESEARCH
9.4 Key concepts in mental health services research Once we have established the framework for Mental HSR, and have discussed the two dimensions that must be taken into account when dealing with complex questions, we will briefly review some of the key concepts in Mental HSR.
9.4.1 Need Need is one of the main drivers of health service use. In layman’s terms it may mean the existence of a health problem [15], but a definition must be more complex than this. According to The Dictionary of Epidemiology [16] the term ‘need’ has both ‘a precise and all-but-undefinable meaning in the public health context’. The fact is that when using the word ‘need’, there are implied value judgements that define what, and when, a health status can be defined as a health problem. For instance, in the case of mental health, before psychiatric deinstitutionalisation, the needs of outpatient treatment for people affected by schizophrenia were not considered, whereas since deinstitutionalisation, and in the context of the subsequent progressive sensitisation of citizens, their needs for communitarian treatment have been taken into account. From an economist’s point of view, ‘need’ can be defined as ‘the minimum amount of resources needed to exhaust an individual’s capacity to benefit’. A relatively simple definition from a health economics point of view, provided by Davis, states that ‘need is a subjective feeling state that initiates the process of choosing among medical resources’ [15]. Other authors, from a sociological standpoint, have distinguished four approaches to define need [15, 17, 18]: • Normative need: those needs ‘objectively’ defined by professionals. • Felt need: those needs ‘subjectively’ defined by individuals. • Expressed need: defined by the actions carried out by individuals. That is, for instance, seeking care for a health problem. • Comparative need: derived from examining the services provided in one area to one population
and using this information as the basis to determine the sort of services required in another area with a similar population. From a Mental HSR approach, a mental health need is defined as: ‘the requirement of individuals to enable them to achieve, maintain, or restore an acceptable level of social independence or quality of life, as defined by particular care agency or authority’ [17].
9.4.2 Want, demand and supply Need is related to other key concepts: want, demand and supply. The four terms are in some sense overlapped, and sometimes they are used loosely. Simply put, want is understood to mean what the individuals would like but may not act upon, demand refers to the expressed want (some authors will say to the expressed need) and supply refers to the services/treatment/kinds of care that are available [15, 18]. As a goal, mental health systems try to increase the overlap between need, demand and supply. Additionally, some authors have argued for the importance of differentiating between unmet and met needs [15], according to whether people are receiving effective services or care, or not. Moreover, others have pointed out the importance of the existence of treatment to determine whether something is a need. That is, to say, if no treatment exists for an illness, one could argue that, rather than a need for this treatment, there is a want.
9.4.3 Efficacy, effectiveness and efficiency The study of the efficacy, effectiveness and efficiency of mental health services is among the key issues in Mental HSR. The concepts refer to the effects of an intervention. Efficacy is assessed by answering the question ‘Can it work?’ That is, does a given intervention causes more good than harm to specifically diagnosed patients who are adequately treated and who totally comply with the full treatment? In other words, efficacy tries to assess whether an intervention (be it a drug, a surgical procedure or an organisational arrangement) works in ideal 137
CHAPTER 9
conditions. Typically, randomised clinical trials are designed to evaluate the efficacy of interventions. On the other hand, effectiveness is measured by answering the question ‘does it work?’ That is, in everyday conditions, will the treatment work? Everyday conditions can depart from the ideal for a number of reasons, such as incomplete diagnostic efforts, comorbidity and insufficient compliance by the provider and/or the patient. Assessing effectiveness is important for more accurate planning and evaluation of services provision. Lastly, efficiency takes into account the relationship between costs and effects. Two different types of efficiency are defined: (i) Production efficiency refers to achieving a given level of output at minimum cost, that is if two interventions obtain the same results, the intervention with lower costs will be more efficient. (ii) Allocative efficiency refers to maximising the results, in this case on population health, with a given amount of resources. With a health budget, maximum allocative efficiency will be achieved if resources are devoted to the interventions that produce the maximum improvement in health [19]. There are three ways of estimating efficiency depending on the way outcome is measured: costeffectiveness, cost-utility and cost-benefit. All three take into account costs in monetary units (which could be direct, such as the costs of treatments, or indirect, for example productivity losses associated with illness) but differ in the unit of outcomes: in cost-effectiveness analysis the consequences of the intervention (outcomes) are measured in the most appropriate natural effects or physical units, such as ‘reduction in psychotic symptomatology’ or ‘cases adequately detected’. In cost-utility analyses, the outcomes are measured in health state preference scores or utilities. The most common measure used in cost-utility analysis is the quality-adjusted life-year (QALY). Finally, cost-benefit analysis measures the consequences, the outputs, in monetary terms, for instance, applying a monetary value to the illness status or life [20].
9.4.4 Appropriateness of care Most of the concepts discussed above deal with the results, outputs or outcomes of health services. When
138
interested in evaluating the process of providing services, adequacy or appropriateness of care is important. Appropriateness tries to assess whether the particular patient receives adequate treatment, in a timely manner, from the appropriate professional, in the right setting. According to Shape and Faden [21], the concept of appropriateness has to be considered from at least three different perspectives: (i) the clinical point of view, that is is there enough evidence about a procedure in terms of potential benefits and harm?; (ii) the perspective of the individual patient; that is when studying appropriateness, the values and ‘nonclinical’ benefits and harm to the patients and their interests have to be incorporated. In other words, from a patient’s point of view, an intervention will be considered as adequate when the patient has participated in the decision-making process and has freely accepted it once informed and, finally, (iii) from the societal point of view, that is in an era of escalating healthcare costs and contained financing, procedures should also be cost-effective. The relationship of needs with effectiveness and adequacy can be understood through the following example. Imagine an epidemiological research study designed to assess whether the citizens of a region with mental health needs are receiving appropriate interventions. Following the steps suggested by Spasoff [15] and Muir Gray [22], we should proceed as follows: 1 We should estimate the number of people in need (as a proxy we can use the prevalence of people with mental disorders). 2 We should measure the actual level of health service utilisation by people with the problem (that is how many people with mental disorders are using health services for their emotional problems?). 3 We should determine, from evidence-based literature, which interventions are beneficial (effective and/or cost-effective) for their problems. 4 We should try to assess whether recommendations from literature are consistent with the kind of care that they are receiving. This type of approach is illustrated in Table 9.2. In cell (a) there is the number of cases for which the intervention is indicated and who is actually receiving
AN INTRODUCTION TO MENTAL HEALTH SERVICES RESEARCH Table 9.2 The relationship between needs for treatment and appropriateness of care.
Intervention is indicated Intervention is not indicated Total
Receiving recommended intervention
Not receiving recommended intervention
Not receiving any intervention
Total
(a) Met need/adequately treated (d) Inappropriate treatment Total treated
(b) Inappropriate treatment
(c) Unmet need
Total need for intervention
(e) Inappropriate treated
(f) Appropriate non-treatment Not treated
No need for intervention Total cases of problem
Adapted and modified from: Spasoff RA (1999) Epidemiologic Methods for Health Policy, Oxford University Press, New York, p. 111 [15].
it. These could be considered as patients whose needs have been met. Cell (b) represents the number of people for whom the intervention is needed but they are receiving an intervention which does not meet minimum quality standards. In cell (c) are those patients who are not receiving any treatment, despite their need for it. These are the cases with unmet need. Cells (d and e) indicate misuse of resources such as cases where people are receiving treatment for which the intervention is not indicated. In other words, there are cases of inappropriate treatments. Lastly, cell (f) shows those cases without a need for intervention who are, appropriately, not treated. Nevertheless, this approach has some limitations which should be acknowledged. If we use ‘normative’ needs, assuming that anyone with a psychiatric diagnosis is in need, we could be overestimating the number of people with unmet needs. Moreover, it is important to bear Tansella’s and Thornicroft’s matrix in mind, and try to describe the various factors that would explain why people are not expressing their needs or receiving the required treatment. Of course, unmet need could also simply be due to the fact that effective treatments are not being supplied in a particular country/area, are not considered to be cost-effective, or a lack of conclusive evidence exists regarding effective treatments for a problem. Another limitation of this approach is that it does not consider the patient’s perspective. A person could be diagnosed but does not feel disabled enough to seek care or, conversely, a person could be in need of some kind of mental health care that does not meet diagnostic criteria.
9.4.5 Small area variations (SAV) Related to the study of appropriateness, another important issue for Mental HSR is the study of small area variations (SAVs). This concept refers to the large differences in the rates of use of medical services between geographical regions. Such variations can be detected between countries, provinces or regions [23]. The study of SAVs is important because it could indicate poor access to health services or underuse of resources in some areas. It could also show iatrogenic consequences of overuse. Briefly, the steps in analysing SAVs are: 1 Determination of numerator. For instance, number of emergency psychiatric consultations during a month. 2 Determination of denominators. For instance, health regions. 3 Adjustment for age and gender. 4 Use of statistical test to control random fluctuations. Different hypotheses have been put forward to explain what causes SAVs. Among the most commonly cited are the following: 1 The uncertainty hypothesis: according to this hypothesis, formulated by the first time by Wennberg [24], variability is low when there is clinical consensus (and/or scientific evidence) about which is the best procedure. When there is uncertainty about the best therapeutic option, health professionals act for the best according
139
CHAPTER 9
to their own criteria. In these cases of high uncertainty, factors related to health-system provisions play an important role in explaining SAV. 2 Enthusiasm hypothesis: this hypothesis suggests that the inappropriate use of a procedure is equal in areas with high and low use of services. Nevertheless, in areas with high use of services, there are few clinicians who are enthusiastic about a procedure being responsible for the variability [25]. 3 Patient practice variations hypothesis: states that differences in morbidity explain SAV. Variables related to demand (i.e. the patient) such as socioeconomic level, studies, ethnicity, health status and beliefs are the main source of variability [26].
9.4.6 Factors associated with access to health care Different models have been proposed to understand why people access health care. One of the most used is the Behavioral Model and Access to Medical Care by Ronald M. Andersen [27]. Figure 9.1 depicts the components and their interrelation. This model suggests that people’s use of health services is a result of a combination of factors related to the environment, their predisposition to use these services, along with factors that may enable or impede use, and their need for health. It also includes feedback loops. For instance, outcomes may, in turn, affect perception of need and health behaviour. The first component of the model, the environment, refers explicitly to the national health policy, the resources devoted to health and their organisation. For instance, in a country with a national health system with universal coverage, higher access to healthcare than in a country with a private health system would be expected. With respect to external environment, the influences of political and economical components are also taken into account. The second component, population characteristics, covers three distinct factors: • Predisposing characteristics include: demographic characteristics such as age and gender; social structure (education, occupation and ethnicity); social networks, interactions and networks and 140
the health beliefs that comprise the attitudes, values and knowledge that people have about both health and health services. In the case of Mental HSR, the stigma associated with mental disorders is also one of the key elements that could explain lower use. • Enabling resources refer to the community and personal facilities that people have. For instance, income, health insurance, a regular source of care and perceived social support, are just some of the enabling factors. • The perceived need for care, as discussed above. The third component of the model is the use of health services per se, traditionally the main outcome. Additionally, other personal health practices, such as diet, exercise or self-care are recognised as interacting with the formal use of services. The inclusion of other outcomes (fourth component) such as: perceived health status, evaluated health status and consumer/user/patient satisfaction allows research to include other outcomes that could be important to health policy. Thus, Andersen suggests some additional measures such as ‘effective access’, which is achieved when utilisation studies show that use improves health status or consumer satisfaction with services, and ‘efficient access’ which is established when the level of health status or satisfaction increases relative to the amount of health care services consumed.
9.4.7 Equity The International Society for Equity in Health (ISEqH) defines equity in health as: ‘the absence of potentially remediable, systematic differences on one or more aspects of health across socially, economically, demographically or geographically defined population groups or subgroups’ [28]. Investigations related to mental HSR and equity will explore, for instance, whether people with equivalent needs receive equal treatment (horizontal equity) or whether those with greater mental health needs receive preferential treatment (vertical equity).
AN INTRODUCTION TO MENTAL HEALTH SERVICES RESEARCH
ENVIRONMENT Health care system
External environment
POPULATION CHARACTERISTICS Predisposing characteristics
Enabling resources
Need
HEALTH BEHAVIOUR Personal health practices
Use of Health Services
OUTCOMES (satisfaction with treatment, perceived or/and evaluated health status, quality of life…)
Fig 9.1 Behavioral Model and Access to Medical Care by Ronald M. Andersen. Adapted from: Andersen RM (1995) Revisiting the behavioral model and access to medical care: does it matter? Journal of Health and Social Behavior, 36, 1–10 [27].
Using his model of access to care, Andersen defines equitable access as occurring when demographic and need variables account for most of the variance in utilisation, whereas inequitable access occurs when social structure (for instance ethnicity), health beliefs or enabling resources (income) determine who gets medical care [27].
9.5 Examples of mental health services research studies In this section we describe and discuss several studies in the area of Mental HSR to provide a more applied perspective of the concepts outlined in the first part of the chapter. Mental HSR is a multidisciplinary area of knowledge and, as such, it implies the use of different methodologies which depend on the main aim of the study. In this second part we present studies based both on administrative data and on primary data collection, including some examples of qualitative studies.
9.5.1 Administrative data Deinstitutionalisation radically changed how mental health care attempts to meet patients’ wants and needs. No longer does the state hospital try to meet these multiple wants and needs; a great number of alternative community-based settings and alternative inpatient settings have sprung up since deinstitutionalisation [4]. In fact the principles of psychiatric reform emphasised the need to focus on community care, with no more admissions to state psychiatric hospitals and with in-patient care provided in small wards in general hospitals. To determine whether this objective was met, we can use available case registers and study patterns of service use. The paper by Tansella and colleagues [29] is an example of this. In this paper, they describe the development of a community-based mental health service in South Verona (Italy), the patterns of care provided by this new service, and its costs since its set-up in 1978. Using the South-Verona Psychiatric Case Register they were able to show that, between 1979 and 2003, 141
CHAPTER 9
hospital care consistently decreased, whereas outpatient care, home visits and day-hospital increased. Specifically, hospital rates decreased from almost 350 patients per 100 000 adult South Verona residents in 1979, to just 50 patients per 100 000 South Verona residents in 2003. On the other hand, outpatient/community care increased from nearly 25 patients per 100 000 residents in 1979 to more than 250 per 100 000 in 2003. Twenty-five years after the reform (from 1978 to 2003) there was a 29% decrease of inpatient admissions, with a 56% decrease in compulsory admissions. The mean number of occupied beds per day decreased over time, falling by 81% between 1977 and 2003. Figure 9.2 shows the patterns of inpatient admissions from 1977 to 2004. This study could be seen as evidence of the achievement of one of the main objectives of psychiatric reform. These kinds of studies could be useful in monitoring and evaluating the implementation of a programme or a new policy. One of the main studies in Mental HSR is the World Health Organization (WHO) Mental Health Atlas. Following Thornicroft and Tansella’s matrix, this study is an example of a country-input study, as it is comparing resources devoted to mental health (inputs) in different countries that are grouped into wide regions. This project was initiated in 2000 with the objectives of collecting, compiling and disseminating global information on mental health resources and services in each country [30]. With this information, WHO aims to show both
public and professionals the inadequacies of existing resources and services devoted to mental health, and the large inequities in their distribution at national and global level. In 2005 this information was updated in a second edition of the Atlas. Information was obtained from the Ministry of Health of each country and triangulated with results of an exhaustive literature search and with other kind of documents submitted and collected by WHO Regional Offices staff. Information was also checked with experts and members of the World Psychiatric Association. The 192 WHO member states and the 11 associated members are represented in the Atlas. This represents nearly 99% of the world’s population. As an example, Table 9.3 shows a comparison of the median number of different mental health professionals per 100 000 inhabitants, according to WHO Regions. As can be observed, there is a large variation in the number of professionals from region to region. For instance, there are nearly 1800 psychiatrists for 702 million people in the African Region, compared with more than 89 000 psychiatrists for 879 million people in the European Region. It points out not just the lack of resources but also the high inequities in resource distribution. Such information has potential value for planning mental health services both at national and international level. Moreover, as information is updated, comparisons and changes in resources devoted to mental health can be monitored, indicating whether
600
Compulsory
500
To state mental hospital (voluntary)
400 300
To public care
200
To private care
100
TOTAL
0 1977
1979
1983
1987
1991
1995
1999
2003
Fig 9.2 Patterns of in-patient admissions from 1977 to 2003 in South Verona (ratios per 100 000 residents). Own elaboration with data obtained from: Tansella M, Amaddeo F, Burti L, Lasalvia A and Ruggeri M (2006) Evaluating a communitybased mental health service focusing on severe mental illness. The Verona experience. Acta Psychiatrica Scandinavica, 113, 90–94 [29].
142
AN INTRODUCTION TO MENTAL HEALTH SERVICES RESEARCH Table 9.3 Median number of mental health professionals by WHO regions.
Psychiatrists Psychiatric nurses Neurologists Neurosurgeons Psychologists working in mental health Social workers working in mental health
Africa
Americas
Eastern Mediterranean
Europe
South-East Asia
Western Pacific
World
0.04 0.20 0.02 0.01 0.05
2.00 2.60 0.70 0.40 2.80
0.95 1.25 0.30 0.20 0.60
9.80 24.8 4.00 1.00 3.10
0.20 0.10 0.05 0.03 0.03
0.32 0.50 0.00 0.00 0.03
1.20 2.00 0.30 0.20 0.60
0.05
1.00
0.40
1.50
0.04
0.05
0.40
Own elaboration with data obtained from: World Health Organization (2005) Mental Health Atlas, World Health Organization, Geneva [30].
specific policies aimed at improving resources have been effective. For instance, comparisons of data collected in 2001 and updated in 2004 show an increase in the quantity of mental health professionals in the world, the number of psychologists and social workers showing the greatest increases (with increases in median of 0.2 points and 0.1 points per 100 000 inhabitants respectively). There were no major changes in the median number of other professionals. Comparisons between large regions are interesting from a macro/international standpoint. Nevertheless, it would be interesting to complete and compare these results with data obtained at a meso-level, that is with data gathered in municipalities, health areas or districts, as it may diverge from data aggregated at higher levels (i.e. countries). The meso-level comparison of mental health service availability is related to the study of SAV in medical procedures. One of the major difficulties when comparing availability of services in different areas (even within the same countries) is the different names that services are given. Moreover, the name they receive may or may not describe its main activity, which can make comparisons difficult. To deal with this barrier, in 1994 a group of investigators named the European Psychiatric Care Assessment Team (EPCAT) group began to work towards the establishment of a standardised methodology for the description and assessment of the care received by people suffering from mental disorders. They developed the European Service Mapping Schedule (ESMS). The ESMS is an instrument that serves three purposes: (i) to compile the adult mental health services of a catchment
area; (ii) to describe and compare the structures and types of mental health services between catchment areas and (iii) to measure and compare the levels of provision of major types of mental health services between catchment areas. The ESMS uses atheoretical descriptors based on the main types of care: (i) residential care; (ii) day care and (iii) outpatient and communiy care. By choosing these terms, the ESMS avoid using culturally laden words (such as rehabilitation) or common names designing different types of care (day-centre). Moreover, each type of care is divided according to whether patients stays overnight at the service, receives care in a day-care facility or has face-to-face contact with the professional. Secondary and tertiary subdivisions are made on the basis of other characteristics such as: intensity, time of stay and mobility [31]. Graphically, the ESMS can be seen as a ‘service tree’ (Figure 9.3). Salvador-Carulla et al. [32] used the ESMS to make a meso-level comparison of mental health service availability and use in Chile and Spain. They selected small areas (catchment areas) with marked differences regarding organisation and provision of services. The areas selected in Spain were: Gava` (Catalonia, in the north-east), Granada-Norte (Andalusia, South) and Rochapea (Navarre, North). The three areas differed in the socioeconomic, distribution and organisation models for their mental health services. It is also important to note that in Spain the responsibilities of the National Health System and Social Services have been gradually transferred to each of the 17 autonomous regions that comprise Spain. The three small Spanish areas selected are from different autonomous regions, with 143
144 Hospital
Non-acute
Daily support
24-h support
Indefinite stay
Daily support
24-h support
Time limited
Daily support
24-h support
Indefinite stay
Daily support
24-h support
Time limited
Acute
Non-hospital
Non-acute
Non-hospital
Hospital
Generic acute
Day & structured activity
Work
Work
Social support
Other structured activity
Work related activity
Low intensity
Social support
Other structured activity
Work related activity
High intensity 24 h
Limited hours
24 h
Moderate intensity Low intensity
Low intensity
Moderate intensity
High intensity
Non-mobile
Mobile High intensity
Self-help & non-professional
Continuing care
Limited hours
Non-mobile
Mobile
Emergency care
Out-patient & community
Fig 9.3 The ESMS service tree. Modified from: Johnson S, Kuhlmann R and the EPCAT group (2000) The European Service Mapping Schedule (ESMS): development of an instrument for the description and classification of mental health services. Acta Psychiatrica Scandinavica, 102, 14–23 [31].
Secure
Residential
Mental Health Services
AN INTRODUCTION TO MENTAL HEALTH SERVICES RESEARCH
different mental health services and objectives. The Chilean areas were: Concepcion and Talcahuano. On the one hand, the organisation of services in Concepcion is more traditional (dating from the 1960s). On the other hand, provision of mental health services in Talcahuano was reorganised during the 1990s. Briefly, the procedure for data collection for the ESMS began in each area with a face-to-face interview with the head of the community mental-health centre and the reference hospital setting. A map of the services and the main local administrative data source were identified. Figures 9.4–9.7 show the utilisation rates of the main types of care in the five small health areas per 100 000 inhabitants. This study showed that there were differences in the use of residential and day-care facilities between Spanish and Chilean areas. However, if we look data in detail, the rate of continuous outpatient care in Chilean areas was closer to that of the Rochapea area than the other two Spanish areas. This could be related to the greater availability of these kinds of services in these areas which could have an impact on demand as well as the clinical pattern. This study also showed the lack of availability of day-care services and acute care. It demonstrated that patterns of hospital residential care in Chile and Spain were more similar than expected. In fact, the poorest Spanish studied area (Granada) was very similar to the Chilean ones. Combining data from the WHO Mental Health Atlas with meso-level data offers a more accurate picture of the use of mental health services. Another example of the use of the ESMS could be found in the study by Pirkola et al. in Finland [33],
which aimed to investigate the relation between suicide risk and different ways of organising mental health services in the 428 municipalities that make up Finland. Each of these municipalities has nearly 5000 inhabitants. The provision of mental health care has been transferred to these municipalities, so management structure and procedures vary widely among them. Again, following the mental health matrix, this study could be seen as an example of meso-level comparison, but in this case the authors compare outcomes (suicide) rather than inputs. The authors obtained ESMS data by means of interviews with the 20 mainland Finnish hospital districts, and from health care and social-care officers. Data on suicide was obtained from Statistics Finland. Findings from this study suggested that, after controlling for socioeconomic factors, those municipalities with a predominance of outpatient services had a low suicide rate (relative risk (RR) 0.94, 95% CI 0.90–0.98). In spite of the cross-sectional design of the study that precluded causal implications, results were consistent with results of a meta-analysis that suggested that patients treated by community mental-health teams are less likely to kill themselves. Studies made with administrative data have some advantages: they are readily available, normally they are inexpensive to acquire, they are computer-based and typically have a big sample size. Nevertheless, when compared with studies using primary data, some limitations have to been acknowledged. The main disadvantage is that, in most cases, sociodemographic information is scarce. Moreover, with administrative data, the study of unmet needs from the general population can not
30 25 20 15 10 5 0
Rochapea Gavà Granada Norte Concepcion Hospital acute
Hospital non-acute:total
Nonhospital: total
Talchuano
Fig 9.4 Comparison of mental health services in five small areas. Residential care (beds occupied per month per 100 000 population). Own elaboration with data obtained from Salvador-Carulla L, Sladivia S, Mart´ınez-Leal R, Vicente B, Garc´ıaAlonso C, Grandon P and Haro JM (2008) Meso-level comparison of mental health services availability and use in Chile and Spain. Psychiatric Services, 59, 421–428 [32].
145
CHAPTER 9
120.00 100.00
Rochapea
80.00
Gavà
60.00
Granada Norte
40.00
Concepcion Talchuano
20.00 0.00 Day-care (users per months per 100 000 population)
Fig 9.5 Comparison of mental health services in five small areas. Day care (day and structured activities). Own elaboration with data obtained from Salvador-Carulla L, Sladivia S, Mart´ınez-Leal R, Vicente B, Garc´ıa-Alonso C, Grandon P and Haro JM (2008) Meso-level comparison of mental health services availability and use in Chile and Spain. Psychiatric Services, 59, 421–428 [32].
250
Rochapea Gavà Granada Norte Concepcion Talchuano
200 150 100 50 0 Emergency
Fig 9.6 Comparison of mental health services in five small areas. Outpatient and ambulatory care I (contacts per month per 100 000 population). Own elaboration with data obtained from Salvador-Carulla L, Sladivia S, Mart´ınez-Leal R, Vicente B, Garc´ıa-Alonso C, Grandon P and Haro JM (2008) Meso-level comparison of mental health services availability and use in Chile and Spain. Psychiatric Services, 59, 421–428 [32].
3000 2500
Rochapea
2000
Gavà
1500
Granada Norte
1000
Concepcion
50 0 0
Talchuano Continuing care
Fig 9.7 Comparison of mental health services in five small areas. Outpatient and ambulatory care II (services users per month per 100 000 population). Own elaboration with data obtained from Salvador-Carulla L, Sladivia S, Mart´ınez-Leal R, Vicente B, Garc´ıa-Alonso C, Grandon P and Haro JM (2008) Meso-level comparison of mental health services availability and use in Chile and Spain. Psychiatric Services, 59, 421–428 [32].
146
AN INTRODUCTION TO MENTAL HEALTH SERVICES RESEARCH
be studied. Epidemiological studies could deal with these disadvantages, helping, with the information obtained, to document service use and unmet need for treatment.
9.5.2 Studies using primary data collection One of the most important epidemiological initiatives for Mental HSR is the World Mental Health (WMH) Survey Initiative. This project, sponsored by WHO [34], aims to obtain cross-national information on the prevalence and correlates of mental, substance and behavioural disorders in all WHO Regions. To date, 28 countries are participating in this study. Using data from the WMH surveys Wang et al. published a paper in 2007 examining frequency, types and adequacy of mental health service use in 17 countries in which surveys were completed at the time of their study [35]. The main strength of this initiative is the use of common methodology in all the countries. Briefly, face-to-face household interviews were carried out in population representative samples in the participating countries, providing a total sample of nearly 85 000 respondents from lowincome, middle-income and high-income countries. Presence of lifetime, 12-month and current mental disorders were assessed with the CIDI 3.0, a structured diagnostic interview which can be administered by trained lay interviewers. The CIDI 3.0 can provide diagnosis of mental disorders based on criteria from the American Psychiatric Association’s Diagnostic and Statistical Manual of Mental Disorders (DSMIV) or the International Classification of Diseases (ICD-10). The CIDI 3.0 has been proven to generally show good agreement when compared to clinical diagnosis [36]. Disorders included in this study were agoraphobia, generalised anxiety disorders, panic disorder, post-traumatic stress disorder, social phobia, specific phobia, bipolar disorder type I and II, dysthymia, major depressive disorder and substanceuse disorders. Mental disorders were classified as serious, moderate or mild depending on specific criteria regarding functioning, disability and clinical aspects. Services received in the previous 12 months were assessed by asking respondents if they had ever seen any type of professional, either as an outpatient or inpatient, for problems with emotions, nerves,
mental health or use of alcohol or drugs. Included were mental health professionals (e.g. psychiatrist, psychologist), general medical professionals (e.g. family doctor, occupational therapist) and other non-health professionals such as religious counsellors or traditional healers. Examples of these types of providers were presented in a Respondent Booklet used as a visual recall aid and varied somewhat across countries depending on local circumstances. Follow-up questions asked about age at first and most recent contacts as well as number and duration of visits in the past 12 months. They also estimated the proportion of participants who potentially could have received minimally adequate treatment according to evidence-based guidelines. Treatment was considered adequate if the participant received medication for at least 1 month plus at least four visits to any type of medical doctor, or did not receive medication but had made at least eight visits to any type of professional. The proportion of respondents having made any use of health services by severity of mental disorders is shown in Figure 9.8. As can be seen, there is a relation between disorder severity and the probability of any use of health services in all the countries except in China. In general, the more severe the disorder, the greater the probability of health service use. On the other hand, the proportion of respondents using 12-month services for their emotional problems is lower in low-income countries than in developed countries. This may indicate a serious equity issue. Another problem highlighted in these results is the higher number of people with unmet need for treatment, even among those with the most severe disorders. This problem is worse in developing countries, but even in developed countries only half of those with serious mental disorders receive any kind of treatment. It is also interesting to note that there is a proportion of participants without mental disorders who use services for emotional problems. In theory, this use could be considered as an inappropriate and inefficient use of services. However, they could be affected by disorders not assessed in the survey, or in maintenance treatment for a disorder which occurred in the past. A limitation of the WMH consortium’s approach is that they are using ‘normative needs’ (i.e. they considered that people affected by mental disorders 147
CHAPTER 9
Belgium USA Spain New Zealand Israel Italy Netherlands France Germany Colombia South-Africa Mexico Ukraine Japan Nigeria Lebanon China 0
10
20
30
40 severe
50 moderate
60 mild
70
80
90
100
none
Fig 9.8 Use of mental health services by severity of mental disorders and country. Own elaboration with data extracted from Wang PS et al. (2007) Use of mental health services for anxiety, mood and substance disorders in 17 countries in the WHO world mental health surveys. The Lancet, 370, 841–850 [35].
148
AN INTRODUCTION TO MENTAL HEALTH SERVICES RESEARCH
are objectively in need of treatment), and not a ‘subjective’ approach (based on what people feel they need). So the data should be interpreted with caution. The subjective approach in the assessment of needs is exemplified by the work analysing patient needs using the Camberwell Assessment of Need (CAN) inventory. For example, Ochoa et al. [37] conducted a study which evaluated 231 people with schizophrenia living in the city of Barcelona and its surroundings. The CAN instrument is useful in helping professionals to design treatment plans for their individual patients, but also in studying the performance of mental health services. The CAN evaluates the presence of need in 22 areas: accommodation, food, house upkeep, self-care, daytime activities, physical health, psychotic symptoms, information, psychological distress, risk to self, risk to others, alcohol, drugs, company, intimate relationships, sexual expression, child care, education, telephone, transport, money and benefits. For each of these areas, the CAN determines, if a need is detected, whether it is met, who provides the care (formal or informal care) and whether the help provided is appropriate. The questionnaire is completed independently by the staff and the patient. This double assessment allows the comparison of normative needs with felt needs. Briefly, this study pointed out that staff detected more needs than patients did (staff mean = 6.6 (SD = 3.17) vs. patients mean = 5.36 (SD = 2.71); p < 0.0001). The most frequent detected needs by patients were: psychotic symptoms, house upkeep, food and information. Staff detected needs in the areas of psychotic symptoms, company, daytime activities, house upkeep, food and information. With regard to who gave the required help, results showed that patients received more informal than formal help (75% of participants with met needs received informal help while, on the other hand, less than 50% received formal help). Regarding unmet needs, they also found that staff rated more areas as unmet needs than patients did (staff mean = 1.38 (SD = 1.75) vs. patients mean = 1.82(SD = 1.98); p < 0.0001). Most frequent unmet need expressed by patients included: companionship, intimate relationship, sexual expression and daytime activities. The same areas were detected by staff. It is important to note that in most of the unmet areas, the participant
reported that they received help; although this was not considered sufficient to meet their need. So far we have reviewed examples describing or comparing data. These kinds of studies are usual in Mental HSR and are useful for analysing and planning the needs of a given community. But Mental HSR is also interested in assessing the performance of programs or interventions focused on mental or emotional problems. One example of such studies is the UK 700 case management trial [38]. This study was carried out in four centres, three in London and one in Manchester, which obtained, in 1993, funding from the National Health Service (NHS) for a randomised, controlled trial of intensive case management (ICM). Investigators aimed to investigate the cost-effectiveness of ICM (case-load size 10–15) compared with standard case management (SCM) (case-load size 30–35) for patients with severe psychosis. A total of 708 patients with psychosis and a history of repeated hospital admissions were randomly allocated to ICM or SCM and assessed at baseline, 12 and 24 months by researchers independent of those providing clinical care. They did not find any differences in terms of days in hospital for psychiatric problems over 24 months, or in the scores of the Comprehensive Psychiatric Rating Scale, in the Quality of Life, in the assessment of unmet needs, in the mean Disability Assessment Schedule total score or in patient satisfaction. Nor did they find differences between ICM and CSM in the total 2-year costs of care per patient. As neither form of case management was better than the other, the authors conclude that formal cost-effectiveness analyses were not required. This study had a clear policy implication: it contradicted the policy of advocating ICM for patients with severe psychosis, as their study showed no beneficial effects of ICM on costs, clinical outcomes or cost-effectiveness. Another example is the paper by Bellon et al. [39] carried out in a primary care setting aiming to assess the effectiveness of general practitioner intervention to reduce frequent-attendee consultation. This study was carried out by a multidisciplinary team formed by general practitioners, statisticians and psychiatrists. The interest of this study from a Mental HSR standpoint is that, typically, frequentattendee consultations are sought by people affected by emotional problems or mental disorders. The 149
CHAPTER 9
authors designed a randomised, controlled trial with frequent attendees divided into an intervention group (N = 66) and two control groups (CG1, N = 71; and CG2, patients who consulted the same general practitioners (GPs) as the intervention group, N = 72). A total of six GPs participated in the study. GPs on the control groups were blind to which patients were selected to be acting as controls. They used two different control groups, CG1 absolutely na¨ıve to the intervention, and CG2 formed by those GPs also in the intervention group aiming to study if intervention was interiorised. The setting was a primary health care centre in southern Spain. Authors identified the sample of frequentattendees with reference to mean annual consultation rates (before intervention) at the health centre, stratified by sex and age. Frequent attendees were considered to be those patients who had an annual rate of consultation at least twice as high as the sexand-age-related mean for the health centre; that is, nearly the 90th percentile of the overall distribution. The intervention aiming to reduce frequentattendee visits was called by the investigators, the ‘seven hypotheses + team’ intervention. The three GPs in the intervention group underwent an interactive workshop training session (15 hours). Briefly, this intervention encourages GPs to select, from a list of seven possible hypotheses, a reason why the patient is a frequent attendee: biological, psychological, social, family, cultural, administrative–organisational or related to the doctor–patient relationship. After this, GPs share their analysis with other GPs regarding the hypothesis and the plans derived from it (this is the team component of the intervention). The frequent-attendees’ mean consultations by group at baseline and 1-year after intervention with GPs are detailed in Figure 9.9. At the end of the follow-up it was observed that the intervention group had significantly fewer visits than control group 1 (p < 0.001) and control group 2 (p < 0.001). Moreover, CG2 (those patients whose GPs form part of the control and intervention groups) also showed a reduction between visits at baseline and 1 year later (p < 0.001). All the results were adjusted by covariates such as chronic diseases and self-reported health, provider-use interface variables (such as traveling time to the health centre 150
and satisfaction with the GP), sociodemographic and psychosocial variables. Pending further evidence, the intervention showed a significant and relevant reduction in frequent-attendee consultations. This study could be seen as an example of a patientprocess study.
9.5.3 Qualitative studies The use of qualitative methodologies in the Mental HSR is relatively recent. According to the review made by Murphy et al. in 1998 [40], qualitative methods could be particularly useful in order to understand the findings of outcome studies in HSR. Qualitative research could provide the information that both policy makers and clinicians need to translate the findings of research into interventions and changes in policies and health services. In this sense, qualitative methods are very close to the field of implementation research, an emerging science that could be defined as ‘the systematic study of how a specific set of strategies are used to successfully integrate evidence-based public health interventions within specific settings’ [41, 42]. For instance, Hysong et al. conducted a qualitative study in which 102 employees involved in the implementation of clinical guidelines in different centres were interviewed. They were asked about specific strategies for its implementation. Results showed that in those centres where strategies were adapted to the local context, implementation was successful [43]. Additionally, other areas where qualitative methodology has shown particular strengths are: (i) in the identification of natural solutions to problems; (ii) in the studies about processes, focused on the functioning of a programme or a team and aiming to understand its internal organisation and, (iii) in comparative analyses, for instance, between different ways of coordinating services. Studies aiming to describe problems of coordination among professionals could be seen as examples of research at the local level, focusing on ‘invisible’ inputs. For instance, Calderon and colleagues [44] carried out a study to find out what family doctors and psychiatrists thought about their collaboration in the healthcare of patients with depression. A total of 29 family doctors and 13 psychiatrists participated in four discussion groups (two for family doctors and
AN INTRODUCTION TO MENTAL HEALTH SERVICES RESEARCH
25
visits
20 Intervention Group (IG) Control Group 1 (CG1) Control Group 2 (CG2)
15 10 5 0 Baseline
1 year
Fig 9.9 Frequent attendees’ mean consultations by group at baseline and one year after intervention with GPs. Own ´ JA; Rodr´ıguez-Bayon ´ A, de Dios Luna J and Torres-Gonzalez ´ elaboration with data obtained from Bellon F (2008) Successful GP intervention with frequent attenders in primary care: randomised controlled trial. British Journal of General Practice, 58, 324–330 [39].
two for psychiatrists). In these groups, they related their experiences of treating patients with depression. Results showed that the perceptions and attitudes of the two types of professionals were different. They had diverse views on the patients, the health context and their own expectations. For instance, family doctors often found that patients with depression consulted on another type of health problem which made the diagnosis of depression more difficult. For them, previous knowledge of the patient was a great facilitator in a correct diagnosis of depression. Family doctors did not feel skilled enough to deal with mental disorders. When asked about referral to a psychiatrist, family doctors explained that referral did not only depend on the severity of the problem or its course. Other factors, such as the relationship with the patient, the previous experience with the psychiatrist and the knowledge that the family doctor had about the functioning of the Outpatient Mental Health Center were also critical in deciding which patients to refer. On the other hand, psychiatrists usually attended the patients when they were referred by a primary doctor. Psychiatrists felt that a diagnosis of depression by a family doctor was sometimes inappropriate, since family doctors could label a person with a depression diagnosis who is having social problems, or who suffers from personality, or even psychotic, disorders. Psychiatrists did not know much about the primary health care context, their relationship with the patient is
also conditioned by their relationship with family doctors. Moreover, their expectations were focused on treating serious mental disorders and not on mental problems of lesser severity. Among the ideas that both types of professionals shared, they found the lack of resources, the progressive psychiatrisation of sadness and the low tolerance of frustration in citizens. This study showed that, independently of the macro aspects which are out of clinicians’ control, in order to improve treatment of depression in public health, family doctors and psychiatrists needed to share the same knowledge and to adopt a patient-centred approach. Finally, the role of users and their relatives in the Mental HSR is progressively increasing. Relatives have been involved in mental HSR since it was first set up. For instance, the National Alliance on Mental Illness (NAMI), a non-profit organisation formed by consumers, relatives and friends of people with mental problems, has been fighting against stigma since its inception in 1979 and they have now included research among their aims. In Europe it is also worth mentioning the efforts of the European Federation of Associations of Families of People with Mental Disorders (EUFAMI) that have had, since 2004, a Research Advisory Group aiming to initiate different research projects in cooperation with investigators. On the other hand, initiatives such as the Patient Programmed Expert [45] or the Best Practice Guidelines for consumer-delivered services [46] could be seen 151
CHAPTER 9
as an example of user involvement in Mental HSR. Using a narrative approach, research performed in collaboration with patients’ organisations gives us information about how patients experience their illness and how they perceive stigma and their relationship with mental health services. An example of this kind of research is that carried out by ADEMM, the mental health users association of Catalonia (Spain), focused on the relation between users and professionals in the field of mental healthcare. This research is interesting because it was done by mental health patients with the collaboration of the Catalan Department of Health and the methodological supervision of a psychosocial research centre. This is an example of how patients could be empowered by means of research [47].
9.6 Conclusion In this chapter we have presented a basic description of the major issues in the field of mental HSR. We have used the well-known Mental Health Matrix by Tansella and Thornicroft as a way of organising this field of knowledge. The major concepts identified include the need for care, appropriateness, effectiveness and equity. We have shown how these concepts are dealt with in some examples. Each of them deserves further attention and could easily provide enough information for its own handbook. As a conclusion, Mental HSR could be important for providers, administrators and the public, for at least the following four reasons: (i) to guarantee resource efficiency and avoid waste; (ii) to help to establish priorities in limited resource environments; (iii) to reduce mental health inequities and, (iv) to provide a base of evidence for mental health planning [48].
References [1] Starfield, B. (1973) Health services research: a working model. N. Engl. J. Med., 289, 132–136. [2] Lalonde, M. (1974) A New Perspective on the Health of Canadians. A Working Document, Government of Canada, Ottawa.
152
[3] Black, N. (1997) Health services research: saviour or chimera? Lancet, 349, 1834–1836. [4] Anthony, W.A. (1993) Recovery form mental illness: the guiding vision of the mental health services system in the 1990s. Psychosoc. Rehabil. J., 16, 11–23. [5] Tansella, M. and Thornicroft, G. (1998) A conceptual framework for mental health services: the matrix model. Psychol. Med., 28, 503–508. [6] Stein, L.I. and Test, M.A. (1980) Alternative to mental hospital treatment I. Conceptual model treatment program, and clinical evaluation. Arch. Gen. Psychiatry, 37, 392–397. [7] Hoult, J. and Reynolds, I. (1983) Psychiatric Hospital Versus Community Treatment: A Controlled Study. New South Wales Department of Health, Canberra. [8] Institute of Medicine (1995) Committee on Health Services Research: Training and Work Force Issues, Health Services Research: Workforce and Educational Issues, National Academy Press, Washington, DC. [9] Helping the Nation With Health Services Research. Fact Sheet. AHRQ Publication No. 02-P014, March 2002. Agency for Healthcare Research and Quality, Rockville, MD. Available from http://www.ahrq.gov/ news/focus/scenarios.htm (accesed 7 October 2010).May 2008. [10] Lohr, K.N. and Steinwachs, D.M. (2002) Health services research: an evolving definition of the field. Health Serv. Res., 37, 7–9. [11] Busch, A.B. (2006) Recent advances in mental health services research: introduction. Harv. Rev. Psychiatry, 14, 183–184. [12] Thornicroft, G. and Rose, D. (2005) Health services research: is there anything to learn from mental health? J. Health Serv. Res. Policy, 10, 1–2. [13] Thornicroft, G. and Tansella, M. (1999) The Mental Health Matrix: A Manual to Improve Services, Cambridge University Press, Cambridge. [14] Aday, L.A. (2001) Establishment of a conceptual base for health services research. J. Health Serv. Res. Policy, 6, 183–185. [15] Spasoff, R.A. (1999) Epidemiologic Methods for Health Policy, Oxford University Press, New York. [16] Last, JM. (ed.) (1995) A Dictionary of Epidemiology, 3rd edn, Oxford University Press, New York. [17] Thornicroft, G. (2001) Measuring Mental Health Needs, 2nd edn, Royal College of Psychiatrists, London. [18] Asadi-Lari, M., Packham, C. and Gray, D. (2003) Need for redefining needs. Health Qual. Life Outcomes, 1, 34. [19] Aday, L.A., Begley, A.C., Lairson, D.R. and Skater, C.H. (1993) Evaluating the Medical
AN INTRODUCTION TO MENTAL HEALTH SERVICES RESEARCH
[20]
[21]
[22]
[23]
[24]
[25]
[26] [27]
[28]
[29]
[30] [31]
[32]
[33]
[34]
Care System, Health Administration Press, Ann Arbor, MI. Drummond, M.F., Sculpher, M.J., Torrance, G.W. and O’Brien, B.J. (2005) Methods for the Economics Evaluation of Health Care Programmes, Oxford University Press, Oxford. Shape, V.A. and Faden, A.I. (1996) Appropriateness in patient care: a new conceptual framework. Milbank Q., 74, 115–138. Muir, G.J.A. (1996) Evidence-Based Healthcare. How to Make Health Policy and Management Decisions, Churchill Livingstone, London. Health Services Research Group (1992) Small-area variation: what are they and what do they mean? Can. Med. Assoc. J., 146, 467–470. Wennberg, J.E., Barnes, B.A. and Zubkoff, M. (1982) Professional uncertainty and the problem of supplierinduced demand. Soc. Sci. Med., 16, 811–824. Chassin, M.R. (1993) Explaining geographic variations. The enthusiasm hypothesis. Medical Care, 31, 37–44. Longo, D.R. (1993) Patient practice variation: a call for research. Med. Care, 31, 81–85. Andersen, R.M. (1995) Revisiting the behavioral model and access to medical care: does it matter? J. Health Soc. Behav., 36, 1–10. Macinko, J.A. and Starfield, B. (2002) Annotated bibliography on equity in health, 1980-2001. Int. J. Equity Health, 1, 1. Tansella, M., Amaddeo, F., Burti, L. et al. (2006) Evaluating a community-based mental health service focusing on severe mental illness. The Verona experience. Acta Psychiatr. Scand., 113, 90–94. World Health Organization (2005) Mental Health Atlas, World Health Organization, Geneva. Johnson, S. and Kuhlmann, R., the EPCAT Group (2000) The European Service Mapping Schedule (ESMS): development of an instrument for the description and classification of mental health services. Acta Psychiatr. Scand., 102, 14–23. Salvador-Carulla, L., Sladivia, S., Mart´ınez-Leal, R. et al. (2008) Meso-level comparison of mental health services availability and use in Chile and Spain. Psychiatr. Serv., 59, 421–428. Pirkola, S., Sund, R., Sailas, E. and Wahlbeck, K. (2009) Community mental-health services and suicide rate in Finland: a national wide small-area analysis. Lancet, 373, 147–153. ¨ Kessler, R.C. and Ustun, T.B. (2004) The World Mental Health (WMH) Survey Initiative version of the World Health Organization (WHO) Composite International Diagnostic Interview (CIDI). Int. J. Methods Psychiatr. Res., 12, 93–121.
[35] Wang, P.S., Aguilar-Gaxiola, S., Alonso, J. et al. (2007) Use of mental health services for anxiety, mood, and substance disorders in 17 countries in the WHO world mental health surveys. Lancet, 370, 841–850. [36] Haro, J.M., Arbabzadeh-Bouchez, S., Brugha, T.S. et al. (2006) Concordance of the Composite International Diagnostic Interview Version 3.0 (CIDI 3.0) with standardized clinical assessments in the WHO World Mental Health surveys. Int. J. Methods Psychiatr. Res., 15, 167–180. [37] Ochoa, S., Haro, J.M., Autonell, J. et al. (2003) Met and unmet needs of schizophrenia patients in a Spanish sample. Schizophr. Bull., 29, 201–210. [38] UK 700 Group (2000) Cost-effectiveness of intensive v. standard case management for severe psychotic illness. UK 700 case management trial. Br. J. Psychiatry, 176, 537–543. ´ J.A., Rodr´ıguez-Bayon, ´ A., de Dios Luna, J. [39] Bellon, ´ and Torres-Gonzalez, F. (2008) Successful GP intervention with frequent attenders in primary care: randomised controlled trial. Br. J. Gen. Pract., 58, 324–330. [40] Murphy, E., Dingwall, R., Greatbatch, D. et al. (1998) Qualitative research methods in health technology assessment: a review of the literature. Health Technol. Assess., 2 (16), 1–294. [41] Proctor, E.K., Landsverk, J., Aarons, G. et al. (2009) Implementation research in mental health services: an emerging science with conceptual, methodological, and training challenges. Admin. Policy Ment. Health Ment. Health Serv. Res., 36, 24–34. [42] Tansella, M. and Thornicroft, G. (2009) Implementation science: understanding the translation of evidence into practice. Br. J. Psychiatry, 195, 283–285. [43] Hysong, S.J., Best, R.G. and Pugh, J.A. (2006) Clinical practice guideline implementation strategy patterns in veterans affairs primary care clinics. Health Serv. Res., 42, 84–103. ´ ´ [44] Calderon-G omez, C., Retolaza Balsategui, A., Bacigalupe de la Hera, A. et al. (2009) Family doctors and psychiatrists and the patient with depression: the need to re-adjust health care approaches and organizational dynamics. Aten. Primaria, 41, 33–40. [45] Davidson, L. (2005) Recovery, self-management and the expert patient-camping the culture of mental health from a UK perspective. J. Ment. Health, 14, 25–35. [46] Kloos, B. (2005) Creating new possibilities for promoting liberation, well-being and recovery: learning from experiences of psychiatric consumers/survivors, in Community Psychology. In Pursuit of Liberation and Well-Being (eds G. Nelson and
153
CHAPTER 9 I. Prilleltensky), Palgrave MacMillan, New York, pp. 426–447. [47] ADEMM-Usuaris de Salut Mental de Catalunya (2007) The Relation Between Users and Professionals in the Scope of the Mental Health. Centre Especial de Treball Apunts, Barcelona. Available from http://www.ademm-usm.org/main_cas.html?opcio=3 (accessed 27 Spetember 2010).
154
[48] Le, F.E. (2002) Health needs, health demand and health services utilisation. 37th Meeting of the Advisory Committee on Health Research, Washington, DC. Available from http://www .paho.org/English/HDP/HDR/ACHR-37%202002LeFranc-Abstract.pdf (accessed 27 September 2010).
10
The pharmacoepidemiology of psychiatric medications Philip S. Wang,1 Alan M. Brookhart,2 Christine Ulbricht1 and Sebastian Schneeweiss2 1 Division
of Services and Intervention Research, National Institute of Mental Health, Bethesda, MD, USA 2 Division of Pharmacoepidemiology and Pharmacoeconomics, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
10.1 Introduction The need for rigorous pharmacoepidemiologic studies of the use, risks and benefits of psychotropic medications has grown considerably in recent years. Medications representing the major classes of modern psychotherapeutic drugs first became available over half a century ago. Such agents (and classes) included imipramine (a tricyclic antidepressant), chlorpromazine (a neuroleptic antipsychotic), chlordiazepoxide (a benzodiazepine anxiolytic) and lithium (a mood stabiliser) [1]. Except for lithium, each of these agents led to the development of other medications that tended to have very similar mechanisms of action. Newer drugs within each class began to emerge by the 1980s, including fluoxetine (a selective-serotonin reuptake inhibiting (SSRI) antidepressant), clozapine (an ‘atypical’ antipsychotic), buspirone (a non-sedating anxiolytic), as well as valproate and carbamazepine (antiepileptics with mood-stabilising properties). In spite of more than half a century of use of many psychotherapeutic drugs, empirical data concerning their utilisation, safety, effectiveness and cost-effectiveness in real-world patient populations is often lacking. Data from randomied controlled trials (RCTs) that were conducted to establish the basic efficacy and safety of medications as well as register them with regulatory bodies, are
often the only information to guide treatment decisions. Unfortunately, such RCT data may not be generalisable to the way medications are used, and the benefits and risks that result from such use, under typical practice conditions. For example, earlier efficacy trials suggested the newer second-generation of antipsychotics emerging in the 1980s were potentially superior to older first-generation neuroleptics at treating negative symptoms of schizophrenia, avoiding adverse effects like extrapyramidal symptoms, and in terms of their economic value [2]. Such results led to heavy promotion of the second-generation antipsychotics and a rapid increase in their use. By the second half of the 1990s, second-generation antipsychotics comprised the majority of antipsychotic use in the United States, many years before results of large comparative effectiveness trials became available [3, 4]. Other examples of rapid diffusion of practice in the absence of data on safety or benefits include the standing regimens of multiple concurrent antipsychotics that were being given to over one in six patients with schizophrenia spectrum disorders by the late 1990s [3]. In fact, results from a recent trial of clozapine plus risperidone suggest that this particular combination may not be superior to clozapine alone [5]. Such rapid adoption and diffusion of psychotropic regimens before data are available on how
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
155
CHAPTER 10
medications are being used, their safety, and their benefits in typical practice can lead to substantial increases in health care expenditures. Spending on public programs such as Medicare and Medicaid in the United States has tripled over the past 30 years, rising from 1.3% of gross domestic product in 1975 to 4% in 2007 and projected to continue increasing to 12% by 2050 under current policies [6]. Spending on psychotropic medications can comprise a large proportion of these increases, with expenditures for just second-generation antipsychotics making up nearly a third of all drug costs for some Medicaid programmes [7]. Unfortunately, without clear data on the safety, benefits and cost-effectiveness of such regimens, purchasers of health care can be uncertain if such costs are justified. In a recent analysis of Medicaid prior authorisation policies used to control drug costs, there was no consistent relationship between the application of these policies and overall spending on atypical antipsychotic medications [7]. Such findings suggest Medicaid programmes do not have sufficient data on how these medications are being used in their patient populations and what outcomes their patients are experiencing to know if use should be increased or decreased. The lack of such data can also leave policy makers unable to respond to new challenges such as drug advisories from regulators. In 2005, the United States Food and Drug Administration (FDA) issued a warning of increased mortality among elderly patients with dementia taking second-generation antipsychotics. Over the next year no state Medicaid programme modified its prior authorisation policy to respond to this warning [7]. Likewise, an analysis of Medicaid prior authorisation policies for antidepressants prescribed to children found that states made few and variable changes after an FDA advisory in 2003 warning of increased risks of treatment-emergent suicidality [8]. Again, such results suggest that Medicaid benefit managers do not have sufficient information on the use, risks and benefits of psychotropic medications to inform their decisions. Finally, psychopharmacoepidemiology is increasingly being called upon to help intervene and improve upon the poor quality of regimens and outcomes that patients currently experience on the basis of 156
their psychotropic medication use. As will be covered below, even while the use and expenditures for psychotropic regimens have risen substantially, many people with mental disorders experience unmet needs for effective treatment as well as poor health and functioning. Although the United States spends the greatest percentage of gross domestic product on health care, recent data from the World Health Organization’s World Mental Health Survey indicates the United States lags behind other developed nations in terms of the rate of receiving effective treatment [9]. Similarly, analyses of temporal trends in the United States have shown that even though use of mental health treatments increased 65% in the last decade, the prevalence of mental disorders and suicidality failed to decline [10–12]. For this reason, psychopharmacoepidemiologists have begun to focus on evaluating interventions, policies, delivery system redesigns and even means of financing mental health treatments, in order to improve the care and outcomes that patients experience. The remainder of this chapter provides a brief overview of potential data sources for investigations, examples of recent psychopharmacoepidemiologic studies, and some suggestions for future developments.
10.2 Overview of psychopharmacoepidemiology 10.2.1 A brief history Both the parent field of pharmacoepidemiology and the narrower field of psychopharmacoepidemiology are relatively young disciplines [13]. Both arose out of needs revealed by the thalidomide catastrophe in the 1960s. Thalidomide, a psychotropic medication originally marketed for sedation and hypnosis, was found to be associated with deformed extremities among children who had been exposed to it in utero [14]. This public health crisis involving a marketed psychotropic drug led to important policy changes and the establishment of systems for reporting unexpected hazards from approved medications throughout the world [15]. The Kefauver–Harris Amendments passed in the United States established the FDA’s current regulatory requirements for drug approvals [16]. In a step that directly led to the
THE PHARMACOEPIDEMIOLOGY OF PSYCHIATRIC MEDICATIONS
establishment of the field of pharmacoepidemiology, these amendments required that drugs that were approved and being marketed also receive review. From these origins in the 1960s, pharmacoepidemiologic studies have grown in frequency, scope and impact. Post-marketing pharmacoepidemiologic studies have sometimes been required by the FDA as a precondition at the time of approval [17]. Uncovering serious unanticipated adverse events from psychotropic drugs has continued to be a major focus of such studies in subsequent decades. During the 1980s, the antidepressant nomifensine was marketed in Europe and showed promise for treatment-resistant depression, but was found to cause fatal haemolytic anaemia [18]. The ultra shortacting benzodiazepine triazolam was frequently used as a hypnotic when it was found to be associated with anterograde amnesia [19]. Results such as these have had a significant public health impact, in part by leading to the withdrawal of hazardous agents as well as new recommendations and regulations to ensure the safe and effective use of psychotropic drugs that remain. Important advances in the data and study designs available to psychopharmacoepidemiologists have allowed for an expansion of the discipline’s role. Large administrative databases with accurate drug exposure information on hundreds of thousands or even millions of patients have been developed. These large automated databases have in turn allowed psychopharmacoepidemiologists to efficiently study rare adverse effects, those occurring after long lag periods, and adverse effects with high background rates. Novel designs such as the case cross-over and case-time-control designs have allowed investigators to conduct new types of studies, such as those identifying transient effects from intermittent drug exposures [20–22]. New methodologies for conducting quasi-experimental and simulation studies have also become available and made it possible for psychopharmacoepidemiologists to evaluate the impact of things like psychotropic drug policies and even hypothetical psychotropic drug regimens [23–25]. Methodologic advances in the analysis of pharmacoepidemiologic data have allowed investigators to more effectively deal with or at least quantify threats to the validity of observational studies, such as the common problem of confounding by
indication (‘channelling bias’, as might occur if certain drugs regimens are preferentially prescribed to patients with particular conditions) [26]. For example, investigators can minimise this type of bias by employing such analytic procedures as propensity score matching (which controls for differences in the characteristics of patients given different drug regimens), restriction, instrumental variable techniques and adjustment for unmeasured confounders using external sources of information [27–31]. For these reasons, psychopharmacoepidemiologic studies conducted after drug approval have become an indispensable complement to the clinical trials performed before drug approval for registration purposes. Pharmacoepidemiologic studies may be the only type of study that can detect certain outcomes, such as those that are rare or occur only after long delays. Because of their larger sample sizes, pharmacoepidemiologic studies can allow investigators to estimate drug effects with much greater precision or in particular subgroups. Psychopharmacoepidemiologists can examine how psychiatric drugs are used and their effects in populations that are often excluded from clinical trials, including patients with comorbid psychiatric and general medical disorders, the elderly, children or pregnant patients. Psychopharmacoepidemiology makes it possible to evaluate psychiatric drug regimens that are typically used in the real world but may not be studied in clinical trials for practical or ethical reasons, including long-term exposures, cotreatment with other medications, no treatment and even overdosages.
10.3 Sources of data The strengths as well as limitations of psychopharmacoepidemiologic studies for answering particular questions often depend critically upon the underlying data sources being employed. In general, answering psychopharmacoepidemiologic questions often requires information on a number of patients, that is orders of magnitude greater than the hundreds usually studied in clinical trials prior to drug approval. This very large number of patients on whom information is needed has in turn made it essential to employ secondary data collected for other purposes whenever possible. 157
CHAPTER 10
Beyond these general considerations, characteristics of individual data sources may make them more or less suitable for answering specific psychopharmacoepidemiologic questions [32]. Factors that may favour using one data source over another include: whether data on the drug exposure of interest is available, from time periods of interest, and common enough to adequately power analyses; the level of detail and accuracy of these exposure data; whether there are adequate numbers of and accurate data on outcomes of interest; whether there is information needed to control for confounding and other biases and the representativeness of the study population to other populations of interest. Below are brief descriptions of some data sources typically used in psychopharmacoepidemiologic studies, including some of their strengths and weaknesses for answering specific study questions.
10.3.1 Large governmental administrative databases The establishment of governmental entitlement programmes, such as Medicaid in the mid-1960s, created an important source of data for psychopharmacoepidemiologists [33]. Databases from specific state Medicaid programmes (e.g. New Jersey and Tennessee) as well as collections of states (e.g. the Computerized On-line Medical Pharmaceutical Analysis and Surveillance System [COMPASS] consortium) have been employed successfully in pharmacoepidemiologic studies. Because Medicaid databases contain information on large numbers of psychiatric patients due to the poverty and disability associated with mental illness, they are often ideal data sources for studies of psychotropic medications. The indigent status of recipients also reduces out-of-pocket health care expenditures and contributes to the high level of completeness of Medicaid data for information on use of medications and other services [34]. Disadvantages of Medicaid data can include their lack of information on inpatient drug utilisation, limited generalisability for certain investigations and questions about the completeness and validity of recorded diagnoses [35]. Other large governmental administrative databases collected for insurance purposes also exist, including data collected by the 158
US Veteran’s Administration and provincial governments in Canada (e.g. the British Columbia Pharmacare programme). These data sources may offer specific advantage in studies due to their inclusion of subjects with a wider range of socioeconomic and other characteristics. However, like all databases collected for administrative purposes, questions persist concerning the accuracy of their clinical information, especially on mental disorders.
10.3.2 Data from health maintenance organizations The number of people in the United States receiving their pharmacy benefits through health maintenance organizations (HMOs) has increased substantially over the past decade. This has allowed HMO databases to become an important datasource for psychopharmacoepidemiologic studies. Prescription claims databases from health plans such as Group Health Cooperative, the Kaiser Permanente Medical Care Program, United Health Care, Fallon Health Plan and Harvard Pilgrim Health Care have all been employed in psychopharmacoepidemiologic studies. Data from such plans have also been successfully used in concert, through consortia like the HMO Research Network and the Vaccine Safety Datalink programmes. An important advantage of data from HMOs is that the clinical information collected for billing purposes can often be supplemented with more complete or accurate information from review of patients’ primary medical records. HMO databases provide an ideal means to study psychotropic medication use in primary care, the setting in which the majority of mental health care is delivered in the United States. However, because HMO membership often requires employment, HMO databases may not include use of psychiatric medications by those with serious mental illness. In addition, patient turnover can be high, hampering longitudinal studies.
10.3.3 Large-scale surveys Data for psychopharmacoepidemiologic studies can also be obtained from large surveys of medication and other health services use. Surveys administered in multiple years in the United States include the
THE PHARMACOEPIDEMIOLOGY OF PSYCHIATRIC MEDICATIONS
annual National Ambulatory Medical Care Survey (NAMCS), which samples a nationally representative group of visits to physicians in office-based practices and records the prescriptions for medications given to patients. Advantages of data from such surveys include the ability to generate nationally representative estimates concerning psychiatric medication use. Disadvantages include the surveys’ high costs, the possibility that patients may not have filled or consumed prescribed medications, the lack of longitudinal follow-up and the lack of completeness and detail regarding clinical conditions. Psychiatric epidemiologic surveys such as the National Comorbidity Survey (NCS) in the early 1990s and its replication (NCS-R) a decade later contain detailed information on mental disorders and also assessed the use of psychotropic medications among respondents. Because similar survey methods and data collection instruments were employed in the NCS and NCS-R, analyses of temporal trends in use of medications as well as mental disorders is possible. The same methodology and instruments as the NCS/NCS-R have also been employed in population-based surveys being conducted throughout the world as part of the WHO World Mental Health Survey Initiative (www.hcp.med.harvard.edu/wmh/), making it now possible to conduct cross-national analyses as well. Potential limitations in all of these surveys include the frequent lack of detail concerning medication regimens and the fact that they are based on respondents’ recall and subject to information biases.
10.3.4 Practice-based networks Practice-based networks are designed to provide information on the patterns and outcomes of health services use in typical practice settings. One such network – the General Practice Research Database (GPRD) in the United Kingdom – contains the computerised medical records from hundreds of general practices in the United Kingdom and has been used extensively in psychopharmacoepidemiologic studies. Other practice-based networks in the United States include the family practice Ambulatory Sentinel Practice Network (ASPN), the Pediatric Research in Office Settings (PROS) and the American Psychiatric Association’s Practice Research Network (PRN) of psychiatrists. More recently, the National
Institute of Mental Health (NIMH) established practice-based research networks for the study of schizophrenia, depression and bipolar disorder. These NIMH networks have mainly been used to rapidly recruit large numbers of typical patients for large practical clinical trials such as the Clinical Antipsychotic Trial of Intervention Effectiveness (CATIE) trial in schizophrenia, Sequenced Treatment Alternatives to Relieve Depression (STAR*D) trial in depression and Systematic Treatment Enhancement Program for Bipolar Disorder (STEP-BD) trial in bipolar disorder [4, 36–38]. However because they reflect real-world practice, the NIMH networks have also been used successfully in pharmacoepidemiologic studies [39, 40]. General strengths of data from practice-based networks include their more accurate clinical information and the potential to develop representative estimates; disadvantages include the high costs of maintaining networks and the uncertainty over whether patients actually consumed prescribed medications.
10.4 Examples of recent psychopharmacoepidemiologic studies From its origins investigating adverse effects from psychotropic medications, the field of psychopharmacoepidemiology has broadened to now include a wider variety of study types. The following section describes a few recent psychopharmacoepidemiologic studies. It is not intended to be a comprehensive review of the large body of work that has been conducted but instead a brief presentation of examples of the range of studies now possible.
10.4.1 Uncovering adverse effects and unanticipated benefits of psychiatric medications Identifying adverse effects from psychiatric medications that were not observed in RCTs conducted for registration purposes continues to be a major reason for conducting psychopharmacoepidemiologic studies. One question that has received considerable attention recently has been whether the antidepressant medications used to treat depression 159
CHAPTER 10
can paradoxically incite or exacerbate suicidal thoughts and behaviours. In March 2004, the FDA first issued a public health advisory that the use of 10 newer antidepressants may be associated with the development of suicidal thoughts and behaviours in children and adults [41]. In October 2004, the FDA extended this advisory to a ‘black box’ warning of these potential risks for suicidality that covered all antidepressant agents and all patient age groups [42]. At the heart of these warnings were data from meta-analyses that the FDA had conducted on available RCTs of antidepressant medications. The clinical implications of these findings and warning were complicated by the fact that antidepressant medications remain a therapeutic mainstay for depression and are quite beneficial for some patients. For this reason, it has been imperative to further identify particular vulnerable subgroups or regimens associated with risk so that they can be avoided. Some psychopharmacoepidemiologic analyses have examined whether users of particular classes (e.g. SSRIs, serotonin norepinephrine reuptake inhibitors (SNRIs), tricyclic amines) or individual agents are especially hazardous but have generally found only small or no differences in completed suicides and suicide attempts [43–45]. However, other psychopharmacoepidemiologic studies attempting to uncover whether particular age groups may be at greater or lower risk have reported that the hazards of treatment emergent suicide attempts are elevated among children but not adults [46]. This apparent effect modification by age was also observed in re-analyses conducted by the FDA of available RCT data and led to the FDA’s 2007 modification of its earlier warnings to now only cover potential treatment of emergent suicidality among children and adolescents treated with antidepressants [47]. Another psychotropic drug safety issue receiving considerable attention recently concerns the frequent use of antipsychotic drugs to treat behavioural symptoms and agitation in dementia patients. In 2005, the FDA issued an advisory warning that atypical antipsychotics increased the risk of death compared to placebo in short-term RCTs conducted among elderly dementia patients [48]. ‘Black box’ warnings were added to the labels of all atypical antipsychotics describing these risks and advising that the atypical antipsychotics are not approved for use in elderly 160
patients with dementia. There was insufficient trial data on the mortality associated with conventional antipsychotic use by elderly dementia patients, so the FDA did not include these agents in its advisories. However, recent pharmacoepidemiologic studies of elderly patients initiating antipsychotic treatments began to raise questions about this omission of conventional agents from the FDA’s warnings. One investigation found that patients prescribed conventional agents had a 37% greater, dosedependent risk of short-term mortality than those prescribed atypical antipsychotics [49]. Subsequent psychopharmacoepidemiologic studies performed in other populations have also found elevated risks of short-term mortality among elderly initiating conventional as opposed to atypical antipsychotics and allowed the FDA in June 2008 to include the conventional agents in its earlier warnings of mortality risks among dementia patients [50–52]. Another role for psychopharmacoepidemiology related to its focus on identifying adverse effects, is uncovering unanticipated benefits from psychiatric medications. While similar designs and analytic methods are employed in both types of investigations, results from studies of unanticipated benefits can be more difficult to interpret than those of adverse effects because of the often greater possibilities of residual confounding by indication. One topic that has received renewed attention has been whether the mood stabilising medication lithium may offer unique protection against suicidality among patients with bipolar disorders. A recent psychopharmacoepidemiologic study of this question found that lithium was superior to divalproex, gabapentin and carbamazepine in protecting patients against suicide attempts, although the latter two comparisons did not reach statistical significance [53]. However, as the authors point out, the observational nature of such studies and the possibility that choice of mood stabiliser may be related to clinical severity or suicidality risk complicates interpretation of such findings. Adjudicating between the possibilities of true benefits versus confounding by indication may ultimately required randomised data, such as that from the ongoing NIMH-supported Use of Moderate-Dose Lithium for the Treatment of Bipolar Disorder (LiTMUS) trial of moderate dose lithium added to usual care.
THE PHARMACOEPIDEMIOLOGY OF PSYCHIATRIC MEDICATIONS
10.4.2 Descriptive analyses of the use and quality of psychiatric medication use Traditionally, pharmaceutical companies have examined how their products are used and by whom, largely for marketing purposes. More recently psychopharmacoepidemiologists have been shedding light on the use and quality of psychotropic medication use for a variety of other stakeholders. Such studies have revealed the continuing rise in psychiatric medication use by children and adolescents, particularly the use of atypical antipsychotics [54]. Although these medications are given for a variety of clinical reasons, a rapid rise in the diagnosis of bipolar disorder among children and adolescents may be behind much of the increase in youth atypical antipsychotic use [55]. The accuracy of such diagnoses being made in community practice as well as the effectiveness and safety of these prescribed regimens are areas of active investigation. In addition to identifying how psychotropic medications are being used and by whom, psychopharmacoepidemiologists are engaged in studying the quality of mental health care. Recent nationally representative data have shown that the majority of people in the United States with mental disorders during the prior year receive either no treatment or treatment that fails to meet minimal standards for adequacy [56]. Although comparably large unmet needs for effective mental health treatment have been observed worldwide, it is notable that the United States fails to achieve better despite spending by far the greatest proportion of its gross domestic product on health care [9]. More focused psychopharmacoepidemiologic studies have identified other problematic aspects of regimens that could render many to be ineffective and/or harmful to patients. For example, one analysis of antidepressant treatments given to elderly patients with depression found that nearly half were suboptimal because they involved either potentially hazardous (i.e. highly anticholinergic agents or excessively high dosages) or low-intensity regimens (i.e. low dosages, short durations or lack of follow-up) [57]. Descriptive psychopharmacoepidemiologic studies uncovering such patterns and potentially modifiable determinants of poor quality psychotropic medication use are often a necessary first step to design and target the types of interventions covered below.
10.4.3 Pharmacoeconomic analyses With the rising expenditures on pharmaceuticals, it has become increasingly important to shed light on not only the outcomes from their use but also their value. Weighing the relative costs and benefits from psychiatric medication use has required conducting formal economic studies. Although such economic evaluations have usually accompanied clinical trials, advances in the decision sciences and simulation modelling have allowed investigators to employ a wider range of data, including pharmacoepidemiologic. This in turn has allowed the field to answer questions concerning the cost-effectiveness of a wider array of interventions and in a wider range of populations than just those involved in clinical trials. This expanding role and capacity to conduct pharmacoeconomic evaluations is illustrated by the recent body of work to enhance the treatment of depression, particularly the widespread poor quality pharmacotherapy in primary care settings. Economic analyses of trials of primary care quality improvement interventions had shown that they are a good value from a societal perspective, with cost-effectiveness ratios below the $50 000 per quality adjusted life year (QALY) benchmark used to judge whether interventions are worth investing in [58–60]. Unfortunately, widespread uptake of these interventions has not occurred, in part because the employers that purchase much of US health care do not know what their return-on-investment would be from enhanced depression care programs for specifically depressed workers. To shed light on this, the costs and benefits of enhanced depression care for workers from both the societal and employer-purchaser perspectives, were estimated in a state-transition Markov model [61]. Results from this economic analysis indicated that improving the quality of depression treatment for workers was not only a good value from society’s point of view, but also potentially costsaving to employer-purchasers due to the recovery of lost work productivity.
10.4.4 Studying interventions to optimise psychiatric medication use In addition to conducting descriptive studies of psychotropic medication use and analytic studies of 161
CHAPTER 10
the outcomes from such use, psychopharmacoepidemiologists have increasingly become engaged in evaluating interventions to actually improve use and outcomes. For example, based upon the favourable results from the economic models described above, an actual randomised effectiveness trial of an enhanced depression treatment programme was conducted among workers to experimentally assess the intervention’s effects on clinical as well as work productivity outcomes. Results of the trial showed that by 12 months, the intervention had significantly improved both clinical as well as workplace outcomes compared to usual depression care. The financial value of the latter to employers in terms of recovered hiring-training and salary costs suggested that many employers would experience a positive return on investment from improved treatment of depressed workers. Other interventions could have potentially profound effects on the use, quality and outcomes from psychiatric medication use, but may not be amenable to study with experimental designs. Fortunately, methodological developments in quasi-experimental methods like the econometric technique known as instrumental variables analysis have helped investigators to produce unbiased estimates of the impacts of interventions on outcomes using epidemiological data [23, 31]. These methods in turn have opened the door for psychopharmacoepidemiologists to study the ‘natural’ experimentation, that is occurring with mental health policies, delivery system redesigns and financing of mental health care. For example, one recent analysis evaluated whether increases in patient cost-sharing enacted by the Canadian province of British Columbia curtailed already generally underused antidepressant medications by seniors with depression [62]. Introducing a $10–$25 copayment for prescriptions was associated with a significant drop in the frequency of antidepressant initiation. Subsequent replacement of these copayments with a more stringent income-based deductible policy then led to a significant reduction in the rate of increase in antidepressant initiation. While introducing these new forms of medication cost sharing did appear to have the potential to reduce use of antidepressant therapy by seniors, the clinical consequences of such reduced use still need to be clarified. 162
10.5 Conclusions As this chapter has attempted to illustrate, the data sources, capacities and roles for psychopharmacoepidemiology have all expanded. Given the centrality of psychotropic medications in the current treatment of mental disorders, psychopharmacoepidemiologic studies remain essential to ensuring that such use is safe, effective and cost-beneficial. If history is any guide, investigators can anticipate a steady stream of new hypotheses concerning unanticipated effects of both established and new psychotropic medications. Likewise, tracking how psychiatric medications are used, by whom, and the outcomes from use is imperative to identify unmet needs for effective treatment and new intervention targets. And as constraints on health care resources and reliance on psychotropic medications increase, so too will the need for evaluations of the relative value obtained from psychopharmacologic regimens. Psychopharmacoepidemiologists should anticipate a growing need to evaluate a wide range of interventions that could have important impacts on patients’ use of psychotropic medications and their clinical outcomes. To meet all of these future demands, advances will also be needed in the data, methods and resources available for conducting psychopharmacoepidemiologic studies. Recognising the need for new data sources and analyses, the Food and Drug Administration Amendments Act (FDAAA) of 2007 calls for a marked expansion in the current system of monitoring the performance of approved medications [63]. Part of this new capacity for conducting active surveillance will include the Sentinel Initiative, a national electronic system of linked healthcare datasources for monitoring medical product safety. The FDAAA legislation sets as targets that data on 25 million patients and 100 million patients be accessible by 1 July 2010 and 2012, respectively. Parallel methodologic and analytic developments will also be needed to ensure that queries of these expanded datasources can be implemented and yield valid answers. Another important potential future role that psychopharmacoepidemiology could play is in facilitating health care reform. Experts and opinion leaders have emphasised the need for rigorous data on the comparative effectiveness of medical treatments to both inform practice decisions and improve health
THE PHARMACOEPIDEMIOLOGY OF PSYCHIATRIC MEDICATIONS
care quality, outcomes and value [38]. Bodies such as the Congressional Budget Office and Institute of Medicine have joined in these calls for new research shedding light on the relative benefits, risks and costs of medical therapies [64, 65] and US Congressional legislation [66] has been introduced that would establish an independent, non-governmental Healthcare Comparative Effectiveness Research Institute. Generating some of this data will certainly involve conducting large comparative effectiveness trials. However the costs, time required and other challenges of conducting large practical clinical trials make it clear that additional means will also be needed. As covered in this chapter, the range of data sources, study methods and analytic approaches now available to psychopharmacoepidemiologists leave them well poised to answer questions regarding the comparative effectiveness, safety and value of psychotropic medications in the future.
Acknowledgements The views and opinions expressed are those of the authors and should not be construed to represent the views of any sponsoring organisation, agencies or the US Government. The views expressed do not necessarily represent the views of the National Institute of Mental Health, the National Institutes of Health, the Department of Health and Human Services or the United States government.
References [1] Schatzberg, A. and Nemeroff, C. (2006) Essentials of Clinical Psychopharmacology, 2nd edn, American Psychiatric Publishing, Inc. [2] Leucht, S., Pitschel-Walz, G., Abraham, D. et al. (1999) Efficacy and extrapyramidal side-effects of the new antipsychotics olanzapine, quetiapine, risperidone, and sertindole compared to conventional antipsychotics and placebo. A meta-analysis of randomized controlled trials. Schizophr. Res., 35, 51–68. [3] Wang, P.S., West, J.C., Tanielian, T. et al. (2000) Recent patterns and predictors of antipsychotic medication regimens used to treat schizophrenia and other psychotic disorders. Schizophr. Bull., 26, 451–457.
[4] Lieberman, J.A., Stroup, T.S., McEvoy, J.P. et al. (2005) Effectiveness of antipsychotic drugs in patients with chronic schizophrenia. N. Engl. J. Med., 353, 1209–1223. [5] Honer, W.G., Thornton, A.E., Chen, E.Y. et al. (2006) Clozapine alone versus clozapine and risperidone with refractory schizophrenia. N. Engl. J. Med., 354, 472–482. [6] U.S. Congressional Budget Office (2007) Research on the Comparative Effectiveness of Medical Treatments: Issues and Options for an Expanded Federal Role, Congressional Budget Office, Washington, DC. December. [7] Polinski, J.M., Wang, P.S. and Fischer, M.A. (2007) Medicaid’s prior authorization program and access to atypical antipsychotic medications. Health Aff., 26, 750–760. [8] Fischer, M.A., Servi, A.D., Polinski, J.M. et al. (2007) Restrictions on antidepressant medications for children: a review of medicaid policy. Psychiatr. Serv., 58, 135–138. [9] Wang, P.S., Aguilar-Gaxiola, S., Alonso, J., et al. The WHO World Mental Health Survey Consortium (2007) Worldwide use of mental health services for anxiety, mood, and substance disorders: results from 17 countries in the WHO World Mental Health (WMH) surveys. Lancet, 370, 841–850. [10] Wang, P.S., Demler, O., Olfson, M. et al. (2006) Changing profiles of service sectors used for mental health care in the United States. Am. J. Psychiatry, 163, 1187–1198. [11] Kessler, R.C., Demler, O., Frank, R.G. et al. (2005) Prevalence and treatment of mental disorders, 1990 to 2003. N. Engl. J. Med., 352, 2515–2523. [12] Kessler, R.C., Berglund, P., Borges, G. et al. (2005) Trends in suicide ideation, plans, gestures, and attempts in the United States, 1990–1992 to 2001– 2003. J. Am. Med. Assoc., 293, 2487–2495. [13] Strom, B.L. (1994) What is pharmacoepidemiology? in Pharmacoepidemiology, 2nd edn (ed B.L. Strom), John Wiley & Sons, Inc., New York, pp. 3–13. [14] McBride, W.G. (1961) Thalidomide and congenital abnormalities. Lancet, ii, 1358. [15] Wilholm, B.E., Onsson, S., Moore, N., et al. (1994) Spontaneous reporting system outside the United States, in Pharmacoepidemiology, 2nd edn (ed. B.L. Strom), John Wiley & Sons, Inc., New York, pp. 139–155. [16] Baum, C., Kweder, S.L. and Anello, C. (1994) The spontaneous reporting system in the United States, in Pharmacoepidemiology, 2nd edn (ed B.L. Strom), John Wiley & Sons, Inc., New York, pp. 125–137. [17] Mattison, N. and Richard, B.W. (1987) Postapproval research requested by the FDA at the time of NCE approval, 1970–1984. Drug Inf. J., 21, 309–329.
163
CHAPTER 10 [18] Cole, J.O. (1988) Where are those new antidepressants we were promised? Arch. Gen. Psychiatry, 45, 193–197. [19] Morris, H.H. and Estes, M. (1987) Traveler’s amnesia: transient global amnesia secondary to triazolam. J. Am. Med. Assoc., 258, 945–946. [20] Maclure, M. (1991) The case-crossover design: a method for studying transient effects of the risk of acute events. Am. J. Epidemiol., 113, 144–153. [21] Suissa, S. (1995) The case-time-control design. Epidemiology, 6, 248–253. [22] Wang, P.S., Schneeweiss, S., Glynn, R.J. et al. (2004) Use of the case-crossover design to study prolonged drug exposures and insidious outcomes. Ann. Epidemiol., 14, 296–303. [23] Schneeweiss, S., Maclure, M., Walker, A.M. et al. (2001) On the evaluation of drug benefits policy changes with longitudinal claims data: the policy maker’s versus the clinician’s perspective. Health Policy, 55, 97–109. [24] Gold, M.R., Siegel, J.E. and Russell, L.B. (1996) Cost-Effectiveness in Health and Medicine, Oxford University Press, New York. [25] Eddy, D.M. (2007) Linking electronic medical records to large-scale simulation models: can we put rapid learning on turbo? Health Aff., 26, w125–w136. [26] Rosenbaum, P.R. and Rubin, D.B. (1983) Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. J. R. Stat. Soc. B, 45, 212–218. [27] Schneeweiss, S. (2007) Developments in postmarketing comparative effectiveness research. Clin. Pharmacol. Ther., 82, 143–156. [28] Seeger, J.D., Kurth, T. and Walker, A.M. (2007) Use of propensity score technique to account for exposure related covariates. Med. Care, 45, S143–S148. ¨ [29] Schneeweiss, S., Patrick, A.R., Sturmer, T. et al. (2007) Increasing levels of restriction in pharmacoepidemiologic database studies of elderly and comparison with randomized trial results. Med. Care, 45, S131–S142. ¨ [30] Sturmer, T., Glynn, R.J., Rothman, K.J. et al. (2007) Adjustments for unmeasured confounders in pharmacoepidemiologic database studies using external information. Med. Care, 45, S158–S165. [31] Brookhart, M.A., Rassen, J.A., Wang, P.S. et al. (2007) Evaluating the validity of an instrumental variable study of neuroleptics: can between-physician differences in prescribing patterns be used to estimate treatment effects? Med. Care, 45, S116–S122. [32] Strom, B.L. (1994b) How should one perform pharmacoepidemiology studies?: Choosing among the available alternatives, in Pharmacoepidemiology, 2nd edn (ed B.L. Strom), John Wiley & Sons, Inc., New York, pp. 337–350.
164
[33] Bright, R.A., Avorn, J. and Everitt, D.E. (1989) Medicaid data as a resource for epidemiologic studies: strengths and limitations. J. Clin. Epidimiol., 42, 937–945. [34] Lessler, J.T. and Harris, B.S.H. (1984) Medicaid Data as a Source for Postmarketing Surveillance Information, Final Report, Research Triangle Institute, Research Triangle Park, NC. [35] Roos, L.L., Sharp, S.M. and Cohen, M.M. (1991) Comparing clinical information with claims data: some similarities and differences. J. Clin. Epidemiol., 44, 881–888. [36] Trivedi, M.H., Fava, M., Wisniewski, S.R. et al. (2006) Medication augmentation after the failure of SSRIs for depression. N. Engl. J. Med., 354, 1243–1252. [37] Sachs, G.S., Nierenberg, A.A., Calabrese, J.R. et al. (2007) Effectiveness of adjunctive antidepressant treatment for bipolar depression. N. Engl. J. Med., 356, 1711–1722. [38] March, J.S., Silva, S.G., Compton, S. et al. (2005) The case for practical clinical trials in psychiatry. Am. J. Psychiatry, 162, 836–846. [39] McGrath, P.J., Kahn, A.Y., Trivedi, M.H. et al. (2008) Response to a selective serotonin reuptake inhibitor (citalopram) in major depressive disorder with melancholic features: a STAR*D report. J. Clin. Psychiatry, 69, 1847–1855. [40] Dennehy, E.B., Bauer, M.S., Perlis, R.H. et al. (2007) Concordance with treatment guidelines for bipolar disorder: data from the Systematic Treatment Enhancement Program for Bipolar Disorder. Psychopharmacol. Bull., 40, 72–84. [41] US Food and Drug Administration (2004) Center for Drug Evaluation and Research: FDA Public Health Advisory: Worsening Depression and Suicidality in Patients Being Treated with Antidepressant Medications, March 22, 2004. Available at: http://www.fda. gov/Drugs/DrugSafety/PostmarketDrugSafetyInforma tionforPatientsandProviders/DrugSafetyInformation forHeathcareProfessionals/PublicHealthAdvisories/ ucm161696.htm (accessed 29 November 2010). [42] US Food and Drug Administration (2004) Summary Minutes of the September 13–14, 2004 Center for Drug Evaluation and Research Pharmachopharmacologic Drugs Advisory Committee and the FDA Pediatric Advisory Committee. Available at: http://www.fda.gov/ohrms/dockets/ac/04/minutes/ 2004-4065M1_Final.htm. (accessed 27 September 2010). [43] Jick, H., Kaye, J. and Jick, S. (2004) Antidepressants and the risk of suicidal behaviors. J. Am. Med. Assoc., 292, 338–343. [44] Martinez, C., Rietbrock, S., Wise, L. et al. (2005) Antidepressant treatment and the risk of fatal and
THE PHARMACOEPIDEMIOLOGY OF PSYCHIATRIC MEDICATIONS
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
non-fatal self-harm in the first episode of depression. Br. Med. J., 330, 389. Simon, G.E., Savarino, J., Operskalski, B. et al. (2006) Suicide risk during antidepressant treatment. Am. J. Psychiatry, 163, 41–47. Olfson, M. and Marcus, S.C. (2008) A case-control study of antidepressants and attempted suicide during early phase treatment of major depressive episodes. J. Clin. Psychiatry, 69, 425–432. US Food and Drug Administration (2006) Overview for December 13 Meeting of Psychopharmacologic Drugs Advisory Committee (PDAC). Available at: http://www.fda.gov/ohrms/dockets/ac/06/briefing/ 2006-4272b1-01-FDA.pdf. (accessed 27 September 2010). US Food and Drug Administration (2005) FDA Public Health Advisory: Deaths with Antipsychotics in Elderly Patients with Behavioral Disturbances. Available at: http://www.fda.gov/Drugs/DrugSafety/ PostmarketDrugSafetyInformationforPatientsandPro viders/DrugSafetyInformationforHeathcareProfession als/PublicHealthAdvisories/ucm053171.htm (accessed 29 November 2010). Wang, P.S., Schneeweiss, S., Avorn, J. et al. (2005) Risk of death in elderly users of conventional vs. atypical antipsychotic medications. N. Engl. J. Med., 353, 2335–2341. Schneeweiss, S., Setoguchi, S., Brookhart, A. et al. (2007) Risk of death associated with the use of conventional versus atypical antipsychotic drugs among elderly patients. Can. Med. Assoc. J., 176, 627–632. Gill, S.S., Bronskill, S.E., Normand, S.L. et al. (2007) Antipsychotic drug use and mortality in older adults with dementia. Ann. Intern. Med., 146, 775–786. US Food and Drug Administration (2008) FDA alert. Cent. Drug Eval. Res., Available at: http://www.fda. gov/Safety/MedWatch/SafetyInformation/SafetyAlerts forHumanMedicalProducts/ucm110212.htm (accessed 29 November 2010). Collins, J.C. and McFarland, B.H. (2008) Divalproex, lithium and suicide among Medicaid patients with bipolar disorders. J. Affect Disord., 107, 23–28.
[54] Olfson, M., Blanco, C., Liu, L. et al. (2006) National trends in outpatient treatment of children and adolescents with antipsychotic drugs. Arch. Gen. Psychiatry, 63, 679–685. [55] Moreno, C., Laje, G., Blanco, C. et al. (2007) National trends in the outpatient diagnosis and treatment of bipolar disorder in youth. Arch. Gen. Psychiatry, 64, 1032–1039. [56] Wang, P.S., Lane, M., Olfson, M. et al. (2005) Twelvemonth use of mental health services in the United States. Arch. Gen. Psychiatry, 62, 629–640. [57] Wang, P.S., Schneeweiss, S., Brookhart, M.A. et al. (2005) Suboptimal antidepressant use in the elderly. J. Clin. Psychopharmacol., 25, 118–126. [58] Simon, G.E., Katon, W.J., VonKorff, M. et al. (2001a) Cost-effectiveness of a collaborative care program for primary care patients with persistent depression. Am. J. Psychiatry, 158, 1638–1644. [59] Simon, G.E., Manning, W.G., Katzelnick, D.J. et al. (2001) Cost-effectiveness of systematic depression treatment for high utilizers of general medical care. Arch Gen. Psychiatry, 58, 181–187. [60] Schoenbaum, M., Unutzer, J., Sherbourne, C. et al. (2001) Cost-effectiveness of practice-initiated quality improvement for depression: results of a randomized controlled trial. J. Am. Med. Assoc., 286, 1325–1330. [61] Wang, P.S., Patrick, A.R., Avorn, J. et al. (2006) The costs and benefits of enhanced depression care to employers. Arch. Gen. Psychiatry, 63, 1345–1353. [62] Wang, P.S., Patrick, A.R., Dormuth, C. et al. (2008) The impact of cost-sharing on antidepressant use among older adults in British Columbia. Psychiatr. Serv., 59, 377–383. [63] US Food and Drug Administration Amendments Act of 2007. Public Law 110-85, September (2007). Title IX, Section 905. [64] Clancy, C.M. (2006) Getting to ‘smart’ health care. Health Aff., 25 (6), w589–w592. [65] Wilensky, G.R. (2006) Developing a center for comparative effectiveness information. Health Aff., 25, w572–w585. [66] US Senate (2008) Comparative Effectiveness Research Act of 2008. United States Senate Bill S. 3408, July 31.
165
11
Peering into the future of psychiatric epidemiology Michaeline Bresnahan,1,2 Ezra Susser,1,2 Dana March1,2 and Bruce Link1,2 1 Department of Epidemiology, Mailman School of Public Health, Columbia University, NY, USA 2 New York State Psychiatric Institute, NY, USA
11.1 Introduction Epidemiology has already contributed a great deal to psychiatric research. The discipline has been used extensively for studying the frequency of mental disorders in communities across the world, establishing the enormous burden of illness associated with these disorders and identifying their causes and consequences. In the past decade, the extension of epidemiologic risk factor methods to genetic studies has further opened a new and exciting realm for the use of epidemiology in psychiatric research [1]. Yet we have utilised only one small part of the potential contributions of epidemiology. In this chapter, we describe uses of epidemiology that are rapidly emerging but not fully established. Peering into the future, we anticipate that these applications will be increasingly adapted for psychiatric research in the coming decades. Among the salient developments are that epidemiologists increasingly focus on studying multiple levels of causation, the trajectory of health and illness over the life course and the interplay of genes and environment [2]. We discuss the first two, and interweave the third (interplay of genes and environment) into our examples insofar as possible.
11.2 Levels of causation: A historical overview The history of epidemiology has been marked by dramatic transitions in thinking, occurring in response to new public health challenges and/or scientific breakthroughs [3]. These can be used to demarcate historical eras, characterised by prevailing causal paradigms [4], which are tied inherently to culturally and historically bound styles of thinking, and shift through the exchange of ideas and debate [5, 6]. Below we trace the shifts in thinking about levels of causation over successive eras and paradigms in epidemiology. The crucible for the development of epidemiology was the Industrial Revolution in England in the early nineteenth century. In this early ‘sanitary era’, epidemiologists adopted a very broad view of causation, with the focus mainly on societal factors. A dominant view was that the social transformation associated with industrialisation had led to a concentration of human waste and other decaying organic matter in the new urban areas. At the societal level, the thinking of the Sanitarians was valid enough; the societal transformation that they witnessed was indeed the underlying force behind the change in the health of
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
167
CHAPTER 11
England at that time. It motivated one of the most effective public health reforms ever enacted, the Public Health Act of 1848, an effort that culminated in the building of sewage and water systems throughout the newly industrialised towns and cities of England. Despite its evident successes, sanitary epidemiology had fatal flaws. While the Sanitarians focused on explaining patterns of disease in populations, they lacked good explanations as to how societal factors led to disease in individuals. The theory of ‘miasma’, a kind of polluting vapour that emerged from the accumulation of decaying waste, prevailed as the main explanation of disease causation. Often, the miasma theory was a plausible, albeit incorrect, explanation for patterns of disease. For example, William Farr used miasma theory to explain the relation between elevation and mortality from cholera in London during the 1848 epidemic (William Farr Cholera in England 1848–1849); while the miasma theory could explain the higher mortality rates at lower elevations, the real explanation lay in the contaminated water consumed by the population residing at lower elevations. As such, the miasma theory was contested. For instance, the epidemiologist John Snow inferred the presence of microorganisms causing cholera as early as the 1840s [7]. The epidemiology of the sanitary era was brought to a close by the development of a new science, microbiology, which provided an explanation for disease at the individual level, and quickly supplanted the miasma theory of disease causation [8], though not without debate (e.g. [9, 10]). Towards the end of the nineteenth century, Robert Koch and others made a series of stunning discoveries that demonstrated beyond any doubt that microbes played a crucial role in some of the most important diseases of the time. This ushered in the ‘infectious disease era’, in which epidemiology was actually redefined in some instances as the science of infectious diseases. In this period, epidemiologists primarily sought to identify microbial agents and their mode of transmission [11]. However, infectious disease transmission is an inherently social process. As such, the societal level of thinking remained important, but only within the narrow framework of the ways in which social factors influenced epidemic transmission [12]. With notable exceptions (see later), few continued to focus on the implications of societal 168
transformation for public health, and their ways of thinking were relegated to the periphery of mainstream epidemiology [6]. The next transition, to the ‘chronic disease era’, was largely motivated by the changing health profile of developed countries in the mid-twentieth century. Infectious diseases were declining rapidly, whereas apparently non-infectious ‘chronic diseases’ such as cardiovascular disease and cancer were increasing at an alarming rate. Infectious disease methods could not address the challenges presented by these frightening new causes of morbidity and mortality. Within a short period after World War II the discipline was again redefined and its methodology transformed. The signal event was the demonstration that smoking was a ‘cause’ – a major ‘risk factor’ – for lung cancer, using cohort and case–control designs developed for the purpose [13]. For cardiovascular disease, the notion of the risk factor was arguably even more important; many factors, such as serum cholesterol, hypertension, diet and exercise seemed to bear on the risk of disease, even though demonstrating causality per se presented an ongoing challenge [14]. Subsequently the notion of the risk factor became common parlance among epidemiologists, statisticians, clinicians and indeed the population at large. Cohort and case–control study designs became standard methods for investigating risk factors, especially individual exposures or lifestyles, in chronic diseases that likely had many causes. What is most important for the present argument is that these designs brought the discipline to focus still further on the individual as opposed to the societal level influences on disease. As we shall explain below, the risk factor designs are individual level studies par excellence, and their very strength lies in isolating the individual level risk factor from all others. Risk factor methods still predominate in teaching and research (e.g. [15]), but the field is rapidly changing once more. In the introduction, we noted several trends in epidemiology. Here we will take up two in greater depth: multilevel causation, and causation over the life course. Investigators have taken up the challenge of thinking about multiple levels of causation [16–19]. Epidemiologists are not dispensing with risk factor investigations (nor should they), but rather, are subsuming them under a broader framework, and it is a framework that we believe to
PEERING INTO THE FUTURE OF PSYCHIATRIC EPIDEMIOLOGY
be especially well suited to psychiatric research [1]. Thus, we now think systematically not only about risk factors, but also about the impact of family, community, society, and of gene, cell and tissue. What is motivating the latest transition? Although the question is far beyond the scope of this chapter, it is worth noting that the acquired immune deficiency sundrome (AIDS) pandemic had an enormous impact [6, 20]; its multiple and often interrelated causes necessitate epidemiologic methods for dealing with sociopolitical, behavioural and molecular complexities. Epidemiology and public health were faced with the greatest challenge in their short history, a virtual holocaust, as human immunodeficiency virus swept through Africa and other developing regions. It is a challenge that simply could not be met using risk factor methods alone. The war on AIDS requires research and intervention on every level: political leadership, deep social change, individual behaviour change and molecular genetics.
11.3 Levels of causation We now turn to introducing the concept of levels of causation, which is coming to the fore in epidemiology. Before doing so, we should note that exceptional, forward-thinking individuals have systematically considered causation at multiple levels throughout the history of epidemiology [2]. Historical eras are demarcated by prevailing paradigms but these are not the exclusive method in any given era. Nonetheless, it is only recently that this kind of thinking has received sustained attention from the field and has been used as a foundation for training in epidemiology. The idea of expanding the scope of psychiatric epidemiology ‘up’ to social contexts and ‘down’ to biological mechanisms is immediately appealing for several reasons. It allows the possibility of integrating disparate orientations into an organic whole. A combined undertaking takes greater advantage of advances in understanding across levels of research and disciplines. In addition it releases us from prejudice that the ‘real causes’ reside at any one level, to conceive disease causation as occurring at many levels. Once we are able to specify the potential relevance of any particular level
of analysis the idea of excluding that level raises the spectre of incompleteness, missed opportunity, model misspecification and confounding. Conceptualising disease causation in this way does not mean that every study or even any study has to include many levels. Integrated understanding may be achieved through a series of studies with a much more limited purview. It does mean that every study has to begin by asking the question: what level/s of organisation are most relevant to the question at hand? Then the research is designed accordingly.
11.3.1 Individual level Why are some people within a population more likely to develop disease than others? This is the question posed by risk factor investigations, which are conducted at the individual level. An individual level observational study, whether it is cohort or case–control, is designed to see whether variation in the disorder among individuals within the population reflects variation in their exposure histories. It does not require venturing down to the level of the cell, where we might ask which cells are affected by the exposure and in what ways, nor up to the level of the society, where we might ask which societies are organised in such a way that their members are exposed. Imagine that you posit a relation between exposure to sunlight and the risk of seasonal affective disorder. This model is appropriately conceptualised and investigated at the individual level. Individuals with more exposure to sunlight are hypothesised to be less vulnerable to this disorder, within the population of interest. To examine this hypothesis, it is sufficient to collect data on sunlight exposure and seasonal affective disorder for individuals within the population. The effects of sunlight exposure on cells, and the effects of social organisation on sunlight exposure, are related topics but are not directly addressed by either the hypothesis or the study design. Thus, the risk factor investigation is at once important, useful and incomplete. We will mention three important limitations on what can be revealed about determinants of disease using individual level designs. This discussion will provide a link to the next section on the contextual level, where we will see that these limitations can be partially overcome by research on other levels of causation. 169
CHAPTER 11
First, not all risk factors of interest will vary between individuals within the study population. A factor, that is universal in the study population, even if it participates in causing disease, cannot be readily examined in this framework. This can arise for exposures that are effectively mandated by government policy (e.g. vaccines) or by cultural norms (e.g. circumcision) in a given society. A small number of people may not follow the mandate, however; they tend to differ from the rest of the population in important ways making them unsuitable as an unexposed comparison group. A second limitation is that individual level risk factor designs are not well suited to discover the causes of an increase (or a decrease) in disease incidence in a population. A noticeable increase in the incidence of a disease is often what motivates an investigation. Generally the most parsimonious and useful explanation for a change in incidence is found at the societal level, albeit a societal change that brought about an increase (or decrease) in the population prevalence of an individual risk factor. An individual level study is ill equipped to identify the pivotal event, societal change. Consider the example of autism. Studies suggest that the prevalence of autism has increased markedly in developed societies over the last two decades. Hypothesised explanations include environmental exposures, potential toxins we encounter in our environment that are a byproduct of modern living (e.g. air and water pollution, plastics, food additives, products made from synthetic materials). This latent variable – an increasing multiplex exposure consisting of >80 000 synthetic chemicals in the environment and counting – is ubiquitous, and could contribute to the time trend. Beyond the measurement and identifiability challenge, isolating a subset of exposures causing an individual case may not explain the trend if each subset of component causes is individually rare. A third limitation is that the effect of an individual level determinant on the risk of disease is context dependent – even at the purely individual level of analysis. Under the paradigm of risk factor epidemiology, disease causation requires the participation of multiple risk factors, and individual cases may result from different constellations of risk factors, so that many different constellations may be ‘sufficient’ to cause disease. For the risk factors comprising any one 170
sufficient constellation, the impact of each risk factor upon the disease risk will vary, depending upon the relative frequency of the other risk factors within the constellation, in the population being investigated. Generally, in studies within a given population, the common risk factors of a sufficient constellation tend to appear less ‘influential’ in disease causation than the rare risk factors of the same constellation [1]. This occurs in spite of their joint contribution to disease occurrence in a given case. Suppose that congenital neural tube defects (spina bifida, anencephaly) are caused by a combination of two risk factors: a genetic defect that increases the need for folate, and low folate in the maternal diet. (This causal model is realistic albeit simplified for exposition.) When the genetic defect is common and a low folate maternal diet is uncommon, in a crude analysis, the effect of the genetic defect on the risk of disease will appear to be much less than that of low folate diet. On the other hand, when the genetic defect is uncommon and a low folate maternal diet is common, the effect of the genetic defect will appear to be greater than that of low folate diet. The more common risk factor thus carries a lower relative risk, and will be more difficult to detect. Yet, it may be precisely the common risk factors that carry the most implications for disease prevention. Thus it is in part for this reason that some of the causes that would be important for prevention are common, and will be of small effect when evaluated in an individual level analysis. Effects of common risk factors tend to be among the most controversial of epidemiological findings. A corollary result is that the magnitude of effect attached to a given risk factor can be expected to vary across populations due to variation in the prevalence of causal cofactors. Hence, we should not expect identical findings when we conduct the same study in two populations with somewhat different constellations of risk factors. The findings may be similar in populations with similar risk factors, supporting the pursuit of ‘replication’ of findings, but there should be some variation.
11.3.2 Contextual level Why do some populations have higher rates of disease than others? To identify determinants that
PEERING INTO THE FUTURE OF PSYCHIATRIC EPIDEMIOLOGY
explain differences in rates between populations, or in the same population over time, we often turn to studies at the level of the social context. A social context may be any combination of individuals who are connected in some meaningful way, such as a family, a community or a society. Thus, we move ‘up’ from the individual level to higher levels, in order to gain access to causal determinants that may not be identifiable in individual level studies. As implied earlier, these include determinants that are invariant within a population and therefore obscured or even invisible at the individual level, as well as those determinants that are not defined in individuals but in the relationships and contexts that surround them. The core idea in reasoning about contexts is that properties emerge as we move up from the individual to these higher levels of organisation. For example, most of us are accustomed to thinking about the emergent properties of neighbourhoods, and intuitively understand their meaning. In New York City, Harlem, Greenwich Village and Chinatown are examples of neighbourhoods with particular attributes, although the individuals living in each of them are by no means homogeneous. Living in one or another of these neighbourhoods will have a large influence on many dimensions of life, for example the cost and quality of housing, the type of recreation available (e.g. parks, gymnasiums, cinemas, museums), the presence of noxious facilities (e.g. sewage treatment plants, power plants), the quality of schooling for children and the amount and type of police surveillance. Residents will also be affected by the perceptions of other people about these neighbourhoods. In these and other ways, the emergent properties of the three neighbourhoods will shape the experiences of people who live there. The same can be said of emergent properties of nations, regions of the country, cities, schools, work places, families and dyadic relationships. The critical issue for epidemiologists is to identify which are most central to health and then to measure those properties so as to test causal explanations that involve them. The societal determinants of health may appear remote from the occurrence of a specific disease in an individual, and yet be of great consequence as a causal determinant. Consider the hypothetical example of sunlight and seasonal affective disorder, which we previously used to illustrate the individual
level of investigation. We could now elaborate our causal model by positing a relation between rates of seasonal affective disorder among women and societal determinants of women’s work and leisure activities. Let us propose that societies which severely restrict women’s access to outdoor occupations and recreations will have higher rates of seasonal affective disorder among women. This model is appropriately conceptualised at the societal level because the crucial determinant of health is societal constraints on women, and the outcome is the rate of disorder in the population. To examine the hypothesis, we might choose to compare several populations with different societal constraints on women, but similar geographic and climatic conditions, with respect to both pattern of sunlight exposure and rate of seasonal affective disorder among women. Note that while the risk factor investigation would provide the more ‘proximal’ causal mechanism, the societal level investigation might be more likely to indicate an effective intervention. Unless the societal barriers can be reduced, individual women may find it difficult to change their work and leisure patterns. When we turn attention from the individual to the contextual level we encounter great opportunities and enormous challenges. The opportunities arise from the fact that the full scope of contextual level influence (family, neighbourhood, school, work group, country) has barely been explored. Our fundamental understanding of context is also constantly evolving. Social contexts entirely supported by virtual medium mean that physical contact may be optional or entirely nonexistent, geographic proximity not always relevant. Both relational and physical-distances and boundaries are important in defining the ‘level’ affecting health [21]. While there are exemplary studies that indicate the importance of contexts for health outcomes [22–26], we are still in the early stages of development in putting together social, physical, cultural and other contexts with health outcomes. These opportunities exist in part because the conceptual and measurement work needed to capture variation in contexts like these is still early in its development (e.g. [27, 28]). Current practice in collecting data for epidemiologic research has, perhaps, slowed our progress. The standard approach is to sample and collect data on individuals; data are 171
CHAPTER 11
provided either through self reports or lab based measures. As useful as this approach is, it does not give us direct access to information about contexts. Often, we only learn about context indirectly through what people tell us about contexts, or what their biological measurements may reveal about contexts; however, few fine examples of direct measurements of social context exist (e.g. [24]). Our attention is drawn towards individual level processes, and away from the potential importance of processes at the contextual level. Consequently concepts and measurements at the contextual level do not come into the purview of the scientist on a regular basis when this approach is used. The best way to think about conceptual level causation is not yet entirely clear, and competing proposals have generated some excitement. Link and Phelan [29] propose thinking of contexts as units that vary in the power they possess to secure health enhancing living conditions – the capacity to secure good things for health and avoid bad things. The example of neighbourhood suggests some possibilities along these lines in that well-heeled neighbourhoods can resist noise, pollution and crime in ways that neighbourhoods that possess less social and political power cannot. Similarly, in a unionised workplace the union can negotiate for safe work conditions and better health care opportunities. Social capital (e.g. [30]), social stratification (e.g. [31]), social cohesion (e.g. [32]), social fragmentation (e.g. [33]), ethnic density (e.g. [34–37]), inequality (e.g. [38]) may be the most commonly investigated contextual features in relation to health – the literature for the first two being the most extensive. We are reminded, however, that careful measurement of context is as crucial as careful measurement of disease outcomes [39].
11.3.3 Combining individual and contextual levels Thinking about both individual and contextual levels at the same time frees us to ask different questions than we would thinking at either level alone. Previously, we were limited to two essential questions: Why do some people in a population develop disease and not others? Why are the rates of disease higher/lower in some populations than others? We can now ask about the interplay between determinants at different levels. 172
Studies of neighbourhood social isolation and schizophrenia provide an example from contemporary research. Following on early findings from the landmark ecologic studies of Faris and Dunham [40], Hare [41] demonstrated that in the city of Bristol, the incidence rate of schizophrenia was associated with neighbourhood social isolation, measured by the proportion of people living alone. He proposed two explanations (not mutually exclusive): individuals might migrate to these neighbourhoods, or, the social context of these neighbourhood might foster the development of schizophrenia. van Os et al. [42] took up this line of enquiry, in a study in Holland, using a multilevel analysis that well reflects the emerging era of epidemiology. They too found an effect of neighbourhood social isolation, measured by the proportion single and the proportion divorced, on the risk of schizophrenia. They also found an effect of marital status at the individual level. The neighbourhood effects were not explained, however, by the individual effects of marital status, indicating that the measure of neighbourhood social isolation tapped some emergent property of the neighbourhood. Furthermore, in their study neighbourhood interacted with individual risk factors in the following manner: being single and living in a neighbourhood with a lower proportion of single persons more than doubled the risk of schizophrenia over being single and living in a neighbourhood with a higher proportion single persons. A plausible interpretation is that one is more at risk – perhaps one feels more alone – as a single person when living in a neighbourhood comprised of married people.
11.4 Causation over (life) time Increasingly epidemiologists are adopting a life course perspective on disease causation. The significance of gestational and early life experience with respect to adult health outcomes has come into sharper focus over the last few decades [43–46]. There has been a fundamental shift thinking about the evolution of disease over the life course [47, 48]. Simultaneously, and perhaps encouraged by this fresh perspective, there has been an exponential development of new and existing resources explicitly designed for life course studies. The linkage of birth
PEERING INTO THE FUTURE OF PSYCHIATRIC EPIDEMIOLOGY
and disease/death registry data, and the expansion of birth cohort research have provided the basis for these developments [49]. The impact of a life course perspective in psychiatry is reflected in the way we conceive the development of psychopathology, and conceive of the pathologies themselves. We are learning that adult mental disorders typically do not arise de novo in adulthood. Most often they are preceded by symptoms or frank disorders in childhood and/or adolescence. Oppositional defiant disorder, for example has been found to predate multiple adult disorders [50, 51] possibly reflecting a liability to adult mental illness per se rather than a one-to-one liability for a specific disorder in adulthood. Just as a life course perspective has changed how we think about mental disorders as outcomes, it reframes our investigations of causes by lengthening the causal time frame to include possible causes all along the life course. Models of causation over long periods include accumulating risk, chains of risk and critical and/or sensitive periods [52]. The least intuitive causal sequence is based on latent effects of gestational exposures. Gestation is a privileged period of rapid growth; within the gestational period, timing of exposure measured in weeks or days may represent the difference between a life-changing effect and no effect whatsoever. And the consequence may manifest decades later. The classic example is diethylstilbestrol (DES) [53]: maternal exposure to DES during pregnancy resulted in diseases in offspring in adulthood. Other critical and susceptible phases most certainly exist. Adolescence may be another such period [54–56]. Research on the latent effects of gestational exposures on psychiatric disorders in adulthood exemplify the informative potential of the lengthened time frame. A relation of famine during gestation to risk of schizophrenia during adulthood emerged from studies of the Dutch Hunger Winter, and the Great Famine in China following the Great Leap Forward of 1958. In each of three studies, exposure to famine during early gestation was associated with a twofold increased risk of schizophrenia in adulthood [57]. One hypothesis is that micronutrient deficiency during gestation was responsible for the increased risk of schizophrenia in offspring. Some attention has focused specifically on folate deficiency because of
its crucial role in DNA repair and methylation. De novo mutations associated with folate deficiency are one possible explanation for the increased rate of schizophrenia. Changes in DNA methylation due to folate deficiency are another possible explanation. These explanations are not mutually exclusive; however, we will focus on DNA methylation in order to introduce epigenetics, which we envision as part of the future of epidemiology. Epigenetic effects change the potential for gene expression without changing the DNA coding sequence. They include DNA methylation, histone acetylation and other processes which alter the accessibility of DNA for transcription [58, 59]. Epigenetic effects are mitotically heritable. Animal studies have shown that in utero exposures to micronutrients can have epigenetic effects that alter the phenotype of the offspring. Among the bestknown examples is an experiment in which micronutrients (including folate) in the one carbon pathway were administered prenatally to Agouti mice dams, resulting in altered DNA methylation as well as phenotype among offspring. (e.g. [60]) Notably, epigenetic effects are probabilistic (e.g. shift the per cent of DNA methylated) and are thought to be potentially reversible in at least some instances. In one of the first human studies of epigenetic effects of prenatal nutrition, it was found that after early prenatal exposure to the Dutch Hunger Winter, there was an alteration of imprinting on insulin-like growth factor 2, an epigenetic effect, still evident 60 years after birth [61]. At the same time, studies of post-mortem brain tissue have implicated epigenetic effects as potentially related to schizophrenia [62, 63]. This is still a young field and these findings on prenatal famine and on schizophrenia are still very preliminary. We use them here to indicate that the future of psychiatric epidemiology will almost certainly include studies of epigenetic effects. By providing a concrete mechanism by which early environmental exposures can modify gene expression and physiology, the study of epigenetic effects has the potential to explain latent effects of in utero exposures over the life course, and to bring together social and biological explanations for psychiatric outcomes (e.g. [64]). A further development in our thinking about the causal time frame emerging from the life course 173
CHAPTER 11
framework is the widening perspective on transgenerational effects beyond transmission through genes (DNA sequence) and culture. Mechanisms for intergenerational epigenetic effects are now being articulated [59, 65, 66]. An example of behavioural transmission of an epigenetic effect can be found in studies in mice where it has been shown that the transmission of nurturing behaviour is achieved during the early postnatal period; this maternal care influences gene expression and development of the stress response [67]. A big challenge in epigenetics will be how to establish these mechanisms in human studies [68]. It is certain that psychiatric epidemiology will participate in these developments. When we test hypotheses in life course framework, we are confronted with a series of challenges – some particular to life course epidemiology, and others simply exaggerated by the length and breadth of lifetime studies (e.g. confounding, multiple measurements, missing data). Methods for reducing the complexity in informative models and analyses are being developed as the opportunities to examine these hypotheses increase [69].
11.5 Examples Rethinking existing epidemiologic research and outstanding questions in the field of psychiatry within multilevel and longitudinal frameworks illustrates the relevance of this approaches when they are explicitly applied. Thinking about these issues in terms of levels of causation, and over time often adds intellectual interest and rigour, and opens new perspectives on intervention. The examples which follow, some drawn from our own research, show how multilevel reasoning and life course frame evolved from a research question or finding, and contributed to a new line of investigation.
11.5.1 Parental age The archetypal parental-age related disorder is Down’s syndrome. Increased risk of Down’s syndrome in offspring among older mothers was noted in 1933 [70]. Whether paternal age is related to increased risk is debated. Recent analyses and reanalyses hoping to resolve the issue have supported a small to negligible effect of paternal age [71, 72]. 174
The father’s age has, however, been related to a broad range of other outcomes including fetal death [73], congenital syndromes (Apert’s) [74] and neurocognitive deficits in childhood [75], and most relevant here autism [76] and schizophrenia [77, 78]. While the mechanism has not been established, the layering of risk across levels is elegantly illustrated in this example, and the direction of future research following up these findings brings us to the edge of current research technologies. One hypothesis to explain the excess risk of schizophrenia associated with older fathers is mutagenesis. Mutations in the paternal germline increase with age [79, 80]. In genome-wide scans, copy number variation in networks controlling neurodevelopment have been associated with schizophrenia [81]. Investigators have not yet established that these or similar mutations are more common in individuals with schizophrenia whose fathers were relatively older when they were born. Even if the causal process involves genetic mutations, the determinants of age at parenting arise in part from societal, family and partner relationships. Contextual influences on the distribution of age at child bearing are channelled through social norms for transitions into adult roles, educational and economic participation of women and economic conditions. When interventions are warranted, they may consist of policies that reinforce the value and feasibility of ‘on-time’ parenthood for both men and women (family policies, work policies, health care policies).
11.5.2 Neighbourhood and ethnic density A rapidly growing body of work has demonstrated markedly elevated rates of schizophrenia in migrant and ethnic minority populations [82–84]. Such findings do not appear attributable to selective migration, nor to elevated background rates in countries of origin. In particular, observed elevations in rates of schizophrenia in ethnic minority populations have catalysed a contemporary emphasis on the social patterning of schizophrenia, and of the risk and protective factors that influence it. One especially compelling example is neighbourhood ethnic density. In their classic analysis of schizophrenia in Chicago neighbourhoods, Faris and
PEERING INTO THE FUTURE OF PSYCHIATRIC EPIDEMIOLOGY
Dunham found that rates of schizophrenia among blacks decreased as the percentage of black residents increased in neighbourhoods [40]. Recent studies in London [85] and The Hague [35], which measured ethnic density around the time of illness onset, have reported similar results. In both studies, an interaction between individuals and neighbourhoods was found, and the protective effect of (own) ethnic density persisted even in the most deprived neighbourhoods. The mechanism(s) by which ethnic density might operate to attenuate rates of schizophrenia remain elusive. Neighbourhood ethnic density also seems to be protective against other outcomes, such as psychological distress [86], admissions to psychiatric hospitals [87] and suicide [88]. Some have posited that ethnic minorities living in neighbourhoods with higher percentages of other ethnic minorities are subjected less to discrimination, which has been associated with rates of schizophrenia in ecological studies (e.g. [35]). Others have suggested that ethnic minority dense neighbourhoods are likely to have greater social cohesion than neighbourhoods in which majority ethnicity constitute the greatest proportion of residents. Typically, ethnic density is measured using administrative (e.g. census) data. Recent work in the United Kingdom indicates that perceived ethnic density and measured ethnic density are moderately correlated, and that the impact differs by ethnic group [34]. Further investigation at the neighbourhood and individual levels across a range of contexts and ethnic groups is required to better understand the protective properties of this social phenomenon, and perhaps to harness its salutary effects.
11.5.3 Alcohol: Genes, culture and health The association between a genotype and disease can be modified by context. Genetic susceptibility to alcohol dependence is associated with genes coding for enzymes involved in the metabolism of alcohol in the liver. In Asian populations, an allele coding for one of these enzymes, aldehyde dehydrogenase 2 (ALDH2*2), has repeatedly been shown to decrease alcohol consumption [89], and decrease the risk of alcohol dependence [90, 91]. The mechanism by which the allele reduces the risk of alcoholism
involves an aversive reaction to alcohol consumption caused by a high concentration of acetaldehyde in the blood following consumption. The aversive symptoms can be very unpleasant, including intense flushing, palpitations and headache. Individuals who are homozygous for ALDH2 protective alleles (ALDH2*2*2) have such a strong aversive reaction that they drink very little if at all [89]. This accounts for the fact that none were found in large samples of male alcoholics in Japan [90]. Individuals who are heterozygotes for this allele have a weaker and more variable aversive reaction. Consequently the biological effects of homozygous ALDH2*2 are so strong that they are little affected by cultural factors, whereas the effects of being heterozygous ALDH2*2 allow for an interaction of culture with the genotype. This was put forward as one possible explanation of observed changes in the proportion of ALDH2 heterozygotes in samples of male alcoholics in Japan [90]. The protective effect of the heterozygous genotype may have become weaker as the strength of the social pressures for heavy drinking increased. The ALDH2 alleles have also been used to provide evidence on the health effects of alcohol consumption. A method referred to as ‘Mendelian randomisation’ is increasingly employed in epidemiology to provide complementary evidence as to whether an observed association between an environmental exposure and a disease is causal (see [92–94] for more detailed discussion). Often an exposure is associated with a cluster of potential confounders (e.g. high alcohol intake may be associated with cigarette smoking, poor diet and other unhealthy habits) and it is difficult to disentangle their effects. This problem can be overcome to some degree by examining a genetic variant that is related to the exposure but not the confounders. This condition appeared to be met in a Japanese study in which ALDH2*2*2 was strongly related to (reduced) acohol intake but not to some potential confounders such as cigarette smoking [95]. ALDH2*2*2 was related to reduced levels of high-density lipoprotein cholesterol and increased risk of myocardial infarction [95], providing some supportive evidence for a protective effect of alcohol use that has been reported from observational studies (e.g. [96–98]). Again, these relationships will vary across different contexts 175
CHAPTER 11
according to the frequency of both the genetic variant and of alcohol use. For instance, in many Asian societies, women consume much less alcohol than men and consequently the relation of ALDH2*2*2 to the health effects of alcohol is harder to detect among women.
11.5.4 Course and outcome of schizophrenia in developing and developed countries In studies of schizophrenia in the twentieth century, the course and outcome were found to be on average more benign in developing than developed countries [99, 100]. Thinking only in terms of individual level influences on course and outcome, these findings were counterintuitive. It had been shown that within populations, modern treatments (e.g. medication, family interventions) reduce the risk of relapse in patients with schizophrenia [101]. And yet, in developed countries where individuals had greater access to those treatments, the mean outcome was comparatively worse. To explain this difference in mean outcome across settings, researchers had to consider societal level processes. Speculation concentrated on three dimensions of context: family relationships, informal economies and segregation of the mentally ill. The overarching theme of most theories was that developing country settings offered more opportunities for individuals with mental illness to maintain family, work and community roles. Recently some investigators have challenged whether this difference in course and outcome is valid [102]. Our view is that the original findings represent the best work on this topic in the twentieth century and were valid in that historical context. Nonetheless, the world of today is dramatically different, and one should not expect therefore to see the same patterns of course and outcome today across these same countries. Massive urbanisation and the growth of megacities represent but one of the salient sociocultural changes that have taken hold in low- and middle-income countries. We do not know the implications of these sociocultural changes for either the incidence or the course of schizophrenia but this is surely an important topic for the future of psychiatric epidemiology. 176
11.5.5 BirthWeight and psychiatric outcomes Relationships between birthweight and psychiatric outcomes have been postulated since the midtwentieth century. Given the ready availability of birthweight data in many locales, this would seem to be one of the simplest relationships to establish (or refute), but in fact, has turned out to be among the most difficult. This question has still not been resolved, for example for schizophrenia and affective disorders, despite the availability of registries which link birthweight and psychiatric treatment outcomes for many millions of persons. Therefore the experience may be instructive. The central issues can be illustrated with the relation of birthweight to IQ. Reports have suggested that birthweight may be related to IQ, well into the normal birth weight range [103, 104]. Studies of the relationship between birthweight and IQ are shadowed; however, by the powerful and potentially confounding influence of family social environment. Removing the influence of family social environment is extremely difficult in individual level studies: controlling for parental attributes, and other measured family factors, does not fully capture the complex influence of family environment. The aspects of family social environment that potentially confound these results are generally shared by siblings, and therefore, are better conceptualised as family level rather than individual level variables. So we are dealing with, a family level variable (social environment) as a potential confounder of an individual level association (of birth weight and IQ). Once the cross-level nature of the confounding is recognised, it becomes possible to design studies so as to tightly control it. Sib-pair designs, examining individual level effects within families, offer a potential solution to this problem. Matte and colleagues used this strategy to examine the association of birthweight and IQ in a large cohort born 1959–1966 in the United States. Comparing individuals within same-sex sibships, they demonstrated that for boys, the increase in childhood IQ with birthweight extends well into the normal birthweight range [105]. Under this design, the birthweight effect could not be confounded by family environment, as siblings within the same
PEERING INTO THE FUTURE OF PSYCHIATRIC EPIDEMIOLOGY
family share this environment. Although the effect was modest, the ramifications on a population level were potentially important. But this was not the end of the story. Some large contemporary studies of the birthweight-IQ relationship within sibling pairs have found minimal or no association (e.g. [106–108]). Although the most recent large study (probably the best study so far) did fine one, also there may be different causes of birthweight variation within versus between families [109, 110]. The question remains open as to whether these conflicting findings reflect historical change or geographic variation in the relationship. This example indicates still another way in which explicit thinking about multiple levels can be useful, that is in the control of confounding. Causal determinants at one level can be confounders of findings at another level. Consequently, a clear conceptual framework that includes multiples levels of causation makes it much easier to find ways to control confounding, which is especially important for relatively small effects.
11.5.6 Violence and mental illness There are many individual level risk factors for violent behaviours and severe mental illness is one of them [111]. At the same time, it is clear that the societal context exerts a powerful influence on violent behaviour. This was demonstrated, for example in an innovative study of Chicago neighbourhoods, where collective efficacy (similar to social cohesion) of the neighbourhood was inversely related to the rate of violent crime [24]. Consistent with this are findings from two studies by Link and colleagues, one in New York City and the other in Israel, using similar measures of violence [112, 113]. They found modestly higher rates of violence among the mentally ill in both study populations; however, people with mental illness in Israel had rates of violence comparable to members of the public in New York City. In light of these relationships, what do we do about higher rates of violence amongst people with schizophrenia and other severe mental illnesses? One answer is: we find out more about what predicts violence in samples of people who have been hospitalised for mental illnesses and we develop risk assessment tools to select out violent people for more thorough intervention and control. Individual risk
factors do seem to play a role in the increased rates of violence that people with mental illness exhibit. Some investigators emphasise comorbid substance abuse [114], while others emphasise the nature of psychotic symptoms [113]. Such an approach is a reasonable and important one. But let us see how it can be enhanced by reasoning at a contextual level. Once we accept the possibility that context matters for violent behaviours we can begin to reason about the connection between mental illnesses and violent behaviours with a different frame of reference. Our vision is then shifted to thinking about the policies we implement and the structural arrangements these impose on people who develop serious mental illnesses. Currently, the most striking feature of policy towards individuals with schizophrenia in the United States is the scarcity of evidence-based treatments and the insufficient provision of even the most basic care such as shelter. Due in large part to the scarcity of supported housing, a very large number of mentally ill persons are presently residing in jails and prisons and municipal shelters. In these facilities, violent norms are well documented, and in such environments, mentally ill men and women are likely to adopt more violent behaviours. Moreover, those who can obtain supported housing generally are located in neighbourhoods which have low social cohesion and high rates of violence; again these neighbourhood characteristics can affect the behaviours of all residents including those who are mentally ill. To a large degree, these issues also pertain to individuals with other severe mental illnesses. It may very well be, then, that policies shaped by irrational stigmatisation and fear of people with schizophrenia and other severe mental illnesses, have the ironic effect of contributing to high rates of violence in this group. The stigmatisation of mental illness no doubt contributes a great deal to the policy of scarce services and supported housing, as it would be inconceivable for a developed society to impose the appalling conditions of prisons and shelters on individuals with less stigmatised illnesses (e.g. diabetes). In addition, the strong societal fear that people with mental illnesses will be dangerous, a fear, that is entirely out of proportion to the real risk that people with these problems actually pose, breeds the ‘not in my back yard’ (NIMBY) syndrome, 177
CHAPTER 11
ensuring that the available housing for people with mental illnesses will be mainly located in neighbourhoods that do not have the clout to exclude this feared group from their midst. Should these considerations change our viewpoint about policies to reduce violence among individuals with mental illness? Perhaps the most effective intervention of all would be to make adequate care available including supported housing in safe neighbourhoods. This policy would, at the same time, tend to reduce substance abuse and psychotic symptoms, which are among the important risk factors for violence that have been identified among mentally ill individuals. In addition, it might behoove us to address the antecedents of current policy, and advocate for change societal attitudes towards mental illness.
11.6 Framing the future While epidemiologists wrestle with the application of the methods and concepts just described, methodologists are also working on developing the next frontier – dynamic modelling. One approach to modelling dynamic systems is agent-based modelling [115]. Agent based modelling is a bottom up approach: the models assign individuals and environments characteristics, allow them to interact, and observe the emergence of higher level dynamics from these lower level interactions [116]. It is anticipated that these simulations will be of particular value in the development of health interventions. The more immediate future relates to the developments in multilevel and life course frameworks – and our ability to manage these complexities. The possibilities for expansion both up and down are enormous, indeed endless. Take expansion up to contexts – there is the global context, the national context, the neighbourhood context, the peer group context, the work context, the family context, even the context of a relationship with just one other person. Moreover, there isn’t just one facet to each of these contexts but rather a multitude of facets just as there are many, many characteristics of individuals. Similarly, biological determinants exist at many different levels of organisation – molecule, cell, tissue, organ, system. Over and above the 178
complexity brought about by considering multiple levels, we are in danger of being overwhelmed with information at several levels. We are now in the era of whole genome sequencing; we must manage this information. Health and disease are emergent properties of individuals; the result of a dynamic process. Placing biological determinants in the hierarchy of causation helps to remind us that individual and higher contextual level processes will influence biological phenomena. As a consequence, we are multiplying complexities. This appealing expansion brings home two very critical points about epidemiological inquiry. First, we choose our focus. Because we cannot conceptualise, let alone accurately measure all influences, at all levels, we are forced to choose a focus whether we want to or not – whether we know it or not. Second, because we cannot include all variables at all levels, our statistical analyses are always mis-specified by leaving out variables that would be included in a fully comprehensive model. This principle would apply even if we narrowed our focus to include only the individual level of analysis – it certainly applies when we expand our focus to include the cell and the society. Whatever choice we make, much will be left out, and the gap cannot be filled by any statistical analysis of the data collected. The practical significance of the foregoing considerations is that to approach epidemiological questions wisely, we need to have causal explanations that involve multiple levels and the interconnections between those levels. This will require theory and conceptualisation of what is salient for disease causation at the various levels. Thus, the era of multilevel inquiry will require the creative construction of rigorous causal explanations and the careful conceptualisation and measurement of the variables implied by those explanations. We cannot hope to succeed by simply adding measures at other levels of analysis to the kinds of statistical manipulations used during the individually focused era of risk factor epidemiology. The data of census tracts may seem to offer a measure of the social context, but we cannot solely rely on what the census gathers, nor can we limit our assessment of contexts to the arbitrarily constructed boundaries of census tracts. We must also keep the longitudinal perspective in mind, even when we are not conducting life course
PEERING INTO THE FUTURE OF PSYCHIATRIC EPIDEMIOLOGY
studies. Causation is rarely immediate; identifying the relevant causal factors will require a deeper understanding of how we develop liability to mental disorders, and mindfulness of a pathogenic trajectory over the life course, and perhaps over the life course of parent, children and grandchildren. Classical epidemiology before the primacy of multivariate methods is replete with examples of strategic inquiry focused on evaluating explanations for disease causation. Clever tests help us decide whether a causal explanation is consistent with observed facts or inconsistent with those facts. These examples from classical epidemiology tell us we need two things together: causal explanations and informative tests of those causal explanations. We need to bring this aspect of classical epidemiology to the new focus on multiple levels of inquiry.
References [1] Susser, E, Bromet, E, Morabia, A et al. (eds) (2007) Concepts and Methods of Psychiatric Epidemiology, Oxford University Press, New York. [2] Susser, E. (2004) Eco-epidemiology: thinking outside the black box. Epidemiology, 15 (5), 519–520. [3] Susser, M. and Stein, Z. (2009) Eras in Epidemiology: The Evolution of Ideas, Oxford University Press, New York. [4] Kuhn, T.S. (1962) The Structure of Scientific Revolutions, University of Chicago Press, Chicago, IL. [5] Fleck, L. (1939) Genesis and development of a scientific fact, in Genesis and Development of a Scientific Fact (eds T.J. Trenn and R.K. Merton), University of Chicago Press, Chicago, IL, 1981. [6] March, D. and Susser, E. (2006) The eco-in ecoepidemiology. Int. J Epidemiol., 35 (6), 1379–1383. [Epub 2006 Nov 24]. [7] Snow, J. (1855) On the Mode of Communication of Cholera, 2nd edn, Churchill, London. (Reproduced in Snow on Cholera. Commonwealth Fund, 1936, New York). [8] Winslow, C.E.A. (1943) The Conquest of Epidemic Disease: A Chapter in the History of Ideas, University of Wisconsin Press, Madison, WI. [9] Morabia, A. (2007) Epidemiologic interactions, complexity, and the lonesome death of Max von Pettenkofer. Am. J. Epidemiol., 166 (11), 1233–1238. [10] Oppenheimer, G. and Susser, E. (2007) Invited commentary: the context and challenge of von Pettenkofer’s contributions to epidemiology. Am. J. Epidemiol., 166 (11), 1239–1241; discussion 1242– 1243.
[11] Chapin, C. (1934) The Papers of Charles V. Chapin, MD: A Review of Public Health Realities, The Commonwealth Fund, Oxford University Press, New York. [12] Ross, R. (1911) The Prevention of Malaria, Oxford University Press, London. [13] Doll, R. and Hill, A.B. (1950) Smoking and carcinoma of the lung. Preliminary report. Br. Med. J., 2, 739–748. [14] Oppenheimer, G.M. (2006) Profiling risk: the emergence of coronary heart disease epidemiology in the United States (1947–1970). Int. J. Epidemiol., 35 (3), 720–730. [15] Rothman, K.J., Greenland, S. and Lash, T. (2008) Modern Epidemiology, 3rd edn, Lippincott Williams & Wilkins, Philadelphia, PA. [16] Susser, M. and Susser, E. (1996) Choosing a future for epidemiology: I. eras and paradigms. Am. J. Public Health, 86, 668–673. [17] Susser, M. and Susser, E. (1996) Choosing a future for epidemiology: II. from black box to Chinese boxes in eco-epidemiology. Am. J. Public Health, 86, 674–677. [18] McMichael, A.J. (1999) Prisoners of the proximate: loosening the constraints on epidemiology in an age of change. Am. J. Epidemiol., 149, 887–897. [19] Smith, G.D. and Ebrahim, S. (2001) Epidemiology – is it time to call it a day? Int. J. Epidemiol., 30, 1–11. [20] Myer, L., Morroni, C. and Susser, E.S. (2003) Commentary: the social pathology of the HIV/AIDS pandemic. Int. J. Epidemiol., 32 (2), 189–192. [21] Christakis, N.A. and Folwer, J.H. (2007) The spread of obesity in a large social network over 32 years. N. Engl. J. Med., 357 (4), 370–379. [22] Goldberger, J., Wheeler, G.A. and Sydenstrycker, E. (1920) A study of the relation of family income and other economic factors to pellagra incidence in seven cotton mill villages of South Carolina in 1916. Public Health Rep., 35, 2673–2714. [23] Haan, M., Kaplan, G. and Camacho, T. (1987) Poverty and health: prospective evidence from the Alameda County study. Am. J. Epidemiol., 125, 989–998. [24] Sampson, R.J., Raudenbush, S.W. and Earls, F. (1997) Neighborhoods and violent crime: a multilevel study of collective efficacy. Science, 277, 918–924. [25] Diez-Roux, A., Nieto, F., Muntaner, C. et al. (1997) Neighborhood environments and coronary heart disease: a multilevel analysis. Am. J. Epidemiol., 146, 48–63. [26] Entwisle, B., Mason, W.M. and Hermali, H.I. (1986) The multilevel dependence of contraceptive use on the socioeconomic development and family planning program strength. Demography, 23, 199–216.
179
CHAPTER 11 [27] Cummins, S., Curtis, S., Diez-Roux, A.V. et al. (2007) Understanding and representing ‘place’ in health research: a relational approach. Soc. Sci. Med., 65, 1825–1838. [28] Kirkbride, J.B. and Jones, P.B. (2010) The Prevention of Schizophrenia–What Can We Learn From EcoEpidemiology? Schizophr Bull [Epub 2010 Oct 25]. [29] Link, B.G. and Phelan, J. (1995) Social conditions as fundamental causes of disease. J. Health Soc. Behav., 80–94 (Extra Issue). [30] Carpiano, R.M. (2007) Neighborhood social capital and adult health: an empirical test of a Bourdieubased model. Health Place, 13 (3), 639–655. [31] Drukker, M., Krabbendam, L., Driessen, G. et al. (2006) Social disadvantage and schizophrenia: a combined neighborhood and individual-level analysis. Soc. Psychiatry Psychiatr. Epidemiol., 20, 1–10. [32] Echeverria, S., Diez-Roux, A.V., Shea, S. et al. (2008) Associations of neighborhood problems and neighborhood social cohesion with mental health and health behaviors: the Multi-Ethnic Study of Atherosclerosis. Health Place, 14 (4), 853–865. [33] Rezaeian, M., Dunn, G., Selwyn, S.L. et al. (2007) Do hot spots of deprivation predict the rates of suicide within London boroughs? Health Place, 13, 886–893. [34] Stafford, M., Becares, L. and Nazroo, J. (2009) Objective and perceived ethnic density and health: findings from a United Kingdom general population survey. Am. J. Epidemiol., 170 (4), 484–493. [35] Veling, W., Susser, E., van Os J. et al. (2008) Ethnic density of neighborhoods and incidence of psychotic disorders among immigrants. Am. J. Psychiatry, 165 (1), 66–73. [36] Morgan, C. and Fearon, P. (2007) Social experience and psychosis insights from studies of migrant and ethnic minority groups. Epidemiol. Psichiatr. Soc., 16 (2), 118–123. [37] Kirkbride, J.B., Morgan, C., Fearon, P., Dazzan, P., Murray, R.M. and Jones, P.B. (2007) Neighbourhood-level effects on psychoses: reexamining the role of context. Psychol. Med., 37 (10), 1413–1425. [38] Wilkinson, R.G. and Pickett, K.E. (2007) The problems of relative deprivation: why some societies do better than others. Soc. Sci. Med., 65, 1965–1978. [39] McIntyre, S., Macdonald, L. and Ellaway, A. (2008) Do poorer people have poorer access to local resources and facilities? The distribution of local resources by area deprivation in Glasgow, Scotland. Soc. Sci. Med., 67 (6), 900–914. [40] Faris, R. and Dunham, H. (1939) Mental Disorders in Urban Areas, University of Chicago Press, Chicago, IL.
180
[41] Hare, E.H. (1956) Mental illness and social conditions in Bristol. J. Ment. Sci., 102, 349–357. [42] van Os, J., Driessen, G., Gunther, N. et al. (2000) Neighbourhood variation in incidence of schizophrenia. evidence for person-environment interaction. Br. J. Psychiatry, 176, 243–248. [43] Barker, D.J.P. (1992) The Fetal and Infant Origins of Adult Disease, BMJ Books, London. [44] Keating, D.P. and Hertzman, C. (eds) (1999) Developmental Health and the Wealth of Nations, Guilford Press, New York. [45] Gluckman, P.D., Hanson, M.A., Cooper, C. et al. (2008) Effect of in utero and early-life conditions on adult health and disease. N. Engl. J. Med., 359 (1), 61–73. [46] Susser, E. and Terry, M.B. (2003) A conception-todeath cohort. Lancet, 361 (9360), 797–798. [47] Kuh, E. and Ben-Shlomo, Y. (1997) A Life Course Approach to Chronic Disease Epidemiology, Oxford University Press, Oxford. [48] Kuh, D. and Ben-Shlomo, Y. (2004) A Life Course Approach to Chronic Disease Epidemiology, 2nd edn, Oxford University Press, Oxford. [49] Susser, E., Terry, M.B. and Matte, T. (2000) The birth cohorts grow up: new opportunities for epidemiology. Pediatr. Perinat. Epidemiol., 14, 98–100. [50] Copeland, W.E., Shanahan, L., Costello, E.J. et al. (2009) Childhood and adolescent disorders as predictors of young adult disorders. Arch. Gen. Psychiatry, 66 (7), 764–772. [51] Kim-Cohen, J., Caspi, A., Moffitt, T.E. et al. (2003) Prior juvenile diagnoses in adults with mental disorder: developmental follow-back of a prospectivelongitudinal cohort. Arch. Gen. Psychiatry., 60 (7), 709–717. [52] Kuh, E. and Ben-Shlomo, Y. (2002) A life course approach to chronic disease epidemiology: conceptual models, empirical challenges and interdisciplinary perspectives. Int. J. Epidemiol., 31 (2), 285–293. [53] Herbst, A.L., Ulfelder, H., and Poskanzer, D.C. (1971) Adenocarcinoma of the vagina. Association of maternal stilboestrol therapy with tumor appearance in young women. N. Engl. J. Med., 284, 878–881. [54] Richter, L.M. (2006) Studying adolescence. Science, 312, 1902–1905. [55] Spear, L.P. (2009) Heightened stress responsivity and emotional reactivity during pubertal maturation: implications for psychopathology. Dev. Psychopathol., 21, 87–97. [56] Gunner, M.R., Wewerka, S., Frenn, K. et al. (2009) Developmental changes in hypothalamus–pituitary– adrenal activity over the transition to adolescence:
PEERING INTO THE FUTURE OF PSYCHIATRIC EPIDEMIOLOGY
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68] [69]
[70]
[71]
normative changes and associations with puberty. Dev. Psychopathol., 21, 69–85. Susser, E., St Clair, D. and He, L. (2008) Latent effects of prenatal malnutrition on adult health: the example of schizophrenia. Ann. N. Y. Acad. Sci., 1136, 185–192. Tsankova, N., Rethal, W., Kumar, A. et al. (2007) Epigenetic regulation in psychiatric disorders. Nat. Rev. Neurosci., 8, 355–367. Jirtle, R.L. and Skinner, M.K. (2007) Environmental epigenomics and disease susceptibility. Nat. Rev. Genet., 8 (4), 253–262. Waterland, R.A. and Jirtle, R.L. (2003) Transposable elements: targets for early nutritional effects on epigenetic gene regulation. Mol. Cell. Biol., 23, 5293–5300. Heijmans, B.T., Tobi, E.W., Stein, A.D. et al. (2008) Persistent epigenetic differences associated with prenatal exposure to famine in humans. Proc. Natl. Acad. Sci. USA, 105 (44), 17046–17049. Mill, J., Tang, T., Kaminsky, Z. et al. (2008) Epigenomic profiling reveals DNA-methylation changes associated with major psychosis. Am. J. Hum. Genet., 82, 696–711. Grayson, D.R., Jia, X., Chen, Y. et al. (2005) Reelin promoter hypermethylation in schizophrenia. Proc. Natl. Acad. Sci. USA, 102, 9341–9346. Meaney, M.J. (2001) Maternal care, gene expression, and the transmission of individual differences in stress reactivity across generations. Annu. Rev. Neurosci., 24, 1161–1192. Youngson, N. and Whitelaw, E. (2008) Transgenerational epigenetic effects. Annu. Rev. Genomics. Hum. Genet., 9, 233–257. Morgan, D.K. and Whitelaw, E. (2008) The case for transgenerational epigenetic inheritance in humans. Mamm. Genome, 19, 394–397. Weaver, I.C.G., D’Alessio, A.C.D., Brown, S.E. et al. (2007) The transcription factor nerve growth factorinducible protein A mediates epigenetic programming: altering epigenetic marks by immediate-early genes. J. Neurosci., 27 (7), 1756–1768. Hyman, S.E. (2009) How adversity gets under the skin. Nat. Neurosci., 12 (3), 241–243. Pickles, A., Maughan, B. and Wadsworth, M. (2007) Epidemiological Methods in Life Course Research, Oxford University Press, Oxford. Penrose, L.S. (1933) The relative effects of parental and maternal age in mongolism. J. Genet., 27, 219–224. De Souza, E., Alberman, E. and Morris, J.K. (2009) Down syndrome and paternal age, a new analysis of case-control data collected in the 1960s. Am. J. Med. Genet., 149A (6), 1205–1208.
[72] Dzurova, D. and Pikhart, H. (2005) Down syndrome, paternal age and education: comparison of California and the Czech Republic. BMC Public Health, 5, 69. [73] Nybo Andersen, A.M., Hansen, K.D., Andersen, P.K. et al. (2004) Advanced paternal age and risk of fetal death: a cohort study. Am. J. Epidemiol., 160 (12), 1214–1222. [74] Yoon, S.-R., Qin, J, Glaser, R.L. et al. (2009) The ups and downs of mutation frequencies during aging can account for the apert syndrome paternal age effect. PLoS Genet., 5 (7), e1000558. [75] Saha, S., Barnett, A.G., Foldi, C. et al. (2009) Advanced paternal age is associated with impaired neurocognitive outcomes during infancy and childhood. PLoS Med., 6 (3), 0303–0310. [76] Reichenberg, A., Gross, R., Weiser, M. et al. (2006) Advancing paternal age and autism. Arch. Gen. Psychiatry, 63 (9), 1026–1032. [77] Malaspina, D., Harlap, S., Fennig, S. et al. (2001) Advancing paternal age and the risk of schizophrenia. Arch. Gen. Psychiatry, 58, 361–367. [78] Brown, A.S., Schaefer, C., Wyatt, R.J. et al. (2002) Paternal age and risk of schizophrenia in adulthood. Am. J. Psychiatry, 159 (9), 1528–1533. [79] Penrose, L.S. (1955) Parental age and mutation. Lancet, 269, 312–313. [80] Crow, J.F. (1997) The high spontaneous mutation rate: is it a health risk? Proc. Natl. Acad. Sci. USA, 94, 8380–8386. [81] St Clair, D. (2009) Copy number variation and schizophrenia. Schizophr. Bull., 35 (1), 9–12. [82] Cantor Graae, E. and Selten, J.P. (2005) Schizophrenia and migration: a meta-analysis and review. Am. J. Psychiatry, 162 (1), 12–24. [83] Fearon, P., Kirkbride, J.B., Morgan, C. et al. (2006) Incidence of schizophrenia and other psychoses in ethnic minority groups: results from the MRC ÆSOP study. Psychol. Med., 36 (11), 1541–1550. [84] Veling, W., Selten, J.P., Veen, N. et al. (2006) Incidence of schizophrenia among ethnic minorities in the Netherlands: a four-year first-contact study. Schizophr. Res., 86 (1–3), 189–193. [85] Boydell, J., van Os, J., McKenzie, K. et al. (2001) Incidence of schizophrenia in ethnic minorities in London: ecological study into interactions with environment. Br. Med. J., 323 (7325), 1336–1338. [86] Fagg, J., Curtis, S., Stansfeld, S., et al. (2006) Psychological distress among adolescents, and its relationship to individual, family∼and area characteristics in East London. Soc. Sci. Med., 63 (3), 636–648. [87] Rabkin, J. (1979) Ethnic density and psychiatric hospitalization: hazards of minority status. Am. J. Psychiatry., 136 (12), 1562–1566.
181
CHAPTER 11 [88] Neeleman, J. and Wessely, S. (1999) Ethnic minority suicide: a small area geographical study in south London. Psychol. Med., 29 (2), 429–436. [89] Higuchi, S., Matsushita, S., Muramaysu, T. et al. (1996) Alcohol and aldehyde dehydrogenase genotypes and drinking behavior in Japanese. Alcohol. Clin. Exp. Res., 20, 493–497. [90] Higuchi, S., Matsushita, S., Imazeki, H. et al. (1994) Aldehyde dehydrogenase genotypes in Japanese alcoholics. Lancet, 343, 741–742. [91] Goedde, H.W., Agarwal, D.P., Fritze, G. et al. (1992) Distribution of ADH2 and ALDH2 genotypes in different populations. Hum. Genet., 88 (3), 344–346. [92] Smith, G.D. and Ebrahim, S. (2003) ’Mendelian randomisation’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int. J. Epidemiol., 32, 1–22. [93] Smith, G.D. and Ebrahim, S. (2004) Mendelian randomization: prospects, potentials, and limitations. Int. J. Epidemiol., 33, 30–42. [94] Ebrahim, S. and Smith, G.D. (2008) Mendelian randomization: can genetic epidemiology help redress the failures of observational epidemiology? Hum. Genet., 123 (1), 15–33. [95] Takagi, S., Iwai, N., Yamauchi, R. et al. (2002) Aldehyde dehydrogenase 2 gene is a risk factor for myocardial infarction in Japanese men. Hypertens. Res., 25 (5), 677–681. [96] Gaziano, J.M., Gaziano, T.A., Glynn, R.J. et al. (2000) Light-to-moderate alcohol consumption and mortality in the Physicians’ Health Study enrollment cohort. J. Am. Coll. Cardiol., 35 (1), 96–105. [97] Mukamal, K.J., Conigrave, K.M., Mittlemen, M.A. et al. (2003) Roles of drinking pattern and type of alcohol consumed in coronary heart disease in men. N. Engl. J. Med., 348, 109–118. [98] Djouss´e, L., Lee, I.M., Buring, J.E. et al. (2009) Alcohol consumption and risk of cardiovascular disease and death in women: potential mediating mechanisms. Circulation, 120 (3), 237–244. [99] Jablensky, A., Sartorious, N., Ernberg, G. et al. (1992) Schizophrenia: manifestations, incidence and course in different cultures. Psychol. Med. Monogr. Suppl., 20, 1–97. [100] Harrison, G., Hopper, K., Craig, T. et al. (2001) Recovery from psychotic illness: a 15- and 25-year international follow-up study. Br. J. Psychiatry., 178, 506–517. [101] Wyatt, R.J. and Henter, I. (2001) Rationale for the study of early intervention. Schizophr. Res., 51 (1), 69–76. [102] Cohen, A., Patel, V., Thara, R. et al. (2008) Questioning an Axiom: better prognosis for Schizophrenia in the developing world? Schizophr. Bull., 34 (2), 229–244.
182
[103] Breslau, N., Chilcoat, H., DelDotto, J. et al. (1996) Low birth weight and neurocognitive status at six years of age. Biol. Psychiatry., 40, 389–397. [104] Richards, M., Hardy, R., Kuh, D. et al. (2001) Birth weight and cognitive function in the British 1946 birth cohort: longitudinal population based study. Br. Med. J., 322, 199–203. [105] Matte, T.D., Bresnahan, M., Begg, M.D. et al. (2001) Influence of variation in birth weight within normal range and within sibships on IQ at age 7 years: cohort study. Br. Med. J., 323 (7308), 310–314. [106] Lawlor, D.A., Bor, W., O’Callaghan, M.J. et al. (2005) Intrauterine growth and intelligence within sibling pairs: findings from the Mater-University study of pregnancy and its outcomes. J. Epidemiol. Community Health, 59, 279–282. [107] Lawlor, D.A., Clark, H., Smith, G.D. et al. (2006) Intrauterine growth and intelligence within sibling pairs: findings from the aberdeen children of the 1950s cohort. Pediatrics, 117, e894–e902. [108] Yang, S., Lynch, J., Susser, E.S. et al. (2008) Birth weight and cognitive ability in childhood among siblings and nonsiblings. Pediatrics, 122, e350–e358. [109] Susser, E., Eide, M.G. and Begg, M. (2010) Invited commentary: The use of sibship studies to detect familial confounding. Am. J. Epidemiol., 172 (5), 537–539. [110] Tambs, K. (2010) Birth weight standardized to gestational age and intelligence in young adulthood: a register-based birth cohort study of male siblings. Am. J. Epidemiol., 172 (5), 530–536. [111] Stueve, A. and Link, B.G. (1997) Violence and psychiatric disorders: results from an epidemiological study of young adults in Israel. Psychiatr Q., 68 (4), 327–342. [112] Link, B.G., Andrews, H. and Cullen, F.T. (1992) The violent and illegal behavior of mental patients reconsidered. Am. Sociol. Rev., 57, 2750292. [113] Link, B.G., Monahan, J., Stueve, A. et al. (1999) Real in their consequences: a sociological approach to understanding the association between psychotic symptoms and violence. Am. Sociol. Rev., 64, 316–332. [114] Steadman, H.J., Mulvey, E.P., Monahan, J. et al. (1998) Violence by people discharged from acute psychiatric inpatient facilities and by others in the same neighborhoods. Arch. Gen. Psychiatry, 55, 1–9. [115] Bonabeau, E. (2002) Agent-based modeling: methods and techniques for simulating human systems. Proc. Natl. Acad. Sci. U.S.A., 99 (3), 7280–7287. [116] Auchicloss, A.H. and Diez Roux, A.V. (2008) A new tool for epidemiology: the usefulness of dynamicagent models in understanding place effects on health. Am. J. Epidemiol., 168, 1–8.
12
Studying the natural history of psychopathology William W. Eaton Department of Mental Health, Bloomberg School of Public Health, John Hopkins University, Baltimore, MD, USA
12.1 Introduction The natural history of psychopathology is a description, at the level of the population, of the ebbing and flowing of psychopathology from its earliest appearance to its final outcome. This chapter provides a conceptual framework for studies of the natural history of psychopathology and illustrates details of the framework with examples from research in the field of psychiatric epidemiology. Three major aspects of the natural history of psychopathology are onset, course and outcome. Onset of psychiatric disorders can occur very early in life, and the study of outcome ends with death. The ebbing and flowing sometimes occurs rapidly, as in the crescendo of fear involved in a panic attack; but for the most part the course is much more languid, operating over days, weeks, months, years and decades. Since a large proportion of individuals with mental disorders do not seek treatment, and those that do seek treatment presumably represent the most severe cases, the natural history of psychopathology is best studied with a population-based sample, in which individuals are selected from the entire general population without regard to whether they have received treatment of not. This avoids the well-known Berkson bias [1]. The combination of population-based samples and languid evolution of psychopathology favours the approach of life course epidemiology [2].
This chapter reviews concepts and methods for the study of the natural history of psychopathology. It is not a review of findings of studies on natural history. A comprehensive review would be cumbersome and uninformative because there is so much variation in methodologic quality of studies of natural course. If methodological standards are set high for such a review (for example by including only population-based studies with diagnostic information on an adequate number of subjects), there are very few studies that would be included. On the other hand, if methodological standards are set low for such a review (for example by including small studies of clinic samples and studies without diagnostic information), there would be a confusing morass of numerous studies with results so mixed and contradictory that the review would be of dubious value. This situation shows the state of the art in this area, indicating that we are still at the beginning stages of learning about the natural history of psychopathology.
12.2 Onset Signs and symptoms which might be related to psychiatric disorders are widespread in the population, not always reflecting the presence of psychiatric disorder. This high frequency makes the evolution from a normal deviation to a pathologic process difficult
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
183
CHAPTER 12
to define and discern. The absence of firm data on the validity of the classification system enjoins us to be careful about operationally defining disease onset. It is particularly difficult to establish the validity of a threshold for the presence versus the absence of disorder, because, from the clinical standpoint, subtle differences in a given clinician’s approach to treatment may suggest quite varied thresholds; from the epidemiologic standpoint, subtle differences in threshold may produce widely varying prevalences. A simple definition is that onset occurs when the individual first enters treatment. A related definition is that onset occurs when a symptom is noticeable by a clinician. Another definition is the point when the symptom is first noticed by the individual. With the operational criteria of the Diagnostic and Statistical Manual (DSM), it is possible to conceive of onset as the time when full criteria are met for the first time in the life. This definition has been used in studies of incidence (e.g. [3, 4]). But it omits that part of the pathological process that takes place prior to meeting full criteria for disorder – the prodrome, as described below. Since the aetiological process may be extended in time, and the operation of aetiological factors distant, the definitions above, although capable of being operationalised, lack an explicit relationship to the pathological process. Pathology occurs when the sociobiologic dynamics have become abnormal and signifies a distinct change in the relationship among variables, the new influence of variables that were not important beforehand or a new metabolism of some sort. Onset is that point in time when the aetiological process becomes irretrievably pathological, that is, the point when it is certain that the full criteria for disorder will eventually be met. This point of irreversibility is difficult to observe. Focus on population indicators for the force of morbidity leads to explicit consideration of the idea of a continuous line of development toward manifestation of disease with an as-yet-unknown point of irreversibility. At present we can only hypothesise where the disease begins, so that even the use of the word ‘symptom’ is problematic in the strict medical sense, since we cannot ascribe the complaint to the disease with perfect accuracy. Studying the natural history of psychopathology may, in the end, lead to the
184
conclusion that the disease concept is inappropriate or not useful, suggesting a shift to a more explicitly developmental framework [5, 6], with emphasis on normally distributed characteristics, and continuities in development, rather than rare dichotomies and discontinuities, which the disease model entails. One way of thinking about the development toward disease is to focus on the increase in severity or intensity of symptoms. An individual could have all the symptoms required for diagnosis but none of them in sufficient intensity or severity as to meet the threshold for case definition. The underlying logic of this concept is that the relatively high frequency of symptoms at a mild level of intensity in the general population makes it difficult to distinguish normal and subcriterial complaints from manifestations of disease. For many chronic disorders, including psychiatric disorders, it may be inappropriate to regard the symptom as ever having been absent (for example, deviant personality traits on axis II of the DSM). This type of progression toward disorder is termed intensification and leads the researcher to consider whether a crucial level of intensity exists at which the development toward disorder becomes irreversible. Figure 12.1 is an adaptation of a diagram used by Lilienfeld and Stolley [7, Figure 6.2], to visualise incidence as a time-oriented rate. The adaptation shows several distinct forms that onset can take when the disorder is defined by different levels of intensity or severity of symptoms. Compare cases No. 3 and No. 5, for example, which in the original diagram are situations of uncomplicated incidence. The bottom part of the figure shows how intensity represented by the vertical width of the bars, might be different for these two new cases. It also shows how there might be intensifications occurring that are stronger in magnitude than that associated with incidence, which would not be recorded as new cases (bottom two ‘cases’ in grey). Since the intensification of symptoms represents the force of morbidity in the population, use of a simple dichotomous measure of incidence will be misleading, unless the threshold of intensity is precisely where the pathologic process begins. A second conceptual approach toward disease development is the occurrence of new groups of symptoms where none existed. This involves the
STUDYING THE NATURAL HISTORY OF PSYCHOPATHOLOGY Wave 1
Wave 2 Existing chronic case
1 2
Remitted case 4
New case New case, not discovered New case 5 3
For bottom of figure, let = Threshold of intensity for defining onset 3
New case with sudden onset New case with gradual onset
5
Sudden onset but not a new case Existing chronic, not new, case with intensification
1
Fig 12.1 Dichotomous view of onset (top) compared to symptom intensification (bottom).
.0
R
.3
.2
=0
R
=0
R
=0
.5
.4
R
=0
R
=0
So
m
at
ic
sy
m
pt
om
s
Depressed mood
gradual acquisition of symptoms so that clusters are formed that increasingly approach the constellation required to meet specified definitions for diagnosis. ‘Present’ can be defined as occurrence either at the non-severe or at the severe level: thus, decisions made about the process of symptom intensification complicate this idea which focuses on symptom acquisition. This leads the researcher to consider the order in which symptoms occur over the natural history of the disease and, in particular, whether one symptom is more important than others in accelerating the process. Conceptualising the force of morbidity as time to a single dichotomous event (i.e. traditional concepts of incidence) is not flexible enough to deal with dimensional constructs, as shown in Figure 12.1. It is also not flexible enough to deal with changes through time in the covariation of indicators, which can be an important aspect of the force of morbidity. Emergence is defined to be the development of new covariation of a group of symptoms to each other. Figure 12.2 shows a simplified view of this developmental phenomenon for the example of the depression syndrome. The vertical axis represents the intensity of mood disturbance, and the diagonal axis, slanting backwards from lower left to upper right, the intensity of somatic disturbance. Time is represented by the horizontal axis, passing from left to right. At some early stage of development, the correlation
Precursors Prodrome 5
10
15 20 Time (years)
Disorder
25
Fig 12.2 Acquisition of symptoms to covariation threshold of onset.
of mood to somatic disturbance is pictured as being 0.0 (round circle representing cross-sectional scatter plot with correlation equal to 0.0). Gradually the mood comes to be associated with the somatic disturbance, shown by the evolution of the circle into an ellipse. At this point, the normal and the abnormal have not split, and the disorder is not inevitable. At this stage both mood and somatic disturbance predict imperfectly to later onset of major depressive disorder. Later, a group begins to emerge for whom mood and somatic disturbance are highly correlated. Finally, there emerges a group with very high covariation of mood and somatic disturbance, and a second normal group where little covariation remains. An
185
CHAPTER 12
increase in covariation can occur without an increase in mean levels of either mood or somatic disturbance. But presumably there is a sharp increase in impairment associated with some threshold of covariation. At some stage in the development of the covariation and impairment, a threshold for disorder might be set. These concepts allow the study of the progression of disease independently of case definition.
12.2.1 Prodromes and precursors
Cumulative percent with onset
The prodrome is the period prior to meeting fullblown criteria of disorder, when some signs or symptoms are nevertheless present. The prodome is defined only for those who eventually are diagnosed as cases, and can only be observed with complete certainty in a retrospective fashion. The speed of onset is the length of the prodromal period and can be measured in simple units of time (e.g. months or years). The presence of signs or symptoms below the criterion level may help to identify individuals at heightened risk for developing the full-blown disorder, who might be considered targets of prevention. Given the widespread prevalence of individual signs and symptoms of mental disorders in the general population, it is likely that many individuals with signs and symptoms of disorder will not go on to develop the full-blown criteria. In this situation the signs and symptoms are not quite prodromal, in the strict sense of the word, but it unacceptably imprecise to refer to them as risk factors. Signs and symptoms from a diagnostic cluster that precede
disorder, but do not predict the onset of disorder with certainty, are referred to here as precursor signs and symptoms. At the present state of our knowledge of the onset of mental disorders, there are few or no signs and symptoms that predict onset with certainty, but precursor signs and symptoms may be helpful in identifying groups at higher risk for onset than the general population. Converting what is known about precursors into true prodromes is an important topic of research for epidemiologists interested in longitudinal research and in prevention. An illustration of these issues is presented in Figure 12.3 and Table 12.1. Figure 12.3 shows two cumulative distributions for depressive disorder. The distribution on the right focuses on the age at which the individual first meets full criteria for DSM-III depressive disorder. For this distribution, onset must occur during the 1-year follow-up period of the Epidemiologic Catchment Area (ECA) Program, that is, a prospective design. The population at risk includes those who had never met criteria for the diagnosis at the beginning of the follow-up period. Thus, the at-risk group includes those with no symptoms, as well as those with some symptoms of disorder, but not meeting full DSM-III criteria. The distribution on the left focuses on the age at which the depression syndrome first occurred, as reported by the new cases. The dotted lines mark the quintiles. The area between the two curves gives a rough outline of the prodromal period. Depressive disorder has onset in young adulthood (Figure 12.3). Twenty per cent of cases meet criteria
100 80 60
Onset of Problem Onset of Disorder
40 20 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 Age in years
Fig 12.3 DIS/DSM-III major depressive disorder prodromal period for new cases epidemiologic catchment area program. Eaton et al. [8] Am J Psychiatry.
186
STUDYING THE NATURAL HISTORY OF PSYCHOPATHOLOGY Table 12.1 Relative and attributable risk for depressive disorder due to selected precursors epidemiologic catchment area programme. Precursor
Sad mood for two weeks Weight problems Sleep problems Fatigue Thoughts of death Depression syndrome
Precursor Precursor Precursor relative prevalence attributable risk (%) risk (%) 7.0 3.0 7.6 4.0 6.8 5.7
6.6 10.4 13.6 7.9 12.1 0.5
28 17 47 19 41 2
Dysphoria
63 43
Anhedonia
56
Appetite Sleep
62
Slow/restless
38 47
Tired Worthless
35
Thinking problems
54
Suicidal
49 0
5 10 15 20 25 30 35 Duration of Prodrome in Years
Adapted from [8].
for diagnosis for the first time before the age of 27 years and 50% before they are 40. Twenty per cent have their first depressive episode before the age of 17 and 50% before the age of 25. The prodromal period is about 15 years long. Symptoms associated with onset of depressive disorder, defined above as precursors, are associated with accelerated onset of the disorder. Table 12.1 shows the prevalence of the precursor, its relative risk in predicting onset of depressive disorder during the one year of follow-up in the ECA Program, and the attributable risk that can be estimated with the prevalence and the relative risk. The standard formula for attributable risk can be applied here (e.g. [9]) and is useful because it might prioritise precursors for screening or other prevention programmes, but this use of the term is conceptually different from other uses because of the limited duration of the follow-up. Therefore, the duration of the follow-up is used to qualify the attributable risk. Sleep problems have the highest relative risk (RR = 7.6), as well as high prevalence (13.6%): if there exists a single aetiologic pathway connecting sleep problems to depression, its elimination would reduce the occurrence of depressive disorder by 47%. The occurrence of depression syndrome (sad mood or anhedonia and two or more other symptoms) also has high relative risk (RR = 5.7), but the prevalence of depression syndrome is so low (0.5%) that the precursor attributable risk is only 2%. This formulation has been applied to depression previously [10, 11], and is applicable to most disorders. Many mental disorders have long prodromal periods, as shown in Figure 12.3 for depressive disorder.
Fig 12.4 Duration of prodrome by symptom group baltimore ECA followup. Adapted from Eaton et al. [12], Arch Gen Psychiatry.
The symptomatic picture of the prodromal period is efficiently summarised with a horizontal box plot, as shown in Figure 12.4, in this case for depression [12]. As required for prodromes, only new cases, from the Baltimore ECA Follow-up, are included. The boxes show the durations of time that symptoms in the DSM-IV symptom groups have endured prior to onset. The median time is designated by the vertical line inside the box, and the quartiles are designated by the ends of the boxes. Most symptom groups have prodromes lasting 1 or 2 years, but for dysphoria and suicidal ideation, there is much heterogeneity, with over half the prodromes being more than 5 years long.
12.2.2 Population measures of onset Incidence is the rate at which new cases develop in the population. It is essential to distinguish first incidence from total incidence. The numerator for first incidence is composed of those individuals who have had an occurrence of the disorder for the first time in their lives during a specified time period; the denominator excludes all persons who start the period with any prior history of the disorder. The numerator for total incidence includes all individuals who have a new occurrence of the disorder during the time period under investigation whether or not it is the initial episode of their lives or a recurrent episode. The denominator for total incidence excludes only persons who are active cases at the 187
CHAPTER 12
beginning of the follow-up period. The distinction itself is commonly assumed by epidemiologists, but there does not appear to be consensus on the terminology. Most definitions of the incidence numerator include a concept such as new cases [13], illness commencing [14], cases that come into being [15] or persons who develop a disease [16] or have onset [17]. Sartwell and Last [18] imply total incidence when they state the necessity of allowing for an individual being counted more than once, if the condition is one for which this is possible (e.g. accidents or colds). Kleinbaum et al. [9] hint at the distinction between first and total incidence, but are not explicit on the issue. Morris [19] defines incidence as equivalent to our first incidence and attack rate as equivalent to our total incidence. Lilienfeld and Lilienfeld [13] also occasionally equate incidence with attack rate. Except for the latter text, in none of these definitions is it explicit whether or not an individual who is healthy now, but has had episodes of the disorder over the life course, qualifies for a new onset. First incidence corresponds to the most common use of the term ‘incidence’, but since the usage is by no means universal, the prefix is recommended. The preference for first or total incidence in aetiological studies depends on hypotheses and assumptions about the way causes and outcomes important to the disease ebb and flow. If the disease is recurrent and the causal factors vary in strength over time, then it might be important to study risk factors not only for first but for subsequent episodes (total incidence). For example, one might consider the effects of changing levels of stress on the occurrence of episodes of neurotic illness [20] or of schizophrenia [21]. For a disorder with a presumed fixed progression from some fixed starting point, such as dementia, the first occurrence might be the most important episode to focus on, and first incidence is the appropriate rate. In the field of psychiatric epidemiology, there are a range of disorders with both types of causal structures operating, which leads to discussion of the two distinct types of incidence. The two types of incidence are functionally related to different measures of prevalence. Kramer et al. [22] have shown that lifetime prevalence (i.e. the proportion of the population who have ever had an occurrence of a disorder) is a function of first incidence and mortality in affected and unaffected 188
populations. Point prevalence (i.e. the proportion of persons in a defined population at a given time who manifest the disorder) is linked to total incidence by the queuing formula P = I*D [9, 23]: that is, point prevalence is equal to the total incidence multiplied by the average duration of episodes. Incidence data on specific psychiatric disorders are expensive to gather. A minority of individuals, not necessarily representative of those with disorder, receive treatment, and therefore a field survey is required. Many of the disorders are rare and many well individuals have to be evaluated, at two distinct points in time to estimate the incidence rate. The number of prospective studies with sufficiently large samples to estimate rates of incidence is small. If 5000 person-years of observation is set as the minimum requirement, there are only a handful of studies that cover a range of disorders. These include the ECA study in the United States [3, 4], the Stirling County study in Canada [24], the Traunstein study in Germany [25], the Lundby Study in Sweden [26], the Baltimore ECA Follow-up [12], the Netherlands Mental Health Survey and Incidence Study (NEMESIS) study [27], and, soon, the Follow-up of the National Comorbidity Survey [28]. Comparison of results between these studies is important because the numerators are so small that the findings from any one study are statistically volatile. Analysis of the onset of alcohol abuse or dependence in the ECA cohort [3, 4] shows sharply declining incidence after young adulthood and a slight rise at the beginning of the seventh decade. The rise in age-specific incidence rates in the elderly is caused by only five individuals who had onset in that age range. A similar curve from the Lundby study has the same shape [29], with the rise after age 60 based on only three individuals who had incidence in that age range. These results suggest aetiological clues and have implications for prevention efforts. The results of each study might not be convincing, but the replication of the identical pattern is credible.
12.3 Course 12.3.1 Remission Careful definition of terms is essential for studying the course of psychopathology [30]. Conceptualising
STUDYING THE NATURAL HISTORY OF PSYCHOPATHOLOGY
and measuring the ebb and flow of psychopathology after onset necessitates focus on duration, measured by units of time, and on recurrence, which is measured in a manner similar to incidence. Remission is a point in time after onset when signs and symptoms diminish sharply. After the first onset has occurred, it is useful to have a measure of level of symptomatology that defines remission unambiguously. Only after setting a threshold for remission can the duration of the episode be studied [31]. The definition of remission has all the complexities of the definition of onset. But as well as a threshold for the presence and absence of signs and symptoms, defined by both intensity and breadth, the definition of remission requires that a threshold of a minimum time period be set, before which a remission does not occur. For example, a remission may be defined as a continuous period of three months or more during which the individual is not meeting full criteria for disorder; or, a stricter definition might be three months during which the individual has no symptoms of the disorder at all. The measure of remission will be most useful if it uses the diagnostic criteria as a comparison or standard value, because that will facilitate meaningful comparison of qualities of remission between disorders. As an example, an operational measure of completeness of remission is proposed to describe that point between episodes, that is most free of signs and symptoms. It requires that thresholds be established for the intensity of signs and symptoms, as in, for example, the SCAN (rating scale one value of 1 versus 2 or 3; [32]. The measure of completeness of remission can be used even if the threshold levels are set differently in different research studies. The measure below takes advantage of the SCAN definitions to set thresholds of symptom intensity, and sets three months as the minimum time period during which the individual must fail to meet complete diagnostic criteria in order for a remission to be defined and measurable. The proposed levels of completeness of remission are the following: • Level 1: No signs and symptoms present. • Level 2: At least one sign or symptom present, but none above the threshold of intensity. • Level 3: One and only one sign or symptom present above the threshold of intensity; other
signs and symptoms may or may not be present below the threshold of intensity. • Level 4: More than one sign or symptom present above the threshold. • Level 5: Full criteria for disorder are present continuously, that is remission does not occur (‘continuously’ is defined as having no gaps greater than three months). The speed of remission is defined similarly to the speed of onset and the prodromal period. It is the time from the point at which the disorder is at its symptom peak to the beginning of the remission. The symptom peak is best defined similarly to the concept of acquisition, discussed above: the point in time where the highest number of signs and symptoms are above the threshold of intensity. The speed of remission can be measured in standard units of time (e.g. weeks and months). A relapse occurs if the individual meets criteria for disorder after a remission. Relapse requires careful work on terminology and operational definition, as with remission [33]. The speed of relapse is the time required to move from the state of remission to the symptom peak. As with other duration measures, the metric for speed of relapse is standard units of time. Recurrence is the risk for relapse and is analogous to the incidence in expressing a dynamic or timeoriented risk for onset, as discussed above regarding attack rate. The rate of recurrence can be estimated similarly to incidence, with the risk set for recurrence being comprised of all those not currently meeting criteria for disorder. The natural course is advantageously displayed in quasi-continuous fashion, as in Figure 12.5. Here the horizontal dimension is time, measured in yearly increments, and the vertical dimension is an ordinal measure of the frequency of panic attacks during the year. The graph shows every bit of data obtained on the course for the 33 new cases in the Baltimore ECA Follow-up [34]. The points in the graph are randomly jittered so that the course for each individual can be observed. The graph displays the great heterogeneity of the natural history, without reducing information, as would occur with the calculation of remission or recurrence rates. Individuals representing certain typical types of natural course can be identified: the quick and enduring recovery for case A; the stable 189
Frequency of Panic Attacks 2 4 6 8
CHAPTER 12
B A
C
D
0
Year Since Onset 2
4
6
8
10
12
14
Fig 12.5 Frequency of attacks after onset of panic disorder Baltimore ECA followup.
190
Undulating
ONSET
COURSE
OUTCOME 25% 10% 5%
40% “Good” outcome
24% 8% 60% “Poor” 12% outcome
10% Schubweis
chronic case B; the gradual recovery in case C and case D, who crosses the threshold of panic attacks back and forth repeatedly. These data can serve as the basis for random effects models, which estimate a slope for each individual. The average of all the individual slopes is shown in Figure 12.5 as a solid thick line declining from ordinal frequency Level 4 to level zero over the 14 years of follow-up. Many of the concepts discussed above present a simplistic point of view by not taking the longterm course into account. For example, incidence, remission and relapse are all dichotomous outcomes that can be measured with only two waves of observations. One wave defines the sample at risk, which comprises the denominator, and the second wave estimates the numerator. These approaches involve severe reductions in the complexity of data, such as that displayed in Figure 12.6. Attempts have been made to categorise or quantify the entire course of psychopathology for a given disorder – what might be termed the career of psychopathology. For example, Ciompi [35], after observing a first onset sample for an average of 35 years, proposed eight categories for the course of schizophrenia that combine the three dichotomies of onset (acute vs. insidious), course (stable vs. episodic) and outcome (good vs. bad). A visual description of these categories, adapted from Ciompi, is shown in Figure 12.6. These figures stimulate questions as to the nature of the course. For example, what is the ultimate outcome? Is the course steadily, progressively deteriorative
6%
Fig 12.6 Typologies of course from Ciompic ollow-up.
or progressively ameliorative [36]. Is the rate of remission related to the speed of onset? Is the risk for recurrence related to the duration of the episode or to the speed of onset? Answers to these questions would be important for clinical treatment, but not much is known because of the difficulties of conducting research on the natural history of psychopathology. Risk factors may have differential effects on incidence, duration and recurrence, and it is informative to combine the study of all three indicators for any given risk factor. For example, prevalence studies uniformly show that both female gender and lower socioeconomic status have been associated with higher prevalence of major depressive disorder. Analysis from the Baltimore ECA cohort showed that the gender difference existed only for incidence, not
STUDYING THE NATURAL HISTORY OF PSYCHOPATHOLOGY
prevalence for one disorder, given the presence of another [39]. Studies of natural history focus on risk, either through retrospective recall of the timing of one disorder versus the other, or through a true prospective design. For example, in the ECA data, the risk for onset of DSM-III major depressive disorder is 3.4 times higher if the individual has had a panic attack than if the person has not suffered a panic attack [40]. Many mental disorders have their peak periods of onset in adolescence and young adulthood (e.g. depressive disorder, panic disorder, alcohol disorder, substance use disorder and schizophrenia), while many important chronic physical conditions have peak onset in middle age or later (e.g. heart diseases, cancers, type 2 diabetes and strokes). Therefore, physical illness is another type of comorbidity and a possible consequence of psychopathology. In followups based on psychiatric case registers, the systems of registration are usually based on the structure of the treatment systems, which tend to separate psychiatry from other areas of medicine. Thus, only highly specialised registration systems, such as the Oxford Record Linkage Study [41], or the use of two or more illness-based registers, such as the Danish Psychiatric and Cancer registration systems [42], is effective. In population-based follow-ups, such as the Baltimore ECA Follow-up, a difficulty has been anticipating the range of potential consequences. Table 12.2 shows the range of consequences of depression for selected physical conditions and symptoms. Each relative risk in the table was the result of a separate analysis that compared depressive disorder to other forms of psychopathology, and which adjusted for other known risk factors for the physical condition. For many conditions, depressive disorder was the only nontrivial predictor from
for duration of episodes, nor for the risk for recurrence [12]. In contrast, the SES difference is very weak for incidence, but much stronger for persistence of disorder [37].
12.4 Outcome Outcome refers to the consequences of the psychopathology. These consequences can be immediate, such as impairment and disability resulting from the disorder. The focus here is on important and pernicious consequences of disorder that occur afterward and that are not included in the defining phenomena of the disorder, that is future psychopathology of other types and physical illness (comorbidity), overall functioning and death.
12.4.1 Comorbidity Comorbidity is the occurrence of two or more disorders in one individual [38]. There has been increasing interest in narrowly defined disorders since the introduction of the DSM-III. Since psychopathology does not always fit into the DSM categories and is highly overlapping, the increased splitting of disorders has led to increasing interest in psychiatric comorbidity: the occurrence of two or more disorders in the same individual. The disorders can occur simultaneously in the same individual, or they can occur at different points in time – so-called lifetime comorbidity. Comorbidity over the lifetime presumably expresses a genetic diathesis, an early and enduring risk factor or a long-standing environmental cause. Observed patterns of differential comorbidity will contribute, eventually, to improved nosology. Cross-sectional study of comorbidity focuses on the increase in
Table 12.2 Depressive disorder as outcome and predictor of medical conditions over a 13 year follow up of the Baltimore ECA cohort. Predict medical condition Condition Type 2 diabetes Heart attack Cancer Stroke Arthritis
Predict depressive disorder
At risk
New cases
Relative risk
At risk
New cases
Relative risk
1715 1551 2017 1705 1332
89 64 203 95 270
2.2 4.5 1.0 2.7 1.3
1633 1633 1633 1633 1633
71 71 71 71 71
1.1 1.7 0.6 8.4 1.0
191
CHAPTER 12
the range of psychopathology. Consistent with the developmental approach taken above, the effects of psychopathology below the threshold of diagnosis were also important for some physical conditions (not shown). The sizes of the relative risks are large enough to place depressive disorder on a par with other risk factors such as high cholesterol for heart attack, family history for breast cancer, hypertension for stroke and obesity for type 2 diabetes. Since depressive disorder is mostly not treated, in spite of the availability of effective treatments, and since it is relatively easy to screen for it, these data have implications for the practice of preventive medicine in the primary care setting. The table also shows the consequences of the medical conditions for later depression, mostly not important, except for stroke, which raises risk for depression significantly.
12.4.2 Functioning Functioning is the ability to deal with the normal demands of everyday life. Persons with psychopathology are often less able to function effectively than the general population. The term as used here includes the World Health Organization definition of disability [43]. Impairment and disability resulting from a given disorder such as schizophrenia is widely variable [44, 45], and most of the costs associated with psychiatric problems come from the reduced functioning, not from the signs and symptoms themselves. The conversion of psychopathology to impairment and disability is thus an important area of study. A growing number of longitudinal studies show that psychopathology has strong consequences for disability, comparable to, or greater than, consequences of chronic physical conditions [46–49].
12.4.3 Mortality Mortality, or the rate of death in the population, is usually higher in individuals with psychopathology than in the general population. Increased mortality is associated with schizophrenia (e.g. [50]), mood disorders (e.g. [24, 51–53]), anxiety disorders [54], cognitive impairment [55] and substance use disorders [56, 57]. Recent data from the public mental health sector estimate a life expectancy reduced by 192
25 years for those with severe mental disorders as compared to the general population [58]. For some disorders the increased mortality is associated with the signs and symptoms of the disorder itself, as is the situation for suicide with depression. But the risk for suicide is also high for disorders where the connection is less obvious, as in the controversy over panic and suicide [54, 59], and suicide in schizophrenia [60]. The rate of accidental death is also sometimes higher among persons with psychopathology. Other causes of death related to psychopathology are more subtle still. For example, it may be the case that individuals with psychopathology are less likely to engage in illness prevention and health promotion behaviours, such as curtailment of smoking or lowering of cholesterol intake, due to preoccupation with psychopathology or less effective functioning generally. Finally, the mortality rate is raised due to the association with physical conditions which raise risk for death, as discussed above and in Table 12.2.
12.5 Methodological concepts for studying the natural history of psychopathology Measuring onset, course and outcome in the context of population benefits from a prospective approach. The traditional design for natural history is the cohort study in which a population of individuals are observed prospectively over years, decades or even a lifetime [61–63]. The minimum design requirement is two waves of data collection. For example, to estimate incidence, the lifetime history of psychopathology is determined at the first wave in order to exclude individuals who have already met the criteria for caseness. At the second wave, those who have become new cases form the numerator of the incidence rate, and those who were never cases at wave 1 form the risk set, or denominator.
12.5.1 Attrition Major sources of error in cohort studies are due to attrition, censoring and recall. Attrition is the loss of subjects in longitudinal research usually due to one of three causes: individual mobility outside the study area or to an unknown residence, death and refusal to
STUDYING THE NATURAL HISTORY OF PSYCHOPATHOLOGY
participate after some threshold of response burden is reached. In field surveys such as the ECA, attrition after even so short a period as one year can be large enough to threaten the credibility of results. The ECA attrition in one year of follow-up was mostly due to refusal (about 15%) and partly to failure to locate individuals (about 5%). Since the time period was short, there was relatively little attrition due to mortality (less than 1%). In the Baltimore ECA Follow-up, in which the follow-up interview was 13 years after the baseline, the proportions shifted: nearly 25% had died, 12% could not be located and 8% refused [55]. Attrition can bias results. In the 1 year follow-up of the ECA, older white women and younger black males had about twice the rate of attrition than other respondents, and these differences in attrition were larger than differences related to baseline psychopathology [64]. In the Baltimore ECA Follow-up, older persons were more likely to die, but there were also biases connected to psychopathology, such as the tendency for those with cognitive impairment to die, and for those with antisocial characteristics not to be located [55, 65]. Attrition forestalls studying the effect of psychopathology during the interval between baseline and follow-up: For example, there may be a tendency for those with new episodes of disorder to move to another location (e.g. a young person might move to another city to live with parents during recovery). Since both the episode of psychopathology and the attrition occur between waves of interviews, attrition eliminates the possibility of studying this tendency. In population-based psychiatric case registers, attrition is likely to have different causes and a different structure. In a survey study, persons with psychosis may be more likely to refuse to be interviewed, and more likely to change address and be lost to follow-up after the passage of time. For a psychiatric case register, refusal is less likely to be important if the level of psychopathology is such as to need or even require treatment, such as might be argued is the case for psychosis. For disorders such as depression, where treatment is often not sought, register data may be severely biased by attrition. For registers of limited geographic spread, mobility will be important; for case registers that cover an entire country, such as in Denmark or Israel, mobility will be much
less important. The upshot of these comparisons is that population-based psychiatric case registers are a useful source of information on the natural history of severe mental disorders such as psychosis.
12.5.2 Censoring Censoring is the bias that results from the fact that the period of observation is limited in time. The extreme version of censoring is the cross-sectional study. It is possible to approximate measures of incidence, remission and recurrence using data gathered at one point in time, but this requires assumptions that are not generally tenable. Age of onset can be determined in a cross-sectional sample, for example by asking each respondent who meets lifetime criteria for disorder when the symptoms began. Even if the recall is accurate (discussed below), episodes of individuals who have onsets after the data collection is complete will be omitted, and this will lead to a downward bias in the estimate of age of onset. The problems of censoring are less severe with a cohort study, but exist nevertheless in any study that begins after birth and ends before all members of the cohort have died. In estimating the duration of an episode of psychopathology, for example, there will always be a small portion of the cohort who are in an episode at the time the data collection concludes, making it impossible to estimate the average duration of the episode in the cohort. Since the mean is highly influenced by observations on the tail of the distribution, the bias in the mean can be strong.
12.5.3 Prevalence bias Many individuals with mental disorders do not experience a recurrence, and those that do have recurrent episodes represent more chronic and severe cases. For this reason the natural history is best studied by prospective follow up of a sample of individuals with first lifetime onsets – that is from the first episode forward. This approach avoids the well-known ‘clinician’s illusion’ [66]. The problems of attrition, censoring and prevalence bias are illustrated in Figure 16.2 (in Chapter 16 of this book) with data from the Danish Psychiatric Case Register on hospital admissions during the period 1973–1988. In contrast to the display 193
CHAPTER 12
of course in Figure 12.6, this method requires a dichotomous indicator for presence or absence of disorder. The cohort begins with the first episode in the individual’s lifetime wherein the diagnosis of schizophrenia was given. The figure shows survival curves for the 1st, 5th, 10th and 15th episodes. Each curve shows the percentage of individuals who remain outside the hospital (vertical axis) according to time since discharge (horizontal axis). Relapse from the first episode tends to occur in the first few years after discharge; by the fifth year, almost threequarters of the cohort have had a second episode of hospitalisation. In any given curve, the manner of presentation is immune from the censoring bias, since it correctly portrays the lack of information for the individuals who have not suffered a relapse by the end of the follow-up in 1988. But curves for those with more episodes reveal the effects of prevalence bias since they are only computed for individuals suffering 4 or more, 9 or more and 14 or more relapses, respectively. Survival in the community is less likely for these cohorts because they represent an increasingly severe subsample of the first admission cohort. Imagine a clinician making an inference from his/her daily experience about the chronicity of a disorder – the clinician sees the most chronic cases 15 times as often as the cases with only one episode. These data show that prevalence bias can generate a falsely pessimistic view of the chronicity and severity of psychopathology.
12.5.4 Recall Recall bias is the error in measurement due to inaccuracies in the respondent’s memory of events. The cross-sectional approach is compromised because it relies on the respondent’s autobiographical memory to recall the time of the onset, which may be quite distant from the time of the data collection. It is likely that those with more recent onsets are less likely to forget the occurrence of the disorder, which biases the onset distribution toward later onset. If the disorder tends to occur early in life, as many mental disorders do, the tendency to forget distant episodes can generate findings with nonsensical data on lifetime prevalence [67, 68], and also possibly suggesting an upward trend in the occurrence of the 194
disorder [69], as in the suggestion of an ‘age of melancholy’ [70]. Simulation models suggest it takes only a small difference in recall to produce the appearance of strong upward trends in occurrence [71]. It is likely that those with severe cases of disorder are less likely to forget the occurrence of disorder; if severity is associated with earlier onset, this bias would be toward earlier onset. The study of risk factors will be further complicated because individuals may not remember the order of occurrence of the risk factor and the onset. Thus, retrospective data from a cross-sectional approach include a mixture of biases that are sometimes undecipherable. The same mistakes in recall can occur in the cross-sectional or prospective design. But in the prospective design, the mistakes made by an individual are likely to be smaller than in the cross-sectional design, because the time of data collection is closer to the present for the individual, especially at the second or later waves where new onsets are determined. The effects of error are complex in the prospective design, because the biases can concatenate in so many different ways. For example, in the East Baltimore ECA panel cohort, there were 2622 individuals who had never in their lifetimes met criteria for diagnosis of panic disorder by the time of the interview at Wave 1; 20 of these met criteria at Wave 2, giving a cumulative annual incidence rate of about 7 per 1000 per year [4]. There were 40 individuals at Wave 1 who met criteria for past or present diagnosis; of these, 20 reported never having experienced a panic attack at Wave 2. These 20 might be labelled ‘reverse incidence’. They represent half (20/40) of those meeting criteria for diagnosis at Wave 1; they match exactly the number (20) of incident cases. This phenomenon is not unique to the ECA surveys. The existence of reverse incidence is due to forgetting and, while disquieting, does not negate the existence of the 20 cases in the numerator of the incidence rate. It does suggest that forgetting of episodes occurs, a tendency that would bias prevalence rates downward; and, probably, bias incidence rates upward. The upward bias in incidence would occur because cases that belong in the numerator of the attack rate would be mixed in with the numerator of the first incidence rate. Lack of blind measurement is an important problem in estimation as regards outcome. The dependence of outcome on initial state is a central focus
STUDYING THE NATURAL HISTORY OF PSYCHOPATHOLOGY
of research on natural history, but it may be difficult to measure outcome independently of initial state. If the respondent or the interviewer remembers the initial measurement session, the results of that session are likely to bias measurement of outcome. For example, an interviewer may probe more persistently for the occurrence of panic attacks if it is known that they have occurred in the recent, or even distant, past. Impairment and disability are likely to be rated downward if it is known that the individual once met the criteria for diagnosis of schizophrenia, even if no signs and symptoms are present at the time of the follow-up. Thus, bias due to lack of blindness is likely to overestimate the relationship of early indicators of psychopathology to later outcomes. Random error has counterintuitive pernicious effects in prospective research on the natural history of disorder. Indeed, in the context of estimating incidence in field surveys, the concept of random error is not very useful. If by random error is meant an equiprobable response, then it is straightforward to show that, for a sample, the bias resulting is moderately upward for prevalence and strongly upward for incidence. The rates of false-positive and falsenegative answers to a given question will depend on the question and will not be equiprobable, in general; but many other types of errors in the survey process – mistakes in data entry, for example – will have an equiprobable character to them. Thus, the tendency is for seemingly random errors to bias the incidence and recurrence rates upward.
four or more waves of analysis, with continuous and categorical constructs not directly observable, have been developed [75–77]. Inverse probability weighting techniques allow inference to the baseline sample in a cohort study, even in the presence of attrition [78].
12.6 Conclusion Studying the natural history of psychopathology in the general population requires large resources of effort and expense because of the combination of population-based sampling, long-term commitment and intensity of measurement. Most data on natural history are based on clinical samples, which are not representative of the population of persons with mental disorders. There are few benchmark estimates for the incidence of most major mental disorders that have been replicated and for which there is a consensus among investigators. The estimates for parameters of long-term course of disorders are widely varying. Thus, there is plenty of progress to be made!
Acknowledgements This work was supported by NIDA grant DA026652 and NIMH grant MH53188.
References 12.5.5 Statistical Innovations There has been an explosion of statistical techniques over the last several decades which address many of the problems of prospective studies. Problems of censoring are addressed with the family of techniques called survival analysis (e.g. [72]). The development of covariation over time can be studied with secondorder generalised estimating equations [73]. Risk factors at different stages of the disease may be differentially related to disease progression only above or below the threshold set by the diagnosis. In this situation, the diagnostic threshold might be reconsidered. Statistical techniques to locate a threshold have been developed [74]. Latent growth mixture models, which are statistical techniques suitable for
[1] Berkson, J. (1946) Limitations of the application of fourfold table analysis to hospital data. Biom. Bull., 2, 47–53. [2] Kuh, D. and Ben-Shlomo, Y. (1997) A Life Course Approach to Chronic Disease Epidemiology, Oxford University Press, New York. [3] Eaton, W.W., Kramer, M., Anthony, J.C. et al. (1989a) The incidence of specific DISrDSM-III mental disorders: data from the NIMH Epidemiologic Catchment Area Program. Acta Psychiatr. Scand., 79, 163–178. [4] Eaton, W.W., Kramer, M. and Anthony, J.C. (1989b) Conceptual and methodological problems in estimation of the incidence of mental disorders from field survey data, in Epidemiology and the Prevention of Mental Disorders (eds B. Cooper and T. Helgason), Routledge, London, pp. 108–127.
195
CHAPTER 12 [5] Baltes, P.B., Reese, H.W. and Lipsitt, L.P. (1980) Lifespan developmental psychology. Annu. Rev. Psychol., 31, 35–110. [6] McHugh, P.R. and Slavney, P.R. (1998) The Perspectives of Psychiatry, Johns Hopkins University Press, Baltimore. [7] Lilienfeld, D.E. and Stolley, P.D. (1994) Foundations of Epidemiology, Oxford University Press, New York. [8] Eaton, W.W., Badawi, M. and Melton, B. (1995) Prodromes and precursors. Epidemiologic data for primary prevention of disorders with slow onset. Am. J. Psychiatry., 152 (7), 967–972. [9] Kleinbaum, D.G., Kupper, L.L. and Morgenstern, H. (1982) Epidemiologic Research: Principles and Quantitative Methods, Lifetime Learning, Belmost, CA. [10] Dryman, A. and Eaton, W.W. (1991) Affective symptoms associated with the onset of major depression in the community: findings from the U.S. NIMH epidemiologic catchment area program. Acta Psychiatr. Scand., 84, 15. [11] Horvath, E., Johnson, J., Klerman, G.L. et al. (1992) Depressive symptoms as relative and attributable risk for first-onset major depression. Arch. Gen. Psychiatry, 49, 817–823. [12] Eaton, W.W., Anthony, J.C., Gallo, J. et al. (1997) Natural history of DISrDSM major depression: the Baltimore epidemiologic catchment area follow-up. Arch. Gen. Psychiatry, 54, 993–999. [13] Lilienfeld, A.M. and Lilienfeld, D.E. (1980) Foundations of Epidemiology, 2nd edn, Oxford University Press, New York. [14] Expert Committeee on Health Statistics (1959) Sixth report, World Health Organization, Geneva. [15] MacMahon, B., Pugh, T.F. and Ipsen, J. (1960) Epidemiologic Methods, Little, Brown, Boston, MA. [16] Mausner, J.S. and Kramer, S. (1985) Epidemiology: An Introductory Text, WB Saunders, Eastbourne. [17] National Center for Health Statistics (1977) Health Interview Survey Procedures 1957–1974: Vital and Health Statistics, Series 1, No. 11, US Government Printing Office, Washington, DC. [18] Sartwell, P.E. and Last, J.M. (1980) Epidemiology, in Maxcy-Rosenau Public Health and Preventive Medicine, 11th edn (ed. J.M. Last), Appleton-CenturyCrofts, New York, p. 985. [19] Morris, J.N. (1975) Uses of Epidemiology, 3rd edn, Churchill Livingstone, Edinburg. [20] Tyrer, P. (1985) Neurosis divisible? Lancet, 8430, 685–688. [21] Brown, G.W. and Birley, J.L.T. (1968) Crises and life changes and the onset of schizophrenia. J. Health Soc. Behav., 9, 203–214. [22] Kramer, M., Von, K.M. and Kessler, L. (1981) The lifetime prevalence of mental disorders: estimation, uses and limitations. Psychol. Med., 10, 429–436.
196
[23] Kramer, M. (1957) Discussion of the concepts of prevalence and incidence as related to epidemiologic studies of mental disorders. Am. J. Public. Health, 47, 826–840. [24] Murphy, J., Monson, R.R., Olivier, D.C. et al. (1987) Affective disorders and mortality: a general population study. Arch. Gen. Psychiatry, 44, 473–480. [25] Fichter, M.M., Koch, H.J., Rehm, J. et al. (1987) Adversity and the risk of mental illness: preliminary results of the Upper Bavarian restudy, in From Social Class to Social Stress (ed. M.C. Angermeyer), Springer, Berlin. [26] Hagnell, O., Essen-Moller, E., Lanke, J. et al. (1990) The Incidence of Mental Illness Over a Quarter of a Century, Almqvist and Wiksell International, Stockholm. [27] Bijl, R.V., van Zessen, G., Ravelli, A. et al. (1998) The Netherlands mental health survey and incidence study (NEMESIS): objectives and design. Soc. Psychiatry Psychiatr. Epidemiol., 33, 581–186. [28] Kessler, R.C. (1995) Epidemiology of psychiatric comorbidity, in Textbook in Psychiatric Epidemiology (eds M.T. Tsuang, M. Tohen and G.E.P. Zahner), John Wiley & Sons, Inc., New York, pp. 179–197. [29] Ojesjo, L., Hagnell, O. and Lanke, J. (1982) Incidence of alcoholism among men in the Lundby community cohort. Sweden 1957–1972. J. Stud. Alcohol., 43, 1190–1198. [30] Frank, E., Prien, R.F., Jarrett, R.B. et al. (1991) Conceptualization and rationale for consensus definitions of terms in major depressive disorder. Arch. Gen. Psychiatry, 48, 851–855. [31] Philipp, M. and Fickinger, M.P. (1993) The definition of remission and its impact on the length of a depressive episode. Arch. Gen. Psychiatry, 50, 407–408. [32] Wing, J.K., Babor, T., Brugha, T. et al. (1990) SCAN: schedules for clinical assessment in neuropsychiatry. Arch. Gen. Psychiatry, 47, 589–593. [33] Falloon, R.H., Grant, N., Marshall, J.L.B. et al. (1983) Relapse in schizophrenia: a review of the concept and its definitions editorial. Psychol. Med., 13, 469–477. [34] Eaton, W.W., Anthony, J., Romanoski, A. et al. (1998) Onset and recovery from panic disorder in the Baltimore epidemiologic catchment area follow-up. Br. J. Psychiatry, 173, 501–507. [35] Ciompi, L. (1980) Catamnestic long-term study on the course of life and aging of schizophrenics. Schizophr. Bull., 6, 606–618. [36] Eaton, W.W., Bilker, W., Haro, J.M. et al. (1992) The long-term course of hospitalization for schizophrenia: change in rate of hospitalization with passage of time. Schizophr. Bull., 18, 185–207. [37] Miech, R., Power, C. and Eaton, W. (2007) Disparities in psychological distress across education and sex: a longitudinal analysis of their persistence
STUDYING THE NATURAL HISTORY OF PSYCHOPATHOLOGY
[38] [39]
[40]
[41] [42]
[43]
[44]
[45] [46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
within a cohort over 19 years. Ann. Epidemiol., 17, 289–295. Feinstein, A. (1967) Clinical Judgement, Williams and Wilkins, Baltimore. Merikangas, K.R., Angst, J., Eaton, W. et al. (1996) Comorbidity and boundaries of affective disorders with anxiety disorders and substance misuse: results of an international task force. Br. J. Psychiatry Suppl., 168 (30), 58–67. Andrade, L., Eaton, W.W. and Chilcoat, H. (1996) Lifetime comorbidity of panic attacks and major depression in population-based study: age of onset. Psychol. Med., 26, 991–996. Acheson, E.D. (1967) Medical Record Linkage, Oxford University Press, London. Mortensen, P.B. and Juel, K. (1993) Mortality and causes of death in first-admitted schizophrenic patients. Br. J. Psychiatry, 163, 183–189. WHO (1980) International Classification of Impairments, Disabilities, and Handicaps. World Health Organization, Geneva. Jablenski, A., Schwartz, R. and Tomov, T. (1980) WHO collaborative study on impairments and disabilities associated with schizophrenic disorders. Arch. Gen. Psychiatry, 62 (Suppl. 285), 152–163. Eaton, W.W. (1991) Update on the epidemiology of schizophrenia. Epidemiol. Rev., 13, 320–328. Hays, R.D., Wells, K.B., Sherbourn, C.D. et al. (1995) Functioning and well-being outcomes of patients with depression compared with chronic general medical illnesses. Arch. Gen. Psychiatry, 52, 11–19. Kouzis, A.C. and Eaton, W.W. (1995) Disability days and psychopathology. Am. J. Public Health, 84, 1304–1307. Kouzis, A.C. and Eaton, W.W. (1997) Psychopathology and the development of disability. Soc. Psychiatry Psychiatr. Epidemiol. 32, 379–386. Armenian, H.K., Pratt, L.A., Gallo, J.J. et al. (1998) Psychopathology as a predictor of disability: a population-based follow-up study in Baltimore, Maryland. Am. J. Epidemiol., 148, 269–275. Babigian, H.M. and Odoroff, C.L. (1969) The mortality experience of a population with psychiatric illness. Am. J. Psychiatry, 126, 470–480. Black, D.W., Warrack, G. and Winokur, G. (1985) The Iowa record linkage study. I. Studies and accidental deaths among psychiatric patients. Arch. Gen. Psychiatry, 42, 71–75. Harris, E.C. and Barraclough, B. (1998) Excess mortality of mental disorder. Br. J. Psychiatry, 173, 11–53. Wulsin, L.R., Vaillant, G.E. and Wells, V. (1999) A systematic review of the mortality of depression. Psychosom. Med., 61, 6–17.
[54] Weissman, M.M., Klerman, G.L., Markowitz, J.S. et al. (1989) Suicidal ideation and suicide attempts in panic disorder and attacks. N. Engl. J. Med., 321, 1209–1214. [55] Badawi, M.A., Eaton, W.W., Myllyluoma, J. et al. (1999) Psychopathology and attrition in the Baltimore ECA follow-up 1981–1996. Soc. Psychiatry. Psychiatr. Epidemiol., 34, 91–98. [56] Kouzis, A., Eaton, W.W. and Leaf, P.J. (1995) Psychopathology and mortality in the general population. Soc. Psychiatry Psychiatr. Epidemiol., 30 (4), 165–170. [57] Neumark, Y.D., Van, E.M.L. and Anthony, J.D. (2000) Drug dependence and death: survival analysis of the Baltimore ECA sample from 1981 to 1995. Subst. Use Misuse, 35, 49–63. [58] Colton, C.W. and Manderscheid, R.W. (2006) Congruencies in increased mortality rates, years of potential life lost, and causes of death among public mental health clients in eight states. Prev. Chronic Dis., 3(2), A42 [Epub 2006 Mar 15]. [59] Anthony, J.C. and Petronis, K.R. (1991) Panic attacks and suicide attempts. Arch. Gen. Psychiatry, 48, 11–14. [60] Herrman, H.E. (1987) Re-evaluation of the evidence on the prognostic importance of schizophrenic and affective symptons. Aust. N.Z. J. Psychiatry, 21, 424–427. [61] Breslow, N.E. and Day, N.E. (1987) Statistical Methods in Cancer Research. II. The Design and Analysis of Cohort Studies, International Agency for Research on Cancer, Lyon. [62] Samet, J.M. and Munoz, A. (eds) (1998) Epidemiologic Reviews: Cohort Studies, The Johns Hopkins University School of Hygiene and Public Health, Baltimore, MD. [63] Eaton, W.W. (2002) The logic for a national conception-to-death cohort study. Ann. Epidemiol., 12, 445–451. [64] Eaton, W.W., Anthony, J.C., Tepper, S. et al. (1992) Psychopathology and attrition in the epidemiologic catchment area surveys. Am. J. Epidemiol., 134, 1041–1059. [65] Eaton, W.W., Kalaydjian, A., Scharfstein, D.O., Mezuk, B. and Ding, Y. (2007) Prevalence and incidence of depressive disorder: the Baltimore ECA follow-up, 1981–2004. Acta Psychiatr. Scand., 116 (3), 182–188. [66] Cohen, P. and Cohen, J. (1984) The clinician’s illusion. Arch. Gen. Psychiatry, 41, 1178–1182. [67] Robins, L.N., Helzer, J.E., Weissman, M.M. et al. (1984) Lifetime prevalence of specific psychiatric disorders in three sites. Arch. Gen. Psychiatry, 41, 949–958.
197
CHAPTER 12 [68] Parker, G. (1987) Are the lifetime prevalence estimates in the ECA Study accurate? Psychol. Med., 17, 275–282. [69] Klerman, G.L. and Weissman, M.M. (1989) Increasing rates of depression. J. Am. Med. Assoc., 261, 2229–2235. [70] Hagnell, O., Lanke, J., Rorsman, B. et al. (1982) Are we entering an age of melancholy? Depressive illnesses in a prospective epidemiological study over 25 years: the Lundby study, Sweden. Psycholo. Med., 12, 279–289. [71] Giuffra, L. and Risch, N. (1994) Diminished recall and the cohort effect of major depression: a simulation study. Psycholo. Med., 24, 375–383. [72] Lawless, J.F. (1982) Statistical Models and Methods for Lifetime Data, John Wiley & Sons, Inc., New York. [73] Zeger, S.L. and Liang, K.-Y. (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42 (1), 121–130. [74] Scharfstein, D., Liang, K., Eaton, W. and Chen, L.-S. (2001) The quadratic cumulative odds regression
198
[75]
[76] [77]
[78]
model for scored ordinal outcomes: application to alcohol dependence. Biostatistics, 2, 2473–2483. Muthen, B. (2001) Second-generation structural equation modeling with a combination of categorical and continuous latent variables: new opportunities for latent class/latent growth modeling, in New Methods for the Analysis of Change (eds L.L. Collins and A. Sayer), American Psychological Association, Washington, DC, pp. 291–322. Bollen, K.A. (1989) Structural Equations with Latent Variables, John Wiley & Sons, Inc., New York. McArdle, J.J. and Hamagami, F. (1992) Modeling incomplete longitudinal and cross-sectional data using latent growth structural models. Exp. Aging. Res., 18, 145–166. Robins, J.M., Rotnickski, A. and Zhao, L.P. (1994) Analysis of semiparametric regression models for repeated outcomes under the presence of missing data. J. Am. Stat. Assoc., 90, 106–121.
13
Symptom scales and diagnostic schedules in adult psychiatry Jane M. Murphy Department of Psychiatry, Massachusetts General Hospital, Harvard Medical School; Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA
13.1 Introduction Psychiatric ‘scales’ concern dimensions of psychopathology while ‘schedules’ deal with categories of psychiatric disorders. A scale queries a set of inter-related symptoms that constitute a continuum from a few to many symptoms. It concerns a quantitative gradient based on symptoms representing a qualitative theme. Many scales reflect influence from psychometric theory and survey methodology. Psychologists and sociologists have been more prominent as their designers than psychiatrists. Schedules are based on the syndromes that define diagnostic categories as described in the Diagnostic and Statistical Manuals (DSMs) of the American Psychiatric Association [1–3] and the recent versions of the International Classification of Diseases (ICD) of the World Health Organization [4, 5]. A syndrome is a pattern of symptoms made up of ‘essential features’, ‘associated symptoms’, ‘duration’ and frequently also ‘disability’. Depending on the completeness of the pattern, the syndrome is considered to be present or absent thereby reflecting dichotomous measurement. Psychiatrists have played active roles in the construction of schedules. Most scales used thus far in psychiatric epidemiology deal with anxiety and/or depression. Each question is asked of each subject, and categories of response refer to the presence or absence of a symptom, its frequency of occurrence or the degree to which it is bothersome. Responses are given numerical values that are added together to form a score, the
range of which has a ‘cutting-point’ that allows cases to be separated from non-cases. The distribution of scale scores in a general population shows marked skewness. The majority of people report that they do not have these symptoms, or only a few; and a minority of people report that they suffer from several to many. In other words, the score distributions for psychiatric scales are not normally distributed as are height, weight, IQ and some social attitudes. Diagnostic schedules are more comprehensive in psychiatric coverage in that they deal with psychotic disorders and substance abuse as well as depression and anxiety. The schedules differ in terms of whether they focus on the clinical status at the time of the interview or on the subject’s history of psychiatric disorders. The schedules designed for epidemiological research use the ‘lifetime’ approach and are highly structured so that clinical judgement need not be applied during the course of the interview. Most of them have ‘modules’ for the separate diagnoses thereby allowing the researcher to be selective. Often, the module opens with one or two ‘stem’ questions about the ‘essential features’ of that diagnosis. If the subject responds negatively to the ‘stem’, the module for some diagnoses can be skipped. The schedules contain careful instructions to the interviewers about ‘skip-outs’. Even if a given module is started, skipouts occur as it becomes clear that the subject will not meet the criteria for that diagnosis. Most scales and schedules are known by an acronym standing for the full name of the instrument, and sometimes by the name of the designer.
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
199
CHAPTER 13 Table 13.1 Scales and schedulesa. BDI CES-D CIDI CIS CIS-R CMI CSI DIS DPAX Eysenck GHQ HOS HSCL ISPI MMPI MSS NSA PERI PHQ-G PHQ-S PRIME-MD PSE PSS SADS SCAN SCID SCL-90 SF-36 UM-CIDI WMH-CIDI Zung 22IS
Beck Depression Inventory [109] Center for Epidemiologic Studies Depression Scale [48] Composite International Diagnostic Interview [68] Clinical Interview Schedule [139] Clinical Interview Schedule Revised [191] Cornell Medical Index [89] Cornell Selectee Index [15] Diagnostic Interview Schedule [58] Depression and Anxiety Schedule [37, 38] Eysenck Personality Inventory [13, 14] General Health Questionnaire [138] Health Opinion Survey [20] Hopkins Symptom Checklist [95, 96] Iowa Structured Psychiatric Interview [52] Minnesota Multiphasic Personality Inventory [12] Mental Status Schedule [113] Neuropsychiatric Screening Adjunct of the US Army [16] Psychiatric Epidemiological Research Instrument [48] Personal Health Questionnaire (Goldberg and Simpson, in Rizzo et al. [149]b Patient Health Questionnaire [132]a Primary Care Evaluation of Mental Disorders [131] Present State Examination [54, 150] Psychiatric Status Schedule [114] Schedule for Affective Disorders and Schizophrenia [55] Schedules for Clinical Assessment in Neuropsychiatry [79] Structured Clinical Interview for DSM-III-R [81, 126] Symptom Checklist 90 Items [99] Short-Form Health Survey 36 items [135] University of Michigan CIDI [69] World Mental Health CIDI [78] Zung Depression Scale [110] Twenty-Two Item Scale [21]
a The
references shown in Table 13.1 are those in which the instrument is most fully described. instruments developed by Goldberg and Simpson and by Spitzer et al., have different names but the same acronym. For purposes here, the former is called PHQ-G and the latter PHQ-S.
b The
Table 13.1 gives a list of acronyms and names of the instruments discussed in this chapter. The selection is intended to provide a historical overview since many of the instruments used today are outgrowths or revisions of earlier ones. The selection is not, however, exhaustive. Specifically excluded are rating scales for use by observers rather than for asking direct questions. The Hamilton Rating Scale for Depression [6] and the Brief Psychiatric Rating Scale [7] are examples of well-known instruments excluded on this basis. Most of the symptom scales were developed as paper/pencil questionnaires since they were intended for use in the armed forces or in clinical settings. For epidemiological research, the face-to-face interview 200
has been the typical mode of data-gathering although increasingly telephone interviews have been used. Experimentation is going on regarding the use of computers either as the mode of inquiry or as an interview aid [8]. Figure 13.1 gives a general overview of the field and introduces the instruments in the context of their development. The dates shown on the vertical axis refer to the first publication that describes how the instrument was constructed and gives evidence about reliability and validity. The separate parts of the figure refer to North America, the United Kingdom and the World Health Organization. Each of these parts is divided into the types of settings where an instrument
SYMPTOM SCALES AND DIAGNOSTIC SCHEDULES IN ADULT PSYCHIATRY
Fig 13.1 Time, place, and purpose of instruments developed for adult psychiatry.
was developed: population-based epidemiology, psychiatric facilities and primary care. The scales are shown in white letters and the schedules in black letters. Solid lines that end in an arrow show the evolution of an instrument as carried out by a given group held together by one or two leaders. Dotted lines that end in an arrow reflect that an instrument developed by one group of researchers influenced a subsequent instrument designed by a different group. Almost without exception, instruments designed for one setting have been used in other settings. Further, each new instrument has been influenced to some degree by prior instruments, and there are several lines of influence that flow across the geographic boundaries. For epidemiological research, the instruments have been used in both ‘single-stage’ studies, where a given instrument is administered to each individual
of a population sample, and ‘two-stage’ designs that typically involve a scale at the first stage followed by a more complete psychiatric work-up by means of a schedule for a subset of subjects, a large portion of whom gave evidence of a psychiatric history at the first stage. In addition to describing the instruments themselves, the studies in which they have been used will also be mentioned. This means that most of the major investigations in psychiatric epidemiology will be noted, except for one of the most long-lasting studies, the Lundby Study in Sweden [9–11]. The Lundby data derive from face-to-face interviews carried out by psychiatrists who followed an outline of questions. But a schedule based on the questions has not been published. While it is not the intent of this chapter to cover personality assessment per se, two instruments 201
CHAPTER 13
designed for that purpose need to be mentioned because of their influence on psychiatric instruments. One is the Minnesota Multiphasic Personality Inventory (MMPI) which was designed in the United States during the 1930s [12] and the other is the Eysenck Personality Inventory [13, 14] which was created in the United Kingdom after the World War II. Many of the psychiatric instruments have borrowed questions from one or another of these inventories. Recently, the broad dimensions of the Eysenck Inventory, ‘Introversion’ and ‘Extraversion’, have contributed to discussions about possible future directions for psychiatric measurement.
13.2 North American instruments for epidemiological research During the World War II two scales were designed to screen for psychopathology among Army recruits: The Cornell Selectee Index (CSI) [15] and the US Army’s Neuropsychiatric Screening Adjunct (NSA) [16], both of which, but especially the NSA, stood up well to extensive psychometric testing. After the War, the CSI and NSA were administered to the same individuals [17]. The correlation was very high, and recommendations were made for the development of a new psychiatric instrument that would incorporate the best features of each and would be appropriate for general use. Unfortunately such an instrument was not developed. The CSI was transformed into a clinical instrument while the NSA was never changed, updated or used again as an independent instrument. However, the NSA strongly influenced the subsequent instruments for population-based epidemiology. The two World Wars indicated that psychiatric disorders were much more common than shown in mental hospital statistics. To investigate the question of ‘how much?’, two epidemiological studies of general populations were undertaken: the Stirling County Study [18] conducted in Atlantic Canada, and the Midtown Manhattan Study [19] in New York City. Each used a rather long structured interview administered by lay interviewers. In addition, each study produced a shorter screening instrument. From the Stirling Study came the Health Opinion Survey (HOS) [20] and from the Midtown Study 202
the Twenty-Two Item Scale (22IS) [21]. Through empirical testing, the composition of each involved numbers of questions from the Army’s NSA which concentrated on the type of general anxiety that focused on ‘nervousness’ and involved the autonomic expressions of fearfulness as indicated in ‘pounding heart’, ‘cold sweats’ and other features of the ‘body’s alarm system’ being activated and which will be called here ‘autonomic anxiety’. There was lesser coverage of depression which was in line with the fact that, at that time, anxiety was thought of as the hallmark of neurotic disorders and depression was mainly considered a psychotic disorder. During the 1960s and 1970s, the HOS and 22IS, despite being presented as screening instruments, were used as the main source of data-gathering in several other epidemiologic studies [22–28]. In addition, they were adapted for studies of US national samples [29–31]. These national sample studies were not, however, epidemiological in the usual sense because they did not estimate prevalence but rather focused on the proportions of the samples who answered individual questions in particular ways. Both the Stirling and Midtown Studies reported that, counting all types of psychiatric disorders together, prevalence was much higher than expected being approximately 20% in each study. The Stirling Study has continued with repeated cross-sectional surveys and cohort follow-up to provide a 40-year epidemiological perspective which is based on both face-to-face interviews with subjects and interviews by psychiatrists with the subjects’ general physicians [32–36]. Using a longer schedule that included the HOS, a computerised algorithm was designed for the longitudinal research. Both the longer schedule and the algorithm were given the acronym ‘DPAX’. This acronym was selected to emphasise that the procedures focused specifically on depression, represented by ‘DP’ and anxiety, represented by ‘AX’ [37]. It has steps for ‘essential features’, ‘associated symptoms’, ‘duration’ and ‘impairment’. Two versions of the schedule and algorithm (DPAX-1 and DPAX-2) were constructed to accommodate historical changes in the colloquial vernacular by which the mood of depression and the sensations of anxiety were recognized (for example, the idiom of ‘being in poor spirits’ became outmoded while ‘feeling low and hopeless’ was
SYMPTOM SCALES AND DIAGNOSTIC SCHEDULES IN ADULT PSYCHIATRY
easily understood) [38]. Using these methods, it was found that these disorders exhibited quite steady prevalence, tended to be chronic with low incidence, and that depression carried a significant mortality risk [39, 40]. The next instrument to appear on the US scene was the Center for Epidemiologic Studies Depression Scale (CES-D) [41]. Designed at the National Institute of Mental Health (NIMH), it deals exclusively with depression and has considerable face validity for that syndrome in contrast to what were perceived as the ambiguities of autonomic anxiety. It was better accepted by clinical psychiatrists than its predecessors. In part this was based on the profession’s increasing appreciation of the importance of overt symptoms in contrast to unconscious anxiety and intrapsychic features. The CES-D was first used in an epidemiologic study in Missouri and Maryland [42, 43] and now has been used in many studies, both clinical and epidemiological, where assessment of the current level of depressed mood is needed [44–47]. The Psychiatric Epidemiologic Research Instrument (PERI) [48] was developed by Dohrenwend and colleagues to produce empirically distinct and reliable scales among different ethnic and racial groups. It is much broader than earlier instruments and has scales for ‘false beliefs and perceptions’, ‘manic characteristics’, ‘suicide ideation’ and so on. It also contains items from the NSA, HOS and 22IS which are described as ‘non-specific psychological distress’ [49] or ‘demoralisation’ [50]. The Dohrenwend group distinguish between ‘demoralisation’ as a dimension and diagnosable categories of psychiatric disorder. One of the main studies using the PERI is a large two-stage investigation named the Israeli Study of Psychiatric Disorder and Social Status [51]. At about the same time as the PERI, another broad-range instrument, the first of the diagnostic schedules developed in the United States, was presented as the Iowa Structured Psychiatric Interview (ISPI) [52] and used in the Iowa 500 Study [53], which involved follow-up of mental hospital patients and normal controls. It was designed to be acceptable to people who did not suffer from a psychiatric disorder. Influenced by the British Present State Examination (PSE) [54], it focused on psychiatric categories but begins with a core of 20 screening
questions for depression, mania, schizophrenia and neurosis. In the body of the interview, attention is given to the duration and history of the symptoms. Encouragement about the feasibility of using a diagnostic schedule in a full-scale general population study came from a group of researchers who trained lay interviewers to administer a clinical instrument, the Schedule for Affective Disorders and Schizophrenia (SADS) [55, 56]. The demonstration that a long instrument dealing with psychotic as well as milder disorders could be used successfully in a community study was the beginning of a new trend in which epidemiologic research would be focused on multiple and discrete categories of psychiatric disorders [57]. The first epidemiologic instrument designed with this new goal in mind was the Diagnostic Interview Schedule (DIS) which implemented the criteria outlined in DSM-III [58]. The DIS was used in the NIMH Epidemiologic Catchment Area (ECA) program which had been mandated by President Carter’s Commission on Mental Health [59] in order to provide an up-to-date and comprehensive overview of the prevalence of psychiatric disorders in the United States. Most of the earlier studies had drawn samples of 1000–1500 subjects, but the ECA’s sample consisted of more than 20,000 subjects from five mental health catchment areas in different parts of the United States [60, 61]. The DIS has gone through several revisions mainly based on changes in DSM criteria. The original DIS dealt with schizophrenia, mania, depression, panic, phobias, obsessive–compulsive, somatisation, alcohol and drug abuse as well as antisocial personality, thus forecasting that larger and larger numbers of diagnoses would be involved in instruments of this kind. Many reports on the different categories of disorder have been published and the overall annual prevalence was reported as 20%, a rate that continued to be higher than expected [62]. The Baltimore portion of the ECA has become the Baltimore ECA Follow-up Study with two periods of re-interviews that, combined, cover nearly a quarter of a century [63, 64]. The DIS has also been used in different countries [65, 66]. One of the largest of these was carried out in Canada, the Edmonton Psychiatric Epidemiology Study [67] which reported comparable current prevalence. The DIS module for depression illustrates the more complex approach of a diagnostic schedule. 203
CHAPTER 13
The first question, which is a ‘stem’, concerns both the ‘essential features’ and ‘duration’: ‘Over your lifetime, have you ever had 2 weeks or more when you felt sad, blue or depressed or when you lost all interest and pleasure in things you usually enjoyed?’. This is followed by questions dealing with disturbances of appetite, sleep, energy, psychomotor activity, loss of interest in sex, disturbances of concentration and of self-worth, as well as preoccupation with death. A probing system is used to rule out instances in which the symptom might have been caused by physical illness or injury or due to taking drugs or alcohol. The module is terminated at this point if the subject did not report at least three associated symptoms. If such were reported, however, the remainder of the module deals with whether the symptoms clustered together in time and whether the episode occasioned seeing a doctor, taking medication or being impaired. The next instrument used in a North American study was an adaptation of the World Health Organization’s Composite International Diagnostic Interview (CIDI) [68]. The adaptation was named the University of Michigan Composite International Diagnostic Interview (UM-CIDI) [69] and was used in the National Comorbidity Survey (NCS) [70]. This study grew out of evidence given in the ECA that many people had more than one type of psychiatric disorder, with many of the comorbid disorders involving drugs and alcohol [71]. Unlike the earlier US national sample studies, the NCS was the first to use a diagnostic schedule. Many of the diagnostic-specific prevalence rates were somewhat higher than in the ECA and the overall annual rate was 29%. The UM-CIDI was then used in the Ontario Mental Health Survey, the first province-wide sample in Canada with 19% being its overall annual rate [72, 73]. The modifications introduced in the UM-CIDI focused on strategies for increasing comprehension of the questions and motivating accurate reporting using principles of survey methodology. For example, the DIS stem question for depression involved both dysphoria and anhedonia. In the UMCIDI, these two features were presented in separate questions. The ‘stem’ questions from all modules were brought to the beginning of the interview rather than being scattered throughout as the first question in each separate module. This change was designed 204
to discourage subjects from giving a falsely negative response because they had learned by experience that a positive response led to further questions. The UM-CIDI also involved a ‘commitment’ question for motivating accurate retrieval of autobiographical memory. It is possible that some of these adjustments related to the higher prevalence compared to other North American studies. With some further modification based on comparison to validating clinical interviews, the UM-CIDI was used 10 years later in the National Comorbidity Follow-up Study which involved re-interviews with members of the original sample. It indicated that the risk of chronicity and recurrence was related in a graded way to the severity of the initial disorder [74]. The same instrument was also used in the National Comorbidity Survey Replication (NCS-R), which involved a new national sample also selected 10 years after the first NCS [75]. This study indicated that while overall prevalence remained steady, the proportion of people receiving treatment increased [76]. The most recent study in North America is the largest to date involving a national sample of Canada numbering over 30,000 subjects, the Canadian Community Health Survey [77]. This investigation used the modules for diagnosing a major depressive disorder as well as for diagnosing the anxiety disorders from the version of the CIDI known as World Mental Health Composite International Diagnostic Interview (WMH-CIDI) [78]. While this recent Canadian study did not provide an overall rate, the other North American studies taken together suggest annual rates that cluster around 20% with the NCS being somewhat higher. Taking all of such studies together, current rates for depression, as an example, cluster around 5%. Throughout the more than 50 years of accumulating such information, questions have been raised about validity because the resultant prevalence rates were perceived as unrealistically high. It has been suggested that these surveys must be identifying normal and transient reactions to stressful life events rather than clinical disorders. If clinicians were to examine the subjects, it was thought that they would be able to differentiate between normal and pathological reactions and would identify smaller numbers and therefore produce lower prevalence rates. However, two recent studies indicated the opposite. Both
SYMPTOM SCALES AND DIAGNOSTIC SCHEDULES IN ADULT PSYCHIATRY
involved a design whereby community subjects were selected for a clinical interview based on the results of a lay-administered schedule. One came from the Baltimore ECA Project Follow-up Study which used the WHO’s Schedules for Clinical Assessment in Neuropsychiatry (SCAN) [79] to assess the DIS [80]. The other came from the last survey in the Stirling County Study where the American Structured Clinical Interview for DSM-III-R (SCID) [81] was used as the standard to assess both the DIS and DPAX-2 [82]. In each, the clinicians identified a much larger number of cases than did the lay-interview methods with specificity being high but sensitivity low. The clinicians rarely negated a case identified in the lay interviews but they identified many additional cases. Such information raises new questions and may re-direct validation efforts towards greater scrutiny of how clinical skills and judgements are applied in a structured clinical interview. Another methodologic issue relates to the use of a lifetime-orientation in diagnostic schedules such as the DIS and CIDI [83]. They have consistently indicated an association between increasing age and low prevalence. This finding has led to two different interpretations. One is that the higher rates among younger people indicate that depression is increasing over time [84, 85]. The other interpretation is that reliance on recollections over the whole lifetime has led to faulty recall among older people [86–88]. The use of retrospective reconstruction remains an active methodological issue.
13.3 North American instruments for psychiatric services and primary care The developers of the wartime Cornell Selectee Index (CSI) were interested in the relationship between emotional problems and medical conditions. Thus they adapted the original instrument for use in clinical settings. Questions about physical conditions were included, and it was re-named the Cornell Medical Index (CMI) [89]. Many of the psychiatric questions were cast as ‘Do you usually feel ...?’, a feature that probably contributed to the instrument’s performance in forecasting subsequent psychiatric and psychosomatic problems [90]. The
CMI was widely used in medical settings and several epidemiological investigations [91–94]. The first post-war scale specifically for use in psychiatric clinics was the Hopkins Symptom Checklist (HSCL) which was developed to monitor the effectiveness of psychotherapy [95, 96]. Although borrowing items from the CMI, the response pattern was changed from ‘Yes/No’ to four categories for the degree to which the symptom bothered the patient, and the time frame was specified as the recent week. The HSCL was improved over many years largely based on factor analytic studies but including tests of internal consistency, test-retest reliability and correspondence with psychiatrists’ assessments [97, 98]. The patients tested were often described as ‘anxious neurotics’ but the ultimate version, an instrument known as the Symptom Checklist 90-items (SCL-90), covers a much wider range of psychopathology with factors for depression, anxiety, obsessive–compulsive symptoms, hostility, paranoid ideation and ‘psychoticism’ [99]. Versions of the HSCL were used in the early drug trials when psychotropic medications were first being developed [100–102] as well as in epidemiological studies [103, 104]. The evolution of the HSCL led to a 25-item version consisting of the factors for anxiety and depression for use in primary care [105]. A diagnostic algorithm based on DSM-III was developed for it and applied in a national sample [106]. Algorithmic assessment did not, however, become the standard procedure, and most HSCL studies continue to use a simple score and a ‘cutting point’. Because of the simplicity of the HSCL language, it has been a good candidate for translation into other languages [107], and recently the HSCL-25 was used for the first stage of a two-stage Norwegian investigation [108]. Some years after the launching of the HSCL, the Beck Depression Inventory (BDI) [109] and the Zung Depression Scale [110] were constructed for psychiatric settings. These instruments reflect the growing interest in depression as antidepressant medications were developed and marketed. The BDI has been used extensively for monitoring of cases in treatment for depression. Along with the Hamilton Rating Scale, the BDI is the best known to psychiatric residents of any of the psychiatric scales. In recent years, the programme of national screening for depression in the United States has drawn heavily on both the BDI 205
CHAPTER 13
and the Zung [111]. A version of the BDI has also been prepared for use in primary care [112]. In the mid-1960s, diagnostic schedules began to be developed for research in North American clinical settings under the leadership of Robert Spitzer. A step-by-step development of diagnostic schedules began with the Mental Status Schedule (MSS) [113]. Next was the Psychiatric Status Schedule (PSS) [114] which was used in an important study that came to be known as the US/UK Diagnostic Project that explored reasons for differences in diagnostic practices in the two countries [115]. In addition, Spitzer and Endicott [116, 117] created for the PSS a system of differential diagnosis performed by a computerised set of algorithms. The computer programs were named DIAGNO. Then followed the Schedule for Affective Disorders and Schizophrenia (SADS) [55] which played a special role in the developments leading to DSM-III. As Chair of the American Psychiatric Association Task Force that produced DSM-III, Spitzer’s experience in instrument development and in designing criteria for diagnosis contributed significantly to the work of the Task Force. The most important study in which the SADS has been used, however, is the Psychobiology of Depression Study [118–123]. This study has emphasised that depression is often chronic with its episodic features appearing as symptom florescences on top of a chronic base, as in ‘double depression’ [124]. This led Judd [125] to say that ‘the most recent and important paradigm shift is the acceptance of unipolar Major Depression as primarily a chronic rather than an acute illness’. Most of the diagnostic schedules, including the SADS, continue to inquire about depression as an ‘episodic’ illness, but awareness of its chronic nature will probably be reflected in future schedules. After the third DSM was revised, the Structured Clinical Interview for DSM-III-R (SCID) was produced and has been assessed through field trials, and a version for non-patients was created [81, 126]. Later, a version was designed to be congruent with DSM-IV [127]. The SCID has become the most commonly-used schedule in US clinical studies [128–130]. Spitzer and colleagues have also prepared an instrument named Primary Care Evaluation of 206
Mental Disorder (PRIME-MD) as a guide for general physicians to evaluate psychiatric disorders often seen in their practices [131]. A revision named Patient Health Questionnaire (PHQ-S) was subsequently presented, which is entirely self-administered [132]. The Medical Outcomes Study (MOS) was designed to provide information about the functional impairment of patients treated in different types of clinical settings [133, 134]. The instrument developed for it was named the MOS 36-Item Short-Form Health Survey (SF-36) [135]. It assesses disability associated with both physical and emotional health and its use led to the finding that depression is comparable to or worse than eight major chronic medical conditions in terms of markers such as missing work, staying in bed and other features of poor functioning. The SF-36 was created mainly by factor analytic techniques. It is a multi-item scale concerned with eight health concepts such as limitations in physical and social activities, bodily pain, psychological distress, vitality and so on [136]. The appearance of the SF-36 was timely in that it reflects the growing recognition of the importance of impairment in psychiatric measurement.
13.4 European instruments for psychiatric services and primary care After the World War II, National Health Insurance was established in the United Kingdom. Because nearly complete population registration was involved, epidemiologic estimates could be provided through medical services as illustrated in the London General Practice Study [137]. For this study, physician diagnoses as well as patient responses to the American CMI were utilised. The CMI yielded a high test-retest coefficient over 1 year (0.87), a result which probably derives from its use of the word ‘usually’ in describing the frequency of symptoms and to its predictive capacity. The test results suggested to the London group that the CMI measures stable personality traits rather than the types of psychiatric episodes of concern in general practices. The General Health Questionnaire (GHQ) was designed by Goldberg [138] to overcome this feature of the CMI and to be a more appropriate instrument
SYMPTOM SCALES AND DIAGNOSTIC SCHEDULES IN ADULT PSYCHIATRY
in primary care. Thus the GHQ asks if the person has the symptom ‘more than usual’. The intent is to identify the kinds of changes from a person’s usual state that lead to consultation with a general physician. Excellent validity results were achieved when the GHQ was compared to a clinician-administered structured interview named the Clinical Interview Schedule (CIS) that is congruent with the intent of the GHQ [139]. The original publication of the GHQ emphasised that it measures ‘general’ psychopathology of a non-psychotic type. In light of growing interest in diagnosis, Goldberg and Hillier [140] developed a scaled version intended to distinguish between the syndromes of anxiety and depression. Factor analysis identified four domains: ‘anxiety and insomnia’ and ‘severe depression’ as well as ‘social dysfunction’ and ‘general illness’. The ‘anxiety and insomnia’ factor indicates that GHQ anxiety is more cognitive than pertains in any of the earlier scales. Typical GHQ questions deal with ‘being under strain’, ‘everything getting on top of me’, and ‘having difficulty sleeping’ in contrast to the autonomic expressions of fear. The ‘severe depression’ factor reveals that death and suicide are more extensively and explicitly covered than in earlier scales. Also distinctive is that the GHQ factor called ‘social dysfunction’ elicits impairment in everyday activities better than almost any other scale. The GHQ has been used in several studies in the United States, the first of which compared the GHQ and HSCL and found them to show a correlation coefficient of 0.78 [141]. The GHQ was also compared to DIS depression [142] and in one study the GHQ-28, HSCL-25 and CES-D were simultaneously compared to the DIS [143]. The three scales were indistinguishable (sensitivity from 0.65 to 0.69 and specificity from 0.78 to 0.84) indicating that each performed similarly, and none perfectly. The use of the GHQ around the world in both epidemiologic and clinical studies far exceeds that of the other short scales [144–148]. Recently, Goldberg and Simpson developed the Personal Health Questionnaire (PHQ-G), a 10-item instrument designed to gather information specifically about depression according to ICD-10 [149]. For research in psychiatric clinics rather than primary care, the Present State Examination (PSE) was
developed over several years by Wing and co-workers [54, 150]. The first publication appeared about the same time as the first of the schedules designed by Spitzer’s group in the United States, thus suggesting that the need for such instrumentation was beginning to be widely recognised. The original purpose of the PSE was to provide a guide for ‘cross-examining’ a patient for evidence of schizophrenia. The word ‘Present’ in the name of the Examination refers to the fact that the inquiry focuses on the ‘current clinical state’ as exhibited in the recent month. Prior experimentation suggested that recall of subjective experiences over a longer period of time was often faulty. Unlike most of the instruments described thus far, the results of the PSE reflect the decisions of the interviewer rather than the report given by the subject. The schedule consists of pre-formulated questions, but the responses of the subject are not used in analysis. Rather, diagnosis is based on the interviewer’s evaluation of the subject’s responses as guided by a glossary of differential definitions. The interviewer decides if the symptoms are sufficiently severe to warrant contributing evidence to a syndrome. In the late 1960s, the seventh revision of the PSE was employed in the US/UK Diagnostic Project [115], which demonstrated that many of the differences in diagnosis disappeared when structured interviews were employed. In addition to the PSE, this project used Spitzer’s PSS. One of the nosological issues explored was the question whether anxiety and depression could be differentiated. The PSE definition of anxiety involves a syndrome in which autonomic hyperactivity and motor tension are well represented while the PSS definition was more cognitive, like that of the GHQ, with a focus on anxious mood, worry and feeling under strain. Zubin and Fleiss [151] found that the syndromes of anxiety and depression were better discriminated by the PSE than by the PSS. This suggests that the autonomic indicators play an important role in the distinction despite the fact that the two syndromes are often found to be comorbid. PSE-8 was used in the International Pilot Study of Schizophrenia [152] which contributed evidence that schizophrenia seems to be found in most parts of the world. Shortly thereafter, the ninth revision was published along with a description of a computer program named CATEGO that had been developed for standardised analysis [54, 150]. Then, in order 207
CHAPTER 13
to use the PSE in population-based epidemiology, an Index of Definition was constructed to differentiate between cases and non-cases [153]. PSE-9 was translated into more than 40 languages, and it has been used extensively in clinical research and in several single and two-stage epidemiological studies [154–157]. Throughout this phase, it continued to focus on psychoses and neuroses and to exclude substance abuse and personality disorders. Not long after the US President’s Commission on Mental Illness that led to the DIS and ECA, the WHO Division of Mental Health and the US Alcohol, Drug Abuse, and Mental Health Administration (ADAMHA) joined forces in order to carry out a worldwide review of diagnoses and classification of psychiatric disorders. In 1982 a WHO-ADAMHA Task Force was formed to develop diagnostic interviews that would implement the definitions embodied in ICD-10 as well as the criteria employed in DSM-III and the principles of the PSE-CATEGO system. One goal was to develop a schedule for studying clinical samples, the product of which consisted of a series of schedules, the overall name being Schedules for Clinical Assessment in Neuropsychiatry (SCAN) [79]. SCAN provides a comprehensive procedure for clinical examination appropriate for use throughout the world. It incorporates the 10th version of the PSE, and it is suggested that other schedules for personality and disability assessment also be used [158, 159].
13.5 European instruments for epidemiological research In addition to developing the SCAN, the WHOADAMHA Task Force was charged with preparing an epidemiological instrument that could be administered by lay interviewers and used throughout the world. The CIDI is the product of this work [68]. It was intended to bring together the best features of the DIS and the PSE. Like the DIS, the CIDI does not allow variation in order or changes in the way the questions are asked but it contains 35 PSE items that could be transformed into close-ended questions. Because it is highly structured and does not allow interviewers to interpret responses it is much more similar to the DIS than the PSE. In fact, PSE items dealing with delusion were not incorporated because they required clinical judgement. 208
The CIDI went through numerous field trials and a variety of special topics were investigated. These included analysis of comparability with the PSE [160]; issues of recall and dating symptoms [161]; appropriateness and feasibility for cross-cultural investigations [162–164], as well as reliability and validity [165–167]. A computerised version, CIDI-AUTO, was created and tested [168], as were also a short form (CIDI-SF) [169], and a screening version (CIDI-S) [170]. While both the DIS and CIDI were going through phases of change and improvement, the diagnostic criteria on which they were built were not static. For example, the definition of generalised anxiety disorder changed from involving the autonomic indicators, such as had been prominent in the early scales, to being more cognitive, as in the GHQ and PSS. Based on clinical studies, the definition came to focus on ‘feeling miserable’, worrying, being tense, high-strung and sleepless. The typical indicators of ‘bodily alarm’ came to reside only in panic and phobic disorders rather than in a generalised form of autonomic anxiety. Quite aside from UM-CIDI being used in North America, the standard CIDI was used in a primary care study conducted in 15 different sites around the world [171], a two-stage investigation in Norway [172], in the Australian National Survey of Mental Health and Well-Being [173] and in the Netherlands Mental Health Survey and Incident Study (NEMESIS) [174]. Another version of the CIDI was mentioned as having been used in the recent national sample study in Canada: the World Mental Health CIDI. It was mainly constructed, however, for the World Mental Health Initiative [78]. The questions about diagnoses were based on criteria represented in DSM-IV and ICD-10. In addition to diagnoses, there were sections for functional impairment, treatment, consequences, risk factors and sociodemographic variables. Several innovations were introduced, among them are mechanisms for including dimensional as well as categorical assessment, subthreshold disorders, maintaining standard wording of questions along with culturally suggested clarifications. The World Mental Health Initiative is the outgrowth of the Global Burden of Disease, which indicated that, by defining ‘burden’ as a combination
SYMPTOM SCALES AND DIAGNOSTIC SCHEDULES IN ADULT PSYCHIATRY
of reduced quality of life (disability) and reduced quantity of life (death), the toll taken by mental disorders was brought to the fore [175]. The goal of the new initiative was to provide empirical evidence about the prevalence of psychiatric disorders in many countries around the world. A report based on using the WMH-CIDI in 14 countries indicates that while the rates varied more than in North America, everywhere the more seriously ill had the greatest likelihood of receiving treatment [176]. In addition, the WMH-CIDI has been used in the European Study of the Epidemiology of Mental Disorders (ESEMeD) [177] and, as mentioned earlier, in Canada. Based on the amount of comorbidity seen in epidemiological studies that were using the CIDI, questions began to be asked about whether a small number of broad categories might have nosological advantages over many discrete categories. Reanalysis of CIDI data using factor analysis indicated that the diagnoses of social phobia, simple phobia, agoraphobia and panic disorder loaded on a factor with the suggested name of ‘fear’. On the other hand, generalised anxiety disorder, as defined by this time with an emphasis on cognitive worry, affiliated with major depressive episode and dysthymia in a factor named ‘distress’ [178–180]. If alcohol, drug and anti-social diagnoses were added, they loaded on a factor of ‘externalisation’, while ‘fear’ and ‘distress’ were sufficiently correlated to suggest an ‘internalisation’ factor. These factor analytic results have been interpreted as possibly dividing psychiatric disorders in a more meaningful way than multiple categories. This idea has brought considerable discussion about dimensional versus categorical measurement [181, 182]. Contributing to the view that these more comprehensive groupings relate to the core of psychopathology is evidence that generalised anxiety disorder and major depression share the same genetic liability [183, 184]. The CIDI has not, however, been used in the United Kingdom. Studying large samples of the population by means of a structured instrument does not have as long a history in the United Kingdom as in North America. Until recently only a few household surveys had been carried out and they tended to focus on segments of the population
such as women [154, 185] or residents of special housing areas [186, 187]. In the 1990s, however, a very large investigation was conducted, named the National Psychiatric Morbidity Surveys of Great Britain [188–190]. The decision not to use the CIDI was motivated primarily due to its length and reliance on complex questions about the subject’s whole life. Instead, the earlier instrument Clinical Interview Schedule (CIS) developed by Goldberg and co-workers [139] was improved and named the Clinical Interview Schedule – Revised (CIS-R) [191]. The original CIS was intended to be used by psychiatrists for identifying the common disorders that are found in primary care and community settings [192]. The schedule consisted of two halves, the first based on self-report about the frequency, duration and intensity of symptoms with the second based on the psychiatrist’s observations of ‘manifest abnormalities’. When used as a validating standard for the GHQ, the CIS did not give a diagnosis but rather a rating of severity along with a ‘cutting point’ to separate cases from non-cases. The CIS approach was thus congruent with the GHQ focus on identifying ‘general’ non-psychotic disorder. The CIS-R can be administered by lay interviewers but the emphasis on ‘general’ neurosis is maintained. It uses a ‘cutting point’ to identify cases but additional analytic routines allow the application of diagnostic designations for generalised anxiety disorder, depressive episode, phobias, obsessive–compulsive disorders, panic disorder and non-specific neurotic disorder according to ICD-10. To avoid long-term recall, the time frame is the previous week, but subjects are asked to give the date of onset of key symptoms. The CIS-R is described as a ‘bottom-up’ schedule that gathers information about the basic phenomena to which classification algorithms can be subsequently applied. This contrasts to the ‘top-down’ instruments like the DIS and CIDI that build the classification rules into the questions. The rationale for the CIS-R approach relates to the objective of conducting subsequent surveys for comparison over time when the specific criteria may be modified. In addition, a screening instrument for psychosis was developed for the survey, and those who scored positively were later interviewed with the SCAN. Like most of the instruments designed before 1980, the CIS-R 209
CHAPTER 13
itself does not include substance abuse. A separate schedule was therefore developed for that purpose. It is unknown whether researchers in other countries will use the CIS-R. There is growing evidence, however, that the broad dimensional approach it embodies is being given increased attention. In addition, numerous substantive reports have been produced from the data gathered in the United Kingdom [193–195].
13.6 Summary For the first half of the period reviewed, psychiatrists stood aside and viewed with skepticism the developments described here. The reasons for their distance were multiple and complex including the influence of psychodynamic psychiatry, doubt that asking questions was enough, and belief that non-psychiatrists were unable to interpret answers accurately or to perceive the nuances of facial expression and body movement that are necessary for an adequate psychiatric work-up. Undoubtedly visual information about appearance and comportment contributes to a psychiatric assessment. The time may come when the process of observation will achieve sufficient standardisation to be useful in epidemiological research. However, that approach is not yet on the horizon. The time may come when biological markers will have been identified and proven sufficiently accurate and efficient to be used in large-scale studies, but that approach is also not yet on the horizon. The question/answer interaction (by paper/pencil, face-toface interview, telephone or computer) remains the most useful mode of gathering data for psychiatric epidemiology. Major advances in the question/answer approach were achieved when clinical criteria, such as now exist in the DSM and the ICD, became available. Because schedules like the DIS and CIDI were designed to implement the criteria, a foundation was laid for comparability across studies and for covering the range of diagnoses that involve psychotic, substance abuse and personality disorders as well as the earlier focus on neurotic disorders. There were also some losses. Psychometric principles tended to be ignored as did also the principles of 210
survey methodology. At the present time, one of the main questions is whether the categories of disorder embodied in the DSM have maximal utility or if dimensional measures may be superior. The concept of a ‘psychiatric syndrome’ is at the heart of the existing classification systems. By its nature categorical, the syndrome is either complete enough to say that it is present or sufficiently incomplete to warrant saying it is not. There are numerous aspects of syndrome recognition, however, that draw on dimensional models. This can be illustrated by reference to the existing scales, all of which are dimensional. All of the ones reviewed here refer to the ‘essential features’ and ‘associated symptoms’ – at least of depression or anxiety. Further, a dimension for what is ‘essential’ could be constructed separately from what is ‘associated’ so that the requirements for exhibiting key symptomatology could be met. ‘Duration’ and ‘disability’ are also by their nature dimensional. In addition to the questions about how a categorical approach might be improved by using dimensions for its component parts, questions have arisen about the value of using much broader dimensions. In so far as the main goal of epidemiology will continue to be the estimation of prevalence and incidence, much research would be required to find a single and adequate ‘cutting point’ on dimensions as broad as ‘introversion’ and ‘extraversion’. On the other hand, one can envisage the productive epidemiologic use of such ‘middle range’ factors as ‘fear’ and ‘distress’ which to a reasonable degree translate as anxiety and depression. It should not be ignored, however, that the focus on discrete categories as identified through the schedules has provided re-conceptualisations that appear to be useful. One is the change from viewing depression as an episodic disorder to seeing it as a chronic one which is subject to fluctuations in intensity. The other is that generalised anxiety may have a more cognitive manifestation than appeared to be the case in the early years although it may also be true that there are two forms of generalised anxiety, one more articulated through the autonomic system and the other more through mental processes. A feature of the diagnostic schedules that needs further thought concerns the use of lifetime recall. There is evidence of international tension on this point. The
SYMPTOM SCALES AND DIAGNOSTIC SCHEDULES IN ADULT PSYCHIATRY
schedules developed in the United Kingdom (PSE and CIS-R) focus on the current clinical state. The US and WHO schedules (DIS and CIDI) elicit information about lifetime experiences. The lifetime approach in psychiatric epidemiology appeared about the same time as an upsurge of genetic research, in which lifetime population norms were needed for family studies. With the changing face of genetic research towards molecular studies, the rationale for lifetime rates may recede. Important steps forward in reliability might thus be achieved if assessment of the current clinical state becomes the first order of inquiry. Reliability is fostered when the subject comprehends the interview situation and is well motivated to give accurate answers. Both psychometric theory and survey experience suggest that the best ways to reduce misunderstanding on the part of the subject and variability on the part of interviewers is to provide clear instructions and use simple language [196, 197]. Scrutiny of diagnostic schedules to reduce complexity may also be a useful step towards increasing reliability. A major challenge that lies ahead for both lay and clinician interviews has to do with validity. New questions have been raised by the fact that the use of well-recognised clinical interviews in the Baltimore and Stirling Studies gave considerably higher rates of depression than did lay interviews. The clinical approach did not invalidate the lay results in the sense of denying them, as was expected, but rather the clinicians indicated that the lay-administered schedules missed numerous cases. Many of the questions asked by clinicians were the same as those asked by lay interviewers. This raises the problem of whether use of a question-oriented ‘gold standard’ provides an adequate test of validity. Campbell and Fiske [198] emphasise that validity depends upon using independent and different information: ‘Reliability is the agreement between two efforts to measure the same trait through maximally similar methods. Validity is represented in the agreement between two attempts to measure the same trait by maximally different methods’. Given the fact that both the clinical and lay interviews involved similar questions, the differences must result from features other than the questions asked. What is ‘maximally different’ about them? Do clinicians ask the questions in a different manner? Do
subjects hear the questions of a clinician in a different way? Do clinicians interpret the same response to the same question in ways that are distinctively different? Evidence suggests that schedules like SCID and SCAN have achieved reliability when applied in clinical settings. In such settings, it is a matter of determining what kind of diagnosis is pertinent rather than whether the person is a case or not. In community settings, the situation may be sufficiently different to warrant a different approach to validity. Over and above investigation of interpretative differences when the same questions are asked, efforts at validation need to seek materials that are genuinely dissimilar from the question/answer format. Such material may reside in the ‘lead standard’ which Spitzer [199] defined as involving ‘Longitudinal assessment by a panel of clinical Experts who have access to All available Data’. Such a standard has been employed to assess certain CIDI diagnoses in a clinic-based study [168]. A group of psychiatrists (the ‘panel of experts’) who had known the patients over considerable time (an approximation of ‘all available data’) provided consensus diagnoses that agreed well with the CIDI results. Given the fact that many people in the community do not seek treatment for a psychiatric illness, another application of the standard would be to focus on the ‘Longitudinal assessment’ part of the definition. For example, it may be possible to use prospective evidence about the course and outcome of illness identified by structured lay interviews to confirm or reject the diagnosis. From the limited amount of longitudinal follow-up data available at the present time, the evidence about chronicity and risk for recurrence and other adversities does not support the view that the epidemiologic studies have identified transient and normal reactions to life stress. Rather, a degree of predictive validity seems already to have been achieved. Despite the fact that the schedules take much longer to administer than the scales, the samples used in epidemiologic research have become larger and larger over time, making it possible to achieve adequate numbers for the rarer disorders. At the same time, the cost of conducting such surveys has increased. Because of cost, two-stage designs may become more popular in the future. However, a note of caution is in order. It is now known that those 211
CHAPTER 13
who are psychiatrically ill are more likely to refuse than are others and thus it becomes increasingly compelling to avoid subject attrition. Incompleteness of data is one of the serious problems faced by psychiatric epidemiology at the present time, and two-stage designs give two opportunities for refusal in contrast to one in single-stage investigations [70, 200]. This review of the use of scales and schedules to estimate prevalence among adults in the general population indicates that the ‘unknowns’ observed at the end of the World War II have to some extent become ‘knowns’. It is now clear that prevalence is higher in many countries than originally estimated and that many of those who suffer from a psychiatric disorder do not receive treatment for it. Whether prevalence and incidence are increasing remains a question but the two studies (NCS-R and Stirling) that have thus far compared a sample drawn earlier with one drawn latter suggest more stability than change. Research using the scales and schedules described here has shown that psychiatric disorders are common, diverse in character, often comorbid, widely distributed, probably more steady than fluctuating in rate, and heavily burdensome.
Acknowledgements This chapter is based on a course taught at the Harvard School of Public Health. From 1987 to 1999, the course was titled ‘Psychiatric Screening and Diagnostic Tests’ after which the title was changed to ‘Psychiatric Diagnosis in Clinic and Community Populations’. The chapter also draws on a report prepared for the National Institute of Mental Health under contract 80M014280101D titled ‘Psychiatric Instrument Development for Primary Care Research: Patient Self-Report Questionnaire’, 1981. In addition, the chapter draws on materials from the Stirling County Study through NIMH Grant R01 MH39576-25.
References [1] American Psychiatric Association (1980) Diagnostic and Statistical Manual of Mental Disorders, 3rd edn, American Psychiatric Association, Washington, DC.
212
[2] American Psychiatric Association (1987) Diagnostic and Statistical Manual of Mental Disorders, 3rd edn Revised, American Psychiatric Association, Washington, DC. [3] American Psychiatric Association (1994) Diagnostic and Statistical Manual of Mental Disorders, 4th edn, American Psychiatric Association, Washington, DC. [4] World Health Organization (1977) Manual of the International Statistical Classification of Diseases, Injuries, and Causes of Death, Ninth Revision, World Health Organization, Geneva. [5] World Health Organization (1992) International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10), World Health Organization, Geneva. [6] Hamilton, M. (1960) A rating scale for depression. J. Neurol. Neurosurg. Psychiatry, 23, 57–62. [7] Overall, J.E. and Gorham, D.R. (1962) The brief psychiatric rating scale. Psychol. Rep., 10, 799–812. [8] Blouin, A.G., Perez, E.L. and Blouin, J.M. (1987) Computerized administration of the diagnostic interview schedule. Psychiatry Res., 23, 335–344. ¨ [9] Essen-Moller, E. (1956) Individual traits and morbidity in a Swedish rural population. Acta Psychiatr. Neurol. Scand., (Suppl. 100), 1–160. [10] Hagnell, O., Lanke, J., Rorsman, B. et al. (1982) Are we entering an age of melancholy? Depressive illness in a prospective epidemiological study over 25 years: the Lundby study, Sweden. Psychol. Med., 12, 279–289. [11] Mattisson, C., Bogren, M., Nettelbladt, P. et al. (2005) First incidence depression in the Lundby study: a comparison of the two time periods 1947–1972 and 1972–1997. J. Affect. Disord., 87, 151–160. [12] Meehl, P.E. and Hathaway, S.R. (1946) The K factor as a suppressor variable in the MMPI. J. Appl. Psychol., 30, 525–564. [13] Eysenk, H.J. (1947) Dimensions of Personality, Routledge & Kegan Paul, London. [14] Eysenck, H.J. and Eysenck, S.B.G. (1975) Eysenck Personality Questionnaire, Educational and Industrial Testing Service, San Diego, CA. [15] Weider, A., Mittelmann, B., Wechsler, D. et al. (1944) The Cornell Selectee Index: a method for quick testing of selectees for the armed forces. J. Am. Med. Assoc., 124, 224–228. [16] Star, S.A. (1950) The screening of psychoneurotics in the army: technical development of tests, in Measurement and Prediction (eds S.A. Stouffer, L. Guttman, E.A. Suchman and P.F. Lazarsfeld), Princeton University Press, Princeton, pp. 486–547.
SYMPTOM SCALES AND DIAGNOSTIC SCHEDULES IN ADULT PSYCHIATRY [17] Leavitt, H.C. (1946) A comparison between the Neuropsychiatric Screening Adjunct (NSA) and the Cornell Selectee Index (Form N). Am. J. Psychiatry, 103, 353–357. [18] Leighton, A.H. (1959) My Name Is Legion: The Stirling County Study of Psychiatric Disorder and Sociocultural Environment, vol. 1, Basic Books, New York. [19] Srole, L., Langner, T.S., Michael, S.T. et al. (1962) Mental Health in the Metropolis: The Midtown Manhattan Study, McGraw-Hill, New York. [20] Macmillan, A.M. (1957) The health opinion survey: technique for estimating prevalence of psychoneurotic and related types of disorders in communities. Psychol. Rep., 3, 325–339. [21] Langner, T.S. (1962) A twenty-two item screening score of psychiatric symptoms indicating impairment. J. Health Hum. Behav., 3, 269–276. [22] Manis, J.G., Brawer, M.J., Hunt, C.L. et al. (1964) Estimating the prevalence of mental illness. Am. Sociol. Rev., 29, 84–89. [23] Phillips, D.L. (1966) The ‘true prevalence’ of mental illness in a New England state. Community Ment. Health J., 2, 35–40. [24] Prince, R.H., Mombour, W., Shiner, E.V. et al. (1967) Abbreviated techniques for assessing mental health in interview surveys: an example from central Montreal. Laval Med., 38, 58–62. [25] Dohrenwend, B.P. and Crandell, D.L. (1970) Psychiatric symptoms in community, clinic, and mental hospital groups. Am. J. Psychiatry, 126, 87–97. [26] Shader, R.I., Ebert, M.H. and Harmatz, J.S. (1971) Langner’s psychiatric impairment scale: a short screening device. Am. J. Psychiatry, 128, 596–601. [27] Myers, J.K., Lindenthal, J.J. and Pepper, M.P. (1971) Life events and psychiatric impairment. J. Nerv. Ment. Dis., 152, 149–157. [28] Schwab, J.J., Bell, R.A., Warheit, G.J. et al. (1979) Social Order and Mental Health: The Florida Health Study, Brunner/Mazel, New York. [29] Gurin, G., Veroff, J. and Feld, S. (1960) Americans View Their Mental Health: A Nationwide Interview Survey, Basic Books, New York. [30] Veroff, J., Douvan, E. and Kulka, R.A. (1981) The Inner American: A Self-Portrait from 1957 to 1976, Basic Books, New York. [31] Veroff, J., Kulka, R.A. and Douvan, E. (1981) Mental Health in America: Patterns of Help-Seeking from 1957 to 1976, Basic Books, New York. [32] Leighton, D.C., Harding, J.S., Macklin, D.B. et al. (1963) The Character of Danger: The Stirling County Study of Psychiatric Disorder and Sociocultural Environment, vol. 3, Basic Books, New York.
[33] Murphy, J.M., Sobol, A.M., Neff, R.K. et al. (1984) Stability of prevalence: depression and anxiety disorders. Arch. Gen. Psychiatry, 41, 990–997. [34] Murphy, J.M., Olivier, D.C., Monson, R.R. et al. (1988) Incidence of depression and anxiety: the Stirling County Study. Am. J. Public Health., 78, 534–540. [35] Murphy, J.M., Monson, R.R., Laird, N.M. et al. (2000a) A forty-year perspective on the prevalence of depression from the Stirling County Study. Arch. Gen. Psychiatry, 57, 209–215. [36] Murphy, J.M., Laird, N.M., Monson, R.R. et al. (2000b) Incidence of depression in the Stirling County Study: historical and comparative perspectives. Psychol. Med, 30, 505–514. [37] Murphy, J.M., Neff, R.K., Sobol, A.M. et al. (1985) Computer diagnosis of depression and anxiety: the Stirling County Study. Psychol. Med., 15, 99–112. [38] Murphy, J.M., Monson, R.R., Laird, N.M. et al. (1998) Identifying depression in a forty-year epidemiologic investigation: the Stirling County study. Int. J. Methods Psychiatr. Res., 7, 89–109. [39] Murphy, J.M., Olivier, D.C., Sobol, A.M. et al. (1986) Diagnosis and outcome: depression and anxiety in a general population. Psychol. Med., 16, 117–126. [40] Murphy, J.M., Burke, J.D., Monson, R.R. et al. (2008) Mortality associated with depression: a fortyyear perspective from the Stirling County Study. Soc. Psychiatry Psychiatr. Epidemiol., 43, 594–601. [41] Radloff, L.S. (1977) The CES-D scale: a self-report depression scale for research in the general population. Appl. Psychol. Meas., 1, 385–401. [42] Markush, R.E. and Favero, R.V. (1974) Epidemiologic assessment of stressful life events, depressed mood, and psychophysiological symptoms – a preliminary report, in Stressful Life Events: Their Nature and Effects (eds B.S. Dohrenwend and B.P. Dohrenwend), John Wiley & Sons, Inc., New York, pp. 171–190. [43] Comstock, G.W. and Helsing, K.J. (1976) Symptoms of depression in two communities. Psychol. Med., 6, 551–563. [44] Weissman, M.M., Sholomskas, D., Pottenger, M. et al. (1977) Assessing depressive symptoms in five psychiatric populations: a validation study. Am. J. Epidemiol., 106, 203–214. [45] Berkman, L.F., Berkman, C.S., Kasl, S. et al. (1986) Depressive symptoms in relation to physical health and functioning in the elderly. Am. J. Epidemiol., 124, 372–388. [46] Lyketsos, C.G., Hoover, D.R., Guccione, M. et al. (1996) Depressive symptoms over the course of HIV infection before AIDS. Soc. Psychiatry Psychiatr. Epidemiol., 31, 212–219.
213
CHAPTER 13 [47] Li, C., Johnson, N.P. and Leopard, K. (2001) Risk factors for depression among adolescents living in group homes in South Carolina. J. Health Soc. Policy, 13, 41–59. [48] Dohrenwend, B.P., Levav, I. and Shrout, P.E. (1986) Screening scales from the Psychiatric Epidemiology Research Interview (PERI), in Community Surveys of Psychiatric Disorders (eds M.M. Weissman, J.K. Myers and C.E. Ross), Rutgers University Press, New Brunswick, NJ, pp. 349–375. [49] Dohrenwend, B.P., Shrout, P.E., Egri, G. et al. (1980) Nonspecific psychological distress and other dimensions of psychopathology. Arch. Gen. Psychiatry, 37, 1229–1236. [50] Link, B. and Dohrenwend, B.P. (1980) Formulation of hypotheses about the true prevalence of demoralization in the United States, in Mental Illness in the United States: Epidemiological Estimates (eds B.P. Dohrenwend, B.S. Dohrenwend, M.S. Gould et al.), Praeger Press, New York, pp. 114–132. [51] Dohrenwend, B.P., Levav, I., Shrout, P.E. et al. (1992) Socioeconomic status and psychiatric disorders: the causation-selection issue. Science, 255, 946–952. [52] Tsuang, M.T., Woolson, R.F. and Simpson, J.C. (1980) The Iowa Structured Psychiatric Interview: rationale, reliability and validity. Acta Psychiatr. Scand., 62 (Suppl. 283), 1–58. [53] Tsuang, M.T. and Winokur, G. (1975) The Iowa 500: field work in a 35-year follow-up of depression, mania, and schizophrenia. Can. Psychiatr. Assoc. J., 20, 359–365. [54] Wing, J.K., Cooper, J.E. and Sartorius, N. (1974) Measurement and Classification of Psychiatric Symptoms: An Instruction Manual for the PSE and Catego Program, Cambridge University Press, London. [55] Endicott, J. and Spitzer, R.L. (1978) A diagnostic interview: the schedule for affective disorders and schizophrenia. Arch. Gen. Psychiatry, 35, 837–844. [56] Weissman, M.M., Myers, J.K. and Harding, P.S. (1978) Psychiatric disorders in a US urban community: 1975–1976. Am. J. Psychiatry, 135, 459–462. [57] Weissman, M.M. and Klerman, G.L. (1978) Epidemiology of mental disorders: emerging trends in the United States. Arch. Gen. Psychiatry, 35, 705–712. [58] Robins, L.N., Helzer, J.E., Croughan, J. et al. (1981) National Institute of Mental Health Diagnostic Interview Schedule: its history, characteristics and validity. Arch. Gen. Psychiatry, 38, 381–389. [59] President’s Commission on Mental Health (1978) Report to the President, Report no. Pr39.8:M52/R29. United States Government Printing Office, Washington, DC.
214
[60] Regier, D.A., Myers, J.K., Kramer, M. et al. (1984) The NIMH Epidemiologic Catchment Area (ECA) Program: historical context, major objectives and study population characteristics. Arch. Gen. Psychiatry, 41, 934–941. [61] Eaton, W.W., Holzer, C.E., Von Korff, M. et al. (1984) The design of the Epidemiologic Catchment Area surveys: the control and measurement of error. Arch. Gen. Psychiatry, 41, 942–948. [62] Robins, L.N. and Regier, D.A. (eds) (1991) Psychiatric Disorders in America: The Epidemiologic Catchment Area Study, Free Press, New York. [63] Eaton, W.W., Anthony, J.C., Gallo, J. et al. (1997) Natural history of diagnostic interview schedule/DSM-IV major depression: the Baltimore epidemiologic catchment area follow-up. Arch. Gen. Psychiatry, 54, 993–999. [64] Eaton, W.W., Kalaydjian, A., Scharfstein, D.O. et al. (2007) Prevalence and incidence of depressive disorder: the Baltimore ECA Follow-up, 1981–2004. Acta Psychiatr. Scand., 116(3), 1–7. [65] Helzer, J.E. and Canino, G.J. (eds) (1992) Alcoholism in North America, Europe, and Asia, Oxford University Press, New York. [66] Weissman, M.M., Bland, R.C., Canino, G.J. et al. (1996) Cross-national epidemiology of major depression and bipolar disorder. J. Am. Med. Assoc., 276, 293–299. [67] Bland, R.C., Newman, S.C. and Orn, H. (eds) (1988) Epidemiology of psychiatric disorders in Edmonton. Acta Psychiatr. Scand., 77 (Suppl. 338), 1–80. [68] Robins, L.N., Wing, J., Wittchen, H.U. et al. (1988) The composite international diagnostic interview. Arch. Gen. Psychiatry, 45, 1069–1077. [69] Kessler, R.C., Wittchen, H.U., Abelson, J. et al. (1998) Methodological studies of the Composite International Diagnostic Interview (CIDI) in the National Comorbidity Survey (NCS). Int. J. Methods Psychiatr. Res., 7, 33–55. [70] Kessler, R.C., McGonagle, K.A., Zhao, S. et al. (1994) Lifetime and 12-month prevalence of DSMIII-R psychiatric disorders in the United States: results from the National Comorbidity Survey. Arch. Gen. Psychiatry, 51, 8–19. [71] Boyd, J.H., Burke, J.D., Gruenberg, E. et al. (1984) Exclusion criteria of DSM-III: a study of cooccurrence of hierarchy-free syndromes. Arch. Gen. Psychiatry, 41, 983–989. [72] Offord, D.R., Boyle, M.H., Campbell, D. et al. (1996) One-year prevalence of psychiatric disorder in Ontarians 15 to 64 years of age. Can. J. Psychiatry, 41, 559–563. [73] Goering, P., Lin, E., Campbell, D. et al. (1996) Psychiatric disability in Ontario. Can. J. Psychiatry, 41, 564–571.
SYMPTOM SCALES AND DIAGNOSTIC SCHEDULES IN ADULT PSYCHIATRY [74] Kessler, R.C., Merikangas, K.R., Berglund, P. et al. (2003) Mild disorders should not be eliminated from the DSM-5. Arch. Gen. Psychiatry, 60, 1117–1122. [75] Kessler, R.C., Berglund, P., Demler, O. et al. (2003b) The epidemiology of major depressive disorder: results from the National Comorbidity Survey Replication (NCS-R). J. Am. Med. Assoc., 289, 3095–3105. [76] Kessler, R.C., Dernier, O., Frank, R.G. et al. (2005) Prevalence and treatment of mental disorders, 1990–2003. N. Engl. J. Med., 352, 2515–2523. [77] Patten, S.B., Wang, J.L., Williams, J.V.A. et al. (2006) Descriptive epidemiology of major depression in Canada. Can. J. Psychiatry, 51, 84–90. ¨ un, ¨ T.B. (2004) The World [78] Kessler, R.C. and Ust Mental Health (WMH) survey initiative version of the World Health Organization (WHO) Composite International Diagnostic Interview (CIDI). Int. J. Methods Psychiatr. Res., 13, 83–121. [79] Wing, J.K., Babor, T., Brugha, T. et al. (1990) SCAN: Schedules for Clinical Assessment in Neuropsychiatry. Arch. Gen. Psychiatry, 47, 589–593. [80] Eaton, W.W., Neufeld, K., Chen, L.S. and Cai, G. (2000) Comparison of self-report and clinical diagnostic interviews for depression: diagnostic Interview Schedule and Schedules for Clinical Assessment in Neuropsychiatry in the Baltimore Epidemiologic Catchment Area Follow-up. Arch. Gen. Psychiatry, 57, 217–222. [81] Spitzer, R.L., Williams, J.B.W., Gibbon, M. et al. (1992) The Structured Clinical Interview for DSMIII-R (SCID), 1: history, rationale, and description. Arch. Gen. Psychiatry, 49, 624–629. [82] Murphy, J.M., Monson, R.R., Laird, N.M. et al. (2000) A comparison of diagnostic interviews for depression in the Stirling County study: challenges for psychiatric epidemiology. Arch. Gen. Psychiatry, 57, 230–236. [83] Bromet, E.J., Dunn, L.O., Connell, M.M. et al. (1986) Long-term reliability of diagnosing lifetime major depression in a community sample. Arch. Gen. Psychiatry, 43, 435–440. [84] Klerman, G.L. and Weissman, M.M. (1989) Increasing rates of depression. J. Am. Med. Assoc., 261, 2229–2235. [85] Cross-National Collaborative Group (1992) The changing rate of major depression: cross-national comparisons. J. Am. Med. Assoc., 268, 3098–3105. [86] Rogler, L.H., Malgady, R.G. and Tryon, W.W. (1992) Evaluation of mental health: issues of memory in the Diagnostic Interview Schedule. J. Nerv. Ment. Dis., 180, 215–222. [87] Giuffra, L.A. and Risch, N. (1994) Diminished recall and the cohort effect of major depression: a simulation study. Psychol. Med., 24, 375–383.
¨ un, ¨ [88] Simon, G.E., Von Korff, M., Ust T.B. et al. (1995) Is the lifetime risk of depression actually increasing? J. Clin. Epidemiol., 48, 1109–1118. [89] Brodman, K., Erdmann, A.J., Lorge, I. et al. (1949) The Cornell Medical Index: an adjunct to medical interview. J. Am. Med. Assoc., 140, 530–534. [90] Brodman, K., Erdmann, A.J., Lorge, I. et al. (1954) The Cornell Medical Index – Health Questionnaire. VII. The prediction of psychosomatic and psychiatric disabilities in army training. Am. J. Psychiatry, 111, 37–40. [91] Brodman, K., Erdmann, A.J., Lorge, I. et al. (1952) The Cornell Medical Index – Health Questionnaire. IV. The recognition of emotional disturbances in a general hospital. J. Clin. Psychol., 8, 289–293. [92] Arthur, R.J., Gunderson, E.K.E. and Richardson, J.W. (1966) The Cornell Medical Index as a mental health survey instrument in the naval population. Mil. Med., 131, 605–610. [93] Eastwood, M.R. and Trevelyan, M.H. (1972) Relationship between physical and psychiatric disorder. Psychol. Med., 2, 363–372. [94] Levav, I., Arnon, A. and Portnoy, A. (1977) Two shortened versions of the Cornell Medical Index – a new test of their validity. Int. J. Epidemiol., 6, 135–141. [95] Parloff, M.B., Kelman, H.C. and Frank, J.D. (1954) Comfort, effectiveness, and self-awareness as criteria of improvement in psychotherapy. Am. J. Psychiatry, 111, 343–351. [96] Derogatis, L.R., Lipman, R.S., Rickels, K. et al. (1974) The Hopkins Symptom Checklist (HSCL): a self-report symptom inventory. Behav. Sci., 19, 1–15. [97] Derogatis, L.R., Lipman, R.S., Covi, L. and Rickels, K. (1971) Neurotic symptom dimensions: as perceived by psychiatrists and patients of various social classes. Arch. Gen. Psychiatry, 24, 454–464. [98] Derogatis, L.R., Lipman, R.S., Covi, L. and Rickels, K. (1972) Factorial invariance of symptom dimensions in anxious and depressive neuroses. Arch. Gen. Psychiatry, 27, 659–665. [99] Derogatis, L.R., Lipman, R.S. and Covi, L. (1973) SCL-90: an outpatient psychiatric rating scale, preliminary report. Psychopharmacol. Bull., 9, 13–28. [100] Lipman, R.S., Cole, J.O., Park, L.C. and Rickels, K. (1965) Sensitivity of symptom and nonsymptomfocused criteria of outpatient drug efficacy. Am. J. Psychiatry, 122, 24–27. [101] Rickels, K., Lipman, R.S., Park, L.C. et al. (1971) Drug, doctor warmth, and clinic setting in the symptomatic response to minor tranquilizers. Psychopharmacologia, 20, 128–152. [102] Covi, L., Lipman, R.S., Pattison, J.H. et al. (1973) Length of treatment with anxiolytic sedatives and
215
CHAPTER 13
[103]
[104]
[105]
[106]
[107]
[108]
[109]
[110] [111]
[112]
[113]
[114]
[115]
[116]
[117]
216
response to their sudden withdrawal. Acta Psychiatr. Scand., 49, 51–64. Uhlenhuth, E.H., Lipman, R.S., Balter, M.B. et al. (1974) Symptom intensity and life stress in the city. Arch. Gen. Psychiatry, 31, 759–764. Mellinger, G.D., Balter, M.B., Manheimer, D.I. et al. (1978) Psychic distress, life crisis, and use of psychotherapeutic medications: national household survey data. Arch. Gen. Psychiatry, 35, 1045–1052. Hesbacher, P.T., Rickels, K., Morris, R.J. et al. (1980) Psychiatric illness in family practice. J. Clin. Psychiatry., 41, 6–10. Uhlenhuth, E.H., Balter, M.B., Mellinger, G.D. et al. (1983) Symptom Checklist syndromes in the general population: correlations with psychotherapeutic drug use. Arch. Gen. Psychiatry, 40, 1167–1173. Mollica, R.F., Wyshak, G., de Marneffe, D. et al. (1987) Indochinese versions of the Hopkins Symptom Checklist-25: a screening instrument for the psychiatric care of refugees. Am. J. Psychiatry, 144, 497–500. Sandanger, I., Moum, T., Ingebrigtsen, G. et al. (1999) The meaning and significance of caseness: the Hopkins Symptom Checklist-25 and the Composite International Diagnostic Interview II. Soc. Psychiatry Psychiatr. Epidemiol., 34, 53–59. Beck, A.T., Ward, C.H., Mendelsohn, M. et al. (1961) An inventory for measuring depression. Arch. Gen. Psychiatry, 4, 561–571. Zung, W.W.K. (1963) A self-rating depression scale. Arch. Gen. Psychiatry, 12, 63–70. Baer, L., Jacobs, D.G., Meszler-Reizes, J. et al. (2000) Development of a brief screening instrument: the HANDS. Psychother. Psychosom., 69, 35–41. Beck, A.T. and Beck, R.W. (1972) Screening depressed patients in family practice: a rapid technique. Postgrad. Med., 52, 81–85. Spitzer, R.L., Fleiss, J.L., Endicott, J. et al. (1967) Mental Status Schedule: properties of factoranalytically derived scales. Arch. Gen. Psychiatry, 16, 479–493. Spitzer, R.L., Endicott, J., Fleiss, J.L. et al. (1970) The Psychiatric Status Schedule: a technique for evaluating psychopathology and impairment in role functioning. Arch. Gen. Psychiatry, 23, 41–55. Cooper, J.E., Kendell, R.E., Gurland, B.J. et al. (1972) Psychiatric Diagnosis in New York and London, Oxford University Press, London. Spitzer, R.L. and Endicott, J. (1968) DIAGNO I: a computer program for psychiatric diagnosis utilizing the differential diagnostic procedure. Arch. Gen. Psychiatry, 18, 746–756. Spitzer, R.L. and Endicott, J. (1969) DIAGNO II: further developments in a computer program
[118]
[119]
[120]
[121]
[122]
[123]
[124]
[125]
[126]
[127]
[128]
[129]
[130]
[131]
for psychiatric diagnosis. Am. J. Psychiatry, 125, 12–21. Katz, M.M. and Klerman, G.L. (1979) The psychobiology of depression – NIMH clinical research branch collaborative program: introduction: overview of the clinical studies program. Am. J. Psychiatry, 136, 49–51. Keller, M.M., Shapiro, R.W., Lavori, P.W. et al. (1982) Relapse in major depressive disorder: analysis with the life table. Arch. Gen. Psychiatry, 39, 911–915. Coryell, W., Endicott, J., Reich, T. et al. (1984) A family study of bipolar II disorder. Br. J. Psychiatry, 145, 49–54. Hirschfeld, R.M.A., Klerman, G.L., Andreasen, N.C. et al. (1986) Psycho-social predictors of chronicity in depressed patients. Br. J. Psychiatry, 148, 648–654. Mueller, T.I., Lavori, P.W., Keller, M.B. et al. (1994) Prognostic effect of the variable course of alcoholism on the 10-year course of depression. Am. J. Psychiatry, 151, 701–706. Solomon, D.A., Keller, M.B., Leon, A.C. et al. (2000) Multiple recurrences of major depressive disorder. Am. J. Psychiatry, 157, 229–233. Keller, M.M. and Shapiro, R.W. (1982) Double depression: superimposition of acute depressive episodes on chronic depressive disorders. Am. J. Psychiatry, 139, 438–442. Judd, L.L. (1997) Commentary: the clinical course of unipolar major depressive disorders. Arch. Gen. Psychiatry, 54, 989–990. Williams, J.B.W., Gibbon, M., First, M.B. et al. (1992) The structured clinical interview for DSMIII-R (SCID): II. Multisite test-retest reliability. Arch. Gen. Psychiatry, 49, 630–636. First, M.B., Gibbon, M., Spitzer, R.L. et al. (1997) Structured Clinical Interview for DSM-IV Personality Disorders, American Psychiatric Press, Washington, DC. Swinson, R.P., Soulios, C., Cox, B.J. et al. (1992) Brief treatment of emergency room patients with panic attacks. Am. J. Psychiatry, 149, 944–946. Kendler, K.S. and Roy, M.A. (1995) Validity of a diagnosis of lifetime major depression obtained by personal interview versus family history. Am. J. Psychiatry, 152, 1608–1614. Zimmerman, M., McDermut, W. and Mattia, J.I. (2000) Frequency of anxiety disorders in psychiatric outpatients with major depressive disorder. Am. J. Psychiatry, 157, 1337–1340. Spitzer, R.L., Williams, J.B.W., Kroenke, K. et al. (1994) Utility of a new procedure for diagnosing mental disorders in primary care: the PRIME-MD 1000 Study. J. Am. Med. Assoc., 272, 1749–1756.
SYMPTOM SCALES AND DIAGNOSTIC SCHEDULES IN ADULT PSYCHIATRY [132] Spitzer, R.L., Kroenke, K. and Williams, J.B.W. (1999) Validation and utility of a self-report version of PRIME-MD: the PHQ Primary Care Study. J. Am. Med. Assoc., 282, 1737–1744. [133] Tarlov, A.R., Ware, J.E., Greenfield, S. et al. (1989) The Medical Outcomes Study: an application of methods for monitoring the results of medical care. J. Am. Med. Assoc., 262, 925–930. [134] Wells, K.B., Stewart, A., Hays, R.D. et al. (1989) The functioning and well-being of depressed patients: results from the Medical Outcomes Study. J. Am. Med. Assoc., 262, 914–919. [135] Ware, J.E. and Sherbourne, C.D. (1992) The MOS 36-item short-form health survey (SF-36): 1. Conceptual framework and item selection. Med. Care, 30, 473–483. [136] Stewart, A.L., Hays, R.D. and Ware, J.E. (1988) The MOS Short-form general health survey: reliability and validity in a patient population. Med. Care, 26, 724–732. [137] Shepherd, M., Cooper, B., Brown, A.C. et al. (1966) Psychiatric Illness in General Practice, Oxford University Press, Oxford. [138] Goldberg, D.P. (1972) The Detection of Psychiatric Illness by Questionnaire: A Technique for the Identification and Assessment of Non-Psychotic Psychiatric Illness, Oxford University Press, London. [139] Goldberg, D.P., Cooper, B., Eastwood, M.R. et al. (1970) A standardized psychiatric interview for use in community surveys. Br. J. Prev. Soc. Med., 24, 18–23. [140] Goldberg, D.P. and Hillier, V.F. (1979) A scaled version of the General Health Questionnaire. Psychol. Med., 9, 139–145. [141] Goldberg, D.P., Rickels, K., Downing, R. et al. (1976) A comparison of two psychiatric screening tests. Br. J. Psychiatry, 129, 61–67. [142] Von Korff, M., Shapiro, S., Burke, J.D. et al. (1987) Anxiety and depression in a primary care clinic: comparison of Diagnostic Interview Schedule, General Health Questionnaire, and practitioner assessments. Arch. Gen. Psychiatry, 44, 152–156. [143] Hough, R.L., Landsverk, J.A. and Jacobson, G.F. (1990) The use of psychiatric screening scales to detect depression in primary care patients, in Depression in Primary Care: Screening and Detection (eds C.C. Attkisson and J.M. Zich), Routledge, New York, pp. 139–154. [144] Prince, R. and Miranda, L. (1977) Monitoring life stress to prevent recurrence of coronary heart disease episodes. Can. J. Psychiatry, 22, 161–169. [145] Mari, J.J. and Williams, P. (1984) Minor psychiatric disorder in primary care in Brazil: a pilot study. Psychol. Med., 14, 223–237.
[146] Piccinelli, M., Bisoffi, G., Bon, M.G. et al. (1993) Validity and test-retest reliability of the Italian version of the 12-item General Health Questionnaire in general practice: a comparison between three scoring methods. Compr. Psychiatry., 34, 198–205. [147] Schmitz, N., Kruse, J. and Tress, W. (1999) Psychometric properties of the General Health Questionnaire (GHQ-12) in a German primary care sample. Acta Psychiatr. Scand., 100, 462–468. [148] Furukawa, T.A., Goldberg, D.P., Rabe-Hesketh, S. et al. (2001) Stratum-specific likelihood ratios of two versions of the General Health Questionnaire. Psychol. Med., 31, 519–529. [149] Rizzo, R., Piccinelli, M., Mazzi, M.A. et al. (2000) The personal health questionnaire: a new screening instrument for detection of ICD-10 depressive disorders in primary care. Psychol. Med., 30, 831–840. [150] Wing, J.K., Birley, J.L.T., Cooper, J.E. et al. (1967) Reliability of a procedure for measuring and classifying ‘present psychiatric state’. Br. J. Psychiatry, 113, 499–515. [151] Zubin, J. and Fleiss, J. (1971) Current biometric approaches to depression, in Depression in the 1970’s: Modern Theory and Research (ed. R. Fieve), Excerpta Medica, Princeton, NJ, pp. 7–19. [152] World Health Organization (1973) International Pilot Study of Schizophrenia. World Health Organization, Geneva. [153] Wing, J.K., Mann, S.A., Leff, J.P. et al. (1978) The concept of a ‘case’ in psychiatric population surveys. Psychol. Med., 8, 203–217. [154] Brown, G.W. and Harris, T. (1978) Social Origins of Depression: A Study of Psychiatric Disorder in Women, The Free Press, New York. [155] Henderson, S., Duncan-Jones, P., Byrne, D.G. et al. (1979) Psychiatric disorder in Canberra: a standardized study of prevalence. Acta Psychiatr. Scand., 60, 355–374. [156] Bebbington, P., Hurry, J., Tennant, C. et al. (1981) Epidemiology of mental disorders in Camberwell. Psychol. Med., 11, 561–579. [157] Rogers, B. and Mann, S.A. (1986) The reliability and validity of PSE assessments by lay interviews: a national population survey. Psychol. Med., 16, 689–700. [158] Loranger, A.W., Sartorius, N., Andreoli, A. et al. (1994) The international personality disorder examination: the World Health Organization/Alcohol Drug Abuse, and mental health administration pilot study of personality disorders. Arch. Gen. Psychiatry, 51, 215–224. [159] Jablenski, A., Schwarz, R. and Tomov, T. (1980) WHO collaborative study on impairments and disabilities associated with schizophrenic disorders. Acta Psychiatr. Scand., 62 (Suppl. 285), 152–163.
217
CHAPTER 13 [160] Farmer, A.E., Katz, R., McGuffin, P. et al. (1987) A comparison between the Present State Examination and the Composite International Diagnostic Interview. Arch. Gen. Psychiatry, 44, 1064–1068. [161] Wittchen, H.U., Burke, J.D., Semler, G. et al. (1989) Recall and dating of psychiatric symptoms: testretest reliability of time-related symptom questions in a standardized psychiatric interview. Arch. Gen. Psychiatry, 46, 437–443. [162] Cottler, L.B., Robins, L.N., Grant, B.F. et al. Participants in the WHO/ADAMHA Field Trial (1991) The CIDI-core substance abuse and dependence questions: cross-cultural and nosological issues. Br. J. Psychiatry, 159, 653–658. [163] Wittchen, H.U., Robins, L.N., Cottler, L.B. et al. Participants in the Multicentre WHO/ADAMHA Field Trials (1991) Cross-cultural feasibility, reliability and sources of variance of the Composite International Diagnostic Interview (CIDI). Br. J. Psychiatry, 159, 645–653. [164] Rubio-Stipec, M., Canino, G., Robins, L.N. et al. Participants in the WHO/ADAMHA Field Trials (1993) The somatization schedule of the Composite International Diagnostic Interview: the use of the probe flow chart in 17 different countries. Int. J. Methods Psychiatr. Res., 3, 129–136. [165] Wittchen, H.U. (1994) Reliability and validity studies of the WHO-Composite International Diagnostic Interview (CIDI): a critical review. J. Psychiatr. Res., 28, 57–84. [166] Andrews, G. and Peters, L. (1998) The psychometric properties of the Composite International Diagnostic Interview. Soc. Psychiatry Psychiatr. Epidemiol., 33, 80–88. [167] Breslau, N., Kessler, R.C. and Peterson, E.L. (1998) Post-traumatic stress disorder assessment with a structured interview: reliability and concordance with a standardized clinical interview. Int. J. Methods Psychiatr. Res., 7, 121–127. [168] Peters, L. and Andrews, G. (1995) Procedural validity of the computerized version of the Composite International Diagnostic Interview (CIDIAuto) in the anxiety disorders. Psychol. Med., 25, 1269–1280. [169] Kessler, R.C., Andrews, G., Mroczek, D. et al. (1998b) The World Health Organization composite international diagnostic interview short-form (CIDISF). Int. J. Methods Psychiatr. Res., 7, 171–185. ¨ [170] Wittchen, H.U., Hofler, M., Gander, F. et al. (1999) Screening for mental disorders: performance of the Composite International Diagnostic Screener (CID-S). Int. J. Methods Psychiatr. Res., 8, 59–70. ¨ un, ¨ T.B., Costa e Silva, J.A. et al. [171] Sartorius, N., Ust (1993) An international study of psychological
218
[172]
[173]
[174]
[175]
[176]
[177]
[178] [179]
[180]
[181]
[182]
[183]
[184]
problems in primary care. Arch. Gen. Psychiatry, 50, 819–824. ˚ Sandanger, I., Nygard, J.F., Ingebrigtsen, G. et al. (1999) Prevalence, incidence rate and age at onset of psychiatric disorders in Norway. Soc. Psychiatry Psychiatr. Epidemiol., 34, 570–579. Andrews, G., Henderson, S. and Hall, W. (2001) Prevalence, comorbidity, disability and service utilisation: overview of the Australian National Mental Health Survey. Br. J. Psychiatry, 178, 145–153. Spijker, J., de Graaf, R., Bilj, R. et al. (2002) Duration of major depressive episodes in the general population: results of The Netherlands Mental Health Survey and Incidence Study (NEMESIS). Br. J. Psychiatry, 181, 208–213. Murray, C.J.L. and Lopez, A.D. (1996) The Global Burden of Disease: A Comprehensive Assessment of Mortality and Disability from Diseases, Injuries, and Risk Factors in 1990 and Projected to 2020, The Harvard School of Public Health on behalf of the World Health Organization and the World Bank, Boston. WHO World Mental Health Survey Consortium (2004) Prevalence, severity, and unmet need for treatment of mental disorders in the World Health Organization World Mental Health Surveys. J. Am. Med. Assoc., 291, 2581–2590. ESEMeD/MHEDEA 2000 Investigators (2004) Prevalence of mental disorders in Europe: results of the European study of the epidemiology of mental disorders (MSEMeD) project. Acta Psychiatr. Scand., 109 (Suppl. 420), 21–27. Krueger, R.F. (1999) The structure of common mental disorders. Arch. Gen. Psychiatry, 56, 921–926. Vollebergh, W.A.M., Iedema, J., de Graaf, R. et al. (2001) The structure and stability of common mental disorders: the NEMESIS Study. Arch. Gen. Psychiatry, 58, 597–603. Slade, T. and Watson, D. (2006) The structure of common DSM-IV and ICD-10 mental disorders in the Australian general population. Psychol. Med., 36, 1593–1600. Goldberg, D.P. (2000) Plato versus Aristotle: categorical and dimensional models for common mental disorders. Compr. Psychiatry., 41 (Suppl. 1), 8–13. Brugha, T.S. (2002) The end of the beginning: a requiem for the categorization of mental disorder? Psychol. Med., 32, 1149–1154. Kendler, K.S., Neale, M.C., Kessler, R.C. et al. (1992) Major depression and generalized anxiety disorder: same genes, (partly) different environments? Arch. Gen. Psychiatry, 49, 716–722. Kendler, K.S., Prescott, C.A., Myers, J. et al. (2003) The structure of genetic and environmental risk factors for common psychiatric and substance use
SYMPTOM SCALES AND DIAGNOSTIC SCHEDULES IN ADULT PSYCHIATRY
[185]
[186] [187]
[188]
[189]
[190]
[191]
[192]
disorders in men and women. Arch. Gen. Psychiatry, 60, 929–937. Surtees, P.G., Dean, C., Ingham, J.G. et al. (1983) Psychiatric disorder among women from an Edinburgh community: associations with demographic factors. Br. J. Psychiatry, 142, 238–246. Taylor, L. and Chave, S. (1964) Mental Health and Environment, Longman Green, London. Hare, E.H. and Shaw, G.K. (1965) Mental Health on a New Housing Estate: A Comparative Study of Health in Two Districts in Croydon, Oxford University Press, Oxford. Meltzer, H., Gill, B., Petticrew, M. et al. (1995) Morbidity in Great Britain: the Prevalence of Psychiatric Morbidity Among Adults Living in Private Households, Her Majesty’s Stationery Office (HMSO), London. Jenkins, R., Bebbington, P., Brugha, T. et al. (1997a) The National Psychiatric Morbidity Survey of Great Britain – strategy and methods. Psychol. Med., 27, 765–774. Jenkins, R., Lewis, G., Bebbington, P. et al. (1997b) The National Psychiatric Morbidity Surveys of Great Britain – initial findings from the Household Survey. Psychol. Med., 27, 775–789. Lewis, G., Pelosi, A.J., Araya, R. et al. (1992) Measuring psychiatric disorder in the community: a standardized assessment for use by lay interviewers. Psychol. Med., 22, 465–486. Lewis, G. and Williams, P. (1989) Clinical judgment and the standardized interview in psychiatry. Psychol. Med., 19, 971–979.
[193] Lewis, G., Bebbington, P., Brugha, T. et al. (1998) Socioeconomic status, standard of living, and neurotic disorder. Lancet, 352, 605–609. [194] Paykel, E.S., Abbott, R., Jenkins, R. et al. (2000) Urban-rural mental health differences in Great Britain: findings from the National Morbidity Survey. Psychol. Med., 30, 269–280. [195] Weich, S., Lewis, G. and Jenkins, S.P. (2001) Income inequality and the prevalence of common mental disorders in Britain. Br. J. Psychiatry, 178, 222–237. [196] Choi, I.C. and Comstock, G.W. (1975) Interviewer effect on responses to a questionnaire relating to mood. Am. J. Epidemiol., 101, 81–92. [197] Nunnally, J.C. and Bernstein, I.H. (1994) Psychometric Theory, 3rd edn, McGraw-Hill, New York. [198] Campbell, D.T. and Fiske, D.W. (1959) Convergent and discriminant validation by the multitraitmultimethod matrix. Psychol. Bull., 56, 81–105. [199] Spitzer, R.L. (1983) Psychiatric diagnosis: are clinicians still necessary? in Psychotherapy Research: Where Are We and Where Should We Go? (eds J.B.W. Williams and R.L. Spitzer), Guilford Press, New York, pp. 273–292. [200] Horton, N.J., Laird, N.M., Murphy, J.M. et al. (2001) Multiple informants: mortality associated with psychiatric disorders in the Stirling County Study. Am. J. Epidemiol., 154, 649–656.
219
14
The National Comorbidity Survey (NCS) and its extensions Ronald C. Kessler Department of Health Care Policy, Harvard Medical School, Boston, MA, USA
14.1 Introduction This chapter presents an overview of the research program associated with the US National Comorbidity Survey (NCS) and its extensions. The baseline NCS, which was fielded in the autumn of 1990 and completed in the Spring of 1992, was the first nationally representative mental health survey in the United States to use a fully structured research diagnostic interview to assess the prevalence and correlates of Diagnostic and Statistical Manual of Mental Disorders, 3rd edition revised (DSM-III-R) disorders. The baseline NCS respondents were re-interviewed in 2001–2003 (NCS-2) in order to study patterns and predictors of the course of mental and substance use disorders and to evaluate the effects of primary mental disorders in predicting the onset and course of secondary substance disorders. A National Comorbidity Survey Replication (NCS-R) was also carried out in conjunction with NCS-2 in a new national sample of 9282 respondents. The goals of the NCS-R were to study trends in a wide range of variables assessed in the baseline NCS and to obtain more information about a number of topics not covered in the baseline NCS. A survey of over 10 000 adolescents (NCS-A) was being carried out in parallel with the NCS-R and NCS-2 surveys. The goal of the NCS-A was to produce nationally representative data on the prevalences and correlates of mental disorders among youth. The NCS-R, finally, was replicated in a number of countries around the world as part of the World Health Organization (WHO) World Mental Health (WMH) Survey Initiative [1]. This chapter
presents a brief overview of each of these phases in the evolution of the NCS research programme.
14.2 The baseline NCS 14.2.1 Background and design The need for a national survey on patterns and predictors of psychiatric disorders was noted nearly two decades ago in the report of the President’s Commission on Mental Health and Illness [2]. Such a survey could not be undertaken at that time, though, due to the absence of a structured research diagnostic interview capable of generating reliable psychiatric diagnoses in general population samples. Recognising this need, the National Institute of Mental Health (NIMH) funded the development of the Diagnostic Interview Schedule (DIS) [3], a research diagnostic interview that can be used by trained interviewers who are not clinicians. The DIS was first used in the Epidemiologic Catchment Area (ECA) Study, a landmark series of surveys that interviewed over 20 000 respondents in five local community samples. The ECA was the main source of data in the United States on the prevalence of psychiatric disorders and utilisation of services for these disorders during the decade between the early 1980s and the early 1990s [4–6]. The baseline NCS was designed to take the next step beyond the ECA [7] by carrying out a nationally representative survey of mental disorders. This was done by administering a face-to-face structured diagnostic
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
221
CHAPTER 14
interview to a widely dispersed sample that was representative of all people living in households in the continental United States. The 8098 NCS respondents were selected from over 1000 neighbourhoods in over 170 counties distributed over 34 states. The NCS diagnostic interview was a modification of the Composite International Diagnostic Interview (CIDI) [8], a state-of-the-art structured diagnostic interview based on the DIS. We deleted diagnoses known to have low prevalences in the ECA (e.g. obsessive–compulsive disorder, somatisation disorder). We also modified the CIDI in several ways based on extensive pilot tests [9, 10]. The most important of these modifications involved the diagnostic stem questions. Almost all CIDI diagnostic sections begin with a small number of questions that assess core features of the disorder. If these questions are answered positively, the respondent is asked a detailed series of follow-up questions about the disorder. If the stem questions are answered negative, in comparison, the respondent is skipped to the next section. Our pilot work showed clearly that respondents quickly catch on to this stem–branch logic and sometimes deny stem questions in order to get through the interview more quickly. We addressed this problem by moving the diagnostic stem questions for all disorders into a separate lifetime review section that was administered before any other sections of the CIDI. We prefaced the administration of the lifetime review section with a preamble designed to motivate serious and honest responding [11]. A field experiment that randomised pilot test respondents to receive the CIDI either with or without this lifetime review section showed that use of this section resulted in a statistically significant and substantively important increase in the estimated prevalences of most DSM-III-R disorders. A separate clinical validity study showed that this increase was due to a decrease in false negative diagnostic evaluations rather than to an increase in false positives [10]. Another NCS innovation was the use of a twophase clinical reinterview design for complex cases. WHO CIDI field trials showed that most CIDI diagnoses have good inter-rater reliability, test–retest reliability and validity in comparison to blind clinician reinterviews in non-patient samples [12]. An important exception to this general pattern, however, is non-affective psychosis, which is diagnosed 222
with low reliability and validity in structured interviews like the CIDI. Based on this fact, and given the great public health importance of non-affective psychosis, the NCS included clinical reinterviews with respondents who reported any evidence of schizophrenia or other non-affective psychoses. These reinterviews were administered by experienced clinicians using an adapted version of the Structured Clinical Interview for DSM-III-R (SCID) [13], an instrument with demonstrated reliability in the diagnosis of schizophrenia [14]. The NCS diagnoses of schizophrenia and other nonaffective psychoses are based on these clinical reinterviews rather than on the CIDI interviews [15]. As described below, this reliance on clinical reinterviews for diagnosis of complex cases was expanded in NCS-R. A final noteworthy NCS innovation was the systematic evaluation of the relationship between survey non-response and diagnosis. Based on a concern that non-respondents might have considerably higher rates of some mental disorders than respondents, we carried out a systematic non-respondent survey in which a random subsample of non-respondents was contacted by specially trained refusal conversion interviewers and asked to complete a 10-minute screening interview. The screening interview was completed either face-to-face or over the telephone by approximately one-third of the non-respondents who were selected into this special subsample. Propensity score weighting that made use of the information about diagnostic stem question profiles obtained in these screening interviews was then used to adjust the sample for the under-representation of these initial refusers [16]. Analysis of response bias showed, interestingly, that failure to adjust for differential non-response led most importantly to an underestimation of the prevalence of anxiety disorders [17]. This occurred because anxious people were more reluctant than other people to allow a stranger into their homes, while they were willing to complete the screening once the option of telephone administration was offered.
14.2.2 Illustrative findings The number of NCS analyses is much too large to summarise in a single chapter. As a result, we present just a sample of results here in order to give the
THE NATIONAL COMORBIDITY SURVEY (NCS) AND ITS EXTENSIONS
reader a flavour of the kinds of analyses carried out. A complete list of NCS publications can be found at the NCS web site: www.hcp.med.harvard.edu/ncs.
14.2.2.1 Lifetime and recent prevalence of DSM disorders As reported in more detail elsewhere [7], the NCS found that DSM-III-R disorders are more prevalent than previously thought to be the case. The results in Table 14.1 show prevalence estimates for the 14 lifetime and 12-month disorders assessed in the core NCS interview. Lifetime prevalence is the proportion of the sample who ever experienced a disorder, while 12-month prevalence is the proportion who
experienced the disorder at some time in the 12 months prior to the interview. The prevalence estimates in Table 14.1 are presented without exclusions for DSM-III-R hierarchy rules. The most common disorders are major depression and alcohol dependence. The next most common are social and simple phobias. As a group, substance use disorders and anxiety disorders are somewhat more prevalent than affective disorders, with approximately one in every four respondents reporting a lifetime substance use disorder and a similar number a lifetime anxiety disorder. Approximately one in every five respondents reported a lifetime affective disorder. Anxiety disorders, as a group, were
Table 14.1 Lifetime and 12-month prevalence of DSM-III-R disorders. Male Lifetime
Female 12-mo
Lifetime
Total 12-mo
Lifetime
12-mo
%
(SE)a
%
(SE)
%
(SE)
%
(SE)
%
(SE)
%
(SE)
12.7 1.6 4.8 14.7
(0.9) (0.3) (0.4) (0.8)
7.7 1.4 2.1 8.5
(0.8) (0.3) (0.3) (0.8)
21.3 1.7 8.0 23.9
(0.9) (0.3) (0.6) (0.9)
12.9 1.3 3.0 14.1
(0.8) (0.3) (0.4) (0.9)
17.1 1.6 6.4 19.3
(0.7) (0.3) (0.4) (0.7)
10.3 1.3 2.5 11.3
(0.6) (0.2) (0.2) (0.7)
3.6 2.0 11.1 6.7 3.5 19.2
(0.5) (0.3) (0.8) (0.5) (0.4) (0.9)
2.0 1.3 6.6 4.4 1.7 11.8
(0.3) (0.3) (0.4) (0.5) (0.3) (0.6)
6.6 5.0 15.5 15.7 7.0 30.5
(0.5) (1.4) (1.0) (1.1) (0.6) (1.2)
4.3 3.2 9.1 13.2 3.8 22.6
(0.4) (0.4) (0.7) (0.9) (0.4) (0.1)
5.1 3.5 13.3 11.3 5.3 24.9
(0.3) (0.3) (0.7) (0.6) (0.4) (0.8)
3.1 2.3 7.9 8.8 2.8 17.2
(0.3) (0.3) (0.4) (0.5) (0.3) (0.7)
12.5 20.1 5.4 9.2 35.4
(0.8) (1.0) (0.5) (0.7) (1.2)
3.4 10.7 1.3 3.8 16.1
(0.4) (0.9) (0.2) (0.4) (0.7)
6.4 8.2 3.5 5.9 17.9
(0.6) (0.7) (0.4) (0.5) (1.1)
1.6 3.7 0.3 1.9 6.6
(0.2) (0.4) (0.1) (0.3) (0.4)
9.4 14.1 4.4 7.5 26.6
(0.5) (0.7) (0.3) (0.4) (1.0)
2.5 7.2 0.8 2.8 11.3
(0.2) (0.5) (0.1) (0.3) (0.5)
4.8 0.3 48.7
(0.5) (0.1) (0.2)
– 0.2 27.7
– (0.1) (0.9)
1.0 0.7 47.3
(0.2) (0.2) (1.5)
– 0.4 31.2
– (0.1) (1.3)
2.8 0.5 48.0
(0.2) (0.1) (1.1)
– 0.3 29.5
– (0.1) (1.0)
Mood disorders Major depression Mania Dysthymia Any mood disorder Anxiety disorders Generalised anxiety disorder Panic disorder Social phobia Simple phobia Agoraphobia without panic Any anxiety disorder Substance use disorders Alcohol abuse Alcohol dependence Drug abuse Drug dependence Any substance use disorder Other disorders Antisocial personality (ASP)a Non-affective psychosisb Any NCS disorder a ASP
was only assessed on a lifetime basis. psychosis = schizophrenia, schizophreniform disorder, schizoaffective disorder, delusional disorder and atypical psychosis. Standard errors are reported in parentheses.
b Non-affective
223
CHAPTER 14
considerably more likely to occur in the 12 months prior to interview than either substance use disorders or affective disorders, suggesting that anxiety disorders are more chronic than affective or substance disorders. The prevalence of other NCS disorders is considerably lower. As shown in the last row of Table 14.1, 48.0% of the sample reported at least one lifetime disorder and 29.5% at least one disorder in the 12 months prior to the interview. While there is no meaningful sex difference in these overall prevalences, there are sex differences in prevalences of specific disorders. Consistent with previous research, men were much more likely to have substance use disorders and ASPD than women, while women were much more likely to have anxiety disorders and affective disorders than men (with the exception of mania, for which there is no sex difference). The data also show, consistent with a trend found in the ECA [18], that women in the household population are more likely than men to have nonaffective psychosis. There was a good deal of scepticism about these results when they were first published. The main criticism was that the NCS prevalence estimates were higher than those found in the ECA and other epidemiological surveys based on the ECA methodology. However, clinical reappraisal studies in which clinicians blindly reinterviewed a sample of NCS respondents subsequently showed that the NCS estimates are accurate [10], suggesting that the ECA estimates are biased downwards. A later reanalysis of the ECA data found that ECA estimates can be adjusted for reporting bias to approximate the NCS estimates [19]. Methodological studies suggest that the life review section, mentioned earlier, is largely responsible for the more accurate estimates in the NCS than the ECA [10].
14.2.2.2 Age at onset The NCS collected retrospective data on the ages of first onset of each lifetime disorder. Consistent with previous evidence [20], simple and social phobia were found to have much earlier ages at onset than the other disorders [21] – with simple phobia often beginning during middle or late childhood and social phobia during late childhood or early adolescence. Substance abuse was found to have a typical age of 224
onset during the late teens or early 20s. A substantial proportion of people with lifetime major depression and dysthymia also reported that their first episode occurred during the late teens or 20s. Some other disorders had later ages at onset, but the most striking overall impression from the data as a whole was that most psychiatric disorders have first onsets quite early in life.
14.2.2.3 Comorbidity The ECA Study was the first survey to document that comorbidity is widespread not only among patients but also in the general population [6, 22]. Over 54% of ECA respondents with a lifetime history of at least one DSM-III disorder were found to have a second diagnosis. Fifty-two per cent of persons with a lifetime history of alcohol abuse or dependence received a second diagnosis and 75% of persons with lifetime drug abuse or dependence had a second diagnosis. Respondents with a lifetime history of at least one mental disorder in the ECA had a 2.3 relative-odds of having a lifetime history of alcohol abuse or dependence and a relative-odds of 4.5 of some other drug use disorder compared to respondents with no lifetime mental disorder. Very similar patterns were found in the NCS. Fifty-six per cent of NCS respondents with a lifetime history of at least one DSM-III-R disorder also had one or more other disorders [7]. Fifty-two per cent of respondents with lifetime alcohol abuse or dependence also had a lifetime mental disorder, while 36% had a lifetime drug use disorder. Fifty-nine per cent of the respondents with a lifetime history of drug abuse or dependence also had a lifetime mental disorder and 71% had a lifetime alcohol use disorder. More detailed analyses showed that lifetime comorbidities of specific pairs of disorders are very similar in the ECA and NCS surveys [23]. In both surveys, virtually all the odds-ratios (ORs) between each pair of lifetime disorders is greater than 1.0. This means that there is a positive association between the lifetime occurrences of almost all ECA and NCS disorders, demonstrating that comorbidity of psychiatric disorders is truly pervasive in the general population. There is considerable variation in the sizes of the ORs. This variation is systematic and quite consistent across the two surveys.
THE NATIONAL COMORBIDITY SURVEY (NCS) AND ITS EXTENSIONS
14.2.2.4 Pure and comorbid lifetime disorders It is of interest to look beyond simple two-variable associations for broader patterns of comorbidity among multiple disorders. The 48% of persons in the NCS who had a lifetime history of at least one DSM-III-R disorder was found to be made up of 21% with exactly one, 13% with exactly two and 14% with three or more disorders. Thinking of disorders as the unit of analysis, we found that only 21% of all lifetime disorders occurred to the subsample of respondents who had no lifetime comorbidity. The other 79% occurred to respondents with lifetime comorbidity. The vast majority of lifetime disorders, then, were comorbid disorders [7]. Furthermore, we found that over 50% of all lifetime disorders occurred to the 14% of the population with a history of three or more disorders. This highly comorbid segment of the population also accounted for close to 60% of all 12-month disorders and close to 90% of severe 12-month disorders. These results show that while psychiatric disorders are widespread in the general population, the major burden of psychopathology is concentrated among people with high comorbidity.
14.2.2.5 Primary and secondary disorders Given the importance of comorbidity, a question arises at to which disorders in comorbid sets have the earliest ages at onset. The results in Table 14.2 show that there was considerable variation across disorders in the NCS in the probability of being the first lifetime disorder. Simple phobia, social phobia, alcohol abuse and conduct disorder were the only disorders considered in the NCS where the majority of lifetime cases were temporally primary. In general, anxiety disorders were most likely to be temporally primary, with 82.8% of NCS respondents having one or more anxiety disorders reporting that one of these was their first lifetime disorder compared to 71.1% of those with conduct disorder, 43.8% of those with an affective disorder and 48.1% of those with a substance use disorder. Results in the third column of Table 14.2 show the per cent of overall respondents who reported each disorder as temporally primary. Anxiety disorders, again, were more likely to be temporally primary (45.3% of all lifetime cases)
than either affective disorders (16.4%), substance use disorders (24.5%) or other disorders (19.5%). Information about age at onset was used to study the time-lagged effects of earlier disorders in predicting the subsequent onset of secondary disorders using a discrete-time survival analysis approach. This work showed clearly that early-onset anxiety disorders are the most important primary disorders in terms of predicting later disorders [24]. Interestingly, while most of these effects are only associated with active disorders, there are others that are also associated with remitted disorders. For example, respondents with a history of early-onset panic attacks have an elevated risk of secondary major depression throughout the majority of their adulthood even if their panic attacks occurred exclusively many years in the past [25]. Results such as this suggest that some early-onset anxiety disorders are risk markers rather than direct causes of secondary disorders.
14.2.2.6 The Societal costs of mental disorders Epidemiologists have traditionally been much more concerned with the causes than with the consequences of the illnesses they study. However, the rise of cost-effectiveness analysis as a tool for allocating health care resources has led to a dramatic increase in research on the adverse consequences of untreated chronic conditions and the benefits of treatment [26]. The NCS analyses consequently included an investigation of the adverse consequences of mental disorders. Consistent with the Rand Medical Outcome Study [27], we found that mental disorders have adverse effects on role functioning that equal or exceed the effects of most chronic physical conditions [28]. Data from clinical trials on the reversibility of these role impairments, when combined with NCS data on the costs of workrelated role impairments to employers, suggest that the cost savings due to increased work productivity might well make it cost-effective for employers to develop aggressive screening, outreach and treatment programmes for employees with some mental disorders [11]. This is an issue that is being examined in much more detail in the NCS-R and the other WMH surveys [29, 30]. NCS analyses also found that the early age at onset of mental disorders led them to have much 225
CHAPTER 14 Table 14.2
Percent and distribution of temporally primary NCS/DSM-III-R disorders. Percent temporally primary among those having the disorder
Distribution of temporally primary disorder
%
(SE)
%
(SE)
41.1 37.7 20.2 43.8
(2.7) (3.1) (6.0) (2.4)
13.4 4.8 0.7 16.4
(0.9) (0.5) (0.2) (0.9)
37.0 23.3 63.1 67.6 45.2 52.1 82.8
(2.9) (3.2) (2.0) (2.7) (4.0) (3.0) (1.3)
3.6 1.6 16.0 14.5 5.9 7.5 45.3
(0.4) (0.2) (0.9) (1.0) (0.7) (0.7) (1.4)
57.0 36.8 39.7 20.8 48.1
(2.3) (3.1) (3.0) (2.5) (1.6)
10.2 9.9 3.4 3.0 24.5
(0.6) (0.6) (0.3) (0.3) (1.0)
71.1 14.0 28.8
(2.0) (1.8) (5.6)
17.7 1.4 0.4
(1.0) (0.2) (0.1)
Mood disorders Major depression Dysthymia Mania Any mood disorder Anxiety disorders Generalised anxiety disorder Panic disorder Social phobia Simple phobia Agoraphobia Posttraumatic stress disorder Any anxiety disorder Substance use disorders Alcohol abuse Alcohol dependence Drug abuse Drug dependence Any substance use disorder Other disorders Conduct disorder Adult antisocial behaviour Nonaffective psychosis
SE, standard error; NCS, National Comorbidity Survey. All disorders are operationalised using DSM-III-R criteria ignoring diagnostic hierarchy rules.
greater effects than physical disorders on critical life course transitions such as educational attainment, teen childbearing, the timing and stability of marriage and early career decisions [31–33]. These adverse effects typically occur as part of a cascade of events as a result of the onset of serious early-onset mental disorders. People with this complex pile-up of emotional and psychosocial difficulties typically do not seek professional mental health treatment until at least a decade after the onset of their first mental disorder. It is consequently of great importance to develop aggressive outreach and treatment programmes for young people with mental disorders. This is a topic of central importance in the NCS-A survey. 226
14.2.2.7 Treatment Although only 4 out of every 10 NCS respondents with a lifetime history of at least one DSM-III-R disorder reported ever obtaining professional treatment, a survival analysis that compared age at onset with time to treatment suggested that the vast majority of people with persistent mental illness eventually seek treatment [34]. Delays in initial help-seeking, however, are pervasive, with the median time between first lifetime onset of a mental illness and first treatment contact greater than a decade. Importantly, delays in seeking treatment are inversely related to age at onset, with child and adolescent onsets
THE NATIONAL COMORBIDITY SURVEY (NCS) AND ITS EXTENSIONS
being associated with the lowest probabilities of ever seeking treatment. This is a critical finding because early-onset disorders are the ones most likely to promote comorbidity and adverse life course consequences. On a more positive note, analysis of retrospective NCS data suggests that rates of treatment-seeking increased over the four decades of historical time retrospectively assessed in the NCS. This presumably reflects a combination of increases in access to care, in awareness that mental illness is treatable and in attitudes conducive to seeking care.
14.2.2.8 Primary prevention of secondary disorders One question suggested by the NCS results is whether early treatment of pure child-onset or adolescentonset mental disorders would result in a reduction in the percentage of people who go on to develop comorbid mental disorders and, if so, whether it would also lead to a reduction in the persistence and adverse social consequences of primary mental disorders. We do not know the answer because no large-scale controlled study has ever attempted to screen and treat a representative sample of children or adolescents with mental disorders and then follow them over time to document the long-term effects of treatment. Given the high prevalences and enormous personal and societal costs of mental disorders, such an investigation should be undertaken. An issue of special interest in the current social policy arena is the prevention of adolescent substance disorder. Current federal policy on substance abuse prevention emphasises a combination of strategies that focus on reduction in access to drugs and unproven school-based primary prevention programmes, such as DARE, that ignore the fact that the majority of adolescent substance abusers have a primary mental disorder [35, 36]. Policy simulations based on the NCS data suggest that a more cost-effective strategy would be to develop outreach and treatment programmes for youngsters with early-onset mental disorders that predispose them to substance abuse. In addition to sharply reducing the proportion of youth who become substance abusers, such an effort could have a powerful preventive effect on subsequent adult serious mental disorder.
14.3 The NCS follow-up survey (NCS-2) 14.3.1 Design and rationale The NCS-2 was designed with the explicit purpose of providing an epidemiological foundation for early intervention programmes of the sort just described. While the baseline NCS simulations suggested that early primary mental disorders are important predictors of the subsequent onset and course of secondary mental and substance disorders, these results are based on retrospective reports about age at onset. The 10-year follow-up data in the NCS-2 were designed to determine whether these retrospective results hold up prospectively. This was done using a life chart approach to assess onset and course of disorders during the decade between the baseline NCS and the NCS-2. The life chart method, pioneered by Freedman and her colleague [37], provides respondents with a paper calendar covering the recall period that includes notations of important historical events in an effort to create memory anchors. Respondents are also asked to include personal memory anchors in the calendar to further enhance the accuracy of dating. Life charting was facilitated in the NCS-2 by the use of laptop computerised interviews that included a customised preloaded data file for each respondent based on baseline NCS reports. Respondents with a history of a particular disorder as of the baseline NCS were asked to chart the course of that disorder during the decade since the baseline NCS, while respondents with no history of the disorder as of the baseline NCS were asked about subsequent onsets and, if onsets occurred, about the course of the disorder after the time of onset. A similar procedure was used by Eaton and his associates in a 13-year follow-up of the Baltimore ECA sample [38]. In addition to charting the course of mental disorders, the NCS-2 charted major role transitions in education, marriage, childbearing and work that might play a part in influencing the onset and course of mental and substance disorders. Major stressor events and difficulties were also charted using a structured version of the Brown and Harris (1978) Life Events and Difficulties system [39]. Charting was done separately for each year 227
CHAPTER 14
across the decade between the two interviews and for each month in the 12 months prior to the NCS-2 interview.
14.3.2 Illustrative findings 14.3.2.1 Primary and secondary disorders As noted above, the NCS analyses investigated the distinction between temporally primary and secondary disorders by using retrospective age of onset reports. We were able to replicate and extend these analyses in the NCS-2 panel data using prospective information about age of onset. A good example of this line of analysis concerns the relationship between major depressive episode (MDE) and generalised anxiety disorder (GAD). Although MDE and GAD are known to be highly comorbid and to share most, if not all, of their genetic determinants [40], little prospective research has examined whether these two disorders predict the subsequent first onset or persistence of the other or the extent to which other predictors explain the time-lagged associations between GAD and MDE. An analysis of these issues in the NCS-2 showed that while baseline MDE significantly predicted subsequent GAD onset but not persistence, baseline GAD significantly predicted not only subsequent MDE onset but also the persistence of MDE [41]. We also found that the associations of each disorder with the subsequent onset of the other attenuated with time since onset of the temporally primary disorder, but remained significant for over a decade after this onset. We also found that baseline risk factors of onset and persistence varied somewhat between the two disorders. These results argue against the view of some that the two disorders are merely different manifestations of a single underlying internalising syndrome or that GAD is merely a prodrome, residual or severity marker of MDE.
14.3.2.2 Targeted risk factors for disorder onset and progression Another kind of prospective analysis carried out in the NCS-2 panel focused on baseline (NCS) predictors of the subsequent onset and progression of various other disorders. A good illustration of this work concerns substance disorders, where data 228
were obtained in both the NCS and NCS-2 on use, abuse and dependence. It was possible to study patterns and prospective predictors of the first onset of substance use, of the transition from use to abuse, of the transition from abuse to dependence and of the predictors of persistence versus recovery from abuse and dependence in ways that replicated earlier analyses in the NCS that used retrospective age of onset reports to mimic prospective data [42]. These analyses showed clearly that many of the previously documented risk factors for substance dependence are, in fact, risk factors only for one or two transitions. For example, the well-known finding that women have lower rates of alcohol and drug dependence than men was shown to be largely due to lower rates of ever starting to use among women than men, with very little evidence that women differ from men in the probability of progressing from use to abuse or from abuse to dependence.
14.3.2.3 Persistence of disorders and syndromes The NCS-2 was also used to study patterns and correlates of disorder persistence. One of the most fascinating of these studies focused on suicidality, including suicidal ideation, plans and attempts. Substantial persistence of suicidality was found over the decade between the two interviews, with over onethird of respondents who had a baseline history of suicide ideation continuing to have suicide ideation at some time over the intervening decade [43]. Indeed, the strongest predictors of later suicidality were measures of baseline suicidality. Nonetheless, a number of additional baseline predictors were found both of new first onsets of suicidality and of persistence of suicidality that have important implications for targeting interventions. Importantly, we found that even though mental disorders are powerful predictors of suicidality and that the vast majority of suicidal people have a pre-existing mental disorder, the main impact of mental disorders is in predicting the onset of suicide ideation, while other factors determine the transition from ideation to plans and attempts [44].
14.3.2.4 Disorder progression An important line of investigation in the NCS-2 panel has focused on disorder progression, with a special
THE NATIONAL COMORBIDITY SURVEY (NCS) AND ITS EXTENSIONS
emphasis on severity. This work was motivated by the fact that several restrictive definitions have been proposed to narrow the number of people qualifying for treatment of mental disorders. For example, a number of health plans restrict mental health coverage to the subset of DSM disorders that they consider to be ‘biologically-based’. A team of researchers from the American Psychiatric Association has argued that disorders currently classified as mild in the DSM-IV system should be excluded altogether from future diagnostic systems [45, 46]. This suggestion has important implications not only for the definition of current unmet need for treatment but also for current research and consideration of future treatment needs. Research shows that many syndromes currently defined as mental disorders are extremes on continua that appear not to have meaningful thresholds [47, 48]. These results are important in at least two ways. First, exploration of the full continua rather than the currently established diagnostic thresholds might yield greater power in studies of genetic and environmental risk factors [49]. Second, development of early interventions to prevent progression along a given severity continuum might reduce the prevalence of serious cases [50]. Removal of current mild cases from the DSM system would undercut both of these advantages as well as distort the reality that mental disorders, like physical disorders, vary widely in seriousness [51, 52]. In an effort to investigate this issue empirically, we examined the associations of baseline NCS 12-month illness severity with clinically significant outcomes assessed in a decade later in the NCS-2. Twelvemonth baseline NCS disorders were disaggregated into 3.2% severe, 3.2% serious, 8.7% moderate and 16.0% mild. All four categories were associated with significantly elevated risk of the NCS-2 outcomes compared to baseline non-cases, with ORs of any outcome ranging monotonically from 2.4 (95% CI: 1.6–3.4) to 15.1 (95% CI: 10.0–22.9) for mild to severe cases (Table 14.3). ORs comparing mild to moderate cases were generally non-significant. The existence of graded relationship between mental illness severity and later clinical outcomes has important implications for the decision whether or not to retain mild cases in the DSM. Retention of these cases would help represent the fact that mental disorders, like physical disorders, vary in severity and that
decisions about treating mild cases should include recognition that treatment of mild cases might prevent a substantial proportion of future serious cases.
14.4 The NCS replication survey (NCS-R) 14.4.1 Design and rationale As noted above, the NCS-R was carried out to study trends in a wide range of variables assessed in the baseline NCS and to obtain more information about a number of topics either not covered in the baseline NCS or covered in less depth than we currently desire. A new sample of 9282 adult respondents was interviewed in the same nationally representative sampling segments as the baseline NCS. There was also an update of new segments to adjust for population shifts over the decade between the two surveys. The NCS-R interview repeated many of the questions assessed in the baseline NCS for purposes of trending. New questions were also asked to expand old topics as well as to add new topics of investigation. The recruitment procedures and materials were identical to the baseline NCS. As in the baseline, interviews were carried out face-to-face in the homes of respondents. A complication in studying trends is that diagnoses in the baseline NCS were based on DSM-III-R [54] criteria, while diagnoses in the NCS-R were based on DSM-IV [55] criteria. The CIDI was used in both surveys, but it proved to be impossible to revise the version of CIDI used in the NCS-R to repeat all the DSM-III-R questions from the baseline survey as well as include the new questions needed to operationalise the new DSM-IV criteria. As in the baseline NCS, clinical reappraisal interviews in the NCS-R documented good concordance and conservative prevalence estimates compared with blinded clinician diagnoses [10, 56]. Because DSM-III-R and DSM-IV criteria differ too greatly to justify direct comparisons of prevalence, trend analysis was based on a re-calibration of both surveys to a common summary severity rating developed in the NCS-R and then imputed to the NCS. This severity rating is described in detail elsewhere [57]. In brief, a serious 12-month disorder 229
230
29.7∗
OR
23.8 9.7 10.1∗ 3.0 3.0∗ 2.9 2.7∗ 1.0 1.0 152.1∗
% (16.9–52.1) (4.8–21.3) (1.7–5.4) (1.5–4.9) –
(95% CI)
Hospitalisation
5.6∗
OR
6.1 1.7 1.5 1.4 1.3 1.5 1.3 1.0 1.0 17.0∗
% (2.2–14.4) (0.5–4.3) (0.4–3.6) (0.4–3.2) –
(95% CI)
Work disability
11.7∗
OR
8.0 5.0 6.1∗ 2.2 2.9∗ 1.6 2.0 0.7 1.0 40.4∗
% (4.5–30.4) (3.0–12.5) (1.2–7.4) (0.8–4.9) –
(95% CI)
Suicide attempt
15.4∗
OR
28.9 22.1 10.6∗ 13.2 5.6∗ 6.1 2.6∗ 2.5 1.0 194.0∗
%
SMI
(9.9–24.0) (6.0–18.5) (3.7–8.4) (1.8–3.8) –
(95% CI)
15.1∗
OR 42.4 30.8 8.8∗ 16.4 3.8∗ 9.9 2.4∗ 4.5 1.0 202.8∗
%
(95% CI) (10.0–22.9) (5.7–13.6) (2.7–5.5) (1.6–3.4) –
Any
The associations (odds-ratios) between baseline (1990–2002) NCS/DSM-III-R illness severity and NCS-2 (2000–2002) outcomes (n = 4375)a.
in the % columns are unadjusted prevalences of the NCS-2 outcomes in sub-samples defined by baseline 12-month NCS/DSM-III-R disorder severity. Entries in the OR and (95% CI) columns are odds-ratios and design-corrected 95% confidence intervals obtained by exponentiating multiple logistic regression coefficients in equations that simultaneously included dummy variables for the baseline disorder severity categories and controls for age and sex to predict the NCS-2 outcomes. This table appeared previously in Kessler, R.C., Merikangas, K.R., Berglund, P., Eaton, W.W., Koretz, D., Walters, E.E. (2003). Mild disorders should not be eliminated from the DSM-5. Archives of General Psychiatry 60 (11), 1117–1122 [53]. © 2003 American Medical Association. All rights reserved. Used with permission. ∗ Significant to the 0.05 level, two-sided test.
a Entries
Severe Serious Moderate Mild Non-cases χ2 4
Table 14.3
THE NATIONAL COMORBIDITY SURVEY (NCS) AND ITS EXTENSIONS
was defined as either: meeting 12-month criteria for schizophrenia, any other non-affective psychosis, bipolar I or II disorder or substance dependence with a physiological dependence syndrome; making a suicide attempt or having a suicide plan in conjunction with any NCS-R/DSM-IV disorder; reporting severe role functioning in two or more areas of life from among the four assessed (family, friends, work, household maintenance); or reporting functional impairment associated with a mental disorder at a level consistent with a Global Assessment of Functioning (GAF) [58] score of 50 or less. Respondents whose disorder did not meet criteria for being serious were classified moderate or mild based on responses to the disorder-specific Sheehan Disability Scales [59]. The imputation of severity scores to NCS cases was based on logistic regression equations estimated in the NCS-R that used symptom measures available in both surveys to predict: (i) serious disorder vs. all other respondents; (ii) serious–moderate disorder vs. all other respondents and (iii) any disorder vs. no disorder. Prediction accuracy was good in all three equations (AUC = 0.73 for serious, 0.84 for serious–moderate and 0.78 for any disorder). The coefficients in these equations were used to generate predicted probabilities for each NCS and NCS-R respondent for each nested outcome, which, in turn, were used to impute discrete scores on the severity scale.
14.4.2 Illustrative findings 14.4.2.1 Trends in the prevalence of DSM disorders Twelve-month prevalence estimates of DSM-IV disorders did not differ significantly across surveys, with a (29.4% estimated prevalence of any 12-month disorder in the NCS (1990–1992) and a 30.5% estimate in the NCS-R (2001–2003; p = 0.52). No significant change was found either in serious (5.3 vs. 6.3%, p = 0.27), moderate (12.3 vs. 13.5%, p = 0.30) or mild (11.8 vs. 10.8%, p = 0.37) disorders [60]. No statistically significant interactions were found between time and sociodemographics, suggesting that the overall lack of significant trend is not due to opposite-sign trends in major subsets of the population. We also looked at trends in
12-month suicidality and found no significant changes either in suicidal ideation (2.8–3.3%), plans (0.7–1.0%), gestures (0.3−0.2%) or attempts (0.4–0.6%) [61].
14.4.2.2 Trends in treatment Prevalence of 12-month treatment for emotional problems, in comparison, was found to change dramatically in the decade between the two surveys, from 12.2% in the NCS to 20.1% in the NCS-R [60] (Table 14.4). The association between severity and treatment was positive and significant (p < 0.001), although substantively modest in the pooled data (with a Pearson’s Contingency Coefficient (C) of 0.14), and did not differ significantly over time. Only a minority of respondents with serious disorders received treatment (24.3% in the NCS and 40.5% in the NCS-R). Approximately half of patients who received treatment had none of the disorders considered here. Trends in sector-specific treatment (psychiatric treatment, other speciality mental health treatment, general medical treatment, human services treatment, complementary-alternative medical treatment) were similar to overall trends in two respects. First, severity was significantly related to treatment in each sector (p < 0.001). Second, these associations did not change over time (p = 0.40–0.98). A significant difference in treatment trends was found across sectors (p < 0.001). General medical treatment increased most dramatically (from 3.9 to 10.0%), psychiatrist treatment (from 2.4 to 5.2%) and other mental health treatment (from 5.3 to 8.4%) less dramatically, human services treatment only modestly (from 2.6 to 3.5%) and complementary–alternative medical treatment decreased (from 3.3 to 2.7%). A distributional shift in treatment occurred because of these within-sector differences. Most significantly, general medical treatment changed from 31.5 to 49.6% of all treatment. This distributional increase, importantly, did not vary by severity, which means that more and more people with mental disorders of all severity levels are seeing a general medical doctor for treatment. This trend has important implications for treatment quality, as the NCS-R showed clearly that treatment quality is lower for patients treated in general medical than speciality settings [62]. 231
CHAPTER 14 Table 14.4 Twelve-month treatment of DSM-IV disorders by severity and sector among NCS (n = 5388) and NCS-R
(n = 4319) respondents ages 18–54. Anya
PSYa
OMHa
GMa
HSa
CAMa
NCS (1990–1992)b
Serious Moderate Mild Any None Total
%
(SE)
%
(SE)
%
(SE)
%
(SE)
%
(SE)
%
(SE)
24.3 25.4 13.3 20.3 8.8 12.2
(3.8) (2.4) (2.4) (1.5) (0.7) (0.6)
7.3 5.8 2.5 4.8 1.4 2.4
(2.2) (1.2) (1.2) (0.8) (0.3) (0.3)
11.4 13.6 4.9 9.7 3.5 5.3
(2.5) (1.6) (1.3) (1.0) (0.4) (0.3)
8.2 8.6 4.3 6.8 2.6 3.9
(3.0) (1.4) (1.4) (1.0) (0.4) (0.4)
4.5 5.5 3.0 4.3 1.9 2.6
(1.9) (1.1) (1.2) (0.7) (0.3) (0.3)
8.4 7.1 3.0 5.7 2.3 3.3
(1.9) (1.2) (0.8) (0.7) (0.3) (0.3)
22.1 19.5 11.8 17.3 6.8 10.0
(3.5) (2.4) (2.9) (1.3) (0.6) (0.5)
6.5 5.5 3.9 5.1 2.7 3.5
(1.6) (1.2) (1.5) (0.8) (0.4) (0.3)
6.2 4.6 2.9 4.3 1.9 2.7
(1.5) (1.0) (0.9) (0.6) (0.3) (0.3)
NCS-R (2001–2003)b Serious Moderate Mild Any None Total
40.5 37.2 23.0 32.9 14.5 20.1
(4.7) (3.0) (3.8) (2.0) (0.9) (0.8)
14.4 13.0 5.1 10.5 2.9 5.2
(3.3) (1.6) (1.3) (1.0) (0.4) (0.3)
19.4 15.8 9.0 14.1 5.9 8.4
(3.5) (1.8) (2.2) (1.3) (0.6) (0.5)
NCS-R : NCSc
Serious Moderate Mild Any None Total
RR
(SE)
RR
(SE)
RR
(SE)
RR
(SE)
RR
(SE)
RR
(SE)
1.7 1.5∗ 1.7∗ 1.6∗ 1.6∗ 1.6∗
(0.4) (0.2) (0.4) (0.2) (0.2) (0.1)
2.0 2.3∗ 2.2 2.2∗ 2.0∗ 2.2∗
(0.8) (0.6) (1.1) (0.4) (0.5) (0.3)
1.7 1.2 1.8 1.5∗ 1.7∗ 1.6∗
(0.5) (0.2) (0.6) (0.2) (0.3) (0.2)
2.9 2.3∗ 2.8 2.6∗ 2.6∗ 2.6∗
(1.3) (0.5) (1.0) (0.4) (0.5) (0.3)
1.5 1.0 1.3 1.2 1.4 1.3
(0.7) (0.3) (0.6) (0.2) (0.3) (0.2)
0.7 0.6 1.0 0.8 0.9 0.8
(0.2) (0.2) (0.4) (0.1) (0.2) (0.1)
χ2
(p)
χ2
(p)
χ2
(p)
χ2
(p)
χ2
(p)
χ2
(p)
194.6 56.8 0.5
(0.000) (0.000) (0.928)
112.2 34.5 0.2
(0.000) (0.000) (0.975)
118.1 22.7 3.0
(0.000) (0.000) (0.399)
105.3 72.4 0.3
(0.000) (0.000) (0.958)
23.0 3.3 0.9
(0.000) (0.069) (0.825)
82.9 3.3 1.2
(0.000) (0.067) (0.759)
Statistical significanced
Severity (S) Time (T) T×S
a Any, Any treatment; PSY, Psychiatrist; OMH, Other mental health specialist; GM, General medical; HS, Human services; CAM,
Complementary-alternative medicine. b %, Proportion of respondents in the total sample who received either any treatment or treatment in the treatment sector indicated in the
column heading. SE, Design-based multiply imputed standard error of the % estimate. c RR, Risk Ratio, the proportional increase in prevalence in NCS-R compared to NCS. For example, a RR of 1.5 corresponds to the NCS-R
prevalence being 50% higher than the NCS prevalence. Note that RR does not always equal the ratio of the % estimates in Parts I and II. This is because the Multiple Imputation method calculates % and RR as means of these estimates in pseudo-samples. The mean of a within-pseudo-sample ratio does not necessarily equal the ratio of the within-pseudo-sample means of the % estimates. d The significance tests for severity (S) evaluate the significance of differences in treatment proportions across the four categories of the severity variable pooled across the two surveys. Each severity χ2 test has 3 degrees of freedom (serious, moderate and mild vs. none). The significance tests for time (T) evaluates the significance of differences in treatment proportions in the two surveys controlling for differences in severity. Each time χ2 test has 1 degree of freedom (1990–1992 vs. 2001–2003). The significance tests for interactions between time and severity (T × S) evaluate the significance of differential change across the two surveys depending on severity. Each T × S χ2 test has 3 degrees of freedom. Adapted from a table previously published in Kessler, R.C., Demler, O., Frank, R.G., Olfson, M., Pincus, H.A., Walters, E.E., Wang, P.S., Wells, K.B., Zaslavsky, A.M. (2005). Prevalence and treatment of mental disorders, 1990 to 2003. New England Journal of Medicine 352 (24), 2519 [60]. © 2005 Massachusetts Medical Society, All rights reserved. Used with permission. ∗ Significant at the 0.05 level, two-sided test.
232
THE NATIONAL COMORBIDITY SURVEY (NCS) AND ITS EXTENSIONS
14.5 The NCS-R adolescent supplement (NCS-A) 14.5.1 Design and rationale The NCS-A was designed to provide basic descriptive psychiatric epidemiological information on adolescents comparable to the information on adults obtained in the baseline NCS [63]. In addition, the NCS-A interview schedule included a detailed risk factor battery to study modifiable determinants of the onset and course of child and adolescent mental disorders. Furthermore, as a nationally representative sample of schools was selected to help recruit the NCS-A sample (described below), the survey included considerable detail on school and neighbourhood environmental factors that might be important determinants of early detection, outreach and treatment of child and adolescent mental disorders. A number of important design decisions arose in planning NCS-A that deviated from the model used in the adult NCS-2 and NCS-R surveys [64, 65]. One of these concerned sampling. Because adolescents only reside in a small proportion of all households, a critical design decision concerned the sampling scheme. The scheme we settled on used a dual-frame approach, in which a representative sample of all schools in the country and a representative sample of all households in the country were both used to select adolescents for interview. The school sample was a probability sample of the schools in the communities used in the NCS-2 and NCS-R samples. A probability sample of students in the eligible age range (12–17) was selected in each sample school. The household sample was based on a random selection of one adolescent in each household contacted for the NCS-2 and NCS-R adult surveys. Information was recorded for each household sample respondent regarding whether or not they still attend school and, if so, the name of their school. This information was used to weight the data to adjust for the under-sampling of school dropouts and of students who, along with their parents, agreed to participate in the survey as part of the household sample while the principal of the school they attend did not agree to include the school in the school sample. This dual frame approach was facilitated by the fact that the
adult NCS-2 and NCS-R surveys were carried out in parallel with the adolescent survey. Dual-frame sampling is much more efficient than other sample designs in a situation of this sort. Another critical design decision concerned instrumentation. A number of research diagnostic interviews exist to assess mental disorders among children and adolescents [66–68]. We were unable to achieve consensus among our advisors in selecting one of these instruments based on the simultaneous consideration of accuracy and ease of implementation. As a result, we elected to use a modified version of the NCS-2 and NCS-R diagnostic interview, the CIDI, in the adolescent survey. This decision was based, in part, on the fact that the CIDI was previously used successfully in a German adolescent sample [69] as well as among 15–17-year-olds in the baseline NCS. An additional consideration was that the same interview staff that administers the NCS-2 and NCS-R adult interviews also administered the adolescent interviews. We reasoned that it would be much easier for these interviewers if we maintained relative consistency in the instrument across samples rather than use a totally different instrument for the adolescents than the adults. The CIDI was expanded for the NCS-A to include new sections on child and adolescent disorders derived from the DIS. These include oppositional-defiant disorder, conduct disorder, attention-deficit/hyperactivity disorder and separation anxiety disorder. We also modified existing CIDI diagnostic sections that have different criteria for adolescents than adults. In addition, the risk factor battery was expanded to include a more detailed assessment of childhood adversity, while the interview questions on treatment for emotional disorders were revised to blend relevant questions from the NCS-R with questions in another instrument designed for use with children and adolescents [70]. Once all these modifications were complete, revisions in question wording were made to improve comprehension among adolescent respondents. This work made use of recently developed cognitive interviewing methods to gain insights into areas of confusion in the instrument and into ways that these confusions might be resolved with modified questions [9, 10]. Finally, a self-administered informant version of the instrument was developed to 233
CHAPTER 14
obtain information from the parents of respondents. A clinical reappraisal study showed that these modifications resulted in the diagnoses based on the CIDI having very good concordance with independent diagnoses based on blinded clinical reappraisal interviews [71].
14.5.2 Illustrative findings As the NCS-A analyses are only now beginning, no substantive results can be reported here other than to note that preliminary analyses show that, consistent with previous epidemiological surveys, the estimated prevalence of mental disorders among youth is both quite high and quite widely distributed throughout the population. We anticipate that published reports of these analyses will begin to appear in 2009–2010. The NCS web site will post information about these reports as they become available (www.hcp.med.harvard.edu/ncs).
14.6 The WHO WMH Surveys 14.6.1 Design and rationale The WMH Survey Initiative is an outgrowth of the WHO Global Burden of Disease (GBD) study [72, 73], an investigation of the comparative prevalence and societal costs of diseases throughout the world. The first phase of the GBD study concluded that mental disorders are among the most burdensome of all diseases in the world today and that major depression will become the single most burdensome disease in the world within the next two decades. These striking conclusions are based on a unique combination of characteristics shared by depression and many other mental disorders: that they are very common diseases; that they typically have much earlier ages of onset than most chronic physical diseases; that they have high rates of chronicity in conjunction with high risks of impairment and disablement and that they have low rates of treatment. It might be hoped that these results would influence health policy planners throughout the world to move mental disorders up in their priority list for prevention and treatment initiatives. However, this has not happened as yet. At least one reason for this is that the 234
first phase of the GBD study relied entirely on panels of clinical experts to estimate comparative levels of disease-specific impairment and disablement. The validity of these ratings can be called into question, undercutting the persuasive power of the GBD results concerning the importance of mental disorders. The WMH Initiative was designed to address this limitation by carrying out a series of parallel community epidemiological surveys based on the interview schedule developed for the NCS-R in countries throughout the world in order to obtain objective estimates of the prevalences, impairments and patterns of treatment for mental disorders. Over two dozen countries from all regions of the world are participating in WMH, with a combined sample size in excess of 250 000 respondents [1]. To date, surveys have been conducted in Australia, Belgium, Brazil, Bulgaria, China, Colombia, France, Germany, India, Iraq, Israel, Italy, Japan, Lebanon, Mexico, The Netherlands, New Zealand, Nigeria, Northern Ireland, Peru, Portugal, Romania, South Africa, Spain, Turkey, Ukraine and the United States. Surveys are pending in Nepal, Saudi Arabia and Spain. Because of their emphasis on comparative disease burden, the WMH surveys, including the NCS-R, differ from previous CIDI surveys in a number of important respects that were developed in the NCSR. First, while the focus of almost all previous CIDI surveys was on lifetime disorders, the WMH surveys are equally interested in past year and current (at the time of interview) disorders. All previous versions of CIDI, including the version used in the baseline NCS, provided only superficial information on recent disorders by focusing on lifetime symptoms and asking only one question – ‘How recently have you had (the disorder)?’ – to learn about recency. This made it impossible to characterise the persistence of disorders over the recent past or to know whether respondents with a lifetime disorder meet full criteria during the recent past. The CIDI was modified to correct these problems for use in the NCS-R and WMH surveys by obtaining information about current symptoms and persistence of symptoms over the past year. Second, the WMH surveys were designed to focus on recent prevalence to address a question raised by critics of the baseline NCS concerning the clinical significance of community cases [74]. These critics hypothesised that a substantial proportion
THE NATIONAL COMORBIDITY SURVEY (NCS) AND ITS EXTENSIONS
of community cases of mental disorders are not clinically significant. We addressed this concern by administering structured versions of standard clinical severity measures to all NCS-R and WMH respondents with recent CIDI disorders. Included here are such measures as a structured version of the Inventory of Depressive Symptomatology to assess the severity of recent depression [75], a structured version of the Panic Disorder Severity Scale to assess the severity of panic [76] and a structured version of the Yale–Brown Obsessive–Compulsive Scale to assess the severity of OCD [77, 78]. Our goal was to use standard clinical severity scales such as these to provide a heretofore missing crosswalk between the findings in our epidemiological surveys and the findings in clinical studies. Third, related to the issue of clinical significance is the issue of impairment. The original version of the CIDI asked only one dichotomous disorder-specific role impairment question for all disorders: ‘Did (the disorder) ever interfere a lot with your life or activities?’ No questions about impairment were asked independent of disorders. This was inadequate for the purposes of the WMH surveys. We consequently expanded the assessment of impairment in the CIDI to include more detailed disorder-specific questions about both lifetime and 12-month role impairments. All WMH surveys, including the NCS-R, also include the WHO Disability Assessment Schedule [79] to assess overall role impairment and disablement independent of particular disorders. Importantly, in order to provide comparative information on the impairments of mental and physical disorders, a checklist of chronic physical disorders was included in the NCS-R and the other WMH surveys. The problem of under-reporting due to some people with chronic conditions not being aware of their disorders was dealt with for symptom-based condition by using symptom screening questions for a random subsample of physical diseases for each WMH respondent. The random subsampling strategy is required because comprehensive screening for all possible physical disorders would be too time-consuming for a one-session survey devoted to mental disorders. However, by taking care to screen randomly to select a separate representative subsample of physical disorders for each respondent, we will guarantee that
data will be collected for a representative subsample of people with each chronic disorder for purposes of comparative assessment of within-disorder role impairments.
14.6.2 Illustrative results 14.6.2.1 Disorder prevalence Consistent with the results of the NCS-R, the WMH results found that mental disorders are commonly occurring in the vast majority of the countries studied [80]. Comparative prevalence estimates were also quite similar across countries, with phobias virtually always the most common anxiety disorder and major depression the most common overall disorder. Comorbidity among these disorders was also found to be high in most countries, with distinct clusters of internalising and externalising disorders.
14.6.2.2 Relative impairments of mental and physical disorders Comparative analyses in the WMH data found that the mean levels of self-reported impairment associated with mental disorders are significantly higher than those associated with the vast majority of the commonly occurring chronic physical disorders assessed in the surveys [81]. This general pattern was true in all regions of the world and to both developed and developing countries.
14.6.2.3 Treatment Despite these higher impairments, only a minority of people with even seriously impairing mental disorders were found in the WMH surveys to receive treatment [82]. This was true even in developed countries although the pattern was found to be more pronounced in developing countries. Treatment rates were considerably lower in every country for mental disorders than for physical disorders associated with comparable levels of impairment [81].
14.6.2.4 Diagnostic criteria for DSM and International Classification of Diseases disorders The WMH data have been used to investigate a number of issues raised in debates over appropriate 235
CHAPTER 14
diagnostic criteria for DSM and International Classification of Diseases (ICD) disorders. For example, the WMH data were used to investigate the implications of the suggestion that the six-month minimum duration requirement for a diagnosis of GAD be reduced to 1 month [83]. Results showed that symptom severity, persistence, comorbidity and impairment of GAD were all quite similar for cases defined with a 1–5 month duration requirement compared to a 6-month minimum duration requirement. These results showed that the current DSM and ICD 6-month requirement excludes a large number of people with clinically significant short recurrent episodes of a GAD-like syndrome.
14.6.2.5 Cross-national correlates The WMH data have also been used to examine cross-national variation in correlates of mental illness. One of the most interesting of these investigations focused on gender differences. Epidemiological surveys consistently find significantly higher levels of anxiety and mood disorders among women than men [84, 85] and significantly higher levels of externalising and substance use disorders among men than women [86, 87]. Similar patterns are found in the WMH data. Although a number of biological, psychosocial and biopsychosocial hypotheses have been proposed to account for these findings [88–90], evidence that gender differences in both depression [91, 92] and substance use [93, 94] have been narrowing in recent years in a number of countries has led to a special interest in the ‘sex roles’ hypothesis. The latter hypothesis holds that gender differences in the prevalence of mental disorders are due to differences in the typical stressors, coping resources and opportunity structures for expressing psychological distress that are made available to women and men in different countries at different points in history [95, 96]. No rigorous test of the sex roles hypothesis has ever been carried out before the WMH surveys. We did this by using administrative data collected by the World Bank, the United States and WHO on time–space variation in diverse indicators of the positions of women relative to men in countries around the world to generate an index of sex
236
role inequality. Index scores were then merged with the WMH survey data. We showed that gender differences in both depression and substance disorders have become significantly smaller across successive cohorts in countries where sex role inequality has decreased over time [97]. We also found that point-in-time cross-national variation in the strength of association between sex and mental disorders is significantly related to traditionality of gender roles.
14.7 Overview Descriptive studies like the NCS and WMH surveys are of more importance in psychiatric epidemiology than in other branches of epidemiology due to the fact that psychiatric epidemiology has traditionally been hampered by difficulties in conceptualising and measuring disorders. The baseline NCS was important mainly because it helped resolve these difficulties by providing accurate descriptive data on the prevalence and correlates of mental disorders. However, we have to remember that the ultimate goals of epidemiology are to understand and control disease by empirically studying associations between variation in exposure to disease-causing agents external to the individual, variation in the resistance of individuals exposed to the disease-causing agents and variation in resistance resources in the environments of exposed individuals. Although these investigations are initially carried out by examining natural variations of the sort assessed in the NCS surveys, we have to move beyond this initial step to develop hypotheses that can be tested provisionally in naturalistic quasiexperimental situations with matching or statistical controls used to approximate the conditions of an experiment. If the hypotheses stand up to these preliminary tests, they then need to be evaluated in interventions aimed at preventing the onset or altering the course of the disorders. This perspective on the role of surveys like the NCS and WMH surveys suggests that they should be seen as a necessary step in the evolution of epidemiological research on mental disorders that provide a firm descriptive foundation for further analytic and experimental epidemiological research.
THE NATIONAL COMORBIDITY SURVEY (NCS) AND ITS EXTENSIONS
Acknowledgements The baseline National Comorbidity Survey (NCS) was supported by NIMH grants MH46376, MH49098 and MH52861, with supplemental support from the W.T. Grant Foundation (Grant 90135190) and an NIMH Career Scientist award to R.C.K. (MH00507). The NCS-2 was funded by National Institute of Drug Abuse (NIDA) grant DA12058, with supplemental support from NIMH. The National Comorbidity Survey Replication (NCS-R) and National Comborbidity Replication Adolescent Supplement (NCS-A) are supported by the National Institute of Mental Health (U01MH60220) with supplemental support from the National Institute on Drug Abuse, the Substance Abuse and Mental Health Services Administration (SAMHSA), the Robert Wood Johnson Foundation (RWJF; Grant 044780) and the John W. Alden Trust. Collaborating NCS-R investigators include Ronald C. Kessler (Principal Investigator, Harvard Medical School), Kathleen Merikangas (Co-Principal Investigator, NIMH), James Anthony (Michigan State University), William Eaton (The Johns Hopkins University), Meyer Glantz (NIDA), Doreen Koretz (Harvard University), Jane McLeod (Indiana University), Mark Olfson (New York State Psychiatric Institute, College of Physicians and Surgeons of Columbia University), Harold Pincus (University of Pittsburgh), Greg Simon (Group Health Cooperative), Michael Von Korff (Group Health Cooperative), Philip S. Wang (NIMH), Kenneth Wells (UCLA), Elaine Wethington (Cornell University) and Hans-Ulrich Wittchen (Max Planck Institute of Psychiatry; Technical University of Dresden). The views and opinions expressed in this report are those of the authors and should not be construed to represent the views of any of the sponsoring organisations, agencies or US Government. The NCS-R is carried out in conjunction with the World Health Organization World Mental Health (WMH) Survey Initiative. We thank the staff of the WMH Data Collection and Data Analysis Coordination Centres for assistance with instrumentation, fieldwork and consultation on data analysis. These activities were supported by the National Institute
of Mental Health (R01 MH070884), the John D. and Catherine T. MacArthur Foundation, the Pfizer Foundation, the US Public Health Service (R13MH066849, R01-MH069864 and R01 DA016558), the Fogarty International Center (FIRCA R03-TW006481), the Pan American Health Organization, Eli Lilly and Company, Ortho-McNeil Pharmaceutical, Inc., GlaxoSmithKline and Bristol-Myers Squibb. A complete list of WMH publications can be found at http://www.hcp.med.harvard.edu/wmh/. A complete list of publications from the NCS, NCS-2 and NCS-R can be found at www.hcp.med.harvard.edu/ncs. Information about the WMH surveys can be found at www.hcp.med.harvard.edu/wmh/. Address comments to R.C Kessler, Department of Health Care Policy, Harvard Medical School, 180 Longwood Avenue, Boston, MA 02115.
References ¨ un, ¨ [1] Kessler, R.C. and Ust T.B. (2008) The WHO World Mental Health Surveys: Global Perspectives on the Epidemiology of Mental Disorders, Cambridge University Press, New York. [2] President’s Commission on Mental Health (1978) Report to the President, Vol. 1, One Stock Number 040-000-00390-8, US Government Printing Office, Washington, DC. [3] Robins, L.N., Helzer, J.E., Croughan, J.L. et al. (1981) National Institute of Mental Health Diagnostic Interview Schedule: its history, characteristics and validity. Arch. Gen. Psychiatry, 38, 381–389. [4] Bourdon, K.H., Rae, D.S., Locke, B.Z. et al. (1992) Estimating the prevalence of mental disorders in U.S. adults from the Epidemiologic Catchment Area Survey. Public Health Rep., 107, 663–668. [5] Regier, D.A., Narrow, W.E., Rae, D.S. et al. (1993) The de facto US mental and addictive disorders service system: Epidemiologic Catchment Area prospective 1-year prevalence rates of disorders and services. Arch. Gen. Psychiatry, 50, 85–94. [6] Robins, L.N. and Regier D.A. (eds) (1991) Psychiatric Disorders in America: The Epidemiologic Catchment Area Study, The Free Press, New York. [7] Kessler, R.C., McGonagle, K.A., Zhao, S. et al. (1994) Lifetime and 12-month prevalence of DSM-III-R psychiatric disorders in the United States:
237
CHAPTER 14
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
238
results from the National Comorbidity Survey. Arch. Gen. Psychiatry, 51, 8–19. World Health Organization (1990) Composite International Diagnostic Interview, World Health Organization, Geneva, Switzerland. Kessler, R.C., Mroczek, D.K. and Belli, R.F. (1999) Retrospective adult assessment of childhood psychopathology, in Diagnostic Assessment in Child and Adolescent Psychopathology (eds D. Shaffer, C.P. Lucas and J.E. Richters), Guilford Press, New York, pp. 256–284. Kessler, R.C., Wittchen, H.-U., Abelson, J.M. et al. (1998) Methodological studies of the Composite International Diagnostic Interview (CIDI) in the US National Comorbidity Survey. Int. J. Methods Psychiatr. Res., 7, 33–55. Kessler, R.C., Barber, C., Birnbaum, H.G. et al. (1999) Depression in the workplace: effects on short-term disability. Health Aff. (Millwood), 18, 163–171. Wittchen, H.U. (1994) Reliability and validity studies of the WHO--Composite International Diagnostic Interview (CIDI): a critical review. J. Psychiatry Res., 28, 57–84. Spitzer, R.L., Williams, J.B., Gibbon, M. et al. (1992) The Structured Clinical Interview for DSM-III-R (SCID). I: history, rationale, and description. Arch. Gen. Psychiatry, 49, 624–629. Williams, J.B., Gibbon, M., First, M.B. et al. (1992) The Structured Clinical Interview for DSM-III-R (SCID). II. Multisite test–retest reliability. Arch. Gen. Psychiatry, 49, 630–636. Kendler, K.S., Gallagher, T.J., Abelson, J.M. et al. (1996) Lifetime prevalence, demographic risk factors, and diagnostic validity of nonaffective psychosis as assessed in a US community sample. The National Comorbidity Survey. Arch. Gen. Psychiatry, 53, 1022–1031. Rosenbaum, P.R. and Rubin, D.B. (1983) The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. Kessler, R.C., Little, R.J. and Groves, R.M. (1995) Advances in strategies for minimizing and adjusting for survey nonresponse. Epidemiol. Rev., 17, 192–204. Keith, S.J., Regier, D.A. and Rae, D.S. (1991) Schizophrenic disorders, Psychiatric Disorders in America: The Epidemiologic Catchment Area Study, Free Press, New York, pp. 33–52. Regier, D.A., Kaelber, C.T., Rae, D.S. et al. (1998) Limitations of diagnostic criteria and assessment instruments for mental disorders. Implications for research and policy. Arch. Gen. Psychiatry, 55, 109–115.
[20] Burke, K.C., Burke, J.D. Jr. , Rae, D.S. et al. (1991) Comparing age at onset of major depression and other psychiatric disorders by birth cohorts in five US community populations. Arch. Gen. Psychiatry, 48, 789–795. [21] Magee, W.J., Eaton, W.W., Wittchen, H.U. et al. (1996) Agoraphobia, simple phobia, and social phobia in the National Comorbidity Survey. Arch. Gen. Psychiatry, 53, 159–168. [22] Regier, D.A., Farmer, M.E., Rae, D.S. et al. (1990) Comorbidity of mental disorders with alcohol and other drug abuse. Results from the Epidemiologic Catchment Area (ECA) Study. J. Am. Med. Assoc., 264, 2511–2518. [23] Kessler, R.C. (1995) The epidemiology of psychiatric comorbidity, in Textbook in Psychiatric Epidemiology (eds M.T. Tsuang, M. Tohen and G.E.P. Zahner), John Wiley & Sons, Inc., New York, pp. 179–197. [24] Kessler, R.C. (1997) The prevalence of psychiatric comorbidity, in Treatment Strategies for Patients with Psychiatric Comorbidity (eds S. Wetzler and W.C. Sanderson), John Wiley & Sons, Inc., New York, pp. 23–48. [25] Kessler, R.C., Stang, P.E., Wittchen, H.U. et al. (1998) Lifetime panic-depression comorbidity in the National Comorbidity Survey. Arch. Gen. Psychiatry, 55, 801–808. [26] Gold, M.R., Siegel, J.E., Russell, L.B. et al. (1996) Cost-Effectiveness in Health and Medicine, Oxford University Press, Oxford, England. [27] Wells, K.B., Sturm, R., Sherbourne, C.D. et al. (1996) Caring for Depression, Harvard University Press, Cambridge. [28] Kessler, R.C., Greenberg, P.E., Mickelson, K.D. et al. (2001) The effects of chronic medical conditions on work loss and work cutback. J. Occup. Environ. Med., 43, 218–225. [29] de Graaf, R., Kessler, R.C., Fayyad, J. et al. (2008) The prevalence and effects of adult attentiondeficit/hyperactivity disorder (ADHD) on the performance of workers: results from the WHO World Mental Health Survey Initiative. Occup. Environ. Med., 65, 835–842. [30] Kessler, R.C., Akiskal, H.S., Ames, M. et al. (2006) Prevalence and effects of mood disorders on work performance in a nationally representative sample of U.S. workers. Am. J. Psychiatry, 163, 1561–1568. [31] Kessler, R.C., Berglund, P.A., Foster, C.L. et al. (1997) Social consequences of psychiatric disorders, II: teenage parenthood. Am. J. Psychiatry, 154, 1405–1411. [32] Kessler, R.C., Foster, C.L., Saunders, W.B. et al. (1995) Social consequences of psychiatric disorders, I: educational attainment. Am. J. Psychiatry, 152, 1026–1032.
THE NATIONAL COMORBIDITY SURVEY (NCS) AND ITS EXTENSIONS [33] Kessler, R.C., Walters, E.E. and Forthofer, M.S. (1998) The social consequences of psychiatric disorders, III: probability of marital stability. Am. J. Psychiatry, 155, 1092–1096. [34] Kessler, R.C., Olfson, M. and Berglund, P.A. (1998) Patterns and predictors of treatment contact after first onset of psychiatric disorders. Am. J. Psychiatry, 155, 62–69. [35] Kessler, R.C., Crum, R.M. and Warner, L.A. et al. (1997) Lifetime co-occurrence of DSM-III-R alcohol abuse and dependence with other psychiatric disorders in the National Comorbidity Survey. Arch. Gen. Psychiatry, 54, 313–321. [36] Kessler, R.C., Nelson, C.B., McGonagle, K.A. et al. (1996) The epidemiology of co-occurring addictive and mental disorders: implications for prevention and service utilization. Am. J. Orthopsychiatry, 66, 17–31. [37] Freedman, D., Thornton, A., Camburn, D. et al. (1988) The life history calendar: a technique for collecting retrospective data. Sociol. Methodol., 18, 37–68. [38] Lyketsos, C.G. (1994) Application of clinical epidemiologic methods to the clinical practice of psychiatry. Am. J. Psychiatry, 151, 299–300. [39] Wethington, E., Brown, G.W. and Kessler, R.C. (1995) Interview measurement of stressful life events, in Measuring Stress: A Guide for Health and Social Scientists (eds S. Cohen, R.C. Kessler and L. Gordon), Oxford University Press, New York, pp. 59–79. [40] Kendler, K.S., Gardner, C.O., and Gatz, M. et al. (2007) The sources of co-morbidity between major depression and generalized anxiety disorder in a Swedish national twin sample. Psychol. Med., 37, 453–462. [41] Kessler, R.C., Gruber, M., Hettema, J.M. et al. (2008) Co-morbid major depression and generalized anxiety disorders in the National Comorbidity Survey followup. Psychol. Med., 38, 365–374. [42] Swendsen, J., Anthony, J.C., Conway, K.P. et al. (2008) Improving targets for the prevention of drug use disorders: sociodemographic predictors of transitions across drug use stages in the National Comorbidity Survey replication. Prev. Med., 47, 629–634. [43] Kessler, R.C., Borges, G., Sampson, N. et al. (2008) The association between smoking and subsequent suicide-related outcomes in the National Comorbidity Survey panel sample. Mol. Psychiatry, 14, 1132–1142. [44] Nock, M.K., Borges, G., Bromet, E.J. et al. (2008) Cross-national prevalence and risk factors for suicidal ideation, plans and attempts. Br. J. Psychiatry, 192, 98–105. [45] Narrow, W.E., Rae, D.S., Robins, L.N. et al. (2002) Revised prevalence estimates of mental disorders in
[46]
[47]
[48]
[49]
[50]
[51]
[52] [53]
[54]
[55]
[56]
[57]
the United States: using a clinical significance criterion to reconcile 2 surveys’ estimates. Arch. Gen. Psychiatry, 59, 115–123. Regier, D.A. and Narrow, W.E. (2002) Defining clinically significant psychopathology with epidemiologic data, in Defining Psychopathology in the 21st Century: DSM-5 and Beyond (eds J.E. Helzer and J.J. Hudziak), American Psychiatric Publishing, Washington, DC, pp. 19–30. Preisig, M., Merikangas, K.R. and Angst, J. (2001) Clinical significance and comorbidity of subthresshold depression and anxiety in the community. Acta Psychiatr. Scan., 104, 96–103. Sullivan, P.F., Kessler, R.C. and Kendler, K.S. (1998) Latent class analysis of lifetime depressive symptoms in the National Comorbidity Survey. Am. J. Psychiatry, 155, 1398–1406. Benjamin, J., Ebstein, R.P. and Lesch, K.P. (1998) Genes for personality traits: implications for psychopathology. Int. J. Neuropsychopharmacology, 1, 153–168. Eaton, W.W., Badawi, M. and Melton, B. (1995) Prodromes and precursors: epidemiologic data for primary prevention of disorders with slow onset. Am. J. Psychiatry, 152, 967–972. Kendell, R.E. (2002) Five criteria for an improved taxonomy of mental disorders, in Defining Psychopathology in the 21st Century DSM-5 and Beyond (eds J.E. Helzer and J.J. Hudziak), American Psychiatric Publishing, Washington, DC, pp. 3–17. Spitzer, R.L. (1998) Diagnosis and need for treatment are not the same. Arch. Gen. Psychiatry, 55, 120. Kessler, R.C., Merikangas, K.R., Berglund, P., et al. (2003) Mild disorders should not be eliminated from the DSM-5. Arch. Gen. Psychiatry, 60 (11), 1117–1122. American Psychiatric Association (1987) Diagnostic and Statistical Manual of Mental Disorders (DSMIII-R), 3rd edn Revised, American Psychiatric Association, Washington, DC. American Psychiatric Association (1994) Diagnostic and Statistical Manual of Mental Disorders (DSMIV), 4th edn, American Psychiatric Association, Washington, DC. Haro, J.M., Arbabzadeh-Bouchez, S., Brugha, T.S. et al. (2006) Concordance of the composite international diagnostic interview version 3.0 (CIDI 3.0) with standardized clinical assessments in the WHO World mental health surveys. Int. J. Methods Psychiatr. Res., 15, 167–180. Kessler, R.C., Chiu, W.T., Demler, O. et al. (2005) Prevalence, severity, and comorbidity of 12-month DSM-IV disorders in the National Comorbidity Survey replication. Arch. Gen. Psychiatry, 62, 617–627.
239
CHAPTER 14 [58] Endicott, J., Spitzer, R.L., Fleiss, J.L. et al. (1976) The gobal assessment sale: a procedure for measuring overall severity of psychiatric disorders. Arch. Gen. Psychiatry, 33, 766–771. [59] Leon, A.C., Olfson, M., Portera, L. et al. (1997) Assessing psychiatric impairment in primary care with the Sheehan disability scale. Int. J. Psychiatry Med., 27, 93–105. [60] Kessler, R.C., Demler, O., Frank, R.G. et al. (2005) Prevalence and treatment of mental disorders, 1990 to 2003. N. Engl. J. Med., 352, 2515–2523. [61] Kessler, R.C., Berglund, P., Borges, G. et al. (2005) Trends in suicide ideation, plans, gestures, and attempts in the United States, 1990–1992 to 2001–2003. J. Am. Med. Assoc., 293, 2487–2495. [62] Wang, P.S., Lane, M., Olfson, M. et al. (2005) Twelve-month use of mental health services in the United States: results from the National Comorbidity Survey replication. Arch. Gen. Psychiatry, 62, 629–640. [63] Merikangas K.R., Avenevoli S., Costello E.J. et al. (2009) National Comorbidity Survey Replication Adolescent Supplement (NCS-A): I Background and measures. J. Am. Acad. Child Adolesc. Psychiatry, 48, 367–369. [64] Kessler R.C., Avenevoli S., Costello E.J. et al. (2009) The National Comorbidity Survey Replication Adolescent Supplement (NCS-A): II. overview and design. J. Am. Acad. Child Adolesc. Psychiatry., 48, 380–385. [65] Kessler R.C., Avenevoli S., Costello E.J. et al. (2009) Design and field procedures in the US National Comorbidity Survey Replication Adolescent Supplement (NCS-A). Int. J. Methods Psychiatr. Res., 18, 69–83. [66] Angold, A. and Costello, E.J. (2000) The Child and Adolescent Psychiatric Assessment (CAPA). J. Am. Acad. Child Adolesc. Psychiatry, 39, 39–48. [67] Reich, W., (2000) Diagnostic interview for children and adolescents (DICA). J. Am. Acad. Child Adolesc. Psychiatry 39, 59–66. [68] Shaffer, D., Fisher, P., Lucas, C.P. et al. (2000) NIMH Diagnostic Interview Schedule for Children Version IV (NIMH DISC-IV): description, differences from previous versions, and reliability of some common diagnoses. J. Am. Acad. Child Adolesc. Psychiatry, 39, 28–38. [69] Wittchen, H.U., Perkonigg, A., Lachner, G. et al. (1998) Early developmental stages of psychopathology study (EDSP): objectives and design. Eur. Addict. Res., 4, 18–27. [70] Stiffman, A.R., Horwitz, S.M., Hoagwood, K. et al. (2000) The Service Assessment for Children and Adolescents (SACA): adult and child reports. J. Am. Acad. Child Adolesc. Psychiatry, 39, 1032–1039.
240
[71] Kessler, R.C., Avenevoli, S., Greif Green, J., et al. (2009). The National Comorbidity Survey Adolescent Supplement (NCS-A): III. Concordance of DSM-IV/CIDI diagnoses with clinical reassessments. J. Am. Acad. Child Adolesc. Psychiatry, 48, 386–399. [72] Murray, C.J., Lopez, A.D. and Jamison, D.T. (1994) The global burden of disease in 1990: summary results, sensitivity analysis and future directions. Bull. World Health Organ., 72, 495–509. [73] Murray, C.J.L. and Lopez, A.D. (1996) The Global Burden of Disease: A Comprehensive Assessment of Mortality and Disability from Diseases, Injuries and Risk Factors in 1990 and Projected to 2020, Harvard University Press, Cambridge. [74] Regier, D.A. (2000) Community diagnosis counts. Arch. Gen. Psychiatry, 57, 223–224 [Commentary]. [75] Rush, A.J., Gullion, C.M., Basco, M.R. et al. (1996) The Inventory of Depressive Symptomatology (IDS): psychometric properties. Psychol. Med., 26, 477–486. [76] Shear, M.K., Brown, T.A., Barlow, D.H. et al. (1997) Multicenter collaborative panic disorder severity scale. Am. J. Psychiatry, 154, 1571–1575. [77] Goodman, W.K., Price, L.H., Rasmussen, S.A. et al. (1989) The yale-brown obsessive compulsive scale. II. validity. Arch. Gen. Psychiatry, 46, 1012–1016. [78] Goodman, W.K., Price, L.H., Rasmussen, S.A. et al. (1989) The yale-brown obsessive compulsive scale. I. Development, use, and reliability. Arch. Gen. Psychiatry, 46, 1006–1011. [79] World Health Organization (1998) The WHO Disability Assessment Schedule II (WHO-DAS II), World Health Organization, Geneva, Switzerland. [80] Kessler, R.C., Aguilar-Gaxiola, S., Alonso, J. et al. (2008) Lifetime prevalence and age of onset distributions of mental disorders in the World Mental Health Survey Initiative, in The WHO World Mental Health Surveys: Global Perspectives on the Epidemiology of Mental Disorders (eds R.C. Kessler and ¨ un), ¨ T.B. Ust Cambridge University Press, New York, pp. 511–521. [81] Ormel, J., Petukhova, M., Chatterji, S. et al. (2008) Disability and treatment of specific mental and physical disorders across the world. Br. J. Psychiatry, 192, 368–375. [82] Wang, P.S., Aguilar-Gaxiola, S., Alonso, J. et al. (2007) Use of mental health services for anxiety, mood, and substance disorders in 17 countries in the WHO world mental health surveys. Lancet, 370, 841–850. [83] Lee, S., Tsang, A., Ruscio, A.M. et al. (2009) Implications of modifying the duration requirement of generalized anxiety disorder in developed and developing countries. Psychol. Med., 39, 1163–1176.
THE NATIONAL COMORBIDITY SURVEY (NCS) AND ITS EXTENSIONS [84] Kuehner, C. (2003) Gender differences in unipolar depression: an update of epidemiological findings and possible explanations. Acta Psychiatr. Scand., 108, 163–174. [85] Pigott, T.A. (1999) Gender differences in the epidemiology and treatment of anxiety disorders. J. Clin. Psychiatry, 60 (Suppl. 18), 4–15. [86] Brady, K.T. and Randall, C.L. (1999) Gender differences in substance use disorders. Psychiatr. Clin. North Am., 22, 241–252. [87] Keenan, K., Loeber, R. and Green, S. (1999) Conduct disorder in girls: a review of the literature. Clin. Child Fam. Psychol. Rev., 2, 3–19. [88] Grigoriadis, S. and Robinson, G.E. (2007) Gender issues in depression. Ann. Clin. Psychiatry, 19, 247–255. [89] Lynch, W.J., Roth, M.E. and Carroll, M.E. (2002) Biological basis of sex differences in drug abuse: preclinical and clinical studies. Psychopharmacology, 164, 121–137. [90] Hilt, L. and Nolen-Hoeksema, S. (2006) Possible contributors to the gender differences in alcohol use and problems. J. Gen. Psychol., 133, 357–374. [91] Joyce, P.R., Oakley-Browne, M.A., Wells, J.E. et al. (1990) Birth cohort trends in major depression:
[92]
[93]
[94]
[95]
[96] [97]
increasing rates and earlier onset in New Zealand. J. Affect. Disord., 18, 83–89. Wickramaratne, P.J., Weissman, M.M., Leaf, P.J. et al. (1989) Age, period and cohort effects on the risk of major depression: results from five United States communities. J. Clin. Epidemiol., 42, 333–343. McPherson, M., Casswell, S. and Pledger, M. (2004) Gender convergence in alcohol consumption and related problems: issues and outcomes from comparisons of New Zealand survey data. Addiction, 99, 738–748. Wilsnack, R.W., Vogeltanz, N.D., Wilsnack, S.C. et al. (2000) Gender differences in alcohol consumption and adverse drinking consequences: crosscultural patterns. Addiction, 95, 251–265. Pape, H., Hammer, T. and Vaglum, P. (1994) Are ’traditional’ sex differences less conspicuous in young cannabis users than in other young people? J. Psychoactive Drugs, 26, 257–263. Thoits, P.A. (1986) Social support as coping assistance. J. Consul. Clin. Psychol., 54, 416–423. Seedat, S., Scott, K.M., Angermeyer, M.C. et al. (2009) Cross-national associations between gender and mental disorders in the WHO World Mental Health Surveys. Arch. Gen. Psychiatry, 66, 785–795.
241
15
Experimental epidemiology John R. Geddes Department of Psychiatry, Warneford Hospital, Oxford, UK
15.1 Introduction The investigation of the relation between cause and effect in psychiatric research is the same as in any other area of clinical science. When possible, the experiment, in which the exposure is controlled, produces the most convincing evidence of causal association. The most commonly used experimental design for assessing the effects of treatments is the randomised controlled trial (RCT). This chapter will deal with some of the evolving trends in our understanding and classification of clinical trials. Before considering the design of RCTs in more detail, the limitations of non-randomised evidence will be considered because RCTs can also be vulnerable to similar problems and need careful design to preserve the advantages of randomisation. We will then consider the main threats to the validity and success of RCTs and the main strategies for dealing with them. Finally, we will examine some of the practical implications of the issues discussed in the chapter.
15.2 Limitations of non-randomised evidence The main problem with non-randomised evidence is that it is unclear to what extent any observed association is causal. First of all, when the exposure is not under the control of the investigator, even in a prospective trial, it can be hard to time the relationship of the putative cause to the outcome. There are two main problems: • Did the exposure occur before the outcome? In an observational study it is often unclear if a putative
risk factor predisposes to the outcome of interest, if the outcome causes the putative risk factor (i.e. reverse causation), or if the putative risk factor and outcome are both caused by a third factor. This is because measurement of both the exposure and outcome are subject to imprecision and bias. An effective way of determining the temporal relationship between exposure and outcome is by controlling the exposure. By manipulating the exposure in a prospective study, it is possible to determine exactly when it is administered to the participant. • Have alternative explanations been excluded? This is a particular problem in risk factor research in that there are usually alternative explanations for any observed association. It is usually unclear if the observed association is simply due to an additional association between the risk factor, the outcome and a third variable which is related to both the exposure and the outcome. This is known as confounding and is a particular problem where there is an inter-relationship between numerous causal factors. When considering the effect of a particular medicine it may be unclear if any difference in outcomes between patients who take the drug and those who do not is due to the drug or to other clinical factors which are related both to the clinical choice of the drug and the outcome (or prognosis) of the condition. For example, in an observational study comparing suicidal behaviour in patients who were prescribed a selective serotonin re-uptake inhibitor (SSRI) and those who were prescribed another drug, is an increased rate of suicide due to the SSRI or because SSRIs were
Textbook in Psychiatric Epidemiology, Third Edition. Edited by Ming T. Tsuang, Mauricio Tohen and Peter B. Jones © 2011 John Wiley & Sons, Ltd. ISBN: 978-0-470-69467-1
243
CHAPTER 15
more likely to be prescribed to patients who were at clinically increased risk of suicide because they were safer in overdose than alternatives? There are seven main ways of trying to deal with confounding. 1 Exclusion – participants with exposures to known confounders are excluded from the study. 2 Stratification of input – participants are stratified according to one or more known confounder. 3 Stratification of analysis – the potential effect of confounding is adjusted for in the analysis. 4 Matching – each case and control is matched according to their exposure to one or more known confounders. 5 Standardisation – rates of outcomes are compared to standardised rates from a reference population. 6 Regression analysis – the potential effect of confounding is adjusted for in the analysis using regression techniques. 7 Randomisation – participants are randomly allocated to the exposure of interest or a control intervention. When done properly, randomisation has the key effect of preventing the allocation of exposure of control from being influenced by any other factors. All other factors – both confounders and non-confounders will, on average, be equally distributed between groups. Methods 1–6 can, to some extent, control for known confounders if (and it is a substantial if!) they can be both measured and quantified accurately. Of course, this is time consuming, expensive and sometimes impossible – but, in principal, can be done for many known confounders. In observational studies of drug therapies, propensity score matching is increasingly used. In this technique the conditional probability of being treated one way or another, given the person’s clinical characteristics, is used to balance the comparison groups using matching, stratification or regression [1]. However, despite the increasing sophistication of observational techniques, random allocation of the exposure is the only way of dealing with unknown confounders. This unique ability to control the allocation of
244
exposures, dealing with confounding and subsequent measurement of specific, investigator-specified outcomes is the key strength of the randomised experiment and the approach has become standard inside the laboratory. The convincing nature of evidence from a properly conducted randomised experiment is the reason why this design is seen as the gold standard for assessing causation in medicine. Clearly, there are many exposures where it is either impossible or unethical to allocate patients to exposures randomly. For example, generally, people either smoke or do not smoke through choice (at least initially) but it is not usually regarded as ethical to allocate them to smoking by random. Within psychiatry, observational evidence has identified a possible causal association between events occurring perinatally and subsequent development of schizophrenia. However, one cannot allocate mothers to perinatal incidents. This means that uncertainty about the attribution of cause and effect in observational studies is inevitable. Even if a very clear estimate of the exposure dose can be obtained it is rarely possible to adjust for confounders (even known confounders) to the same degree of exactness as one can determine the exposure dose. Further, it is of course impossible to either control, measure or assess the effect of unknown confounders. Despite the limitations of observational evidence, large scale, well-controlled non-randomised studies of routinely collected observational data will remain a necessary tool in the evaluation of comparative treatment effects because of the difficulties and expense of conducting large RCTs [2]. Non-randomised studies also have the advantages of being generally cheaper to achieve a given sample size and they do not suffer from the inevitable selection biases of RCTs. Given their specific strength and weaknesses, it is reassuring when the treatment effects estimated in RCTs and observational studies are consistent. For example, estimates of the magnititude of the reduction in suicidal risk associated with long-term lithium therapy is consistent across randomised and non-randomised evidence, with the non-randomised evidence playing a key role in view of the small and heterogeneous nature of the RCTs [3, 4]. Box 15.1 summarises the advantages and disadvantages of RCTs.
EXPERIMENTAL EPIDEMIOLOGY
Box 15.1 The Advantages and Disadvantages of RCTs Advantages • Most efficient design for investigating causality because we can ensure that the ‘cause’ precedes the ‘effect’. • We can ensure that possible confounding factors do not confuse the results. • We can ensure that treatments are compared efficiently. • Randomisation facilitates statistical analysis. Disadvantages • Can take a long time and be very expensive. • Not suitable for very rare diseases or diseases with a long latency. • Ethical problems. • Generalisability – RCTs often screen out vulnerable groups such as the very young, very old and pregnant women (or at risk).
Even when it is both feasible and ethical to conduct a randomised trial, there are major difficulties in using experimental approaches in the real world where one has much less control over the environment or the precise allocation of a known amount of exposure and measurement of outcome than in the laboratory. The rationale for conducting expensive and difficult randomised trials in clinical populations is to produce solid and reliable evidence for clinical decision-making. Inevitably, however, compromises must be made in the design of all trials to achieve the best balance between the adherence to the optimal designs used in the laboratory and the degree to which the results can be applied in the real world of clinical practice.
15.3 RCTs: The translation of the experimental design into the real world In the laboratory, it is possible to control most of the aspects of the participant (in animal studies),
environment, exposure and outcome measurement. All known sources of bias can be controlled and therefore one can have great confidence in the observed results. Part of the translation and development of the design of randomised trials has been the application of methods derived in highly controlled environments and their application in situations where control is far less possible. The main validity threats in this translation from lab to bedside are the introduction of random and systematic errors. The standard design of a clinical trial is to administer a known exposure and measure an outcome in a reliable manner in an environment, that is standard and consistent across all participants. The aim is both to reduce unnecessary ‘noise’ – or random error – but also to reduce bias or systematic error. The optimal design of the randomised control trial must consider both systematic and random error and limit their effects to as great an extent as possible.
15.4 Importance and control of systematic error or bias Systematic error, or bias, seriously undermines the internal validity of the trial and produces bias in the estimation of the treatment effect – or, simply, the wrong result. The methodological development of randomised trials has, to a large extent, been concerned with identifying those aspects of the design of the randomised trial which are most important to reduce the impact of bias. As above, it can be challenging to protect against all forms of bias in a trial while at the same time making the trial both feasible and clinically applicable. Therefore, if we could know which aspects of the trial are most important, then the design can prioritise methods which limit these. Other aspects that are less critical to validity could be loosened to some extent, to make the design feasible in the real world without seriously invalidating or biasing the results of the study.
15.4.1 Randomisation with allocation concealment Randomisation with allocation concealment prevents the treatment being assigned by the clinician on
245
CHAPTER 15
the basis of any baseline clinical characteristic, prognostic factor or preference. Without randomisation and allocation concealment, there is a serious potential for selection bias. Note that this does simply refer to the process by which participants are allocated to treatments randomly. Perhaps more important is that the investigator and participant are completely unaware of the treatment allocation prior to randomisation. So, a pregenerated list of random allocations in which the next allocation is known will not suffice because it is possible for the investigator to select patients based on this knowledge. Most of the empirical studies show that randomisation is the single most important way of avoiding bias. In principle, randomisation is quite a simple procedure but in practice maintaining adequate concealment of allocation is crucial and can be challenging. Why is allocation concealment so important? The main reason is that if a patient or clinician knows which treatment they will receive prior to participating or prior to randomisation then they may choose not to participate in the trial. Further, if the investigator has control over the allocated treatment then they may choose to tamper with the allocation and select a treatment allocation that fits with their preference. These two mechanisms will lead to bias because the patients being treated with one treatment or another will differ systematically from each other. Therefore, the allocation has to be concealed from both patient and clinician until they have entered the trial and been allocated. Patient and clinician may remain blind to the treatment allocation following randomisation, but this is a separate matter to allocation concealment and will not always be possible. How might adequate allocation concealment be achieved? The best way is to ensure that the allocation schedule is held remotely and cannot be accessed by either the patient or clinician. Modern trials tend to have remote, independent randomisation services. The patient consents to randomisation and the investigator then phones the randomisation service. The phone call is logged and the patient is registered as a participants into the trial. Then, and only then, is the patient randomly allocated to the treatment or comparator. The allocation needs to be completely concealed if pregenerated – for example it could be
246
help by an independent pharmacist in a remote site, or – more preferable – the allocation is only generated when the patient has consented to the trial and has been entered into the trial database. The importance of allocation concealment cannot be overstated – allowing treatment to be chosen by randomisation can be difficult for both patient and clinician. The lengths that investigators will go to trying to identify the next treatment allocation have been well documented and include steaming open envelopes and using X-ray equipment! [5].
15.4.2 Blinding (or masking) of treatment allocation Blinding refers to maintaining concealment of allocation following randomisation. The aim of blinding is: 1 to prevent participants being treated in different ways depending on which treatment they have been allocated to (performance bias) and 2 to prevent the trial outcomes being measured differently depending on which treatment they have been allocated to (ascertainment bias). The allocation may be concealed from the participant, investigator, outcome assessor, statistician, authors, journal reviewers, and so on, and there is sometimes uncertainty about who is masked when the trial is simply referred to as double-blind [6]. It is important to discriminate between allocation concealment and blinding – the former is always essential, while it is often impossible to mask the identity of the intervention from all who are involved in a trial – although it will often be possible to mask some parts of the trial (such as assessment of outcome). Particularly in RCTs with subjective outcomes, some methods of blinding should always be considered because there is evidence that absence of blinding leads to biased estimates of treatment effect [7]. There is also some uncertainty about whether the success of blinding should be checked during the trial. Current consensus is that this should not be done routinely because it may focus attention on the issue, although there may be specific situations in which it is useful [8].
EXPERIMENTAL EPIDEMIOLOGY
15.4.3 Maximising follow-up Any participant who does not complete the trial is a source of uncertainty in the results. How do we know what happened to them? If the numbers of trial withdrawals and the reasons for the withdrawals are similar in the two arms of the trial then this may not bias the trial (although it is likely to increase random error, see below). If however, there are differential total, or specific, drop-outs in the arms of the trials then attrition bias may be introduced – especially if drop-out from the trial is related to one or other of the trial treatments.
15.4.4 Accounting for participants following randomisation and analysis by allocated treatment group The potential for introducing bias by excluding certain participants has long been recognised [9]. An analysis in which all randomised participants are included, irrespective of the duration of participation and in their randomly allocated groups, is least likely to introduce bias and is also most likely to produce a clinically relevant results [10]. This has often been termed an intention-to-treat analysis although this term has recently fallen from favour because it is frequently misused and, hence, its meaning has become uncertain [8]. For example, sometimes participants are removed from the analysis (often patients with
no follow-up data following randomisation, or participants who did not receive the allocated treatment) and this is called this a modified intention-to-treat analysis. The key issue is to decide on how trial withdrawals will be dealt with in the analysis prior to the analysis in a detailed analysis plan.
15.4.5 Empirical evidence of bias in RCTs A major advance in our understanding of RCTs over the past two decades has been gained from empirical study of specific design features on the estimated treatment effect (Figure 15.1). Much of the empirical analysis of the design of clinical trials has consisted of the comparison of the estimates of effect obtained from trials in which there is greater control of bias with trials where there is lesser control. The assumption here is that less well controlled trials will, on average, produce larger (and more biased) estimates of treatment effects than more tightly controlled trials. In a review of 250 RCTs from 33 meta-analyses, Schulz et al. compared the treatment effects from studies according to several design characteristics [11]. They found that the treatment effect was 30–41% larger in trials without adequate concealment of treatment allocation. In a further analysis, inadequate or unclear allocation concealment was again found to be associated with larger (and, presumably, biased) estimates of treatment effects but
Intervention
Outcome
Control
Outcome
Patients
Selection bias
Performance bias
Attrition bias
Detection bias
Fig 15.1 Specific sources of bias in a randomised controlled trial.
247
CHAPTER 15
lack of blinding was only significantly associated with larger treatment effects when subjective outcomes were used – which is often the case in psychiatry [7].
15.5 Importance and control of random error and noise A fundamental issue in clinical trial design is the required sample size. The power of a trial is determined by the sample size, the anticipated difference between the interventions (or treatment effect), the chance of detecting a significant difference when there is no real difference between the treatments (α, which represents the risk of a false positive result, the type 1 error). The aim of achieving a sufficient sample size is therefore to provide sufficient statistical power to be reasonably sure of detecting a real treatment effect while, at the same time, not producing a false positive result. There are standard methods of calculating sample size and these are not considered further here [12, 13]. Achieving a sufficient sample size is often considered to be the main way of reducing random error. This is often considered to be less of a problem than selection bias because it essentially produces clean noise that limits the power of the trial and the precision of the results, rather than producing biased results. However, it is essential to consider its impact because a trial needs to be able to detect the treatment signal above the inevitable noise of the uncontrolled environment. A useful way on considering the issues is in terms of the ratio of signal to noise. Failure to control the signal to noise ratio is an important potential reason for a trial to fail to demonstrate a treatment effect even for an intervention of known effectiveness [14]. David Sackett observed that he relationship between our confidence in the results of a trial, signal to noise ratio and sample size is: Confidence =
Signal × Sample Noise
It will immediately be seen that relying in increased sample size alone is inefficient: confidence increases as a function of the square root of sample size. In other word, to double confidence in the results, the sample size will need to increase by a factor of 4. A more efficient strategy is to maximise the signal 248
to noise ratio. Figure 15.2 shows some potential strategies for maximising signal and sample size, and minimising noise. There are certain conclusions that can be drawn from this relationship. First, a highly controlled trial is likely to maximise signal to noise ratio and will therefore minimise the required sample size for a given degree of confidence. Second, a trial conducted in a relatively uncontrolled clinical environment will tend to have a low signal to noise ratio and numbers will be crucially important. Lastly, a trial with a low signal:noise ratio and minimal sample size is at high risk of failing to pick up a treatment effect. The term assay sensitivity has also been used to denote similar issues.
15.6 Reporting the results of clinical trials—the CONSORT statement Bias may often occur in the reporting of randomised trials and this can lead to misinterpretation. The Consolidated Standards of Reporting Trials (CONSORT) Group have developed the CONSORT Statement which is an evidence-based set of recommendations for the reporting of RCTs [15] (see Table 15.1 and the flow-diagram shown in Figure 15.3). The CONSORT statement offers a standard way for investigators to report trials (http://www.consortstatement.org/). CONSORT is soundly based on the results of empirical studies into the relationship between specific design features and bias. Journals increasingly require articles reporting RCTs to conform to the CONSORT criteria and it is useful to use the checklist when designing a trial because it focuses specifically on areas which are susceptible to bias.
15.7 Different clinical questions will prioritise control of different threats to validity and confidence The design features that protect against bias and the relationship between confidence, signal, noise and sample size begins to make explicit some of the decisions that need to be made in the design of any RCT. In some situations, it might be undesirable to have strict eligibility criteria because it will reduce the
EXPERIMENTAL EPIDEMIOLOGY
• Baseline risk in patients • Responsiveness of patients to treatment • Potency of experimental intervention • Completeness of ascertainment of outcome
•Reduce variation between patients by inclusion criteria •Reduce variation between arms by stratification or minimisation •Increase compliance •Reduce misclassification •Improve precision of outcome assessment
• Use as main strategy only as last resort – to increase confidence by 2x, need to 4x sample size – however, • Reduce risk of failing to achieve sample size by making procedures simple, providing support, reducing barriers, • Increase compliance • Reduce misclassification • Improve precision of outcome assessment
Fig 15.2 Relationship between confidence, signal to noise ratio and sample size in randomised controlled trials.
external validity, or clinical applicability, of the trial. It may be difficult and expensive to blind treatment assignment from participants. It may be impossible to recruit large numbers of patients. The appropriate design for each trial will depend on the primary clinical question that the trial is designed to answer. For example, in drug development, several phases are conventionally recognised: • Phase I: Clinical pharmacology in healthy volunteers (first in man). • Phase II: Clinical pharmacology in patients to establish preliminary efficacy and safety and dose. Frequently subdivided into Phase IIa (to estimate clinically effective dosage) and IIb (preliminary test of efficacy). • Phase III: Formal therapeutic trials to provide pivotal evidence of efficacy and safety. • Phase IV: Post licensing studies to establish broader efficacy and safety.
There is often a degree of overlap between these phases and it is increasingly common for a trial to be conducted to meet the needs of multiple phases of development, particularly Phase II and III. For example, a trial may randomise patients between placebo, multiple doses of the investigational drug and an active comparator. In general, however, the designs of the study will vary between phases. RCTs will often be the primary research design in phases II and III and an important part of phase IV. Phase II trials will tend to be smaller and more tightly controlled – and the concern is less with external validity than with maximising the sensitivity of the design to pick up treatment effects. Phase III trials are sometimes called pivotal as they are aimed at the regulatory authorities who will principally require a high level of internal validity and a reasonable degree of external validity. Phase IV trials are usually more concerned with external validity and, to achieve this, may relax some of the design features 249
CHAPTER 15 Table 15.1
CONSORT 2010 checklist of information to include when reporting a randomised triala.
Section/topic
Item no
Checklist item
1a 1b
Identification as a randomised trial in the title Structured summary of trial design, methods, results and conclusions (for specific guidance see CONSORT for abstracts)
2a
Scientific background and explanation of rationale
2b
Specific objectives or hypotheses
3a
Description of trial design (such as parallel, factorial) including allocation ratio Important changes to methods after trial commencement (such as eligibility criteria), with reasons Eligibility criteria for participants Settings and locations where the data were collected The interventions for each group with sufficient details to allow replication, including how and when they were actually administered Completely defined pre-specified primary and secondary outcome measures, including how and when they were assessed Any changes to trial outcomes after the trial commenced, with reasons How sample size was determined When applicable, explanation of any interim analyses and stopping guidelines
Title and abstract
Introduction Background and objectives Methods Trial design
3b Participants Interventions
4a 4b 5
Outcomes
6a 6b
Sample size
7a 7b
Randomisation: Sequence generation
8a 8b
Allocation concealment mechanism
9
Implementation
10
Blinding
11a
Statistical methods
11b 12a 12b
Method used to generate the random allocation sequence Type of randomisation; details of any restriction (such as blocking and block size) Mechanism used to implement the random allocation sequence (such as sequentially numbered containers), describing any steps taken to conceal the sequence until interventions were assigned Who generated the random allocation sequence, who enrolled participants and who assigned participants to interventions If done, who was blinded after assignment to interventions (for example, participants, care providers, those assessing outcomes) and how If relevant, description of the similarity of interventions Statistical methods used to compare groups for primary and secondary outcomes Methods for additional analyses, such as subgroup analyses and adjusted analyses
Results Participant flow (a diagram is strongly recommended)
13a
13b
250
For each group, the numbers of participants who were randomly assigned, received intended treatment and were analysed for the primary outcome For each group, losses and exclusions after randomisation, together with reasons
Reported on page no
EXPERIMENTAL EPIDEMIOLOGY Table 15.1 (cont.) Section/topic
Item no
Checklist item
Recruitment Baseline data
14a 14b 15
Numbers analysed
16
Outcomes and estimation
17a
Dates defining the periods of recruitment and follow-up Why the trial ended or was stopped A table showing baseline demographic and clinical characteristics for each group For each group, number of participants (denominator) included in each analysis and whether the analysis was by original assigned groups For each primary and secondary outcome, results for each group and the estimated effect size and its precision (such as 95% confidence interval) For binary outcomes, presentation of both absolute and relative effect sizes is recommended Results of any other analyses performed, including subgroup analyses and adjusted analyses, distinguishing pre-specified from exploratory All important harms or unintended effects in each group (for specific guidance see CONSORT for harms)
17b Ancillary analyses
18
Harms
19
Reported on page no
Discussion Limitations
20
Generalisability Interpretation
21 22
Trial limitations, addressing sources of potential bias, imprecision and, if relevant, multiplicity of analyses Generalisability (external validity, applicability) of the trial findings Interpretation consistent with results, balancing benefits and harms and considering other relevant evidence
Other information Registration Protocol Funding
23 24 25
Registration number and name of trial registry Where the full trial protocol can be accessed, if available Sources of funding and other support (such as supply of drugs), role of funders
a We strongly recommend reading this statement in conjunction with the CONSORT 2010 Explanation and Elaboration for important clarifications on all the items. If relevant, we also recommend reading CONSORT extensions for cluster randomised trials, non-inferiority and equivalence trials, non-pharmacological treatments, herbal interventions and pragmatic trials. Additional extensions are forthcoming: for those and for up to date references relevant to this checklist, see www.consort-statement.org.
protecting against bias and may risk a lower signal to noise ratio.
15.8 The classification of RCTs RCTs are often classified according to the extent to which they maximise internal or external validity. Trials are usually optimally designed to answer specific research questions and a corollary of this is that different questions will require different designs. Variation in the primary design objectives of a trial should be borne in mind when considering
the results of apparently h