Genomic and Personalized Medicine
This page intentionally left blank
Genomic and Personalized Medicine Volume 1 Edited by Huntington F. Willard, Ph.D. Director Duke Institute for Genome Sciences & Policy Nanaline H. Duke Professor of Genome Sciences Howard Hughes Medical Institute Professor Duke University Durham, North Carolina 27708
and Geoffrey S. Ginsburg, M.D., Ph.D. Center Director, Center for Genomic Medicine Duke Institute for Genome Sciences & Policy Professor of Medicine Duke University Durham, North Carolina 27708
AMSTERDAM • BOSTON • HEIDELBERG • LONDON • NEW YORK • OXFORD • PARIS SAN DIEGO • SAN FRANCISCO • SYDNEY • TOKYO
Academic Press is an imprint of Elsevier
Academic Press is an imprint of Elsevier 525 B Street, Suite 1900, San Diego, CA 92101-4495, USA 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA 32 Jamestown Road, London NW1 7BY, UK First edition 2009 Copyright © 2009 Elsevier Inc. All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (44) (0) 1865 843830; fax (44) (0) 1865 853333; email:
[email protected]. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-369420-1 (set) ISBN: 978-0-12-370888-5 (vol. 1) ISBN: 978-0-12-370889-2 (vol. 2) For information on all Academic Press publications visit our web site at elsevierdirect.com Typeset by Charon Tec Ltd., A Macmillan Company. (www.macmillansolutions.com) Printed and bound in China 09 10 11 12 13 10 9 8 7 6 5 4 3 2 1
Contents in Brief Foreword
xxv
Preface
xxvii
Acknowledgements
xxix
Abbreviations
xxxi
Advisory Board
xxxix
Contributors
xli
PART I GENOMIC APPROACHES TO BIOLOGY AND MEDICINE Section 1 Principles of Human Genomics 1. Organization,Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine 2. Concepts of Population Genomics 3. Genomic Approaches to Complex Disease 4. Human Health and Disease: Interaction Between the Genome and the Environment 5. Epigenomics and its Implications for Medicine 6. Systems Biology and the Emergence of Systems Medicine Section 2 Technology Platforms for Genomic Medicine 7. DNA Sequencing for the Detection of Human Genome Variation and Polymorphism 8. Genome-Wide Association Studies and Genotyping Technologies 9. Copy Number Variation and Human Health 10. Inter-Species Comparative Sequence Analysis: A Tool for Genomic Medicine 11. DNA Methylation Analysis: Providing New Insight into Human Disease 12. Transcriptomics: Translation of Global Expression Analysis to Genomic Medicine 13. DNA Microarrays in Biological Discovery and Patient Care 14. Proteomics: The Deciphering of the Functional Genome
3
4 22 33 47 60 74
87 88 101 108 120 131 143 157 173
15. Comprehensive Metabolic Analysis for Understanding of Disease Mechanisms 16. Comprehensive Analysis of Gene Function: RNA interference and Chemical Genomics
180 193
Section 3 Informatic and Computational Platforms for Genomic Medicine 205 17. Bioinformatic and Computational Analysis for Genomic Medicine 206 18. Fundamentals and History of Informatics for Genomic and Personalized Medicine 226 19. Electronic Medical Records in Genomic Medicine Practice and Research 233 20. Clinical Decision Support in Genomic and Personalized Medicine 242 21. Online Health Information Retrieval by Consumers and the Challenge of Personal Genomics 252
PART II TRANSLATIONAL APPROACHES IN GENOMIC AND PERSONALIZED MEDICINE Section 4 Enabling Strategies in the Translation of Genomics into Medicine 22. Translational Genomics: From Discovery to Clinical Practice 23. Principles of Study Design 24. Biobanking in the Post-Genome Era 25. Application of Biomarkers in Human Population Studies 26. Validation of Candidate Protein Biomarkers 27. Pharmacogenetics and Pharmacogenomics 28. The Role of Genomics and Genetics in Drug Discovery and Development 29. Role of Pharmacogenomics in Drug Development 30. Clinical Implementation of Translational Genomics 31. Translating Innovation in Diagnostics: Challenges and Opportunities 32. The Role of Genomics in Enabling Prospective Health Care
261 262 275 284 299 308 321 335 343 357 367 378
v
vi
■
Contents in Brief
Section 5 Policy Challenges in Genomic and Personalized Medicine 33. From Sequence to Genomic Medicine: Genome Policy Considerations 34. Educational Strategies in Genomic Medicine 35. Federal Regulation of Genomic Medicine 36. Economic Issues and Genomic Medicine 37. Public–Private Interactions in Genomic Medicine: Research and Development
387 388 401 414 424 434
Section 6 Genomic Medicine and Public Health 445 38. What Is Public Health Genomics? 446 39. Why Do We Need Public Health in the Era of Genomic Medicine? 454 40. Principles of Human Genome Epidemiology 461 41. Genomics and Population Screening: Example of Newborn Screening 470 42. Family History: A Bridge Between Genomic Medicine and Disease Prevention 481 Section 7 Clinical Technologies Supporting Personalized Medicine 43. Molecular Imaging as a Paradigm for Genomic and Personalized Medicine 44. PET Imaging in Genomic Medicine 45. MRI for Molecular Imaging Applications: Overview, Perspectives, and Challenges 46. Fluorescence Imaging: Overview and Applications in Biomedical Research 47. Imaging Genetics: Integration of Neuroimaging and Genetics in the Search for Predictive Markers 48. Viral Chip Technology in Genomic Medicine 49. Vaccines Against Infectious Diseases: A BiotechnologyDriven Evolution 50. Cancer Vaccines: Some Basic Considerations 51. Biosensors for the Genomic Age 52. Stem Cells 53. Gene Therapy
493 494 500 512 524 532 538 562 573 590 599 610
PART III DISEASE-BASED GENOMIC AND PERSONALIZED MEDICINE: GENOME DISCOVERIES AND CLINICAL APPLICATIONS Section 8 Cardiovascular Genomic Medicine 623 54. The Genomics of Hypertension 624 55. Lipoprotein Disorders 634 56. Reactive Oxygen Species Signals Leading to Vascular Dysfunction and Atherosclerosis 652 57. Genomics of Myocardial Infarction 665 58. Acute Coronary Syndromes 680
59. Heart Failure in the Era of Genomic Medicine 60. Application of Genomic Methodology for Assessment of Cardiac Transplant Rejection 61. Hypertrophic Cardiomyopathy in the Era of Genomic Medicine 62. Genetics and Genomics of Arrhythmias 63. Hemostasis and Thrombosis 64. Peripheral Arterial Disease 65. Genomics of Congenital Heart Disease 66. Genomics of Perioperative and Procedural Medicine
692
Section 9 Oncology Genomic Medicine 67. Cancer Genes, Genomes, and the Environment 68. Immune Cells and the Tumor Microenvironment 69. Lymphomas 70. Genomics in Leukemias 71. Genomics of Lung Cancer 72. Breast Cancer and Genomic Medicine 73. Colorectal Cancer 74. Prostate Cancer 75. Molecular Biology of Ovarian Cancer 76. Pancreatic Neoplasms 77. The Multiple Endocrine Neoplasia Syndromes 78. Genomics of Head and Neck Cancer 79. Genomic Medicine, Brain Tumors and Gliomas 80. Molecular Therapeutics of Melanoma 81. Emerging Concepts in Metastasis 82. Diagnostic-Therapeutic Combinations in the Treatment of Cancer
807 808 818 830 844 856 869 879 898 913 921 931 945 956 967 977
705 716 729 755 773 781 794
990
Section 10 Inflammatory Disease Genomic Medicine 1009 83. Environmental Exposures and the Emerging Field of Environmental Genomics 1010 84. Molecular Basis of Rheumatoid Arthritis 1017 85. “Omics” in the Study of Multiple Sclerosis 1032 86. Inflammatory Bowel Disease 1040 87. Glomerular Disorders 1056 88. Spondyloarthropathies 1067 89. Asthma Genomics 1084 90. Genomic Aspects of Chronic Obstructive Pulmonary Disease 1098 91. Genomic Determinants of Interstitial Lung Disease 1110 92. Peptic Ulcer Disease 1122 93. Cirrhosis in the Era of Genomic Medicine 1138 94. Systemic Sclerosis 1155 Section 11 Metabolic Disease Genomic Medicine 95. Genomic Medicine of Obesity 96. Diabetes 97. Metabolic Syndrome 98. Nutrition and Diet in the Era of Genomics
1169 1170 1187 1194 1204
Contents in Brief
Section 12 Neuropsychiatric Disease Genomic Medicine 99. The Genetic Approach to Dementia 100. Parkinson’s Disease: Genomic Perspectives 101. Epilepsy Predisposition and Pharmacogenetics 102. Ophthalmology 103. Genomic Basis of Neuromuscular Disorders 104. Psychiatric Disorders 105. Genomics and Depression 106. Bipolar Disorder in the Era of Genomic Psychiatry Section 13 Infectious Disease Genomic Medicine 107. Genomic Approaches to the Host Response to Pathogens
1221 1222 1233 1243 1256 1265 1282 1289 1299
1313 1314
108. 109. 110. 111. 112.
Genomic Medicine and Aids Viral Genomics and Antiviral Drugs Host Genomics and Bacterial Infections Sepsis and the Genomic Revolution Diagnosis, Management and Disease Mechanisms of Hepatitis in the Era of Genomic Medicine
■
vii
1324 1340 1347 1362 1375
Glossary
1391
Index
1407
This page intentionally left blank
Table of Contents Foreword
xxv
Preface
xxvii
Acknowledgements
xxix
Abbreviations
xxxi
Advisory Board
xxxix
Contributors
xli
PART I GENOMIC APPROACHES TO BIOLOGY AND MEDICINE Section 1 Principles of Human Genomics 1. Organization,Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine Huntington F. Willard Introduction The Human Genome Variation in the Human Genome Expression of the Human Genome Genes, Genomes and Disease From Genome to Personalized Medicine Conclusion References Recommended Resources 2. Concepts of Population Genomics Mike E.Weale and David B. Goldstein Introduction Important Concepts in Population Genomics Human Population Genomics Application of Population Genomics to Genomic Medicine Conclusions References Recommended Resources 3. Genomic Approaches to Complex Disease Desmond J. Smith and Aldons J. Lusis Introduction Identifying Common and Rare Genomic Variations in the Population Relating DNA Variation to Phenotypes Integration of “Omic” Technologies with Genetics
3
4 4 6 9 11 13 15 18 18 21 22 22 22 26 28 29 30 32 33 33 33 36 40
Conclusions and Prospects Acknowledgements References Recommended Resources 4. Human Health and Disease: Interaction Between the Genome and the Environment Kenneth Olden Introduction Importance of the Environment The Environmental Genome Project Problematic Nature of Gene–environment Interaction Studies Polymorphism and Disease Susceptibility: Case–control Studies Epigenetics and the Environment Conclusion Acknowledgements References 5. Epigenomics and its Implications for Medicine Moshe Szyf Introduction DNA Methylation Patterns Chromatin Modification DNA Methylation and Chromatin States Co-operatively Determine the State of Activity of Genes Epigenetics and Human Disease Conclusions Acknowledgements References 6. Systems Biology and the Emergence of Systems Medicine Nathan D. Price, Lucas B. Edelman, Inyoul Lee, Hyuntae Yoo, Daehee Hwang, George Carlson, David J. Galas, James R. Heath and Leroy Hood Introduction Systems Science in Biology And Medicine Multi-parameter Blood-bourne Biomarkers Emerging in vivo and in vitro Technologies Computational and Mathematical Challenges in Systems Medicine Conclusions and Perspectives References Recommended Resources
43 43 43 46
47 47 48 50 52 53 56 57 57 57 60 60 61 63
65 67 69 70 70 74
74 75 76 78 81 81 82 85 ix
x
■
Table of Contents
Section 2 Technology Platforms for Genomic Medicine 7. DNA Sequencing for the Detection of Human Genome Variation and Polymorphism Samuel Levy and Yu-Hui Rogers Introduction DNA Sequencing Other Methodologies for Polymorphism Detection Future Directions Acknowledgements References 8. Genome-Wide Association Studies and Genotyping Technologies Kevin V. Shianna Introduction Principles of Genome-wide Association Studies Platform Overview Conclusion References 9. Copy Number Variation and Human Health Charles Lee, Courtney Hyland, Arthur S. Lee, Shona Hislop and Chunhwa Ihm Introduction Basic Principles of CNVs Detecting CNVs in a Genome-wide Manner Association of CNVs to Disease and Disease Susceptibility Implications of CNVs Conclusions Acknowledgements References Recommended Resources 10. Inter-Species Comparative Sequence Analysis: A Tool for Genomic Medicine Anthony Antonellis and Eric D. Green Introduction Performing Comparative Sequence Analysis: Resources and Methods Comparative Sequence Analysis and Human Genetic Disease CSA and the Future of Human Genetics and Genomic Medicine References Recommended Resources
87 88 88 89 95 96 97 97
101 101 101 103 106 106 108
108 108 112 114 116 118 118 118 119
120 120 121 124 128 128 130
11. DNA Methylation Analysis: Providing New Insight into Human Disease 131 Susan Cottrell,Theo deVos, Juergen Distler, Carolina Haefliger, Ralf Lesche, Achim Plum and Matthias Schuster Introduction Technology to Assess DNA Methylation
131 132
Clinical Impact of DNA Methylation Analysis Conclusion References Recommended Resources 12. Transcriptomics: Translation of Global Expression Analysis to Genomic Medicine Michelle M. Kittleson, Rafael Irizarry, Bettina Heidecker and Joshua M. Hare Introduction Gene Expression Technology Gene Discovery Molecular Signature Analysis Gene Discovery Versus Molecular Signature Analysis Current Issues in Gene Expression Analysis Alternative Technologies for Analysis of the Transcriptome Conclusion Acknowledgements References Recommended Resources 13. DNA Microarrays in Biological Discovery and Patient Care Andrew J.Yee and Sridhar Ramaswamy Introduction Microarray Technology Data Analysis Applications Limitations and Challenges Future Directions Conclusions References Recommended Resources
136 139 139 142
143
143 143 146 148 151 151 153 153 154 154 156 157 157 157 160 161 165 165 168 168 172
14. Proteomics: The Deciphering of the Functional Genome 173 Li-Rong Yu, Nicolas A. Stewart and Timothy D.Veenstra Introduction 173 Gel-based and Solution-based Proteomics 174 Mass Spectrometry 175 Bioinformatics 176 Impact of Proteomics on Understanding Diseases 178 Conclusions 178 Acknowledgements 179 References 179 Recommended Resources 179 15. Comprehensive Metabolic Analysis for Understanding of Disease Mechanisms 180 Christopher B. Newgard, Robert D. Stevens, Brett R.Wenner, Shawn C. Burgess, Olga Ilkayeva, Michael J. Muehlbauer, A. Dean Sherry and James R. Bain Introduction 180 Current Metabolomics Platforms: Basic Tools and General Features 181
Table of Contents
Comparison of NMR and MS Technologies for Unbiased Metabolic Profiling MS Methods for Targeted Metabolic Profiling Examples of NMR-based Metabolic Profiling in Disease Research Examples of Targeted MS-based Metabolic Profiling for Understanding of Disease Mechanisms Integration of Metabolic Profiling with Other “Omics” Technologies Future Directions References 16. Comprehensive Analysis of Gene Function: RNA interference and Chemical Genomics Bjorn T. Gjertsen and James B. Lorens Introduction RNA Interference Gene Function Analysis: An Overview Chemical Genomics Gene Function Studies Conclusions Acknowledgements References Recommended Resources Section 3 Informatic and Computational Platforms for Genomic Medicine 17. Bioinformatic and Computational Analysis for Genomic Medicine Atul J. Butte Introduction Vignettes: How Specific Bioinformatics Methods Can Change the Practice Ofmedicine Analytic Methods Where Data for Studies May be Found Bioinformatics Vocabularies and Ontologies Freely Available Bioinformatics Tools New Questions for Genomic Medicine Acknowledgements References Recommend Resources 18. Fundamentals and History of Informatics for Genomic and Personalized Medicine A. Jamie Cuticchia Introduction Databases for Genomic Medicine Conclusion References Recommended Resources 19. Electronic Medical Records in Genomic Medicine Practice and Research Glenn S. Gerhard, Robert D. Langer, David J. Carey and Walter F. Stewart
182 183 184 184 189 189 190
193 193 194 194 197 199 200 201 201 203
205 206 206 207 213 214 215 216 220 221 221 225 226 226 227 230 231 232
233
Introduction EMRs and Genomic Medicine Clinical Practice EMRs and Genomic Medicine Research Conclusion Acknowledgements References Recommended Resources 20. Clinical Decision Support in Genomic and Personalized Medicine Kensaku Kawamoto and David F. Lobach Introduction CDS Background: History, Examples, Evidence of Effectiveness, and Desirable Attributes Potential Uses of CDS to Support Genomic and Personalized Medicine Limited Deployability: The Potential Achilles’ Heel of CDS Systems for Genomic Medicine Challenges to Widespread Deployment of Effective CDS Systems Conclusions Disclosures References Recommended Resources 21. Online Health Information Retrieval by Consumers and the Challenge of Personal Genomics Mark S. Boguski Introduction Characteristics of Consumer Searches for Health Information What and Where Are Consumers Searching? Personalized Genomics for Consumers Summary and Conclusions References Wikipedia References
■
xi
233 234 236 240 240 240 241 242 242 243 245 246 247 249 249 249 251
252 252 252 253 255 256 256 257
PART II TRANSLATIONAL APPROACHES IN GENOMIC AND PERSONALIZED MEDICINE Section 4 Enabling Strategies in the Translation of Genomics into Medicine 22. Translational Genomics: From Discovery to Clinical Practice Geoffrey S. Ginsburg Introduction A Roadmap for Translation Where Can Genomics Have Impact in the Continuum of Health and Disease? The Genomics “Gold Rush”
261 262 262 263 263 264
xii
■
Table of Contents
The Personal Genome: Precious Code or Fool’s Gold? “Grand Challenges” in Translation of Genomics to Human Health Translational Genomics: Enabling Competencies How Are We Going to Do This? Developing Environments That Foster Translational Genomics to Health Applications References 23. Principles of Study Design Peter Grass Introduction Principles of Experimental Design Design Issues in Genomic Medicine References Recommended Resources 24. Biobanking in the Post-Genome Era Theresa Puifun Chow, Chia Kee Seng, Per Hall and Edison T. Liu Introduction The Biobanking Evolution The Past Imperfect Resources The Evolving Face of Biobanking Existing Models: Biobanking in Europe and the USA Singapore’s National Biobank and National Aspirations In Biomedical Research The Future of National Biobanks References Recommended Resources 25. Application of Biomarkers in Human Population Studies Stefano Bonassi and Monica Neri Introduction Biomarkers in Medicine Biomarkers of Exposure Biomarkers of Early Disease Risk Biomarkers of Genetic Susceptibility to Disease Conclusions References 26. Validation of Candidate Protein Biomarkers Ingibjörg Hilmarsdóttir and Nader Rifai Introduction Optimization of the Candidate Protein Research Assay Analytical Evaluation Reference Intervals Pre-analytical Variation Clinical Evaluation Indicators of Diagnostic Accuracy And Predictability Diagnostic Research Studies Design of Diagnostic Studies
266 266 267
270 272 275 275 276 281 283 283 284
284 285 285 285 286 288 294 296 297
299 299 299 302 303 304 306 306 308 308 308 309 311 311 312 313 316 316
Transferability Of Test Performance Assay Transfer to Diagnostic Company Regulatory Requirements References Recommended Resources 27. Pharmacogenetics and Pharmacogenomics Iris Grossman and David B. Goldstein Introduction Pharmacogenetic Studies: From Concept to Practice Marker Selection – Strategy and Application From Bench to Bedside: Integration of Pharmacogenetic Testing into Clinical Practice Examples of PGx Tests: Promising New Developments and Marketed Products Future Developments Required for the Field to Fully Meet its Expectations References Recommended Resources 28. The Role of Genomics and Genetics in Drug Discovery and Development Robert I.Tepper and Ronenn Roubenoff Introduction The Drug Discovery Process Genomics in Target Discovery Genomic Approaches to Drug Identification Pharmacogenomics and Drug Development Pharmacodynamic Markers and their Role in Drug Discovery and Development Toxicogenomics Genetics and Genomics in Clinical Trial Design Genomics in Drug Approval and Regulation Conclusion References Recommended Resources 29. Role of Pharmacogenomics in Drug Development Colin F. Spraggs, Beena T. Koshy, Mark R. Edbrooke and Allen D. Roses Introduction Drug Development Critical Path Drug Development Economics Methods for Identification of Genetic Classifiers Pharmacogenomics in the Drug Development Pipeline Efficacy Pharmacogenetics – here and Now Drug Exposure Pharmacogenetics to Tune Efficacy and Safety Profiles Investigation and Management of Safety in Clinical Trials Other Genomic Methods: RNA Interference to Direct Drug Usage “No Samples, No Science” Conclusions Acknowledgements References
317 317 318 318 320 321 321 323 325 328 329 331 332 334
335 335 335 336 337 337 338 340 340 340 341 341 342 343
343 344 344 344 346 348 349 350 352 352 354 355 355
Table of Contents
30. Clinical Implementation of Translational Genomics Wendy K. Chung Introduction Genetic Stratification Will Allow Medical Care to be Individualized After a Diagnosis is Made Population-based Germline Genomic Screening Newborn Screening Pharmacogenetics Somatic Genomic Variation Novel Sources of Genomic Variation Laboratory Standards to Ensure Analytic Validity Clinical Validation and Clinical Utility Cost Reimbursement Who Will Provide Genomic Medical Care? Genomic Literacy Ethical, Legal, and Social Issues Conclusions Acknowledgements References Recommended Resources 31. Translating Innovation in Diagnostics: Challenges and Opportunities Matthew P. Brown, Myla Lai-Goldman and Paul R. Billings Introduction Novel Diagnostics Conclusions: Translational Challenges for Innovative Diagnostics References 32. The Role of Genomics in Enabling Prospective Health Care Ralph Snyderman Introduction Predictive Models Predictive Factors Risk Assessment for Breast Cancer Pharmacogenomics Conclusion Acknowledgements References Recommended Resources Section 5 Policy Challenges in Genomic and Personalized Medicine 33. From Sequence to Genomic Medicine: Genome Policy Considerations Susanne B. Haga Introduction Genome Research after the Human Genome Project Policy Issues in Large-scale Genetics and Genomics Research
357 357 357 358 359 360 360 361 361 362 362 363 363 364 364 365 365 365 366
367
367 368 374 375
378 378 380 380 382 383 383 383 384 385
■
Integrating Genomic Medicine Applications in Healthcare Conclusion References
xiii
392 396 397
34. Educational Strategies in Genomic Medicine Charles J. Epstein Introduction Genetic and Genomic Literacy of the Public and Makers of Public Policy Education of the Providers of Health Care Conclusion References
401
35. Federal Regulation of Genomic Medicine Janet Woodcock Introduction Regulation of Genomic Tests Pharmacogenomics in Drug Development And Clinical Medicine: the Role of Regulation Fda Efforts to Advance Genomic Product Development Conclusions References Recommended Resources
414
36. Economic Issues and Genomic Medicine David L.Veenstra, Louis P. Garrison and Scott D. Ramsey Introduction Economic Evaluation and Cost-effectiveness Analysis Evaluating Genomic Technologies Economic Incentives and the Future of Genomic Medicine Establishing Value-based Reimbursement For Genomic Technologies Conclusions References 37. Public–Private Interactions in Genomic Medicine: Research and Development Subhashini Chandrasekharan, Noah C. Perin, Ilse R.Wiechers and Robert Cook-Deegan Introduction Landscape of Private Sector Genomics Future Trends Acknowledgements References
401 402 404 411 411
414 416 418 421 422 422 423 424 424 424 426 428 430 431 432 434
434 435 441 442 442
387 388 388 389 391
Section 6 Genomic Medicine and Public Health 445 38. What Is Public Health Genomics? 446 Alison Stewart and Ron Zimmern Introduction 446 The Emergence of Public Health Genomics 446 The Definition of Public Health Genomics 447 Key Concepts in Public Health Genomics 447
xiv
■
Table of Contents
The “Enterprise” of Public Health Genomics Core Activities in Public Health Genomics Moving Public Health Genomics Forward: Leadership And Networks Conclusion References Recommended Resources
448 449 451 452 452 453
39. Why Do We Need Public Health in the Era of Genomic Medicine? 454 Muin J. Khoury and Marta Gwinn Introduction 454 The Continuum from Genetics to Genomics in Health Practice 454 The Role of Public Health in the Translation of Human Genome Discoveries into Health Applications 455 The Focus on Disease Prevention and Health Promotion 457 The Population Perspective: Crucial Role of Public Health Sciences 457 The Role of Knowledge Integration Across Disciplines 458 The Role of Health Services Research and Population Health Assessment, Assurance, and Evaluation 458 Conclusion 458 References 459 40. Principles of Human Genome Epidemiology Marta Gwinn and Muin J. Khoury Introduction Human Genome Epidemiology Epidemiologic Study Designs Epidemiologic Measures Of Disease Frequency, Association, and Risk Measurement and Bias Gene–environment Interaction Probability and Personalized Medicine Building the Evidence Base Conclusion References Recommended Resources 41. Genomics and Population Screening: Example of Newborn Screening John D.Thompson and Michael Glass Introduction Components of the NBS System Screening Technology: Simple Ideas, Complex Realities Variability Among NBS Programs Influence of Genetics and -omic Technologies on NBS Acknowledgements
461 461 461 462 463 464 465 466 466 467 467 469
470 470 471 471 476 476 477
References Recommended Resources 42. Family History: A Bridge Between Genomic Medicine and Disease Prevention Maren T. Scheuner and Paula W.Yoon Introduction Clinical Approach Conclusion Acknowledgements References Recommended Resources Section 7 Clinical Technologies Supporting Personalized Medicine 43. Molecular Imaging as a Paradigm for Genomic and Personalized Medicine Ralph Weissleder Introduction Molecular Imaging and Cancer Detection Molecular Imaging to Determine Treatment Efficacy Molecular Imaging and Drug Development Near-term Needs and Opportunities Acknowledgement References 44. PET Imaging in Genomic Medicine Vikas Kundra and Osama Mawlawi Introduction Physics Imaging Agents and Methods in Analysis of Biological Samples Conclusion References Recommended Resources 45. MRI for Molecular Imaging Applications: Overview, Perspectives, and Challenges Dmitri Artemov Introduction Basics of MRI Contrast MR Contrast Agents for Molecular Imaging Applications Molecular Imaging Applications of MRI Conclusions Acknowledgements References 46. Fluorescence Imaging: Overview and Applications in Biomedical Research Vasilis Ntziachristos Introduction Imaging Technology
477 479
481 481 481 488 488 488 492
493 494 494 494 496 497 497 498 498 500 500 500 503 510 510 511
512 512 513 515 517 520 521 521
524 524 525
Table of Contents
Fluorescence Applications in Genomic Medicine References 47. Imaging Genetics: Integration of Neuroimaging and Genetics in the Search for Predictive Markers Ahmad R. Hariri Introduction Conceptual Basis of Imaging Genetics Basic Principles of Imaging Genetics Imaging Genetics and the Neurobiology of the 5-httlpr Future Directions References 48. Viral Chip Technology in Genomic Medicine Zeno Földes-Papp Introduction Role of Viruses in Human Infectious Disease Microfabrication Nanofabrication Are there Additional Alternatives to Diagnostic Microarrays? Conclusions Acknowledgements References Recommended Resources 49. Vaccines Against Infectious Diseases: A Biotechnology-Driven Evolution Vega Masignani, Hervé Tettelin and Rino Rappuoli Introduction The Genomic Era: From Microbial Genome to Vaccine Development Impact of Whole Genome Analyses In Vivo Gene Expression: Ivet And Stm Microarray Expression Technology Proteomics From Microbial to Human Genome Sequencing: Genomic Medicine Metagenomics: Deciphering Host–microbe Interactions Conclusions References 50. Cancer Vaccines: Some Basic Considerations Hans-Georg Rammensee, Harpreet Singh-Jasuja, Niels Emmerich and Steve Pascolo Introduction Immune Suppression by Tumors and by Regulatory T-cells The Ideal Therapeuticcancer Vaccine Molecularly Undefinedcancer Vaccines Peptides
530 530
532 532 533 534 535 536 536 538 538 538 541 552 553 554 554 554 560
562 562 563 564 566 567 568
■
Proteins and Carbohydrates Nucleic Acids: Plasmid DNA and Messenger RNA Viral and Bacterial Vectors Adjuvants, Formulations, and Route of Application Immunomonitoring Conclusions References 51. Biosensors for the Genomic Age Meghan B. O’Donoghue, Lin Wang,Yan Chen, Gang Yao and Weihong Tan Introduction Biosensors for Detection of Oligonucleotides for the Detection of Disease Nucleic Acid as Tools for Biosensing Outlook References 52. Stem Cells Rikkert L. Snoeckx, Kris Van Den Bogaert and Catherine M.Verfaillie Introduction Types of Stem Cells: Embryonic and Adult Stem Cells How to Define The Molecular Signature of Stem Cells Future Directions to Identify the Global Integrated Regulatory Network Future Directions in Stemcell Therapies Conclusion References 53. Gene Therapy James M.Wilson and Nelson A.Wivel Introduction Gene Delivery Vehicles Gene Therapy Clinical Trials Conclusion References
xv
578 579 582 582 583 584 584 590
590 591 592 596 597 599
599 599 600 601 604 606 606 610 610 611 614 617 617
568 569 570 570 573
573 574 576 576 577
PART III DISEASE-BASED GENOMIC AND PERSONALIZED MEDICINE: GENOME DISCOVERIES AND CLINICAL APPLICATIONS Section 8 Cardiovascular Genomic Medicine 54. The Genomics of Hypertension Chana Yagil and Yoram Yagil Introduction Predisposition Screening Diagnosis
623
624 625 627 627
xvi
■
Table of Contents
Prognosis Pharmacogenomics Monitoring Novel and Emerging Therapeutics Conclusion Acknowledgements References Recommended Resources 55. Lipoprotein Disorders Sekar Kathiresan and Daniel J. Rader Introduction Overview of Lipoprotein Metabolism Plasma Lipid and Lipoprotein Levels and Atherosclerotic Cardiovascular Disease Inherited Basis for Blood Lipid Traits Screening for Lipid Disorders Genetics of Ldl-c Genetics of Hdl-c Genetics of Triglycerides Genetic Lipid Disorders Without Current Proven Molecular Etiology Influence of Lipid-modulating Mutations on Risk of Atherosclerotic Cardiovascular Disease Future Directions in Genetics And Genomics of Lipoproteins Pharmacogenetics of Lipid-modulating Therapies Implications of Genomics of Lipoprotein Metabolism For The Development of Novel Therapies Clinical Recommendations for Genetic Testing for Lipid Disorders Acknowledgements References Recommended Resources 56. Reactive Oxygen Species Signals Leading to Vascular Dysfunction and Atherosclerosis Nageswara R. Madamanchi, Aleksandr E.Vendrov and Marschall S. Runge Introduction Sources of ROS in vascular cells Vascular Dysfunction and Atherosclerosis ROS-induced inflammatory gene expression in vascular cells Association of ROS modulators with atherosclerosis ROS signaling in atherosclerotic risk factors ROS-regulated signaling pathways Regulation of transcription factors by ROS ROS signaling in advanced atherosclerosis Polymorphisms in ROS production genes and atherosclerosis Inhibitors of ROS signaling and vascular disease Conclusion Acknowledgements References
628 630 631 631 631 632 632 633 634 634 634 636 636 637 637 639 642 643 643 644 646 646 647 648 648 651 652
652 652 653 653 654 655 655 656 658 658 658 659 660 660
57. Genomics of Myocardial Infarction Carlos A. Hubbard and Eric J.Topol Introduction Predisposition Screening Strategies Diagnosis of Acute MI Prognostic Implications of MI Pharmacogenomics of MI Novel and Emerging Therapies Conclusion References Recommended Resources
665
58. Acute Coronary Syndromes L. Kristin Newby Introduction Predisposition Screening Diagnosis Prognosis Pharmacogenomics Monitoring Novel and Emerging Therapeutics Conclusion Acknowledgements References Recommended Resources
680
59. Heart Failure in the Era of Genomic Medicine Ivor J. Benjamin and Jeetendra Patel Introduction Predisposition (Genetic and Non-genetic) Screening Pathophysiology Diagnosis Prognosis Pharmacogenomics Monitoring Novel Therapeutics and Future Directions Conclusions and Recommendations Acknowledgements References
692
60. Application of Genomic Methodology for Assessment of Cardiac Transplant Rejection Michael Pham, Mario C. Deng, Jay Wohlgemuth and Thomas Quertermous Introduction Cardiac Allotransplantation as a Definitive Therapy for End-stage Heart Failure The Problem of Allograft Rejection Immunosuppression Strategies to Prevent Rejection Current Strategies for Monitoring Transplant Rejection
665 665 672 673 673 673 674 675 676 679
680 680 681 682 685 686 687 687 687 687 688 690
692 692 693 694 695 696 697 698 701 702 703 703
705
705 705 706 706 707
Table of Contents
The Cargo Clinical Study Development of a Gene Expression Signature For Cardiac Transplant Rejection Pathways Monitored by the Gep (Allomap™) Test Variability of the Biopsy Gold Standard and Relationship to the Gep (Allomap™) Score Discordance Between Biopsy Grade and Molecular Score Effect of Time Post-transplantation on Performance of the Gep Test Relationship of Gep Score to Corticosteroid Dose Relationship of Gep Scores to Cytomegalovirus Infection Prediction of Future Acr by Molecular Score Clinical Use of the Allomap™ Test Future Directions and Ongoing Research With Gep Testing Further Application of Genomic Science to Transplant Rejection References 61. Hypertrophic Cardiomyopathy in the Era of Genomic Medicine J. Martijn Bos, Steve R. Ommen and Michael J. Ackerman Introduction Definitions, Clinical Presentation, and Diagnosis Molecular Genetics of HCM Screening And Treatment for HCM Conclusions References 62. Genetics and Genomics of Arrhythmias Jeffrey A.Towbin and Matte Vatta Introduction Specific Cardiac Arrhythmias Primary Abnormalities in Cardiac Rhythm: Ventricular Tachyarrhythmias Complex Forms of Lqts Short Qt Interval Syndrome Familial Vt/cpvt Primary Conduction Abnormalities References 63. Hemostasis and Thrombosis Richard C. Becker and Felicita Andreotti Introduction Genetics of Coagulation Human Hemostatic Variability Genotype–phenotype Influences Gene-environment Influences on Hemostasis Circulating Cellular and Protein Influences on Hemostasis And Thrombosis
709 709 710 710 710 711 711 711 712 712 713 713 714
716 716 716 718 722 725 725 729 729 729 729 736 741 741 742 745 755 755 755 756 757 758 759
■
Race-related Influences On Hemostasis and Thrombosis Linkage Studies in Thrombosis Association Studies in Thrombosis Heritability And Thrombosis: Existing Complexities A Personalized Approach to Hemostasis and Thrombosis Patient Screening: A Traditional Paradigm Patient Screening: A Comprehensive And Population-based Approach Prognostic Considerations Emerging Platform for Hemostasis and Thrombosis Research References
xvii
760 762 762 763 763 764 767 768 768 769
64. Peripheral Arterial Disease 773 Ayotunde O. Dokun and Brian H. Annex Introduction 773 Epidemiology and Risk Factors for PAD 773 Clinical Manifestations of PAD 773 Therapeutic Strategies for PAD 774 Ic And Cli Are Distinct Clinical Outcomes of PAD 775 Genetic Background as a Risk Factor for PAD 775 Gene Polymorphisms Contributing to Atherosclerosis and PAD 775 Polymorphisms in Pro-atherothrombotic Genes and PAD 776 Genetic Locus Conferring Susceptibility to PAD 776 Identification of Novel Gene Polymorphisms Involved in PAD 776 Identification of a Quantitative Trait in a Preclinical Model of PAD 777 Refining a QTL Using Haplotype Analysis 777 Identification of Candidate Genes 777 Future Potential Use of Genomic Methodologies in PAD 777 Acknowledgements 778 References 778 65. Genomics of Congenital Heart Disease Jessie H. Conta and Roger E. Breitbart Introduction CHD Gene Discovery by Conventional Genetics Genomic Strategies for CHD Gene Discovery Cytogenetic and Molecular Genetic testing Medical Evaluation and Counseling Recommendations Conclusion Acknowledgements References Recommended Resources
781 781 781 786 787 788 789 790 790 793
66. Genomics of Perioperative and Procedural Medicine 794 Simon C. Body, Mihai V. Podgoreanu and Debra A. Schwinn Introduction 794 Why Perioperative Insults are not Equivalent to Chronic Ambulatory Disease 795
xviii
■
Table of Contents
Perioperative Atrial Fibrillation Perioperative Venous and Arterial Thrombosis Perioperative Stroke and Neurocognitive Dysfunction Hemorrhage and Cardiac Surgery Dynamic Genomic Markers of Perioperative Outcomes Conclusion References
796 796 798 799 799 800 800
Section 9 Oncology Genomic Medicine 67. Cancer Genes, Genomes, and the Environment Robert L. Strausberg Introduction Acquired Functions of Cancer Cells Chromosomal Aberrations and Cancer Cancer Genes and their Functions Inherited Predisposition Cellular Progression Toward Cancer Through Somatic Changes From Genome to the Clinic Comprehensive Sequencing of the Kinome Expanding the Search Multiple Molecular Mechanisms for Oncogene Activation Microarrays and Cancer Genomics Environmental Cancer Genomics Cancer Genomic Databases Expedite Progress Mouse Models of Cancer Future Directions References
807 808
68. Immune Cells and the Tumor Microenvironment David S. Hsu, Michael Morse,Timothy Clay, Gayathri Devi and H. Kim Lyerly Introduction Immune Cells of the Tumor Microenvironment Examples of Tissue or Gene Microarrays used to Study Tumors Studies of Genomic Immune Stimulation within the Microenvironment Genomic Analysis of Tumor Microenvironment in Immunotherapy Studies Proteomics of Immune Cells and the Tumor Microenvironment Conclusion References
818
69. Lymphomas Lisa Rimsza Introduction Diffuse Large B-cell Lymphoma Primary Mediastinal Large B-cell Lymphoma Hodgkin Lymphoma Follicular Lymphoma Mantle Cell Lymphoma
808 808 808 809 809 810 811 811 812 812 813 813 813 814 814 815
818 819 822 823 824 825 825 826 830 830 833 835 835 836 838
Burkitt Lymphoma Miscellaneous Lymphomas Clinical Applications of Molecular Assays in Lymphoma References
839 839 840 840
70. Genomics in Leukemias Lars Bullinger, Hartmut Dohner and Jonathan R. Pollack Introduction Genomics in Leukemias-insights into Leukemia Biology Genomics in Leukemias – Evaluation of Drug Effects Genomics in Leukemias – Clinical Outcome Prediction Conclusions Acknowledgments References Recommended Resources
844
71. Genomics of Lung Cancer Hasmeena Kathuria, Avrum Spira and Jerome Brody Introduction Early Diagnosis/screening of Lung Cancer Classification And Prognosis Pathogenesis And Treatment of Lung Cancer Conclusion References Recommended Resource
856
72. Breast Cancer and Genomic Medicine Erich S. Huang and Andrew T. Huang Introduction The Promise Genetic Bases Molecular Bases Prognosis and Prediction Molecular Markers Genomic Insights Netherlands Cancer Institute Study Duke-taipei Study Nsabp Study Pathway Prediction The Reality of Clinical Genomics References 73. Colorectal Cancer G.L.Wiesner,T.P. Slavin and J.S. Barnholtz-Sloan Introduction Genomic Model of CRC Predisposition for CRC Risk Assessment, Evaluation, and Genetic Testing Screening and Surveillance Prognosis and Treatment
844 846 847 849 851 852 852 855
856 857 859 862 865 866 868 869 869 869 871 871 872 872 873 873 875 875 876 876 877 879 879 880 882 887 889 889
Table of Contents
Pharmacogenetics/genomics of Chemoprevention and Chemotherapy Novel and Emerging Therapeutics Conclusion References 74. Prostate Cancer Phillip G. Febbo and Philip W. Kantoff Introduction Genetic Predisposition and Alterations in Prostate Cancer Prostate Cancer Detection Genomic Changes Associated with Prostate Cancer Behavior Genomic Changes associated with HormoneRefractory Prostate Cancer Future Prospects of Genomics in Prostate Cancer Care References 75. Molecular Biology of Ovarian Cancer Tanja Pejovic, Matthew L. Anderson and Kunle Odunsi Introduction Inherited Ovarian Cancer Syndromes Options for Screening and Prevention Genomic Instability and Ovarian Cancer Fanconi/anemia Pathway Somatic Mutations in Ovarian Cancer Oncogenes and Growth Factors Tumor Suppressor Genes Epigenetics in Ovarian Carcinogenesis Ovarian Cancer Metastases Angiogenesis Summary References 76. Pancreatic Neoplasms Asif Khalid and Kevin McGrath Introduction Predisposition (Genetic and Non-Genetic) Screening Diagnosis Prognosis Pharmacogenomics Monitoring Novel and Emerging Therapies Conclusion References
890 892 892 893 898 898 898 901 904 906 907 908 913 913 913 914 914 914 915 916 917 917 917 918 919 919 921 921 922 922 923 926 927 927 927 927 928
77. The Multiple Endocrine Neoplasia Syndromes Y. Nancy You,Vipul Lakhani and Samuel A.Wells
931
Introduction The Multiple Endocrine Neoplasia Syndromes
931 932
Conclusion Acknowledgements References Recommended Resources
■
xix
941 941 942 944
78. Genomics of Head and Neck Cancer Giovana R.Thomas and Yelizaveta Shnayder Introduction Head and Neck Squamous Cell Carcinoma Conclusion References Recommended Resources
945
79. Genomic Medicine, Brain Tumors and Gliomas Sean E. Lawler and E. Antonio Chiocca Introduction Predisposition Screening Diagnosis and Prognosis Pharmacogenomics Monitoring Novel and Emerging Therapeutics Conclusions References
956
80. Molecular Therapeutics of Melanoma Jiaqi Shi,Yonmei Feng, Robert S. Krouse, Stanely Leong and Mark A. Nelson Introduction Diagnosis Genetics of Melanoma Pharmacogenomics Novel and Emerging Therapeutics Conclusions References 81. Emerging Concepts in Metastasis Nigel P.S. Crawford and Kent W. Hunter Introduction Tools to Investigate the Mechanisms of Metastasis Assessement of Prognosis and New Treatments for Metastasis: the Role of New Technologies Conclusion References Recommended Resources 82. Diagnostic-Therapeutic Combinations in the Treatment of Cancer Jeffrey S. Ross Introduction Targeted Therapies for Cancer The Ideal Target The First Diagnostic-therapeutic Combination in Cancer Therapy: Hormonal Therapy for Breast Cancer
945 945 952 953 955
956 957 958 958 961 962 962 963 964 967
967 967 968 969 969 973 973 977 977 979 986 987 988 989 990 990 990 990
992
xx
■
Table of Contents
Diagnostic-therapeutic Combinations for Leukemia and Lymphoma 994 Her-2 Positive Breast Cancer and Trastuzumab 994 (Herceptin®) Other Targeted Anticancer Therapies Using Antibodies 995 Selected Targeted Anticancer Therapies Using Small Molecules 999 Pharmacogenomics 1002 Conclusion 1003 References 1003 Section 10 Inflammatory Disease Genomic Medicine 83. Environmental Exposures and the Emerging Field of Environmental Genomics David A. Schwartz Introduction Importance of Environmental Exposures in Human Health Importance of Environmental Exposures in Studying Disease Processes Comparative Environmental Genomics Exposure Assessment in the Gene-environment Paradigm Challenges and Future of Environmental Genomics Conclusion References
1009 1010 1010 1010 1012 1012 1013 1015 1015 1015
84. Molecular Basis of Rheumatoid Arthritis Robert M. Plenge and Michael E.Weinblatt Introduction Clinical Features Predisposition Screening Diagnosis, Prognosis, and Monitoring Pharmacogenomics Conclusions References
1017
85. “Omics” in the Study of Multiple Sclerosis Francisco J. Quintana and Howard L.Weiner Introduction Genomics in MS Transcriptomics in MS Immunomics in MS Proteomics in MS Conclusion References
1032
86. Inflammatory Bowel Disease Ad A. van Bodegraven and Cisca Wijmenga
1040
Introduction Predisposition (Genetic and Non-genetic)
1017 1018 1018 1024 1024 1024 1026 1026
1032 1032 1034 1035 1036 1037 1037
1040 1041
Screening Diagnosis Prognosis Pharmacogenomics Monitoring Novel and Emerging Therapeutics Conclusions References 87. Glomerular Disorders Tadashi Yamamoto, Hidehiko Fujinaka and Visith Thongboonkerd Introduction Techniques for Detection, Quantification, and Profiling of mRNA Expression in the Kidney mRNA Expression Profiles of Glomerular Disorders Genome Variations in Glomerular Disorders Genetics of Congenital Glomerular Disorders Genomic Medicine for Glomerular Disorders Conclusions References 88. Spondyloarthropathies Dirk Elewaut, Filip De Keyser, Filip Van den Bosch, Dieter Deforce and Herman Mielants Introduction Characteristics of SPA Role of Bowel Inflammation Histopathology of Synovitis in SPA Gut and Synovium Transcriptomes Proteome Analysis Novel and Emerging Therapeutics and Biomarkers Conclusions References 89. Asthma Genomics Scott T. Weiss, Benjamin A. Raby and Juan C. Celedón Introduction Asthma: Basic Pathobiology Predisposition (Genetic and Non-genetic) to Asthma Genome-wide Linkage Analyses of Asthma and its Intermediate Phenotypes Candidate-gene Association Studies of Asthma Genome-wide Association Studies of Asthma Asthma Genomics Screening Diagnosis Prognosis Pharmacogenetics Monitoring Novel and Emerging Therapeutics Conclusions Acknowledgements References Recommended Resources
1043 1044 1045 1047 1047 1048 1049 1050 1056
1056 1056 1059 1060 1062 1062 1064 1064 1067
1067 1067 1070 1072 1074 1075 1075 1077 1078 1084 1084 1084 1085 1085 1087 1088 1088 1088 1091 1091 1091 1092 1093 1093 1093 1093 1097
Table of Contents
90. Genomic Aspects of Chronic Obstructive Pulmonary Disease Peter J. Barnes Introduction Predisposition Pathophysiology Cellular and Molecular Mechanisms Diagnosis and Screening Prognosis Management New Treatments Conclusions References 91. Genomic Determinants of Interstitial Lung Disease P.W. Noble and M.P. Steele Introduction Genetic Determinants of Dpld in Mouse Strains Genetic Determinants of Sarcoidosis Surfactant Proteins and Dpld Genetic Determinants of Pulmonary Fibrosis Identified in Rare Inherited Disorders Genetic Determinants of Fip Conclusion References 92. Peptic Ulcer Disease J. Holton Introduction Clinical and Physiological Aspects of Pud Pathophysiology of Ulcer Formation The Helicobacter Genome Human Polymorphism and Pud Genomics in the Management of Disease Future Developments in the use of Genomic Techniques in Relation to Pud Conclusions Acknowledgements References Recommended Resources 93. Cirrhosis in the Era of Genomic Medicine N.A. Shackel, K. Patel and J. McHutchison Introduction Liver Structure Fibrosis and Cirrhosis Diagnosis of Cirrhosis Treatment of Cirrhosis Genetics of Cirrhosis The Liver Transcriptome The Liver Proteome Development of Liver Fibrosis Transcriptome Analysis of Liver Disease Proteomic Studies of Liver Disease
1098 1098 1098 1099 1101 1104 1105 1105 1107 1107 1108 1110 1110 1111 1111 1114 1115 1115 1117 1118 1122 1122 1123 1124 1125 1130 1131 1132 1132 1133 1133 1137 1138 1138 1138 1140 1140 1141 1142 1143 1143 1144 1146 1148
Proteomics in Other Liver Disease Future Impact of Genomics Studies Conclusion References 94. Systemic Sclerosis Ulf Müller-Ladner Introduction Predisposition Screening Diagnosis Prognosis Pharmacogenomics Monitoring and Genomic Factors Therapeutic Strategies Novel and Emerging Therapeutics Conclusions Acknowledgements References
■
xxi
1150 1151 1151 1151 1155 1155 1155 1158 1159 1159 1159 1159 1162 1164 1165 1165 1165
Section 11 Metabolic Disease Genomic Medicine 95. Genomic Medicine of Obesity J. Alfredo Martínez Introduction Obesity: Causes and Genetic Predisposition Search of Genes Involved in Obesity Diagnosis and Characterization of Genes Associated with Obesity Screening and Diagnosis Prognosis and Gene Based-treatments Novel and Emerging Therapeutics Nutrigenomics, Pharmacogenomics and Gene Therapy Conclusions Acknowledgements References
1169 1170
96. Diabetes Maggie Ng and Nancy J. Cox Introduction GWAS in Type 2 Diabetes Future Research in Type 2 Diabetes Genetics GWAS in Type 1 Diabetes Future Studies in Type 1 Diabetes Clinical Utility of Genetic Research in Diabetes Conclusion References
1187
97. Metabolic Syndrome Rebecca L. Pollex and Robert A. Hegele Introduction Diagnosis: Definition of the MetS Phenotype Pathophysiology of MetS in Brief Genetics of MetS
1170 1170 1172 1175 1179 1180 1182 1183 1183 1184
1187 1188 1190 1191 1192 1192 1192 1192 1194 1194 1194 1196 1196
xxii
■
Table of Contents
Conclusion Acknowledgements References Recommended Resources 98. Nutrition and Diet in the Era of Genomics Jose M. Ordovas and Dolores Corella Introduction Methodological Issues Gene–Nutrient Interactions Path Forward Conclusions Acknowledgements References Section 12 Neuropsychiatric Disease Genomic Medicine 99. The Genetic Approach to Dementia Robert L. Nussbaum Introduction Incidence of Dementia Primary Dementias Clinical Approach to the Dementias Future Prospects for Genomic Medicine in the Dementias Conclusion References Recommended Resources 100. Parkinson’s Disease: Genomic Perspectives Shushant Jain and Andrew B. Singleton Introduction Clinical Characteristics of PD Genetics of PD Genetics of Sporadic PD Conclusion References Recommended Resources
1200 1200 1200 1202 1204 1204 1205 1206 1214 1214 1215 1215
1221 1222 1222 1223 1224 1229 1229 1230 1230 1232 1233 1233 1233 1235 1238 1241 1241 1242
101. Epilepsy Predisposition and Pharmacogenetics Nicole M.Walley and David B. Goldstein Introduction Mendelian Epilepsies Common Epilepsies Future Program of Work References
1243
102. Ophthalmology Janey L.Wiggs Introduction Extraocular Muscles Cornea Lens
1256
1243 1243 1248 1251 1252
1256 1257 1257 1259
Iris Trabecular Meshwork Optic Nerve Retina Genetic Testing for Ocular Disorders Summary References
1259 1259 1260 1260 1261 1261 1261
103. Genomic Basis of Neuromuscular Disorders Erynn S. Gordon and Eric P. Hoffman Introduction Motor Neuron Disease Disorders of the Neuromuscular Junction Disorders of the Muscle Predisposition Screening Diagnosis Prognosis Monitoring Current, Novel, and Emerging Therapies Advances in Genomics and Proteomics Conclusion References Recommended Resources
1265
104. Psychiatric Disorders Stephan Züchner and Ranga Krishnan Introduction Great Prospect, But are We There Yet? Classification Reconsidered How Complex Can it Be? The Value of Rare Genetic Variation Converging Methods Personalized Medicine Conclusion References 105. Genomics and Depression Brigitta Bondy Introduction Diagnosis, Prevalence and Course of Depression Pathophysiological Mechanisms Pharmacogenomics of Antidepressants Current Concepts Future Aspects References 106. Bipolar Disorder in the Era of Genomic Psychiatry Ayman H. Fanous, Frank Middleton, Carlos N. Pato and Michele T. Pato Introduction Diagnosis Predisposition Pharmacogenetics
1265 1265 1266 1268 1268 1273 1274 1274 1275 1276 1277 1279 1279 1281 1282 1282 1282 1283 1283 1284 1284 1285 1286 1287 1289 1289 1289 1290 1293 1294 1296 1296 1299
1299 1299 1300 1304
Table of Contents
The Role of New Technologies in Elucidating the Genetics of BPD 1305 Conclusion 1308 References 1308 Section 13 Infectious Disease Genomic Medicine 107. Genomic Approaches to the Host Response to Pathogens M. Frances Shannon Introduction Genetic Susceptibility to Pathogens Exploring the Host Response Through Expression Profiling Genetical Genomics and Systems Biology: the New Frontiers Application to Clinical Practice Acknowledgement References Recommended Resources 108. Genomic Medicine and AIDS Thomas Hirtzig,Yves Lévy and Jean-François Zagury Introduction Context of HIV and AIDS Predisposition: Susceptibility to HIV-1 Infection Diagnosis Prognosis Monitoring Pharmacogenomics Novel and Emerging Therapeutics Conclusion References 109. Viral Genomics and Antiviral Drug Roberto Patarca Introduction Viral Genomics and the Antiviral Drug Revolution Era Conclusion References
1313 1314 1314 1315 1317 1320 1320 1320 1320 1323 1324 1324 1324 1328 1328 1329 1331 1331 1333 1334 1334 1340 1340 1341 1344 1345
■
xxiii
110. Host Genomics and Bacterial Infections Melissa D. Johnson and Mihai Netea Introduction Genomics and the Study of Bacterial Infections Host Genomics and Gram-positive, Gram-negative and Mycobacterial Infections Future Directions Conclusions References
1347
111. Sepsis and the Genomic Revolution Christopher W.Woods, Robert J. Feezor and Stephen F. Kingsmore Introduction Genetic Polymorphisms Associated with Sepsis Molecular Signatures and Sepsis Therapeutics Conclusion References
1362
112. Diagnosis, Management and Disease Mechanisms of Hepatitis in the Era of Genomic Medicine N.A. Shackel, K. Patel and J. McHutchison Introduction Virology of Hepatitis Viruses Acquisition and Predisposition to Viral Hepatitis Screening and Diagnosis of Viral Hepatitis Pathogenesis of Viral Hepatitis Therapeutics and Pharmacogenomics Future Impact of Genomics Studies Conclusion References
1347 1347 1351 1357 1357 1357
1362 1363 1368 1371 1371 1372
1375 1375 1375 1377 1378 1380 1384 1386 1386 1386
Glossary
1391
Index
1407
This page intentionally left blank
Foreword In the health care field today, there are high expectations for a paradigm shift in care delivery over the coming years. According to this view, we have the opportunity to achieve fundamental change and improvement in the delivery of care, with better health outcomes across the board, led by advances in genomics and molecular medicine. The prospect is exciting: a continuum of health maintenance and medical care that is truly tailored to the individual, based on his or her individual biology. We could know our own genetic profiles from birth. Prevention could be more individually charted, based on individual genetic factors. Disease could be detected and treated earlier through molecular diagnostics. Health care dollars might be spent to much better effect. And our lives could be healthier, longer. The stepping stones to such a future are now widely known. Building on the work of the Human Genome Project and its offspring, medical care can be expected to acquire powerful new tools, especially in diagnosis, that could render care much more effective by making it more precise, more individually targeted, and more predictive. We should be able to prescribe drugs more safely because genetic or other characteristics would help clinicians identify which patients would respond well to a given therapy. As diseases come to be understood at a new level, we should be better able to achieve the right diagnosis and the right treatment for each person, without the trial-and-error process that has long characterized medical treatment. As biomarkers are identified, we should be able to intervene in disease at much earlier stages. And when we are able to know our personal genomic profiles, we should be able to better pinpoint our individual health susceptibilities. Our physicians could give more individualized prevention advice – and perhaps it will even come to pass that we will be more motivated to follow it. I am a believer in this view of the possible future – and I am confident it is not just wishful thinking. The pace of discovery in the genomic field today is unprecedented. Furthermore, enough successful applications of this paradigm already exist to give us reason to look toward a new era of effectiveness in medical care, with new information and new tools for both the clinician and the consumer. At the same time, we must be realistic in our assessment of this future, especially the extent of the efforts that will be needed to achieve it. However desirable the idea of this “paradigm shift” may be, the realization of such a shift rests on a foundation that is still very much under construction.
The shift will require development of new capacities that will enable us to differentiate among the needs of individual patients. In turn, these new capacities will depend on data to be derived, analyzed, and employed on a new scale. Such information demands will surely need to be supported by sophisticated electronic data networks that are yet to be created, informatics tools that are yet to be invented, and clinical decision-support systems that are yet to be devised or adopted. Finally, at the bedrock, the trust and understanding of the medical community and of society at large must be won even as the edifice is being created. This is the work of a generation. It is work that spans professions, economic sectors, and even nations. It is driven by science – but it makes new demands for the rapid translation of scientific discovery into clinical practice and improved health outcomes. One part of the work before us is continued discovery. Phenomenal achievements have been made in mapping the human genome, and a rush of findings is occurring today in understanding associations between genomic factors and health. Yet vast areas still remain to be explored. That work is underway on a global scale. Another part of our work might be called the engineering. This includes the development of interoperable health information technology, with all the implications of that goal: development of technical and clinical standards, adoption of health IT across the health sector, and security of personal health information. In the long term, it should also include the use of health IT to enable us to make faster progress in medical research – and then to feed back what we’ve learned into clinical practice, using IT decision-support tools that are physician-, nurse-, and consumer-friendly. Another element of our task is less defined but equally challenging: the collaboration and cooperation needed to bring this vision to reality. Personalized medicine means care that is information-based at a new level. Gathering that information and using it successfully will transcend many disciplines and make new demands on a health system that is often characterized as “fractured.” It may seem ironic that delivery of individualized care depends on standardization, partnerships, and networks. But these kinds of collaboration are among the most important element of the work that lies ahead. Finally, in today’s health care environment, the success of new products, services, and models of care will depend more than ever on the value that is delivered. At its core, personalized medicine is about care that can achieve new levels in predicting, preventing, and detecting disease. Medical effectiveness of this
xxv
xxvi
■
Foreword
kind should translate into cost effectiveness as well. Achieving and demonstrating high value expectations will be challenging, but this discipline will serve the success of personalized medicine over the long term. The paradigm shift of personalized medicine depends on an extensive foundation of scientific knowledge, professional leadership, health information technology, and a spirit of cooperation, collaboration, and dialogue. The many perspectives offered by the
contributors to this text help us appreciate the breadth of these issues, challenges, debates, and opportunities that lie ahead.
Michael O. Leavitt Secretary Department of Health and Human Services Washington, DC
Preface It seems just yesterday that we were first getting used to the notion of introducing the seemingly freshly uncovered concepts of genetics into the practice of medicine. And yet, with the completion of the Human Genome Project and the rapid development and application of new advances in our ability to understand and query the human genome and its gene set, it is time already to anticipate and outline the early stages of what must be called a transformation of medicine. We are beginning to see the first signs of a fundamental shift in how we behold human physiology and pathology, how we view the concept of what is “normal,” how we consider individuals and their prospects for lifelong health, and how we design healthcare systems that are equally adaptable to the demands of population-wide epidemics and the opportunities for personalized care that utilizes genome-based information to consider individual susceptibility to disease and therapeutic options. Genome-based data, information, knowledge, and eventually wisdom will make possible the kind of healthcare that has been dreamed of since the advent of disease-based medicine early in the 20th century. A system of healthcare that harnesses the might of the genome and its derivatives, along with imaging, clinical and environmental information, will empower physicians and other healthcare providers to do what they have always aspired to do – make medical care as individualized as possible. But this newfound information and knowledge will also allow each of us as consumers of healthcare to take more control of our futures and to develop a more strategic and a prospective approach to health. We stand at the dawn of a profound change in science and medicine’s predictive nature and in our understanding of the biological underpinnings of health and disease. Even in this early light, we can see the outlines of a coming ability to: ●
●
●
● ●
predict individual susceptibility to disease, based on genetic, genomic and other factors; provide more useful tools and individualized programs for disease prevention, based on knowledge of one’s susceptibility; detect the onset of disease earlier and before it is clinically evident, based on newly discovered biological markers that arise from changes at the molecular level; preempt disease progression, as a result of early detection; target medicines and their dose more precisely and safely to each patient, on the basis of a deep understanding of disease
mechanism and the role that genetic and genomic factors play in the individual response to drugs. This revolution in genomic and personalized medicine was anticipated nearly three decades ago by Nobel laureate Paul Berg, who stated so presciently: Just as our present knowledge and practice of medicine relies on a sophisticated knowledge of human anatomy, physiology, and biochemistry, so will dealing with disease in the future demand a detailed understanding of the molecular anatomy, physiology, and biochemistry of the human genome. . . . We shall need a more detailed knowledge of how human genes are organized and how they function and are regulated. We shall also have to have physicians who are as conversant with the molecular anatomy and physiology of chromosomes and genes as the cardiac surgeon is with the structure and workings of the heart.
That time has come. This book is intended to lay out the foundations of this new science, to outline the early opportunities for the practice of medicine to incorporate genome-based analysis into healthcare, and to anticipate the many conditions to which genomic and personalized medicine will apply in the years ahead. The chapters in these volumes are designed to be read either sequentially – introducing the scientific underpinnings of this revolution, exploring aspects of translational medicine and genomics that will be critical for bringing about this revolution, and presenting practical aspects of the first applications of genomic and personalized medicine in the context of specific medical conditions – or one-at-a-time for those interested in particular disorders or approaches. These volumes also describe a field in its infancy, with many challenges for society at large, in addition to those associated with healthcare systems strife with inefficiencies and heterogeneity in their ability to deliver the basics of healthcare. There are “grand challenges” for the visionary science and the clinical care highlighted in these pages. Such challenges include the potential for these innovations to exaggerate existing health disparities, information technology systems that have been described as a “tower of Babel,” an unprepared healthcare work force, and economic incentives that are inadequately aligned for the various stakeholders to fully embrace genomic and personalized medicine. Nonetheless, we are optimistic that the appropriate delivery models and economic incentives will be developed
xxvii
xxviii
■
Preface
in a trustworthy framework that will be embraced by societies around the globe. As an indicator of the importance of personalized medicine, international communities are working together on strategies to overcome these obstacles. In the United States, the Department of Health and Human Services has played a leadership role in this area and a plan was published in 2007 (Personalized Healthcare: Opportunities, Pathways, Resources; see http://www. hhs.gov/myhealthcare/). Other governments (in the United Kingdom, Iceland, Estonia, Luxembourg, and Singapore – to name just a few) have funded initiatives that will secure their place in developing genome-based knowledge and its translation into day-to-day patient care. A collective and global approach to what might be arguably one of the most complex scientific and clinical undertakings in the history of healthcare is undoubtedly what is required. Our international collective of contributors to this work reflects the early adopters and members of a global community of physicians, scientists, and policy makers who will make this happen.
Our intended audience is broad, ranging from medical students (and even the intrepid undergraduate eager to explore this new era of personalized and prospective medicine) to residents and fellows to practitioners in any of the healthcare professions – physicians in any of the medical specialties, surgeons, nurses, genetic (and genomic) counselors, and laboratory directors – and, finally, to members of the genomic and personalized medicine research communities who will, we trust, help write future editions of this text. In times of transformation, we are all students. We hope that this book will help usher in this new era of genomic and personalized medicine and will provide a useful and thorough introduction to the science and practice of this new approach to human health.
Huntington F. Willard, Ph.D. Geoffrey S. Ginsburg, M.D., Ph.D.
Terminology Throughout this book, the terms “genetics” and “genomics” are used repeatedly, both as nouns and in their adjectival forms. Although these terms seem similar, they in fact describe quite distinct (though frequently overlapping) approaches in biology and in medicine. Here, we provide operational definitions to distinguish the various terms and the subfields of medicine to which they contribute. The field of genetics is the scientific study of heredity and of the genes that provide the physical, biological, and conceptual bases for heredity and inheritance. To say that something – a trait, a disease, a code or information – is “genetic” refers to its basis in genes and in DNA. Heredity refers to the familial phenomenon whereby traits (including clinical traits) are transmitted from generation to generation, due to the transmission of genes from parent to child. Genomics is the scientific study of a genome, the complete DNA sequence, containing the entire genetic information of a gamete, an individual, a population or a species. The word “genome” was first used as an analogy with the earlier term “chromosome,” referring to the physical entities (visible under the microscope) that carry genes from one cell to its daughter cells or from one generation to the next. Over the past two decades, “genomics” has given birth to a series of other “-omics” that refer to the comprehensive study of the full complement of, for example, proteins (hence, proteomics), transcripts (transcriptomics), or metabolites (metabolomics). The essential feature of the “-omes” is that they refer to the complete collection of genes, proteins, transcripts, or metabolites, not just to the study of individual entities. Medical genetics is the application of genetics to medicine and is one of the 24 medical specialties recognized by The American Board of Medical Specialties, the preeminent medical organization overseeing physician certification in the United States. Genetic medicine is a term sometimes used to refer to the application of genetic principles to the practice of medicine and thus overlaps medical genetics. Both medical genetics and genetic
medicine approach clinical care largely through consideration of individual genes and their effects on patients and their families. Genomic medicine, by contrast, refers to the use of large-scale genomic information and to consideration of the full extent of an individual’s genome, proteome, transcriptome, or metabolome in the practice of medicine and medical decision-making. The principles and approaches of genomic medicine are relevant well beyond the traditional purview of individual medical specialties and include, as examples, gene expression profiling to characterize tumors or to define prognosis in cancer, genotyping variants in the set of genes involved in drug metabolism or action to determine an individual’s correct therapeutic dosage, scanning the entire genome for millions of variants that influence one’s susceptibility to disease, or analyzing multiple protein biomarkers to monitor therapy and to provide predictive information in presymptomatic individuals. Finally, personalized medicine refers to a rapidly advancing field of healthcare that is informed by each person’s unique clinical, genetic, genomic, and environmental information. The goals of personalized medicine are to take advantage of a molecular understanding of disease to optimize preventive healthcare strategies and drug therapies while people are still well or at the earliest stages of disease. Because these factors are different for every person, the nature of disease, its onset, its course, and how it might respond to drug or other interventions are as individual as the people who have them. For personalized medicine to be used by healthcare providers and their patients, these findings must be translated into precision diagnostic tests and targeted therapies. Since the overarching goal is to optimize medical care and outcomes for each individual, treatments, medication types and dosages, and/or prevention strategies may differ from person to person – resulting in unprecedented customization of patient care. The principles underlying genomic and personalized medicine and their applications to the practice of clinical medicine are presented throughout the chapters that comprise this volume.
Acknowledgements We wish to express our appreciation and gratitude to our many colleagues, especially in the Duke Institute for Genome Sciences & Policy, who have shared their knowledge and ideas about genomic and personalized medicine and who, by doing so, inspired this project. We particularly thank our first editor at Academic Press/Elsevier, Luna Han, who encouraged us to develop the concept of a text on genomic and personalized medicine. We are also grateful to Sally Cheney, Kirsten Funk, and Christine Minihane, our Senior Editors at Academic Press/ Elsevier; to Rogue Shindler, our Developmental Editor; and to Ganesan Murugesan, our Production Project Manager, for their patience, advice, and professionalism in all stages of the project.
We acknowledge our Advisory Board for their suggestions and support and especially thank the nearly 300 authors of the 112 chapters that comprise these volumes. Needless to say, without their efforts, this project could never have come to fruition. We also thank Secretary Mike Leavitt for providing a Foreword for this book, as well as for his enthusiastic support of the concept of genomic and personalized medicine. It gives us pleasure to give special thanks to Kathy Hay and to Lynne Skinner, whose tireless efforts kept us on track and saw this project through to completion. Lastly, we thank our families for their patience and understanding for the many hours we spent creating this, the inaugural edition of Genomic and Personalized Medicine.
xxix
This page intentionally left blank
Abbreviations 2D-DIGE 2D-PAGE
Positron Neutrino Two-dimensional difference gel electrophoresis Two-dimensional polyacrylamide gel electrophoresis 3-MT 3-Methoxytyramine 5-HT 5-Hydroxytriptamine or serotonin 5-HTT 5-HT transporter 5-HTTLPR 5-HTT gene-linked polymorphic region 6-MMP 6-Methylmercaptopurine 6-MP 6-Mercaptopurine 6-MTIMP 6-Methyl-thioinosinemonophosphate 6-TGN 6-Thioguanine nucleosides 6-TIMP 6-Thioinosinemonophosphate 18 18 F-FLT F-labeled 3-deoxy-3-fluorothymidine 18 18 F-FMAU F-1abeled-2-deoxy-2-fluoro--d-arabinofuranosyl thymine AAU Acute anterior uveitis ABCA1 Adenosine triphosphate-binding cassette protein A1 ABCB1 Adenosine triphosphate-binding cassette, sub-family B ABI Ankle brachial index ACC American College of Cardiology ACE Angiotensin converting enzyme aCGH Array comparative genomic hybridization ACR Acute cellular rejection AD Alzheimer’s disease AD Aortic dilation ADH Autosomal dominant hypercholesterolemia ADHD Attention deficit hyperactivity disorder ADIGE Autosomal dominant idiopathic generalized epilepsy
ADJME ADME ADNFLE ADP ADR AED Ago2 AHA AIDS AIMs AJCC ALL ALOX5AP AMACR AML Amol AMR AOO AOTF Apo(a) ApoE APOE AR AR ARACNe ARE ARH ARJP ARM
Autosomal dominant juvenile myoclonic epilepsy Absorption, distribution, metabolism, and elimination Autosomal dominant nocturnal frontal lobe epilepsy Adenosine 5-diphosphate Adverse drug reaction Antiepileptic drug Argonaute 2 mRNA endonuclease American Heart Association Acquired immune deficiency syndrome Ancestry informative markers American Joint Committee on Cancer Acute lymphoblastic leukemia Arachidonate 5-lipooxygenase-activating protein Alpha methylacyl CoA reductase Acute myeloid leukemia Attomole Antibody mediated rejection Age of onset Acousto-optic tunable filters Apolipoprotein(a) Apolipoprotein E Apolipoprotein E (gene name) Androgen receptor Autosomal recessive Algorithm for the Reconstruction of Accurate Cellular Networks Androgen response element Autosomal recessive hypercholesterolemia Autosomal recessive juvenile Parkinsonism Armadillo
xxxi
xxxii
■
ARVC AS ASCA ASCO ASCVD ASD ASO ASPE ATM ATP ATP ATR AZT B1-AR BAC BBB BCAA Bcl BCL-2 Bcl-Abl BDNF BER BEST BH BLAST BLAT BMC BMI BMP-7 BOLD BPDE BPH BRCA BRCA1 BRCA2 BRD2 BSML CA CAD CADASIL CAE CALGB CAP CARD CARGO CAV CBD CBP CBZ CCD
Abbreviations
Arrhythmogenic right ventricular cardiomyopathy Ankylosing spondylitis Anti-Saccharomyces cerevisiae antibodies American Society for Clinical Oncology Atherosclerotic cardiovascular disease Atrial septal defect Anti-sense oligonucleotides Allele-specific primer extension Ataxia telangiectasia mutated Adenosine triphosphate Adult Treatment Panel Ataxia-telangiectasia and RAD3-related Azidothymidine 1-adrenergic receptor Bacterial artificial chromosome Blood–brain barrier Branched-chain amino acids B-cell lymphoma protein B-cell lymphoma 2 Breakpoint cluster region – Abelson kinase fusion protein Brain-derived neurotrophic factor Base excision repair -Blocker Evaluation of Survival Trial Bcl-2 homology domain Basic local alignment and search tool BLAST-like alignment tool Bone marrow cell Body mass index Bone morphogenic protein-7 Blood oxygen level-dependent Benzo (a) pyrene diol epoxide Benign prostatic hypertrophy Breast cancer gene Breast cancer 1, early onset, protein Breast cancer type 2 susceptibility protein Bromodomain containing 2 (gene) Bioinformatic sequence markup language Capsid (viral protein) Coronary artery disease Cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy Childhood absence epilepsy Cancer and leukemia group B College of American Pathologists Caspase recruitment domain (gene) Cardiac Allograft Rejection Gene Expression Observational study Cardiac allograft vasculopathy Cortical basal degeneration CREB-binding protein Carbamazepine Charge coupled device
CCL CCND1 CCP CD CD CDAI CDC CDDP cDNA CDS CED-4 CETP CF CFTR CGAP CGH CGHa CGP CHD CHD CHF CHMP CHRNA4 CIBEX CIC CICR CKD CLI CLIA CML CMT CMV CNS CNV COPA COR COX COX-2 CPOE CRC CREB CRF CRP CS CSA CSF CSF1 CT CTC CTGF CTL CTLA
Chemokine (C-C motif) ligand Cyclin D1, a proto-oncogene Citrullinated cyclic peptide Cluster of differentiation Crohn’s disease Crohn’s disease activity index Centers for Disease Control Cisplatin (cis-diamminedichloroplatinum Complementary deoxyribonucleic acid Clinical decision support Cell death abnormality gene 4 Cholesteryl ester transfer protein Cystic fibrosis Cystic fibrosis transmembrane conductance regulator Cancer Genome Anatomy Project Comparative genomic hybridization Comparative genomic hybridization array Cancer Genome Project Congenital heart disease Coronary heart disease Congestive heart failure Committee for Medicinal Products for Human Use Cholinergic receptor, nicotinic, alpha 4 (gene) Center for Information Biology Gene Expression Citrate/isocitrate carrier Calcium-induced calcium release Chronic kidney disease Critical limb ischemia Clinical Laboratory Improvement Amendments Chronic myelogenous leukemia Charcot-Marie-Tooth disease Cytomegalovirus or human herpes virus 5 Central nervous system Copy number variant Cancer outlier profile analysis C-terminal of ROC domain Cyclooxygenase Cyclooxygenase-2 Computerized provider order entry Colorectal cancer Cyclic AMP response binding protein Circulating recombined form C-reactive protein Cell signaling factor Comparative sequence analysis Cerebrospinal fluid Colony stimulating factor 1 Computed tomography Circulating tumor cells Connective tissue growth factor Cytotoxic T lymphocytes Connective tissue late antigen
Abbreviations
CTSA CX32 CX37 CYP CYP2C19 CYP2D6
Clinical and Translational Science Awards Connexin-32 Connexin-37 Cytochrome P450 Cytochrome P450 2C19 Cytochrome P450 2D6
D2R DAI DBS dbSNP DCM DGGE DGI DHF DHLPC
Dopamine-2 receptor Disease Activity Score (for Ulcerative Colitis) Deep brain stimulation NCBI SNP database Dilated cardiomyopathy Denaturing gradient gel electrophoresis Diabetes Genetics/Initiative Diastolic heart failure Denaturing high performance liquid chromatography Diffuse Lewy body disease Diffuse Lewy body disease Diagonalized linear discriminant analysis Disc large homologue (gene) Drug-metabolizing enzymes Deoxyribonucleic acid Deoxyribonuclease DNA methyltransferases Dihydroxyphenylacetic acid Double stranded Double-stranded ribonucleic acid Direct-to-consumer Duodenal ulcer Dizygotic
DLB DLBD DLDA DLG DMEs DNA DNase DNMTs DOPAC ds dsRNA DTC DU DZ EAE EASS EBI EBV ECL ECM EDG-1 EGAPP EGF EGFR EGP EHR EIA EL ELISA ELMO1 EM EMB EMBL-EBI EMEA ENCODE EOAD
Experimental autoimmune encephalomyelitis East Asia specific sequence European Bioinformatics Institute Epstein–Barr virus Eenterochromaffin-like cells Extracellular matrix Endothelial differentiation gene-1 Evaluation of Genomic Applications in Practice and Prevention Epidermal growth factor Epidermal growth factor receptor Environmental Genome Project Electronic health record Enzyme immunoassay Endothelial lipase Enzyme-linked immunosorbent assay Engulfment and cell motility 1 Extracellular matrix protein Endomyocardial biopsy European Molecular Biology Laboratory– European Bioinformatics Institute European Medicines Evaluation Agency Encyclopedia of DNA Elements Early-onset Alzheimer’s disease
EPIYA eQTL ERK ERNA ESI ESI-MS/MS ESR ESRD EST FA FACS FANCA FANCD2 FAP FCCS FCHL fCJD FCS FCS FD FDA FDB FDG FFPE FGF FGF-2 FH FHTG FISH FKBP FL FLT-3 Fmol fMRI FMT FNA FOV FRET FSGS FTD FT-ICR-MS FUSION GABBR1 GABRG2 GAIN GAPDH GBM GBS GCKR GC-MS
■
xxxiii
Single letter codes for amino acids Expression quantitative trait loci Extracellular signal-regulated kinase Equilibrium radionucleotide angiography Electrospray ionization Electrospray ionization tandem mass spectrometry Erythrocyte sedimentation rate End-stage renal disease Expressed sequence tag Fanconi anemia Fluorescent-activated cell sorting Fanconi anemia proteins A Fanconi anemia protein D2 Familial adenomatous polyposis Two-color fluorescence cross-correlation spectroscopy Familial combined hyperlipidemia Familial Creutzfeldt–Jakob disease Familial hyperchylomicronemia syndrome Fluorescence correlation spectroscopy Familial dysbetalipoproteinemia Food and Drug Administration (USA) Familial defective apoB-100 18 F-labeled 2-deoxy-d-glucose Formalin-fixed, paraffin-embedded Fibroblast growth factor Fibroblast growth factor-2 Familial hypercholesterolemia Familial hypertriglyceridemia Fluorescence in situ hybridization FK506-binding protein Follicular lymphoma FMS-like tyrosine kinase-3 Femtomole Functional MRI Fluorescence molecular tomography Fine needle aspirate Field of view Fluorescence resonance energy transfer Focal segmental glomerulosclerosis Frontotemporal dementia Fourier transform ion cyclotron mass spectrometry Finland–United States Investigation of NIDDM genetics GABA B receptor 1 (gene) GABA A receptor, gamma 2 (gene) Genetic Association Information Network Glyceraldehyde-3-phosphate dehydrogenase Glomerular basement membrane Guillain–Barre Syndrome Glucokinase regulatory protein Gas chromatography–mass spectrometry
xxxiv
■
GCSF GDNF GEFS GEO GEP GFAP GFP GLEPP1 GLP GM-CSF GO GPI GR GSEA GSIS GSTT GU GWAS GZM H HAART HAV HBV HCC HCM HCV HDAC HDL HDM2 HDV HeLa HEPS HER2 HEV HF HFPEF HGF HGPRT HHS HHV-6 HITSP HIV HKMT HL HL7 HLA/HLA HLI HMT HNPCC HNSCC
Abbreviations
Granulocyte colony-stimulating factor Glial cell line-derived neurotrophic factor Generalized epilepsy with febrile seizures plus Gene Expression Omnibus Gene expression profiling Glial fibrillary acidic protein Green fluorescent protein Glomerular epithelial protein 1 Good laboratory practice Granulocye macrophage–colony stimulating factor Gene ontology Globus pallidus interna Glucocorticoid receptor gene Gene set enrichment analysis Glucose-stimulated insulin secretion Glutathione S-transferase Gastric ulcer Genome-wide association study Granzyme Heterozygosity Highly active anti-retroviral therapy Hepatitis A Hepatitis B Hepatocellular carcinoma Hypertrophic cardiomyopathy Hepatitis C Histone deacetylase High-density lipoproteins Human double minute 2 Hepatitis D cervical cancer cell line, also the first cancer cell line established Highly exposed persistently seronegative individuals Human epidermal growth factor receptor 2 Hepatitis E Heart failure Heart failure with preserved ejection fraction Hepatocyte growth factor Hypoxanthine phosphoribosyl transferase Health and Human Services (U.S.) Human herpes virus 6 Healthcare Information Technology Standards Panel Human immunodeficiency virus Histone lysine methyltransferases Hepatic lipase Health Level 7 Human leukocyte antigen/(gene) Hind limb ischemia Histone methyltransferase Hereditary nonpolyposis colorectal cancer Head and neck squamous cell carcinoma
HP-1 HPV HPV-16 HR HRPC HS HSC HSP HSV-1 HSV1-TK HSV-2 hTERT HTLV-I HTLV-III HTS HVJ
Heterochromatin-associated protein-1 Human papilloma virus Human papilloma virus-16 Homologous recombination Hormone refractory prostate cancer Hyper-spectral Hepatic stellate cell Heat shock protein Human herpes simplex virus type 1 Herpes simplex virus type 1-thymidine kinase Human herpes simplex virus type 2 Human telomerase reverse transcriptase Human T-cell leukemia/lymphoma virus type I Human T-cells lymphotropic virus type 3 High-throughput screening Hemagglutinating virus of Japan
IBD IBDQ IC IC ICAM ICAT ICD ICD ICDc
Inflammatory bowel disease Inflammatory bowel disease questionnaire Intermittent claudication Interstitial cystitis Intracellular adhesion molecules Isotope-coded affinity tags Implantable cardioverter defibrillator International Classification of Diseases Cytosolic, NADP-dependent isocitrate dehydrogenase Immature dendritic cells Idiopathic dilated cardiomyopathy Insulin-dependent diabetes mellitus Intermediate-density lipoproteins Indolemaine 2,3-dioxygenase Interferon Interferon-beta Interferon-gamma Immunoglobulin Immunoglobulin-A Insulin-like growth factor-I Immunohistochemistry Integrating the Healthcare Enterprise Inhibitor of B-like (gene) Interleukin Interleukin (4/13) Interleukin-1 beta (gene) Interleukin-1 receptor antagonist Interleukin-6 Interleukin-10 Invasive Monitoring Attenuation through Gene Expression study Immature myeloid cells Integrase (viral protein) Institute of Medicine (U.S.) Insertion sequence Institute for Systems Biology Intreferon stimulated gene
iDC IDCM IDDM IDL IDO IFN IFN- IFN- Ig IgA IGF-I IHC IHE IKBL IL IL (4/13) IL-1 IL-1ra IL-6 IL-10 IMAGE iMC IN IOM IS ISB ISG
Abbreviations
ISHLT IT IVDMIA IVVM JCA JHDM1
International Society for Heart and Lung Transplantation Information technology In vitro diagnostic multivariate index assay In vivo videomicroscopy
JME
Juvenile chronic arthritis JmjC domain-containing histone demethylase-1 Juvenile myoclonic epilepsy
Kb KEGG
Kilobase Kyoto Encyclopedia of Genes and Genomes
LARGO
Lung Allograft Rejection Gene expression Observational study Lymphadenopathy-associated virus Lewy body Lecithin cholesterol acyltransferase Laser capture microdissection Liquid chromatography–mass spectrometry Liquid chromatography/mass spectrometry/ mass spectrometry Liquid crystal tunable filters Linkage disequilibrium Low-density lipoproteins Levodopa Late-onset Alzheimer’s disease Logarithm of the odds Loss of heterozygosity Logical Observation Identifiers, Names, and Codes Line of response Lipoprotein(a) Lysophosphatidic acid Lipoprotein lipase Lipopolysaccharide Leucine-rich repeat Leucine rich repeat kinase 2 Lysine-specific demethylase-1 Lymphotoxin alpha Leukotriene A4 hydrolase Leukotriene B4 Long-term non-progression Left ventricle Left ventricular assist device Left ventricular end-diastolic pressure Left ventricular ejection fraction Left ventricular hypertrophy Left ventricular outflow tract (obstruction)
LAV LB LCAT LCM LC-MS LC-MS/MS LCTF LD LDL l-DOPA LOAD LOD LOH LOINC LOR Lp(a) LPA LPL LPS LRR LRRK2 LSD1 LTA LTA4H LTB4 LTNP LV LVAD LVEDP LVEF LVH LVOT(O) MA mAb MADH6 MAGE-ML MAGUK
Matrix (viral protein) Monoclonal antibodies Mothers against decapentaplegic homolog 6 MicroArray and Gene Expression Markup Language Membrane-associated guanylate kinase
MALDI MALDI-TOF MAM MAO MAOA MAP MAPK MAPK MAQC Mb MBD MCD MCD MCP MCP-1 MCS M-CSF MDM2 MDP MDR MedDRA MEF2A Megsin MEN MeSH MeV MGED MGMT MHC MI MIAME MIM miR-192 miRNA MITF MLEM MLPA MMP MMP-3 MMR MPSS MPTP mQTL MR MRI mRNA mRSS
■
xxxv
Matrix-assisted laser desorption and ionization Matrix-assisted laser desorption ionization time-of-flight Metastasis-averse microenvironment Monoamine oxidase Monoamine oxidate A Mitogen-activated protein Microtubule-associated protein kinase Mitogen-activated protein kinase Microarray quality control Megabase Methylated DNA binding domain protein Malonyl CoA decarboxylase Minimal change disease Monocyte/macrophage chemoattractant protein Monocyte chemoattractant protein 1 Multi-species conserved sequence Macrophage colony-stimulating factor Mouse double minute 2 Muramyl dipeptide Multidrug resistance (gene) Medical Dictionary for Regulatory Activities Terminology Myocyte-enhancing factor 2A Mesangial cell-specific gene with homology to serpin Multiple endocrine neoplasm Medical Subject Headings Multiexperiment viewer Microarray Gene Expression Data society O6-Methylguanine-DNA methyltransferase Major histocompatability complex Myocardial infarction Minimal Information About a Microarray Experiment Metastasis-inclined microenvironment MicroRNA-192 Micro-ribonucleic acid Microphthalmia transcription factor Maximum likelihood expectation maximization Multiplex ligation-dependent probe amplification Matrix metalloproteinase Matrix metalloproteinase-3 Mismatch repair Massively parallel signature sequencing 1-Methyl-4-phenyl-1, 2, 3, 6-tetrahydropyridine Metabolite quantitative trait loci Magnetic resonance Magnetic resonance imaging Messenger ribonucleic acid Modified Rodnan Skin Score
xxxvi
■
MS MS MS MS/MS MSA MTDNA MTHFR mTOR MTP MTX MudPIT MVP MZ NAAT NAD(P)H
Abbreviations
Mass spectrometry Multiple sclerosis Multispectral Tandem mass spectrometry Multiple system atrophy Mitochondrial DNA Methylenetetrahydrofolate reductase (gene) Mammalian target of rapamycin Microsomal triglyceride transfer protein Methotrexate Multidimensional protein identification technology Mitral valve prolapse Monozygotic
NRTI NSAID nt NUD NYHA
Nucleic acid amplification testing Nicotinamide adenine dinucleotide phosphate reduced form Non-alcoholic fatty liver disease Non-alcoholic steatohepatitis N-Acetyltransferase 2 Normal appearing white matter Nucleocapsid (viral protein) Neural cell adhesion molecule National Center for Biotechnology Information New chemical entity National Cholesterol Education Program Refers to a set of 60 cancer cell lines curated at the NCI (National Cancer Institute) Non-coding RNA NEgative regulatory Factor (viral protein) Nucleotide excision repair Neurofibromatosis1/2 National Human Genome Research Institute Non-homologous recombination National Institute of Environmental Health Sciences (NIH) National Institutes of Health Natural killer Nuclear magnetic resonance Non-nucleosidic reverse transcriptase inhibitor Nucleotide oligomerization domain (gene) Negative predictive value Nicotinamide adenine dinucleotide phosphate oxidoreductase Nucleosidic reverse transcriptase inhibitor Non-steroidal anti-inflammatory drug Nucleotides Non-ulcer dyspepsia New York Heart Association
OBO OCTN OHCA OMIM OMP
Open Biomedical Ontologies Organic cation transporter (gene) Out of hospital cardiac arrest Online Mendelian Inheritance in Man Outer membrane protein
NAFLD NASH NAT2 NAWM NC N-CAM NCBI NCE NCEP NCI60 ncRNA NEF NER NF1/2 NHGRI NHR NIEHS NIH NK NMR NNRTI NOD NPV NQO1
OmpC opn OPN OR ORESTES ORF OSEM
Outer membrane porin C of E. coli Osteopontin (gene) Osteopontin (protein) odds ratio Open reading frame expressed sequence tags Open reading frame Ordered subset expectation maximization
p53BP1 PAD PAGE PAI PAM pANCA
p53 Binding protein 1 Peripheral arterial disease Polyacrylamide gel electrophoresis Pathogenicity island Prediction analysis of microarrays Perinuclear antineutrophil cytoplasmatic antibodies Peripheral arterial occlusive disease Papanicolaou Human parvovirus B-19 Primary biliary cirrhosis Peripheral blood lymphocytes Peripheral blood mononuclear cell Pyruvate carboxylase Principal components analysis Proliferating cell nuclear antigen Polymerase chain reaction Proprotein convertase subtilisin/kexin type 9 Procarbazine/CCNU/vincristine Parkinson disease Pharmacodynamic Patent ductus arteriosus Personal digital assistant Parkinson disease with dementia Platelet-derived growth factor Platelet-derived growth factor receptor Pyruvate dehydrogenase Pyrrolidine dithiocarbamate prodynorphin (gene) Positron emission tomography Pulse field gel electrophoresis Prostaglandin E2 Pharmacogenetics and pharmacogenomics Phenytoin Protease inhibitor Phosphotidylinositol triphosphate kinase Prostatic intraepithelial neoplasia Pharmacokinetics Phospholipid transfer protein Polony Multiplex Analysis of Gene Expression Peripheral blood mononuclear cells Pro-myelocytic leukemia retinoic acid receptor oncoprotein Photomultiplier tube Paraoxonase Percutaneous peripheral intervention Positive predictive value
PAOD1 Pap Parvo B19 PBC PBL PBMC PC PCA PCNA PCR PCSK9 PCV PD PD PDA PDA PDD PDGF PDGFR PDH PDTC PDYN PET PFGE PGE2 PGx PHT PI PI3K PIN PK PLTP PMAGE PMBC PML-RAR PMT PON1 PPI PPV
Abbreviations
PR PrEC PRIDE PRNP PS PSA PsA PSC PSP PTEN PTM PUD PZ
Protease (viral) Prostate epithelial cells Proteomics Identifications Database Prion protein (gene) Pulmonary stenosis Prostate-specific antigen Psoriatic arthritis Primary sclerosing cholangitis Progressive supranuclear palsy Phosphatase and tensin homolog Pre-trans-splicing molecule Peptic ulcer disease Plasticity zone
QPCR qRT qRT-PCR
Quantitative PCR Quantitative real-time PCR Quantitative reverse-transcriptase polymerase chain reaction Quantitative trait locus Quadrupole time-of-flight-mass spectrometry Quantitative real-time polymerase chain reaction Quality value
QTL Q-TOF-MS QT-PCR QV R&D RA RAAS RAPD-PCR RB RB RCM REPAIR-AMI
RET REV RFLP RISC RNA RNAi ROC ROCO ROS RP RPA RPLC RRMS RSV RT RT-PCR RT-PCR
Research and Development Rheumatoid arthritis Renin-angiotensin-aldosterone system Random amplified polymorphic deoxyribonucleic acid–polymerase chain reaction Retinoblastoma Retinoblastoma (gene) Restrictive cardiomyopathy Reinfusion of enriched progenitor cells and infarct remodeling in acute myocardial infarction Ret proto-oncogene REgulator of Virion protein expression (viral protein) Restriction fragment length polymorphism RNA interference silencing complex Ribonucleic acid Ribonucleic acid interference RAS in complex proteins ROC (RAS of complex proteins)/COR (C-terminal of ROC) Reactive oxygen species Rapid progression Ribonuclease protection assay Reversed-phase liquid chromatography Relapsing-remitting MS Respiratory syncytial virus Reverse transcriptase (viral protein) Real-time polymerase chain reaction Reverse transcription – polymerase chain reaction
■
xxxvii
RV RVHF
Right ventricle Right ventricular heart failure
SACGHS
Secretary’s Advisory Committee on Genetics, Health and Society (U.S.) Serial Analysis of Gene Expression Scoring algorithm for spectral analysis S-Adenosyl-l-methionine (AdoMet) Significance analysis of microarrays Systolic anterior motion of mitral valve Shrimp alkaline phosphatase/exonuclease I Severe acute respiratory syndrome Single-base extension Sudden cardiac death Severe combined immunodeficiency Voltage-gated sodium channel, type 1, alpha subunit (gene) Selective capture of transcribed sequences Strong cation exchange Standard deviation Standards development organization Shared epitope of HLA (human leukocyte antigen) genes Spongiform encephalopathies System for Evidence-Based Advice through Simultaneous Transaction with an Intelligent Agent across a Network Sinusoidal endothelial cell Surface-enhanced laser desorption and ionization Surface-enhanced laser desorption ionization time-of-flight Surface-enhanced laser desorption/ionization time-of-flight mass spectrometry Serological proteome analysis Systolic heart failure Short hairpin ribonucleic acid Smad-interacting protein 1 Small interfering ribonucleic acid Simian immunodeficiency virus Stevens–Johnson syndrome/toxic epidermal necrolysis Solute carrier family 12 member 3 Systemic lupus erythematosus Sentinel lymph node Stanford Microarray Database Substantia nigra Systematized Nomenclature of Medicine, Clinical Terms Single nucleotide polymorphism Substantia nigra pars compacta Substantia nigra pars reticulata SRY-box containing gene 10 Spondyloarthropathy
SAGE SALSA SAM SAM SAM SAP/Exo I SARS SBE SCD SCID SCN1A SCOTS SCX SD SDO SE SE SEBASTIAN
SEC SELDI SELDI-TOF SELDI-TOF-MS SERPA SHF shRNA SIP1 siRNA SIV SJS/TEN SLC12A3 SLE SLN SMD SN SNOMED CT SNP SNpc SNR SOX10 SpA
xxxviii
■
SPARC SPECT SPMS SPR SPSM-FCS
SR-BI ss SSc SSCP SSRP SSTR2 STAT STEMI STK11 gene STN SU SVAS SVR SWOG TAM TAT TBB TCF7L2 TCR TDM TF TFAP2B TGF TGF- TH Th1 Th2 TIGR TIMP TIMP-1 TK TLE TLR TLS TM TNF
Abbreviations
Secreted protein, acidic, and rich in cysteine, osteonectin Single photon emission computed tomography Secondary progressive MS Surface plasmon resonance Solution (single)-Phase Single-Molecule Fluorescence auto- and two color crossCorrelation Spectroscopy Scavenger receptor, class BI Single-stranded Systemic sclerosis Single strand conformation polymorphism Simple sequence repeat polymorphism Somatostatin receptor type 2 Signal transducers and activators of transcription ST segment Elevation Myocardial Infarction Serine/threonine kinase 11 Subthalmic nucleus Surface (viral protein) Supravalvar aortic stenosis Sustained viral response Southwest Oncology Group Tumor associated macrophages TransActivator of Transcription (viral protein) Transcriptomic-based biomarkers Transcription factor 7-like 2 T-cell receptor Therapeutic drug monitoring Transcription factor Transcription factor activating protein 2 Transforming growth factor Transforming growth factor beta Tyrosine hydroxylase T-helper cell, type 1 T-helper cell, type 2 The Institute for Genomic Research Tissue inhibitor of metalloproteinase Tissue inhibitor of metalloproteinase-1 Tyrosine kinase Temporal lobe epilepsy Toll-like receptor Trans-lesion DNA synthesis Trans-Membrane (viral protein) Tumor necrosis factor
TNF- TNF- TNM TOF TOF TPMT/TPMT Treg TSA TSG TSP
Tumor necrosis factor-alpha Tumor necrosis factor-beta Tumor-node-metastasis stage Tetralogy of Fallot Time-of-flight Thiopurine methyl transferase/(gene) Regulatory T-cells Trichostatin A Tumor suppressor gene Thrombospondin
UADT UC UCSC UMLS UNAIDS US US USF1 USpA UTR
Upper aerodigestive tract Ulcerative colitis University of California at Santa Cruz Unified Medical Language System United Nation mission on AIDS Ultrasound United States of America Upstream transcription factor 1 Undifferentiated spondyloarthropathy Untranslated region (three prime or five prime)
V-ATPase V-CAM VEGF VF VIF VLDL VPR VPU VS VT VZV
Vacuolar adenosine triphosphatase Vascular cell adhesion molecule Vascular endothelial growth factor Ventricular fibrillation Virion infectivity factor Very low-density lipoproteins Viral protein R Viral protein U Ventricular septal defect Ventricular tachycardia Varicella zoster virus
WGA WGAS WGS WHO AIDS
Whole-genome amplification Whole Genome Association Studies Whole genome shotgun World Health Organization mission on AIDS World Health Organization West Nile virus Western-specific sequence Welcome Trust Case Control Consortium
WHO WNV WSS WTCCC XMRV
Xenotropic murine leukemia virus-related virus
ZDV ZFP
Zidovudine Zinc finger protein
Advisory Board Paul R. Billings President and Chief Executive Officer, CELLPOINT DIAGNOSTICS, 265 N. Whisman Road, Mountain View, CA 94043
Raju Kucherlapati Center for Genetics and Genomics, Harvard Medical School, 77 Avenue Louis Pasteur, Ste 250, Boston, MA 02115
Robert Cook-Deegan Center for Genome Ethics, Law & Policy, Duke Institute for Genome Sciences and Policy, Box 90141, Durham, NC 27708
Elizabeth G. Nabel National Heart, Lung and Blood Institute, 31 Center Drive MSC 2486, Building 31, Room 5A52, Bethesda, MD 20892
Kay E. Davies Department of Human Anatomy and Genetics, Oxford University, South Parks Road, Oxford, QX1 3QX, UK
Robert L. Strausberg Human Genomic Medicine, J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850
Brian Druker Department of Medicine, Oregon Health & Sciences University, 3181 S.W. Sam Jackson Park Road, Portland, Oregon 97239
Robert I. Tepper Third Rock Ventures, LLC, 29 Newbury St., Boston, MA 02116
Victor Dzau Office of the Chancellor, Duke University Health System, Box 3701, Durham, NC 27710
Janet A. Warrington External RNA Controls Consortium, Affymetrix, 3380 Central Expressway, Santa Clara, CA 95051
Eric Green National Institutes of Health, National Human Genome Research Institute, 50 South Drive, Room 5222, Bethesda, MD 20892-8002
Ralph Weissleder Molecular Imaging Research Center, Massachusetts General Hospital, 149 13th Street, Room 5406, Charlestown MA 02129
Muin J. Khoury Office of Genomics & Disease Prevention, Centers for Disease Control and Prevention, 1600 Clifton Road, Atlanta, GA 30333
Janet Woodcock Food and Drug Administration, 5600 Fishers Lane, Parklawn Buildling, Rm 14-71, Rockville, MD 20857
Mary-Claire King University of Washington, 1705 Northeast Pacific Street, Box 357720, Seattle, WA 98195-7720
xxxix
This page intentionally left blank
Contributors Michael J. Ackerman Mayo Clinic, Windland Smith Rice Sudden Death Genomics Laboratory, Rochester, MN 55905, USA. Matthew L. Anderson Departments of Obstetrics and Gynecology and Pathology, Baylor College of Medicine, Houston, Texas, USA. Felicita Andreotti Institute of Cardiology, Catholic University Medical School, Rome-Italy. Brian H. Annex Division of Cardiovascular Medicine and Department of Medicine, Duke University Medical Center, Durham, NC 27710, USA. Anthony Antonellis Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. Dmitri Artemov Department of Radiology – MR Research, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA. James R. Bain Departments of Radiology and Internal Medicine, Advanced Imaging Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75235, USA. Peter J. Barnes National Heart and Lung Institute, Section of Airway Disease, London, SW3 6LY, UK. J.S. Barnholtz-Sloan Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine; Department of Epidemiology and Biostatistics, Case Western Reserve University School of Medicine, Cleveland, Ohio, USA. Richard C. Becker Department of Medicine – Cardiovascular Thrombosis Center, Duke University Medical Center, Durham, NC 27705, USA.
Mark Boguski Harvard Medical School, Center for Biomedical Informatics, 10 Shattuck St., Boston, MA 02115 Stefano Bonassi Unit of Molecular Epidemiology, National Cancer Research Institute, Genova, Italy. Brigitta Bondy Section Psychiatric Genetics and Neurochemistry, Psychiatric Hospital of University Munich, Munich, Germany. J. Martijn Bos Mayo Clinic, Windland Smith Rice Sudden Death Genomics Laboratory, Rochester, MN 55905, USA. Roger E. Breitbart Department of Cardiology, Children’s Hospital Boston, Boston, MA 02115, USA. Jerome Brody Department of Medicine, Boston University School of Medicine, Boston, MA 02118, USA. Matthew P. Brown USA.
Omics Consulting LLC, Clayton, CA,
Lars Bullinger Department of Internal Medicine III, University of Ulm, Ulm, Germany. Shawn C. Burgess Sarah W. Stedman Nutrition and Metabolism Center, Department of Pharmacology and Cancer Biology, Duke University Medical Center, Durham, NC 27704, USA. Atul J. Butte Department of Medicine and Department of Pediatrics, Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, CA 94305, USA. David J. Carey Weis Center for Research/Geisinger Clinic, Danville, PA 17822-2607, USA. George Carlson McLaughlin Research Institute, Great Falls, MT, USA
Ivor J. Benjamin Division of Cardiology, Center for Cardiovascular Translational Biomedicine, University of Utah Health Sciences Center, Salt Lake City, UT, USA.
Juan C. Celedón Channing Laboratory, Brigham and Women’s Hospital, Boston, MA; Division of Pulmonary and Critical Care, Brigham & Women’s Hospital, Boston, MA; Harvard Medical School, Boston, MA; Center for Genomic Medicine, Brigham and Women’s Hospital, Boston, MA.
Paul R. Billings President and Chief Executive Officer, CELLPOINT DIAGNOSTICS, 265 N. Whisman Road, Mountain View, CA 94043
Subhashini Chandrasekharan Center for Genome Ethics, Law and Policy, Duke Institute for Genome Sciences and Policy, Durham, NC 27708, USA.
Simon C. Body Department of Anesthesiology, Perioperative & Pain Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA.
Wing C. (John) Chang Department of Pathology, Center for Lymphoma and Leukemia Research, University of Nebraska Medical Center, Omaha, NE, USA. xli
xlii
■
Contributors
Yan Chen Department of Chemistry, Shands Cancer Center and UF Genetics Institute, University of Florida, Gainesville, FL 32611, USA. Antonio Chiocca Department of Neurological Surgery, Dardinger Laboratory for Neuro-oncology and Neurosciences, The Ohio State University Medical Center and Comprehensive Cancer Center, Columbus, OH 43210, USA. Theresa Puifun Chow Agency for Science, Technology and Research, Singapore Tissue Network, Singapore. Wendy K. Chung Department of Pediatrics, Division of Molecular Genetics, Columbia University, New York, NY 10032, USA. Robert Cook-Deegan Center for Genome Ethics, Law and Policy, Duke Institute for Genome Sciences and Policy, Durham, NC 27708, USA. Dolores Corella Nutrition and Genomics Laboratory, Jean Mayer–U.S. Department of Agriculture, Human Nutrition Research Center on Aging at Tufts University, Boston, MA, USA; Genetic and Molecular Epidemiology Unit, School of Medicine, University of Valencia, Valencia, Spain. Susan Cottrell Amgen, Inc., Seattle, WA 98119, USA. Nancy J. Cox Departments of Medicine and Human Genetics, University of Chicago, Chicago, IL 60637, USA. Nigel P.S. Crawford CCR/NCI/NIH, Laboratory of Cancer Biology and Genetics, National Cancer Institute, Bethesda, MD 20892-4264, USA. A. Jamie Cuticchia Bioinformatics Group, Duke Comprehensive Cancer Center, Durham, NC 27708, USA. Dieter Deforce Laboratory of Pharmaceutical Biotechnology, Faculty of Pharmaceutical Sciences, Ghent University, Ghent, Belgium. Mario C. Deng
Columbia University, New York, NY, USA.
Gayathri Devi Department of Surgery, Duke University Medical Center, Durham, NC 27710, USA.
Dirk Elewaut Department of University Hospital, Ghent, Belgium.
Rheumatology, Ghent
Charles J. Epstein Department of Pediatrics and Institute for Human Genetics, University of California, San Francisco, CA 94143, USA. Ayman H. Fanous Washington VA Medical Center, Washington, DC, 20422, USA. Phillip G. Febbo Institute for Genome Sciences and Policy, Department of Medicine – Oncology, Duke University Medical Center, Durham, NC 27710, USA. Robert J. Feezor Division of Vascular Surgery and Endovascular Therapy, University of Florida College of Medicine, Gainesville, FL. Yonmei Feng Department of Pathology and the Arizona Cancer Center, University of Arizona, Tucson, AZ, USA. Zeno Földes-Papp ISS, National Center of Fluorescence, Champaign, Illinois 61822, USA. Hidehiko Fujinaka Institute for Clinical Research, Niigata National Hospital, Niigata 951-8585, Japan. David J. Galas Institute for Systems Biology, Seattle, WA, USA; Battelle Memorial Institute, Columbus, OH, USA. Louis, P. Garrison Department of Pharmacy, University of Washington, Seattle, WA 98195, USA. Glenn S. Gerhard Weis Center for Research/Geisinger Clinic, Danville, PA 17822-2607, USA. Geoffrey S. Ginsburg Center for Genomic Medicine, Duke Institute for Genome Sciences and Policy, Durham, NC 27710, USA. Bjorn T. Gjertsen Institute of Medicine, Hematology, Haukeland University Hospital, University of Bergen, Bergen, Norway. Michael Glass Newborn Screening Program, Washington State Department of Health, Shoreline,WA 98155, USA.
Juergen Distler Epigenomics, AG, Berlin, Germany.
David B. Goldstein Center for Population Genomics and Pharmacogenetics, Duke Institute for Genome Sciences and Policy, Durham, NC 27710, USA.
Harmut Dohner Department of Internal Medicine III, University of Ulm, Germany.
Erynn S. Gordon Division of Pediatric Genetics, University of Maryland School of Medicine, Baltimore, MD 21201, USA.
Ayotunde O. Dokun Metabolism and Nutrition.
Endocrinology
Tucker Gosnell Massachussetts General Hospital Cancer Center, Boston, MA, USA.
Mark R. Edbrooke Pharmacogenetics, GlaxoSmithKline, Durham, NC 27709, USA.
Peter Grass Biomarker Development, Novartis Pharma AG, Basel, Switzerland.
Lucas B. Edelman Department of Bioengineering and Institute for Genomic Biology, University of Illinois, Urbana-Champaign.
Eric D. Green Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesa, MD 20892, USA.
Theo deVos
Epigenomics, Inc, Seattle, WA, USA.
Division
of
Contributors
■
xliii
Iris Grossman Pharmacogenetics, Research and Development, GlaxoSmithKline, Research Triangle Park, NC 27709, USA.
Chunhwa Ihm Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA.
Marta Gwinn National Office of Public Health Genomics, Centers for Disease Control and Prevention, Atlanta, GA 30341, USA.
Olga Ilkayeva Sarah W. Stedman Nutrition and Metabolism Center, Department of Pharmacology and Cancer Biology, Duke University Medical Center, Durham, NC 27704, USA.
Carolina Haefliger
Rafael Irizarry Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health.
Epigenomics, AG, Berlin, Germany.
Susanne B. Haga Duke Institute for Genome Sciences and Policy, Durham, NC 27710, USA. Per Hall Department of Medical Epidemiology Biostatistics, Karolinska Institute, Stockholm, Sweden.
and
Joshua M. Hare Division of Cardiology and Interdisciplinary Stem Cell Institute, University of Miami, Miller School of Medicine, Miami FL 33136, USA. Ahmad Hariri Developmental Imaging Genetics Program, University of Pittsburgh, Pittsburgh, PA 15213, USA. James R. Heath Department of Chemistry, California Institute of Technology, Los Angels, CA, USA.
Shushant Jain Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA; Reta Lila Weston Institute of Neurological Studies, University College London W1T 4JF, UK; Department of Molecular Neuroscience, Institute of Neurology, University College London, London, UK. Melissa D. Johnson Department of Medicine – Infectious Diseases, Duke University Medical Center, Durham, NC 27710, USA. Philip W. Kantoff Dana-Farber Cancer Institute, Department of Medical Oncology, Boston, MA, USA.
Robert A. Hegele Robarts Research Institute, Blackburn Genetics Laboratory, London, ONT, Canada.
Sekar Kathiresan Massachusetts General Hospital, Cardiovascular Disease Prevention Center, Boston, MA, 02114, USA.
Bettina Heidecker Division of Cardiology and Interdisciplinary Stem Cell Institute, University of Miami, Miller School of Medicine, Miami, FL, USA.
Hasmeena Kathuria Department of Medicine, Boston University School of Medicine, Boston, MA 02118, USA.
Shona Hislop Department of Pathology, Brigham and Women’s Hospital, Boston, MA 02115, USA. Eric P. Hoffman Research Center for Genomic Medicine, Children’s National Medical Center, Washington, DC 20010, USA. John Holton Centre for Infectious Diseases and International Health, Windeyer Institute of Medical Sciences, Royal Free and University College London Medical School, London, W1T 4JF, UK. Leroy Hood The Institute for Systems Biology, Seattle, WA 98103-8904, USA. Andrew T. Huang Koo Foundation Sun Yat-Sen Cancer Center, Taiwan. Erich S. Huang Duke University Medical Center, Durham, NC, USA. Carlos A. Hubbard Southern Medical Group, Tallahassee, FL 32308, USA. Kent Hunter CCR/NCI/NIH, Laboratory of Cancer Biology and Genetics, National Cancer Institute, Bethesda, MD 20892-4264, USA. Courtney Hyland Department of Pathology, Brigham and Women’s Hospital, Boston, MA 02115, USA.
Kensaku Kawamoto Division of Clinical Informatics, Duke University Medical Center, Durham, NC 27710, USA. Filip De Keyser Department of Rheumatology, Ghent University Hospital, Ghent, Belgium. Asif Khalid Division of Gastroenterology, Hepatology and Nutrition, University of Pittsburgh, Pittsburgh, PA 15213, USA. Muin Khoury National Office of Public Health Genomics, Centers for Disease Control and Prevention, Atlanta, GA 30341, USA. Stephen F. Kingsmore Resources.
National Center for Genome
Michelle M. Kittleson Division of Cardiology, UCLA School of Medicine, Los Angeles, CA, USA. Beena T. Koshy Pharmacogenetics, GlaxoSmithKline, Durham, NC 27709, USA. Ranga Krishnan Department of Psychiatry and Behavioral Science, Duke University Medical Center, Durham, NC 27710, USA. Robert S. Krouse Southern Arizona Veterans Affairs Health Care System, Tucson, AZ, USA. Vikas Kundra Diagnostic Radiology, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
xliv
■
Contributors
Myla Lai-Goldman Laboratory Corporation of America, Holdings, Burlington, NC, USA.
Herman Mielants Department of Rheumatology, Ghent University Hospital, Ghent, Belgium.
Vipul Lakhani Division of Endocrinology, Department of Medicine, Duke University School of Medicine, Durham, NC, 27710, USA.
Michael J. Muehlbauer Sarah W. Stedman Nutrition and Metabolism Center, Department of Pharmacology and Cancer Biology, Duke University Medical Center, Durham, NC 27704, USA.
Robert Langer Weis Center for Research/Geisinger Clinic, Danville, PA 17822-2607, USA. Sean E. Lawler Department of Neurological Surgery, The Ohio State University Medical Center, Wiseman Hall Columbus, OH 43210, USA. Inyoul Lee
Institute for Systems Biology, Seattle, WA, USA.
Ulf Müller-Ladner Department of Internal Medicine and Rheumatology, Justus-Liebig University Giessen, Bad Nauheim, Germany. Mark A. Nelson Department of Pathology, University of Airzona, Tucson, AZ 85724, USA.
Charles Lee Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA.
Monica Neri Unit of Molecular Epidemiology, National Cancer Research Institute, Genoa, Italy.
Arthur S. Lee Department of Pathology, Brigham and Women’s Hospital, Boston, MA 02115, USA.
Mihai Netea Department of Medicine, University Medical Center St. Radboud, Nijmegen, The Netherlands.
Stanely Leong Department of Surgery, UCSF Medical Center at Mount Zion, University of California, San Francisco, CA, USA.
L. Kristin Newby Department of Medicine – Cardiology, Duke University Medical Center, Durham, NC 27710, USA.
Ralf Lesche Epigenomics, AG, Berlin, Germany. Samuel Levy J. Craig Venter Institute, Rockville, MD 20850, USA. Edison T. Liu Agency for Science, Technology and Research, Singapore Tissue Network, Genome Institute of Singapore, Singapore. David F. Lobach Division of Endocrinology and Metabolism, Duke University Medical Center, Durham, NC 27710, USA. James B. Lorens Department of Biomedicine, University of Bergen, Bergen, Norway. Aldons J. Lusis Department of Medicine, Division of Cardiology, University of California, Los Angeles, CA 900951679, USA. H. Kim Lyerly Duke Comprehensive Cancer Center, Duke University Medical Center, Durham, NC 27705, USA. Nageswara R. Madamanchi Department of Medicine, Carolina Cardiovascular Biology Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7126, USA. J. Alfredo Martinez Department of Physiology and Nutrition, University of Navarra, Pamplona, Spain.
Christopher B. Newgard Duke Independence Park Facility, Duke University Medical Center, Durham, NC 27704, USA. Maggie Ng Departments of Medicine and Human Genetics, Chicago, IL 60637, USA. Paul W. Noble Division of Pulmonary, Allergy and Critical Care Medicine, Duke University Medical Center, Box 3171, Durham, NC 27710, USA. Vasilis Ntziachristos Technical University of Munich, Helmholtz Center Munich, Munich, Germany. Robert L. Nussbaum Division of Medical Genetics, Department of Medicine & Institute for Human Genetics, University of California, San Francisco, 513 Parnassus Ave, San Francisco, CA 94143 Meghan B. O’Donoghue Department of Chemistry, Shands Cancer Center and UF Genetics Institute, University of Florida, Gainesville, FL 32611, USA. Kunle Odunsi Departments of Gynecologic Oncology and Immunology, Roswell Park Cancer Institute, Buffalo, NY, USA.
Kevin McGrath Division of Gastroenterology, Hepatology and Nutrition, University of Pittsburgh, Pittsburgh, PA 15213, USA.
Kenneth Olden Laboratory of Molecular Carcinogenesis, NIEHS, National Institutes of Health, Department of Health and Human Services, Research Triangle Park, NC 27709-2233, USA.
John McHutchison Department of Gastroenterology, Duke University Medical Center, Duke University Clinic Research Institute, Durham, NC, USA.
Steve R. Ommen Mayo Clinic, Windland Smith Rice Sudden Death Genomics Laboratory, Rochester, MN 55905, USA.
Frank Middleton Upstate Medical University, Washington VA Medical Center, Washington, DC, USA.
Jose M. Ordovas Nutrition and Genomics Laboratory, Tufts University, Boston, MA 02111-1524, USA.
Vega Masignani
Novartis Vaccines, Siena, Italy.
Contributors
Roberto Patarca E. M. Papper Lab of Clinical Immunology and Molecular Biology, University of Miami School of Medicine, Sunny Isles Beach, FL 33160, USA. Jeetendra Patel Division of Cardiology, Department of Medicine, University of Utah School of Medicine, Salt Lake City, UT, USA. K. Patel Department of Gastroenterology, Duke University Medical Center, Duke University Clinic Research Institute, Durham, NC, USA. Carlos N. Pato Center for Neuropsychiatric Genetics and Department of Psychiatry, Department of Neuroscience and Physiology, Upstate Medical University, Syracuse, NY.
■
xlv
Sridhar Ramaswamy Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02115, USA. Hans-Georg Rammensee Interfakultares Institut fur Zellbiologie, Abt. Immunologie, D-72076 Tubingen, Germany. Scott D. Ramsey Department of Pharmacy, University of Washington, Seattle, WA 98195, USA. Rino Rappuoli Novartis Vaccines, Siena, Italy. Nader Rifai Laboratory Medicine, Children’s Hospital Boston, Boston, MA 02115, USA. Lisa Rimsza Department of Pathology, University of Arizona, Tucson, AZ 85724-5043, USA.
Michele T. Pato Center for Neuropsychiatric Genetics and Department of Psychiatry, Department of Neuroscience and Physiology, Upstate Medical University, Syracuse, NY.
Yu-Hui Rogers J. Craig Venter Institute, Rockville, MD 20851, USA.
Tanja Pejovic Department of Obstetrics and Gynecology, Oregon Health Sciences University, Portland, OR 97239, USA.
Allen D. Roses Deane Drug Discovery Institute, Duke University Medical Center, One Science Drive, Suite 342, Durham, NC 27708.
Noah C. Perin Tularik, Inc., South San Francisco, CA 94080, USA. Michael Pham Division of Cardiovascular Stanford University, Stanford, CA 94305, USA.
Medicine,
Robert M. Plenge Division of Rheumatology, Harvard Medical School, Brigham and Women’s Hospital, Boston, MA 02115, USA. Achim Plum Epigenomics, AG, Berlin, Germany. Mihai V. Podgoreanu Department of Anesthesiology, Duke University Medical Center, Durham, NC 27710, USA. Jonathan R. Pollack University, CA, USA.
Department of Pathology, Stanford
Rebecca L. Pollex Robarts Research Institute, London, ONT, Canada. Nathan D. Price Department of Chemical and Biomolecular Engineering, Department of Bioengineering and Institute for Genomic Biology, University of Illinois, UrbanaChampaign, IL USA. Thomas Quertermous Division of Cardiovascular Medicine, Stanford University, Stanford, CA 94305, USA. Francisco J. Quintana Center for Neurologic Diseases, Harvard Medical School, Boston, USA. Benjamin A. Raby Channing Laboratory, Brigham and Women’s Hospital, Boston, MA; Division of Pulmonary and Critical Care, Brigham & Women’s Hospital, Boston, MA; Harvard Medical School, Boston, MA; Center for Genomic Medicine, Brigham and Women’s Hospital, Boston, MA. Daniel J. Rader University of Pennsylvania School of Medicine, Philadelphia, PA 19104-6160, USA.
Jeffrey S. Ross Department of Pathology and Laboratory Medicine, Albany Medical College MC-80, Albany, NY 12208, USA. Ronenn Roubenoff Biogen Idec, Inc.
Immunology
Medical
Research,
Marschall S. Runge Department of Medicine, University of North Carolina-Chapel Hill, Chapel Hill, NC 27514, USA. Maren T. Scheuner 90407-2138, USA.
Rand Corporation, Santa Monica, CA
Matthias Schuster Epigenomics, AG, Berlin, Germany. David A. Schwartz Center for Genetics and Therapeutics, National Jewish Medical and Research Center, 1400 Jackson Street, Denver, CO 80206 Debra A. Schwinn Department University of Washington, Seattle, USA.
of
Anesthesiology,
Chia Kee Seng Centre for Molecular Epidemilogy, Yong Loo Lin School of Medicine, National University of Singapore, Singapore. Nicholas A. Shackel Department of Gastroenterology, Royal Prince Alfred Hospital, Sydney, Australia. M. Frances Shannon Division of Molecular Bioscience, John Curtin School of Medical Research, Australian National University, Canberra ACT 2601 Australia. A. Dean Sherry Departments of Radiology and Internal Medicine, Advanced Imaging Research Center, University of Texas Southwestern Medical Center, Dallas,TX 75235, USA. Jiaqi Shi Department of Pathology and the Arizona Cancer Center, University of Arizona, Tucson, Arizona, USA.
xlvi
■
Contributors
Kevin Shianna Duke Institute for Genome Sciences and Policy, Duke University, Durham, NC 27710, USA. Yelizaveta Shnayder Department of Otolaryngology – Head and Neck Surgery, University of Kansas School of Medicine, Kansas City, KS, USA. Andrew B. Singleton Laboratory of Neurogenetics, Molecular Genetics Unit, NIH, National Institute on Aging, Bethesda, MD 20892, USA. T.P. Slavin Center for Human Genetics, University Hospitals Case Medical Center; Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine; Department of Pediatrics, University Hospitals Case Medical Center, Cleveland, Ohio, USA. Desmond J. Smith Department of Molecular and Medical Pharmacology, Geffen School of Medicine, UCLA, Los Angeles, CA 90095-1735, USA. Rikkert L. Snoeckx Stem Cell Institute Leuven, Catholic University Leuven, Leuven, Belgium.
Hervé Tettelin
Novartis Vaccines, Siena, Italy.
Giovana R. Thomas Department of Otolaryngology – Head & Neck Surgery, University of Miami School of Medicine, Miami, FL 33136, USA. John D. Thompson Newborn Screening Program,Washington State Department of Health, Shoreline,WA 98155, USA. Visith Thongboonkerd Medical Molecular Biology Unit, Office for Research and Development, Siriraj Hospital, Mahidol University, Bangkoknoi, Bangkok 10700, Thailand. Eric J. Topol Scripps Clinic Division of Cardiovascular Disease, La Jolla, CA 92037, USA. Jeffrey A. Towbin Pediatric Cardiology, Baylor College of Medicine, Houston, TX 77030, USA. Ad A. van Bodegraven Department of Gastroenterology, VU University Medical Centre, Amsterdam, The Netherlands. Kris Van Den Bogaert Stem Cell Institute Leuven, Catholic University Leuven, Leuven, Belgium.
3059,
Filip Van den Bosch Department of Rheumatology, Ghent University Hospital, Ghent, Belgium.
Avrum Spira Department of Medicine, Boston University School of Medicine, Boston, MA 02118, USA.
Matte Vatta Departments of Pediatrics (Cardiology), Baylor College of Medicine, Texas Children’s Hospital, Houston, TX, USA.
Colin F. Spraggs Pharmacogenetics, Durham, NC 27709, USA.
GlaxoSmithKline,
Timothy D. Veenstra SAIC Frederick, Inc., National Cancer Institute, Frederick, MD 21702-1201, USA.
Robert D. Stevens Sarah W. Stedman Nutrition and Metabolism Center, Department of Pharmacology and Cancer Biology, Duke University Medical Center, Durham, NC 27704, USA.
David L. Veenstra Department of Pharmacy, University of Washington, Seattle, WA 98195, USA.
Ralph Snyderman Duke Durham, NC 27710, USA.
University,
DUMC
Nicolas A. Stewart SAIC Frederick, Inc., National Cancer Institute, Frederick, MD 21702-1201, USA. Alison Stewart Strangeways Research Laboratory, Public Health Genetics Unit, Worts Causeway, Cambridge, CB1 8RN, UK. F. Stewart Weis Center for Research/Geisinger Clinic, Danville, PA 17822-2607, USA. Robert L. Strausberg J. Craig Venter Institute, Rockville, MD 20850, USA.
Aleksandr E. Vendrov Department of Medicine, Carolina Cardiovascular Biology Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7126, USA. Catherine M. Verfaillie Stem Cell Institute Leuven, Catholic University Leuven, Leuven, Belgium. Nicole M. Walley Center for Population Genomics and Pharmacogenetics, Duke University Medical Center, Durham, NC 27710, USA. Ling Wang Department of Chemistry, Shands Cancer Center and UF Genetics Institute, University of Florida, Gainesville, FL 32611, USA.
Moshe Szyf Department of Pharmacology and Therapeutics, McGill University, Montreal, Quebec H3G 1Y6, Canada.
Mike Weale Statistical Genetics Unit, Department of Medical and Molecular Genetics, King’s College London, Guy’s Hospital, London SE1 9RT, UK.
Weihong Tan Department of Chemistry, Shands Cancer Center and UF Genetics Institute, University of Florida, Gainesville, FL 32611, USA.
Michael E. Weinblatt Division of Rheumatology, Harvard Medical School, Brigham and Women’s Hospital, Boston, MA 02115, USA.
Robert I. Tepper Third Rock Ventures, LLC, Boston, MA 02116, USA.
Howard L. Weiner Center for Neurologic Diseases, Harvard Medical School, Boston, MA 02115, USA.
Contributors
Scott T. Weiss Department of Medicine, Harvard Medical School, Boston, MA 02115, USA. Ralph Weissleder Center for Systems Massachusetts General Hospital, Boston, MA.
Biology,
Samuel A. Wells Department of General Surgery, Duke University Medical Center, Durham, NC 27705, USA. Brett R. Wenner Sarah W. Stedman Nutrition and Metabolism Center, Department of Pharmacology and Cancer Biology, Duke University Medical Center, Durham, NC 27704, USA. Ilse R. Wiechers Massachusetts General/McLean Hospital, Adult Psychiatry Residency Program, Boston, MA 02114, USA. Georgia L. Wiesner School of Medicine – Department of Genetics, Case Western Reserve University, Cleveland, Ohio 44106-4955, USA. Janey L. Wiggs Department of Opthalmology, Harvard Medical School and the Massachusetts Eye and Ear Infirmary, Boston, MA 02114, USA. Cisca Wijmenga Department of Genetics, University Medical Center Groningen, Groningen, The Netherlands.
■
xlvii
Christopher W. Woods Department of Medicine – Infectious Diseases, Duke University Medical Center, Durham, NC 27705, USA. Paula W. Woon National Center for Public Health Genomics, Centers for Disease Control and Prevention, Atlanta, GA 30341, USA. Chana Yagil Department of Nephrology and Hypertension and Laboratory for Molecular Medicine, Ben-Gurion University, Barzilai Medical Center Campus, Ashkelon, Israel. Yoram Yagil Department of Nephrology and Hypertension, Ben-Gurion University, Barzilai Medical Center Campus, Ashkelon, Israel. Tadashi Yamamoto Department of Structural Pathology, Institute of Nephrology, Niigata University Graduate School of Medical and Dental Sciences, Niigata 951-8510, Japan. Gang Yao Department of Chemistry, Shands Cancer Center and UF Genetics Institute, University of Florida, Gainesville, FL 32611, USA. Andrew J. Yee Massachusetts General Hospital Cancer Center, Harvard Medical School, Boston, MA 02115, USA. Hyuntae Yoo Institute for Systems Biology, Seattle, WA, USA.
Huntington F. Willard Duke Institute for Genome Sciences and Policy, Durham, NC 27710, USA.
Y. Nancy You Department of Surgery, Mayo Clinic, Rochester, MN 55905, USA.
James M. Wilson Gene Therapy Program, Translational Research Lab, University of Pennsylvania, Philadelphia, PA 19104-3403, USA.
Li-Rong Yu SAIC Frederick, Inc., National Cancer Institute, Frederick, MD 21702-1201, USA.
Nelson A. Wivel Gene Therapy Program, Translational Research Lab, University of Pennsylvania, Philadelphia, PA 19104-3403, USA.
Jean-François Zagury Conservatoire National des Arts et Metiers, Chaire de Bioinformatique, 292, rue Saint-Martin, 75003 Paris.
Jay Wohlgemuth Via Pharmaceuticals, San Francisco, CA, USA.
Ron Zimmern Strangeways Research Laboratory, Public Health Genetics Unit, Worts Causeway, Cambridge, CB1 8RN, UK.
Janet Woodcock MD 20857, USA.
Stephan Züchner University of Miami School of Medicine, Miami, FL 33136.
Food and Drug Administration, Rockville,
This page intentionally left blank
PART
One
Genomic Approaches to Biology and Medicine
1
This page intentionally left blank
Section
Principles of Human Genomics
1. 2. 3. 4. 5. 6.
1
Organization,Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine Concepts of Population Genomics Genomic Approaches to Complex Disease Human Health and Disease: Interaction Between the Genome and the Environment Epigenomics and Its Implications for Medicine Systems Biology and the Emergence of Systems Medicine
CHAPTER
1 Organization, Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine Huntington F. Willard
INTRODUCTION The genetic variation that can influence health and disease has been a central, if not widely practiced, principle of medicine for over a hundred years, since the prescient observations of the British physician Sir Archibald Garrod established the concept of “chemical individuality” over a century ago (Garrod, 1902). What has limited broad application of this principle until now has been the generally presumed rarity or special nature of clinical circumstances or conditions to which genetic variation was relevant – rare disorders such as Garrod’s alkaptonuria, inherited conditions limited to specific populations such as sickle cell anemia, or specialized situations such as the role of ABO incompatibility in blood transfusion. Now, however, with the availability of a “reference sequence” of the human genome, with emerging appreciation of the extent of genome variation among different individuals, and with a growing understanding of the role of common, not just rare, variation in disease, we are poised to begin to exploit the impact of that variation on human health on a broad scale, in the context of genomic and personalized medicine. Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg 4
Variation in the human genome has long been the cornerstone of the field of human genetics (see Box 1.1), and its study led to the establishment of the medical specialty of medical genetics (Nussbaum et al., 2007). A crucial set of connections joining Mendel’s principles of heredity, Garrod’s concept of chemical individuality, the practice of medicine and the sequence of the genome came with Pauling’s discovery of the molecular basis of sickle cell anemia and the direct correspondence between an individual’s genetic make-up and the type of hemoglobins present in that individual’s red cells (Pauling et al., 1949). The general nature and frequency of gene variants in the human genome became apparent with the classic work in the 1960s on the incidence of polymorphic protein variants in populations of healthy individuals (Harris and Hopkinson, 1972; Lewontin, 1967; reviewed in Harris, 1980). Calculations based on those protein polymorphism data – now extended in a robust and comprehensive manner with the analysis of variation on a genome scale – lead to the inescapable conclusion that virtually every individual should be found to have his or her own unique constitution of gene products, the implications Copyright © 2009, Elsevier Inc. All rights reserved.
Introduction
BOX 1.1
■
5
Genetics and Genomics
Throughout this and the many other chapters in this volume, the terms “genetics” and “genomics” are used repeatedly, both as nouns and in their adjectival forms. While these terms seem similar, they in fact describe quite distinct (though frequently overlapping) approaches in biology and in medicine. Having said that, there are inconsistencies in the way the terms are used, even by those who work in the field. To some, genetics is a subfield of genomics; to others, genomics is a subfield of genetics. Arguably, depending on the perspective one has in mind, both may be right! Here, we provide operational definitions to distinguish the various terms and the subfields of medicine to which they contribute. The field of genetics is the scientific study of heredity and of the genes that provide the physical, biological and conceptual bases for heredity and inheritance. To say that something – a trait, a disease, a code or information – is “genetic” refers to its basis in genes and in DNA. Heredity refers to the familial phenomenon whereby traits (including clinical traits) are transmitted from generation to generation, due to the transmission of genes from parent to child. A disease that is said to be inherited or hereditary is certainly genetic; however, not all genetic diseases are hereditary (witness cancer, which is always a genetic disease, but is only occasionally an inherited disease). Genomics is the scientific study of a genome or genomes. A genome is the complete DNA sequence, containing the entire genetic information of a gamete, an individual, a population or a species. As such, it is a subfield of genetics when describing an approach taken to study genes. The word “genome” originated as an analogy with the earlier term “chromosome,” referring to the physical entities (visible under the microscope) that carry genes from one cell to its daughter cells or from one generation to the next. “Genomics” gave birth to a series of other “-omics” that refer to the comprehensive study of the full complement of genome products – for example, proteins (hence, proteomics), transcripts (transcriptomics) or metabolites (metabolomics). The essential feature of the “omes” is that they refer to the complete collection of genes or their derivative proteins, transcripts or metabolites, not just to the study of individual entities. While formally the field of genomics refers to the study of genomes (and hence, DNA) only, it sometimes takes on the broader meaning of referring to any large-scale approach; the less specific term “genome sciences” is also sometimes used to refer to all of the “-omics” to connote global and comprehensive approaches to the study of biology and medicine. By analogy with genetics and genomics, epigenetics and epigenomics refer to the study of factors that affect gene (or, more globally, genome) function, but without an accompanying change in genes or the genome. As presented later in this chapter and others, some typical epigenetic factors involve changes in DNA methylation or modifications to chromatin that change genome structure and hence influence gene expression even in the absence of changes in the DNA sequence. The epigenome is the comprehensive set of epigenetic changes in a given individual, tissue, tumor or population.
of which provide a conceptual foundation for what today we call “personalized medicine” as a new-age rediscovery of Garrod’s “chemical individuality.” Thus, with the availability of the human genome sequence (International Human Genome Sequencing Consortium, 2001, 2004; Venter et al., 2001) and determination of
It is the paired combination of the genome and the epigenome that best characterize and determine one’s phenotype. Medical genetics is the application of genetics to medicine with a particular emphasis on inherited disease. Medical genetics is a broad and varied field, encompassing many different subfields, including clinical genetics, biochemical genetics, cytogenetics, molecular genetics, the genetics of common diseases and genetic counseling. Medical genetics is one of 24 medical specialties recognized by The American Board of Medical Specialties, the preeminent medical organization overseeing physician certification in the United States. As of 2007, there were approximately 2300 board-certified medical geneticists in the United States. Genetic medicine is a term sometimes used to refer to the application of genetic principles to the practice of medicine and thus overlaps medical genetics. However, genetic medicine is somewhat broader, as it is not limited to the specialty of Medical Genetics but is relevant to health professionals in many, if not all, specialties and subspecialties. Both medical genetics and genetic medicine approach clinical care largely through consideration of individual genes and their effects on patients and their families. By contrast, genomic medicine refers to the use of large-scale genomic information and to consideration of the full extent of an individual’s genome, proteome, transcriptome, metabolome and/or epigenome in the practice of medicine and medical decision-making. The principles and approaches of genomic medicine are relevant well beyond the traditional purview of medical genetics and include, as examples, gene expression profiling to characterize tumors or to define prognosis in cancer, genotyping variants in the set of genes involved in drug metabolism or action to determine an individual’s correct therapeutic dosage, scanning the entire genome for millions of variants that influence one’s susceptibility to disease, or analyzing multiple protein biomarkers to monitor therapy and to provide predictive information in presymptomatic individuals. Finally, personalized medicine refers to a rapidly advancing field of health care that is informed by each person’s unique clinical, genetic, genomic and environmental information. The goals of personalized medicine are to take advantage of a molecular understanding of disease to optimize preventive health care strategies and drug therapies while people are still well or at the earliest stages of disease. Because these factors are different for every person, the nature of disease, its onset, its course, and how it might respond to drug or other interventions are as individual as the people who have them. In order for personalized medicine to be used by health care providers and their patients, these findings must be translated into precision diagnostic tests and targeted therapies. Since the overarching goal is to optimize medical care and outcomes for each individual, treatments, medication types and dosages, and/or prevention strategies may differ from person to person – resulting in unprecedented customization of patient care. The principles underlying genomic and personalized medicine and their applications to the practice of clinical medicine are presented throughout the chapters that comprise this volume.
the extent of human genome variation, both within and among populations (International HapMap Consortium, 2003, 2007) and within individual genomes (Levy et al., 2007; Wheeler et al., 2008), awareness of widespread human variation can begin to be applied generally to the exploration of common human disease.
6
CHAPTER 1
■
Organization,Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine
In this chapter, the organization, variation and expression of the human genome is presented as a foundation for the many chapters to follow on human genomics, on genome technology and informatics, on approaches in translational genomics and, finally, on the principles of genomic and personalized medicine as applied to specific diseases.
THE HUMAN GENOME The typical human genome consists of approximately 3 billion (3 109) bp of DNA, divided among the 24 types of nuclear chromosomes (22 autosomes, plus the sex chromosomes, X and Y) and the much smaller mitochondrial chromosome (Tables 1.1, 1.2). The genome can be represented and evaluated in different ways, with different levels of resolution and degrees of sensitivity, depending on the clinical need (Figure 1.1). Individual chromosomes can best be studied at metaphase in dividing cells, and karyotyping of patient chromosomes has been a valuable and routine clinical laboratory procedure for decades (Trask, 2002); various staining or hybridizationbased analytical techniques have the ability to detect chromosome abnormalities ranging from an extra or missing whole chromosome (aneuploidy), to translocations or rearrangements involving just a portion of a chromosome(s), to deletions or duplications involving as little as perhaps a megabase (106 bp; Mb) of DNA. More recent technologies involving overlapping sets (called “tiling paths”) of isolated segments of the genome arrayed on microscope slides have provided vastly improved resolution and precision capable of evaluating in a rapid and comprehensive way the proper dosage (and in some cases the organization) of the corresponding DNA segments within an individual’s genome (Figure 1.1b) (see Chapter 9). The ultimate resolution, of course, comes from direct sequence analysis, and a number of new technologies have reduced the cost and improved the throughput of sequencing individual genomes, facilitating comparisons with the reference human genome sequence and enabling medical resequencing of patient samples (see later section in this chapter) to search for novel variants or mutations that might be of clinical importance (Bentley, 2006) (Figure 1.1c, d) (see Chapter 7). Genes in the Human Genome While the human genome contains a currently estimated 20,000–25,000 genes (Clamp et al., 2007; International Human Genome Sequencing Consortium, 2004), the coding segments of those genes comprise less than 2% of the genome; as represented in Figure 1.1c, most of the genome, therefore, consists of DNA that lies between genes, far from genes or in vast areas spanning several Mb that appear to contain no genes at all (“gene deserts”). A caveat for this statement is that the process of gene identification and genome annotation remains very much a work-in-progress; despite the apparent robustness of recent estimates (Clamp et al., 2007), it is conceivable that
TABLE 1.1
Characteristics of the human genomea
Length of the human genome (basepairs)
3,253,037,807
Number of known protein-coding genes
21,541
b
Average gene density (genes/Mb)
6.6
Number of non-coding RNA genes
4421
Number of SNPsb
13,022,900
a b
From Ensembl v. 48 (accessed February 2008) Mb megabasepairs; SNP single nucleotide polymorphism
TABLE 1.2 Chromosome
Variation among human chromosomesa Mb
Genes/ Mb
miRNA genes
1
247.25
2153
8.7
68
2
242.95
1315
5.4
60
3
199.50
1105
5.5
57
4
191.27
786
4.1
42
5
180.86
894
4.9
46
6
170.90
1109
6.5
36
7
158.82
1008
6.3
43
8
146.27
743
5.1
38
9
140.27
904
6.4
40
10
135.37
819
6.1
35
11
134.45
1368
10.2
37
12
132.35
1069
8.1
43
13
114.14
356
3.1
23
14
106.37
662
6.2
62
15
100.34
634
6.3
21
16
88.83
902
10.2
20
17
78.77
1217
15.5
40
18
76.12
289
3.8
15
19
63.81
1427
22.4
82
20
62.44
603
9.7
28
21
46.94
283
6.0
10
22
49.69
508
10.2
18
X
154.91
874
5.6
97
Y
57.77
80
1.4
3
22
–
–
Mitochondrial a
Proteincoding genes
From Ensembl v. 48
0.016
The Human Genome
(a)
(d)
1
2
3
6
7
8
13
14
15
19
20
4
9
21
22
5
10
11
12
16
17
18
X
Y
(b)
0Mb
1Mb
2Mb
3Mb
■
7
>chr1:2040588,2043588 actgcaacctccacctcctgggttcaagtgattctgctgcctcagcctcctgagtagctg ggattacaggtgcccaccaccatgcccaactattttttgtatttttagtagaggcagggt ttcaccatattgaccaggctggtatcgaattcctggcctcaagtgatctgtctgccttgg cctcccaaagtgctggg[t/a]ttacaggcatgagccactgtgcctggcctaattattct tctttccttattgttagtttgtgctattattttatcagtctttgtgctgttattatcatg cctgtaaattctacgtgtatttcagacccacaaaccaagtgttgtcttagacagtggtcc ttcagatttacccccaggttacccttctagtcttcctgcaggacggcgcttacatggaga ccagcttccttctgcctgaagtagtccctttagtattcctttcagcacagacttgtaatt aattctttttatttcttttcttttcttttttttttttttgagatggatttttgctcttgt tgcccaggctggagtgcagtggtgtgattttggctcactgcagcctccacctcccaggtt caagcgattctcctggctcagcctcctgaggagctaggattgcaggtgtgcgccaccacg cccagttgttttttgtttgtgtgggaaatgtctttggcattctttctggagggtgttctc cactctgtgtggagttctaggcaggtagggggtttcccccaacaggtttttgtgttggct tggagtgtt[t/g]gtcatttctgtggtgagggcgccttccagcctcactgccacccctg gaaggcaacatctcttttctctgactcctgttaaaagtgttttcatcacaacagcagcct tgtgaaggacagaggaatcgagaatttctcctaattgagattggtagagcttcttgaatc agggacatgatagcttttgtctcttttggaaaatatcagcccttgacttttcgttttttt ttttttttttttttttttttttgagtctcgctcttgttgcccaggctggagtgcaatggc gcgatctcgactcactgcaatctccacctccccggttcaagtgattctcctgcctcagcg tcccgagtagctgggattacaggcacttgccaccatgaccggctaattttttttgcattt ataggagagacagggtttcaccatgttgaccaggctggtctggaactcctgatcatacat ccaccttggcctcccaaagtgctgggattacaggtgtgagccaccgtgcccggccagccc ttggcttttcaaatagcatcctgttctctctcccctgggacccccacacttcacacctgt gtgtctaatgtgctcttttttctgggtttcttctgcgttggttttttcccgctttgtgct tcaatgtggatttttttctactgttatctcttatttcacccaatctactcttaaatctac cctttaaattattaatttcagtcacttcattttttacttttagaatttccatttgattct ttttttttttttttttgcccaggatggcaatggcacgctctcggctcactgcaacctccg cctcccaggttcaagcaatattcctgccccagcctcccaagcagctgggattacagggtc acactaccacgccccactaatttttatgtttttattagagacggggttttgccatgttgg ccaggctggtctcgaactcctgaccttgggtgatccg[c/t]ttgcctcagcctcccaaa gtgttgggattacaggcgtgagccactgcgcctggcatcgtagttctctcttctggggtg ggaatgtctattctgtgtccttctcacgtgcaaaatactgtcattacatcccaatggccc cagaacccttaactcctcccagtgtggcgggggcagtcttgtctgaacaaggcatggggg agcctggaggcccattcctcctgaggccaagt[t/a]cctccctggctgtgggccagcat aagcgaacaaggcgtgtacttccggaatgctatggactgagtgtgtgtctccccagaatc catatgttgaagccctaaccctccagtgtgatggtgtttggagacgaagcctttgacagg tagttagagtcatggcggtagttagttagagtcatggcggtagttagttagggtcacggt ggtagttaggatcatggtggtacttaaggtcatggcagtagttagggttatatcagtagt tagggctatggctgtagttagggtgatggtggtagttaaggtcacagcagtaattagggt catggtggtggttagggtcacagtggtagttagggtcacggtggtggttagggtcgtggt ggtggttagggtcacggtggtggttagggtcacggtggtagttagggtcacggcggtact tagggtcacggcggtggttagggtcacggcggtggttagggtcacggtggtggttagggt cacggcggtggttagggtcacggtggtggttagggtcgtggtagttaggttcatggtggt ggttagggtcgtggtggttagggtcacggtggtggttagggtcacggtggtagttagggt cacggctgtagttagcgtcatggtggtggttagggtcacggcggtggttagggtcacggt ggtggttagggtcacggcggtggttagggtcacggtggtggttagggtcgtggtagttag gttcatggtggtggttagggtcgtggtggttagggtcacggtggtagttagggtcgtggt ggttagggtcatggtggtggttagggtcacggtggtggttagggtcgtggtggttagggt cgtggtggttagggtcgtggtggttagggttgtggtggttagggtggtggtggttagggt cgtggcggtggttagggtcgtggcggtggttagggttgtggtggttagggtcacggtggt ggttagggtcacggtgg…
4Mb
5Mb
6Mb
Chromosome 6
(c)
p25.2
Figure 1.1 Four views of the human genome. (a) Karyotype of a normal male donor, HuRef, whose genome was the first individual diploid genome to be sequenced (Levy et al., 2007). The 24 types of human chromosome are shown after conventional G-banding – 22 pairs of autosomes and the two sex chromosomes, X and Y. (b) An array of genomic segments, showing 244,000 genomic elements hybridized to DNA from HuRef. (c) Schematic representation of the content of 6 Mb from the short arm of chromosome 6, including the location of various genes and other features from the HuRef genome. (d) DNA sequence from the genome of James D. Watson, showing 3000 bp from chromosome 1. Watson’s sequence is heterozygous at four positions, three (in yellow) that are known polymorphisms in various populations and one (in red) that is a novel variant. Figures in (a) and (b) were provided courtesy of Steve Scherer, Hospital for Sick Children, Toronto, Canada. (c) is part of a large poster representing the complete diploid HuRef genome (Levy et al., 2007).
there are some genes, including clinically relevant genes, that are currently undetected or that display characteristics that we do not currently recognize as being associated with genes. A maximum of 5% of the genome consists of DNA that has been quite
well conserved through evolution, one indication of an important function. These and other considerations have led to the estimate that at most 20% of the genome is of functional importance (Pheasant and Mattick, 2007). Nonetheless, the statement
8
CHAPTER 1
■
Organization,Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine
that the vast majority of the genome consists of spans of DNA that are non-genic, of no obvious function, and of uncertain clinical relevance remains true. In addition to being relatively sparse in the genome, genes are distributed quite non-randomly along the different human chromosomes. Some chromosomes are relatively gene-rich, while others are quite gene-poor, ranging from a high of 22 genes/Mb to a low of 3 genes/Mb (excluding the Y chromosome and the mitochrondrial chromosome) (Table 1.2). And even within a chromosome, genes tend to cluster in certain regions or in particular bands, a point of clear clinical significance when evaluating genome integrity, dosage or arrangement in different patient samples. Coding and Non-Coding Genes There are a number of different types of gene in the human genome. Most genes are protein-coding and are transcribed into messenger RNAs (mRNAs) that are ultimately translated into their respective proteins; their products comprise the list of enzymes, structural proteins, receptors and regulatory proteins that are found in various human tissues and cell types. However, there are additional genes whose functional product appears to be the RNA itself. These so-called non-coding RNAs (ncRNAs) have a range of functions in the cell, and some do not as yet have any identified function. But the genes whose transcripts make up the collection of ncRNAs represent about a sixth of all identified human genes (Table 1.1). Some of the types of ncRNA play largely generic roles in cellular infrastructure, including transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs) involved in translation of mRNAs on ribosomes, spliceosomal RNAs involved in control of RNA splicing, and small nucleolar RNAs (snoRNAs) involved in modifying rRNAs (Griffiths-Jones, 2007; Mattick and Makunini, 2006). Other ncRNAs play roles in gene regulation, for example in epigenetic gene silencing (Ogawa and Lee, 2002). A class of small RNAs of growing importance are the microRNAs, ncRNAs of only 22 bases in length that suppress translation of target genes by binding to their respective mRNAs and thus regulate protein production from the target transcript(s) (Filipowicz et al., 2008). Some 255 microRNA genes were identified in the human genome initially (Lim et al., 2003), although the total number of such genes is now thought to be closer to 1000 (Bentwich, 2005; Griffiths-Jones, 2007) (Table 1.2). Some are evolutionarily conserved, while others appear to be of quite recent origin during primate evolution, thus underscoring the difficulty of determining the precise number and identity of human genes (Clamp et al., 2007). MicroRNAs have been shown to downregulate hundreds of mRNAs each, with different combinations of target RNAs in different tissues (Lim et al., 2005); combined, the microRNAs are thus predicted to control the activity of as many as 30% of all protein-coding genes in the genome (Filipowicz et al., 2008). While this is a fast-moving area of genome biology, several microRNAs have already been implicated in various human diseases, including cancer,
developmental disorders and heart disease (Chang and Mendell, 2007; van Rooij et al., 2008). Genome Composition and Landscape As observed earlier, the distribution of genes in the genome is non-random, both within and between chromosomes. This in part is a reflection of the distribution of different types of DNA sequence, as the genome is partitioned into domains spanning hundreds of kilobasepairs to megabases, reflecting large-scale variation in the GC content of DNA. These so-called “isochores” have been known for decades and, at a very gross level, mimic the pattern of light- and dark-staining bands that one observes in metaphase chromosomes (e.g., Figure 1.1a, c) (Eyre-Walker and Hurst, 2001). While the driving force behind the evolution of isochores is not clear, they influence the GC content of genes contained within them (and, by virtue of the genetic code, therefore, the amino acid composition of the encoded proteins), the patterns of mutation and polymorphism detected, and the nature of various families of repeated DNA that reside there. Further – and most strikingly – different isochore domains contain clusters of genes that are highly or weakly expressed in a coordinated manner in different tissues (Caron et al., 2001; Gierman et al., 2007; Hurst et al., 2004). Thus, isochores reflect both the functional as well as structural organization of the genome. (See later section on “Expression of the Human Genome” for further discussion.) Repetitive DNA Overall, only about half of the total linear length of the genome consists of so-called single-copy or unique DNA, whose sequence is represented only once or at most a few times (International Human Genome Sequencing Consortium, 2001, 2004; Venter et al., 2001). The rest of the genome consists of several classes of repetitive DNA and includes DNA whose sequence is repeated, either perfectly or with some variation, hundreds to millions of times in the genome. Several different categories of repetitive DNA are recognized. Clustered repeated sequences constitute an estimated 10–15% of the genome and consist of arrays of various short repeats organized tandemly in a head-to-tail fashion. Such arrays can stretch several Mb or more in length and constitute up to several percent of the DNA content of individual human chromosomes; a notable outlier in this respect is the male-specific Y chromosome, of which more than half consists of such repeated DNA families (Skaletsky et al., 2003). Other tandem repeat families are based on somewhat longer basic repeats. For example, the alpha-satellite family of DNA is composed of tandem arrays of different copies of an 171 bp unit, found at the centromere of each human chromosome, which is critical for proper segregation of chromosomes during cell division (Rudd and Willard, 2004; Schueler and Sullivan, 2006). Another highly significant family of repeats is found at the very ends of chromosomes, the telomeres. While the repeats at the functional telomeres consist of relatively short stretches of perfect (TTAGGG)n repeats, different subtelomeric regions (just proximal to the telomere repeats) share patterns of homology with other subtelomeres around the genome that
Variation in the Human Genome
create clinically relevant hotspots of interchromosomal recombination (Linardopoulou et al., 2005; Riethman et al., 2005). Other major types of repetitive DNA in the genome consist of related sequences that are dispersed throughout the genome rather than localized. Among the best-studied dispersed repetitive elements are short, interspersed nuclear elements (SINEs). The most prominent family of these contains repeats that are about 300 bp in length and are recognizably related to each other although not identical in DNA sequence. In total, members of this family make up at least 10% of human DNA, although they make up a much higher percentage of the DNA in some isochores. A second major dispersed, repetitive DNA family is called the LINE (where the L stands for long) family, whose members range in size up to 6 kp in length and account for about 20% of the genome. Families of repeats dispersed throughout the genome are clearly of medical importance. Both SINE and LINE sequences have been implicated as the cause of mutations in genetic disease. At least a few copies of these families generate copies of themselves that can integrate elsewhere in the genome, occasionally causing insertional inactivation of a medically important gene. The frequency of such events causing genetic disease in humans is largely unknown, but they have been suggested to account for as many as 1 in 500 mutations (Deininger et al., 2003; Kazazian and Moran, 1998). In addition, aberrant recombination events between different LINE or SINE repeats can also be a cause of mutation in some genetic diseases. Segmental Duplications An important subclass of repetitive DNA, distinct from the large families just mentioned, includes blocks of different sequences (hence, not defining a particular family of sequences) that are present in multiple copies, often with extraordinarily high sequence conservation, in many different locations around the genome. Duplications involving substantial segments of a chromosome, called segmental duplications, account for at least 5% of the genome (Bailey and Eichler, 2006). When the duplicated regions contain genes, genomic rearrangements can result in the deletion of the region (and the genes) between the copies and thus give rise to disease (Conrad and Antonarakis, 2007). In addition, rearrangements between duplicated segments are a source of significant variation between individuals in the number of copies of these DNA sequences (Sharp et al., 2005), as will be discussed in the next section.
With completion of the reference human genome sequence, attention turned to the discovery and cataloging of variation in that sequence among different individuals (including both healthy individuals and those with various diseases) and among different populations. It has been estimated that there are some 10–15 million common sequence variants that are of sufficient frequency (minor allele frequency 5%) in one or more
9
populations to be considered polymorphic in our species. In addition, there are countless very rare variants, many of which probably exist in only a single or a few individuals. In fact, given the number of individuals in our species, essentially each and every base pair in the human genome is expected to vary in someone somewhere around the globe. It is for this reason that the original genome sequence is considered a “reference” sequence, derived as a consensus of the limited number of individual genomes whose sequencing was part of the Human Genome Project, but actually identical to no individual’s genome. Types of Variation Early estimates were that any two randomly selected individuals have sequences that are 99.9% identical or, put another way, that an individual genome would be heterozygous at approximately 3–5 million positions, with different bases (i.e., a T or a G) at the maternally and paternally inherited copies of that particular sequence position. The majority of these differences involve simply a single unit in the DNA code and are referred to as single nucleotide polymorphisms (SNPs) (Table 1.1) (see Chapter 7). The remaining variation consists of insertions or deletions (in/dels) of (usually) short sequence stretches, variation in the number of copies of repeated elements or inversions in the order of sequences at a particular locus in the genome (Figure 1.2). The total amount of in/del variation is more than originally anticipated and approaches 0.5%, not 0.1%, between any two randomly selected individuals (Levy et al., 2007). Any and all of these types of variation can influence disease and thus must be accounted for in any attempt to understand the contribution of genetics to human health (Table 1.3).
A
B
C
D
Reference A
B
C
C
D
Segmental duplication – Biallelic CNV (C)2 A
B
C
C
C
D
Multiallelic copy number variant (C)0-n A
B
C
D
D
D
D
C
D
C
D
C
D
Complex CNV (D)4(CD)3 A
VARIATION IN THE HUMAN GENOME
■
C
B
D
Inversion (CB) Chromosome
Figure 1.2 Schematic representation of different types of structural polymorphism in the human genome, leading to deletions, duplications, inversions and CNV changes relative to the reference arrangement. From Estivill and Armengol (2007), with permission.
10
CHAPTER 1
TABLE 1.3
■
Organization,Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine
Common variation in the human genome
Type of variation
Size range (approx.)a
Effect(s) in biology and medicine
Single nucleotide polymorphisms (SNPs)
1 bp
Non-synonymous → functional change in encoded protein? Others → potential regulatory variants? Most → no effect? (“neutral”)
Copy number variants (CNVs)
10 kb to 1 Mb
Insertion/deletion polymorphisms (in/dels)
1 bp to 1 Mb
Inversions
Few bp to 100 kb
Segmental duplications
10 kb to 1 Mb
a
Gene dosage variation → functional consequences? Most → no effect or uncertain effect
In coding sequence: frameshift mutation? → functional change Most → uncertain effects ? break in gene sequence ? long-range effect on gene expression ? indirect effects on reproductive fitness Most → no effect? (“neutral”)
Hotspots for recombination → polymorphism (CNVs)
Abbreviations: bp basepair; kb kilobasepair; Mb megabasepair
While the overall estimate of SNP heterozygosity is approximately 1 in 1500 bp, there is much more variation in non-coding sequences than in the coding segments of genes, reflecting strong selective pressure during evolution against certain types of change in gene sequences. The combination of particular alleles along chromosomes is also non-random, with particular combinations (haplotypes) being more prevalent over short distances, due to the relative inefficiency of meiotic recombination to separate alleles at sites that are physically close together (International HapMap Consortium, 2007; Nussbaum et al., 2007). The resulting patterns of linkage disequilibrium are relevant for designing strategies to examine genetic variation genome-wide, both as a practical matter (i.e., reducing the number of SNPs that need to be tested to reveal the underlying patterns of variation) and for evaluating the potential functional importance of any particular SNP allele (see Chapters 2, 8 and 27). Copy Number Variation Over the past few years, a number of important studies have identified a previously unanticipated prevalence of structural variants in the genome, which collectively account for more variation in genome sequence than do SNPs (e.g., Levy et al., 2007; Redon et al., 2006; Sebat et al., 2004; Tuzun et al., 2005). The most common type of structural variation involves changes in the local copy number of sequences (including genes) in the genome, and these are generally referred to as copy number variants (CNVs) (Figure 1.2) (see Chapter 9). A number of different technology platforms are now being used to detect CNVs, including arrays and direct genome sequencing (Korbel et al., 2007; Levy et al., 2007; Wong et al., 2007). As many such CNVs encompass genes (including microRNA genes; Wong et al., 2007) and as a significant number of
novel CNVs are uncovered with every new population studied, a dedicated effort is underway to cataloged CNVs in the human genome worldwide and to associate these with clinical phenotype (Feuk et al., 2006; Scherer et al., 2007; Sharp et al., 2006). While most variation of this type is inherited, some CNVs occur de novo or even in somatic cells; in these cases, an individual will have different repeat lengths than do either of his or her parents. Array-based methods (see Chapter 9) have rapidly gained acceptance for evaluating the association of both inherited and de novo CNVs with mental retardation and other developmental disorders (de Vries et al., 2005; Friedman et al., 2006; Lee et al., 2007;Weiss et al., 2008). It is of considerable ongoing interest to evaluate the role of CNVs and other structural variants including deletions (Conrad et al., 2006) and inversions (Korbel et al., 2007; Stefansson et al., 2005; Tuzun et al., 2005) in the etiology of more common, complex diseases or traits of adulthood, including neurological and psychiatric conditions as well as pharmacogenetic traits (Beckmann et al., 2007; Buckland, 2003). Variation in a Single Genome The most extensive current inventory of the amount and type of variation to be expected in any given genome comes from the direct analysis of the diploid genome sequence of a single male individual, HuRef (Levy et al., 2007). Over 4 million variants were described, spanning some 12.3 Mb of DNA. About 20 Mb of “new” sequence was determined that was not previously available as part of the human reference sequence, reflecting in part the still unfinished nature of the human genome sequence and in part the particular patterns of inserted or deleted sequences that distinguish different genomes. Several hundred thousand in/dels were also found in this single genome. In addition, several hundred CNVs were detected, which overlapped at
Expression of the Human Genome
least 95 well-annotated genes. While most of these variants are identical to those found in other individuals in the population, others are likely to be what are termed “private” mutations, specific to HuRef and his family. In the HuRef genome, at least 850 genes known to be involved in inherited disease contained at least one heterozygous variant, and over 300 of them contained at least one nonsynonymous SNP (i.e., a SNP that, by virtue of the genetic code, is predicted to change the encoded amino acid). Of course, additional genes may also impact disease, and, overall, more than 4000 genes in the HuRef genome contained one or more non-synonymous SNP. Thus, at least 17% and perhaps as many as 44% of the genes in the HuRef genome were heterozygous and could encode proteins that differ in their amino acid sequence and/or are produced in different amounts (Levy et al., 2007). These estimates underscore the impact of gene and genome variation on human biology and on medicine. They also provide remarkable validation of the original estimates of Harris and Lewontin decades ago of the proportion of genes that are heterozygous in any given individual (Harris, 1980; Lewontin, 1967). Table 1.3 and Figure 1.2 capture the general types of and characteristics of the most common variation in the human genome and in human genes. However, it is clear that we are still in a mode of discovery, as relatively few genomes or populations have been assessed to date; no doubt millions of additional SNPs remain to be uncovered, as well as many additional in/ dels, inversions and CNVs, a portion of which will be expected to involve genes and other sequences of direct relevance to medicine. The issue of “what is normal?” – an essential concept in clinical medicine – remains very much an open question when it comes to the human genome (Shianna and Willard, 2006). Variation in Populations Most of the heterozygosity in the human genome is believed to be due to variants with a minor allele frequency of at least 1%. Taking advantage of major technological developments that have greatly increased the throughput of genotyping on a genome-wide scale, several large-scale projects have validated these estimates by gathering genotypic information on millions of SNPs worldwide (Hinds et al., 2005; International HapMap Consortium, 2003, 2007). Most of the studies to date, however, have been restricted to a small number of populations of Northern European, African and Asian origin used for SNP detection. From these and a large number of earlier studies that examined more populations but for many fewer variants, it has been concluded that some 85–90% of the variation found in our species is shared among different population groups; a relative minority of variants, therefore, are specific to or highly enriched/depleted in genomes from a particular population. It is possible to use population-specific variants to obtain information on the geographic origin of a genome or of particular segments within a genome. Given the many millions of SNPs now available, there are at least hundreds of thousands of
■
11
SNPs that are informative for such studies (so-called “ancestry informative markers,” AIMs) (Kittles and Weiss, 2003; Paschou et al., 2007; Tian et al., 2008). This had led to two related, but distinct applications of such markers. First is the use of admixture mapping, tracing the location of particular SNPs associated with disease in populations of patients whose genomes are a mixture from at least two original populations, for example, African-Americans or Latinos (Price et al., 2007; Smith et al., 2004). Such an approach has already been used to map genes associated with several phenotypes whose frequency differs markedly between different population groups, including prostate cancer (Freedman et al., 2006), hypertension (Deo et al, 2007), skin pigmentation (McEvoy et al., 2006) and white blood cell count (Nalls et al., 2008). The second use of AIMs is for ancestry testing unrelated to disease studies (Kittles and Weiss, 2003; Shriver and Kittles, 2004). While the motivations behind such testing and the potential uses (and, some fear, abuses) of biogeographic information are varied, the commercial availability and interpretation of genetic ancestry testing is controversial (Bolnick et al., 2007). Nonetheless, the availability of such information as an intentional or unwitting by-product of wide-scale genome analysis is inevitable, and both consumers/patients and health professionals need to be aware of this as genetic variation is explored in the context of individual genomes (see Chapter 33).
EXPRESSION OF THE HUMAN GENOME A key question in exploring the origins, structure and function of the human genome is to understand how proper expression of our 20,000–25,000 genes is determined, how it can be influenced by either genetic variation or by environmental exposures or inputs, and by what mechanisms such alterations in gene expression can lead to pathology evident in the practice of clinical medicine. The control of gene activity – in development, in different tissues, during the cell cycle, and during the lifetime of an individual both in sickness and in health – is determined by a complex interplay of genetic and epigenetic features. By “genetic” features, we here refer to those found in the genome sequence (see Box 1.1), which plays a role, of course, in determining the identity of each gene, its particular form (alleles), its level of expression (regulatory elements such as promoters, enhances, splice sites, etc.), and its particular genomic landscape (domains, isochores). By “epigenetic” features, here we mean packaging of the DNA into chromatin, in which it is complexed with a variety of histones as well as innumerable non-histone proteins that influence the accessibility and activity of genes and other genomic sequences. The structure of chromatin – unlike the genome sequence itself – is highly dynamic and underlies the control of gene expression that shapes in a profound way both cellular and organismal function (Felsenfeld and Groudine, 2003).
12
CHAPTER 1
■
Organization,Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine
Genomic and Epigenomic Aspects of Gene Expression Identifying the genomic sequences and features that direct spatial and temporal aspects of gene expression remains a formidable challenge in genome annotation. While several decades of work in molecular biology have defined critical regulatory elements such as promoters and enhancers for many individual genes, it is a tall order to perform such studies on a genomewide scale. At the genetic level, such studies have depended on identifying highly conserved elements, whose demonstrated sequence conservation over, in some cases, hundreds of millions of years provides prima facie evidence of their importance (Bejerano et al., 2004; Margulies et al., 2003) (see Chapter 10). Several such ultraconserved elements have been validated in vitro as gene enhancers, providing some confidence that others identified in this way will also play a role in regulating gene expression (Pennacchio et al., 2006). However, by definition, such approaches will overlook regulatory elements that are not well conserved at the sequence level and/or that are newly evolved on the human lineage. A complementary epigenetic approach has been to explore the characteristics of chromatin that are associated with active or repressed genes as a step towards identifying the transcriptional regulatory code for the human genome (Barrera and Ren, 2006; Bernstein et al., 2007). Such studies, largely employing the method of chromatin immunoprecipitation followed by array or sequence analysis (Hawkins and Ren, 2006), have uncovered predictive chromatin “signatures” for promoters and enhancers in the human genome (Heintzman et al., 2007; Kim et al., 2005). These analyses are part of a broad effort to explore epigenetic patterns in chromatin genome-wide to better understand control of gene expression in different tissues or disease states (ENCODE Project Consortium, 2007; Brena et al., 2006). Increasing evidence points to a role for epigenetic changes in human disease in response to environmental influences (Feinberg, 2007). The dynamic nature of epigenetic regulation – changes in DNA methylation or in histone modification over time, between different tissues, or in various disease states – allows for what has been called ”phenotypic plasticity,” relevant both to the origins and potential treatment of disease (Feinberg, 2007) (see Chapters 5 and 11). That monozygotic twins (who share identical genomes, but are frequently discordant with respect to clinical phenotypes) show epigenetic differences in DNA methylation demonstrates the potential for epigenetic effects to extend or modify information contained in the genome and thus to underlie at least some phenotypic differences (Fraga et al., 2005).
Genetic Control of Gene Expression Levels It has been appreciated for decades that there is high variability in gene expression levels among individuals. A number of groups have begun to examine this variation as a complex quantitative trait, under genetic control (Cheung and Spielman, 2002). Several thousand loci exhibiting such variation have now been studied, and the factor(s) controlling the variation mapped around the genome (Cheung et al., 2005; Morley et al., 2004; Stranger et al., 2007a, b). A significant proportion of such effects map to the genes themselves, a result consistent with local sequence variation influencing the expression of such genes; many such local regulatory variants map to the promoter region of the gene or to the 3 untranslated region (Cheung et al., 2005). Most of the mapped variation could be accounted for by SNP variation, but nearly 20% was due to CNVs (Stranger et al., 2007a, b), underscoring the functional importance of this type of genome variation. It is likely (though unproven in specific instances) that the discovered regulatory variants will correlate with the patterns of epigenetic modifications described earlier (Heintzman et al., 2007; Kim et al., 2005) and with the finding of widespread allelic imbalance around the genome (Serre et al., 2008). Other determinants of gene expression variation map not to the variable locus itself, but to another locus elsewhere in the genome. This implies a network of regulatory interactions among different gene products in the cell, a finding that lends itself to a systems biology approach to begin to better define such networks (see Chapter 6). Some of the differences in gene expression also vary among populations. Where examined, the phenotypes are attributable to common genetic variants that are more common in particular populations than in others (Spielman et al., 2007; Stranger et al., 2007a, b). These findings, therefore, may be relevant to complex genetic diseases whose prevalence differs among populations.
Allelic Imbalance Other than well-known examples of monoallelic gene expression where only one of the two copies of a gene in a diploid cell is expressed (such as genomic imprinting and X chromosome inactivation; Nussbaum et al., 2007), it is commonly assumed that the vast majority of autosomal genes in the genome are expressed
The Genome in Three-Dimensional Space In contrast to the impression one gets when viewing the genome as a linear string of sequence (e.g., Figure 1.1c,d), the genome adopts a highly ordered arrangement within the three-dimensional space of the nucleus (Figure 1.3) (Bolzer et al., 2005; Lanctot et al., 2007). This three-dimensional structure is highly predictive of the transcriptome map (Goetze et al., 2007) and
from both homologs at comparable levels. Recent evidence, however, using expressed SNPs to distinguish transcripts from the two alleles, has demonstrated widespread differential allelic expression for up to 20% of genes in the genome (Serre et al., 2008). This phenomenon, whose epigenetic or genetic basis is unknown, may be related to similar variability observed among X-linked genes in heterozygous females (Carrel and Willard, 2005). Further complexity is suggested by the finding of monoallelic gene expression, random with respect to parental origin, for approximately 5% of genes tested (Gimelbrant et al., 2007). This is an area of ongoing study and further implicates interactions between the genome and epigenetic control mechanisms as a determinant of gene expression in the human genome.
Genes, Genomes and Disease
(a)
■
13
(b)
1
2
6
13
19
3
7
14
20
8
4
9
10
15
5
11
16
21
22
12
17
18
X
Y
Figure 1.3 Organization of the human genome in metaphase and interphase. (a) Individual metaphase chromosomes from a normal male donor, HuRef. Chromosomes are colored differently after hybridization to chromosome-specific DNA. From Levy et al. (2007), with permission. (b) Arrangement of chromosomes in an interphase nucleus. Each chromosome maintains its own territory, with minimal mixing. From Bolzer et al. (2005), with permission.
reflects the megabase-sized domains evident from examining coordinated patterns of gene expression at the chromosome level (Gierman et al., 2007). The biophysical and/or genomic properties that facilitate or specify the orderly and dynamic packaging of each chromosome during each cell cycle without reducing the genome to a tangled mess within the nucleus remain unknown and are the subject of much investigation in genome biology. Overall, then, one should view the gene expression phenotype as the sum of several different (but interrelated) effects, including gene sequence, regulatory sequences and their epigenetic packaging, organization of the genome into domains and isochores, programmed interactions between different parts of the genome, and three-dimensional packaging in the nucleus (Fraser and Bickmore, 2007). All must coordinate in an efficient and hierarchical fashion, and disruption of any one – due either to genetic change or to disease-related processes – would be expected to alter the overall cellular phenotype.
GENES, GENOMES AND DISEASE In the context of genomic and personalized medicine, a key question is to what extent variation in the sequence and/or expression of one’s genome influences the likelihood of disease onset, determines or signals the natural history of disease, and/or provides clues relevant to the management of disease. Variation in one’s constitutional genome can have a number of different direct or indirect effects on gene expression (Table 1.3), thus contributing to the likelihood of disease. It is not, however, just the human genome whose variation is relevant to an individual’s state of health; there are thousands of microorganisms,
both symbiotic and pathogenic, whose genomes are also relevant to human phenotypes, and sequence determination of their genomes is providing new insights and approaches for the diagnosis, study and treatment of infectious disease (see Box 1.2). Contemporary frameworks for considering the impact of variation in the human genome on disease build on decades of success in establishing the role of individual, typically rare, mutations as causal determinants of now more than 2000 simple Mendelian diseases (Online Mendelian Inheritance in Man, 2008). As a result of that success, much of which paralleled the development of technologies in the early stages of the Human Genome Project (Peltonen and McKusick, 2001), attention has now turned to the genes presumed to underlie susceptibility to common complex diseases, which are the subject of many of the subsequent chapters in this book. There are two distinct, but non-exclusive models for thinking about human genetic variation and disease (Altshuler and Clark, 2005; Fearnhead et al., 2005; Florez et al., 2003; Pritchard, 2001). One – the “common allele, common disease” hypothesis – posits that variation common in the population accounts for the relatively higher or lower risk that some individuals (and their families) have for a particular condition. Under this model, the collection of 10–15 million common SNPs, CNVs and other variants in the genome underlies the range of susceptibility that one finds in the general population, modulated by the particular and often variable environmental inputs and factors that are present in that population and that may, in fact, shift or even obscure the relative impact of inherited factors. An alternative model – the rare variant hypothesis – argues that genetic susceptibility to disease is due to the accumulated risk conferred by multiple rare variants in an individual’s genome,
14
CHAPTER 1
■
Organization,Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine
BOX 1.2 The Genomes Within The human genome is not the only genome relevant to the practice of medicine. Both in states of health and disease, our own genome is vastly outnumbered by the genomes of a host of microorganisms, many living peacefully and continuously on various body surfaces, especially throughout the gastrointestinal tract, others wreaking havoc as adventitious viral, bacterial or fungal pathogens. The genomes of thousands of microorganisms have been determined and are being utilized to provide rapid diagnostic tests in clinical settings, to predict antibiotic or antifungal efficacy, to identify the source of airborne, water or soil contaminants, to monitor hospital or community environments, and to better understand the contribution of microbial ecosystems and various environmental exposures to diverse human phenotypes. The human colon contains more than 400 bacterial species comprising some 1013 to 1014 microorganisms. Each adult’s gut provides a unique environment – the microbiome – whose origins and impact on human disease are just being explored. The microbiotic gene set is significantly different from that of the human genome and thus has the capacity to alter the metabolic profile of different individuals or different populations, with clinically meaningful effects on drug metabolism, toxicity and efficacy (Gill et al., 2006; Li et al., 2008; Palmer et al., 2007). The applications to microbiomes of approaches in genomics (as well as proteomics and transcriptomics)
variants therefore not likely to be captured by study of the common variants identified by studies to date. These two hypotheses suggest different approaches that will likely be informative for delineating the genetic contribution to disease, both for designing research studies and for eventual clinical surveillance. It is worth emphasizing that these two hypotheses are not mutually exclusive and are each likely to be correct in some cases; indeed, there is evidence supporting each for different diseases. Genome-wide Association Studies The “common allele, common disease” hypothesis has been explored with notable success in a number of conditions, utilizing large cohorts of well-phenotyped patients and high throughput methods to genotype up to 500,000 or a million SNPs in the genome (Topol et al., 2007) (see Chapter 8). These genome-wide association studies report the statistical association of one or more variants in a narrow genomic region (which may or may not contain an annotated gene) with the presence or absence of the clinical condition. The reported SNPs define immediately accessible risk factors for that condition, at least in the population(s) under study, and can provide novel insights into the biology of the disease. It should be stressed, however, that in most instances, causality of the reported SNP(s) and the increased risk has not been proved; it may be that the actual causal variant is not the SNP itself, but is a currently undetected variant that lies in linkage disequilibrium with the SNP. In the most favorable cases, the associated SNP may be a non-synonymous variant, leading to a pathological amino acid change in the relevant gene (e.g., Thorleifsson et al., 2007), or
are revolutionizing clinical diagnostics, for example to identify unknown viral infections (Delwart, 2007; Long et al., 2004; Wang et al., 2003) or to diagnose antibiotic-resistance infections such as methicillin-resistant Staphylococcus aureus (MRSA) (Francois et al., 2007). The field of metagenomics explores this heterogeneous ecosystem by comprehensive sequence analysis of the collected genomes from biological specimens (such as stool, urine, sputum, water sources and air), followed by both taxonomic and bioinformatic analysis to deconvolute the many genomes contained in such specimens and to define the different organisms, their genes and genome variants. This approach is particularly informative for characterizing organisms that cannot be cultured in standard microbiology labs. A number of diseases have been associated with large-scale imbalances in the gut microbiome, including Crohn’s disease, ulcerative colitis, antibioticresistant diarrhea and obesity (Frank and Pace, 2008; Ley et al., 2006; Turnbaugh et al., 2006). Undoubtedly, the states of health and disease are determined in part by the balance of genomes both within us and external to us. The full complement of genomic information from both of these sources of genomes will provide insights into defining the states of health and disease and the basis for unsurpassed precision in both the prevention of disease and its treatment.
may be a variant in an RNA splice site, leading to a clinically meaningful change in the production of the gene’s transcript(s) (e.g., Heinzen et al., 2007). But in most instances, the functional impact of the associated SNP is obscure, notwithstanding very clear genetic evidence of a role of genome variation in susceptibility to the particular condition. As emphasized throughout this chapter, SNPs are but one of several types of genome variation that can influence gene expression and/or disease (Table 1.3). CNVs have also been associated with some disorders (reviewed in Estivill and Armengol, 2007), and it will require integrated genomic, genetic and functional studies to elucidate the precise basis for the role(s) of genome variation in different diseases. Medical Resequencing in Search of Rare Variants While genome-wide association studies certainly establish that common alleles do indeed underlie susceptibility to common disease in some instances, they do not allow one to conclude that this is always the case. Indeed, in most (but not all) cases to date, the common SNPs found to be associated with disease only explain a small fraction of the total genetic variation, implicating an as yet undiscovered (and presumably rarer) basis for most genetic variation underlying a given condition (Iles, 2008). In other words, while a positive statistical association is sufficient to conclude that a particular variant does indeed contribute to disease in the population under study, such a finding is insufficient to say anything declarative about the cause of or likelihood of disease in a particular case. This conclusion is, of course, highly relevant to the prospects of genomic and personalized medicine.
From Genome to Personalized Medicine
An alternative or complementary approach to genomewide genotyping of common variants is to resequence specific genes in a cohort of affected individuals, in an effort to uncover rare variants responsible for (or at least statistically associated with) the disease in question. To date, most efforts have focused on one or several genes that were believed to be strong candidates for the phenotype under study; the notable exception thus far is the whole-genome resequencing of the HuRef genome and correlation of novel variants detected in that study with his family and personal medical histories (Levy et al., 2007). Rare variants, including non-synonymous variants, in relevant candidate genes have been detected at a statistically significant higher frequency in the genomes of patients with colorectal adenomas (Fearnhead et al., 2004), with low plasma high-density lipoprotein cholesterol (HDL-C) (Cohen et al., 2004), and with triglyceride levels in the lowest quartile (Romeo et al., 2007). These successes point to a strategy of resequencing relevant genes in individuals at the extremes of the population distribution for measurable traits (Topol and Frazer, 2007), in which the only limiting parameters are the cost of sequencing and the quality of the phenotypic or quantitative data. In a proof-of-principle for this case-control approach, a recent study sequenced coding exons and splice junctions of 58 genes in nearly 400 obese and lean individuals, at the 95th or 10th percentile of body mass index (Ahituv et al., 2007). Of the 1000 variants detected, most were rare variants, including over 270 non-synonymous mutations, many of which were found only in the obese cohort and thus become strong functional candidates for a role in obesity. Searching for Somatic Mutations While genome-wide association studies are restricted to inherited variation, medical resequencing studies can target either inherited or somatic variants. In cancer especially, it is of interest to use medical resequencing to search for somatic mutations in tumor tissue in order to identify genes potentially relevant to cancer progression (Greenman et al., 2007; Sjoblom et al., 2006). Two important points emerge from such studies. First, the genes implicated by virtue of discovering rare somatic variants in multiple cases of a particular cancer tend to be different from those identified in previous genetic studies as inherited risk factors. This provides novel insights into the biology of human cancer and suggests candidates for further exploring mechanisms of tumorigenesis or metastasis or for developing therapeutic approaches. Second, however, the large number of mutations uncovered by this approach introduces the need for caution, as many will be “bystander” or “passenger” mutations around the genome that reflect only the increased mutation rate associated with cancer, not genes involved directly in cancer.
FROM GENOME TO PERSONALIZED MEDICINE Of all the promises of the current scientific and social revolution stemming from advances in our understanding of the human
■
15
genome and its variation, genomic and personalized medicine may be the most eagerly awaited. The prospect of examining an individual’s entire genome (or at least a significant fraction of it) in order to make individualized risk predictions and treatment decisions is an attractive, albeit challenging, one (Bentley, 2004; Willard et al., 2005). Having access to the reference human genome sequence has been transformational for the fields of human genetics and genome biology, but by itself is an insufficient prerequisite for genomic medicine. As discussed in subsequent chapters, equally important are the various complementary technologies to reliably capture information on individual genomes, their epigenetic modification, and their derivatives the transcriptome, proteome and metabolome for health and disease status (Table 1.4). Each of these technologies provides information that, in combination with clinical data and evaluation of environmental triggers, will in time contribute to assessment of individual risks and guide clinical management and decision-making (Figure 1.4). Critical enablers of this new approach have been innovations in laboratory technology (to address biologically and medically relevant questions on a scale and with a throughput hardly imaginable just a few decades ago), paired with equally transforming developments in informatics and information systems to handle the onslaught of genomic data (West et al., 2006) (see Chapter 17). Personal Genomics At the heart of the genomic approach to personalized medicine will be information from individual genomes, a fast-moving area of technological development that is spawning a social and information revolution among consumers. Dramatic improvements in sequencing technology (Bentley, 2006) have reduced the cost and time of resequencing projects to a level that invites conjecture about the long awaited “$1000 genome” (Dalton, 2006; Service, 2006; Wolinsky, 2007). The much-ballyhooed release of the genome sequences of Craig Venter (“HuRef ”) (Levy et al., 2007) and James Watson has stirred up concerns about “celebrity genomics” (Check, 2007); however, while the Venter and Watson sequences may have been the first, they will not be alone for long, as numerous additional genomes are already in various sequencing pipelines (Table 1.5). What remains unsettled for now is what degree of genome surveillance will be most useful, either for research or for clinical practice, a topic that is returned to frequently in subsequent chapters in this volume. While whole-genome sequencing is increasingly possible, it is unlikely to provide more information about established disease associations than would highdensity, genome-wide SNP genotyping. Targeted resequencing of, for example, exons and known regulatory regions would allow detection of rare variants in portions of the genome most relevant to disease at a fraction of the cost of whole-genome sequencing (Hodges et al., 2007). The availability of associated clinical data is variable among the studies announced to date, but there exists, at least among some participants, a strong sense of “health-information altruism” to contribute to the much needed large-scale correlation
16
CHAPTER 1
TABLE 1.4
■
Organization,Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine
Personalized signatures of health or disease
Dataset
“omic” approach
Technology platform or approach
Chapter number(s)
Human genome sequence
Genomics
Single nucleotide polymorphisms (SNPs); Other genome variants
Chapters 7–9
Gene expression profiles
Transcriptomics
Microarrays of 20,000 gene transcripts
Chapters 12, 13
Protein abundance
Proteomics
Protein arrays of specific protein products
Chapter 14
Metabolites
Metabolomics
Analysis of hundreds to thousands of metabolites
Chapter 15
Chromatin
Epigenomics
Array- or sequence-based assessment of chromatin modification
Chapters 5, 11
Gene networks, interactions
Systems biology
Large-scale interactions among genes or proteins
Chapter 6
Microbiome
Metagenomics
Analysis of viral, fungal and bacterial populations in human specimens
Chapter 48
Genomic and clinical data
Informatics
Integrated databases of “omic” data and electronic health records
Chapters 17–21
Individual A
Individual B
• Predisposition to disease
Genome sequence Somatic mutation Epigenetic changes Personal environment
Transcriptome
Proteome
• • • •
Diagnosis Natural history Treatment options Prognosis
Phenotype
Figure 1.4
The promise of personal genomics in the era of genomic and personalized medicine.
of genotype and phenotype (Kohane and Altman, 2005). Notwithstanding individual’s willingness to make genome sequence data (much less medical information) available moreor-less publicly, substantial concerns have been raised about privacy, since a surprisingly (to some) small number of SNPs or other genome variants are sufficient to allow identification of individuals (Lin et al., 2004; McGuire and Gibbs, 2006).
A Consumer Revolution While the genome revolution has without doubt been driven by technological improvements and by an explosion in the availability of genome data, the push for incorporation of genome information into clinical practice may come as much or more from consumers as from professionals. A half dozen or more companies are already offering genome-wide SNP profiles to
From Genome to Personalized Medicine
TABLE 1.5
■
17
Individual sequenced human genomes (as of March 2008)
Individuals
Nature of genome
Method
Year completed/ anticipated
References
“RP11”a
Partial haploid
Sanger sequencing
2003
International Human Genome Sequencing Consortium, (2001); Osoegawa et al. (2001)
J. Craig Venter
Complete diploid
Sanger sequencing
2007
Levy et al. (2007)
James D. Watson
Complete diploid
454 pyrosequencing
2007
Wheeler et al. (2008)
YanHuang researcher
Complete diploid
Illumina sequencing by synthesis
2007
Qiu and Hayden (2008); Xin (2007)
YanHuang paying customer
Complete diploid
Illumina sequencing by synthesis
2008
Qiu and Hayden (2008)
Yoruba HapMap sample
Complete diploid
Illumina sequencing by synthesis
2008
Knome customers
Complete diploid
Various next-generation technologies
2008 and beyond
Personal Genome Project; 10 individuals
“Exomes” (coding exons only)
Polony sequencing
2008
Venter institute; 10–50 genomes
Complete diploid
Various next-generation technologies
2008
ClinSeq 1000
Exons of 200–400 cardiovascular genes
To be determined
2009 or beyond
http://www.genome. gov/25521304
YanHuang 99
Complete diploid
Various next-generation technologies
2010
Xin (2007)
GATC 100
Complete diploid
Various next-generation technologies
2010
http://www.gatc-biotech.com/ en/discover_gatc/research/ Human_Genome_Sequencing_ Service.php
1000 Genomes Project
Variousb
Various next-generation technologies
2010
www.1000genomes.org; http:// www.genome.gov/26524516; Hayden (2008); Kaiser (2008)
Personal Genome – Project; 500–100,000
Exomes
Polony sequencing
?
http://www. personalgeneomes.org
http://www.personalgenomes. org; Church (2006)
Source: Courtesy of Misha Angrist, Duke Institute for Genome Sciences & Policy a An anonymous male from Buffalo, New York, whose genome was overrepresented in the publicly funded Human Genome Project b 6 people fully sequenced at 20x coverage; 180 at 2x; and exons of 1000 genes in 1000 people
the public, some with associated risk estimates for relevant clinical conditions (see Chapter 21). At least a few companies will also sequence individual genomes for a cost that, while high currently, is not out of reach for some individuals. Some in the medical community have called such genome scans premature and ill advised (Hunter et al., 2008), and similar concerns have been raised for more traditional and less comprehensive commercial tests targeted to specific genes or to specific conditions (Janssens et al., 2008). Hesitation on the part of health
professionals aside, however, it seems clear that at least some consumers, whatever their motivation, will choose to test their genomes, regardless of whether any of the findings are (yet) considered actionable by their physicians. This creates both the need and an opportunity for health professionals to stay abreast of the rapidly developing scientific foundations of genomic medicine, to place in context the dynamic and still poorly understood interplays of the genome with the environment and to communicate effectively the complexities of overall genetic risk to their patients.
18
CHAPTER 1
■
Organization,Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine
CONCLUSION In this chapter, we have described the organization, variation and expression of the human genome as a foundation for the many chapters to follow on principles of human genomics, on new technologies in genomics and informatics, on approaches in translational genomics and, finally, on the first applications of genomic and personalized medicine to specific diseases. What is now referred to as the “Genome Revolution” has its roots in a technological revolution over the past two decades that has transformed biology and human genetics as they transition from discovery-based analog sciences to digital systems based on comprehensive information from the genome and its derivatives. Advances in genome technology and the resulting
explosion in knowledge and information stemming from the Human Genome Project are now playing a major and increasingly transformational role in our understanding of human health and in the delivery of health care. While the challenges both for the health care profession and for society are significant, the development of comprehensive, cost-efficient and high-throughput technologies, combined with powerful tools in biomedical and clinical informatics for analyzing and storing vast amounts of data, will enable the growing field of genomic and personalized medicine. As the chapters to follow illustrate, the genome sciences – encompassing large-scale analyses of individual genomes, epigenomes, transcriptomes, proteomes and metabolomes – will drive profound changes in clinical medicine.
REFERENCES Ahituv, N., Kavaslar, N., Schackwitz, W., Ustaszewska, A., Martin, J., Hebert, S., Doelle, H., Ersoy, B., Kryukov, G., Schmidt, S. et al. (2007). Medical sequencing at the extremes of human body mass. Am J Hum Genet 80, 779–791. Altshuler, D. and Clark, A. (2005). Harvesting medical information from the human family tree. Science 307, 1052–1053. Bailey, J. and Eichler, E. (2006). Primate segmental duplications: Crucibles of evolution, diversity and disease. Nat Rev Genet 7, 552–564. Barrera, L. and Ren, B. (2006). The transcriptional regulatory code of eukaryotic cells – insights from genome-wide analysis of chromatin organization and transcription factor binding. Curr Opin Cell Biol 18, 291–298. Beckmann, J., Estivill, X. and Antonarakis, S. (2007). Copy number variants and genetic traits: Closer to the resolution of phenotypic variability. Nat Rev Genet 8, 639–646. Bejerano, G., Pheasant, M., Makunini, I., Stephen, S., Kent,W., Mattick, J. and Haussler, D. (2004). Ultraconserved elements in the human genome. Science 304, 1321–1325. Bentley, D.R. (2004). Genomes for medicine. Nature 429, 440–445. Bentley, D.R. (2006). Whole-genome resequencing. Curr Opin Genet Deve 16, 545–552. Bentwich, I. (2005). Identification of hundreds of conserved and nonconserved human microRNAs. Nat Genet 37, 766–770. Bernstein, B., Meissner, A. and Lander, E. (2007). The mammalian epigenome. Cell 128, 669–681. Bolnick, D., Fullwiley, D., Duster, T., Cooper, R., Fujimura, J., Kahn, J., Kaufman, J., Marks, J., Morning, A., Nelson, A. et al. (2007). The science and business of genetic ancestry training. Science 318, 399–400. Bolzer, A., Kreth, G., Solovei, I., Koehler, D., Saracoglu, K., Fauth, C., Muller, S., Eils, R., Cremer, C., Speicher, M. et al. (2005). Threedimensional maps of all chromosomes in human male fibroblast nuclei and prometaphase rosettes. PLoS Biol 3, e157. Brena, R., Huang, T. and Plass, C. (2006). Toward a human epigenome. Nat Genet 38, 1359–1360. Buckland, P. (2003). Polymorphically duplicated genes: Their relevance to phenotypic variation in humans. Ann Med 35, 308–315. Caron, H., van Schaik, B., van der Mee, M., Baas, F., Riggins, G., van Sluis, P., Hermus, M.C., van Asperen, R., Boon, K., Voute, P.A. et al. (2001). The human transcriptome map: Clustering of highly expressed genes in chromosomal domains. Science 291, 1289–1292.
Carrel, L. and Willard, H.F. (2005). X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature 434, 400–404. Chang, T. and Mendell, J. (2007). microRNAs in vertebrate physiology and human disease. Annu Rev Genomics Hum Genet 8, 215–239. Check, E. (2007). Celebrity genomes alarm researchers. Nature 447, 358–359. Cheung, V. and Spielman, R. (2002). The genetics of variation in gene expression. Nat Genet (Suppl 32), 522–525. Cheung, V., Spielman, R., Ewens, K., Weber, T., Morley, M. and Burdick, J. (2005). Mapping determinants of human gene expression by regional and genome-wide association. Nature 437, 1365–1369. Church, G. (2006). Genomes for all. Sci Am 294, 46–54. Clamp, M., Fry, B., Kamal, M., Xie, X., Cuff, J., Lin, M., Kellis, M., Lindblad-Toh, K. and Lander, E. (2007). Distinguishing proteincoding and noncoding genes in the human genome. Proc Natl Acad Sci USA 104, 19428–19433. Cohen, J., Kiss, R., Pertsemlidis, A., Marcel, Y., McPherson, R. and Hobbs, H. (2004). Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 869–871. Conrad, B. and Antonarakis, S. (2007). Gene duplication: A drive for phenotypic diversity and cause of human disease. Annu Rev Genomics Hum Genet 8, 17–35. Conrad, D., Andrews, T., Carter, N., Hurles, M. and Pritchard, J. (2006). A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 38, 75–81. Dalton, R. (2006). Sequencers step up to the speed challenge. Nature 443, 258–259. Deininger, P., Moran, J., Batzer, M. and Kazazian, H., Jr. (2003). Mobile elements and mammalian genome evolution. Curr Opin Genet Dev 13, 651–658. Delwart, E.L. (2007).Viral metagenomics. Rev Med Virol 17, 115–131. Deo, R., Patterson, N.,Tandon, A., McDonald, G., Haiman, C., Ardlie, K., Henderson, B., Henderson, S. and Reich, D. (2007). A high-density admixture scan in 1,670 African Americans with hypertension. PLoS Genet 3, e196. De Vries, B., Pfundt, R., Leisink, M., Koolen, D., Vissers, L., Janssen, I., van Reijmersdal, S., Nillesen, W., Huys, E., de Leeuw, N. et al. (2005). Diagnostic genome profiling in mental retardation. Am J Hum Genet 77, 606–616.
References
ENCODE Project Consortium (2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816. Estivill, X. and Armengol, L. (2007). Copy number variants and common disorders: Filling the gaps and exploring complexity in genome-wide association studies. PLoS Genet 3, e190. Eyre-Walker, A. and Hurst, L. (2001). The evolution of isochores. Nat Rev Genet 2, 549–555. Fearnhead, N.S.,Wilding, J.,Winney, B.,Tonks, S., Bartlett, S., Bicknell, D., Tomlinson, I., Montensen, N.J. and Bodmer, W.F. (2004). Multiple rare variants in different genes account for multifactorial inherited susceptibility to colorectal adenomas. Proc Natl Acad Sci USA 101, 15992–15997. Fearnhead, N.S., Winney, B. and Bodmer, W.F. (2005). Rare variant hypothesis for multifactorial inheritance: Susceptibility to colorectal adenomas as a model. Cell Cycle 4, 521–525. Feinberg, A. (2007). Phenotype plasticity and the epigenetics of human disease. Nature 447, 433–440. Felsenfeld, G. and Groudine, M. (2003). Controlling the double helix. Nature 421, 448–453. Feuk, L., Marshall, C., Wintle, R. and Scherer, S. (2006). Structural variants: Changing the landscape of chromosomes and design of disease studies. Hum Mol Genet 15(Review Issue 1), 15–66. Filipowicz, W., Bhattacharyya, S. and Sonenberg, N. (2008). Mechanisms of post-transcriptional regulation by microRNAs: Are the answers in sight?. Nat Rev Genet 9, 102–114. Florez, J., Hirschhorn, J. and Altshuler, D. (2003). The inherited basis of diabetes mellitus: Implications for the genetic analysis of complex traits. Annu Rev Genomics Hum Genet 4, 257–291. Fraga, M., Ballestar, E., Paz, M., Ropero, S., Setien, F., Ballestar, M., Heine-Suner, D., Cigudosa, J., Urioste, M., Benitez, J. et al. (2005). Epigenetic differences arise during the lifetime of monozygotic twins. Proc Natl Acad Sci USA 102, 10604–10609. Francois, P., Scherl, A, Hochstrasser, D., and Schrenzel, J. (2007). Proteomic approach to investigate MRSA. Methods Mol Biol 391, 179–199. Frank, D.N. and Pace, N.R. (2008). Gastrointestinal microbiology enters the metagenomics era. Curr Opin Gastroenterol 24, 4–10. Fraser, P. and Bickmore, W. (2007). Nuclear organization of the genome and the potential for gene regulation. Nature 447, 413–417. Freedman, M., Haiman, C., Patterson, N., McDonald, G., Tandon, A., Waliszewska, A., Penney, K., Steen, R., Ardlie, K., John, E. et al. (2006). Admixture mapping identifies 8q24 as a prostate cancer risk locus in African-American men. Proc Natl Acad Sci USA 103, 14068–14073. Friedman, J., Baross, A., Delaney, A., Ally, A., Arbour, L., Asano, J., Bailey, D., Barber, S., Birch, P., Brown-John, M. et al. (2006). Oligonucleotide microarray analysis of genomic imbalance in children with mental retardation. Am J Hum Genet 79, 500–513. Garrod, A. (1902). The incidence of alkaptonuria: A study in chemical individuality. Mol Med 2, 274–282. Reprint (1996) Gierman, H., Indemans, M., Koster, J., Goetze, S., Seppen, J., Geerts, D., van Driel, R. and Versteeg, R. (2007). Domain-wide regulation of gene expression in the human genome. Genome Res 17, 1286–1295. Gill, S.R., Pop, M., DeBoy, R., Eckburg, P.B.,Turnbaugh, P., Samuel, B.S., Gordon, J.I., Relman, D.A., Fraser-Liggett, C.M. and Nelson, K.E. (2006). Metagenomic analysis of the human distal gut microbiome. Science 312, 1355–1359.
■
19
Gimelbrant, A., Hutchinson, J., Thompson, B. and Chess, A. (2007). Widespread monoallelic expression on human autosomes. Science 318, 1136–1140. Goetze, S., Mateos-Langerak, J., Gierman, H., de Leeuw,W., Giromus, O., Indemans, M., Koster, J., Ondrej, V., Versteeg, R. and van Driel, R. (2007). The three-dimensional structure of human interphase chromosomes is related to the transcriptome map. Mol Cell Biol 27, 4475–4487. Greenman, C., Stephens, P., Smith, R., Dalgliesh, G., Hunter, C., Bignell, G., Davies, H., Teague, J., Stevens, C., Edkins, S. et al. (2007). Patterns of somatic mutation in human cancer genomes. Nature 446, 153–158. Griffiths-Jones, S. (2007). Annotating noncoding RNA genes. Annu Rev Genomics Hum Genet 8, 279–298. Harris, H. (1980). The Principles of Human Biochemical Genetics, 3rd edition. Elsevier,Amsterdam. Harris, H. and Hopkinson, D. (1972). Average heterozygosity per locus in man: An estimate based on the incidence of enzyme polymorphisms. Ann Hum Genet 36, 9–20. Hawkins, D. and Ren, B. (2006). Genome-wide location analysis: Insights on transcriptional regulation. Hum Mol Genet 15(Review Issue 1), 1–7. Hayden, E. (2008). International genome project launched. Nature 451, 378–379. Heintzman, N., Stuart, R., Hon, G., Fu, Y., Ching, C., Hawkins, R., Barrera, L., Calcar, S., Qu, C., Ching, K. et al. (2007). Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet 39, 311–318. Heinzen, E., Yoon, W., Tate, S., Sen, A., Wood, N., Sisodiya, S. and Goldstein, D. (2007). Nova2 interacts with a cis-acting polymorphism to influence the proportions of drug-responsive slice variants of SCN1A. Am J Hum Genet 80, 876–883. Hinds, D., Stuve, L., Nilsen, G., Halperin, E., Eskin, E., Ballinger, D., Frazer, K. and Cox, D. (2005). Whole-genome patterns of common DNA variation in three human populations. Science 307, 1072–1079. Hodges, E., Xuan, Z., Balija, V., Kramer, M., Molla, M., Smith, S., Middle, C., Rosesch, M., Albert, T., Hannon, G. et al. (2007). Genome-wide in situ exon capture for selective resequencing. Nat Genet 39, 1522–1527. Hunter, D., Khoury, M. and Drazen, J. (2008). Letting the genome out of the bottle – will we get our wish?. New Engl J Med 358, 105–107. Hurst, L., Pal, C. and Lercher, M. (2004). The evolutionary dynamics of eukaryotic gene order. Nat Rev Genet 5, 299–310. Iles, M. (2008). What can genome-wide association studies tell us about the genetics of common disease?. PLoS Genet 4, e33. International HapMap Consortium (2003). The International HapMap Project. Nature 426, 789–796. International HapMap Consortium (2007). A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–862. International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921. International Human Genome Sequencing Consortium (2004). Finishing the euchromatic sequence of the human genome. Nature 431, 931–945. Janssens, A., Gwinn, M., Bradley, L., Oostra, B., van Duijn, C. and Khoury, M. (2008). A critical appraisal of the scientific basis of commercial genomic profiles used to assess health risks and personalize health interventions. Am J Hum Genet 82, 593–599.
20
CHAPTER 1
■
Organization,Variation and Expression of the Human Genome as a Foundation of Genomic and Personalized Medicine
Kaiser, J. (2008). A plan to capture human diversity in 1000 genomes. Science 319, 395. Kazazian, H.H. and Moran, J.V. (1998). The impact of L1 retrotransposons on the human genome. Nat Genet 19, 19–24. Kim,T., Barrera, L., Zheng, M., Qu, C., Singer, M., Richmond,T.,Wu,Y., Green, R. and Ren, B. (2005). A high-resolution map of active promoters in the human genome. Nature 436, 876–880. Kittles, R. and Weiss, K. (2003). Race, ancestry, and genes: Implications for defining disease risk. Annu Rev Genomics Hum Genet 4, 33–67. Kohane, I. and Altman, R. (2005). Health-information altruists – a potentially critical resource. New Engl J Med 353, 2074–2077. Korbel, J., Urban, A., Affourtit, J., Godwin, B., Grubert, F., Simons, J., Kim, P., Palejev, D., Carriero, N., Du, L. et al. (2007). Pairedend mapping reveals extensive structural variation in the human genome. Science 318, 420–426. Lanctot, C., Cheutin, T., Cremer, M., Cavalli, G. and Cremer, T. (2007). Dynamic genomic architecture in the nuclear space: Regulation of gene expression in three dimensions. Nat Rev Genet 8, 104–115. Lee, C., Iafrate, A. and Brothman, A. (2007). Copy number variations and clinical cytogenetic diagnosis of constitutional disorders. Nat Genet Suppl. 39, 48–54. Levy, S., Sutton, G., Ng, P., Feuk, L., Halpern, A., Walenz, B., Axelrod, N., Huang, J., Kirkness, E., Denisov, G. et al. (2007). The diploid genome sequence of an individual human. PLoS Biol 5(10), e254. Lewontin, R. (1967). An estimate of average heterozygosity in man. Am J Hum Genet 19, 681–685. Ley, R.E., Turnbaugh, P., Klein, S. and Gordon, J.I. (2006). Human gut microbes associated with obesity. Nature 444, 1022–1023. Li, M., Wang, B., Zhang, M., Rantalainen, M., Wang, S., Zhou, H., Zhang, Y., Shen, J., Pang, X., Zhang, M. et al. (2008). Symbiotic gut microbes modulate human metabolic phenotypes. Proc Natl Acad Sci USA 105, 2117–2122. Lim, L., Glasner, M.,Yekta, S., Burge, C. and Bartel, D. (2003).Vertebrate microRNA genes. Science 299, 1540. Lim, L., Lau, N., Garrett-Engele, P., Grimson, A., Schelter, J., Castle, J., Bartel, D., Linsley, P. and Johnson, J. (2005). Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 433, 769–773. Lin, Z., Owen, A. and Altman, R. (2004). Genomic research and human subject privacy. Science 305, 183. Linardopoulou, E.,Williams, E., Fan, Y., Friedman, C.,Young, J. and Trask, B. (2005). Human subtelomeres are hot spots of interchromosomal recombination and segmental duplication. Nature 437, 94–100. Long,W.H., Xiao, H., Gu, X., Zhang, Q.,Yang, H., Zhao, G. and Liu, J.H. (2004). A universal microarray for detection of SARS coronavirus. J Virol Meth 121, 57–63. Margulies, E., Blanchette, M., Haussler, D. and Green, E.NISC Comparative Sequencing Program (2003). Identification and characterization of multi-species conserved sequences. Genome Res 13, 2507–2518. Mattick, J. and Makunini, I. (2006). Non-coding RNA. Hum Mol Genet 15(Review Issue 1), 17–29. McEvoy, B., Beleza, S. and Shriver, M. (2006). The genetic architecture of normal variation in human pigmentation: An evolutionary perspective and model. Hum Mol Genet Suppl. 15, 176–181. McGuire, A. and Gibbs, R. (2006). Genetics. No longer de-identified. Science 312, 370–371. Morley, M., Molony, C., Weber, T., Devlin, J., Ewins, K., Spielman, R. and Cheung, V. (2004). Genetic analysis of genome-wide variation in human gene expression. Nature 230, 743–747.
Nalls, M., Wilson, J., Patterson, N., Tandon, A., Zmuda, J., Huntsman, S., Garcia, M., Hu, D., Li, R., Beamer, B. et al. (2008). Admixture mapping of white cell count: Genetic locus responsible for lower white blood cell count in the health ABC and Jackson Heart Studies. Am J Hum Genet 82, 81–87. Nussbaum, R., McInnes, R. and Willard, H.F. (2007). Genetics in Medicine, 7th edition.WB Saunders Co, Philadelphia. Ogawa,Y. and Lee, J. (2002). Antisense regulation in X inactiviation and autosomal imprinting. Cytogenet genome res 99, 59–65. Online Mendelian Inheritance in Man (2008). http://www.ncbi.nlm. nih.gov/sites/entrez?dbOMIM&itooltoolbar Osoegawa, K., Mammoser, A.,Wu, C., Frengen, E., Zeng, C., Catanese, J. and de Jong, P. (2001). A bacterial artificial chromosome library for sequencing the complete human genome. Genome Res 11, 483–496. Palmer, C., Bik, E.M., DiGiulio, D., Rleman, D.A. and Brown, P.O. (2007). Development of the human infant intestinal microbiota. PLoS Biol 5, e177. Paschou, P., Ziv, E., Burchard, E., Choudhry, S., Rodriguez-Cintron, W., Mahoney, M. and Drineas, P. (2007). PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genet 3, e160. Pauling, L., Itano, H., Singer, S.J. and Wells, I.G. (1949). Sickle cell anemia a molecular disease. Science 110(2865), 543–548. Peltonen, L. and McKusick, V. (2001). Genomics and medicine. Dissecting human disease in the postgenomic era. Science 291, 1224–1229. Pennacchio, L., Ahituv, N., Moses, A., Prabhakar, S., Nobrega, M., Shoukry, M., Minovitsky, S., Dubchak, I., Holt, A., Lewis, K. et al. (2006). In vivo enhancer analysis of human conserved non-coding sequences. Nature 444, 499–502. Pheasant, M. and Mattick, J. (2007). Raising the estimate of functional human sequences. Genome Res 17, 1245–1253. Price, A., Patterson, N.,Yu, F., Cox, D., Waliszewska, A., McDonald, G., Tandon, A., Schirmer, C., Neubauer, J., Bedoya, et al. (2007). A genomewide admixture map for Latino populations. Am J Hum Genet 80, 1024–1036. Pritchard, J. (2001). Are rare variants responsible for susceptibility to complex diseases?. Amer J Hum Genet 69, 124–137. Qiu, J. and Hayden, E. (2008). Genomics sizes up. Nature 451, 234. Redon, R., Ishikawa, S., Fitch, K., Feuk, L., Perry, G., Andrews, T., Fiegler, H., Shapero, M., Carson, A., Chen, W. et al. (2006). Global variation in copy number in the human genome. Nature 444, 444–454. Riethman, H., Ambrosini, A. and Paul, S. (2005). Human subtelomere structure and variation. Chromosome Res 13, 505–515. Romeo, S., Pennacchio, L., Fu, Y., Boerwinkle, E., Tybjaerg-Hansen, A., Hobbs, H. and Cohen, J. (2007). Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase HDL. Nat Genet 39, 513–516. Rudd, M.K. and Willard, H.F. (2004). Analysis of the centromeric regions of the human genome assembly. Trends Genet 20, 529–533. Scherer, S., Lee, C., Birney, E., Altshuler, D., Eichler, E., Carter, N., Hurles, M. and Feuk, L. (2007). Challenges and standards in integrating surveys of structural variation. Nat Genet Suppl. 39, 7–15. Schueler, M. and Sullivan, B. (2006). Structural and functional dynamics of human centromeric chromatin. Annu Rev Genomics Hum Genet 7, 301–313. Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M. et al. (2004). Large-scale copy number polymorphism in the human genome. Science 305, 525–528.
Recommended Resources
Serre, D., Gurd, S., Ge, B., Sladek, R., Sinnett, D., Harmsen, E., Bibikova, M., Chudin, E., Barker, D., Dickinson, T. et al. (2008). Differential allelic expression in the human genome: A robust approach to identify genetic and epigenetic cis-acting mechanisms regulating gene expression. PLoS Genet 4, e1000006. Service, R. (2006).The race for the $1000 genome. Science 311, 1544–1546. Sharp, A., Cheng, Z. and Eichler, E. (2006). Structural variation of the human genome. Annu Rev Genomics Hum Genet 7, 407–442. Sharp, A., Locke, D., McGrath, S., Cheng, Z., Bailey, J., Vallente, R., Pertz, L., Clark, R., Schwartz, S., Segraves, R. et al. (2005). Segmental duplications and copy-number variation in the human genome. Am J Hum Genet 77, 78–88. Shianna, K. and Willard, H.F. (2006). Human genomics: In search of normalcy. Nature 444, 428–429. Shriver, M. and Kittles, R. (2004). Genetic ancestry and the search for personalized genetic histories. Nat Rev Genet 5, 611–618. Sjoblom, T., Jones, S., Wood, L., Parsons, D., Lin, J., Barber, T., Mandelker, D., Leary, R., Ptak, J., Silliman, N. et al. (2006). The consensus coding sequences of human breast and colorectal cancers. Science 314, 268–274. Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P., Cordum, H., Hillier, L., Brown, L., Repping, S., Pyntikova, T., Ali, J., Bieri, T., et al. (2003). The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature. 423, 825–837. Smith, M., Patterson, N., Lautenberger, J., Truelove, A., McDonald, G., Waliszewska, A., Kessing, B., Malasky, M., Scafe, C., Le, E. et al. (2004). A high-density admixture map for disease gene discovery in African Americans. Am J Hum Genet 74, 1001–1013. Spielman, R., Bastone, L., Burdick, J., Morley, M., Ewens,W. and Cheung, V. (2007). Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet 39, 226–231. Stefansson, H., Helgason, A., Thorleifsson, G., Steinthorsdottir, V., Masson, G., Barnard, J., Baker, A., Jonasdottir, A., Ingason, A., Gudnadottir, V. et al. (2005). A common inversion under selection in Europeans. Nat Genet 37, 129–137. Stranger, B., Forrest, M., Dunning, M., Ingle, C., Beazley, C., Thorne, N., Redon, R., Bird, C., de Grassi, A.., Lee, C. et al. (2007a). Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848–853. Stranger, B., Nica, A., Forrest, M., Dimas, A., Bird, C., Beazley, C., Ingle, C., Dunning, M., Flicek, P., Koller, D. et al. (2007b). Population genomics of human gene expression. Nature Genet 39, 1217–1224. Thorleifsson, G., Magnusson, K., Sulem, P., Bragi Walters, G., Gudbjartsson, D., Stefansson, H., Jonsson, T., Jonasdottir, A., Jonasdottir, A. and Stefansdottir, G. (2007). Common sequence variants in the LOXL1 gene confer susceptibility to exfoliation glaucoma. Science 317, 1397–1400.
21
Tian, C., Plenge, R., Ransom, M., Lee, A., Villoslada, P., Selmi, C., Klareskog, L., Pulver, A., Qi, L., Gregersen, P. et al. (2008). Analysis and application of European genetic substructure using 300K SNP information. PLoS Genet 4, e4. Topol, E. and Frazer, K. (2007). The resequencing imperative. Nat Genet 39, 439–440. Topol, E., Murray, S. and Frazer, K. (2007). The genomics gold rush. JAMA 298, 218–221. Trask, B. (2002). Human cytogenetics: 46 chromosomes, 46 years and counting. Nat Rev Genet 3, 769–778. Turnbaugh, P.J., Ley, R.E., Mahowald, M.A., Magrini, V., Mardis, E. and Gordon, J.I. (2006). An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444, 1027–1031. Tuzun, E., Sharp, AJ., Bailey, JA., Kaul, R., Morrison, VA., Pertz, LM., Haugen, E., Hayden, H., Albertson, D., Pinkel, D. et al. (2005). Fine-scale structural variation of the human genome. Nature Genet 37(7), 727–732. van Rooij, E., Liu, N. and Olson, E.N. (2008). MicroRNAs flex their muscles. Trends Genet, 24, 159–166. Venter, J., Adams, M., Myers, E., Li, P., Mural, R., Sutton, G., Smith, H., Yandell, M., Evans, C., Holt, R. et al. (2001). The sequence of the human genome. Science 291, 1304–1351. Wang, D., Urisman, A., Liu,Y.T., Springer, M., Ksiazek, T.G., Erdman, D., Mardis, E., Hickenbotham, M., Magrini, V., Eldred, J. et al. (2003). Viral discovery and sequence recovery using DNA microarrays. PLoS Biol 1, e2. Weiss, L.A., Shen, Y., Korn, J., Arking, D., Miller, D., Fossdal, R., Saemundsen, E., Stefansson, H., Ferreira, M., Green, T. et al. (2008). Association between microdeletion and microduplication at 16p11.2 and autism. New Engl J Med 358, 667–675. West, M., Ginsburg, G., Huang, A. and Nevins, J. (2006). Embracing the complexity of genomic data for personalized medicine. Genome Res 16, 559–566. Wheeler, D.A., Srinivasan M., Egholm, M., Shen, Y.., Chen, L., McGuire, A., He, W., Chen, Y.-J., Makhijani, V., Roth, G.T. et al. (2008). The complete genome of an individual by massively parallel DNA Sequencing. Nature 452, 872–876. Willard, H.F., Angrist, M. and Ginsburg, G. (2005). Genomic medicine: Genetic variation and its impact on the future of health care. Philos Trans Roy Soc Lond B Biol Sci 360, 1543–1550. Wolinsky, H. (2007). The thousand-dollar genome: Genetic brinksmanship or personalized medicine?. EMBO Rep 8, 900–903. Wong, K., deLeeuw, R., Dosanjh, N., Kim, L., Cheng, Z., Horsman, D., MacAulay, C., Ng, R., Brown, C., Eichler, E. et al. (2007). A comprehensive analysis of common copy-number variataions in the human genome. Am J Hum Genet 80, 91–104. Xin, H. (2007). The million-dollar genome. Science 318, 550.
RECOMMENDED RESOURCES Brown, T.A. (2006). Genomes 3, 3rd edition. Garland Science Publ., New York. Nussbaum, R., McInnes, R. and Willard, H.F. (2007). Genetics in Medicine, 7th edition. WB Saunders Co., Philadelphia. Strachan, T. and Read, A.P. (2004). Human Molecular Genetics 3, 3rd edition. Garland Science Publ., New York. NCBI Human Genome Resources http://www.ncbi.nlm.nih.gov/Genome/guide/human/ Online Mendelian Inheritance in Man http://www.ncbi.nlm.nih.gov/sites/entrez?dbOMIM& itooltoolbar
■
UCSC Genome Bioinformatics http://genome.ucsc.edu/ Ensembl Human Genome Browser http://www.ensembl.org/Homo_sapiens/index.html Genome Sequence of J. Craig Venter (“HuRef ”) http://www.jcvi.org/cms/research/projects/huref/overview/ Genome Sequence of James D. Watson http://jimwatsonsequence.cshl.edu/about.html
CHAPTER
2 Concepts of Population Genomics Mike E. Weale and David B. Goldstein
INTRODUCTION Over the past decade, population genetics – the study of processes affecting the evolution, distribution and correlation of genetic variants within populations – has truly “gone genomic” (Goldstein and Weale 2001; Jorde et al., 2001). The Human Genome Project provided a blueprint for the “average” human genome. Subsequent projects, and in particular the arrival of affordable high-density genome-wide genotyping, has expanded this picture to provide detailed information on the genomes of many thousands of individuals from different parts of the world, focusing on the genetic differences that distinguish us all. Population genetics has thus been transformed into population genomics. This revolution has resulted in fundamental advances in our understanding of the evolution of the human genome and the history of human populations. But this information is not just of academic interest: it has had a direct bearing on how medical genomic research is conducted today. Indeed, this has been the primary motivation for collecting much of this information in the first place. As medical genomics has moved from the description of simple Mendelian traits to the investigation of complex multifactorial traits, so population genomics has assumed pivotal importance. We need population genomic theory in order to know how to look for variants that affect these traits, how to understand their genealogical history and their geographic distribution, and how to use this information effectively in diagnosis Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg 22
and treatment (Box 2.1). Understanding Mendelian traits generally requires only an understanding of variation within families; understanding complex traits requires an understanding of the population within which the complex trait occurs. The first section of this chapter reviews basic concepts in population genomics. The second section reviews what we know about human population genomics, while the third section describes the application of this to medical genomics.
IMPORTANT CONCEPTS IN POPULATION GENOMICS Population genomics is the study, on a genomic scale, of the processes affecting the evolution, distribution and correlation of genetic variants within populations. Population genomics has little interest in the parts of the genome that are the same among all individuals and focuses instead on the genomic differences between us. Roughly 0.1% of our DNA varies appreciably among individuals at the single nucleotide level, and considerably more, we have recently learned (Redon et al., 2006), varies at the larger-scale structural level. The raw engine for variation is mutation (see previous Chapter 1). Once created by a mutation event, two forces – chance (modified by population structure) and selection (only relevant if the variant has an effect on viability or reproductive success) – act on the new variant as it passes from one generation to the next. Many are soon lost from Copyright © 2009, Elsevier Inc. All rights reserved.
Important Concepts in Population Genomics
BOX 2.1
Population Genomics in Action
■
23
Generation t-6 Generation t-5
While many aspects of population genomics can appear technical and somewhat distant from both application and real-world problems and interests in biology and medicine, in fact population genomics is transforming fields as diverse as genetic anthropology and medical genomics. Among the more surprising insights won with our increasing knowledge of the pattern of human genetic variation are a string of observations about relationships among certain populations. Patterns of Y chromosome variation have been used to support the claims of Jewish ancestry of a Bantu-speaking population in South Africa and for a long-term patrilineal pattern of inheritance among Jewish men that consider themselves to be “priests” or Cohanim in Hebrew (see Goldstein, 2008). More recently, Joel Hirschhorn, David Reich and their colleagues have identified a genome-wide signature of Jewish ancestry (Price et al., 2007), using PCA. Similar methodologies have also been developed by Reich and others to study the detailed genomic history of populations that have mixed ancestry. For example, looking at the whole genomes of African Americans with prostate cancer, Reich and his colleagues were able to determine which part of each person’s genome came originally from Africa and which from Europe with a high degree of statistical accuracy (Haiman et al., 2007). Most of the genome reflected the overall proportion of European ancestry in African Americans, but a section of chromosome 8 (within band 8q24) stood out strikingly. African Americans with prostate cancer had a much higher chance of having that region of the genome be of African ancestry compared to the rest of the genome. This result provides overwhelming evidence that there is a major risk factor for prostate cancer in this part of the genome and that this risk factor is likely to contribute to the observed differences in risk of prostate cancer between African Americans and Americans of European ancestry.
the gene pool, but some rise in frequency and may, eventually, come to replace the original ancestral allele. This is how evolution works, and population genomics provides the theory that describes it (Graur and Li, 1999; Hartl and Clark, 2007). Here we briefly review the important population genomic concepts of genealogy, the standard neutral model, population structure and linkage disequilibrium. Our aim is to provide enough information to understand the medically related issues discussed later. For a more thorough coverage of population genetic theory, the reader is referred to one of many standard texts (Gillespie, 2004; Halliburton, 2003; Hartl and Clark, 2007; Hedrick, 2004). Genealogy Imagine you have taken a sample of n individuals and typed them for one single nucleotide polymorphism (SNP) – perhaps a coding polymorphism in a medically relevant gene. The 2n allele copies (two for each diploid individual) are connected by an underlying genealogy, a tree that describes the ancestor– descendent relationships linking all the copies in that population. A simple illustration of how a genealogy might emerge in a population of only four individuals is shown in Figure 2.1. Here the sample size is just n 2 (thus four allele copies, sampled in
Generation t-4
!
Generation t-3 Generation t-2 Generation t-1 Generation t Indiv 1
Indiv 2
Indiv 3
Indiv 4
Figure 2.1 A simple example of a genealogy. Here the population in any one generation consists of just 4 individuals (represented by boxes) and 8 allele copies (represented as circles within the boxes). The genealogy connecting a sample of 2 individuals (and 4 allele copies) is shown. A mutation event in Generation t-3 (indicated by a red exclamation mark) has led to two allele types, coded green and blue.
Generation t) and the population from which they are sampled is fixed in each generation at N 4 individuals. For simplicity, we assume discrete non-overlapping generations, so that the set of parents for individuals in Generation t comes exclusively from the previous generation, Generation t-1. The total set of 2N allele copies in any one generation makes up the gene pool. Consider the two allele copies possessed by Individual 1 in Generation t, one inherited from her mother and the other from her father. A separate lineage of ancestor–descendent transmissions can be traced back in time for both copies (note the same thing can be done for any two allele copies sampled from the gene pool, regardless of whether they happen to belong to the same individual or not). Provided that N is finite, eventually one will reach a previous generation where the two lineages coalesce to a common ancestral copy (in Generation t-6 in the example in Figure 2.1). Note that coalescence is inevitable, given enough time, since in each generation there is a non-zero probability that coalescence will occur. However, the timing of coalescence is typically subject to much statistical uncertainty and, in human populations, is usually much further back in time than the six generations depicted here. Superimposed on the genealogical tree are mutation events, here exemplified by a single event in Generation t-3 that results in two distinct allele types (coded as different colors). This overtly genealogical approach to population genomics, focused on an observed sample of allele copies in the present day and tracing a tree backwards in time from these, has given rise to much new work in population genomics over the past 25 years, collectively known as coalescent theory (Donnelly and Tavaré, 1995; Hein et al., 2005; Kingman, 1982; Nordborg, 2001; Rosenberg and Nordborg, 2002; Wakeley, 2007). This contrasts with a more traditional approach that focuses on the entire population and typically arrives at solutions for tracing changes in allele frequencies forward in time. Perhaps the key computational and
24
CHAPTER 2
■
Concepts of Population Genomics
conceptual simplification afforded by coalescent approaches is that they are able to ignore alleles (and the individuals carrying them) that do not contribute in any way to the genetic diversity that is sampled in the current generation. Hence in this example three individuals are irrelevant in Generation t-6, two in Generation t-5, on so on. Note that while these individuals have not contributed to this sample for this locus, they may have contributed to other loci elsewhere in the genome. Thus, for example, the genealogy for a sample of men typed for their mitochondrial DNA is entirely separate from the genealogy for the same sample of men typed for their Y chromosome, and indeed in this case the individuals lying at the root of these genealogies (corresponding to so-called “mitochondrial Eve” and “Y-chromosome Adam”, provided a large enough sample is taken) are thought to have lived at entirely separate times. Both coalescent theory and the more traditional “forward” approach that models all members of a population have their uses, but we have concentrated here on the coalescent approach because it often helps one’s conceptual understanding of population genomics problems if one thinks in terms of the underlying genealogy. The Standard Neutral Model The standard neutral model (or Fisher-Wright model) provides a convenient null case against which other more complex models can be compared. The two basic assumptions are that (1) mating is random, so any allele copy in Generation t has the same probability of descending from any individual in Generation t-1; and (2) reproductive success is random, so a second allele copy picked from Generation t has the same probabilities of descent as the first one. Under these conditions, the probability that two allele copies in Generation t coalesce in Generation t-1 is 1/(2N). This is simply the probability of picking the same parent again (and the same allele copy within this parent) for the second copy, having picked a parent at random for the first allele copy. Much elegant theory flows from this simple rule, and we give two examples here to give a flavor of this. The first is the derivation of the famous Hardy-Weinberg proportions for genotype frequencies based on allele frequencies. Since the two allele copies possessed by an individual represent random samples from the gene pool under the standard neutral model, if Allele Type i has a population frequency of pi then the probability of seeing two Type i copies in the same individual (i.e., a homozygous ii genotype) is pi2 (due to the multiplicative law governing independent statistical events). Expected frequencies for other homozygous genotypes are similarly derived, and the expected frequency of heterozygotes H, known as the heterozygosity, is what is left over after taking away all the other homozym gotes. Thus H 1 ∑ i1 pi2 , where m is the total number of allele types. Note that in the simple case of just two allele types, these relationships reduce to the more familiar p2, q2 and 2pq for the frequencies of the two homozygotes and the heterozygote respectively, where q 1 p. The second example involves the expected value of H under mutation-drift equilibrium, in a model in which mutations
occur with probability to any one allele copy in any one generation, and in which all mutations lead to a new distinguishable allele type (a reasonable assumption for longer DNA sequences where each new point mutation is likely to occur in a different nucleotide). Consider two allele copies chosen at random in Generation t. Three things can happen to their respective lineages in Generation t-1: (1) they coalesce, with probability 1/(2N); (2) a mutation occurs in one or other ancestor, with probability 2 since there are two ancestors; or (3) nothing interesting happens – just two separate ancestors with no mutation events. If Event (3) happens, one simply asks the same question for Generation t-2, with the probabilities of Events (1), (2) and (3) unchanged. Thus the question of heterozygosity boils down to whether (2) happens before (1) in the genealogy connecting the two allele copies, which in turn depends only on the relative probabilities of (2) versus (1). Thus H 2/(2 1/(2N)) 4N/(1 4N). The same relationship can also be derived, somewhat more laboriously, using more traditional forward-in-time population genetics modeling (see, e.g., Crow, 1986). The relationship predicts that heterozygosity (and hence the amount of genetic variation) at neutral loci should be greater in larger populations and with high-mutation rates, and this pattern is indeed generally seen in both human and non-human populations (Graur and Li, 1999). Population Structure The concept of a completely randomly mating population is patently false for humans. Instead, individuals are more likely to have children with some individuals rather than others. One obvious constraint is geographic distance – you are more likely to marry (or otherwise pair with) someone next door than someone on the other side of the world. National boundaries are important barriers to gene flow (the movement of allele copies from one subpopulation to another), but even within national boundaries marriages are more likely to take place within socioeconomic, cultural and ethnic groups than between them (a phenomenon known as endogamy). For humans, however, none of these barriers is absolute; instead there are simply varying probabilities of having children with anyone else. In population genomics, this complexity is often simplified by compartmentalizing people into distinct groups and then by thinking about the different rates of exchange of allele copies between and within these groups. A useful metric that is often used to describe the degree of population differentiation between these groups is the FST statistic. This can be derived in many ways, and, when these result in slightly different answers, there are some differences of opinion on which derivation is best (Balding, 2003; Excoffier, 2001; Nei, 1987). However, a commonly applied derivation of FST is based on contrasting the amounts of genetic variance attributable to within- and between-population differences. For various reasons (Excoffier, 2001; Nei, 1987), the heterozygosity H that we defined earlier is an excellent measure of the genetic variance of a population. Intuitively, one can see that H approaches zero as genetic variability is reduced (i.e., the population becomes dominated by a single allele type) and conversely approaches unity as genetic
Important Concepts in Population Genomics
variability is increased (i.e., the population becomes dominated by many different allele types, such that the probability of picking two of the same type from the gene pool becomes vanishingly small). One interpretation of FST, then, is that it represents the proportion of total heterozygosity, HT (obtained by pooling all populations together), that is attributable to between-population differences. Since it’s easier to calculate the average withinpopulation contribution (simply the mean heterozygosity H calculated over all within-population H values), and since the between- and within-population contributions together sum to HT, the definition of FST is then (H T H )/H T ). For most of the genome, FST values range from a maximum of 15% for intercontinental differences, to typically 1–3% for international differences within a continent, to less than 0.1% for differences within countries in Europe (Cavalli-Sforza et al., 1994; Steffens et al., 2006). Genetic markers with high FST are used for a number of purposes in genomic medicine, including ancestry testing and tracing the likely origin(s) of all or portions of one’s genome (Box 2.1). Linkage Disequilibrium Linkage Disequilibrium, or LD, is a measure of the degree of nonrandom association between allelic forms at two different sites (or loci) in the genome. Two factors influence this: mutation and recombination. Since the chances of recombination increase the further apart two loci are, LD tends to be highest between loci that lie very close to each other in the genome. However, there is a lot of variability around this general rule, partly because the rate of recombination varies throughout the human genome (Myers et al., 2005) and partly because chance factors in the placement of mutations in the underlying genealogy also play a role. These points are illustrated in Figure 2.2. In the absence of recombination (i.e., if the loci are very close to each other), one still sees variation in the degree of LD between two loci. If the polymorphism-generating mutations happen to occur along the same branch of the genealogy (Figure 2.2a), then only two out of the possible four 2-locus haplotypes are formed, and in this case knowledge of the allele type at one locus allows us to perfectly predict the allele type at the other. But if instead the mutations occur at widely separated points on the genealogical tree (Figure 2.2b), then knowledge of the allele type at one locus is only partially predictive of the allele type at the other. This breakdown of association, due simply to mutation, may or may not be of interest to the population geneticist depending on the application. If one is interested in the prediction of allele type at one locus depending on the allele type at the other, then mutation-led breakdown of LD is of interest and a useful LD metric in this case is r 2, which is the amount of variation at locus A that is explained by knowledge of allele states at locus B. If locus A and B are binary SNPs, then r 2 can be found simply by converting the allelic states into 0/1 variables, then calculating the square of the standard Pearson correlation coefficient between them. If instead one is interested only in the supposed amount of recombination between locus A and B, then a metric that
■
25
attempts to factor out the mutation-led breakdown of association is preferable. One such metric is D, which for SNP loci takes its maximum value of 1 if and only if one observes just three out of the possible four haplotypes in one’s sample (for (a) Generation t-6 Generation t-5 Generation t-4
!!
Generation t-3 Generation t-2 Generation t-1 Generation t Indiv 1
Indiv 2
Indiv 3
Indiv 4
Indiv 3
Indiv 4
Indiv 3
Indiv 4
(b) Generation t-6 Generation t-5 Generation t-4
! !
Generation t-3 Generation t-2 Generation t-1 Generation t Indiv 1
Indiv 2
(c) Generation t-6 Generation t-5 Generation t-4
! !
Generation t-3 Generation t-2
X
Generation t-1 Generation t Indiv 1
Indiv 2
Figure 2.2 Example genealogies involving 2 linked loci. (a) Both mutations happen on the same genealogical branch, leading to just two haplotypes – green red and blue yellow. Thus knowledge of the allele type at one locus provides perfect prediction of the allele type at the other. LD metrics: r2 1, D 1. (b) The two mutations occur separately on the genealogy tree, leading to three haplotypes and imperfect prediction. LD metrics: r2 0.33, D 1. (c) A recombination event in Generation t-2 (indicated by a red X) has created a recombinant haplotype with contributions from two separate lineages, leading to four different haplotypes and further reducing LD (to zero in the toy example, but this is not a general rule). LD metrics: r2 0, D 0.
26
CHAPTER 2
■
Concepts of Population Genomics
reviews of LD metrics see Devlin and Risch, 1995; Hudson, 2001; Nordborg and Tavaré, 2002; Mueller, 2004). Looking at Figure 2.2b, one should note that wherever one places the two mutations one can never obtain more than three haplotypes in the eventual sample, so D is always 1. Assuming no further mutations, the only way to obtain all four haplotypes in the eventual sample is via one or more recombination events between the two loci, and this is illustrated in Figure 2.2c. Note that recombination events alter the underlying genealogy, which is no longer a bifurcating tree but rather a more complicated web (technically, a “graph”). Linkage disequilibrium is both a boon and a bane for medical genomics, and this is discussed in more detail later.
HUMAN POPULATION GENOMICS We start by describing important sources of genomic data that have emerged recently. Then, we describe what these have told us about the history and evolution of human populations and about how human populations are structured today. Important Sources of Data The HapMap dataset (http://www.hapmap.org/) today constitutes the most important source of information for population genomics (The International HapMap Consortium, 2005, 2007). Instigated in 2002 with funding from the US NIH and other international sources, the project set out to describe and type common SNP variation, evenly spread throughout the human genome, in a collection of 270 individuals sampled from four populations spanning three major continents of origin – Africa, Asia and Europe. This was accomplished in two main phases – an initial Phase I completed in early 2005, featuring around 1.3 million common SNPs at an average density of 0.4 SNPs per kilobase (The International HapMap Consortium, 2005), and a subsequent Phase II completed in late 2005, featuring around 3.1 million common SNPs at an average density of 1 SNP per kilobase (The International HapMap Consortium, 2007). In addition, 10 separate 500 kb regions of the genome (the HapMap-ENCODE regions) were targeted for exhaustive discovery and typing of all but the rarest SNPs, in order to provide a valuable reference source to compare the projected performance of the rest of the HapMap data in representing all common SNP variation. It is estimated that there are some 10 million common SNPs in the human genome (Carlson et al., 2005). HapMap Phase II typed only one third of these; however, comparisons using the HapMap-ENCODE regions show that, thanks to linkage disequilibrium, these 3.1 million SNPs represent or “tag” the other 7 million to a high degree of accuracy (The International HapMap Consortium, 2007). The HapMap project, along with earlier SNP discovery projects such as The SNP Consortium (The International SNP Map Working Group, 2001), played an important role in the development of high-density genome-wide SNP panels by greatly increasing our knowledge of the location of SNPs in the human genome. In addition, the HapMap project provided the
LD data that allowed tag-based SNP panels to be designed (see Chapter 8). Genome-wide SNP panels are now being applied to population samples from around the world, providing valuable extensions of the information available from HapMap in two important ways. Firstly, new datasets have increased the sample sizes for certain populations to thousands of individuals; see, for example, the iControl database (US individuals, http:// www.illumina.com/pages.ilmn?ID231), the WTCCC database (UK individuals, http://www.wtccc.org.uk/) and the dbGaP database (http://www.ncbi.nlm.nih.gov/sites/entrez?dbgap). Secondly, new datasets have expanded the set of populations beyond the four studied for HapMap; see, for example, the Stanford University HGDP-CEPH database (51 distinct worldwide population samples, http://www.cephb.fr/hgdp-cephdb/). A key element in all these datasets is their free availability to the scientific community, reflecting a sea-change in attitudes to data access that has beneficially affected almost all genomic research since the Human Genome Project (see also The GAIN Collaborative Research Group, 2007). Currently, most databases characterize just the SNPs in the human genome. However, much interest is now also focused on Copy Number Variation (CNVs), and this is likely to widen soon to other types of structural variation such as inversions (see previous Chapter 1). The indications are that SNPs have a fair, but not perfect, ability to tag CNVs (Locke, 2006; Redon et al., 2006), and thus the development of genomic databases specifically targeting structural variation remains a priority. A Brief History of Homo sapiens The new genomic data have broadly confirmed the picture of our genetic history that had emerged from previous genetic studies and are also consistent with what we know from paleoanthropological data. This picture is of a relatively small population of ancestral humans living in East Africa between 60–100 thousand years ago, spreading out-of-Africa to central Asia and thence to Europe in one direction and South and East Asia and the Americas in the other, replacing other extant human populations as they went (Cavalli-Sforza et al., 1994; Garrigan and Hammer, 2006; Goldstein and Chikhi, 2002; Jobling et al., 2003). This picture explains why (1) compared to those of other primate species, our genomes are remarkably low in genetic diversity within populations and in differentiation between populations; (2) this diversity steadily decreases in populations further removed from Africa (Prugnolle et al., 2005; Ramachandran et al., 2005); (3) non-African diversity tends to be a subset of African diversity or, to put it in genealogical terms, most gene trees have their root in Africa; and (4) LD in the human genome is high in non-African populations. Not enough time has passed since our recent, small origins to have allowed mutations and recombinations to break down the haplotypic patterns present in the genomes that emerged from Africa. There is still room, and indeed eagerness, for debate regarding this picture. Debate continues over the degree to which there was interbreeding with extant populations either before or after the out-of-Africa expansion (Garrigan and Hammer, 2006;
Human Population Genomics
AA C EA M SA
0.05
0.10
2007; Rosenberg et al., 2005; Serre and Pääbo, 2004). Thus, the genome of any one single individual is not composed of a contribution from one single discrete ancestral group, nor even of a series of contributions from different discrete ancestral groups. Just as there are no discrete groups today, so there were none in the past. For these reasons, a more useful paradigm for human population genomic structure is to see it as comprised of a series of quantitative axes, each one reflecting one aspect of a multidimensional structure. We illustrate some of these points in Figure 2.3. The figure displays the principal aspects of the population genomic structure of a sample of approximately 500 individuals collected in North Carolina, USA. Each individual was typed using the Illumina HumanHap 550 k genome-wide SNP panel (see Chapter 8), and these data were then subjected to Principal Components Analysis (PCA), a statistical method that ranks and dissects the correlation structure of multivariate datasets (Patterson et al., 2006; Price et al., 2006). Among the population groups contained within the cohort, some of the allele frequencies change in correlated ways that distinguish between population groups. Thus, in a simple and extreme case, you might have 10,000 SNPs that are all high frequency in group A and low frequency in group B; these would have high FST values, as described earlier.These SNPs would contribute to a single axis (e.g., PC1 in Figure 2.3), and the position of an individual along that axis would be dependent on which of
PC 2 0.00
Population Genomic Structure and Race The new genomics has allowed us, at last, to separate the population genetic concept of biogeographical shared ancestry from the socio-cultural concept of race. This is because the quantity of genetic data that one can routinely collect on a single individual is enough to provide all one needs to draw up a remarkably accurate picture of the population structure of a sample of individuals. In the past, with less genetic data available, other information such as the individual’s own self-declared racial identity was needed to shore up uncertainties in the underlying population genomic structure, and much debate ensued over the effectiveness of using one to predict the other (Barnholtz-Sloan et al., 2005; Burnett et al., 2006; Liu et al., 2006b; Tang et al., 2005; Wilson et al., 2001). Now that debate is largely irrelevant – we can get all the information we need regarding patterns of ancestral relatedness from the genetic data alone. The fact that confusion still exists over conflating population genomic structure with sociocultural race stems from many reasons, but the research community is at least partly to blame. Recent surveys suggest that the term “race” continues to be used ambiguously both in the scientific literature (Sankar et al., 2007; Shanawani et al., 2006) and in dealings with the press (Condit, 2007; Lynch and Condit, 2006). Genetically, there are no such things as human “races”, in the original strict sense of discrete unequivocal groupings. Because of our recent shared roots and our considerable mobility, we are only marginally more similar to a person sharing common biogeographic ancestry and living locally than we are to a person with different roots living on the other side of the world. Leaving aside sex-linked loci, which one must treat separately, average FST never rises much above 15% even for intercontinental comparisons (Cavalli-Sforza et al., 1994), so at least 85% of all genetic differences are accounted for by “within population” variance. There are no absolute barriers to gene flow between any two parts of the world, and, while partial barriers exist, the location and extent of these are difficult to define unequivocally (Barbujani and Belle, 2006; Handley et al.,
27
0.05
Harding and McVean, 2004; Liu et al., 2006a; Ray et al., 2005), in particular regarding whether there was interbreeding with Neanderthals in Europe (Currat and Excoffier, 2004; Trinkaus, 2007). This debate revolves not only around the interpretation of genetic data, but also around morphological data such as cranial features, again with contrasting conclusions (Manica et al., 2007; Trinkaus, 2005). Likewise there is debate over the mode of the out-of-Africa expansion, for example over the use of coastal routes as preferred routes for expansion (Amos and Manica, 2006; Macaulay et al., 2005) and over the number and extent of population bottlenecks that may have occurred along the way (Liu et al., 2006a; Ramachandran et al., 2005). While our knowledge of genetic variation in the present day is excellent, the use of this to make inferences about the past is not an exact science. It may be that ancient DNA techniques will help to resolve these issues, by providing more direct information on the population genetics of these ancient populations (Caramelli et al., 2007; Noonan et al., 2006;Vernesi et al., 2004).
■
0.00
0.05 PC 1
0.10
Figure 2.3 Population structure in a sample collected in North Carolina, USA, revealed by Principal Components Analysis applied to genome-wide SNP data (Duke Genetics of Memory Cohort typed on Illumina HumanHap 550 panel). Scores for the first two principal component axes are plotted here. Self-declared racial identity labels are: AA African/AfricanAmerican; C Caucasian; EA East Asian; M Hispanic/ Mixed; SA South Asian. See text for further details. Figure based on data provided by D. Goldstein and A. Need.
28
CHAPTER 2
■
Concepts of Population Genomics
the two groups that individual came from. Different axes would reflect other correlations in the data, involving, for example, a different subset of SNPs that also distinguish different population groups (e.g., PC2 in Figure 2.3). In this way, PCA uses the correlation patterns in the data to reduce the dimensionality of the dataset. Formal methods exist to determine how much information is reflected in each of the axes. In real datasets, the correlation patterns are much less extreme and simple than the group A and group B case described above, but the idea is the same. The first point the figure illustrates is that, even though PCA is based solely on genotype information and is thus blind to non-genetic information, there is a remarkably good correspondence with self-declared racial identity. Most individuals belonging to three population subgroups (Caucasian in red, South Asian in turquoise, and East Asian in black) are clearly identified and separately from each other by analysis of their genotypes (Figure 2.3). However, a second point, and just as important as the first, is that there is additional population structure that is not picked up by the information on racial identity. For example, there is a large spread in the African American group, reflecting individual differences in the degree of European admixture in their genomes. Thirdly, there is more than one axis along which co-ancestry relationships can be defined, and in fact the PCA reveals many more significant axes for this dataset than the two used to plot Figure 2.3. This underlines the multidimensional nature of human population genomic structure. Finally, the degree of resolution in elucidating such structure, as evidenced by the sharpness in the clustering of some of the groups in Figure 2.3, is something altogether new, and is a largely unforeseen benefit of the step-increase in genetic data afforded by the latest genome-wide SNP chips.
APPLICATION OF POPULATION GENOMICS TO GENOMIC MEDICINE Here we show how the concepts we have discussed so far can be applied to genomic medicine. The first two sections deal with the important role that LD and population structure have in the search for medically relevant causal variants, while the last deals with the insights beginning to emerge on the common population genomic patterns describing these variants (Box 2.1). LD in the Search for Medically Relevant Variants Today the hunt for medically relevant variants is taking place in large population-based samples, using genotyping panels that scan the entire genome. This creates both challenges and benefits, and both require an understanding of population genomics. In essence, the reason why association studies are so much more powerful than linkage studies at localizing a causal genetic variant is that linkage studies only make use of recombination events taking place over one or two generations in the pedigree under study, whereas association studies make use of the much larger number of recombination events that have taken place over the entire genealogy linking the individuals in a sample at a given
locus. Thus the breakdown in association with a nearby marker due to LD happens over a much shorter distance than the breakdown in linkage due to observed recombination events within pedigrees. The cost is that you need to cover the genome with a much greater density of markers. Using a simple theoretical model, Kruglyak estimated that this breakdown might happen over a 3 kb distance (Kruglyak, 1999). We now know that breakdown of LD is both region-specific (i.e., varies between different parts of the genome) and population-specific. The former is due at least in part to variation in the rate of recombination throughout the genome as a reflection of genome organization (Myers et al., 2005), while the latter is due to differences in the demographic histories of different populations. For example, levels of LD in European populations are higher than those in African populations and higher than the level estimated by Kruglyak (1999). This is presumed to be due to one or more population bottlenecks that have occurred since the out-of-Africa expansion (bottlenecks generate higher LD by shortening the average genealogical tree linking a sample of individuals). In a practical sense, this means that fewer SNPs are needed to cover the genome in Europeans than in Africans. Still greater efficiency can be achieved, however, by leveraging what we know about the specific patterns of LD to pick out a special set of SNPs that still largely represents all the common variation in a given genomic region – a procedure known as tagging (de Bakker et al., 2005; Weale et al., 2003). This idea is now applicable to the entire genome, with the availability of genomewide genotyping arrays containing hundreds of thousands of SNPs. Thus either through simply the sheer density of SNPs (e.g., Affymetrix platforms) or through explicit use of tagging methods to guide the choice of SNPs in the first place (e.g., Illumina platforms), these modern genome-wide platforms are able to represent all common variation in the human genome to an extremely high level of accuracy (see Chapter 8). Recent studies have shown that the ability to port tagging SNP sets inferred from one population to another related one (e.g., from the same continent) is generally good (Conrad et al., 2006; de Bakker et al., 2006). Since LD varies in extent among different populations, this presents the possibility of conducting different studies at different stages in different populations in order to maximize efficiency. Two types of special populations have been used to capitalize on larger levels of LD in the initial scan for causal variants. One of these is the admixed population, one in which individuals are descendents of a recent period of admixture between two previously separated ancestral populations. The best-studied example of this is the African American population, as on average their genome is comprised of an 80:20 mixture of West African and European DNA, although there is quite a high degree of variation among different individuals who self-identify as African American in terms of the proportion of the genome that is of West African or European origin (Parra et al., 1998). The other type is the isolated population, one in which the individuals descend from a small number of founders and have, through endogamy, remained relatively free from subsequent admixture (Service et al., 2006). One well-studied example is the Dutch “GRIP” population (Liu et al., 2007). Despite their differences
Conclusions
in origin, both admixed and isolated populations have genomes with elevated levels of LD, and this can be used to positive effect in the early stages of the search for new causal variants (Angius et al., 2008; Seldin, 2007; Smith and O’Brien, 2005). Ultimately, though, the very same LD that allows us to economize in the number of SNPs needed to perform effective genome-wide association studies also acts as a barrier, limiting one’s ability to localize a causal variant based on association alone. Instead, a signal pointing to causal involvement of locus A could equally be due to the actual influence of locus B, if locus B is in high LD with locus A. In such cases, other types of information (e.g., based on the location of the respective polymorphisms and their predicted functional influence) are needed to tease apart the causal relationships. Correcting for Population Structure We have already noted that human populations have a complex structure (also known as stratification) that neither conforms to the notion of distinct races nor to that of one single randomly mating gene pool but rather falls somewhere in between. In practice, what this means is that even within a carefully designed study that attempts to match cases and controls for biogeographical shared ancestry, there may be hidden structure within the study. This creates a problem if cases and controls are nonrandomly distributed with respect to this structure. To illustrate this, consider again the structure revealed in Figure 2.3. Imagine a case-control study conducted within the African American group, in which cases are more likely to come from the bottom right and controls from the top left of the cluster. This sets up a situation in which SNPs that differentiate strongly between these two sides would also differentiate cases and controls, resulting in many false signals of association. Luckily, we can use the fact that population structure affects all SNPs in the genome to affect a correction. Older genomic control methods applied a uniform correction to all SNPs (Devlin et al., 2004; Devlin and Roeder, 1999). Newer methods allow a SNP-specific correction to be applied, tailored to the observed degree to which the SNP in question is seen to be affected by population structure (some SNPs vary more simply by evolutionary chance (Balding and Nichols, 1995), others because of presumed selective gradients (Campbell et al., 2005)). Of these newer methods, the EIGENSTRAT method of Price et al. (2006) is particularly convenient and flexible because it (a) makes use of PCA to efficiently dissect the population structure; (b) uses as many co-ancestry related principal component axes as are deemed necessary to effect the correction (guided by theory developed in Patterson et al., 2006); and (c) automatically recognizes, since principal component axes are quantitative, that human population structure is continuous rather than discrete. Understanding the History and Population Distribution of Variants In this final section, we anticipate the forthcoming flood of data on causal variants that will be discovered as a result of genome-wide association studies on common complex diseases
■
29
(Frayling, 2007). These will soon allow the resolution of questions that have been debated for decades regarding the population genomic properties of causal variation relative to other variation in the human genome (Mitchell-Olds et al., 2007). An early example of this is a meta-analysis by Ioannidis et al. (2004) on the question of whether causal variants tend to exert consistent effects in different parts of the world – in other words, does possession of an AA genotype for locus Q always increase your risk of a given disease relative to genotype aa, or is it different for people living in Africa versus people living in Asia? Their study suggested that the general rule was the former and that differences in population risk were more due to differences in allele frequencies among populations than to differences in the effect of possessing a given genotype. The take-home message is that a new causal variant discovered in Population A because it is at high frequency can still lead to new therapies of benefit to people living in Population B, where the causal variant is rare, because the underlying metabolic pathway being affected by that causal variant is the same. As one example of this, a new class of antiretroviral drugs – the HIV entry inhibitors – was developed following the discovery of the importance of the CCR5-32 polymorphism in progression to HIV-AIDS, and these drugs appear to be of universal benefit despite the fact that the CCR5-32 mutation is almost wholly absent in subSaharan Africa (Esté and Telenti, 2007). Looking ahead, large-scale resequencing technologies will help us to build up a picture of the “allelic architecture” of causal variants for common complex diseases (Wang et al., 2005). A null hypothesis would be that the allele frequency distributions of causal variants are no different from those observed for neutral genetic variants in the human genome. Much opinion has been offered on the number of variants open to detection by association studies (which are only powered to detect common variants) (Blangero, 2004; Pritchard, 2001; Pritchard and Cox, 2002; Reich and Lander, 2001; Smith and Lusis, 2002). One way in which this debate has been couched is over the degree to which causal variants have been under selection, and whether this has been large enough to alter their allele frequency and LD properties.Various methods have been proposed recently to look for signals of selection on a genome-wide scale (Kelley et al., 2006; McVean and Spencer, 2006; Sabeti et al., 2006; Sabeti et al., 2007; Voight et al., 2006; Thornton et al., 2007), but so far, while individual examples exist of what appear to be strong signals of selection around genes affecting Mendelian traits (e.g., the lactase gene and adult lactose intolerance (Bersaglieri et al., 2004; Burger et al., 2007)), a systematic treatment of this question for variants affecting common complex disorders awaits further data.
CONCLUSIONS In this chapter we have given what we hope is a concise and accessible account of the main concepts of population genomics and their importance for medical genomics. The field has been
30
CHAPTER 2
■
Concepts of Population Genomics
transformed over the past 5 or so years by the move away from a single “Human Genome” to the diverse human genomes of many thousands of individuals. The new data have allowed us to quantify with unprecedented accuracy the patterns in different human populations of allele frequency and of linkage disequilibrium throughout the genome. This has allowed us to estimate the variable rates of recombination and mutation that apply to
different part of the genome and to reaffirm the remarkably recent and unified origins of our species in Africa. Most importantly, it has provided us with the necessary tools to carry out association-based genome-wide scans to look for novel causal variant of common complex disease, and the necessary framework for interpreting these in the translation of this knowledge into new effective therapies.
REFERENCES Amos, W. and Manica, A. (2006). Global genetic positioning: Evidence for early human population centers in coastal habitats. Proc Nat Acad Sci USA 103, 820–824. Angius, A., Hyland, F.C., Persico, I., Pirastu, N., Woodage, T., Pirastu, M. and de la Vega, F.M. (2008). Patterns of linkage disequilibrium between SNPs in a Sardinian population isolate and the selection of markers for association studies. Hum Hered 65, 9–22. Balding, D.J. (2003). Likelihood-based inference for genetic correlation coefficients. Theor Popul Biol 63, 221–230. Balding, D.J. and Nichols, R.A. (1995). A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12. Barbujani, G. and Belle, E.M. (2006). Genomic boundaries between human populations. Hum Hered 61, 15–21. Barnholtz-Sloan, J.S., Chakraborty, R., Sellers, T.A. and Schwartz, A.G. (2005). Examining population stratification via individual ancestry estimates versus self-reported race. Cancer Epidemiol Biomarkers Prev 14, 1545–1551. Bersaglieri, T., Sabeti, P.C., Patterson, N.,Vanderploeg, T., Schaffner, S.F., Drake, J.A., Rhodes, M., Reich, D.E. and Hirschhorn, J.N. (2004). Genetic signatures of strong recent positive selection at the lactase gene. Am J Hum Genet 74, 1111–1120. Blangero, J. (2004). Localization and identification of human quantitative trait loci: King harvest has surely come. Curr Opin Genet Dev 14, 233–240. Burger, J., Kirchner, M., Bramanti, B., Haak, W. and Thomas, M.G. (2007). Absence of the lactase-persistence-associated allele in early Neolithic Europeans. Proc Nat Acad Sci USA 104, 3736–3741. Burnett, M.S., Strain, K.J., Lesnick, T.G., de Andrade, M., Rocca, W.A. and Maraganore, D.M. (2006). Reliability of self-reported ancestry among siblings: Implications for genetic association studies. Am J Epidemiol 163, 486–492. Campbell, C.D., Ogburn, E.L., Lunetta, K.L., Lyon, H.N., Freedman, M.L., Groop, L.C., Altshuler, D., Ardlie, K.G. and Hirschhorn, J.N. (2005). Demonstrating stratification in a European American population. Nat Genet 37, 868–872. Caramelli, D., Vernesi, C., Sanna, S., Sampietro, L., Lari, M., Castrì, L., Vona, G., Floris, R., Francalacci, P., Tykot, R. et al. (2007). Genetic variation in prehistoric Sardinia. Hum Genet 122, 327–336. Carlson, C.S., Eberle, M.A., Rieder, M.J., Smith, J.D., Kruglyak, L. and Nickerson, D.A. (2005). Additional SNPs and linkagedisequilibrium analyses are necessary for whole-genome association studies in humans. Nat Genet 33, 518–521. Cavalli-Sforza, L.L., Menozzi, P. and Piazza, A. (1994). The History and Geography of Human Genes. Princeton University Press, Princeton. Condit, C.M. (2007). How geneticists can help reporters to get their story right. Nat Rev Genet 8, 815–820.
Conrad, D.F., Jakobsson, M., Coop, G., Wen, X., Wall, J.D., Rosenberg, N.A. and Pritchard, J.K. (2006). A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat Genet 38, 1251–1260. Crow, J. (1986). Basic Concepts in Population, Quantitative and Evolutionary Genetics.W.H.Freeman & Co Ltd. Currat, M. and Excoffier, L. (2004). Modern humans did not admix with Neanderthals during their range expansion into Europe. PLoS Biol 2, e421. de Bakker, P.I., Burtt, N.P., Graham, R.R., Guiducci, C., Yelensky, R., Drake, J.A., Bersaglieri, T., Penney, K.L., Butler, J., Young, S. et al. (2006). Transferability of tag SNPs in genetic association studies in multiple populations. Nat Genet 38, 1298–1303. de Bakker, P.I., Yelensky, R., Pe’er, I., Gabriel, S.B., Daly, M.J. and Altshuler, D. (2005). Efficiency and power in genetic association studies. Nat Genet 37, 1217–1223. Devlin, B., Bacanu, S.A. and Roeder, K. (2004). Genomic control to the extreme. Nat Genet 36, 1129–1130. Devlin, B. and Risch, N. (1995). A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322. Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55, 997–1004. Donnelly, P. and Tavaré, S. (1995). Coalescents and genealogical structure under neutrality. Annu Rev Genet 29, 401–421. Esté, J.A. and Telenti, A. (2007). HIV entry inhibitors. Lancet 370, 81–88. Excoffier, L. (2001). Analysis of population subdivision. In Handbook of Statistical Genetics (D.J. Balding, M. Bishop and C. Cannings, eds), Wiley, Chichester, UK, p. 309. Frayling, T.M. (2007). Genome-wide association studies provide new insights into type 2 diabetes aetiology. Nat Rev Genet 8, 657–662. Garrigan, D. and Hammer, M.F. (2006). Reconstructing human origins in the genomic era. Nat Rev Genet 7, 669–680. Gillespie, J. (2004). Population Genetics: A Concise Guide. John Hopkins University Press. Goldstein, D.B. (2008). Jacob’s Legacy: A Genetic View of Jewish History. Yale University Press, New Haven & London. Goldstein, D.B. and Chikhi, L. (2002). Human migrations and population structure: What we know and why it matters. Annu Rev Genom Hum Genet 3, 129–152. Goldstein, D.B. and Weale, M.E. (2001). Population genomics: Linkage disequilibrium holds the key. Curr Biol 11, R576–R579. Graur, D. and Li, W.-H. (1999). Fundamentals of Molecular Evolution. Sinauer Associates Inc. Haiman, C.A., Patterson, N., Freedman, M.L., Myers, S.R., Pike, M.C., Waliszewska, A., Neubauer, J., Tandon, A., Schirmer, C., McDonald, G.J. et al. (2007). Multiple regions within 8q24 independently affect risk for prostate cancer. Nat Genet 39, 638–644. Halliburton, R. (2003). Introduction to Population Genetics. Addison Wesley.
References
Handley, L.J., Manica, A., Goudet, J. and Balloux, F. (2007). Going the distance: Human population genetics in a clinal world. Trends Genet 23, 432–439. Harding, R.M. and McVean, G. (2004). A structured ancestral population for the evolution of modern humans. Curr Opin Genet Dev 14, 667–674. Hartl, D.L. and Clark, A.G. (2007). Principles of Population Genetics. Sinauer Associates Inc. Hedrick, P.W. (2004). Genetics of Populations. Jones & Bartlett Publishing Co. Hein, J., Schierup, M.H. and Wiuf, C. (2005). Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford University Press. Hudson, R.R. (2001). Linkage disequilibrium and recombination. In Handbook of Statistical Genetics (D.J. Balding, M. Bishop and C. Cannings, eds),Wiley, Chichester, UK, p. 309. Ioannidis, J.P., Ntzani, E.E. and Trikalinos, T.A. (2004). “Racial” differences in genetic effects for complex diseases. Nat Genet 36, 1312–1318. Jobling, M.A., Tyler-Smith, C. and Hurles, M. (2003). Human Evolutionary Genetics: Origins, Peoples and Disease. Garland Publishing Inc. Jorde, L.B., Watkins,W.S. and Bamshad, M.J. (2001). Population genomics: A bridge from evolutionary history to genetic medicine. Hum Mol Genet 10, 2199–2207. Kelley, J.L., Madeoy, J., Calhoun, J.C., Swanson, W. and Akey, J.M. (2006). Genomic signatures of positive selection in humans and the limits of outlier approaches. Genome Res 16, 980–989. Kingman, J.F.C. (1982). On the genealogy of large populations. J Appl Probab 19A, 27–43. Kruglyak, L. (1999). Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat Genet 22, 139–144. Liu, F., Arias-Vásquez, A., Sleegers, K., Aulchenko, Y.S., Kayser, M., Sanchez-Juan, P., Feng, B.J., Bertoli-Avella, A.M., van Swieten, J., Axenovich, T.I. et al. (2007). A genomewide screen for late-onset Alzheimer disease in a genetically isolated Dutch population. Am J Hum Genet 81, 17–31. Liu, H., Prugnolle, F., Manica, A. and Balloux, F. (2006a). A geographically explicit genetic model of worldwide human-settlement history. Am J Hum Genet 79, 230–237. Liu, X.Q., Paterson, A.D., John, E.M. and Knight, J.A. (2006b). The role of self-defined race/ethnicity in population structure control. Ann Hum Genet 70, 496–505. Locke, D.P. (2006). Linkage disequilibrium and heritability of copynumber polymorphisms within duplicated regions of the human genome. Am J Hum Genet 79, 275–290. Lynch, J. and Condit, C.M. (2006). Genes and race in the news: A test of competing theories of news coverage. Am J Health Behav 30, 125–135. Macaulay, V., Hill, C., Achilli, A., Rengo, C., Clarke, D., Meehan, W., Blackburn, J., Semino, O., Scozzari, R., Cruciani, F. et al. (2005). Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes. Science 308, 1034–1036. Manica, A., Amos, W., Balloux, F. and Hanihara, T. (2007). The effect of ancient population bottlenecks on human phenotypic variation. Nature 448, 346–349. McVean, G. and Spencer, C.C. (2006). Scanning the human genome for signals of selection. Curr Opin Genet Dev 16, 624–629. Mitchell-Olds, T., Willis, J.H. and Goldstein, D.B. (2007). Which evolutionary processes influence natural genetic variation for phenotypic traits?. Nat Rev Genet 8, 845–856.
■
31
Mueller, J.C. (2004). Linkage disequilibrium for different scales and applications. Brief Bioinform 5, 355–364. Myers, S., Bottolo, L., Freeman, C., McVean, G. and Donnelly, P. (2005). A fine-scale map of recombination rates and hotspots across the human genome. Science 310, 321–324. Nei, M. (1987). Molecular Evolutionary Genetics. Columbia University Press, New York. Noonan, J.P., Coop, G., Kudaravalli, S., Smith, D., Krause, J., Alessi, J., Chen, F., Platt, D., Pääbo, S., Pritchard, J.K. et al. (2006). Sequencing and analysis of Neanderthal genomic DNA. Science 314, 1113–1118. Nordborg, M. (2001). Coalescent theory. In Handbook of Statistical Genetics (D.J. Balding, M. Bishop and C. Cannings, eds), Wiley, Chichester, UK, p. 309. Nordborg, M. and Tavaré, S. (2002). Linkage disequilibrium: What history has to tell us. Trends Genet 18, 83–90. Parra, E.J., Marcini, A., Akey, J., Martinson, J., Batzer, M.A., Cooper, R., Forrester, T., Allison, D.B., Deka, R., Ferrell, R.E. et al. (1998). Estimating African American admixture proportions by use of population-specific alleles. Am J. Hum Genet 63, 1839–1851. Patterson, N., Price, A.L. and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genet 2, e190. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904–909. Price, A.L., Butler, J., Patterson, N., Capelli, C., Pascali, V.L., Scarnicci, F., Ruiz-Linares, A., Groop, L., Saetta, A.A., Korkolopoulou, P. et al. (2007). Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. doi:10.1371/journal.pgen.0030236.eor. Pritchard, J.K. (2001). Are rare variants responsible for susceptibility to complex diseases?. Am J Hum Genet 69, 124–137. Pritchard, J.K. and Cox, N.J. (2002). The allelic architecture of human disease genes: Common disease-common variant ... or not?. Hum Mol Genet 11, 2417–2423. Prugnolle, F., Manica, A. and Balloux, F. (2005). Geography predicts neutral genetic diversity of human populations. Curr Biol 15, R159–R160. Ramachandran, S., Deshpande, O., Roseman, C.C., Rosenberg, N.A., Feldman, M.W. and Cavalli-Sforza, L.L. (2005). Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Nat Acad Sci USA 102, 15942–15947. Ray, N., Currat, M., Berthier, P. and Excoffier, L. (2005). Recovering the geographic origin of early modern humans by realistic and spatially explicit simulations. Genome Res 15, 1161–1167. Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson, A.R., Chen, W. et al. (2006). Global variation in copy number in the human genome. Nature 444, 445–454. Reich, D.E. and Lander, E.S. (2001). On the allelic spectrum of human disease. Trends Genet 17, 502–510. Rosenberg, N.A., Mahajan, S., Ramachandran, S., Zhao, C., Pritchard, J.K. and Feldman, M.W. (2005). Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet 6, e70. Rosenberg, N.A. and Nordborg, M. (2002). Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat Rev Genet 3, 380–390.
32
CHAPTER 2
■
Concepts of Population Genomics
Sabeti, P.C., Schaffner, S.F., Fry, B., Lohmueller, J., Varilly, P., Shamovsky, O., Palma, A., Mikkelsen, T.S., Altshuler, D. and Lander, E.S. (2006). Positive natural selection in the human lineage. Science 312, 1614–1620. Sabeti, P.C., Varilly, P., Fry, B., Lohmueller, J., Hostetter, E., Cotsapas, C., Xie, X., Byrne, E.H., McCarroll, S.A., Gaudet, R. et al. (2007). Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913–918. Sankar, P., Cho, M.K. and Mountain, J. (2007). Race and ethnicity in genetic research. Am J Med Genet A 143A, 961–970. Seldin, M.F. (2007). Admixture mapping as a tool in gene discovery. Curr Opin Genet Dev 17, 177–181. Serre, D. and Pääbo, S. (2004). Evidence for gradients of human genetic diversity within and among continents. Genome Res 14, 1679–1685. Service, S., DeYoung, J., Karayiorgou, M., Roos, J.L., Pretorious, H., Bedoya, G., Ospina, J., Ruiz-Linares, A., Macedo, A., Palha, J.A. et al. (2006). Magnitude and distribution of linkage disequilibrium in population isolates and implications for genome-wide association studies. Nat Genet 38, 556–560. Shanawani, H., Dame, L., Schwartz, D.A. and Cook-Deegan, R. (2006). Non-reporting and inconsistent reporting of race and ethnicity in articles that claim associations among genotype, outcome, and race or ethnicity. J Med Ethics 32, 724–728. Smith, D.J. and Lusis, A.J. (2002). The allelic structure of common disease. Hum Mol Genet 11, 2455–2461. Smith, M.W. and O’Brien, S.J. (2005). Mapping by admixture linkage disequilibrium: Advances, limitations and guidelines. Nat Rev Genet 6, 623–632. Steffens, M., Lamina, C., Illig, T., Bettecken, T., Vogler, R., Entz, P., Suk, E.-K., Toliat, M.R., Klopp, N., Caliebe, A. et al. (2006). SNPbased analysis of genetic substructure in the German population. Hum Hered 62, 20–29. Tang, H., Quertermous,T., Rodriguez, B., Kardia, S.L., Zhu, X., Brown, A., Pankow, J.S., Province, M.A., Hunt, S.C., Boerwinkle, E. et al. (2005). Genetic structure, self-identified race/ethnicity, and confounding in case-control association studies. Am J Hum Genet 76, 268–275.
The GAIN Collaborative Research Group (2007). New models of collaboration in genomewide association studies: The Genetic Association Information Network. Nat Genet 39, 1045–1051. The International HapMap Consortium (2005). A haplotype map of the human genome. Nature 437, 1299–1320. The International HapMap Consortium (2007). A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–862. The International SNP Map Working Group (2001). A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933. Thornton, K.R., Jensen, J.D., Becquet, C. and Andolfatto, P. (2007). Progress and prospects in mapping recent selection in the genome. Heredity 98, 340–348. Trinkaus, E. (2005). Early modern humans. Annu Rev Anthropol 34, 207–230. Trinkaus, E. (2007). European early modern humans and the fate of the Neandertals. Proc Nat Acad Sci USA 104, 7367–7372. Vernesi, C., Caramelli, D., Dupanloup, I., Bertorelle, G., Lari, M., Cappellini, E., Moggi-Cecchi, J., Chiarelli, B., Castrì, L., Casoli, A. et al. (2004). The Etruscans: A population-genetic study. Am J Hum Genet 74, 694–704. Voight, B.F., Kudaravalli, S., Wen, X. and Pritchard, J.K. (2006). A map of recent positive selection in the human genome. PLoS Biol 4, e72. Wakeley, J. (2007). Coalescent Theory: An Introduction. Roberts & Co. Wang, W.Y., Barratt, B.J., Clayton, D.G. and Todd, J.A. (2005). Genomewide association studies: Theoretical and practical concerns. Nat Rev Genet 6, 109–118. Weale, M.E., Depondt, C., Macdonald, S.J., Smith, A., Lai, P.S., Shorvon, S.D., Wood, N.W. and Goldstein, D.B. (2003). Selection and evaluation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: Implications for linkage-disequilibrium gene mapping. Am J Hum Genet 73, 551–565. Wilson, J.F., Weale, M.E., Smith, A.C., Gratrix, F., Fletcher, B., Thomas, M.G., Bradman, N. and Goldstein, D.B. (2001). Population genetic structure of variable drug response. Nat Genet 29, 265–269.
RECOMMENDED RESOURCES The seminal work by Cavalli-Sforza et al. (1994) is still a standard reference on the use of genetics to infer aspects of human history and demography. A useful, more up-to-date work with an emphasis on human evolution is the textbook by Jobling et al. (2003).
The HapMap website (http://www.hapmap.org/tutorials.html.en) has some nice tutorial material on the rationale and use of this important population genomic resource.
CHAPTER
3 Genomic Approaches to Complex Disease Desmond J. Smith and Aldons J. Lusis
INTRODUCTION Most of the socioeconomic burden of disease in the industrialized nations is due to complex disorders, in which multiple genes interact with the environment to produce the final disease phenotype (Smith and Lusis, 2002). Typical examples include atherosclerosis, diabetes, cancer, multiple sclerosis, autism, alcoholism, and drug abuse (Diabetes Genetics Initiative et al., 2007; Easton et al., 2007; Moffatt et al., 2007; Samani et al., 2007; Scott et al., 2007; Steinthorsdottir et al., 2007; Wellcome Trust Case Control Consortium, 2007). An understanding of complex disease involves: (1) identifying the genetic variations, whether common or rare, that contribute to the disease; (2) determining how the variations influence function and which pathways they perturb; and (3) determining how the variations interact with each other and with the environment to contribute to the disease. Even if the total genetic contribution to a complex trait is substantial, typically each gene contributes only a small amount to the final phenotype, making detection of the responsible genes an arduous process (Wellcome Trust Case Control Consortium, 2007). Further complicating such analyses are allelic heterogeneity and locus heterogeneity, in which a disease can be due to different alleles of the same gene or to alleles in different genes, respectively. Genetic interactions can also occur, in which multiple genes enhance each others’ effects on a disease (synergy) or
Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg
block each others’ effects (epistasis). In addition, there is much variation of the environment in human studies, further contributing to “noise” in an analysis (Smith and Lusis, 2002). The advent of genomics, in which the whole genome is the unit of study rather than individual genes, is revolutionizing the study of complex disease (Olson, 2003) and is rapidly helping establish genomic medicine as a new and exciting field. This chapter focuses on the use of genomic technologies to understand complex disease and is organized as follows: First, we discuss methods for the identification of DNA variations, both common and rare (Figure 3.1). Second, we discuss linkage and association methods for relating DNA variation to phenotypes. And lastly, we discuss how the “omic” technologies can help identify the underlying genes and organize them into pathways and networks (Figure 3.2).
IDENTIFYING COMMON AND RARE GENOMIC VARIATIONS IN THE POPULATION Varieties of Polymorphism Mapping genes became practicable in humans with the realization that polymorphic DNA sequences could be used directly as genetic markers (Botstein et al., 1980). Initially, single nucleotide polymorphisms (SNPs), assayed as restriction fragment length polymorphisms (RFLPs) on gels, were employed as markers in
Copyright © 2009, Elsevier Inc. All rights reserved. 33
34
CHAPTER 3
■
Genomic Approaches to Complex Disease
ELUCIDATING COMPLEX DISEASE Phenotype of interest Identify variants Linkage or association Mixed or isolated populations Common Allele Detection Technology (e.g., TaqMan, MS, Affymetrix, Illumina) Rare Allele Discovery Technology (e.g., 454 Life science, polony sequencing, Solexa, MPSS) Gene/allele discovery interaction variants
Proteomics
Systems biology
Genetical genomics
Fxns/pathways Animal models
Molecular subtyping
Transcript profiling
Figure 3.1 Elucidating complex disease. This flow chart parallels the discussion in the text. Genetic dissection of a disease phenotype of interest requires identification of appropriate variants for mapping. These variants will usually be common and chosen from publicly available databases. The strategy for gene identification will require decisions as to whether linkage or association should be used, which type of population should be employed, and the nature of the allele detection technology. Some phenotypes may require the use of recent approaches to rare allele identification. Once a gene responsible for a phenotype has been identified, the methods of systems biology can be used to understand the relation of the causative allele to disease.
combination with Southern blotting. However, this approach is laborious. The strategy became much more feasible with the discovery of microsatellite repeats (Weber and May, 1989). These densely spaced markers (roughly one per every 2 kb) are polymorphic because of differences in lengths of short oligonucleotide repeats. They have the advantage of being easily scored using the polymerase chain reaction (PCR) and gel electrophoresis and also have multiple alleles, which greatly improves their informativeness compared to biallelic markers such as SNPs. The use of microsatellites dominated mammalian genetics until the last several years, at which time the utility of SNPs reemerged with the invention of medium- and high-throughput approaches for analyzing these digital allelotypes. The SNPs also have the advantage that their average spacing is about one every 300 bp in the human genome, denser than microsatellites. SNP Genotyping Technologies A large number of SNP genotyping technologies have emerged over the last few years (Table 3.1). One of the first rapid methods for SNP genotyping, TaqMan technology, uses real-time PCR (Gibson et al., 1996; Heid et al., 1996; Livak, 1999). The SNPs are distinguished by reporter oligonucleotides containing a fluorescent nucleotide at the 5 end and a quenching nucleotide at the 3 end. If the oligonucleotide is a perfect match to a PCR product flanking
the SNP, the Taq polymerase used for the PCR liberates the 5 fluor through the 5 to 3 exonuclease activity of the polymerase. The liberated fluor-substituted nucleotide is no longer inhibited by proximity to the quenching fluor and fluoresces brightly. This fluorescence increases with each cycle of the PCR and is measured in each cycle that is in real time. However, if the oligonucleotide does not bind because of the presence of an SNP, there is no increase in fluorescence. Because a PCR assay has to be set up for each SNP, the method is only of moderate throughput. Another moderate- to high-throughput technology for genotyping biallelic markers employs matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) (Cantor, 2005; Griffin and Smith, 2000). Primer extension reactions are performed resulting in differing product lengths depending on the allele present for a particular SNP. The mass of the reaction products are then quantitated using the MALDI-TOF MS, permitting evaluation of which SNP is present. Throughput can be enhanced by multiplexing different assays in one sample. A dramatic increase in throughput came with the invention of photolithographically synthesized arrays of oligonucleotides for genotyping (Kreiner and Buck, 2005; Lipshutz et al., 1995). Affymetrix arrays consist of very high densities of 25-mers matching major or minor alleles of known SNPs in the human genome. Photolithographic activation of chemical groups is used to synthesize the oligonucleotides in situ. The photolithography permits a dramatic degree of spatial miniaturization with very dense packing of oligonucleotides and high-throughput analysis of SNPs. The presence or absence of a particular SNP in the target genome of interest is revealed by the presence or absence of a corresponding hybridization signal on the array. The arrays are spatially encoded, meaning that the position of an oligonucleotide on the array maps to the SNP it assays. The current feature size on these arrays permits up to 500,000 SNPs to be analyzed. Perlegen, a company that employs Affymetrix technology, has used custom-designed photolithographic arrays to query 1.5 106 SNPs (Hinds et al., 2005). Illumina arrays use sequence-encoding rather than spatialencoding for high-throughput genotyping. Beads with assay oligonucleotides are used to interrogate individual SNPs. A fluorescent extension reaction requiring correct hybridization of the assay oligonucleotide to genomic DNA detects which SNP is present (Shen et al., 2005).The small size of the beads means that hundreds of thousands of SNPs can be analyzed in parallel. Fluorescence readouts from each bead are captured using bundles of optical fibers. In addition to the assay oligonucleotide, the beads also contain decoding oligonucleotides that reveal which SNP the bead is interrogating. The decoding oligonucleotides are revealed using a series of fluorescently labeled probes. This bead-based method has been used to date to analyze up to a million SNPs in one assay. Sequencing Technologies The above genotyping technologies are high throughput but only detect already known SNPs. Most of these SNPs have been identified as a result of resequencing projects in the public domain (Wheeler et al., 2006) and represent reasonably common
Identifying Common and Rare Genomic Variations in the Population
■
35
OMIC SPACE
PC 2 Phenome PC 1
Metabolome
Proteome
Transcriptome
Genome
Figure 3.2 Omic space. Multiple partially orthogonal domains, where various sorts of biological networks, experimental findings, and model-based predictions are represented and integrated (Left). Tools for collection of genome-wide data relevant to omic space are being developed. The phenome (global phenotyping) represents much of classical biology and medicine, supplemented with exciting advances in fields such as molecular imaging (MRS, PET). Repositories for such data are in the early stages. One example is the Mouse Phenome database (http://www.jax.org/phenome). Mass spectrometry (MS) and NMR have been used for high-throughput analysis of the metabolome and for non-invasive diagnosis and screening using non-supervised algorithms such as PCA. An example of PCA used to differentiate disease and non-diseased samples is shown to the right, where PC1 and PC2 represent the first and second principal components, and red and blue points represent metabolites differentiating diseased and non-diseased samples, respectively. The proteome is studied using a number of technologies, including MS. The interactome (study of interactions between proteins) has been investigated using whole-genome yeast two-hybrid screens, literature-based analysis of pathways (e.g., gene ontology, http:// www.geneontology.org/), and co-affinity immunopurification. Protein products of genes mutated in inherited genetic disorders tend to interact with proteins involved in similar disorders (Right). Green, proteins involved in developmental disorders; blue, proteins involved in endocrine disorders. Whole-genome expression arrays have been used for transcriptome analysis, while high-density SNP chips, high-throughput sequencing, and HapMap sequences have been used at the genome level. Some databases, such as the “GenomePhenome Superbrain” (http://www.omicspace.riken.jp/), are designed to integrate data from the various omic domains. The elements (nodes) of each stage appear to have a scale-free topology (right, shown linked to transcriptome). Certain elements (hubs) have many connections while most have relatively few (see also Chapters 1 and 6).
alleles. However, the extent to which complex diseases are due to common or uncommon polymorphisms is unclear (see Common Disease/Common Variant Hypothesis below) (Smith and Lusis, 2002).
In the long term, the evaluation of rare SNPs in common complex disorders will require new sequencing technologies to affordably resequence multiple human genomes. This quest is known as “the $1000 genome” (Shendure et al., 2004). While
36
CHAPTER 3
TABLE 3.1
■
Genomic Approaches to Complex Disease
Genotyping and sequencing technologies
Technology
Applications
Organization
TaqMan
Custom SNPs, 102–103 assays
Applied Biosystems (https://www2. appliedbiosystems. com/)
Mass spectrometry Custom SNPs, 102–103 assays
Sequenom (http: //www.sequenom. com)
Photolithographic arrays
Predesigned SNPs, Affymetrix (http:// 105–106 assays www.affymetrix.com) Perlegen (http:// www.perlegen.com/)
Illumina arrays
Predesigned SNPs, Illumina (http:// 105–106 assays www.illumina.com)
Pyrosequencing
SNP discovery
454 Life Sciences (http://www.454. com)
Solexa sequencing arrays
SNP discovery
Illumina (http:// www.illumina.com)
Massively parallel signature sequencing
SNP discovery
Illumina (http:// www.illumina.com)
Single molecule DNA sequencing
SNP discovery
Helicos Biosciences (http://helicosbio. com)
certainly inspiring, it will likely be many years before this goal is achieved. Nevertheless, recent advances in rapid sequencing may improve our discovery of rare and private alleles (Table 3.1) (see Chapter 7). All of the developments increase speed by avoiding laborious molecular cloning steps in Escherichia coli. In one method developed by 454 Life Sciences, individual genomic DNA molecules are cloned on beads (Margulies et al., 2005). The beads are isolated from one another using oil/water emulsions to create spherical lipid bilayer vesicles about the size of a microorganism. Polymerase-mediated amplification of the DNA molecules on the beads can then occur without cross-contamination (emulsion PCR). Sequence information from each bead is then obtained using pyrosequencing. In this method, release of a pyrophosphate during incorporation of a nucleotide in an extension reaction is detected by light emission from a luciferase reaction. The small size of the beads allows large numbers of sequences to be obtained in parallel in picoliter reactors. Acquisition of the pyrosequencing output uses fiber optics. The throughput is about 25 Mb in an afternoon, although the average sequence reads are less than 100 bp. Polony sequencing is another approach using emulsion PCR, in which the individual beads are immobilized in a thin film of polyacrylamide after DNA amplification (Shendure et al.,
2005). The term “polony” is an abbreviation of “polymerase colony”. The DNA molecules represent self-ligated molecules which permit the acquisition of mate-paired sequence tags, that is, sequences from the two ends of the ligated molecule. Fluorescent sequencing by ligation in combination with a light microscope allows sequence tags of about 13 bp to be obtained from each end of the self-ligated molecule, a total of 26 bp. This method appears to be particularly useful for detecting translocations and inversions. Another approach, used by Solexa, employs PCR amplification of individual DNA molecules (Bennett et al., 2005). The individual PCR products remain localized to a small region of a glass slide because of the presence of capture oligonucleotides. The amplification results in “colonies” of different individual DNA molecule at a high spatial density on the glass slide. Fluorescent dideoxy sequencing reactions are then employed to interrogate each of the colonies in parallel, giving high-throughput sequences. Another bead-based sequencing method is massively parallel signature sequencing, MPSS (Brenner et al., 2000). A population of DNA molecules is ligated en masse to a highly complex mixture of identifying oligonucleotides. Amplification by PCR follows. Purification of multiple copies of a single class of DNA occurs when individual identifying oligonucleotides are captured by their complement on a bead. Sequence from each bead is then obtained by sequential hybridization reactions. The method gives sequence ladders of about 20 bp. A fourth method does not employ any type of DNA amplification, but rather uses fluorescent dideoxy sequencing of single DNA molecules (Braslavsky et al., 2003). The data capture uses a light microscope and the potential degree of parallelism is extremely high. Sequence reads of up to 5 bp have been obtained. At present, there is tremendous activity in developing cheaper, faster, and more accurate sequencing methods. Whether the next generation of sequencing tools uses one of the strategies described above or some completely unexpected technology, it seems clear that the contribution of rare and private polymorphisms to common genetic disease will become better understood in the coming years.
RELATING DNA VARIATION TO PHENOTYPES Candidate Gene Approach In the early days of complex trait genetics, the mid-1980s, the candidate gene approach was widely used. This approach evaluates whether common variations of biochemically defined candidate genes are associated with a particular disease by evaluating the frequency of various alleles in case versus control populations (Table 3.2). One of the first important examples was the gene for apolipoprotein E (APOE) which was found to exhibit three common alleles (E2, E3, and E4). The E2 form was originally found to be strongly associated with a relatively rare
Relating DNA Variation to Phenotypes
TABLE 3.2
■
37
Examples of sequence variants implicated in complex disease
Disease phenotype
Gene/sequence varianta
Identification strategy
Technology
References
Hyperlipidemia (type III)
ApoE2
Association/candidate gene
Isoelectric focusing
Wardell et al., 1982
Alzheimer’s disease
ApoE4
Association/candidate gene
PCR/restriction isotyping
Strittmatter et al., 1993
Diabetes mellitus (type II)
Calpain 10
Linkage/association
Sequencing, PCR-RFLP
Horikawa et al., 2000
Crohn’s disease
NOD2
Linkage/LD, Linkage, TDT
Microsatellites/ sequencing, allelespecific PCR
Hugot et al., 2001; Ogura et al., 2001
Familial combined hyperlipidemia
USF1
Linkage/isolated population/association
Sequencing
Pajukanta et al., 2004
Myocardial infarct risk
Lymphotoxin alpha
Association/LD
Sequencing/PCR isotyping
Ozaki et al., 2002
Macular degeneration
Complement factor H
Association
Affymetrix arrays
Klein et al., 2005
Low HDL cholesterol
ABCA1, APOA1, LCAT
Association
Sequencing
Cohen et al., 2004
Low LDL cholesterol
NPC1L1
Association
Sequencing/TaqMan
Cohen et al., 2006
Obesity
SLC6A14
Linkage/LD/isolated population/association
Microsatellites/ sequencing/MS
Suviolahti et al., 2003
Diabetes mellitus (type II)
TCF7L2
Linkage/isolated population/association
Microsatellites
Grant et al., 2006
Myocardial infarct risk
LTA4H
Isolated population/association
Sequencing
Helgadottir et al., 2006
Schizophrenia
NRG1
Linkage/isolated population/TDT
Fluorescence polarization
Stefansson et al., 2002
Crohn’s disease
5q31 cytokine gene cluster
Linkage/LD
Microsatellites/ sequencing
Rioux et al., 2001
Breast cancer
FGFR2, TNRC9/ LOC643714, MAP3K1, rs13281615 (8q), LSP1
Association
Affymetrix arrays/TaqMan/MS
Easton et al., 2007
Diabetes mellitus (type II)
TCF7L2, CDKAL1
Association
Illumina
Steinthorsdottir et al., 2007
Diabetes mellitus (type II)
9p21.3 (CDKN2A/ CDKN2B), IGF2BP2, CDKAL1, HHEX, SLC30A8
Association
Affymetrix
Diabetes Genetics Initiative et al., 2007; Zeggini et al., 2007
Diabetes mellitus (type II)
9p21.3 (CDKN2A/ CDKN2B), IGF2BP2, CDKAL1, HHEX, SLC30A8, TCF7L2, FTO PPARG, KCNJ11
Association/isolated population
Illumina
Scott et al., 2007
Myocardial infarct
9p21.3 (CDKN2A/ CDKN2B), MTHFD1L, PSRC1, MIA3, SMAD3, LPL
Association
Affymetrix
Samani et al., 2007
(continued )
38
CHAPTER 3
TABLE 3.2
■
Genomic Approaches to Complex Disease
(Continued)
Disease phenotype
Gene/sequence varianta
Identification strategy
Technology
References
Asthma
ORMDL3
Association
Illumina
Moffatt et al., 2007
Bipolar disease
16p12, KCNC2, GABRB1, GRM7, SYN3
Association
Affymetrix
Wellcome Trust Case Control Consortium, 2007
Myocardial infarct
9p21.3 (CDKN2A/ CDKN2B), MTHFD1L, ADAMTS17
Association
Affymetrix
Wellcome Trust Case Control Consortium, 2007
Crohn’s disease
NOD2, 5q31 cytokine gene cluster, IL23R, ATG16L1, 10q21, 5p13.1, IRGM, NKX2-3, PTPN2
Association
Affymetrix
Wellcome Trust Case Control Consortium, 2007
Rheumatoid arthritis
HLA-DRB1, PTPN22, IL2RA, IL2RB, TNFAIP2, GZMB, PRKCQ, KAZALD1, 12q24
Association
Affymetrix
Wellcome Trust Case Control Consortium, 2007
Diabetes mellitus (type I)
MHC, CTLA-4, PTPN22, IL2RA, IFIH1, 12q13, 12q24, 16p13, PTPN2, IL2 receptor
Association
Affymetrix
Wellcome Trust Case Control Consortium, 2007
Diabetes mellitus (type II)
PPARG, KCNJ11, TCF7L2, FTO, CDKAL1, HHEX,
Association
Affymetrix
Wellcome Trust Case Control Consortium, 2007
a
Where the responsible gene is unclear, the relevant chromosome region is given.
hyperlipidemia (Type III) and with low cholesterol levels in the population (Wardell et al., 1982). Later, the E4 allele was found to be strikingly associated with Alzheimer’s disease (Strittmatter et al., 1993). The candidate gene approach continues to be employed extensively, as thousands of association studies are reported each year. However, the strategy has some important problems. First, of course, the gene has to first be linked to the disease through biochemical studies or mechanistic hypotheses. Second, association tests are prone to artifacts due to population stratification and ethnic admixture. One way to overcome this latter problem is to perform family-based association testing such as the transmission disequilibrium test (TDT) (Spielman and Ewens, 1998). This test evaluates the role of a polymorphism in a disorder by assessing its frequency in affected versus non-affected siblings. The idea is that, since the siblings are raised in the same family, most of the differences in susceptibility will be due to genetics rather than to the environment. Linkage In the early 1990s, linkage analysis employing microsatellite markers became a common approach to map loci for complex diseases. In this approach, families with multiple affected members are phenotyped and genotyped. Regions of the genome relevant to the disease can then be detected using allele association or parametric approaches. One advantage of this approach is that a simplified
set of genes may be responsible for the disorder within a particular family compared to the population as a whole. However, this advantage melts away when multiple families are required, as is nearly always the case to obtain adequate statistical power. Another substantial disadvantage of linkage analysis in humans is the cost and difficulty of identifying multiple multigenerational families with the disorder of interest. In addition, mapping power in this design is limited and the responsible genes are frequently poorly localized to regions covering many megabases. A common paradigm since the late 1990s has been to carry out linkage analysis followed by targeted association testing of candidate genes in promising regions or to test for linkage disequilibrium (LD) with genetic markers in the region. Linkage disequilibrium is a situation in which neighboring alleles are linked more often than expected by chance, either because of low recombination rates between the markers or because of expansion of a founder population in which the alleles were originally linked. The first successful example of the twostep strategy was the identification of the calpain 10 gene as a causal gene in type 2 diabetes in a large set of Mexican families, although this finding remains somewhat controversial (Horikawa et al., 2000). A very clear subsequent success was the identification of the NOD2 gene as an important contributor to Crohn’s disease (Hugot et al., 2001; Ogura et al., 2001). Another example is the identification of the USF1 gene as a cause of familial combined hyperlipidemia (Pajukanta et al., 2004).
Relating DNA Variation to Phenotypes
Genome-wide Association Although the combination of linkage and association has been successful in several instances, only a handful of genes have been identified to date. However, with the advent of cheaper methods for the detection of polymorphisms, genome-wide association studies have become feasible (Hirschhorn, 2005; Hirschhorn and Daly, 2005) (see Chapter 8). Thus, association, the strategy first used for genetic dissection of human traits, has come back to center stage. The first example was a study of heart disease by Ozaki et al. (2002), who typed thousands of SNPs in thousands of individuals in Japan (Ozaki et al., 2002). The several genes exhibiting strong evidence of association with heart disease were then typed in a second set of individuals, and one gene, encoding lymphotoxin-, was found to exhibit a consistent association. Another very striking success was the identification of complement factor H in age-related macular degeneration (Klein et al., 2005). In that study, a relatively small group of individuals (96 cases, 50 controls) were typed for 116,204 SNPs, and a common non-synonymous SNP was found to be strongly associated with the disease even after a conservative Bonferroni correction for multiple comparisons. Such whole-genome association studies are rapidly becoming the major source of information on complex disorders in humans as the available technologies become more widely used and are allowing interrogation of 105–106 SNPs. Very recent successful examples include the identification of loci for breast cancer (Easton et al., 2007), type II diabetes (Diabetes Genetics Initiative et al., 2007; Scott et al., 2007; Steinthorsdottir et al., 2007; Zeggini et al., 2007), coronary artery disease (Samani et al., 2007), asthma (Moffatt et al., 2007), and a large study which identified loci contributing to bipolar disorder, coronary artery disease, Crohn’s disease, rheumatoid arthritis, and type 1 and type 2 diabetes (Wellcome Trust Case Control Consortium, 2007). Most gratifying is the high degree of agreement between different studies for identified loci for the same phenotype. Also interesting was the fact that large majority of the loci exerted small effects on the final disease risk (see Chapter 8). Copy Number Variation Another recent realization emerging from the use of microarray genotyping is the prevalence of copy number variants (CNVs) which constitute about 10% of the human genome (e.g., Sebat et al., 2004) (see Chapter 9). The CNVs are type of polymorphism in which segments of DNA with an average length of 500 kb show variation in copy number either due to deletion or duplication. This source of variation is beginning to be explored with respect to common complex disorders, one recent example being autism (Sebat et al., 2007). Common Disease/Common Variant Hypothesis The success of modern genotyping technologies in analyzing common complex disorders depends partly upon the correctness of the common disease/common variant hypothesis. This hypothesis states that common genetic diseases are due
■
39
to common polymorphisms. On the face of it, this hypothesis seems plausible, and there is evidence that it is at least partially true for many common complex diseases (Hirschhorn, 2005). However, it now appears that a significant fraction of common disease will be due to many different rare mutations. In fact, this was observed in a subset of the NOD2 mutations underlying Crohn’s disease (Hugot et al., 2001). More recently, studies of common variations in plasma lipoprotein levels, involving sequencing of candidate genes in many individuals, have clearly demonstrated the importance of rare alleles (Cohen et al., 2004, 2006). Thus, efficient sequencing approaches are likely to be important in understanding at least some forms of complex traits. Although there have been tremendous recent advances in genotyping speed, the throughput is still insufficient to interrogate all possible polymorphisms over the entire genome. In addition, even modern genotyping technologies will not easily identify rearrangements such as inversions, which may be more common than previously thought (Tuzun et al., 2005). Estimates based on mutation rates suggest that nearly every nucleotide in the genome is altered in at least one individual on the planet (see Chapter 1). Isolated Populations As described in a previous section, one of the major stumbling blocks in human genetics is allelic heterogeneity. One potentially helpful approach in analyzing complex human genetic traits is to use isolated populations. These populations are thought to have an extremely small number of founder individuals, often estimated as being in the hundreds. This bottleneck means that a much smaller number of genes and alleles cause the disease phenotype, simplifying the genetic heterogeneity of many complex diseases. Useful isolated populations have been identified in the Hutterites of rural North America, as well as populations in Finland and Iceland. Some theoretical studies suggest that for common complex disorders the advantages of using an isolated population may be blunted compared to that for rare alleles (Hirschhorn and Daly, 2005). This is because common disease alleles will be proportionally represented even in a relatively small founding population. Nevertheless, the promise of isolated populations has been fulfilled in at least a few examples. These include the identification of genes involved in familial combined hyperlipidemia (Pajukanta et al., 2004) and obesity (Suviolahti et al., 2003) in Finland and genes involved in type 2 diabetes (Grant et al., 2006), myocardial infarction (Helgadottir et al., 2006), and schizophrenia in Iceland (Stefansson et al., 2002). HapMap Project Human genetic analysis is currently limited by both genotyping power and ascertainment of sufficiently large study populations for analysis. Strategies are therefore needed to maximize the information obtained from these precious datasets. It was observed that contiguous regions of the mammalian genome tend to remain linked together on the chromosome in which
40
CHAPTER 3
■
Genomic Approaches to Complex Disease
they reside (Rioux et al., 2001). These regions in LD are called haplotype blocks and their existence led to the realization that one SNP, a so-called “tag SNP”, could in principle effectively “stand in” for a larger number of adjacent SNPs. It was conjectured that using carefully chosen tag SNPs could extend the power of present genotyping technologies, thus improving coverage of the genome (see Chapter 2). These observations led to the international HapMap project, an endeavor to comprehensively identify haplotype blocks in the human genome. The main problem in decoding haplotypes is that genotype information represents the admixture of two different linked haplotypes from the parental chromosomes. The admixed “messages” have to be deconvoluted to provide haplotype information. Two distinct approaches have been taken. In the publicly funded approach, genotyping of 269 individuals from four different ethnic human groups was used to identify blocks of SNP alleles that tend appear together more often that would be expected by chance (Altshuler et al., 2005). The other approach taken by the company Perlegen used purification of individual chromosomes by somatic cell genetics, so that the haplotypes could be read out directly (Hinds et al., 2005; Patil et al., 2001). In both approaches, blocks were found to extend for varying lengths in the genome, ranging from 1 kb to 100 kb. Despite this variation, the data should allow more efficient design of genotyping assays. In addition, the uncovered haplotype data will likely be useful in understanding the history and structure of human populations throughout the world. Quantitative Trait Locus Mapping in Model Organisms Animal models, particularly strains of mice and rats, exhibit common variations in traits that are relevant to most common diseases. The mapping of loci contributing to such traits using quantitative trait locus (QTL) analysis is straightforward (Flint and Mott, 2001). At present, thousands of loci for hundreds of complex traits have been mapped, but, apart from positional candidates, only a handful of genes have been identified. The QTL studies in mice suffer from some of the same problems as linkage in humans: the loci identified are usually very large and the underlying genes usually explain only a small fraction of the variance of the trait. Flint et al. (2005) provide an excellent summary of the status of QTL mapping and of new strategies. As discussed below, genome-wide “omics” technologies are expected to greatly facilitate disease gene identification in studies with animal models.
INTEGRATION OF “OMIC” TECHNOLOGIES WITH GENETICS Systems Biology High-throughput assays for DNA sequence variation, transcript levels, protein levels, protein interactions, and metabolites, in combination with data integration methods, are beginning to
transform studies of complex disease and of biology in general. This area of research has become broadly known as systems biology, the scientific discipline that seeks to quantify all of the elements of a biological system and assess their interactions (Hood et al., 2004). An important concept is that the interaction of the parts of a system creates properties or functions that would not be expected from looking at the individual components on their own (Table 3.3) (see Chapter 6). Transcriptomic Analysis The “omic” technology that has had the largest impact on complex disease is the measurement of transcript levels using highdensity microarrays.The original arrays consisted of spotted cDNA capture probes, but the more recent technologies, including Affymetrix and Illumina, utilize synthetic oligonucleotides as capture probes (discussed in Pravenec et al., 2003) (see Chapter 13). In any segregating population, there will be hundreds or thousands of genetic variations that affect gene expression. It seems clear that these variations in expression will make important contributions to complex disease. With the development of microarray technology, it is now possible to quantify transcript levels globally and to integrate such data with information relating to disease. One approach has been to map a critical genomic region for a disease using affected and unaffected individuals and then pinpoint the responsible gene by identifying differentially expressed genes in the critical region. For example, in the rat animal model, a combination of congenic mapping and microarray analysis identified the fatty acid translocase Cd36 gene as a cause of insulin resistance and hypertriglyceridemia (Aitman et al., 1999). In the case of complex, heterogeneous disorders, no one gene would necessarily be expected to exhibit altered expression in cases versus controls, but certain pathways might be perturbed. For example, Mootha and colleagues developed an analytic strategy, termed “Gene Set Enrichment Analysis,” to detect modest but coordinate changes in the expression of functionally related genes (Mootha et al., 2003). They applied the method to muscle biopsies from diabetic versus control individuals. Although no single gene was significantly associated with diabetes, the mitochondrial and oxidative phosphorylation pathways were significantly enriched in genes that were modestly perturbed in diseased individuals. Genetical Genomics A particularly powerful way of integrating expression and clinical data is to apply genetics, an approach that has been called “genetical genomics” (Jansen and Nap, 2001). Thus, if transcript levels are measured in segregating populations, the loci controlling transcript levels can be mapped using the same approaches that have been used to map loci for complex clinical traits. The loci controlling transcript abundance have been termed “expression QTLs,” or eQTLs (Schadt et al., 2003). When an eQTL encompasses the physical location of the gene encoding that transcript, it is likely that the causative genetic variation resides
Integration of “Omic” Technologies with Genetics
TABLE 3.3
■
41
Systems biology and genomic medicine
Disease phenotype
Systems biology approach
Insights
Technology
References
Insulin resistance/ hypertriglyceridemia
Transcript analysis
Cd3 gene causative
Microarrays
Aitman et al., 1999
Diabetes mellitus (type II)
Gene set enrichment analysis
Mitochondrial/oxidative phosphorylation plays a role
Microarrays
Mootha et al., 2003
Obesity
Genetical genomics
Molecular subtyping can better define complex phenotypes
Microarrays
Schadt et al., 2003
Hypertension
Genetical genomics
Correlation of biological and expression phenotypes
Microarrays
Hubner et al., 2005
Obesity, diabetes
Genetical genomics
Correlation of biological and expression phenotypes
Microarrays
Lan et al., 2006
Embryonic viability
Interactome mapping
Hub genes more likely to be essential for survival
Yeast 2 hybrid, co-affinity purification, literature curation
Gandhi et al., 2006
Inherited disease
Interactome mapping
Gene products from similar disorders tend to interact
Yeast 2 hybrid, co-affinity purification, literature curation
Gandhi et al., 2006
Cancer
Interactome mapping
Oncogenes can be related to known signal transduction pathways
Yeast 2 hybrid, co-affinity purification, genetic crosses in C. elegans
Tewari et al., 2004
within the gene itself (i.e., the transcript is likely to be regulated in cis). Conversely, when an eQTL does not encompass the physical location of the gene, the transcript must be regulated in trans (i.e., under the control of a different gene). Thousands of such eQTL have been mapped in various crosses in mice, as well as other experimental organisms, and less detailed maps have been produced in studies of cells from human pedigrees (Cheung et al., 2005; Doss et al., 2005; Ghazalpour et al., 2005; Hubner et al., 2005; Schadt et al., 2003). One of the initial surprises in these studies has been the extent of variation at the level of transcript levels. In recent studies in mice, where experiments are greatly simplified compared to human studies, a large F2 intercross between two common laboratory strains of mice revealed 6676 significant eQTLs (p 5 105) in liver, over a thousand of which were cis-acting eQTLs (Wang et al., 2006). Similarly, large numbers of common variations are likely to exist in human populations. In the near future, one can envision a genome-wide database relating common DNA variation with common transcript level variations (in other words, a map, such as the HapMap of all common variations in transcript levels). Obviously, such a map would help in prioritizing candidate gene for loci contributing to complex diseases (i.e., genes with cis-eQTLs at the
locus of interest would constitute a restricted set of candidates). Also, it might be expected that transcript levels of the causative genes will be significantly correlated with clinical trait values, so determining Pearson or Spearman correlation coefficients between these may be helpful. Genetical genomic data could also be used in a converse manner to exclude potential candidate genes (Lan et al., 2006). There are two important caveats in considering the relationship of an eQTL and a clinical trait. First, gene expression patterns are cell- and tissue-specific, and certain inaccessible tissues may be required to understand a complex trait (although it has been observed that cis-eQTLs usually tend to affect gene expression in all tissues in which that gene is expressed). The second is that the regulation of biologic processes occurs via control of gene expression in many, but certainly not all, situations. Molecular Subtyping Another use of genomic and other “omic” technologies is to define distinct molecular subtypes for a clinical phenotype. This has been particularly useful in the case of cancers (Bucca et al., 2004), but it should be applicable to most complex disorders. For example, in one study examining liver expression array data
42
CHAPTER 3
■
Genomic Approaches to Complex Disease
in an intercross, distinct subgroups of mice were identified with respect to body fat (Schadt et al., 2003). The array data were consistent with the idea that the classically defined trait of fat pad mass actually consists of at least two molecularly distinct subgroups. Analyzing these two subgroups separately gave much greater sensitivity in detecting loci for obesity. While such an approach has yet to be employed in human complex traits, there is a strong interest in using high-dimensional data methods, such as microarrays and mass spectrometry, to identify biomarkers for disease (Bailey and Ulrich, 2004). It is conceivable that some of these investigations may detect different subclasses of disease that would usually be lumped into one group using classical methods. Separate genetic analysis of these subclasses may well give greater sensitivity to detect the genes responsible for complex human conditions. A particularly powerful application of genetical genomics is to identify genes that are causally related to clinical traits. For example, in a recent study of a cross between two strains of mice differing in adiposity, Schadt and colleagues used liver expression array data to identify genes likely to be causally involved in visceral fat levels (Schadt et al., 2005). It was reasoned that genes controlling adiposity would likely also control expression of genes determining fat levels. However, correlation of gene expression levels with adiposity could represent an effect rather than a cause of obesity. Using mathematical conditioning, it was possible to distinguish between these possibilities. Some of the top candidates were then examined using transgenic mice either overexpressing or underexpressing the causal genes, and it was observed that a large fraction of those genes did indeed influence adiposity. Proteomics The above approaches were made possible by the high-throughput analysis of the transcriptome. The analysis of the full proteome has not yet been achieved, but rapid progress is being made (Domon and Aebersold, 2006). It is clear that the number of distinct proteins in mammals will certainly greatly exceed the number of genes, and proteomes will exhibit large and dynamic complexity. While proteomic and systems approaches have had only a limited role in understanding complex disease to date, this situation is bound to change as the depth and reliability of the databases continues to improve. Interactome The connections between gene products can be mapped using genome-scale surveys of protein-binding partners. The universe of protein–protein interactions is known as the interactome. Two widely used technologies for charting this space are yeast twohybrid screening and co-affinity purification combined with detection by MS (Cusick et al., 2005). Yeast (Gavin et al., 2002; Ho et al., 2002; Ito et al., 2001; Uetz et al., 2000), worms (Li et al., 2004), flies (Formstecher et al., 2005; Giot et al., 2003; Stanyon et al., 2004), and mammalian cells (Rual et al., 2005) have been investigated using these methods. For example, a recent study used yeast two-hybrid screening to examine all pairwise interactions
among 8100 human genes (38% of all genes) and detected 2800 interactions (Rual et al., 2005). A major disadvantage of high-throughput protein interaction surveys is their limited reliability. False-positive and false-negative results are currently estimated at 40–50% for both yeast two-hybrid and MS methods (de Silva and Stumpf, 2005; Rual et al., 2005). Another strategy to map genetic networks employs curation of protein interactions from the literature (Gandhi et al., 2006; Reguly et al., 2006). These datasets are thought to have higher reliability since many interactions have been verified by independent methods from more than one group. The Human Protein Reference Database (HPRD) contains 8800 proteins (42% of all genes) and 25,000 interactions (Gandhi et al., 2006). Nevertheless, curation databases suffer from selection bias and do not randomly sample all protein interactions. Furthermore, comparison of yeast two-hybrid data and literature-derived interactions suggests that the literature-curated data can also suffer from variable quality (Rual et al., 2005). The networks derived from interaction mapping are often scale-free. Such networks show a power law distribution relating the probability of interactions, P(k), of a node on the network with k other nodes: P(k) k, where = 2 (Carter et al., 2004). A feature of scale-free networks is the presence of a few highly connected nodes or hubs, representing genes that are highly connected to many other genes. This network class is robust to random disconnection of individual nodes, but is vulnerable to disruption of highly connected hubs. Correspondingly, genes in hubs are found significantly more likely to be essential for survival (Gandhi et al., 2006). Both scale-free and classical random graphs display “small world” behavior, in which moving between any two nodes on the network takes far fewer steps (edges) than naively expected. Interactome Insights Some interesting insights related to disease have emerged from analysis of these networks. For example, the HPRD revealed that protein products of genes mutated in inherited genetic disorders tend to interact with proteins involved in similar disorders. This observation suggested that certain gene subnetworks are involved in particular diseases and can help guide the search for new genetic culprits. In another example, it was recently shown that combining yeast two-hybrid data, co-affinity purification data, and genetic experiments in worms gave a much sharper outline of the TGF- signaling pathway than using any one dataset alone (Tewari et al., 2004). A total of nine new TGF- signaling related genes were found, of which seven were conserved in humans. One of these had functional homology to the human SNO/SKI oncoprotein. Metabolomics No less promising than proteomics in its relation to early diagnosis and treatment of disease is metabolomics (Claudino et al., 2007) (see Chapter 15). It is estimated that there are approximately 3,000 metabolites, compared with 20,000 genes. The
References
main analytic tool for the metabolome is proton nuclear magnetic resonance spectroscopy (1H NMR). Another useful approach is MS, which has greater processing cost per sample but is thought to have more sensitivity. Metabolomic approaches have been used successfully for non-invasive diagnosis of a number of disorders, including breast cancer, hepatitis, ovarian cysts, and ovarian cancer (Claudino et al., 2007). Many of these studies used unsupervised analytic approaches such as principal components analysis (PCA) to distinguish diseased and nondiseased samples (Griffin, 2006) (Figure 3.2).
■
43
to fruition with the first complete genome sequences of individual members of many species. With the goal of the $1000 genome on the horizon, it is probable that relatively soon we will have complete genome sequences for many individuals within one species. Recent genomic advances in transcript profiling, proteomics, metabolomics, and molecular imaging suggest that comprehensive knowledge in orthogonal descriptive directions is also possible (Figure 3.2). This is systems biology. The challenge now is to integrate these diverse modes of knowledge to a better understanding of complex genetics in both health and disease and make the field of genomic medicine a practical reality.
CONCLUSIONS AND PROSPECTS In his famous textbook Molecular Biology of the Gene (Watson, 1965), James Watson pointed out that the genome is not of infinite size (Olson, 2003). He therefore argued (within physical limits) that in principle we could know everything about the molecular description of a cell or organism. This vision has come
ACKNOWLEDGEMENTS Supported by NIH R01 NS050148, MH71779, NS043562 (D.J. Smith) and NIH HL30568, HL30568, HL28481, DK071673 (A.J. Lusis).
REFERENCES Aitman, T.J., Glazier, A.M., Wallace, C.A., Cooper, L.D., Norsworthy, P.J., Wahid, F.N., Al-Majali, K.M., Trembling, P.M., Mann, C.J., Shoulders, C.C. et al. (1999). Identification of Cd36 (Fat) as an insulin-resistance gene causing defective fatty acid and glucose metabolism in hypertensive rats. Nat Genet 21, 76–83. Altshuler, D., Brooks, L.D., Chakravarti, A., Collins, F.S., Daly, M.J. and Donnelly, P. (2005). A haplotype map of the human genome. Nature 437, 1299–1320. Bailey, W.J. and Ulrich, R. (2004). Molecular profiling approaches for identifying novel biomarkers. Expert Opin Drug Saf 3, 137–151. Bennett, S.T., Barnes, C., Cox, A., Davies, L. and Brown, C. (2005). Toward the $1000 human genome. Pharmacogenomics 6, 373–382. Botstein, D., White, R.L., Skolnick, M. and Davis, R.W. (1980). Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 32, 314–331. Braslavsky, I., Hebert, B., Kartalov, E. and Quake, S.R. (2003). Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci USA 100, 3960–3964. Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., Luo, S., McCurdy, S., Foy, M., Ewan, M. et al. (2000). Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 18, 630–634. Bucca, G., Carruba, G., Saetta, A., Muti, P., Castagnetta, L. and Smith, C.P. (2004). Gene expression profiling of human cancers. Ann N Y Acad Sci 1028, 28–37. Cantor, C.R. (2005). The use of genetic SNPs as new diagnostic markers in preventive medicine. Ann N Y Acad Sci 1055, 48–57. Carter, S.L., Brechbuhler, C.M., Griffin, M. and Bond, A.T. (2004). Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics 20, 2242–2250. Cheung, V.G., Spielman, R.S., Ewens, K.G., Weber, T.M., Morley, M. and Burdick, J.T. (2005). Mapping determinants of human gene expression by regional and genome-wide association. Nature 437, 1365–1369.
Claudino, W.M., Quattrone, A., Biganzoli, L., Pestrin, M., Bertini, I. and Di Leo, A. (2007). Metabolomics: Available results, current research projects in breast cancer, and future applications. J Clin Oncol 25, 2840–2846. Cohen, J.C., Kiss, R.S., Pertsemlidis, A., Marcel, Y.L., McPherson, R. and Hobbs, H.H. (2004). Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 869–872. Cohen, J.C., Pertsemlidis, A., Fahmi, S., Esmail, S., Vega, G.L., Grundy, S.M. and Hobbs, H.H. (2006). Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. Proc Natl Acad Sci USA 103, 1810–1815. Cusick, M.E., Klitgord, N.,Vidal, M. and Hill, D.E. (2005). Interactome: Gateway into systems biology. Hum Mol Genet 14 (Spec No. 2), R171–R181. de Silva, E. and Stumpf, M.P. (2005). Complex networks and simple models in biology. J R Soc Interface 2, 419–430. Diabetes Genetics Initiative et al. (2007). Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 316, 1331–1336. Domon, B. and Abersold, R. (2006). Mass spectrometry and protein analysis. Science 312, 212–217. Doss, S., Schadt, E.E., Drake, T.A. and Lusis, A.J. (2005). Cis-acting expression quantitative trait loci in mice. Genome Res 15, 681–691. Easton, D.F., Pooley, K.A., Dunning, A.M., Pharoah, P.D., Thompson, D., Ballinger, D.G., Struewing, J.P., Morrison, J., Field, H., Luben, R. et al. (2007). Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447, 1087–1093. Flint, J. and Mott, R. (2001). Finding the molecular basis of quantitative traits: Successes and pitfalls. Nat Rev Genet 2, 437–445. Flint, J., Valdar, W., Shifman, S. and Mott, R. (2005). Strategies for mapping and cloning quantitative trait genes in rodents. Nat Rev Genet 6, 271–286. Formstecher, E., Aresta, S., Collura, V., Hamburger, A., Meil, A., Trehin, A., Reverdy, C., Betin, V., Maire, S., Brun, C. et al. (2005).
44
CHAPTER 3
■
Genomic Approaches to Complex Disease
Protein interaction mapping: A Drosophila case study. Genome Res 15, 376–384. Gandhi, T.K., Zhong, J., Mathivanan, S., Karthick, L., Chandrika, K.N., Mohan, S.S., Sharma, S., Pinkert, S., Nagaraju, S., Periaswamy, B. et al. (2006). Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 38, 285–293. Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M., Michon, A.M., Cruciat, C.M. et al. (2002). Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147. Ghazalpour, A., Doss, S., Sheth, S.S., Ingram-Drake, L.A., Schadt, E.E., Lusis, A.J. and Drake, T.A. (2005). Genomic analysis of metabolic pathway gene expression in mice. Genome Biol 6, R59. Gibson, U.E., Heid, C.A. and Williams, P.M. (1996). A novel method for real time quantitative RT-PCR. Genome Res 6, 995–1001. Giot, L., Bader, J.S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y.L., Ooi, C.E., Godwin, B., Vitols, E. et al. (2003). A protein interaction map of Drosophila melanogaster. Science 302, 1727–1736. Grant, S.F., Thorleifsson, G., Reynisdottir, I., Benediktsson, R., Manolescu, A., Sainz, J., Helgason, A., Stefansson, H., Emilsson, V., Helgadottir, A. et al. (2006). Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat Genet 38, 320–323. Griffin, J.L. (2006). The Cinderella story of metabolic profiling: Does metabolomics get to go to the functional genomics ball? Philos Trans R Soc Lond B Biol Sci 361, 147–161. Griffin, T.J. and Smith, L.M. (2000). Single-nucleotide polymorphism analysis by MALDI-TOF mass spectrometry. Trends Biotechnol 18, 77–84. Heid, C.A., Stevens, J., Livak, K.J. and Williams, P.M. (1996). Real time quantitative PCR. Genome Res 6, 986–994. Helgadottir, A., Manolescu, A., Helgason, A., Thorleifsson, G., Thorsteinsdottir, U., Gudbjartsson, D.F., Gretarsdottir, S., Magnusson, K.P., Gudmundsson, G., Hicks, A. et al. (2006). A variant of the gene encoding leukotriene A4 hydrolase confers ethnicity-specific risk of myocardial infarction. Nat Genet 38, 68–74. Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E., Ballinger, D.G., Frazer, K.A. and Cox, D.R. (2005). Whole-genome patterns of common DNA variation in three human populations. Science 307, 1072–1079. Hirschhorn, J.N. (2005). Genetic approaches to studying common diseases and complex traits. Pediatr Res 57, 74R–77R. Hirschhorn, J.N. and Daly, M.J. (2005). Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6, 95–108. Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor, P., Bennett, K., Boutilier, K. et al. (2002). Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183. Hood, L., Heath, J.R., Phelps, M.E. and Lin, B. (2004). Systems biology and new technologies enable predictive and preventative medicine. Science 306, 640–643. Horikawa, Y., Oda, N., Cox, N.J., Li, X., Orho-Melander, M., Hara, M., Hinokio, Y., Lindner, T.H., Mashima, H., Schwarz, P.E. et al. (2000). Genetic variation in the gene encoding calpain-10 is associated with type 2 diabetes mellitus. Nat Genet 26, 163–175. Hubner, N., Wallace, C.A., Zimdahl, H., Petretto, E., Schulz, H., Maciver, F., Mueller, M., Hummel, O., Monti, J., Zidek, V. et al.
(2005). Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease. Nat Genet 37, 243–253. Hugot, J.P., Chamaillard, M., Zouali, H., Lesage, S., Cezard, J.P., Belaiche, J., Almer, S., Tysk, C., O’Morain, C.A., Gassull, M. et al. (2001). Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn’s disease. Nature 411, 599–603. Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. and Sakaki, Y. (2001). A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98, 4569–4574. Jansen, R.C. and Nap, J.P. (2001). Genetical genomics: The added value from segregation. Trends Genet 17, 388–391. Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S., Haynes, C., Henning, A.K., SanGiovanni, J.P., Mane, S.M., Mayne, S.T. et al. (2005). Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389. Kreiner, T. and Buck, K.T. (2005). Moving toward whole-genome analysis: a technology perspective. Am J Health Syst Pharm 62, 296–305. Lan, H., Chen, M., Flowers, J.B., Yandell, B.S., Stapleton, D.S., Mata, C.M., Mui, E.T., Flowers, M.T., Schueler, K.L., Manly, K.F. et al. (2006). Combined expression trait correlations and expression quantitative trait locus mapping. PLoS Genet 2, e6. Li, S., Armstrong, C.M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain, P.O., Han, J.D., Chesneau, A., Hao, T. et al. (2004). A map of the interactome network of the metazoan C. elegans. Science 303, 540–543. Lipshutz, R.J., Morris, D., Chee, M., Hubbell, E., Kozal, M.J., Shah, N., Shen, N., Yang, R. and Fodor, S.P. (1995). Using oligonucleotide probe arrays to access genetic diversity. Biotechniques 19, 442–447. Livak, K.J. (1999). Allelic discrimination using fluorogenic probes and the 5 nuclease assay. Genet Anal 14, 143–149. Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z. et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380. Moffatt, M.F., Kabesch, M., Liang, L., Dixon, A.L., Strachan, D., Heath, S., Depner, M., von Berg, A., Bufe, A., Rietschel, E. et al. (2007). Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448, 470–473. Mootha,V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E. et al. (2003). PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34, 267–273. Ogura, Y., Bonen, D.K., Inohara, N., Nicolae, D.L., Chen, F.F., Ramos, R., Britton, H., Moran,T., Karaliuskas, R., Duerr, R.H. et al. (2001). A frameshift mutation in NOD2 associated with susceptibility to Crohn’s disease. Nature 411, 603–606. Olson, M.V. (2003). The new quantitative biology. Cold Spring Harb Symp Quant Biol 68, 495–502. Ozaki, K., Ohnishi, Y., Iida, A., Sekine, A., Yamada, R., Tsunoda, T., Sato, H., Sato, H., Hori, M., Nakamura, Y. et al. (2002). Functional SNPs in the lymphotoxin-alpha gene that are associated with susceptibility to myocardial infarction. Nat Genet 32, 650–654. Pajukanta, P., Lilja, H.E., Sinsheimer, J.S., Cantor, R.M., Lusis, A.J., Gentile, M., Duan, X.J., Soro-Paavonen, A., Naukkarinen, J., Saarela, J. et al. (2004). Familial combined hyperlipidemia is associated with upstream transcription factor 1 (USF1). Nat Genet 36, 371–376.
References
Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P. et al. (2001). Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294, 1719–1723. Pravenec, M., Wallace, C., Aitman, T.J. and Kurtz, T.W. (2003). Gene expression profiling in hypertension research: A critical perspective. Hypertension 41, 3–8. Reguly, T., Breitkreutz, A., Boucher, L., Breitkreutz, B.J., Hon, G.C., Myers, C.L., Parsons, A., Friesen, H., Oughtred, R., Tong, A. et al. (2006). Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae. J Biol 5, 11. Rioux, J.D., Daly, M.J., Silverberg, M.S., Lindblad, K., Steinhart, H., Cohen, Z., Delmonte, T., Kocher, K., Miller, K., Guschwan, S. et al. (2001). Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. Nat Genet 29, 223–228. Rual, J.F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N., Berriz, G.F., Gibbons, F.D., Dreze, M., AyiviGuedehoussou, N. et al. (2005). Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 1173–1178. Samani, N.J., Erdmann, J., Hall, A.S., Hengstenberg, C., Mangino, M., Mayer, B., Dixon, R.J., Meitinger, T., Braund, P., Wichmann, H.E. et al. (2007). Genomewide association analysis of coronary artery disease. N Engl J Med 357, 443–453. Schadt, E.E., Monks, S.A., Drake, T.A., Lusis, A.J., Che, N., Colinayo, V., Ruff, T.G., Milligan, S.B., Lamb, J.R., Cavet, G. et al. (2003). Genetics of gene expression surveyed in maize, mouse and man. Nature 422, 297–302. Schadt, E.E., Lamb, J., Yang, X., Zhu, J., Edwards, S., Guhathakurta, D., Sieberts, S.K., Monks, S., Reitman, M., Zhang, C. et al. (2005). An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 37, 710–717. Scott, L.J., Mohlke, K.L., Bonnycastle, L.L., Willer, C.J., Li, Y., Duren, W.L., Erdos, M.R., Stringham, H.M., Chines, P.S., Jackson, A.U. et al. (2007). A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316, 1341–1345. Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M. et al. (2004). Large-scale copy number polymorphism in the human genome. Science 305, 525–528. Sebat, J., Lakshmi, B., Malhotra, D., Troge, J., Lese-Martin, C., Walsh, T., Yamrom, B., Yoon, S., Krasnitz, A., Kendall, J. et al. (2007). Strong association of de novo copy number mutations with autism. Science 316, 445–449. Shen, R., Fan, J.B., Campbell, D., Chang, W., Chen, J., Doucet, D., Yeakley, J., Bibikova, M., Wickham Garcia, E., McBride, C. et al. (2005). High-throughput SNP genotyping on universal bead arrays. Mutat Res 573, 70–82. Shendure, J., Mitra, R.D.,Varma, C. and Church, G.M. (2004). Advanced sequencing technologies: Methods and goals. Nat Rev Genet 5, 335–344. Shendure, J., Porreca, G.J., Reppas, N.B., Lin, X., McCutcheon, J.P., Rosenbaum, A.M., Wang, M.D., Zhang, K., Mitra, R.D. and Church, G.M. (2005). Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309, 1728–1732. Smith, D.J. and Lusis, A.J. (2002). The allelic structure of common disease. Hum Mol Genet 11, 2455–2461.
■
45
Spielman, R.S. and Ewens, W.J. (1998). A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test. Am J Hum Genet 62, 450–458. Stanyon, C.A., Liu, G., Mangiola, B.A., Patel, N., Giot, L., Kuang, B., Zhang, H., Zhong, J. and Finley, R.L, Jr. (2004). A Drosophila protein-interaction map centered on cell-cycle regulators. Genome Biol 5, R96. Stefansson, H., Sigurdsson, E., Steinthorsdottir, V., Bjornsdottir, S., Sigmundsson, T., Ghosh, S., Brynjolfsson, J., Gunnarsdottir, S., Ivarsson, O., Chou, T.T. et al. (2002). Neuregulin 1 and susceptibility to schizophrenia. Am J Hum Genet 71, 877–892. Steinthorsdottir, V., Thorleifsson, G., Reynisdottir, I., Benediktsson, R., Jonsdottir, T., Walters, G.B., Styrkarsdottir, U., Gretarsdottir, S., Emilsson, V., Ghosh, S. et al. (2007). A variant in CDKAL1 influences insulin response and risk of type 2 diabetes. Nat Genet 39, 770–775. Strittmatter, W.J., Saunders, A.M., Schmechel, D., Pericak-Vance, M., Enghild, J., Salvesen, G.S. and Roses, A.D. (1993). Apolipoprotein E: High-avidity binding to beta-amyloid and increased frequency of type 4 allele in late-onset familial Alzheimer disease. Proc Natl Acad Sci USA 90, 1977–1981. Suviolahti, E., Oksanen, L.J., Ohman, M., Cantor, R.M., Ridderstrale, M., Tuomi, T., Kaprio, J., Rissanen, A., Mustajoki, P., Jousilahti, P. et al. (2003). The SLC6A14 gene shows evidence of association with obesity. J Clin Invest 112, 1762–1772. Tewari, M., Hu, P.J., Ahn, J.S., Ayivi-Guedehoussou, N., Vidalain, P.O., Li, S., Milstein, S., Armstrong, C.M., Boxem, M., Butler, M.D. et al. (2004). Systematic interactome mapping and genetic perturbation analysis of a C. elegans TGF-beta signaling network. Mol Cell 13, 469–482. Tuzun, E., Sharp, A.J., Bailey, J.A., Kaul, R., Morrison, V.A., Pertz, L.M., Haugen, E., Hayden, H., Albertson, D., Pinkel, D. et al. (2005). Fine-scale structural variation of the human genome. Nat Genet 37, 727–732. Uetz, P., Giot, L., Cagney, G., Mansfield,T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P. et al. (2000). A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623–627. Wang, S.,Yehya, N., Schadt, E.E., Wang, H., Drake, T.A. and Lusis, A.J. (2006). Genetic and genomic analysis of a fat mass trait with complex inheritance reveals marked sex specificity. PLOS Genet 2, e15. Wardell, M.R., Suckling, P.A. and Janus, E.D. (1982). Genetic variation in human apolipoprotein E. J Lipid Res 23, 1174–1182. Watson, J.D. (1965). Molecular Biology of the Gene, 1st edition. W. A. Benjamin, New York. Weber, J.L. and May, P.E. (1989). Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am J Hum Genet 44, 388–396. Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin,V., Church, D.M., DiCuccio, M., Edgar, R., Federhen, S. et al. (2006). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 34, D173–D180. Zeggini, E., Weedon, M.N., Lindgren, C.M., Frayling, T.M., Elliott, K.S., Lango, H., Timpson, N.J., Perry, J.R., Rayner, N.W., Freathy, R.M. et al. (2007). Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 316, 1336–1341.
46
CHAPTER 3
■
Genomic Approaches to Complex Disease
RECOMMENDED RESOURCES 454 Life Sciences www.454.com Affymetrix www.affymetrix.com Applied Biosystems https://www2.appliedbiosystems.com/ Gene ontology – the gene ontology classification, http://www.geneontology.org Genome-phenome superbrain http://omicspace.riken.jp HapMap http://www.hapmap.org/ Helicos Biosciences http://helicosbio.com Illumina www.illumina.com
Jackson laboratory – a repository for genetic and genomic information on the mouse, http://www.jax.org National Center for Biotechnology Information – repository for SNP information, comprehensive literature database and much more, www.ncbi.nlm.nih.gov Perlegen http://www.perlegen.com/ Sequenom www.sequenom.com UCSC Genome Browser – available genome sequences, http:// genome.ucsc.edu/
CHAPTER
4 Human Health and Disease: Interaction Between the Genome and the Environment Kenneth Olden
INTRODUCTION The importance of inheritance versus the environment in the origin of various diseases has been debated for more than a century. The phrase “nature versus nurture” was coined by Francis Galton (a 19th-century British scientist) to distinguish between characteristics one is born with and those that are acquired as a result of one’s surroundings. Early on, humans looked to the external environment to explain morbidity and mortality as they were cognizant of their exposure to natural agents that were poisonous. Because of investments in environmental health research, we now know that many of the chemical, physical, and biological agents in the environment not only cause acute illness and death, but also increase risk for various chronic diseases such as cancer, asthma, hypertension, stroke, Alzheimer’s, Parkinson’s, infections, and various developmental disorders. However, the era of dominance of external factors as the primary contributor to human disease gradually faded with the development and growth of Mendelian genetics (Dobzhansky, 1941) and the more recent field of genomics (Trask, 1999). In fact, the geocentric view of disease has dominated our concept of the origin of disease for the past 50 years. However, with the completion of the sequencing of the human genome (Lander et al., 2001; Venter et al., 2001), it now seems apparent that complex human Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg
phenotypes cannot be explained based solely on the action of genes or their protein products. Genes account for only a small part of the phenotype, and the environment has a spectacular impact. That is, the onset and clinical course of most chronic diseases are influenced by complex interactions between genes and the environment (Olden, 2007). The concept of gene–gene and gene–environment interactions is not new. Garrod (1902) was one of the first to suggest that the effects of genes could be modified by the environment. He suggested that individual differences in genetics could play a role in variation in response to drugs, and that the effect of genotype could be modified by the diet. Wright (1932) also emphasized the existence of a functional relationship between various biological endpoints and networks of genes and environmental factors in his studies of mutation, selection, and breeding, and Dobzhansky and Levene (1955) reported the existence of genotype–environment interaction with respect to viability of some chromosomes in Drosophila. Whereas concerns about gene–environment interactions have had a long history (reviewed by Haldane, 1946), the importance of understanding these interactions as a means to prevent complex diseases emerged only over the past 15 years (Olden, 1993; Olden and Guthrie, 2000; Olden et al., 2001; Schwartz and Collins, 2007). The geocentric view of disease was virtually unchallenged until the leadership of Copyright © 2009, Elsevier Inc. All rights reserved. 47
48
CHAPTER 4
■
Genome and the Environment
the environmental health community began to emphasize gene– environment interaction as a top research priority in 1992 (Olden, 1993; Washington FAX, January 21, 1992). A few years later, the initiation of the Environmental Genome Project (EGP) (discussed later in this chapter) represented the first large-scale effort to discover susceptibility alleles likely to be important in gene–gene or gene–environment interactions (Brown and Hartwell, 1998; Guengerich, 1998; Kaiser, 1997; Olden and Wilson, 2000). Prior to the initiation of the EGP, most of the momentum for research on gene–environment interactions was occurring in the field of pharmacogenetics to identify genetic variants that influence drug efficacy. As a consequence, many published studies have linked specific polymorphisms with specific drug responses (Frye, 2004; Hulla et al., 1999; Nebert et al., 1996). Studies have shown that interindividual pharmacokinetic variation in rates of drug elimination can vary dramatically (Vessel, 1991). Multinational twin studies were conducted on the kinetics of several drugs to compare relative contribution of genetics and environmental factors with respect to interindividual variations. Results were remarkably similar for the various populations, and pharmacokinetic variation virtually disappeared within monozygotic twins, but was preserved within most dizygotic twins; meaning that variation in drug metabolism is primarily under genetic control. Altering the environment of the monozygotic twin had no effect on kinetics. If reported pharmacogenetic differences of 10- to 200-fold can be extrapolated directly to the risk of human diseases, one could conclude that an individual can be 10- to 200-fold more sensitive to a given drug or environmental chemical because of differences in expression and activity of metabolic enzymes. Recent studies have shown a similar association between specific diseases and various genetic polymorphisms. Some examples of both types of genotype–phenotype association studies will be described later. However, even when all the highly penetrant alleles have been discovered and all the gene–gene and gene–environment interactions are understood, it may still be difficult to relate genotype to phenotype because of the additional layers of complexity associated with post-translational processing and protein–protein interactions. Kouhry and Wagner (1995) suggested that there are at least six types of gene– environment interactions. In the first situation, neither the environment nor the genotype has any effect alone, but can interact to increase risk. In the second situation, the genotype has no effect in the absence of exposure, but can exacerbate the effects of the latter. The third situation is the converse of the second; that is, the environment is without effect alone, but can enhance the effect of the genotype. In the fourth situation, both the environmental exposure and the genotype increase risk for disease, but the combination is interactive or synergistic. The fifth and sixth situations represent cases in which gene–environment interactions are protective. Phenylketonuria is an example of disorder that is caused by both the susceptibility genotype and exposure to phenylalanine in the diet. In contrast, lung cancer can be caused by cigarette smoking even if the susceptibility genotype is not present, and acute hemolytic anemia can be caused by the susceptibility
genotype alone even in the absence of fava bean digestion. Individuals with the susceptibility genotype for a1-antitrypsin deficiency have a high risk for development of emphysema, but the risk is greatly increased by cigarette smoking. The model of gene–environment interaction that has been most studied relates to metabolic susceptibility genes and chemical carcinogenesis. Polymorphisms in carcinogen-metabolizing enzymes (to be discussed later) are relatively common in the population and can play a major role in the development of the common cancers. The phrase gene–environment interaction means that the direction and magnitude of the effect that a genetic variant has on the phenotype can vary as the environment changes; that is, genetic risk is modifiable in an environment-specific manner or that the phenotypic effect of a mutation or environmental exposure depends on the genetic background. Whether a particular variant allele is expressed, the degree to which it is expressed, and when it is expressed can be influenced by the environment or by other genes. However, since the genotype evolves or changes very slowly over hundreds of years, one can assume that most of the increase in disease burden in industrialized nations is the result of a variable environment interacting with a relatively constant genetic substrate. Similar to gene–environment interactions, variants in multiple genes can contribute to the expression of phenotype. For example, multiple genetic loci influence the rate of progression of HIV infection to full blown AIDS (Dean, 2003). The nonsense allele, V641, in the CCR2 gene and heterozygosity for the cytokine receptor/HIV-1 co-receptor CCR5- 32 allele is reported to be associated with the slower progression of AIDS (Dean et al., 1996; Smith et al., 1997); in contrast, a haplotype in the promoter region of CCR5 is associated with the more rapid progression of HIV infection to AIDS (Martin et al., 1998). Also, progression of HIV-infected patients with more heterologous HLA Class I genes is considerably slower (Carrington et al., 1999).
IMPORTANCE OF THE ENVIRONMENT Over the past 50 years, the prevalence of chronic diseases has reached epidemic proportions in industrialized countries, and various governments have expended enormous resources to identify genetic causes. Nevertheless, the expectation that simple direct paths would be found from genes to disease has not been realized. Several explanations have been put forward to explain failure, but gene–gene and gene environment interactions are among the most likely. The merging view is that exposure to environmental toxicants has conspired with genetics to promote the development of these disorders. It is now well established that alterations in high penetrant genes explain only a small fraction of complex diseases, and such genes represent only a small fraction of variations relative to the more common polymorphisms that have a less disruptive effect on protein function. However, it is much less obvious what portion of the disease burden is attributable to interactions among low penetrant
Importance of the Environment
genes versus interactions of low penetrant genes with the environment. Nevertheless, a prominent role for the environment is supported by findings of geographic differences in incidence of disease, by variations in trends over time and by studies of disease patterns in immigrant populations (Doll and Peto, 1981). Also, population-based, twin-cohort studies, the “gold standard” for distinguishing between the contribution of genetics and the environment, suggest that the environment plays a prominent role in the development of human disease. In many cases, concordance among monozygotic twins is very low. For example, such recent studies show that no more than one-third of the cancer burden can be attributed to the action of genes alone (Lichtenstein et al., 2000; Verkasala et al., 1999), only 15% of Parkinson’s disease (Tanner et al.,1999), and about a third of autoimmune diseases (Powell et al., 1999). In a much earlier landmark study, Doll and Peto (1981) estimated that about one-third of all cancers are caused by genes alone based on demographic distribution of the disease in the United States. And, a recent study (Khot et al., 2003) reported that 90% of individuals with severe heart disease have at least one or more of the four classic risk factors captured in the current definition of the environment (Olden et al., 2001). Because of these and other studies, it is now generally accepted that a single risk factor is rarely responsible for the development of most chronic diseases (see Chapter 3). Rather, diseases commonly arise from complex interactions involving genes, environmental factors, behavior, and stochastic events associated with DNA replication and endogenous metabolism, as depicted in the schematic model shown in Figure 4.1. Age and stage of development are also important factors in disease development, as they influence the timing of gene expression and susceptibility to environmental toxicants. Many people with susceptibility genes never actually develop the disease, even for diseases caused by highly penetrant alleles (e.g., BRCA1 and BRCA2 breast cancer genes) (Futreal et al., 1994). There is always a subpopulation of carriers of susceptibility genes who do not progress to the ultimate disease state. Whether one can account for the discordance between genotype and phenotype by lack of exposure to unknown environmental risk factors or to yet undiscovered genomic variations or gene–environment interactions
Intrinsic Genetic Susceptibility
Environmental Exposures
Human Health/Disease
Behavior/Age/Stage of Development
Figure 4.1 Schematic model of interactive factors involved in human health and disease.
■
49
is not known. However, existing evidence for a prominent role for the environment in the development of human disease is so compelling that Rothman et al. (2001) concluded that “the epidemiologic evidence accumulated to date indicates that environmental exposures, broadly defined to include lifestyle factors, are responsible for most cancers.” Others have concluded that the new data now being reported “blow away” the myth that “bad genes” are responsible for the majority of human morbidity and mortality (Olden, 2004a, b). That is, individual life history may be just as important as genetic background in predicting and understanding the development of complex traits. Genetics Loads the Gun, but the Environment Pulls the Trigger Is the relationship between genes and the environment analogous to that of a loaded gun and its trigger, as previously proposed? Olden and Wilson (2000) argued that the inheritance of disease susceptibility alleles creates the potential for adverse health outcomes, but that the adverse health outcome may never be realized provided the carrier can avoid environmental exposures capable of interacting with the susceptibility gene(s) to trigger disease development. If this analogy fully captures the relationship between environmental susceptibility alleles and specific environmental exposures, this can be good news because amelioration of environmental hazards could lead to a dramatic reduction in the incidence of chronic diseases in human populations. From an evolutionary point of view, one could argue that genes and gene variants that were adaptively selected in the past because they endowed the human species with survival or reproductive advantage, now have the potential to increase disease risk because of their incompatibility with the environment. The change in the environment, resulting from human activity and technology innovation, has outpaced evolution of the human genome to the extent that there is now a nature–nurture mismatch. For years, environmental health advocates have had to deal with the question of why more bodies are not piling up on the streets if industrial chemicals, such as benzene, vinyl chloride, and dioxins, are so harmful. Why can some people withstand exposure to nerve gases such as sarin, and organophosphate pesticides such as malathion and parathion? Why do fewer than 25% of smokers develop lung or bladder cancer? How do some individuals who work in the petrochemical industry survive years of exposure to benzene and never develop leukemia? Human population studies to assess gene–environment interactions are beginning to provide answers to these questions. For example, about 25% of Asians and 10% of Caucasians have a protective form of the gene that codes for paraoxonase (PON1, to be discussed later), an enzyme that detoxifies organophosphates, such as sarin and the above-mentioned insecticides, ten times faster than the more common form of the enzyme (Costa and Furlong, 2002). Also, common polymorphisms in genes for the cytochrome P450 enzymes determine how one metabolizes carcinogens (i.e., arylamine) in cigarette smoke or benzene in the environment of petrochemical workers (Lower et al., 1979; Smith et al., 2002). Such studies are beginning to
50
CHAPTER 4
■
Genome and the Environment
explain why a given polymorphism can be highly beneficial in one environment and virtually lethal in another (Boguski, 1999). Furthermore, they suggest that risk reduction can be achieved by re-engineering of genes or by modification of the environment. Now that the Human Genome Project (Lander et al., 2001; Venter et al., 2001) and the haplotype mapping (HapMap) project (Altshuler et al., 2005) have been completed, the major challenge facing biomedical researchers is to elucidate the interactive relationship between genes and between genes and the environment. While descriptive studies by toxicologists and epidemiologists have shed some light on the possible genetic and environmental risk factors, the actual causes and mechanisms of disease development remain poorly understood. Investments in the aforementioned sequencing and mapping projects led to the development of powerful new tools to conduct large-scale, population-based studies necessary to untangle complex interactions between genes and environmental factors. However, the design and conduct of such studies will require the development of even better tools and fundamental restructuring of the biomedical research enterprise to put greater, although not exclusive, emphasis on interdisciplinary research teams. Promise of New Technologies for Unraveling Complex Interactions Between the Human Genome and the Environment The Nobel Prize laureate Sydney Brenner has written that “progress in science depends on new technologies, new discoveries and new ideas, probably in that order” (Brenner, 2002). In no discipline is this more evident than in environmental health research. Environmental health regulation has long been constrained by limitations in our ability to determine the risk presented by specific xenobiotic agents. Integration of genomics and toxicology to develop a new field of toxicogenomics may revolutionize risk assessment and environmental health regulation (Lovett, 2000). Over the past 50 years, advances in molecular biology have dramatically improved our understanding of human biology. Such innovations include the discovery of the DNA double helix (Watson and Crick, 1953), development of recombinant DNA technology (Watson et al., 1983), the polymerase chain reaction (Chamberlin, 1982), siRNA technology (Baulcombe, 2002), gene and protein microarray technology (Brown and Botstein, 1999; Schena et al., 1995), and DNA sequencing methodology (Lander et al., 2001). For example, an important insight gained from gene sequencing studies is that the human genome consists of a large number of polymorphic sites, some of which are associated with altered biological function or phenotype (see Chapter 1). The finding of a large number of variant forms of genes suggested that genetic polymorphisms may play a significant role in susceptibility to disease or toxic exposure. While descriptive studies by toxicologists, molecular geneticists, and epidemiologists, working independently, have shed some light on this topic, the actual role of gene variation in susceptibility to disease development remains poorly understood.
THE ENVIRONMENTAL GENOME PROJECT Although we are beginning to understand how mutations in single genes effect phenotype, we have very limited knowledge of how variations in different genes influence each other and the sensitivity of biological mechanisms to environmental perturbation. But, with the completion of the reference sequence of the human genome, the field of genomics shifted from gene discovery to the study of gene function and elucidation of the genetic basis for differences in susceptibility to common diseases. Although reference is made to the human genome, the concept of a single genome is misleading. Whereas variation in the human genome is not greater than 0.1% among individuals in the population, this represents very significant variation considering that the human genome is composed of approximately three billion nucleotide base pairs. Genetic linkage analysis has been a powerful approach in the discovery of high penetrance allelic variants (e.g., BRCA1/2). However, this approach has not yet been particularly useful in the identification of low penetrance susceptibility alleles associated with the development of complex diseases such as cancer, cardiovascular disease, asthma, diabetes, Alzheimer’s and Parkinson’s disease because the relative low risk associated with a specific allele may be influenced by unknown environmental factors. Of the many types of variations in the human genome, single nucleotide polymorphisms (SNPs) are the most common (Collins et al., 1999). Estimates are that there is on average one SNP per 1300 base pairs, four to five per gene-coding region, or approximately 11 million in total in the human genome (Kruglyak and Nickerson, 2001). The EGP was formally initiated by the US National Institute of Environmental Health Sciences (NIEHS) in 1997 to investigate the role of genetic polymorphisms in susceptibility to environmentally induced disease (Brown and Hartwell, 1998; Kaiser, 1997, 2003). In this regard, NIEHS was way out front in recognizing the importance of genetic variation for understanding human disease. Following their lead, other large-scale public and private initiatives have been developed to identify the millions of SNPs in the human genome. Most SNPs can be expected to be neutral with respect to alteration of environmental sensitivity. To examine the extent of diversity in the human genome, the EGP was organized into five research areas: (1) SNP discovery; (2) functional analysis; (3) disease association studies; (4) development and refinement of technologies to improve sequencing, functional analysis, and population-based studies; and (5) development of policies to address possible ethical, social, and legal issues associated with genetic susceptibility studies. The 544 candidate genes included in the initial survey (see http://www.genome. utah.edu/genesnps and http://www.niehs.nih.gov/research/ supported/programs/egp/) had already been characterized with respect to function and were known to have a significant probability of playing a role in susceptibility to specific environmental exposures. The resequencing analysis of individual DNA samples was based on samples from 90 individuals, representative of the
The Environmental Genome Project
diversity in the US population. This design would allow for the detection of variation that occurs at a frequency of one percent or greater. As of September 2007, approximately 619 environmental response genes have been resequenced with the discovery of 88,016 SNPs. All of the SNPs entered into the database have been validated by multiple sequence analysis and include genotype frequency information. This highly useful SNP database is publicly available (http://www.egp.gs.washington.edu/welcome.html). Decades of studies by environmental health researchers, demonstrating tremendous variation in individual response to environmental exposures, provided the intellectual foundation for the massive resequencing effort undertaking in the EGP. Presumably, natural selection, resulting from exposures to various environmental insults, has operated over the course of human history to promote the evolution of sophisticated metabolic pathways to suppress phenotypic variation or buffer against toxic injury. Buffering is a fundamental aspect of biology, and there are likely to be multiple mechanisms involved, including genetic redundancy, feedback regulation, and cooperative biochemical interactions (Wagner et al., 1997; Gibson and Wagner, 2000; Hartman et al., 2001). Buffering was first inferred from the observation that organisms can acquire a wide range of phenotypes when subjected to varying genetic or environmental stressors (Scheiner et al., 1991). Collectively, the various buffering pathways involved in protecting the organism against environmental stressors have been referred to as the “environmental response machinery” (Olden and Wilson, 2000). According to this model, all human genes, including those that encode protein components of the environmental response machinery, are subject to genetic variability (mutations or polymorphisms) that can alter the metabolic efficiency of protective pathways. Some alleles can strengthen protection and others can reduce it. Another result emerging from the SNP discovery effort is that the variants tend to associate with one another in a nonrandom manner known as linkage disequilibrium (Carlson et al., 2004); this pattern of association is called a haplotype. On average, the EGP candidate genes are organized into 25 haplotypes (Wilson and Olden, 2004). The current thinking is that the generation of haplotype and linkage disequilibrium data will be extremely useful in large-scale genotyping studies to investigate the association between specific genes and disease. Such studies are now underway in several laboratories (see Chapter 8). However, the goal of assigning specific genotypes to phenotypes with respect to environmental sensitivity has proven to be a daunting assignment. For functional analysis of the SNPs discovered in the EGP, both in vitro biochemical and animal model studies have been initiated. For example, Furlong and his colleagues (Costa and Furlong, 2002) investigated the functional consequences of polymorphisms in paraoxonase (PON1) genes. PON1 is a highdensity lipoprotein associated with the plasma membrane that metabolizes oxidized lipids, environmental xenobiotics (e.g., organophosphate insecticides mentioned above) and some pharmaceutical agents (e.g., statins). Studies by Furlong and colleagues suggest that PON1 may play a significant role in the development of cardiovascular disease independent of environmental exposures (Costa and Furlong, 2002). However, their
■
51
studies have shown that the activity of PON1 is significantly altered by genetic variation and that these alterations influence sensitivity to the organophosphate insecticides chlorpyrifos and diazinon. For example, one of the two SNPs in the coding region significantly increased the substrate-specific activity of the enzyme and provides much greater protection against chlorpyrifos oxon than the wild-type PON1. However, the Q192R allele did not correlate with carotid artery disease in a case– control study (Jarvik et al., 2003). Whereas individual SNPs in the promoter region had no effect on enzyme activity, combinations of three of the common haplotypes of the five SNPs in this region of the gene did alter the plasma activity of PON1. While these studies do suggest that polymorphisms in PON1 may play a role in susceptibility to organophosphate toxicity, the exact involvement and possible interactions of the various SNPs in the PON1 gene are yet to be determined. In another project in the EGP, Martyn Smith and colleagues (Smith et al., 2001) identified a gene that codes for NAD(P)H: quinone receptor oxidoreductase1 (NQO1) and demonstrated its potential involvement in the etiology of leukemia. This enzyme plays an important role in protection against oxidative damage from exposure to quinones. And they found that the common C609T polymorphism resulted in an amino acid substitution with a complete loss of enzyme activity in homozygotes. The C609T variant was associated with a 2.5-fold increase in risk for leukemia in case–control studies (Krajinovic et al., 2002; Smith et al., 2001, 2002). In contrast, several polymorphisms in genes involved in folate metabolism reduced the risk of leukemia by two- to threefold in heterozygotes and 3- to 10-fold in homozygotes (Skibola et al., 1999, 2002). Folate is a source of precursors for DNA synthesis, repair and methylation; and dietary folate protects against leukemia (Thompson et al., 2001). Also, there is evidence for a role of gene–environment interactions between polymorphic variants of thymidylate synthetase and methylenetetrahydrofolate reductase (Skibola et al., 2002). Finally, for functional analysis of the various polymorphisms discovered in the EGP, NIEHS developed mouse genome centers in 2000. The mouse is the most widely used mammalian model system for the study of the impact of the environment on human health. The mouse is particularly useful in studies of gene–environment interactions because most human genes have counterparts in the mouse genome. Furthermore, extensive toxicological and pathological databases and molecular genetic techniques for construction of gene-knockout and knock-in strains have been developed using the mouse as a model. The mouse model also offers the advantage of using strain-specific differences in response to environmental exposure to investigate how the genetic background influences susceptibility. First, the mouse genome center resource was used to create genetic variation map databases of laboratory mouse strains for use in studies of gene–environment interactions. This was achieved by using knockout and knock-in technologies to generate polymorphic variations in candidate genes similar to those discovered in the EGP. Exposing such animals to putative environmental risk factors is the most obvious and direct way
52
CHAPTER 4
■
Genome and the Environment
to investigate the relationship between genetic background and susceptibility to environmental exposures. Second, since various mouse strains are known to exhibit significant variation in susceptibility to carcinogens and other environmental toxicants, efforts were undertaken to discover the haplotype maps of the 15 most commonly used mouse strains to identify genetic sequences that underlie differences in response (Olden 2004a; Pearson 2004). Once susceptibility haplotypes have been identified, one can track down the genes involved by crossing a susceptible strain with a resistant strain and identifying the chunks of the genome that are associated with the susceptibility. As the biological and genetic databases expand, the need for such mouse crosses might no longer be necessary.
PROBLEMATIC NATURE OF GENE– ENVIRONMENT INTERACTION STUDIES While there is considerable interest and enthusiasm for research on gene–environment interaction, there is also apprehension and uncertainty about the feasibility of such studies and practical questions about how to conduct them. Current studies underway in different parts of the world are problematic because they do not employ a standard definition of the environment nor standard metrics to assess exposure. But, even more problematic is the fact that methods to accurately measure exposure and to statistically model multiple interactive components that make modest contributions to the phenotype are not yet available. To alter the current state of affairs with respect to technology and database needs will require investments in the following areas of research: (i) identification of disease risk factors; (ii) elucidation of genetic differences and similarities between human and mouse; (iii) development of better tools and statistical models to assess exposure and interactions of multiple components; (iv) development of high-throughput, low-cost and more informative strategies to assess toxicity of drugs and environmental xenobiotics, and (v) elucidation of mechanisms and metabolic pathways influenced by gene–environment interactions. Furthermore, it has become clear that addressing the role of gene–environment interactions in the etiology of complex diseases will require the development of a robust framework to account for “the environment” (see Chapter 83). In spite of current understanding of the multifactorial nature of the etiologic mechanisms of chronic diseases, biomedical researchers still tend to focus on limited or circumscribed components of disease. Epidemiologists tend to focus on exposures; geneticists target susceptibility genes; cell/molecular biologists tend to explore mechanisms; and social scientists study behavior. While the various disciplinary approaches have led to new insights into disease causation, they are unlikely to be able to provide a coherent explanation for disease causation. Given this reality,
the environmental health community moved swiftly to take advantage of new databases and tools, derived from the human genome project, to identify and characterize all the environmental susceptibility alleles in the human genome and all the functional molecules (e.g., RNA, protein, carbohydrates, lipids, and metabolites) encoded by human and mouse genomes (Kaiser, 1997; Lovett, 2000; Olden, 2004a, b; Waters et al., 2003). With approximately 25,000 genes, hundreds of thousands of protein species, and numerous metabolites, the task of identifying and characterizing them with respect to function is an ambitious undertaking, one requiring improvement, standardization, and validation of existing methods to achieve reproducibility, increased sensitivity and specificity, and high throughput. Such studies will also require large-scale, multi-institutional collaboration and interdisciplinary expertise to amass and analyze such large databases. Until the latter part of the 19th century, infectious diseases were the chief causes of death; and their development was characterized by relatively short latency periods between exposure and the onset of illness. The relative ease in identifying the causative agents led to the development of effective prevention measures in the United States and other industrialized nations. Unfortunately, our capacity to prevent infectious diseases tracked with the emergence of chronic diseases, which have latency periods of 10–20 years. The increased prevalence of chronic diseases, the long latency period, and the more insidious nature of their causes necessitated development of new approaches to study their etiology. Epidemiologists responded by developing association studies to analyze the statistical relationship between two or more variables (e.g., genotype, exposure, and disease), based on two fundamental assumptions: first, that human disease does not occur at random; and second, that human disease has causal factors that can be identified by studying different populations. This disciplinary approach has been remarkably successful in the discovery of major risk factors for many common diseases, but has been less successful in discovering risks resulting from gene–gene or gene–environment interactions. Gene–environment interactions have been identified by taking advantage of existing models that take into account various ways that genetic effects can be modified by environmental exposures (Hunter, 2005). Epidemiologists have typically used one of three study designs: retrospective case–control, prospective cohort, and family-based. Family-based association design estimates association between genes and phenotypes based on the assumed Mendelian transmission of alleles from parent to offspring (Kraft and Hunter, 2005). In this study design, it is often difficult to obtain genetic data on parents for late-onset diseases, and exposure data is subject to recall bias as it is collected retrospectively. A further limitation of this approach is that the penetrance of the gene has to be sufficiently strong to cause familial aggregation. However, family-based studies allow one to assess the influence of genes over several generations; if the penetrance of the gene changes over time, this might represent the effect of the environment and lifestyle. In the case–control design, data on exposure is obtained after diagnosis of the disease, so errors in
Polymorphism and Disease Susceptibility: Case–Control Studies
estimating risk from gene–environment interactions can result from exposure recall or from bias in selection of the controls. Even if appropriate controls are found, incentive for participation in the study is often low; so, obtaining high participation is challenging. Also, it is difficult to estimate the effect of the disease on biomarkers used in the study. However, the case–control design is popular among epidemiologists because cases and controls can be enrolled in a short period of time, which leads to reduced cost and delay in reporting out the results. Finally, in the populationbased cohort design, information on exposure is collected prospectively, and tissue for DNA and baseline biomarker analysis is collected before the disease develops. The advantage of this study design is that assessment of exposure is more reliable and it does not suffer from selection bias. Unfortunately, such studies are very costly and time consuming and lack sufficient number of subjects to evaluate risks associated with gene–gene or gene–environment interactions. However, one can increase the power of populationbased cohort studies by pooling data from several studies. Because of the cost and time involved, most of the previous studies of gene–environment interactions have used the case– control design, which relied on family history to infer genetic predisposition and questionnaires to assess environmental and lifestyle exposures. In future studies of gene–environment interactions, it is likely that data from case–control and population-based cohort studies will be combined, and that genotype and exposure will be analyzed with greater precision using more direct and quantitative technologies. For example, the human genome and the various polymorphism discovery projects have opened up a vast new terrain to search for gene–gene or gene–environment interactions. Once these discovery efforts have evolved to the point that genetically susceptible populations have been identified, targeting such populations would increase the magnitude of relative risk, the significance of the observed associations, and the likelihood of discovering important gene–gene or gene– environment interactions. But, because of cost, rigorously designed case–control studies will remain the only option for assessing gene–environment interactions for rare diseases. Exploration of this new vast genomic landscape presents major challenges with respect to design and analysis of meaningful studies, and translation of the findings into public health and clinical practice. In addition to the challenges associated with population studies, the study of gene–environment interactions requires accurate information on exposure to eliminate a major source of uncertainty and the most serious problem in environmental health risk assessment. Not only is it difficult to measure exposure accurately, it is often prohibitively expensive for use in studies of large populations needed to test for association between genes and disease to correlate genetic variation with risk for disease.The current practice of estimating exposure using indirect surrogates, such as toxic release and production inventories and environmental monitoring, is inadequate and limits our understanding of dose–response relationships. Current methods available for measuring exposure include the use of environmental sensors, geographical information systems, biological sensors, direct analysis of body fluids, and more recently, toxicogenomics. Unfortunately,
■
53
all of these exposure assessment tools, with the possible exception of toxicogenomics, have serious limitations. The biological response to environmental exposure is so complex and involves so many interactive factors that the utilization of a combination of exposure assessment tools, capable of integrating all the interactive components, will be required in prospective epidemiologic studies. In my opinion, the development of the field of toxicogenomics will provide powerful and relatively inexpensive tools to identify biomarkers and to relate exposure and biological events during disease progression. In addition to impediments imposed by sample size and exposure assessment tools, population-based studies are fraught with other challenges related to the need for better statistical models. To define complex exposure-disease relationships and interactions between genes and environmental factors in the development of human disease, a comprehensive view of exposure is needed. The achievement of this goal will require improved sophisticated modeling capabilities to explain how components of the disease process respond when perturbed by environmental factors. Even in the simplest case of four combinations of genotype and exposure, where one is comparing risk of carrier versus noncarrier in exposed and nonexposed situations, there are several models for describing interactions between genetic susceptibility and environmental exposures. Of course, modeling of interactions gets more complex as the categories of environmental exposures and the number of variant alleles involved increase (Hunter, 2005). Also, new statistical methods may be needed to extract meaning from large datasets and to integrate information from many sources. Finally, the most critical technological challenge is how to relate exposure-disease association studies to pathways and mechanisms. To address this issue, NIEHS developed the National Center for Toxicogenomics (Lovett, 2000) to promote the development of technologies to monitor gene, protein, and metabolite expression in cells and tissues under various environmental conditions. To understand how genes and environmental factors interact to perturb biological pathways to cause injury or disease, scientists will need tools with the capacity to monitor the global expression of thousands of genes, proteins, and metabolites simultaneously. The generation of such data in multiple species can be used to identify conserved and functionally significant genes and pathways involved in gene–environment interactions.
POLYMORPHISM AND DISEASE SUSCEPTIBILITY: CASE–CONTROL STUDIES An article by Lower et al. (1979) transformed population health research and provided the foundation for the massive sequencing efforts to identify genetic variations involved in modulating human response to drugs and other environmental xenobiotics. This landmark investigation also demonstrated the power of hypothesis-driven, population-based studies for assessing disease risk associated with specific interactions between genes and the
54
CHAPTER 4
■
Genome and the Environment
environment. Furthermore, this pioneering study validated the use of epidemiological tools to investigate the biological significance of human genetic variants in disease risk. Such studies indicate that functional polymorphisms in genes that code for DNA repair or metabolism of environmental xenobiotics can modify the effects of environmental exposures. Gene variants can influence the balance of metabolic intermediates produced by altering the structure, function, and level of enzymes or the catalytic efficiency of the biotransformation process. Several large-scale projects are underway to build databases and to develop tools and infrastructure to investigate gene–gene and gene–environment interactions, and some results are already being reported. An encyclopedic review of the various studies is beyond scope of the present chapter, but a few examples will be presented as illustrative. Bladder Cancer and N-acetyltransferase 2 (NAT2) Polymorphism The seminal study by Lower and colleagues (1979) was prompted by their desire to elucidate the relationship between N-acetyltransferase phenotype and susceptibility to the development of bladder cancer from exposure to arylamines, as a result of exposure to cigarette smoke or chemical dyes. Earlier studies had suggested that (i) a significant portion of bladder cancer can be attributed to cigarette smoking and occupation exposures; (ii) the distribution of N-acetyltransferase activity in the liver is highly variable among individuals, and (iii) individuals with low enzyme activity (the so-called “slow acetylator phenotype”) were most susceptible to the development of bladder cancer (Case and Hasker, 1954; Cole, 1974; Hueper, 1969; Hueper et al., 1938). These observations led Lower et al. (1979) to propose that the slow acetylator phenotype would be over-represented in a population of bladder cancer patients occupationally exposed to arylamine dyes. Their hypothesis was confirmed in a cohort of bladder cancer patients who worked in the dye industry in Denmark. Bladder cancer has been described as a classic illustration of the principle of dose–effect modification of an environmental exposure by genetic polymorphisms (Kelada et al., 2003). Exposure to aromatic amines can cause bladder cancer depending on one’s NAT2 genotype/phenotype. Smokers who are slow acetylators have a higher risk of developing bladder cancer compared to smokers who are rapid acetylators. Marcus et al. (2000) calculated an odds ratio of 1.3–1.7 for smokers who are rapid acetylators, confirming that there is an interaction between NAT2 and smoking. Using nonsmoking rapid acetylators as the reference with estimates of the prevalence of smoking, these investigators were able to demonstrate that the populationattributable risk of the gene–environment interaction was 35% for slow acetylators who had ever smoked and 13% for rapid acetylators who had ever smoked. In laboratory studies designed to investigate plausible mechanisms to explain the results, Hein et al. (1993) and Mattano et al. (1989) demonstrated that NAT2 enzyme can bioactivate procarcinogenic aromatic amines in the bladder. Following N-oxidation of aromatic amines by cytochrome P4501A2 in the liver, O-acetylation of the resulting hydroxylamine by NAT2 can produce unstable acetoxy esters or
proximate carcinogens. The acetoxy ester intermediates can be N-acetylated or N-oxidized to yield N-hydroxy-N-acetyl aromatic amines that can bind to DNA to create point mutations or adducts. Myeloperoxidase Polymorphism and Lung Cancer Myeloperoxidase is a phase I enzyme found in the lysosomes of polymorphonuclear leukocytes and monocytes. It converts lipophilic carcinogens into hydrophilic forms. Exposure to tobacco smoke stimulates the recruitment of neutrophils into lung tissue with local release of myeloperoxidase. This enzyme converts 7,8-diol benzo(a)pyrene to the highly reactive and carcinogenic form called benzo(a)pyrene 7,8-diol epoxide, which can form DNA adducts in lung tissue. Myeloperoxidase can also convert other procarcinogens (e.g., polycyclic aromatic hydrocarbons and aromatic amines) in cigarette smoke to free radicals or reactive oxygen species. A common G to A transition at position 463 of the promoter region of the myeloperoxidase gene, which leads to loss of a SP1 transcription factor binding site in an Alu hormone responsive element has been shown to reduce myeloperoxidase mRNA expression (Kiyohara et al., 2005). Because transcriptional activity is decreased in individuals with the variant A allele, less enzyme is available to convert benzo(a)pyrene intermediate to the highly carcinogenic benzo(a)pyrene 7,8-diol epoxide species. The frequency of the A allele varies between Caucasian (23.4%) and Asian (14.4%) populations. As the myeloperoxidase G-463A variant allele is related to low metabolic activation, individuals with at least one variant allele may be associated with decreased risk of lung cancer. London et al. (1997) reported that Caucasian individuals with the A/A genotype were associated with a significant reduction in risk for lung cancer. Although the biological effects of each genotype (i.e., G/G, G/A and A/A) have not been clarified, lung cancer among the three genotypes decreases from G to A. Similar results have been obtained for Japanese and Hawaiian populations (LeMarchand et al., 2000). The association between myeloperoxidase genotype and lung cancer risk was modified by duration of smoking. In heavy smokers versus light smokers, the combined G/A and A/A genotypes had an odds ratio for squamous cell carcinoma of 6.22 for heavy smokers and 1.72 for light smokers, and for never smokers 1.14. Other studies showed that myeloperoxidase G/A genotype interacted with the glutathione S-transferase T1 (GSTT1) locus to significantly decrease lung cancer among males but not females (Cajas-Salizar et al., 2003; Feyler et al., 2002). NAD(P)H Quinine Oxidoreductose 1 Polymorphism and Lung Cancer Polymorphism in the gene that codes for NAD(P) quinine oxidoreductase 1 (NQO1) is also a lung cancer susceptibility gene that interacts with carcinogens in cigarette smoke (Kiyohara et al., 2005). NQO1 has a dual function in that it can both activate and detoxify carcinogenic quinoid compounds, such as benzo(a)pyrene. The catalytic reduction of benzo(a)pyrene prevents the formation of cancer-causing quinone-DNA adducts. The combined variant genotypes, Pro-Ser and Ser-Ser were
Polymorphism and Disease Susceptibility: Case–Control Studies
significantly associated with decreased risk of lung cancer in Mexican Americans (Wiencke et al., 1997). Similar results were obtained in three Japanese studies (cited in review article by Kiyohara et al., 2005); however, these genotypes were only weakly associated with lung cancer in Chinese, Hawaiians, Caucasians, and African-Americans. In terms of gene–environment interactions, there was an excess small cell lung cancer risk associated with the presence of the Ser allele in heavy smokers, where the odds ratio for the Ser-Ser and Pro-Ser genotypes combined was 12.5; in light smokers the odds ratio was 0.90. In contrast, the frequency of the NQO1 genotypes did not differ significantly between smokers and nonsmokers. The NQO1 Pro187Ser polymorphism was also involved in gene–gene interactions. For example, the combined NQO1 Pro-Pro and GSTT1 null genotype showed a significant association with lung adenocarcinoma risk, with an odds ratio of 4.61 (Sunaga et al., 2002). The odds ratio was calculated using the NQO1 Ser-Ser and GSTT1 nonnull genotypes combined as the reference. Although the risk of developing lung cancer conferred by the various susceptibility alleles is small, lung cancer is such a common malignancy that this small increase in risk translates to a large number of lung cancer cases in the population and is clearly important from a public health point of view. These results also show that lung cancer cannot be explained by variation in a single allele, but rather by multiple polymorphisms within genes involved in the same or different pathways. Discovery of other functional variants may shed some light on the development of this important malignancy, as the burden of this disease surely develops from complex interactions between multiple genetic and environmental factors over time. That is, the effect of polymorphisms may be best represented by their haplotypes. Fibrinogen Polymorphism and Coronary Heart Disease Fibrinogen is an acute phase protein and its plasma level is responsive to trauma, serum cholesterol, hormones, and infections. Increase in genetically determined levels of this glycoprotein is associated with risk of cardiovascular disease. Several epidemiologic studies have shown that gene–environment interactions involving fibrinogen polymorphisms play a role in coronary heart disease. The glycoprotein consists of three different polypeptide chains (a, b, and c) linked by disulfide bonds (Vischetti et al., 2002). Studies show that b-chain synthesis limits the rate of production of mature fibrinogen (Yu et al., 1984). Many polymorphisms have been discovered in the b-chain, two of which show a strong correlation with plasma concentration of the mature protein. The BclI and the G/A-455 polymorphisms have been selected for analysis of gene–environment interaction. Such studies have shown that moderate physical activity decreases the level of fibrinogen; however, these effects are evident only in carriers of the B2 allele of the BclI b-chain polymorphism. Montgomery et al. (1996) have shown that there is an acute rise in fibrinogen levels in response to exercise, and that this response is strongly influenced by the G/A-455 polymorphism in the b-chain gene; in fact, a gradient of increasing
■
55
acute fibrinogen response was seen across the G/G, G/A, and A/A genotypes, respectively. Zito et al. (1999) reported that infection with Helicobacter pylori, the pathogen responsible for most peptic ulcers and gastritis, resulted in a fourfold increase in risk of myocardial infarction; the response to infection was higher in carriers of the BclI polymorphism, with carriers of the B2 allele showing higher fibrinogen levels compared to the B1 homogygous subjects. Carriers of the B2 allele have increased risk for myocardial infarction and have the capacity to amplify the effect of seropositivity for Helicobacter. So patients with both seropositivity for Helicobacter and carriers of the B2 genotype have an additional increased risk. In another study, some subjects with high dietary fat intake developed abnormal high-density lipoprotein cholesterol concentrations, and some did not, depending on their genotype or polymorphisms in the hepatic lipase gene promoter (Ordovas and Corella, 2002; Ordovas 2004). Also, tobacco smokers developed coronary heart disease, or did not, depending on their lipoprotein lipase genotype (Talmud et al., 2000) and their apolipoprotein E4 genotype (Humphries et al., 2001). These results show that genetic polymorphisms in the b-chain gene do not cause myocardial infarction, but influence risk for the disease by rendering individuals more susceptible to environmental risk factors. Mental Disorders and Monoamine Oxidase A Polymorphism A functional polymorphism in the promoter region of the MAOA gene encoding the neurotransmitter metabolizing enzyme monoamine oxidase moderates the effects of child maltreatment on the cycle of violence (Moffitt et al., 2005). Their results show that maltreated children, with the polymorphism in the promoter region, produced low levels of monoamine oxidase A and were more prone to develop conduct disorders, to demonstrate antisocial personality, and to commit crimes as an adult than maltreated children with high-activity levels of monoamine oxidase A. In a related study, Caspi et al. (2003) demonstrated that the serotonin transporter (5-HTT ) gene moderated the effects of stressful life events on depression. Individuals with one or two copies of the 5-HTT polymorphic allele exhibited more depressive symptoms, diagnosable depression, and suicidal tendencies following stressful life events than individuals who were homozygous for the wild-type allele. Asthma and Gene Polymorphisms Asthma is a complex multifactorial disorder with variable phenotypes attributed to interaction of multiple genes with multiple environmental factors (see Chapter 89). The interplay between genes and the environment is best illustrated by the hygiene hypothesis to explain the increase in asthma prevalence. The fundamental tenet of this hypothesis is that the emergence and severity of the disease depends on the interaction between genetic background and environmental exposure. According to the hygiene hypothesis, children in industrialized nations have decreased incidence of bacterial infections, hence reduced exposure to endotoxins. In fact, this is supported by several studies
56
CHAPTER 4
■
Genome and the Environment
which show that exposure to increased endotoxin levels in the domestic environment is associated with decreased occurrence of asthma (for recent review see Kleeberger and Peden, 2005). The risk of developing asthma is influenced by the Western lifestyle, including diet, air pollution, reduced risk of infections, indoor air, and stress (reviewed by Arruda et al., 2005). The length and type of exposure, as well as the time when it occurs, can either promote or suppress the inflammatory process and the subsequent course of the disease. CD14 is an innate immunity gene that has been studied extensively with respect to its genetics, function, and distribution in various populations because of its important role in the development of asthma and other allergic reactions. The CD14 gene product is a component of the endotoxin-recognition receptor complex and is critical for the development of the optimal responses to low concentrations of the environmental stimulant via the Toll-like receptors (TLR4 and TLR2). By some unknown mechanism, both the membrane-bound and soluble forms of CD14 enhance the affinity of the interaction between environmental allergens and the TLRs. In 1999, Baldini et al., as part of the Tucson Children Respiratory Study, demonstrated an association between a SNP in the CD14 promoter (CD14*C-159T), increased serum levels of soluble CD14, and decreased total serum IgE. The discovery of an association between this polymorphism in the CD14 promoter and a reduction in serum IgE levels provided important support for the view that innate immunity genes are key determinants in the development of asthma. Several studies have shown that pro-oxidant environmental exposures are associated with increased incidence and exacerbation of asthma and other lung diseases. For example, a large study conducted in Southern California showed that in utero exposure to tobacco smoke was associated with several indicators of lung morbidity, including increased incidence of asthma diagnosis, increased need for emergency room use for airways symptoms, and increased occurrence of bronchitis (reviewed by Kleeberger and Peden, 2005). Numerous candidate genes have been shown to be associated with asthma and asthma-associated phenotypes, such as elevated IgE levels, bronchial hyperactivity, and SNPs within specific cytokines, IgE regulating enzymes, and phase I and phase II xenobiotic metabolism enzymes. However, only a few genes conferring significant risk have been discovered, and the functional significance of the various gene variants is largely unknown. A glutathione S-transferase M1 (GSTM1) null allele (M1*0), which results in no protein expression and no glutathione antioxidant protection, was shown to be strongly associated with decreased lung growth and function in a cohort of school children in Southern California due to increased oxidant stress (Gilliland et al., 2002a, b, c). These investigators also examined the relationship between the GSTM1 null genotype, the prevalence of asthma, and the in utero exposure to tobacco smoke (Gilliland et al., 2002a, b, c). Children with the GSTM1 null genotype and who were born to mothers who smoked during pregnancy showed increased prevalence of early onset asthma, persistent asthma, wheezing with exercise, and increased need for medical intervention for asthma. Children with the functional GSTM1 genotype did not have asthmatic symptoms; nor did the children
with GSTM1 null genotype if they were born to nonsmoking mothers. Studies have also shown that children with the GSTP1 Val 105/Val 105 genotype had fewer school absences related to asthma, which suggests that this antioxidant protective gene may influence asthma severity (Gilliland et al., 2002a, b, c). Both GSTM1 and GSTP1 appear to modulate the effect of exposure to particulates in diesel exhaust as measured by IgE response following a nasal challenge; both IgE and histamine levels were increased. In a study of the effects of diesel exhaust on response to ragweed challenge, subjects were stratified on the basis of GSTM1 and GSTP1 genotypes and were allowed to inhale diesel exhaust particles followed by ragweed allergen. Subjects with either the GSTM1 null or GSTP1 Ile 105 genotype exhibited a stronger response than did subjects with other GSTM1 and GSTP1 genotypes (Bastain et al., 2003; Gilliland et al., 2004). Studies of the risk for asthma from lifetime exposure to ozone show that GST proteins are important determinants of asthma severity (David et al., 2003). In studies of the combined genotypes of NQO1 Pro187Ser and GSTM1 null or functional sequence, the presence of the Pro-Pro allele of NQO1 significantly reduced asthma risk in subjects homozygous for the GST1 deletion. The studies on lung function and gene variation cited above are representative examples of published and ongoing studies; for a more comprehensive survey, the reader is referred to the excellent review by Kleeberger and Peden (2005) and to other chapters in this book.
EPIGENETICS AND THE ENVIRONMENT A discussion of gene–environment interactions could not be concluded without at least a brief mention of epigenetics, as it represents yet another mechanism by which the environment can influence phenotype or disease development (see Chapter 5). Cells of multicellular organisms are genetically the same, yet structurally and functionally diverse owing to the differential expression of genes. Many of these differences in gene expression arise during development, and most are stable throughout the life of the organism. However, it is important to note that epigenetic changes can occur in the adult. Such stable alterations are referred to as “epigenetic” in that they do not involve mutations in the DNA itself. Most of the research on epigenetic mechanisms has focused on DNA methylation during early development and gene silencing. But studies are now focusing on epigenetic regulation of gene expression response to environmental exposures as a possible pathway for disease development. For example, one could imagine that inactivation of proteins that buffer the cell against environmental stress could lead to disease by allowing the expression of previously suppressed genetic variants. Much attention has been devoted to diet as it represents an abundant source of methyl group donors. Also, diet is known to be a particularly significant determinant in the timing of disease expression. For example, dietary supplements such as folate or vitamins that influence the activity of methyl-transfer enzymes can modulate the rate of disease development and can exert
References
marked effects on the incidence of colon cancer (for review, see Van den Veyver, 2002). Reduced amounts of folate have been associated with genomic instability, neural tube defects, and genomic hypomethylation. In addition, a diet deficient in methylcontaining components has been shown to induce liver cancer. The possibility that classical genetic mechanisms may not be adequate to explain the development of complex diseases looms large on the horizon. Normal physiological responses to certain environmental stimuli may be mediated by epigenetic mechanisms, so progress is likely to require a more intense research effort to fully understand how humans and other organisms respond to environmental cues. For example, epigenetic changes might explain the phenotypic discordance often observed among monozygotic twins, differences in age of onset of diseases, fluctuation in diseases and sporadic cases of disease. Also, since epigenetic-mediated changes in gene expression can occur at a much faster rate than DNA mutation, it represents a more adaptable mechanism for organisms to respond to rapidly changing environmental conditions.
CONCLUSION Although most of the visible environmental problems of the 1950s and 1960s have been ameliorated, massive quantities of toxic agents are still polluting our environment. Whether current levels of exposure to these agents are contributing to the high or increasing incidence of cancer, Parkinson’s and Alzheimer’s disease, asthma, autism, learning disabilities, diabetes, or other complex disorders is a matter of considerable concern. Finding answers to these questions has been a slow and difficult process. The traditional methodologies available to
■
57
medical researchers have not been adequate to elucidate the intricate gene–environment interactions involved in the development of complex diseases. But, thanks to the rare confluence of technology breakthroughs and the rethinking and redirection of the environmental health sciences over the past decade, the link between the environment and human health and disease can now be investigated with more rigor and specificity. For example, the sequencing of the human genome and the development and application of high throughput technologies to monitor the expression of genes and proteins in response to specific environmental exposures has created an unparalleled opportunity to study gene–environment interactions. By using a combination of new “-omic” technologies (genomics, proteomics, and metabolomics), in combination with population studies and better exposure data, one can achieve an integrative view of gene–environment interaction at the level of the whole organism. So, perhaps the most important development in healthcare in the 21st century will be the incorporation of knowledge of gene–environment interactions into public health and the practice of medicine. Many of the expected breakthroughs from use of these technologies will almost definitely lead to new opportunities to prevent, diagnose, and treat human diseases.
ACKNOWLEDGEMENTS I am grateful to my colleague, Dr Samuel Wilson, for his intellectual input and leadership in developing many of the concepts and research initiatives on gene–environment interactions discussed in this chapter. I am also grateful to Ms Betty Mills for editorial and technical assistance.
REFERENCES Altshuler, D., Brooks, L.D. et al. (2005). A haplotype map of the human genome. Nature 437, 1299–1320. Arruda, L.K., Sole, D., Carlos, E. et al. (2005). Risk factors for asthma and atopy. Curr Opin Allergy Clin Immunol 5, 153–159. Baldini, M., Lohman, I.C., Halonen, M. et al. (1999). A polymorphism in the 5 flanking region of the CD14 gene is associated with circulating soluble CD14 levels and with total serum IgE. Am J Respir Cell Mol Biol 20, 976–983. Bastain, T.M., Gilliland, F.D., Li,Y.F. et al. (2003). Intra-individual reproducibility of nasal allergic responses to diesel exhaust particles indicates a susceptible phenotype. Clin Immunol 109, 130–136. Baulcombe, D. (2002). RNA silencing. Curr Biol 12, R82–R84. Boguski, M.S. (1999). Biosequence exegesis. Science 286, 453–455. Brenner, S. (2002). Detective Rummage Investigates. Scientist 16(16), 15. Brown, P.O. and Hartwell, L. (1998). Genomics and human diseases – variations on variation. Nat Genet 18, 91–93. Brown, P.O. and Botstein, D. (1999). Exploring the New World of the genome with DNA microarrays. Nat Genet 21, 33–37. Carlson, C.S., Eberle, M.A. et al. (2004). Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Human Genet 74, 106–120.
Carrington, M., Nelson, G.W., Martin, M.P. et al. (1999). Heterozygote advantage and B*35-Cw*04 disadvantage. Science 283, 1748–1752. Case, R.A.M. and Hasker, M.E. (1954). Tumors of the urinary in workmen engaged in the manufacture and use of certain dye stuff intermediates in the British chemical industry. Brit J Ind Med 11, 75. Cajas-Salizar, N., Sierra-Terres, C.H., Salama, S.A. et al. (2003). Combined effect of MPO, GSTM1 and GSTT1 polymorphisms on chromosome aberrations and lung cancer risk. Int J Hyg Environ Health 206, 473–483. Caspi, A., Sugden, K., Moffitt, T.E. et al. (2003). Influence of life stress on depression: Moderation by a polymorphism in the 5-HTT gene. Science 301, 386–389. Chamberlin, M. (1982). Bacterial DNA-dependent RNA polymerases. In The Enzymes, 3rd edition (P. Boyer, ed.) Vol. 15B. Academic Press, NY, pp. 61–108. Cole, P. (1974). Lower urinary tract, In Cancer Epidemiology and Prevention: Current Concepts (D. Schottenfield, ed.), Charles C. Thomas, Springfield, IL, pp. 233–262. Collins, F.S., Lonjou, C. and Morton, N.E. (1999). Genetic epidemiology of single-nucleotide polymorphisms. Proc Natl Acad Sci USA 96(26), 15173–15177.
58
CHAPTER 4
■
Genome and the Environment
Costa, L.G. and Furlong, C.E. (2002). Paraoxanase (PON1) in Health and Disease: Basic and Clinical Aspects. Kluwer Academic Publishers, Norwell, MA. David, G.L., Romieu, I., Sienra-Monge, J.J. et al. (2003). Nicotinamide adenine dinucleotide (phosphate) reduced: Quinine oxidoreductase and glutathione S-transferase M1 polymorphisms and childhood asthma. Am J Respir Crit Care Med 168(10), 1199–1204. Dean, M. (2003). Approaches to identify genes for complex human diseases: Lessons from Mendelian disorders. Hum Mutat 22, 261–274. Dean, M., Carrington, M., Winkler, C. et al. (1996). Genetic restriction of HIV-1 infection and progression of AIDS by a deletion allele of the CKR5 structural gene. Science 273(5283), 1856–1862. Dobzhansky, T. (1941). Genetics and the Origin of Species, 2nd edition. Columbia University Press, NY. Dobzhansky, T. and Levene, H. (1955). Genetics of natural populations. Genetics 40, 779–808. Doll, R. and Peto, R. (1981). The causes of cancer: Quantitative estimates of available risks of cancer in the United States today. J Natl Cancer Inst 66, 1191–1308. Feyler, A., Voho, A., Bouchardy, C. et al. (2002). Myeloperoxidase463G- a polymorphism and lung cancer risk. Cancer Epidemiol Biomarkers Prev 11, 1550–1554. Frye, R.F. (2004). Probing the world of cytochrome P450 enzymes. Mol Interv 4, 157–162. Futreal, P.A., Liu, Qingyun, et al. (1994). BRCA1 mutations in primary breast and ovarian carcinomas. Science 266, 120–122. Garrod, A. (1902). The incidence of alkatonuria: A study in chemical individuality. Lancet 2, 1616–1620. Gibson, G. and Wagner, G. (2000). Canalization in evolutionary genetics: a stabilizing theory? Bio Essays 22, 376–380. Gilliland, F.D., Li, Y.F., Dubeau, L. et al. (2002a). Effects of glutathione S-transferase M1, maternal smoking during pregnancy, and environmental tobacco smoke on asthma and wheezing in children. Am J Respir Crit Care Med 166, 457–463. Gilliland, F.D., Gouderman, W.J., Vora, H. et al. (2002b). Effects of glutathione S-transferase M1, T1 and P1 on childhood lung function growth. Am J Respir Crit Care Med 166, 710–716. Gilliland, F.D., Rappaport, E.B., Berhane, K. et al. (2002c). Effects of glutathione S-transferase P1, M1, and T1 on acute respiratory illness in school children. Am J Respir Crit Care Med 166, 346–351. Gilliland, F.D., Li, Y.F., Saxon, A. et al. (2004). Effect of glutathione S-transferase M1 and P1 genotypes on xenobiotic enhancement of allergic responses: Randomized, placebo-controlled crossover study. Lancet 363, 119–125. Guengerich, F.P. (1998). The environmental genome project: Functional analysis of polymorphisms. Environ Health Perspect 106, 365–368. Haldane, J.B.S. (1946). The interaction of nature and nurture. Ann Eugenics 13, 197–205. Hartman, J.L., Garvik, B. et al. (2001). Principles for the buffering of genetic variation. Science 291, 1001–1004. Hein, D.W., Doll, M.A., Rustan, T.D. et al. (1993). Metabolic activation and deactivation of arylamine carcinogens by recombinant human NAT1 and polymorphic NAT2 acetyltransferases. Carcinogenesis 14, 1633–1638. Hueper, W.C. (1969). Occupational and Environmental Cancer of the Urinary System.Yale University Press, New Haven, CT. p. 465. Hueper, W.C., Wiley, F.H. et al. (1938). Experimental production of bladder tumors in dogs by administration of beta-naphthylamine. J Ind Hyg Toxicol 20, 46.
Hulla, J.E., Miller, M.S. and Taylor, J.A. (1999). Symposium overview: The role of genetic polymorphisms and repair deficiencies in environmental disease. Toxicol Sci 47, 135–143. Humphries, S.E., Talmud, P.J., Hawe, E. et al. (2001). Apolipoprotein E4 and coronary heart disease in middle aged men who smoke: A prospective study. Lancet 358, 115–119. Hunter, D. (2005). Gene-environment interactions in human diseases. Nat Rev Genet 6, 287–298. Jarvik, G.P., Jampsa, R. et al. (2003). Novel paraoxonase: Nonsense and missense mutations predicted by functional genomic assay of status. Pharmacogenetics 13, 291–295. Kaiser, J. (1997). Environmental Institute lays plans for gene hunt. Science 298, 569–570. Kaiser, J. (2003). Tying genetic to the risk of environmental disease. Science 300, 563. Kelada, S.N., Eaton, D.L., Wang, S.S. et al. (2003). The role of genetic polymorphisms in environmental health. Environ Health Perspect 111, 1055–1064. Khot, U.N., Khot, M.B., Bajzer, C.T. et al. (2003). Prevalence of conventional risk factors in patients with coronary heart disease. JAMA 290, 899–904. Kiyohara, C., Yoshimasu, K., Takayama, K. et al. (2005). NQO1, MPO, and the risk of lung cancer: A huge review. Genet Med 7, 463–474. Kleeberger, S.R. and Peden, D. (2005). Gene-environment interactions in asthma and other respiratory diseases. Annu Rev Med 56, 383–400. Kouhry, M.J. and Wagner, D.K. (1995). Epidemiological evasuation of the use of genetics to improve the predictive value of disease risk factors. Am J Hum Genet 56, 835–844. Kraft, P. and Hunter, D. (2005). Integrating epidemiology and genetic association: The challenge of gene-environment interaction. Philos Trans R Soc Lond B Biol Sci 360, 1609–1616. Krajinovic, M., Sinnett, H. et al. (2002). Role of NQO1, MPO, and CYP2El genetic polymorphisms in the susceptibility of childhood acute lymphoblastic leukemia. Int J Cancer 97, 230–236. Kruglyak, L. and Nickerson, D.A. (2001). Variation is the spice of life. Nat Genet 27, 234–236. Lander, E.S., Linton, L.M., Birren, B. et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921. LeMarchand, L., Seifriend, A., Lum, A. et al. (2000). Association of the myeloperoxidase-463G- a polymorphism with lung cancer risk. Cancer Epidemiol Biomarkers Prev 9, 181–184. Lichtenstein, P., Niels, V.H., Verkasala, P.K. et al. (2000). Environmental and heritable factors in the causation of cancer: Analysis of cohorts of twins from Sweden. N Eng J Med 343, 78–85. London, S.J., Lehman,T.A. and Taylor, J.A. (1997). Myeloperoxidase genetic polymorphism and lung cancer risk. Cancer Res 57, 5001–5003. Lovett, R.A. (2000). Toxicologists brace for genomics revolution. Science 289, 536–573. Lower, G.M., Nilsson, T. et al. (1979). N-acetyltransferase phenotype and risk of urinary bladder cancer: Approaches to molecular epidemiology. Preliminary results in Sweden and Denmark. Environ Health Perspect 29, 71–79. Marcus, P.M., Hayes, R.B., Vineis, P. et al. (2000). Cigarette smoking, N-acetyltransferase 2 acetylation status and bladder cancer risk: a case series meta-analysis of a gene-environment interaction. Cancer Epidemiol Biomarkers Prev 9, 461–467. Martin, M.P., Dean, M., Smith, M.W. et al. (1998). Genetic acceleration of AIDS progression by a promoter variant of CCR5. Science 282, 1907–1911. Mattano, S.S., Land, S., King, C.M. et al. (1989). Purification and biochemical characterization of hepatic arylamine-N-acetyltransferase
References
from rapid and slow acetylator mice: Identify with arylhydroxamic acid N,O-acetyltransferase. Mol Pharmacol 35, 599–609. Moffitt, T.E., Caspi, A. and Rutter, M. (2005). Strategy for investigating interactions between measured genes and measured environments. Arch Gen Psychiatry 62, 473–481. Montgomery, H.E., Clarkson, P., Nwose, O.M. et al. (1996). The acute rise in serum fibrinogen concentration with exercise is influenced by the G453-A polymorphism in the beta-fibrinogen gene. Arterioscler Thromb Vasc Biol 16, 386–391. Nebert, D.W., McKinnon, R.A. and Puga, A. (1996). Human drug metabolizing enzyme polymorphisms: Effects on toxicity and cancer. DNA Cell Biol 15, 273–280. Olden, K. (1993). Opportunities in environmental health research. Environ Health Perspect 101, 6–7. Olden, K. (2004a). Use of omic approaches in unraveling mechanisms of gene-environment interactions. Curr Genomics 5, 1–6. Olden, K. (2004b). Relevance of the National Institutes of Health Roadmap Initiative to the field of environmental health. Environ Health Perspect 112, 3. Olden, K. (2007). Commentary: From phenotype to genotype, to gene-environment interaction and risk for complex diseases. Int J Epidemiol 36, 11–18. Olden, K. and Guthrie, J. (2000). Perspective on environmental toxicant issues: Information gaps in basic environmental health sciences. Environ Epidemiol Toxicol 2, 172–178. Olden, K. and Wilson, S.H. (2000). Environmental health and genomics: Visions and implications. Nat Rev Genet 1, 149–153. Olden, K., Guthrie, J. and Newton, S.A. (2001). A bold new direction for environmental health research. Am J Public Health 91, 1964–1967. Ordovas, J.M. (2004). Pharmacogenetics of lipid diseases. Hum Genomics 2, 111–125. Ordovas, J.M. and Corella, D. (2002). Genetic variation and lipid metabolism: Modulation by dietary factors. Circulation 106, 2315–2321. Pearson, H. (2004). Mouse sequencing plan aims to boost models. Nature 432, 5. Powell, J.J., Vande Mater, J. and Gerschwin, M.E. (1999). Evidence for the role of environmental agents in the initiation and progression of autoimmune conditions. Environ Health Perspect 107, 667–672. Rothman, N.,Wacholder, S., Caporoso, M. et al. (2001). The use of common genetic polymorphisms to enhance the epidemiologic study of environmental carcinogens. Biochim Biophys Acta 1471, C1–C10. Scheiner, S.M., Caplan, R.L. et al. (1991). The genetics of phenotypic plasticity III: Genetic correlations and fluctuating asymmetries. J Evol Biol 4, 51–68. Schena, M., Shalon, D. et al. (1995). Protein quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470. Schwartz, D. and Collins, F. (2007). Environmental biology and human disease. Science 316, 695–696. Skibola, C.F., Smith, M.T. et al. (1999). Polymorphisms in the methylenetetrahydrofolate reductase gene are associated with susceptibility to acute leukemia in adults. Proc Natl Acad Sci USA 96, 12810–12815. Skibola, C.F., Smith, M.T. et al. (2002). Polymorphisms in the thymidylate synthetase and serine hydroxymethyltransferase genes and risk of adult acute lymphocytic leukemia. Blood 99, 3786–3791. Smith, M.W., Dean, M., Carrington, M. et al. (1997). Contrasting genetic influence of CCR2 and CCR5 variants on HIV-1 infection and disease progression. Science 277, 959–965. Smith, M.T., Wang, Y. et al. (2001). Low NAD(P): Quinone oxidoreductase activity is associated with increased risk of leukemia
■
59
with MLL translocations in infants and children. Blood 100, 4590–4593. Smith, M.T., Wang, Y., Skibola, C.F. et al. (2002). Low NAD(P)H: quinine oxidoreductase activity is associated with increased risk of leukemia with MLL translocations in infants and children. Blood 100, 4590–4593. Sunaga, N., Kohno, T., Yanagitami, N. et al. (2002). Contribution of the NQO1 and GSTT1 polymorphisms in lung adenocarcinoma susceptibility. Cancer Epidemiol Biomarkers Prev 11, 730–738. Talmud, P.J., Bujac, S., Hall, S. et al. (2000). Substitution of asparagine for aspartic acid at residue 9(D9N) of lipoprotein lipase markedly augments risk of coronary heart disease in male smokers. Atherosclerosis 149, 75–81. Tanner, C.M., Ohtman, R., Goldman, S.M. et al. (1999). Parkinson disease in twins: A etiologic study. JAMA 281, 341–346. Thompson, J.R., Gerald, P.F. et al. (2001). Maternal folate supplementation in pregnancy and protection against acute lymphoblastic leukemia in childhood: A case-control study. Lancet 358, 1935–1940. Trask, B.J. (1999). Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, pp. 303–313. Van den Veyver, I.B. (2002). Genetic effects of methylation diets. Annu Rev Nutr 22, 255–282. Venter, J.C., Adams, M.D., Myers, E.W. et al. (2001). The sequence of the human genome. Science 291, 1304–1351. Verkasala, P.K., Kopria, J., Kaskenuvua, M. et al. (1999). Genetic predisposition, environmental and cancer incidence: National Twin Study in Finland. Int J Cancer 83, 743–749. Vessel, E.S. (1991). Genetic and environmental actors causing variation in drug response. Mutat Res 247, 241–257. Vischetti, M., Zito, F., Donati, M.B. et al. (2002). Analysis of geneenvironment interaction in coronary heart disease: Fibrinogen polymorphisms as an example. Ital Heart J 3, 18–23. Wagner, G.P., Booth, G. et al. (1997). A population genetic theory of canalization. Evolution 51, 329–347. Waters, M., Boorman, G., Bushel, P. et al. (2003). Systems toxicology and the chemical effects of biological systems (CEBS) knowledge based. Environ Health Perspect Toxicogenomics 111, 15–28. Watson, J.D. and Crick, F.H.C. (1953). Genetic implications of the structure of deoxyribonucleic acid. Nature 171, 964–967. Watson, J.D., Tooze, J. and Kurtz, D.T. (1983). Recombinant DNA: a short course. W.H. Freeman, N.Y. Wiencke, J.K., Spitz, M.R., McMillan, A. et al. (1997). Lung cancer in Mexican-Americans and African Americans is associated with the wildtype genotype of the NAD(P) H-quinone oxidoreductase polymorphism. Cancer Epidemiol Biomarkers Prev 6, 87–92. Wilson, S.H. and Olden, K. (2004). The environmental genome project; phase I and beyond. Molecular Intervention 4, 147–156. Wright, S. (1932). The roles of nutrition, inbreeding, crossbreeding and selection in evolution. Proc. of the 6th International Congress of Genetics 11, 356–366. Yu, S., Sher, B. and Kudryk, B. (1984). Fibrinogen precursors. Order of assembly of fibrinogen chains. J Biol Chem 259, 10574–10581. Zito, F., Di Castelnuevo, A., D’Orazio, A. et al. (1999). Helicobacter pylori infection and the risk of myocardial infarction: Role of fibrinogen and its genetic control. Thromb Haemost 82, 14–18. http://egp.gs.washington.edu/welcome.html http://www.genome.utah.edu/genesnps http://www.niehs.nih.gov/research/supported/programs/egp/ http://ntp.nieh.nih.gov/
CHAPTER
5 Epigenomics and Its Implications for Medicine Moshe Szyf
INTRODUCTION The genome holds the entire information required to encode all the proteins expressed in an organism. However, expression of this information is programmed so that only a fraction of the potential repertoire of proteins is expressed in a given cell type. This programming is defined by the epigenome. The epigenome controls gene expression programs by defining the accessibility of transcription machineries, as well as other nuclear machineries, to genes (Razin, 1998). Programming of the epigenome occurs during embryonic development, whereby it establishes the diversity of gene expression patterns amongst organs, tissues, and cell types (Razin, 1998). These programs are stable and maintained through life. The epigenetic machinery is responsive to signaling pathways, which could be activated by different triggers. It has become clear recently that epigenetic programming could potentially change through life, and not just during embryogenesis, in response to environmental exposures such as nutrition, drugs, and social interactions (Meaney and Szyf, 2005). The epigenome thus mediates the effect of the environment on our genes. These effects could result in novel gene expression programs triggering new physiological, behavioral, or pathological states. Improper epigenetic programming might lead to consequences that could be similar to the effects of genetic mutations or deletions. For example, aberrant epigenetic programming could lead to activation of a gene that is normally silenced or
Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg 60
to inactivation of a gene that is normally active. An example is the aberrant silencing of tumor-suppressor genes in cancer, which could be mediated by either an inactivating mutation or by epigenetic silencing through DNA methylation (Baylin et al., 2001). Epigenetic differences amongst individuals can result in interindividual phenotypic differences with respect to behavior, physiology, and pathology and response to drugs similar to genetic polymorphisms. However, the extent to which epigenetic interindividual differences play a role in generating interindividual phenotypic differences is still unknown. In contrast to the genome, which is identical in different cell types and through life, the epigenome is dynamic and varies from cell type to cell type and from time point to time point in life. This diversity allows the epigenome to regulate the assortment of gene expression programs in a multicellular organism. The epigenome consists of chromatin and its modifications, as well as a covalent modification of cytosine rings found at the dinucleotide sequence CpG (CG) (Razin, 1998). The basic building block of chromatin is the nucleosome, which is formed of an octamer of histone proteins containing an H3-H4 tetramer flanked on either side with a H2A-H2B dimer (Finch et al., 1977). The N-terminal tails of these histones are extensively modified by methylation (Jenuwein, 2001), phosphorylation, acetylation (Wade et al., 1997), and ubiquitination (Shilatifard, 2006). Different histone variants also play a regulatory role (Henikoff et al., 2004). The pattern of histone modifications and
Copyright © 2009, Elsevier Inc. All rights reserved.
DNA Methylation Patterns
variants creates a “histone code” that delineates the parts of the genome expressed at a given point in time in a given cell type (Jenuwein and Allis, 2001). Specific transcription factors and transcription repressors recruit histone-modifying enzymes to specific genes and thus define the profile of histone modification around genes (Jenuwein and Allis, 2001). The best-investigated examples are histone acetyltransferases (HAT), which acetylate histone H3 and H4 tails at the lysine-9 (K9) residue, and histone deacetylases (HDAC), which deacetylate histone tails (Kuo and Allis, 1998). Histone acetylation is believed to be a predominant signal for an active chromatin configuration. Deacetylated histones signal inactive chromatin, chromatin associated with inactive genes. Many repressors and repressor complexes recruit HDACs to genes, thus causing their inactivation. Histone tail acetylation is believed to enhance the accessibility of a gene to the transcription machinery whereas deacetylated tails are tightly interacting with DNA and limit its accessibility to transcription factors (Kuo and Allis, 1998). Histone methylation at K9 of H3-histone signals inactivity and is determined by the recruitment of histone methyltransferases (HMTs) such as SUV3–9 to genes (Lachner et al., 2001). The heterochromatin-associated protein HP-1, which binds methylated histones and precipitates an inactive chromatin structure (Lachner et al., 2001), recognizes methylated histones. Recently described histone demethylases remove the methylation mark, causing either activation or repression of gene expression (Shi et al., 2004;Tsukada et al., 2006). Chromatin remodeling complexes, which are ATP dependent, alter the position of nucleosomes around the transcription initiation site and define its accessibility to the transcription machinery (Varga-Weisz and Becker, 2006). It is becoming clear now that there is an interrelationship between chromatin modification and chromatin remodeling. For example, the presence of BRG1 (the catalytic subunit of SWI/ SNF-related chromatin remodeling complexes) is required for histone acetylation and regulation of -globin expression during development (Bultman et al., 2005). The targeting of chromatin modifications and remodeling to discrete genetic loci is established by specific trans-acting factors, which might be responsive to cellular as well as extracellular signaling. Chromatin is dynamic and plastic and could respond by altered recruitment of the different histone modification enzymes in response to a variety of signals. In addition to chromatin, which is associated with DNA, the DNA molecule itself is chemically modified by methyl residues at the 5ⴕ position of the cytosine rings in the sequence CG in vertebrates (Razin, 1998) (see Chapter 11). What distinguishes DNA methylation in vertebrate genomes is the fact that not all CGs are methylated in any given cell type (Razin, 1998). Distinct CGs are methylated in different cell types, generating cell type– specific patterns of methylation. Thus, the DNA methylation pattern confers upon a genome itself its cell type identity (Razin, 1998). Since DNA methylation is part of the chemical structure of the DNA itself, it is more stable than other epigenetic marks and thus has extremely important diagnostic potential (Beck et al., 1999). It was originally believed that the DNA methylation
■
61
pattern is established during development and is then maintained faithfully through life by the maintenance DNA methyltransferase (DNMT) (Razin and Riggs, 1980). The DNA methylation reaction was believed to be irreversible; thus the only way methyl residues could be lost was through replication in the absence of DNMT by passive demethylation (Razin and Riggs, 1980). Recent data support the idea that similar to chromatin modification DNA methylation is also potentially reversible even in post-mitotic tissues (Weaver et al., 2004a, b). A hallmark of DNA methylation patterns is the correlation between chromatin and the DNA methylation pattern. Active chromatin is usually associated with unmethylated DNA while inactive chromatin is associated with methylated DNA (Razin, 1998). Current data suggest a bilateral relation between chromatin structure and DNA methylation. This interrelationship of chromatin modification state and DNA methylation has important implications for understanding of how the epigenome is generated and of its utility as a diagnostic and as a therapeutic target. The gene carries the basic unit of information, but this information has an impact only once it is properly programmed by the epigenome. Thus, the epigenome is a prime target for intervention in human diseases and other conditions that involve altered gene function. Genetic mutations and polymorphisms are irreversible. In difference from the genome, the epigenome is reversible. One could potentially alter epigenetic patterns by either targeting pharmacologically the different chromatin modifications and DNA methylation enzymes or affecting the environmental exposures and the signaling pathways mediating these effects. Epigenetic profiles have therefore important practical implications in prophylaxis, diagnosis, and therapeutics. A true realization of the potential of epigenetics will require a thorough understanding of the mechanisms leading to the formation of specific DNA methylation patterns and a comprehensive mapping of dynamic epigenomes with relation to different environmental exposures. This chapter will review the basic machineries involved in shaping the epigenome and the potential implications of epigenetics on understanding, diagnosing, treating, and preventing human pathologies.
DNA METHYLATION PATTERNS The DNA Methylation Pattern Is Cell Type Specific and Correlates with Chromatin Structure Modification of nucleotides in DNA by addition of a methyl group is widespread but not ubiquitous in biology. Both 5-methylcytosine and 6-methyladenine methylation are found in bacteria and are encoded by sequence-specific DNMTs. The methylated sequence distinguishes one strain from the other and serves for self versus non-self demarcation, protecting bacteria from invading DNA arriving from a different host. For this restriction modification system to function properly, it requires complete methylation of the host DNA in a unique 4–8 base sequence (Bestor, 1990; McClelland et al., 1994).
62
CHAPTER 5
■
Epigenomics and Genomic Medicine
DNA methyltransferase (DNMT)
HO HO
S-Adenosylmethionine (SAM)
—
CH N
N
N
CH2SCH2CH2CH(NH2)
—
O— —C
N
—
CH N
COOH
CH — CH3
COOH
—
—
O— —C
C
—
—
CH2SCH2CH2CH(NH2)
—
— N
—
N
CH
N
—
C
CH3
NH2
NH2
—
N
—
N
NH2
—
—
NH2
HO HO
S-Adenosylhomocystein (SAH)
Figure 5.1 The DNA methylation reaction. The DNMTs catalyze the transfer of a methyl group from the methyl donor S-adenosyl methionine onto the 5 in the cytosine ring in DNA.
DNA methylation is not found in all organisms. Certain yeast strains, including Saccharomyces cerevisiae, are devoid of DNA methylation. Since these organisms are eukaryotes with extensive epigenetic control in spite of lack of methylation, it suggests that the primary component of epigenetic control is chromatin, which is highly developed in yeast, and not DNA methylation, which is absent in yeast. Similarly, it was long believed that nematodes and insects are devoid of DNA methylation. However, recent data suggest that Drosophila flies do contain minute amounts of 5-methylcytosine, which is relatively abundant during early development (Lyko et al., 2000). In contrast to bacterial methylation systems, the primary methylated sequence in vertebrates is composed of only two bases, the dinucleotide sequence CG. In plants, CNG (where N is either A, G, or T) sequences are also methylated in addition to CGs (Razin and Riggs, 1980). Another feature that distinguishes bacterial from vertebrate methylation is the fact that only a fraction (80%) of the methylatable CG population is methylated in vertebrate DNA. Different CG sites are methylated in different tissues, creating a pattern of methylation that is gene and tissue specific (Razin and Riggs, 1980). In some cases, such as parentally imprinted genes (Swain et al., 1987) or genes residing on the inactive X chromosome (Mohandas et al., 1981), the two alleles of the same gene are differentially methylated in the same cell. DNA methylation is the only component of the covalent DNA structure that shows cell-, parent of origin- and allelespecific identity. An identical sequence could be either methylated or nonmethylated in different cell types or in different alleles in the same cell type. The DNA methylation pattern is not copied by the DNA replication machinery, but by independent enzymatic machinery the DNMTs (Razin and Cedar, 1977) (Figure 5.1). DNA methylation patterns in vertebrates are distinguished by their tight correlation with chromatin structure. Active regions of the chromatin, which enable gene expression, are associated with hypomethylated DNA whereas hypermethylated DNA is packaged in inactive chromatin (Razin and Cedar, 1977). It is generally accepted that DNA methylation plays an important role in regulating gene expression. Establishing and Maintaining DNA Methylation Patterns Vertebrates establish their DNA methylation patterns during development and these patterns are then maintained in each
cell type during life. Nevertheless, differences in environmental exposure result in differences in DNA methylation patterns that emerge later in life (Weaver et al., 2004a, b), even in monozygotic twins (Fraga et al., 2005). DNA methylation patterns might be affected by changes in the methyl content of diets, maternal behavior (Weaver et al., 2005), and exposures to certain drugs. Understanding the epigenetic drift during life and its environmental causes is now attracting considerable attention and could potentially develop into an important field in medicine. DNA Methyltransferases The methylation of DNA occurs immediately after replication by a transfer of a methyl moiety from the donor S-adenosyl-lmethionine (SAM) (AdoMet) in a reaction catalyzed by DNMTs (Figure 5.1). DNA methylation is only one of the methylation reactions occurring in the cell that uses AdoMet as a cofactor. The N-terminus of most cytosine 5-methyltransferases defines their unique features and regulatory functions which are responsible for specific targeting of DNMTs to their site of action. In mammalian DNMT1, the N-terminus contains a number of functional domains that integrate DNMT activity with chromatin-modifying enzymes such as histone deacetylase 1 (HDAC1) (Fuks et al., 2000) and HDAC2 (Robertson et al., 2000) and coordinate DNMT activity with the replication fork and DNA replication (Milutinovic et al., 2003). This obviously has therapeutic implications, since it might be possible to target these functions independently of DNA methylation and steer clear of the adverse effects associated with loss of DNA methylation (for a review see Szyf [2001]). De Novo and Maintenance Methyltransferase Activities It is critical to accurately replicate the methylation pattern during cell division. Razin and Riggs proposed more than two decades ago that the maintenance DNMT prefers a hemimethylated substrate (Razin and Riggs, 1980). Since hemimethylated sites are generated during DNA replication when a nascent unmethylated C is synthesized across a methylated C in the template parental strand, the DNMT accurately copies the methylation pattern of the template strand (Figure 5.2). Some DNMTs distinguish hemimethylated from nonmethylated sites. In support of the Razin and Riggs model, three distinct phylogenic DNMTs have been identified in mammals.
Chromatin Modification
De novo DNMT (DNMT3a, 3b, and DNMT1?)
Demethylase
DNA replication
Maintenance DNMT (DNMT1)
Figure 5.2 DNA DNMT and demethylases sculpt the methylation pattern. De novo DNMTs introduce new methyl groups into CG sites, which are unmethylated on both strands of DNA. Demethylases remove methyl groups from DNA. Maintenance DNMT copies the DNA methylation pattern from the parental strand into the nascent strand of DNA during DNA replication. Open circles indicate an unmethylated CG site in DNA; red circles indicate a methylated CG site.
DNMT1 shows preference for hemimethylated DNA in vitro, which is consistent with its role as a maintenance DNMT, whereas DNMT3a and DNMT3b methylate unmethylated and methylated DNA at an equal rate, which is consistent with a de novo DNMT role (Okano et al., 1998). Two additional DNMT homologs have been found: DNMT2, whose substrate and methylation activity is unclear (Vilain et al., 1998), and DNMT3L, which belongs to the DNMT3 family of DNMTs by virtue of its sequence and is essential for the establishment of maternal genomic imprints but lacks key methyltransferase motifs, and is possibly a regulator of methylation rather than an enzyme that methylates DNA (Bourc’his et al., 2001). Knockout mouse data indicate that DNMT1 is responsible for a majority of DNA methylation marks in the mouse genome (Li et al., 1992), whereas DNMT3a and DNMT3b are responsible for some but not all de novo methylation during development (Okano et al., 1999). It has recently been demonstrated that an additional factor is required for targeting DNMT1 to newly replicating hemimethylated DNA, the protein UHRF1 (ubiquitin-like, containing PHD and RING finger domains 1), also known as NP95 in mouse and ICBP90 in human (Bostick et al., 2007). Several lines of evidence indicate that DNMTs are targeted to specific sequences by sequence-specific factors, which recognize specific sequences of DNA. For example, the HMT EZH2 or the oncoprotein PML-RAR target DNMTs to specific sequences in DNA (Di Croce et al., 2002;Vire et al., 2006). Thus, in addition to template-directed DNA methylation in replicating DNA, specific factors are responsible for specific methylation in specific genes. Is DNA Methylation a Reversible Reaction? The most controversial issue in the DNA methylation field is the question of whether the DNA methylation reaction is reversible. A passive process through replication in the absence of DNMT
■
63
would result in loss of methylation. This could serve as a mechanism for both site specific and global demethylation (see Figure 5.2 for a scheme of the different DNA methylation reactions). There is a long list of data from both cell culture and early mouse development supporting the hypothesis that active methylation occurs in embryonal and somatic cells. There are now convincing examples of active, replication-independent DNA demethylation during development as well as in somatic tissues. Active demethylation was reported for the myosin gene in differentiating myoblast cells (Lucarelli et al., 2001), the INTERLEUKIN-2 gene upon T-cell activation (Bruniquel and Schwartz, 2003), the INTERFERON- gene upon antigen exposure of memory CD8 T cells (Kersh et al., 2006), and the glucocorticoid receptor gene promoter in adult rat brains upon treatment with the HDAC inhibitor trichostatin A (TSA) (Weaver et al., 2004a, b). Enzymatic Activities Responsible for Demethylation The characteristics of the enzymes responsible for active demethylation are controversial. One proposal has been that a G/T mismatch repair glycosylase also functions as a 5-methylcytosine DNA glycosylase, recognizing methylcytosines and cleaving the bond between the sugar and the base. The abasic site is then repaired and replaced with a nonmethylated cytosine resulting in demethylation (Jost, 1993). An additional protein with a similar activity was recently identified, the methylated DNA binding domain protein 4 (MBD4) (Zhu et al., 2000). Another report has proposed that MBD2 has demethylase activity. MBD2b (a shorter isoform of MBD2) was shown to directly remove the methyl group from methylated cytosine in methylated CGs (Bhattacharya et al., 1999). This enzyme was therefore proposed to reverse the DNA methylation reaction. However, other groups disputed this finding (Ng et al., 1999). Understanding the mechanisms responsible for demethylation is one of the most important questions in the field.
CHROMATIN MODIFICATION Modifications of Histones with Implications for Gene Expression Programming: The Histone Code The basic unit of the chromatin is the nucleosome, which is a dimer of two identical complexes, each consisting of four histone proteins: H2A, H2B, H3, and H4 (Pruss et al., 1995). The N-terminal tails of histones H3 and H4 are highly charged and are tightly associated with DNA. Chromatin had been viewed in the past as a static entity, which packaged DNA in a condensed form and maintained its integrity. However, a vast body of literature has established that histones undergo diverse covalent modifications, which include acetylation, methylation, phosphorylation, ubiquitination, and sumoylation (Strahl and Allis, 2000). Histone acetylation of H3 and H4 histone N-terminal tails is believed to be a predominant signal for active chromatin by enhancing the accessibility of the transcription machinery. Histone methylation at the K9 of histone H3 is repressive and results in exclusion of acetylation at the same residue, which is
64
CHAPTER 5
■
Epigenomics and Genomic Medicine
required for activation of gene expression. H3-histone K4 methylation is associated, on the other hand, with gene activity and transcription elongation (Santos-Rosa et al., 2002). These modifications are believed to form a “histone code” that regulates chromatin function by affecting the structural dynamics of the nucleosome (Jenuwein and Allis, 2001). Proteins That Interpret the Histone Code Histone modifications regulate chromatin function by either altering the accessibility of DNA to different trans-acting factors or recruiting specific proteins that recognize a single or conformational set of modifications (Strahl and Allis, 2000). Examples of proteins whose interaction with DNA is defined by different histone modifications include bromodomain-containing proteins, which interact with acetylated histones (Filetici et al., 2001), and chromodomain-containing proteins, which interact with methylated lysines (Lachner and Jenuwein, 2002). A classic example is the HP-1, which recognizes histone H3 methylated at K9 and is believed to be involved in gene silencing in heterochromatin (Lachner and Jenuwein, 2002). Different histone variants also play regulatory roles (Henikoff et al., 2004; Pusarla and Bhargava, 2005; Sarma and Reinberg, 2005; Zhang et al., 2005). Remodeling of chromatin is also critical for gene function and involves ATPasecontaining complexes such as SWI/SNF, which propels energydependent nucleosomal movement required for opening up of chromatin around transcription initiation regions (Gibbons, 2005; Muchardt and Yaniv, 1999;Wolffe and Hayes, 1999). The State of Modification of Histones Is a Dynamic Balance of Modifying and Demodifying Enzymatic Activities Histone acetylation is catalyzed by HATs, which transfer an acetyl group from the cofactor acetyl CoA onto the -position on lysine, and reversed by HDAC (Kuo and Allis, 1998). Two phylogenetic superfamilies of HAT are known to date: the GCN5 N-acetyltransferases related family GNAT and the MYST family (Verdone et al., 2005). PCAF is a mammalian transcriptional coactivator, which was discovered based on its homology to GCN5-HAT. PCAF acts as a transcriptional coactivator in many processes and is known to interact with the CREB binding protein (CBP) and its homolog p300. P300 and CBP are ubiquitously expressed transcription coactivators. The MYST superfamily includes the human acetyltransferases MOZ and Tip60 (Borrow et al., 1996). Tip60 is involved in histone acetylation mostly on histone H4 in response to DNA damage and transcriptional activation of genes involved in repair. MOZ is involved in acute myeloid leukemia as suggested by the fusion of MOZ and CBP found in these leukemias, which results in mistargeting of the fused protein causing aberrant activation of gene expression (Borrow et al., 1996). There are four distinct phylogenetic classes of HDACs known to date (see Holbert and Marmorstein [2005] for a review). Class 1 includes HDACs 1–3, 8; class 2 includes the human HDACs 4–7, 9, 10; class 3 includes the sirtuins; and class
4 includes recently identified HDAC 11. Class 1 and 2 HDACs were shown to be involved in malignancies in humans and are the target of the HDAC inhibitors TSA and SAHA. Alterations in class 3 HDACs or sirtuins were shown to be associated with age-associated diseases such as type II diabetes, obesity, and neurodegenerative disorders. The mechanism of action of class 1 and 2 HDACs seems to involve a mode of catalysis in which the bound zinc ion mediates the nucleophilic attack of a water molecule on the acetylated lysine substrate whereas the class 3 HDACs require the presence of the oxidized form of the cofactor nicotinamide adenine dinucleotide (NAD+) (Holbert and Marmorstein, 2005). Methylation marks on histones could either activate or silence gene expression as discussed above. Arginine and lysine HMTs catalyze the transfer of a methyl group from the methyl donor SAM to either arginines or lysines. In some cases, a single lysine can be mono-, di-, or tri-methylated, with different functional consequences for each of the three forms. Mammalian homologs of Suv3–9 that methylate lysine-9 on the tail of H3-histones were the first HKMT (histone lysine methyltransferases) to be identified (Rea et al., 2000). SET-containing HKMT methylates Lys-4, -9, -27, or -36 of histone H3 and lys-20 of histone H4. There are two distinct families of lysine methyltransferases SET which contain a SET domain of approximately 130 amino acids and Dot1p (Peters and Schubeler, 2005). All arginine methyltransferases are members of the PRMT family (for a review, see Cheng et al. [2005]). PRMT1 and PRMT4/CARM1 methylate histones H3, H4, and H2B. Although it was long believed that histone methylation is irreversible, histone demethylases have been identified that can remove either repressive or activating histone methylation marks (Metzger et al., 2005; Shi et al., 2004). Two classes of histone demethylases were recently discovered: FAD-dependent amine monooxygenases, which demethylate histones via an oxidation reaction releasing formaldehyde as the leaving group; and JmjC domain demethylases, which also release formadelhyde but require -ketoglutarate and Fe(II). Lysine-specific demethylase-1 (LSD1) is a FAD-dependent enzyme and demethylates monoand di-methyl H3 K4 (for a review see Holbert and Marmorstein [2005]). Since K4 methylation marks histones associated with active genes, it was anticipated that demethylation of this mark would have a repressive effect on gene expression. In agreement with this prediction, LSD1 was found in a complex with the CoREST complex, which represses neuronal genes in nonneuronal cells (Lee et al., 2005). A representative of the second class of histone demethylases, JHDM1 (JmjC domain-containing histone demethylase-1), which specifically demethylates histone H3 at lysine 36 (H3-K36), was recently discovered (Tsukadaet al., 2006). H3-K36 is probably involved in promoting transcription elongation and thus maintenance of a transcriptional active state (Peters and Schubeler, 2005). Since other methylation sites are known to exist in histones, it is anticipated that more demethylases with different specificities will be discovered in the coming years. In summary, the histone modification state is defined by a dynamic equilibrium of modifying and demodifying enzymes.
DNA Methylation and Chromatin States Cooperatively Determine the State of Activity of Genes
■
65
Transcription factor
X CH3
Ac
Transcription factor
Ac
CH3
Ac
Ac Sin3A
HDCA
MECP2
CH3
H3
CH3
CH3
SUV39 HP1
Figure 5.3 DNA methylation silences genes by two mechanisms. The first mechanism is direct (upper right). A methyl group in the recognition element for a transcription factor inhibits the binding of the transcription factor to the promoter resulting in inhibition of transcription. The second mechanism (bottom right) involves recruitment of MBDs which in turn recruit co-repressors (Sin3A), HDAC, HMTs (Suv39), and HP-1, a protein that binds K9 methylated H3-histones. This results in formation of an inactive chromatin structure and silencing of transcription.
The direction of the equilibrium is responsive to cellular signaling pathways which activate transcriptional activators or repressors. Transcription factors recruit HATs and K4-HMTs, while transcriptional repressors recruit HDACs, K9-HMTs, and K4-histone demethylases to suppress gene expression. The relative balance of activators and repressors thus defines the state of modification of a gene. This dependence on targeting of histone-modifying enzymes by sequence-specific factors explains the gene selectivity of the histone modification state (Figures 5.3, 5.4 and 5.5).
A second mechanism involves the recruitment of methyl-domain binding proteins such as MeCP2 to methylated genes (Nan et al., 1997). All these proteins share a methyl-binding domain, which recognizes methylated CG dinucleotides in regulatory regions of genes. MBD3 does not bind methylated DNA, but is probably recruited to methylated genes through its interactions with MBD2 (Zhang et al., 1999). MeCP2 recruits a repressive multiprotein complex, which contains the co-repressor Sin3A and histonemodifying enzymes such as HDAC1 and the HMT Suv3–9 (Fuks et al., 2003). Thus, methylation of DNA can lead to chromatin modification, which creates conditions for gene silencing.
DNA METHYLATION AND CHROMATIN STATES COOPERATIVELY DETERMINE THE STATE OF ACTIVITY OF GENES
DNMTs Form Protein–Protein Interactions with Chromatin-Modifying Enzymes There is a bilateral relation between chromatin and DNA methylation. As much as DNA methylation targets chromatinmodifying enzymes to genes, chromatin-modifying enzymes recruit DNA methylating enzymes (Figures 5.4 and 5.5). The mechanisms by which chromatin directs DNA methylation is via protein–protein interaction, through recruitment of DNMTs to specific genes by chromatin-modifying enzymes. A growing list of histone-modifying enzymes has been shown to interact with DNMT1, such as HDAC1 and HDAC2, the HMTs SUV3–9 and EZH2, a member of the multiprotein Polycomb complex PRC2, which methylates H3-histone at the
DNA Methylation Leads to Chromatin Inactivation DNA methylation can mark gene for silencing by two mechanisms (Figure 5.3). First, DNA methylation in a recognition element of a transcription factor such as MYC (Prendergast and Ziff, 1991) interferes with its binding. By blocking the binding of these transcription factors, recruitment of acetyltransferases is inhibited, probably resulting in a hypoacetylated chromatin structure, which is inaccessible to the transcription machinery.
66
CHAPTER 5
■
Epigenomics and Genomic Medicine
K27 residue (Fuks et al., 2000, 2003; Rountree et al., 2000;Vire et al., 2006). DNMT3a was recently also shown to interact with EZH2, which targets the DNA methylation-histone modification multiprotein complexes to specific sequences in DNA (Vire et al., 2006). The targeting of DNMTs is required not merely for initiating de novo methylation but also for the safeguarding of this pattern. Trans-acting repressors also target both histonemodifying enzymes and DNMTs to specific cis-acting signals. For example, the promyelocytic leukemia fusion protein PMLRAR engages HDACs and DNMTs to its target binding sequences and produces de novo DNA methylation of adjacent genes (Di Croce et al., 2002). There are also documented interactions between proteins, which read the DNA methylation and histone methylation marks and either histone or DNA modifying enzymes. The MBD MeCP2 interacts with the HMT SUV3–9 (Fuks et al., 2003), and in plants it was shown that Chromomethylase3 (CMT3), a plant CNG-specific DNMT, interacts with an Arabidopsis homolog of HP1, a protein that binds histone H3 methylated at Lysine-9 (Jackson et al., 2002). In summary, DNA methylation and histone modification are dynamically interrelated through protein–protein interactions between DNMT and MBDs, on the one hand, and histone-modifying enzymes and modified-histone binding proteins, on the other (see Figures 5.4 and 5.5 for a scheme). Chromatin Activation Leads to DNA Demethylation The relationship between DNA methylation and chromatin is bilateral in both directions of the DNA methylation equilibrium. Pharmacological histone acetylation using HDAC inhibitors leads to replication-independent demethylation in human cell lines (Cervoni and Szyf, 2001; Szyf et al., 1985) as well as in Neurospora crassa (Selker, 1998). TSA induces demethylation in non-replicating cells in the hippocampus of adult rats, illustrating the reversibility of the DNA methylation pattern even in somatic tissues (Weaver et al., 2004a, b). The relationship between histone acetylation and DNA demethylation could explain gene-targeted
demethylation (Figure 5.4). Transcription factors target HATs to specific genes, causing gene-specific acetylation, and thus facilitate their demethylation. For example, the intronic kappa chain enhancer and the transcription factor NF-kappaB are required for B-cell–specific demethylation of the kappa gene (Lichtenstein et al., 1994). The maize Suppressor-mutator (Spm) transposon demethylation is mediated by the transposon-encoded transcriptional activator TnpA protein (Bruniquel and Schwartz, 2003). The mechanism by which histone acetylation induces demethylation is still unknown. Increase in histone acetylation through triggering of cellular signaling pathways by environmental signals is a possible route for environmental signals to induce changes in the DNA methylation pattern. The partial overlap between HDAC inhibitors and DNA methylation inhibitors has to be considered when one uses these agents either therapeutically or for research purposes (see Figure 5.5 for a summary of the relationship between DNA and chromatin modifications and gene expression).
DNMTs Are Involved in Cellular Functions Independent of Their DNMT Activity Recent data suggest that DNMT proteins encode functional domains that enable them to interact with other cellular machineries, and thus DNMTs might exert their biological effects by methylation-independent mechanisms. DNMT1 was shown to have a number of distinct functional domains: a fork-targeting domain that targets it to the replication fork (Leonhardt et al., 1992), a stretch of amino acids that bind the replication fork protein PCNA (proliferating cell nuclear antigen) (Chuang et al., 1997), and domains that mediate the binding of DNMT1 to HDAC1 and 2 as well as the tumor-suppressor gene retinoblastoma (Rb) protein (Robertson et al., 2000). DNMT1 inhibitors were shown to induce expression of tumor-suppressor genes in a methylation-independent manner (Milutinovic et al., 2004). This possibility must also be considered when interpreting data from DNA methylation inhibition experiments.
DNMT Transcription repressor
HAT
dem
eth
CH3 CH3
CH3
Ac
TSA
Ac
Ac Ac Ac
HDAC
ylas
e
Transcription factor
HAT binding
CH3
CH3
Ac
Ac
X
demethylase
Figure 5.4 Gene activation by histone acetylation and DNA demethylation. An inactive gene is methylated and is associated with HDAC, DNMT, and hypoacetylated histones. Upon either pharmacological inhibition of HDACs with TSA or recruitment of HATs to the gene, the histones are acetylated. The “open” chromatin facilitates interaction of demethylases with the gene and DNA demethylation.
Epigenetics and Human Disease
MBD2AS SAM
demethylase
Ac
Ac
K9
K9 M
TR+ HDAC
■
67
INHAT M
DNMT
HAT
TSA
K9-
K9M
M
X
X
TA+ Histone dMTase?
Histone MTase K9-M
demethylase
K9-M M
X
DNMT
MG98 5-azaC
M
X
Figure 5.5 Epigenetic cycle. A gene could be found in different epigenetic states that are a balance of modification and demodification reactions. Methylation of DNA and chromatin modifications are interrelated. Ac, histone acetylation; horizontal arrow indicates transcription; empty circle, CG site in DNA; M in circle indicates methylated CG site; K9-M, methyl-K9-H3-histone; TR, transcriptional repressor; histone MTase, histone methyltransferase; HAT, histone acetyltransferase; INHAT, inhibitors of histone acetyltransferase; histone dMTase, histone demethylase; TA, transcription activator; MBD2 AS, antisense oligonucleotide inhibitor of MBD2 a putative demethylase; SAM, S-adenosylmethionine; MG98, antisense oligonucleotide inhibitor of DNMT; 5-azaC, a nucleoside inhibitor of DNA methylation.
EPIGENETICS AND HUMAN DISEASE Aberrant DNA Methylation and Chromatin Modification States in Cancer Aberrant epigenetic markings could have the same effect as genetic mutations, and thus they are candidates to be involved in any human disease caused by aberrant gene expression (for recent reviews see Feinberg [2007] and Gosden and Feinberg [2007]). Since epigenetic marks are potentially responsive to extracellular signals, they are candidates to mediate gene–environment interactions. Epigenetic changes are especially attractive candidates to mediate age-related disease, such as cancer, schizophrenia, type 2 diabetes, and asthma. Although the involvement of epigenetic changes in most disease is at this point in time still speculative, there are extensive data implicating DNA methylation and chromatin modification changes in cancer. The alterations in DNA methylation in cancer include a global loss in methylation concurrently with regional hypermethylation of specific genes (Feinberg and Vogelstein, 1983; Feinberg et al., 1988; Lu et al., 1983). Cancer progression involves concurrent silencing of genes such
as tumor-suppressor genes that block cancer growth, as well as activation of genes that promote metastasis such as certain proteases; the changes in DNA methylation appear to correspond with these changes in gene expression programming. Loss of methylation was observed in repetitive elements such as LINE1 (Chalitchagorn et al., 2004), in genes that promote the cancer state such as oncogenes (Feinberg and Vogelstein, 1983), as well as in pro-metastatic genes such as S100P (Sugimoto et al., 2004), S100A4 (Rosty et al., 2002), and uPA (Guo et al., 2002). Loss of methylation is associated with increased invasiveness in vitro (Figure 5.6). The regional hypermethylation observed in cancer is associated with changes in chromatin state and, as a consequence, with gene silencing. There is an increasing list of genes affected by hypermethylation that includes genes encoding tumor suppressors such as p16, RB, and BRCA1; adhesion molecules such as E CADHERIN; repair enzymes such as MGMT; and inhibitors of metastasis such as the different TIMPs (for review on tumorsuppressor gene methylation see Baylin et al. [2001]). The assortment of hypermethylated genes varies in different cancer types and at different cancer stages, creating a methylation profile that is unique for each cancer cell type. These methylation
68
CHAPTER 5
■
Epigenomics and Genomic Medicine
DNMT
DNA methylation
Demethylation
M M Tumor-suppressor genes
Metastasis genes
X Figure 5.6 DNA methylation changes in cancer. In cancer, tumor-suppressor genes are methylated and silenced, and metastatic genes are unmethylated and activated.
profiles are gaining increased importance in cancer diagnosis (Novik et al., 2002). In addition to changes in DNA methylation, the abundance of components of the DNA methylation machinery is altered in cancer. DNMT1, for example, was shown to be upregulated in many cancers by different oncogenic signaling pathways (Szyf, 2001). It is unclear whether the upregulation of DNMT1 is responsible for the hypermethylation of tumor-suppressor genes or whether aberrant targeting of DNNMT1 to specific genes by specific trans-acting factors such as the fusion protein PML-RAR causes gene-specific hypermethylation (Di Croce et al., 2002). DNMT1 might be promoting tumorigenesis by DNA methylationindependent mechanisms through protein–protein interactions with critical cellular regulatory proteins such as Rb, HDAC1, and PCNA as discussed above (Chuang et al., 1997; Milutinovic et al., 2004; Robertson et al., 2000). DNA Methylation and Late-Onset Human Disease Although cancer is the most well-established case for epigenetically driven human disease, there is some evidence suggesting involvement of epigenetic processes in other diseases. It stands to reason that epigenetic drift would lead to altered gene expression programming, which would lead to a pathological state. Good candidates are autoimmune diseases and inflammatory states such as asthma. It is by now well established that alteration in DNA methylation plays a role in the autoimmune disease Systemic Lupus Erythematosus (SLE) (Richardson, 2002). Drugs known to induce lupus such as procainamide and hydralazine induce also demethylation (Cornacchia et al., 1988; Deng et al., 2003; Scheinbart et al., 1991) and the demethylating agent 5-azaC induces lupus (Quddus et al., 1993). The genome of T cells from SLE patients is demethylated in comparison with T cells from control subjects (Balada et al., 2007). It is proposed that demethylation activates genes whose expression confers upon T cells self-reactivity. Interestingly, MBD2, which was implicated
in demethylation, is upregulated in T cells from Lupus patients (Balada et al., 2007). The involvement of DNA methylation in asthma was proposed (Kaminsky et al., 2006), and promoters of genes that encode cytokines involved in asthma were shown to be regulated by DNA methylation (Santangelo et al., 2002), but there is no evidence yet for aberrant DNA methylation in asthma. Similarly, it has been speculated that DNA methylation disregulation is involved in cardiovascular diseases (Corwin, 2004). There are some anecdotal data indicating involvement of DNA methylation in cardiovascular pathologies. Global hypermethylation was noted in leukocytes from chronic kidney disease, and hypermethylation was shown to associate with inflammation and increased mortality (Stenvinkel et al., 2007). There is some evidence for epigenetic dysregulation of estrogen receptor beta in atherosclerosis and vascular aging (Kim et al., 2007), and global hypomethylation was noted in atherosclerotic lesions (Hiltunen et al., 2002). These data justify further studies into the possible epigenetic basis of cardiovascular disease. If indeed epigenetic aberrations are found to be involved in cardiovascular diseases, this might result in new diagnostic, prophylactic, and therapeutic approaches. DNA methylation has also been proposed to be involved in insulin resistance and type 2 diabetes (Devaskar and Thamotharan, 2007). It is known that nutritional restriction during gestation is associated with development of insulin resistance and type 2 diabetes later in life (Devaskar and Thamotharan, 2007; Simmons, 2007). Uteroplacental insufficiency also causes global changes in methylation and histone acetylation in brain and liver in rodents (Ke et al., 2006; MacLennan et al., 2004). Although these studies did not establish a causal relationship between hypomethylation and type 2 diabetes, they are suggestive. Epigenetic mechanisms could provide a molecular link between early exposures and late onset of type 2 diabetes. The advent of techniques to perform whole-epigenome mapping will allow us to directly test the attractive hypothesis that epigenetic aberrations driven by different environmental exposures cause epigenetic aberrations that can lead to lateonset diseases such as type 2 diabetes or cardiovascular disease. There is emerging evidence that aberrations in DNA methylation are involved in mental health. Mutations in the MBD MECP2 lead to Rett syndrome (Amir et al., 1999; Wan et al., 1999), one of the most common causes of mental retardation in females. There are some data indicating aberrant methylation in late-onset mental pathologies, although it is unclear whether these changes in DNA methylation originated during embryogenesis or later in life as a response to an environmental exposure. The gene encoding REELIN, a protein involved in neuronal development and synaptogenesis, which is implicated in long-term memory, was found to be methylated in brains of schizophrenia patients. The methylation of REELIN was correlated with its reduced expression and increased DNMT1 expression in GABAergic neurons in the prefrontal cortex (Chen et al., 2002; Costa et al., 2002, 2003; Grayson et al., 2005; Veldic et al., 2007). Our preliminary data demonstrate hypermethylation of the GLUCOCORTICOID RECEPTOR exon 1f promoter and
Conclusions
its reduced expression in hippocampi of subjects who committed suicide, relative to age-matched control subjects (McGowan et al., unpublished data). The analysis of epigenetic aberration in mental disease is an active area of research, which is bound to produce exciting data in the coming years. Therapeutic Implications of Aberrant DNA Methylation Patterns in Cancer Epigenetic states, in contrast to genetic mutations, are potentially reversible by pharmacological agents (Figure 5.5). Thus, this field carries much promise for future novel therapeutics and prophylactics. A number of clinical trials are underway with catalytic inhibitors of DNMT1, antisense inhibitors of DNMT1, and HDAC inhibitors in different cancers, with the goal of activating silenced tumor-suppressor genes (Goffin and Eisenhauer, 2002; Szyf, 2003). It is anticipated that, as our understanding of other histone modifications is improved, inhibitors of HKMT and DNA demethylases will be used in cancer therapy as well. An important objective is to develop isotypic specific inhibitors, which would target the isotypes relevant to cancer and limit adverse effects. A critical issue here is to activate tumor-suppressor genes without activating pro-metastatic genes (Figure 5.6) and to identify the specific chromatin modification and DNA modification enzymes critical for tumor-suppressor genes. DNMT1 might promote cancer through DNA methylationindependent mechanisms in addition to a methylationdependent process (Szyf, 2001). Demethylation of DNA could trigger the induction of pro-metastatic genes as discussed above; it might be therefore ill advised to use catalytic inhibitors of DNMT1. A preferred approach is to target the methylationindependent activities of DNMT1 specifically. While the main focus in the field is on activating tumor-suppressor genes using DNA methylation and HDAC inhibitors, we suggest that it might be worthwhile to target the opposite process hypomethylation. Inhibitors of the DNA demethylation machinery might be utilized to silence pro-metastatic genes (Pakneshan et al., 2004). An important issue that needs to be determined in future studies is whether demethylation inhibitors would cause silencing of tumor-suppressor genes through hypermethylation, which could contraindicate their use. DNA methylation profiles are promising to turn into central diagnostic tools. We now know that not all tumors are methylated at the same sites. Each tumor type bears a “methylation profile,” a unique combination of methylated CGs, which has important predictive and diagnostic potential (Adorjan et al., 2002). The development of whole-genome high-density oligonucleotide arrays will allow in the future to map in detail the DNA methylation and chromatin changes associated with disease and behavioral abnormalities. These methylation profiles might offer an explanation for interindividual phenotypic diversity. Identifying target genes whose methylation and chromatin is changed in correlation with altered phenotypes will facilitate unraveling of the biological mechanisms behind these phenotypic differences.
■
69
The second area that carries great promise is the utilization of DNA methylation and epigenetic profiling in prophylactics. We need to map the response of DNA methylation to different environmental exposures such as social environment, food, and drugs. Epigenetics is a candidate to serve as the main mechanism used by the environment to stably reprogram genes. Understanding the relationship between the environment and disease-state–specific methylation profiles will allow charting prophylactic strategies to avoid deleterious exposure and to promote positive environmental interventions. One of the interesting pending questions in this respect is the issue of methyl-supplemented diets. Folate-fortified diets have been in place for some time. Folate-enriched diets might be helpful in protection from the global hypomethylation which is associated with cancer (Poirier, 2002). However, it is also possible that they might cause hypermethylation and silence critical genes such as tumor-suppressor genes. It is important to map the relationship between methyl supplementation and changes in DNA methylation profiles. Understanding how the environment utilizes epigenetics to sculpt the genome is perhaps one of the greatest challenges in biology, which would have a broad impact in medicine.
CONCLUSIONS Epigenetic mechanisms control gene expression programs and serve as an interface between the dynamic environment and the genome. Many of the proteins, which lay down the DNA methylation pattern as well as histone remodeling complexes and modifying enzymes, have been identified. Moreover, recent discoveries of chromatin demodification enzymes suggest that epigenetic processes are reversible and that the state of chromatin, as well as DNA methylation, is an equilibrium between modification and demodification processes. There is a bilateral relation between chromatin structure and demethylation, which is in part determined by protein–protein interactions between DNA methylation enzymes and chromatin modification enzymes. The establishment of new epigenetic patterns requires the targeting of DNA methylation and chromatin modification enzymes to specific genes, while maintenance of this pattern requires their continuous presence on these genes. Epigenetic proteins are targeted to specific sequences through interaction with specific trans-acting activators or repressors, which recognize specific cis-acting signals next to genes. Future experiments are required to understand the cellular signaling pathways leading to activation of these targeting factors and to determine whether they link physiological, environmental, or pathological signals with chromatin modification states and the DNA methylation pattern. One of the great challenges in the field is to define the mechanisms linking environment and the sculpting of the genes by epigenetic modifications. Epigenetic aberrations act similar to genetic mutations to precipitate pathological states. In cancer this is now well established. It stands to reason that similar processes are involved in other diseases such as mental disorders, cardiovascular disorders, and metabolic disorders. The main challenge now is to unravel
70
CHAPTER 5
■
Epigenomics and Genomic Medicine
the epigenetic code and identify the epigenetic profiles associated with human disease. Recent advances in whole-genome mapping using high-density oligonucleotide arrays and the use of 5-methylcytosine–specific antibodies, as well as chromatin immunoprecipitation (ChIP) with histone-modification–specific antibodies, will allow for mapping in unbiased way the state of methylation and chromatin modification of entire genomes and for identifying methylation profiles that are associated with disease. Since epigenomes vary from cell type to cell type and at different time points in life, this mapping effort is bound to be an enormous task that requires development of high-throughput methodologies. The whole-epigenome mapping efforts would comprehensively test for the first time the hypothesis that interindividual epigenetic differences are the basis for the well-known interindividual differences in appearance, behavior, and physiology. One provocative hypothesis that needs to be addressed by whole-epigenome mapping is that some of the environmentally-driven epigenetic changes are passed through the germline, thus explaining welldocumented trans-generational environmental effects. Since the epigenome is dynamic, it is potentially reversible. A further understanding of the DNA methylation and histone
modification machineries will allow us to determine whether we could intercede either pharmacologically or by behavioral and nutritional interventions to reverse deleterious epigenetic programming. We ought to link the main cellular signaling pathways, specific DNA methylation and chromatin modification enzymes, and DNA methylation profiles in order to design intelligent interventions that will have main positive effects in the absence of adverse outcomes. Epigenetics has made important headways and has advanced dramatically in the last few years; however, many methodological and conceptual challenges remain. If the promise of this field would be realized, it would revolutionize our understanding of biology, medical diagnosis, and therapeutics.
ACKNOWLEDGEMENTS The work from the author’s laboratory discussed in this chapter was funded by the National Cancer Institute of Canada, the Canadian Institute of Health Research, and the Canadian Institute of Advanced Research.
REFERENCES Adorjan, P. et al. (2002). Tumour class prediction and discovery by microarray-based DNA methylation analysis. Nucleic Acids Res 30, e21. Amir, R.E.,Van den Veyver, I.B., Wan, M., Tran, C.Q., Francke, U. and Zoghbi, H.Y. (1999). Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl-CpG-binding protein 2. Nat Genet 23, 185–188. Balada, E., Ordi-Ros, J., Serrano-Acedo, S., Martinez-Lostao, L. and Vilardell-Tarres, M. (2007). Transcript overexpression of the MBD2 and MBD4 genes in CD4+ T cells from systemic lupus erythematosus patients. J Leukoc Biol 81, 1609–1616. Baylin, S.B., Esteller, M., Rountree, M.R., Bachman, K.E., Schuebel, K. and Herman, J.G. (2001). Aberrant patterns of DNA methylation, chromatin formation and gene expression in cancer. Hum Mol Genet 10, 687–692. Beck, S., Olek, A. and Walter, J. (1999). From genomics to epigenomics: A loftier view of life. Nat Biotechnol 17, 1144. Bestor, T.H. (1990). DNA methylation: Evolution of a bacterial immune function into a regulator of gene expression and genome structure in higher eukaryotes. Philos Trans R Soc Lond B Biol Sci 326, 179–187. Bhattacharya, S.K., Ramchandani, S., Cervoni, N. and Szyf , M. (1999). A mammalian protein with specific demethylase activity for mCpG DNA. Nature 397, 579–583. Borrow, J. et al. (1996). The translocation t(8;16)(p11;p13) of acute myeloid leukaemia fuses a putative acetyltransferase to the CREBbinding protein. Nat Genet 14, 33–41. Bostick, M., Kim, J.K., Esteve, P.O., Clark, A., Pradhan, S. and Jacobsen, S.E. (2007). UHRF1 plays a role in maintaining DNA methylation in mammalian cells. Science 317, 1760–1764.
Bourc’his, D., Xu, G.L., Lin, C.S., Bollman, B. and Bestor, T.H. (2001). Dnmt3L and the establishment of maternal genomic imprints. Science 294, 2536–2539. Bruniquel, D. and Schwartz, R.H. (2003). Selective, stable demethylation of the interleukin-2 gene enhances transcription by an active process. Nat Immunol 4, 235–240. Bultman, S.J., Gebuhr, T.C. and Magnuson, T. (2005). A Brg1 mutation that uncouples ATPase activity from chromatin remodeling reveals an essential role for SWI/SNF-related complexes in beta-globin expression and erythroid development. Genes Dev 19, 2849–2861. Cervoni, N. and Szyf, M. (2001). Demethylase activity is directed by histone acetylation. J Biol Chem 276, 40778–40787. Chalitchagorn, K., Shuangshoti, S., Hourpai, N., Kongruttanachok, N., Tangkijvanich, P., Thong-ngam, D., Voravud, N., Sriuranpong, V. and Mutirangura, A. (2004). Distinctive pattern of LINE-1 methylation level in normal tissues and the association with carcinogenesis. Oncogene 23, 8841–8846. Chen, Y., Sharma, R.P., Costa, R.H., Costa, E. and Grayson, D.R. (2002). On the epigenetic regulation of the human reelin promoter. Nucleic Acids Res 30, 2930–2939. Cheng, X., Collins, R.E. and Zhang, X. (2005). Structural and sequence motifs of protein (histone) methylation enzymes. Annu Rev Biophys Biomol Struct 34, 267–294. Chuang, L.S., Ian, H.I., Koh, T.W., Ng, H.H., Xu, G. and Li, B.F. (1997). Human DNA-(cytosine-5) methyltransferase-PCNA complex as a target for p21WAF1. Science 277, 1996–2000. Cornacchia, E., Golbus, J., Maybaum, J., Strahler, J., Hanash, S. and Richardson, B. (1988). Hydralazine and procainamide inhibit T cell DNA methylation and induce autoreactivity. J Immunol 140, 2197–2200.
References
Corwin, E.J. (2004). The concept of epigenetics and its role in the development of cardiovascular disease: Commentary on “new and emerging theories of cardiovascular disease”. Biol Res Nurs 6, 11–16. Discussion 21–23. Costa, E., Chen,Y., Davis, J., Dong, E., Noh, J.S.,Tremolizzo, L.,Veldic, M., Grayson, D.R. and Guidotti, A. (2002). REELIN and schizophrenia: A disease at the interface of the genome and the epigenome. Mol Interv 2, 47–57. Costa, E., Grayson, D.R., Mitchell, C.P., Tremolizzo, L., Veldic, M. and Guidotti, A. (2003). GABAergic cortical neuron chromatin as a putative target to treat schizophrenia vulnerability. Crit Rev Neurobiol 15, 121–142. Deng, C., Lu, Q., Zhang, Z., Rao, T., Attwood, J., Yung, R. and Richardson, B. (2003). Hydralazine may induce autoimmunity by inhibiting extracellular signal-regulated kinase pathway signaling. Arthritis Rheum 48, 746–756. Devaskar, S.U. and Thamotharan, M. (2007). Metabolic programming in the pathogenesis of insulin resistance. Rev Endocr Metab Disord 8, 105–113. Di Croce, L. et al. (2002). Methyltransferase recruitment and DNA hypermethylation of target promoters by an oncogenic transcription factor. Science 295, 1079–1082. Feinberg, A.P. (2007). Phenotypic plasticity and the epigenetics of human disease. Nature 447, 433–440. Feinberg, A.P. and Vogelstein, B. (1983). Hypomethylation of ras oncogenes in primary human cancers. Biochem Biophys Res Commun 111, 47–54. Feinberg, A.P., Gehrke, C.W., Kuo, K.C. and Ehrlich, M. (1988). Reduced genomic 5-methylcytosine content in human colonic neoplasia. Cancer Res 48, 1159–1161. Filetici, P., P, O. and Ballario, P. (2001). The bromodomain: A chromatin browser?. Front Biosci 6, D866–D876. Finch, J.T., Lutter, L.C., Rhodes, D., Brown, R.S., Rushton, B., Levitt, M. and Klug, A. (1977). Structure of nucleosome core particles of chromatin. Nature 269, 29–36. Fraga, M.F. et al. (2005). Epigenetic differences arise during the life-time of monozygotic twins. Proc Natl Acad Sci USA 102, 10604–10609. Fuks, F., Burgers, W.A., Brehm, A., Hughes-Davies, L. and Kouzarides, T. (2000). DNA methyltransferase Dnmt1 associates with histone deacetylase activity. Nat Genet 24, 88–91. Fuks, F., Hurd, P.J., Wolf , D., Nan, X., Bird, A.P. and Kouzarides, T. (2003). The methyl-CpG-binding protein MeCP2 links DNA methylation to histone methylation. J Biol Chem 278, 4035–4040. Gibbons, R.J. (2005). Histone modifying and chromatin remodelling enzymes in cancer and dysplastic syndromes. Hum Mol Genet 14(Spec No 1), R85–R92. Goffin, J. and Eisenhauer, E. (2002). DNA methyltransferase inhibitorsstate of the art. Ann Oncol 13, 1699–1716. Gosden, R.G. and Feinberg, A.P. (2007). Genetics and epigenetics– nature’s pen-and-pencil set. N Engl J Med 356, 731–733. Grayson, D.R., Jia, X., Chen,Y., Sharma, R.P., Mitchell, C.P., Guidotti, A. and Costa, E. (2005). Reelin promoter hypermethylation in schizophrenia. Proc Natl Acad Sci USA 102, 9341–9346. Guo, Y., Pakneshan, P., Gladu, J., Slack, A., Szyf, M. and Rabbani, S.A. (2002). Regulation of DNA methylation in human breast cancer. Effect on the urokinase-type plasminogen activator gene production and tumor invasion. J Biol Chem 277, 41571–41579. Henikoff , S., McKittrick, E. and Ahmad, K. (2004). Epigenetics, histone H3 variants, and the inheritance of chromatin states. Cold Spring Harb Symp Quant Biol 69, 235–243.
■
71
Hiltunen, M.O.,Turunen, M.P., Hakkinen,T.P., Rutanen, J., Hedman, M., Makinen, K., Turunen, A.M., Aalto-Setala, K. and Yla-Herttuala, S. (2002). DNA hypomethylation and methyltransferase expression in atherosclerotic lesions. Vasc Med 7, 5–11. Holbert, M.A. and Marmorstein, R. (2005). Structure and activity of enzymes that remove histone modifications. Curr Opin Struct Biol 15, 673–680. Jenuwein, T. (2001). Re-SET-ting heterochromatin by histone methyltransferases. Trends Cell Biol 11, 266–273. Jenuwein, T. and Allis, C.D. (2001). Translating the histone code. Science 293, 1074–1080. Jost, J.P. (1993). Nuclear extracts of chicken embryos promote an active demethylation of DNA by excision repair of 5-methyldeoxycytidine. Proc Natl Acad Sci USA 90, 4684–4688. Kaminsky, Z., Wang, S.C. and Petronis, A. (2006). Complex disease, gender and epigenetics. Ann Med 38, 530–544. Ke, X. et al. (2006). Uteroplacental insufficiency affects epigenetic determinants of chromatin structure in brains of neonatal and juvenile IUGR rats. Physiol Genomics 25, 16–28. Kersh, E.N., Fitzpatrick, D.R., Murali-Krishna, K., Shires, J., Speck, S.H., Boss, J.M. and Ahmed, R. (2006). Rapid demethylation of the IFN-gamma gene occurs in memory but not naive CD8 T cells. J Immunol 176, 4083–4093. Kim, J., Kim, J.Y., Song, K.S., Lee, Y.H., Seo, J.S., Jelinek, J., Goldschmidt-Clermont, P.J. and Issa, J.P. (2007). Epigenetic changes in estrogen receptor beta gene in atherosclerotic cardiovascular tissues and in-vitro vascular senescence. Biochim Biophys Acta 1772, 72–80. Kuo, M.H. and Allis, C.D. (1998). Roles of histone acetyltransferases and deacetylases in gene regulation. Bioessays 20, 615–626. Lachner, M. and Jenuwein, T. (2002). The many faces of histone lysine methylation. Curr Opin Cell Biol 14, 286–298. Lachner, M., O’Carroll, D., Rea, S., Mechtler, K. and Jenuwein, T. (2001). Methylation of histone H3 lysine 9 creates a binding site for HP1 proteins. Nature 410, 116–120. Lee, M.G., Wynder, C., Cooch, N. and Shiekhattar, R. (2005). An essential role for CoREST in nucleosomal histone 3 lysine 4 demethylation. Nature 437, 432–435. Leonhardt, H., Page, A.W., Weier, H.U. and Bestor, T.H. (1992). A targeting sequence directs DNA methyltransferase to sites of DNA replication in mammalian nuclei. Cell 71, 865–873. Li, E., Bestor, T.H. and Jaenisch, R. (1992). Targeted mutation of the DNA methyltransferase gene results in embryonic lethality. Cell 69, 915–926. Lichtenstein, M., Keini, G., Cedar, H. and Bergman, Y. (1994). B cellspecific demethylation: A novel role for the intronic kappa chain enhancer sequence. Cell 76, 913–923. Lu, L.J., Randerath, E. and Randerath, K. (1983). DNA hypomethylation in Morris hepatomas. Cancer Lett 19, 231–239. Lucarelli, M., Fuso, A., Strom, R. and Scarpa, S. (2001). The dynamics of myogenin site-specific demethylation is strongly correlated with its expression and with muscle differentiation. J Biol Chem 276, 7500–7506. Lyko, F., Ramsahoye, B.H. and Jaenisch, R. (2000). DNA methylation in Drosophila melanogaster. Nature 408, 538–540. MacLennan, N.K., James, S.J., Melnyk, S., Piroozi, A., Jernigan, S., Hsu, J.L., Janke, S.M., Pham, T.D. and Lane, R.H. (2004). Uteroplacental insufficiency alters DNA methylation, one-carbon metabolism, and histone acetylation in IUGR rats. Physiol Genomics 18, 43–50.
72
CHAPTER 5
■
Epigenomics and Genomic Medicine
McClelland, M., Nelson, M. and Raschke, E. (1994). Effect of sitespecific modification on restriction endonucleases and DNA modification methyltransferases. Nucleic Acids Res 22, 3640–3659. Meaney, M.J. and Szyf , M. (2005). Maternal care as a model for experience-dependent chromatin plasticity?. Trends Neurosci 28, 456–463. Metzger, E., Wissmann, M.,Yin, N., Muller, J.M., Schneider, R., Peters, A.H., Gunther, T., Buettner, R. and Schule, R. (2005). LSD1 demethylates repressive histone marks to promote androgenreceptor-dependent transcription. Nature 437, 436–439. Milutinovic, S., Zhuang, Q., Niveleau, A. and Szyf, M. (2003). Epigenomic stress response. Knockdown of DNA methyltransferase 1 triggers an intra-S-phase arrest of DNA replication and induction of stress response genes. J Biol Chem 278, 14985–14995. Milutinovic, S., Brown, S.E., Zhuang, Q. and Szyf , M. (2004). DNA methyltransferase 1 knock down induces gene expression by a mechanism independent of DNA methylation and histone deacetylation. J Biol Chem 279, 27915–27927. Mohandas, T., Sparkes, R.S. and Shapiro, L.J. (1981). Reactivation of an inactive human X chromosome: Evidence for X inactivation by DNA methylation. Science 211, 393–396. Muchardt, C. and Yaniv, M. (1999). ATP-dependent chromatin remodelling: SWI/SNF and Co. are on the job. J Mol Biol 293, 187–198. Nan, X., Campoy, F.J. and Bird, A. (1997). MeCP2 is a transcriptional repressor with abundant binding sites in genomic chromatin. Cell 88, 471–481. Ng, H.H., Zhang, Y., Hendrich, B., Johnson, C.A., Turner, B.M., Erdjument-Bromage, H., Tempst, P., Reinberg, D. and Bird, A. (1999). MBD2 is a transcriptional repressor belonging to the MeCP1 histone deacetylase complex. Nat Genet 23, 58–61. Novik, K.L., Nimmrich, I., Genc, B., Maier, S., Piepenbrock, C., Olek, A. and Beck, S. (2002). Epigenomics: genome-wide study of methylation phenomena. Curr Issues Mol Biol 4, 111–128. Okano, M., Xie, S. and Li, E. (1998). Cloning and characterization of a family of novel mammalian DNA (cytosine-5) methyltransferases [letter]. Nat Genet 19, 219–220. Okano, M., Bell, D.W., Haber, D.A. and Li, E. (1999). DNA methyltransferases Dnmt3a and Dnmt3b are essential for de novo methylation and mammalian development. Cell 99, 247–257. Pakneshan, P., Szyf , M., Farias-Eisner, R. and Rabbani, S.A. (2004). Reversal of the hypomethylation status of urokinase (uPA) promoter blocks breast cancer growth and metastasis. J Biol Chem 279, 31735–31744. Peters, A.H. and Schubeler, D. (2005). Methylation of histones: playing memory with DNA. Curr Opin Cell Biol 17, 230–238. Poirier, L.A. (2002). The effects of diet, genetics and chemicals on toxicity and aberrant DNA methylation: an introduction. J Nutr 132, 2336S–2339S. Prendergast, G.C. and Ziff , E.B. (1991). Methylation-sensitive sequencespecific DNA binding by the c-Myc basic region. Science 251, 186–189. Pruss, D., Hayes, J.J. and Wolffe, A.P. (1995). Nucleosomal anatomy– where are the histones?. Bioessays 17, 161–170. Pusarla, R.H. and Bhargava, P. (2005). Histones in functional diversification. Core histone variants. Febs J 272, 5149–5168. Quddus, J., Johnson, K.J., Gavalchin, J., Amento, E.P., Chrisp, C.E., Yung, R.L. and Richardson, B.C. (1993). Treating activated CD4+ T cells with either of two distinct DNA methyltransferase inhibitors, 5-azacytidine or procainamide, is sufficient to cause a lupuslike disease in syngeneic mice. J Clin Invest 92, 38–53.
Razin, A. (1998). CpG methylation, chromatin structure and gene silencing – A three-way connection. Embo J 17, 4905–4908. Razin, A. and Cedar, H. (1977). Distribution of 5-methylcytosine in chromatin. Proc Natl Acad Sci USA 74, 2725–2728. Razin, A. and Riggs, A.D. (1980). DNA methylation and gene function. Science 210, 604–610. Rea, S. et al. (2000). Regulation of chromatin structure by site-specific histone H3 methyltransferases. Nature 406, 593–599. Richardson, B.C. (2002). Role of DNA methylation in the regulation of cell function: autoimmunity, aging and cancer. J Nutr 132, 2401S–2405S. Robertson, K.D., Ait-Si-Ali, S., Yokochi, T., Wade, P.A., Jones, P.L. and Wolffe, A.P. (2000). DNMT1 forms a complex with Rb, E2F1 and HDAC1 and represses transcription from E2F-responsive promoters. Nat Genet 25, 338–342. Rosty, C., Ueki, T., Argani, P., Jansen, M., Yeo, C.J., Cameron, J.L., Hruban, R.H. and Goggins, M. (2002). Overexpression of S100A4 in pancreatic ductal adenocarcinomas is associated with poor differentiation and DNA hypomethylation. Am J Pathol 160, 45–50. Rountree, M.R., Bachman, K.E. and Baylin, S.B. (2000). DNMT1 binds HDAC2 and a new co-repressor, DMAP1, to form a complex at replication foci. Nat Genet 25, 269–277. Santangelo, S., Cousins, D.J., Winkelmann, N.E. and Staynov, D.Z. (2002). DNA methylation changes at human Th2 cytokine genes coincide with DNase I hypersensitive site formation during CD4(+) T cell differentiation. J Immunol 169, 1893–1903. Santos-Rosa, H., Schneider, R., Bannister, A.J., Sherriff, J., Bernstein, B.E., Emre, N.C., Schreiber, S.L., Mellor, J. and Kouzarides,T. (2002). Active genes are tri-methylated at K4 of histone H3. Nature 419, 407–411. Sarma, K. and Reinberg, D. (2005). Histone variants meet their match. Nat Rev Mol Cell Biol 6, 139–149. Scheinbart, L.S., Johnson, M.A., Gross, L.A., Edelstein, S.R. and Richardson, B.C. (1991). Procainamide inhibits DNA methyltransferase in a human T cell line. J Rheumatol 18, 530–534. Selker, E.U. (1998). Trichostatin A causes selective loss of DNA methylation in Neurospora. Proc Natl Acad Sci USA 95, 9430–9435. Shi, Y., Lan, F., Matson, C., Mulligan, P., Whetstine, J.R., Cole, P.A., Casero, R.A. and Shi, Y. (2004). Histone demethylation mediated by the nuclear amine oxidase homolog LSD1. Cell 119, 941–953. Shilatifard, A. (2006). Chromatin modifications by methylation and ubiquitination: Implications in the regulation of gene expression. Annu Rev Biochem 75, 243–269. Simmons, R.A. (2007). Developmental origins of beta-cell failure in type 2 diabetes: The role of epigenetic mechanisms. Pediatr Res 61, 64R–67R. Stenvinkel, P. et al. (2007). Impact of inflammation on epigenetic DNA methylation – A novel risk factor for cardiovascular disease?. J Intern Med 261, 488–499. Strahl, B.D. and Allis, C.D. (2000). The language of covalent histone modifications. Nature 403, 41–45. Sugimoto, S. et al. (2004). Expression and regulation of tumor suppressor gene maspin in human bladder cancer. Cancer Lett 203, 209–215. Swain, J.L., Stewart, T.A. and Leder, P. (1987). Parental legacy determines methylation and expression of an autosomal transgene: A molecular mechanism for parental imprinting. Cell 50, 719–727. Szyf , M. (2001). Towards a pharmacology of DNA methylation. Trends Pharmacol Sci 22, 350–354. Szyf, M. (2003). Targeting DNA methylation in cancer. Ageing Res Rev 2, 299–328.
References
Szyf , M., Eliasson, L., Mann, V., Klein, G. and Razin, A. (1985). Cellular and viral DNA hypomethylation associated with induction of Epstein-Barr virus lytic cycle. Proc Natl Acad Sci USA 82, 8090–8094. Tsukada, Y., Fang, J., Erdjument-Bromage, H., Warren, M.E., Borchers, C.H., Tempst, P. and Zhang, Y. (2006). Histone demethylation by a family of JmjC domain-containing proteins. Nature 439, 811–816. Varga-Weisz, P.D. and Becker, P.B. (2006). Regulation of higher-order chromatin structures by nucleosome-remodelling factors. Curr Opin Genet Dev 16, 151–156. Veldic, M., Kadriu, B., Maloku, E., Agis-Balboa, R.C., Guidotti, A., Davis, J.M. and Costa, E. (2007). Epigenetic mechanisms expressed in basal ganglia GABAergic neurons differentiate schizophrenia from bipolar disorder. Schizophr Res 93, 33–41. Verdone, L., Caserta, M. and Di Mauro, E. (2005). Role of histone acetylation in the control of gene expression. Biochem Cell Biol 83, 344–353. Vilain, A., Apiou, F., Dutrillaux, B. and Malfoy, B. (1998). Assignment of candidate DNA methyltransferase gene (DNMT2) to human chromosome band 10p15.1 by in situ hybridization. Cytogenet Cell Genet 82, 120. Vire, E. et al. (2006). The Polycomb group protein EZH2 directly controls DNA methylation. Nature 439, 871–874. Wade, P.A., Pruss, D. and Wolffe, A.P. (1997). Histone acetylation: Chromatin in action. Trends Biochem Sci 22, 128–132. Wan, M. et al. (1999). Rett syndrome and beyond: recurrent spontaneous and familial MECP2 mutations at CpG hotspots. Am J Hum Genet 65, 1520–1529.
■
73
Weaver, I.C., Cervoni, N., Champagne, F.A., D’Alessio, A.C., Sharma, S., Seckl, J.R., Dymov, S., Szyf, M. and Meaney, M.J. (2004a). Epigenetic programming by maternal behavior. Nat Neurosci 7, 847–854. Weaver, I.C., Diorio, J., Seckl, J.R., Szyf, M. and Meaney, M.J. (2004b). Early environmental regulation of hippocampal glucocorticoid receptor gene expression: Characterization of intracellular mediators and potential genomic target sites. Ann N Y Acad Sci 1024, 182–212. Weaver, I.C., Champagne, F.A., Brown, S.E., Dymov, S., Sharma, S., Meaney, M.J. and Szyf, M. (2005). Reversal of maternal programming of stress responses in adult offspring through methyl supplementation: Altering epigenetic marking later in life. J Neurosci 25, 11045–11054. Wolffe, A.P. and Hayes, J.J. (1999). Chromatin disruption and modification. Nucleic Acids Res 27, 711–720. Zhang, H., Roberts, D.N. and Cairns, B.R. (2005). Genome-wide dynamics of Htz1, a histone H2A variant that poises repressed/ basal promoters for activation through histone loss. Cell 123, 219–231. Zhang, Y., Ng, H.H., Erdjument-Bromage, H., Tempst, P., Bird, A. and Reinberg, D. (1999). Analysis of the NuRD subunits reveals a histone deacetylase core complex and a connection with DNA methylation. Genes Dev 13, 1924–1935. Zhu, B., Zheng, Y., Hess, D., Angliker, H., Schwarz, S., Siegmann, M., Thiry, S. and Jost, J.P. (2000). 5-methylcytosine-DNA glycosylase activity is present in a cloned G/T mismatch DNA glycosylase associated with the chicken embryo DNA demethylation complex. Proc Natl Acad Sci USA 97, 5135–5139.
CHAPTER
6 Systems Biology and the Emergence of Systems Medicine Nathan D. Price, Lucas B. Edelman, Inyoul Lee, Hyuntae Yoo, Daehee Hwang, George Carlson, David J. Galas, James R. Heath and Leroy Hood
INTRODUCTION A new approach to biology, which we call systems biology, has emerged over the past 10 years or so – an approach that looks at biology as an informational science, studies biological systems as a whole, and recognizes that biological information in living systems is captured, transmitted, modulated, and integrated by biological networks that pass this information to molecular machines for execution. This approach differs from early “systems approaches to biology” in that it attempts both a bottomup approach (from large molecular datasets) and a top-down approach (from computational modeling and simulations) where there is an attempt to trace complex observations of phenotype back to the digital core encoded in the genome. New measurement and visualization technologies, together with powerful computational and modeling tools, have transformed systems biology by making it possible for the first time to execute the five uniquely defining features of contemporary systems biology: 1. measurements of the various types of biological information that are global or comprehensive to the greatest extent possible (e.g., measure the digital code of all genes or, for example, the concentrations of all mRNAs, all proteins, all metabolites, etc.); Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg 74
2. the different levels of information (DNA, RNA, protein, etc.) must be integrated to understand or capture how the environment has modified the basic digital information of the genome at each level of the biological information hierarchy (DNA, RNA, proteins, interactions, biological networks, cells, individuals and ecologies) and thereby to induce biological responses; 3. all biological systems (e.g., networks) must be studied dynamically as they capture, transmit, integrate and utilize the biological information necessary for the execution the two most fundamental responses of living organisms – development or physiological responses to the organism’s environment; 4. all measurements must be quantitatively determined to the greatest extent possible; and 5. the global and dynamic data from the variety of information hierarchies must be integrated and modeled. These models essentially create hypotheses about biological functions and mechanisms of disease that can then be tested experimentally by systems perturbations – the enumeration of new datasets, their integration, modifications of models, in an iterative manner, until the working models reflect the reality of the experimental data. Once the models accurately fit the biological systems measurements, they can then be used to predict the results of perturbing the system in new ways (e.g., for Copyright © 2009, Elsevier Inc. All rights reserved.
Systems Science in Biology and Medicine
designing treatment strategies or for identifying likely underlying causes for disease). The information revolution in biology will enable systems medicine to emerge in the next 10 years. Key to this transformation will be harnessing computationally the vast amount of biological information becoming available through rapid advancement in measurement, visualization and computational technologies. Two fundamental types of biological information, the digital information of the genome and interacting information from the environment, are integrated together to specify the five fundamental mechanisms of life – evolution, development, physiological responses, aging and the onset and progression of disease (Hood et al., 2004). Advances in high-throughput DNA sequencing will enable each person’s genome to be sequenced rapidly and at a reasonable cost, providing digital information for each person and making possible the prediction of increasingly accurate probabilistic health futures. Also, the advance of highthroughput measurement technologies will enable the assessment of dynamic environmental information emerging from the integration of genome and environmental information from each individual, as reflected, for example, by the changing levels of proteins in the blood – thus providing a real time (current) health assessment of the individual. These technologies will generate tremendously large, dynamical datasets about health states and hence about the states of relevant biological networks of the individual patients. This detailed information will arise through the use of perturbation experiments coupled with global experimental technologies, and progress in systems biology research is increasingly elucidating the functions and structures of these networks. A systems approach to medicine argues that disease arises from disease-perturbed biological networks and that the dynamically changing, altered patterns of gene expression that are controlled by these perturbed networks give rise to the disease manifestations. Here, we present a systems view of biology and disease together with a discussion of some recent advances in state-of-the-art in vitro and in vivo diagnostics technologies, and we suggest how, as these technologies mature, they will move us towards a future of predictive, personalized, preventive, and participatory medicine (P4 medicine) (Hood et al., 2004).
SYSTEMS SCIENCE IN BIOLOGY AND MEDICINE Systems Biology How do systems approaches to biology and medicine lead to a revolutionary view of medicine? We present this systems view in some detail, because an understanding of P4 medicine and its implications for society are predicated upon understanding these systems principles. Two primary domains of biological information lend themselves readily to such systems-level analysis: the static, digital information of the genome, and the dynamic information arising from environmental interactions with the subcellular, cellular, and tissue levels of organization (Hood et al.,
■
75
2004). Digital genome information encodes two types of biological networks – protein interactions and gene regulatory networks. Protein networks transmit and use biological information for development, physiology and metabolism. Gene regulatory networks – transcription factors, their regulatory binding sites and the small RNAs that regulate networks of other transcription factors and other RNAs interacting with one another – receive information from signal-transduction networks, integrate and modulate it, and convey the processed information to networks of genes or molecular machines that execute developmental and physiological functions. In biological systems, these two types of networks are closely integrated. The organization of these networks can be inferred from various types of measurements including, for example, global measurements of dynamically changing levels of mRNAs and proteins during developmental and physiological responses, as well as large-scale measurements of protein–protein and protein–DNA interactions. There are multiple hierarchical levels of organization and information (for example, DNA, RNA and protein networks, cellular and metabolic networks, and organization and responses of organ systems). To understand biological systems, information must be gathered from as many information levels as possible and integrated into models that generate hypotheses about how biology works. Let us now consider the logic of systems approaches to medicine. Systems Medicine The central premise of systems medicine is that diseaseperturbed networks alter patterns of expression in genes and proteins – and that these altered patterns encode the dynamic pathophysiology of the disease and necessarily results in altered molecular fingerprints that can be detected clinically. Through advances in measurement technologies and computational analysis tools for tissue and blood analyses, we will be able to read these signals to make a multitude of diagnoses to distinguish health from disease and to determine the nature of any pathology. Multi-parameter analyses of biological information will be key to tracking these altered patterns in disease-perturbed networks. Diagnostic methods of the past have been pauciparameter in nature – usually measuring just a single parameter relevant to a specific disease state (e.g., PSA levels to assess prostate health), and so our ability to track health and disease has been limited. Current and emerging technologies are creating a transition into a new era of predictive, preventative, and personalized medicine. A causal disease perturbation could be the result of specific, disease-causing DNA mutations, pathogenic organisms, or other pathological environmental factors such as toxins. Molecular fingerprints of pathological processes can take on many molecular forms, including the analyses of proteins (Wang et al., 2005a, b), DNA (Papadopoulou et al., 2006), RNA (Scherzer et al., 2007), and metabolites (Solanky et al., 2003), as well as informative, post-translational modifications to these molecules such as protein phosphorylation. Signals related to health and disease can be found in multiple sites. For instance, many bodily fluids such as the blood, urine, saliva, cerebral spinal fluid and so forth can be sampled to
76
CHAPTER 6
■
Systems Biology and the Emergence of Systems Medicine
identify evidence of perturbed molecular fingerprints reflecting the altered expression patterns of genes and proteins in diseaseperturbed biological networks. Of these, the blood is likely the most information rich organ (or fluid) in that it bathes all of the tissues in the body, and it is easily accessible for diagnostic procedures. In addition to the biomolecules secreted into the blood from cells and tissues throughout the body, the transcriptomes and proteomes of cells circulating in the blood (e.g., white blood cells) are also potentially an abundant source of biomedically important information (Buttner et al., 2007). Thus, the amount of information available in the blood about health and disease is enormous if we learn to read and interpret the molecular signals. As an example of an environmental disease perturbation, we have studied the dynamic onset of the infectious prion disease in several strains of mice at the level of mRNA in the brain (the affected organ) and showed that a series of interlocking protein networks that surround the prion protein are significantly perturbed across the 150-day span from disease initiation to death. Figure 6.1 shows differential networks that were derived from comparisons of mRNA-expression patterns in normal and diseased brains at each of three time points after infection. They involve 67 proteins in the prion replication and accumulation network. The initial network changes occur well before the clinical signs of the disease can be detected and predict later widespread histopathological events. These dynamically changing, disease-perturbed networks lead to two important conclusions. First, some significant network nodal points change before the related clinical or histological changes are evident. Therefore, labeled biomarkers that are specific for the changing nodes or the biological processes they regulate could be used for in vivo imaging diagnostics even before symptoms arise, as has already been shown in patients (Golub et al., 1999; Quackenbush, 2006; Ramaswamy et al., 2001;Wang et al., 2005a, b). Alternatively, if some of these altered nodes encode secreted proteins, they could provide readily accessible in vitro diagnostic blood markers for early disease detection. Second, many of the sub-networks of proteins that change during the onset of disease affect changes in phenotypic traits that are consistent with the pathology of the disease. About 900 perturbed mRNAs appear to encode the core prion-disease response, and, during the progression of the priondisease process, several hundred undergo significant gene expression changes well before the clinical signs of prion disease. Many of these potential early disease “sentinels” are predicted to be secreted into the blood and therefore represent potential protein biomarkers for early disease diagnosis through blood protein analysis. One interesting new diagnostic method evaluates relative expression reversals in protein concentrations or gene expression levels. This procedure eliminates the need for data normalization or the establishment of population-wide thresholds. This approach has been successfully used to identify robust and accurate classifiers for prostate cancer (Xu et al., 2005), sarcomas (Price et al., 2007), and a variety of other cancers (Tan et al., 2005), as well as to predict treatment outcomes in breast
cancer (Ma et al., 2004), demonstrating the efficacy of even relatively low-order systems analysis in medicine. Another key issue that needs to be addressed is how to best decouple two primary dimensions of disease information: namely, disease stratification (e.g., which type of prostate cancer is present) and disease progression (e.g., stage of development of a particular prostate cancer stratification in time). Molecular signature changes can indicate the presence of different diseases (stratification) as well as their stage (progression). Thus, a key challenge in future will be to develop analysis tools that will enable us to differentiate the location of the physiologic state in regards to both of these critical clinical dimensions.
MULTI-PARAMETER BLOOD-BOURNE BIOMARKERS In principle, it is clear how blood samples containing secreted molecules from all of the tissues in the body can be used as a window into health or disease states (Anderson and Anderson, 2002; Fujii et al., 2004; Hood et al., 2004; Lathrop et al., 2003; Lee et al., 2006). In practice, however, the task of identifying markers of disease states in the vast array of secreted proteins can seem daunting. There are millions of different proteins present in the blood and they are expressed at levels that probably differ by 12 orders of magnitude (1012). Thus, there are indeed very significant challenges associated with defining relevant biomarkers in the plasma proteome, as is evidenced by the paucity of blood protein markers found thus far, despite significant efforts (Wilson, 2006). This fact highlights the need to develop wellfounded systems approaches to diagnostics in order to hasten the identification of such markers and to harness the tremendous information potential of the blood. While the potential information available to diagnose heath and disease is enormous, it is also true that the challenges of separating signal from background noise (biological or measurement noise) are also very significant. Sources of noise include error in measurements, polymorphisms in the population, environmental variations, stochastic variations, as well as signal dilution through mixing and other processes of molecule transport. Thus, enhancing signal while reducing noise will be a primary theme of research in predictive medicine for the foreseeable future. An important strategy for dealing with noise arising from genetic polymorphisms and health histories in the population are through dynamic, subtractive analyses carried out in the individual patient. In other words, measurements taken from individual patients at different time points (longitudinal data) can be used to perform subtractive analyses where only the differences observed are considered. Thus, each patient becomes their own control, which eliminates many of the sources of noise. Eventually, it will be important for each patient to have biannual blood analyses taken so that changes can be observed relative to the background of what is normal in each individual rather than relative to the population at large. Such databases of measurements will be essential to enabling personalized and predictive
22 wks
18 wks Clinical Signs
No Clinical Signs
PRND
LAMAL LAMRI
STIP
PRND
FCGRL
CSNK2A1
LRP2 A2M
LRP1 LRPAP1
C3
FYN
GM2A
LYN
SHC1
SHC1
GFAP
CTSB
YWHAZ VCL
ANXA2 CTSH
CST3 S100ALO
MEN
PPL
CTSD STAT3
PDGFRA
ESR1 NAPK1
STAT1
CEBPB
MEN
PPL
CTSD
MAPK1
STAT1
LAG3
CEBPB
CREBBP HDAC1
TRP53
ITGAV
2 wks
JUN
GSN PPL
CTSD STAT3
PDGFRA
ESR1
TGAV
MAPK1
STATL
CEBPB
CREBBP HDAC1
YWHAZ
VIM
MEN
LAG3 HDAC1
TRP53 RELA
RELA NDTCH1
GFAP
VCL
CTS3 SL00AL0
LAG3
TRP53
RELA JUN
PDGFRA
ESR1
STAT3
PSEN1 ANXA2
CTSH
GSN
VIM
CTSB
NOTCH1
12 wks
JUN
NOTCH1
20 wks
Figure 6.1 A central sub-network involved in neuronal pathology in prion disease in mice, shown at three times after inoculation. Red circles indicate increased levels of gene expression relative to controls.
Multi-Parameter Blood-Bourne Biomarkers
CREBBP
TTGAV
VCL
CTS3 SL00AL0
EZH2
BCL2L11 CTSB
YWHAZ
ANXA2 CTSH
GSN
VIM
GFAP
PSEN
DNM2
GRB2 BCL2
EZH2
CTSB
VAV1
CASP3
DNM2
BCL2L11
BCL2L11 SEN
SHC1
VAV1 GRB2
BCL2
EZH2
SYK
APLPL APP
CASP3
DNM2
GRB2 BCL2
PTK2B
SYK
APP
VAV1
CASP3
LYN
HEXA PTK2B
SYK APLP
FYN
GM2A
LYN
HEXA PTK2B
APP
CSTB
LRP1 LRPAP1
C3
FYN
GM2A
HEXA
CSTB
CAV1 PLD2
CAV1 PLD2
CAV1 PLD2
APLP1
PCGRL
CSNK2A1
APOB
LRP2
LRAP1
C1QA C1QR1 PCGR3
HSPAJA
A2M LRP1
STIP1
PRNP
PCGRL
CSNK2A1
APOB
LRP2
APCS
PLAUR
C1QA C1QR1 PCGR3
HSPAPA
A2M C3
STIP1
PRNP
PRND LAMR1
PLG
PLAU
APCS
PLAUR
C1QA C1QR1 FCGR3
HSPA1A APOB
LAMA1
LAMR1 PLG
PLAU
APCS
PLAUR
PRND
LAMAL
PLG
PLAL
■
77
78
CHAPTER 6
■
Systems Biology and the Emergence of Systems Medicine
medicine. Using each patient as their own control is one of the essential features of the emerging personalized medicine. Organ-Specific Blood Protein Fingerprints We discuss the organ-specific blood protein fingerprint approach to diagnostics because it lies at the very heart of the predictive medicine that will emerge over the next 10 years. Our feeling is that this will be one of the central data gathering strategies for predictive and personalized medicine and, accordingly, we will discuss it in detail. The idea is that disease arises from dynamically changing disease-perturbed networks, that the diseased organs will secrete proteins into the blood, and that if these proteins are encoded by disease-perturbed networks their levels of expression in the blood will be altered in a manner that reflects the specific nature of the disease. Indeed, this idea is the basis for the very broad range of blood biomarker studies that are being carried out by many scientific centers (Anderson and Anderson, 2002; Fujii et al., 2004; Hood et al., 2004; Lathrop et al., 2003; Lee et al., 2006). The major difficulty with this simple view is that if you identify multiple blood proteins whose changes are specific for a particular disease state (as compared against normal controls) and then examine the same blood markers for that disease in bloods drawn from individuals, with say 10 other pathologies, these marker proteins can change in unpredictable ways since multiple organs control the expression of most blood proteins and these organs respond differently to various environmental signals. The important point is that if a marker protein synthesized in five organs changes in the blood, we cannot be certain which organ(s) is the origin of the change. While that marker may sample a biological network relevant to disease diagnosis, since its origin is not clear its use as a disease biomarker may raise more questions than it answers. The solution to this dilemma is clear – employ organspecific blood protein biomarkers whose changes must therefore reflect changes only in the organ itself. If enough of these organspecific blood proteins are sampled, they will represent a survey of many different biological networks in the organ of interest and will provide sufficient diagnostic information for any disease. Eventually, correlation of these organ-specific biomarkers with more general biomarkers may prove to be the best long-term strategy for developing diagnostic fingerprints. We have several lines of preliminary evidence that suggest this organ-specific blood protein approach will be effective. In prostate cancer, for example, there are disease-mediated altered patterns of mRNA and protein expression in the prostate. Some of these genes are expressed primarily in the prostate (organspecific products) and some of these organ-specific proteins are secreted into the blood, where they collectively constitute a protein molecular fingerprint comprised of say 100 or more proteins whose relative concentration levels probably report the status of the biological networks in the prostate gland. We have demonstrated that changes in the blood concentrations of several of these prostate-specific blood proteins reflect the various stages of prostate cancer and, as discussed above, various brainspecific blood proteins also reflect the progression of prion disease in mice. We propose that the distinct expression levels of the
individual proteins in each fingerprint represent a multiparameter (and therefore potentially information-rich) diagnostic indicator reflecting the dynamic behavior of, for example, the disease-perturbed networks from which they arise. The analysis of 50 or so organ-specific proteins should allow us to both stratify diseases in the organ as well as determine their stage of progression. We have identified tens to hundreds of organ-specific transcripts in each of the 40 or so organs that we have examined in mouse and human. We can envision a time over the next 5–10 years when 50 or so organ-specific blood proteins will be identified for each of the 50 or so major organs and tissues in humans – so that computational analyses of the relative concentrations of the protein components in these organ-specific fingerprints will enable blood to become the primary window into health and disease. When we analyze data from these blood indicators, we also may be in a position to identify the dynamically changing disease-perturbed network(s). The analysis of these dynamic networks will allow us to study in detail the pathophysiology of the disease response and hence be in a position to think of new approaches to therapy and prevention. To generate the ability to assess simultaneously all 50 organ blood protein fingerprints in patients, we ultimately need to develop the microfluidic or nanotechnology tools for making perhaps 2500 rapid protein measurements (e.g., 50 proteins from each of 50 human organs) from a droplet of blood. Nanotechnology is necessary because only this severe miniaturization can allow the necessary thousands of measurements from a single drop of blood. In order to reach this stage we will also have to create the appropriate computational tools to capture, store, analyze, integrate, model, and visualize the information arising from this approach.
EMERGING IN VIVO AND IN VITRO TECHNOLOGIES In vitro Measurement Technologies P4 medicine will require that in vitro measurements of thousands of blood proteins be executed rapidly, automatically and inexpensively on small blood samples. This will require miniaturization, parallelization, integration and automation of tissue and blood purification and measurement technologies. In short, it will require making in vitro measurements inexpensively – perhaps for a penny or less per protein measured. This need for inexpensive in vitro measurements is driving the development of integrated microfluidics and nanotechnologies for in vitro diagnostics. We further argue that the complex changes in these protein levels must be analyzed computationally to search for the patterns that correlate with particular diseases – and indeed to stratify each disease as well as determine its stage of progression. To meet the expanding requirements of systems medicine, in vitro measurements of cells, proteins, mRNAs, etc., whether for fundamental biological studies or for pathological analysis,
Emerging In vivo and In vitro Technologies
must not just be inexpensive, but must also be sensitive, quantitative, rapid, and executed on very small quantities of tissues, cells, serum, etc. We are working towards developing chip platforms that can take a few microliters (a finger prick) of blood, separate the plasma from the serum, and then measure on the order of 100 or more proteins and/or mRNAs quantitatively, with high sensitivity and specificity, and in a few minutes time. We can envision a time 5–10 years hence when small hand-held devices will be able to make these 2500 measurements from a fraction of a droplet of blood, send them via wireless to a server for analysis and consequently inform the patient and physician as to the status of the patient. While many fundamental scientific challenges remain to be solved before this goal is achieved, none of those challenges appears insurmountable at this point. One will also be able to use the blood cells as powerful diagnostic markers – either of infectious diseases or of genetic diseases. Microfluidic cell-sorting technologies for being able to sort blood cells into their 10 or so individual types for analysis are now available. Even more important is the emergence of single-cell analytic tools where DNA, mRNAs or small RNAs, proteins and even metabolites can be analyzed rapidly from individual cells. The cells can also be perturbed with appropriate environmental stimuli to identify defects. It is possible that an appropriate analysis of immune cells (both innate and adaptive) from the blood will reveal important information about past antigenic history of the individual as well as current state of immunological responsiveness. Similarly, analysis of rare blood cells such as circulating cancer cells may also be utilized to guide therapies. Another type of in vitro measurement will be a determination of the complete genome sequence of individuals with a nanotechnology approach to sequencing single strands of DNA on a massively parallel basis (billions of DNA strands simultaneously). A device like this will emerge over the next 5–10 years and will allow millions, if not billions, of individual human genome sequences to be determined rapidly, inexpensively and in a massively parallel manner. The 2500 organ-specific blood marker measurements and the complete individual human genome sequences will be the heart of the predictive and personalized medicine that will emerge over the next 10 years or so. While transcriptomic approaches have proven useful for identifying informative molecular signatures for a number of diseases (Quackenbush, 2006), high-throughput proteomic characterizations have lagged behind. The reason for this difference is clearly the more difficult challenge of measuring proteins compared to mRNA transcripts. Emerging proteomics technologies hold the promise of greatly improving our ability to make detailed disease assessments using protein-based molecular signatures. One key advantage of protein signatures relative to gene expression is that the proteins found in the blood and other accessible bodily fluids exhibit slower degradation rates than their mRNA counterparts – and of course proteins are the direct agents mediating the disease process itself. Thus, protein concentration signatures represent an important class of molecular signature for disease diagnosis.
■
79
The Imaging of Proteomics and Blood Protein Biomarkers The blood is an ideal organ for identifying biomarkers because it interacts with virtually every organ and major tissue type in the body, and each secretes proteins into the blood. Moreover, the blood is easily accessible for diagnostic studies – and presumably the analysis of changes in protein levels or protein modifications that could serve as surrogates or reflections of health and disease. As noted earlier, the blood is an enormously complex organ; it contains hundreds of thousands to millions of proteins whose concentrations range over 12 orders of magnitude. Since there is for proteins no equivalent to the polymerase chain reaction (PCR) amplification procedure, only the more abundant proteins can be seen by conventional protein analysis studies. It should also be pointed out that many different features of proteins ultimately must be characterized or quantified: identification, quantification, chemical modifications, alternative mRNA splicing products, cellular localizations, three-dimensional structures (and their dynamics), as well as their functions. We here are only concerned with the quantification and identification of proteins. There are two general approaches to the analysis of protein mixtures. In one case protein-capture agents such as antibodies are used – and with appropriate controls these approaches can quantify the levels of proteins. These techniques include Western blot analyses, ELISA assays, surface plasmon resonance and protein chips. A second approach is to use mass spectrometry. Since proteins have large masses that are difficult to analyze accurately by mass spectrometry, most proteins are analyzed after converting them into tryptic peptides (or other proteolytic fragments) whose masses are far smaller and hence more easily analyzed. Because blood is such a complex mixture, proteins of interest are often enriched by some type of fractional procedure (charge, hydrophobility, size and/or the presence of modifications such as glycosylation or phosphorylation), before (or after) their conversion into peptides and analysis by mass spectrometry. Isotopically labeled and synthesized peptides may be used to identify and quantify (relative or absolute) the same peptides from unknown samples (and hence one obtains a proxy for the quantification of their corresponding proteins) very effectively by mass spectrometry. We discuss below examples of some of these techniques. Antibody Microarrays Using Surface Plasmon Resonance Imaging (SPRI) Protein chip methods hold great potential for broad quantitative screens of proteins, and a variety of techniques have been developed based on antibody binding (Haab et al., 2001; Olle et al., 2005; Song et al., 2007).Various types of antibody arrays have been used for biomarker discovery and protein profiling of serum from patients with prostate, lung, pancreas, and bladder cancer (Gao et al., 2005;Miller et al., 2003;Orchekowski et al., 2005;Sanchez-Carbayo et al., 2006). One emerging approach with tremendous promise is SPRI (Hu et al., 2007; Huber and Mueller, 2006; Koga et al., 2006), which enables real-time, label-free measurement of protein expression by large numbers of different antibodies. SPR-based chips have a detection sensitivity of 10–100 times less than ELISA
80
CHAPTER 6
■
Systems Biology and the Emergence of Systems Medicine
(Hu et al., 2007), but have a spatial resolution down to approximately 4 m (Lyon et al., 1998). It is thus possible to print up to 800 unique antibodies on Lumera Nanocapture Gold™ microarray slides and monitor the abundance of the target proteins in real time (Hu et al., 2007), even in complex samples such as blood, because the same slide can be regenerated for reuse many times in 10 minutes or less (Z. Hu, C. Lausted, unpublished observations). We are in the process of automating the sample introduction procedures using microfluidic chips, which means that SPRI has the capacity necessary to screen rapidly through tens to hundreds of patient samples. Thus, this approach holds tremendous promise to be able to screen through large numbers of proteins, including secreted proteins and cell-surface markers and not only measure their presence, but also abundance and the dynamics of their binding. The limitation of this technique is its dependence on the affinity and specificity of the antibodies it employs for detection – cross-reactivities in complex protein mixtures (such as blood) can pose significant problems. DNA-Encoded Antibody Libraries (DEAL) One recently developed technique that offers great potential for detailed analyses is DEAL. The primary advantage of DEAL is that it uses a single, robust chemistry – that of DNA hybridization – to spatially localize and detect proteins, mRNAs, and cells, all in a multiplexed fashion. Antibodies are typically too fragile to survive the fabrication procedures associated with assembling robust microfluidics chips, but DNA oligomers are significantly more robust. DEAL thus enables the detection of panels of protein biomarkers within a microfluidics environment and from very small quantities of biological material (100 nanoliters or so) (Yang et al., 2006). This amount of plasma can be readily separated from whole blood on-chip, thus allowing for the measurement of serum biomarkers from a fingerprick of whole blood. In addition, within the environment of flowing microfluidics, the rate-limiting step in performing a surface immunoassay is the kinetics of the binding of the analyte to the surface-bound capture agent (Zimmermann et al., 2005). Thus, DEAL-based immunoassays can be executed very rapidly. The versatility of DEAL also enables multiplexed cell sorting and localization, followed by few- or single-cell measurements of protein, RNA, and other biomolecules (Bailey et al., 2007). DEAL can be engineered into a highly sensitive and very rapid measurement technique, with a reported detection limit of 10 f M for the protein IL-2, 150 times more sensitive than the corresponding commercial ELISA assay. This sensitivity can be applied to the isolation of rare cells based on combinations of cell-surface markers, enabling the isolation and addressing of individual cancer stem cells. DEAL can also be used to make single-cell measurements of secreted proteins from each of these isolated single cells. Thus, DEAL offers superb sensitivity and the ability to perform spatially multiplexed detection for characterization of rare cell types, such as circulating cancer cells or cancer stem cells. These advantages still face the inherent limitations of antibodies, so the development of new approaches to generating
protein-capture agents is a critical part of future development of comprehensive blood diagnostics. Mass Spectrometry-based Techniques Isotopic Tagging for Relative and Absolute Quantification (iTRAQ) of Proteins Analyzed by Mass Spectrometry Stable isotope labeling enables the quantitative analysis of protein concentrations through mass spectrometry (MS). One stateof-the-art technique for quantitative MS is iTRAQ (Ross et al., 2004), which uses stable isotope labeling of proteolytic peptides. This technique modifies primary amino acid groups of peptides by linking a mass balance group, and a reporter group by forming an amide bond. When MS/MS is used for analysis with iTRAQ-tagged peptides, the mass balancing carbonyl moiety is released as a neutral fragment generating 4–8 distinct sets of peptides whose relative abundances can be determined quantitatively. Because eight different iTRAQ reagents are currently available, comparative analysis of a set of two to eight samples is feasible within a single MS/MS run (Hu et al., 2007). The, iTRAQ technology represents the state-of-the-art in quantitative proteomics and represents a promising technology for using proteomics to differentiate key differences in protein networks. Glycopeptide Capture – Front-End Enrichment of Blood Proteins Containing Sugar Residues Mass spectrometry-based methods will allow for the identification of proteins spanning approximately three orders of magnitude in concentration from a given sample. Therefore, methods that can select specified fractions of the proteome are important for simplifying the sample sufficiently to identify the proteins of interest. One recently developed approach is the shotgun glycopeptides capture approach (Sun et al., 2007). This approach selects for N-linked or O-linked glycosylated peptides by selectively coupling these peptides to beads – allowing the uncoupled peptides to be washed away – then the glycosylated peptides are be released and analyzed by mass spectrometry. Both secreted and cell-surface proteins are enriched for glycopeptides compared with their nuclear and cytoplasmic counterparts. Thus, this approach can be used to, for example, identify candidates for unique cell-surface markers to make identifications of clinically relevant cellular subpopulations. In Vivo Molecular Diagnostics As it relates to personalized and predictive medicine, in vivo molecular diagnostics will also require the development of a diverse library of molecular imaging probes. These modular tools can be used to: 1. identify the specific location of disease-perturbed networks in patients; 2 link in vivo molecular measurements in diseased tissue in patients to in vitro measurements; 3. rapidly assess the efficacy of personalized therapeutics; and 4. validate that a drug is hitting its target and inducing the desired pharmacological outcome.
Conclusions and Perspectives
Although there are many in vivo imaging modalities, perhaps the best current method from the point of view of personalized and predictive medicine in patients is positron emission tomography (PET) molecular imaging (Czernin and Phelps, 2002). For PET, trace quantities of radiolabeled molecular probes are injected into the patient. As the probes circulate through the body and its various organ systems, they interact with target proteins to provide imaging assays for, as examples, the rate of metabolic processes, the concentration of receptors in signal transduction, enzyme activity, DNA-replication rates, hormone status, pharmacokinetics, and pharmacodynamics.
COMPUTATIONAL AND MATHEMATICAL CHALLENGES IN SYSTEMS MEDICINE Molecular Signature Classifiers Computational methods are needed to reduce the high degree of data dimensionality associated with global datasets to identify molecular signatures that can be used for disease diagnosis. Despite notable and significant challenges that remain (Dupuy and Simon, 2007; Simon, 2005), computational analyses to identify molecular signatures from global gene expression datasets that can be used for diagnosis and treatment selection (Quackenbush, 2006) is an area of research that has shown significant promise. These studies typically involve the collection of samples from two or more classes (e.g., cancer versus normal, or responsive versus non-responsive to treatment) and the use of a set of data on which to train the classifier and another set on which to test. In the absence of a true test set, re-sampling methods (such as cross-validation) are generally used to estimate likely performance of the classifier on future data. The ability to generate an accurate classifier is a function of factors such as: (1) the size of the training set relative to the number of features, (2) the computational method used, and (3) the inherent distinctness of the selected phenotypes. Typically, the number of samples is far less than the number of transcripts, and consequently over-fitting is a significant problem. This leads to the need for computational methods that avoid over-fitting when selecting a classifier. A variety of methods have been applied to disease diagnoses, including approaches based on support vector machines (Furey et al., 2000; Ramaswamy et al., 2001) and relative expression reversals (Geman et al., 2004; Price et al., 2007; Tan et al., 2005; Xu et al., 2005), among many others. Application of these methods has perhaps been applied most extensively to the study of cancer, leading to the discovery of molecular classifiers of varying degrees of accuracy to identify prognostic signatures for breast cancer (Adler and Chang, 2006; Buyse et al., 2006; Dai et al., 2005; Foekens et al., 2006; Glinsky et al., 2004a, b; Goncalves et al., 2006; Ivshina et al., 2006; Liu et al., 2007; Ma et al., 2004; Nuyten et al., 2006; Park et al., 2007; Pawitan et al., 2005; Thomassen et al., 2007; van ‘t Veer et al., 2002; van de Vijver et al.,
■
81
2002; Wang et al., 2005a, b; Weigelt et al., 2005), ovarian cancer (De Cecco et al., 2004; Smith, 2002; Spentzos et al., 2004), colon cancer (Barrier et al., 2006; Giacomini et al., 2005), prostate cancer (Dhanasekaran et al., 2001; Glinsky et al., 2004a, b; Halvorsen et al., 2005; Lin et al., 2005; Luo et al., 2002; Singh et al., 2002; Xu et al., 2005), and brain cancer (Fuller et al., 2005; Kim et al., 2002), among others. Biological Networks and the Interactome Among the most promising opportunities presented by systems biology is the integrated assessment of multiple biological parameters to magnify our understanding of complex developmental or pathological states. The interaction among all individual biomolecules in complex regulatory, signal transduction, and feedback networks present in eukaryotic organisms forms the theoretical framework for this higher-order analysis and is referred to as the interactome. These dynamic associations include protein–protein interactions, transcriptional regulation, and post-transcriptional gene silencing by short-interfering RNA and microRNA (Ruvkun, 2001). Multiple high-throughput methods have been developed to characterize the interaction between proteins, including the yeast-two-hybrid system (Ito et al., 2001) and surface plasmon resonance (Usui-Aoki et al., 2005). These methods are to determining interactome interactions what microarray studies are to determining relative mRNA concentrations.
CONCLUSIONS AND PERSPECTIVES Nanotechnology and microfluidic sensing chips with integrated microfluidic delivery devices will be developed over the next 5–10 years so that we reach a point at which thousands of proteins and/or RNAs can be analyzed quantitatively from only a fraction of a droplet of blood. This will allow physicians to assess the health/disease status of virtually every organ in the body at 6-month intervals and will become the first critical element of predictive medicine. We also predict that nanotechnology approaches to the sequencing of single DNA strands will over the next 5–10 years lead to instrumentation that can rapidly, cheaply and simply sequence individual human genomes. This will be a second foundation of predictive medicine – assessing the variability in individual human genomes to generate probabilistic future health histories for each individual human, and integrating it with diagnostics. Similarly, in vivo imaging agents will permit the functional visualization of informational molecules and drugs in humans, which will provide powerful and informative in vivo diagnostics approaches. These measurement technologies, together with both new computational approaches to extract information from these data and the systems view of medicine, will lead to three important revolutions in medicine. First, over the next 10–20 years, a medicine – P4 medicine – that is predictive, personalized, preventive and participatory will emerge. Obtaining individual human genome sequences will lead to individual predictive health histories, whereas biannual blood measurements of thousands of proteins from the molecular
82
CHAPTER 6
■
Systems Biology and the Emergence of Systems Medicine
fingerprints of human organs will give us a real-time assessment of individual health. Predictive medicine will result in a personalized medicine that focuses on the illnesses of an individual and eventually their wellness, preventing disease rather than treating it. The systems approach to disease will in time permit the identification of the key disease-perturbed subnetworks and the identification of important nodal points, which, if perturbed by drugs, could make the network behave in a more normal fashion or at least delete the more deleterious effects of the disease-perturbed network. This capability will provide a new approach for drug-target discovery, and a powerful and rapid new approach for developing drugs. Predictive, personalized and preventive medicine, if appropriately orchestrated with patient-oriented interpretations, will also enable patients to understand more deeply and actively participate in personal choices about illness and wellness. Participatory medicine will necessitate the development of powerful new approaches to handling enormous amounts of personal information in a secure manner, and to a new form of medical education of individual patients as well as their physicians. Thus, over the next 5–20 years, medicine will become predictive, personalized, preventive and participatory (P4 medicine). Another very exciting idea is that, in the long-term, systems analysis of blood offers the potential to open up a new avenue for studying human biology – when we learn to read and interpret the information inherent in secreted organ-specific protein patterns. The key patterns in these secreted proteins represent different network perturbations, and hence, different diseases. When secreted proteins enter the blood, they can provide a novel means to study biology in higher organisms and to identify drug targets through linking blood measurements to perturbations in underlying biological networks. Importantly, learning to study perturbations in underlying networks through secreted protein patterns (to whatever resolution is possible) will provide access to studying biological networks in vivo that are not amenable to direct experimentation. Developing the capacity to identify in vivo network perturbations through secreted protein measurements in the blood will open up a new avenue to drug
target identification and will provide a novel means to discover the perturbed sub-networks. Since taking blood is relatively non-invasive, it has less potential than biopsies or other invasive techniques of greatly distorting the system being studied. Thus, this approach has the potential to strongly complement existing approaches to study human biology. The effect of drugs, toxicity, human development and even aging may all be amenable to study through the blood if we can learn to read and interpret the signals in the proteins, which has the potential to be very significant in human biology. Second, as the sensitivity of measurements increases (both in vitro and in vivo), we will achieve a digitalization of biology and medicine – that is, the ability to extract relevant information content from single individuals, single cells and ultimately single molecules – with its attendant economies of scale. Just as Moore’s law led in time to the widespread digitalization of information technologies and communications, the exponentially increasing ability to extract quantized biological information from individual cells and molecules will transform biology and medicine in ways that we can only begin to imagine. Finally, all of these changes – a systems approach to medicine with its focus on disease prevention and more efficient drug discovery, the introduction of increasingly inexpensive nanotechnology based diagnostics and in vivo measurement technologies, the highly accurate and specific molecular characterization of the systems biology of disease, and the digitalization of medicine – will start to reduce the inexorably increasing costs of health care. This can, in principle, enable us to afford to provide for the more than 45 million people in the United States who currently lack health insurance, to reduce the crushing costs of healthcare to society, and to export our digital predictive and preventive medical approach to the developing world. Just as the mobile phone has become a fundamental communication mode in developing countries and has changed the lives of much of the world’s population, so digital medicine of the 21st century can bring to the world’s citizens a global and strongly improved state of human health and healthcare. In our view P4 medicine will become the foundational framework for global medicine.
REFERENCES Adler, A.S. and Chang, H.Y. (2006). From description to causality: Mechanisms of gene expression signatures in cancer. Cell Cycle 5, 1148–1151. Anderson, N.L. and Anderson, N.G. (2002). The human plasma proteome: History, character, and diagnostic prospects. Mol Cell Proteomics 1, 845–867. Bailey, R.C., Kwong, G.A., Radu, C.G., Witte, O.N. and Heath, J.R. (2007). DNA-encoded antibody libraries: A unified platform for multiplexed cell sorting and detection of genes and proteins. J Am Chem Soc 129, 1959–1967. Barrier, A., Boelle, P.Y., Roser, F., Gregg, J., Tse, C., Brault, D., Lacaine, F., Houry, S., Huguier, M., Franc, B. et al. (2006). Stage II colon cancer prognosis prediction by tumor gene expression profiling. J Clin Oncol 24, 4685–4691.
Buttner, P., Mosig, S., Lechtermann, A., Funke, H. and Mooren, F.C. (2007). Exercise affects the gene expression profiles of human white blood cells. J Appl Physiol 102, 26–36. Buyse, M., Loi, S., van’t Veer, L., Viale, G., Delorenzi, M., Glas, A.M., d’Assignies, M.S., Bergh, J., Lidereau, R., Ellis, P. et al. (2006). Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Natl Cancer Inst 98, 1183–1192. Czernin, J. and Phelps, M.E. (2002). Positron emission tomography scanning: Current and future applications. Annu Rev Med 53, 89–112. Dai, H., van’tVeer, L., Lamb, J., He,Y.D., Mao, M., Fine, B.M., Bernards, R., van de Vijver, M., Deutsch, P., Sachs, A. et al. (2005). A cell proliferation signature is a marker of extremely poor outcome in a subpopulation of breast cancer patients. Cancer Res 65, 4059–4066.
References
De Cecco, L., Marchionni, L., Gariboldi, M., Reid, J.F., Lagonigro, M.S., Caramuta, S., Ferrario, C., Bussani, E., Mezzanzanica, D., Turatti, F. et al. (2004). Gene expression profiling of advanced ovarian cancer: Characterization of a molecular signature involving fibroblast growth factor 2. Oncogene 23, 8171–8183. Dhanasekaran, S.M., Barrette, T.R., Ghosh, D., Shah, R., Varambally, S., Kurachi, K., Pienta, K.J., Rubin, M.A. and Chinnaiyan, A.M. (2001). Delineation of prognostic biomarkers in prostate cancer. Nature 412, 822–826. Duarte, N.C., Becker, S.A., Jamshidi, N., Thiele, I., Mo, M.L.,Vo, T.D., Srivas, R. and Palsson, B.O. (2007). Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proc Natl Acad Sci USA 104, 1777–1782. Dupuy, A. and Simon, R.M. (2007). Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst 99, 147–157. Foekens, J.A., Atkins, D., Zhang,Y., Sweep, F.C., Harbeck, N., Paradiso, A., Cufer, T., Sieuwerts, A.M., Talantov, D., Span, P.N. et al. (2006). Multicenter validation of a gene expression-based prognostic signature in lymph node-negative primary breast cancer. J Clin Oncol 24, 1665–1671. Fujii, K., Nakano, T., Kawamura, T., Usui, F., Bando, Y., Wang, R. and Nishimura, T. (2004). Multidimensional protein profiling technology and its application to human plasma proteome. J Proteome Res 3, 712–718. Fuller, G.N., Mircean, C., Tabus, I., Taylor, E., Sawaya, R., Bruner, J.M., Shmulevich, I. and Zhang, W. (2005). Molecular voting for glioma classification reflecting heterogeneity in the continuum of cancer progression. Oncol Rep 14, 651–656. Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M. and Haussler, D. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914. Gao,W.M., Kuick, R., Orchekowski, R.P., Misek, D.E.,Qiu, J., Greenberg, A.K., Rom, W.N., Brenner, D.E., Omenn, G.S., Haab, B.B. et al. (2005). Distinctive serum protein profiles involving abundant proteins in lung cancer patients based upon antibody microarray analysis. BMC Cancer 5, 110. Geman, D., d’Avignon, C., Naiman, D.Q. and Winslow, R.L. (2004). Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol 3(Article19). Giacomini, C.P., Leung, S.Y., Chen, X., Yuen, S.T., Kim, Y.H., Bair, E. and Pollack, J.R. (2005). A gene expression signature of genetic instability in colon cancer. Cancer Res 65, 9200–9205. Glinsky, G.V., Glinskii, A.B., Stephenson, A.J., Hoffman, R.M. and Gerald, W.L. (2004). Gene expression profiling predicts clinical outcome of prostate cancer. J Clin Invest 113, 913–923. Glinsky, G.V., Higashiyama, T. and Glinskii, A.B. (2004). Classification of human breast cancer using gene expression profiling as a component of the survival predictor algorithm. Clin Cancer Res 10, 2272–2283. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A. et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537. Goncalves, A., Esterni, B., Bertucci, F., Sauvan, R., Chabannon, C., Cubizolles, M., Bardou, V.J., Houvenaegel, G., Jacquemier, J., Granjeaud, S. et al. (2006). Postoperative serum proteomic profiles
■
83
may predict metastatic relapse in high-risk primary breast cancer patients receiving adjuvant chemotherapy. Oncogene 25, 981–989. Haab, B.B., Dunham, M.J. and Brown, P.O. (2001). Protein microarrays for highly parallel detection and quantitation of specific proteins and antibodies in complex solutions. Genome Biol 2, RESEARCH0004. Halvorsen, O.J., Oyan, A.M., Bo, T.H., Olsen, S., Rostad, K., Haukaas, S.A., Bakke, A.M., Marzolf, B., Dimitrov, K., Stordrange, L. et al. (2005). Gene expression profiles in prostate cancer: Association with patient subgroups and tumour differentiation. Int J Oncol 26, 329–336. Hood, L., Heath, J.R., Phelps, M.E. and Lin, B. (2004). Systems biology and new technologies enable predictive and preventative medicine. Science 306, 640–643. Hu, Z., Hood, L. and Tian, Q. (2007). Quantitative proteomic approaches for biomarker discovery. Proteomics. In review. Huber, W. and Mueller, F. (2006). Biomolecular interaction analysis in drug discovery using surface plasmon resonance technology. Curr Pharm Des 12, 3999–4021. Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. and Sakaki, Y. (2001). A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98, 4569–4574. Ivshina, A.V., George, J., Senko, O., Mow, B., Putti, T.C., Smeds, J., Lindahl, T., Pawitan,Y., Hall, P., Nordgren, H. et al. (2006). Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer Res 66, 10292–10301. Joshi-Tope, G., Gillespie, M., Vastrik, I., D’Eustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath, G.R., Wu, G.R., Matthews, L. et al. (2005). Reactome: A knowledgebase of biological pathways. Nucleic Acids Res 33, D428–D432. Kim, S., Dougherty, E.R., Shmulevich, I., Hess, K.R., Hamilton, S.R., Trent, J.M., Fuller, G.N. and Zhang, W. (2002). Identification of combination gene sets for glioma classification. Mol Cancer Ther 1, 1229–1236. Koga, H., Kyo, M., Usui-Aoki, K. and Inamori, K. (2006). A chip-based miniaturized format for protein-expression profiling: the exploitation of comprehensively produced antibodies. Electrophoresis 27, 3676–3683. Lathrop, J.T., Anderson, N.L., Anderson, N.G. and Hammond, D.J. (2003). Therapeutic potential of the plasma proteome. Curr Opin Mol Ther 5, 250–257. Lee, H.J., Lee, E.Y., Kwon, M.S. and Paik, Y.K. (2006). Biomarker discovery from the plasma proteome using multidimensional fractionation proteomics. Curr Opin Chem Biol 10, 42–49. Lin, B., White, J.T., Lu, W., Xie, T., Utleg, A.G., Yan, X., Yi, E.C., Shannon, P., Khrebtukova, I., Lange, P.H. et al. (2005). Evidence for the presence of disease-perturbed networks in prostate cancer cells by genomic and proteomic analyses: A systems approach to disease. Cancer Res 65, 3081–3091. Liu, R., Wang, X., Chen, G.Y., Dalerba, P., Gurney, A., Hoey, T., Sherlock, G., Lewicki, J., Shedden, K. and Clarke, M.F. (2007). The prognostic role of a gene signature from tumorigenic breastcancer cells. N Engl J Med 356, 217–226. Luo, J., Dunn,T., Ewing, C., Sauvageot, J., Chen,Y.,Trent, J. and Isaacs,W. (2002). Gene expression signature of benign prostatic hyperplasia revealed by cDNA microarray analysis. Prostate 51, 189–200. Lyon, L.A., Musick, M.D. and Natan, M.J. (1998). Colloidal Au-enhanced surface plasmon resonance immunosensing. Anal Chem 70, 5177–5183.
84
CHAPTER 6
■
Systems Biology and the Emergence of Systems Medicine
Ma, X.J., Wang, Z., Ryan, P.D., Isakoff, S.J., Barmettler, A., Fuller, A., Muir, B., Mohapatra, G., Salunga, R., Tuggle, J.T. et al. (2004). A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell 5, 607–616. Miller, J.C., Zhou, H., Kwekel, J., Cavallo, R., Burke, J., Butler, E.B., Teh, B.S. and Haab, B.B. (2003). Antibody microarray profiling of human prostate cancer sera: Antibody screening and identification of potential biomarkers. Proteomics 3, 56–63. Mulquiney, P.J. and Kuchel, P.W. (2003). Modelling Metabolism with Mathematica Detailed Examples Including Erythrocyte Metabolism. CRC Press, Boca Raton, Fla. Nuyten, D.S., Kreike, B., Hart, A.A., Chi, J.T., Sneddon, J.B.,Wessels, L.F., Peterse, H.J., Bartelink, H., Brown, P.O., Chang, H.Y. et al. (2006). Predicting a local recurrence after breast-conserving therapy by gene expression profiling. Breast Cancer Res 8, R62. Olle, E.W., Sreekumar, A., Warner, R.L., McClintock, S.D., Chinnaiyan, A.M., Bleavins, M.R., Anderson, T.D. and Johnson, K. J. (2005). Development of an internally controlled antibody microarray. Mol Cell Proteomics 4, 1664–1672. Orchekowski, R., Hamelinck, D., Li, L., Gliwa, E., vanBrocklin, M., Marrero, J.A.,Vande Woude, G.F., Feng, Z., Brand, R. and Haab, B.B. (2005). Antibody microarray profiling reveals individual and combined serum proteins associated with pancreatic cancer. Cancer Res 65, 11193–11202. Palsson, B. (2004). Two-dimensional annotation of genomes. Nat Biotechnol 22, 1218–1219. Papadopoulou, E., Davilas, E., Sotiriou, V., Georgakopoulos, E., Georgakopoulou, S., Koliopanos, A., Aggelakis, F., Dardoufas, K., Agnanti, N.J., Karydas, I. et al. (2006). Cell-free DNA and RNA in plasma as a new molecular marker for prostate and breast cancer. Ann N Y Acad Sci 1075, 235–243. Park, E.S., Lee, J.S., Woo, H.G., Zhan, F., Shih, J.H., Shaughnessy, J.D. and Frederic Mushinski, J. (2007). Heterologous tissue culture expression signature predicts human breast cancer prognosis. PLoS ONE 2, e145. Pawitan,Y., Bjohle, J., Amler, L., Borg, A.L., Egyhazi, S., Hall, P., Han, X., Holmberg, L., Huang, F., Klaar, S. et al. (2005). Gene expression profiling spares early breast cancer patients from adjuvant therapy: Derived and validated in two population-based cohorts. Breast Cancer Res 7, R953–R964. Price, N.D. and Shmulevich, I. (2007). Biochemical and statistical network models for systems biology. Curr Opin Biotechnol 18, 365–370. Price, N.D., Reed, J.L. and Palsson, B.O. (2004). Genome-scale models of microbial cells: Evaluating the consequences of constraints. Nat Rev Microbiol 2, 886–897. Price, N.D., Schellenberger, J. and Palsson, B.O. (2004). Uniform sampling of steady-state flux spaces: Means to design experiments and to interpret enzymopathies. Biophys J 87, 2172–2186. Price, N.D.,Trent, J., El-Naggar, A.K., Cogdell, D.,Taylor, E., Hunt, K.K., Pollock, R.E., Hood, L., Shmulevich, I. and Zhang, W. (2007). Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas. Proc Natl Acad Sci USA 104, 3414–3419. Quackenbush, J. (2006). Microarray analysis and tumor classification. N Engl J Med 354, 2463–2472. Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E. and Mesirov, J.P. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 98, 15149–15154.
Ross, P.L., Huang, Y.N., Marchese, J.N., Williamson, B., Parker, K., Hattan, S., Khainovski, N., Pillai, S., Dey, S., Daniels, S. et al. (2004). Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics 3, 1154–1169. Ruvkun, G. (2001). Molecular biology. Glimpses of a tiny RNA world. Science 294, 797–799. Sanchez-Carbayo, M., Socci, N.D., Lozano, J.J., Haab, B.B. and CordonCardo, C. (2006). Profiling bladder cancer using targeted antibody arrays. Am J Pathol 168, 93–103. Scherzer, C.R., Eklund, A.C., Morse, L.J., Liao, Z., Locascio, J.J., Fefer, D., Schwarzschild, M.A., Schlossmacher, M.G., Hauser, M.A., Vance, J.M. et al. (2007). Molecular markers of early Parkinson’s disease based on gene expression in blood. Proc Natl Acad Sci USA 104, 955–960. Simon, R. (2005). Roadmap for developing and validating therapeutically relevant genomic classifiers. J Clin Oncol 23, 7332–7341. Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P. et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203–209. Smith, D.I. (2002). Transcriptional profiling develops molecular signatures for ovarian tumors. Cytometry 47, 60–62. Solanky, K.S., Bailey, N.J., Beckwith-Hall, B.M., Davis, A., Bingham, S., Holmes, E., Nicholson, J.K. and Cassidy, A. (2003). Application of biofluid 1 H nuclear magnetic resonance-based metabonomic techniques for the analysis of the biochemical effects of dietary isoflavones on human plasma profile. Anal Biochem 323, 197–204. Song, S., Li, B., Wang, L., Wu, H., Hu, J., Li, M. and Fan, C. (2007). A cancer protein microarray platform using antibody fragments and its clinical applications. Mol Biosyst 3, 151–158. Spentzos, D., Levine, D.A., Ramoni, M.F., Joseph, M., Gu, X., Boyd, J., Libermann, T.A. and Cannistra, S.A. (2004). Gene expression signature with independent prognostic significance in epithelial ovarian cancer. J Clin Oncol 22, 4700–4710. Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., Breitkreutz, A. and Tyers, M. (2006). BioGRID: A general repository for interaction datasets. Nucleic Acids Res 34, D535–D539. Sun, B., Ranish, J.A., Utleg, A.G.,White, J.T.,Yan, X., Lin, B. and Hood, L. (2007). Shotgun glycopeptide capture approach coupled with mass spectrometry for comprehensive glycoproteomics. Mol Cell Proteomics 6, 141–149. Tan, A.C., Naiman, D.Q., Xu, L., Winslow, R.L. and Geman, D. (2005). Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 21, 3896–3904. Thiele, I., Price, N.D.,Vo, T.D. and Palsson, B.O. (2005). Candidate metabolic network states in human mitochondria. Impact of diabetes, ischemia, and diet. J Biol Chem 280, 11683–11695. Thomassen, M.,Tan, Q., Eiriksdottir, F., Bak, M., Cold, S. and Kruse,T.A. (2007). Prediction of metastasis from low-malignant breast cancer by gene expression profiling. Int J Cancer 120, 1070–1075. Usui-Aoki, K., Shimada, K., Nagano, M., Kawai, M. and Koga, H. (2005). A novel approach to protein expression profiling using antibody microarrays combined with surface plasmon resonance technology. Proteomics 5, 2396–2401. van ‘t Veer, L.J., Dai, H., van de Vijver, M.J., He,Y.D., Hart, A.A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T. et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536.
Recommended Resources
van de Vijver, M.J., He, Y.D., van’t Veer, L.J., Dai, H., Hart, A.A., Voskuil, D.W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J. et al. (2002). A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347, 1999–2009. Wang, X.,Yu, J., Sreekumar, A., Varambally, S., Shen, R., Giacherio, D., Mehra, R., Montie, J.E., Pienta, K.J., Sanda, M.G. et al. (2005). Autoantibody signatures in prostate cancer. N Engl J Med 353, 1224–1235. Wang, Y., Klijn, J.G., Zhang, Y., Sieuwerts, A.M., Look, M.P., Yang, F., Talantov, D., Timmermans, M., Meijer-van Gelder, M.E., Yu, J. et al. (2005). Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365, 671–679. Weigelt, B., Hu, Z., He, X., Livasy, C., Carey, L.A., Ewend, M.G., Glas, A.M., Perou, C.M. and Van’t Veer, L.J. (2005). Molecular
■
85
portraits and 70-gene prognosis signature are preserved throughout the metastatic process of breast cancer. Cancer Res 65, 9155–9158. Wilson, J.F. (2006). The rocky road to useful cancer biomarkers. Ann Intern Med 144, 945–948. Xu, L., Tan, A.C., Naiman, D.Q., Geman, D. and Winslow, R.L. (2005). Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data. Bioinformatics 21, 3905–3911. Yang, S., Undar, A. and Zahn, J.D. (2006). A microfluidic device for continuous, real time blood plasma separation. Lab Chip 6, 871–880. Zimmermann, M., Delamarche, E., Wolf, M. and Hunziker, P. (2005). Modeling and optimization of high-sensitivity, low-volume microfluidic-based surface immunoassays. Biomed Microdevices 7, 99–110.
RECOMMENDED RESOURCES Website
Article
www.systemsbiology.org Contains overviews of systems biology discipline, particularly in “Systems Biology in Depth” section of the website.
Hood, L., Heath, J.R., Phelps, M.E. and Lin, B. (2004). Systems biology and new technologies enable predictive and preventative medicine. Science 306, 640–643.
This page intentionally left blank
Section
Technology Platforms for Genomic Medicine
2
7. DNA Sequencing for the Detection of Human Genome Variation and Polymorphism 8. Genome-Wide Association Studies and Genotyping Technologies 9. Copy Number Variation and Human Health 10. Inter-Species Comparative Sequence Analysis: A Tool for Genomic Medicine 11. DNA Methylation Analysis: Providing New Insight into Human Disease 12. Transcriptomics: Translation of Global Expression Analysis to Genomic Medicine 13. DNA Microarrays in Biological Discovery and Patient Care 14. Proteomics: The Deciphering of the Functional Genome 15. Comprehensive Metabolic Analysis for Understanding of Disease Mechanisms 16. Comprehensive Analysis of Gene Function: RNA interference and Chemical Genomics
CHAPTER
7 DNA Sequencing for the Detection of Human Genome Variation and Polymorphism Samuel Levy and Yu-Hui Rogers
INTRODUCTION The discovery and characterization of DNA polymorphisms in human populations is an important step toward understanding the contribution of genome sequence variants to predisposition, onset and progression of disease phenotypes. By way of example, molecular epidemiology studies have been able to establish the importance of the wide range of DNA variants that cause monogenic disorders such as cystic fibrosis, where over 1000 mutations in the CFTR gene have been identified and implicated in the disease phenotype (Rowntree and Harris, 2003). The impact of the mutational landscape of the CFTR gene on biological function is complex, and emerging evidence implicates additional modifier genes contributing to the range of disease phenotypes observed (Davies et al., 2005). The situation is likely to be compounded further when identifying and characterizing the contribution of multiple genetic loci to phenotypes found in polygenic diseases such as coronary heart disease, diabetes and hypertension (see Chapter 3). While our understanding of how polymorphisms can contribute either singly or collectively to either disease onset or progression is still actively being developed, so too are the technologies that enable their detection and ascertainment in human populations.
Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg 88
Human DNA polymorphism detection has been greatly empowered in the last decade by the production of a high quality, mostly finished, DNA sequence of the human genome. The production of this sequence, and the fact that several distinct DNA sources were employed, enabled the direct identification of 2.1 million single nucleotide polymorphisms (SNPs) (Lander et al., 2001; Venter et al., 2001) to supplement those already found in the public variant databases (dbSNP). The discovery of these SNPs in a small number of humans set the stage for an ambitious experiment, the HapMap project, to define common haplotypes through the use of 4 million SNP types in 269 distinct humans (Consortium, 2005). It is anticipated that these datasets will enable the identification of a set of informative SNPs with which disease association studies can be accomplished with improved success. Indeed, early successes highlighting the utility of the HapMap data in identifying alleles implicated in complex diseases such as asthma (Laitinen et al., 2004), age related macular degeneration (Klein et al., 2005) and type II diabetes (Sladek et al., 2007), amongst others, have clearly set the stage for more studies of this kind (see Chapter 8). Essentially, one potential study design would likely involve the use of SNPs as markers for a disease in an affected DNA population compared to a control DNA population. Subsequently, any marker statistically associated with
Copyright © 2009, Elsevier Inc. All rights reserved.
DNA Sequencing
the disease group would inform the genomic loci requiring further sequence analysis to detect the causative variant. There are currently over 10 million SNPs and other polymorphisms such as insertion/deletion variants (indels) in public databases that potentially provide a marker set for efficient disease-gene association studies. It is important to note that even this large variant set still might not represent the variants causative of diseases. This is likely due to the fact that significant polymorphism discovery efforts to date have been performed in a random fashion across the genomes of only a limited number of individuals. Therefore, we still need to discover genetic variation in regions of functional DNA sequence in the genomes of individuals with disease, in order to provide an accurate understanding of disease etiology. In this chapter, we outline the sequencing protocols and the attendant computational approaches employed for the discovery and initial characterization of DNA variants, with an emphasis on the most commonly used Sanger-based sequencing (Sanger et al., 1977) methodologies established in our laboratories. These polymorphisms range from the substitution of single bases (i.e., SNPs) to the insertion or deletion of nucleotides in size ranges from 1 basepair (bp) to greater than 20,000 bp. Recent reports on the unexpected number and scope of larger scale copy number variants or structural variants ranging from 1–3 million bases (Mb) in size (Redon et al., 2006; Sebat et al., 2004) will not be addressed here since current methods used for their analysis involve either genotyping or microarray-based comparative genome hybridization, as presented elsewhere (see Chapter 9). We also highlight, where possible, how the application of these methods and advances has impacted our understanding of human biology. Several recent reviews addressed the methodologies and development of advanced sequencing methods and recent commercial platforms that exploit their use (Fan et al., 2006; Shendure et al., 2004). These advances in genome sequencing for variant detection are of particular importance as we have begun to see genome sequences and complete variants sets of individual humans, including as of this writing James D. Watson by 454 Life Sciences (Wheeler et al., 2008) and J. Craig Venter (Sanger-based sequencing (Levy et al., 2007)). The discussion in this chapter will focus primarily on mature approaches for use in high-throughput experimental environments, as well as on the types of experimental results that can and have been generated. It will be evident that comprehensive experimental designs and economies of scale are possible when applying these techniques in a production environment.
■
89
and genomic variants therein (Osoegawa et al., 2001). Other sequencing projects have employed expression sequence tag (EST) or cDNA sequencing of mRNA libraries to characterize variants primarily at the transcript, and by inference, at the protein level (Strausberg et al., 2003). All these approaches benefit from the random sampling of DNA regions by selecting bacterial colonies containing DNA inserts, DNA sequence determination using capillary-based electrophoresis of incorporated dye terminator nucleotide bases and subsequent sequence alignment of reads to construct reliable assemblies of genomic or mRNA molecules. Other approaches to DNA sequence elucidation that will be covered in this section involve the targeted amplification of defined genomic loci typically for study in large populations, the use of microfabricated high-throughput bead or surfacebased sequencing devices and high-density oligonucleotide arrays. A comparison of these distinct methodologies is provided in Table 7.1. Whole Genome Shotgun Sequencing Polymorphism discovery on a genome-wide level can be accomplished by applying the Sanger sequencing technique (Sanger et al., 1977) and the whole genome shotgun (WGS) strategy (Fleischmann et al., 1995; Venter et al., 2001; Venter et al., 1996) on a small pool of DNA samples. Owing to its high accuracy, versatility and ability to generate complete DNA sequence information, the Sanger sequencing method has been considered the definitive approach for DNA variant discovery. The WGS methodology relies upon generating a random sampling of genomic regions at a sufficiently deep coverage level to produce enough sequence data that can then be assembled by sequence alignment methods. First, multiple copies of the genome are randomly shredded into pieces of approximately 2–40 kilobasepairs (Kb) in size and subsequently cloned into a vector or plasmid. These constructs are then replicated in bacteria and sequenced from both ends to produce pairs of linked sequences, termed “mate pairs”, representing 500–800 bp at the end of each insert. These sequence reads are then assembled computationally to generate a set of contiguous high-quality regions with potential information regarding any associated sequence gaps of a known size. Currently, the Sanger-based WGS sequencing process, excluding the library construction step, is typically fully supported by equipment automation, as outlined in Figure 7.1, and managed as a production “pipeline” with varying degrees of integration between different steps employing a laboratory information management system.
DNA SEQUENCING DNA sequencing methods applied to either single or multiple DNA samples provide the definitive approach to discovering polymorphisms. These polymorphisms can be detected at the genomic and/or the mRNA level depending on the source of the material employed. For example, genomic DNA subcloned into BACs or in small to medium insert size bacterial libraries were created to produce a finished human genome sequence
Library Construction Genomic DNA is randomly sheared via nebulization to produce fragments with a distribution of approximately 2–40 Kb. The DNA fragments are ligated to adapters after end-polishing and size selection. After several rounds of purification, the fragments are inserted into plasmid vectors. The resulting library is electroporated into E. coli cells.
90
CHAPTER 7
TABLE 7.1
■
DNA Sequencing for the Detection of Human Genome Variation and Polymorphism
Comparison of distinct approaches in DNA sequence acquisition and the detection of variants Sanger based WGS
Targeted PCR
Sequence by synthesis
Sequence by hybridization
Current platforms
ABI 3730 xl
ABI 3730 xl
454, Illumina/Solexa, ABI SOLiD (strictly hybridization-ligation chemistry)
Affymetrix, Perlegen, Nimblegen
Read length (bp)
500–800
500–800
25–200
60–100
Detection SNP/Indel
Y/Y
Y/Y (up to amplicon size)
Y/Y
Y/Y (primarily deletions in sample)
Unbiased polymorphism detection
Y
Limited to targeted regions
Y
Limited to targeted regions
Biased polymorphism detection
N
Y
Y
Y
Construction of longrange haplotypes
Y
Limited to targeted regions
Uncertain
Limited to targeted regions
Current cost per sample $2 million
N
Uncertain
Y
Y
WGS Library
Figure 7.1
Clone picking and growth
Template preparation
Sequencing reaction set up and clean-up
Electrophoresis and sequence detection
Whole Genome Shotgun sequencing process flow.
Colony Plating and Picking The E. coli cells containing the plasmid vectors with inserted DNA fragments are then spread onto agar plates and incubated to form E. coli colonies. The colonies are picked into liquid media and incubated for further growth to generate sufficient numbers of plasmid template copies. Subsequently, these template copies are isolated and purified from the host cells. Template Preparation The alkaline lysis plasmid isolation approach (Sambrook et al., 1989) is the most widely utilized method for plasmid template isolation due to its robustness. Bacterial cells are lysed, cell debris removed by centrifugation and plasmid DNA recovered from the cleared lysate by isopropanol precipitation. Sequencing Reactions For DNA sequence determination, the most widely used technique is the Sanger dideoxy sequencing method (Sanger et al., 1977). It mimics the natural process of DNA replication in vivo with the presence of target DNA molecules, universal M13 sequencing primers, polymerase, deoxynucleotides (dNTPs) as well as chain terminating dideoxynucleotides which interrupt chain elongation upon incorporation generating DNA
fragments of different lengths. Several improvements have been made to the original method to greatly enhance its throughput, robustness and detection sensitivity. These include the development of multicolor fluorescent detection (Smith et al., 1986), fluorescently tagged chain-terminators (Prober et al., 1987), a cycle sequencing protocol (McCombie et al., 1992) and fluorescence resonance energy transfer (FRET) dyes (Ju et al., 1995; Lee et al., 1997). Currently, most sequencing reactions are carried out using FRET dye terminators, modified Taq polymerase and a cycle sequencing protocol. Sequence Detection The technology for the electrophoresis-based separation of Sanger sequenced DNA fragments and the detection of individual bases have advanced substantially in the past two decades. As a result, the throughput, detection sensitivity and costs of sequencing, as well as the accuracy and quality of the data generated by these sequencers, have improved significantly. The invention of multicolor fluorescent sequencing in 1986 (Smith et al., 1986) replaced laborious and hazardous radioisotope labeling techniques (Sanger et al., 1982) with automated signal detection, high-speed computational recording and signal processing. In the mid-1990s, the successful integration of the capillary-based
DNA Sequencing
T
I
R
■
91
O Consensus
reads
Figure 7.2 Display and detection of a single nucleotide polymorphism in a human genome assembly of a single DNA donor. Sequence reads overlap and can be assembled to form the consensus sequence of the contig where 5 of the 12 reads show a T variation at a C consensus base (variant position is highlighted with an arrow). T type of sequence read, either assembled contig or individual read, I unique identifier, R co-ordinates of alignment, O strand orientation that aligns.
electrophoresis and multicolor fluorescent detection (Bashkin et al., 1996a; Bashkin et al., 1996b; Behr et al., 1999) further enabled the creation of the true high-throughput sequencing machines. Today, state-of-the-art capillary-based sequencers have the ability to read over 1 million bp of sequence per 24-hour period with long read lengths (an average of 800 bp per read) and an average quality value (QV) of greater than 30 (99.9% accuracy). Whole Genome Assembly The whole genome assembly process utilizes several aspects of experimental design to ensure that unambiguous construction of long contiguous sequence can be generated. These include the generation of sequence reads from each end of clone inserts using universal primers pairs to ensure that a majority of inserts will have their ends sequenced. This enables both the generation of contiguous sequence and the potential for ordering and orienting larger sequence segments essentially created by sequence alignment of the individual reads whose mate pair relationship is known. A variety of genome assembler software packages have been designed with this basic rationale at their core (Celera Assembly (Myers et al., 2000), Phusion (Mullikin and Ning, 2003)). The Atlas assembler uses as input both BAC-based clone sequences and reads generated via the WGS strategy (Havlak et al., 2004), an approach that was used to assemble the rat genome (Gibbs et al., 2004). There are two basic steps to assembly: (i) the creation of unique regions of contiguous (contigs) assemblies from the sequence overlaps between reads and (ii) the sequential end-toend organization of contigs by employing the mate pair information contained in the reads constituting each contig.The successful application of this strategy was a significant challenge for the sequencing and assembly of human DNA since greater than 45%
of the genome sequence is repetitive in nature (Lander et al., 2001; Venter et al., 2001). The repetitive nature of the human and other mammalian genomes thus confounds the ability to determine accurate read sequence overlaps and the placement of contigs into the correct order and orientation. Our solution is to employ clone insert libraries of a size larger than the corresponding repeat regions, typically either fosmid or BAC libraries (40 kb and 100 kb respectively), thus enabling the spanning of repeats by assembling adjacent unique sequence (Venter et al., 2001). The selection of a range of insert sizes libraries is typically built into the experimental design when sequencing a new genome. Variant Detection Variants can be reliably detected from assembled sequence reads from either a single or multiple DNA donors in the region of unique genome sequence assembly. The example in Figure 7.2 show the reads assembled from a single human identifying a known SNP found in a coding region of the APOE gene and, as expected, displays a close to equiprobable occurrence of both allelic forms (five reads with the T allele and seven reads with C allele). This approach has been employed with success to identify 974,400 heterozygous variants in a single canine individual (Kirkness et al., 2003). The inclusion of random shotgun sequence reads from distinct DNA donors is another approach to employing genome assembly and read placement to identify polymorphic site. This approach was employed using five DNA donors in the initial shotgun sequence assembly of the human genome (Venter et al., 2001). An important factor in using a WGS assembly approach to detecting polymorphic sites is that enough sequence coverage from a single donor should be available. In this manner it is possible to provide a good quality assembly with associated consensus sequence and associated allelic variants. This means that
92
CHAPTER 7
■
DNA Sequencing for the Detection of Human Genome Variation and Polymorphism
additional donors can be included either post-assembly or as part of the assembly process if the species in question does not possess extensive interindividual polymorphism. Recent work attempting to detect both SNP and indels in an individual human sample suggests that detection of heterozygous variants, with 99% probability using a Sanger chemistry-based WGS approach, necessitates a variant being reported by 20 reads or more (Levy et al., 2007). Another challenge to polymorphism detection during genome assembly is the occurrence of many small heterozygous indel polymorphisms in human DNA, typically between 1–20 bp length (Ball et al., 2005). These indel events result in the occurrence of long and short alleles, requiring a methodology to accurately separate the reads from each allele, thereby providing an accurate representation of the indel. Subsequently, a practical decision needs to be made regarding which form will be represented in the consensus sequence. Typically this can be the “major” allele even though in a single donor this will be present effectively at equal probability. Lastly, a strategy needs to be implemented to best catalog the occurrence and type of the other allele or alleles, in the case of sequencing multiple DNA donor samples. An example of the detection and display of an indel from a genome assembly of a single DNA donor can be seen in Figure 7.3. The figure displays the consensus sequence (the contig sequence) containing the short, major, allele since gaps are included in the alignment, whilst the long form (inclusion of ATTCT) is found in two of the eight assembled reads. We recently applied a modified sequence assembly algorithm to detect heterozygous indels variants from the diploid genome sequence of an individual human (Levy et al., 2007). Using this approach we detected over 345,000 heterozygous
indels events of lengths 1–571 bp with at least 200 indels in protein-coding regions of genes. PCR Amplicon Re-Sequencing The goal of directed sequencing or re-sequencing is to use the existing sequence data from a particular genomic locus or loci to determine sequence variations of these same loci in different DNA samples. This provides the nature and the frequency of DNA polymorphisms and can be used to assess whether these variants could be implicated in, for example, either disease onset or progression. PCR re-sequencing methodology is widely accepted as the gold standard (Kwok and Chen, 2003) for the discovery of polymorphisms (both SNPs and indels) in targeted regions since it provides the most complete information including the genotype, the location and sequence context.This approach has been applied to understand sequence variation in populations (Crawford et al., 2004) and for the identification of variants and haplotypes that could potentially explain disease phenotypes in cancer (Cox et al., 2005). It combines the powerful PCR technique (Mullis et al., 1986) with the informative Sanger sequencing method (Sanger et al., 1977). Rapid advancement of the technologies and dramatic reduction in costs for direct sequencing of PCR products, as well as the completion of the reference human genomes in 2001 (Lander et al., 2001;Venter et al., 2001) has enabled the systematic large-scale discovery of polymorphisms in targeted regions. The approach generally involves PCR primer design, the amplification of the targeted region from whole genomic DNA using PCR, followed by Sanger sequencing and the analysis of the targeted sequences (Figure 7.4). T
I
R
O Consensus
reads
2 reads with 5 bp insertion ATTCT
Figure 7.3 Display and detection of a 5 bp insertion/deletion polymorphism in a human genome assembly of a single DNA donor. The consensus sequence displays the “short” allelic form whilst the “long” form (ATTCT) is found in 2 of the 8 sequence assembled reads. T type of sequence read, either assemblyed contig or individual read, I unique identifer, R co-ordinates of alignment, O strand orientation that aligns. “” gap inserted into read to permit correct alignment.
PCR primer design
Figure 7.4
PCR reaction and clean-up
PCR directed sequencing process flow.
Sequencing reaction and clean-up
Electrophoresis and trace analysis
DNA Sequencing
Primer Design The goal of directed sequencing is to provide reliable amplification of a single genomic region, and this is readily achieved via the design of unique pairs of oligomers for priming of the PCR. The resulting amplicon encompassing the targeted region can vary in size; however reproducible and robust assays range from 350 bp to 1000 bp in length per amplicon. In order to provide coverage of a larger genomic locus with this amplicon size, it becomes necessary to tile PCR products with some overlap (typically 100 bp) between amplicons. The primer design process attempts to generate oligomers for products with similar melting temperatures whilst avoiding primer-dimerization forming interfering secondary structures. Other factors to permit specific amplification would be the GC content of the region, the proximity of repeat regions, the uniqueness of the oligomer sequences in the genome and the absence of known DNA variants in the primer binding sites. These considerations can be embodied in a computational pipeline as illustrated in Figure 7.5. The sequence context, in this case percent GC quantity, are considered before identifying a candidate genomic region for primer design. In a high-throughput directed sequencing pipeline, it is frequently desirable to employ few, or preferable one, PCR protocol, and in our experience GC content allows a distinction between at least two desired PCR conditions (high and low GC content). Once the PRIMER3 software (Rozen and Skaletsky, 2000) generates candidate primer pairs for a desired target region, these are further
Gene/region of interest
Pre-Compute/Load GC content
Generate 1st tile
Decrease tile size change 5 tile position Fail Generate next tile position
Fail
Primer3
Critique primer pair for genome specificity. Check for: Homopolymers Known SNP/indels Repeat regions Pass Order primers
Figure 7.5 Primer design pipeline for PCR based directed sequencing.
■
93
assessed for their product uniqueness (low and high copy repeat regions) and to ensure that resulting amplicons contains only one type of complex sequence, i.e. homopolymer or known indel polymorphism. The last requirement is important since sequence detection is performed using capillary-based detection methods and chromatograms potentially contain a mixture of alternate alleles at any one mobility position. This renders the process of allelic definition potentially difficult in the case of multiple indel events in a single amplicon. PCR Set-Up and Clean-Up As described in the primer design section, due to the variable GC content within the genome, application of different PCR primer design criteria, as well as different PCR protocols might be necessary when targeting different regions. In our laboratory, we have two fully validated high-throughput amplification protocols (Rand et al., 2005), one for regions of normal GC content and another for amplicons with melting temperatures (Tm) 82°C. AmpliTaq Gold® DNA polymerase (Applied Biosystems) is used to perform hot-start PCR to minimize the formation of primer–dimers and spurious amplicons and ensure efficient, consistent and specific amplification results in a highthroughput environment. A Shrimp Alkaline Phosphatase/Exonuclease I (SAP/Exo I) mix is used to digest excess dNTPs and amplification primers. (While SAP is commonly used to remove 5 phosphate groups from DNA, it also degrades dNTPs.) The SAP/Exo I method provides an effective, single tube approach to PCR clean-up that is easily automated and scalable. Sequencing Reaction Set-Up and Clean-Up In order to simplify the downstream sequencing process, the PCR primers are frequently tagged with M13 forward and reverse sequences on their 5 ends. The resulting amplicons can then be sequenced using universal M13 sequencing primer sets and a standard sequencing protocol (see previous section). Reaction products can be purified with sodium acetate and ethanol prior to analysis on the sequencers. Electrophoresis and Trace Analysis The state-of-the-art technology for the electrophoresis-based separation of Sanger sequenced DNA fragments and the detection of individual bases can be accomplished by a combination of trace filtering, trace separation and mixed base calling. Electropherogram peaks are described as “mixed” in that a primary peak is detected with a secondary but significantly present sub-peak due to two distinct and different allelic forms contributing to the locus, We have developed a trace analysis pipeline that applies digital signal filters to reduce the occurrence of secondary, typically noisy, sequence peaks (Figure 7.6), thus enabling the accurate distinction of a single position with mixed peaks. This approach has enabled a substantial improvement of common artifacts observed in diploid genomic regions sequenced after PCR amplification that include dye-blob, PCR slippage
94
CHAPTER 7
■
DNA Sequencing for the Detection of Human Genome Variation and Polymorphism
Unfiltered trace
Filtered trace
Mixed base Figure 7.6 Sequence chromatograms of a locus revealing a SNP before and after digital signal filtering were applied to remove PCR slippage artifacts due to the presence of proximal (not seen) homopolymeric DNA.
and primer synthesis length errors, all of which confound mixed peak detection. After generating either a raw (unfiltered) or digital signal filtered chromatogram, it is then possible to identify mixed peaks that vary significantly in height, area and width to characterize departures from the reference sequence. This general approach can be performed with many different software tools in the public domain (Polyphred, Mutation Surveyor, novoSNP and SNPdetector, InSNP), the most extensively tested of which is Polyphred. In the most recent version of this software, detailed parameterization was performed using sequence reads from PCR-amplified diploid samples and a statistical framework was developed to report evidence of genotypes being called (Stephens et al., 2006). Heterozygous indel polymorphisms result in significant changes in the chromatograms from diploid DNA samples where typically traces contain mixed peaks from the point of the insertion/deletion onwards. One can develop computational methods that essentially separate the traces to enable the definition of the long and short alleles (Bhangale et al., 2005; Bhangale et al., 2006). Polyphred, InSNP (Manaster et al., 2005) and Mutation Surveyor are all capable of detecting heterozygous indels, but as of this writing an unbiased detailed comparison of performance has not yet been attempted. Sequencing by Synthesis Recently implemented sequencing by synthesis technologies can employ PCR (Mullis et al., 1986) to achieve clonal amplification of single DNA molecules. In addition, the sequencing by synthesis principle in a massively parallel format has been developed by Rothberg and colleagues (Margulies et al., 2005) and by Church and colleagues (Shendure et al., 2005). In each of these new technologies, different chemistries are applied to interrogate, tag and detect the sequence bases. For example, utilizing nanotechnology and pyrosequencing chemistry (Nyren et al., 1993), coupled with an enzymatic luminometric inorganic pyrophosphate detection technique, Rothberg and colleagues at 454 Life Sciences recently developed a novel and highly parallel system capable of sequencing approximately 100 million bp in a 7-hour period by sampling over 400,000 sequence clones. They were able to illustrate this technological
advance by sequencing and assembling the Mycoplasma genitalium and Streptococcus pneumoniae genomes (0.6 and 2.1 Mb respectively in size). Compared to the current state-of-the-art Sanger DNA sequencing and capillary-based electrophoresis platform, their system generates “raw” data with approximately 100 times higher throughput. However, compared to the Sanger/electrophoresis-based sequencing approach, it still has several limitations, such as the short sequence read lengths which, averaging 200–250 bases per read, are only a quarter of Sanger read lengths. Another problem is that the accuracy of base calling on individual reads is low especially for genomic regions containing homopolymers. This leads to the increased likelihood of spurious polymorphism detection. In addition, the 454 sequencing methodology currently does not incorporate the use of mate pairs that come from large insert libraries (10 kb), which for the successful assembly of repeat rich genomes, like those of most eukaryotes, are important in order to span large repeat regions. Nonetheless, this technology has been employed to generate some sequence data from DNA extracted from fossil remains (Green et al., 2006) and has been employed most recently to generate sixfold coverage of an individual genome (Wheeler et al., 2008). Some of these problems can be circumvented by adopting a hybrid sequencing approach, that is one that employs, in addition to the sequencing-by-synthesis approach (454), a small number of large insert clones (typically fosmid libraries) sequenced by Sanger technology to provide order and orientation in the final genome assembly.While this approach has been successfully employed for a number of bacterial genomes (Goldberg et al., 2006), establishing the most efficient mix of the sequence-by-synthesis contribution and fosmid libraries will be the challenge for the completion of larger mammalian genomes. Each of the sequencing-by-synthesis approaches mentioned above share similar advantages as well as limitations. However, they do show great promise for the discovery of polymorphisms in a rapid, cost effective manner. The massively parallel sequencing-by-synthesis approach enabled by the 454 picoliter plate technology can also be utilized for targeted polymorphism discovery. In this manner, PCR amplicons that contain the targeted genomic regions are extensively sampled since it is possible to generate currently up to 400,000 clonal amplifications of the PCR product per experiment. This
Other Methodologies for Polymorphism Detection
approach has potentially great value for the discovery and detection of DNA variations present in low frequency in the samples. This is especially applicable for mutation detection in tumor tissue, in which potentially only a few cells carry activating mutations and the surrounding cell population contributes to the “contamination” of the tumor DNA. The lowest detection level thus far obtained has been for alleles present in the sample as low as 9% for indel mutations in the EGFR gene in lung adenocarcinoma specimens (Thomas et al., 2006) and at a similar frequency for non-synonymous point mutations in the FGFR1 gene from glioblastoma samples (Strausberg et al., personal communication). In both of these cases the corresponding PCRamplified, Sanger sequence chromatograms lack the necessary secondary peak signal to detect a variant above the background noise level, where the limit of detection of a secondary allele is thought to be between 15–25%. Sequencing by Hybridization The sequencing by hybridization on microarray approach has also become a widely utilized tool for SNP detection and analysis (Kukita et al., 2005; Patil et al., 2001; Maraganore et al., 2005; Kozal, 1996; Cronin et al., 1996; Chee et al., 1996; Hacia et al., 1996; Kapranov et al., 2002) due to the technological advances that make the manufacturing of high-density DNA arrays possible The genomic DNA is generally prepared by PCR to amplify the regions of interest or by using a whole genome amplification methodology (Dean et al., 2002) and tagged with fluorescent dyes for detection prior to hybridizing to the oligonucleotide array. This approach allows the detection of sequence present in a test sample via direct hybridization and was able to detect the unique (non-repeat) contiguous sequence from human DNA chromosome 21 conserved with mouse and dog (Frazer et al., 2001) and the haplotype block structure of chromosome 21 in 24 human DNA samples (Patil et al., 2001). In both cases, this was accomplished by creating a tiling array where each position was interrogated for all possible bases using a set of eight distinct oligonucleotide probes comprising the position in question and 12 bp flanking to provide for binding specificity. The interspecies hybridization experiments revealed additional putative exonic region of current genes and identified other regions of significant sequence conservation that could represent other forms of functional sequence. The human population study revealed that 80% of the DNA samples could be described by three common haplotypes. Such knowledge of chromosomal block structure and inheritance patterns enables the choice of a smaller set of variants with which one can perform disease-gene association studies (see Chapter 8). The sequencing-by-hybrization approach clearly necessitates that all the sequence being tested is present on the microarray. While this may be a limitation for discovering novel sequence in some genomes, even for the multiple, yet to be sequenced, human genomes, the methodology is well suited and has been extensively employed for the genotyping of known SNPs in disease conditions and deletions in arrayed sequence. Recent whole genome association studies
■
95
employing large-scale SNP assays performed on microarrays have identified genes as either biomarkers or therapeutic targets in Parkinson’s disease (200,000 loci) (Maraganore et al., 2005) and copy number variations and deletions in mental retardation (100,000 loci) (Friedman et al., 2006).
OTHER METHODOLOGIES FOR POLYMORPHISM DETECTION Many other methodologies have been developed for genomewide discovery and characterization of DNA polymorphisms. Although each of the methodologies has their own unique capability in detecting various types of polymorphisms, they tend to have limited utility and are often employed at a lower scale than the approaches described above. Restriction Fragment Length Polymorphism One of the most widely utilized techniques is Restriction Fragment Length Polymorphisms (RFLP). RFLPs are polymorphisms in the lengths of particular restriction fragments. It was one of the first marker systems developed and has been used in creating genetic maps of chromosomes (Botstein et al., 1980; Rothschild et al., 1993), tracking changes occurring in the cancer genome (Jenkins et al., 2002), as well as serving as a starting points for chromosome walks (Bakker et al., 1985). The technology is based on using restriction endonucleases that cleave DNA at highly specific sites; therefore mutations or polymorphisms change the enzymatic activity at precise genomic loci. Thus by measuring the differences in length of the cleaved DNA strands (accomplished by electrophoresis separation and probing with cloned fragments of the genome), it is possible to monitor the presence of genetic variations. While reliable, this method’s detection power is limited to the small number of the bases around the restriction sites and can only detect the presence and not the identities of the sequences variations. In the simplest and most common type of RFLP, the polymorphism results from a single nucleotide difference that provides a recognition site for a restriction enzyme in one allelic form and not the other. A polymorphism of this type, once identified and characterized by sequencing, can be detected readily among a population of individuals in a targeted fashion by first amplifying the region around the polymorphic site from each sample, digesting the amplified material with the appropriate restriction enzyme and distinguishing the undigested PCR product from the smaller digested fragments by gel electrophoresis (Saiki et al., 1985). This PCR-based protocol provides results rapidly and is a significant improvement over the Southern blot alternative which requires blotting, probe labeling, hybridization and autoradiography. Simple Sequence Repeat Polymorphisms Simple Sequence Repeat Polymorphisms (SSRPs), generally referred to as “microsatellite” markers, are tandem repeats units of short sequences of nucleotides (generally 2–6 units in
96
CHAPTER 7
■
DNA Sequencing for the Detection of Human Genome Variation and Polymorphism
length). The detection of these markers can be readily accomplished by probing the genomic DNA libraries with synthetic oligonucleotides followed by confirmation sequencing of the positive clones with the Sanger technique. PCR primers can be designed to amplify the markers once the sequences flanking the SSRP markers are known. The variations in the length of PCR products can then be separated by electrophoresis and detected (Cornall et al., 1991; Love et al., 1990). This method is relatively easy to automate at a low cost for large scale genotyping. It can be very informative for fingerprinting applications; however, its utility is obviously limited to the detection of length polymorphisms. Chromatography or Electrophoresis Separation Methods for Conformation Polymorphism Detection The natural intra- or inter-molecular structural distortions (conformational changes) of the DNA double helix induced by mispairing of the bases could result in detectable or measurable physical properties changes that can be used to detect the presence and the approximate locations of polymorphisms. These detectable or measurable physical property changes include melting temperature and mobility shift. These methods generally utilize separation techniques such as electrophoresis or chromatography to resolve conformational differences among DNA samples. For example, the Denaturing Gradient Gel Electrophoresis (Myers et al., 1987; Burmeister et al., 1991) method takes advantage of the fact that the denaturing of double stranded DNA is highly dependent on its sequence. One base mispairing could induce conformational changes and alter the melting temperature of a DNA molecule, thereby suggesting the presence of an allelic variant in a particular locus. The DNA molecules with different melting temperatures show distinguishable motility shifts under denaturing conditions for electrophoresis. Other methods utilizing these properties include Denaturing High-Performance Liquid Chromatography (Huber et al., 2001; Underhill et al., 1997), Single Strand Conformation Polymorphism (Nataraj et al., 1999; Sheffield et al., 1993) and Heteroduplex Analysis (Nataraj et al., 1999; Thomas et al., 2001). The main drawback of these methods is that they can only be used to detect the presence and not the identities of the polymorphisms. Other Enzymatic and Chemical Cleavage Methods Cleavage techniques also exploit the conformational changes of the DNA double helix caused by mispairing of the bases. Several approaches utilizing the different chemical and enzymatic properties of the perfectly matched homoduplexes and the mismatched heteroduplexes for targeted polymorphism detection in specific regions have been developed. These methods include chemical cleavage of mismatched DNA (Curiel et al., 1990; Lambrinakos et al., 1999; Tabone et al., 2006), ribonuclease cleavage of mismatched DNA (Myers et al., 1985), cleavase fragment length polymorphism analysis (Nomura et al., 2001; Uehara et al., 1999), T4 Endonuclease VII cleavage of heteroduplex DNA (Babon et al., 1999), mismatch repair detection (Faham et al., 2005),
UNG-Mediated T-Sequencing (Hawkins and Hoffman, 1997), and RNA-mediated fingerprinting with MALDI MS Detection (Krebs et al., 2003; Stanssens et al., 2004). The detection sensitivity and accuracy of these techniques highly depend on the specificity and efficiency of the particular enzyme activity or chemical used. Ideally, little or no cleavage would be seen in a perfectly matched DNA fragment and a near complete digestion or modification of all distortions of the helix generated by base mismatches would be accomplished. However, none of the techniques mentioned above are able to detect both the matched and mismatched DNA states equally well. The utility and the choice of a technique becomes a balance between ease of use, sensitivity and specificity. Some of these methods can be applied directly to extracted genomic DNA if the quantity of the DNA is not an issue. However, PCR is commonly used to select the targeted genomic region and generate enough DNA material for the assays. The separation and detection of the digested or modified DNA fragments is generally accomplished through electrophoretic techniques.
FUTURE DIRECTIONS The generation of a first draft of the human genome sequence permitted the global identification of small scale variation such as SNPs and heterozygous indels. However, it also set the stage for detection of large scale genome-to-genome comparisons (Istrail et al., 2004). These and other such studies (Khaja et al., 2006) have identified large-scale difference (1–3 Mb ranges) between human individuals that have been confirmed in larger populations (Redon et al., 2006). Recent studies have also implicated such large scale polymorphism (copy number variants) in developmental verbal dyspraxia (Feuk et al., 2006) and Crohn’s disease (Fellermann et al., 2006), to name but a few. The complete human genome sequence has also fueled the development of newer sequencing technologies that employ short sequence reads but generate large quantities of sequence data at increasingly lower cost without the need of elaborate production facilities. These approaches that involved the interrogation of single DNA molecules via clonal amplication (454, Solexa, ABI-Agencourt) will permit de novo sequencing of distinct human individuals. However de novo human genome assembly is currently not attainable due to the short read length, the current absence of well developed methodologies for generating large insert clone mate sequence and computational approaches that account for both the error profile and the input characteristics of reads generated. These new sequencing platforms do permit the generation of sequence data from defined genomic regions of multiple individuals via PCR and other methods, to allow the targeted amplification or separation of genomic loci (Thomas et al., 2005;Thomas et al., 2006). The goal for sequencing in the next decade will be to achieve de novo assembly of multiple human genomes, since the mapping of newly generated reads to an existing human genome reference constrain new sequence data to an existing, likely unique, genome structure. Therefore establishing de novo assembly approaches in the analysis of future genomes will create
References
an unbiased understanding of human chromosomal structure and the underlying individual genotypes that define each person uniquely in the context of growing phenotypic information. Lastly, any human DNA variation will need to be interpreted in the context of the functional sequence that it potentially modulates. This appears to be an easier problem to address when considering a SNP in a protein-coding domain (Livingston et al., 2004; Ng and Henikoff, 2006), and computational prediction methods can rank which putative functional changes are to be tested experimentally. These prediction methods are less defined for polymorphisms larger than 1 bp (2–100 bp) that occur distal to known protein-coding domains, and clearly much of the impact on human disease may be modulated by these variants (Ball et al., 2005; Chuzhanova et al., 2003). More detailed recent analyses of 1% of the human genome sequence revealed that many transcriptional signals and cues reside in non-coding DNA and a more significant fraction of the human genome is functional than thought previously
■
97
(Birney et al., 2007). This indicates that a large amount of DNA variation that has been identified in non-coding DNA (Levy et al., 2007) may elicit a variety of phenotypic effects, whose elucidation will be a rich source for future research. Much of the human genetic and genomic research since the completion of the human genome sequence has led us to conclude that much still remains to be discovered in primary sequence space as a predecessor to our understanding of basic functional changes in cells and tissues. Such knowledge is crucial so that we can bring genomic data to bear in a useful manner to the medical context.
ACKNOWLEDGEMENTS The authors would like to thank Drs Robert Strausberg, Jiaqi Huang and Pauline Ng for their suggestions and critical review of the manuscript.
REFERENCES Babon, J.J., McKenzie, M. and Cotton, R.G. (1999). Mutation detection using fluorescent enzyme mismatch cleavage with T4 endonuclease VII. Electrophoresis 20, 1162–1170. Bakker, E., Hofker, M.H., Goor, N., Mandel, J.L., Wrogemann, K., Davies, K.E., L.M., K., Willard, H.F., Fenton, W.A., Sandkuyl, L. et al. (1985). Prenatal diagnosis and carrier detection of Duchenne muscular dystrophy with closely linked RFLPs. Lancet 1, 655–658. Ball, E.V., Stenson, P.D., Abeysinghe, S.S., Krawczak, M., Cooper, D.N. and Chuzhanova, N.A. (2005). Microdeletions and microinsertions causing human genetic disease: Common mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum Mutat 26, 205–213. Bashkin, J., Marsh, M., Barker, D. and Johnston, R. (1996a). DNA sequencing by capillary electrophoresis with a hydroxyethylcellulose sieving buffer. Appl Theor Electrophor 6, 23–28. Bashkin, J.S., Bartosiewicz, M., Roach, D., Leong, J., Barker, D. and Johnston, R. (1996b). Implementation of a capillary array electrophoresis instrument. J Capillary Electrophor 3, 61–68. Behr, S., Matzig, M., Levin, A., Eickhoff , H. and Heller, C. (1999). A fully automated multicapillary electrophoresis device for DNA analysis. Electrophoresis 20, 1492–1507. Bhangale, T.R., Rieder, M.J., Livingston, R.J. and Nickerson, D.A. (2005). Comprehensive identification and characterization of diallelic insertion-deletion polymorphisms in 330 human candidate genes. Hum Mol Genet 14, 59–69. Bhangale, T.R., Stephens, M. and Nickerson, D.A. (2006). Automating resequencing-based detection of insertion-deletion polymorphisms. Nat Genet 38, 1457–1462. Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R., Gingeras, T.R., Margulies, E.H., Weng, Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E. et al. (2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816. Botstein, D., White, R., Skolnick, M. and Davis, R.W. (1980). Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 32, 314–331.
Burmeister, M., diSibio, G., Cox, D.R. and Myers, R.M. (1991). Identification of polymorphisms by genomic denaturing gradient gel electrophoresis: Application to the proximal region of human chromosome 21. Nucleic Acid Res 19, 1475–1481. Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X.C., Stern, D., Winkler, J., Lockhart, D.J., Morris, M.S. and Fodor, S.P. (1996). Accessing genetic information with high-density DNA arrays. Science 274, 610–614. Chuzhanova, N.A., Anassis, E.J., Ball, E.V., Krawczak, M. and Cooper, D.N. (2003). Meta-analysis of indels causing human genetic disease: Mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum Mutat 21, 28–44. Consortium, The International HapMap (2005). A haplotype map of the human genome. Nature 437, 1299–1320. Cornall, R.J., Aitman, T.J., Hearne, C.M. and Todd, J.A. (1991). The generation of a library of PCR-analyzed microsatellite variants for genetic mapping of the mouse genome. Genomics 10, 874–881. Cox, D.G., Kraft, P., Hankinson, S.E. and Hunter, D.J. (2005). Haplotype analysis of common variants in the BRCA1 gene and risk of sporadic breast cancer. Breast Cancer Res 7, R171–R175. Crawford, D.C., Carlson, C.S., Rieder, M.J., Carrington, D.P., Yi, Q., Smith, J.D., Eberle, M.A., Kruglyak, L. and Nickerson, D.A. (2004). Haplotype diversity across 100 candidate genes for inflammation, lipid metabolism, and blood pressure regulation in two populations. Am J Hum Genet 74, 610–622. Cronin, M.T., Fucini, R.V., Kim, S.M., Masino, R.S., Wespi, R.M. and Miyada, C.G. (1996). Cystic fibrosis mutation detection by hybridization to light-generated DNA probe arrays. Hum Mutat 7, 244–255. Curiel, D.T., Buchhagen, D.L., Chiba, I., D’Amico, D., Takahashi, T. and Minna, J.D. (1990). A chemical mismatch cleavage method useful for the detection of point mutations in the p53 gene in lung cancer. Am J Respir Cell Mol Biol 3, 405–411. Davies, J., Alton, E. and Griesenbach, U. (2005). Cystic fibrosis modifier genes. J R Soc Med 98 Suppl 45, 47–54.
98
CHAPTER 7
■
DNA Sequencing for the Detection of Human Genome Variation and Polymorphism
Dean, F.B., Hosono, S., Fang, L., Wu, X., Faruqi, A.F., Bray-Ward, P., Sun, Z., Zong, Q., Du, Y., Du, J. et al. (2002). Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci USA 99, 5261–5266. Faham, M., Zheng, J., Moorhead, M., Fakhrai-Rad, H., Namsaraev, E., Wong, K., Wang, Z., Chow, S.G., Lee, L., Suyenaga, K. et al. (2005). Multiplexed variation scanning for 1,000 amplicons in hundreds of patients using mismatch repair detection (MRD) on tag arrays. Proc Natl Acad Sci USA 102, 14717–14722. Fan, J.B., Chee, M.S. and Gunderson, K.L. (2006). Highly parallel genomic assays. Nat Rev Genet 7, 632–644. Fellermann, K., Stange, D.E., Schaeffeler, E., Schmalzl, H., Wehkamp, J., Bevins, C.L., Reinisch, W., Teml, A., Schwab, M., Lichter, P. et al. (2006). A chromosome 8 gene-cluster polymorphism with low human beta-defensin 2 gene copy number predisposes to Crohn disease of the colon. Am J Hum Genet 79, 439–448. Feuk, L., Kalervo, A., Lipsanen-Nyman, M., Skaug, J., Nakabayashi, K., Finucane, B., Hartung, D., Innes, M., Kerem, B., Nowaczyk, M.J. et al. (2006). Absence of a paternally inherited FOXP2 gene in developmental verbal dyspraxia. Am J Hum Genet 79, 965–972. Fleischmann, R.D.,Adams, M.D.,White, O., Clayton, R.A., Kirkness, E.F., Kerlavage,A.R., Bult, C.J.,Tomb, J.F., Dougherty, B.A., Merrick, J.M. et al. (1995). Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512. Frazer, K.A., Sheehan, J.B., Stokowski, R.P., Chen, X., Hosseini, R., Cheng, J.F., Fodor, S.P., Cox, D.R. and Patil, N. (2001). Evolutionarily conserved sequences on human chromosome 21. Genome Res 11, 1651–1659. Friedman, J.M., Baross, A., Delaney, A.D., Ally, A., Arbour, L., Asano, J., Bailey, D.K., Barber, S., Birch, P., Brown-John, M. et al. (2006). Oligonucleotide microarray analysis of genomic imbalance in children with mental retardation. Am J Hum Genet 79, 500–513. Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E. et al. (2004). Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521. Goldberg, S.M., Johnson, J., Busam, D., Feldblyum, T., Ferriera, S., Friedman, R., Halpern, A., Khouri, H., Kravitz, S.A., Lauro, F.M. et al. (2006). A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc Natl Acad Sci USA 103, 11240–11245. Green, R.E., Krause, J., Ptak, S.E., Briggs, A.W., Ronan, M.T., Simons, J.F., Du, L., Egholm, M., Rothberg, J.M., Paunovic, M. et al. (2006). Analysis of one million base pairs of Neanderthal DNA. Nature 444, 330–336. Hacia, J.G., Brody, L.C., Chee, M.S., Fodor, S.P. and Collins, F.S. (1996). Detection of heterozygous mutations in BRCA1 using high density oligonucleotide arrays and two-colour fluorescence analysis. Nat Genet 14, 441–447. Havlak, P., Chen, R., Durbin, K.J., Egan, A., Ren, Y., Song, X.Z., Weinstock, G.M. and Gibbs, R.A. (2004). The Atlas genome assembly system. Genome Res 14, 721–732. Hawkins, G.A. and Hoffman, L.M. (1997). Base excision sequence scanning. Nat Biotech 15, 803–804. Huber, C.G., Premstaller, A., Xiao, W., Oberacher, H., Bonn, G.K. and Oefner, P.J. (2001). Mutation detection by capillary denaturing high-performance liquid chromatography using monolithic columns. J Biochem Biophys Meth 47, 5–19.
Istrail, S., Sutton, G.G., Florea, L., Halpern, A.L., Mobarry, C.M., Lippert, R., Walenz, B., Shatkay, H., Dew, I., Miller, J.R. et al. (2004). Whole-genome shotgun assembly and comparison of human genome assemblies. Proc Natl Acad Sci USA 101, 1916–1921. Jenkins, G.J.,Williams, G.L., Beynon, J.,Ye, Z., Baxter, J.N. and Parry, J.M. (2002). Restriction enzymes in the analysis of genetic alterations responsible for cancer progression. Br J Surg 89, 8–20. Ju, J., Ruan, C., Fuller, C.W., Glazer, A.N. and Mathies, R.A. (1995). Fluorescence energy transfer dye-labeled primers for DNA sequencing and analysis. Proc Natl Acad Sci USA 92, 4347–4351. Kapranov, P., Cawley, S.E., Drenkow, J., Bekiranov, S., Strausberg, R.L., Fodor, S.P. and Gingeras, T.R. (2002). Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–919. Khaja, R., Zhang, J., Macdonald, J.R., He, Y., Joseph-George, A.M., Wei, J., Rafiq, M.A., Qian, C., Shago, M., Pantano, L. et al. (2006). Genome assembly comparison identifies structural variants in the human genome. Nat Genet 38, 1413–1418. Kirkness, E.F., Bafna, V., Halpern, A.L., Levy, S., Remington, K., Rusch, D.B., Delcher, A.L., Pop, M., Wang, W., Fraser, C.M. et al. (2003). The dog genome: Survey sequencing and comparative analysis. Science 301, 1898–1903. Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S., Haynes, C., Henning, A.K., SanGiovanni, J.P., Mane, S.M., Mayne, S.T. et al. (2005). Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389. Kozal, M.J. et al. (1996). Extensive polymorphisms observed in HIV-1 clade B protease gene using high-density oligonucleotide arrays. Nat Med 2, 753–759. Krebs, S., Medugorac, I., Seichter, D. and Förster, M. (2003). RNaseCut: A MALDI mass spectrometry-based method for SNP discovery. Nucleic Acids Res 31, e37. Kukita, Y., Miyatake, K., Stokowski, R., Hinds, D., Higasa, K.,Wake, N., Hirakawa, T., Kato, H., Matsuda, T., Pant, K. et al. (2005). Genomewide definitive haplotypes determined using a collection of complete hydatidiform moles. Genome Res 15, 1511–1518. Kwok, P.Y. and Chen, X. (2003). Detection of single nucleotide polymorphisms. Curr Issues Mol Biol 5, 43–60. Laitinen, T., Polvi, A., Rydman, P., Vendelin, J., Pulkkinen, V., Salmikangas, P., Makela, S., Rehn, M., Pirskanen, A., Rautanen, A. et al. (2004). Characterization of a common susceptibility locus for asthma-related traits. Science 304, 300–304. Lambrinakos, A., Humphrey, K.E., Babon, J.J., Ellis, T.P. and Cotton, R.G. (1999). Reactivity of potassium permanganate and tetraethylammonium chloride with mismatched bases and a simple mutation detection protocol. Nucleic Acids Res 27, 1866–1874. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W. et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921. Lee, L.G., Spurgeon, S.L., Heiner, C.R., Benson, S.C., Rosenblum, B.B., Menchen, S.M., Graham, R.J., Constantinescu, A., Upadhya, K.G. and Cassel, J.M. (1997). New energy transfer dyes for DNA sequencing. Nucleic Acids Res 25, 2816–2822. Levy, S., Sutton, G., Ng, P.C., Feuk, L., Halpern, A.L., Walenz, B.P., Axelrod, N., Huang, J., Kirkness, E.F., Denisov, G. et al. (2007). The Diploid Genome Sequence of an Individual Human. PLoS Biol 5, e254.
References
Livingston, R.J., von Niederhausern, A., Jegga, A.G., Crawford, D.C., Carlson, C.S., Rieder, M.J., Gowrisankar, S., Aronow, B.J., Weiss, R.B. and Nickerson, D.A. (2004). Pattern of sequence variation across 213 environmental response genes. Genome Res 14, 1821–1831. Love, J.M., Knight, A.M., McAleer, M.A. and Todd, J.A. (1990). Towards construction of a high resolution map of the mouse genome using PCR-analysed microsatellites. Nucleic Acids Res 18, 4123–4130. Manaster, C., Zheng, W.,Teuber, M., Wachter, S., Doring, F., Schreiber, S. and Hampe, J. (2005). InSNP: A tool for automated detection and visualization of SNPs and InDels. Hum Mutat 26, 11–19. Maraganore, D.M., de Andrade, M., Lesnick, T.G., Strain, K.J., Farrer, M.J., Rocca, W.A., Pant, P.V., Frazer, K.A., Cox, D.R. and Ballinger, D.G. (2005). High-resolution whole-genome association study of Parkinson disease. Am J Hum Genet 77, 685–693. Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z. et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380. McCombie, W.R., Heiner, C., Kelley, J.M., Fitzgerald, M.G. and Gocayne, J.D. (1992). Rapid and reliable fluorescent cycle sequencing of double-stranded templates. DNA Seq 2, 289–296. Mullikin, J.C. and Ning, Z. (2003). The phusion assembler. Genome Res 13, 81–90. Mullis, K.B., Faloona, F.A., Scharf, S.J., Saiki, R.K., Horn, G.T. and Erlich, H.A. (1986). Specific enzymatic amplification of DNA in vitro: The polymerase chain reaction. Cold Spring Harbor Symp Quant Biol 51, 263–273. Myers, E.W., Sutton, G.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan, M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H., Remington, K.A. et al. (2000). A whole-genome assembly of Drosophila. Science 287, 2196–2204. Myers, R.M., Larin, Z. and Maniatis, T. (1985). Detection of single base substitutions by ribonuclease cleavage at mismatches in RNA: DNA duplexes. Science 230, 1242–1246. Myers, R.M., Maniatis, T. and Lerman, L.S. (1987). Detection and localization of single base pair changes by denaturing gradient gel electrophoresis. Meth Enzymol 155, 501. Nataraj, A.J., Olivos-Glander, I., Kusukawa, N. and Highsmith, W.E.J. (1999). Single-strand conformational polymorphism and heteroduplex analysis for gel-based mutation detection. Electrophoresis 20, 1177–1185. Ng, P.C. and Henikoff , S. (2006). Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7, 61–80. Nomura, Y., Kamimura, J., Yoshinaga, M., Nishi, J., Kono, Y. and Miyata, K. (2001). Cleavase fragment length polymorphism analysis in a family with long QT syndrome. Pediatr Cardiol Card Surg 17, 20–25. Nyren, P., Pettersson, B. and Uhlen, M. (1993). Solid phase DNA minisequencing by an enzymatic luminometric inorganic pyrophosphate detection assay. Anal Biochem 208, 171–175. Osoegawa, K., Mammoser, A.G., Wu, C., Frengen, E., Zeng, C., Catanese, J.J. and de Jong, P.J. (2001). A bacterial artificial chromosome library for sequencing the complete human genome. Genome Res 11, 483–496. Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P. et al. (2001). Blocks of limited haplotype diversity
■
99
revealed by high-resolution scanning of human chromosome 21. Science 294, 1719–1723. Prober, J.M., Trainor, G.L., Dam, R.J., Hobbs, F.W., Robertson, C.W., Zagursky, R.J., Cocuzza, A.J., Jensen, M.A. and Baumeister, K. (1987). A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 238, 336–341. Rand, V., Huang, J., Stockwell, T., Ferriera, S., Buzko, O., Levy, S., Busam, D., Li, K., Edwards, J.B., Eberhart, C. et al. (2005). Sequence survey of receptor tyrosine kinases reveals mutations in glioblastomas. Proc Natl Acad Sci USA 102, 14344–14349. Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson, A.R., Chen, W. et al. (2006). Global variation in copy number in the human genome. Nature 444, 444–454. Rothschild, C.B.,Akots, G., Hayworth, R., Pettenati, M.J.,Nagesh Rao, P., Wood, P., Stolz, F.-M., Hansmann, I., Serino, K.K.T.P., Fajans, S.S. et al. (1993). A genetic map of chromosome 20q12–13.1: Multiple highly polymorphic microsatellite and RFLP markers linked to the maturity-onset diabetes of the young (MODY) locus. Am J Hum Genet 52, 110–123. Rowntree, R.K. and Harris, A. (2003). The phenotypic consequences of CFTR mutations. Ann Hum Genet 67, 471–485. Rozen, S. and Skaletsky, H. (2000). Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol 132, 365–386. Saiki, R.K., Scharf, S., Faloona, F., Mullis, K.B., Horn, G.T., Erlich, H.A. and Arnheim, N. (1985). Enzymatic amplification of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science 230, 1350–1354. Sambrook, J., Fritsch, E.F. and Maniatis, T. (1989). Molecular Cloning: A Laboratory Manual, 2nd ed. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Sanger, F., Nicklen, S. and Coulson, A.R. (1977). DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74, 5463–5467. Sanger, F., Coulson, A.R., Hong, G.F., Hill, D.F. and Petersen, G.B. (1982). Nucleotide sequence of bacteriophage lambda DNA. J Mol Biol 162, 729–773. Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M. et al. (2004). Large-scale copy number polymorphism in the human genome. Science 305, 525–528. Sheffield,V.C., Beck, J.S., Kwitek, A.E., Sandstrom, D.W. and Stone, E.M. (1993). The sensitivity of single-strand conformation polymorphism analysis for the detection of single base substitutions. Genomics 16, 325–332. Shendure, J., Mitra, R.D.,Varma, C. and Church, G.M. (2004). Advanced sequencing technologies: Methods and goals. Nat Rev Genet 5, 335–344. Shendure, J., Porreca, G.J., Reppas, N.B., Lin, X., McCutcheon, J.P., Rosenbaum, A.M., Wang, M.D., Zhang, K., Mitra, R.D. and Church, G.M. (2005). Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309, 1728–1732. Sladek, R., Rocheleau, G., Rung, J., Dina, C., Shen, L., Serre, D., Boutin, P., Vincent, D., Belisle, A., Hadjadj, S. et al. (2007). A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445, 881–885. Smith, L.M., Sanders, J.Z., Kaiser, R.J., Hughes, P., Dodd, C., Connell, C.R., Heiner, C., Kent, S.B. and Hood, L.E. (1986). Fluorescence detection in automated DNA sequence analysis. Nature 321, 674–679.
100
CHAPTER 7
■
DNA Sequencing for the Detection of Human Genome Variation and Polymorphism
Stanssens, P., Zabeau, M., Meersseman, G., Remes, G., Gansemans, Y., Storm, N., Hartmer, R., Honisch, C., Rodi, C.P., Böcker, S. et al. (2004). High-throughput MALDI-TOF discovery of genomic sequence polymorphisms. Genome Res 14, 126–133. Stephens, M., Sloan, J.S., Robertson, P.D., Scheet, P. and Nickerson, D.A. (2006). Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat Genet 38, 375–381. Strausberg, R.L., Simpson, A.J. and Wooster, R. (2003). Sequence-based cancer genomics: Progress, lessons and opportunities. Nat Rev Genet 4, 409–418. Tabone, T., Sallmann, G., Webb, E. and Cotton, R.G. (2006). Detection of 100% of mutations in 124 individuals using a standard UV/Vis microplate reader: A novel concept for mutation scanning. Nucleic Acids Res 34, e45. Thomas, G.A., Williams, D.L. and Soper, S.A. (2001). Capillary electrophoresis-based heteroduplex analysis with a universal heteroduplex generator for detection of point mutations associated with Rifampin resistance in tuberculosis. Clin Chem 47, 1195–1203. Thomas, R.K., Greulich, H., Yuza, Y., Lee, J.C., Tengs, T., Feng, W., Chen, T.H., Nickerson, E., Simons, J., Egholm, M. et al. (2005). Detection of oncogenic mutations in the EGFR gene in lung adenocarcinoma with differential sensitivity to EGFR tyrosine kinase inhibitors. Cold Spring Harb Symp Quant Biol 70, 73–81. Thomas, R.K., Nickerson, E., Simons, J.F., Janne, P.A., Tengs, T., Yuza, Y., Garraway, L.A., LaFramboise, T., Lee, J.C., Shah, K.
et al. (2006). Sensitive mutation detection in heterogeneous cancer specimens by massively parallel picoliter reactor sequencing. Nat Med 12, 852–855. Uehara, H., Kawano, H., Kataoka, H., Sameshima, T., Moriyama, T., Nakano, S. and Wakisaka, S. (1999). Mutational analysis of human p53 gene in human gliomas by Cleavase® fragment length polymorphism. Neuropathology 19, 267. Underhill, P.A., Jin, L., Lin, A.A., Mehdi, S.Q., Jenkins, T., Vollrath, D., Davis, R.W., Cavalli-Sforza, L.L. and Pefner, P.J. (1997). Detection of numerous Y chromosome biallelic polymorphisms by denaturing high-performance liquid chromatography. Genome Res 7, 996–1005. Venter, J.C., Smith, H.O. and Hood, L. (1996). A new strategy for genome sequencing. Nature 381, 364–366. Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A. et al. (2001). The sequence of the human genome. Science 291, 1304–1351. Wheeler, D.A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., He, W., Chen, Y. -J., Makhijani, V., Roth, G.T. et al. (2008). The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876.
CHAPTER
8 Genome-Wide Association Studies and Genotyping Technologies Kevin V. Shianna
INTRODUCTION Since the completion of the International HapMap Project (International HapMap Consortium, 2005; The International HapMap Consortium, 2003), a wealth of genetic information within the human genome has been uncovered. Over 12 million single nucleotide polymorphisms (SNPs) have been identified (Wheeler et al., 2007). This dense map of common genetic variation has revolutionized the field of human genetics. Genomewide linkage scans can now be performed using efficient and cost-effective high-throughput SNP genotyping technologies, while at the same time achieving a fine map at a density not previously possible when performing earlier-generation genome scans based on microsatellite polymorphisms. The SNP map, along with recent technological gains, has also made it possible to comprehensively study complex diseases and traits by allowing for the genotyping of hundreds of thousands of SNPs for many samples. Also, with the identification of a specific type of SNP, called tagging SNPs, almost complete genomic coverage of common variation can be attained. This chapter will provide an introduction to genome-wide association studies, followed by an overview of two commonly used genome-wide genotyping platforms, the Illumina Infinium Assay and the Affymetrix GeneChip. The methods and concepts discussed here complement those described elsewhere in this book.
Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg
PRINCIPLES OF GENOME-WIDE ASSOCIATION STUDIES Before performing a genome-wide association study, it is essential to begin with a fully developed plan.The key issues to consider are the kind of study to perform, the required sample size, the importance of a well-defined phenotype, the need to correct for population stratification and the technologies available for genome-wide genotyping. This section will provide an introduction to the basic experimental design for a genome-wide association study. Type of Study First, the study should be focused on a key question based on a specific phenotype or disease. As an example, a study could be established to identify genetic variants associated with epilepsy. The key would be to collect a large cohort of individuals that have a definitive and well-defined diagnosis of epilepsy (cases) and compare these individuals to a group that does not have epilepsy (controls). In the past it was always necessary to collect matched controls, but with an abundance of control samples containing data from the various genome-wide array-based studies, it is now possible to use control genotyping data from the available databases. One example is the database made available by Illumina (iControlDB) where individual users have deposited genotyping data for control samples from previously processed samples. It has been demonstrated that the use of general control
Copyright © 2009, Elsevier Inc. All rights reserved. 101
102
CHAPTER 8
TABLE 8.1
■
Genome-Wide Association Studies and Genotyping Technologies
Examples of successful genome-wide association studies
Disease
Chip type
Gene/locus
HIV
Illumina HumanHap 550
HLA-C, HCP5/HLA-B*5701
486
(Fellay et al., 2007)
Type 2 Diabetes
Illumina HumanHap 300
TCF7L2, SLC30A8, HHEX, FTO, PPARG and KCNJ11
1161/1174
(Scott et al., 2007)
Affymetrix Human Mapping 500
CDKN2A, CDKN2B, IGF2BP2, CDKAL1
1464/1467
(Saxena et al., 2007)
Illumina HumanHap 300
TCF7L2, AHI1-LOC441171 region
500/497
(Salonen et al., 2007)
Illumina HumanHap 300
TCF7L2, IDE–KIF11–HHEX and EXT2–ALX4
1363
(Sladek et al., 2007)
Illumina HumanHap 550
KIAA0350
563/1146
(Hakonarson et al., 2007)
Affymetrix Human Mapping 500
Chromosome regions 18q22, 12q24, 12q13, 16p13 and 18p11
4000/5000
( Todd et al., 2007)
Affymetrix Human Mapping 100
IFIH1
2029/1775
(Smyth et al., 2006)
Restless leg syndrome
Affymetrix Human Mapping 500
MEIS1, BTBD9
401/1644
( Winkelmann et al., 2007)
Colorectal cancer
Illumina HumanHap 550
8q24.21 locus
930/960
( Tomlinson et al., 2007)
Affymetrix Human Mapping 100
8q24 locus
1257/1336
(Zanke et al., 2007)
Illumina HumanHap 300
TCF2
1501/11290
(Gudmundsson et al., 2007)
Illumina HumanHap 300
8q24
1453/3064
(Gudmundsson et al., 2007)
Illumina HumanHap 300
8q24
1172/1157
( Yeager et al., 2007)
Illumina HumanHap 300
Chromosome region 4q25 near PITX2
550/4476
(Gudbjartsson et al., 2007)
Type 1 Diabetes
Prostate cancer
Atrial fibrillation
Sample size (cases/controls)
Reference
Celiac disease
Illumina HumanHap 300
IL2, IL21
778/1422
(van Heel et al., 2007)
Crohn’s disease
Illumina HumanHap 300
IL23R
2877/1345
( Tremelling et al., 2007)
Illumina HumanHap 300
IL23R
567/571
(Duerr et al., 2006)
Macular degeneration
Affymetrix Human Mapping 100
CFH
96/50
(Klein et al., 2005)
Asthma
Illumina HumanHap 300
ORMDL3
994/1243
(Moffatt et al., 2007)
samples across studies is feasible (Wellcome Trust Case Control Consortium, 2007).This greatly decreases the cost to run a project and potentially allows for the inclusion of more cases (if available). A study could also be arranged to look at a specific continuous phenotype or quantitative trait within a group of individuals. Using the previous example, one could study how individuals with epilepsy responded to a specific drug or series of drugs. Sample Size The sample sizes required for a genome-wide association study are dependent on several factors. The first is the effect size of an unknown genetic variable. Of course, this is impossible to know in advance; however, power calculations can be performed to determine how many samples would be required to identify a variant with a specific effect size. These calculations can serve as a guide when deciding how many samples to include in the study. Purcell and colleagues (2003) have created a web-based genetic power calculator (http://pngu.mgh.harvard.edu/~purcell/gpc/) that can be used to determine power of a study. Another factor
that is important when determining sample size is the type of study: case/control or a continuous phenotype. The current view for case/control studies is that large sample sizes are needed (1,000 of each), but this is not always true. A few studies have identified associated variants with only 150–200 samples (Table 8.1). Again, it is impossible to know the effect size of an unknown variant so it is best to include a large number of samples when designing a study. For a continuous phenotype, studies have had success while using fewer samples (Table 8.1). Importance of Phenotype The most important aspect when designing a genome-wide association study is the quality and discrete nature of the phenotype. It is essential that the studied phenotype is evaluated carefully for accuracy and is well defined across all samples. The inclusion of incorrectly phenotyped samples in a study could lead to a decrease in the statistical power to identify associated variants. This could consist of control samples that are in fact cases, but have yet to be identified with the specific phenotype/diagnosis or a misdiagnosis
Platform Overview
of an individual within the cases. For some complex diseases, the phenotypic characteristics overlap with other closely related diseases. This can make it difficult when choosing which samples should be included in the study and, because of genetic heterogeneity, will decrease the power of the study. Population Stratification When analyzing the data from genome-wide association studies, it is essential to evaluate and account for population stratification between cases and controls. Cryptic population structure can generate false positive and/or negative associations. This can happen if allele frequencies are different in any unknown subgroups. One solution to correct for this difference in ancestry is described by Price and colleagues (2006). Left uncorrected, population stratification can lead directly to false associations.
Address
■
103
Probe
Bead
Figure 8.1 An example of a bead on the Illumina Infinium BeadChip. A schematic of a single bead, containing a 75nucleotide oligomer consisting of an address and a specific probe sequence, as described in the text.
Genomic DNA A T Extended nucleotides emit signal
Bead 1
Genomic DNA
Genotyping Technologies, Tagging SNPs and Genomic Coverage Another necessity for genome-wide association studies is to have a method to genotype hundreds of thousands of genetic variants around the genome. Fortunately, there are now standard products that allow for screening of very large numbers of SNPs. Depending on the chosen platform, the SNPs on the arrays have been selected based either on even spacing or on the basis of certain SNPs being able to represent multiple SNPs.The SNPs in the second approach are termed “tagging SNPs” and greatly increase the genomic coverage that can be obtained compared to SNPs that are evenly spaced across the genome (Barrett and Cardon, 2006; Pe’er et al., 2006). Reflecting now the well-established patterns of linkage disequilibrium (Hinds et al., 2005; International HapMap Consortium, 2005), these tagging SNPs work by acting as proxies for other nearby associated SNPs, making it possible to genotype only the tagging SNP and yet capture the content of other associated SNPs in the region. There are currently commercially available arrays (SNP chips) that contain close to 1 million tagging SNPs, representing most of the common variation in the human genome. Products where the SNPs have not been specifically chosen based on the close association with other SNPs will have much less genomic coverage, because the SNPs have not been optimally chosen for the ability to tag. For example, Illumina’s HumanHap300 array, containing around 300,000 tagging SNPs, has essentially the same statistical genomic coverage as Affymetrix’s Mapping 500 K Array Set, which contains 500,000 evenly spaced SNPs (Barrett and Cardon, 2006; Pe’er et al., 2006). Further background information on genome-wide association studies and tagging SNPs has been elegantly presented elsewhere (Hirschhorn and Daly, 2005).
PLATFORM OVERVIEW Illumina Infinium Assay The Illumina genotyping technology is based on 3 silicon beads that have hundreds of thousands of DNA oligomers attached to
A
(a)
G
Bead 2
Genomic DNA A T*
(b)
Bead
G
Figure 8.2 Infinium I and II assay. (a) A graphical representation of the Infinium I assay. This assay uses an ASPE and requires two bead types to detect a single SNP. (b) A depiction of an SBE reaction as used for the Infinium II assay. Only one bead type is required to detect a single SNP. The diagram shows a single dideoxy base (T) being added to the 50-base oligomer on the bead.
the surface (Shen et al., 2005). For Illumina’s Infinium wholegenome genotyping BeadChip, the attached oligomers consist of a stretch of 75 nucleotides (Gunderson et al., 2005; Steemers et al., 2006) (Figure 8.1). The first 25 nucleotides (the address) serve as a barcode to identify the location of specific beads on the array, and the next 50 bases (the probe) are complementary to a genomic region containing a specific SNP (or copy number variant). On average, there is an 18-fold redundancy of each bead type on the BeadChip. The placement of the beads on the BeadChip is a random process, thus requiring a decoding step. The decoding involves multiple hybridizations to the 25 oligomer address. This not only identifies the location of each bead, but performs a functional quality control for each bead on the BeadChip. There are two Infinium chemistries, type I and II (Figure 8.2). The Infinium I assay uses an allele-specific primer extension (ASPE) reaction in which fluorescently labeled nucleotides are inserted at the 3 end of the primer. With this assay it takes two bead types to determine the genotype at a single SNP. The Infinium II assay involves a single-base extension (SBE) reaction that occurs on a single bead type for each SNP. Most of Illumina’s off-the-shelf BeadChips use the Infinium II assay.
104
CHAPTER 8
■
Genome-Wide Association Studies and Genotyping Technologies
rs2381194
Restriction site
Restriction site
6000
Intensity (B)
Enzyme digestion 4000
2000
Adapter ligation 0
(a)
0
2000
4000 6000 Intensity (A)
8000
10000
PCR
rs2381194 1.20 1
Fragment and label
Norm R
0.80 0.60 0.40 0.20 0 0.20 (b)
26 0
227 0.20
0.40 0.60 Norm Theta
696 0.80
1
Figure 8.3 Raw and normalized data for SNP genotyping, using the Illumina platform.(a) Raw data from genotype analysis of one SNP (rs2381194) from 949 individuals. Each dot corresponds to a single individual. Blue, homozygote for one allele (n 696); Purple, heterozygous (n 227); Red, homozygote for other allele (n 26). (b) Same SNP and individuals as in (a), but after normalization using Illumina’s BeadStudio software.
The steps for the Infinium II assay are as follows: (1) 750 ng of intact genomic DNA is whole-genome amplified (WGA); (2) the amplified product is fragmented; (3) the fragmented DNA is hybridized to a BeadChip; (4) an SBE reaction is performed directly on the chip with the insertion of a labeled (2,4-dinitrophenol or biotin) dideoxy base; (5) the inserted base is stained with a fluorescently labeled antibody and (6) fluorescence on the beads on the BeadChip is detected by scanning with the Illumina BeadStation. An average of the raw intensity per bead type is then imported into the Illumina BeadStudio software, where the data can be efficiently managed and individual SNPs can be automatically or manually called. The quality of the raw and normalized genotype calls can be visually confirmed within this software (Figure 8.3).
Figure 8.4 Overview of DNA processing steps for the AffymetrixGeneChip. The diagram demonstrates the steps implemented to reduce the genomic complexity allowing for efficient allele-specific hybridization.
Affymetrix GeneChip The Affymetrix genotyping GeneChip is manufactured using the same principles as for the expression GeneChip, using a technology called photolithography. This has been described in great detail (Dalma-Weiszhausz et al., 2006; Matsuzaki et al., 2004), including elsewhere in this book. However, the genotyping GeneChip contains additional probes for each interrogated SNP, to allow for differentiation of true signal over noise (DalmaWeiszhausz et al., 2006). The steps required to process a DNA sample in preparation for hybridization to the GeneChip lead to a reduction in genome complexity allowing for allele-specific hybridization (Figure 8.4). The following is the procedure for processing a sample: (1) perform separate genomic DNA (250 ng) digestions using the NspI and StyI restriction enzymes; (2) ligate specific adapters containing a single universal primer binding site to the digested DNA; (3) pool the DNA fragments from the separate digestions and amplify using a single universal PCR primer specific to the primer site within the adaptor; (4) fragment the amplified products; (5) end label the DNA; (6) hybridize to a GeneChip and (7) scan the chip. The raw data can then be converted into genotype calls by using the appropriate algorithms (from Affymetrix).
Platform Overview
Platform Comparison Numerous laboratories worldwide have successfully used both the Illumina and Affymetrix platforms to identify genetic variants involved with complex disease (Table 8.1). This section will describe potential benefits of each platform, paying particular attention to (1) differences in the basic design of the chips; (2) statistical tagging power (genomic coverage); (3) ability to detect copy number variation (CNV); (4) processing issues and (5) overall quality of the final raw genotyping data. The length of the probes to detect an SNP during hybridization constitutes one of the major differences between the platforms. The probe length on the Affymetrix GeneChip is 25 nucleotides, while on the Illumina BeadChip the probes contain 50 nucleotides. This difference in length for hybridization provides for higher specificity for probes on the Illumina BeadChip. The other main difference is the content of the probe for each SNP. For Affymetrix, each SNP is represented by a series of probes consisting of mismatches and perfect homology, allowing for differentiation between signal and noise. The Illumina BeadChips do not require specific mismatch probes for every SNP because of the increased length of the probes for hybridization, but contain control beads that detect the overall level of background noise. The probe length constitutes the major difference between the two platforms and is most likely directly linked to the differences in quality between the platforms. Most of the SNPs on the Illumina Infinium BeadChips were selected as tagging SNPs, resulting in excellent genomic coverage for most populations (based on HapMap populations) (International HapMap Consortium, 2005). On the other hand, the Affymetrix GeneChips are based on SNPs that are evenly spaced throughout the genome. The positive aspect of even spacing is better physical coverage for detecting CNVs (Redon et al., 2006). One disadvantage of this approach is less statistical power to capture the known common variation when compared to arrays that contain tagging SNPs. A multimarker or haplotype-tagging approach may gain increased power/coverage, but the downstream statistical analyses become more difficult. Multiple groups (Barrett and Cardon, 2006; Magi et al., 2007; Pe’er et al., 2006) have performed statistical comparisons of the Illumina HumanHap300 and the HumanHap550 BeadChips with the Affymetrix Mapping 500 K Set. From these analyses it is evident that the Illumina array with only around 300,000 tagging SNPs has essentially the same statistical genomic coverage as the Affymetrix Mapping 500 K Set, while the Illumina HumanHap550 array is statistically superior for genome-wide association studies. Overall, both platforms provide good quality results (Table 8.1). However, a comparison of the SNP conversion rate (percentage of SNPs that work) between the two technologies reveals a significant difference. In our hands, for example, when using the Illumina pre-defined cluster, an average sample call rate of 99.6% is achieved, with an average reproducibility of over 99.99%. With the Affymetrix platform, the average sample call rate is variable and depends greatly on the chosen stringency within the calling algorithm, but a typical call rate would be in the range of 95–98% (personal communications). As discussed above, the increased SNP conversion rate with the Illumina BeadChip is most likely related
■
105
to the size of the probe (50 nucleotides) which increases the specificity during the hybridization step. Recently, there has been a push to detect CNVs throughout the human genome by using genome-wide genotyping chips (Komura et al., 2006; Redon et al., 2006) (see Chapter 9). Using these SNP-based arrays is an efficient method to detect large deletions or duplications. The Affymetrix GeneChip with its even spacing of SNPs has great genomic coverage to detect large CNVs. Most of the BeadChips from Illumina have specific SNPs or non-polymorphic probes added in regions where there are known CNVs. In general, both arrays are capable of detecting large CNVs. However, with the current SNP densities, the detection of small CNVs is difficult with the use of SNP-based arrays. Raw Data Quality Control Genetic association studies using genome-wide genotyping chips produce very large amounts of data. A typical project with 2,000 samples, each containing over 500,000 SNPs per sample, will result in over one billion data points. With this much data, the probability for false positives becomes much greater. A few reasons that can attribute to this are samples being prepared using different purification methods and the processing of the chips in different labs. These slight variations can lead to minor differences in intensity, thus making it more difficult to accurately call the genotypes. To prevent false positives from clouding the results from the statistical analyses, it is of utmost importance to begin with a clean or curated data set. This is accomplished by establishing a high threshold of sample and SNP quality. The number of SNPs that are obtained for each sample (the “call rate”) and the raw fluorescent intensity are the main variables that can be analyzed to determine whether a sample should be included or excluded. If any sample’s call rate is much lower (1–2%) than the average call rate for the project, the sample should be excluded. A 1% difference can represent as many as 5,000 fewer SNPs being called in a sample when compared with the average number of SNP calls in a population of samples. There are many variables that can be used to assess SNP quality. Metrics such as call frequency (number of samples that are being called for an SNP), cluster separation (degree of split between the heterozygote grouping and homozygotes, e.g. Figure 8.3), extreme deviation from Hardy–Weinberg proportions and overall raw signal intensity should be evaluated for all SNPs. A very aggressive approach to maintaining a clean data set is to adhere to a rigorous level of acceptable SNP call frequency. We have established a high threshold for SNP inclusion and have coined the phrase the “1% rule”; we delete any SNP that has greater than 1% of the samples not being called or called ambiguously. This rule was established from personal experiences when attempting to extract as much genotype data as possible. After performing statistical analyses, we noticed a large number of false positives due entirely to poor genotyping calls. For the 1% rule to work effectively, each new experiment needs to be reclustered using its own raw data. With the 1% rule, the use of predetermined cluster files will not work, because slight variations from experiment to experiment can result in slight shifts of intensity and cluster separation (i.e., separation of heterozygote
106
CHAPTER 8
■
Genome-Wide Association Studies and Genotyping Technologies
and homozygote clusters), resulting in incomplete calling of truly callable samples for a large number of SNPs. When applying the 1% rule, it is a balance between including a greater number of samples or a greater number of SNPs. Attempting to keep good but lower quality samples can result in the deletion of many SNPs that fail to meet the 1% rule standard. By deleting a few extra samples across the entire data set, it is possible to keep many of the SNPs that would have violated the 1% rule. In practice, in our experience, a typical project will have between 1% and 3% samples deleted (not all because of 1% rule) across all SNPs and between 1% and 2% of SNPs deleted because of violating the 1% rule. We have successfully used this approach for multiple studies (Fellay et al., 2007). After completing the first pass of statistical tests, the SNPs with the lowest 100–200 p-values should be manually checked for genotyping data quality. Visualization of the raw/normalized genotype call can increase one’s confidence in any potentially associated SNP with a low p-value. Inspecting the data is especially important if a data set has not been stringently curated under the 1% rule. For a given SNP, if only 95% of samples are called, over 50 samples from a sample set of 1,000 samples will not have a genotype call. Across a very large data set (over 500,000 SNPs), this loosened threshold can result in many false positives solely because of the increased opportunity for an imbalance of the case/control ratio in the uncalled samples. Sample Collection, Processing and Throughput It is now feasible for institutional core facilities or small- to medium-sized labs to run hundreds of samples per week using genome-wide genotyping chips. For each platform described here, there are off-the-shelf kits allowing for easy processing of many chips/samples at once with minimal variation across experiments. Also, with the implementation of automation developed specifically for each assay, sample throughput and tracking have all been simplified.
Genome-wide genotyping chips require less than 1 g of DNA, a very low amount considering the amount of data generated from this sample. In general, high-quality genomic DNA should be used. DNA of lesser quality or WGA DNA can be used, but will result in a lower sample call rate. The Affymetrix platform is more amenable to the use of WGA DNA, because of the lack of an initial WGA step as in the Illumina Infinium assay. However, WGA DNA will work for the Infinium platform, although lower call rates will be achieved (personal communication). Formalin-fixed, paraffin-embedded (FFPE) tissue samples represent a rich and abundant source of well-annotated material, but DNA isolated from FFPE samples using currently available methods is not amenable to genome-wide genotyping assays. The main reason for this is that the interaction of the formalin with the DNA prevents efficient enzymatic amplification of the DNA. Creating methods to successfully perform whole-genome genotyping on FFPE samples will be of great importance. This will benefit current projects by allowing for the potential increase in sample sizes. Also, precious samples obtained only via FFPE tissues could then be included in genome-wide association studies.
CONCLUSIONS This chapter has provided a basic overview of genome-wide association studies and the current technologies used for genome-wide genotyping. Since the completion of the International HapMap Project in October 2005 (International HapMap Consortium, 2005), there have been great technological advances that now allow for the genotyping of over one million SNPs within many individual samples in a matter of days. The ability to achieve genomewide SNP coverage has lead to the identification of multiple variants associated with disease (Table 8.1). These sorts of studies will continue in the coming years until it becomes feasible to simply and cost-effectively sequence the entire genome of individuals.
REFERENCES Barrett, J.C. and Cardon, L.R. (2006). Evaluating coverage of genomewide association studies. Nat Genet 38, 659–662. Dalma-Weiszhausz, D.D.,Warrington, J.,Tanimoto, E.Y. and Miyada, C.G. (2006). The affymetrix GeneChip platform: an overview. Methods Enzymol 410, 3–28. Duerr, R.H., Taylor, K.D., Brant, S.R., Rioux, J.D., Silverberg, M.S., Daly, M.J., Steinhart, A.H., Abraham, C., Regueiro, M., Griffiths, A., Dassopoulos, T., Bitton, A., Yang, H., Targan, S., Datta, L.W., Kistner, E.O., Schumm, L.P., Lee, A.T., Gregersen, P.K., Barmada, M.M., Rotter, J.I., Nicolae, D.L. and Cho, J.H. (2006). A genomewide association study identifies IL23R as an inflammatory bowel disease gene. Science 314, 1461–1463. Fellay, J., Shianna, K.V., Ge, D., Colombo, S., Ledergerber, B., Weale, M., Zhang, K., Gumbs, C., Castagna, A., Cossarizza, A. et al. (2007). A Whole-Genome Association Study of Major Determinants for Host Control of HIV-1. Science 317, 944–947. Gudbjartsson, D.F., Arnar, D.O., Helgadottir, A., Gretarsdottir, S., Holm, H., Sigurdsson, A., Jonasdottir, A., Baker, A., Thorleifsson, G., Kristjansson, K. et al. (2007). Variants conferring risk of atrial fibrillation on chromosome 4q25. Nature 448, 353–357.
Gudmundsson, J., Sulem, P., Manolescu, A., Amundadottir, L.T., Gudbjartsson, D., Helgason, A., Rafnar, T., Bergthorsson, J.T., Agnarsson, B.A., Baker, A. et al. (2007). Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q24. Nat Genet 39, 631–637. Gudmundsson, J., Sulem, P., Steinthorsdottir, V., Bergthorsson, J.T., Thorleifsson, G., Manolescu, A., Rafnar, T., Gudbjartsson, D., Agnarsson, B.A., Baker, A. et al. (2007). Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes. Nat Genet 39, 977–983. Gunderson, K.L., Steemers, F.J., Lee, G., Mendoza, L.G. and Chee, M.S. (2005). A genome-wide scalable SNP genotyping assay using microarray technology. Nat Genet 37, 549–554. Hakonarson, H., Grant, S.F., Bradfield, J.P., Marchand, L., Kim, C.E., Glessner, J.T., Grabs, R., Casalunovo, T., Taback, S.P., Frackelton, E.C. et al. (2007). A genome-wide association study identifies KIAA0350 as a type 1 diabetes gene. Nature 448, 591–594. Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E., Ballinger, D.G., Frazer, K.A. and Cox, D.R. (2005). Whole-genome
References
patterns of common DNA variation in three human populations. Science 307, 1072–1079. Hirschhorn, J.N. and Daly, M.J. (2005). Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6, 95–108. International HapMap Consortium (2003). The International HapMap Project. Nature 426, 789–796. International HapMap Consortium (2005). A haplotype map of the human genome. Nature 437, 1299–1320. Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S., Haynes, C., Henning, A.K., SanGiovanni, J.P., Mane, S.M., Mayne, S.T. et al. (2005). Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389. Komura, D., Shen, F., Ishikawa, S., Fitch, K.R., Chen, W., Zhang, J., Liu, G., Ihara, S., Nakamura, H., Hurles, M.E. et al. (2006). Genomewide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Res 16, 1575–1584. Magi, R., Pfeufer, A., Nelis, M., Montpetit, A., Metspalu, A. and Remm, M. (2007). Evaluating the performance of commercial whole-genome marker sets for capturing common genetic variation. BMC Genomics 8, 159. Matsuzaki, H., Loi, H., Dong, S., Tsai, Y.Y., Fang, J., Law, J., Di, X. et al. (2004). Parallel genotyping of over 10,000 SNPs using a oneprimer assay on a high-density oligonucleotide array. Genome Res 14, 414–425. Moffatt, M.F., Kabesch, M., Liang, L., Dixon, A.L., Strachan, D., Heath, S., Depner, M., von Berg, A., Bufe, A., Rietschel, E. et al. (2007). Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448, 470–473. Pe’er, I., de Bakker, P.I., Maller, J., Yelensky, R., Altshuler, D. and Daly, M.J. (2006). Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat Genet 38, 663–667. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904–909. Purcell, S., Cherny, S.S. and Sham, P.C. (2003). Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics 19, 149–150. Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H. et al. (2006). Global variation in copy number in the human genome. Nature 444, 444–454. Salonen, J.T., Uimari, P., Aalto, J.M., Pirskanen, M., Kaikkonen, J., Todorova, B., Hypponen, J., Korhonen, V.P., Asikainen, J., Devine, C. et al. (2007). Type 2 Diabetes Whole-Genome Association Study in Four Populations: The DiaGen Consortium. Am J Hum Genet 81, 338–345. Saxena, R., Voight, B.F., Lyssenko, V., Burtt, N.P., de Bakker, P.I., Chen, H., Roix, J.J., Kathiresan, S., Hirschhorn, J.N., Daly, M.J. et al. (2007). Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 316, 1331–1336. Scott, L.J., Mohlke, K.L., Bonnycastle, L.L., Willer, C.J., Li, Y., Duren, W.L., Erdos, M.R., Stringham, H.M., Chines, P.S., Jackson, A.U. et al. (2007). A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316, 1341–1345.
■
107
Shen, R., Fan, J.B., Campbell, D., Chang, W., Chen, J., Doucet, D., Yeakley, J. et al. (2005). High-throughput SNP genotyping on universal bead arrays. Mutat Res 573, 70–82. Sladek, R., Rocheleau, G., Rung, J., Dina, C., Shen, L., Serre, D., Boutin, P., Vincent, D., Belisle, A., Hadjadj, S. et al. (2007). A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445, 881–885. Smyth, D.J., Cooper, J.D., Bailey, R., Field, S., Burren, O., Smink, L.J., Guja, C., Ionescu-Tirgoviste, C., Widmer, B., Dunger, D.B. et al. (2006). A genome-wide association study of nonsynonymous SNPs identifies a type 1 diabetes locus in the interferon-induced helicase (IFIH1) region. Nat Genet 38, 617–619. Steemers, F.J., Chang, W., Lee, G., Barker, D.L., Shen, R. and Gunderson, K.L. (2006). Whole-genome genotyping with the single-base extension assay. Nat Methods 3, 31–33. Todd, J.A., Walker, N.M., Cooper, J.D., Smyth, D.J., Downes, K., Plagnol, V., Bailey, R., Nejentsev, S., Field, S.F., Payne, F. et al. (2007). Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nat Genet 39, 857–864. Tomlinson, I., Webb, E., Carvajal-Carmona, L., Broderick, P., Kemp, Z., Spain, S., Penegar, S., Chandler, I., Gorman, M., Wood, W. et al. (2007). A genome-wide association scan of tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21. Nat Genet 39, 984–988. Tremelling, M., Cummings, F., Fisher, S.A., Mansfield, J., Gwilliam, R., Keniry, A., Nimmo, E.R., Drummond, H., Onnie, C.M., Prescott, N.J. et al. (2007). IL23R variation determines susceptibility but not disease phenotype in inflammatory bowel disease. Gastroenterology 132, 1657–1664. van Heel, D.A., Franke, L., Hunt, K.A., Gwilliam, R., Zhernakova, A., Inouye, M., Wapenaar, M.C., Barnardo, M.C., Bethel, G., Holmes, G.K. et al. (2007). A genome-wide association study for celiac disease identifies risk variants in the region harboring IL2 and IL21. Nat Genet 39, 827–829. Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M. et al. (2007). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 35, D5–D12. Winkelmann, J., Schormair, B., Lichtner, P., Ripke, S., Xiong, L., Jalilzadeh, S., Fulda, S., Putz, B., Eckstein, G., Hauk, S. et al. (2007). Genome-wide association study of restless legs syndrome identifies common variants in three genomic regions. Nat Genet 39, 1000–1006. Yeager, M., Orr, N., Hayes, R.B., Jacobs, K.B., Kraft, P., Wacholder, S., Minichiello, M.J., Fearnhead, P.,Yu, K., Chatterjee, N. et al. (2007). Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat Genet 39, 645–649. Zanke, B.W., Greenwood, C.M., Rangrej, J., Kustra, R., Tenesa, A., Farrington, S.M., Prendergast, J., Olschwang, S., Chiang, T., Crowdy, E. et al. (2007). Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24. Nat Genet 39, 989–994.
CHAPTER
9 Copy Number Variation and Human Health Charles Lee, Courtney Hyland, Arthur S. Lee, Shona Hislop and Chunhwa Ihm
INTRODUCTION Early cytogenetic studies recognized that microscopically visible aberrations such as duplications or deletions of entire chromosomes (aneuploidy) were associated with specific congenital developmental disorders. For example, an extra copy of chromosome 21 (i.e., trisomy 21) has long been established as being correlative with the mongoloid phenotype that was first recognized in 1866 by John Langdon Down (Down 1866; Lejeune et al., 1959). However, recent development of high-resolution assays, capable of detecting small segmental genetic alterations in a genome-wide fashion, have led to the detection of widespread copy number variation (CNV) among the genomes of healthy individuals (Iafrate et al., 2004; Sebat et al., 2004). This finding has now sparked intense efforts to identify and characterize the extent of this type of genetic variation in human populations and to understand its impact on human health.
BASIC PRINCIPLES OF CNVS CNVs have been operationally defined as genomic gains and losses of 1 kb or larger (Freeman et al., 2006; Feuk et al., 2006). This definition helps to differentiate it from other forms of
Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg 108
polymorphism and/or repeated DNAs in the human genome (see Chapter 1), which include indels, microsatellites, minisatellites, simple repeats (e.g., dinucleotide repeats, trinucleotide repeats, etc.), telomeric and centromeric repetitive DNAs, and most interspersed repetitive elements (although LINES and other long interspersed repeats have repeat elements greater than 1 kb in size). In addition, since CNVs are considered subchromosomal imbalances, they are differentiated from whole chromosomal aneuploidies, such as trisomy 21 or monosomy X. When CNV data are available for many individuals within or across different human populations, CNVs can be categorized into biallelic or multiallelic states (Figure 9.1). Biallelic CNVs have only two alleles and thus produce three different genotypes. For example, a CNV that exists solely in the form of a 1-copy allele or a 2-copy allele (Figure 9.1a, right) can produce diploid copy numbers of 2 (1 allele1 allele), 3 (1 allele2 allele), or 4 (2 allele2 allele) (Figure 9.1a, left). CNVs with greater than two alleles are considered multiallelic and result in more than three different genotypes (Figure 9.1b and Figure 9.2). Heritable CNVs are thought to arise from germline genomic rearrangements (or in some cases, possibly very early somatic events).The genomic rearrangements (mutational events) that are thought to cause CNVs can be broadly categorized as arising via one of two mechanisms: (i) Non-allelic homologous recombination (NAHR) – where homologous recombination
Copyright © 2009, Elsevier Inc. All rights reserved.
Basic Principles of CNVs
■
109
Alleles
Reference = 2 copies (40% frequency)
Duplication CNV = 3 copies (50% frequency)
Frequency (%)
100 Duplication CNV = 4 copies (10% frequency)
80 60 40 20 0
1
2 3 4 Copy Number
5
6 Hypothetical Gene
(a) Bi-allelic CNV Alleles
Deletion CNV = 3 copies (10% frequency)
Duplication CNV = 5 copies (20% frequency)
Frequency (%)
100 80 60 Reference = 4 copies (60% frequency) 40 20 0
1
3 4 2 Copy Number
5
6
SULT1A1 Gene SULT1A1 Gene Deletion
(b) Multi-allelic CNV
Figure 9.1 Examples of bi-allelic and multi-allelic CNVs. a) An example of a bi-allelic CNV that has a 1-copy allele and a 2-copy allele. The reference individual has two 1-copy alleles but 50% of individuals in this population have a total of three copies of this gene per cell. All bi-allelic CNVs have three genotypes per diploid cell, and in this case, copy numbers of 2, 3, and 4 per diploid cell. b) An example of a multi-allelic CNV that has a 0, 1, 2 and 3 copy alleles, resulting in six genotypes in this population. Only the allelic combinations for the three most common genotypes are shown.
Figure 9.2 The amylase gene is considered a multiallelic CNV with diploid copy number estimates ranging from 2 to 15 among humans (Perry et al., 2007). Here, a two-color fiber-FISH experiment of the amylase gene (each copy of the gene is represented by a redgreen combination) shows three copies on a DNA fiber from a single chromosome. It is interesting to note here that the second amylase gene copy is inverted relative to the others.
occurs between highly identical sequences in the genome, such as segmental duplications or related interspersed repetitive elements (Figure 9.3) and (ii) non-homologous end joining (NHEJ) – a repair mechanism whereby double-strand breaks
that occur in the genome are ligated together with the assistance of specific protein complexes that form at the sites of the double-strand breaks (Figure 9.4). The rate for NHEJ is likely influenced significantly by environmental factors and localized DNA
110
CHAPTER 9
■
Copy Number Variation and Human Health
(gene a and b deleted)
(gene a and b duplicated)
Segmental duplication – allele 1 Segmental duplication – allele 2 Gene a Gene b
Figure 9.3 Non-allelic homologous recombination (NAHR) is a mechanism for generating CNVs where recombination between non-allelic repeats with 90% sequence homology (indicated by black- and gray-colored DNA segments). Intervening DNA sequences are deleted and duplicated on different chromatids.
conformations, but in general has been estimated to occur at a rate of less than 107 per generation, similar to the 108 per generation estimated mutation rate of single nucleotide polymorphisms (SNPs) (Conrad and Hurles, 2007). NAHR events are believed to occur more frequently, with estimates of up to 104 per locus per gamete per generation (Shaffer and Lupski 2000). Since NAHR events lead to duplication and deletion of DNA sequences that lie between highly identical sequences in the genome, this mutational event tends to be associated with larger CNVs (Redon et al., 2006). CNVs are scattered throughout the human genome (Figure 9.5) with at least 6% of a chromosome’s total DNA content being potentially copy number variable (Redon et al., 2006). Taken together, current estimates are that more than 500 Mb (or 18.8%) of the reference human genome is copy number variable (Scherer et al., 2007). In fact, one recent study suggested that structural
genomic variation (i.e., non-SNP variation that is predominantly in the form of CNVs) account for as much as 22% of all genetic variable events found in a given individual and total 74% of all variant bases (Levy et al., 2007). Many of the earlier CNV discovery projects relied on array comparative genomic hybridization (aCGH) technologies that utilized large-insert clones (on the order of 120kb–150kb each) as individuals probles and therefore yielded CNV data with ill-defined boundaries. This has likely led to a size over-estimation for many currently identified CNVs. Documented CNVs are being cataloged and collated in several databases. For example, CNV data from healthy individuals can be found in the Database of Genomic Variants (http:// projects.tcag.ca/variation) and the human paralogy database (http://humanparalogy.gs.washington.edu/structuralvariation), while CNVs from patients affected with a neurodevelopmental disorder are being collated in databases such as DECIPHER
Basic Principles of CNVs
b
c
b
■
111
c
(a)
(b)
Acentric fragment Gene a
Gene c
Gene b
Gene d
Double-strand DNA break
Figure 9.4 Non-homologous end joining (NHEJ) can result in a genomic rearrangement that immediately or eventually results in the gain or loss of genetic material after multiple double-strand DNA breaks occur and are repaired. (a) A balanced rearrangement has occurred and there is no immediate net gain or loss of genomic material within the cell, but it will result in gains or losses of DNA in subsequent generations depending on which chromosome is inherited. (b) In an unbalanced rearrangement, genomic material can be lost immediately (e.g., in this case, the acentric fragment containing one copy of genes b and c will be lost in subsequent cell divisions).
112
CHAPTER 9
■
Copy Number Variation and Human Health
1
2 3
4
5
6
7
8
9
10
11 12 Length scaled log(length)
Presence of CNV freq
X 13 14
15 16 17 18 19 20 21 22
Y Length log (heterozygosity)
Figure 9.5 Genomic distribution of CNVs identified in one study (Redon et al., 2006). No large stretches of the genome appear to be exempt from CNVs and the proportion of any given chromosome that is susceptible to CNVs conservatively varies from 6% to 19%.
(http://www.sanger.ac.uk/PostGenomic/decipher/), the Chromosome Abnormality Database (http://www.ukcad. org.uk/cocoon/ukcad/) and the European Cytogenetics Association Register of Unbalanced Chromosome Aberrations (http://www.ecaruca.net). Most CNVs found in healthy individuals are biased away from genes and reside within intergenic regions (i.e., DNA sequences between genes) (Conrad et al., 2006; Nguyen et al., 2006). Nevertheless, the remaining CNVs have been shown to overlap some 3,000 RefSeq genes and 300 genes implicated in genetic disease (as listed in the Online Mendelian Inheritance in Man database, http://www.ncbi.nlm.nih.gov/sites/ entrez?dbOMIM). Ontology analyses of the genes thought to be copy number variable demonstrate a substantial number that can be classified as “environmental sensor/interaction” genes (Tuzun et al., 2005). These are genes that are involved in sensory perception, neurophysiological processes, drug detoxification, immunity and inflammation, as well as cell surface integrity and cell surface antigens. Clearly, such genes are not critical for early development but rather are involved in our perception and interaction with external stimuli, helping us to adapt to our ever-changing environment. The functional impact of certain CNVs can be relatively straightforward. For example, reduced copy number of a given gene can often be correlated with reduced expression levels, while additional copies of a gene could lead to increased expression levels of the CNV gene (McCarroll et al., 2006). CNVs that involve parts of a gene could result in fusion gene products or aberrant proteins with addition or loss of specific protein domains. Some CNVs in intergenic regions could overlap regulatory elements that affect the expression of genes as far away as 4 Mb (Stranger et al., 2007), and correlations of CNVs with transcriptional levels do not necessarily have to be positive. For example, deletion of a repressor element may cause upregulation of an associated gene.
DETECTING CNVS IN A GENOME-WIDE MANNER Array-Based Comparative Genomic Hybridization There are different genome-wide methods for detecting CNVs. By far, the most widely used method has been array-based comparative genomic hybridization (aCGH). This technology was first introduced as “matrix-CGH” (Solinas-Toldo et al., 1997) and later referred to as “array CGH” (Pinkel et al., 1998). In aCGH, the “test” genome being interrogated is labeled with one type of fluorescent molecule (e.g., Cy5), and a “normal” or “reference” control genome is labeled with another type of fluorescent molecule (e.g., Cy3) (Figure 9.6). The labeled DNAs are combined, denatured and hybridized to an array of DNA fragments or oligonucleotides on a microscope slide, with each DNA fragment or oligonucleotide representing a unique part of the human genome. The labeled DNAs are then allowed to hybridize to their complementary DNA sequences on the array, in a stoichiometric fashion, such that – by measuring the fluorescence ratio of the two fluorescent dyes at each spot on the array – one can infer the relative copy number of that particular DNA sequence in the genome being tested, with respect to the reference genome (Figure 9.6). There have been many advances in aCGH technology over the past 5 years. Originally, aCGH platforms contained several hundred or even up to a thousand, large-insert DNA clones (e.g., BAC clones that have an average insert size of about 120–150 kb), which recapitulated the human genome with a clone per 3 Mb. More recently, the trend has been to manufacture arrays that use smaller DNA sequences as hybridization targets but with increasing number of targets on an array. On such arrays, targets can be oligonucleotides of 45–75 bases in length that have been designed to have similar annealing temperatures (isothermic), based primarily on the length of the oligonucleotide
Detecting CNVs in a Genome-Wide Manner
■
113
Gain
Microarray with DNA segments No change
Loss Reference DNA
Test DNA (a)
Log2 ratio
1.2 1.6 0
0.6 1.2
Position along the genome
(b)
Figure 9.6 (a) A schematic of an aCGH assay where a test genome (labeled with Cy5, denoted in green) is cohybridized with a reference genome (labeled with Cy3, denoted in red). The DNA probes are mixed and allowed to hybridize to its complementary sequences on the array in a stoichiometric fashion. Fluorescence intensities of the spots on the microarray (each containing a specific DNA sequence) are measured and DNA sequences occurring in greater copy number in the test than in the reference will result in more green fluorescence for those spots on the microarray. A lower copy number of the same DNA sequences will result in more red fluorescence. (b) Typically the log2 of the fluorescence ratios for each DNA segment on the array is then plotted from one end of a chromosome to the other. A gain is indicated in green and a loss in red. DNA segments having no significant change in DNA copy number in the test sample (with respect to the reference DNA) is indicated in yellow.
and its GC base pair content. Two companies that produce such arrays are NimbleGen Systems, Inc. (www.nimblegen.com) and Agilent, Inc. (www.agilent.com/chem/goCGH). NimbleGen uses a programmable mirror array to synthesize 385,000 oligonucleotides (more recently 2.1 million targets) directly on a glass surface using photolithography. Agilent, on the other hand, uses ink-jet technology to synthesize 244,000 oligonucleotides (more recently 1.1 million targets), on a spot-by-spot basis. When using oligonucleotide-based arrays, it is important to note that their primary disadvantage is that each oligonucleotide probe tends to have lower signal-to-noise ratios than a large-insert
genomic clone target. This results in more experimental “noise” for oligonucleotide-based aCGH assays. Indeed, the typical standard deviation of log2 ratios for an oligonucleotide-based array is approximately 0.25–0.3, five times the standard deviation of log2 ratios obtained for BAC-based arrays. However, the inclusion of hundreds of thousands to millions of targets on a given oligonucleotide-based platform provides an assay with an increased resolution. The “effective resolution” of one of these platforms depends largely on the minimum number of consecutive probes needed to confidently call a CNV, which in turn is a function of how well the target sequences were chosen to accurately and consistently
114
CHAPTER 9
■
Copy Number Variation and Human Health
respond to copy number changes. Hence, a particular platform that has 500,000 targets and requires only three consecutive probes to make a confident CNV call actually has a higher effective resolution than an array platform with one million targets, but requiring 10 consecutive probes to make a CNV call, assuming that both platforms distribute targets evenly throughout the genome. Genotyping Arrays High-throughput array technologies for identifying SNPs can also be used to identify CNVs (see Chapter 8). In general, these arrays contain short targets (20–30 base oligonucleotides) that make them ideal for detecting single base alterations, but less ideal for identifying CNVs (especially when compared to long oligonucleotide-based arrays). For genotyping arrays, only a single labeled DNA source (“test sample”) is hybridized, and the signal intensities obtained at each hybridized target appear to have a linear relationship with respect to copy numbers of that particular DNA sequence in the test genome. For example, if a given DNA sequence has four copies in test genome “X” and only two copies in test genome “Y,” the signal intensity obtained for that DNA sequence when test genome “X” is hybridized would be twice that of when test genome “Y” is hybridized. In general, genotyping platforms can detect larger CNVs (especially when there is a higher level of copy number change) or smaller CNVs (that are detectable by numerous targets on the array platform). Affymetrix (www.affymetrix.com) SNP arrays contain targets that are ~25 bases long and the predecessor to the Affymetrix 500K array was recently used to identify CNVs in 270 HapMap individuals (Redon et al., 2006). Illumina Inc. (www.illumina.com) has designed a genotyping platform that uses 50 base oligonucleotides attached to indexed beads on a glass slide. Labeled test DNAs are hybridized to this slide, followed by primer extension and then immunofluorescence detection. Peiffer et al. (2006) recently showed that the ability of Illumina genotyping platforms have also been used to detect CNVs in both constitutional and tumor samples. Clearly, there would be significant benefits to having platforms that are capable of accurately determining both SNP and CNV genotypes, including reduced reagent costs and expenditure of minimal amounts of DNA. However, the fluorescence intensity data obtained from these genotyping platforms typically have more noise when trying to obtain copy number information than do long oligonucleotide-based arrays. Hence, both Affymetrix and Illumina are now designing “next generation” arrays that incorporate thousands of non-polymorphic probes (i.e., probes that do not target known SNPs) that fall within known CNV regions. It is thought that a CNV can be confidently detected if enough targets are strategically chosen for a given CNV region and included in the array. In other words, one probe may not be able to reliably detect a single copy loss in a CNV region but cumulative data from one hundred targets in the same CNV region may result in a consistent and confident CNV call. Whole-Genome Sequence Comparisons CNVs can also be detected via whole-genome sequence comparison analyses. The main advantage of this method for
identifying CNVs is that the acquisition of fine-scale genomic architecture of CNVs (i.e., accurate CNV sizes and breakpoint information). A major disadvantage of this approach is the limited number of individuals for which whole-genome sequence data are currently available. The first human whole-genome sequence was made available from the Human Genome Project. This reference human genome is actually a compilation of DNA sequences from over ten different individuals with ~67% of the DNA sequences originating from the RPCI-11 DNA library, derived from a single individual male. Tuzun et al. (2005) aligned end-sequence data from thousands of fosmid clones from the G248 DNA library (derived from a single North American female) and compared these with the human reference genome sequence. Taking advantage of the tight size restriction of fosmid clones, they were able to identify genomic gains/losses in the reference genome, when pairs of end-sequences aligned with intervening spacing significantly greater or less than the expected 40 kb. In this manner, 241 CNVs were identified in one of these two healthy and presumably normal individuals. More recently, sequencing technologies have advanced to a point where complete human genome sequences can be obtained more efficiently and cost-effectively than previously possible. For example, using Solexa or 454 DNA sequencing approaches, an individual’s complete genome sequence can now be obtained within months and at costs of under $250,000. Undoubt-edly, these advances will be used to comprehensively identify and define CNVs at a DNA sequence resolution, provided that a minimal fold-coverage of the person’s genome is achieved. Recent studies that have obtained complete genome sequences for different individuals have reported thousands of CNVs in a given individual, encompassing hundreds of millions of bases of DNA (Korbel et al., 2007; Levy et al., 2007). A major advantage of utilizing DNA sequence comparison strategies is the ability to identify balanced chromosomal rearrangements that cannot be detected by aCGH-based methods. For example, Tuzun et al. (2005) found evidence for 56 inversion breakpoints in their comparative analysis of the two genomes of two individuals, and Korbel et al. (2007) found 132 inversion breakpoints when comparing the genomes of two different individuals.
ASSOCIATION OF CNVS WITH DISEASE AND DISEASE SUSCEPTIBILITY Genomic imbalances, including CNVs, can contribute to human diseases in at least two ways. First, certain genomic imbalances appear to directly cause neurodevelopmental diseases, that can occur at birth or even later on in life. These genomic imbalances (referred by some as “pathogenic” CNVs) are usually de novo in nature and recent estimates have associated specific genomic imbalances with as many as 50 such genetic syndromes (http://www.sanger.ac.uk/PostGenomics/decipher/). Most of the remaining known genomic imbalances – sometimes referred
Association of CNVs to Disease and Disease Susceptibility
to as “benign” CNVs because of their identification in healthy individuals – actually have more subtle consequences on human health. Fcgr3 Variation in Glomerulonephritis Glomerulonephritis is a major contributor to human kidney failure. Fcgr3 is a gene that encodes for a receptor found on the surfaces of macrophages and that has low-affinity binding properties to immunoglobulin G. The copy number of Fcgr3 can vary in humans and among rat strains from 0 to 4 per diploid cell. Individuals with fewer copies of this gene, due to deletions of paralogous Fcgr3 genes (which appear to have a negative regulatory effect on the “full length” Fcgr3 gene/gene products), demonstrate increased macrophage activity and an autoimmune response (Aitman et al., 2006; Fanciulli et al., 2007). DEFB4 CNV in Crohn Disease Human -defensins are a family of genes predominantly secreted from leukocytes and epithelial tissues. -defensins are small proteins (15–20 residues) that function in antimicrobial defense by penetrating a microbe’s cell membrane and cause microbial death in a manner similar to that of antibiotics. In the presence
■
115
of interleukin 1-alpha (IL-1), which is secreted by macrophages and other immunologically relevant cell types at the site of tissue inflammation, the expression levels of the -defensin gene, DEFB4, increases (O’Neil et al., 1999), to protect the tissue from further microbial invasion. Therefore, individuals with a lower copy number of this -defensin have decreased immunity against microbes and increased susceptibility to Crohn disease (Inflammatory bowel disease 1) (Fellermann et al., 2006; Naser et al., 2004). UGT2B17 Variation in Prostate Cancer Park and colleagues (2006) found that CNV of the UGT2B17 gene (Murata et al., 2003; Redon et al., 2006; Wilson et al., 2004) confers differential susceptibility to prostate cancer (Figure 9.7). Testosterone is normally processed into dihydrotestosterone (DHT) and other androgens within human cells. In the presence of functioning UGT2B17 gene product, excess DHT molecules are converted into water-soluble glucuronic acid which can be subsequently eliminated from the cell. When UGT2B17 is deleted, this process becomes less efficient and results in greater amounts of DHT within the cells. Increased intracellular DHT concentrations lead to increased interactions with androgen
Dissolution / expulsion
DHT Testosterone Glucuronic acid Androgen receptor (a) Normal copy number of UGT2B17 gene
DHT Testosterone Androgen receptor Deleted UGT2B17 (b) Two copy deletion of UGT2B17 gene
Figure 9.7 Potential mechanism of the UGT2B17 CNV in prostate cancer pathogenesis. (a) Testosterone is normally processed into DHT and other androgens. When there are two copies of UGT2B17, the gene product converts excess molecules of DHT into the water-soluble glucuronic acid and is subsequently eliminated from the cell. (b) When there is a homozygous deletion of UGT2B17, the DHT-sequestration pathway no longer exists, increasing endogenous interaction with androgen receptors (AR) and leading to elevated cell proliferation, which in some cases can result in the development of prostate cancer (Park et al., 2006).
116
CHAPTER 9
■
Copy Number Variation and Human Health
receptors that cause elevated cell proliferation. When this occurs in the prostate cells, it can result in prostatic tumorigenesis. Interestingly, while UGT2B17 null deletions significantly increases prostate cancer susceptibility as a whole, stratified analysis revealed that this correlation to prostate cancer susceptibility is only significant in the Caucasian populations studied, but not among African-Americans. The CNV of Complement Component C4 in Systemic Lupus Erythematosus (SLE) Although a link between the complement component C4 (and its isotypes, C4A and C4B) and SLE has been previously reported (Hauptmann et al., 1974a,b), a recent study suggests that the gene’s variable copy number actually serves as a significant risk factor for the disease (Yang et al., 2007). The complement system is comprised of over 20 proteins/protein fragments that normally circulate in the blood but, when activated, cause a biochemical cascade that clears pathogens from the human body, often by forming new transmembrane channels in the pathogen and causing osmotic lysis of the target cell. The median copy number of complement component C4 is 4 but can range from 0 to 5 copies among humans (Yang et al., 2007). Low copy numbers of this gene are correlated with increased risk for SLE (Fanciulli et al., 2007;Yang et al., 2007). CCL3L1 CNVs in HIV/AIDS A number of studies on HIV/AIDS have focused on the gene copy number and protein levels of CCL3L1 and CCR5 (Gonzalez et al., 2001, 2005) in diseased and healthy individuals. CCR5 is a major co-receptor for the HIV-1 virus as well as the CCL3L1 chemokine. It was recently found that the combination of low CCL3L1 gene copy number, which is populationspecific, and high protein levels of CCR5 showed the greatest increase in susceptibility to AIDS (Gonzalez et al., 2005). It is speculated that increased CCR5 protein levels lead to greater number of CD4CCR5 receptor complexes on the surfaces of T cells and decreased copy number of CCL3L1 leads to decreased expression (and protein levels) of this chemokine, leaving more CD4CCR5 receptor complexes available for binding with the HIV-1 virus.
IMPLICATIONS OF CNVS Disease Association Studies SNPs have become powerful markers for identifying important disease loci in genetic association studies. However, recent analysis of the complete DNA sequence from a single human individual has shown that CNVs (and other structural genomic variants including small balanced chromosomal rearrangements) account for some 22% of all genetic variation events in the individual and 74% of the total DNA sequence variation, when compared to the human reference genome (Levy et al., 2007). Although it is premature to predict the relative contribution of
CNVs to the etiology of common, complex diseases (compared to SNPs), it is clear that CNVs represent a substantial component of human genetic variation that should not be ignored in future disease association studies. Indeed, the presence of CNVs in the human genome has actually reduced the power of certain SNPs. For example, SNPs that lie within CNV regions are difficult to genotype, result in Hardy–Weinberg equilibrium distortions, and are often erroneously calculated as having reduced linkage disequilibrium to nearby causative genomic regions. Moreover, CNVs are far more complex in nature than SNPs and demand appreciation of several factors in order to use this form of genetic variation appropriately in disease association studies. First, the absolute copy number (rather than relative copy number obtained from aCGH-based experiments) needs to be established for each CNV. Second, the exact genomic location of duplicated CNVs should also be considered. For example, the presence of a third copy of a gene may have dramatically different phenotypic effects depending on where the third copy occurs in the genome. Finally, the precise boundaries (at the DNA sequence level) of each CNV should also be known, as should the specific allelic state of a CNV (e.g., two copies of a gene could be distributed as one copy per chromosome or both copies on a single chromosome). Without this level of information, the power of CNVs in genetic association studies diminishes. Indeed, of the several thousand CNVs that have been identified to date, only a small percent have actually been annotated and genotyped to this precision. Some efforts have been made to determine if CNVs are in linkage disequilibrium with nearby SNPs. If so, this would allow for specific CNV alleles to be assayed indirectly with a subset of well-characterized SNPs (i.e., “tagging” SNPs) (see Chapter 2). Initial observations suggested that a subset of CNVs did appear to be ancestral in nature and therefore taggable by specific SNPs (McCarroll et al., 2006). Larger CNVs, especially those that are in segmental, duplication-rich areas of the genome, appear to be less taggable by known SNPs (Locke et al., 2006). Part of the reason for this may be that the mutation rate for such CNVs is higher than the mutation rates for nearby SNPs (recall that NAHR mutations rates are estimated to be substantially higher than that of SNPs). Overall, this suggests that, for association studies, a substantial number of CNVs may need to be genotyped directly. Pharmacogenetics Since many CNV genes are involved in metabolism and drug detoxification, it has been speculated that CNVs may also make significant contributions to future pharmacogenomic studies (Ouahchi et al., 2006). For example, the CYP2D6 genes code for enzymes that are involved in the metabolism of more than 30 medications that include antiarrhythmics, antihypertensives, beta-blockers, etc. CYP2D6 CNVs have been identified and shown to result in gene products with differential metabolism efficiency (Rotger et al., 2007). Similarly, CNVs of genes involved in metabolism may explain some cases of interindividual variation in drug toxicity. McCarroll et al. (2005) and Conrad et al. (2006) found more than 120 CNV genes that were
Implications of CNVs
■
117
14p13 14p12
RP11-98N22 14p11.2-11.1 14q11.1 14q11.2 14q12
RP11-557015
13.1 14q13.3 14q21.2 14q21.3 14q22.1 14q22.2 14q23.1 14q23.2 14q24.1 14q24.2 14q24.3 14q31.1 14q31.3 14q32.12 14q32.2 14q32.32 14q32.33
4
5
5
3.
3
5
2.
2
Human 14
1.
08 1. 3 9 0.
67
0.
5
0.
4 0. 33 0. 29 0. 25
(a)
(b)
RP11-98N22 RP11-557015
Figure 9.8 Example of a de novo genomic imbalance that is likely causative of the clinical phenotype of a patient seen in the clinical genetic clinic. (a) A chromosome 14 profile from a 1 Mb resolution aCGH assay that has been run in dye swap. Deviation of the red line to the right of the expected 1:1 ratio and the blue line to the left of the expected 1:1 ratio is indicative of a genomic loss in the patient DNA. Here, nine BAC clones (each clone represented by a dot on the array profile) are within the deleted region, suggesting a 9–10 Mb sized deletion. The position of BAC clones used in FISH confirmation studies are indicated in red and green. (b) A metaphase spread from the patient after a two-color FISH confirmation assay using a clone within the deletion interval (RP11-557O15) and a control clone outside of the deletion interval (RP11-98N22). The absence of the green signal on one of the chromosome 14s (indicated with a white arrowhead) confirms the aCGH findings. Parental FISH studies were normal (not shown), indicating that this chromosomal aberration was a de novo event.
homozygously deleted. Individuals harboring such homozygous deletions presumably have low toxicity tolerance to medications that depend heavily on the homozygously deleted CNV gene product for proper metabolism. Rapid and accurate identification of these individuals should be made a priority in pharmacogenomic research (see also Chapter 27). Clinical Cytogenetic Diagnostics Array CGH-based techniques are now being more widely used in the clinical cytogenetic diagnostic arena to identify smaller genomic imbalances that may be associated with neurodevelopmental disorders. Indeed, it has been estimated that aCGHbased assays are now detecting apparently pathogenic genomic imbalances in as much as 20% of cases that have had normal results from chromosome-banded karyotyping tests (Figure 9.8). However, the recent recognition of widespread CNVs among the genomes of healthy individuals has made interpretation of aCGH-based assays more difficult [for a recent review on this, see Lee et al. (2007)]. Specifically, all CNVs detected in an
aCGH clinical test need to be assessed for their potential to contribute to the clinical presentation. One of the primary means for assessing pathogenicity of a CNV is by determining whether or not the CNV is de novo. Indeed, 2–3% of children are born with a major birth defect that is sporadic in nature, and the genetic alteration causing these birth defects is similarly assumed to be sporadic in nature. This strategy for determining the pathogenicity of a genomic imbalance in “constitutional” genetic testing is not new and has been applied by cytogeneticists for decades (albeit at the chromosomal banding analysis level). Therefore, for genomic imbalances detected by aCGH assays, it must be determined if the same genomic imbalance is observed in either of the biological parents, and close attention must be paid to the absolute copy number of the CNV (i.e., not just the relative copy number that is obtained from aCGH experiments), breakpoint information when available and genomic distribution of the CNV DNA segment. Similarly, determining whether a given CNV is present in affected (or unaffected) relatives can also be highly informative.
118
CHAPTER 9
■
Copy Number Variation and Human Health
Another factor to consider is whether the CNV has been found in other unrelated but affected individuals with similar clinical phenotypes (using CNV databases such as DECIPHER) or among healthy individuals (using CNV databases such as the Database of Genomic Variants). This information is only as useful as the accuracy of the CNV entries in these databases (i.e., minimal false positives and precise definition of the architecture of each CNV entry). Moreover, the gene content within a given CNV will also dictate the clinical interpretation of that CNV. Finally, the overall size of the CNV can be a minor indicator of its pathogenicity, with the caveat that “benign” CNVs, as large as 10 Mb in size in gene-poor regions of the genome, have been documented (Hansson et al., 2007). Accurate clinical interpretation of CNVs will improve as our knowledge of the structure and function of the human genome increases. Clearly, CNVs cannot be considered as isolated genomic aberrations and eventually should be interpreted in the context of all genomic variants, including other CNVs as well as SNPs. For example, the same deletion CNV in a healthy individual may uncover a recessive SNP mutation on the other allele in an unrelated individual – leading to a drastically different phenotype. Since we are at a stage in clinical cytogenetic diagnostics where the technology and the data being obtained are beyond our understanding of human genome function, caution continues to be advised in the application of genome-wide aCGH assays in prenatal diagnostics. Instead, targeted aCGH platforms that primarily utilize probes with well-defined associations with genetic disorders have been used for prenatal diagnoses.
CONCLUSIONS It was previously touted that the genomes of healthy individuals were 99.9% similar (Lander et al., 2001). Over the last 3 years, however, it has become obvious that there is more genetic variation than had been appreciated and that a substantial portion is in the form of structural genomic variation, including genomic imbalances of DNA segments greater than 1 kb in length (CNVs). As a result, CNVs are now requiring special consideration in many aspects of genomic medicine, including disease association studies, pharmacogenomics and clinical cytogenetic diagnostics. While our understanding of the locations, structures and frequencies of CNVs remains rudimentary, it is anticipated that there will be rapid growth in CNV knowledge over the next few years. This growth will be essential for accurately integrating this type of genetic variation with SNPs (and other forms of genetic alterations) for each human being. New findings will lead researchers and clinicians toward a greater appreciation of the genetic factors that make each person so unique, which is essential to the practice of genomic and personalized medicine.
ACKNOWLEDGEMENTS The authors would like to express their gratitude to George Perry for providing the image for Figure 9.2 and Marc Listewnik for his assistance in generating Figure 9.8.
REFERENCES Aitman, T.J., Dong, R., Vyse, T.J., Norsworthy, P.J., Johnson, M.D., Smith, J. et al. (2006). Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans. Nature 439, 851–855. Conrad, D.F., Andrews, T.D., Carter, N.P., Hurles, M.E. and Pritchard, J.K. (2006). A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 38, 75–81. Conrad, D.F. and Hurles, M.E. (2007). The population genetics of structural variation. Nat Genet 39, 530–536. Down, J.L.H. (1866). Observations on an ethnic classification of idiots. London Hosp Clin Lect Rep 3, 259. Fanciulli, M., Norsworthy, P.J., Petretto, E., Dong, R., Harper, L., Kamesh, L., Heward, J.M. et al. (2007). FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity. Nat Genet 39, 721–723. Fellermann, K., Stange, D.E., Schaeffeler, E., Schmalzl, H., Wehkamp, J., Bevins, C.L., Reinisch, W. et al. (2006). A chromosome 8 genecluster polymorphism with low human beta-defensin 2 gene copy number predisposes to Crohn disease of the colon. Am J Hum Genet 79, 439–448. Feuk, L., Marshall, C.R., Wintle, R.F. and Scherer, S.W. (2006). Structural variants: Changing the landscape of chromosomes and design of disease studies. Hum Mol Genet 15, R57–R66. Freeman, J.L., Perry, G.H., Feuk, L., Redon, R., McCarroll, S.A., Altshuler, D.M. et al. (2006). Copy number variation: new insights in genome diversity. Genome Res 16, 949–961.
Gonzalez, E., Dhanda, R., Bamshad, M., Mummidi, S., Geevarghese, R., Catano, G., Anderson, S.A. et al. (2001). Global survey of genetic variation in CCR5, RANTES, and MIP-1alpha: Impact on the epidemiology of the HIV-1 pandemic. Proc Natl Acad Sci USA, 5199–5204. Gonzalez, E., Kulkarni, H., Bolivar, H., Mangano, A., Sanchez, R., Catano, G. et al. (2005).The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science 307, 1434–1440. Hansson, K., Szuhai, K., Knijnenburg, J., van Haeringen, A. and de Pater, J. (2007). Interstitial deletion of 6q without phenotypic effect. Am J Med Genet A 143, 1354–1357. Hauptmann, G., Grosshans, E. and Heid, E. (1974a). Lupus erythematosus syndrome and complete deficiency of the fourth component of complement. Boll Ist Sieroter Milan 53(Suppl), 228. Hauptmann, G., Grosshans, E., Heid, E., Mayer, S. and Basset, A. (1974b). Acute lupus erythematosus with total absence of the C4 fraction of complement. Nouv Presse Med 3, 881–882. Iafrate, A.J., Feuk, L., Rivera, M.N., Listewnik, M.L., Donahoe, P.K., Qi,Y. et al. (2004). Detection of large-scale variation in the human genome. Nat Genet 36, 949–951. Korbel, J.O., Urban, A.E., Affourtit, J.P., Godwin, B., Grubert, F., Simons, J.F. et al. (2007). Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426.
Recommended Resources
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J. et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921. Lee, C., Iafrate, A.J. and Brothman, A.R. (2007). Copy number variations and clinical cytogenetic diagnosis of constitutional disorders. Nat Genet 39, S48–S54. Lejeune, J., Gautier, M. and Turpin, R. (1959). Etude des chromosomes somatiques de neuf enfants mongoliens. C R Hebd Seances Acad Sci 248, 1721–1722. Levy, S., Sutton, G., Ng, P.C., Feuk, L., Halpern, A.L., Walenz, B.P. et al. (2007). The diploid genome sequence of an individual human. PLoS Biol 5, e254. Locke, D.P., Sharp, A.J., McCarroll, S.A., McGrath, S.D., Newman, T.L., Cheng, Z. et al. (2006). Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am J Hum Genet 79, 275–290. McCarroll, S.A., Hadnott, T.N., Perry, G.H., Sabeti, P.C., Zody, M.C., Barrett, J.C. et al. (2006). Common deletion polymorphisms in the human genome. Nat Genet 38, 86–92. Murata, M., Warren, E.H. and Riddell, S.R. (2003). A human minor histocompatibility antigen resulting from differential expression due to a gene deletion. J Exp Med 197, 1279–1289. Naser, S.A., Ghobrial, G., Romero, C. and Valentine, J.F. (2004). Culture of Mycobacterium avium subspecies paratuberculosis from the blood of patients with Crohn’s disease. Lancet 364, 1039–1044. Nguyen, D.Q., Webber, C. and Ponting, C. (2006). Bias of selection on human copy-number variants. PLoS Genet 2, e20. O’Neil, D.A., Porter, E.M., Elewaut, D., Anderson, G.M., Eckmann, L., Ganz, T. and Kagnoff , M.F. (1999). Expression and regulation of the human beta-defensins hBD-1 and hBD-2 in intestinal epithelium. J Immunol 163, 6718–6724. Ouahchi, K., Lindeman, N. and Lee, C. (2006). Copy number variants and pharmacogenomics. Pharmacogenomics 7, 25–29. Park, J., Chen, L., Ratnashinge, L., Sellers, T.A., Tanner, J.P., Lee, J.H. et al. (2006). Deletion polymorphism of UDP-glucuronosyltransferase 2B17 and risk of prostate cancer in African American and Caucasian men. Cancer Epidemiol Biomarkers Prev 15, 1473–1478. Perry, G.H., Dominy, N.J., Claw, K.G., Lee, A.S., Fiegler, H., Redon, R., Werner, J.,Villanea, F.A., Mountain, J.L., Misra, R., Carter, N.P., Lee, C. and Stone, A.C. (2007). Diet and the evolution of human amylase gene copy number variation. Nat Genet 39, 1256–1260. Peiffer, D.A., Le, J.M., Steemers, F.J., Chang, W., Jenniges, T. and Garcia, F.K. (2006). High-resolution genomic profiling of
■
119
chromosomal aberrations using Infinium whole-genome genotyping. Genome Res 16, 1136–1148. Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, I., Kowbel, D. et al. (1998). High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 20, 207–211. Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews,T.D. et al. (2006). Global variation in copy number in the human genome. Nature 444, 444–454. Rotger, M., Saumoy, M., Zhang, K., Flepp, M., Sahli, R., Decosterd, L. et al. (2007). The Swiss HIV Cohort Study. Partial deletion of CYP2B6 owing to unequal crossover with CYP2B7. Pharmacogenet Genomics 17, 885–890. Scherer, S.W., Lee, C., Birney, E., Altshuler, D.M., Eichler, E.E., Carter, N.P. et al. (2007). Challenges and standards in integrating surveys of structural variation. Nat Genet 39, S7–S15. Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S. et al. (2004). Large-scale copy number polymorphism in the human genome. Science 305, 525–528. Shaffer, L.G. and Lupski, J.R. (2000). Molecular mechanisms for constitutional chromosomal rearrangements in humans. Annu Rev Genet 34, 297–329. Solinas-Toldo, S., Lampel, S., Stilgenbauer, S., Nickolenko, J., Benner, A., Dohner, H. et al. (1997). Matrix-based comparative genomic hybridization: Biochips to screen for genomic imbalances. Genes Chromosomes Cancer 20, 399–407. Stranger, B.E., Forrest, M.S., Dunning, M., Ingle, C.E., Beazley, C., Thorne, N. et al. (2007). Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848–853. Tuzun, E., Sharp, A.J., Bailey, J.A., Kaul, R., Morrison, V.A., Pertz, L.M. et al. (2005). Fine-scale structural variation of the human genome. Nat Genet 37, 727–732. Wilson, W., III, Pardo-Manuel de Villena, F., Lyn-Cook, B.D., Chatterjee, P.K., Bell, T.A., Detwiler, D.A. et al. (2004). Characterization of a common deletion polymorphism of the UGT2B17 gene linked to UGT2B15. Genomics 84, 707–714. Yang, Y., Chung, E.K., Wu, Y.L., Savelli, S.L., Nagaraja, H.N., Zhou, B. et al. (2007). Gene copy-number variation and associated polymorphisms of complement component c4 in human systemic lupus erythematosus (SLE): Low copy number is a risk factor for and high copy number is a protective factor against SLE susceptibility in European Americans. Am J Hum Genet 80, 1037–1054.
RECOMMENDED RESOURCES Affymetrix, Inc. (http://www.affymetrix.com) Agilent, Inc. (www.agilent.com/chem/goCGH) Chromosome Abnormality Database (http://www.ukcad.org. uk/cocoon/ukcad/) Database of Genomic Variants (http://projects.tcag.ca/variation) DECIPHER (http://www.sanger.ac.uk/PostGenomic/decipher/) Ensembl Genome Browser (http://www.ensembl.org/index.html) European Cytogenetics Association Register of Unbalanced Chromosome Aberrations (http://www.ecaruca.net)
Human Paralogy Database (http://humanparalogy.gs.washington.edu/ structuralvariation) Human Segmental Duplication Database (http://projects.tcag.ca/ humandup/) Illumina Inc. (www.illumina.com) NimbleGen Systems, Inc. (www.nimblegen.com) UCSC Genome Browser (http://www.genome.ucsc.edu/)
CHAPTER
10 Inter-Species Comparative Sequence Analysis: A Tool for Genomic Medicine Anthony Antonellis and Eric D. Green
INTRODUCTION In 2001, the Human Genome Project completed a draft sequence of the human genome (International Human Genome Sequencing Consortium, 2001). Since then, the human genome sequence has been refined (International Human Genome Sequencing Consortium, 2004) and attention has turned to generating the genome sequences of a number of biomedically relevant vertebrate species, including mouse (Mouse Genome Sequencing Consortium, 2002), rat (Rat Genome Sequencing Project Consortium, 2004), dog (Lindblad-Toh et al., 2005), chicken (International Chicken Genome Sequencing Consortium, 2004), chimpanzee (Chimpanzee Sequencing and Analysis Consortium, 2005), and others (www.intlgenome.org). A major rationale for elucidating these additional sequences is that their comparison with the human genome sequence would reveal important details about the functional landscape of all human chromosomes, which in turn would empower studies aiming to understand the genetic basis of human disease (Collins et al., 2003). In reality, our current understanding of the functional sequence elements residing within the 3 billion base pairs that constitute the human genome is relatively nascent. Needed are approaches that can identify such functional elements – including both sequences that encode protein (coding sequences) and those that function in other ways (non-coding Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg 120
functional sequences). Importantly these functional sequences may harbor disease-associated mutations. Comparative sequence analysis (CSA) involves crossanalyzing related nucleotide or protein sequences, so as to identify similarities and differences among them. Here, we limit our discussion to one type of CSA – that involving the comparison of genomic sequences from different species. Such interspecies CSA can be used to identify genomic sequences that are highly conserved among species. When applied to the analysis of sequences from distantly related species, conserved sequences can be identified that reflect genomic regions that have not changed substantially over millions of years of evolutionary time; such sequences represent strong candidates for being functionally important. For example, protein-coding sequences are typically highly conserved among vertebrates, even between organisms as distantly related as human and fish (Aparicio et al., 2002). Similarly, many non-coding functional sequences, such as those that regulate the expression of genes, also tend to be highly conserved (e.g., Loots et al., 2000; Prabhakar et al., 2006a). In addition to identifying similarities, CSA can be used to identify differences between species. Indeed, comparing genomic sequences among closely related species can reveal sequence changes that are specific to certain lineages (e.g., human-specific sequences). Importantly, such analyses can provide insight into the evolution
Performing Comparative Sequence Analysis: Resources and Methods
of human-specific traits (e.g., Pollard et al., 2006; Prabhakar et al., 2006b). Because of its ability to provide clues about functional sequences in the genome, CSA can facilitate the identification and characterization of genes associated with human disease. Specifically, CSA can benefit such studies in several ways. First, CSA can be used to identify and prioritize the genomic regions that are most likely to harbor disease-causing mutations. For example, it is estimated that roughly 5% of the human genome sequence is functionally important (Lindblad-Toh et al., 2005; Mouse Genome Sequencing Consortium, 2002; Rat Genome Sequencing Project Consortium, 2004), but the precise nucleotides constituting that 5% are, at present, not known; CSA can be used to help hone in on that small fraction of functional sequence, prioritizing it for more detailed study in patients with genetic disease. Second, CSA can be used to characterize disease-associated sequence variants once they are identified, helping to classify them as disease-causing mutations versus innocent polymorphisms. For example, demonstrating that a variant occurs at a highly conserved nucleotide (or amino-acid residue) provides support for its relevance in the disease. Finally, CSA can aid in the design of functional analyses aiming to characterize a gene of interest in another species. For example, elucidating the function of a newly discovered disease gene often requires studying the functional consequences of mutating that gene in a model organism, such as yeast, worm, fruit fly, zebrafish, or mouse. Such studies allow rapid, in vivo analyses of gene function and mutational consequences that are not possible in patient populations. In this setting, CSA can help to identify the presence or absence of the corresponding gene (i.e., the orthologous gene) in another species as well as the conservation of affected nucleotides or amino-acid residues; if present, that gene can then be studied in greater detail using that species as a model system. These applications of CSA have implications for the practice of genomic medicine. Specifically, CSA can directly aid the identification and characterization of sequence variants, both common and rare, that are responsible for disease onset (or severity) and differential response to therapeutics. In the long run, such information will be valuable for providing patients with accurate diagnoses and prognoses, and for directing a course for proper care. In this chapter, we describe how to access and utilize resources for performing CSA, demonstrate how such studies can help identify and characterize disease-associated mutations, and discuss the role of this approach in genomic medicine.
PERFORMING COMPARATIVE SEQUENCE ANALYSIS: RESOURCES AND METHODS Step 1: Species Selection and Sequence Acquisition CSA generally involves a series of sequential steps, with the typical ones illustrated in Figure 10.1. Different resources and methodologies are utilized in each of these steps. The first step
■
121
involves acquiring genomic sequences from species of interest (Figure 10.1, Step 1), with the choice of species often dictated by the biological question(s) being asked. For example, the study of the basic genomic elements that are relevant for vertebrate development requires the acquisition of sequences from distantly related vertebrate species (Plessy et al., 2005). In contrast, examination of mammalian- or primate-specific genomic features requires sequences from discrete lineages (Prabhakar et al., 2006a). Often, the initial questions are quite broad and not associated with a priori expectations, in which case it is prudent to collect all available (i.e., already generated) genomic sequence data for subsequent CSA. Once the species have been selected, the relevant sequences for a genomic region of interest must be collected. Genomic sequences can be identified and obtained from several public databases and Web sites. These include the National Center for Biotechnology Information (NCBI; this is the home of GenBank), the University of California at Santa Cruz (UCSC) Genome Browser (Kent et al., 2002), and Ensembl (Stalker et al., 2004) (Table 10.1). Each of these sites has different routines for identifying sequences of interest. For example, genomic sequences can be obtained from the UCSC Genome Browser by selecting a species and performing a query using a gene name, marker, or chromosome band; this site is particularly useful because its graphical interface is tailored for CSA. In a similar fashion, protein sequences can be collected (e.g., for determining the conservation of an amino-acid residue implicated in a human genetic disease). In this case, a text query using the name or symbol of the human protein may be performed using either the HomoloGene or protein databases at NCBI (Table 10.1); this will yield a list of related sequences from various species that can be easily downloaded. A critical aspect of sequence acquisition is ensuring that appropriate sequences are actually obtained. Genomic regions that originate from a common ancestor are considered homologous, and these subdivide based on their origins. Homologous regions generated by genome-duplication events in a given species are considered paralogs. Homologous regions generated by speciation are considered orthologs. For the applications described here, CSA involves the study of orthologous sequences – that is, those from corresponding genomic regions among a set of species. There are two general approaches to ensure that orthologous sequences are obtained. First, alignment tools such as BLAST (Basic Local Alignment Search Tool; Altschul et al., 1990) and BLAT (BLAST-Like Alignment Tool; Kent, 2002) (Table 10.1) can be used to search nucleotide databases, requiring that a “reciprocal best hit” is obtained for a pair of sequences. For example, a human sequence is used to query the dog genome sequence; the most closely related dog sequence is then used to query the human genome sequence. If the most closely related sequence identified in this second query is the same as the starting human sequence, then it can be assumed that those human and dog sequences are orthologous. Second, the content and order of genes within related sequences from two species can be compared, with strong similarity providing support that the sequences are orthologous.
122
CHAPTER 10
■
Comparative Sequence Analysis
Step 1. Acquire Sequence Non-Primate Mammals Non-Mammals Primates
Pig
Human
Cat
Rat
Dog
Mouse
Chicken
Chimp Fugu Cow
Zebrafish
Step 2. Align Sequences ~30 kb
Human Conservation Chimp Mouse Rat Dog Cow Chicken Zebrafish Fugu
Step 3. Identify and Localize Conserved Regions 1
2
3
4
Exons Conserved Regions Coding
Step 4a. Computational Analysis Predict Transcription Factor-Binding Sites
Non-Coding
Step 4b. Experimental Analysis In Vitro
In Vivo
Expression
LacZ Expression
Conserved Predicted Binding Sites Search for Common Motifs
Patient Screening Mutation in Conserved Region
T G C Y C A C Common Motifs
Figure 10.1
Figure Caption on Page 123.
Performing Comparative Sequence Analysis: Resources and Methods
Repetitive elements are a broad class of DNA sequences that occur more than once in a genome. Two types of these sequences that form a substantial fraction of most vertebrate genomes are tandemly repeated repetitive elements and interspersed repetitive
TABLE 10.1 Bioinformatic resources for comparative sequence analysis Sequence Databases and Sequence-Analysis Tools National Center for Biotechnology Information
www.ncbi.nlm.gov
Ensembl Genome Browser
www.ensembl.org
University of California at Santa Cruz (UCSC) Genome Browser
genome.ucsc.edu
RepeatMasker
www.repeatmasker.org
Sequence-Alignment Tools BLAST
www.ncbi.nlm.nih. gov/BLAST
BLAT
genome.ucsc. edu/cgi-bin/hgBlat
ClustalW
www.ebi.ac.uk/clustalw
PipMaker/MultiPipMaker
pipmaker.bx.psu. edu/pipmaker
VISTA
genome.lbl.gov/vista
Programs for Conservation and Motif Detection WebMCS
zoo.nhgri.nih.gov/mcs
ExactPlus
research.nhgri.nih. gov/projects/exactplus
TRANSFAC
www.gene-regulation. com/pub/databases.html
MEME
meme.sdsc.edu/meme
■
123
elements (see Chapter 1). When analyzing and comparing genomic sequences, it is important to consider whether or not to include repetitive elements. Including repetitive sequences can lead to the spurious alignment of non-orthologous sequences; excluding such sequences can result in completely missing functional elements residing therein, and, indeed, repetitive elements can harbor functional sequences (Kamal et al., 2006; Nishihara et al., 2006). For most initial CSA studies, it is reasonable to “mask” (or remove) repetitive sequences, and this can be accomplished in one of two ways. First, masking repetitive elements can be performed while downloading the sequence using tools available at the UCSC Genome Browser and Ensembl (Table 10.1). Alternatively, this step can be accomplished by subjecting already obtained sequence to processing by the program RepeatMasker (Table 10.1). Step 2: Generating Multi-Sequence Alignments The next step in CSA is to generate multi-sequence alignments (Figure 10.1, Step 2), a process in which the orthologous genomic sequences from different species are mapped (or organized) relative to one another. Various computational tools for aligning sequences have been developed and refined in recent years (Table 10.1). There are two general types of alignment methods: those that produce local alignments and those that produce global alignments. Local-alignment methods [e.g., PipMaker and MultiPipMaker (Schwartz et al., 2000; Schwartz et al., 2003)] generate alignments based on similarity of subregions within the sequence, placing more importance on local sequence identity and less on the long-range pattern of alignment across the entire sequence. As such, these methods can produce alignments of orthologous regions that are inverted or otherwise shuffled between species. In contrast, global-alignment methods [e.g., ClustalW (Thompson et al., 1994) and VISTA (Frazer et al., 2004)] generate alignments based on the similarity over the entire length of the sequence, placing more importance on maintaining the relative long-range organization of the genomic region among species and less on local sequence identity. It is generally accepted that local alignments are more appropriate for analyzing genomic regions where the long-range organization may not be well-conserved among species, while
Figure 10.1 Utilization of comparative sequence data. A generic set of steps for using comparative sequence data in human genetic studies are represented. In Step 1, sequences of a genomic region of interest are acquired from relevant (often highly diverged) species. Next (Step 2), these sequences are aligned relative to one another, revealing regions that are identical between a given species and the human reference sequence (indicated by vertical lines within the “track” for each species at the UCSC Genome Browser). Analysis of these sequences allows the overall level of conservation to be established across the region. Levels of conservation above a defined threshold can be used to designate “conserved regions” (Step 3), and these typically reflect both coding and non-coding sequences. Note the hypothetical model for a four-exon gene (in Step 3 and at the top of Step 2). In Step 4, various analyses can be performed to characterize the conserved regions in greater detail. Computational analyses (Step 4a) can include searching for predicted sites of transcription factor binding (based on the sequences of known transcription factor-binding sites) within the conserved regions as well as searching for common motifs (arrow in the upper panel indicates the transcription start site of a gene of interest). Experimental analyses (Step 4b) can include in vitro assays to assess if any or all of a conserved region influences expression of a reporter gene, in vivo testing to see if a conserved region can drive the expression of a marker gene (e.g., LacZ), and screening patients of interest for mutations in a conserved region.
124
CHAPTER 10
■
Comparative Sequence Analysis
global alignments are more appropriate for analyzing genomic regions where the long-range organization is likely to be wellconserved (Frazer et al., 2003). For aligning nucleotide sequences, PipMaker generates alignments between two orthologous sequences, while MultiPipMaker and VISTA generate alignments of up to 20 and 100 sequences, respectively (Table 10.1). For example, to identify sequences conserved in human, mouse, and rat across a genomic region of interest, an investigator would typically submit the human sequence (referred to as the “reference sequence”) to the MultiPipMaker web server followed by the orthologous sequences from mouse and rat. The program then produces a text-based multi-sequence alignment as well as a graphical representation of conservation levels across the region. Importantly, all of the output is indexed relative to the reference sequence, thereby organizing and facilitating downstream analyses. As an aside, multi-sequence protein alignments are typically generated using CLUSTALW (Table 10.1). Briefly, CLUSTALW analyzes a set of text-formatted protein sequences and produces an alignment of the amino-acid sequences. Individual regions of interest can then be examined for conservation (e.g., Figure 10.2A, left panel). Step 3: Identifying Conserved Sequences In contrast to examining relatively simple protein-sequence alignments, analyzing a multi-sequence alignment of a genomic region of interest typically involves scanning across many kilobase pairs of sequence with highly variable levels of conservation. As a result, identifying the most highly conserved regions cannot be accomplished by manual inspection of the alignment. Rather, additional computational analysis is required; indeed, some of the available alignment programs perform this step as well. For example, PipMaker and VISTA will identify the subset of regions with a defined level of conservation between two sequences in a pair-wise fashion (e.g., 70% identity over at least 100 bp). Other stand-alone programs [e.g., WebMCS (Margulies et al., 2003) and ExactPlus (Antonellis et al., 2006); Table 10.1] analyze multi-sequence alignments (e.g., generated by tools such as MultiPipMaker), and identify regions that are highly conserved across sequences from multiple species. It is important to then localize such conserved regions relative to other annotated genomic landmarks (e.g., genes); for example, classifying conserved regions into coding versus non-coding is particularly important (see Figure 10.1, Step 3). The UCSC Genome Browser and Ensembl (Table 10.1) both allow the results of such conservation analyses to be uploaded, thereby facilitating the direct assimilation of user-produced data with all other available annotations. Above, we have emphasized performing CSA in a customized, step-wise fashion. An alternative is to use the results of genome-wide CSA studies that are available on most of the genome browsers. For example, the UCSC Genome Browser provides various graphical representations of sequence conservation across the entire human genome (e.g., Figure 10.1, Step 2), often computed using multiple different methods; further,
such analyses are frequently updated and refined. The data about conserved regions of interest can be readily downloaded from such sites and studied in greater detail. One caution is that while such pre-computed analyses can be rapidly and easily accessed, the genome-wide nature of these CSA studies is associated with inherent limitations, and thus more focused and customized studies of a region of interest typically yield more informative results. Step 4: Computational and Experimental Analyses of Conserved Sequences Once sequence conservation is assessed for a genomic region of interest, computational and experimental analyses are typically performed to further investigate the role of specific sequences. In the case of conserved coding regions, the amino-acid sequence of the encoded protein can be deduced in various species. Mutation screening might reveal a human disease-associated sequence variant that changes an amino acid, and such a change can then be analyzed within the context of knowing the conservation level of that particular residue (e.g., Figure 10.2A, left panel). In addition, computational predictions can be made about mutation-associated structural changes in the encoded protein, and assays can be developed to test the function of the mutant protein. The analysis of conserved non-coding sequences is less straightforward (Figure 10.1, Steps 4a and 4b). This is because we know very little about the “language” of functional noncoding sequences, especially compared to our understanding of how coding sequences encode information about proteins. As a result, the biological relevance of newly encountered disease-associated variants in non-coding sequence is almost never immediately obvious (e.g., compared to a variant that clearly changes an amino-acid residue in the case of coding sequence). As a first step, conserved non-coding sequences can be analyzed for the presence of transcription factor-binding sites using the TRANSFAC database (Matys et al., 2006) and/or commonly occurring sequence motifs using programs such as MEME (Figure 10.1, Step 4a and Table 10.1) (Bailey and Elkan, 1994). Such analyses can help identify sequences involved in transcriptional regulation. In parallel to computational studies, conserved non-coding sequences can also be tested for regulatory activity using relevant in vitro or in vivo assays. Such studies can provide evidence for function that is directly associated with sequence conservation, thereby helping to prioritize further analyses. Finally, conserved regions can be screened for disease-associated mutations in relevant patients; it is then important to assess the extent of conservation at the position of any disease-associated variant that is encountered (e.g., Figure 10.3B, middle panel).
COMPARATIVE SEQUENCE ANALYSIS AND HUMAN GENETIC DISEASE CSA can play a valuable role in establishing the relevance of disease-associated sequence variants. Specifically, human genetic studies often lead to the identification of multiple nucleotide
Comparative Sequence Analysis and Human Genetic Disease
A. CFTR Protein-Coding Mutations Conservation of CFTR G551 Human Chimp Cow Pig Dog Mouse Rat Chicken Frog Salmon Dogfish
TLSGGQRAR TLSGGQRAR TLSGGQRAR TLSGGQRAR TLSGGQRAR TLSGGQRAR TLSGGQRAR ILSGGQRAR TLSGGQRAR TLSGGQRAR TLSGGQRAR
Function of CFTR G551
Cl TMD
*
NBD
* G551
B. CFTR Splice-Site Mutations 6211 G to T 6212 T to G 6213 A to G
6213 A to G
CFTR Intron 4
6211 G to T
AAG GTA
AAGAAG GTAATACTT... AAGAAG GTAATACTT... AAGAAG GTAATACTT... AAAAAG GTAATACTT... AAAAAG GTAATACTT... AAGAAG GTAATGCTT... AAGAAG GTAATACTT... AAGAAG GTAATACTT... AAGAAG GTAATACTT... AAGAAG GTAACTGTT... AAGAAG GTAAACAGA... AAGAAG GTAAGGGGA...
Wt CFTR mRNA
CFTR Exon 4 Human Chimp Cow Pig Sheep Dog Hedgehog Fruit Bat Mouse Chicken Zebrafish Tetraodon
Figure 10.2 Conserved sequences are important for protein function and mRNA processing. Mutations in coding and non-coding regions of the CFTR gene can cause cystic fibrosis. (A) The position of a coding mutation (G551D) is indicated in a multi-species protein-sequence alignment (left panel, arrow). The importance of this amino-acid residue in CFTR function is underscored by its conservation in species ranging from human to fish. The G551D change is shown relative to a schematic model of the CFTR protein within a plasma membrane (right panel). The G551 residue resides within a nucleotide-binding domain (NBD) that is critical for activation of the chloride channel, and the G551D mutation severely impairs this process [(Anderson and Welsh, 1992; Logan et al., 1994); note that the two green areas represent transmembrane domains (TMD)]. (B) The positions of three non-coding CFTR mutations are shown, with at least two (621+1 G to T and 621+3 A to G) resulting in improper splicing and deletion of exon 4 in the mature CFTR transcript (Hull et al., 1993; Tzetis et al., 2001). Inset: illustration of improper CFTR splicing caused by two mutations, as detected by RT-PCR. The importance of these nucleotides in CFTR mRNA processing is again underscored by the striking conservation of this noncoding sequence in species ranging from human to fish (indicated in red).
■
125
differences that correlate with the inheritance of a disease. Sorting out which of these, if any, is responsible for the disease can be challenging. CSA can establish the extent of evolutionary conservation at the genomic position of each disease-associated variant, with evidence of high conservation providing support that the variant is functionally relevant to the disease. Also note that this process can occur in a slightly reversed order, in which CSA is first used to identify conserved regions of interest, and these in turn are examined closely in patients to search for diseaseassociated sequence variants (see below). In this section, we provide illustrative examples where CSA has been used to implicate specific mutations in human genetic diseases. Conserved Sequences Important for Protein Function CSA is often used as a validation step to help implicate a protein-coding sequence variant with a genetic disease. Specifically, the presence of a disease-associated amino-acid variant at a residue that is highly conserved among widely diverged species is supportive evidence that the mutation plays a role in the disease. Cystic fibrosis (CF) is an autosomal recessive disease characterized by the production of abnormally viscous mucus, which frequently obstructs the pulmonary airways and leads to chronic, life-threatening respiratory infections. Loss-of-function mutations in the cystic fibrosis transmembrane conductance regulator gene (CFTR), which encodes a chloride channel, cause CF (Kerem et al., 1989; Riordan et al., 1989; Rommens et al., 1989). One of the most common missense mutations found in patients with CF is G551D (genet.sickkids.on.ca/cftr). Examination of the affected amino-acid residue by CSA reveals that G551, as well as its adjacent amino acids, are highly conserved among species ranging from human to fish (Figure 10.2A, left panel). This indicates that G551 is likely to be important for CFTR function. Indeed, G551 resides within the nucleotide-binding domain (NBD; Figure 10.2A, right panel), which is critical for proper function of the encoded chloride channel (Anderson and Welsh, 1992). Furthermore, the G551D mutation appears to cause abnormal channel activity (Logan et al., 1994). Conserved Sequences Important for mRNA Processing Identifying disease-associated mutations in protein-coding sequences is relatively straightforward. However, screening a gene or genomic region implicated in a disease does not always reveal mutations in coding sequence, making it necessary to search for mutations in the surrounding non-coding sequence. A common strategy is to then analyze conserved non-coding nucleotides – especially those at positions with a known or expected role in gene function. Proper splicing of heteronuclear RNA to form mature messenger RNA (mRNA) is essential for proper protein translation. The splicing process is carried out by a complex set of interactions between intron sequences and the nuclear spliceosome
126
CHAPTER 10
■
Comparative Sequence Analysis
A. CX32 Promoter Mutations
B. RET Enhancer Mutation Conserved Disease-Associated SNP in RET Intron 1
SOX10-Binding Sites
RET Intron 1 CX32 Exon 1 SNPs
CATTGTGCA CAGTGTGCA CATTCTGCA
Mutation Analysis 528 T to G 526 G to C
Experimental Analysis Wt Promoter
526 G to C
35% of Wt
528 T to G
50% of Wt Expression
RET3 C to T
Conserved Regions
MCS9.7
Sequence Conservation of RET+3 C Allele RET+3 C to T Human Chimp Baboon Cow Pig Cat Dog Rat Mouse
GGGGGCC-AGTGACCCTTACACGGTCATCCACAGGCCACTTGG GGGGGCC-AGTGACCCTTACACGGTCATCCACAGGCCACTTGG GGGGGCC-AGTGACCCTTACACGGTCATCCATAGGCCACTTGG GGGAGCCTGGTGACCTGCACACAGTCATCAGCAGGCCACTTGG GCGGACTTTGTGACCCTCACACGGTCATCATCAGGCCACTTGG GTGGGCCTGGTGACCCCCACACAGTCATCAGCAGGCCACTTGG GCGGGCCTGGTGACCCACACACAGTCATCAGCAGGCCACTTGG AGGGGCCTGGTCACTCACACGCACTCATCCACAGGCCACTTGG AGAGGTCTGGTCACTCACACACGCTCATCCCCAGGCCACTTGG Predicted Retinoic Acid Response Elements
Conservation Analysis 528 T to G 526 G to C Human Mouse Rat Rabbit Dog Opossum Fugu
CATTGTGCA CATTGTACA CATTGTACA TATTGTGCA CATTGTGCA TATTGTGTA CATTGTATG
Experimental Analysis of RET+3 Alleles Promoter Only MCS9.7 / RET3 C Allele 6.3-fold change MCS9.7 / RET3 T Allele 0
10 20 30 Expression
40
Figure 10.3 Conserved sequences are important for gene expression.Two examples of non-coding, disease-causing mutations in regulatory elements are schematically illustrated. (A) Mutations in the CX32 promoter (arrow indicates transcription start site) are associated with one form of Charcot-Marie-Tooth disease. Specifically, mutations in a SOX10-binding site (indicated in red) have been identified in patients with Charcot-Marie-Tooth disease. Experimental analysis of the wild-type (Wt) and mutant (526 G to C and 528 T to G) promoters, involving in vitro expression of a reporter gene, demonstrates mutation-associated decreases in promoter activity (based on Houlden et al. 2004). Conservation analysis reveals that the mutated nucleotides are conserved in vertebrates (highlighted in red). (B) Within intron 1 of the RET gene, a specific single-nucleotide polymorphism (SNP) – RET+3 C to T – is highly associated with Hirschsprung disease in a large subset of patients. That SNP resides within a conserved region – MCS+9.7 (grey box in upper panel). The RET+3 C allele is conserved among mammals (middle panel; highlighted in red), with that nucleotide position residing between two predicted transcription factor-binding sites involved in the response to retinoic acid (grey shading). Experimental analysis, which involved testing the MCS+9.7 segment for enhancer activity, revealed that the disease-associated RET+3 T allele shows greatly reduced expression compared to the wild-type RET+3 C allele. These studies demonstrate that a mutation in a RET-associated enhancer is responsible for Hirschsprung disease in some patients (based on Emison et al. 2005).
machinery. Specifically, three intron sequences are required for splicing to occur: (1) a donor site at the 5-end of the intron beginning with the sequence GT(A/G)ATG; (2) an acceptor site at the 3-end of the intron ending with the sequence CAG; and (3) an internal branch site with the sequence CT(A/G)A(C/T). Indeed, the identification of disease-causing mutations in these sequences underscores their importance in gene function. In a similar fashion to the analysis of amino-acid residues, the functional importance of nucleotides involved in mRNA processing can be established by CSA. In addition to missense mutations, CF can be caused by splice-site mutations in the CFTR gene. For example, three
mutations in the CFTR intron 4 donor splice site (621+1 G to T, 621+2 T to G, and 621+3 A to G) have been identified in patients with CF (genet.sickkids.on.ca/cftr), and these likely lead to the deletion of exon 4 in mature CFTR transcripts (Figure 10.2B, inset) (Hull et al., 1993; Tzetis et al., 2001). Comparing the genomic sequence of this region in highly diverged species reveals complete conservation of the 621+1 G, 621+2 T, and 621+3 A nucleotides (Figure 10.2B). Interestingly, while the donor-site consensus sequence GT(A/G)ATG appears to allow for an A to G nucleotide change at position 621+3, the G allele appears to cause CF and the A allele is conserved to fish (Figure 10.2B). Thus, CSA supports a pathogenic role for all three of
Comparative Sequence Analysis and Human Genetic Disease
these mutations and illustrates the importance of donor-site nucleotides for proper mRNA processing. Conserved Sequences Important for Promoter Function Transcriptional regulation is mainly controlled by trans-acting proteins (e.g., transcription factors) and cis-acting sequences (e.g., promoters, enhancers, and repressors). Sequence variants affecting either of these components can alter levels of transcription and lead to disease. Other non-coding sequences typically screened for diseaseassociated mutations are those corresponding to putative cis-acting transcriptional regulatory elements. This frequently begins by analyzing the region thought to harbor a gene’s promoter; specifically, the segment immediately upstream (roughly 500–2000 bases) of a gene’s known or predicted transcription start site (Figure 10.3A, arrow). Charcot-Marie-Tooth (CMT) disease consists of a heterogeneous group of inherited peripheral neuropathies that are generally characterized by muscle weakness and wasting in the distal extremities. One form of CMT, CMTX1, is associated with peripheral nerve demyelination and is caused by mutations in the connexin 32 gene (CX32) (Ionasescu et al., 1994). CX32 encodes a gap-junction protein expressed in Schwann cells that facilitates the radial transport of material through the concentric layers of the myelin sheath. One transcription factor involved in CX32 expression is SRY-related HMG-box 10 (SOX10), which binds to two sequences in the CX32 promoter (Bondurand et al., 2001). Interestingly, two mutations (526 G to C and 528 T to G) have been identified in these SOX10binding sites in patients with CMTX1 (Figure 10.3A, top) (Houlden et al., 2004). To establish the functional consequences of these disease-associated mutations, the promoter activities of fragments bearing each allele were tested. Specifically, CX32promoter fragments bearing the wild-type sequence, 526 G to C, or 528 T to G were cloned upstream of a luciferase reporter gene; in turn, each construct was transfected into cultured cells. There is a notable decrease in reporter-gene expression with 526 G to C and 528 T to G compared to the wild-type promoter sequence (Figure 10.3A, middle) (Houlden et al., 2004). Finally, CSA studies reveal that the wild-type nucleotides (526 G and 528 T) are conserved across a diverse array of vertebrates. Thus, CSA adds to a body of evidence supporting a pathogenic role for two CX32-promoter mutations in CMT disease. Conserved Sequences Important for the Function of Distal Regulatory Elements In the above examples, CSA was used to help implicate identified amino-acid and nucleotide variants with a human genetic disease. These cases all involved regions within genes that are relatively “obvious” to search for such mutations (i.e., coding, splice-site, and promoter sequences). A current obstacle in human genetics is finding disease-causing variants in other types of non-coding functional sequences, which are more difficult to
■
127
identify (i.e., less “obvious”). For example, distal transcriptional regulatory elements can reside within large introns or many kilobase pairs away from their associated gene. Yet, such sequences are believed to harbor mutations that cause human genetic disease. CSA can play a major role in identifying such elements in the first place, and these in turn can be screened for mutations (Figure 10.1, Step 4b). Hirschsprung disease is a genetically heterogeneous developmental disorder of the enteric nervous system, which is characterized by the absence of enteric ganglia in the distal colon. To date, eight loci have been implicated in the disease, with the majority of mutations identified residing in the RET gene, which encodes the ret proto-oncogene (Edery et al., 1994). However, mutations in these genes can only be identified in less than 30% of patients with Hirschsprung disease, indicating that there are other disease-associated genes or novel alleles of already identified genes. A common Hirschsprung disease-associated haplotype was identified involving RET intron 1, with no associated RET coding mutation (Carrasquillo et al., 2002). This haplotype contains multiple single-nucleotide polymorphisms (SNPs; red triangles in Figure 10.3B, top panel) that are associated with the disease, rendering it difficult to determine which one, if any, is pathogenic. CSA was used to identify highly conserved regions in and around RET intron 1 (Emison et al., 2005). One such region (MCS+9.7) overlapped one of the disease-associated SNPs (RET+3; Figure 10.3B, top panel). Importantly, the T allele of this SNP is over-transmitted to affected individuals, and the C allele is highly conserved among mammals (Figure 10.3B, middle panel). Analysis of the MCS+9.7 sequence for transcription factor-binding sites revealed two predicted, highly conserved retinoic acid-binding sites that flank the RET+3 SNP (Figure 10.3B, middle panel). Retinoic acid is known to regulate RET expression in certain tissues (Batourina et al., 2001; Shoba et al., 2002), suggesting that these sites are biologically relevant. Finally, experimental analysis of each RET+3 allele revealed enhancer activity with the C allele and significantly lower activity, with the T allele (Figure 10.3B, bottom panel), suggesting that this genomic region may function as an enhancer to regulate RET expression (Emison et al., 2005). In this example, CSA played a central role in identifying candidate functional non-coding sequences that were then screened for disease-associated variants. Without the evidence of sequence conservation, it would have been far more difficult to ultimately establish that this region functions as an enhancer or that one of the variants residing therein is disease-causing. The Utility of CSA in the Study of Complex Genetic Diseases A key promise of the Human Genome Project was to enhance the study of complex genetic diseases. This could involve using the generated human genomic sequence for identifying genetic variants in human populations. In turn, these variants can be studied in genome-wide association studies to identify risk alleles for
128
CHAPTER 10
■
Comparative Sequence Analysis
common genetic diseases (see Chapter 8), including, for example, type II diabetes (e.g., Scott et al., 2007 and Zeggini et al., 2007). A major challenge with such studies is that it typically is unclear whether a given identified variant is conferring disease risk, or if it is just closely linked to the true risk variant. CSA can be used to evaluate the degree of evolutionary conservation of a given region harboring identified variants. Those variants of particular interest (e.g., perhaps because they reside in a highly conserved region) can then be tested in experimental assays (see above for examples) to study their functional consequences. Another important consideration for interpreting the results of CSA in studying complex genetic diseases is the expected molecular pathology of the disease in question. If a loss of gene function is expected in disease pathogenesis, then CSA can be quite informative since conservation will often imply function. Multiple examples of this are provided above. Alternatively, if a gain of gene function is expected in disease pathogenesis (e.g., certain neurodegenerative diseases and some forms of cancer), then CSA may be less informative. On the one hand, a gainof-function mutation may mutate a highly conserved nucleotide or amino acid to generate higher levels of a gene product (e.g., an increase in affinity between a promoter and a transcription factor) or a gene product with a higher activity (e.g., an increase in affinity for a binding partner); here, CSA can provide information for implicating a disease-causing variant. On the other hand, a gain-of-function mutation may change a lessconserved nucleotide or amino acid to one that creates a novel or toxic function of the encoded protein; here, CSA would not help to support this variant as a probable disease-causing mutation.
CSA AND THE FUTURE OF HUMAN GENETICS AND GENOMIC MEDICINE The availability of sequences of the human genome and that of multiple other evolutionarily diverse species is greatly advancing the study of human genetic diseases. In this chapter, we have focused on how CSA can facilitate the identification and study of disease-causing mutations. The identification of genes defective in diseases that are inherited in a Mendelian fashion is now (almost) a routine process. Aided by CSA, investigators can now screen a more-complete catalog of conserved protein-coding and non-coding functional elements for mutations in such disease-associated genes. In a similar fashion, these same conserved sequences can be carefully scrutinized for the presence of mutations
causing more-common, but genetically complex, diseases (see section “The Utility of CSA in the Study of Complex Genetic Diseases” for a discussion of this issue). A related utility of CSA will be in more-completely characterizing the functional landscape of the human genome. Specifically, by establishing which sequences have been conserved over various evolutionary time frames (e.g., between human and mouse or between human and fish) and by linking those sequences to specific biological functions, an enhanced view of human genome function should emerge. Such a comprehensive examination of sequence conservation and sequence function is the primary goal of the ENCODE (ENCyclopedia Of DNA Elements) Project (www.genome.gov/ENCODE) (ENCODE Project Consortium, 2007). For this Project, a consortium of investigators is performing myriad computational (including CSA) and experimental studies (some similar to the ones described above) to catalog all functional sequences in the human genome, initially focusing on a targeted set of regions that comprise 1% of the genome. This last application of CSA has important implications for genomic medicine. It is predicted that the human genome harbors 3 to 5 million sites of SNPs that differ between any two individuals (Altshuler and Clark, 2005). While many of these SNPs have no functional consequence, a subset of them reside in functional regions of the genome and are functionally important. These latter SNPs may contribute to disease onset (as described above), modify the phenotypes caused by mutations at other loci (Nadeau, 2001), or modulate a patient’s response to medication (Goldstein et al., 2003). CSA can aid the identification of SNPs that are most likely to have functional consequences. For example, a subset of patients may experience adverse side effects when treated with a given drug. These side effects may be caused by an allele of a SNP within an enhancer that increases the expression of a gene in the pathway that metabolizes that drug. Through the use of CSA and other studies, this SNP can be identified and a test developed that determines which patients are most likely to experience the adverse side effects upon treatment with that drug (see Chapter 27). In summary, CSA continues to hold enormous potential as a tool for research in genome function, human genetics, and genomic medicine. One of the remaining challenges is to establish which species are most relevant for each biological question (Figure 10.1, Step 1). In the coming years, the continued collection and analysis of genomic sequences from multiple species should make this less of an issue, thereby enhancing the utility of CSA and its applications.
REFERENCES Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403–410. Altshuler, D. and Clark, A.G. (2005). Harvesting medical information from the human family tree. Science 307, 1052–1053.
Anderson, M.P. and Welsh, M.J. (1992). Regulation by ATP and ADP of CFTR chloride channels that contain mutant nucleotide-binding domains. Science 257, 1701–1704. Antonellis, A., Bennett, W.R., Menheniott, T.R., Prasad, A.B., Lee-Lin, S.Q., Green, E.D., Paisley, D., Kelsh, R.N., Pavan, W.J.
References
and Ward, A. (2006). Deletion of long-range sequences at Sox10 compromises developmental expression in a mouse model of Waardenburg-Shah (WS4) syndrome. Hum Mol Genet 15, 259–271. Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A. et al. (2002). Wholegenome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301–1310. Bailey, T.L. and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28–36. Batourina, E., Gim, S., Bello, N., Shy, M., Clagett-Dame, M., Srinivas, S., Costantini, F. and Mendelsohn, C. (2001). Vitamin A controls epithelial/mesenchymal interactions through Ret expression. Nat Genet 27, 74–78. Bondurand, N., Girard, M., Pingault, V., Lemort, N., Dubourg, O. and Goossens, M. (2001). Human Connexin 32, a gap junction protein altered in the X-linked form of Charcot-Marie-Tooth disease, is directly regulated by the transcription factor SOX10. Hum Mol Genet 10, 2783–2795. Carrasquillo, M.M., McCallion, A.S., Puffenberger, E.G., Kashuk, C. S., Nouri, N. and Chakravarti, A. (2002). Genome-wide association study and mouse model identify interaction between RET and EDNRB pathways in Hirschsprung disease. Nat Genet 32, 237–244. Chimpanzee Sequencing and Analysis Consortium (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87. Collins, F.S., Green, E.D., Guttmacher, A.E. and Guyer, M.S. (2003). A vision for the future of genomics research. Nature 422, 835–847. Edery, P., Lyonnet, S., Mulligan, L.M., Pelet, A., Dow, E., Abel, L., Holder, S., Nihoul-Fekete, C., Ponder, B.A. and Munnich, A. (1994). Mutations of the RET proto-oncogene in Hirschsprung’s disease. Nature 367, 378–380. Emison, E.S., McCallion, A.S., Kashuk, C.S., Bush, R.T., Grice, E., Lin, S., Portnoy, M.E., Cutler, D.J., Green, E.D. and Chakravarti, A. (2005). A common sex-dependent mutation in a RET enhancer underlies Hirschsprung disease risk. Nature 434, 857–863. ENCODE Project Consortium (2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816. Frazer, K.A., Elnitski, L., Church, D.M., Dubchak, I. and Hardison, R.C. (2003). Cross-species sequence comparisons: A review of methods and available resources. Genome Res 13, 1–12. Frazer, K.A., Pachter, L., Poliakov, A., Rubin, E.M. and Dubchak, I. (2004). VISTA: Computational tools for comparative genomics. Nucleic Acids Res 32,W273–W279. Goldstein, D.B., Tate, S.K. and Sisodiya, S.M. (2003). Pharmacogenetics goes genomic. Nat. Rev. Genet. 4, 937–947. Houlden, H., Girard, M., Cockerell, C., Ingram, D., Wood, N.W., Goossens, M., Walker, R.W. and Reilly, M.M. (2004). Connexin 32 promoter P2 mutations: A mechanism of peripheral nerve dysfunction. Ann Neurol 56, 730–734. Hull, J., Shackleton, S. and Harris, A. (1993). Abnormal mRNA splicing resulting from three different mutations in the CFTR gene. Hum Mol Genet 2, 689–692. International Chicken Genome Sequencing Consortium (2004). Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716. International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921.
■
129
International Human Genome Sequencing Consortium (2004). Finishing the euchromatic sequence of the human genome. Nature 431, 931–945. Ionasescu,V., Searby, C. and Ionasescu, R. (1994). Point mutations of the connexin32 (GJB1) gene in X-linked dominant Charcot-MarieTooth neuropathy. Hum Mol Genet 3, 355–358. Kamal, M., Xie, X. and Lander, E.S. (2006). A large family of ancient repeat elements in the human genome is under strong selection. Proc Natl Acad Sci U S A 103, 2740–2745. Kent, W.J. (2002). BLAT--the BLAST-like alignment tool. Genome Res 12, 656–664. Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M. and Haussler, D. (2002). The human genome browser at UCSC. Genome Res 12, 996–1006. Kerem, B., Rommens, J.M., Buchanan, J.A., Markiewicz, D., Cox, T.K., Chakravarti, A., Buchwald, M. and Tsui, L.C. (1989). Identifica-tion of the cystic fibrosis gene: Genetic analysis. Science 245, 1073–1080. Lindblad-Toh, K., Wade, C.M., Mikkelsen, T.S., Karlsson, E.K., Jaffe, D.B., Kamal, M., Clamp, M., Chang, J.L., Kulbokas, E.J., Zody, M.C. et al. (2005). Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438, 803–819. Logan, J., Hiestand, D., Daram, P., Huang, Z., Muccio, D.D., Hartman, J., Haley, B., Cook, W.J. and Sorscher, E.J. (1994). Cystic fibrosis transmembrane conductance regulator mutations that disrupt nucleotide binding. J Clin Invest 94, 228–236. Loots, G.G., Locksley, R.M., Blankespoor, C.M., Wang, Z.E., Miller, W., Rubin, E.M. and Frazer, K.A. (2000). Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288, 136–140. Margulies, E.H., Blanchette, M., Haussler, D. and Green, E.D. (2003). Identification and characterization of multi-species conserved sequences. Genome Res 13, 2507–2518. Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K. et al. (2006). TRANSFAC and its module TRANSCompel: Transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34, D108–D110. Mouse Genome Sequencing Consortium (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562. Nadeau, J.H. (2001). Modifier genes in mice and humans. Nat Rev Genet 2, 165–174. Nishihara, H., Smit, A.F. and Okada, N. (2006). Functional noncoding sequences derived from SINEs in the mammalian genome. Genome Res 16, 864–874. Plessy, C., Dickmeis, T., Chalmel, F. and Strahle, U. (2005). Enhancer sequence conservation between vertebrates is favoured in developmental regulator genes. Trends Genet 21, 207–210. Pollard, K.S., Salama, S.R., Lambert, N., Lambot, M.A., Coppens, S., Pedersen, J.S., Katzman, S., King, B., Onodera, C., Siepel, A. et al. (2006). An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443, 149–150. Prabhakar, S., Poulin, F., Shoukry, M.,Afzal,V., Rubin, E.M., Couronne, O. and Pennacchio, L.A. (2006a). Close sequence comparisons are sufficient to identify human cis-regulatory elements. Genome Res 16, 855–862. Prabhakar, S., Noonan, J.P., Paabo, S. and Rubin, E.M. (2006b). Accelerated evolution of conserved noncoding sequences in humans. Science 314, 786. Rat Genome Sequencing Project Consortium (2004). Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521.
130
CHAPTER 10
■
Comparative Sequence Analysis
Riordan, J.R., Rommens, J.M., Kerem, B., Alon, N., Rozmahel, R., Grzelczak, Z., Zielenski, J., Lok, S., Plavsic, N., Chou, J.L. et al. (1989). Identification of the cystic fibrosis gene: Cloning and characterization of complementary DNA. Science 245, 1066–1073. Rommens, J.M., Iannuzzi, M.C., Kerem, B., Drumm, M.L., Melmer, G., Dean, M., Rozmahel, R., Cole, J.L., Kennedy, D., Hidaka, N. et al. (1989). Identification of the cystic fibrosis gene: Chromosome walking and jumping. Science 245, 1059–1065. Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R. and Miller, W. (2000). PipMaker – a web server for aligning two genomic DNA sequences. Genome Res 10, 577–586. Schwartz, S., Elnitski, L., Li, M., Weirauch, M., Riemer, C., Smit, A., Green, E.D., Hardison, R.C. and Miller, W. (2003). MultiPipMaker and supporting tools: Alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res 31, 3518–3524. Scott, L.J., Mohlke, K.L., Bonnycastle, L.L., Willer, C.J., Li, Y., Duren, W.L., Erdos, M.R., Stringham, H.M., Chines, P.S., Jackson, A.U., Prokunina-Olsson, L. et al. (2007). A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316, 1341–1345.
Shoba, T., Dheen, S.T. and Tay, S.S. (2002). Retinoic acid influences Phox2 expression of cardiac ganglionic cells in the developing rat heart. Neurosci Lett 321, 41–44. Stalker, J., Gibbins, B., Meidl, P., Smith, J., Spooner, W., Hotz, H.R. and Cox, A.V. (2004). The Ensembl Web site: Mechanics of a genome browser. Genome Res 14, 951–955. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673–4680. Tzetis, M., Efthymiadou, A., Doudounakis, S. and Kanavakis, E. (2001). Qualitative and quantitative analysis of mRNA associated with four putative splicing mutations (621+3A→G, 2751+2T→A, 296+1G→C, 1717-9T→C-D565G) and one nonsense mutation (E822X) in the CFTR gene. Hum Genet 109, 592–601. Zeggini, E., Weedon, M.N., Lindgren, C.M., Frayling, T.M., Elliott, K.S., Lango, H., Timpson, N.J., Perry, J.R., Rayner, N.W., Freathy, R.M. et al. (2007). Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 316, 1336–1341.
RECOMMENDED RESOURCES Websites http://www.intlgenome.org Provides an overview of major genomesequencing projects as well as links to species-specific databases and other useful web resources. http://www.genome.gov/ENCODE Describes the rationale and specific components of the ENCODE (ENCyclopedia Of DNA Elements) Project. Also provides direct links to the generated data sets.
Articles Dermitzakis, E.T., Reymond, A. and Antonarakis, S.E. (2005). Conserved non-genic sequences – an unexpected feature of mammalian genomes. Nat Rev Genet 6, 151–157. This review gives an introduction to conserved non-coding sequences, and discusses their potential roles in genome biology.
Kleinjan, D.A. and van Heyningen, V. (2005). Long-range control of gene expression: Emerging mechanisms and disruption in disease. Am J Hum Genet 76, 8–32. This review discusses long-range transcriptional regulatory elements, and describes examples where disruption of such elements cause human disease. Miller, W., Makova, K.D., Nekrutenko, A. and Hardison, R.C. (2004). Comparative genomics. Annu Rev Genomics Hum Genet 5, 15–56. Nardone, J., Lee, D.U., Ansel, K.M. and Rao A. (2004). Bioinformatics for the “bench biologist”: How to find regulatory regions in genomic DNA. Nat Immunol 5, 768–774. These two reviews provide an excellent overview of comparative sequence analysis, and discuss a number of issues to consider when pursuing such studies.
CHAPTER
11 DNA Methylation Analysis: Providing New Insight into Human Disease Susan Cottrell, Theo deVos, Juergen Distler, Carolina Haefliger, Ralf Lesche, Achim Plum and Matthias Schuster
INTRODUCTION The human genome contains four bases – guanine, adenine, thymine, and cytosine; but the cytosines can be either methylated or unmethylated at the fifth carbon position in the pyrimidine ring (Figure 11.1a). In general, cytosines can only be methylated when they are in the context of a CpG dinucleotide, or in other words, a cytosine immediately followed by a guanine. These CpG dinucleotides are under-represented in the genome, occurring at only about 20% of the frequency expected assuming random distribution. In some regions, called CpG islands, the frequency of CpG dinucleotides is only 65% of the expected frequency (Takai and Jones, 2002). These stretches of 500–1000 base pairs typically overlap the promoter or exon 1 region of genes, and there are estimated to be 30,000 CpG islands in the human genome (Lander et al., 2001). CpGs in islands are predominantly unmethylated, while CpGs outside of islands are typically methylated. The methylation status of a CpG island is normally correlated with the chromatin structure and expression levels of nearby genes (Stirzaker et al., 2004). CpG islands associated with actively transcribed genes are typically unmethylated. When a CpG island Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg
is methylated, methyl-CpG-binding domain (MBD) proteins recognize the methylated CpG and recruit the necessary factors for chromatin condensation and gene inactivation. This DNA methylation state is maintained during cell division by a family of enzymes called DNA methyltransferases (DNMTs). DNMT1 is thought to be responsible for maintaining methylation patterns through cell divisions, and DNMT3b is the primary enzyme involved in de novo methylation (Lopatina et al., 2002). Mutations in one of the DNMT genes, DNMT3B, are responsible for the rare genetic disorder ICF syndrome, which is characterized by severe immunodeficiency and DNA hypomethylation (Hansen et al., 1999). Mice lacking any of the DNMTs die either in utero or soon after birth (Okano et al., 1999). Although most CpG islands in the human genome are typically unmethylated in healthy cells, methylation of some islands plays a role in several normal physiological processes. In females, CpG islands on one of the X chromosomes become hypermethylated in the process of chromosomal inactivation (Hellman and Chess, 2007; Mohandas et al., 1981). On autosomes, methylation is also important during development for inactivation of maternally- or paternally-derived alleles of certain genes in a process called genomic imprinting. For instance, a region 2 kb upstream Copyright © 2009, Elsevier Inc. All rights reserved. 131
(a)
■
DNA Methylation Analysis
NH2
—
—
NH2
C
C
N C — O—
N Cytosine
C
N
C
C — O—
CH3
(c)
C
CH3 —
—
CHAPTER 11
genomic DNA
5-TCGCTGG-3 3-AGCGACC-5 —
132
CH3
C bisulfite treatment
N
5-methyl-cytosine
(b)
bisulfiteconverted DNA Methylation-specific restriction digestion
5-TCGUTGG-3
bis I strand
3-AGCGAUU-5
bis II strand
PCR
bisulfite-converted PCR amplified DNA Amplification
5-TCGTTGG-3 3-AGCAACC-5
“G-rich” strand “C-rich” strand
3-TCGCTAA-5 5-AGCGATT-3
Figure 11.1 Detection of methylated cytosines. (a) Cytosine and 5-methyl-cytosine. (b) Cleavage by some restriction enzymes is blocked by methylation of cytosines in the recognition site. After digestion, various methods can be used to detect the intact methylated DNA, including locus-specific PCR amplification. (c) Genomic DNA is shown at the top with methylated (red) and unmethylated (green) cytosines. The bisulfite treatment converts unmethylated cytosines into uracils. The resulting strands are no longer complementary, and each strand can be amplified by conversion-specific PCR primers.
of the H19 gene for a non-coding RNA is paternally methylated, leading to repressed IGF2 expression (Bartolomei et al., 1993). During the process of carcinogenesis, two important changes in the methylation pattern of the genome occur. First, sporadic CpGs (outside of CpG islands) become hypomethylated (Ehrlich et al., 1982). This global hypomethylation has been linked to genomic instability, possibly due to mobilization of repetitive elements (Chen et al., 1998; Feinberg, 2004). Meanwhile, some CpG islands, particularly near tumor-suppressor genes, can become hypermethylated during carcinogenesis (Herman and Baylin, 2003; Jones and Baylin, 2002). The hypermethylation is functionally similar to a deletion or mutation and can serve as one method of gene inactivation in Knudson’s classic two-hit hypothesis (Jones and Laird, 1999). Different combinations of aberrantly methylated genes are found in nearly all types of cancer (Costello et al., 2000).
TECHNOLOGY TO ASSESS DNA METHYLATION Cytosine methylation has minimal impact on both thermodynamics and kinetics of common DNA polymerases and
can, therefore, not be directly detected by any amplification or hybridization-based methods. Instead, there are three basic principles that can be employed to distinguish differentially methylated DNA sequences: (i) methylation-specific restriction enzymes (Figure 11.1b), (ii) methylation-specific antibodies and proteins, and (iii) methylation-specific bisulfite modification (Figure 11.1c). All existing methods for discovering and evaluating DNA methylation markers are based on one or more of these principles. Methylation-sensitive restriction enzymes, which typically digest unmethylated but not methylated DNA, have been in use for many years (Bird and Southern, 1978). The recognition sites contain one or more CpG sites. Sometimes cleavage is blocked completely by methylation of the CpG site, but more often the rate of cleavage is affected. Therefore, these methods must be optimized for complete and specific digestion. More recently, restriction enzymes have been introduced that specifically digest methylated DNA (Krueger et al., 1995). Methods based on restriction enzymes are limited to CpGs that are present within a restriction site. The second major category of DNA methylation methodology relies on naturally occurring proteins whose activity requires methylation detection. Methylation-specific antibodies
Technology to Assess DNA Methylation
have been generated that specifically bind to DNA containing methylated cytosines (Gebhard et al., 2006). In addition, methylbinding proteins, which play an integral part in bridging DNA methylation information with cellular function, have been successfully used to separate differentially methylated DNA fractions (Cross et al., 1994). Methylation analysis methods based on bisulfite conversion of sample DNA rely on the unique property of inorganic bisulfite to transform unmethylated cytosine to uracil (Figure 11.1) without affecting methylated cytosines (Frommer et al., 1992). Thus, the bisulfite reaction converts the epigenetic methylation information into sequence information, which can be analyzed by common molecular biology methods, most notably PCR. By converting most of the cytosines to uracils, bisulfite conversion of DNA has the effect of reducing the complexity of the genome, thereby making primer design and multiplexing more challenging. In addition, the chemical reaction can cause DNA damage if sub-optimal conditions are used (Grunau et al., 2001). Despite these caveats, bisulfite-based methods are extremely powerful for sensitive and quantitative analysis of DNA methylation at one or many CpG sites. Methods for Methylation Marker Discovery The human genome contains tens of thousands of CpG islands associated with individual genes and their activity. The number of individual methylation positions that could be aberrantly methylated in diseased states is several orders of magnitude higher. Several genome-wide discovery methods to identify differentially methylated regions associated with the specific diagnostic question have been developed. Many of these methods are based on digestion of sample DNA with a methylationsensitive restriction enzyme. The individual discovery techniques differ in their methods for detection of these restriction cuts, but they include techniques such as linker ligation combined with subtractive hybridization (MCA; Toyota et al., 1999), arbitrarily primed PCR (AP-PCR; Liang et al., 2002), array hybridization (Differential methylation hybridization, DMH; Huang et al., 1999), and two-dimensional gel electrophoresis (RLGS; Costello et al., 2002; Hatada et al., 1991). These methylation marker discovery methods use methylation-sensitive restriction enzymes to differentiate two classes of samples, such as a DNA from samples of responders to a drug and DNA from non-responders. For a thorough analysis, the entire marker discovery process often involves multiple comparisons. Furthermore, the throughput is low enough that only a few samples can reasonably be processed. Using pooled DNA from multiple patients reduces the likelihood of identifying polymorphic differences associated with individual patients, and it allows for the enrichment of methylation differences associated with disease, which is essential for identification of markers with high prevalence for screening applications. One approach that is becoming more commonly used is array-based discovery of aberrant methylation (Fukasawa et al., 2006; Hayashi et al., 2007;Taylor et al., 2007). In DMH, originally used for the identification of epigenetic alterations in breast and
■
133
ovarian cancer (Huang et al., 1999; Yan et al., 2001), genomic DNA is digested by a methylation-insensitive restriction enzyme and then linkers are ligated to the fragments. These products are then digested with a methylation-sensitive restriction enzyme and finally amplified by PCR via the linker sequence. Amplificates are only generated from methylated templates that are not digested by the methylation-sensitive restriction enzymes. The PCR products can be labeled with fluorescent dyes and hybridized to a CpG-island microarray. A recent paper describes the combination of a modified and optimized pre-chip DMH workflow and a custom-made oligonucleotide array representing approximately 50,000 regions of the human genome (Lewin et al., 2007). The advantages of DMH are its reproducibility, speed, and throughput (allowing the option of analyzing individual samples instead of pools). The readout of the microarray not only identifies the differentially methylated sequence, but also provides semi-quantitative methylation values. Recently, the Human Epigenome Project, a large-scale effort to identify, catalog, and interpret genome-wide DNA methylation patterns of all human genes in all major tissues, has been initiated by The Wellcome Trust Sanger Institute (Cambridge, UK), the Centre National de Génotypage (Paris, France), and Epigenomics AG (Berlin, Germany) (Table 11.1). Based on public versions of the sequence of the human genome established by the Human Genome Project, the Human Epigenome Project will systematically uncover the epigenetic information layer that is still hidden to a large extent. A second Human Epigenome Project was recently proposed by an international group of 40 scientists (Esteller, 2006; Jones and Martienssen, 2005) (Table 11.1). Methods for Analyzing Markers in Tissue Samples Methylation analysis in tissues is often not limited by the quantity of sample material available, but the quality of the material, particularly formalin-fixed and paraffin-embedded samples, can be an issue (Schuster, 2004). The isolated DNA is usually damaged to some extent depending on the fixation procedure. In addition, the proportion of the tissue type with aberrant methylation (i.e., the percent tumor) can vary from sample to sample. Laser capture microdissection (LCM), which is used to isolate the tissue type of interest, helps overcome this issue. There are very different levels at which DNA methylation can be assessed (Figure 11.2) (Laird, 2003), ranging from genomewide to a particular locus or CpG. The “methylation content” of a cell population corresponds to the overall proportion of methylated cytosines within the entire genome. In contrast, the “methylation level” designates the percentage of methylated DNA strands at one genomic locus, while the “methylation pattern” describes the individual methylation status of a specific set of CpG positions on one DNA strand within a region of interest. Tissue testing is best performed using marker information contained in the “methylation level”. In principle, the methylation level of a DNA sample can be measured in a straightforward manner after bisulfite treatment because the methylation ratio at the CpG site of interest is
134
CHAPTER 11
TABLE 11.1
■
DNA Methylation Analysis
Multicenter efforts for large-scale epigenetic analysis Goals
Human Epigenome Project
●
●
International Alliance for the Human Epigenome and Disease (AHEAD)
●
Encyclopedia of DNA Elements (ENCODE)
●
●
Approach
Institutions
References
“To identify, catalogue and interpret genomewide DNA methylation patterns of all human genes in all major tissues” Identify “Methylation Variable Positions”
Bisulfite DNA sequencing
Wellcome Trust Sanger Institute; Epigenomics, AG; The Centre National de Genotypage
Eckhardt et al., 2006 http://www.epigenome. org/
A comprehensive catalog of potential chromatin alterations Describe “reference epigenomes”
To be determined; likely to include a variety of methods to assess DNA methylation and histone modifications
To be determined
Jones and Martienssen 2005
Identify all functional elements in the human genome
Chip-on-chip assays for DNA methylation and chromatin modifications; comparative genomics; DNase I hypersensitivity site mapping
Welcome Trust Sanger Institute; Stanford University; University of Virginia; Affymetrix; Municipal Institute of Medical Research; Ludwig Institute for Cancer Research; Yale University; University of Washington; others
ENCODE project consortium, 2007 http://www.genome. gov/10005107
(a) Methylation content
(b) Methylation level
(c) Methylation pattern
Figure 11.2 Methods for assessment of DNA methylation. The “methylation content” of a cell population corresponds to the overall proportion of methylated cytosines within the entire genome (a), the “methylation level” designates the percentage of methylated DNA strands at a one genomic locus (b), and the “methylation pattern” describes the individual methylation status of a specific set of CpG positions on one DNA strand within a region of interest (c).Reproduced with permission of Macmillan Magazines Ltd from Nature Reviews Cancer 3(4), April 2003, 253–266. Copyright (2003) Macmillan Magazines Ltd.
translated into an allele ratio. However, in most cases not enough DNA will be available for direct analysis. The bisulfite-treated DNA can be amplified without disturbing this allele ratio using conventional PCR with the primers designed to avoid CpG
sites. After amplification, the product can be analyzed using any of a number of analysis methods typically used for the quantification of allele ratios. In MS-SNuPE (Methylation-Specific Single Nucleotide Primer Extension), extension primers
Technology to Assess DNA Methylation
designed to hybridize just upstream of a CpG site are extended by one nucleotide in a methylation-specific manner. Detection and quantification based on HPLC (El-Maarri et al., 2002) and incorporation of radioactivity have been described (Gonzalgo et al., 1997). Another example is pyrosequencing, in which the sequential incorporation of additional bases into the extension primer is translated into a light flash of the same intensity (Colella et al., 2003). Alternatively, hybridization of methylationspecific probes to amplified bisulfite-treated DNA has been combined with different readout systems, including microarrays (Adorjan et al., 2003) and real-time PCR (Quantitative MethyLight; Eads et al., 2000). All of the methods mentioned measure relative levels of methylation, but can be used to quantify DNA methylation levels when they are appropriately calibrated on mixtures of methylated and unmethylated DNA templates. The methods described in the previous paragraph use an unbiased amplification of bisulfite-treated DNA for measuring methylation levels. Methylation in tissues can also be estimated using various forms of methylation-specific amplification. The most commonly used method in methylation research is methylation-specific PCR (MSP) or the real-time version, MethyLight (Eads et al., 2000). These methods use primers designed for a certain methylation pattern of bisulfite-converted DNA. MethyLight technology is a combination of methylation-specific primers and a methylation-specific real-time PCR probe (Figure 11.3). Since these methods only amplify and detect the methylated portion of the DNA sample, they can only provide qualitative methylation pattern information. In combination with an estimation of total DNA amount, which can be measured for instance by a control real-time PCR assay, methylation-specific assays can be used to estimate relative methylation pattern levels (Ogino et al., 2006a). For tissue-based analyses, a quantitative methylation assay is likely to be important, because in many cases the required information will depend not on the presence or absence of methylation but on the level of methylation. Responders to a certain drug, for instance, might have fewer inactivated copies of a gene than non-responders. In some instances, the methylation difference could be quite subtle, and it is not clear which of the methods will provide the level of precision required. Methods for Analyzing Markers in Disease Screening Samples The samples of choice for early detection or screening are body fluids that can be obtained by non-invasive procedures. Serum and plasma are optimal targets for methylation marker analyses, but investigators have also analyzed methylation in urine (Hoque et al., 2005), sputum, bronchial lavage, ductal lavage, ejaculate, and other “remote samples” (reviewed in Laird, 2003). In healthy individuals, only small amounts of circulating DNA are observed in plasma and serum, whereas high amounts have been described in patients with various conditions. The main sources of circulating DNA are thought to be apoptotic or necrotic cell death (Holdenrieder et al., 2001). DNA is released from degrading cells after nuclease cleavage of the chromatin between the nucleosomes. Therefore, the tumor- or disease-derived DNA fragments
(a)
■
135
light F
R
+ ++ + + + + + +
(b)
light
(c) F
R
+++ + ++ +++ ++++ +
(d)
Figure 11.3 Sensitive real-time PCR assays for methylation detection MethyLight assays (a, b) have two methylationspecific primers that bind only to the methylated (a) but not the unmethylated (b) versions of their binding sites. If both primer binding sites are methylated, PCR products are formed and will be detected by methylation-specific real-time PCR probes (a). In contrast, HM assays (c, d) use primers that amplify all methylation states. The methylation specificity of the priming is provided by the blocker oligonucleotide, which prevents the primer from binding to unmethylated DNA (d). Therefore, only methylated DNA is amplified and will be detected by a methylation-specific fluorescent probe (c). Filled circles indicate methylated CpGs, and open circles indicate unmethylated CpGs.
can be as short as the smallest nucleosomal fragment, approximately 180 base pairs (Cottrell and Laird, 2003; Jahr et al., 2001). In contrast to tissue-based applications, methylation markers for screening must fulfill additional biological requirements. Even if the majority of free DNA in blood is derived from cancer cells, a significant amount of DNA from other cell types, particularly leukocytes, will be present. Therefore, the marker must be methylated in a high percentage of tumor cells and relatively unmethylated in leukocytes and other tissues contributing background DNA. Qualitative markers, which are methylated only in tumor-derived DNA, are well suited for screening applications. The clinical sensitivity (proportion of diseased people testing positive) and specificity (proportion of healthy individuals testing negative) of the marker is not only dependent on its biological properties, but also on the analytical performance of the assay. The challenge for the final diagnostic assay is a high analytical sensitivity (detecting small amounts of methylated tumor DNA) at a maximum specificity (no detection of unmethylated DNA). In practice, methylation assays suitable for screening must be capable of detecting a few copies of the methylated marker DNA in a background of 1–10 thousand-fold excess unmethylated DNA. In addition, the DNA extraction method must be optimized for high recovery yield, particularly of short DNA fragments found in plasma or serum. Bisulfite treatment is known to cause damage to the DNA, but milder treatment
136
CHAPTER 11
■
DNA Methylation Analysis
conditions can be used without risk of incomplete conversion of the DNA (Berlin et al., 2005). MethyLight may be an assay that is suitable for detection of disease in remote samples. In contrast to MSP, MethyLight minimize false-positive results caused by possible mis-priming of the methylation-specific primer, as the methylation status of the amplificates are analyzed on a second level by the methylation-specific fluorescent probe. MethyLight has been successfully used for various applications, but primers and probes must be carefully designed in order to ensure sensitive and specific detection. HeavyMethyl (HM) technology is another way to achieve methylation-specific priming, and, like MSP, can be combined with a probe for real-time detection (Cottrell et al., 2004). HM assays combine methylation-non-specific primers and a nonextendable 3-modified blocker oligonucleotide. The blocker binds to unmethylated DNA, preventing the primers from accessing the priming site (Figure 11.3). The blocker does not bind to the methylated version, allowing the primers to access the priming sites. A methylation-specific fluorescent probe is used to ensure specificity and allow for methylation quantification. HM assays typically have a limit of detection of 25–50 pg of methylated DNA and can detect this methylated DNA in a background of 50 ng of unmethylated DNA. Although HM design requires more effort compared to a MethyLight assay, the HM assay has advantages. Unmethylated template that is amplified unintentionally will be targeted by the blocker in the next cycle. Therefore, the analytical specificity of HM assays is very high.
CLINICAL IMPACT OF DNA METHYLATION ANALYSIS Cancer has been viewed as an accumulation of chromosomal aberrations and therefore been called a “genetic disease”. However, it has become clear over the last two decades that epigenetic changes play a crucial role in carcinogenesis. Research on DNA methylation in cancer has been expanding rapidly, and for this reason, in the following sections, we will rely on examples from oncology to demonstrate how methylation profiles can be applied for basic disease research, early detection, disease stratification, treatment response prediction, and drug target identification. While attention has focused on methylation in carcinogenesis, a similar groundswell of research is emerging on methylation in other diseases, especially autoimmune and cardiovascular conditions (see Table 11.2). For example, aberrantly hypomethylated DNA in circulation in systemic lupus erythematosus might induce an immune response due to its similarity to unmethylated microbial DNAs (Januchowski et al., 2004). Further contributions of defective methylation to lupus pathogenesis could be due to transcription of endogenous retroviruses (Okada et al., 2002) or the increase in the expression of certain genes related to autoreactivity (Richardson, 2002). In cardiovascular disease, DNA methylation is associated with atherosclerosis pathogenesis in two ways: First, the decrease in essential factors (e.g., folate and vitamin D)
for the synthesis of S-adenosyl methionine (the main methyl group donor in the methylation reactions) leads to global DNA hypomethylation and coexisting hyperhomocystinemia (Zaina et al., 2005). Second, Lund et al. (2004) provided evidence that methylation profiles are early markers of the disease in a mouse model. Taken together, these data reinforce the notion that methylation profiles can be highly valuable to study the pathogenesis of a variety of conditions (Table 11.2). Impact of Methylation Research on Basic Disease Understanding Researchers have spent many years studying the role of permanent alterations in DNA sequence, such as mutations, deletions, and insertions, in carcinogenesis. Cancer has long been thought of as a disease that arises after accumulation of mutational events in growth control genes. In recent years, RNA expression profiling has been used to examine the functional consequences of these sequence alterations, as outlined in other chapters in this book. There is now overwhelming evidence that epigenetic modification can also control RNA and protein levels. As the methylation analysis toolbox expands, we are gaining more insight into how alterations in these patterns of methylation influence disease. As an example, methylation analysis has provided a new molecular understanding of colon cancer. Colon cancer develops in a well-described pathway from normal epithelium, to aberrant crypts, to adenomas, then to adenocarcinoma, each with specific histological features and molecular alterations. The first alterations to be studied were sequences changes: for instance, APC mutations have been implicated in the formation of adenomas (Powell et al., 1992) and TP53 mutations have been implicated in the transition from adenoma to adenocarcinoma (Baker et al., 1990). More recently, the role of aberrant methylation events of genes in these same pathways has been elucidated. For example, a hereditary form of colon cancer is commonly caused by mutations in the MLH1 gene, but this same gene can induce sporadic forms of the disease when it is aberrantly methylated (Cunningham et al., 1998). Some colon cancers are characterized by microsatellite instability (MSI), and the majority of these tumors have methylated MLH1 alleles. Furthermore, research on methylation in colon cancer has established that tumors can not only be divided based on microsatellite stability status, but also on their CpG island methylator phenotype (CIMP) status (Toyota et al., 1999). Colon tumors, particularly those with MSI, have a bimodal distribution based on the methylation status of multiple genes. Those tumors with multiple methylated genes (CIMP+) are more likely to have BRAF mutations and wildtype KRAS (Ogino et al., 2006a, b). Clearly, a complete understanding of the pathogenesis of colon cancer requires knowledge of both genetic and epigenetic events. Methylation Markers for Early Detection and Diagnosis Early detection is an effective success factor in the fight against any type of cancer. Aberrant DNA methylation patterns show
TABLE 11.2
Examples of DNA methylation in human disease
Disease/Disorder
Phenotype
Mechanism
Genes involved
Reference
Cancer
Tumorigenesis; Increased cell division; metastasis
Hypermethylation
Tumor-suppressor genes
Hypomethylation Loss of imprinting
Global Insulin-like grow factor 2 (IGF2), GTP-binding RAS-like 3 (DIRAS3), mesoderm-specific transcript homolog (MEST) Mut L Homolog 1 (MLH1)
Jones and Baylin, 2002; Herman and Baylin, 2003 Ehrlich et al., 1982 Feinberg, 2007
Heritable germline hypermethylation
Hitchins et al., 2007
Embryonal tumors; omphalocele; macroglossia (large tongue); gigantism
Disruption of an imprinting locus
11p15 (IGF2, H19 ncRNA, cyclindependent kinase inhibitor 1C (CDKN1C), pleckstrin homologylike domain, family A, member 2 (PHLDA2), solute carrier family 22, member 1 (SLC22A1), potassium voltage-gated channel (KCNQ1))
Weksberg et al., 2003
Prader–Willi syndrome
Developmental delay; obesity; genital hypoplasia; distinct facial features
Disruption of an imprinting locus
15q11-13 (small nuclear ribonucleoprotein polypeptide N (SNRPN), ncRNAs)
Nicholls and Knepper, 2001
Angelman syndrome
Developmental delay; ataxia; microcephaly; distinct facial features
Disruption of an imprinting locus
15q11-13 (Ubiquitin E3 ligase (UBE3A))
Nicholls and Knepper, 2001
Pseudohypoparathyroidism
Hypoparathyroidism without sensitivity to parathyroid hormone
Disruption of an imprinting locus
20q13 (GNAS complex locus (GNAS1))
Bastepe et al., 2001
Immunodeficiency/centromeric instability/facial anomalies (ICF syndrome)
Immunodeficiency; chromosomal breakage; facial defects
Mutations
DNA Methyltransferase 3b (DNMT3b)
Hansen et al., 1999
Rett syndrome
Motor abnormalities; ataxia; handwringing; poor verbal skills
Mutations
Methyl-CpG-binding protein 2 (MeCP2)
Amir et al., 1999
Systemic Lupus Erythematosus
Autoimmune disorder; inflammation; tissue damage
Hypomethylation
global, perforin (PRF1), CD70, integrin, alpha L (ITGAL)
Januchowski et al., 2004
Atherosclerosis
Thickening and hardening of arterial walls
Hypermethylation
Estrogen receptor, Monocarboxylate Transporter 3 (SLC16A3)
Kim et al., 2007; Zhu et al., 2005
Schizophrenia
Hallucinations; delusions
Hypomethylation
Global
Shimabukuro et al., 2006
Hypermethylation
Reelin (RELN)
Grayson et al., 2005
Hypermethylation
Homeobox A10 (HOXA10)
Wu et al., 2005
Fragile X syndrome
Mental retardation; autism; large ears; macroorchidism
Hypermethylation of an expanded trinucleotide repeat
Fragile X mental retardation 1 (FMR1)
Oberle et al., 1991
137
Growth of endometrial cells outside of the uterus
■
Endometriosis
Clinical Impact of DNA Methylation Analysis
Beckwith–Wiedemann syndrome
138
CHAPTER 11
■
DNA Methylation Analysis
promise as biomarkers for early detection of cancer for several reasons: Tumor-specific methylation changes occur early in tumorigenesis, appear to be stable, provide an amplifiable signal, and can be assayed with high analytical sensitivity. DNA methylation patterns characteristic of tumor cells have been found for many types of cancer in several types of body fluids (Laird, 2003). DNA methylation changes occur frequently in colon cancer and can be present early in malignant transformation (Grady, 2005). Feasibility studies have shown that tumor-derived methylated DNA markers can be detected in serum, plasma, and stool (reviewed in Laird, 2003). These studies analyzed a limited number of patient samples, and larger clinical studies will be necessary to assess the quality of these markers for a screening approach. Prostate-specific antigen (PSA) based screening has resulted in detection of many cases of prostate cancer, but, due to the low specificity of this assay, its usefulness for prostate cancer screening is no longer certain. DNA methylation markers may improve screening and diagnosis of prostate cancer considerably in the future. GSTP1, encoding glutathione S-transferase P1, is the bestcharacterized methylation marker for prostate cancer. Methylation of GSTP1 in prostate cancer has been consistently reported with a frequency of about 70–90% (reviewed in Henrique and Jeronimo, 2004). Interestingly, GSTP1 methylation is also reported in some high-grade prostatic intraepithelial neoplasia (PIN) lesions, a precursor to prostate cancer, indicating that GSTP1 hypermethylation is an early event in prostate cancer development (Henrique et al., 2006). Several groups have demonstrated the feasibility of using GSTP1 methylation and other methylation markers to detect prostate cancer in bodily fluids. For example, Hoque et al. (2005) analyzed a panel of nine genes in urine sediments from 52 cancer patients and 91 controls using MethyLight assays. A panel containing GSTP1 and three other genes (CDKN2A, ARF, and MGMT) detected the prostate cancer cases with 87% sensitivity and 100% specificity. Recently, Lofton-Day et al. (2006) conducted a study to identify and validate DNA methylation-based markers for colorectal cancer using plasma samples. At a set specificity of 95%, the lead methylation marker (Septin 9) showed in several studies sensitivity values of 51, 65, and 50%, respectively. In total, approximately 2000 samples were tested, including over 600 plasma samples from colorectal cancer patients, 600 with related non-malignant diseases, and over 600 normal controls from an age-matched, colonoscopy-verified healthy population. Importantly, early-stage cancers were identified with the same sensitivity as later stage cancers (Lofton-Day et al., 2006). The test was able to detect colorectal cancers regardless of their location, addressing a critical medical need and shortfall of the existing fecal occult blood tests. Once a methylation-based screening test is developed, its use can potentially be extended to monitoring for disease recurrence after treatment, similar to the role of PSA for prostate cancer.
aggressiveness of tumors and the responsiveness to certain drug regimens, respectively. For example, prognostic markers for breast cancer are required to determine which patients can be spared from aggressive therapy due to very good prognosis. Recently, several reports have been published exploring the prognostic relevance of DNA methylation at the promoter of the PLAU gene, which codes for the matrix remodeling enzyme urokinase-type plasminogen activator (uPA) (Guo et al., 2002; Pakneshan et al., 2004; Xing and Rabbani, 1999). Interestingly, PLAU-promoter DNA methylation was negatively associated with tumor grade; grade 1 tumors were frequently methylated and grade 3 tumors were unmethylated. These findings are consistent with results from large studies that have demonstrated that protein expression in serum of uPA and its cell surface receptor uPAR are important determinants of distant spread of breast cancer (Look et al., 2002). Shinozaki et al. (2005) found that in a panel of 151 primary breast tumors, hypermethylation of the CDH1 gene was significantly associated with lymphovascular invasion, infiltrating ductal histology, and lack of estrogen receptor expression. On the other hand, RASSF1A and RARB hypermethylation were significantly more common in estrogen receptor-positive and human epidermal growth factor receptor 2-positive tumors, respectively. Other genes that have been associated with prognosis in breast cancer and which appear to be regulated via DNA methylation are CDH3 (Paredes et al., 2005), the tuberous sclerosis (TSC) genes TSC1 and TSC2 (Jiang et al., 2005), and LATS1/LATS2, tumor-suppressor genes that have been implicated in the regulation of the cell cycle (Takahashi et al., 2005). We have recently identified a methylation marker, PITX2, which predicts outcome in hormone receptor-positive, node-negative breast cancer patients treated with tamoxifen monotherapy (Maier et al., 2004, 2007). The marker was originally identified and validated in two independent cohorts of patients treated with adjuvant tamoxifen monotherapy. The findings were then validated in an independent study analyzing paraffin-embedded tumors of 422 node-negative patients treated with tamoxifen only after surgery. In the group with low PITX2 methylation, 94% of the patients were metastasis-free after 10 years, compared to only 84% in the group with high PITX2 methylation. In a multivariate model, PITX2 methylation added significant information to conventional factors such as tumor size, grade, and age (Harbeck et al., 2005). Analysis of PITX2 methylation in a large cohort of patients not receiving adjuvant therapy confirmed that PITX2 methylation is a prognostic marker associated with tumor aggressiveness (Martens et al., 2005). These results provide substantial evidence that PITX2 DNA methylation is suitable for routine clinical use in order to predict outcome in node-negative, tamoxifen-treated patients, and to identify low-risk patients who can be spared the burden of additional cytotoxic therapy.
Methylation Markers for Disease Prognosis Molecular classification holds great promise for assessment of optimal treatment options. DNA methylation markers have been reported as both prognostic and predictive markers, indicating the
Methylation Markers for Treatment Response Prediction A number of studies have provided evidence that specific methylation changes are associated with responses to a variety of cancer
References
therapeutic agents currently used in the clinics (reviewed in Maier et al., 2005). These alterations in methylation patterns could serve as predictive markers of drug response and be used by clinicians and patients to support treatment selection. Furthermore, methylation markers could be developed together with a novel drug during clinical trials in order to better select the patients who are likely to respond. The inclusion of predictive markers in clinical trials is becoming increasingly important as more targeted therapies are being tested. Many anti-cancer therapies disturb DNA integrity and thus interfere with DNA synthesis and successful cellular replication, ultimately inducing cell death. Tumor response to DNA damaging agents is closely linked to expression of several DNA-repair enzymes, many of which are regulated by methylation. Resistance can result from either a gain of methylation event or loss of methylation event. Methylation of MGMT, which codes for a DNArepair enzyme, is associated with response to alkylating agents. In this case, tumor cells rely on the excess MGMT product to specifically repair the damage that occurred during therapy with the alkylating agent (Ludlum, 1990). Esteller et al. (2000) investigated the methylation status of the MGMT gene in 47 glioma samples from patients treated with the chloroethylating agent carmustine. The MGMT promoter region was methylated in 19 (40%) of the samples, 12 (63%) of which were from patients who had a response to carmustine. Of the 28 tumors with unmethylated MGMT, only 1 patient (4%) responded to the drug (p 0.001, univariate analysis). Paz and coworkers (2004) found similar results in glioma patients. Balana and colleagues (2003) demonstrated that detection of these treatment response markers might be possible in serum for disease where affected tissue is unavailable. MLH1 is a mismatch repair enzyme that is activated in response to DNA damage. However, MLH1 expression is not only required to repair the damage, but also seems to be linked to apoptotic signaling: Its activation induces processes leading to programmed cell death. Therefore, in contrast to MGMT, methylation of the MLH1 promoter is associated with resistance to DNA damaging agents, such as temozolomide, dacarbazine, and
■
139
cisplatin. Expression of MLH1 increases the effect of the drugs, presumably by acting as a sensor detecting DNA damage caused by the drug, and by activating processes that eventually lead to apoptosis of the cell (Agarwal and Kaye, 2003; Brown et al., 1997; Karran and Bignami, 1994). Other drug classes for which methylation markers were shown to be associated with tumor response in patients or sensitivity of preclinical models are taxanes (CHFR, TRAG3, RASSF1, BRCA1), platinum compounds (ABCB1, FANCF, RASSF1), retinoids (RARB, RBP1), and anti-hormonal therapies (ESR1, ESR2, AR) (Maier et al., 2005).
CONCLUSIONS With methylation-specific restriction enzymes, methylationbinding proteins, and bisulfite conversion, researchers have many technical options for detecting and measuring methylation information in the human genome. While MSP and bisulfite sequencing have been the workhorses in basic research on methylation, new developments in real-time PCR and arraybased analysis are expanding the repertoire. Techniques such as MethyLight have the level of analytical performance required for clinical application, allowing basic research on aberrant methylation in disease to be translated into sensitive and specific diagnostic tests that will be extremely valuable to clinicians and patients. The growing knowledge of aberrant methylation has already been integrated into models of carcinogenesis, and a complete understanding of disease initiation and progression must incorporate epigenetic analysis as well as traditional genetic analysis. With the development and validation of biomarkers based on aberrant methylation events, the emergence of methylation markers for clinical oncology is anticipated in short order. The wealth of information provided by methylation analysis in oncology and the fundamental importance of methylation in gene expression are certain to drive the expansion of methylation analysis into autoimmune, cardiovascular, and other diseases as well.
REFERENCES Adorjan, P., Distler, J., Lipscher, E., Model, F., Muller, J., Pelet, C., Braun, A., Florl, A., Gutig, D., Grabs, G. et al. (2003). Tumor class prediction and discovery by microarray-based DNA methylation analysis. Nucleic Acids Res 30, e21. Agarwal, R. and Kaye, S.B. (2003). Ovarian cancer: Strategies for overcoming resistance to chemotherapy. Nat Rev Cancer 7, 502–516. Amir, R.E.,Van den Veyver, I.B., Wan, M., Tran, C.Q., Francke, U. and Zoqhbi, H.Y. (1999). Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl-CpG-binding protein 2. Nat Genet 23, 185–188. Baker, S.J., Preisinger, A.C., Jessup, J.M., Paraskeva, C., Markowitz, S., Willson, J.K., Hamilton, S. and Vogelstein, B. (1990). p53 gene mutations occur in combination with 17p allelic deletions as late events in colorectal tumorigenesis. Cancer Res 50, 7717–7722.
Balana, C., Ramirez, J., Taron, M., Roussos, Y., Ariza, A., Ballester, R., Sarries, C., Mendez, P., Sanchez, J. and Rosell, R. (2003). O6-methyl-guanine-DNA methyltransferase methylation in serum and tumor DNA predicts response to 1,3-bis(2-chloroethyl)-1nitrosourea but not to temozolamide plus cisplatin in glioblastoma multiforme. Clin Cancer Res 9, 1461–1468. Bartolomei, M.S., Webber, A.L., Brunkow, M.E. and Tilghman, S.M. (1993). Epigenetic mechanisms underlying the imprinting of the mouse H19 gene. Genes Dev 7, 1663–1673. Bastepe, M., Lane, A.H. and Jueppner, H. (2001). Paternal uniparental isodisomy of chromosome 20q – and the resulting changes in GNAS1 methylation – as a plausible cause of pseudohypoparathyroidism. Am J Hum Genet 68, 1283–1289. Berlin, K., Cardon, K. and Ballhause, M. (2005). Improved bisulfite conversion of DNA. Patent WO 2005/038051.
140
CHAPTER 11
■
DNA Methylation Analysis
Bird, A.P. and Southern, E.M. (1978). Use of restriction enzymes to study eukaryotic DNA methlation: I, The methylation pattern in ribosomal DNA from Xenopus laevis. J Mol Biol 118, 27–47. Brown, R., Hirst, G., Gallagher, W., McIlwrath, A., Margison, G., van der Zee, A. and Anthoney, D. (1997). hMLH1 expression and cellular responses of ovarian tumour cells to treatment with cytotoxic anticancer agents. Oncogene 15, 45–52. Chen, R., Pettersson, U., Beard, C., Jackson-Grusby, L. and Jaenisch, R. (1998). DNA hypomethylation leads to elevated mutation rates. Nature 395, 89–93. Colella, S., Shen, L., Baggerly, K.A., Issa, J.-P.J. and Krahe, R. (2003). Sensitive and quantitative universal Pyrosequencing methylation analysis of CpG sites. BioTechniques 35, 146–150. Costello, J.F., Fruhwald, M.C., Smiraglia, D.J., Rush, L.J., Robertson, G.P., Gao, X.,Wright, F.A., Feramisco, J.D., Peltomaki, P., Lang, J.C. et al. (2000). Aberrant CpG-island methylation has nonrandom and tumour-type-specific patterns. Nat Genet 24, 132–138. Costello, J.F., Smiraglia, D.J. and Plass, C. (2002). Restriction landmark genome scanning. Methods 27, 144–149. Cottrell, S. and Laird, P. (2003). Sensitive detection of DNA methylation. Ann N Y Acad Sci 983, 120–130. Cottrell, S., Distler, J., Goodman, N., Mooney, S., Kluth, A., Olek, A., Schwope, I., Tetzner, R., Ziebarth, H. and Berlin, K. (2004). A realtime PCR assay for DNA-methylation using methylation-specific blockers. Nucleic Acids Res 32, e10. Cross, S.H., Charlton, J.A., Nan, X. and Bird, A.P. (1994). Purification of CpG islands using a methylated DNA binding column. Nat Genet 6, 236–244. Cunningham, J.M., Christensen, E.R.,Tester, D.J., Kim, C.Y., Roche, P.C., Burgart, L.J. and Thibodeau, S.N. (1998). Hypermethylation of the hMLH1 promoter in colon cancer with microsatellite instability. Cancer Res 58, 3455–3460. Eads, C., Danenberg, K., Kawakami, K., Saltz, L., Blake, C., Shibata, D., Danenberg, P. and Laird, P. (2000). MethyLight: A high-throughput assay to measure DNA methylation. Nucleic Acids Res 28, E32. Eckhardt, F., Lewin, J., Cortese, R., Rakyan,V.K., Attwood, J., Burger, M., Burton, J., Cox, T.V., Davies, R., Down, T.A. et al. (2006). DNA methylation profiling of human chromosomes 6, 20, and 22. Nat Genet 38, 1378–1385. Ehrlich, M., Gama-Sosa, M.A., Huang, L.H., Midgett, R.M., Kuo, K.C., McCune, R.A. and Gehrke, C. (1982). Amount and distribution of 5-methylcytosine in human DNA from different types of tissues of cells. Nucleic Acids Res 10, 2709–2721. El-Maarri, O., Herbiniaux, U., Walter, J. and Oldenburg, J. (2002). A rapid, quantitative, non-radioactive bisulfite-SNuPE- IP RP HPLC assay for methylation analysis at specific CpG sites. Nucleic Acids Res 30, e25. ENCODE Project Consortium (2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816. Esteller, M. (2006). The necessity of a human epigenome project. Carcinogenesis 27, 1121–1125. Esteller, M., Garcia-Foncillas, J., Andion, E., Goodman, S., Hidalgo, O., Vanaclocha,V., Baylin, S. and Herman, J. (2000). Inactivation of the DNA-repair gene MGMT and the clinical response of gliomas to alkylating agents. N Engl J Med 343, 1350–1354. Feinberg, A.P. (2004). The epigenetics of cancer etiology. Semin Cancer Biol 14, 427–432. Feinberg, A.P. (2007). Phenotypic plasticity and the epigenetics of human disease. Nature 447, 433–440.
Frommer, M., McDonald, L.E., Millar, D.S., Collis, C.M., Watt, F., Griff, G.W., Molloy, P.L. and Paul, C.L. (1992). A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci USA 89, 1827–1831. Fukasawa, M., Kimura, M., Morita, S., Matsubara, K., Yamanaka, S., Endo, C., Sakurada, A., Sato, M., Kondo, T., Horii, A. et al. (2006). Microarray analysis of promoter methylation in lung cancers. J Hum Genet 51, 368–374. Gebhard, C., Schwarzfischer, L., Pham, T.H., Schilling, E., Klug, M., Andreesen, R. and Rehli, M. (2006). Genome-wide profiling of CpG methylation identifies novel targets of aberrant hypermethylation in myeloid leukaemia. Cancer Res 66, 6118–6128. Gonzalgo, M.L. and Jones, P.A. (1997). Rapid quantitation of methylation differences at specific sites using methylation-sensitive single nucleotide primer extension (Ms-SNuPE). Nucleic Acids Res 25, 2529–2531. Grady, W.M. (2005). Epigenetic events in the colorectum and in colon cancer. Biochem Soc Trans 33, 684–688. Grayson, D.R., Jia, X., Chen,Y., Sharma, R.P., Mitchell, C.P., Guidotti, A. and Costa, E. (2005). Reelin promoter hypermethylation in schizophrenia. Proc Natl Acad Sci USA 102, 9341–9346. Grunau, C., Clark, S.J. and Rosenthal, A. (2001). Bisulfite genomic sequencing: Systematic investigation of critical experimental parameters. Nucleic Acids Res 29, e65–e65. Hansen, R.S., Wijmenga, C., Luo, P., Stanek, A.M., Canfield, T.K., Weemaes, C.M. and Gartler, S.M. (1999). The DNMT3B DNA methyltransferase gene is mutated in the ICF immunodeficiency syndrome. Proc Natl Acad Sci USA 96, 14412–14417. Harbeck, N., Bohlmann, I., Ross, J.S., Gruetzmann, R., Kristiansen, G., Margossian, A., Hartmann, A., Cufer, T., Paradiso, A. and Maier, S. EpiBreast Group (2005). Multicenter study validates PITX2 DNA methylation for risk prediction in tamoxifen-treated, node-negative breast cancer using paraffin-embedded tumor tissue. J Clin Oncol (ASCO Annual Meeting Proceedings) 23, 505. Hatada, I., Hayashizaki, Y., Hirotsune, S., Komatsubara, H. and Mukai, T. (1991). A genomic scanning method for higher organisms using restriction sites as landmarks. Proc Natl Acad Sci USA 88, 9523–9527. Hayashi, H., Nagae, G.,Tsutsumi, S., Kaneshiro, K., Kozaki,T., Kaneda, A., Sugisaki, H. and Aburatani, H. (2007). High-resolution mapping of DNA methylation in human genome using oligonucleotide tiling array. Hum Genet 120, 701–711. Hellman, A. and Chess, A. (2007). Gene body-specific methylation on the active X chromosome. Science 315, 1141–1143. Henrique, R. and Jeronimo, C. (2004). Molecular detection of prostate cancer: A role for GSTP1 hypermethylation. Eur Urol 46, 660–669. Henrique, R., Jeronimo, C.,Teixeira, M.R., Hoque, M.O., Carvalho, A.L., Pais, I., Ribeiro, F.R., Oliveira, J., Lopes, C. and Sidransky, D. (2006). Epigenetic heterogeneity of high-grade prostatic intraepithelial neoplasia: Clues for clonal progression in prostate carcinogenesis. Mol Cancer Res 4, 1–8. Herman, J.G. and Baylin, S.B. (2003). Gene silencing in cancer in association with promoter hypermethylation. N Engl J Med 349, 2042–2054. Herman, J.G., Graff, J.R., Myohanen, S., Nelkin, B.D. and Baylin, S.B. (1996). Methylation-specific PCR: A novel PCR assay for methylation status of CpG islands. Proc Natl Acad Sci USA 93, 9821–9826.
References
Hitchins, M.P., Wong, J.J., Suthers, G., Suter, C.M., Martin, D.I., Hawkins, N.J. and Ward, R.L. (2007). Inheritance of a cancerassociated MLH1 germ-line epimutation. N Engl J Med 356, 731–733. Holdenrieder, S., Stieber, P., Bodenmuller, H., Busch, M., Fertig, G., Furst, H., Schalhorn, A., Schmeller, N., Untch, M. and Seidel, D. (2001). Nucleosomes in serum of patients with benign and malignant diseases. Int J Cancer 95, 114–120. Hoque, M.O., Topaloglu, O., Begum, S., Henrique, R., Rosenbaum, E., Van Criekinge, W., Westra, W.H. and Sidransky, D. (2005). Quantitative methylation-specific polymerase chain reaction gene patterns in urine sediment distinguish prostate cancer patients from control subjects. J Clin Oncol 23, 6569–6575. Huang, T.H., Perry, M.R. and Laux, D.E. (1999). Methylation profiling of CpG islands in human breast cancer cells. Hum Mol Genet 8, 459–470. Jahr, S., Hentze, H., Englisch, S., Hardt, D., Fackelmayer, F.O., Hesch, R.D. and Knippers, R. (2001). DNA fragments in the blood plasma of cancer patients: Quantitations and evidence for their origin from apoptotic and necrotic cells. Cancer Res 61, 1659–1665. Januchowski, R., Prokop, J. and Jagodzinski, P.P. (2004). Role of epigenetic DNA alterations in the pathogenesis of systemic lupus erythematosus. J Appl Genet 45, 237–248. Jiang, W.G., Sampson, J., Martin, T.A., Lee-Jones, L., Watkins, G., Douglas-Jones, A., Mokbel, K. and Mansel, R.E. (2005). Tuberin and hamartin are aberrantly expressed and linked to clinical outcome in human breast cancer: The role of promoter methylation of TSC genes. Eur J Cancer 41, 1628–1636. Jones, P.A. and Laird, P.W. (1999). Cancer epigenetics comes of age. Nat Genet 21, 163–167. Jones, P.A. and Baylin, S.B. (2002). The fundamental role of epigenetic events in cancer. Nat Rev Genet 3, 415–428. Jones, P.A. and Martienssen, R. (2005). A blueprint for a Human Epigenome Project: The AACR Human Epigenome Workshop. Cancer Res 65, 11241–11246. Karran, P. and Bignami, M. (1994). DNA damage tolerance, mismatch repair and genome instability. Bioessays 16, 833–839. Kim, J., Kim, J.Y., Song, K.S., Lee, Y.H., Seo, J.S., Jelinek, J., Goldschmidt-Clermont, P.J. and Issa, J.P. (2007). Epigenetic changes in estrogen receptor beta gene in atherosclerotic cardiovascular tissues and in-vitro vascular senescence. Biochim Biophys Acta 1772, 72–80. Krueger, T., Wild, C. and Noyer-Weidner, M. (1995). McrB: A prokaryotic protein specifically recognizing DNA containing modified cytosine residues. EMBO J 14, 2661–2669. Laird, P.W. (2003). The power and promise of DNA methylation markers. Nat Rev Cancer 3, 253–266. Lander, E., Linton, L., Birren, B., Nusbaum, C., Zody, M., Baldwin, J., Devon, K., Dewar, K., Dewar, I., Doyle, M. et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921. Lewin, J., Plum, A., Hildmann, T., Rujan, T., Eckhardt, F., Liebenberg,V., Lofton-Day, C. and Wasserkort, R. (2007). Comparative DNA methylation analysis in normal and tumour tissues and in cancer cell lines using differential methylation hybridisation. Int J Biochem Cell Biol 39, 1539–1550. Liang, G., Gonzalgo, M.L., Salem, C. and Jones, P.A. (2002). Identification of DNA methylation differences during tumorigenesis by methylation-sensitive arbitrarily primed polymerase chain reaction. Methods 27, 150–155.
■
141
Lofton-Day, C., Model, F.,Tetzner, R., deVos,T., Schuster, M., Lesche, R., Sledziewski, A. and Day, R. (2006). A real-time PCR test for Septin 9 gene methylation identifies early stage colorectal cancer in plasma. 97th Annual Meeting of the AACR. Abstract LB-224. Look, M.P., van Putten, W.L., Duffy, M.J., Harbeck, N., Christensen, I.J., Thomssen, C., Kates, R., Spyratos, F., Ferno, M. and EppenbergerCastori, S. (2002). Pooled analysis of prognostic impact of urokinase-type plasminogen activator and its inhibitor PAI-1 in 8377 breast cancer patients. J Natl Cancer Inst 94, 116–128. Lopatina, N., Haskell, J.F., Andrews, L.G., Poole, J.C., Saldanha, S. and Tollefsbol, T. (2002). Differential maintenance and de novo methylating activity by three DNA methyltransferases in aging and immortalized fibroblasts. J Cell Biochem 84, 324–334. Ludlum, D.B. (1990). DNA alkylation by the haloethylnitrosoureas: Nature of modifications produced and their enzymatic repair or removal. Mutat Res 233, 117–126. Lund, G., Andersson, L., Lauria, M., Lindholm, M., Fraga, M.F., VillarGarea, A., Ballestar, E., Esteller, M. and Zaina, S. (2004). DNA methylation polymorphisms precede any histological sign of atherosclerosis in mice lacking apolipoprotein E. J Biol Chem 279, 29147–29154. Maier, S., Nimmrich, I., Marx, A., Eppenberger-Castori, S., Jaenicke, F., Paradiso, A., Spyratos, F., Foekens, J., Schmitt, M. and Harbeck, N. (2004). DNA methylation profile predicts risk of recurrence in tamoxifen-treated, node-negative breast cancer patients. J Clin Oncol (ASCO Annual Meeting Proceedings) 22, 525. Maier, S., Dahlstroem, C., Haefliger, C., Plum, A. and Piepenbrock, C. (2005). Identifying DNA methylation biomarkers of cancer drug response. Am J Pharmacogenomics 5, 223–232. Maier, S., Nimmrich, I., Koenig,T., Eppenberger-Castori, S., Bohlmann, I., Paradiso, A., Spyratos, F., Thomssen, C., Mueller, V., Naehrig, J. et al. (2007). DNA-methylation of the homeodomain transcription factor PITX2 reliably predicts risk of distant disease recurrence in tamoxifen-treated, node-negative breast cancer patients – Technical and clinical validation in a multi-center setting in collaboration with the European Organisation for Research and Treatment of Cancer (EORTC) PathoBiology group. Eur J Cancer 43, 1679–1686. Martens, J.W., Maier, S., Schwope, I., Koenig,T., Hartmann, O., Kluth, A., Look, M., Harbeck, N., Bolt-de Vries, J., Sieuwerts, A. et al. (2005). DNA-hypermethylation of PITX2 and risk of cancer recurrence in lymph node-negative hormone receptor-positive breast cancer patients. 96th Annual Meeting of the AACR. Abstract 2762. Mohandas, T., Sparkes, R.S. and Shapiro, L.J. (1981). Reactivation of an inactive human X chromosome: Evidence for X inactivation by DNA methylation. Science 211, 393–396. Nicholls, R.D. and Knepper, J.L. (2001). Genome organization, function, and imprinting in Prader-Willi and Angelman syndromes. Annu Rev Genomics Hum Genet 2, 153–175. Oberle, I., Rousseau, F., Heitz, D., Kretz, C., Devys, D., Hanauer, A., Boue, J., Bertheas, M.F. and Mandel, J.L. (1991). Instability of a 550-base pair DNA segment and abnormal methylation in fragile X syndrome. Science 252, 1097–1102. Ogino, S., Kawasaki, T., Brahmandam, M., Cantor, M., Kirkner, G.J., Spiegelman, D., Makrigiorgos, G.M.,Weisenberger, D.J., Laird, P.W., Loda, M. et al. (2006a). Precision and performance characteristics of bisulfite conversion and real-time PCR (MethyLight) for quantitative DNA methylation analysis. J Mol Diagn 8, 209–217. Ogino, S., Cantor, M., Kawasaki, T., Brahmandam, M., Kirkner, G., Weisenberger, D.J., Campan, M., Laird, P.W., Loda, M. and
142
CHAPTER 11
■
DNA Methylation Analysis
Fuchs, C.S. (2006b). CpG island methylator phenotype (CIMP) of colorectal cancer is best characterized by quantitative DNA methylation analysis and prospective cohort studies. Gut 55, 1000–1006. Okada, M., Ogasawara, H., Kaneko, H., Hishikawa, T., Sekigawa, I., Hashimoto, H., Maruyama, N., Kaneko, Y. and Yamamoto, N. (2002). Role of DNA methylation in transcription of human endogenous retrovirus in the pathogenesis of systemic lupus erythematosus. J Rheumatol 29, 1978–1982. Okano, M., Bell, D.W., Haber, D.A. and Li, E. (1999). DNA methyltransferases Dnmt3a and Dnmt3b are essential for de novo methylation and mammalian development. Cell 99, 247–257. Paredes, J., Albergaria, A., Oliveira, J.T., Jeronimo, C., Milanezi, F. and Schmitt, F.C. (2005). P-cadherin overexpression is an indicator of clinical outcome in invasive breast carcinomas and is associated with CDH3 promoter hypomethylation. Clin Cancer Res 11, 5869–5877. Paz, M.F., Yaya-Tur, R., Rojas-Marcos, I., Reynes, G., Pollan, M., Aguirre-Cruz, L., Garcia-Lopez, J.L., Piquer, J., Safont, M.J. and Balana, C. (2004). CpG island hypermethylation of the DNA repair enzyme methyltransferase predicts response to temozolomide in primary gliomas. Clin Cancer Res 10, 4933–4938. Powell, S.M., Zilz, N., Beazer-Barclay, Y., Bryan, T.M., Hamilton, S.R., Thibodeau, S.N., Vogelstein, B. and Kinzler, K.W. (1992). APC mutations occur early during colorectal tumorigenesis. Nature 359, 235–237. Richardson, B.C. (2002). Role of DNA methylation in the regulation of cell function: Autoimmunity, aging and cancer. J Nutr 132, 2401S–2405S. Schuster, M. (2004). DNA methylation analysis: A powerful diagnostic tool. J Lab Med 28, 233–238. Shimabukuro, M., Sasaki, T., Imamura, A., Tsujita, T., Fuke, C., Umekage, T., Tochigi, M., Hiramatsu, K., Miyazaki, T., Oda, T. et al. (2006). Global hypomethylation of peripheral leukocyte DNA in male patients with schizophrenia: A potential link between epigenetics and schizophrenia. J Psychatr Res e-publication. Shinozaki, M., Hoon, D.S., Giuliano, A.E., Hansen, N.M., Wang, H.J., Turner, R. and Taback, B. (2005). Distinct hypermethylation profile of primary breast cancer is associated with sentinel lymph node metastasis. Clin Cancer Res 11, 2156–2162.
Stirzaker, C., Song, J.Z., Davidson, B. and Clark, S.J. (2004). Transcriptional gene silencing promotes DNA hypermethylation through a sequential change in chromatin modifications in cancer cells. Cancer Res 64, 3871–3877. Takahashi,Y., Miyoshi,Y., Takahata, C., Irahara, N., Taguchi, T., Tamaki,Y. and Noguchi, S. (2005). Down-regulation of LATS1 and LATS2 mRNA expression by promoter hypermethylation and its association with biologically aggressive phenotype in human breast cancers. Clin Cancer Res 11, 1380–1385. Takai, D. and Jones, P.A. (2002). Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc Natl Acad Sci USA 99, 3740–3745. Taylor, K.H., Pena-Hernandez, K.E., Davis, J.W., Arthur, G.L., Duff , D.J., Shi, H., Rahmatpanash, F.B., Sjahputera, O. and Caldwell, C.W. (2007). Large-scale CpG methylation analysis identifies novel candidate genes and reveals methylation hotspots in acute lymphoblastic leukemia. Cancer Res 67, 2617–2625. Toyota, M., Ho, C., Ahuja, N., Jair, K.W., Li, Q., Ohe-Toyota, M., Baylin, S.B. and Issa, J.P. (1999). Identification of differentially methylated sequences in colorectal cancer by methylated CpG island amplification. Cancer Res 59, 2307–2312. Weksberg, R., Smith, A.C., Squire, J. and Sadowski, P. (2003). BeckwithWiedemann syndrome demonstrates a role for epigenetic control of normal development. Hum Mol Genet 12, R61–R68. Wu,Y., Halverson, G., Basir, Z., Strawn, E.,Yan, P. and Guo, S.W. (2005). Aberrant methylation at HOXA10 may be responsible for its aberrant expression in the endometrium of patients with endometriosis. Am J Obstet Gynecol 193, 371–380. Yan, P.S., Chen, C.M., Shi, H., Rahmatpanah, F.,Wei, S.H., Caldwell, C.W. and Huang, T.H. (2001). Dissecting complex epigenetic alterations in breast cancer using CpG island microarrays. Cancer Res 61, 8375–8380. Zaina, S., Lindholm, M.W. and Lund, G. (2005). Nutrition and aberrant DNA methylation patterns in atherosclerosis: More than just hyperhomocystinemia?. J Nutr 135, 5–8. Zhu, S., Goldschmidt-Clermont, P.J. and Dong, C. (2005). Inactivation of monocarboxylate transporter MCT3 by DNA methylation in atherosclerosis. Circulation 112, 1353–1361.
RECOMMENDED RESOURCES Websites
Journal
http://www.epigenome.org – The website for the Human Epigenome Project. http://www.microarrays.ca – This is the website for the microarray center at the University Health Network in Toronto. They describe a spotted array containing over 12,000 CpG island clones. http://epigenome-noe.net/ – The website for the Epigenome Network of Excellence, a consortium of 25 European research groups. http://www.dnamethsoc.com/ – Website for the Epigenetics Society.
Laird, P.W. (2003). The power and promise of DNA methylation markers. Nat Rev Cancer 3(4), 253–266. This paper provides a comprehensive review of methylation detection and measurement methods as well as a review of the literature applying these methods to clinical samples.
CHAPTER
12 Transcriptomics: Translation of Global Expression Analysis to Genomic Medicine Michelle M. Kittleson, Rafael Irizarry, Bettina Heidecker and Joshua M. Hare
INTRODUCTION Transcriptomics is defined as the genome-wide study of mRNA expression levels. With this promising technology, it is now possible to assess the expression of tens of thousands of gene transcripts simultaneously, providing a resolution and precision of phenotypic characterization of the transcriptome not previously possible. Because the state of the transcriptome in a given diseased tissue may contain a highly accurate representation of key biological phenomena, patterns of gene expression have potential to provide insights into disease mechanisms and also to identify markers useful for diagnostic, prognostic, and therapeutic purposes. This chapter will provide an overview of this powerful new technology, with a discussion of the statistical methods and applications of transcriptomics as well as current issues in this field. We discuss microarrays that are based on genomic information as well as specialized microarrays designed to detect micro-RNAs.
GENE EXPRESSION TECHNOLOGY Types of Microarrays The study of the transcriptome can be accomplished with the use of microarrays. Microarrays rely on the principle of Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg
complementary hybridization of nucleotide sequences and have taken advantage of robotic technology to create platforms as small as a few inches wide, containing tens of thousands of DNA sequences for analysis. There are two commonly used types of gene expression microarrays. The first type, traditionally known as cDNA microarrays, utilizes relatively long multimers of probe cDNAs (500–5000 bases). Such microarrays are made primarily in individual institutions and are often organ- or disease-specific, examples include the CardioChip from Brigham and Women’s Hospital (Barrans et al., 2001) and the LymphoChip from Stanford University (Alizadeh et al., 1999). Usually, one or a few long probes are used to assay the expression of a given gene. They often use two-colors to measure the identity and intensity of gene expression (i.e., red and green), so that on an individual microarray, the gene expression of two samples is compared or individual samples are compared to a reference sample. Thus, the two samples are tagged with distinct labels that fluoresce at a distinct and non-overlapping wavelength and the relative, as opposed to absolute, level of expression is obtained (Figure 12.1) (Albelda and Sheppard, 2000). The second type of microarray is known as an oligonucleotide microarray (Figure 12.2). These microarrays use smaller DNA probes (20–80 bases) and have a number of advantages: (i) they are commercially produced by companies such as Affymetrix, Copyright © 2009, Elsevier Inc. All rights reserved. 143
144
CHAPTER 12
■
Transcriptomics: Translation of Global Expression Analysis to Genomic Medicine
(a) RNA isolation Sample A
Sample B
(e) Imaging Sample A B Sample B A
Sample A B
(b) cDNA generation (c) Labeling of probe Reverse Transcriptase Fluorescent tags
(d) Hybridization to array
Figure 12.1 The use of two-color cDNA microarrays. First, the messenger RNA (mRNA) is isolated from two separate samples (a). The mRNA from each sample is treated with reverse transcriptase (b) and labeled with a distinct fluorescent tag (c). The two pools of labeled RNA are hybridized to the microarray, containing a full set of thousands or tens of thousands of DNA sequences, (d). The microarray is scanned and the color of each spot is determined (e). In this example, genes expressed only in Sample A would be red, genes expressed only in Sample B would be green and those genes expressed equally in both samples would be yellow. This allows genes that are specifically expressed in response to the specific treatment or disease to be determined. (Reproduced with permission from Albelda, S.M. and Sheppard, D. Functional genomics and expression profiling: Be there to be square. Am J Respir Cell Mol Biol 2000: 23, 265–269).
Agilent, Amersham, or AME Bioscience, and thus the quality control is standardized and the cost of production is minimized; (ii) they typically contain an unbiased genome-wide rather than organ system-specific set of probes; and (iii) they contain more gene transcripts (almost 50,000 on the newest Affymetrix U133 Plus 2.0 GeneChip Array, for example). These are one-color microarrays, so that individual chips are used to study individual samples. In addition, multiple short probes are used to assay the expression of a given gene. Thus, a given gene may be represented by multiple probes spanning the length of the gene, such that the absolute gene expression is determined by incorporating the level of hybridization intensity of all the probes that make up a given gene. Both cDNA and oligonucleotide microarrays consist of a solid support, usually a glass slide or a nylon membrane, onto which DNA sequences are attached. For a microarray experiment, RNA from a sample is extracted, labeled, and, after
hybridization, complementary sequences remain bound to the array. Expressed genes are identified by the position of the bound probes on the microarray, and their abundance is determined by the intensity of the measured signal. Thus, an experiment with a single microarray can provide information on the expression of thousands of genes simultaneously: the transcriptome of a given tissue sample can thus be determined. The two most popular microarrays can be divided into Affymetrix GeneChips, a one channel, multi-probe, high-density oligonucleotide microarray, and two-color, one-probe per gene cDNA microarrays.There is no company that dominates the market for the latter so we will refer to these here as two-color arrays. Quality Control After microarray hybridization, the next essential step in microarray analysis is assurance of adequate array quality. All arrays
Gene Expression Technology
■
145
Hybridized probe cell Single-stranded, labeled RNA target
GeneChip® Probe Array
Oligonucleotide probe 24 μm 1.28 cm Millions of copies of a specific oligonucleotide probe
20,000 different complementary probes
Image of hybridized probe array Total RNA
cDNA AAAA
Reverse Transcription
Biotin-labeled cRNA B In Vitro Transcription
AAAA AAAA
Fragmentation
GeneChip® Expression Array
Fragmented, Biotin-labeled cRNA
Hybridization
Wash and stain
Scan and quantitate
Figure 12.2 (a) Schematic relating the oligonucleotide microarray platform, probe sequences, and scanned, hybridized image. The microarray is a solid support upon which nucleic acid probes corresponding to known gene transcripts are attached to specific locations. (b) The many steps involved in microarray analysis. Total RNA is isolated from tissue samples and used to synthesize doublestranded cDNA, which is then used as a template to make biotin-labeled cRNA. Fragmented, biotin-labeled cRNA is hybridized to a microarray, containing nucleic acid probes attached to the solid support. After washing, the microarray is scanned. By monitoring the amount of label associated with each probe location, it is possible to infer the abundance of each mRNA species represented. (Reproduced with permission from www.affymetrix.com).
should be subject to quality control. Array manufacturers typically provide software for quality control assessments. Typically these include tools for visual inspection of the array images for the presence of artifacts, and assessment of background levels (Figure 12.3). The academic community has produced quality control methodology as well for the Affymetrix (Bolstad et al., 2005) and two-color platforms (Ritchie et al., 2006). Only
arrays in which all parameters are within acceptable levels should be subject to further analysis. Normalization The final step in microarray preparation is conversion of feature-level data into gene expression levels (see also Chapter 13). Because genes on a microarray are represented by multiple
146
CHAPTER 12
■
Transcriptomics: Translation of Global Expression Analysis to Genomic Medicine
(a)
TABLE 12.1 Two major applications of microarray studies – Gene discovery and molecular signature analysis Gene discovery
Molecular signature analysis
Goal of analysis
Elucidate novel genetic pathways and therapeutic targets
Refine diagnosis and treatment
What is identified
Differentially expressed genes between disease states
A pattern of genes that characterizes a clinical parameter: diagnosis, prognosis, or response to therapy
Goal of validation
Confirm levels of gene expression
Assess the predictive accuracy of the molecular signature
Validation strategy
Quantitative PCR, Northern blotting, RNase protection assays
Apply the molecular signature to independent samples
(b)
Figure 12.3 Scanned images of hybridized Affymetrix microarrays. (a) A proper scanned image, with uniform intensity and no image artifacts. (b) An image with artifacts. The four, irregular dark spots at the center of the scanned image suggest irregularities of hybridization caused by variation in technique. This could affect the evaluation of gene expression and a new sample should be processed.
features, the hybridization intensities must be summarized to determine the overall expression or relative expression levels. In the case of two-color platforms, red and green foregrounds and background measurements are summarized into one value that quantifies relative expression between the two samples. In the Affymetrix platforms, the various features representing a gene need to be summarized into one value that quantifies expression. Background correction and normalization are issues that are common to all platforms. Background and signal adjustment corrects for background noise, adjusts for cross hybridization, and ensures that all values fall on the proper scale. Normalization eliminates systematic differences between arrays, either using a reference array or the combined information from all the arrays. The background-corrected, normalized, and converted probe summaries of gene expression can then be used for gene-level microarray analysis.
Major Applications of Microarray Analysis The challenge in microarray experiments is in the experimental design and statistical analysis, where the number of variables assessed is orders of magnitude higher than the number of individuals studied. There is concern of the dangers of data-mining, where lists of thousands of genes are generated without an understanding of the technical and statistical pitfalls of microarray analysis. To avoid this, it is essential to understand the distinct applications of this technology. Because the state of the transcriptome in a given diseased tissue may contain a highly accurate representation of key biological phenomena, patterns of gene expression have potential to provide insights into disease mechanisms and also to identify markers useful for diagnostic, prognostic, and therapeutic purposes. Thus, the two distinct major applications of this powerful technology are gene discovery and molecular signature analysis, and these two applications have different goals, statistical methods, and validation strategies (Table 12.1).
GENE DISCOVERY Overview Gene discovery involves the identification of differentially expressed genes between two or more states, for example disease
Gene Discovery
and normal tissue. The goal of gene discovery is to elucidate novel genetic pathways or cause-specific therapies. Experimental Design Given that gene discovery involves the comparison of gene expression between two or more disease states, samples must be chosen that belong to different groups: for example, normal versus disease, with and without treatment, or before and after an intervention. However, three major issues in experimental design bear further consideration: the utility of randomization, replication, and pooling. 1. Randomization is not widely used in biomedical laboratory experiments, but offers significant advantages here. One of the most striking aspects of oligonucleotide microarray experiments is the extent to which the experimental conditions can affect the resulting expression data. While normalization methods can reduce the impact of some of these differences, randomization is the best way to deal with them. In this manner, the samples can be grouped so that any unique chance features are shared equally by the two groups and thus are not potentially confounded with group differences. For example, all normal samples should not be analyzed on one day and all diseased samples on the next day. Similar considerations can apply whenever differences will occur across other factors known to contribute extraneous variability, such as chip batch, operator, reagents, scanners, and so on. Arrange those varying factors that can be controlled in a manner similar to that just described, and randomize across the remainder (Bolstad et al., 2004). 2. Another major issue in experimental design is whether replication should be performed. Here, it is essential to make the distinction between technical replicates and biological replicates. Technical replicates involve multiple measurements from the same tissue sample, whereas biological replicates involve multiple measurements from the same disease state. Clearly biological replication leads to data that are better for reaching conclusions that might apply more generally to the disease state, and technical replication leads to data better for reaching conclusions about that particular sample. In most cases, biological replicates are more relevant to the aims of an experiment than technical replicates. Chips are expensive, and biological replication will allow better use of limited resources getting more data at the level exhibiting higher variation, across subjects, say, than at the level exhibiting lower variation, within subjects (Bolstad et al., 2004). 3. A third major experimental design issue in gene discovery analyses is whether or not to pool samples. One perspective is that pooling provides a form of “biological averaging” and should make it possible to get more precise results with fewer chips than hybridizing RNA from individual subjects to separate chips. While this may be true, there is no compelling evidence for this in the literature, and pooling could mislead without being apparent. Suppose that one subject contributing to the pool is very different from the
■
147
remainder. Depending on the tissue under study, pooling this subject’s RNA with that of the others gives it the ability to exert a large influence on the measured expression values for many genes, whereas running it on a separate chip leaves open the possibility of identifying this subject as an outlier. Furthermore, averaging can be performed at the statistical instead of biological level. This provides the same variance reduction property without losing the ability to detect outliers and assess variance. Thus, if pooling is envisaged in an experiment that could be carried out at the same cost without pooling, it is better not to pool (Kendziorski et al., 2005). Frequently, pooling is seen to be necessary to get sufficient mRNA from the tissue in question, in which case the possible drawbacks must inevitably be accepted or at least weighed against the possible drawbacks of the alternative, which is usually amplification (Bolstad et al., 2004). Statistical Methods Many statistical methods have been used to identify differentially expressed genes. The first attempts simply looked at fold-changes, and fold-changes 2 were initially considered meaningful. However, investigators soon realized that the significance of twofold change depends on both gene and sample. For this reason, it is now considered essential to compare expression between two or more groups by taking into account not only the magnitude of the difference between groups, but also the variability of expression between groups. The naïve solution to summarizing the relationship within and across-group variances is the well-known t-test/F-test and their respective p-values. However, there are two major problems with this approach. First, for small samples (fewer than 5 subjects), the t-and F-statistics tend to favor genes that, by chance, have small within-group variances. The second problem is that of multiple comparisons. When performing thousands of tests at once, the customary definition of a p-value no longer holds. For example, if we are measuring expression for 10,000 genes, all of which have no true across-group difference, we expect the 100 genes to reach p-values equal or smaller than 0.01. Various solutions have been given to both these problems. Empirical Bayes methods and Stein estimators have been proposed to improve estimates of within-group variation (Cui et al., 2005; Kendziorski et al., 2003; Lonnstedt and Speed, 2002; Smyth, 2004), and these methods greatly improve the results obtained with the t-test in small samples. Solutions to the multiple comparison problem have also been proposed (Dudoit et al., 2003; Storey, 2002). A popular piece of software that combines all these ideas is the Significance Analysis of Microarrays (SAM) (Tusher et al., 2001). SAM identifies genes with statistically significant changes in expression by identifying a set of gene-specific statistics (similar to the t-test, thus taking into account both magnitude of change and variability of expression) and a corresponding false discovery rate (similar to a p-value adjusted for multiple comparisons; see Figure 12.4). These ad hoc procedures provide similar results to the more rigorous procedures cited above.
148
CHAPTER 12
■
Transcriptomics: Translation of Global Expression Analysis to Genomic Medicine
(a)
d(i )
Observed relative difference d(i )
(b)
efficiency, as well as posttranslational protein modification and interactions, will all affect the measured protein abundance.
xI (i ) xU (i ) s(i ) S0
MOLECULAR SIGNATURE ANALYSIS
10
Overview In contrast to transcriptome analysis for gene discovery, the goal of molecular signature analysis is to identify a pattern of gene expression that is associated with a clinical parameter, such as etiology, prognosis, or response to therapy, thus potentially providing diagnostic or prognostic precision greater than that available from standard clinical information (Table 12.1).
5
0
5
10 10
5 5 0 Expected relative difference dE(i )
10
Figure 12.4 Depiction of Significance Analysis of Microarrays method of identifying differentially expressed genes. (a) SAM uses a modified t-test statistic to determine the observed and expected relative differences in gene expression. To calculate the observed relative difference in gene expression, d(i), the difference in average gene expression between the two groups is divided by the standard deviation of repeated measures of gene expression, s(i) and a constant, so to minimize the impact of genes with low levels of expression. The expected relative difference is the average of the relative difference of all possible permutations of the samples. (b) The observed relative difference in gene expression is then plotted against the expected relative difference, and this scatterplot is used to determine which genes are differentially expressed. Differentially expressed genes are those displaced from the line of unity by a distance greater than the chosen threshold, and the threshold is chosen to minimize the false discovery rate. (Reproduced with permission from Tusher, V.G., Tibshirani, R. and Chu,G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc Nat Acad Sci USA 98, 5116–5121).
Validation Once differentially expressed genes are identified, the transcript abundance of selected genes of interest are confirmed by a complementary method, such as quantitative PCR, Northern blotting, or RNase protection assays (Cook and Rosenzweig, 2002b). Less commonly, levels of the corresponding protein have also been measured, with less agreement between transcript and protein abundance. However, this is not surprising, since differences in mRNA localization, processing, stability, translation
Experimental Design Many features already discussed in the section on “Gene Discovery” apply as well to the design of experiments focused on molecular signature analysis, including the importance of randomization of experimental conditions and the utility of biological, rather than technical, replicates. While pooling samples is not ideal in microarray analyses focused on gene discovery, it is clearly impossible in studies focused on molecular signature analysis since the identity of individual samples is the goal. Thus, pooling of samples should never be performed in a study focused on molecular signature analysis. In molecular signature analysis, samples are first divided into groups based on a clinically relevant parameter, such as disease etiology, prognosis, or response to therapy. Then a molecular signature is created by choosing genes whose expression is solidly associated with the parameter in question, and by weighting genes based on their individual predictive strengths. Statistical Methods From a mathematical standpoint, molecular signature analysis is nothing other than what the computer science community denotes “machine learning” and the statistics community denotes “classification.” There are dozens, if not hundreds, of methods that have been used successfully in real-world applications such as voice recognition, zip code readers, and fraud detection. Relatively simple methods that are successful in other fields appear to work well in the context of microarrays. However, all these methods have been developed for cases where the number of predictors is smaller than the number of outcomes. With microarrays, this is rarely the case, given that there are more than 10,000 predictors. Nevertheless, if the number of genes included as predictors can be reduced to, say, less than 50 and chosen according to evidence in favor of differential expression, then existing methods work rather well (Dudoit et al., 2002). In fact, the simplest methods, including K-nearest methods and diagonalized linear discriminant analysis (DLDA), work better than more complicated ones. Prediction Analysis of Microarrays (PAM) is one approach, similar to DLDA, that is designed specifically for molecular signature analysis (Tibshirani et al., 2002).
Molecular Signature Analysis
EWS
NB
149
RMS
0
500
1000
Gene
1500
2000
BL
■
0.5
0.0
0.5
0.5
0.0
0.5
0.5
0.0
0.5
0.5
0.0
0.5
Average expression
Figure 12.5 Depiction of Prediction Analysis of Microarray’s method of nearest shrunken centroids. Samples are first divided into classes based on a pre-defined parameter. In this case, there are four groups from a dataset containing samples of four different types of small round blue cell tumors of childhood: Burkitt lymphoma (BL), Ewing sarcoma (EWS), neuroblastoma (NB), and rhabdomyosarcoma (RMS). The gray bars represent the standardized class centroid, the average expression of each gene in a given class divided by the within-class standard deviation (thus genes with stable expression have a greater contribution to the class centroid). The red bars represent the shrunken centroid, “de-noised” versions of centroids after subtraction of a given threshold determined by crossvalidation. After shrinkage, a small number of genes remain and these act as prototypes for each class. Independent samples are classified based on their squared distance from each prototypic class shrunken centroid. (Reproduced with permission from Tibshirani et al. 2002. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Nat Acad Sci USA 99, 6567–6572).
PAM uses the method of nearest shrunken centroids to identify and validate the smallest set of genes whose expression is associated with a predefined class (Figure 12.5). A standardized class centroid is a composition of mean values of expression of all individual genes in a given class divided by the within-class standard deviation (SD). Standardization allows greater weight to be given to genes whose expression is stable within a class.
PAM creates shrunken centroids by shrinking the class centroids towards the overall centroid by a threshold amount. This overall centroid is calculated by dividing the average expression of each gene in all classes by the pooled within-class SD. Thus, the shrunken centroids are “de-noised” versions of centroids that act as prototypes for each class. The threshold is chosen by 10-fold cross-validation. In this process, gene expression prediction profiles
150
CHAPTER 12
■
Transcriptomics: Translation of Global Expression Analysis to Genomic Medicine
Overall CV plot 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
22283 17755 13602 9778 6777 4590 2918 1917 1224 803 545 357 236 161 116 78 52 38 28 20 15 11 7 4 3 2 1 1 1 0
Misclassification error
(a)
0
1
2
3
4
5
Threshold (b) Actual, predicted classes and predicted posterior probabilities LC1003.CEL LC1008.CEL LC1022.CEL Sample label 1 1 1 Actual class 1 1 1 Predicted class Probability: class 1 0.94986124 0.983986836 0.894571 Probability: class 2 0.05013876 0.016013164 0.105429
Actual class Predicted class Probability: class 1 Probability: class 2
H1a.CEL 1 1 0.92565835 0.07434165
H1b.CEL 2 2 0.386107146 0.613892854
H2a.CEL 1 1 0.935907 0.064093
LC1024.CEL 1 1 0.9658909 0.0341091
LC1025.CEL 1 1 0.980457 0.019543
LC1026.CEL 1 1 0.970757 0.029243
LT1001.CEL 2 2 0.042903 0.957097
LT1002.CEL 2 2 0.163733 0.836267
H2b.CEL 2 1 0.9809665 0.0190335
H3a.CEL 1 2 0.181081 0.818919
H3b.CEL 2 1 0.998226 0.001774
H4a.CEL 1 1 0.936414 0.063586
H4b.CEL 2 1 0.932558 0.067442
Figure 12.6 Sample output from Prediction Analysis of Microarrays (PAM) software. (a) Cross-validation plot listing the error rates in classification from 10-fold cross-validation of the training set. The x-axis is the threshold and number of genes in the molecular signatures, and the y-axis is the error rate. Here, the error rate in the training set is minimized by between 52 and 545 genes. Thus, a profile with 52 genes will be chosen as the molecular signature in order to minimize both the error rate and the number of genes in the profile. (b) The output using this 52-gene signature on the test set to determine if samples belong to class 1 or class 2. PAM provides the predicted class as well as the probability of belonging to class 1 or class 2. The sample is assigned to the class with the greater probability.
with different numbers of genes (i.e., from different thresholds) are fit on the basis of 90% of the samples and then tested on the remaining 10%. This process is executed multiple times, and the output is the error rate using different thresholds. The prediction profile is chosen as the smallest list of genes that can be used to predict the classification of an individual sample with the minimal error rate (Figure 12.6). To then classify independent samples, PAM computes each test sample’s squared distance from each of the class centroids.The predicted class is the one whose centroid is closest to the expression profile of the test sample. In contrast to SAM and other methods of identifying differentially expressed genes, PAM focuses mainly on the stability of gene expression and on the smallest number of genes required to create a molecular signature. Validation While the goal of gene discovery is to identify differentially expressed genes that offer insight into novel genetic pathways or cause-specific therapies, the goal of molecular signature analysis is to identify a pattern of genes that differentiates between clinical entities with a precision not possible based on standard clinical information. Thus, the identity of the mRNA transcripts in the signature or whether they are translated into protein may or may not have immediately discernable bearing on the utility of
the pattern. Therefore, the validation strategy is also unique: testing the accuracy of the identified molecular signature in samples distinct from those used to create the signature: an independent test set of samples (Table 12.1). However, because most laboratories have access to only a limited number of samples, withholding a substantial proportion of the samples from the training set for the sake of creating a validation set may considerably reduce the performance of the molecular signature. Cross-validation procedures use the data more efficiently. A small number of specimens are withheld, and most of the specimens are used to build a signature. The signature is used to predict class membership for the withheld specimens. This process is iterated, leaving out a new set of specimens at each step, until all specimens have been classified. In leave-one-out cross-validation, for example, each specimen is excluded from the training set one at a time and then classified on the basis of the predictor built from the data for all of the other specimens. The leave-one-out crossvalidation procedure provides a nearly unbiased estimate of the true error rate of the classification procedure (Simon et al., 2003a). However, when cross-validation (instead of an independent dataset) is used, all three steps of creating a molecular signature must undergo cross-validation, including: (i) selection of informative genes, (ii) computation of weights for selected informative
Current Issues in Gene Expression Analysis
genes, and (iii) creation of a prediction rule. Failure to crossvalidate all steps may lead to substantial bias in the estimated error rate and is a major pitfall in studies focused on molecular signature analysis (Simon et al., 2003b). Applications in Cancer Research In neoplastic disease, molecular signature analysis can determine prognosis and response to therapy. In breast cancer, a molecular signature was identified that predicted disease outcome in young patients with breast cancer better than standard clinical and histological criteria: a poor prognosis signature was associated with a fivefold increased risk of distant metastases in 5 years, a difference that would justify early intensive adjuvant chemotherapy (van de Vijver et al., 2002). In large B-cell lymphoma, a molecular signature predicted survival better than standard clinical methods (Rosenwald et al., 2002). Similar results have been obtained for acute myeloid leukemia (Bullinger et al., 2004; Valk et al., 2004), chronic lymphocytic leukemia (Calin et al., 2005), prostate cancer (Dhanasekaran et al., 2001; Lapointe et al., 2004), CNS tumors (Pomeroy et al., 2002), and non-small cell lung cancer (Chen et al., 2007). After its successful application in cancer, molecular signature analysis has been expanded to other fields in medicine, for example, congenital diseases (Eshaque and Dixon, 2006) enabling even preimplantation diagnosis (Salvado et al., 2004), and cardiovascular medicine. Applications in Cardiovascular Disease In cardiovascular disease, molecular signature analysis is in its earliest stages, but shows promise. In a proof-of-concept study, a diagnostic TBB was developed, which is able to distinguish between the two major forms of cardiomyopathy, ischemic and nonischemic (Kittleson et al., 2004). Several studies, investigating transcriptomic changes during left ventricular assist device (LVAD) placement followed, suggesting possible TBBs of recovery, which might provide mechanistic insight and suggest potential therapeutic targets (Hall et al., 2004; Margulies et al., 2005c). However, those studies also raised an important issue in microarray analysis, namely the difficulty to distinguish transcriptomic changes that were causative for a certain event from those that could result from a bystander effect (Depre et al., 1998; Margulies et al., 2005b). A way to reduce such confounding variables, is to obtain tissue at very early disease-stages, for example, by obtaining biopsies or blood samples (Heidecker and Hare, 2007b). Clinical TBBs derived from peripheral blood mononuclear cells (PBMC) are commercially available as diagnostic microarray chips to detect early rejection in cardiac transplant recipients (Deng et al., 2006b). Further, the use of TBBs was suggested as a screening tool for cardiovascular patients, who might require cardiac catheterization and stent implantation (Seo et al., 2004). Function of MicroRNAs in Cardiology Recent transcriptomic studies have discovered important regulatory functions of microRNAs (miRNAs) in cardiac remodeling
■
151
and heart failure (van et al., 2007b). MiRNAs are non-translated small single-stranded RNA molecules of 21–23 nucleotides in length. These small RNA species (Jackson and Standart, 2007b) either degrade bound mRNA or directly inhibit translation of mRNA by pairing with their target mRNAs in a sequence specific manner. Thus, miRNAs conduct negative regulatory function on gene expression (Jackson and Standart, 2007a). MiRNA of an intron of the MHC gene has been shown to be responsible for stress-induced cardiomyocyte growth and gene expression (van et al., 2007a). These findings point out the significant impact of gene–environment interactions on the phenotype of a cell or tissue and emphasize the value of comprehensive highthroughput technologies such as transcriptome profiling (Birney et al., 2007).
GENE DISCOVERY VERSUS MOLECULAR SIGNATURE ANALYSIS One important distinction in molecular signature analysis is the need first to identify a set of genes whose expression characterizes a pre-defined group of patients (with a clinically relevant distinction such as etiology, prognosis, or response to therapy), and then to test the predictive accuracy of this profile prospectively in an independent set of patients with varying phenotypes (Cook and Rosenzweig, 2002a). Another important distinction is that molecular signature analysis is based upon a pattern of gene expression rather than the identity of specific genes (Simon et al., 2003c) (Table 12.1). The prediction algorithm is able to compare an unknown sample and determine how closely it resembles one pattern versus the other; the absolute expression of an individual gene carries relatively small weight compared to the overall signature. Thus, validation cannot solely involve the confirmation of gene expression or gene product levels via a complementary technique such as quantitative PCR or immunofluorescence. Nevertheless, such validation can prove useful to address a different issue: whether the molecular signature offers utility independent of the microarray platform used to create it. This is important if diseasespecific platforms are developed.
CURRENT ISSUES IN GENE EXPRESSION ANALYSIS There are a number of issues in transcriptomic analysis that are being addressed by workers in the field (Table 12.2). Information Management and Reproducibility in Microarray Experiments Given the rapid growth of microarray research and the vast amount of information that can be gleaned from a single experiment, there are many challenges in designing studies and interpreting the results. To this end, the Microarray Gene Expression Data society (MGED) has developed the Minimum Information
152
CHAPTER 12
■
Transcriptomics: Translation of Global Expression Analysis to Genomic Medicine
TABLE 12.2 Current unresolved issues in transcriptomic analysis Current issues
Solutions
Study design, interpretation of results, reproducibility ●
●
Guidelines for experimental design, sample preparation, hybridization, normalization of array data, array design
MIAME
Data sharing
Standards Public Databases: National center for biotechnology information (NCBI); Gene expression omnibus (GEO); European bioinformatics institute (EMBLEBI) arrayexpress repository; Center for information biology gene expression (CIBEX); Cardiogenomics
Sample source ●
Relevance of investigated sample for later clinical application
Sample collection at early stage of disease, preferred tissue source: biopsies
●
Possible surrogates for diseased tissue or organ
PBMC
Power and sample size ●
Conventional power analysis not applicable (magnitude of change and error standard deviation not available)
Large sample size (in particular in clinical research), integrate biological knowledge with statistical analysis
About a Microarray Experiment (MIAME) standards that are needed to enable the interpretation of the results of the experiment unambiguously and potentially to reproduce the experiment (Brazma et al., 2001a). MIAME includes details of the experimental design, sample preparation, hybridization procedures, normalization algorithms, and array design. The ultimate goal is to establish a standard for recording and reporting microarray-based gene expression data, which will in turn facilitate the establishment of databases and public repositories and enable the development of data analysis tools. Such public repositories exist, including the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO); the European Bioinformatics Institute (EMBL-EBI) ArrayExpress repository; and the Center for Information Biology Gene Expression CIBEX) database. There is also a cardiology-specific repository, the Cardiogenomics Web site (www.cardiogenomics.com), a National Heart Lung and Blood Institute-sponsored Program for Genomic Applications. Many
journals currently require that submissions comply with the MIAME standard, but not all require that a complete dataset be submitted to one of three databases prior to manuscript submission. Furthermore, the public repositories do not require data being submitted in raw as well as normalized fashion, which would be essential for future studies. We believe that mandatory submission of raw microarray data files to public repositories, by promoting collaboration among scientists, could be essential for maximizing the utility of microarray research. The Source of Tissue for Analysis To date, microarray analyses in oncology have used mainly tumor tissue from biopsies. However, other pathological conditions cannot rely on such a ready source of tissue access. For example, microarray studies in cardiovascular disease have mainly utilized discarded myocardial tissue obtained at the time of cardiac transplantation, and there are limitations to this approach. The tissue is obtained from patients late in the disease course, and the conclusions, therefore, may not be applicable to patients at an earlier stage of disease. Thus, in the future, microarray analyses in cardiomyopathy will ideally focus on endomyocardial biopsy tissue obtained from patients at earlier stages of disease. Although endomyocardial biopsy is a safe procedure, future research will focus on PBMC as possible surrogate of diseased tissue, given easy accessibility by venipuncture (Heidecker and Hare, 2007a; Liew, 2005). In the cancer literature, molecular signatures derived from peripheral blood leukocytes offer comparable predictive accuracy to those from solid tumor samples in classifying subjects by cancer type and type of therapy (DePrimo et al., 2003; Twine et al., 2003). This may also be feasible in cardiovascular disease, as peripheral blood molecular signatures correlated with biopsyproven allograft rejection in cardiac transplant recipients (Deng et al., 2006a; Horwitz et al., 2004), cardiac genes in circulating blood were differentially expressed in patients with coronary artery disease relative to controls (Ma and Liew, 2003), and TBBs for familial combined hyperlipidemia were discovered in lymphocytes (Morello et al., 2004). Power and Sample Size in Microarray Experiments Clearly, molecular signature analysis is still in its earliest stages in cardiomyopathy, and there are still many unanswered questions. For example, there is limited knowledge on the sample sizes required in microarray experiments. The largest microarray study in cardiomyopathy to date has involved 199 patient samples (Margulies et al., 2005a), and in the oncology literature, each study has employed fewer than 300 patients. However, this should be considered a strength of these analyses: a succession of smaller studies, performed quickly and with the use of improving technology, will outperform larger studies locked into outdated approaches (Liu and Karuturi, 2004). The greater question is, however, whether traditional power calculations have any role in microarray experiments. A conventional power analysis is of limited value because the necessary variables are usually all unknown: we will only rarely be able
Conclusions
to nominate in advance the gene, the magnitude of change () of interest, or the magnitude of the measurement error SD. We also need to know the SD for all genes, and these are not the same. Nevertheless, it might still be argued that we should be able to determine the power we have with a given sample to detect all genes whose log fold change is some value or greater, and whose measurement SD is no more than a given (conservative) value, even if we will not know in advance which genes these are. However, there is a much greater obstacle to making use of conventional power analyses. In microarray studies, or other similar “-omic” studies, we are not measuring just one gene’s expression; in a typical experiment, we measure gene expression in 20,000 genes. Not only do we have to specify the number of genes we expect to be differentially expressed, but we need to decide on loss functions for false positives and negatives. In the typical situation where power calculations are used, we either get it right, commit one Type I (false positive) error, or one type II (false negative) error. Keep in mind that power is the probability of not committing a type II error when the null hypothesis is false. This definition does not apply to microarray applications where we report lists of genes hoping that a small percentage of the list are false positives and the list is long enough so that the number of false negatives is small. There is no easy solution to these problems, but current experiments rely on a few principles. First, one should aim to get as many case and control samples as possible, bearing in mind the important requirement of homogeneity. Second, the quality of the statistical analysis should improve, since the sample size cannot be increased beyond a certain point. A “smarter analysis” is needed to overcome the limitations of modest sample size. In addition, we need to depart from the conventional “context-free” search for differentially expressed genes that get embodied in power calculations and their multiple testing analogs, and more fully integrate biological knowledge with statistical analysis. However, even with these approaches, multiple testing issues will remain, so we have to extend that theory to apply to our stronger analysis. In brief, we are still in the early days of the statistical analysis of microarray data. Much traditional thinking must be extended and strengthened (Bolstad et al., 2004).
ALTERNATIVE TECHNOLOGIES FOR ANALYSIS OF THE TRANSCRIPTOME While, as presented here, microarray analysis is the most widely used approach for evaluating the transcriptome, there are several additional technologies that have been developed and have been used in particular settings. Serial Analysis of Gene Expression (SAGE) SAGE is a method of quantification of “tags”, rather than direct measurement of gene expression (Lash et al., 2000a). After isolation of mRNA, a small part of sequence is extracted from
■
153
each transcript at a defined position. All extracted pieces get linked together for subsequent cloning in a bacterial vector. Tags contain a nucleotide sequence of 9–11 bases in length and are representative for a unique gene transcript (Lash et al., 2000b). Tags are directly 3-adjacent to the 3-most restriction site for a particular restriction enzyme. Whereas the most widely used restriction enzyme is NlaIII, others such as Sau3A have also been used (Lash et al., 2000c). Several pitfalls with this technique warrant consideration. Because of the short length of tags, genes might share common base sequences, resulting in ambiguous tag-to-gene assignment. On the other hand, one gene can have more than one tag, because of polymorphism in a population or alternate termination in an individual (Lash et al., 2000d). Finally, by providing a digital output, fidelity of this method is reduced (Lash et al., 2000e). However, results from SAGE analysis have been shown to correlate highly with microarray data (Ishii et al., 2000), and results can be shared in public databases (http://www.ncbi.nlm.nih.gov/projects/SAGE). An important advantage of SAGE compared to microarrays is that the mRNA sequence does not have to be known in advance, so new polymorphisms can be discovered. Polony Multiplex Analysis of Gene Expression (PMAGE) PMAGE has been recently introduced as an extremely sensitive technique, which detects mRNA even in low abundance as small as one transcript per three cells (Kim et al., 2007a). Each mRNA is converted to cDNA, which gets cleaved and amplified onto beads carrying primers (Kim et al., 2007b). Each polony (for “PCR colony”) bead gets cross-linked to an aminosylated glass slide, from which fluorescent signal intensity can be measured via microscopy-based imaging (Kim et al., 2007c). Massively Parallel Signature Signaling (MPSS) MPSS was developed primarily to reduce time and cost of DNA sequencing as compared to conventional techniques (i.e., Sanger Sequencing (Sanger et al., 1977) and fluorescence-based electrophoresis technologies (Prober et al., 1987)). MPSS enables sequencing of as much as 25 million bases in a four-hour period, a rate that is about 100-times faster than current stateof-the art conventional sequencing. DNA is fragmented and bound to beads that are encased in droplets containing all reactants necessary to amplify the DNA. After amplification to about 10 million copies, DNA carrying beads are loaded into picolitre reactor wells and sequenced via luciferin reaction (Rogers and Venter, 2005). The same principles have been used for quantification of gene transcripts (Jongeneel et al., 2003).
CONCLUSIONS In conclusion, transcriptomics with microarray or other technologies offers a powerful means of phenotypic resolution not possible with standard clinical or histological criteria. The
154
CHAPTER 12
■
Transcriptomics: Translation of Global Expression Analysis to Genomic Medicine
promising applications of gene expression analysis include gene discovery and molecular signature analysis.These applications could elucidate novel genetic pathways, provide insight into causespecific therapies, and provide information on prognosis and response to therapy to individualize patient management in a variety of diseases. Although there are unresolved issues in this field, including power and sample size requirements, there is still
much promise, and we can expect further advances in refining the design and statistical methods of this technology in the future.
ACKNOWLEDGEMENTS This work was supported by NIH grant HL-65455 to JHM.
REFERENCES Albelda, S.M. and Sheppard, D. (2000). Functional genomics and expression profiling: Be there or be square. Am J Respir Cell Mol Biol 23, 265–269. Alizadeh, A., Eisen, M., Davis, R.E., Ma, C., Sabet, H.,Tran,T., Powell, J.I., Yang, L., Marti, G.E., Moore, D.T. et al. (1999). The lymphochip: A specialized cDNA microarray for the genomic-scale analysis of gene expression in normal and malignant lymphocytes. Cold Spring Harb Symp Quant Biol 64, 71–78. Barrans, J.D., Stamatiou, D. and Liew, C.C. (2001). Construction of a human cardiovascular cDNA microarray: Portrait of the failing heart. Biochem Biophys Res Comm 280, 964–969. Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R., Gingeras, T.R., Margulies, E.H., Weng, Z.P., Snyder, M., Dermitzakis, E.T., Stamatoyannopoulos, J.A. et al. (2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816. Bolstad, B.M., Collin, F., Simpson, K.M., Irizarry, R.A. and Speed, T.P. (2004). Experimental design and low-level analysis of microarray data. Int Rev Neurobiol 60, 25–58. Bolstad, B.M., Irizarry, R.A., Gautier, L. and Wu, Z. (2005). Preprocessing High-density Oligonucleotide Arrays. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (R. Gentleman,V.J. Carey,W. Huber, R.A. Irizarry and S. Dudoit, eds), Springer, New York, pp. 13–32. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C. et al. (2001a). Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29, 365–371. Bullinger, L., Dohner, K., Bair, E., Frohling, S., Schlenk, R.F., Tibshirani, R., Dohner, H. and Pollack, J.R. (2004). Use of geneexpression profiling to identify prognostic subclasses in adult acute myeloid leukemia. N Engl J Med 350, 1605–1616. Butte, A. (2002). The use and analysis of microarray data. Nat Rev Drug Discov 1, 951–960. Calin, G.A., Ferracin, M., Cimmino, A., Di Leva, G., Shimizu, M., Wojcik, S.E.,Iorio, M.V.,Visone, R.,Sever, N.I.,Fabbri, M. et al. (2005). A MicroRNA signature associated with prognosis and progression in chronic lymphocytic leukemia. N Engl J Med 353, 1793–1801. Chen, H.Y., Yu, S.L., Chen, C.H., Chang, G.C., Chen, C.Y., Yuan, A., Cheng, C.L., Wang, C.H., Terng, H.J., Kao, S.F. et al. (2007). A fivegene signature and clinical outcome in non-small-cell lung cancer. N Engl J Med 356, 11–20. Cook, S.A. and Rosenzweig, A. (2002b). DNA microarrays: Implications for cardiovascular medicine. Circ Res 91, 559–564. Cui, X., Hwang, J.T., Qiu, J., Blades, N.J. and Churchill, G.A. (2005). Improved statistical tests for differential gene expression
by shrinking variance components estimates. Biostatistics 6, 59–75. Deng, M.C., Eisen, H.J., Mehra, M.R., Billingham, M., Marboe, C.C., Berry, G., Kobashigawa, J., Johnson, F.L., Starling, R.C., Murali, S. et al. (2006a). Noninvasive discrimination of rejection in cardiac allograft recipients using gene expression profiling. Am J Transplant 6, 150–160. Depre, C., Shipley, G.L., Chen, W., Han, Q., Doenst, T., Moore, M.L., Stepkowski, S., Davies, P.J. and Taegtmeyer, H. (1998). Unloaded heart in vivo replicates fetal gene expression of cardiac hypertrophy. Nat Med 4, 1269–1275. DePrimo, S.E.,Wong, L.M., Khatry, D.B., Nicholas, S.L., Manning,W.C., Smolich, B.D., O’Farrell, A.M. and Cherrington, J.M. (2003). Expression profiling of blood samples from an SU5416 Phase III metastatic colorectal cancer clinical trial: A novel strategy for biomarker identification. BMC Cancer 3, 3. Dhanasekaran, S.M., Barrette, T.R., Ghosh, D., Shah, R., Varambally, S., Kurachi, K., Pienta, K.J., Rubin, M.A. and Chinnaiyan, A.M. (2001). Delineation of prognostic biomarkers in prostate cancer. Nature 412, 822–826. Dudoit, S., Fridlyand, J. and Speed, T.P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat Assoc 97, 77–87. Dudoit, S., Schaffer, J.P. and Boldrick, J.C. (2003). Multiple hypothesis testing in microarray experiments. Stat Sci 18, 71–103. Eshaque, B. and Dixon, B. (2006). Technology platforms for molecular diagnosis of cystic fibrosis. Biotechnol Adv 24, 86–93. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A. et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537. Hall, J.L., Grindle, S., Han, X., Fermin, D., Park, S., Chen,Y., Bache, R.J., Mariash, A., Guan, Z. and Ormaza, S. (2004). Genomic profiling of the human heart before and after mechanical support with a ventricular assist device reveals alterations in vascular signaling networks. Physiol Genomics 17, 283–291. Heidecker, B. and Hare, J.M. (2007a). The use of transcriptomic biomarkers for personalized medicine. Heart Fail Rev 12, 1–11. Horwitz, P.A., Tsai, E.J., Putt, M.E., Gilmore, J.M., Lepore, J.J., Parmacek, M.S., Kao, A.C., Desai, S.S., Goldberg, L.R., Brozena, S.C. et al. (2004). Detection of cardiac allograft rejection and response to immunosuppressive therapy with peripheral blood gene expression. Circulation 110, 3815–3821. Ishii, M., Hashimoto, S.,Tsutsumi, S.,Wada,Y., Matsushima, K., Kodama,T. and Aburatani, H. (2000). Direct comparison of GeneChip and SAGE on the quantitative accuracy in transcript profiling analysis. Genomics 68, 136–143.
References
Jackson, R.J. and Standart, N. (2007). . How do microRNAs regulate gene expression? Sci STKE. re1 Jongeneel, C.V., Iseli, C., Stevenson, B.J., Riggins, G.J., Lal, A., Mackay, A., Harris, R.A., O’Hare, M.J., Neville, A.M., Simpson, A.J. et al. (2003). Comprehensive sampling of gene expression in human cell lines with massively parallel signature sequencing. Proc Natl Acad Sci USA 100, 4702–4705. Kendziorski, C., Irizarry, R.A., Chen, K.S., Haag, J.D. and Gould, M.N. (2005). On the utility of pooling biological samples in microarray experiments. Proc Natl Acad Sci USA 102, 4252–4257. Kendziorski, C.M., Newton, M.A., Lan, H. and Gould, M.N. (2003). On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med 22, 3899–3914. Kim, J.B., Porreca, G.J., Song, L., Greenway, S.C., Gorham, J.M., Church, G.M., Seidman, C.E. and Seidman, J.G. (2007a). Polony multiplex analysis of gene expression (PMAGE) in mouse hypertrophic cardiomyopathy. Science 316, 1481–1484. Kittleson, M.M. and Hare, J.M. (2005). Molecular signature analysis: The potential of gene expression analysis in cardiomyopathy. Future Cardiol. Kittleson, M.M., Ye, S.Q., Irizarry, R.A., Minhas, K.M., Edness, G., Conte, J.V., Parmigiani, G., Miller, L.W., Chen, Y., Hall, J.L. et al. (2004). Identification of a gene expression profile that differentiates between ischemic and nonischemic cardiomyopathy. Circulation 110, 3444–3451. Lapointe, J., Li, C., Higgins, J.P., van de, R.M., Bair, E., Montgomery, K., Ferrari, M., Egevad, L., Rayford, W., Bergerheim, U. et al. (2004). Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA 101, 811–816. Lash, A.E., Tolstoshev, C.M., Wagner, L., Schuler, G.D., Strausberg, R.L., Riggins, G.J. and Altschul, S.F. (2000a). SAGEmap: A public gene expression resource. Genome Res 10, 1051–1060. Liew, C.C. (2005). Expressed genome molecular signatures of heart failure. Clin Chem Lab Med 43, 462–469. Liu, E.T. and Karuturi, K.R. (2004). Microarrays and clinical investigations. N Engl J Med 350, 1595–1597. Lonnstedt, I. and Speed, T.P. (2002). Replicated microarray data. Statistical Sinica 12, 31–46. Lossos, I.S., Czerwinski, D.K., Alizadeh, A.A., Wechser, M.A., Tibshirani, R., Botstein, D. and Levy, R. (2004). Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. N Engl J Med 350, 1828–1837. Ma, J. and Liew, C.C. (2003). Gene profiling identifies secreted protein transcripts from peripheral blood cells in coronary artery disease. J Mol Cell Cardiol 35, 993–998. Margulies, K.B., Matiwala, S., Cornejo, C., Olsen, H., Craven, W.A. and Bednarik, D. (2005a). Mixed messages. Transcription patterns in failing and recovering human myocardium. Circ Res, 1–8. Morello, F., de Bruin, T.W., Rotter, J.I., Pratt, R.E., van der Kallen, C.J., Hladik, G.A., Dzau, V.J., Liew, C.C. and Chen, Y.D. (2004). Differential gene expression of blood-derived cell lines in familial combined hyperlipidemia. Arterioscler Thromb Vasc Biol 24, 2149–2154. Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla, L.M., Angelo, M., McLaughlin, M.E., Kim, J.Y., Goumnerova, L.C., Black, P.M., Lau, C. et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, 436–442.
■
155
Prober, J.M., Trainor, G.L., Dam, R.J., Hobbs, F.W., Robertson, C.W., Zagursky, R.J., Cocuzza, A.J., Jensen, M.A. and Baumeister, K. (1987). A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 238, 336–341. Ritchie, M.E., Diyagama, D., Neilson, J., van Laar, R., Dobrovic, A., Holloway, A. and Smyth, G.K. (2006). Empirical array quality weights in the analysis of microarray data. BMC Bioinformatics 7, 261. Rogers, Y.H. and Venter, J.C. (2005). Genomics: Massively parallel sequencing. Nature 437, 326–327. Rosenwald, A., Wright, G., Chan, W.C., Connors, J.M., Campo, E., Fisher, R.I., Gascoyne, R.D., Muller-Hermelink, H.K., Smeland, E.B., Giltnane, J.M. et al. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse largeB-cell lymphoma. N Engl J Med 346, 1937–1947. Salvado, C.S., Trounson, A.O. and Cram, D.S. (2004). Towards preimplantation diagnosis of cystic fibrosis using microarrays. Reprod Biomed Online 8, 107–114. Sanger, F., Nicklen, S. and Coulson, A.R. (1977). DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74, 5463–5467. Seo, D., Wang, T., Dressman, H., Herderick, E.E., Iversen, E.S., Dong, C., Vata, K., Milano, C.A., Rigat, F., Pittman, J. et al. (2004). Gene expression phenotypes of atherosclerosis. Arterioscler Thromb Vasc Biol 24, 1922–1927. Simon, R., Radmacher, M.D., Dobbin, K. and McShane, L.M. (2003b). Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95, 14–18. Smyth, G.K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3, 1–27. Storey, J.D. (2002). A direct approach to false discovery rates. J Roy Stat Soc B, 479–498. Tan, P.K., Downey, T.J., Spitznagel, E.L., Jr., Xu, P., Fu, D., Dimitrov, D.S., Lempicki, R.A., Raaka, B.M. and Cam, M.C. (2003). Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res 31, 5676–5684. Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 99, 6567–6572. Tusher, V.G., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98, 5116–5121. Twine, N.C., Stover, J.A., Marshall, B., Dukart, G., Hidalgo, M., Stadler, W., Logan, T., Dutcher, J., Hudes, G., Dorner, A.J. et al. (2003). Disease-associated expression profiles in peripheral blood mononuclear cells from patients with advanced renal cell carcinoma. Cancer Res 63, 6069–6075. Valk, P.J., Verhaak, R.G., Beijen, M.A., Erpelinck, C.A., Barjesteh van Waalwijk van Doorn-Khosrovani, , Boer, J.M., Beverloo, H.B., Moorhouse, M.J., van der Spek, P.J., Lowenberg, B. and Delwel, R. (2004). Prognostically useful gene-expression profiles in acute myeloid leukemia. N Engl J Med 350, 1617–1628. van de Vijver, M.J., He, Y.D., van’t Veer, L.J., Dai, H., Hart, A.A., Voskuil, D.W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J. et al. (2002). A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347, 1999–2009. van, Rooij E., Sutherland, L.B., Qi, X., Richardson, J.A., Hill, J. and Olson, E.N. (2007b). Control of stress-dependent cardiac growth and gene expression by a microRNA. Science 316, 575–579.
156
CHAPTER 12
■
Transcriptomics: Translation of Global Expression Analysis to Genomic Medicine
RECOMMENDED RESOURCES Books Pevsner J. (2003). Bioinformatics and Functional Genomics. John Wiley & Sons, Inc. Gentleman, R., Carey, V.J., Huber, W., Irizarry, R.A., Dudoit, S. (2005). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer ScienceBusiness Media, Inc.
Public Online Databases Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/ MGED society: http://www.mged.org/Workgroups/MIAME/miame.html European Bioinformatics Institute: http://www.ebi.ac.uk/ Swiss Institute of Bioinformatics: http://www.isb-sib.ch/ DBGET Database: http://www.genome.jp/dbget/dbget.links.html Weizmann Institute of Science, Gene Cards: http://www.genecards.org/ index.shtml
Kyoto Encyclopedia of Genes and Genomes: http://www.genome. jp/kegg/
Download Sites for Free Software Bioconductor: open source software: http://www.bioconductor.org/ Significance Analysis of Microarrays: http://www-stat.stanford.edu/tibs/ SAM/ Prediction Analysis of Microarrays: http://www-stat.stanford.edu/tibs/ PAM/ The R Project for Statistical Computing: http://www.r-project.org/ DNA chip Analyzer (dChip): http://biosun1.harvard.edu/complab/dchip/ install.htm Gene Pattern: http://www.broad.mit.edu/cancer/software/genepattern/
CHAPTER
13 DNA Microarrays in Biological Discovery and Patient Care Andrew J. Yee and Sridhar Ramaswamy
INTRODUCTION While “expression profiling” is now synonymous with DNA microarray technology, expression profiling dates back to the 1970s with techniques such as northern blotting for measuring RNA expression (Alwine et al., 1977). However, only one gene at a time could be queried with these modalities. Later developments enabled analysis of multiple genes (Liang and Pardee, 2003), including differential hybridization (Sargent, 1987), subtractive hybridization (Zimmermann et al., 1980), differential display (Liang and Pardee, 1992), and serial analysis of gene expression (SAGE) (Velculescu et al., 1995). However, these approaches required a significant amount of starting RNA and investment in labor and technical expertise, and were difficult to scale up. DNA microarrays revolutionized the approach to gene expression profiling. Compared to prior methods, DNA microarrays were both dramatically high throughput and less cumbersome. It should be noted that the concept behind the microarray format was not new, as microarrays were originally developed as a technique for large-scale DNA mapping and sequencing (Hoheisel, 2006). However, changing the support surface from a porous membrane to a solid surface, such as glass, afforded significant improvements (Schena et al., 1998) by increasing reaction kinetics (Southern et al., 1999) and reducing background noise (Cheung et al., 1999).
Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg
As an illustration of approaches to define and develop applications of the transcriptome measured in biological samples (as reviewed in Chapter 12), here we present specific issues related to the use of DNA microarrays, using applications especially in the context of cancer to illustrate the uses and challenges of gene expression profiling in genomic and personalized medicine.
MICROARRAY TECHNOLOGY Microarray Platforms Although the technology for DNA microarrays continues to evolve, the commonly used arrays can be divided into two main categories, as described in detail in Chapter 12: (i) oligonucleotide and (ii) spotted or complementary DNA (cDNA) microarrays (see Figure 13.1). Over the years, commercial oligonucleotide platforms have gained in popularity over “homemade” spotted cDNA microarrays as costs have fallen considerably, to the point where commercial microarrays are the de facto platform. At the time of this writing, the dominant platforms are produced by Affymetrix and Agilent. For additional details, the reader is referred to a review by Hardiman (2004) and also to reviews collected in a series of Nature Genetics supplements (Bowtell, 1999; Cheung
Copyright © 2009, Elsevier Inc. All rights reserved. 157
158
CHAPTER 13
■
DNA Microarrays in Biological Discovery and Patient Care
(a) M1 X X X X
M2 Shine light
M1 X X
X X Couple
X X A A
Repeat
Microarray (b)
Touch surface
Move pins
Repeat
Microarray
Figure 13.1 Comparison of microarray technologies. (a) Photolithography (as used by Affymetrix): a glass wafer modified with photolabile protecting groups (X) is selectively activated for DNA synthesis by shining light through a photomask (M1). The wafer is then flooded with a photoprotected DNA base (A–X), resulting in spatially defined coupling on the chip surface. A second photomask (M2) is used to deprotect defined regions of the wafer. Repeated deprotection and coupling cycles result in a highly dense microarray. (b) Spotted arrays: the gene of interest is loaded into a spotting pin by capillary action, and a small volume is transferred to a solid surface by physical contact between the pin and the solid substrate. After the first spotting cycle, the pin is washed and a second sample is loaded and deposited to an adjacent location. Robotic control systems and multiplexed printheads allow automated microarray fabrication. Adapted from Schena et al. (1998). (Reprinted with permission from Elsevier).
et al., 1999; Churchill, 2002; Duggan et al., 1999; Holloway et al., 2002; Lipshutz et al., 1999; Southern et al., 1999). Oligonucleotide Arrays Oligonucleotide microarrays are created either by in situ synthesis or deposition of presynthesized oligonucleotides ranging in size from 25- to 60-mers. Oligonucleotides can be synthesized directly in situ using photolithography techniques adapted from the microelectronics industry. The Affymetrix GeneChip platform is the primary example of this method, and because of its popularity, merits further discussion. Emerging from research carried out in the early 1990s (Fodor et al., 1991), this technology combines combinatorial, solid phase DNA synthetic chemistry with the benefits of photolithography. Successive photolithographic masks are used to select regions of the chip surface for exposure to light. Exposure to light deprotects photolabile groups on the nascent oligonucleotides undergoing synthesis. The light-mediated deprotection then allows these specified regions to couple with activated nucleoside monomers using standard DNA phosphoramidite synthetic chemistry. Each cycle thus extends the oligonucleotide by one base (see Figure 13.1). Over the past 15 years, the feature size of each oligonucleotide probe or feature has been reduced 10-fold, resulting in a 1.6 cm2 chip with 1.3 million unique features (multiple features are used to measure the expression of a particular gene). The detection sensitivity of these microarrays has been estimated at
1 in 300,000 RNAs (Lockhart et al., 1996). A practical limitation is the efficiency of each synthesis step – about 90–95%. This places a ceiling on in situ synthesis of an oligonucleotide to 25-mers. To control for non-specific binding, the probes on the Affymetrix chip are designed in pairs. One sequence is the exact complement of the target transcript, and the other paired sequence differs from the complement typically by one base pair near the middle of the probe. Difference in signal between the two probes can be used to control for non-specific binding and background contributions (Pease et al., 1994); 22 probes are routinely used for each expression measurement. Compared to cDNA arrays (see below), oligonucleotide arrays offer greater specificity and can distinguish single nucleotide polymorphisms (SNPs) and splice variants (Guo et al., 1994); the same technology is used in arrays for measuring DNA copy number and SNP profiling. The current generation of Affymetrix microarrays can measure the expression of over 47,000 transcripts. Hybridization is detected by using a confocal laser scanning microscope to image the intensity of one color, fluorescently labeled samples. First, experimental mRNA is enzymatically amplified (see below for details of RNA amplification) and then labeled with biotin through the partial substitution of UTP and CTP with biotin-11-UTP and biotin-11-CTP (see Chapter 12). The labeled mRNA is then hybridized to the microarray and detected by the binding of a fluorescent compound (streptavidin– phycoerythrin) to the biotin-labeled samples. Because of the
Microarray Technology
robustness and consistency of the manufacturing process, the single-color readout provides an absolute quantitation of mRNA populations (Ishii et al., 2000). In addition, experiments can be compared with samples hybridized to different Affymetrix arrays. Oligonucleotide microarrays may also be fabricated by using ink-jet printing technology. Nucleotide monomers are printed onto the chip and coupled using phosphoramidite chemistry (Hughes et al., 2001). Agilent has adopted this technology to create a 60-mer oligonucleotide array. Compared to the 25-mer photolithographic chip, a 60-mer array might offer increased sensitivity per probe. In contrast to the Affymetrix platform, there is only one probe per gene in this platform. The longer probes are also more tolerant of mismatches, improving analysis of polymorphic regions. Spotted, Complementary DNA Microarrays The first DNA microarrays were developed at Stanford University and based on cDNA samples spotted by a robotic arrayer at defined locations onto glass slides (Schena et al., 1995). These cDNA samples were generally PCR products generated from cDNA libraries or clone collections. The major benefit of spotted arrays is that they can be constructed in-house and tailored to the user’s needs. The features are generally 100–300 m in size and spaced at about the same distance. About 30,000 cDNAs can be spotted onto the surface of a microscope slide (Schulze and Downward, 2001). In the initial report, the sensitivity was estimated at 1 in 50,000 RNAs starting with 5 g of total RNA (Schena et al., 1995). While spotted arrays offer customizability, they are better suited for smaller-scale experiments, since from a practical perspective managing large clone libraries and creating high-quality arrays consistently can be arduous. In the past, an advantage of spotted arrays was that a priori knowledge of the sequence being spotted was not necessary. However, this is less relevant now that complete sequence information for many experimental organisms and humans is readily accessible. Gene expression using spotted microarrays is determined as a ratio of two samples rather than as an absolute measurement, owing to technical limitations and variability in spot size and probe concentration. Generally, cDNAs derived from experimental and reference RNA samples are labeled with deoxyribonucleotides coupled to different fluorophores (e.g., with Cy3 or Cy5) during first-strand cDNA synthesis. The two different populations are then competitively hybridized to the same array and imaged. While this method accurately compares expression levels between samples, comparing data from different platforms can be difficult, often requiring complex normalization algorithms (Dudley et al., 2002). Sample Preparation Tissue Sampling For biological applications, tissue sampling is an important variable. For example, tumors are heterogeneous mixtures of different cell types, including malignant cells with varying degrees of
■
159
differentiation, stromal elements, blood vessels, and inflammatory cells. Profiling of this whole sample may dilute out the “signal” profile of cancer cells with extraneous expression “noise” from surrounding tissues. Newer techniques such as laser capture microdissection (LCM) address the challenges of tissue heterogeneity by offering an approach for isolating homogeneous, individual cell populations (Emmert-Buck et al., 1996). While there are other approaches, such as fluorescence-activated cell sorting (FACS) and manual microdissection using micromanipulators (Emmert-Buck et al., 1994), LCM is faster, more versatile, and can be applied to solid tumors directly without the need for disaggregation into a cell suspension. Sgroi et al. (1999) first showed that it was feasible to use LCM to isolate subpopulations of breast cancer cells for expression profiling. Coupled with LCM has been the evolution of methodologies for purifying adequate amounts of high-quality RNA for expression profiling (see below). However, a theoretical limitation of focusing only on individual tissue components such as cancer cells relates to the growing appreciation that interactions with the tumor microenvironment (e.g., with stroma, endothelial, and immune cells) play a critical role in tumor progression. Sampling of “non” cancer cell components thus may be necessary for a profile to adequately reflect in vivo biology. An additional consideration is that the actual surgical manipulation itself may affect the expression profiling. Lin et al. (2006) compared expression profiles of prostate cancer from biopsies obtained in situ to biopsies obtained from prostates that were surgically removed. They found that 1.5% of measurable genes were altered as a result of the surgery, including transcripts for acute phase response proteins and regulators of cell proliferation. RNA Isolation Traditionally, microarray experiments have required roughly 50–200 μg of total RNA (Duggan et al., 1999). However, these quantities are not always easily obtainable, especially with techniques such as LCM. Protocols therefore have been developed for amplifying RNA from much smaller quantities, perhaps even down to the level of a single cell. For example, a single cell contains an estimated 0.1 pg of mRNA or 10 pg of total RNA, requiring an expansion of 108–109 fold in order to yield adequate amounts for profiling. Two commonly used amplification techniques include in vivo transcription and PCR amplification. In vivo transcription is a linear amplification approach, first described by Van Gelder et al. (1990). Messenger RNA is converted to double stranded cDNA using an oligo(dT) primer attached to a T7 promoter. T7 polymerase is then used to transcribe complementary copies of the cDNA population. One round of amplification can amplify the starting amount of mRNA by 1000-fold. A second or a third round of amplification can further increase the yield (Baugh et al., 2001). A second approach takes advantage of PCR amplification. PCRbased approaches have several advantages over linear amplification, including speed, cost, and theoretically unlimited degree of amplification. While it has been assumed that exponential amplification may lead to a biased representation of the starting
160
CHAPTER 13
■
DNA Microarrays in Biological Discovery and Patient Care
sample, results from one group suggest that this may not be as problematic as previously suggested (Iscove et al., 2002). Standardization of Data Critical to the success of data analysis is standardization of microarray data and annotation of the genes that comprise the microarrays. In 1999, the Microarray Gene Expression Database (MGED) group was created to establish guidelines for the format of microarray data. This led to the development of the Minimal Information About a Microarray Experiment (MIAME) standard (Brazma et al., 2001; Stoeckert et al., 2002). Most major scientific publications require compliance with MIAME standards and submission of data to public repositories. An additional key factor is the functional annotation of the genes being studied. Biological “knowledge” is intricate, hierarchical, and interrelated. Ontologies are a formalized, standardized means of coding biological knowledge in which concepts are systematically described by both their meaning and their relationship to each other (Bard and Rhee, 2004). Adequate ontologies can thus serve as a common anchor point for systematic integration of data from various microarray sources. An example of commonly used ontology is one devised by the Gene Ontology Consortium (Harris et al., 2004).
DATA ANALYSIS Gene expression studies pose many challenges for data organization, storage, and analysis (see Chapter 17). In concert with the development of microarray technology has been the evolution of sophisticated statistical methods for analyzing and interpreting highly dense microarray information (Ermolaeva et al., 1998; Quackenbush, 2001). These data can be one’s own experimental data or existing data from other investigators’ experiments, made available through public databases. Microarray data may be so rich in information that important findings may not be recognized initially, and only better appreciated in retrospect when re-analyzed or combined with newer data. Data analysis methodology is discussed further in the section “Informatic and Computational Platforms for Genomic Medicine” in Part I of this book and has been reviewed also by Allison et al. (2006). Unsupervised Learning The computational analysis of gene expression data has largely centered on two approaches: unsupervised and supervised learning (see Figure 13.2). Unsupervised methods include hierarchical and k-means clustering (Gaasterland and Bekiranov, 2000). The goal of clustering is to organize genes and samples clusters based on relative similarity (D’Haeseleer, 2005; Eisen et al., 1998). A hierarchy of clusters can be created, which may be visualized as tree diagrams or dendrograms. Unsupervised approaches have the advantage of being unbiased and allow for the identification of structure in a complex dataset without making any a priori assumptions. In fact, many of the initial microarray studies on cancer samples utilized unsupervised classification, for example,
lymphoma (Alizadeh et al., 2000) and breast cancer (Perou et al., 2000). However, because many different relationships are possible in a complex dataset, the predominant structure uncovered by clustering may not be clearly related to the most clinically or biologically interesting aspect of a dataset. Supervised Learning In contrast, supervised learning starts off with a known class distinction and uses this information upfront to select genes whose expression is best associated with this class distinction in a “training dataset” (Golub et al., 1999). This model is then applied to an independent “test” dataset to validate the accuracy of selected gene expression features in classification. Numerous supervised learning algorithms have been applied to gene expression datasets, such as k-nearest neighbors, neural networks, or support vector machines (Brown et al., 2000). The accuracy of supervised learning approaches heavily depends on the quality of the initial training data. Mining Several tools have been developed for distilling higher-level information from microarray data in order to better appreciate the “forest” among the innumerable “trees” of expression data. Two notable examples include modular analysis and Cancer Outlier Profile Analysis (COPA). Modular analysis examines the coordinated behavior of a set of genes in order to detect significant changes in modules of genes, rather than the individual genes themselves. This offers the ability to extract more “signal” from a given microarray experiment. Indeed, groups of gene expression may be more biologically interpretable and statistically robust than deciphering lists of individual genes. Mootha et al. (2003) used this approach to study changes in gene expression in the muscle of patients with type 2 diabetes. Their group introduced the gene set enrichment analysis (GSEA) and used the Kolmogorov– Smirnov test statistic, a type of non-parametric test statistic that does not make assumptions about population distribution. While the decrease in expression for individual genes was modest, around 20%, it was consistently decreased across a set of genes involved in oxidative phosphorylation, accounting for 89% of 106 genes. Thus, significant changes could be seen when looking at a group of genes, even when the expression of individual genes was not significantly different. GSEA has emerged as a useful tool for identifying biological processes from microarray data and has been refined into a robust method (Subramanian et al., 2005). Similarly, Segal et al. (2004) created a cancer-focused compendium of expression profiles from 26 studies. They were able to extract 456 modules and found that some modules were unique to a particular type of tumor whereas other modules were shared across a range of conditions, raising the possibility that these modules reflect common tumor progression mechanisms. For example, in acute lymphoblastic leukemia (ALL), a growthinhibitor module was specifically repressed. On the other hand, a bone osteoblastic module was found across a range of tumor types, suggesting a common mechanism for bone metastasis.
Applications
■
161
Supervised learning
Unsupervised learning Dataset Known classes Class A Class B
Unknown classes
Training set
Train model Cluster samples
Independent test set (“Unknowns”)
Assign class labels Apply model
Class A
Class B
Class A
Class discovery
Class B
Class prediction
Figure 13.2 Unsupervised versus supervised learning. Unsupervised learning: multiple tumor samples are clustered into groups based on overall similarity of their gene expression profiles. This approach is useful for discovering previously unappreciated relationships. Supervised learning: multiple tumor samples from different known classes are used to train a model capable of classifying unknown samples. This model is then applied to a test set for class label assignment. Adapted from Ramaswamy and Golub (2002). (Reprinted with permission from the American Society of Clinical Oncology).
The COPA method focuses on genes characterized by marked overexpression (Tomlins et al., 2005). This strategy is used in order to efficiently process large amounts of data. COPA was applied to the Oncomine database, a compilation of 132 gene expression datasets representing 10,486 microarray experiments (Rhodes et al., 2004). This approach was instrumental for discovering two transcription factors with outlier profiles, ERG and ETV1, and subsequently shed light on recurrent gene fusions of the 5 untranslated region of TMPRS22 to ERG or ETV1 in prostate cancer tissues. This study was notable for being the first study to identify recurrent gene re-arrangements in an epithelial cancer and suggested a novel mechanism for prostate cancer pathogenesis.
APPLICATIONS DNA microarrays provide a panoramic, quantitative overview of a sample’s expression output. The power of this is obvious, as biological processes generally result from the coordinated interaction of multiple genes.
Research with microarrays in humans can be divided into four categories: 1. 2. 3. 4.
normal tissue taxonomy; disease diagnosis and classification; disease prognostication; and dissection of biological mechanisms.
Since the lion’s share of research to date has been in cancer biology, in part due to the ready availability of tissue samples procured during routine diagnostic procedures, the following examples will draw largely from this field’s literature. Normal Tissue Taxonomy Microarrays provide a new way of approaching the same fundamental question posed by Linnaeus in the 18th century: how is A different from and similar to B? An interesting example of previously unknown molecular heterogeneity was illustrated with human fibroblasts (Chang et al., 2002). Here, the authors demonstrated that the major factor responsible for differences in gene expression was the site of origin of the fibroblast. This
162
CHAPTER 13
■
DNA Microarrays in Biological Discovery and Patient Care
suggests that a detailed positional memory is encoded in a fibroblast’s expression profile, and that different fibroblast subsets might be functionally distinct. Prior to the advent of microarray technology, this geographic variation was not appreciated. In addition to topographic location, the effect of aging on fibroblasts can also be measured and classified by expression profiling (Ly et al., 2000). In this study, using an Affymetrix oligonucleotide microarray, the expression profiles of fibroblasts isolated from young, middle-age, and old-age individuals and individuals with progeria were compared. The authors found consistent differences in the expression of genes between these groups in two functional classes: genes involved in cell cycle progression and genes involved in maintenance and remodeling of the extracellular matrix. Disease Diagnosis and Classification There have been a multitude of studies using expression profiling for disease diagnosis and classification, most notably in cancer. One seminal study used oligonucleotide microarrays to examine the expression of 6817 human genes in 72 patients with acute leukemia (Golub et al., 1999). Using unsupervised learning, gene expression was able to independently cluster leukemia samples into the known subsets of acute myelogenous leukemia (AML) and ALL (see Figure 13.3). Then, using supervised learning, gene sets that were differentially expressed in AML and ALL were used to correctly classify a group of unknown samples into the correct categories. Significantly, many markers that were both known, such as myeloperoxidase and terminal transferase, and unknown were useful for making this distinction. Although the difference between AML and ALL is often not difficult using modern histopathology, this study provided the first evidence that tumor expression profiles can be used for cancer classification. Tumor cell lines can also be classified molecularly through expression profiling (Ross et al., 2000). The National Cancer Institute’s Developmental Therapeutics Program bank of 60 cancer cell lines (NCI60) was profiled using a spotted cDNA array of 9703 cDNAs. The gene expression patterns observed in these cancer cell lines corresponded to the tissue of origin, which is remarkable since these cell lines have been selected for survival in the host and later in tissue culture. This illustrates a theme that the tissue of origin plays a major role in determining the expression profile. Similarly, a molecular taxonomy of tumor samples from 14 different tumor classes was created using a supervised learning algorithm (Ramaswamy et al., 2001). Some of these samples were from metastatic sites, and interestingly, the majority of them could be categorized from the classification information derived from the primary site, again demonstrating the importance of the tissue of origin in disease. Breast cancer has been intensively profiled. Perou et al. (2000) reported a molecular classification system based on the expression patterns of 65 breast adenocarcinoma specimens from 42 individuals using a spotted cDNA array. Using an unsupervised method, these breast tumors were categorized into four different subtypes based on their patterns of gene expression.
One clinically important subtype was already known (Erb-B2 or Her2/neu overexpressing cancers), and three others were previously unknown: estrogen receptor-positive/luminal-like cancers, basal-like cancers, and normal breast. In addition to this classification system, the authors were able to show the importance of cellular lineage in determining the expression profile. This was illustrated by examining the variation in expression between primary tumors that were biopsied before and after a course of chemotherapy. They also looked at the profiles of two primary lymph node metastasis pairs. The authors found that the paired samples in the same patient were significantly more related to each other than to tumors from other patients, despite intervening chemotherapy or metastatic evolution. These initial observations were subsequently validated in a larger dataset (Sorlie et al., 2001, 2003). Importantly, these molecularly defined subtypes were clinically relevant, as patients with basal-like tumors had a significantly worse prognosis compared to patients with other breast cancer subtypes. Outside of the realm of cancer, cDNA microarray technology has proved useful for exploring a broad range of diseases such as inflammatory diseases like rheumatoid arthritis and Crohn’s disease (Heller et al., 1997), schizophrenia (Mirnics et al., 2000), multiple sclerosis (Lock et al., 2002), and differentiating between ischemic and non-ischemic cardiomyopathy (Kittleson et al., 2004). Microarrays have also been helpful for re-examining basic questions such as aging, as discussed previously (Ly et al., 2000). Disease Prognosis A challenge in medicine and especially in oncology is prediction– forecasting which patients will and will not benefit from therapy, particularly when the therapy, such as chemotherapy, has significant risks. DNA microarrays offer the possibility of radically transforming medicine from a traditional one-size-fits-all approach to individualized care. Intuitively, more information, when rich in content and appropriately analyzed, should lead to better prediction models. However, this has not always been borne out with various histopathologic tumor markers (e.g., breast and colon cancer), despite much initial excitement when these markers first became available (Bast et al., 2001). In contrast, there have been several microarray experiments involving different types of cancers that have found that the expression signature can serve as a robust biomarker, better than currently used criteria, for predicting prognosis. A common approach is to identify an expression profile that can subclassify a tumor type and then find correlations between this subclassification and prognosis. The first such application was described by Alizadeh et al. (2000) in diffuse large B cell lymphoma (DLBCL). These investigators constructed the Lymphochip, a specialized spotted cDNA microarray enriched for genes preferentially expressed in lymphoid cells and for genes with known roles in immunology or cancer. Using this microarray, hierarchical clustering revealed two molecularly distinct forms of DLBCL: germinal center B-like DLBCL and activated B-like DLBCL. These subtypes were not
Applications
ALL
■
163
AML C-myb (U22376) Proteasome iota (X59417) MB-1 (U05259) Cyclin D3 (M92287) Myosin light chain (M31211) RbAp48 (X74262) SNF2 (D26156) HkrT-1 (S50223) E2A (M31523) Inducible protein (L47738) Dynein light chain (U32944) Topoisomerase II β (Z15115) IRF2 (X15949) TFIIEβ (X63469) Acyl-Coenzyme A dehydrogenase (M91432) SNF2 (U29175) (Ca2)-ATPase (Z69881) SRP9 (U20998) MCM3 (D38073) Deoxyhypusine synthase (U26266) Op 18 (M31303) Rabaptin-5 (Y08612) Heterochromatin protein p25 (U35451) IL-7 receptor (M29696) Adenosine deaminase (M13792) Fumarylacetoacetate (M55150) Zyxin (X95735) LTC4 synthase (U50136) LYN (M16038) HoxA9 (U82759) CD33 (M23197) Adipsin (M84526) Leptin receptor (Y12670) Cystatin C (M27891) Proteoglycan 1 (X17042) IL-8 precursor (Y00787) Azurocidin (M96326) p62 (U46751) CyP3 (M80254) MCL1 (L08246) ATPase (M62762) IL-8 (M28130) Cathepsin D (M63138) Lectin (M57710) MAD-3 (M69043) CD11c (M81695) Ebp72 (X85116) Lysozyme (M19045) Properdin (M83652) Catalase (X04085)
3 Low
2.5
2
1.5
1
0.5
0
0.5
Normalized expression
1
1.5
2
2.5
3 High
Figure 13.3 Gene expression profiling in acute leukemia. The 50 genes most highly correlated with the distinction between ALL and AML are shown. Each row corresponds to a gene, and each column corresponds to expression levels in different samples. Expression levels greater than the mean are shaded in red, and those below the mean are shaded in blue. The scale indicates SDs above or below the mean. The top panel shows genes highly expressed in ALL; the bottom panel shows genes more highly expressed in AML. Adapted from Golub et al. (1999). (Reprinted with permission from AAAS).
164
CHAPTER 13
■
DNA Microarrays in Biological Discovery and Patient Care
(a)
Gene-expression profiling
(b)
1.0
1.0 Good signature
0.8
0.6
0.4
Poor signature p 0.001
0.2
Probability of remaining metastasis-free
Probability of remaining metastasis-free
St Gallen criteria
0.8 Low risk 0.6 High risk 0.4
p 0.05
0.2
0.0
0.0 0
2
4
6 8 Years
10
0
12
2
4
22
8 6 Years
10
12
No. at risk
No. at risk Good signature
60 57 54 45 31 22 12
Low risk
22
Poor signature
91 72 55 41 26 17 9
High risk
129 107 88 69 48 34 19
21 17
9
5
2
Figure 13.4 Kaplan–Meier analysis of the probability that patients would remain free of distant metastases among 151 patients with lymph node-negative breast cancer by gene expression profiling (a) and by St Gallen clinical criteria (Goldhirsch et al., 2001) (b). Adapted from Van de Vijver et al. (2002). (Copyright © 2002, Massachusetts Medical Society. All rights reserved).
appreciated using traditional histopathology and immunohistochemical techniques, yet nevertheless have clinical significance as patients with germinal center B-like DLBCL had significantly better overall survival when treated with standard chemotherapy. Indeed, as observed with other initial studies in cancer classification, cell lineage was an important determinant of the molecular phenotype. Ultimately, it remains to be seen whether or not molecular phenotyping will supplant traditional diagnostic tools, and this remains an ongoing area of research. Similar predictive power has been exemplified in microarray studies of breast cancer. Traditional factors for predicting a patient’s outcome, such as tumor size, lymph node status, and estrogen receptor status, are imperfect. Van’t Veer et al. (2002) used an oligonucleotide microarray to analyze primary tumors from 117 patients with early-stage breast cancer and no lymph node involvement, who received minimal treatment after surgery, to identify a 70-gene expression signature that strongly predicted metastasis. In a subsequent validation study on 295 patients (which included the original cohort of 117 patients), this signature outperformed all currently used clinical predictors and more accurately identified patients at risk for distant metastasis (van de Vijver et al., 2002) (see Figure 13.4). For example, the profile could assign more patients into a low-risk group for metastasis (40%) compared to traditional clinical criteria (15%). These initial findings were subsequently confirmed in an independent group of patients from five European centers (Buyse et al., 2006). Taken together, these results suggested that primary tumor gene expression profiles can be used to molecularly stratify cancer patients according to risk of progression and treatment outcome.
Molecular Biology Expression profiling has also been a new approach for discovering novel roles for both individual genes and biological processes. In addition, expression profiling also reveals the power of performing experiments in silico and mining data from previous experiments to integrate existing information with new insights. For example, Clark et al. (2000) used expression profiling to identify RhoC, a small GTPase, as a potential mediator of melanoma metastasis. Using a mouse model to select for metastatic variants of the human melanoma cell line A375, the investigators were able to compare the expression profiles of the parental cell line and the metastasis. RhoC was one of several genes, which was overexpressed in the metastatic sublines. Functional studies with RhoC confirmed its role in this model system. Lamb et al. (2003) also used integrative methods to describe a novel mechanism for the action of cyclin D1, a cell cycle regulator implicated in cancer progression. After first developing an expression signature for cyclin D1 action through retroviral transduction and gene expression profiling, they then used this profile to mine a compendium of human tumor gene expression data to demonstrate the in vivo relevance of this signature, subsequently leading to the identification of C/EBPb as novel downstream effector of cyclin D1. Cellular Biology Expression profiling has also been able to uncover intriguing similarities between cancer and certain physiological processes such as wound repair and the response to hypoxia. Leveraging on the similarities between wound healing and cancer (Dvorak,
Future Directions
1986), Chang et al. (2004) derived an expression signature of the fibroblast response to serum in tissue culture, noting that serum is encountered in vivo at sites of tissue injury. Using this signature, they identified corresponding expression signatures in a variety of cancers that were similar to active wounds. Furthermore, the molecular features of wound healing also predicted increased risk of metastasis and death in several cancers such as breast, lung, and stomach cancer. The authors validated their findings with microarray data from the Dutch cohort of breast cancer patients (Chang et al., 2005; van de Vijver et al., 2002). Overall survival and distant metastasis-free survival were markedly diminished in patients whose tumors expressed the wound-response signature compared to tumors that did not express this signature. In addition to the serum response, using a similar strategy, the gene expression response to hypoxia in cell culture was also identified in several human cancers, particularly breast and ovarian cancer (Chi et al., 2006). Interestingly, this phenomenon was independent of the wound-response signature. Signatures identified in vitro, that are associated with activation of specific oncogenic pathways, have also been used to identify similar patterns of pathway dysregulation in human tumors (Bild et al., 2006). Interestingly, these signatures might be clinically relevant, as they appear to correlate with survival and treatment outcome in patients with solid tumors. Expression profiling has also yielded provocative insights into the biology of cancer metastasis (Ramaswamy et al., 2003). By comparing expression profiles of various primary and metastatic tumors from a wide variety of sites, the authors discovered a 17-gene-expression signature that identified primary cancers with a greater likelihood of metastasis. Interestingly, many of these “metastasis signature” genes were also genes commonly expressed by cancer-associated fibroblasts and other stromal cells. These findings suggested for the first time that generic molecular determinants of metastasis might exist, and that these determinants might arise very early in tumorigenesis. This observation suggests a model for the biology of metastasis that contrasts with traditional dogma that molecular changes leading to metastasis arise only later in the natural history of a tumor.
LIMITATIONS AND CHALLENGES While microarray technology continues to mature, as with any technology, expression profiles must be interpreted in the context of broader biological knowledge. For example, one assumption often made is that gene expression correlates well with protein quantity and therefore protein activity. This view does not take into account additional variables such as mRNA stability, protein degradation, or post-translational modifications that ultimately determine protein activity. Another supposition is that mRNA expression levels correlate well with biological activity. In contrast, small differences in expression may have a dramatic biological effect. For example, the expression level of transcription factors is often quite low even in tissues where these factors play active roles. Finally, expression profiles are not fixed; they
■
165
may vary dramatically in different physiological contexts (King and Sinha, 2001). Validation with Non-microarray Technologies Gene expression data generated from microarray experiments can be credentialed using other methods of expression measurement such as northern blotting (Taniguchi et al., 2001), real-time PCR (Dallas et al., 2005), or other transcriptomic approaches described in Chapter 12. These studies and others demonstrate that genes found to be expressed with current generation microarrays can be consistently verified. Microarray data can also be verified at the protein expression level. A particularly useful high-throughput technology for validation using immunohistochemistry involves tissue microarrays (Hans et al., 2004). Reproducibility of Microarray Data An ongoing concern since the inception of the DNA microarray has been its technical reproducibility (Marshall, 2004). For a general review of these initial concerns, the interested reader is referred to a review by Chuaqui et al. (2002). While initial studies raised concerns about the reproducibility of microarray data (e.g., Tan et al. [2003]), subsequent studies have found that reproducibility is better than initially described (Bammler et al., 2005; Irizarry et al., 2005; Larkin et al., 2005). Recently, the MicroArray Quality Control (MAQC) project involved 137 participants from 51 academic, government, and commercial institutions to assess the performance of seven microarray platforms of two commercially available RNA samples (Shi et al., 2006). Reassuringly, the MAQC project showed high intra-platform consistency across test sites as well as a high level of interplatform concordance in terms of identifying differentially expressed genes in a common set of biological samples.
FUTURE DIRECTIONS Newer Platform Technologies Alternatives to conventional microarray platforms are emerging. One notable example is a bead-based platform developed by Illumina, the BeadChip. This uses randomly assembled arrays of 50-mer coated 3-m beads in order to create an extremely dense platform that is denser than spotted arrays or photolithographically made arrays (Gunderson et al., 2004). Whereas in conventional arrays, the sequence and location of each probe is known beforehand, the identity (and the corresponding probe sequence) of the beads is decoded after deposition. A DNA-based decoding algorithm is used to identify each bead. Compared to conventional platforms, the main advantage of this platform is lower per-sample cost. A different type of platform departing from conventional technology is the CombiMatrix platform. This platform synthesizes oligonucleotides in situ on a semiconductor chip using routine phosphoramidite chemistry under electrochemical control
166
CHAPTER 13
■
DNA Microarrays in Biological Discovery and Patient Care
(Liu et al., 2006). Each feature on the chip is digitally addressed and controlled. When activated, an electrochemical reaction generating acid occurs, ultimately leading to oligonucleotide synthesis in situ. The main advantage of this technology is flexibility. Custom microarrays can be created on demand with less labor, whereas technologies with “fixed” content such as the Affymetrix chip require a significant investment of resources to create a new chip (e.g., creation of a new photolithographic mask). Formalin-Fixed, Paraffin-Embedded Samples Traditionally, obtaining RNA from human tissues for expression profiling has required that specimens be snap-frozen in liquid nitrogen within a half hour of resection and stored at 80°C or colder to minimize RNA degradation. Samples need to be processed quickly because changes in some mRNA species have been noted even a few minutes after manipulation and devascularization of a tissue (Huang et al., 2001). However, clinical specimens are generally fixed in formalin and embedded in paraffin for routine histopathologic analysis, which causes significant RNA degradation. Only a fraction of the RNA from fixed specimens, roughly 3%, is intact enough for cDNA synthesis, posing a major barrier to obtaining expression profiles from clinical specimens (Godfrey et al., 2000; Masuda et al., 1999). A method to maneuver around this degradation obstacle is the cDNA-mediated, annealing, selection, extension, and ligation (DASL) assay. This assay is a technology that combines the advantages of microarray-based gene expression profiling with multiplexed quantitative PCR (Fan et al., 2004). An upstream and downstream pair of oligonucleotides is designed to amplify a cDNA of interest. Each pair of oligonucleotides incorporates a universal primer sequence for PCR as well as a specific address sequence that hybridizes to a capture sequence on an array. With this methodology, cancer-specific expression profiles were generated with as little as 50 ng of total RNA from formalin-fixed tissues stored over a decade (Bibikova et al., 2004). Approaches such as the DASL assay thus have the potential to capitalize on the vast supply of archived, paraffin-embedded samples from large clinical trials or population-based cohort studies to advance expression profiling efforts. Functional Genomics A goal of functional genomics is to assign biological meaning to genes on a genome-wide level (Steinmetz and Davis, 2004) (see Chapter 16). As illustrated by the examples described earlier, expression profiling has proved useful for assigning functional groups to genes in model organisms such as yeast (Hughes et al., 2000) and for uncovering a role for metastasis with RhoC (Clark et al., 2000). Expression profiling has also been used to decipher regulatory networks of transcription by discovering common sequence motifs in upstream regions of genes that have similar expression profiles (Pilpel et al., 2001). These studies, though, have been mainly been done in yeast and await application in more complex systems such as humans. A new opportunity in functional genomics is to integrate expression profiling with RNA interference (RNAi) technologies
now that genome-wide RNA interference libraries are coming to fruition (Moffat et al., 2006). One can imagine using RNAi libraries to systematically knock down individual genes and the comparing the resulting expression phenotypes. One can then look for similarities and differences between expression profiles as a starting point for inferring the function of the genes that have been ablated by RNAi. Profiling the Transcriptome Expression profiling has also been used to explore the transcriptome (see also Chapter 12). The transcriptome represents the complete set of all the transcribed elements of the genome in contrast to the collection of transcripts that are only translated into proteins.There is a burgeoning interest in the function of the noncoding part of the transcriptome as the number of actual genes transcribed into proteins has been substantially revised downward and increasing numbers of non-coding transcripts have been discovered. Key to their detection has been tiling microarrays, specifically those developed by Affymetrix with photolithographic technology. Rather than only using probes for known exons, tiling arrays probe across the entire genome in an unbiased fashion. Cheng et al. (2005) used a 25-mer oligonucleotide-based array spaced every 5 bp to examine ten human chromosomes. They discovered that interestingly, of all the transcribed sequences, 19% were polyadenylated, 44% were not polyadenylated, and 37% were both polyadenylated and not polyadenylated. Additionally, about 10% of the genome was transcribed as polyadenylated transcripts, compared to 1–2% of the genome corresponding to exons. These findings suggest that the expressed genome may be orders of magnitude larger than previously appreciated. Part of the transcriptome also includes small (roughly 22 nucleotide) non-coding microRNAs (miRNAs). These miRNAs play an important role in regulating gene expression by either degrading mRNA or inhibiting its translation (Bartel, 2004). In cancer, for example, Lu et al. (2005) used a new bead-based profiling system to quantify miRNA expression because of the short size of the miRNAs.This used oligonucleotide capture probes that were coupled to beads, which in turn were differentially impregnated with variable degrees of a fluorescent mixture. Flow cytometry was then used to quantitate the beads and therefore the associated miRNA. Expression profiling of the miRNA population of various malignant and normal tissues revealed an unusually large degree of diversity of miRNA expression across cancers. Furthermore, miRNA signatures could be used to classify poorly differentiated cancers, which have traditionally been difficult to classify using standard mRNA-based profiling strategies. The interested reader is referred to a recent review by Calin and Croce (2006).
Clinical Translation Expression Profiling as a Biomarker The ability of expression signatures of tumor samples to predict prognosis has immediate potential clinical translation. A key lesson from these early clinical microarray studies is that appropriate sample size and careful experimental design are necessary
Future Directions
for identifying signatures that can serve as robust biomarkers (Michiels et al., 2005; Ntzani and Ioannidis, 2003; Simon et al., 2003). However, reassuringly, a recent study of prognostic breast cancer signatures showed that four out of five reported gene signatures had significant concordance in predicting breast cancer outcome, indicating that these signatures were tracking similar biological behaviors (Fan et al., 2006), pointing to the evolving sophistication of clinically oriented profiling efforts. The clinical application of expression profiles is most mature in breast cancer, owing to the extensive microarray data already available, and has recently been reviewed (Sotiriou and Piccart, 2007). The 70-gene-expression signature originally formulated by Van’t Veer et al. (2002) has been commercialized into the MammaPrint assay for clinical use. This assay serves as the basis of the MINDACT (Microarray In Node negative Disease may Avoid ChemoTherapy) clinical trial, which aims to determine whether the expression signature can be used for making clinical decisions (Novak, 2006). It will test the question of whether or not patients with tumors that have high risk features by clinical criteria – individuals who are routinely recommended adjuvant chemotherapy – but with a low-risk molecular prognosis as determined by MammaPrint, can defer chemotherapy without affecting metastasis-free survival. In other words, it will test if the molecular prognosis generated by MammaPrint can outperform clinical criteria. An alternative to using existing signatures is to adapt the expression profile into a RT-PCR-based assay. Compared to DNA microarray technology, RT-PCR has the potential to be less technically challenging, more cost-effective, and more easily integrated into routine clinical practice. Genomic Health International has developed the Oncotype assay which is currently available for clinical use (Paik et al., 2004). The Oncotype assay is an RTPCR assay based on a 21-gene signature (including five control genes) derived from 250 candidate genes selected from the published microarray literature. Rather than using fresh frozen tissues, the researchers were able to exploit archived, paraffinembedded tissues using RT-PCR as their profiling method (Cronin et al., 2004). They systematically quantified the expression of these 250 candidate genes across three historical groups of breast cancer patients. From this data, they derived a gene expression-based recurrence score for predicting breast cancer recurrence that was more powerful than standard histopathologic criteria. The reproducibility of the score was robust and was subsequently validated in a large, population-based study (Habel et al., 2006). Furthermore, Oncotype had the potential to identify patients who may be able to defer adjuvant chemotherapy treatment. Patients with low and intermediate recurrence score tumors did not appear to derive a significant benefit from chemotherapy (Paik et al., 2006). An advantage of using this RT-PCR-based approach is that it can be applied to tumor samples that have been collected and processed in routine fashion; no special handling (e.g., freezing tissues in liquid nitrogen) is required. Other examples of adapting expression profiling to an RT-PCR assay include a two-gene breast cancer signature (Ma et al., 2003) and a six-gene signature predicting survival in DLBCL (Lossos et al., 2004).
■
167
In the future, one of the challenges for translational research will be the integration of expression profiling with routine histopathology in clinical practice. As can be seen with the 21gene breast cancer signature, expression profiling in this application offers the advantage of reproducibility and minimizes human error and variability with pathological interpretation. However, cost continues to be a major limitation to widespread adoption of profiling technologies. An ongoing prospective study, similar in concept to MINDACT, is the TAILORx (Trial Assigning Individualized Options for Treatment) trial in North America, which will help answer the question of how to integrate microarray technology (i.e., Oncotype DX) into clinical practice for patients with breast cancer. TAILORx is examining whether chemotherapy is required for the intermediate-risk group defined by Oncotype’s recurrence score (Paik, 2007). Drug Development DNA microarrays have also facilitated drug target discovery and prediction of drug response. For general reviews, the reader is referred to a recent one by Stoughton and Friend (2005). Using the NCI60 cancer cell line panel, Scherf et al. (2000) constructed a database that correlated gene expression and drug activity patterns. They were able to find relationships between gene expression and drugs such as 5-fluorouracil and l-asparaginase. Gunther et al. (2003) employed a similar strategy with primary human neuronal precursor cells and various psychotropic drugs. Potti et al. (2006) have expanded on this approach by using the expression profiles of sensitive and resistant cancer cell lines for six different chemotherapy drugs to predict response to neoadjuvant chemotherapy for breast cancer. Stegmaier et al. (2004) have developed a high-throughput approach to drug discovery using expression profiling: gene expression-based high-throughput screening (GE-HTS). These authors were interested in compounds that could terminally differentiate AML cells. To discover these compounds, they used 5-gene-expression signature of the differentiated neutrophil and leukemic states as a surrogate. With an RT-PCR-based readout of these five genes, they then screened a small molecule library to identify drugs that could induce differentiation of leukemic blasts and discovered an epidermal growth factor receptor kinase inhibitor with this property (Stegmaier et al., 2005). Remarkably, GE-HTS discovered this potential therapeutic without prior detailed biological knowledge. A related approach to screening small molecules has been implemented through The Connectivity Map, an in silico tool recently developed at the Broad Institute that uses expression profiling to systematically discover functional connections among diseases, genetic perturbations, and drug actions (Lamb et al., 2006). Lamb et al. first created a reference collection of gene expression profiles from cultured human cancer cells treated with small molecules. They then developed pattern-matching software to mine these data. The Connectivity Map has been used to perform in silico discoveries of rapamycin as a new approach for treating glucocorticoid-resistant ALL (Wei et al., 2006) and the inhibition of HSP90 as a novel mechanism for inhibition of androgen receptor signaling in prostate cancer (Hieronymus et al., 2006).
168
CHAPTER 13
■
DNA Microarrays in Biological Discovery and Patient Care
CONCLUSIONS Over the past decade, expression profiling has matured from a revolutionary new technology to a routine, fundamental component of scientific inquiry. Its ability to globally quantitate the expression of thousands of genes at once has provided a new lens for classifying diseases, predicting outcomes, and discovering new drug targets. Commensurate with the technical achievements
have been advances in statistical analysis and interpretation of the microarray data being generated. In the future, one can imagine additional insights through integration of expression profiling with other “–omic” approaches. Ultimately, the real benefits of DNA microarrays will be realized only through translation into better treatment strategies and therapeutics, and the results of these efforts are eagerly anticipated.
REFERENCES Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X. et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511. Allison, D.B., Cui, X., Page, G.P. and Sabripour, M. (2006). Microarray data analysis: From disarray to consolidation and consensus. Nat Rev Genet 7, 55–65. Alwine, J.C., Kemp, D.J. and Stark, G.R. (1977). Method for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes. Proc Natl Acad Sci USA 74, 5350–5354. Bammler, T., Beyer, R.P., Bhattacharya, S., Boorman, G.A., Boyles, A., Bradford, B.U., Bumgarner, R.E., Bushel, P.R., Chaturvedi, K., Choi, D. et al. (2005). Standardizing global gene expression analysis between laboratories and across platforms. Nat Methods 2, 351–356. Bard, J.B. and Rhee, S.Y. (2004). Ontologies in biology: Design, applications and future challenges. Nat Rev Genet 5, 213–222. Bartel, D.P. (2004). MicroRNAs: Genomics, biogenesis, mechanism, and function. Cell 116, 281–297. Bast, R.C., Jr., Ravdin, P., Hayes, D.F., Bates, S., Fritsche, H., Jr., Jessup, J.M., Kemeny, N., Locker, G.Y., Mennel, R.G. and Somerfield, M.R. (2001). 2000 update of recommendations for the use of tumor markers in breast and colorectal cancer: Clinical practice guidelines of the American Society of Clinical Oncology. J Clin Oncol 19, 1865–1878. Baugh, L.R., Hill, A.A., Brown, E.L. and Hunter, C.P. (2001). Quantitative analysis of mRNA amplification by in vitro transcription. Nucleic Acids Res 29, E29. Bibikova, M.,Talantov, D., Chudin, E.,Yeakley, J.M., Chen, J., Doucet, D., Wickham, E., Atkins, D., Barker, D., Chee, M. et al. (2004). Quantitative gene expression profiling in formalin-fixed, paraffin-embedded tissues using universal bead arrays. Am J Pathol 165, 1799–1807. Bild, A.H., Yao, G., Chang, J.T., Wang, Q., Potti, A., Chasse, D., Joshi, M.B., Harpole, D., Lancaster, J.M., Berchuck, A. et al. (2006). Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439, 353–357. Bowtell, D.D. (1999). Options available–from start to finish–for obtaining expression data by microarray. Nat Genet 21, 25–32. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C. et al. (2001). Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29, 365–371. Brown, M.P., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares, M., Jr. and Haussler, D. (2000). Knowledge-based
analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97, 262–267. Buyse, M., Loi, S., van’t Veer, L., Viale, G., Delorenzi, M., Glas, A.M., d’Assignies, M.S., Bergh, J., Lidereau, R., Ellis, P. et al. (2006). Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Natl Cancer Inst 98, 1183–1192. Calin, G.A. and Croce, C.M. (2006). MicroRNA signatures in human cancers. Nat Rev Cancer 6, 857–866. Chang, H.Y., Chi, J.T., Dudoit, S., Bondre, C.,van de Rijn, M., Botstein, D. and Brown, P.O. (2002). Diversity, topographic differentiation, and positional memory in human fibroblasts. Proc Natl Acad Sci USA 99, 12877–12882. Chang, H.Y., Nuyten, D.S., Sneddon, J.B., Hastie, T., Tibshirani, R., Sorlie, T., Dai, H., He, Y.D., van’t Veer, L.J., Bartelink, H. et al. (2005). Robustness, scalability, and integration of a woundresponse gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci USA 102, 3738–3743. Chang, H.Y., Sneddon, J.B., Alizadeh, A.A., Sood, R., West, R.B., Montgomery, K., Chi, J.T., van de Rijn, M., Botstein, D. and Brown, P.O. (2004). Gene expression signature of fibroblast serum response predicts human cancer progression: Similarities between tumors and wounds. PLoS Biol 2, E7. Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H., Helt, G. et al. (2005). Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308, 1149–1154. Cheung, V.G., Morley, M., Aguilar, F., Massimi, A., Kucherlapati, R. and Childs, G. (1999). Making and reading microarrays. Nat Genet 21, 15–19. Chi, J.T., Wang, Z., Nuyten, D.S., Rodriguez, E.H., Schaner, M.E., Salim,A.,Wang,Y., Kristensen, G.B., Helland,A., Borresen-Dale,A.L. et al. (2006). Gene expression programs in response to hypoxia: Cell type specificity and prognostic significance in human cancers. PLoS Med 3, e47. Chuaqui, R.F., Bonner, R.F., Best, C.J., Gillespie, J.W., Flaig, M.J., Hewitt, S.M., Phillips, J.L., Krizman, D.B., Tangrea, M.A., Ahram, M. et al. (2002). Post-analysis follow-up and validation of microarray experiments. Nat Genet 32(Suppl), 509–514. Churchill, G.A. (2002). Fundamentals of experimental design for cDNA microarrays. Nat Genet 32(Suppl), 490–495. Clark, E.A., Golub, T.R., Lander, E.S. and Hynes, R.O. (2000). Genomic analysis of metastasis reveals an essential role for RhoC. Nature 406, 532–535. Cronin, M., Pho, M., Dutta, D., Stephans, J.C., Shak, S., Kiefer, M.C., Esteban, J.M. and Baker, J.B. (2004). Measurement of gene expression in archival paraffin-embedded tissues: Development and
References
performance of a 92-gene reverse transcriptase-polymerase chain reaction assay. Am J Pathol 164, 35–42. D’Haeseleer, P. (2005). How does gene expression clustering work?. Nat Biotechnol 23, 1499–1501. Dallas, P.B., Gottardo, N.G., Firth, M.J., Beesley, A.H., Hoffmann, K., Terry, P.A., Freitas, J.R., Boag, J.M., Cummings, A.J. and Kees, U.R. (2005). Gene expression levels assessed by oligonucleotide microarray analysis and quantitative real-time RT-PCR – how well do they correlate?. BMC Genomics 6, 59. Dudley, A.M., Aach, J., Steffen, M.A. and Church, G.M. (2002). Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proc Natl Acad Sci USA 99, 7554–7559. Duggan, D.J., Bittner, M., Chen, Y., Meltzer, P. and Trent, J.M. (1999). Expression profiling using cDNA microarrays. Nat Genet 21, 10–14. Dvorak, H.F. (1986). Tumors: Wounds that do not heal. Similarities between tumor stroma generation and wound healing. N Engl J Med 315, 1650–1659. Ebert, B.L. and Golub, T.R. (2004). Genomic approaches to hematologic malignancies. Blood 104, 923–932. Eisen, M.B., Spellman, P.T., Brown, P.O. and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863–14868. Emmert-Buck, M.R., Bonner, R.F., Smith, P.D., Chuaqui, R.F., Zhuang, Z., Goldstein, S.R., Weiss, R.A. and Liotta, L.A. (1996). Laser capture microdissection. Science 274, 998–1001. Emmert-Buck, M.R., Roth, M.J., Zhuang, Z., Campo, E., Rozhin, J., Sloane, B.F., Liotta, L.A. and Stetler-Stevenson, W.G. (1994). Increased gelatinase A (MMP-2) and cathepsin B activity in invasive tumor regions of human colon cancer samples. Am J Pathol 145, 1285–1290. Ermolaeva, O., Rastogi, M., Pruitt, K.D., Schuler, G.D., Bittner, M.L., Chen, Y., Simon, R., Meltzer, P., Trent, J.M. and Boguski, M.S. (1998). Data management and analysis for gene expression arrays. Nat Genet 20, 19–23. Fan, C., Oh, D.S., Wessels, L., Weigelt, B., Nuyten, D.S., Nobel, A.B., van’t Veer, L.J. and Perou, C.M. (2006). Concordance among geneexpression-based predictors for breast cancer. N Engl J Med 355, 560–569. Fan, J.B.,Yeakley, J.M., Bibikova, M., Chudin, E., Wickham, E., Chen, J., Doucet, D., Rigault, P., Zhang, B., Shen, R. et al. (2004). A versatile assay for high-throughput gene expression profiling on universal array matrices. Genome Res 14, 878–885. Fodor, S.P., Read, J.L., Pirrung, M.C., Stryer, L., Lu, A.T. and Solas, D. (1991). Light-directed, spatially addressable parallel chemical synthesis. Science 251, 767–773. Gaasterland, T. and Bekiranov, S. (2000). Making the most of microarray data. Nat Genet 24, 204–206. Godfrey, T.E., Kim, S.H., Chavira, M., Ruff, D.W., Warren, R.S., Gray, J.W. and Jensen, R.H. (2000). Quantitative mRNA expression analysis from formalin-fixed, paraffin-embedded tissues using 5 nuclease quantitative reverse transcription-polymerase chain reaction. J Mol Diagn 2, 84–91. Goldhirsch, A., Glick, J.H., Gelber, R.D., Coates, A.S. and Senn, H.J. (2001). Meeting highlights: International Consensus Panel on the Treatment of Primary Breast Cancer. Seventh International Conference on Adjuvant Therapy of Primary Breast Cancer. J Clin Oncol 19, 3817–3827.
■
169
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A. et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537. Gunderson, K.L., Kruglyak, S., Graige, M.S., Garcia, F., Kermani, B.G., Zhao, C., Che, D., Dickinson, T., Wickham, E., Bierle, J. et al. (2004). Decoding randomly ordered DNA arrays. Genome Res 14, 870–877. Gunther, E.C., Stone, D.J., Gerwien, R.W., Bento, P. and Heyes, M.P. (2003). Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proc Natl Acad Sci USA 100, 9608–9613. Guo, Z., Guilfoyle, R.A., Thiel, A.J., Wang, R. and Smith, L.M. (1994). Direct fluorescence analysis of genetic polymorphisms by hybridization with oligonucleotide arrays on glass supports. Nucleic Acids Res 22, 5456–5465. Habel, L.A., Shak, S., Jacobs, M.K., Capra, A., Alexander, C., Pho, M., Baker, J., Walker, M., Watson, D., Hackett, J. et al. (2006). A population-based study of tumor gene expression and risk of breast cancer death among lymph node-negative patients. Breast Cancer Res 8, R25. Hans, C.P., Weisenburger, D.D., Greiner, T.C., Gascoyne, R.D., Delabie, J., Ott, G., Muller-Hermelink, H.K., Campo, E., Braziel, R.M., Jaffe, E.S. et al. (2004). Confirmation of the molecular classification of diffuse large B-cell lymphoma by immunohistochemistry using a tissue microarray. Blood 103, 275–282. Hardiman, G. (2004). Microarray platforms–comparisons and contrasts. Pharmacogenomics 5, 487–502. Harris, M.A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C. et al. (2004). The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32, D258–D261. Heller, R.A., Schena, M., Chai, A., Shalon, D., Bedilion, T., Gilmore, J., Woolley, D.E. and Davis, R.W. (1997). Discovery and analysis of inflammatory disease-related genes using cDNA microarrays. Proc Natl Acad Sci USA 94, 2150–2155. Hieronymus, H., Lamb, J., Ross, K.N., Peng, X.P., Clement, C., Rodina, A., Nieto, M., Du, J., Stegmaier, K., Raj, S.M. et al. (2006). Gene expression signature-based chemical genomic prediction identifies a novel class of HSP90 pathway modulators. Cancer Cell 10, 321–330. Hoheisel, J.D. (2006). Microarray technology: Beyond transcript profiling and genotype analysis. Nat Rev Genet 7, 200–210. Holloway, A.J., van Laar, R.K., Tothill, R.W. and Bowtell, D.D. (2002). Options available–from start to finish–for obtaining data from DNA microarrays II. Nat Genet 32(Suppl), 481–489. Huang, J., Qi, R., Quackenbush, J., Dauway, E., Lazaridis, E. and Yeatman, T. (2001). Effects of ischemia on gene expression. J Surg Res 99, 222–227. Hughes, T.R., Mao, M., Jones, A.R., Burchard, J., Marton, M.J., Shannon, K.W., Lefkowitz, S.M., Ziman, M., Schelter, J.M., Meyer, M.R. et al. (2001). Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat Biotechnol 19, 342–347. Hughes, T.R., Marton, M.J., Jones, A.R., Roberts, C.J., Stoughton, R., Armour, C.D., Bennett, H.A., Coffey, E., Dai, H., He, Y.D. et al. (2000). Functional discovery via a compendium of expression profiles. Cell 102, 109–126.
170
CHAPTER 13
■
DNA Microarrays in Biological Discovery and Patient Care
Irizarry, R.A., Warren, D., Spencer, F., Kim, I.F., Biswal, S., Frank, B.C., Gabrielson, E., Garcia, J.G., Geoghegan, J., Germino, G. et al. (2005). Multiple-laboratory comparison of microarray platforms. Nat Methods 2, 345–350. Iscove, N.N., Barbara, M., Gu, M., Gibson, M., Modi, C. and Winegarden, N. (2002). Representation is faithfully preserved in global cDNA amplified exponentially from sub-picogram quantities of mRNA. Nat Biotechnol 20, 940–943. Ishii, M., Hashimoto, S.,Tsutsumi, S.,Wada,Y., Matsushima, K., Kodama,T. and Aburatani, H. (2000). Direct comparison of GeneChip and SAGE on the quantitative accuracy in transcript profiling analysis. Genomics 68, 136–143. King, H.C. and Sinha, A.A. (2001). Gene expression profile analysis by DNA microarrays: Promise and pitfalls. JAMA 286, 2280–2288. Kittleson, M.M., Ye, S.Q., Irizarry, R.A., Minhas, K.M., Edness, G., Conte, J.V., Parmigiani, G., Miller, L.W., Chen, Y., Hall, J.L. et al. (2004). Identification of a gene expression profile that differentiates between ischemic and nonischemic cardiomyopathy. Circulation 110, 3444–3451. Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel, M.J., Lerner, J., Brunet, J.P., Subramanian, A., Ross, K.N. et al. (2006). The Connectivity Map: Using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935. Lamb, J., Ramaswamy, S., Ford, H.L., Contreras, B., Martinez, R.V., Kittrell, F.S., Zahnow, C.A., Patterson, N., Golub, T.R. and Ewen, M.E. (2003).A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer. Cell 114, 323–334. Larkin, J.E., Frank, B.C., Gavras, H., Sultana, R. and Quackenbush, J. (2005). Independence and reproducibility across microarray platforms. Nat Methods 2, 337–344. Liang, P. and Pardee, A.B. (1992). Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction. Science 257, 967–971. Liang, P. and Pardee, A.B. (2003). Analysing differential gene expression in cancer. Nat Rev Cancer 3, 869–876. Lin, D.W., Coleman, I.M., Hawley, S., Dumpit, R., Gifford, D., Kezele, P., Hung, H., Knudsen, B.S., Kristal, A.R. and Nelson, P.S. (2006). Influence of surgical manipulation on prostate gene expression: Implications for molecular correlates of treatment effects and disease prognosis. J Clin Oncol 24, 3763–3770. Lipshutz, R.J., Fodor, S.P., Gingeras, T.R. and Lockhart, D.J. (1999). High density synthetic oligonucleotide arrays. Nat Genet 21, 20–24. Liu, R.H., Nguyen, T., Schwarzkopf , K., Fuji, H.S., Petrova, A., Siuda, T., Peyvan, K., Bizak, M., Danley, D. and McShea, A. (2006). Fully integrated miniature device for automated gene expression DNA microarray processing. Anal Chem 78, 1980–1986. Lock, C., Hermans, G., Pedotti, R., Brendolan, A., Schadt, E., Garren, H., Langer-Gould, A., Strober, S., Cannella, B., Allard, J. et al. (2002). Gene-microarray analysis of multiple sclerosis lesions yields new targets validated in autoimmune encephalomyelitis. Nat Med 8, 500–508. Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H. et al. (1996). Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14, 1675–1680. Lossos, I.S., Czerwinski, D.K., Alizadeh, A.A., Wechser, M.A., Tibshirani, R., Botstein, D. and Levy, R. (2004). Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. N Engl J Med 350, 1828–1837.
Lu, J., Getz, G., Miska, E.A., Alvarez-Saavedra, E., Lamb, J., Peck, D., Sweet-Cordero, A., Ebert, B.L., Mak, R.H., Ferrando, A.A. et al. (2005). MicroRNA expression profiles classify human cancers. Nature 435, 834–838. Ly, D.H., Lockhart, D.J., Lerner, R.A. and Schultz, P.G. (2000). Mitotic misregulation and human aging. Science 287, 2486–2492. Ma, X.J., Salunga, R., Tuggle, J.T., Gaudet, J., Enright, E., McQuary, P., Payette, T., Pistone, M., Stecker, K., Zhang, B.M. et al. (2003). Gene expression profiles of human breast cancer progression. Proc Natl Acad Sci USA 100, 5974–5979. Marshall, E. (2004). Getting the noise out of gene arrays. Science 306, 630–631. Masuda, N., Ohnishi, T., Kawamoto, S., Monden, M. and Okubo, K. (1999). Analysis of chemical modification of RNA from formalinfixed samples and optimization of molecular biology applications for such samples. Nucleic Acids Res 27, 4436–4443. Michiels, S., Koscielny, S. and Hill, C. (2005). Prediction of cancer outcome with microarrays: A multiple random validation strategy. Lancet 365, 488–492. Mirnics, K., Middleton, F.A., Marquez, A., Lewis, D.A. and Levitt, P. (2000). Molecular characterization of schizophrenia viewed by microarray analysis of gene expression in prefrontal cortex. Neuron 28, 53–67. Moffat, J., Grueneberg, D.A., Yang, X., Kim, S.Y., Kloepfer, A.M., Hinkle, G., Piqani, B., Eisenhaure, T.M., Luo, B., Grenier, J.K. et al. (2006). A lentiviral RNAi library for human and mouse genes applied to an arrayed viral high-content screen. Cell 124, 1283–1298. Mootha,V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E. et al. (2003). PGC-1-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34, 267–273. Novak, K. (2006). Where the chips fall. Nat Med 12, 158–159. Ntzani, E.E. and Ioannidis, J.P. (2003). Predictive ability of DNA microarrays for cancer outcomes and correlates: An empirical assessment. Lancet 362, 1439–1444. Paik, S. (2007). Development and clinical utility of a 21-gene recurrence score prognostic assay in patients with early breast cancer treated with tamoxifen. Oncologist 12, 631–635. Paik, S., Shak, S., Tang, G., Kim, C., Baker, J., Cronin, M., Baehner, F.L., Walker, M.G., Watson, D., Park, T. et al. (2004). A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med 351, 2817–2826. Paik, S., Tang, G., Shak, S., Kim, C., Baker, J., Kim, W., Cronin, M., Baehner, F.L., Watson, D., Bryant, J. et al. (2006). Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer. J Clin Oncol 24, 3726–3734. Pease, A.C., Solas, D., Sullivan, E.J., Cronin, M.T., Holmes, C.P. and Fodor, S.P. (1994). Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proc Natl Acad Sci USA 91, 5022–5026. Perou, C.M., Sorlie, T., Eisen, M.B., van de Rijn, M., Jeffrey, S.S., Rees, C.A., Pollack, J.R., Ross, D.T., Johnsen, H., Akslen, L.A. et al. (2000). Molecular portraits of human breast tumours. Nature 406, 747–752. Pilpel, Y., Sudarsanam, P. and Church, G.M. (2001). Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet 29, 153–159. Potti, A., Dressman, H.K., Bild, A., Riedel, R.F., Chan, G., Sayer, R., Cragun, J., Cottrill, H., Kelley, M.J., Petersen, R. et al. (2006).
References
Genomic signatures to guide the use of chemotherapeutics. Nat Med 12, 1294–1300. Quackenbush, J. (2001). Computational analysis of microarray data. Nat Rev Genet 2, 418–427. Ramaswamy, S. and Golub, T.R. (2002). DNA microarrays in clinical oncology. J Clin Oncol 20, 1932–1941. Ramaswamy, S., Ross, K.N., Lander, E.S. and Golub, T.R. (2003). A molecular signature of metastasis in primary solid tumors. Nat Genet 33, 49–54. Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P. et al. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 98, 15149–15154. Rhodes, D.R., Yu, J., Shanker, K., Deshpande, N., Varambally, R., Ghosh, D., Barrette, T., Pandey, A. and Chinnaiyan, A.M. (2004). ONCOMINE: A cancer microarray database and integrated datamining platform. Neoplasia 6, 1–6. Ross, D.T., Scherf, U., Eisen, M.B., Perou, C.M., Rees, C., Spellman, P., Iyer, V., Jeffrey, S.S., Van de Rijn, M., Waltham, M. et al. (2000). Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24, 227–235. Sargent, T.D. (1987). Isolation of differentially expressed genes. Methods Enzymol 152, 423–432. Schena, M., Heller, R.A., Theriault, T.P., Konrad, K., Lachenmeier, E. and Davis, R.W. (1998). Microarrays: Biotechnology’s discovery platform for functional genomics. Trends Biotechnol 16, 301–306. Schena, M., Shalon, D., Davis, R.W. and Brown, P.O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470. Scherf , U., Ross, D.T., Waltham, M., Smith, L.H., Lee, J.K., Tanabe, L., Kohn, K.W., Reinhold, W.C., Myers, T.G., Andrews, D.T. et al. (2000). A gene expression database for the molecular pharmacology of cancer. Nat Genet 24, 236–244. Schulze, A. and Downward, J. (2001). Navigating gene expression using microarrays – A technology review. Nat Cell Biol 3, E190–E195. Segal, E., Friedman, N., Koller, D. and Regev, A. (2004). A module map showing conditional activity of expression modules in cancer. Nat Genet 36, 1090–1098. Sgroi, D.C., Teng, S., Robinson, G., LeVangie, R., Hudson, J.R., Jr and Elkahloun, A.G. (1999). In vivo gene expression profile analysis of human breast cancer progression. Cancer Res 59, 5656–5661. Shi, L., Reid, L.H., Jones, W.D., Shippy, R., Warrington, J.A., Baker, S.C., Collins, P.J., de Longueville, F., Kawasaki, E.S., Lee, K.Y. et al. (2006). The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24, 1151–1161. Simon, R., Radmacher, M.D., Dobbin, K. and McShane, L.M. (2003). Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95, 14–18. Sorlie, T., Perou, C.M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M.B., van de Rijn, M., Jeffrey, S.S. et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA 98, 10869–10874. Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J.S., Nobel, A., Deng, S., Johnsen, H., Pesich, R., Geisler, S. et al. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA 100, 8418–8423.
■
171
Sotiriou, C. and Piccart, M.J. (2007). Taking gene-expression profiling to the clinic: When will molecular signatures become relevant to patient care?. Nat Rev Cancer 7, 545–553. Southern, E., Mir, K. and Shchepinov, M. (1999). Molecular interactions on microarrays. Nat Genet 21, 5–9. Stegmaier, K., Corsello, S.M., Ross, K.N., Wong, J.S., Deangelo, D.J. and Golub, T.R. (2005). Gefitinib induces myeloid differentiation of acute myeloid leukemia. Blood 106, 2841–2848. Stegmaier, K., Ross, K.N., Colavito, S.A., O’Malley, S., Stockwell, B.R. and Golub, T.R. (2004). Gene expression-based high-throughput screening (GE-HTS) and application to leukemia differentiation. Nat Genet 36, 257–263. Steinmetz, L.M. and Davis, R.W. (2004). Maximizing the potential of functional genomics. Nat Rev Genet 5, 190–201. Stoeckert, C.J., Causton, H.C. and Ball, C.A. (2002). Microarray databases: Standards and ontologies. Nat Genet 32(Suppl), 469–473. Stoughton, R.B. and Friend, S.H. (2005). How molecular profiling could revolutionize drug discovery. Nat Rev Drug Discov 4, 345–350. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich,A., Pomeroy, S.L., Golub,T.R., Lander, E.S. et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102, 15545–15550. Tan, P.K., Downey,T.J., Spitznagel, E.L., Jr., Xu, P., Fu, D., Dimitrov, D.S., Lempicki, R.A., Raaka, B.M. and Cam, M.C. (2003). Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res 31, 5676–5684. Taniguchi, M., Miura, K., Iwao, H. and Yamanaka, S. (2001). Quantitative assessment of DNA microarrays–comparison with Northern blot analyses. Genomics 71, 34–39. Tomlins, S.A., Rhodes, D.R., Perner, S., Dhanasekaran, S.M., Mehra, R., Sun, X.W., Varambally, S., Cao, X., Tchinda, J., Kuefer, R. et al. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310, 644–648. van’t Veer, L.J., Dai, H., van de Vijver, M.J., He,Y.D., Hart, A.A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T. et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536. van de Vijver, M.J., He, Y.D., van’t Veer, L.J., Dai, H., Hart, A.A., Voskuil, D.W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J. et al. (2002). A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347, 1999–2009. Van Gelder, R.N., von Zastrow, M.E., Yool, A., Dement, W.C., Barchas, J.D. and Eberwine, J.H. (1990). Amplified RNA synthesized from limited quantities of heterogeneous cDNA. Proc Natl Acad Sci USA 87, 1663–1667. Velculescu, V.E., Zhang, L., Vogelstein, B. and Kinzler, K.W. (1995). Serial analysis of gene expression. Science 270, 484–487. Wei, G., Twomey, D., Lamb, J., Schlis, K., Agarwal, J., Stam, R.W., Opferman, J.T., Sallan, S.E., den Boer, M.L., Pieters, R. et al. (2006). Gene expression-based chemical genomics identifies rapamycin as a modulator of MCL1 and glucocorticoid resistance. Cancer Cell 10, 331–342. Zimmermann, C.R., Orr, W.C., Leclerc, R.F., Barnard, E.C. and Timberlake, W.E. (1980). Molecular cloning and selection of genes regulated in Aspergillus development. Cell 21, 709–715.
172
CHAPTER 13
■
DNA Microarrays in Biological Discovery and Patient Care
RECOMMENDED RESOURCES Resources for expression profiling experiments Web site
URL
Resources
National Human Genome Research Institute
http://research.nhgri.nih.gov/ microarray/main.html
Protocols, web links
Broad Institute, Cancer Genomics Group
http://www.broad.mit.edu/cancer/
Gene expression analysis software, web links
Stanford Genomics
http://genome-www.stanford.edu/
Software, web links, cDNA microarray protocols
Jackson Laboratory, Statistical Genomics Group
http://www.jax. org/staff/Churchill/labsite/
Advice about the design of microarray experiments
Microarray Gene Expression Data Society (MGED)
http://www.mged.org/ Workgroups/MIAME/miame_ checklist.html
Minimal requirements for the publication of microarray data
Gene Expression Omnibus database
http://www.ncbi.nlm.nih.gov/geo/
Repository of gene expression data
Array Express database
http://www.ebi.ac.uk/arrayexpress/
Repository of gene expression data
Affymetrix
http://www.affymetrix.com
Commercial website with useful resources
This listing is adapted from Ebert and Golub (2004).
CHAPTER
14 Proteomics: The Deciphering of the Functional Genome Li-Rong Yu, Nicolas A. Stewart and Timothy D. Veenstra
INTRODUCTION The continuing development of novel analytical technologies and instrumentation has brought a dramatic new dimension to how biological research can be pursued. This new dimension has enabled discovery-driven research to be conducted at par with hypothesis-driven research. In a hypothesis-driven approach, the understanding of a particular biological entity – whether it be a cell, gene, or protein – is used to develop studies that are designed to answer a specific question about a single gene, transcript, or protein, for example. Hypothesis-driven studies utilize technologies that focus on one specific entity per experiment. Discoverydriven studies, however, are designed around broad questions and focus on global characteristics of a cell or organism. These studies utilize advanced technologies that are able to gather information on thousands of different genes, transcripts, proteins, and metabolites in a relatively rapid and comprehensive fashion. While these globally directed studies have seemingly been thrust upon the scientific community overnight, in reality it has taken decades of technological and instrumental development to bring this field of science to fruition. The first giant step in discovery-driven science is the completion of the Human Genome Project. This monumental effort could only have been accomplished with the many developments in cloning, amplification, and high-throughput gene sequencing (Yager et al., 1991). The analysis of thousands of gene transcripts using mRNA arrays has been made possible by the ability to synthesize mRNA probes
and coupling them to solid surfaces (Gilham, 1970). Shortly after the development of these genome – and transcriptome-scanning technologies, many scientists shifted their focus onto proteins. Proteomics – the analysis of the entire protein complement of a cell, tissue, or organism under a specific, defined set of conditions – in its present state has also been dependent on decades of technological and instrumental developments. These developments have included advances in mass spectrometry (MS) technology, protein fractionation techniques, and bioinformatics, to name a few. One of the difficult challenges in proteomics is the uncertainty of its complexity. Let us assume that as many as 15,000 genes are expressed within a given human cell at any single time. Let us then assume that these genes give rise, on average, to four different mRNA transcripts each, or 60,000 different transcripts in total. This estimate is very conservative given the presence of splice variants, single nucleotide polymorphisms, etc. that can arise from expressed genes. Now consider all of the events that can happen to a single transcript during and after translation into a mature protein. The protein can be processed from a preprotein into a mature protein.The most common event that happens to change a protein is post-translational modification. There are over 300 known modifications, however; some of the more common are (in no particular order) phosphorylation, glycosylation, methylation, and acetylation. A single protein that may have four potential phosphorylation sites will, in itself, give rise to 16 different isoforms ranging from all the sites phosphorylated to none of them being phosphorylated. If each of the 60,000
Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg 173
■
Proteomics: The Deciphering of the Functional Genome
possible protein products contained only four possible sites of modification, these would give rise to as many as 960,000 different proteins. Obviously, these numbers are extremely conservative and suggest that the human proteome is likely to contain well over 1,000,000 distinct species. This number is well beyond the current capacity of state-of-the-art technologies, resulting in a consistent undersampling of the proteome regardless of the analysis method chosen. It is this uncertainty in the overall measurement that results in the question that hovers around many global proteomic studies: “Was the protein absent in the sample or did we simply fail to detect it?”
Isoelectric point
Isoelectric point
(a)
(b) Molecular mass
CHAPTER 14
Molecular mass
174
In-gel enzymatic digestion
GEL-BASED AND SOLUTION-BASED PROTEOMICS Proteomics relies on three basic technological cornerstones: a method to fractionate complex protein or peptide mixtures, MS to acquire the data necessary to identify individual proteins, and bioinformatics to analyze and assemble the MS data. While the MS and bioinformatic components are somewhat similar in most applications, there are two very distinct methods to separate out complex protein samples in proteomics. The mere mention of the word “proteomics” conjures up an image of stained protein spots that have been fractionated by two-dimensional polyacrylamide gel electrophoresis (2D-PAGE). This separation technology, which was developed over 30 years ago, is still the most common separation technique used in proteomics today (O’Farrell, 1975). Advances have been made in this technology that enables thousands of proteins to be resolved in a single gel. While the primary role of 2D-PAGE is to separate complex mixtures of proteins, it also allows a means to compare the relative abundances of proteins from different proteomes. In a typical 2D-PAGE-based study, the proteomes from two distinct cell populations (e.g., control versus treated cells) are extracted and fractionated on separate gels (Figure 14.1) (Van den Bergh and Arckens, 2005). Since most proteins are colorless, the separated proteins need to be visualized (Westermeier and Marouga, 2005). This visualization is routinely accomplished using colorimetric stains, such as coomassie blue or silver stain. The resulting images are typically quite complex; therefore the spots observed on both gels are aligned so that the relative staining intensity of individual proteins can be compared between gels. Protein spots that are stained more intensely on one gel compared to the other are excised from the gel. The protein within the gel is then enzymatically digested by introducing trypsin into the gel piece. The resultant tryptic peptides are extracted from the gel and analyzed by MS or tandem MS (MS/MS) to acquire the raw data necessary to identify the protein(s) within the gel spot. This identification is accomplished using software designed to compare this type of data against databases containing large amounts of genomic or protein sequence information. While 2D-PAGE fractionation has been the subject of many different criticisms over the years, it still remains a cornerstone technology in proteomics (Van den Bergh and Arckens, 2005). Many of
Protein identification by peptide mapping or tandem MS m/z
Search database
Generate list of differentially abundant proteins
Figure 14.1 Quantitative proteomics using two-dimensional polyacrylamide gel electrophoresis (2D-PAGE). In this method, comparative proteome samples are separated on distinct 2DPAGE gels. After staining, protein spots that are more abundant on one gel compared to the other are excised from the gel. The protein(s) within the gel is then subjected to in-gel tryptic digestion and the resultant peptides are extracted and analyzed by mass spectrometry (MS). The MS data are then searched against the appropriate database to identify the protein(s) with the gel spot.
the criticisms, such as its inability to resolve membrane proteins and gel-to-gel irreproducibility have been addressed through the development of better reagents, equipment, and gel alignment software. For instance, over the past several of years, 2D-DIGE reagents have been developed that allow comparative proteomes to be labeled with different fluorophores (i.e., Cy2, Cy3, and Cy5) (Hoorn et al., 2006). This differential labeling allows for the relative abundance of proteins from different proteomes to be compared by their combination prior to their run on a single gel, obviating the separation irreproducibility that can be observed between gels. No single chromatographic or electrophoretic procedure, including 2D-PAGE, can completely resolve a mixture as complex as that of a proteome. As a complement to 2D-PAGE methods,
Mass Spectrometry
Proteome A (ICAT-13C0)
175
Proteome B (ICAT-13C9)
Pool proteins
Digest with trypsin
Enrich ICAT-peptides
Cleave biotin tags
light
heavy 9.03 Da
% Intensity
solution-based chromatographic techniques have been developed that permit direct on-line MS analysis of proteomes (Issaq et al., 2005). Analogous to 2D-PAGE, these solution-based methods typically use at least two different fractionation schemes to separate peptides/proteins prior to their entering the mass spectrometer. While many different combinations have been proposed, the most effective and popular has been the use of strong cation exchange (SCX) followed by reversed-phase high-performance liquid chromatography (RPLC). John Yates III first demonstrated the effectiveness of combining this multidimensional separation with MS (Washburn et al., 2001).This method, termed “MudPIT” (multidimensional protein identification technology), has shown the capability of identifying thousands of proteins per proteome study. In this scheme, the proteins extracted from a cell lysate are first digested into peptides. There are two major advantages in working with peptides instead of proteins. Peptides are more soluble and easier to separate than intact proteins, especially hydrophobic and membrane proteins, and peptides are more amenable to identification by MS than are proteins (as described in more detail later). The challenge with this strategy is that the number of species that have to be resolved has dramatically increased. The digestate is initially separated by SCX and fractions collected off this column are then loaded onto a RP column that is coupled directly on-line with the mass spectrometer. The mass spectrometer can then dynamically select and acquire the data necessary to identify peptides as they elute from the RP column. A limitation of MudPIT when compared to 2D-PAGE is in its inability to directly measure the relative abundance of proteins in different proteome samples. To meet this need, many different approaches have been developed, all of which use the data provided by the mass spectrometer for quantitation (Ong and Mann, 2005). Many of the methods rely on the use of differential stable-isotope labeling (Julka and Regnier, 2005). Analogous to DIGE, which uses differently fluorescent molecules with identical electrophoretic properties, stable-isotope labeling of proteomes uses chemically identical reagents that differ in their isotope content (i.e., mass). For example, the most popular labeling method termed isotope-coded affinity tags (ICAT) utilizes thiol-modifying reagents that differ in their carbon-13 (13C) content (Gygi et al., 1999). In this procedure, one proteome sample is labeled with the “light” ICAT reagent while a comparative sample is labeled with the “heavy” ICAT reagent that contains nine 13C atoms in place of nine 12C atoms found in the “light” version (Figure 14.2). The label is attached to the proteins after they have been separately extracted from the samples to be compared. Once the two samples are differentially modified, they can be combined. The proteins are digested into peptides and the modified peptides are extracted by virtue of a biotin tag that is present at the termini of the ICAT reagents. This enriched peptide mixture is finally analyzed using MudPIT and the relative abundance and identity of the peptides are determined from measurements provided by the mass spectrometer. While ICAT illustrates a post-extraction/chemical modification method, stable-isotope labeling can also be performed at the metabolic level by culturing cells in growth media that is highly enriched in a
■
NCSETQYESK ICAT
m/z MS quantitation
m/z MS/MS identification
Figure 14.2 Quantitative proteomics using isotope-coded affinity tags (ICAT). In this method, comparative proteome samples are labeled with chemically identical reagents that differ only by their carbon isotope content (i.e., 12C9 for the light reagent and 13C9 for the heavy reagent). After the proteins are modified, the proteome samples are combined and digested into tryptic peptides. The ICAT-modified peptides are extracted using avidin chromatography by virtue of the biotin moiety on the terminus of the ICAT reagents. After removal of the biotin portion, these ICAT-peptides are analyzed by reversed-phase liquid chromatography coupled directly on-line with a mass spectrometer. The mass spectrometer is operated in a data-dependent tandem mass spectrometry (MS/MS) mode, enabling the relative quantitation of the peptide in the two samples to be measured in the MS mode as well as its identity be discerned from the data acquired by MS/MS.
particular heavy isotope (e.g., 15N in place of 14N) or a “heavy” amino acid (e.g., 13C6-lysine) (Ong et al., 2003). A list of various sample preparation used in proteomics today, along with their advantages and disadvantages is provided in Table 14.1.
MASS SPECTROMETRY While high-resolution separations are critical for complex protein mixture analysis, MS is inarguably the technology that
176
CHAPTER 14
■
Proteomics: The Deciphering of the Functional Genome
T A B L E 1 4 . 1 List of advantages and disadvantages of popular sample preparation methods used in proteomics Method
Advantages
2D-PAGE
●
●
2D-DIGE
●
●
MudPIT
● ● ●
ICAT
●
Disadvantages
Staining provides relative quantitation Only sequence proteins of interest
●
Can compare multiple proteomes on single gel Reproducible separation
●
Highly sensitive Broad proteome coverage High throughput
●
Reduces complexity prior to MS analysis
●
● ●
●
●
● ●
iTRAQ
●
Allows 4 samples to be simultaneously compared
● ●
Laborious Limited sensitivity Irreproducibility a factor Laborious Requires fluorescent scanner
Limited individual protein coverage Unable to target specific proteins Lacks inherent ability for quantitative comparison Low throughput Limited to a binary comparison Unable to target specific proteins Low throughput Unable to target specific proteins
has driven the proteomics revolution. Although, it has been around for approximately a century, MS has gained increased prominence with the development of techniques to rapidly identify proteins. There are many different reasons that have propelled the MS to its prominent position within the field of proteomics. Among these are the sensitivity afforded using MS, allowing proteins and peptides present in the low femtomole (fmol, 1015 mol) to high attomole (amol, 1018 mol) range to be successfully identified. The mass measurement accuracy available using current MS technology, which can routinely be in the range of 1–5 ppm, also increases the confidence in the identification provided by the bioinformatic search of the raw data. While these and other technology specifications have been important, the development of MS/MS (McLafferty, 1981) and the coupling of on-line protein/peptide separations with MS (Henion, 1978) have been the key determinants that have enabled proteomics. Tandem MS allows tryptic peptides obtained from an enzymatic digestion of a complex proteome mixture, to be fragmented in such a way that sufficient sequence information can be obtained for its unambiguously identification. While proteins are typically identified through MS/MS identification of
their peptide surrogates, methods to characterize intact proteins by MS/MS are becoming more and more popular. The coupling of on-line separations, most commonly RPLC, has enabled complex mixtures to be fractionation prior to MS analysis so that thousands of peptides can be identified within a few hours, as shown in Figure 14.3 (Washburn et al., 2001). The throughput by which proteins in complex mixtures are identified by LC-MS is unparalleled using any other technology: a critical parameter for discovery science in which large amounts of data are collected in order to find the important factors. There are many different types of commercially available mass spectrometers, and it can be confusing as to which type is most suitable for a specific application. For instance, ion-trap mass spectrometers equipped with electrospray ionization (ESI) sources are the “work-horse” instruments for the characterization of complex proteomes when solution-based separations are used. Time of flight (TOF) instruments are popular for identifying proteins separated using 2D-PAGE, primarily because of their speed and mass accuracy. The choice of what instrument to be utilized needs to be weighed against the type of research that is being focused on. How is MS able to identify so many peptides in a complex mixture? A schematic of the process is shown in Figure 14.3. The mass spectrometer is operated in what is referred to as a “data-dependent” mode of operation. In this mode, the instrument takes a series of “snapshots” of peptides eluting from a LC column into the mass analyzer. This scan is referred to as the MS scan. In between these snapshots, a series of MS/MS scans of peptides that are isolated within the instrument for CID are acquired. The instrument selects peptides for MS/MS analysis based on their intensity in the preceding MS scan. A datadependent experiment can perform 10 MS/MS scans between every MS scan. The mass spectrometer continues this cycle of MS scan followed by ten MS/MS scans throughout the entire LC/MS/MS experiment. Depending on the length of the LC separation, upwards of 7000 MS/MS sequencing attempts are made on peptides eluting from the column into the mass spectrometer. Bioinformatic analysis will result in 10–25% (i.e., 700–1750) of these sequencing attempts resulting in successful peptide identifications. A list of popular mass analyzers used in proteomics and their performance characteristics is given in Table 14.2.
BIOINFORMATICS Probably no area of proteomics evolves faster than bioinformatics. Proteomics is critically dependent on bioinformatics to process the raw mass spectral data into protein data. While routinely used by every proteomic laboratory, the most critical software programs are those that take peptide mapping and/or tandem MS results and determine the protein or peptide sequence that most closely matches the experimental data. The two most popular software packages for matching experimental MS data to peptide/protein sequences are MASCOT
Bioinformatics
■
177
1789.8007 Automatically selected for tandem MS 7000/hr
1789.8007
100
100 1750
1821.8724
1790
Relative abundance
Relative abundance
80 1810
60 40 1404.6632 20
1203.6808
0 1000
1400
1912.8071 2442.0809 2829.2381 2328.9829 1800
2200
2600
60 40 20 0
3000
MS/MS spectrum
80
400
800
1200 m/z
1600
2000
m/z
Relative abundance
100 80 60 40 Peptide identifications
20 0 20
40
60
80 Time (min)
100
120
140
Protein identifications
Figure 14.3 Principles of data-dependent tandem mass spectrometry (MS/MS). In data-dependent MS/MS, the mass spectrometer conducts an MS scan to record peptide masses that can be detected during a reversed-phase liquid chromatography (RPLC)-MS analysis of a complex proteome sample. The mass spectrometer isolates the peptide producing the most intense signal (i.e., m/z 1789.8007) and subjects it to collisional induced dissociation (CID). The resultant fragment ions are detected by the mass spectrometer and are subsequently used to identify the peptide in a bioinformatic analysis of the data. The mass spectrometer then isolates the next most intense peptide and collects its MS/MS spectrum. Linear ion-trap mass spectrometers may sequentially select on the order of 10 peptides for CID prior to recording the next MS scan. In the following MS scan, the instrument looks for new peptides that may have eluted from the LC column and selects these for CID and possible identification.
TABLE 14.2 Performance characteristics of various mass analyzers commonly used in proteomic research Analyzer
Sensitivity
Resolution
Mass accuracy
Ion-trap
Good
Low
Low
Linear ion-trap
Excellent
Low
Low
Triplequadrupole
Good
Good
Good
TOF
Good
High
High
FTICR
Excellent
High
High
(Lubec et al., 2005) and SEQUEST (Yates et al., 1995). As with most software algorithms, they both do what can also be done manually but at a much faster speed. For instance, SEQUEST is capable of transforming thousands of raw tandem MS files into potential peptide identifications within minutes, whereas manually assigning a single raw file would take on the order of hours. These software algorithms are also capable of identifying modified peptides when the mass added to the peptide as a result of some covalent modification (e.g., phosphorylation) is known. While not in widespread distribution, software packages, such as Scoring ALgorithm for Spectral Analysis (SALSA) developed by Daniel Leibler, provides a means of identifying unanticipated modifications, by identifying spectra that display characteristic
178
CHAPTER 14
■
Proteomics: The Deciphering of the Functional Genome
product ions, neutral or charged losses of groups of signals that are indicative of a particular modification (Hansen et al., 2001). Taken together, these varieties of software packages have been critical to the enabling of data analysis on the scale needed for discovery-driven proteomic studies. Throughout this chapter, the theme has been to use proteomic technologies to measure the relative abundances of proteins in different samples. Software has been developed to do this both at the gel-staining level and the mass spectral data level (i.e., when stable-isotope labeling quantitation is employed). Where the major lack has been is how this relative abundance data is presented. It is still typically displayed as lists of proteins with their relative abundance between two samples being given as a fold-difference. Unfortunately, with the ability to generate abundance data for thousands of proteins, these lists become an “unsolvable puzzle” when trying to identify major effects on a cell’s proteome as a result of a specific treatment. Recently, software packages such as Ingenuity™ have been developed to distill abundance data into annotated protein pathways enabling the probability that any specific signaling pathway has been affected can be statistically measured (www.ingenuity.com/products/ pathways_analysis.html). Software packages such as these also allow transcriptomic and proteomic data to be directly compared at both the single biomolecule and pathway levels.
IMPACT OF PROTEOMICS ON UNDERSTANDING DISEASES It is impossible to find a common characteristic that threads every disease together. One prevalent characteristic, primarily in the area of cancer, is that the earlier it is detected the better the chance of survival. To detect cancer at earlier stages is going to require novel biomarkers that possess high sensitivity and specificity. This need has driven much of the proteomic research over the past 3–5 years. Much of the search for novel biomarkers falls within discovery-driven research, and therefore MS has played a major role. The aim, in most studies, is to leverage the ability of MS to identify proteins in biological samples from patients with specific cancers and compare that to lists of proteins identified in the same sample types obtained from healthy-matched controls. Proteins that are found to be more prevalent in samples from disease-afflicted patients are considered as potential biomarkers. While on the surface the premise sounds simple enough, in reality there are several daunting challenges, both technological and physiological. Even with the speed at which MS is capable of identifying proteins, it still undersamples any complex proteome mixture. When such entities as post-translational modifications and splice variants are taken into account, it is impossible to know exactly the number of distinct proteins within a cell or biofluid proteome. Therefore, when a protein is detected in a diseased sample but not in a matched control, there is always that uncertainty as to whether the protein is not really present or it just was not detected in the present experiment. As is
commonly quoted in science, “the absence of evidence is not the evidence of absence”. On the physiological side, the fact that the proteome of any cell or biofluid is dynamic results in a constantly changing background that occurs both between different patients and within the same patient. For instance, the serum proteome of an individual is different before and after lunch. Compound this with the genetic and lifestyle variability that is observed between different patients and it is easy to grasp how difficult it is to confidently identify a disease-specific change in a protein in this shifting matrix. Regardless, the promise of technologies such as MS that can scan proteomes at levels never before possible propels investigators to search for new diagnostic markers. While the success in discovering new markers has been limited, they do however exist. For instance, a biomarker for the bladder disorder interstitial cystitis (IC) was found using a combination of protein fractionation and MS techniques (Keay et al., 2004). Previous to this report, no useful diagnostic marker for this disorder existed and patients were typically diagnosed by the exclusion of other maladies. This study resulted in the discovery of a novel nineresidue sialoglycopeptide that was shown to be present only in the supernatant of cultured bladder epithelial cells obtained from patients with IC and not in healthy controls or patients with bacterial cystitis. One of the more intriguing MS-based proteomic techniques to identify signatures of diseases in blood involves the detection of peptides that signify exo- and endo-protease activities (Villanueva et al., 2006). In this study, low molecular weight peptides were extracted from serum and analyzed using matrixassisted laser desorption ionization (MALDI) TOF MS. The study showed that a subset of the serum peptides were able to provide a signature allowing patients with three types of solid tumors to be classified separately from controls without cancer. Sixty-one of the signature peptides were identified and shown to fall into several tight clusters (Villanueva et al., 2006). Most of the identified peptides were generated by cancertype specific exopeptidase activities that enabled highly accurate classification of a set of prostate cancer samples used to validate the markers. While this method does not identify a novel diagnostic or therapeutic protein marker, it does use MS to reveal a cancerspecific activity found within serum of cancer-affected patients.
CONCLUSIONS While proteomics technology has advanced greatly in the past decade and allows scientists to characterize proteomes to an extent never before possible or imagined, this capability in itself brings a whole new set of challenges. The initial challenge is how to glean important clinical information from massive datasets that are amounts of data “contaminated” with a vast background of other less important proteins. While MS provides the ability to conduct comparative analyses of biological samples for the discovery of novel diagnostic and therapeutic biomarkers, it is unable to indicate which of the differentially abundant proteins may be
Recommended Resources
disease-specific. Many times these experiments represent proverbial “needle-in-the-haystack” searches with the additional complication that no information about the “needle” is known. While many potential “needles” can be found, the next area of development is to determine those that should graduate to validation phase. Since validation of any prospective biomarkers is expensive both in terms of time and money, continuing developments in optimal methods to both acquire and analyze proteomics data will be critical.
■
179
ACKNOWLEDGEMENTS This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under Contract N01-CO-12400. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organization imply endorsement by the United States Government.
REFERENCES Gilham, P.T. (1970). RNA sequence analysis. Annu Rev Biochem 39, 227–250. Gygi, S.P., Rist, B., Gerber, S.A.,Turecek, F., Gelb, M.H. and Aebersold, R. (1999). Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 17, 994–999. Hansen, B.T., Jones, J.A., Mason, D.E. and Liebler, D.C. (2001). SALSA: A pattern recognition algorithm to detect electrophile-adducted peptides by automated evaluation of CID spectra in LC-MS-MS analyses. Anal Chem 73, 1676–1683. Henion, J.D. (1978). Drug analysis by continuously monitored liquid chromatography/mass spectrometry with a quadrupole mass spectrometer. Anal Chem 50, 1687–1693. Hoorn, E.J., Hoffert, J.D. and Knepper, M.A. (2006). The application of DIGE-based proteomics to renal physiology. Nephron Physiol 104, 61–72. Issaq, H.J., Chan, K.C., Janini, G.M., Conrads,T.P. and Veenstra,T.D. (2005). Multidimensional separation of peptides for effective proteomic analysis. J Chromatogr B Analyt Technol Biomed Life Sci 817, 35–47. Julka, S. and Regnier, F.E. (2005). Recent advancements in differential proteomics based on stable isotope coding. Brief Funct Genomic Proteomic 4, 158–177. Keay, S.K., Szekely, Z., Conrads, T.P., Veenstra, T.D., Barchi, J.J., Jr., Zhang, C.O., Koch, K.R. and Michejda, C.J. (2004). An antiproliferative factor from interstitial cystitis patients is a frizzled 8 proteinrelated sialoglycopeptide. Proc Natl Acad Sci USA 101, 11803–11808. Lubec, G., Afjehi-Sadat, L., Yang, J.W. and John, J.P. (2005). Searching for hypothetical proteins: Theory and practice based upon original data and literature. Prog Neurobiol 77, 90–127.
McLafferty, F.W. (1981). Tandem mass spectrometry. Science 214, 280–287. O’Farrell, P.H. (1975). High resolution two-dimensional electrophoresis of proteins. J Biol Chem 250, 4007–4021. Ong, S.E. and Mann, M. (2005). Mass spectrometry-based proteomics turns quantitative. Nat Chem Biol 1, 252–262. Ong, S.E., Foster, L.J. and Mann, M. (2003). Mass spectrometric-based approaches in quantitative proteomics. Methods 29, 124–130. Van den Bergh, G. and Arckens, L. (2005). Recent advances in 2D electrophoresis: An array of possibilities. Expet Rev Proteomics 2, 243–252. Villanueva, J., Shaffer, D.R., Philip, J., Chaparro, C.A., ErdjumentBromage, H., Olshen, A.B., Fleisher, M., Lilja, H., Brogi, E., Boyd, J. et al. (2006). Differential exoprotease activities confer tumorspecific serum peptidome patterns. J Clin Invest 116, 271–284. Washburn, M.P., Wolters, D. and Yates, J.R. III (2001). Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol 19, 242–247. Westermeier, R. and Marouga, R. (2005). Protein detection methods in proteomics research. Biosci Rep 25, 19–32. Yager, T.D., Nickerson, D.A. and Hood, L.E. (1991). The Human Genome Project: Creating an infrastructure for biology and medicine. Trends Biochem Sci 16, 454–458. Yates, J.R., Eng, J.K., McCormack, A.L. and Schieltz, D. (1995). Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal Chem 67, 1426–1436.
RECOMMENDED RESOURCES Books Proteomics for Biological Discovery by Timothy D. Veenstra and John R. Yates Introduction to Proteomics:Tools for the New Biology by Daniel C. Liebler The Proteomics Protocols Handbook by John M. Walker Proteomics in Cancer Research by Daniel C. Liebler Proteins and Proteomics: A Laboratory Manual by Richard Simpson
Journals Molecular and Cellular Proteomics (mcponline.org) Journal of Proteome Research (jpr.org)
Proteomics (proteomics.org) Proteomics – Clinical Applications
Websites http://expasy.org http://www.hupo.org http://www.eupa.org http://www.asms.org
CHAPTER
15 Comprehensive Metabolic Analysis for Understanding of Disease Mechanisms Christopher B. Newgard, Robert D. Stevens, Brett R. Wenner, Shawn C. Burgess, Olga Ilkayeva, Michael J. Muehlbauer, A. Dean Sherry and James R. Bain
INTRODUC TION Despite decades of research, the molecular etiology of chronic and indolent diseases and conditions such as diabetes, obesity, cancer, and cardiovascular disease remains obscure, in part because they represent slowly developing phenotypes caused by the interplay of a complex mixture of genetic and environmental effectors. For similar reasons, it has been very difficult to detect and diagnose these diseases at a stage that allows sufficient time for effective intervention. Comprehensive measurement of intermediary metabolites and changes in metabolic activity (metabolic flux) could lead to improved detection and understanding of these diseases. As will be discussed in this chapter, instrumentation for measurement of large groups of intermediary metabolites as well as computational methods for analyzing such data are emerging. This surge in development of “metabolomics” technologies confers a number of potential advantages to scientists interested in disease research. First, it is estimated that human beings contain approximately 5000 discrete small molecule metabolites, far
Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg 180
less than the estimate of 25,000 genes and 100,000 proteins. Thus, metabolomics may ultimately be the most tractable of the “omics” sciences. Second, metabolomics measures changes in metabolic or chemical milieu that are downstream of genomic and proteomic alterations, therefore potentially providing the most integrated picture of biological status. Third, identification of metabolic fingerprints for specific diseases may have particular practical utility for development of therapies, because metabolic changes immediately suggest enzymatic drug targets. And fourth, metabolomics is likely to be a powerful and precise tool for discerning mechanisms of action and possible toxicological effects of drug therapies. The purpose of this chapter is to describe current methodologies for metabolic profiling and flux analysis and to provide selected examples of insights into biological and disease mechanisms gained from these approaches. Studies involving integration of metabolomics with genomic and transcriptomic profiling methods are just beginning to emerge, and early examples of this exciting area will also be highlighted. It is important
Copyright © 2009, Elsevier Inc. All rights reserved.
Current Metabolomics Platforms: Basic Tools and General Features
to emphasize from the outset that metabolic profiling tools are in a rapid phase of development, such that our description of currently useful methods will likely represent a “snapshot in time” that will evolve rapidly over the next several years.
■
181
100 P
L A %
CURRENT METABOLOMICS PLATFORMS: BASIC TOOLS AND GENERAL FEATURES As described in other chapters in this volume, genome-wide analysis of genetic variation (“genomics”) and surveys of changes in mRNA abundance by microarray technologies (“transcriptomics”) are examples of robust and relatively mature technologies for investigation of mechanisms underlying biological variability. For example, mRNA profiling has developed to a point that microarray cores have become common in both the academic and private sectors, with many thousands of RNA profiling experiments performed on a yearly basis. This is not yet the case for laboratories that focus on comprehensive metabolic analysis. One reason for this is the complexity inherent in attempting to measure small molecule metabolites (intermediary metabolites as discussed in this chapter are defined as nonpeptidic and with a molecular weight of 3000) in a quantitatively rigorous fashion. Underlying issues include the variable lability of metabolites (e.g., amino acids are often quite stable, whereas organic acid or lipid species can be quite labile), the chemical diversity of different metabolite classes, problems encountered in efficient extraction of metabolites from different biological matrices (e.g., tissues, blood, urine, etc.), and the wide dynamic concentration range of metabolite species (ranging from sub-nanomolar to milimolar). Thus, it is not surprising that there is currently no single technology for measurement of all of the metabolites in the “metabolome”. This means that the small number of existing laboratories that engage in comprehensive metabolic analysis have usually been constructed on an “ad hoc” basis, often driven by the direct technical experience and/or scientific goals and philosophies of the founding scientists. This leads to varying choices in instrument platforms, and places an even greater burden on individual laboratories for rigor in quantitative analysis, since standardized methods do not yet exist for the majority of analytes. The two major instrument platforms for measuring metabolite levels in biological samples are nuclear magnetic resonance (NMR) and mass spectrometry (MS) (Dettmer et al., 2007; German et al., 2005; Hollywood et al., 2006; Lindon et al., 2007; Watson, 2006). Both technologies can also be used for metabolic flux analysis following provision of stable isotope-labeled metabolic fuels (e.g., U-13C glucose) to cells and organisms, as discussed later. In general, when applying these technologies to assays of metabolite levels, research groups in the field tend to fall into two camps that adopt either “unbiased”/“top-down” approaches or “targeted”/”bottom-up” methods as their core technologies. Unbiased profiling involves use of either NMR
F
A
F
L VP GG
V
S S
M
Y Y
M H
E D D
E
0 140
160
180
200 m/z
220
240
260
Figure 15.1 Tandem mass spectrum of butyl esters of acidic and neutral amino acids in serum. Data were obtained using a neutral loss of 102 Da scan function. The figure illustrates the use of internal standards for “targeted” quantitative profiling of metabolites. Each amino acid in serum (shown by black letters) is measured against its own stable-isotope-labeled internal standard (shown in red letters). G, glycine, A, alanine, S, serine, P, proline, V, valine, L, leucine, M, methionine, H, histidine, F, phenylalanine, Y, tyrosine, D, aspartate, E, glutamate.
or MS for measurement of as many metabolites as possible in a biological specimen simultaneously, regardless of the chemical class of the metabolites. When applied to unbiased profiling, both NMR and MS have advantages and limitations, as discussed below, and neither technology can presently be used for surveying all of the metabolites in a sample in a quantitative fashion. In contrast, targeted profiling focuses on quantification of discrete clusters of chemically related metabolites (“modules”) using various combinations of chromatographic separations technologies and MS instruments that are most compatible with the analyte class. For example, electrospray ionization (ESI) and tandem mass spectrometry (MS/MS) are used for quantitative measurements of multiple amino acids and urea cycle intermediates or acylcarnitines (byproducts of mitochondrial oxidation of fatty acids, amino acids and glucose) of varying chain length and degree of saturation, whereas gas chromatography-MS (GCMS) is used routinely for measurement of multiple species of free fatty acids and organic acids. For each separate module, accurate quantification of the analytes is facilitated by addition of a cocktail of stable isotope labeled standards to the biological sample prior to the extraction and derivatization steps that may be necessary for the particular MS approach (see below and example shown in Figure 15.1). Ideally, each unknown analyte will be paired with its labeled cognate (usually an M+3 heavy isotope, e.g., methionine
182
CHAPTER 15
TABLE 15.1
■
Comprehensive Metabolic Analysis for Understanding of Disease Mechanisms
Summary of mass spectrometry (MS) methods used for metabolic profiling
Technique
Mass Analyzer
Reference
Metabolites
GC/MS
Quadrupole
Fiehn et al. (2000) Roessner et al. (2001) Gullberg et al. (2004) Pope et al. (2007) Denkert et al. (2006)
Fatty acids, alcohols, sterols, aliphatic and aromatic acids, amino acids, sugars, sugar alcohols, phosphates, polyamines
An et al. (2004) Haqq et al. (2005) Allen et al. (2003) Scholz et al. (2004) Oikawa et al. (2006)
Acylcarnitines, amino acids, coenzyme-As,
Bajad et al. (2006) Sabatine et al. (2005) Han and Gross (2003) Murphy et al. (2005) Lafaye et al. (2003) Lafaye et al. (2004) Plumb et al. (2005) Lenz et al. (2007) Rainville et al. (2007)
Phoshpholipids, sphingolipids, eicosanoids, amino acids, sulfoconjugates, carboxylic acids, amines, glucuronides, glycosides, glutathione conjugates
Sato et al. (2004) Williams et al. (2007) McNally et al. (2006) McNally et al. (2007) Soga et al. (2006) Soga et al. (2006)
Amino acids, sugar nucleotides, carboxylic acids, sugar phosphates, amines
TOF
FIA or Infusion
QqQ TOF Q-TOF FT-ICR
LC/MS
QqQ
QIT TOF Q-TOF CE/MS
Quadrupole QqQ-IT TOF Q-TOF
Direct Ionization
MALDI-TOF MALDI-TOF/TOF DESI-QIT
Yu et al. (2006) Fraser et al. (2007) Sun et al. (2007) Cooks et al. (2006) Pan et al. (2007)
Amino acids, fatty acids, carotenoids, nucleosides, carboxylic acids
Abbreviations: GC, Gas Chromatography; MS, Mass Spectrometry; LC, Liquid Chromatography; CE, Capillary Electrophoresis; TOF, Time of Flight; FT-ICR, Fourier Transform Ion Cyclotron Resonance;Q, Quadrupole; QqQ Triple Quadrupole; QIT Quadrupole Ion Trap; MALDI, Matrix Assisted Laser Desorption Ionization; DESI, Desorption Electrospray Ionization.
with methyl-D3 methionine, pyruvate with 13C3-pyruvate, etc.) to control for differences in analyte loss during chemical extraction and/or derivatization of individual analyte species and to compensate for ionization suppression effects. In practice, the limited range of stable isotope-labeled metabolites available from commercial suppliers and their relatively high cost constitute a current impediment to more broadscale development and deployment of these technologies. To overcome this problem in the future, it may be necessary for metabolomics core laboratories to import chemical synthesis capabilities for production of comprehensive libraries of internal standards.
Table 15.1 provides a summary of various MS-based metabolic profiling technologies that are currently used for measurement of different metabolite classes.
COMPARISON OF NMR AND MS TECHNOLOGIES FOR UNBIASED METABOLIC PROFILING Nuclear magnetic resonance spectroscopy is theoretically an excellent tool for unbiased metabolic profiling of all small
MS Methods for Targeted Metabolic Profiling
molecule metabolites, since the method is based on detection of any molecules that contain carbon or hydrogen (German et al., 2005; Lindon et al., 2007). Moreover, the method is non-destructive, in that analyses can be conducted directly in bodily fluids, cells, and even in intact tissues without the need for chemical extraction or derivatization of the individual analytes. These advantages are off-set by significant technical challenges, including poor sensitivity (it is generally not possible to detect metabolites at sub-micromolar levels with NMR) and the difficulty of deconvoluting spectra of complex metabolite mixtures such as typically found in biological matrices like plasma, urine, or tissue extracts. When NMR is applied to such samples, all metabolites within the detection limits of the technology resolve as hundreds to thousands of proton magnetic resonance peaks. The complexity of these data and the absence of comprehensive libraries of NMR spectra make it very difficult to assign the different resonances to specific metabolites. Thus, although NMR spectra are rich with information, the full harvesting of these datasets is not yet possible. Instead, peaks within NMR datasets are often treated as independent statistical variables, and the data analyzed by methods such as principal components analysis (PCA) to identify clusters of resonances that characterize different biological or disease states. Mass spectrometry has the immediate advantage of much higher sensitivity compared to NMR, and the most advanced MS platforms such as Fourier transform ion cyclotron mass spectrometers (FT-ICR-MS) have the capacity to detect metabolites at concentrations as low as the femtomole range (Dettmer et al., 2007). Moreover, modern MS platforms have very high-mass resolution (100,000), can resolve literally thousands of individual small molecules without the need for chromatography, and can achieve mass accuracies of 1 ppm. Seemingly, current MS platforms could therefore provide a means of circumventing problems encountered when using NMR methods for unbiased metabolic profiling of complex biological matrices, as described earlier. In practice, however, MS techniques usually require a sample extraction step that can cause metabolite losses. Moreover, different kinds of MS instruments use varying methods for the generation and detection of ions, and in concert, different kinds of metabolites have discrete ionization properties and are therefore variably resolved and detected in instruments that rely on mass and charge for analyte separation (Dettmer et al., 2007). Thus, while several current MS instrument platforms can be used for resolution and detection of large numbers of metabolites (e.g., high-performance liquid chromatography-MS, Q-TOF-MS, and FT-ICR-MS), the variable lability, solubility, recovery, and ionization of the different analyte species, coupled with the lack of broad-coverage spectral libraries makes these unbiased methods inherently nonquantitative and useful mainly for detecting gross changes in spectral peaks in samples of different biological origin. Similar to NMR, masses in these complex MS spectra are often not assigned to a specific analyte, and the data are instead analyzed by PCA or other multi-factorial statistical tools. However, an
■
183
advantage of MS-based unbiased profiling methods is that when specific peaks or clusters of peaks consistently differentiate biological states (e.g., a metastatic tumor from a benign tumor), their detection by high-resolution MS platforms provides an immediate clue for analyte identification by providing an accurate mass estimate of the analyte. These mass estimates can be used to query databases such as KEGG, HMP, and ChemSpider for chemical formulae that match the estimated mass, thereby usually providing a strong starting point for identifying the analyte, and ultimately facilitating a return to more mechanistically oriented studies. The number of possible formulae is also significantly reduced by high-mass accuracy; for example, the number of possibilities for a mass 500 at a mass accuracy 5 ppm with an elemental palate CHNOS is 23. A mass accuracy of 2 ppm would reduce the number to 10 and an accuracy of 1 ppm would yield only 4 possibilities.
MS METHODS FOR TARGETED METABOLIC PROFILING For scientists with core interests in biological mechanisms rather than biomarkers, knowledge about the identity of metabolites being surveyed and their exact concentration greatly enhances the value of the datasets. This means that some metabolic profiling laboratories are built in a bottom-up fashion, with emphasis on targeted and quantitative MS-based analyses. Although these approaches tend to utilize simpler tools such as ESI-MS/MS and GC-MS, significant challenges must still be overcome in building rigorous and fully vetted analysis modules for groups of metabolites. For example, GC-MS, when coupled with the use of stable isotope standards, can provide sufficient chromatographic resolution and analyte-specific detection to allow quantification of groups of metabolites in a class (e.g., fatty acids of different change length and degree of saturation or organic acids/TCA cycle intermediates). However, because the chromatography is performed in the gas phase at high-temperature, analytes must be volatile and have sufficient thermal stability to survive the analysis. To help stabilize the analytes under study, reactive carboxyl, sulfhydryl, amine, or hydroxyl groups are derivatized by akylation, oximation, acylation, or silylation (Dettmer et al., 2007; German et al., 2005). These methods, while effective, add complexity beyond that already introduced by sample extraction to the analytical procedures, and can result in batchto-batch variation if not properly controlled. Similarly, analysis of acylcarnitines or amino acids/urea cycle intermediates by ESI-MS/MS requires different derivatization steps for each module – for example, treatment with acidic methanol for acylcarnitines and n-butanol for amino acids/urea cycle intermediates (Haqq et al., 2005). Nevertheless, with care, these modules can provide quantitative data for highly informative clusters of analytes (e.g., fatty acids, fatty acylCoAs, and acylcarnitines of different chain length and degree of saturation, amino acids/urea
184
CHAPTER 15
■
Comprehensive Metabolic Analysis for Understanding of Disease Mechanisms
cycle intermediates, and organic acids), with coefficients of variation in replicate assays in the range of 10%.
EXAMPLES OF NMR-BASED METABOLIC PROFILING IN DISEASE RESEARCH The most common application of 1H-NMR-based metabolic profiling in mammals and humans to date has been for evaluation of toxic effects of xenobiotic drugs. Work in this area has been reviewed by others (Lindon et al., 2007), so we focus instead on newer applications in the realm of disease research. Coronary artery disease (CAD) is the leading cause of death in the United States and, in concert with the epidemic of obesity and diabetes, is rapidly becoming the leading cause of death in many developing countries. The genetic predilection of CAD is well established; family history has repeatedly been shown to be a robust, independent risk factor for CAD. Many of the commonly accepted risk factors for CAD are metabolic, such as lipid abnormalities and diabetes, making this disease a particularly logical target for comprehensive metabolic profiling technologies. Indeed, 1H NMR was recently reported to detect metabolic profiles that predicted the presence and severity of CAD (Brindle et al., 2002). The study involved comparison of a relatively small group of patients with triple vessel coronary heart disease (n 36) to angiographically normal subjects (n 30). Partial least squares-discriminant analysis was used to identify peaks in the major lipid regions of the spectra that appeared to provide separation between the groups. The specific lipid species involved were not identified by this analysis, although it was suggested that choline-containing metabolites were particularly diagnostic. However, a subsequent study using very similar techniques demonstrated that the predictive value of the NMR-based metabolic profiles was weak when other factors such as gender and use of medical interventions such as statins were taken into account (Kirschenlohr et al., 2006). This second group of authors demonstrated that the 1H NMR technique could identify male versus female subjects with 100% accuracy, but was much less able to identify statin users or subjects with CVD, despite expectations of substantial changes in lipid profile in the former group. This study makes it clear that 1H NMR is currently not a substitute for the more invasive procedure of angiography in detection of CVD. An intriguing and more promising recent application of 1 H NMR-based metabolic analysis has been to study the influence of intestinal bacteria (microbiota) on pathophysiologic outcomes of high-fat (HF) feeding (Dumas et al., 2006). This involved feeding of a HF diet to a mouse strain (129S6) known to be susceptible to HF-induced hepatic steatosis and insulin resistance, and for comparison, to a strain (BALBc) with relative resistance to these effects of diet. Metabolic profiling of plasma and urine samples revealed low circulating levels of plasma phosphatidylcholine (PChol) and high levels of methylamines
in urine in the 129S6 strain. Since the conversion of choline to methylamines is accomplished by gut microbiota rather than mammalian enzymes (Zeisel et al., 1983), and choline deficient diets are associated with hepatic steatosis (Buchman et al., 1995), the authors propose that the increased propensity of the 129S6 strain for hepatic steatosis and insulin resistance could be due to a reduced PC pool necessary for the assembly of VLDL particles, leading to deposition of triglycerides in liver. As with its application to diagnosis of CVD, the value of NMR for detecting the contributions of bacterial metabolism to whole animal metabolic profiles and disease development remains to be confirmed or refuted by an independent study. However, in this instance, other studies suggest that this could indeed be an important pathway. Thus, treatment of germ-free mice with microbiota from the cecum of conventionally raised animals produced an increase in body fat content and appearance of insulin resistance within 14 days of transfer (Backhed et al. 2004). In addition, more recent studies have revealed that genetically obese rodents or obese humans have changes in relative abundance of two main intestinal bacterial groups, the Bacteroidetes and the Firmicutes, relative to lean controls, and that the “obese microbiome” has an increased capacity to harvest energy from the diet (Turnbaugh et al., 2006). The mechanism by which the obese microbiome influences peripheral metabolism of the host seems to involve changes in circulating levels of fasting-induced adipose factor (FIAF), a circulating lipoprotein lipase inhibitor, as well as changes in activity of the important metabolic regulatory proteins PGC-1 and 5 AMPactivated kinase (AMPK) in skeletal muscle and liver (Backhed et al., 2007). An intriguing area for future study is to attempt to define metabolic profiles for specific gastrointestinal bacteria such that their relative contributions to changes in overall metabolic profile can be monitored and fully understood in the context of fuel homeostasis in the host.
EXAMPLES OF TARGETED MSBASED METABOLIC PROFILING FOR UNDERSTANDING OF DISEASE MECHANISMS Development of Tools The past decade has seen a rapid increase in the use of MS-based analysis platforms for metabolic profiling. Many of the early applications of MS technologies were in the area of plant biology (Fiehn et al., 2000) or for detecting inborn errors of metabolism in newborn children (Frazier et al., 2006; Shekhawat et al., 2005). Application of tandem mass spectrometry (MS/MS) to newborn screening has allowed detection of more than 40 different genetic diseases of lipid and amino acid metabolism (see Chapter 41). In newborn screening, less emphasis is placed on absolute quantification of multiple analytes in a module, since the screen only requires detection of changes
Examples of Targeted MS-Based Metabolic Profling for Understanding of Disease Mechanisms
in a single or discrete cluster of analytes with respect to established laboratory norms. For example, a defect in HMG-CoA lyase results in large and specific increases in C5-OH-carnitine and methylglutaryl-carnitine species detected by MS/MS, whereas defects in long-chain L-3-hydroxyacyl CoA dehydrogenase (LCHAD) or mitochondrial trifunctional protein (MTP) are associated with increases in C16-OH and C18OH-carnitine metabolites (Shekhawat et al., 2005). Similarly, defects in amino acid catabolizing enzymes such as phenylalaine hydroxylase to cause phenylketonuria, or the branchedchain -ketoacid dehydrogenase complex to cause Maple Syrup urine disease are readily detected by dramatic increases in phenylalanine or branched-chain amino acid (BCCA) levels, respectively. In more recent years, many of the MS-based targeted metabolic profiling techniques developed for diagnosing inborn errors of metabolism have been adopted and refined for studies of metabolic regulatory mechanisms, disease detection, and mechanisms of disease pathogenesis. Selective examples, meant to be illustrative rather than comprehensive, are provided. One approach has been to combine several targeted MSbased assay modules as an assemblage of tools that can report on several critical metabolic pathways (An et al., 2004; Haqq et al., 2005; Koves et al., 2008). Using a combination of GC/MS and MS/MS, we are currently able to perform quantitative analysis of approximately 150 metabolites in six groups: 1. free fatty acids of varying chain length and degrees of saturation; 2. total fatty acids (free+esterified); 3. acylcarnitines (representing products of mitochondrial fatty acid, amino acid, and glucose oxidation); 4. acyl CoAs of varying chain length and degree of saturation; 5. organic acids (TCA cycle intermediates and related metabolites), and 6. amino acids, including urea cycle intermediates. Although the total number of analytes measured with these tools is small relative to estimates of 5000 total metabolites in the “metabolome”, they are nevertheless highly useful for understanding changes in metabolic fuel selection under different physiologic and pathophysiologic circumstances. Moreover, expansion of the platform to include a broader range of analytes of interest in disease pathogenesis would appear possible in the near term, via adoption of published methods for analysis of sphingolipids (including ceramides), phospholipids, and prostaglandins and their metabolites (eicosanoids) (Dettmer et al., 2007; Han and Gross, 2005; Merrill et al., 2005; Watson, 2006). However, the reader should appreciate that development of new modules is not a trivial or inexpensive undertaking, as it requires acquisition or synthesis of stable isotope-labeled internal standards to cover most if not all of the individual analytes in the module, development of extraction procedures that are efficient for multiple analytes in a class, and demonstration of quantitative reproducibility of the method. The advantage gained is that such tools can be
■
185
applied to understanding of metabolic regulatory mechanisms in isolated cells, animal models of disease, and human disease states, as summarized in the following specific examples. Metabolic Profiling for Studies of Metabolic Regulatory and Signaling Mechanisms in Cultured Cells An example of integration of NMR- and MS-based tools for understanding of metabolic regulation and signaling mechanisms comes from recent work on the process of glucose-stimulated insulin secretion (GSIS) in pancreatic islet -cells (Boucher et al., 2004; Jensen et al., 2006; Joseph et al., 2006; Joseph et al., 2007; Lu et al., 2002; Ronnebaum et al., 2006; Ronnebaum et al., 2008). Stimulation of islet -cells with glucose causes increases in insulin secretion within seconds to minutes, and this response is mediated by signals that are generated by -cell glucose metabolism. A commonly accepted idea is that increases in the rate of glucose metabolism in -cells leads to increases in ATP:ADP ratio, which causes inhibition of ATP-sensitive K channels (KATP channels), membrane depolarization, and activation of voltage-gated calcium channels. The resulting influx of extracellular Ca2 is then thought to stimulate exocytosis of insulin containing secretory granules. However, this model clearly does not provide a complete description of signals that regulate GSIS, since pharmacologic or molecular inhibition of KATP-channel activity does not abrograte glucose regulation of insulin secretion (Henquin et al., 2003; Nenquin et al., 2004). To investigate non-KATP-channel related signals for GSIS, a set of insulinoma (INS-1)-derived cell lines with varying capacities for GSIS were developed (Hohmeier et al., 2000). Application of 13C NMR-based isotopomer analysis and MS-based profiling of intermediary metabolites led to the discovery of a critical link between pyruvate carboxylase (PC)mediated pyruvate exchange with TCA cycle intermediates (“pyruvate cycling”) and GSIS, and demonstration that these pathways are dysregulated in lipid-cultured and dysfunctional -cells (Boucher et al., 2004; Jensen et al., 2006; Joseph et al., 2006; Joseph et al., 2007; Lu et al., 2002; Ronnebaum et al., 2006). More recent studies have focused on identification of the specific pyruvate cycling pathways that may be involved in generation of signals for insulin secretion. One important pathway appears to involve export of citrate and/or isocitrate from the mitochondria via the citrate/isocitrate carrier (CIC), and subsequent conversion of isocitrate to -ketoglutarate (-KG) by the cytosolic NAPD-dependent isoform of isocitrate dehydrogenase (ICDc) (Joseph et al., 2006; Ronnebaum et al., 2006). siRNA-mediated suppression of CIC or ICDc cause substantial impairment of GSIS in both insulinoma cell lines and primary rat islets. These studies suggest that a metabolic byproduct of pyruvate/isocitrate cycling may be an important amplifying signal for control of GSIS. Possible mediators that are under investigation include NADPH, -KG or its metabolites, or GTP generated by the succinyl CoA dehydrogenase reaction, all of which are direct or downstream products of the ICDc reaction.
186
CHAPTER 15
■
Comprehensive Metabolic Analysis for Understanding of Disease Mechanisms
[U-13C] glucose
pyruvate PDH PC acetyl-CoA
PC ME 1 TCA
TCA αKG
glutamate C2
Figure 15.2 Use of 13C NMR for measurement of metabolic flux in cells. To measure flux of pyruvate into the TCA cycle via the oxidative enzyme pyruvate dehydrogenase (PDH) versus the anaplerotic enzyme pyruvate carboxylase (PC), [U-13C] glucose is administered to cells in culture. After 3 hours of incubation, cells are extracted and the NMR spectra of different carbons of glutamate are obtained. If all of the 13C-labeled glucose is converted to pyruvate to enter the TCA cycle through PDH, the spectrum of carbon 2 of glutamate will appear as shown at the bottom right of the figure. In contrast, if the labeled pyruvate enters the TCA cycle both through PDH and PC in equal proportions, the glutamate C2 spectrum will appear as shown at the top right of the figure. Arrows show significant difference in peak height in the two spectra. The change in the spectra under these two scenarios occurs because pyruvate that enters the TCA cycle via PC can engage in pyruvate recycling pathways that results in dilution of 13C (black circles) with natural abundance 12 C (gray circles), thus changing the pool of glutatmate mass isotopomers. Spectra of other glutamate carbons are obtained in the same fashion, and the aggregate data are deconvoluted to derive estimates of flux through PDH versus PC (Boucher et al., 2004; Lu et al., 2002; Sherry et al., 2004).
In the foregoing studies, measurement of pyruvate cycling flux was accomplished by incubation of -cells in low (3 mM) or high (12 mM) concentrations of U-13C glucose for several hours, followed by extraction of cells and analysis of the glutamate spectrum by NMR (see e.g., Figure 15.2). Specific resonances for each of the carbons of glutamate are affected by the population of mass isotopomers (glutamate with varying mixtures of 13C and 12C at each of the 5 carbons of the molecule), and this information can be used to calculate flux through the oxidative (PDH) and anaplerotic (PC) entry points of the tricarboxylic acid (TCA) cycle (Boucher et al., 2004; Lu et al., 2002; Sherry et al., 2004). When coupled to measurements of oxygen consumption (respiration), these rates can be expressed in
absolute rather than relative terms (Boucher et al., 2004; Sherry et al., 2004). These methods revealed that the capacity for GSIS in variously glucose responsive INS-1-derived cell lines was tightly correlated with PC-catalyzed pyruvate cycling activity, but not PDH-catalyzed glucose oxidation. To gain deeper understanding of the specific pyruvate cycling pathway involved in GSIS, these NMR-based methods of flux anlaysis were combined with MS/MS and GC/MS methods for profiling of acylcarnitines and organic acids, respectively. These experiments revealed that siRNA-mediated suppression of PC activity has little influence on pyruvate cycling rates or GSIS due to a compensatory rise in C2-acylcarnitine, which reflects the acetyl CoA pool; acetyl CoA is an allosteric activator of PC, which offsets the decrease in PC protein. These studies therefore uncovered a compensatory mechanism for preservation of pyruvate cycling activity and GSIS even when PC levels are reduced (Jensen et al., 2006). In another study, organic acid profiling was used to show that suppression of CIC caused a fall in cytosolic citrate levels, as would be expected if a major pathway for export of citrate from the mitochondria to the cytosol is blocked (Joseph et al., 2006). Finally, suppression of ICDc caused a decrease in pyruvate cycling flux, as measured by NMR, coupled with a rise in lactate levels measured by GC/MS, suggesting that in the face of suppressed pyruvate cycling activity, pyruvate is converted to lactate by default (Ronnebaum et al., 2006). These examples illustrate two important application of MS-based metabolic profiling in cellular research: (1) use in validation of specific genetic engineering or pharmacologic manipulations of cellular systems; (2) integration with flux analysis to provide a complete picture of changes in metabolic pathways under varying experimental conditions. Targeted MS-Based Metabolic Profiling Applied to Animal Models of Disease Targeted MS-based methods have also been used to gain understanding of metabolic control mechanisms in animal models of disease. Lipid infusion or the ingestion of a HF diet results in insulin resistance and eventual development of diabetes, but the mechanism by which this occurs has not been elucidated. In rats fed a HF diet, whole-animal, as well as muscle and liver insulin resistance are ameliorated in response to hepatic overexpression of a gene that affects lipid partitioning, malonyl-CoA decarboxylase (MCD) (An et al., 2004). Metabolic profiling of 37 acylcarnitine species by tandem mass spectrometry (MS/MS) revealed a decrease in the concentration of a lipid-derived metabolite, -OH-butyrylcarnitine (HB) in muscle of MCD-overexpressing animals, that was likely due to a change in intramuscular ketone metabolism. These findings spurred further investigation that led to a new model of diet-induced insulin resistance in muscle, involving accumulation of mitochondrially derived byproducts of lipid oxidation rather than diversion of lipid metabolites into cytosolic biosynthetic pathways (Muoio and Newgard, 2008). Supporting this idea are studies showing that chronic exposure
Examples of Targeted MS-Based Metabolic Profling for Understanding of Disease Mechanisms
of muscle to elevated lipids results in an increase rather than a decrease in expression of genes in the fatty acid -oxidative pathway (Koves et al., 2005). Importantly, this lipid-induced upregulation of the enzymatic machinery for -oxidation of fatty acids in muscle is not coordinated with upregulation of downstream metabolic pathways such as the TCA cycle and electron transport chain. This results in incomplete metabolism of fatty acids in the -oxidation pathway, accumulation of lipid-derived metabolites in the mitochondria, and a decrease in the levels of TCA cycle intermediates, as revealed by quantitative MS/MS and GC/MS analysis (Koves et al., 2005; Koves et al., 2008) (Figure 15.3). These abnormalities are reversed by exercise intervention in mice fed a HF diet, in association with increased TCA cycle activity and restoration of insulin sensitivity and glucose tolerance (Koves et al., 2005). Also supporting the model, knockout mice lacking MCD, which promotes fatty acid -oxidation, had markedly lower acylcarnitine levels in muscle and were protected against diet-induced insulin resistance, despite high levels of long-chain acyl CoAs (Koves et al., 2008). Conversely, transgenic mice with muscle-specific overexpression of PPAR, a nuclear receptor that activates -oxidative genes, developed both local and systemic glucose intolerance (Finck et al., 2005, Figure 15.3). The point to emphasize in the context of the current chapter is that the initial insights that led to this new model of muscle insulin resistance came mainly from metabolic profiling of liver and muscle tissue samples. These initial findings were then used to generate hypotheses that were tested by genetic engineering, nutritional modulation, and physical activity studies, illustrating that in addition to its traditional role as a disease screening tool, metabolic profiling can be used as a discovery tool for metabolic control mechanisms and disease pathogenesis. Application of MS-Based Metabolic Profiling for Understanding of Human Disease Pathogenesis Targeted MS-based metabolic profiling has also been increasingly applied to studies of human diseases and conditions. Specific examples of applications to the areas of obesity and cardiovascular disease (CVD) are discussed here, but these tools are also being applied to diverse areas such as cancer (Fan et al., 2006) and mental disorders (Kaddurah-Daouk et al., 2007). It has been appreciated for many years that obesity is associated with impaired metabolic regulation, including development of insulin resistance in muscle, liver, and adipose tissue, as well as loss of mass and function of pancreatic islet -cells (Muoio and Newgard, 2008). However, the specific mechanisms that link overnutrition and obesity to global metabolic dysfunction remain incompletely defined. We recently completed a metabolic analysis of 74 obese (median BMI of 36.6 kg/m2) and 67 lean (median BMI of 23.2 kg/m2) human subjects using the MS-based metabolic, hormonal and physiologic profiling tools described earlier (Haqq et al., 2005; Newgard et al., 2008). PCA was used to detect a metabolic signature comprised of changes
■
187
in amino acid, acylcarnitine, and organic acid metabolites that suggests that pathways of BCAA catabolism are overloaded in obese humans. To test the significance of this for development of obesity-related disease, we fed normal rats on HF, high-fat with supplemented BCAA (HF/BCAA) or standard chow (SC) diets. Despite a low rate of body weight gain equivalent to the SC group, HF/BCAA rats were equally insulin resistant as HF rats, demonstrating a contribution of BCAA to development of insulin resistance independent of body weight. However, the effects of BCAA on body weight and insulin sensitivity seem to require the presence of fat, since animals fed on SC diet with supplemented BCAA gain weight normally and have normal insulin sensitivity. These findings demonstrate a synergistic interaction of dietary protein and fat in the development of insulin resistance, and provide a demonstration of how broad-based metabolic profiling tools can identify unanticipated contributions of dietary and metabolic factors to disease development. MS-based metabolic profiling has also been applied to cardiovascular disease. In one study, liquid chromatography/triple quadrupole high-sensitivity MS analysis was applied to plasma samples of 18 subjects with exercise-inducible ischemia compared to 18 normal controls (Sabatine et al., 2005). Blood samples were taken immediately before and immediately after exercise, and levels of 173 known and several more minor intermediary metabolites were measured. A number of metabolites were found to be discordant between the two groups, including lactate, byproducts of AMP metabolism, metabolites of the citric acid cycle, and several of the unknowns. Using the six most discordantly regulated metabolites, a metabolic ischemia risk score was created and found to be statistically correlated to the probability of ischemia. However, it should be noted that the number of subjects was quite small in this study, necessitating follow up studies that either confirm or refute these interesting initial findings. In another recent study, MS/MS and GC/MS-based profiling of fatty acids, acylcarnitines, and amino acids was performed on plasma samples from 117 subjects from eight multiplex families with familial early-onset CVD (Shah et al., 2008). Using SOLAR and adjusting for variables such as diabetes, hypertension, dyslipidemia, body-mass-index (BMI), age and sex, high heritabilities were found for several amino acids (arginine, glutamate, alanine, ornithine, valine, leucine/isoleucine, h2 0.50–0.85, P 0.00004 7.2 1015), free fatty acids (arachidonic, linoleic, h2 0.57–0.59, P 0.0004 0.00005), and acylcarnitines (h2 0.54–0.86, P 0.002 0.0000002). Moreover, PCA was used to identify several metabolite clusters, with several of these factors found to be highly heritable. Interestingly, GENECARD families showed two distinct metabolite profiles, tracking with their clinical characteristics and reflective of factors identified from PCA, suggesting different genetic backgrounds and consequent variation in control of key metabolic pathways that converge on CVD. However, once again, the utility of these profiles for predicting CVD in other high-risk families or in the general population will require further study.
188
CHAPTER 15
■
Comprehensive Metabolic Analysis for Understanding of Disease Mechanisms
(a)
C20, C18:2, C18:1, C16 Fatty Acids
37 acylcarnitine metabolites
Intermediates (C3-C14) CATs CATs C2 (acetylcarnitine)
βOX
CATs
TCA
CoA
AJD11-3 dm031303a24 1 (1.694) Sm (Mn, 20.75) 100
IS
6
218
417
C2
Mitochondria
Acylcarnitines %
C4OH
305
C18:1
262 249
440 418
235
276
414
290
221 246 260 274
Tissue or serum extracts
438 441 456 462
306
430
0 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 m/z
Muscle acylcarnitines (nmol/mg protein)
(b)
75
3.0
2.5 MCD (Active) 2.0
MCDmut (Inactive)
1.5
* 1.0
0.5
C2 C3 C4 C5:1 C5 *C4-OH C6 C5-OH C4-DC C8:1 C8 C5-DC C6-DC C10:3 C10:2 C10:1 C10 C8-DC C12 C12-OH C10-DC C14:2 C14:1 C14 C14-OH C12-DC C16 C16-OH C18:2 C18:1 C18 C18:1-OH C18-OH C20
0
Figure 15.3 Use of tandem mass spectrometry to profile acylcarnitines in an animal model of insulin resistance. (a). Acylcarnitines provide a profile of products of mitochondrial fatty acid oxidation. As long-chain fatty acyl CoAs enter the mitochondria through carnitine palmitoyltransferase I (CPT1), they are oxidized by successive removal of two carbon acetyl CoA units to generate CoA esters ranging from 2 to 20 carbons in length and with varying degrees of sauration. These CoA intermediates are converted to carnitine esters, which are readily extracted from tissues or measured in plasma. The extracts are processed and analyzed by ESI-MS/MS to produce spectra such as that shown at the right of the panel. (b). Acylcarnitines in muscle samples taken from high-fat fed, insulinresistant rats (black bars) or high-fat fed rats rendered insulin sensitive by hepatic expression of malonyl CoA decarboxylase (MCD) (white bars). Note the dramatic decrease in concentration of one particular acylcarnitine, -hydroxybutyryl carnitine (labeled C4-OH). Data are from An et al., 2004.
References
INTEGRATION OF METABOLIC PROFILING WITH OTHER “OMICS” TECHNOLOGIES Genetic linkage and association studies have the power to establish a causal link between gene loci and physiological traits (see Chapter 8). These studies can make novel connections between biological processes that would not otherwise be predictable based on current knowledge. Numerous quantitative trait loci (QTL) influencing disease-related phenotypes have been identified through gene mapping and positional cloning. While it has become relatively straightforward to map a phenotype to a broad-genomic region, identification of the individual gene(s) and variants responsible for the phenotype remains difficult. Moreover, traditional QTL mapping is based on association with physiological phenotype, but often does not reveal the molecular pathways leading to that phenotype. One way to overcome these limitations is to broadly expand the types of phenotypes analyzed in genetic screens. For example, with microarray technology, one can measure the abundance of virtually all mRNAs in a segregating sample. Importantly, mRNA abundance shows sufficient heritability in outbred populations or experimental crosses to allow mapping of gene loci that control gene expression, termed expression QTL (eQTL) (Schadt et al., 2005). When eQTL co-localize with a physiological QTL, one can hypothesize a shared regulator and offer a potential pathway leading to the physiological trait. The pathway between a QTL and a physiological trait often involves changes in the steady-state levels of metabolic intermediates, in addition to changes in mRNA abundance. Recent studies in plants (Wentzell et al., 2007) and in animal models of disease (Ferrara et al., 2008) have demonstrated that metabolite profiles show sufficient heritability to allow mapping of metabolic quantitative trait loci (mQTL), thereby facilitating integration of metabolic profiles with genomic and transcriptomic analyses. For example, one recent study mapped expression QTLs controlling glucosinolate content in Arabidopsis Bay–Sha recombinant lines, and found a strong correlation between loci controlling enzymes involved in glucosinolate synthesis and the levels of metabolites within this particular pathway (Wentzell et al., 2007). Similarly, analysis of mRNA expression and metabolic profiles in liver samples obtained in an F2 intercross between the diabetes-resistant C57BL/6 leptinob/ob and the diabetessusceptible BTBR leptinob/ob mouse strains was used to show that
■
189
liver metabolites map to distinct genetic regions, thereby demonstrating that tissue metabolite profiles are heritable (Ferrara et al., 2008). This study also integrated liver microarray and metabolic profiling data to construct putative regulatory networks for control of specific metabolic processes in liver. As a proof of principle of the practical significance of this multi-disciplinary approach, a regulatory network that links gene expression and metabolic changes in the context of glutamate metabolism was identified, and the validity of the approach was validated by experiments showing that genes in the network respond to changes in glutamine and glutamate availability. Although further development of these methods will be required, they already appear to have the potential to unearth regulatory networks that are altered to contribute to chronic, complex, and highly prevalent diseases and conditions such as obesity, diabetes, and cardiovascular disease.
FUTURE DIRECTIONS In the post-sequence era, biologists and translational investigators alike have gained a new appreciation for metabolic analysis as a critical tool for assessing the physiologic and pathophysiologic impact of genetic variation. The current growth in instrument and methods development in the field of metabolomics is built on the foundation of decades of analytical biochemistry. The major difference between then and now is that the current emphasis is on methods that allow simultaneous measurement of multiple analytes in a biological sample, whereas earlier work was often focused on one or a small number of metabolites per assay. Despite these advances, no single profiling method currently allows simultaneous analysis of all of the metabolites in the metabolome. Ultimate achievement of this goal will require continued intensive development of deeper libraries of chemical standards, instrument platforms with broad-sensitivity range and high-mass accuracy, and likely, integration of MS and NMR methods to gain full analyte coverage. These advances must be coupled to continued development of computational methods for analysis of complex datasets and their integration with equally complex genomic, transcriptomic, and proteomic profiles. The examples provided in this chapter of new scientific insights gained by application of current tools suggest a broad horizon and provide strong encouragement for further technology development in this area.
REFERENCES Allen, J., Davey, H.M., Broadhurst, D., Rowland, J.J., Oliver, S.G. and Kell, D.B. (Jun 2003). High-throughput classification of yeast mutants for functional genomics using metabolic footprinting. Nat Biotechnol 21(6), 692–696.
An, J., Muoio, D.M., Shiota, M., Fujimoto, Y., Cline, G.W., Shulman, G.I., Stevens, R., Millington, D. and Newgard, C.B. (2004). Hepatic expression of malonyl CoA decarboxylase reverses muscle, liver, and whole-animal insulin resistance. Nat Med 10, 268–274.
190
CHAPTER 15
■
Comprehensive Metabolic Analysis for Understanding of Disease Mechanisms
Backhed, F., Ding, H., Wang, T., Hooper, L.V., Koh, G.Y., Nagy, A., Semenkovich, C.F. and Gordon, G.I. (2004). The gut microbiota as an environmental factor that regulates fat storage. Proc Natl Acad Sci USA 101, 15718–15723. Backhed, F., Manchester, J.K., Semenkovich, C.F. and Gordon, J.I. (2007). Mechanisms underlying the resistance to diet-induced obesity in germ-free mice. Proc Natl Acad Sci USA 104, 979–984. Bajad, S.U., Lu, W., Kimball, E.H., Yuan, J., Peterson, C. and Rabinowitz, J.D. (2006). Separation and quantitation of water soluble cellular metabolites by hydrophilic interaction chromatography-tandem mass spectrometry. J Chromatogr 1125, 76–88. Boucher, A., Lu, D., Burgess, S., Telemaque-Potts, S., Jensen, M., Mulder, H., Wang, M-Y., Unger, R.H., Sherry, A.D. and Newgard, C.B. (2004). Mechanism of lipid-induced impairment of glucose-stimulated insulin secretion and its reversal by an analogue of malate. J Biol Chem 279, 27263–27271. Brindle, J.T., Antti, H., Holmes, E., Tranter, G., Nicholson, J.K., Bethell, H.W.L., Clarke, S., Schofield, P.M., McKilligin, E., Mosedale, D.E. et al. (2002). Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using 1HNMR-based metabonomics. Nat Med 8, 1439–1444. Buchman, A.L., Dubin, M.D., Moukarzel, A.A., Jenden, D.J., Roch, M., Rice, K.M., Gornbein, J. and Ament, M.E. (1995). Hepatology 22, 1399–1403. Cooks, R.G., Ouyang, Z.,Takats, Z. and Wiseman, J.M. (2006). Detection Technologies. Ambient mass spectrometry. Science 311, 1566–1570. Denkert, C., Budczies, J., Kind, T., Weichert, W., Tablack, P., Schouli, J., Niesporek, S., K¯onsgen, D., Dietel, M. and Fiehn, O. (Nov 15 2006). Mass spectrometry-based metabolic profiling reveals different metabolite patterns in invasive ovarian carcinomas and ovarian borderline tumors. Cancer Res 66(22), 10795–10804. Dettmer, K.,Aronov, P.A. and Hammock, B.D. (2007). Mass spectrometrybased metabolomics. Mass Spectrom Rev 26, 51–78. Dumas, M.-E., Barton, R.H., Toye, A., Cloarec, O., Blancher, C., Rothwell, A., Fearnside, J., Tatoud, R., Blanc, V., Lindon, J.C. et al. (2006). Metabolic profiling reveals a contribution of gut microbiota to fatty liver phenotype in insulin-resistant mice. Proc Natl Acad Sci USA 103, 12511–12516. Fan, T.W., Higashi, R.M. and Lane, A.N. (2006). Integrating metabolomics and transcriptomics for probing SE anticancer mechanisms. Drug Metab Rev 38, 707–732. Ferrara, C.T., Wang, P., Neto, E.C., Stevens, R.D., Bain, J.R., Wenner, B.R., Ilkayeva, O.R., Keller, M.P., Blasiole, D.A., Kendziorski, C.M., Yandell, B.S., Newgard, C.B. and Attie, A.D. (Mar 2008). Genetic networks of liver metabolism revealed by integration of metabolic and transcriptomic profiling. PLoS Genet 4(3), e1000034. Fiehn, O., Kopka, J., Dormann, P., Altmann, T., Trethewey, R.N. and Willmitzer, L. (2000). Metabolite profiling for plant functional genomics. Nat Biotechnol 18, 1157–1161. Finck, B.N. et al. (2005). A potential link between muscle peroxisome proliferator-activated receptor-a signaling and obesity-related diabetes. Cell Metabol 1, 133–144. Fraser, P.D., Enfissi, E.M., Goodfellow, M., Eguchi, T. and Bramley, P.M. (2007). Metabolite profiling of plant carotenoids using the matrixassisted laser desorption ionization time-of-flight mass spectrometry. Plant J 49, 552–564. Frazier, D.M., Millington, D.S., McCandless, S.E., Koeberl, D.D., Weavil, S.D., Chaing, S.H. and Muenzer, J. (2006). The tandem mass
spectrometry newborn screening experience in North Carolina: 1997–2005. J Inherit Metab Dis 29, 76–85. German, J.B., Hammock, B.D. and Watkins, S.M. (2005). Metabolomics: Building on a century of biochemistry to guide human health. Metabolomics 1, 3–9. Gullberg, J., Jonsson, P., Nordstrom, A., Sjostrom, M. and Moritz, T. (2004). Design of experiments: An efficient strategy to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in metabolomic studies with gas chromatography/mass spectrometry. Anal Biochem 331, 283–295. Han, X. and Gross, R.W. (2003). Global analyses of cellular lipidomes directly from crude extracts of biological samples by ESI mass spectrometry: A bridge to lipidomics. J Lipid Res 44, 1071–1079. Han, X. and Gross, R.W. (2005). Shotgun lipidomics: Electrospray ionization mass spectrometric analysis and quantitation of cellular lipodomes directly from crude extracts of biological samples. Mass Spectrom Rev 24, 367–412. Haqq, A., Lien, L.F., Boan, J., Arlotto, M., Slentz, C.A., Muehlbauer, M.J., Rochon, J., Gallup, D., McMahon, R.L., Bain, J.R. et al. (2005). The Study of the Effects of Diet on Metabolism and Nutrition (STEDMAN) Project: Rationale and Design. Contem Clin Trials 26, 616–625. Henquin, J.C., Ravier, M.A., Nenquin, M., Jonas, J.C. and Gilon, P. (2003). Hierarchy of the beta-cell signals controlling insulin secretion. Eur J Clin Invest 33, 742–750. Hohmeier, H.E., Mulder, H., Chen, G., Henkel-Rieger, R., Prentki, M. and Newgard, C.B. (2000). Isolation of INS-1-derived cell lines with robust ATP-sensitive K+ channel-dependent and -independent glucose-stimulated insulin secretion. Diabetes 49, 424–430. Hollywood, K., Brison, D.R. and Goodacre, R. (2006). Metabolomics: Current technologies and future trends. Proteomics 6, 4716–4723. Jensen, M.V., Joseph, J., Ilkayeva, O., Burgess, S., Lu, D., Ronnebaum, S., Odegaard, M., Becker, T.C., Sherry, A.D. and Newgard, C.B. (2006). Compensatory responses to pyruvate carboxylase suppression in islet beta-cells: Preservation of glucose-stimulated insulin secretion. J Biol Chem 281, 22342–22351. Joseph, J.W., Jensen, M.V., Ilkayeva, O., Palmieri, F., Alarcon, C., Rhodes, C.J. and Newgard, C.B. (2006). The mitochondrial citrate carrier plays a regulatory role in glucose-stimulated insulin secretion. J Biol Chem 281, 35624–35632. Joseph, J.W., Odegaard, M.L., Ronnebaum, S.M., Burgess, S.C., Muehlbauer, J., Sherry, A.D. and Newgard, C.B. (2007). Normal flux through ATP-citrate lyase or fatty acid synthase is not required for glucose-stimulated insulin secretion. J Biol Chem. 282, 31592–31600. Kaddurah-Daouk, R., McEvoy, J., Baillie, R.A., Lee, D., Yao, J.K., Doraiswamy, P.M. and Krishnan, K.R. (2007). Metabolomic mapping of atypical antipsychotic effects in schizophrenia. Mol Psychiatry 12, 934–945. Kirschenlohr, H.L., Griffin, J.L., Clarke, S.C., Rhydwen, R., Grace, A.A., Schofield, P.M., Brindle, K.M. and Metcalfe, J.C. (2006). Proton NMR analysis of plasma is a weak predictor of coronary artery disease. Nat Med 12, 705–710. Koves, T.R., Li, P., An, J., Slentz, D., Ilkayeva, O., Akimoto, T., Dohm, G.L., Yan, Z., Newgard, C.B. and Muoio, D.M. (2005). PGC1-mediated metabolic remodeling of skeletal muscle mimics exercise training and reverses lipid-induced mitochondrial inefficiency. J Biol Chem 280, 33588–33598.
References
Koves, T.R., Ussher, J.R., Noland, R.C., Slentz, D.S., Mosedale, M., Ikayeva, O.J., Bain, J., Stevens, R., Dyck, J.R., Newgard, C.B., Lopaschuk, G. and Muoio, D.M. (2008). Mitochondrial overload and incomplete fatty acid oxidation contribute to skeletal muscle insulin resistance. Cell Metab 7, 45–56. Lafaye, A., Junot, C., Ramounet-Le Gall, B., Fritsch, P., Tabet, J.C. and Ezan, E. (2003). Metabolite profiling in rat urine by liquid chromatography/electrospray ion trap mass spectrometry. Application to the study of heavy metal toxicity. Rapid Commun Mass Spectrom 17, 2541–2549. Lafaye, A., Junot, C., Ramounet-Le Gall, B., Fritsch, P., Ezan, E. and Tabet, J.C. (2004). Profiling of sulfoconjugates in urine by using precursor ion and neutral loss scans in tandem mass spectrometry. Application to the investigation of heavy metal toxicity in rats. J Mass Spectrom 39, 655–664. Lenz, E.M., Williams, R.E., Sidaway, J. et al. (2007). The application of microbore UPLC/oa-TOF-MS and 1H NMR spectroscopy to the metabonomic analysis of rat urine following the intravenous administration of pravastatin. J Pharm Biomed Anal 44, 845–852. Lindon, J.C., Holmes, E. and Nicholson, J.K. (2007). Metabonomics in pharmaceutical R & D. FEBS J 274, 1140–1151. Lu, D., Mulder, H., Zhao, P., Burgess, S.C., Jensen, M.V., Kamzolova, S., Newgard, C.B. and Sherry, A.D. (2002). 13C NMR isotopomer analysis reveals a connection between pyruvate cycling and glucose-stimulated insulin secretion (GSIS). Proc Natl Acad Sci USA 99, 2708–2713. McNally, D.J., Hui, J.P., Aubry, A.J. et al. (2006). Functional characterization of the flagellar glycosylation locus in Campylobacter jejuni 81–176 using a focused metabolomics approach. J Biol Chem 281, 18489–18498. McNally, D.J., Aubry, A.J., Hui, J.P. et al. (2007). Targeted metabolomics analysis of Campylobacter coli VC167 reveals legionaminic acid derivatives as novel flagellar glycans. J Biol Chem 282, 14463–14475. Merrill, A.H., Jr., Sullards, M.C., Allegood, J.C., Kelly, S. and Wang, E. (2005). Sphingolipidomics: high-throughput, structure-specific, and quantitative analysis of sphingolipids by liquid chromatography tandem mass spectrometry. Methods 36, 207–224. Muoio, D.M. and Newgard, C.B. (2008). Mechanisms of disease: Molecular and metabolic mechanisms of insulin resistance and cell failure in type 2 diabetes. Nat Rev Mollar Cell Biol 9, 193–205. Murphy, R.C., Barkley, R.M., Zemski, Berry, K., Hankin, J., Harrison, K., Johnson, C., Krank, J., McAnoy, A., Uhlson, C. and Zarini, S. (2005). Electrospray ionization and tandom mass spectrometry of eicosanoids. Anal Biochem 346, 1–42. Nenquin, M., Szollosi, A., Aguilar-Bryan, L., Bryan, J. and Henquin, J.C. (2004). Both triggering and amplifying pathways contribute to fuel-induced insulin secretion in the absence of sulfonylurea receptor-1 in pancreatic beta-cells. J Biol Chem 279, 32316–32324. Newgard, C.B., Bain, J.R., Muehlbauer, M.J., McMahon, R.L., Rochon, J., Gallup, D., Stevens, R., Ilkayeva, O., Millington, D., Butler, M.D. et al. (2007). Comprehensive metabolic profiling of obese versus lean humans reveals a signature of overloaded branch chain amino acid catabolism. Submitted for publication. Oikawa, A., Nakamura, Y., Ogura, T. et al. (2006). Clarification of pathway-specific inhibition by Fourier transform ion cyclotron resonance/mass spectrometry-based metabolic phenotyping studies. Plant Physiol 142, 398–413.
■
191
Pan, Z., Gu, H., Talaty, N. et al. (2007). Principal component analysis of urine metabolites detected by NMR and DESI-MS in patients with inborn errors of metabolism. Anal Bioanal Chem 387, 539–549. Plumb, R.S., Granger, J.H., Stumpf, C.L. et al. (2005). A rapid screening approach to metabonomics using UPLC and oa-TOF mass spectrometry: Application to age, gender and diurnal variation in normal/Zucker obese rats and black, white and nude mice. Analyst 130, 844–849. Pope, G.A., MacKenzie, D.A., Defernez, M. et al. (2007). Metabolic footprinting as a tool for discriminating between brewing yeasts. Yeast 24, 667–679. Rainville, P.D., Stumpf, C.L., Shockcor, J.P., Plumb, R.S. and Nicholson, J.K. (2007). Novel application of reversed-phase UPLC-oaTOF-MS for lipid analysis in complex biological mixtures: A new tool for lipidomics. J Proteome Res 6, 552–558. Ronnebaum, S., Joseph, J.W., Burgess, S.C., Lu, D., Ilkayeva, O., Stevens, R., Becker, T.C., Muehlbauer, J., Sherry, A.D., Newgard, C.B. et al. (2006). A pyruvate cycling pathway involving cytosolic NAPD-dependent isocitrate dehydrogenase regulates glucosestimulated insulin secretion. J Biol Chem 281, 31041–31049. Roessner, U., Luedemann, A., Brust, D. et al. (2001). Metabolic profiling allows comprehensive phenotyping of genetically or environmentally modified plant systems. Plant Cell 13, 11–29. Sabatine, M.S., Liu, E., Morrow, D.A., Heller, E., McCarroll, R., Wiegand, R., Berriz, G.F., Roth, F.P. and Gerszten, R.E. (2005). Metabolomic identification of novel biomarkers of myocardial ischemia. Circulation 112, 3868–3875. Sato, S., Soga, T., Nishioka, T. and Tomita, M. (Oct 2004). Simultaneous determination of the main metabolites in rice leaves using capillary electrophoresis mass spectrometry and capillary electrophoresis diode array detection. Plant J 40(1), 151–163. Schadt, E.E. et al. (2005). An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 37, 710–717. Shah, S.H., Hauser, E.R., Bain, J.R., Muehlbauer, M.J., Stevens, R.D., Wenner, B.R., Sketch, M., Dowdy, E., Granger, C.B., Ginsburg, G. S. et al. (2007). High heritability of metabolomic profiles in families burdened with early-onset cardiovascular disease. Submitted for publication. Shekhawat, P.S., Matern, D. and Strauss, A.W. (2005). Fetal fatty acid oxidation disorders, their effect on maternal health and neonatal outcome: Impact of expanded newborn screening on their diagnosis and management. Ped Res 57, 78R–86R. Scholz, M., Gatzek, S., Sterling, A., Fiehn, O. and Selbig, J. (2004). Metabolite fingerprinting: detecting biological features by independent component analysis. Bioinformatics 20, 2447–2454. Sherry, A.D., Jeffrey, F.M. and Malloy, C.R. (2004). Analytical solutions for (13)C isotopomer analysis of complex metabolic conditions: Substrate oxidation, multiple pyruvate cycles, and gluconeogenesis. Metab Eng 6, 12–24. Soga, T., Baran, R., Suematsu, M. et al. (2006). Differential metabolomics reveals ophthalmic acid as an oxidative stress biomarker indicating hepatic glutathione consumption. J Biol Chem 281, 16768–16776. Sun, G., Yang, K., Zhao, Z., Guan, S., Han, X. and Gross, R.W. (2007). Shotgun metabolomics approach for the analysis of negatively charged water-soluble cellular metabolites from mouse heart tissue. Anal Chem 79, 6629–6640.
192
CHAPTER 15
■
Comprehensive Metabolic Analysis for Understanding of Disease Mechanisms
Turnbaugh, P.J., Ley, R.E., Mahowald, M.A., Magrini, V., Mardis, E.R. and Gordon, J.I. (2006). An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444, 1027–1031. Vaidyanathan, S., Kell, D.B. and Goodacre, R. (2002). Flow-injection electrospray ionization mass spectrometry of crude cell extracts for high-throughput bacterial identification. J Am Soc Mass Spectrom 13, 118–128. Vaidyanathan, S., Gaskell, S. and Goodacre, R. (2006). Matrixsuppressed laser desorption/ionisation mass spectrometry and its suitability for metabolome analyses. Rapid Commun Mass Spectrom 20, 1192–1198. Watson, A.D. (2006). Lipidomics: A global approach to lipid analysis in biological systems. J. Lipid Res 47, 2101–2111. Wentzell, A.M., Rowe, H.C., Hansen, B.G., Ticconi, C., Halkier, B. A. and Kliebenstein, D.J. (2007). Linking metabolic QTLs with
network and cis-eQTLs controlling biosynthetic pathways. PLoS Genet 3, 1687–1701. Williams, B.J., Cameron, C.J., Workman, R., Broeckling, C.D., Sumner, L.W. and Smith, J.T. (2007). Amino acid profiling in plant cell cultures: An inter-laboratory comparison of CE-MS and GC-MS. Electrophoresis 28, 1371–1379. Yu, H., Lopez, E., Young, S.W., Luo, J., Tian, H. and Cao, P. (2000). Quantitative analysis of free fatty acids in rat plasma using matrixassisted laser desorption/ionization time-of-flight mass spectrometry with meso-tetrakis porphyrin as matrix. Anal Biochem 354, 182–191. Zeisel, S.H., Wishnok, J.S. and Blusztajn, J.K. (1983). J Pharmacol Exp Ther 225, 320–324.
CHAPTER
16 Comprehensive Analysis of Gene Function: RNA interference and Chemical Genomics Bjorn T. Gjertsen and James B. Lorens
INTRODUCTION Geneticists have for decades mutagenized cells and model organisms to induce phenotypic changes essential for our understanding of gene function (Bier 2005; Peters et al., 2003). Mutations engendering these cell behavioral changes have been mapped and the cognate gene products identified. Pharmaceutical scientists have for decades conducted diversity screens to identify chemical entities that affect biological function (Landro et al., 2000). Large libraries of small molecule compounds that cover a diverse chemical space have been interrogated in high-throughput screening schemes. Compounds that inhibit critical protein functions have been revealed. In recent years molecular biologists have combined these principles, developing different types of dominant genetic effectors (“perturbagens”) that affect gene function at multiple levels (Bredel and Jacoby, 2004; Kawasumi and Nghiem, 2007; Knight and Shokat, 2007; Rognan, 2007). This conflation of high-throughput chemical diversity screening and trans-dominant genetics is particularly evident in the recent advent of RNA interference (RNAi) and chemical genomic approaches with promising clinical translation. For example, the Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg
identification of tumor-specific gene functional-dependence for oncogenic growth led to the development of imatinib mesylate, a small molecule inhibitor of the Bcr-Abl fusion protein in chronic myelogeous leukemia (Hunter, 2007; Sherbenou and Druker, 2007). This type of synthetic lethality via small molecule or modulation of gene function through RNAi is a promising strategy for future therapeutic development (Kaelin, 2005;Westbrook et al., 2005). Gene function analysis needs to be considered in conjunction with currently available technology. Recent technological development allows us to elucidate gene function more extensively, particularly in mammalian systems.Two recent approaches – RNAi and chemical genomics – provide new opportunities to elucidate the role of specific genes in complex biological systems. RNAi has rapidly evolved from a curious genetic phenomenon to the most widely used gene function analysis tool in less than a decade (Fire et al., 1998; Mello and Conte, 2004). Chemical genomics has established the use of small molecule libraries and high-throughput screening, a domain previously unique to the pharmaceutical industry, to delineate gene functions in diverse biological systems (Knight and Shokat, 2007). Copyright © 2009, Elsevier Inc. All rights reserved. 193
194
CHAPTER 16
TABLE 16.1
■
Comprehensive Analysis of Gene Function
Genetic effectors*
Effector type
Delivery
Library
Effects
Protein Full-length Truncated
Vector
cDNA fragmented cDNA
Gain Gain/Loss
Fusion
cDNA
Gain/Loss
Protein domains
cDNA
Gain/Loss
Random oligo
Loss (Gain)
Random oligo
Loss (Gain)
Oligo/split intein
Loss (Gain)
cDNA
Loss
Ribozyme
random oligo
Loss
shRNA
fragmented cDNA/oligo
Loss Loss
siRNA
oligo
Peptide Free
Vector, internalization motif
Scaffolded (GFP, protease, scFv, singlechain antibody) Cyclic peptide RNA Antisense
Vector, transfection
* Dominant genetic effectors or “perturbagens” are trans-acting molecules that alter gene function by interfering with a specific gene product the different stages of gene function (e.g., gene expression, protein-product function). Genetic effectors can be derived from protein, peptide or RNA and exert either gain- or loss-of-function effects on gene function. Libraries of genetic effectors are generally based on expressed (mRNA/cDNA), specific or random oligonucleotide sequences. These can be delivered directly to cells by transfection or expressed by gene vectors.
GENE FUNCTION ANALYSIS: AN OVERVIEW The assignment of function to a gene product entails an interruption of the genotype to phenotype flow. This can be accomplished at every stage: DNA mutation, mRNA destruction or inhibition of protein function. Classical genetic approaches involve cis-changes in DNA sequence or structure (insertion, deletions) that are generally technically challenging in mammalian cells (e.g., specific gene knockout). Trans-dominant genetics offers an alternative approach (Gudkov et al., 1994; Xu et al., 2001). Dominant genetic effectors or “perturbagens” are trans-acting molecules that alter gene function by interfering with a specific gene product at different levels (gene expression, protein-product function). Perturbagens exhibit different modes of action and include dominant negative proteins, siRNA, antisense RNA, ribozymes, inhibitory linear or cyclic
peptides, single chain antibodies or chemical inhibitors (Table 16.1) (Lorens et al., 2001). These dominant inhibitory molecules can be taken up by target cells or expressed by various vectors. Each effector type has distinct advantages and optimal uses. Hence, the assignment of gene function invariably enlists a combination of approaches. Recent efforts in gene functional analysis have focused on chemical genetic approaches based on the availability of many new classes of small chemical inhibitors of protein function, and RNAi. RNAi is a sequence-specific post-transcriptional gene-silencing mechanism used by endogenous small regulatory RNAs. Experimentally induced gene silencing has been used extensively for gene functionalization experiments.
RNA INTERFERENCE RNAi is initiated by the presence of double-stranded RNA (dsRNA) in the cytosol (Figure 16.1). The dsRNA is recognized by the RNAse III endonuclease Dicer complex and cleaved into characteristic short, interfering RNAs (siRNA), each 21–23 nucleotides with a core 19 base-paired region, 2 nucleotide 3OH-overhangs and 5-P-ends. Subsequent gene-silencing effects are triggered by siRNA. One of the siRNA strands (guide strand), identified by a less stable 5-duplex region, is preferentially assimilated by an Argonaute (Ago) family protein into the RNA Interference Silencing Complex (RISC). The other (passenger strand) is rejected and destroyed. The guide strand endows the RISC complex with sequence specificity, by hybridizing to cognate mRNA. The RISC:siRNA:mRNA complex localizes to processing bodies (P-bodies), cytoplasmic sites of cellular RNA degradation and storage. The RISC interaction with mRNA via guide strand hybridization leads to one of two outcomes, dictated by the stability of the hybridization and the presence of specific proteins in the RISC complex. Degradation of the mRNA requires a perfect match between the short RNA and the mRNA. Perfect base pairing promotes endonucleolytic cleavage in middle of guide:mRNA duplex (10 nucleotides from the guide strand 5end) by Ago2/RISC complexes. The products of RISC cleavage are degraded by bulk RNA degradation enzymes. In contrast, imperfect hybridization, characteristic of microRNA, promotes translational repression by RISC complexes. The suppression of translation is tolerant of significant mismatches outside of a short “seed” region (6 bp) at the 5 end of the guide RNA. The relative rates of cleavage versus translation inhibition may promote this difference, with more stable complexes kinetically favorable for mRNA degradation. Design of siRNA The design of effective siRNA is based on empirical criteria determined by statistical surveys of siRNA collections and improved understanding of the biochemistry of the RNAi process (Elbashir et al., 2001; Hsieh et al., 2004). A key consideration is that the antisense sequence be utilized as the guide strand. Sequence motifs in the siRNA, mirrors of the cognate mRNA
RNA interference
Dicer
■
195
siRNA
PAZ
Exportin shRNA
PAZ
m7 Target mRNA
Assembly of RISC
Argonaute
Guide strand AAAA
PIWI
Cleavage of passenger strand
Cleavage of mRNA m7
mRNA degradation AAAA
Figure 16.1 The basic approach to RNA interference. RNAi pathways are guided by small RNAs that include siRNA and miRNAs. This figure indicates how retroviral RNAi works through formation of a shRNA exported from the nucleus to the Dicer complex in the cytoplasm. The double-stranded hairpin RNA is cleaved by the Dicer enzyme complex into siRNA. These siRNAs are incorporated into Ago and the RISC. After cleavage of the RNA passenger strand the siRNA guide strand recognizes target sites to direct mRNA cleavage, carried out by the catalytic domain of Ago. RNAi therapeutics developed to harness the siRNA pathway typically involves the delivery of synthetic siRNA into the cell cytoplasm. The miRNA pathway begins with endogenously encoded primary microRNA transcripts (pri-miRNAs), or retrovirally transduced that encode for similar transcripts as indicted in this figure. For more details see de Fougerolles et al., 2007.
secondary structure, influence overall silencing efficiency and may reflect accessibility constraints. Important determinant siRNA features include overall hybridization energy and H-bonding patterns, asymmetric stability at the 5 ends (e.g., AU rich 5 end of guide strand), low duplex stability adjacent to the cleavage site (positions 9–14) and preferred nucleotides at specific positions (e.g., U at position 1 and 17; A at the cleavage site, position 10). RNAi specificity is achieved by limiting siRNA complementarity to non-target transcripts. The rule of thumb is that a given siRNA sequence should have at least three mismatches to non-target transcripts to abrogate silencing, with the central duplex region surrounding the cleavage site (position 9–12) (Birmingham et al., 2006) being particularly sensitive to mismatches. However, recent studies demonstrate that much of the off-target effects are engendered by the seed region, a hexamer stretch at position 2–8 of the guide strand that mediates translational repression by microRNA (miRNA) in 3UTR of mRNA. The presence of more than one seed region sequence in an mRNA 3UTR significantly increases the probability of gene silencing. An in silico analysis of published siRNA sequences concluded that the majority has a significant risk of non-specific gene knock-down (Snove and Holen, 2004). Chemical modification of seed region bases in (e.g., 2-O-methyl
ribosyl substitution at position 2) synthetic siRNA oligonucleotides can reduce this miRNA-like silencing (Jackson et al., 2006). Hence, as off-target silencing is largely sequence specific, RNAi-based results must be validated by at least two independent siRNA sequences derived from the target transcript, phenotypic “rescue” by co-expression of a siRNA-resistant cDNA version of the target gene or by corroborating evidence from other techniques (Cullen, 2006; Echeverri et al., 2006). Delivery of RNAi Effectors Several different methods can be used to deliver the active siRNA into the cell, depending on the organism of interest. In the nematode Caenorhabditis elegans, simply feeding the worm with bacteria that are expressing dsRNA encoding a target gene is sufficient to trigger long-lasting knock-down, as the long dsRNA is converted into active siRNAs intracellularly by Dicer (Fire et al., 1998). In mammalian cells, long dsRNA activates the antiviral interferon pathway leading to non-specific suppression of gene expression (de Veer et al., 2005). siRNAs or short (less than 30 bp) dsRNA precursors of siRNAs can be produced in vitro by chemical synthesis or enzymatic cleavage of longer RNAs and then introduced into cells by transfection. The transfection of
196
CHAPTER 16
■
Comprehensive Analysis of Gene Function
siRNA pools appears to enhance silencing specificity as the concentration of each individual siRNA is low, reducing the magnitude of off-target effects for any particular sequence. Alternatively, the siRNA can be produced within the cell by expression from a vector. These vector-based systems generally use RNA polymerase III promoter cassettes to produce a short primary transcript that folds into a hairpin structure (shRNA) and utilize the endogenous miRNA processing pathway: shRNA is cleaved within the nucleus by the microprocessor complex and exported to the cytoplasm where Dicer converts these into siRNA (Brummelkamp et al., 2002, Chang et al., 2006). Recently, miRNA-based expression scaffolds were developed to allow shRNA expression from more widely used RNA polymerase II promoter systems. These RNAi triggers are designed to be perfectly matched to the target mRNA and can result in potent and sustained transcript degradation. A number of conditional expression systems, mainly based on tetracyclinemediated gene regulation principles, have also been developed, allowing regulated gene silencing. RNAi Loss-of-Function Library Screens Forward genetic screens are conducted in mammalian cells using libraries of dominant genetic effectors introduced into cells to produce a population of individually “mutagenized” cells where
RNAi library
RNAi Classical genetics
Forward (Hypothesis-generating) Phenotype of interest Reverse (Hypothesis-based)
Gene (or protein) of interest
Chemical genetics Library of small molecules
Targeted small molecules
Figure 16.2 Chemical genomics in medicine. Random DNA mutagenesis is a classical genetics approach to create phenotypes that mimic disease in a cell or organism. Random mutagenesis may be replaced by library of interfering RNA, and targeted mutagenesis may be replaced by specific interfering RNA sequences that target the mRNA of interest. In chemical genomics a small molecule compound is used to achieve the same phenotype, or to correct a defined disease-related phenotype. After Kawasumi and Nghiem, 2007.
specific cellular functions are interrupted by the dominantly acting protein, peptide, RNA or organic chemical (Figure 16.2). Individual cells in which the genetic effector has caused a particular phenotypic response can then be selected and the genetic effector responsible identified. By defining the interactions of these effectors, potential targets for the development of therapeutic agents can be identified. Several different groups have produced collections of chemically synthesized siRNAs targeting all predicted human genes. Typically 3–5 different siRNAs are designed for each gene, with the expectation that at least half will knock-down gene expression more than 70%. For use in screens, these siRNAs are transfected into cells, usually as siRNA-lipid complexes, or by pre-spotting the siRNAs onto cell culture surfaces (“reverse transfection”) (Wheeler et al., 2005). These libraries have certain limitations: the approach is limited to transfectable cell types, and the transfected reagents produce only transient RNAi. Alternatively, retroviral vector shRNA libraries can be used. The distinct advantage of this system is the stable integration of viral vectors into host cell chromosomes and subsequent long-term expression of the shRNA. In a recent example, MacKeigan and colleagues conducted synthetic lethal screens to identify kinases and phosphatases that regulate apoptosis and chemoresistance (MacKeigan et al., 2005). HeLa cells were transfected with siRNAs derived from 650 human kinase and 222 phosphatase genes. Remarkably, siRNAs for 73 kinases (11%) and 70 phosphatases (32%) induced greater than twofold increases in apoptosis level relative to controls. These 143 “survival kinases and phosphatases” hence all contribute to HeLa cell survival, emphasizing the potential for novel synthetic interactions in transformed cells. Moffat and colleagues recently reported the development of a large arrayed lentiviral RNAi library (TRC1) and highcontent screen for mitosis (Moffat et al., 2006). A selected subset of the TRC1 library (4903 shRNA constructs) was used to target 1028 human genes (mostly kinases/phosphatases) with single, distinct shRNA-expressing lentiviruses in each well. Automated fluorescence microscopy and image analysis was used to screen for effects on mitosis. Over 100 genes met the screening criteria and were verified to affect mitotic progression using standard microscopy. Brummelkamp and colleagues used a combined chemical genetic and RNAi library screen to define genes required for p53-dependent cell cycle arrest (Brummelkamp et al., 2006). Treatment of breast carcinoma cells with nutlin-3, a small molecule inhibitor of the p53-negative regulator MDM2 (Vassilev et al., 2004), results in upregulation of p53 protein levels and cell cycle arrest. Cells were transduced with a retroviral shRNA library targeting 8000 genes (3 shRNA per gene) and divided into two populations, one that was treated with nutlin-3 and one receiving vehicle. shRNA sequences were isolated by PCR from colonies arising in the respective cell populations and compared by microarray (barcode) analysis. A small number of sequences were specifically enriched in the treated cell population, including the positive control p53 targeting sequence, and
Chemical Genomics
the p53 binding protein, p53BP1, that are required for p53 activation in nutlin-3 treated cells (Brummelkamp et al., 2006). In Vivo Use of siRNA The notion of using siRNAs as small molecule inhibitors of any gene is a particularly exciting prospect for in vivo gene analysis, drug target validation and novel therapeutic modalities. Significant obstacles remain, however, regarding efficient in vivo delivery of highly charged dsRNA. Unmodified siRNAs are rapidly degraded by serum RNAses and do not easily traverse cellular membranes. Chemical modification of specific bases can provide resistance to RNAse but can also compromise base pairing. Hence these modification strategies must balance stability and potency, such as distributed 2-O-methyl modifications. Systemically administered siRNAs are conjugated to targeting groups (e.g., proteins, cholesterol), encapsulated by lipids or carried by nanoparticles to improve cellular uptake. Alternatively various viral vectors can be used to express shRNA in target cells. Indeed, siRNA-based drugs are already in early clinical development. Off-Target Effects of siRNA Unintended gene regulation by siRNA can be induced by the delivery method and activation of innate cellular immunity. Offtarget effects may also be caused by base pair complementarities that alter dozens to hundreds of genes up to fourfold. Off-target effects are present in both in synthetic produced RNA duplexes and in siRNA generated in intact cells by vector-based expression of hairpin RNAs (Kettner-Buhrow et al., 2006). An in silico analysis of published siRNA concluded that the majority of reported siRNA have a risk of non-specific gene knock-down (Snove and Holen, 2004), but suggested that most of these redundancies can be removed in future siRNA design. Birmingham and colleagues have analyzed off-target genes by expression profiling in human cells, and have isolated the off-targeting to the 3 untranslated region (Birmingham et al., 2006). Modification of the siRNA, a 2-O-methyl ribosyl substitution at position 2 in the guide strand, is suggested to decrease the silencing of partially complementary transcripts (Jackson et al., 2006). Successful use of siRNA in the study of gene function should therefore carefully consider both the bioinformatic rules used for target gene complimentary analysis, as well as the chemical synthesis design.
CHEMICAL GENOMICS In recent years the increasing availability of large sets of bioactive small molecule compounds and the ability to systematically interrogate their effects on specific protein targets or cellular functions has engendered the field of chemical genomics (Kawasumi and Nghiem, 2007; Knight and Shokat, 2007). Based on the tenet of chemical genetics, that small molecules can be used to alter gene function and thus mimic mutant alleles, chemical genomics endeavors to discover and characterize small
■
197
molecules that modulate every gene. This is based either on the ability to design de novo a specific inhibitor of a given protein or the ability to survey a large diverse set of compounds for this activity. This latter approach is analogous to creating random mutations in DNA and necessarily requires significant followup to identify the compound target (“mutation”) (Hoyne and Goodnow, 2006; Inamdar, 2001). Hence most chemical genomic screens have focused on a specific target protein. While the generation of mutations in DNA is straightforward, the chemical synthesis of diverse structures is much more difficult. The advent of combinatorial chemistry has significantly improved our ability to generate very large collections of molecules (Dolle et al., 2005; Huwe, 2006). Natural compounds and other “privileged” bioactive scaffolds that affect known proteins or biological functions can form the basis for more targeted libraries (Ortholand and Ganesan, 2004). Hence, a practical description of chemical genomics must include tools that allow scientists to rapidly characterize a large pool of small molecules against target proteins. This implies an inversion of the drug discovery process, where the conventional industry strategy up to now has been to find drugs for targets, instead of finding targets for drugs. Chemical genomics is thus a modern conflation of well establish pharmacology principles, advances in genomics technology and effective chemical synthesis methodologies (Figure 16.2). Chemical Genomic Examples Combinatorial chemistry has made it possible to generate large libraries of small molecules in a short time, while automated high-throughput screening technologies facilitate the rapid interrogation of these compounds. Much of this work, conducted by the pharmacological industry, has been targeted to the discovery of novel chemical entities that modulate important signaling enzymes and cell death regulators (Roland and Dolle, 2004). Synthesis on beads allows transfer of minute amounts of a library of compounds to cell-based screens (MacBeath, 2001). We describe three different examples of chemical biology that illustrate the potential of this approach. FK506 The first example is FK506 (rapamycin, sirolimus), a natural macrolide antibiotic from the bacterium Streptomyces hygroscopicus, discovered as an antifungal agent from natural compound screening in the 1970s. During the 1990s, it was approved as an immunosuppressant for use in kidney transplant. Through study of the FK506 binding protein, a new class of proteins, the immunophilins, were discovered (Schreiber, 1991). The immunophilins appear to be essential in regulation of T-cell functions, predominantly through inhibition of the mTOR (mammalian target of rapamycin) signaling pathway. The result is attenuated T-cell related immunoreactions and avoidance of graft-versushost reactions and rejection of transplanted organs like kidneys. The mTOR pathway is also involved in cell growth of various cancers, and FK506 has been such a successful inhibitor of
198
CHAPTER 16
■
Comprehensive Analysis of Gene Function
proliferation that has lead to clinical cancer trials of this and similar compounds (Sabatini, 2006). This exemplifies the strength of screening natural sources for bioactive compounds, and illustrates how these compounds may result in discovery of new protein classes and cellular pathways. Alzheimer’s Disease New diagnostic imaging modalities, in particular the positron emission tomography combined with computer tomography, allow functional imaging of tissues or organs based on highly specific molecular probes (see Chapter 44). Alzheimer’s disease is a devastating neurological degenerative disease characterized by progressive loss of cognitive function. Only clinical examination together with imprecise imaging examinations have been used to determine the diagnosis, while post-mortem histopathological examination has revealed its disease-related protein structural features with beta-amyloid senile plaques and neurofibrillary tangles. These pathological protein structures are related to neural cell death, but therapy development is seriously hampered by lack of precise in vivo disease monitoring in humans. Small molecules have been developed that bind to senile plaques and neurofibrillary tangles (Kepe et al., 2006). The development of these molecular imaging probes has been relatively slow, based on decades of empirical use of dyes binding to these structures. More extensive use of chemical libraries, in combination with animal models, may lead to even more precise diagnostics. Bcl-2 Binding Compounds The Bcl-2 (B-cell lymphoma protein) was discovered in nonHodgkin lymphoma, where a recurrent chromosomal translocation had placed the Bcl-2 gene under control of the strong immunoglobulin heavy chain promoter (Adams and Cory, 2001). The actual mechanism of action of Bcl-2 first was elucidated when it was realized that Bcl-2 is homologous to the CED-4 gene in the worm Caenorhabditis elegans, a gene that is central in the worm’s genetic programs for cell death regulation. The function of Bcl-2 was explored by worm and mouse genetics, and the structure of this novel class of proteins mapped with high resolution. Through structure-guided design, small molecules have been designed that bind to a deep hydrophobic binding pocket in Bcl-2 (Bruncko et al., 2007). One such compound was developed based on a high-throughput NMR-screen of a chemical library in search for a lead compound that bound to the hydrophobic BH3-binding groove of Bcl-XL. Further structure analysis by NMR was used to avoid binding to serum albumin, as well as enhance the binding to Bcl-2, Bcl-XL and Bcl-w. One of the final drug compounds, ABT-737, bound to Bcl-2 in the sub-nanomolar range, illustrating the power of NMR-based high-throughput screening with parallel synthesis and structure-based design. Preclinical studies are promising for this new class of drugs that block the anticell death action of Bcl-2 (Fesik, 2005).
TABLE 16.2
Phenotypic screens*
Cellular function
Target phenotype
Read-out
Cell death
Survival
Growth/colony formation
Proliferation
Cell cycle arrest
Various dyes (e.g., tetrazolium salt) DNA staining Cell tracker fluorescence (flow cytometry)
Protein localization
Mislocalization
GFP-fusion Immunofluorescence
Protein modification
Proteasome activity Kinase activity
GFP-PEST
Promoter activation
Reporter genes (luciferase)
Transcription
Phospho-specific antibodies (microscopy, flow cytometry)
Flow cytometry surface marker *Phenotypic screens are designed to address a specific cellular function. General cellular functions (cell proliferation, cell death) involve multiple gene products, while more specific functions (protein localization, enzyme activity) involve fewer genes. Phenotypic screens can be focused on specific signaling events using gene expression reporters. Many detection methods can be used including enzymatic assays, microscopy and flow cytometry.
Phenotypic Screening Phenotypic screening can take advantage of forward and reverse genetics to identify a tentative disease target. Forward genetics is based on random modulation of the phenotype and then identification of the gene. Reverse genetics is based on a specific gene modification and then examination for a phenotype (Figure 16.2). An in vitro phenotype screen can comprise cells that are treated with a library of small molecules or siRNA and examined for activation or inhibition of the process of interest. High-throughput screening of defined compound libraries and cell responses that is detected by automated microscopy is a powerful approach when combined with a carefully designed strategy for verification. Such assays can be used to screen for compounds that alter simple intracellular translocations of labeled proteins or more complex cellular mechanisms like phagocytosis, phagosome maturation and bacterial invasion into human cells (Steinberg et al., 2007). Importantly, cell-based phenotypic screens present the target proteins in their native configuration, often a confounding challenge for screens based on purified proteins (Table 16.2). Combination of drug target gene expression signatures and high-throughput gene expression analysis has been used to determine new compounds that modulate cancer phenotypes,
Gene Function Studies
here exemplified by inhibition of androgen-receptor signaling (Hieronymus et al., 2006). The cellular mRNA from a defined set of genes were bar coded and amplified through PCR, and quantified by two color flow cytometer after hybridization to uniquely colored polystyrene bead. This gene expression analysis classified the two androgen-receptor signaling inhibiting compounds, celastrol and gedunin, as HSP90-inhibitors (Hieronymus et al., 2006). Most high-throughput screening approaches utilize relatively straightforward readouts. In contrast, high-content screening seeks to capture multiple parameters simultaneously. This “high resolution” examination of cell responses can reveal unique characteristics of drug action, though at the expense of throughput. Hence, smaller focused libraries are generally used. In a search for novel cancer cell apoptosis-inducing agents from cyanobacteria, Herfindal and colleagues carefully compared nuclear morphologies in treated normal human hepatocytes (Herfindal et al., 2005). The result was identification of specific hepatototoxic strains of algae not involving the known hepatotoxins microcystins and nodularin. A similar approach using primary blood platelets and neural derived cell lines has identified specific neural and platelet targeting toxins (Selheim et al., 2005). Structure elucidation of the active natural compound is necessary to fully understand the mechanisms involved (Antonopoulou et al., 2002) and provide a basis for future drug development. Parallel Chemical and Genome-wide RNAi Screens Parallel screening for siRNA and small molecule inhibitors is a particularly fruitful combination to map constituents of a cellular function (Figure 16.2). Further siRNA hits provide an effective starting point to reveal the targets of small molecule hits. In a study focused on cytokinesis, the final step of cell division, by searching for siRNAs and small molecules that generate binuclear cells (Eggert et al., 2004). Comparative analysis of phenotypes revealed distinct categories of drug action, providing a basis for further target validation.
GENE FUNCTION STUDIES Therapeutic development invariably involves experiments in animal models. Gene function analysis in the intact organism is a critical component of validation. Genetically tractable organisms like fly, worm and zebrafish allow the researcher to study gene function within short periods of time. Mice have been used for decades, more recently in large institutional set ups, for example to dissect the genetic basis for human neurological disorders (Oliver and Davies, 2005). But due to obvious differences, many gene functions of importance in animals may be irrelevant or absent in humans and vice versa. This is exemplified by cancer genetics where the phenotype of certain oncogenes in mouse does not mimic the disease spectra in humans with a similar genetic defect (Rangarajan and Weinberg, 2003).
■
199
Clinical Trials and Epidemiology Experimental early phase human clinical trials allow a chemical genetic elucidation of gene function. Many patients are treated with one specific drug during the disease course. These compounds have defined target and off-target effects, and gene function can be studied in sentinel cells or tissues, like peripheral leukocytes, or the diseased tissues. There is increasing use of gene expression studies in patients who undergo therapy. One example is examination of gene expression analysis within hours after intensive chemotherapy against acute myeloid leukemia (Anensen et al., 2006). In this study peripheral blood cells, in these samples representing more than 95% pure cancer cells, were harvested from 2 h to 6 h after start of chemotherapy. As early as 2 h after start of chemotherapy, the p53 protein was activated, and the 100 most expressed genes were dominated by known p53 target genes. The tumor suppressor p53 is mutated in more than 40% of all cancers, but in leukemia the gene mutation frequency is below 10%. However, mutated p53 in AML is associated with highly chemoresistant disease. Together, this indicates that normal p53 activation is of paramount importance in the first intensive chemotherapy in acute myeloid leukemia, and that other therapeutic strategies should probably be searched for in cases were p53 is mutated. In some countries or regions there are extensive health registries, including general health status and careful registration of diseases like cancer, cardiovascular disease and multiple sclerosis. These comprehensive registries take advantage of unique personal identifiers and limited population mobility. As a result of governmental financed insurance plans, corresponding detailed information of prescribed drugs is available in populationbased registries (Andersen, 2006). Impacting most fields of modern medicine, epidemiological studies have been performed by linking different registries. Increased risk for miscarriage in pregnant users of non-steroidal anti-inflammatory drugs have been identified by combining data from the prescription registry, birth registry and hospital discharge registry (Nielsen et al., 2001). Protective association of non-steroidal anti-inflammatory drugs against risk for colon, rectal, stomach and ovarian cancer has been described (Sørensen et al., 2003). Finally, registries were used to identify increased mortality of myocardial infarction in patients on cyclooxygenase-2 inhibitors (Gislason et al., 2006). The molecular cause of these associations will in the future be addressed by linkage of genetics and proteomics data from population-based biobanks with the health registries (Pukkala et al., 2007; Stoltenberg, 2005). Combination of drug prescription registries, health registries and biobanks could facilitate retrospective genomic screens for function of drugs and drug combination. A European registry and biobank with more than 600,000 twin pairs and a corresponding biobank may be such a unique source for retrospective chemical genomics in humans (Muilu et al., 2007). Genomics combined with epidemiological data may be an important future tool to determine relationships between genes and drugs that are inaccessible in animal models or require large number of individuals
200
CHAPTER 16
Comprehensive Analysis of Gene Function
■
(a)
Diseases
Genes
Drugs
Genome-wide mRNA-expression profiles N
N
(b)
N
N O P HO OH
Na P OH
O
OH
Unannotated chemical
cmap hypothesis
Disease state
cmap hypothesis
Experimental verification
N3
O
O
O
O
N
N
N
N
N
N
O O
OH
New inhibitors
(c)
Experimental verification
Clinical evaluation
Figure 16.3 A universal functional bioassay forms the Connectivity Map tool. (a) Based on the assumption that all transcripts are known, and simultaneous measurement of these transcript are possible through standardized gene expression array data collection, it may be possible to portrait all induced and organic biological conditions in a common analytic space. Similarities can thereby be determined between these induced and biological conditions, and allow the foundation for Connectivity Maps. (b) An uncharacterized molecule is identified as an active compound in a particular cellular pathway through the Connectivity Map. (c) A particular disease state is coupled with perturbations in certain molecular pathways, and suggests a small molecule as experimental therapeutic. Both (b) and (c) illustrates that the Connectivity Map is a tool for the experimental researcher, where in vitro and in vivo experimental verification of the proposed in silico molecular connection is needed. Modified from Lamb (2007).
for detection. Hypotheses generated in genetic epidemiology may then be tested experimentally by chemical genomics.
CONCLUSIONS New developments in the understanding of RNAi will allow design of siRNA molecules with more potent and specific mRNA targeting characteristics for all genes. The growing access to small
compound collections, liquid handling robotics, miniaturization of assay formats and streamlined readout detection methods will undoubtedly lead to both less time-consuming, labor-intensive and expensive siRNA and chemical genomic screening. As recently proposed, connectivity mapping based on geneexpression signatures may be able to successfully connect disease-related pathways and bioactive small molecules (Lamb et al., 2006). Lamb and colleagues propose a large-scale Connectivity Map project similar to the Human Genome Project, where an
References
open resource can be used for investigators to find connections between small molecule action, physiological processes and disease (Figure 16.3). Several proof-of-principle reports have emerged, for example, identifying novel compounds for established pathways or suggest known drugs and pathways to resolve known drugresistance causing mechanism (Hieronymus et al., 2006; Wei et al., 2006). Connectivity Map technology, even with its limitations (Lamb et al., 2006, for review see Lamb, 2007), may be a useful tool for the experimentalist to focus his or her interest. We have entered the post-genome sequence era. But without having the key to the cipher of gene function, the outstanding effort in genome sequencing will remain just an impressive
■
201
code, without helping us to resolve the enigmas of our most common diseases and making genomic medicine a reality for the patients.
ACKNOWLEDGEMENTS This work was supported by the Norwegian Research Council Functional Genomics Program (FUGE) grant numbers 151859 and 183675. We thank Lars Helgeland Emmet McCormack, David Micklem, Maja Mujic and Line Wergeland for fruitful discussions and for providing illustrations for Figure 16.3.
REFERENCES Adams, J.M., Cory, S. (2001). Life-or-death decisions by the Bcl-2 protein family. Trends Biochem Sci 26, 61–66. Andersen, M. (2006). Is it possible to measure prescribing quality using only prescription data?. Basic Clin Pharmacol Toxicol 1, 314–319. Anensen, N., Oyan, A.M., Bourdon, J.C., Kalland, K.H., Bruserud, O. and Gjertsen, B.T. (2006). A distinct p53 protein isoform signature reflects the onset of induction chemotherapy for acute myeloid leukemia. Clin Cancer Res 12, 3985–3992. Antonopoulou, S., Oikonomou, A., Karantonis, H.C., Fragopoulou, E. and Pantazidou, A. (2002). Isolation and structural elucidation of biologically active phospholipids from Scytonema julianum (cyanobacteria). Biochem J 367, 287–293. Bier, E. (2005). Drosophila, the golden bug, emerges as a tool for human genetics. Nat Rev Genet 6, 9–23. Birmingham, A., Anderson, E.M., Reynolds, A., Ilsley-Tyree, D., Leake, D., Fedorov, Y., Baskerville, S., Maksimova, E., Robinson, K., Karpilow, J. et al. (2006). 3 UTR seed matches, but not overall identity, are associated with RNAi off-targets. Nat Method 3, 199–204. Bredel, M. and Jacoby, E. (2004). Chemogenomics: An emerging strategy for rapid target and drug discovery. Nat Rev Genet 5(4), 262–275. Brummelkamp, T.R., Bernards, R. and Agami, R. (2002). A system for stable expression of short interfering RNAs in mammalian cells. Science 296, 550–553. Brummelkamp, T.R., Fabius, A.W., Mullenders, J., Madiredjo, M., Velds, A., Kerkhoven, R.M., Bernards, R. and Beijersbergen, R.L. (2006). An shRNA barcode screen provides insight into cancer cell vulnerability to MDM2 inhibitors. Nat Chem Biol 2, 202–206. Bruncko, M., Oost, T.K., Belli, B.A., Ding, H., Joseph, M.K., Kunzer, A., Martineau, D., McClellan, W.J., Mitten, M., Ng, S.C. et al. (2007). Studies leading to potent, dual inhibitors of Bcl-2 and Bcl-xL. J Med Chem 50, 641–662. Chang, K., Elledge, S.J. and Hannon, G.J. (2006). Lessons from Nature: microRNA-based shRNA libraries. Nat Method 3, 707–714. Cullen, B.R. (2006). Enhancing and confirming the specificity of RNAi experiments. Nat Method 3, 677–681. de Fougerolles, A., Vornlocher, H.P., Maraganore, J. and Lieberman, J. (2007). Interfering with disease: a progress report on siRNA-based therapeutics. Nat Rev Drug Discov 6, 443–453.
Dolle, R.E., Le Bourdonnec, B., Morales, G.A., Moriarty, K.J. and Salvino, J.M. (2005). Comprehensive survey of combinatorial library synthesis: 2005. J Comb Chem 8(5), 597–635. Echeverri, C.J., Beachy, P.A., Baum, B., Boutros, M., Buchholz, F., Chanda, S.K., Downward, J., Ellenberg, J., Fraser, A.G., Hacohen, N. et al. (2006). Minimizing the risk of reporting false positives in large-scale RNAi screens. Nat Method 3, 777–779. Eggert, U.S., Kiger, A.A., Richter, C., Perlman, Z.E., Perrimon, N., Mitchison, T.J. and Field, C.M. (2004). Parallel chemical genetic and genome-wide RNAi screens identify cytokinesis inhibitors and targets. PLoS Biol 2, 2135–2143. Elbashir, S.M., Harborth, J., Lendeckel, W., Yalcin, A., Weber, K. and Tuschl, T. (2001). Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature 411, 494–498. Fesik, S.W. (2005). Promoting apoptosis as a strategy for cancer drug discovery. Nat Rev Cancer 5, 876–885. Fire, A., Xu, S., Montgomery, M.K., Kostas, S.A., Driver, S.E. and Mello, C.C. (1998). Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391, 806–811. Fuchs, F. and Boutros, M. (2006). Cellular phenotyping by RNAi. Brief Funct Genomic Proteomic 5, 52–56. Gislason, G.H., Jacobsen, S., Rasmussen, J.N., Rasmussen, S., Buch, P., Friberg, J., Schramm, T.K., Abildstrom, S.Z., Køber, L., Madsen, M. et al. (2006). Risk of death or reinfarction associated with the use of selective cyclooxygenase-2 inhibitors and nonselective nonsteroidal antiinflammatory drugs after acute myocardial infarction. Circulation 113(25), 2906–2913. Gudkov,A.V., Kazarov,A.R.,Thimmapaya, R.,Axenovich, S.A., Mazo, I.A. and Roninson, I.B. (1994). Cloning mammalian genes by expression selection of genetic suppressor elements: Association of kinesin with drug resistance and cell immortalization. Proc Natl Acad Sci USA 91, 3744–3748. Herfindal, L., Oftedal, L., Selheim, F., Wahlsten, M., Sivonen, K. and Doskeland, S.O. (2005). A high proportion of Baltic Sea benthic cyanobacterial isolates contain apoptogens able to induce rapid death of isolated rat hepatocytes. Toxicon 46, 252–260. Hieronymus, H., Lamb, J., Ross, K.N., Peng, X.P., Clement, C., Rodina,A., Nieto, M., Du, J., Stegmaier, K., Raj, S.M. et al. (2006). Gene expression signature-based chemical genomic prediction identifies a novel class of HSP90 pathway modulators. Cancer Cell 10, 321–330.
202
CHAPTER 16
■
Comprehensive Analysis of Gene Function
Hsieh, A.C., Bo, R., Manola, J., Vazquez, F., Bare, O., Khvorova, A., Scaringe, S. and Sellers, W.R. (2004). A library of siRNA duplexes targeting the phosphoinositide 3-kinase pathway: Determinants of gene silencing for use in cell-based screens. Nucleic Acids Res 32, 893–901. Huwe, C.M. (2006). Synthetic library design. Drug Discov Today 11(15-16), 763–767. Hoyne, G.F. and Goodnow, C.C. (2006). The use of genomewide ENU mutagenesis screens to unravel complex mammalian traits: Identifying genes that regulate organ-specific and systemic autoimmunity. Immunol Rev 210, 27–39. Hunter, T. (2007). Treatment for chronic myelogenous leukemia: The long road to imatinib. J Clin Invest 117, 2036–2043. Inamdar, M.S. (2001). Functional genomics the old-fashioned way: Chemical mutagenesis in mice. Bioessays 23, 116–120. Jackson, A.L., Burchard, J., Schelter, J., Chau, B.N., Cleary, M., Lim, L. and Linsley, P.S. (2006). Widespread siRNA “off-target” transcript silencing mediated by seed region sequence complementarity. RNA 12, 1179–1187. Kaelin, W.G., Jr (2005). The concept of synthetic lethality in the context of anticancer therapy. Nat Rev Cancer 5, 689–698. Kawasumi, M. and Nghiem, P. (2007). Chemical genetics: Elucidating biological systems with small-molecule compounds. J Invest Dermatol 127(7), 1577–1584. Kepe, V., Huang, S.C., Small, G.W., Satyamurthy, N. and Barrio, J.R. (2006). Visualizing pathology deposits in the living brain of patients with Alzheimer’s disease. Methods Enzymol 412, 144–160. Kettner-Buhrow, D., Dittrich-Breiholz, O., Schneider, H., Wolter, S., Resch, K. and Kracht, M. (2006). Small interfering RNAs generated by recombinant dicer induce inflammatory gene expression independent from the TAK1-NFkappaB-MAPK signaling pathways. Biochem Biophys Res Commun 347, 566–573. Knight, Z.A. and Shokat, K.M. (2007). Chemical genetics: Where genetics and pharmacology meet. Cell 128(3), 425–430. Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel, M.J., Lerner, J., Brunet, J.P., Subramanian, A., Ross, K.N. et al. (2006). The Connectivity Map: Using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935. Lamb, J. (2007). The connectivity map: a new tool for biomedical research. Nat Rev Cancer 7, 54–60. Landro, J.A., Taylor, I.C., Stirtan, W.G., Osterman, D.G., Kristie, J., Hunnicutt, E.J., Rae, P.M. and Sweetnam, P.M. (2000). HTS in the new millennium: The role of pharmacology and flexibility. J Pharmacol Toxicol Method 44(1), 273–289. Lorens, J.B., Sousa, C., Bennett, M.K., Molineaux, S.M. and Payan, D.G. (2001). The use of retroviruses as pharmaceutical tools for target discovery and validation in the field of functional genomics. Curr Opin Biotechnol 12, 613–621. MacBeath, G. (2001). Chemical genomics: What will it take and who gets to play? Genome Biol, 2, comment2005.1–comment2005.6. MacKeigan, J.P., Murphy, L.O. and Blenis, J. (2005). Sensitized RNAi screen of human kinases and phosphatases identifies new regulators of apoptosis and chemoresistance. Nat Cell Biol 7, 591–600. Mello, C.C. and Conte, D. (2004). Revealing the world of RNA interference. Nature 431, 338–342. Moffat, J., Grueneberg, D.A.,Yang, X., Kim, S.Y., Kloepfer, A.M., Hinkle, G., Piqani, B., Eisenhaure, T.M., Luo, B., Grenier, J.K. et al. (2006).
A lentiviral RNAi library for human and mouse genes applied to an arrayed viral high-content screen. Cell 124, 1283–1298. Muilu, J., Peltonen, L. and Litton, J.E. (2007). The federated database – a basis for biobank-based post-genome studies, integrating phenome and genome data from 600,000 twin pairs in Europe. Eur J Hum Genet 15(7), 718–723. Nielsen, G.L., Sørensen, H.T., Larsen, H. and Pedersen, L. (2001). Risk of adverse birth outcome and miscarriage in pregnant users of non-steroidal anti-inflammatory drugs: Population based observational study and case-control study. BMJ 322(7281), 266–270. Oliver, P.L., Davies, K.E. (2005). Analysis of human neurological disorders using mutagenesis in the mouse. Clin Sci 108, 385–397. London. Ortholand, J.Y. and Ganesan, A. (2004). Natural products and combinatorial chemistry: Back to the future. Curr Opin Chem Biol 8(3), 271–280. Peters, J.L., Cnudde, F. and Gerats, T. (2003). Forward genetics and mapbased cloning approaches. Trends Plant Sci 8(10), 484–491. Pukkala, E., Andersen, A., Berglund, G., Gislefoss, R., Gudnason, V., Hallmans, G., Jellum, E., Jousilahti, P., Knekt, P., Koskela, P. et al. (2007). Nordic biological specimen banks as basis for studies of cancer causes and control – more than 2 million sample donors, 25 million person years and 100,000 prospective cancers. Acta Oncol 46(3), 286–307. Rangarajan, A. and Weinberg, R.A. (2003). Opinion: Comparative biology of mouse versus human cells: Modelling human cancer in mice. Nat Rev Cancer 3, 952–959. Rognan, D. (2007). Chemogenomic approaches to rational drug design. Br J Pharmacol 152, 38–52. Roland, E. and Dolle, J. (2004). Comprehensive Survey of Combinatorial Library Synthesis: 2003. Comb Chem 6, 623–679. Sabatini, D.M. (2006). mTOR and cancer: insights into a complex relationship. Nat Rev Cancer 6, 729–734. Schreiber, S.L. (1991). Chemistry and biology of the immunophilins and their immunosuppressive ligands. Science 251, 283–287. Selheim, F., Herfindal, L., Martins, R., Vasconcelos, V. and Doskeland, S. O. (2005). Neuro-apoptogenic and blood platelet targeting toxins in benthic marine cyanobacteria from the Portuguese coast. Aquat Toxicol 74, 294–306. Sherbenou, D.W. and Druker, B.J. (2007). Applying the discovery of the Philadelphia chromosome. J Clin Invest 117, 2067–2074. Snove, O, Jr and Holen, T. (2004). Many commonly used siRNAs risk off-target activity. Biochem Biophys Res Commun 319, 256–263. Steinberg, B.E., Scott, C.C. and Grinstein, S. (2007). High-throughput assays of phagocytosis, phagosome maturation, and bacterial invasion. Am J Physiol Cell Physiol 292, C945–C952. Stoltenberg, C. (2005). Merging genetics and epidemiology: what is in it for public health?. Scand J Public Health 33, 1–3. Sørensen, H.T., Friis, S., Nørgård, B., Mellemkjaer, L., Blot, W.J., McLaughlin, J.K., Ekbom, A. and Baron, J.A. (2003). Risk of cancer in a large cohort of nonaspirin NSAID users: A populationbased study. Br J Cancer 88(11), 1687–1692. Vassilev, L.T., Vu, B.T., Graves, B., Carvajal, D., Podlaski, F., Filipovic, Z., Kong, N., Kammlott, U., Lukacs, C., and Klein, C., et al. (2004). In vivo activation of the p53 pathway by small-molecule antagonists of MDM2. Science 303, 844–848. de Veer, M.J., Sledz, C.A. and Williams, B.R. (2005). Detection of foreign RNA: Implications for RNAi. Immunol Cell Biol 83, 224–228. Westbrook, T.F., Stegmeier, F. and Elledge, S.J. (2005). Dissecting cancer pathways and vulnerabilities with RNAi. Cold Spring Harb Symp Quant Biol 70, 435–444.
Recommended Resources
Wheeler, D.B., Carpenter, A.E. and Sabatini, D.M. (2005). Cell microarrays and RNA interference chip away at gene function. Nat Genet 37(Suppl), S25–S30. Wei, G., Twomey, D., Lamb, J., Schlis, K., Agarwal, J., Stam, R.W., Opferman, J.T., Sallan, S.E., den Boer, M.L., Pieters, R. et al. (2006). Gene expression-based chemical genomics identifies
203
rapamycin as a modulator of MCL1 and glucocorticoid resistance. Cancer Cell 10, 331–342. Xu, X., Leo, C., Jang, Y., Chan, E., Padilla, D., Huang, B.C., Lin, T., Gururaja, T., Hitoshi, Y., Lorens, J.B. et al. (2001). Dominant effector genetics in mammalian cells. Nat Genet 27, 23–29.
RECOMMENDED RESOURCES http://www.ncgc.nih.gov/ http://www.broad.harvard.edu/chembio/icg/overview/index.htm http://pubchem.ncbi.nlm.nih.gov/ http://nmf.jax.org/index.html http://www.gsc.riken.go.jp
■
http://www.ncbi.nlm.nih.gov/sites/entrez?dbomim/ http://www.genomeutwin.org/ http://www.ukbiobank.ac.uk/ Database of Genotypes and Phenotypes (dbGaP): (http://view.ncbi.nlm.nih.gov/sites/entrez?dbgap/
This page intentionally left blank
Section
Informatic and Computational Platforms for Genomic Medicine
17. 18. 19. 20. 21.
3
Bioinformatic and Computational Analysis for Genomic Medicine Fundamentals and History of Informatics for Genomic and Personalized Medicine Electronic Medical Records in Genomic Medicine Practice and Research Clinical Decision Support in Genomic and Personalized Medicine Online Health Information Retrieval by Consumers and the Challenge of Personal Genomics
CHAPTER
17 Bioinformatic and Computational Analysis for Genomic Medicine Atul J. Butte
INTRODUCTION The past 10 years have seen development and deployment of a variety of measurement tools in molecular science that enable the large-scale parallel quantitative assessment of molecular state. The premier example of such measurement tools is the RNA expression detection microarray, which provides quantitative measurements of expression of over 40,000 unique RNAs within cells (Chee et al., 1996; DeRisi et al., 1996). Yet, as illustrated in previous chapters in this volume, microarrays represent just one of at least 30 or so measurement or experimental modalities available to investigators in molecular biology and genomics, including metabolite quantification (Edwards et al., 2001), DNA polymorphism measurements (Johnson et al., 2001), protein quantification (Espina et al., 2003; Ghaemmaghami et al., 2003; Gygi et al., 1999; Liotta et al., 2003), protein activity (Hestvik et al., 2003), protein interactions with small molecules (Jessani et al., 2002) and DNA (Lee et al., 2002; Odom et al., 2004; Ren et al., 2000;Wang et al., 2001; Weinmann et al., 2002), and many others. Though these molecular technologies are commonly described as being “high-throughput” or “genome-wide” in nature, we will use the term high-bandwidth to describe them in this chapter. An analogy explains our reasoning: high-bandwidth is to a supermarket, which serves a large variety of different foods irrespective of speed, while high-throughput is to a fast-food restaurant, serving a few types of
Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg 206
foods quickly. Genome-wide technologies are biased towards genes, while we are covering measurements of many other types besides genes. The service role for bioinformatics in research using these new tools is undeniable. Each of these high-bandwidth modalities yields a sizeable amount of raw data, so distilling these raw data and filtering out measurement noise through the proper use of bioinformatics methods is crucial. Bioinformatics clearly plays a role in the storage, retrieval, and sharing of these measurements from local and international repositories and in relating these measurements to clinical outcomes. However, what we will argue in this chapter is that the role for bioinformatics in genomic and personalized medicine is well beyond that of providing a service and lies more broadly in enabling new and interesting questions to be asked in biomedical, translational, and clinical research. In this chapter, we will present several examples that illustrate how the use of specific bioinformatics methods has enabled a revolution of discovery in medical diagnostics, prognostics, therapeutics, anatomy, and nosology. We will then review a few specific analytic methods and standardized vocabularies useful in studying and identifying high-bandwidth measurements for genomic and personalized medicine. After covering a few freely available software tools valuable for conducting analyzes, we will end with some thoughts about where the future discoveries in genomic medicine, enabled by bioinformatics, may come from.
Copyright © 2009, Elsevier Inc. All rights reserved.
Vignettes: How Specific Bioinformatics Methods can Change the Practice of Medicine
VIGNETTES: HOW SPECIFIC BIOINFORMATICS METHODS CAN CHANGE THE PRACTICE OF MEDICINE While many high-bandwidth measurement technologies have been used to enable basic discoveries in the life sciences, certainly the most intriguing and provocative use of these measurement modalities has been in enabling novel types of medical discoveries. Here, we will present case examples of how these discoveries were enabled by bioinformatics techniques and have demonstrably altered our abilities in four distinct areas of medicine: diagnosis, therapeutics, histopathology, and nosology (Table 17.1). For these vignettes, we will particularly focus on the use of microarrays, as this is currently the most commonly used highbandwidth measurement modality. Diagnosis Diagnosis is a process in which a physician, when presented with a patient holding a chief complaint, gathers information through questioning, history taking, physical examination, and testing, compares the results of those queries with patterns established for known diseases, and assigns the disease with the highest likelihood. High-bandwidth testing in molecular biology has impacted every step of the diagnostic process. For example, those being evaluated for a disease may not yet even have a complaint, as DNA
TABLE 17.1
■
207
polymorphisms are discovered and used to predict future disease (Kohane et al., 2006). Medical history taking becomes problematic when considering the genetic context of a patient, which is increasingly known; for example, many infants now have one or more genes sequenced at birth as part of state-mandated newborn screening (Khoury et al., 2003). History taking has potential to become even more convoluted as more adults have genes (and even whole genomes) sequenced. The bioinformatics challenge in storing, providing, and teaching patients and parents how to use this type of medical information, while protecting against others finding this information, has only just started to be addressed (Adida and Kohane, 2006; 2004; Centers for Disease Control and Prevention, 2004; Henneman et al., 2006; Roche and Annas, 2001, 2006). The nature of the physical examination itself has changed. Positive and negative physical findings (e.g., acanthosis nigricans, or darkening of the skin of the neck) are traditionally considered as probes into the process of disease (e.g., type 2 diabetes mellitus). With the democratization of genomics throughout all fields of medicine, physical findings can now be viewed collectively as phenotype and as a probe into the inner workings of the genome (Freimer and Sabatti, 2003). Two different diseases with similar sets of physical findings may indeed appear quite similar when viewed from the perspective of genes and gene products (Brunner and van Driel, 2004). But certainly the greatest impact on diagnosis from highbandwidth molecular testing has been felt on our ability to redefine our patterns of disease using these new measurements. One
Bioinformatics techniques have demonstrably altered our abilities in four distinct areas of medicine
Area of the practice of medicine Diagnosis Chief complaint Medical history Medical record keeping Physical examination Differential diagnosis Disease sub-type Therapeutics Finding novel effects for existing drugs Side effects Finding novel therapies for untreatable conditions Histopathology Metastasis of unknown origin Pathological mechanism of action of disease Nosology Finding differences between diseases Finding similarities between diseases Linking cellular models to human diseases
Example of bioinformatics altering practice Predicting future disease from polymorphisms before complaint Genetic screening at birth should become part of a patient’s life-long electronic medical record Patient-owned Internet-based genomic records Findings collectively viewed as phenotype and matched against patterns Molecular measurements can be compared against case-examples Hierarchical clustering to determine novel subtypes from molecular measurements Linking drug effects to molecular measurements Predicting drug adverse effect from pre-clinical molecular measurements Matching molecular profiles of disease with molecular profiles of drug effect Matching molecular measurements from metastasis with profiles of known primary cancers Identifying differentially expressed genes and linking back to known biological pathways Principal components of variance of disease sample microarrays Common signatures across microarray measurements of disease Comparing cellular and disease microarray measurements
208
CHAPTER 17
■
Bioinformatic and Computational Analysis for Genomic Medicine
of the earliest examples of this impact was in the work of Golub and colleagues published in 1999 (Golub et al., 1999). Acute lymphomas were first divided into acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) in the 1960s based on histochemical testing, and more recently by antibody-based testing against cell-surface molecules, but there was no single established test to make this diagnosis. Golub was the first to show that gene expression microarrays could be used to measure gene
expression levels from samples from acute lymphomas, and using a bioinformatics technique known as supervised machine learning, trained a prediction system to find and use those genes that uniquely distinguished ALL from AML (Figure 17.1). He then used this predictor to accurately classify additional new leukemia cases. Many others have used alternative computational and bioinformatics techniques to build predictors since this landmark study, while many additional diagnostic dilemmas have similarly
ALL
AML C-myb (U22376) Proteasome iota (X59417) MB-1 (U05259) Cyclin D3 (M92287) Myosin light chain (M31211) RbAp48 (X74262) SNF2 (D26156) HkrT-1 (S50223) E2A (M31523) Inducible protein (L47738) Dynein light chain (U32944) Topoisomerase II β (Z15115) IRF2 (X15949) TFIIEβ (X63469) Acyl-Coenzyme A dehydrogenase (M91432) SNF2 (U29175) (Ca2)-ATPase (Z69881) SRP9 (U20998) MCM3 (D38073) Deoxyhypusine synthase (U26266) Op 18 (M31303) Rabaptin-5 (Y08612) Heterochromatin protein p25 (U35451) IL-7 receptor (M29696) Adenosine deaminase (M13792) Fumarylacetoacetate (M55150) Zyxin (X95735) LTC4 synthase (U50136) LYN (M16038) HoxA9 (U82759) CD33 (M23197) Adipsin (M84526) Leptin receptor (Y12670) Cystatin C (M27891) Proteoglycan 1 (X17042) IL-8 precursor (Y00787) Azurocidin (M96326) p62 (U46751) CyP3 (M80254) MCL1 (L08246) ATPase (M62762) IL-8 (M28130) Cathepsin D (M63138) Lectin (M57710) MAD-3 (M69043) CD11c (M81695) Ebp72 (X85116) Lysozyme (M19045) Properdin (M83652) Catalase (X04085)
3
2.5
low
2
1.5
1
0.5
0
0.5
Normalized Expression
1
1.5
2
2.5
3 high
Figure 17.1 Todd Golub and colleagues measured genome-wide expression levels in samples from patients with ALL and AML. The top 50 genes distinguishing ALL and AML are displayed, with genes elevated in expression shown in pink, and genes decreased in expression shown in blue. From Golub et al., 1999 Science 286:531. Reprinted by permission from AAAS.
Vignettes: How Specific Bioinformatics Methods can Change tjhe Practice of Medicine
Therapeutics Bioinformatics, coupled with the exponential growth in data collected using high-bandwidth measurement modalities, has the potential to revolutionize the way patients are treated with pharmaceuticals. Much has already been written on how highbandwidth measurements have generally altered the process of drug discovery (Debouck and Goodfellow, 1999). Here, we will cover specific case examples on how the conceptualization of pharmacology can be radically changed through development and use of biomedical informatics methods. In 2000, Butte and colleagues focused on generating hypotheses of functional relationships between pairs of genes and pharmaceuticals, with the hope that specific genes could be found that altered a cancer’s ability to respond to specific drugs. To do this, they started with two databases, both related to the NCI60, a standardized set of 60 human cancer cell lines used by the National Cancer Institute Developmental Therapeutics Program (Weinstein et al., 1997). One database contained the baseline microarray measurements of the expression level of 6701 genes measured in these cell lines. The other database contained a set of drug susceptibility measurements for the same cell lines, across
209
(a)
spi-1PU.1 CD86B7-2 RAD50 CD21 Gertminal center kinase
Casein kinase l, 2 Diacylglycerol kinase delta Arachidonate 5-lipoxygenase CD22 JNK3 Myosin-IC KCNN3 Ca activated K channel PI3-kinase p110 catalytic, isoform WIPWASP interacting protein JAW1 APS adapter protein Protocadherin 43 Terminal deoxynucleotide transferase Focal adhesion kinase BCL-7A BCL-6 FMR2 A-myb CD10 OGG18-oxyguanine DNA glycosylase LMO2 CD38 CD27 Ick IRS-1 RDC-1 ABR OP-1 RGS13 PKC delta MEK1 SIAH-2 IL-4 receptor alpha chain APRPMA-responsive peptide GADD34 IL-10 receptor beta chain c-myc NIK ser/thr kinase BCL-2 MAPKK5 kinase PBEFpre-B enhancing factor TNF alpha receptor II Cyclin D2 Deoxycytidylate deaminase IRF-4 CD44 FLIPFLICE-like inhibitory protein SLAPsrc-like adapter protein DRIL1Dead ringer-like 1 Trk3Neurotrophic tyr kinase receptor IL-16 SP100 nuclear body protein LYSP100 K channel, shaker-related, member 3 ID2 NET tyrosine kinase IL-2 receptor beta chain
(b)
All patients
1.0
GC B-like Probability
been addressed using gene expression microarrays, including melanoma (Bittner et al., 2000) and prostate cancer (Singh et al., 2002). Other types of high-bandwidth modalities have been used to assist in other diagnostic questions, including proteomics (Petricoin et al., 2002; Robinson et al., 2002) and metabolomics (Clayton et al., 2006) (see also Chapters 14 and 15). While enabling physicians to make more efficient and accurate decisions between diseases itself is remarkable, the potential for high-bandwidth molecular testing in diagnosis is even greater. Beyond just improving on older tests for established diagnostic dilemmas, high-bandwidth studies have also been used to redefine and even divide diagnoses into novel sub-types. The earliest example of this was in the work of Alizadeh and colleagues in 2000 (Alizadeh et al., 2000). Alizadeh used gene expression microarrays to measure gene expression in samples from patients diagnosed with diffuse large B-cell lymphoma, a cancer of B-lymphocytes. This type of cancer was previously divided into low-, intermediate-, and high-grade categories based on growth pattern and immunohistochemistry. After applying a bioinformatics technique called hierarchical clustering (described later in this chapter) to the microarray data, Alizadeh discovered that the patient samples of B-cell lymphoma could be divided into an equal split of two subtypes (Figure 17.2). Most importantly, patients with these two different sub-types of lymphoma retrospectively demonstrated significant differences in survival by Kaplan–Meier analysis. This was a landmark study because it showed how high-bandwidth molecular measurements could provide a next level of resolution to resolve sub-types of disease not otherwise discernable by physicians. This type of study has been replicated for other types of cancers, including lung cancer (Bhattacharjee et al., 2001) and breast cancer (Sorlie et al., 2001), has been replicated using other types of measurements, including proteomics (Welsh et al., 2003), and is starting to enter clinical trials (Potti et al., 2006).
■
19 patients, 6 deaths 0.5 Activated B-like 21 patients, 16 deaths P 0.01
0.0 0
2
4 6 8 10 Overall survival (years)
12
Figure 17.2 (a) Ash Alizadeh and colleagues measured genome-wide expression levels in samples from patients with diffuse large B-cell lymphoma and found samples clustered into two sub-types based on expression patterns. (b) Patients in these two sub-types have a significant difference in survival (Alizadeh et al., 2000). Reprinted by permission from Macmillan Publishers Ltd: Nature 403:503, copyright 2000.
210
CHAPTER 17
■
Bioinformatic and Computational Analysis for Genomic Medicine
nearly 5000 anti-cancer agents. They then used a database technique known as join to combine these two databases through the cell lines in common. After joining the two tables, they searched for baseline RNA expression levels in the cell lines that correlated with the inhibition of growth of these same cell lines by thousands of anti-cancer agents. They found only one network containing an association between a gene expression and a measure of anti-cancer agent susceptibility (Figure 17.3). The association suggested that increased expression of lymphocyte cytosolic protein-1 (LCP1) is associated with increased susceptibility to the anti-cancer agent NSC 624044, a thiazolidine carboxylic acid derivative. Though a specific role for LCP1 in tumorogenicity had been postulated and though other thiazolidine carboxylic acid derivatives were known to inhibit tumor cell growth, there was no known relationship between this specific anti-cancer agent and gene in the biomedical literature. The significance of this work was that drug efficacy measurements could be joined to microarray measurements and potentially other types of
genomic measurements, enabling the discovery of hidden associations between drugs and genes. Fliri and colleagues joined databases in a similar manner in 2005, but instead of joining using cell lines, they used the drugs themselves (Fliri et al., 2005). Specifically, they took a database of known side-effects of drugs and joined it to a database of ligand binding assay results as measured in the presence of those same drugs. With this method, they were able to show that drugs with similar side-effect profiles actually had similar ligandbinding assays. The significance of this finding is that pre-clinical studies, such as ligand-binding assays and others, could be used to predict clinical side-effects, which, when severe enough, can lead to withdrawal of a drug from the market. Lamb and colleagues in 2006 built a database from microarrays measured across the genome after the application of 164 different small molecules to a breast cancer cell line (Lamb et al., 2006). They then queried this database of drug responses using responses measured from other experiments for which a detailed
BioDn-3 AFFX-M27830_5 D32129_f AFFX-M27830_5 BioDn-3 AFFX-M27830_5 HG3597-HT3800_f M14199 P624044 J02923
U43901_rna1
P169517
HG2743-HT2845
P685105 P637436
M83216 L18877_f HG2743-HT2846
L18920_f
U03735_f
Figure 17.3 Relevance networks calculated by joining drug activity data with gene microarray measurements. White rectangles indicate genes, and green rectangles indicate anti-cancer agents. The only discovered network involving both genes and drugs is shown in the middle. The activity of four anti-cancer agents against multiple cancer cell lines correlates with the expression pattern of a single gene across those same cancer cell lines. J02923 indicates the gene LCP1, while drug identifiers correspond to National Cancer Institute NSC identifiers (Butte et al., 2000).
Vignettes: How Specific Bioinformatics Methods can Change the Practice of Medicine
mechanism had not been known, using a measure of similarity known as a Kolmogorov-Smirnov statistic (Figure 17.4). For example, when queried with the gene expression differences between dexamethasone-treatable and dexamethasone-resistant ALL, they found that the genome-wide difference between these two types of leukemia most closely matched that of the genome-wide response to the drug sirolimus. The significance of this approach is that existing drugs could be associated with unexplained diseaserelated processes, and might even be used to treat those diseases. But most importantly, this work highlights the importance of building structured searchable repositories of experimental data. Histopathology Histopathology is the study of diseased tissue by sectioning, staining, and multi-resolution microscopy. Given high-bandwidth measurements from samples of potentially diseased tissue, bioinformatics methods can be applied to determine whether such samples do quantitatively resemble a disease (van de Rijn and Gilks, 2004). For example, Ramaswamy and colleagues tackled the problem of finding the original source of a cancer, given just a metastatic sample (Ramaswamy et al., 2001). They used a method called support vector machines, each of which was designed to recognize one specific type of cancer (Figure 17.5). This work was significant in that it was one of the first to show how multiple types of cancers could be distinguished using high-bandwidth measurements, and that unknown samples could be accurately assigned a primary diagnosis. Similar work has been published using alternative bioinformatics methods (Su et al., 2001). Fibroblasts are cells in the connective tissue that create much of the extracellular matrix between tissues and proliferate during wound healing. Though fibroblasts from different parts of the body may have similar properties and resemble each
Biological state of interest (signature)
■
other under a microscope, Chang and colleagues demonstrated that when fibroblasts are considered by high-bandwidth measurements of gene expression, significant differences between fibroblasts are seen depending on where in the body they were originally obtained. In other words, cells that appear exactly the same by microscope appear quite differently by microarray, consistent with their original source (Chang et al., 2002). Microarrays have also been used to uncover histopathological findings that might have otherwise been missed. Sarwal and colleagues used microarrays to study the process of acute rejection after kidney transplantation (Sarwal et al., 2003). After hierarchically clustering their samples, they noticed a set of B-cell specific genes that corresponded to one sub-type of acute rejection. This was particularly interesting because B-cells had not previously been thought to be involved in acute rejection. Staining for a B-cell specific cell surface protein showed large aggregations of B-cells in samples of acute rejection. The significance of this finding was the realization that while histochemistry can offer a broad survey of the cellular makeup in a sample, these cells will still remain invisible unless illuminated with the right stain. The proper stain can be found using gene expression microarrays coupled with the right bioinformatics techniques, especially the prior annotation of genes as being specific for a certain cell type. Nosology In the mid 1700s, Carl Linnaeus promoted the binomial nomenclature to classify living things into a hierarchy, or taxonomy. Though we still organize species into taxonomies based on anatomic and physiological similarities, our modern-day use of DNA sequencing and genome sequence has enabled the reorganization of the position of species within established taxonomical trees.
Reference database (profiles)
Connections
positive up
query
output
down negative strong positive
weak positive
211
null
strong negative
Figure 17.4 A reference database of high-bandwidth molecular measurements can be made after the application of various drugs, then queried with an unknown biological response, yielding those drugs sharing a similar response. From Lamb et al., 2006, Science 313:1929. Reprinted by permission from AAAS.
212
CHAPTER 17
■
Bioinformatic and Computational Analysis for Genomic Medicine
This is because DNA sequence similarity analysis provides a quantitative measurement of how different two species are (Baldauf et al., 2000). Linnaeus was also a co-founder of systematic nosology, or the classification of disease, for which he is not as well known (Linnâe and Schrèoder, 1763). Genomic data and bioinformatics methods have now advanced to the point that we can begin to modernize the classification of disease itself, similar to how DNA sequencing has modernized taxonomy. Crescenzi and Giuliani were one of the first to demonstrate this principle in 2001 (Crescenzi and Giuliani, 2001). They gathered samples representative of 60 cancer cell lines and used a statistical method known as principal components analysis to find combinations of genes that capture the significance variance across cancers. They then showed how analyzes using just the top five principal components could reproduce the findings from the entire dataset. In other words, Crescenzi and Giuliani were able to show that when measured using high-bandwidth measurements,
the 60 varied types of cancers differed in only a few ways that could be modeled. Instead of modeling the significant differences between cancers, Rhodes and colleagues searched for commonalities in 2004 (Rhodes et al., 2004). After collecting 40 published microarray datasets totaling more than 3700 samples of cancer, Rhodes determined a genome-wide signature representative of neoplastic transformation. As expected, most of the genes in this cancer signature were known to participate in the cell cycle, the biological process involved in cell replication. A similar study by Bild and colleagues found commonly deregulated biological pathways across cancers that correspond with worsening survival (Bild et al., 2006). This type of work is significant because it suggests that a quantitative classification of cancers, or one based on actual measurements, might be more useful than the traditional anatomic or even histopathological classification of cancer (Figure 17.6).
SAMPLES BR BL CNS CO
LE
LU
LY
ME ML OV PA PR RE UT
HIERARCHICAL CLUSTERING CEA
DATASET
SOM
MOLECULAR MARKERS
GENES
(16,063 Genes, 218 Human Tumor Samples)
BR PR LU CO LY BL ML UT LE RE PA OV ME CNS
PSA
ER 3
2 1 0 1 2 3 standard deviation from mean
Figure 17.5 Over 200 samples of 14 different types of tumors were clustered to indicate similarities in tumor types and then classified to find genes uniquely implicating each type of tumor (Ramaswamy et al., 2001).
Control
Myc
Control
Ras
50
50
150
150
200 250
200 250 2 4 6 8 10 12 14 16 18 20
E2F3
30
150
40
200
50
250
350
300
Src
60 70 2 4 6 8 10 12 14 16 18
Control
β-catenin
10 20 30 40 50 60 70 80 90 100
20
100
300 2 4 6 8 10 12 14 16 18 20
Control 10
50
100 100
Control
2 4 6 8 10 12 14 16 18
2 1.5 1 0.5 0 0.5 1 1.5 2
2 4 6 8 10 12 14 16 18
Figure 17.6 Cells made to over-express particular oncogenes were measured using gene expression microarrays, and a “signature” or pattern of gene expression changes for each oncogene was then computed and used to probe samples of cancer (Bild et al., 2006). Reprinted by permission from Macmillan Publishers Ltd: Nature 439:353, copyright 2006.
Analytic Methods
ANALYTIC METHODS The field of computer science involved in the search for patterns and the induction of rules from data is called machine learning. Current methodologies to analyze high-bandwidth measurements can broadly be divided into two categories: supervised machine learning approaches, or analysis to determine genes or proteins that fit a predetermined pattern; and unsupervised machine learning approaches, or analysis to characterize of the components of a dataset, without the a priori input or knowledge of a right answer (Butte, 2002). Supervised machine learning methods are commonly used to find those individual or sets of measured elements with measurements that (1) are significantly different between defined groups of samples, and (2) accurately predict a characteristic of the sample. An example of an application of a supervised machine learning method is the vignette described above in which Golub and colleagues found genes that distinguish ALL and AML. There are many published supervised methods that find measured elements that accurately predict characteristics of samples, such as distinguishing one type of disease from another, or a malignant disease from a benign one (Dudoit et al., 2002). There are methods that find individual measured elements, such as the nearest-neighbor approach (Golub et al., 1999), and those that find sets of multiple elements, such as decision trees (Quinlan, 1992), neural networks (Rumelhart et al., 1986), and support vector machines (Brown et al., 2000; Chow et al., 2001; Furey et al., 2000). Unsupervised machine learning methods are used to find internal structure or relationships within a dataset, instead of trying to determine how best to predict a “correct” answer. Two of the example vignettes provided above – in which Crescenzi determined those genes that best explained the differences between cancers and in which Alizadeh found sub-types of Bcell lymphoma – are good examples of the use of unsupervised machine learning methods. Within unsupervised learning, there are three classes of techniques, including (1) measured element discovery, or finding elements (e.g., genes) with interesting properties without specifically searching for a specific predefined pattern, such as by principal component analysis (Alter et al., 2000; Fiehn et al., 2000; Hilsenbeck et al., 1999; Raychaudhuri et al., 2000; Wen et al., 1998); (2) cluster discovery, or finding groups of measured features or samples with similar patterns, such as by nearestneighbor clustering (Ben-Dor et al., 2000; Golub et al., 1999), self-organizing maps (Tamayo et al., 1999;Toronen et al., 1999), k-means clustering, and one and two-dimensional hierarchical clustering (Eisen et al., 1998; Ross et al., 2000); and (3) network discovery, or finding graphs representing associations between measured elements, such as by using Boolean networks (Liang et al., 1998; Szallasi and Liang, 1998; Wuensche, 1998), Bayesian networks (Friedman et al., 2000), and relevance networks (Butte and Kohane, 1999; Butte and Kohane, 2000; Butte et al., 2000). Several recent articles have reviewed many of the available methods (Butte, 2002; Quackenbush, 2004, 2006). Here we will
■
213
consider two of the most commonly used supervised and unsupervised methods used in those publications with the greatest impact for medicine. Case-Control Studies The majority of experiments performed using high-bandwidth measurements still typically use only a handful of cases in two or three conditions, and the experimenter typically has the goal of finding those measured elements (e.g., genes) that are significantly different between these groups. Significance has and continues to be evaluated in a multitude of different methods, including parametric (Tusher et al., 2001), non-parametric (Butte et al., 2001), Bayesian (Baldi and Long, 2001), and many others. Analysis of variance has been used for measurements from samples with multiple characteristics (Pavlidis and Noble, 2001). Signal processing (Butte et al., 2002), trend detection (Tseng et al., 2005), Bayesian (Ramoni et al., 2002), non-parametric (Reis et al., 2001), and many other methods have been used for measurements from samples across time (Simon et al., 2005). The challenge in using any of these methods across the set of measured elements (e.g., genes) is compensating for the high number of measurements made. For example, even a standard t-test could be used on every gene measured in two sets of microarrays from two conditions. The null hypothesis for a particular gene would be that the gene expression level is not significantly different across the two conditions, and a p-value can be calculated to assess the ability to reject that null hypothesis for each gene. But in such a case, it would be incorrect to use the “standard” threshold of p 0.05 to determine those genes that are significantly different. Traditionally, such a threshold indicates the likelihood of calling a truly negative finding positive, otherwise known as the false positive rate or a type 1 error in statistics. For a single test, a 5% false positive rate may not be unreasonable, but when multiplied across 40,000 transcripts, such a false positive rate would be expected to yield 2000 falsely positive genes, too many to validate. With high-bandwidth measurements, one is typically more interested in the false discovery rate, rather than the false positive rate. The false discovery rate is the expected proportion of accepted type 1 errors within a list of tested hypotheses, and controls for the number of tests being made. A false discovery rate can actually be calculated for every measured element. In essence, such a rate would indicate the likelihood of false discoveries if that element were considered positive. In practice, this is called the q-value for that element (Storey and Tibshirani, 2003). Those elements with low p-values also have low q-values. Though q-values and false discovery rates are gaining in popularity, it is important to note that there have been many other proposed methods to compensate for multiple-hypothesis testing (Dudoit et al., 2003b). Sub-Type Discovery The discovery of disease sub-types using high-bandwidth measurements has been the prime motivator for personalized disease
214
CHAPTER 17
■
Bioinformatic and Computational Analysis for Genomic Medicine
therapeutics, where a sample of disease from an individual could be potentially measured and the most appropriate set of therapies potentially prescribed. Typically, the discovery of sub-types involves a two-step process. First, an unsupervised machine learning method is applied yielding subsets of samples (e.g., patients) with similar measurements. Many choices are available in this step, including the specific clustering method used (e.g., hierarchical clustering (Eisen et al., 1998), self-organizing maps (Tamayo et al., 1999), or others), the mathematical definition of similarity used (e.g., Pearson correlation coefficient, Spearman rank correlation, Euclidean distance, or others), and the threshold similarity at which clusters are demarcated. While this first step will yield potential sub-types of disease, the utility of these sub-types still needs to be determined. Typically, utility is defined using a marker of clinical outcome, such as survival, disease-free survival, or response duration. The two or more sub-types of disease identified in the first step are then tested statistically against the marker, using a Kaplan–Meier estimator or Cox Proportional Hazards Model. The most important point to note with these types of analyzes is the censored nature of the samples. A particular patient that is lost to follow-up can only “contribute” his or her survival up to the length of follow-up. Recent methodological work has provided numerous new methods that combine both of these steps, enabling the single-step discovery of sets of measured elements (e.g., genes) that best discriminate for censored survival or
other outcomes (Bair and Tibshirani, 2004; Nguyen and Rocke, 2002; Park et al., 2002).
WHERE DATA FOR STUDIES MAY BE FOUND Corresponding with the successful application of highbandwidth molecular measurement modalities, the amount of data in international repositories has grown exponentially, because top-tier journals require the public availability of such data and because of increased calls for transparency of raw data in publications (Nature, 2002; Ball et al., 2004; Perou, 2001). For example, the Gene Expression Omnibus (GEO) is an international repository for gene expression data, developed and maintained by the National Library of Medicine (Wheeler et al., 2004). As of this writing, GEO holds 103,000 samples (i.e., microarrays) from over 4300 experiments involving 189 species, across over 2100 types of microarrays, with 2.6 billion individual gene measurements (Figure 17.7). More impressively, GEO has been gaining data at 300% per year. GEO is not the only international repository for microarray data. ArrayExpress is a similar database of gene expression measurements supported by the European Bioinformatics Institute (Parkinson et al., 2005). As of this writing, ArrayExpress contains 42,424 samples from 1437 experiments. The Stanford Microarray Database (SMD) was the first international repository
Figure 17.7 Screen shot of the home page for the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO).
Bioinformatics Vocabularies and Ontologies
for microarray data (Ball et al., 2005). As of this writing, SMD holds 61,573 samples involving 37 species, with a total of 1.84 billion individual gene measurements. It is not clear how much overlap there is in experiments with GEO, but it is likely that over 150,000 unique samples are available for study. Recently, international repositories for proteomic experimental datasets have been instituted (Prince et al., 2004). PeptideAtlas, supported by the Institute for Systems Biology (ISB), is an example of one such repository (Desiere et al., 2006; Deutsch et al., 2005). As of this writing, PeptideAtlas holds the raw and processed mass spectra for 76 experiments. The Proteomics Identifications Database (PRIDE), supported by EBI, currently holds data from over 1655 experiments (Jones et al., 2006). Though GEO, ArrayExpress, SMD, and PeptideAtlas are already incredible resources for gene expression measurements, these measurements are poorly indexed. Even for microarray data stored in standardized formats like MIAME and MAGE-ML (Brazma et al., 2001; Spellman et al., 2002), the contextual annotations needed to determine relevance of a dataset are unfortunately still represented by unstructured narrative text. Determining the phenotypes, diseases, and environmental contexts studied by these experiments is no longer a tractable manual process. Most repository websites provide only a free-text based search facility for experimental annotations. A case example demonstrates the significance of the problem. At the time of this writing, searching on the NCBI GEO website yields 27 datasets annotated with the term “breast cancer.” However, a search for “breast tumor” yields only 8 datasets. A search for the plural, “breast tumors,” yields 9 datasets, while a search for “breast carcinoma” yields only 6 datasets. Breast cancer is one of the most important cancers affecting women, but searching for existing microarray datasets relevant to this important disease yields arbitrary results and is fraught with false negatives. The basic problem is that data in these repositories are not consistently annotated with terms from a standardized vocabulary. While some initial work has been done in re-annotating these datasets using a standardized vocabulary (Butte and Kohane, 2006), investigators interested in finding relevant data will have to try multiple terms to comprehensively find data. Another caveat is that much of the most relevant clinical data associated with high-bandwidth measurements may not be stored in international repositories. For example, a dataset of microarray measurements on patients with varying lengths of survival after cancer diagnosis may be available in GEO, but the survival times may not be. Such clinical data may be available on an investigator’s laboratory website or in an associated publication, but is often not publicly available at all.
BIOINFORMATICS VOCABULARIES AND ONTOLOGIES A cell biologist interested in finding clinical samples relevant to a disease being studied in the lab will not find those samples unless she can “speak the same language” as the pathologists
■
215
storing and indexing the samples. A physician interested in learning more about genes that are significantly different in his experiments will not be able to learn more about the pathways implicated by those genes, unless he can “speak the same language” as those geneticists providing annotative information for those genes. In the application of bioinformatics to genomic and personalized medicine, it is often the case that shifting focus like this crucially depends on the use of a standardized vocabulary. The formal modeling and representation of concepts of knowledge, the terms and other attributes used to describe these concepts, and the relationships of these concepts, is called a ontology. The importance of using proper ontologies in biomedical research has been described in recent publications (Blake, 2004; Soldatova and King, 2005). Here, we will describe some of the more commonly used ontologies used in the research of genomic medicine (Table 17.2). Clinical vocabularies The largest disease ontology (or nosology) in use today is the Systemized Nomenclature of Medicine-Clinical Term (SNOMED-CT) ontology. SNOMED-CT has a lineage spanning 75 years and is used by pathologists world-wide (Chute, 2000). SNOMED-CT has over 340,000 biomedical concepts in 18 hierarchies with 1.3 million relationships that classify diseases based on syndromic and pathophysiological mechanisms (Cho et al., 1998). For example, relations in SNOMED-CT assert that clear cell carcinoma of the kidney is a malignant tumor of the kidney (i.e., a type of tumor of a particular organ) and separately is also
TABLE 17.2 Some of the ontologies and vocabularies in use in bioinformatics Ontology and vocabulary
Use
Systemized Nomenclature of Medicine-Clinical Term (SNOMED-CT)
Pathological based classification of disease
International Classification of Diseases (ICD) with Clinical Modification
Diagnosis codes for billing, public health, and epidemiology
Gene Ontology (GO)
Molecular functions, biological processes, and cellular components of proteins
Open Biomedical Ontologies (OBO)
Includes Gene Ontology and many other ontologies describing cell types and anatomy
InterDom
Protein–protein interactions
InterPro
Protein domains
Unified Medical Language System (UMLS)
Unified compendium of 140 biomedical vocabularies
216
CHAPTER 17
■
Bioinformatic and Computational Analysis for Genomic Medicine
a malignant neoplasm of the retroperitoneum (i.e., a type of tumor in a particular anatomical location). Another disease classification more familiar to physicians is the International Classification of Diseases (ICD) with Clinical Modification, a nosology with origins in the 1850s. ICD-9-CM has nearly 7000 codes, while ICD-10-CM has over 14,000 codes to describe causes of morbidity and mortality (World Health Organization, 2005). Both ICD-9-CM and ICD-10-CM are in use by various health care providers, the US Federal Government, and the World Health Organization. Though most health care institutions in the United States use ICD-9, ICD-10 was recently approved for use by the US Congress. To continue the example above, relations in ICD-9-CM assert that a malignant neoplasm of the kidney is a malignant neoplasm of the genitourinary organs. Importantly, however, the limited resolution in ICD-9-CM does not permit the specific diagnosis of clear cell carcinoma of the kidney. There are several important differences between the SNOMED-CT and the ICD nosologies. As shown above, SNOMED-CT has greater expressiveness than ICD to represent diseases. SNOMED provides a rich set of relationships between medical concepts. ICD provides only a hierarchical relationship between concepts. Despite their differences, it is important to note that both SNOMED-CT and ICD are viewed as administrative nosologies, typically used for billing, reporting, electronic medical records, and decision support systems. However, important sources of stored tissues in medical institutions, such as frozen tissue banks, may only be accessible through SNOMED or ICD codes, and these coding systems commonly serve as an index into these repositories. Genomic Ontologies The Gene Ontology (GO) is a taxonomy that is used to describe the normal molecular function of proteins, the cellular components in which proteins operate, and the larger biological processes in which they participate (Ashburner et al., 2000). At the time of this writing, GO currently holds 7470 molecular functions, 1823 cellular components, and 12,250 biological processes. Over 186,000 genes across a variety of species have been assigned to one or more GO categories, but this essentially translates to one-third of human genes having some coded function. The easiest way to find the GO classification for a gene is to execute a query using Entrez Gene. For example, a query for the gene coding for the insulin receptor (INSR) yields 13 molecular functions (e.g., phosphoinositide 3-kinase binding), 7 biological processes (e.g., carbohydrate metabolism), and 2 cellular components. It is important to note that an actual publication detailing the evidence for each annotation is available for many GO annotations, and these publications can be easily referenced. It is also important to note that the GO annotations for genes can change quickly, as both the annotations for genes and GO itself continue to be “works in progress.” The GO is actually one of 61 ontologies in the set known as Open Biomedical Ontologies (OBO) (Mungall, 2004). These ontologies are open, in the sense that they can be freely downloaded and used without constraints. Other ontologies in OBO include ontologies to describe the gross anatomy of flies,
worms, and mice, cell types, and functional portions of proteins, as well as ontologies to describe cancers and other concepts related to humans. Other vocabularies used for gene and protein annotation include InterDom (Ng et al., 2003), InterPro (Mulder et al., 2003), and PRINTS (Attwood et al., 2003) for protein domains, and other gene identifiers include the stable NCBI Gene identifier (Wheeler et al., 2006) and the unstable NCBI UniGene identifier (Wheeler et al., 2000). Unified Medical Language System The Unified Medical Language System (UMLS) is the largest available compendium of biomedical vocabularies, containing 140 biomedical vocabularies with over 1.2 million concepts and 41 million relations between concepts (Bodenreider, 2004; Butte and Kohane, 2006). UMLS already unifies vocabularies used extensively in molecular biology and genomics, such as the Medical Subject Headings (MeSH), NCBI Taxonomy, and the GO, with medical vocabularies including the International Classification of Diseases and SNOMED Clinical Terms (Ashburner et al., 2000; ICD, 2003; Wheeler et al., 2004). The UMLS is a unified vocabulary, which means that concepts listed in multiple vocabularies are brought together. For example, the concept of fever, which holds the UMLS concept unique identifier of C0015967, is represented in 78 component vocabularies, such as D005334 in MeSH, 386661006 and 50177009 in SNOMED-CT, 780.6 in ICD-9-CM, 10016558 in the Medical Dictionary for Regulatory Activities Terminology (MedDRA), X25 in Perioperative Nursing DataSet, GO:0001660 in GO, and U001776 in the Library of Congress Subject Headings. Each of these individual codes represents the same concept, and thus have been unified to a single biomedical concept in UMLS. Because of its ability to span nearly every other relevant vocabulary, UMLS best serves as a bridging vocabulary, providing terms commonly used by both physicians as well as molecular biologists. Given the support for UMLS and its component vocabularies shown by the National Library of Medicine and the US Department of Health and Human Services, it is likely that terms and concepts from UMLS will constitute important portions of the future electronic health records of patients in the United States (see also Chapter 19). As samples from patients continue to be valuable for study using high-bandwidth measurements, tapping into the medical records associated with these samples will only grow in importance and UMLS can help enable these secondary research uses of medical records.
FREELY AVAILABLE BIOINFORMATICS TOOLS There is an incredible preponderance of software tools freely available for research in bioinformatics. As an example, as of this writing, over 300 articles are published each year specifically describing bioinformatics software tools in the journals BMC Bioinformatics, Bioinformatics, Nucleic Acids Research, and Genome Biology. The vast majority of these are freely available for
Freely Available Bioinformatics Tools
■
217
Figure 17.8 Screen shot of the MeV (Saeed et al., 2006). Reprinted from Methods in Enzymology, Volume 411, Saeed, A.I., Bhagabati, N.K., Braisted, J.C., Liang, W., Sharov, V., Howe, E.A., Li, J., “TM4 microarray software suite”, 134–93, copyright 2006, with permission from Elsevier.
researchers. Describing all of the available free tools is clearly intractable, though a recent book and paper serve as nice collections of most of the commonly used tools (Dudoit et al., 2003a; Parmigiani, 2003). Here instead, we will arbitrarily focus on a few important tools useful in the study of genomic and personalized medicine, dividing our set into two groups: those useful in the analysis of high-bandwidth data and those useful in the interpretation of results. Analytic Tools The Multiexperiment Viewer (MeV) is an easy-to-use analytic tool initially developed at the Institute for Genomic Research and currently maintained at the Dana Farber Cancer Institute (Saeed et al., 2006). It specifically supports 17 tools for the analysis of gene expression microarrays, and presents these tools using easy-touse on screen buttons (Figure 17.8). It runs under the Macintosh, Windows, and Linux operating systems, and the programming source code is available and can be modified by bioinformatics programmers. At the time of this writing, MeV has been cited by nearly 200 publications and has been extensively documented. GenePattern is a sophisticated analytic tool developed and maintained by the Broad Institute (Reich et al., 2006). GenePattern supports the analysis of multiple types of high-bandwidth measurements, including gene expression microarrays, proteomics, and single nucleotide polymorphisms. Each analytic and visualization tool is contained within a module, which can be chained together to represent a reproducible pipeline for analysis (Figure 17.9).
GenePattern runs under the Macintosh, Windows, and Linux operating systems. GenePattern is well-documented and supported, and as of this writing, approximately 30 papers have cited its use. GenePattern clearly excels in its advanced features. One particularly useful feature of GenePattern is that it offers a modular architecture allowing users to plug in new analytic tools or visualization methods as they become available. These tools can be created to interface with other bioinformatics analytic tools, including Matlab and R and Bioconductor (described below). GenePattern can distribute the work required for an analysis onto a parallel computer cluster, which, if available, can speed the time required to complete an analysis. In addition, GenePattern supports the output of measurements into the standardized formats necessary for submission to international repositories, a step that is increasingly required before a top-tier journal accepts a manuscript. At the opposite extreme of user-friendliness lies the statistical computing and graphics environment known as R (R Development Core Team, 2004). Based on the S statistical computing environment developed by Bell Labs in the late 1970s, R is the most flexible of the tools listed here. R is almost essentially “command line” driven, in that commands to execute statistical and visualization tools are typed and interpreted serially, but most users write scripts and sophisticated programs that can be run to generate the appropriate output and figures. There is essentially no user-interface (i.e., menus and windows) that can be used to perform analyzes; getting up-to-speed with R involves climbing a steep learning curve.
218
CHAPTER 17
■
Bioinformatic and Computational Analysis for Genomic Medicine
Analytical methodology
Pipeline representation and results
1 Heat map
Gene expression data set
Gene list significance
Pre-processing
2 Gene selection
Marker genes
1
Build predictor
Cross-validation results
2
3
Model
Cross-validation results
Test set prediction results
Test data set
Test set prediction results
3
Figure 17.9 The conceptual analytic “pipeline” needed to analyze high-bandwidth molecular data is represented on the left, while its computational equivalent is represented on the right. Using GenePattern, these steps can be coded and documented, yielding reproducible research results (Reich et al., 2006). Reprinted by permission from Macmillan Publishers Ltd, Nature Genetics 38, 500, copyright 2006.
Despite the severe user-interface handicaps, the most important advantage of R is its community of users. Because of its designed similarity to S, users of R can benefit from decades of developed analytic tools (some statistical tool plug-ins for R and S available on the Internet date back to the late 1980s). The most important set of plug-in tools for research in genomic medicine are those contained in the Bioconductor package, which offers tools for reading microarray data, processing raw data files, common supervised and unsupervised machine learning methods, and visualization methods (Gentleman et al., 2004). More than 50 books have been written that explain both commonly-used and cutting-edge statistical techniques using R (and S), including a few specifically for bioinformatics and Bioconductor (Gentleman et al., 2004; Parmigiani, 2003). R runs
under the Macintosh, Windows, and Linux operating systems and its source code is available. Intepretation After several analyses, it quickly becomes obvious that the rate-limiting step in experiments involving high-bandwidth measurements is neither the handling of the biological samples nor the actual statistical or numerical work, but instead the post-analytic work in determining what the results actually mean. First, detailed names and knowledge might not yet exist for genes and proteins that have been found to be significantly involved in an experiment, even though these elements may have already been measured for many years. This complicates the interpretation of results. The official gene name, predicted
Freely Available Bioinformatics Tools
■
219
TCA cycle glycolysis fatty-acid degradation triacylglycerol
view same data set on hundreds of MAPPs
lipoprotein lipase glycerol
fatty acid
glycerol kinase
fatty acid CoA ligase
L-glycerol-3-phosphate
acyl-CoA
glycerol-3-PO4 dehydrogenase
carnitine acetyltransferase carnitine palmitoyltransferase
dihydroxyacetone phosphate
acyl-CoA
triosephosphate isomerase
acyl-CoA dehydrogenases: short chain medium chain long chain very long chain
glyceraldehyde-3-phosphate
see glycolysis MAPP
legend upregulated, P 0.05 downregulated, P 0.05 no criteria met not found
see TCA cycle MAPP
acyl-CoA (n2)
acetyl-CoA
thiolase
trans-Δ2-enoyl-CoA
3-ketoacyl-CoA
enoyl-CoA hydratase
3-OH-acyl-CoA dehydrogenase
click on genes for annotation, data and hyperlinks
3-OH-acyl-CoA
Figure 17.10 GenMAPP has the ability to color-code proteins participating in biological pathways based on measurements (Dahlquist et al., 2002). Reprinted by permission from Macmillan Publishers Ltd, Nature Genetics 31, 19, copyright 2002.
protein domains, or GO classification may become available for a gene or protein as early as tomorrow, or as late as years from now. Operationally, this means that one is never done analyzing a set of microarray data. One has to develop the infrastructure to constantly reinvestigate genes and proteins from analyzes performed in the past. It may be next week, for example, that new knowledge about a protein that was significantly implicated in an analysis performed three months ago gets published, finally leading to a novel and important finding. The challenges in determining the proper analytic methods to use are usually only a short-term difficulty, and typically after the “functional genomics pipeline” has been established, the rate-limiting step shifts to the post-analytic challenges (Kohane et al., 2002). In the future, truly demonstrating a “return on investment” from high-bandwidth molecular measurements depends on taking findings past the measurement stage and integrating them with the rest of the research pipeline, including interpretation and validation. The list of elements resulting from an analysis should not be viewed as an end in itself; its real value only increases as that list moves through biological validation, ranging from the numerical verification of measurement levels with alternative techniques, to the ascertainment of the meaning
of the results, such as finding common mechanisms or biological pathways involving the genes or proteins. While tools that link measurement elements back to known biological pathways are still in their infancy, some have been shown to be quite useful. GenMAPP is a software tool that can visualize of highbandwidth molecular measurements using pre-drawn diagrams illustrating biological pathways (Dahlquist et al., 2002). Though GenMAPP comes with nearly a thousand useful biological pathways across numerous species, GenMAPP also comes with an editor enabling a biologist or bioinformatician to draw any new pathway (Figure 17.10). Because of this, the GenMAPP user base has developed into a large community of users who share pathways. GenMAPP runs only under the Windows operating system. At the time of this writing, GenMAPP has been cited by nearly 600 publications and has been extensively documented. The Kyoto Encyclopedia of Genes and Genomes (KEGG) is an extensive web-based database of biological pathways and functions under development for more than a decade (Kanehisa, 1997). Similar to GenMAPP, KEGG offers tools that can color-code diagrams of biological pathways based on quantitative measurements, instead using a web-based interface (Figure 17.11). KEGG offers more than 42,000 biological pathways for nearly 500 species.
220
CHAPTER 17
Bioinformatic and Computational Analysis for Genomic Medicine
■
INSULIN SIGNALING PATHWAY
Flotillin APS
CAP
CrkII
Cb1
GRF2
Exo70
GLUT4 vesicle
TC10
Glucose homeostasis
Translocation to PM
Glucose uptake
GLUT4
CIP4/2
Glucose p
p p AMPK aPKC
SKIP
GK PGC-1 p
p
py
DNA
ps
IRS
PI3K
PDK1/2
p
p
Akt
GSK-3β
PIP3
py
p Phosphatidyl inositol signaling system
Glycolysis/ Gluconeogenesis
FBP
FOXO1
IKK
p py
Glycolysis G6PC
JNK
INSR
PYK
DNA
PTP1B
INS
Lipogenesis Fatty acid biosynthesis (path 1)
PFK
SREBP-1c
p
SOCS
ACC FAS
SHIP
p
PEPCK
Glycogen GYS
Glycogenesis
p p
PP1
PHK
p
Starch and sucrose metabolism
PYG
TSC2 TSC1
p
PKA
PDE3
p
Antilipolysis
HSL
cAMP Rheb
LAR
p
p
Antiapoptosis
BAD
p mTOR
MAPK signaling pathway
p
p p GRB2
SOS
S6
Raptor
p SHC
p
p70S6K
Lipid homeostasis
Apoptosis
Ras
Raf
4EBP1 p
MEK1/2
Protein synthesis
eIF4E p
p MNK
ERK1/2 p
ElK1 DNA
Proliferation, differentiation
Figure 17.11 The insulin signaling pathway, as represented in KEGG as of this writing. A click on any protein yields additional data about that protein.
Beyond biological pathways, KEGG also provides knowledge on ligands, metabolites, small-molecule drugs, and diseases. At the time of this writing, KEGG has been cited by nearly 1000 publications. Other free web-based resources available to study genes by known function and protein domains include Entrez Gene and GeneCards (Safran et al., 2002;Wheeler et al., 2006).
NEW QUESTIONS FOR GENOMIC MEDICINE While this chapter has demonstrated the utility of bioinformatics and computational biology for answering specific questions for genomic medicine, it is clear that future revolutionary discoveries are going to come from those who can handle, manage, and exploit the exponential growth in measurement methods and data. Even as newer high-bandwidth systems become available for measuring proteins, lipids, and protein-interactions, it is clear that significant discoveries can be enabled by putting measurement-systems and data together, an area of research informally termed integrative biology. Here, we will describe some significant studies that involved the integration of large-scale measurements across modalities. In one integration study, Schadt and colleagues started with gene expression differences in liver between two inbred strains of mice. They used the most significantly different genes as traits
which were then mapped as quantitative trait loci using genetics. They first highlighted genes with known polymorphisms that affected their transcript levels, then showed loci associated with fat pad mass (and thus relevant to obesity) that would otherwise have been insignificant had the expression data not been considered (Schadt et al., 2003a, b). In other words, their association of genes with obesity required both microarray and genetic tools. Mootha and colleagues integrated four publicly available expression datasets with linkage data and proteins identified from mitochondria to ascertain the gene and mutation responsible for Leigh syndrome, French-Canadian type (Mootha et al., 2003). Stoll and colleagues integrated 125 phenotypes with linkage data from rats to determine candidate genes potentially involved in cardiovascular function (Stoll et al., 2001). Chiang and colleagues took genes known to be associated with Bardet-Biedl syndrome and found a set of species in which these genes were conserved and other species in which these genes did not exist. Using this comparative genomic information, Chiang was able to predict and discover a novel gene with a mutation for Bardet-Biedl syndrome (Chiang et al., 2004). English collected 49 genome-scale experiments all related to the study of obesity, but of varying types, including gene expression microarrays, proteomics, RNA interference, and genetic scans. She evaluated the sensitivity and specificity of each against a gold-standard list of known obesity-associated genes. While the sensitivity and specificity of each independent experiment was
References
poor, she found that a simple integrative model could statistically significantly outperform each of the independent experiments in rediscovering known obesity-associated genes. Moreover, she found that the likelihood of discovering a novel obesity-associated gene increased when pairs of genome-scale experiments were considered, instead of individual experiments. This work showed that integrating experiments performed better than single experiments at finding complex disease-associated gene variants, and that it is now possible to predict how likely a gene is to have a variant associated with a complex condition (English and Butte, 2007). These examples are given to demonstrate that exploration into the integration of multiple high-bandwidth modalities has started. Instead of focusing on the cell, or the genotype ranging from one locus to the entire genome, or on any single measurement modality, integrative biology allows one to think holistically and horizontally. A disease such as diabetes can lead to myocardial infarction, nephropathy, and neuropathy; to study diabetes in the era of genomic and personalized medicine would require reasoning from a disease to all its various complications to the genome and
■
221
back, across many vantage points and cellular models that bridge the genome sequence, gene expression, proteomics, metabolomics, and phenotype. As we have shown in this chapter, there is an enormous supply of prior data that can be quite potent, if it can be brought to bear on the right problems, with the right tools. Though it has yet to be formally proven, the future study of complex diseases is highly likely to benefit from integrative biology and bioinformatics.
ACKNOWLEDGEMENTS The work was supported by grants from the Lucile Packard Foundation for Children’s Health, National Library of Medicine (K22 LM008261), National Institute of General Medical Sciences (R01 GM079719), Howard Hughes Medical Institute, and the Pharmaceutical Research and Manufacturers of America Foundation. The author is a scientific advisor to Genstruct, Inc., a computational systems biology company.
REFERENCES Adida, B. and Kohane, I.S. (2006). GenePING: secure, scalable management of personal genomic data. BMC Genomics 7, 93. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X. et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511. Alter, O., Brown, P.O. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97, 10101–10106. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T. et al. (2000). Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25–29. Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A., Paine, K., Taylor, P. et al. (2003). PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res 31, 400–402. Bair, E. and Tibshirani, R. (2004). Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol 2, E108. Baldauf , S.L., Roger, A.J., Wenk-Siefert, I. and Doolittle, W.F. (2000). A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 290, 972–977. Baldi, P. and Long, A.D. (2001). A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17, 509–519. Ball, C.A., Brazma, A., Causton, H., Chervitz, S., Edgar, R., Hingamp, P., Matese, J.C., Parkinson, H., Quackenbush, J., Ringwald, M. et al. (2004). Submission of microarray data to public repositories. PLoS Biol 2, E317. Ball, C.A., Awad, I.A., Demeter, J., Gollub, J., Hebert, J.M., HernandezBoussard, T., Jin, H., Matese, J.C., Nitzberg, M., Wymore, F. et al. (2005). The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res 33, D580–D582.
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. and Yakhini, Z. (2000). Tissue classification with gene expression profiles. J Comput Biol 7, 559–583. Bhattacharjee, A., Richards, W.G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M. et al. (2001). Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 98, 13790–13795. Bild, A.H., Yao, G., Chang, J.T., Wang, Q., Potti, A., Chasse, D., Joshi, M.B., Harpole, D., Lancaster, J.M., Berchuck, A. et al. (2006). Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439, 353–357. Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A. et al. (2000). Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406, 536–540. Blake, J. (2004). Bio-ontologies-fast and furious. Nat Biotechnol 22, 773–774. Bodenreider, O. (2004). The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Res 32(Database issue), D267–D270. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C. et al. (2001). Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29, 365–371. Brown, M.P., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares, M, Jr. and Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97, 262–267. Brunner, H.G. and van Driel, M.A. (2004). From syndrome families to functional genomics. Nat Rev Genet 5, 545–551. Butte, A. (2002). The use and analysis of microarray data. Nat Rev Drug Discov 1, 951–960.
222
CHAPTER 17
■
Bioinformatic and Computational Analysis for Genomic Medicine
Butte, A. and Kohane, I. (1999). Unsupervised knowledge discovery in medical databases using relevance networks. In Fall Symposium, American Medical Informatics Association (N. Lorenzi, ed.), Hanley and Belfus,Washington, DC, pp. 711–715. Butte, A.J. and Kohane, I.S. (2000). Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput, 418–429. Butte, A.J. and Kohane, I.S. (2006). Creation and implications of a phenome-genome network. Nat Biotechnol 24, 55–62. Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R. and Kohane, I.S. (2000). Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci USA 97, 12182–12186. Butte, A.J.,Ye, J., Niederfellner, G., Rett, K., Häring, H., White, M.F. and Kohane, I.S. (2001). Determining significant fold differences in gene expression analysis. Pac Symp Biocomput, 6–17. Butte, A.J., Bao, L., Reis, B.Y., Watkins, T.W. and Kohane, I.S. (2002). Comparing the similarity of time-series gene expression using signal processing metrics. J Biomed Inform 34, 396–405. Centers for Disease Control and Prevention (2004). Genetic testing for breast and ovarian cancer susceptibility: Evaluating direct-toconsumer marketing – Atlanta, Denver, Raleigh-Durham, and Seattle, 2003. MMWR Morb Mortal Wkly Rep 53, 603–606. Chang, H.Y., Chi, J.T., Dudoit, S., Bondre, C., van de Rijn, M., Botstein, D. and Brown, P.O. (2002). Diversity, topographic differentiation, and positional memory in human fibroblasts. Proc Natl Acad Sci USA 99, 12877–12882. Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X.C., Stern, D., Winkler, J., Lockhart, D.J., Morris, M.S. and Fodor, S.P. (1996). Accessing genetic information with high-density DNA arrays. Science 274, 610–614. Chiang, A.P., Nishimura, D., Searby, C., Elbedour, K., Carmi, R., Ferguson, A.L., Secrist, J., Braun, T., Casavant, T., Stone, E.M. et al. (2004). Comparative genomic analysis identifies an ADPribosylation factor-like gene as the cause of Bardet-Biedl syndrome (BBS3). Am J Hum Genet 75, 475–484. Cho, R.J., Fromont-Racine, M., Wodicka, L., Feierbach, B., Stearns, T., Legrain, P., Lockhart, D.J. and Davis, R.W. (1998). Parallel analysis of genetic selections using whole genome oligonucleotide arrays. Proc Natl Acad Sci USA 95, 3752–3757. Chow, M.L., Moler, E.J. and Mian, I.S. (2001). Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. Physiol Genomics 5, 99–111. Chute, C.G. (2000). Clinical classification and terminology: Some history and current observations. J Am Med Inform Assoc 7, 298–303. Clayton, T.A., Lindon, J.C., Cloarec, O., Antti, H., Charuel, C., Hanton, G., Provost, J.P., Le Net, J.L., Baker, D., Walley, R.J. et al. (2006). Pharmaco-metabonomic phenotyping and personalized drug treatment. Nature 440, 1073–1077. Crescenzi, M. and Giuliani, A. (2001). The main biological determinants of tumor line taxonomy elucidated by a principal component analysis of microarray data. FEBS Lett 507, 114–118. Dahlquist, K.D., Salomonis, N., Vranizan, K., Lawlor, S.C. and Conklin, B.R. (2002). GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet 31, 19–20. Debouck, C. and Goodfellow, P.N. (1999). DNA microarrays in drug discovery and development. Nat Genet 21, 48–50. DeRisi, J., Penland, L., Brown, P.O., Bittner, M.L., Meltzer, P.S., Ray, M., Chen, Y., Su, Y.A. and Trent, J.M. (1996). Use of a cDNA
microarray to analyse gene expression patterns in human cancer. Nat Genet 14, 457–460. Desiere, F., Deutsch, E.W., King, N.L., Nesvizhskii, A.I., Mallick, P., Eng, J., Chen, S., Eddes, J., Loevenich, S.N. and Aebersold, R. (2006). The PeptideAtlas project. Nucleic Acids Res 34, D655–D658. Deutsch, E.W., Eng, J.K., Zhang, H., King, N.L., Nesvizhskii, A.I., Lin, B., Lee, H., Yi, E.C., Ossola, R. and Aebersold, R. (2005). Human Plasma PeptideAtlas. Proteomics 5, 3497–3500. Dudoit, S., Yang, Y.H., Callow, M.J. and Speed, T.P. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA Microarray experiments. Statistica Sinica 12, 111–139. Dudoit, S., Gentleman, R.C. and Quackenbush, J. (2003a). Open source software for the analysis of microarray data. Biotechniques(Suppl), 45–51. Dudoit, S., Shaffer, J.P. and Boldrick, J.C. (2003b). Multiple hypothesis testing in microarray experiments. Statis. Sci. 18, 71–103. Edwards, J.S., Ibarra, R.U. and Palsson, B.O. (2001). In silico predictions of Escherichia coli metabolic capabilities are consistent with experimental data. Nat Biotechnol 19, 125–130. Eisen, M.B., Spellman, P.T., Brown, P.O. and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863–14868. English, S.B. and Butte, A.J. (2007). Evaluation and integration of 49 genome-wide experiments and the prediction of previously unknown obesity-related genes. Bioinformatics 23, 2910–2917. Espina, V., Mehta, A.I., Winters, M.E., Calvert, V., Wulfkuhle, J., Petricoin, E.F., 3rd, and Liotta, L.A. (2003). Protein microarrays: Molecular profiling technologies for clinical specimens. Proteomics 3, 2091–2100. Fiehn, O., Kopka, J., Dormann, P., Altmann, T., Trethewey, R.N. and Willmitzer, L. (2000). Metabolite profiling for plant functional genomics. Nat Biotechnol 18, 1157–1161. Fliri, A.F., Loging, W.T., Thadeio, P.F. and Volkmann, R.A. (2005). Analysis of drug-induced effect patterns to link structure and side effects of medicines. Nat Chem Biol 1, 389–397. Freimer, N. and Sabatti, C. (2003). The human phenome project. Nat Genet 34, 15–21. Friedman, N., Linial, M., Nachman, I. and Pe’er, D. (2000). Using Bayesian networks to analyze expression data. J Comput Biol 7, 601–620. Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M. and Haussler, D. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914. Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. et al. (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol 5, R80. Ghaemmaghami, S., Huh, W.K., Bower, K., Howson, R.W., Belle, A., Dephoure, N., O’Shea, E.K. and Weissman, J.S. (2003). Global analysis of protein expression in yeast. Nature 425, 737–741. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A. et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537. Gygi, S.P., Rist, B., Gerber, S.A., Turecek, F., Gelb, M.H. and Aebersold, R. (1999). Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 17, 994–999.
References
Henneman, L., Timmermans, D.R. and Van Der Wal, G. (2006). Public attitudes toward genetic testing: Perceived benefits and objections. Genet Test 10, 139–145. Hestvik, A.L., Hmama, Z. and Av-Gay, Y. (2003). Kinome analysis of host response to mycobacterial infection: A novel technique in proteomics. Infect Immun 71, 5514–5522. Hilsenbeck, S.G., Friedrichs, W.E., Schiff, R., O’Connell, P., Hansen, R.K., Osborne, C.K. and Fuqua, S.A. (1999). Statistical analysis of array expression data as applied to the problem of tamoxifen resistance. J Natl Cancer Inst 91, 453–459. ICD (2003). International Classification of Diseases: 9th revision, Clinical Modification (ICD-9-CM), 6th edition. Centers for Medicare & Medicaid Services,Washington DC. Jessani, N., Liu, Y., Humphrey, M. and Cravatt, B.F. (2002). Enzyme activity profiles of the secreted and membrane proteome that depict cancer cell invasiveness. Proc Natl Acad Sci USA 99, 10335–10340. Johnson, G.C., Esposito, L., Barratt, B.J., Smith, A.N., Heward, J., Di Genova, G., Ueda, H., Cordell, H.J., Eaves, I.A., Dudbridge, F. et al. (2001). Haplotype tagging for the identification of common disease genes. Nat Genet 29, 233–237. Jones, P., Cote, R.G., Martens, L., Quinn, A.F., Taylor, C.F., Derache, W., Hermjakob, H. and Apweiler, R. (2006). PRIDE: A public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res 34, D659–D663. Kanehisa, M. (1997). A database for post-genome analysis. Trends Genet 13, 375–376. Khoury, M.J., McCabe, L.L. and McCabe, E.R. (2003). Population screening in the age of genomic medicine. N Engl J Med 348, 50–58. Kohane, I.S., Kho, A.T. and Butte, A.J. (2002). Microarrays for an integrative genomics. MIT Press, Cambridge, Massachusetts. Kohane, I.S., Masys, D.R. and Altman, R.B. (2006). The incidentalome: A threat to genomic medicine. JAMA 296, 212–215. Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel, M.J., Lerner, J., Brunet, J.P., Subramanian, A., Ross, K.N. et al. (2006). The Connectivity Map: Using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935. Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I. et al. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799–804. Liang, S., Fuhrman, S. and Somogyi, R. (1998). Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput, 18–29. Linnâe, C.V. and Schrèoder, J. (1763). Genera morborum, apud C. E. Steinert, Upsaliμ. Liotta, L.A., Espina, V., Mehta, A.I., Calvert, V., Rosenblatt, K., Geho, D., Munson, P.J., Young, L., Wulfkuhle, J. and Petricoin, E.F., 3rd (2003). Protein microarrays: Meeting analytical challenges for clinical applications. Cancer Cell 3, 317–325. Mootha, V.K., Lepage, P., Miller, K., Bunkenborg, J., Reich, M., Hjerrild, M., Delmonte, T., Villeneuve, A., Sladek, R., Xu, F. et al. (2003). Identification of a gene causing human cytochrome c oxidase deficiency by integrative genomics. Proc Natl Acad Sci USA 100, 605–610. Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P. et al. (2003). The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res 31, 315–318.
■
223
Mungall, C.J. (2004). Obol: Integrating language and Meaning in BioOntologies. Comp Funct Genom 5, 509–520. Nature (2002). Microarray standards at last. Nature 419, 323. Ng, S.K., Zhang, Z., Tan, S.H. and Lin, K. (2003). InterDom: A database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res 31, 251–254. Nguyen, D.V. and Rocke, D.M. (2002). Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics 18, 1625–1632. Odom, D.T., Zizlsperger, N., Gordon, D.B., Bell, G.W., Rinaldi, N.J., Murray, H.L., Volkert, T.L., Schreiber, J., Rolfe, P.A., Gifford, D.K. et al. (2004). Control of pancreas and liver gene expression by HNF transcription factors. Science 303, 1378–1381. Park, P.J., Tian, L. and Kohane, I.S. (2002). Linking gene expression data with patient survival times using partial least squares. Bioinformatics 18(Suppl 1), S120–S127. Parkinson, H., Sarkans, U., Shojatalab, M., Abeygunawardena, N., Contrino, S., Coulson, R., Farne, A., Lara, G.G., Holloway, E., Kapushesky, M. et al. (2005). ArrayExpress – a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 33, D553–D555. Parmigiani, G. (2003). The analysis of gene expression data : Methods and software. Springer, New York. Pavlidis, P. and Noble, W.S. (2001). Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol 2. Perou, C.M. (2001). Show me the data! Nat Genet 29, 373. Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C. et al. (2002). Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572–577. Potti, A., Mukherjee, S., Petersen, R., Dressman, H.K., Bild, A., Koontz, J., Kratzke, R., Watson, M.A., Kelley, M., Ginsburg, G.S. et al. (2006). A genomic strategy to refine prognosis in early-stage non-small-cell lung cancer. N Engl J Med 355, 570–580. Prince, J.T., Carlson, M.W., Wang, R., Lu, P. and Marcotte, E.M. (2004). The need for a public proteomics repository. Nat Biotechnol 22, 471–472. Quackenbush, J. (2004). Data standards for “omic” science. Nat Biotechnol 22, 613–614. Quackenbush, J. (2006). Microarray analysis and tumor classification. N Engl J Med 354, 2463–2472. Quinlan, J. (1992). C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, California. R Development Core Team (2004). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P. et al. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 98, 15149–15154. Ramoni, M.F., Sebastiani, P. and Kohane, I.S. (2002). Cluster analysis of gene expression dynamics. Proc Natl Acad Sci USA 99, 9121–9126. Raychaudhuri, S., Stuart, J.M. and Altman, R.B. (2000). Principal components analysis to summarize microarray experiments: Application to sporulation time series. Pac Symp Biocomput, 455–466. Reich, M., Liefeld, T., Gould, J., Lerner, J., Tamayo, P. and Mesirov, J.P. (2006). GenePattern 2.0. Nat Genet 38, 500–501. Reis, B.Y., Butte, A.J. and Kohane, I.S. (2001). Extracting knowledge from dynamics in gene expression. J Biomed Inform 34, 15–27.
224
CHAPTER 17
■
Bioinformatic and Computational Analysis for Genomic Medicine
Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E. et al. (2000). Genome-wide location and function of DNA binding proteins. Science 290, 2306–2309. Rhodes, D.R.,Yu, J., Shanker, K., Deshpande, N.,Varambally, R., Ghosh, D., Barrette, T., Pandey, A. and Chinnaiyan, A.M. (2004). Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci USA 101, 9309–9314. Robinson, W.H., DiGennaro, C., Hueber, W., Haab, B.B., Kamachi, M., Dean, E.J., Fournel, S., Fong, D., Genovese, M.C., de Vegvar, H.E. et al. (2002). Autoantigen microarrays for multiplex characterization of autoantibody responses. Nat Med 8, 295–301. Roche, P.A. and Annas, G.J. (2001). Protecting genetic privacy. Nat Rev Genet 2, 392–396. Roche, P.A. and Annas, G.J. (2006). DNA testing, banking, and genetic privacy. N Engl J Med 355, 545–546. Ross, D.T., Scherf, U., Eisen, M.B., Perou, C.M., Rees, C., Spellman, P., Iyer, V., Jeffrey, S.S., Van de Rijn, M., Waltham, M. et al. (2000). Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24, 227–235. Rumelhart, D. and McClelland, J. University of California SDPRG (1986). Parallel distributed processing: Explorations in the microstructure of cognition. MIT Press, Cambridge, Mass. Saeed,A.I., Bhagabati, N.K., Braisted, J.C., Liang,W., Sharov,V., Howe, E.A., Li, J.,Thiagarajan, M.,White, J.A. and Quackenbush, J. (2006). TM4 microarray software suite. Methods Enzymol 411, 134–193. Safran, M., Solomon, I., Shmueli, O., Lapidot, M., Shen-Orr, S., Adato, A., Ben-Dor, U., Esterman, N., Rosen, N., Peter, I. et al. (2002). GeneCards(TM) 2002: Towards a complete, objectoriented, human gene compendium. Bioinformatics 18, 1542–1543. Sarwal, M., Chua, M.S., Kambham, N., Hsieh, S.C., Satterwhite, T., Masek, M. and Salvatierra, O, Jr. (2003). Molecular heterogeneity in acute renal allograft rejection identified by DNA microarray profiling. N Engl J Med 349, 125–138. Schadt, E.E., Monks, S.A., Drake, T.A., Lusis, A.J., Che, N., Colinayo, V., Ruff , T.G., Milligan, S.B., Lamb, J.R., Cavet, G. et al. (2003a). Genetics of gene expression surveyed in maize, mouse and man. Nature 422, 297–302. Schadt, E.E., Monks, S.A. and Friend, S.H. (2003b). A new paradigm for drug discovery: Integrating clinical, genetic, genomic and molecular phenotype data to identify drug targets. Biochem Soc Trans 31, 437–443. Simon, I., Siegfried, Z., Ernst, J. and Bar-Joseph, Z. (2005). Combined static and dynamic analysis for determining the quality of timeseries expression profiles. Nat Biotechnol 23, 1503–1508. Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P. et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203–209. Soldatova, L.N. and King, R.D. (2005). Are the current ontologies in biology good ontologies? Nat Biotechnol 23, 1095–1098. Sorlie, T., Perou, C.M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M.B., van de Rijn, M., Jeffrey, S.S. et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA 98, 10869–10874. Spellman, P.T., Miller, M., Stewart, J.,Troup, C., Sarkans, U., Chervitz, S., Bernhart, D., Sherlock, G., Ball, C., Lepage, M. et al.
(2002). Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol 3, RESEARCH0046. Stoll, M., Cowley, A.W, Jr., Tonellato, P.J., Greene, A.S., Kaldunski, M.L., Roman, R.J., Dumas, P., Schork, N.J., Wang, Z. and Jacob, H.J. (2001). A genomic-systems biology map for cardiovascular function. Science 294, 1723–1726. Storey, J.D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100, 9440–9445. Su, A.I., Welsh, J.B., Sapinoso, L.M., Kern, S.G., Dimitrov, P., Lapp, H., Schultz, P.G., Powell, S.M., Moskaluk, C.A., Frierson, H.F, Jr. et al. (2001). Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res 61, 7388–7393. Szallasi, Z. and Liang, S. (1998). Modeling the normal and neoplastic cell cycle with “realistic Boolean genetic networks”: Their application for understanding carcinogenesis and assessing therapeutic strategies. Pac Symp Biocomput, 66–76. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S. and Golub, T.R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96, 2907–2912. Toronen, P., Kolehmainen, M., Wong, G. and Castren, E. (1999). Analysis of gene expression data using self-organizing maps. FEBS Lett 451, 142–146. Tseng, Y.H., Butte, A.J., Kokkotou, E., Yechoor, V.K., Taniguchi, C.M., Kriauciunas, K.M., Cypess, A.M., Niinobe, M., Yoshikawa, K., Patti, M.E. et al. (2005). Prediction of preadipocyte differentiation by gene expression reveals role of insulin receptor substrates and necdin. Nat Cell Biol 7, 601–611. Tusher, V.G., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98, 5116–5121. van de Rijn, M. and Gilks, C.B. (2004). Applications of microarrays to histopathology. Histopathology 44, 97–108. Wang, L., Wu, Q., Qiu, P., Mirza, A., McGuirk, M., Kirschmeier, P., Greene, J.R., Wang, Y., Pickett, C.B. and Liu, S. (2001). Analyses of p53 target genes in the human genome by bioinformatic and microarray approaches. J Biol Chem 276, 43604–43610. Weinmann, A.S., Yan, P.S., Oberley, M.J., Huang, T.H. and Farnham, P.J. (2002). Isolating human transcription factor targets by coupling chromatin immunoprecipitation and CpG island microarray analysis. Genes Dev 16, 235–244. Weinstein, J.N., Myers, T.G., O’Connor, P.M., Friend, S.H., Fornace, A.J, Jr., Kohn, K.W., Fojo,T., Bates, S.E., Rubinstein, L.V.,Anderson, N.L. et al. (1997). An information-intensive approach to the molecular pharmacology of cancer. Science 275, 343–349. Welsh, J.B., Sapinoso, L.M., Kern, S.G., Brown, D.A., Liu, T., Bauskin, A.R.,Ward, R.L., Hawkins, N.J., Quinn, D.I., Russell, P.J. et al. (2003). Large-scale delineation of secreted protein biomarkers overexpressed in cancer tissue and serum. Proc Natl Acad Sci USA 100, 3410–3415. Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L. and Somogyi, R. (1998). Large-scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci USA 95, 334–339. Wheeler, D.L., Chappey, C., Lash, A.E., Leipe, D.D., Madden, T.L., Schuler, G.D., Tatusova, T.A. and Rapp, B.A. (2000). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 28, 10–14.
Recommended Resources
Wheeler, D.L., Church, D.M., Edgar, R., Federhen, S., Helmberg, W., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E. et al. (2004). Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res 32(Database issue), D35–D40. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Edgar, R.,
Websites http://www.ebi.ac.uk/arrayexpress/ http://www.bioconductor.org/ http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbgene http://www.ncbi.nlm.nih.gov/geo/ http://www.broad.mit.edu/cancer/software/genepattern/ http://www.genmapp.org/ http://www.who.int/classifications/icd/en/ http://www.genome.jp/kegg/ http://www.tm4.org/mev.html http://obo.sourceforge.net/ http://www.peptideatlas.org/ http://www.ebi.ac.uk/pride/ http://www.r-project.org/ http://genome-www5.stanford.edu/ http://umlsks.nlm.nih.gov/
Books Brown, T.A. (2006). Genomes 3, 3rd edition. Garland Science Publisher, New York.
Review articles Butte, A. (2002). The use and analysis of microarray data. Nat Rev Drug Discov 1, 951–960. Quackenbush, J. (2006). Microarray analysis and tumor classification. N Engl J Med 354, 2463–2472. Dudoit, S., Gentleman, R.C. and Quackenbush, J. (2003). Open source software for the analysis of microarray data. Biotechniques Suppl, 45–51.
225
Federhen, S. et al. (2006). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 34, D173–D180. World Health Organization (2005). International Statistical Classification of Diseases and Health Related Problems.Tenth Revision/Ed, Geneva. Wuensche, A. (1998). Genomic regulation modeled as a network with basins of attraction. Pac Symp Biocomput, 89–102.
RECOMMENDED RESOURCES ArrayExpress Bioconductor Entrez Gene Gene Expression Omnibus GenePattern GenMAPP International Classification of Diseases Kyoto Encyclopedia of Genes and Genomes MultiExperiment Viewer Open Biomedical Ontologies PeptideAtlas Proteomics Identifications Database R Stanford Microarray Database Unified Medical Language System
■
CHAPTER
18 Fundamentals and History of Informatics for Genomic and Personalized Medicine A. Jamie Cuticchia
INTRODUCTION Bioinformatics is the collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied to molecular genetics and genomics. Bioinformatics arose out of necessity as part of the infrastructure and analytic needs of genome research (see below). Medical informatics refers to computer applications in medicine and health care and historically has focused on clinical data and patient records and on medical terminology (Shortliffe et al., 2003). This information is increasingly being stored and manipulated through electronic health records (see Chapter 19). Genomic and personalized medicine is leading to a unification of these disciplines, and an intersection of the two is the main topic of the current chapter. The success of the Human Genome Project (1990–2003) rested largely on the ability to collect, manage and share gene mapping and sequencing information (Collins and Galas, 1993; Watson, 1990; Watson and Cook-Deegan, 1991). As early as 1988, researchers recognized the need for the collection and sharing of molecular data. In that year, the US Congress created the National Center for Biotechnology Information (NCBI) as a division of the National Library of Medicine (NLM) (Benson et al., 1990;Wheeler et al., 2007). NCBI’s mission includes “creating automated systems for storing and analyzing knowledge Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg 226
about molecular biology, biochemistry, and genetics.” The history of modern biological databases began with the creation of GenBank in 1985 (Burks et al., 1985). The goal of GenBank was to collect and distribute nucleotide sequence information and provide accession numbers, which are unique identifiers for each sequence record. Prior to GenBank, journals published articles describing nucleotide sequence data, as well as the data itself in printed form, but the latter became infeasible as sequencing strategies and technologies advanced and the data began to grow exponentially (www.ncbi.nlm.nih.gov/Genbank/index.html). Similarly, physical and genetic mapping data were described and reported through printed journal articles. Periodically, compendia of such data were published as a product of scientific conferences where geneticists gathered together to integrate the data and build increasingly comprehensive genetic maps. The publications that resulted from the Human Gene Mapping Conferences lead to the development of an online resource for genetic maps, and online access began to be provided by the Genome Database (GDB) (Pearson, 1991) in 1990. Genetic and phenotypic data for genes involved in human disease were also originally compiled in printed form in a series of books by Victor McKusick entitled Mendelian Inheritance in Man (MIM) spanning 12 editions, the final of which was published a decade ago (McKusick, 1998). MIM was also provided Copyright © 2009, Elsevier Inc. All rights reserved.
Databases for Genomic Medicine
(and still is) as Online Mendelian Inheritance in Man or OMIM (www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM). The development and evolution of various physical markers for genes is beyond the scope of this chapter. Single nucleotide polymorphisms (SNPs) are variations that occur in DNA sequences from different members of a population (see Chapter 7). Most of these variants are “silent”, that is, have no known phenotypic effects but are useful as signposts to regions of the genome in which they occur. Currently, there are more than 31,000,000 SNPs in the human population with the information stored in the dbSNP repository at NCBI (Wheeler et al., 2006). The database records in dbSNP are relatively simple: they contain the SNP itself in the context of its flanking sequence as well as chromosome localization, frequency, and association with genes. The large number of SNPs and the ease of detection make it feasible to survey human populations for genetic variation (Germer et al., 2000; Lavebratt and Sengul, 2006; Norton et al., 2002; Tahira et al., 2003). Thus, the HapMap project was created whereby large geographically diverse populations were surveyed across millions of SNPs (International HapMap Consortium, 2003; Couzin, 2002; Thorisson et al., 2005). Sets of nearby SNPs on the same chromosome that are inherited as a unit are called haplotypes, and a haplotype map or HapMap of the human genome now contains in excess of 3.3 million genotypes derived from 270 individuals and serves as an indispensable tool for researchers (Thorisson et al., 2005). One goal of the HapMap project was to provide a blueprint of human variation with associated genetic markers in order to facilitate efficient genotyping of patient populations, ultimately for the use in gene-disease association studies. By providing data that can be used for study design, researchers can chose to use commercially available genotyping chips or employ technologies that focus on specific regions of the genome. Two major companies providing assays are Affymetrix and Illumina. Gene chips containing 500,000 SNP markers are commercially available from Affymetrix, while Illumina allows researchers to chose which markers are to be used in a study from their database of over 1,000,000 validated markers (see Chapter 8). The discovery of millions of SNPs, a survey of human variation across those SNPs, and the low cost of SNP detection in patient populations is a major step forward for personalized medicine. For example, the HapMap is used to discover and study the relationships between different genotypes and the way patients respond to particular drugs, a field of study known as pharmacogenomics (Deloukas and Bentley, 2004) (see Chapter 27). Thus, the foundation has been laid for examining patient genetic and phenotypic variation as a first step toward personalized medicine (Andrawiss, 2005).
DATABASES FOR GENOMIC MEDICINE In the past few years there has been increasing interest in the scientific community for the creation of personalized medicine databases. The US National Institutes of Health (NIH) established
■
227
ClinicalTrials.gov as a resource providing information on federally and privately supported clinical trials (Gillen et al., 2004; Zarin and Keselman, 2007). This is the first major effort to attempt to provide a central resource cataloging ongoing clinical research. The National Institute of Diabetes, Digestive, and Kidney Diseases (NIDDK) established the Central Repositories Program in 2003 with the goal of providing a catalog for data available from completed clinical studies, as well as reporting the availability of DNA and biospecimens from both completed and ongoing research. The NIDDK Data Repository provides web access to general information on studies and acts as a catalog of data (Cuticchia et al., 2006). Since the data in this repository include patient data, it is not possible to provide such data freely on the web. Rather, researchers interested in using data from a single study (or multiple studies for a meta-analysis) work with the NIDDK Data Repository staff to determine the availability and suitability of the data. If the researchers wish to use the data, they must submit a formal request along with the approval of their institutional review board (IRB) for the handling of patient data. The documents are then reviewed by the NIDDK, and, if data access is approved, the data are provided on CD-ROM. In 2001, PharmGKB: the Pharmacogenetics Knowledge Base was publicly released.The PharmGKB mission includes the development, implementation, dissemination of a public genotype– phenotype resource focused on pharmacogenomics (Hewett et al., 2002). PharmGKB is publicly available and supported by the NIH and is part of the NIH Pharmacogenetics Research Network (PGRN). Although it has been supported for nearly 6 years, the amount of data available through the resource is small. As of September 2007, PharmGKB has information on 608 variant genes, 523 drugs, and 123 phenotypes. In order to create an extensive resource, the issues of appropriate levels of funding and incentives to submit data must be addressed. It has been estimated that an initial investment of $50–100 million over 5–10 years would be needed to meet this goal (Gurwitz et al., 2006). The Genome Medicine Database of Japan (GeMDBJ) represents an initial attempt to warehouse data for the Millennium Genome Project (MGP) (Tsujimoto, 2001). In particular, the database contains polymorphism information, expression profiles, and limited clinical and demographic information for Alzheimer’s disease, diabetes, hypertension, and gastric cancer. While rudimentary in its search capabilities and presenting little more than a collection of files, it is nevertheless a result of a growing interest for the development of molecular medicine databases. Moreover, the Marshfield Clinic has undertaken database development as part of its Personalized Medicine Research Project (McCarty et al., 2005). They have created a database of genotypic and phenotypic information of over 18,000 participants and have produced analytical tools to support the project. This database is an example of the population-based study databases that must be created under the NIH Genome-Wide Association Studies (GWAS) Project. NIH has designated NCBI as the group for the development of databases to support GWAS. The Database of Genotype
228
CHAPTER 18
■
Fundamentals and History of Informatics for Genomic and Personalized Medicine
and Phenotype (dbGAP), already has data from studies on macular degeneration and Parkinson’s diseases. The database collects and disseminates data from medical sequencing, and molecular diagnostic assays in addition to genotype–phenotype correlations. Database funding, unfortunately, is one of the areas most easily decreased or eliminated during times of limited funding. However, even if funding could be readily and steadily obtained, the issue of incentives to submit data is a much more difficult challenge to meet. Even though the success of the HGP was largely based on the ability of numerous research groups to work together in the sharing of data, the ability to get groups to submit data publicly during the initial stages was difficult (Cuticchia et al., 1993). This difficulty arose because the laboratories participating in the HGP were hesitant to divert resources away from the research process in order to perform data submission. Even when HGP funding required proof of data release, some research groups interpreted that as being met by merely placing files (sometimes with no documentation) on the Internet. It was only when “peer-pressure,” combined with a recognition of the overall value to all research groups, that data submission became commonplace. For pharmacogenomics data, a great deal of which is obtained through privately funded research, the incentives for public database submission are not immediately apparent. Some have argued that financial incentives such as tax exemptions for the cost of depositing data or increasing the US patent lives for drugs when their clinical trial data are publicly accessible are needed (Gurwitz et al., 2006). It remains to be seen the manner in which public databases receive both data and funding; however, it is likely that such a resource will eventually be developed. Knowledge and Data Management As an example relevant to genomic and personalized medicine, the key to pharmacogenomic research is the ability to accurately collect and interrelate data. While there are numerous components used to determine the genotype at a particular locus, the data that would be used for a gene association study at its simplest form is shown in Table 18.1. The data can be broken down into just four columns corresponding to: Run_ID, Patient_ID, Marker_ID, and Genotype (A, B, or AB) The fields Patient_ID and Marker_ID are references to other datasets, which might include patient records or associated genes or SNPs. From a database perspective, those fields are “foreign keys,” meaning that the information contained in those fields can be used to directly link to other data. The data could be stored together along with additional data in the same database or could be used as links to external databases. In the case of linkage to other databases, diligence must be given to ensure that the identifier used to link the data remains static. As an example, if the Marker_ID field were used to link to dbSNP, then the field must be populated with RefSNP identifier (i.e., rs#), as in Table 18.1, of each marker in order to unambiguously link the two data records. Nearly all publicly accessible biological databases provide
TABLE 18.1
Sample patient genotypes
Run ID
Patient ID
Marker ID
Genotypea
101
1
rs28771969
A
102
2
rs28771969
A
103
3
rs28771969
B
104
4
rs28771969
AB
105
5
rs28771969
A
201
6
rs28771969
AB
202
7
rs28771969
B
203
8
rs28771969
A
204
9
rs28771969
B
205
10
rs28771969
AB
a
These labels represent the presence of alleles at the particular locus. The patient may be homozygous for either allele A or B, or may be heterozygous having one copy of each (AB).
unchanging (and usually purposely ambiguous) identifiers, accession numbers, for their records. If it is unclear which field the database is using for its accession number, then it is important to contact the database host. The field Patient_ID would similarly be linked to data records relevant to that patient. However, the implementation of this link would depend on the manner in which clinical data are made available. Most hospitals and many private medical practices have continued to migrate from the collection of medical information on paper toward its collection electronically (Ford et al., 2006; Geiger et al., 1995; Miller and West, 2007; Welch et al., 2007). Even where paper is still used for the initial recording of medical information, it is likely that the data will later be input into a computer. For clinical trials, the data that are to be collected are predetermined before the start of the trial. Case report forms are created that are used by the health care providers for the collection of necessary data. In many cases, due to the environment in which the data is to be collected, the data collection form is completed on paper. However, with the increase in computer access within clinics, it is becoming easier for the health care provider to complete electronic versions of the form using laptops, tablet computers, or PDA’s. Regardless of the mode of collection, ultimately the medical information needed for the clinical trial is stored electronically. It is incumbent upon both the database creator and user to make certain that the data collected is stored in a manner compatible with IRB and HIPAA standards (Erlen, 2005; Fain, 2003; Harrelson and Falletta, 2007). In some cases, the data may need to be imported directly from electronic medical records. In those cases it will be necessary to work with the administrators of the medical records in order to determine the manner of access. It is quite unlikely, nor is it recommended, that links will be made
Databases for Genomic Medicine
229
ensures that fields on different patient report forms that collect the same data type are related, but also facilitates the linking of clinical trial data across multiple trials (Murphy et al., 2003). There are two problems that must be avoided in relating data: (1) linking data that should not be linked and (2) failing to link identical data. As an example:
Electronic patient records EPR Honest broker EPR_ID RD_ID
■
Research database RD
Figure 18.1 Honest broker association between two databases. The “honest broker” is an independent database with the sole purpose of storing the association of identifiers for linked data stored in two or more databases. Thus, related data from two or more databases can only be retrieved when sufficient access privilege is granted.
directly to the electronic health records. Rather, it is preferable to have an extract of the data prepared by the medical records administrator, which may be directly loaded into the research database or stored as its own database. Those data are then linked by Patient_ID to the research database. The advantage to storing this information separately is that the medical records administrators can create their own database using database tools and protocols compatible to the electronic medical records. They also would not be burdened with the need to understand the research database. Moreover, if the data are stored in a separate database, changes can be made to the structure (or even a full replacement) of the research database without interfering with the medical records. Accessing medical records through an intermediate database also allows for the use of an “honest broker” approach to the data management (Boyd et al., 2007). In some cases patient data that are stored in the intermediate database may need to be accessed by users with different data privileges (Figure 18.1). Additionally, there may be a requirement that no medical records can be directly linked to a research database. By using an honest broker, the patient record identifier from both databases are linked by another database. Thus, there is no way (other than through the honest broker) for someone to link the patient information with the genotype data. Data Modeling While the storage and modeling of genotype data are simple, the modeling of patient data is not! For example, in a recent investigation of a very well thought out clinical trial, there were no less than 10 different places, names, and values in which the data value referred to patient death. The problem of relating identical fields across studies becomes even more difficult if a common naming scheme is not employed (see Chapter 17). The use of a common naming convention, a “controlled vocabulary” in the creation of clinical trial databases, not only
Data Record A: IV
Data Record B: 4
If the records referred to Drug Number in A, and Drug Number in B, then the records should be linked since Record A is the Roman Numeral representation of Record B. Additionally, if the records referred to manner of drug administration in A (intravenous), and manner of drug administration (where 4 intravenous) they should be linked. However, if Record A referred to manner of drug administration (intravenous) and Record B referred to Drug Number (Drug 4) then the records should not be linked. Thus, two pieces of data, which to a computer would be a character data type and a numeric data type, can in fact represent the same values. In databases that are properly designed with meaningful field names, it should be relatively simple to make the appropriate connections. However, enough ambiguity will exist in field names that it is always recommended to go back to that patient report forms and their explanation in order to make associations. A controlled vocabulary can be defined as a collection of consistent terms used to refer to identical objects, that is to say that they have been enumerated explicitly. There are numerous examples such as the Systemized Nomenclature of Medicine (SNOMED) (Kudla and Rallins, 1998) and the Clinical Data Interchange Standards Consortium (CDISC) (Kuchinke et al., 2006) for clinical data, and the Gene Ontology (also known as the GO Ontology, 2006; Camon et al., 2003) and NCI thesaurus (Sioutos et al., 2007) for more basic biological data. Both SNOMED and CDISC curate a list of medical terminology in such a way as to unambiguously connect terms which refer to the same “entity.” As an example, the entity is defined as “a person who is participating in a clinical study.” In one instance “Patient” is used to refer to that entity, and in another instance “Subject” is used. The GO Ontology is an attempt to create a hierarchical system for describing biology by breaking genes into functions and biological process (as an example). Together, these tools provide different means and terminology to map common concepts to a unique identifier. To a computer: ● ● ● ● ●
Myocardial infarction MI Infarction of heart Myocardial infarct Heart attack
all appear as different objects. By using a tool such as those available for SNOMED, one could either use a single definitive term replacing all synonyms or create a linking table that would
230
CHAPTER 18
TABLE 18.2 Patient ID 1
■
Fundamentals and History of Informatics for Genomic and Personalized Medicine
Simplified sample clinical report forms Patient name (Last)a Smith
2
Jones
3
Treatment
Response
External Link to Patient Records Run_ID
X
Yes
X
Yes
Williams
Placebo
No
4
Douglas
X
Yes
5
Baker
Placebo
No
6
Clark
X
No
7
Carey
Placebo
No
8
Sanders
Placebo
No
9
Hawkins
X
Yes
10
Thomas
Placebo
No
Patient_ID
Patient_ID
Patient_Name Table Link
Genotype
Treatment
Marker_ID
Response
External Link to dbSNP
a
Figure 18.2 Data representation. Run_ID and Patient_ID are the primary keys for each of the two tables. Here Patient_ID is also a foreign key for the first table which is used to link those records with matching records in the second table. Patient_ID and Marker_ID also serve as foreign keys to external databases.
Fictional
associate all the terms with the same object. By doing the latter, one could create a query such as “Show me all patients who had a MI since 2001” and the query would retrieve all records with the synonyms matched to MI. The use of controlled vocabulary can also be related to ontologies. Ontologies are ‘specifications of a relational vocabulary’ (Bard, 2007; Bittner and Donnelly, 2007; Pinciroli and Pisanelli, 2006; Soldatova and King, 2005). In other words, once one has created a set of common terms, the terms can be networked together. The use of ontologies would allow researchers to perform queries such as “Show me all patients having adverse events involving the heart since 2001.” Through the ontological network, this would retrieve MI and other adverse events (e.g., cardiac arrest). Software Development As an example, a database to support a very simple clinical trial is outlined. The trial will test the effect of Drug X on headaches. Five patients with headaches are given Drug X, while five patients receive a placebo. Additionally the patients are genotyped at a locus rs28771969 to determine whether variation in the gene has an effect on response. The information received from the Genotyping Center is shown the Table 18.1, while the data collected from the clinical report forms are listed in Table 18.2. The database would look something like the representation shown in Figure 18.2. Points to note are: 1. The Marker_ID field is used as an external identifier to link to other genetic databases. 2. The Patient_ID field is an internal identifier set by the research study and can be used to link to medical records. This information would be covered as Private Health Information (PHI) and will require appropriate IRB and HIPAA treatment.
3. X and Placebo are both terms that refer to same class of treatment. While this in an overly simplistic example, it demonstrates the key points regarding database construction for genomic and personalized medicine: 1. It provides for unambiguous linking of genotypes and patients. 2. It provides for linking to electronic medical records without the need to extract information directly from the medical records database. 3. It uses the term “Treatment” to refer to drug given; where treatment is a general term that can be mapped to other studies. 4. It uses keys to link together database tables and external databases.
CONCLUSION As the field of genomic and personalized medicine evolves, the need for the effective collection and dissemination of data will become increasingly important. While data initially are being collected for genotype–phenotype association studies, ultimately such research will lead to the use of genomic information in the health care process. For this reason, good database standards must be created in the research phase and used as models for the establishment of databases used for diagnostics. No new information technologies need to be invented in order to meet the needs of genomic medicine. Rather, researchers and informaticians need to concentrate on the use of “best-of-breed” information technology in the building of databases. By using good information technology practices, the needed databases should easily be constructed and once completed, issues regarding data collection and the incentives needed to populate the databases can then be addressed by the community.
References
■
231
REFERENCES Andrawiss, M. (2005). First phase of HapMap project already helping drug discovery. Nat Rev Drug Discov 4, 947. Bard, J. (2007). Systems developmental biology: The use of ontologies in annotating models and in identifying gene function within and across species. Mamm Genome. Benson, D., Boguski, M., Lipman, D. and Ostell, J. (1990). The National Center for Biotechnology Information. Genomics 6, 389–391. Bittner, T. and Donnelly, M. (2007). Logical properties of foundational relations in bio-ontologies. Artif Intell Med 39, 197–216. Boyd, A.D., Hosner, C., Hunscher, D.A., Athey, B.D., Clauw, D.J. and Green, L.A. (2007). An ‘Honest Broker’ mechanism to maintain privacy for patient care and academic medical research. Int J Med Inform 76, 407–411. Burks, C., Fickett, J.W., Goad, W.B., Kanehisa, M., Lewitter, F.I., Rindone, W.P., Swindell, C.D., Tung, C.S. and Bilofsky, H.S. (1985). The GenBank nucleic acid sequence database. Comput Appl Biosci 1, 225–233. Camon, E., Magrane, M., Barrell, D., Binns, D., Fleischmann,W.,Kersey, P., Mulder, N., Oinn, T., Maslen, J., Cox, A. et al. (2003). The Gene Ontology Annotation (GOA) project: Implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res 13, 662–672. Collins, F. and Galas, D. (1993). A new five-year plan for the US Human Genome Project. Science 262, 43–46. Couzin, J. (2002). Human genome. HapMap launched with pledges of $100 million. Science 298, 941–942. Cuticchia, A.J., Chipperfield, M.A., Porter, C.J., Kearns, W. and Pearson, P.L. (1993). Managing all those bytes: The Human Genome Project. Science 262, 47–48. Cuticchia, A.J., Cooley, P.C., Hall, R.D. and Qin, Y. (2006). NIDDK data repository: A central collection of clinical trial data. BMC Med Inform Decis Mak 6, 19. Deloukas, P. and Bentley, D. (2004). The HapMap project and its application to genetic studies of drug response. Pharmacogenomics J 4, 88–90. Erlen, J.A. (2005). HIPAA–Implications for research. Orthop Nurs 24, 139–142. Fain, J.A. (2003). Research and the new HIPAA regulations. Diabetes Educ 29, 348–350. Ford, E.W., Menachemi, N. and Phillips, M.T. (2006). Predicting the adoption of electronic health records by physicians: When will health care be paperless?. J Am Med Inform Assoc 13, 106–112. Geiger, G., Merrilees, K., Walo, R., Gordon, D. and Kunov, H. (1995). An analysis of the paper-based health record: Information content and its implications for electronic patient records. Medinfo 8 (Pt 1), 295. The Gene Ontology (GO) project in (2006). . Nucleic Acids Res 34, D322–D326. Germer, S., Holland, M.J. and Higuchi, R. (2000). High-throughput SNP allele-frequency determination in pooled DNA samples by kinetic PCR. Genome Res 10, 258–266. Gillen, J.E., Tse, T., Ide, N.C. and McCray, A.T. (2004). Design, implementation and management of a web-based data entry system for ClinicalTrials.gov. Medinfo 11, 1466–1470. Gurwitz, D., Lunshof, J.E. and Altman, R.B. (2006). A call for the creation of personalized medicine databases. Nat Rev Drug Discov 5, 23–26. Harrelson, J.M. and Falletta, J.M. (2007). The privacy rule (HIPAA) as it relates to clinical research. Cancer Treat Res 132, 199–207.
Hewett, M., Oliver, D.E., Rubin, D.L., Easton, K.L., Stuart, J.M., Altman, R.B. and Klein, T.E. (2002). PharmGKB: The Pharmacogenetics Knowledge Base. Nucleic Acids Res 30, 163–165. International HapMap Consortium (2003). The International HapMap Project. Nature 426, 789–796. Kuchinke, W., Wiegelmann, S., Verplancke, P. and Ohmann, C. (2006). Extended cooperation in clinical studies through exchange of CDISC metadata between different study software solutions. Methods Inf Med 45, 441–446. Kudla, K.M. and Rallins, M.C. (1998). SNOMED: A controlled vocabulary for computer-based patient records. J AHIMA 69, 40–44. quiz 45–46. Lavebratt, C. and Sengul, S. (2006). Single nucleotide polymorphism (SNP) allele frequency estimation in DNA pools using Pyrosequencing. Nat Protoc 1, 2573–2582. McCarty, C., Wilke, R., Giampietro, P., Wesbrook, S. and Caldwell, M. (2005). Marshfield Clinic Personalized Medicine Research Project (PMRP): Design, methods and recruitment for a large populationbased biobank. Personalized Medicine 2, 49–79. McKusick,V.A. (1998). Mendelian Inheritance in Man; a Catalog of Human Genes and Genetic Disorders. The Johns Hopkins University Press, Baltimore, MD. ISBN 0-8018-5742-2 Miller, R.H. and West, C.E. (2007). The value of electronic health records in community health centers: Policy implications. Health Aff (Millwood) 26, 206–214. Murphy, L.S., Reinsch, S., Najm, W.I., Dickerson, V.M., Seffinger, M.A., Adams, A. and Mishra, S.I. (2003). Searching biomedical databases on complementary medicine: The use of controlled vocabulary among authors, indexers and investigators. BMC Complement Altern Med 3, 3. Norton, N., Williams, N.M., Williams, H.J., Spurlock, G., Kirov, G., Morris, D.W., Hoogendoorn, B., Owen, M.J. and O’Donovan, M.C. (2002). Universal, robust, highly quantitative SNP allele frequency measurement in DNA pools. Hum Genet 110, 471–478. Pearson, P.L. (1991). The genome data base (GDB) – a human gene mapping repository. Nucleic Acids Res 19(Suppl), 2237–2239. Pinciroli, F. and Pisanelli, D.M. (2006). The unexpected high practical value of medical ontologies. Comput Biol Med 36, 669–673. Shortliffe, E.H., Perreault, L.E., Wiederhold, G. and Fagan, L.M. (2003). Medical Informatics: Computer Applications in Health Care and Biomedicine. Springer, New York. Sioutos, N., de Coronado, S., Haber, M.W., Hartel, F.W., Shaiu, W.L. and Wright, L.W. (2007). NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information. J Biomed Inform 40, 30–43. Soldatova, L.N. and King, R.D. (2005). Are the current ontologies in biology good ontologies?. Nat Biotechnol 23, 1095–1098. Tahira, T., Suzuki, A., Kukita, Y. and Hayashi, K. (2003). SNP detection and allele frequency determination by SSCP. Methods Mol Biol 212, 37–46. Thorisson, G.A., Smith, A.V., Krishnan, L. and Stein, L.D. (2005). The International HapMap Project Web site. Genome Res 15, 1592–1593. Tsujimoto, G. (2001). [“Millennium Project” of MHLW]. Nippon Rinsho 59, 1884–1888. Watson, J.D. (1990). The human genome project: Past, present, and future. Science 248, 44–49.
232
CHAPTER 18
■
Fundamentals and History of Informatics for Genomic and Personalized Medicine
Watson, J.D. and Cook-Deegan, R.M. (1991). Origins of the Human Genome Project. FASEB J 5, 8–11. Welch, W.P., Bazarko, D., Ritten, K., Burgess, Y., Harmon, R. and Sandy, L.G. (2007). Electronic health records in four community physician practices: Impact on quality and cost of care. J Am Med Inform Assoc 14, 320–328. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Edgar, R., Federhen, S. et al. (2006). Database resources of the National
Center for Biotechnology Information. Nucleic Acids Res 34, D173–D180. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Edgar, R., Federhen, S. et al. (2007). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 35, D5–D12. Zarin, D.A. and Keselman, A. (2007). Registering a clinical trial in ClinicalTrials.gov. Chest 131, 909–912.
RECOMMENDED RESOURCES Human Genome Project www.genome.gov NCBI www.ncbi.nlm.nih.gov GenBank www.ncbi.nlm.nih.gov/Genbank/ Human Genome Organisation (HUGO) www.hugo-international.org GDB Human Genome Database www.gdb.org Online Mendelian Inheritance in Man www.ncbi.nlm.nih.gov/omim/ dbSNP www.ncbi.nlm.nih.gov/projects/SNP/
HAPMAP Project www.hapmap.org Clinical Trials Registry www.clinicaltrials.gov NIDDK Repository www.niddkrepository.org PharmGKB www.pharmgkb.org Genomic Medicine Database of Japan gemdbj.nibio.go.jp Marshfield Clinic Personalized Medicine Research Project www.marshfieldclinic.org/chg/pages/default.aspx?page=chg_pers_med_res_prj
CHAPTER
19 Electronic Medical Records in Genomic Medicine Practice and Research Glenn S. Gerhard, Robert D. Langer, David J. Carey and Walter F. Stewart
INTRODUCTION The implementation of electronic medical records (EMRs) is expected to substantially improve the quality and efficiency of health care and provide an important vehicle to advance patientcentered personalized care, the sine quo non of genomic medicine. In the United States, the use of EMRs in care delivery is expanding rapidly, especially among large integrated health delivery systems. The amount of clinically relevant genomic data and the number of resources devoted to research on genomic medicine are increasing in parallel. However, relatively few publications have yet addressed the use of EMRs in genomic medicine, although the need for such integration has been clearly established (Martin-Sanchez et al., 2004). In this chapter, the role of EMRs in the practice of genomic medicine and their use for genomic research will be considered. The need for digital approaches to storing, processing, and using information is driven by the growing density of genomic data, whether derived from gene expression profiling, single nucleotide polymorphism (SNP) genotyping, or DNA sequencing, and from the pace of discovery. Moreover, the complexity of clinical decision-making will change commensurately as genetic data relevant to diagnosis and therapeutic interventions increases. Historically, the disciplines of medical (or clinical) informatics and biological informatics (or bioinformatics) have both evolved to address the needs of genomic medicine
Genomic and Personalized Medicine, 2-vol set by Willard & Ginsburg
practice and research (Maojo and Kulikowski, 2003). Differences between the two communities have resulted in an attempt to integrate the approaches (Altman, 2000), with some proposing that health care organizations should be more proactive in collaborating with other public and private sector organizations in the development of an EMR containing both clinical and genomic data (Groen et al., 2005). These are laudable goals, but most of the work published thus far has come from those who have focused on the manipulation and analysis of genomic data for the clinical setting. However, the major factors driving the development of EMRs, such as quality measures, efficiency, and reimbursement, are not directly related to genomic medicine. This chapter is written from the perspective of those involved in the use of EMRs in large integrated delivery systems that make use of data and populations for research. While the US health system is fragmented, the large health systems that are adopting EMRs are becoming increasingly integrated, especially in adopting and implementing practice standards (Orlova et al., 2005). A significant window of opportunity thus exists to incorporate various aspects of genomic medicine into the development of these emerging systems. The data relevant to genomic medicine are highly diverse and applicable to essentially all medical specialties. As a result, multiple approaches and solutions will likely need to be developed for the myriad of needs in genomic medicine research and practice (Stein, 2003; Sujansky, 2001).
Copyright © 2009, Elsevier Inc. All rights reserved. 233
234
CHAPTER 19
■
Electronic Medical Records in Genomic Medicine Practice and Research
EMRs AND GENOMIC MEDICINE CLINICAL PRACTICE The EMR is a longitudinal electronic record of patient health information generated by encounters in any care delivery setting, encompassing patient demographics, progress notes, problem lists, medications, vital signs, past medical history, immunizations, laboratory data, and radiology reports. The EMR substitutes for traditional paper-based “charts”. The EMR has the ability to generate a complete record of a clinical patient encounter, as well as supporting other care-related activities directly or indirectly, such as evidence-based decision support, quality management, and outcomes reporting. It may also be part of a solution to addressing a decades old problem of decreasing the long lag time that exists before evidence-based medical knowledge is routinely used in clinical care (Stewart et al., 2007). This may be of tremendous importance for the rapid progress being made in genomic medicine. In this section, the integration and use of genomics data in EMRs for the clinical practice of genomic medicine is discussed. Example of an EMR The Geisinger Clinic, a large integrated health system in northeastern and central Pennsylvania, initiated installation of an EMR (i.e., EpicCare) in 1996, which was completed (i.e., completely paperless operations) in all community practice sites and specialty clinics by 2001. The EMR contains patient information from a variety of sources that is routinely integrated into a common database and includes age, sex, height, weight, and other demographics; lifestyle data (e.g., smoking, alcohol, etc.); clinical measures (e.g., BP, pulmonary function, cardiac information); digital imaging (MRI, CT, X-ray); all orders (i.e., labs, prescriptions, imaging, procedures) that require at least one clinical indication (i.e., ICD-9 code); clinical notes, which are increasingly created using smart-sets, or structured protocols; and laboratory measures (including results of genetic and molecular tests). Patients have on-line access to their EMR in the EpiCare system through the My Geisinger Internet portal. With My Geisinger patients can securely access their EMR, schedule visits, and email their doctor. Such web-based communications represent future opportunities for genomic medicine research. To date, the Geisinger clinic EMR database contains information on more than 2.2 million patients. Integration of Genomic Medicine Data into the EMR EMRs make use of relational database structures and utilities to access and display data to facilitate medical care and clinical decision-making. Although most observers and users consider the EMR to be a single application, in reality there are multiple data feeds from disparate sources that comprise the EMR. Some data sources, for example laboratory information systems, are usually integrated with the main EMR application, while others, such as radiology, histopathology, and molecular data are incompletely integrated or are resident in entirely separate databases. Genomic data currently used for clinical practice are
often generated by hospital, reference, or specialty laboratories that produce a wide variety of reporting formats, data types, and nomenclature (Ogino et al., 2007). Integration of these data into the EMR thus requires the ability to assemble the different data sources into an accessible format. Customized systems are sometimes developed in-house using on-site information technology expertise that are motivated by institution- and/or laboratoryspecific needs and allow for flexibility and control over how the software evolves. However, the tendency is to invest significant resources into these in-house solutions, which sometimes underestimate the effort needed and fail to meet the intended goals. The choice of integration architecture and format for data representation are thus significant decisions and have a number of important issues associated with them (Haas et al., 2001). The disparate sources that generate genomic data often report text-based data that are distributed through laboratory information systems into the EMR. These text results are commonly discrete, that is genotype A or genotype B, rather than continuous, although quantitative results are also produced, such as the number of trinucleotide repeats in the gene (FMR1) responsible for many cases of fragile X syndrome (Murray et al., 1997). Genomics results may also be reported in unique formats developed by diagnostic companies in niche markets. The disparate nature of the data and data sources further complicates the growing trend of integrating genomics data with other laboratory and/or clinically relevant data to generate a comprehensive clinical report. For example, cytogenetic and molecular remissions are recommended in reporting standards for acute myelogenous leukemia (Cheson et al., 2003). With the tendency of health care organizations towards the delivery of care in a service line structure and/or disease-specific care model (e.g., heart hospitals, neuroscience centers, etc.), more comprehensive and integrated reporting of genomic medicine results will likely be needed. This will apply to both genomic and other types of data-intensive results, such as proteomic and metabalomic data (Table 19.1). Currently most genomic data generated for clinical care are in the form of a discrete result. However, as the cost of full DNA sequencing continues to decline (Bennett et al., 2005), it is likely that discrete and highly relevant-specific results will be provided along with a large amount of DNA sequence data that has no known clinical significance (Mitchell and Mitchell, 2007). Ultimately, complete genome sequencing may become the most cost effective means of identifying sequence variants (Hutchison, 2007). It would not be unexpected in the future if DNA sequence determination became the standard of care for every patient, even though most of the sequence data may not be relevant to clinical decisions. Prior to that, many smaller iterations of DNA sequence data will be determined, for example, comprehensive mutation detection for BRCA1 and BRCA2 (Myriad Genetics, Salt Lake City, Utah, “BRAC Analysis for BRCA1 and BRCA2, Breast and Ovarian Cancer, Comprehensive Analysis”). In this analysis, full sequence determination in both forward and reverse directions of approximately 5400 base pairs comprising 22 coding exons and approximately 750 adjacent base pairs in the non-coding intervening sequences (introns) of the BRCA1
EMRs and Genomic Medicine Clinical Practice
T A B L E 1 9 . 1 Examples of genomic and other related data types potentially relevant for inclusion in the EMR ● ● ● ● ● ● ● ●
DNA sequence data Single nucleotide polymorphism (i.e., SNP) genotypes Multiple nucleotide polymorphism (e.g., insertion, deletion, repeat) genotypes Cell and/or tissue-specific microarray gene expression profiles Proteomic (e.g., mass spectroscopy) profiles Glycomic (i.e., carbohydrate) structural data Lipomic (i.e., lipid) profiles Metabolic phenotypes
gene is made. For BRCA2, full sequence determination in both forward and reverse directions of approximately 10,200 base pairs comprising 26 coding exons and approximately 900 adjacent base pairs in the non-coding intervening sequence (intron) is performed. Similar sequence analysis is offered for several other gene or genes that are associated with specific inherited cancer syndromes. Such testing generates an enormous amount of extra information that is not currently clinically relevant. However, as new discoveries are eventually translated into practice, ready access to relevant DNA data on patients may become necessary. How to maintain such dense sequencing data is thus an important decision (Mitchell and Mitchell, 2007). For example, placing every base sequenced into the patient’s medical record may clutter the record with predominantly clinically insignificant data. Alternatively, only bases that differ from reference data may need to be tracked. Reference data may change, which may alter the interpretation of patient data. The American College of Medical Genetics (ACMG) has previously recommended a minimum amount of information for inclusion in DNA sequencing reports (ACMG, 2000). Important considerations include reporting of all of the bases sequenced, as well as linked sequence reference data. Such recommendations are likely to evolve as more clinical experience is acquired with sequence data. Family History Data An integral component of genomic medicine is the application of family history to clinical care. Interest in collecting family history data as a routine part of care delivery is growing, as knowledge advances in linking family history of disease to patient risk. Many common diseases, such as coronary heart disease, diabetes, several types of cancer, osteoporosis, and asthma, are known to occur more frequently in some families than in others; thus the identification of at-risk individuals is important for high quality medical care. Family history data are also necessary for research aimed at identifying environmental and/or genetic factors that may underlie these common diseases (Hopper et al., (2005)). However, the acquisition of such data by most physicians is inadequate, in part because tools do not exist for efficient, reliable, and valid collection and analysis of such data (Guttmacher et al., 2004). Structured family history data are almost universally absent in current EMRs (see Chapter 42).
■
235
The need for the development of better family history tools has been highlighted by projects at the Centers for Disease Control (CDC) and by the US Surgeon General’s Family History Initiative (Wolpert, 2005). However, these efforts have not directly addressed the integration of tools into the real-world scenario of busy physicians and a multiplicity of health record systems, and do not provide an adequate breadth of data capture necessary for research. Various paper family history tools are available, as are various computer-based tools for drawing pedigrees (Trager et al., 2007). Several EMRs incorporate free text documentation of family history, while others allow for the development of family history templates and support checklists for taking and documenting family history. However, currently available tools are inadequate for many situations, especially primary care (Rich et al., 2004). The need for new tools is therefore apparent, and several characteristics of the ideal family history tool have been suggested, including patient-completed (e.g., paper, desktop, telephone, or web input), adapted to patient age, gender, ethnicity, common conditions, elicits-patient specific concerns, brief, understandable, easy to use, compatible with multiple clinical applications (e.g., paper, EMR, personal digital assistant), contains clinical decision support, and prioritizes based on clinical significance (Rich et al., 2004). No such electronic family history tools have yet been developed, despite the availability of suitable technologies. EMRs, Genomic Medicine, and Clinical Decision Support Genomic medicine is accelerating the growth of medical knowledge. In contrast, the means by which knowledge is translated into clinical practice have not evolved to keep up with the accelerating growth. Translation of knowledge into practice relies on a century-old tradition, the direct education of clinicians, an approach that was useful when knowledge was relatively limited. Today, it is not possible for most physicians to stay abreast of medical knowledge, especially genomics, and provide state-of-the-art care. The nuances in clinical decision-making in genomic medicine already render many care scenarios complex. Access to an EMR and the ongoing codification of medical knowledge (i.e., Clinical Practice Guidelines or CPGs) will be essential to addressing this growing translation gap. CPGs greatly facilitate, but are not sufficient for translating knowledge to practice. Almost 2000 active CPGs exist in the US National Guideline Clearinghouse (Kozma, 2006) and an individual CPG may encompass dozens to hundreds of clinical recommendations. These recommendations will rapidly expand in the era of genomic medicine, and thus codification of knowledge will be essential to increasing its access. EMRs offer a platform to translate codified knowledge into real-time actionable processes. Genomic data will need to be accessed with other patient data located in disparate locations within the EMR and evaluated in relation to a rule set. Real-time actionable recommendations will need to be created and be supported by an integrated and intuitive visual display of information, such as a set of orders or recommendations for the physician or other care provider.
236
CHAPTER 19
■
Electronic Medical Records in Genomic Medicine Practice and Research
An example of clinical decision-making using an EMR potentially relevant to genomic medicine is a process with a specific focus on cardiovascular risk management in primary care being developed at the Geisinger Health System. The process involves a decision support rules “engine”, external to the EMR, in which patient data such as labs, questionnaire data for phenotyping, and other clinical measures (Figure 19.1) are extracted in real time and evaluated in relation to clinical rules. The process can be used to generate a span of outputs including orders, clinical notes, and medication lists, etc. (Figure 19.2). The current model
involves rules for smoking cessation, control of LDL, control of blood pressure, and management of other cardiovascular risk factors. Together, the combination of rules for different risk factors translates into more than 100,000 possible ordersets. The process itself demonstrates that it is technically feasible to develop and implement a comprehensive set of rules for state-of-the-art management of a clinical condition. Genetic variants reported to predispose to coronary heart disease (Samani et al., 2007) may impact this process and could greatly increase the number of ordersets.
EMRs AND GENOMIC MEDICINE RESEARCH
Figure 19.1 Screenshot of EMR (Epic Systems, Corporation, Verona, WI). The software uses a tab-based design for access to medical record information including encounters, clinic notes, lab and imaging results, orders, medications, and procedures. These data can be extracted and used to automatically generate orders or recommendations, such as laboratory tests or imaging procedures for health maintenance or disease management.
Figure 19.2 Screenshot of EMR (Epic Systems, Corporation, Verona, WI). Examples of output from problem list, immunizations, allergies, medications, and health maintenance fields.
The EMR offers an exciting opportunity for genomic medicine research, both as a tool and as a data resource. It also represents an economically efficient means of obtaining phenotypic data and biosamples for generating genotypic data. Initial large-scale efforts to use genomic and clinical information for research, for example, deCode Genetics in Iceland, required the conversion of medical information from paper records into an electronic format (Hoffman, 2007). Even current large-scale efforts, such as the UK BioBank (Ollier et al., 2005), are based upon a mix of paper and electronic sources. The EMR thus represents a potentially large increase in efficiency for obtaining phenotypic data, and as described below, can also be an extremely efficient tool for patient recruitment and biosample acquisition. EMRs and Genome-Wide Association Studies The continued acceleration in genotyping technologies has enabled a dramatic increase in the use of genome-wide association studies (GWAS) to identify genetic variants underlying a variety of clinical phenotypes. However, the very large sample sizes and the logistics of collecting phenotype data and biological samples often require multi-center studies with coordinating centers and are very costly and time consuming to complete. Moreover, multiple replication studies are essential to verify initial GWAS findings. The long time period and substantial costs in initiating population-based studies limits the scope of such approaches. As an alternative, biorepositories are being developed to provide the resources necessary for GWAS. In Iceland (Hakonarson et al., 2003), the United Kingdom (Ollier et al., 2005), and elsewhere (Austin et al., 2003a), efforts are being made to link biobanking activities to national health records. In the United States, however, there are challenges with devising a national biobanking strategy (Austin et al., 2003b; Hsieh, 2004; McCarty et al., 2007; Sanner and Frazier, 2007). The US population is very heterogeneous, highly mobile, and health care is highly fragmented. Information on patients is highly dispersed and not in an easily accessible form. Hence, GWAS and related validation studies are largely conducted in conventional single project formats (i.e., de novo data collection), where there is a long lead-time between data acquisition and analysis.
EMRs and Genomic Medicine Research
In the United States, integrated delivery systems that use EMRs may be critical to addressing logistical challenges to conducting large-scale GWAS studies. While health care in the United States is highly fragmented, approximately 10% of care is provided by integrated delivery systems. Nationwide, these systems are undergoing a quiet revolution through the introduction of EMRs. Efforts to standardize data elements for research purposes and data sharing are being pursued by the larger systems (Hornbrook et al., 2005). There may be substantial opportunities to accelerate genomics research in the United States by leveraging the growing number of systems that are both adopting EMRs and creating biorepositories. However, there are many unanswered questions regarding whether such resources can be effectively used for genomics research. At the individual system level, there are questions regarding the quality, specificity, and completeness of data for phenotyping. In addition, little is known about system-level challenges that influence the quality and completeness of data captured during clinical encounters (e.g., impacts upon workflow, technical, and business priorities). Moreover, in some systems, even if quality data are available, there may be substantial technical, ethical, or policy barriers that limit access or use. Finally, GWAS and follow-up replication studies will require collaboration of multiple systems to achieve sample sizes that are sufficiently large, but there are enormous data quality, technical, and policy challenges to pooling data across institutions. Using the EMR to Access Populations for GWAS Case-control studies and family based studies have been the two primary designs for GWAS, but the economic efficiencies of the case-control format conducted within health systems are significant. Patient populations and inherent selection bias also impact GWAS, which may vary substantially depending on the source (e.g., primary care, specialty care, outpatient, inpatient). Primary care populations tend to be similar to general population samples with the exception that younger individuals (especially males) and individuals from lower middle-income strata (i.e., not eligible for Medicaid and no access to health care benefits through work) are under-represented. Selection bias inherent to specialty care samples depends on the extent to which there are competing providers in the same market and whether a provider has local, regional, national, or international reach. Selection bias increases directly in relation to market reach. These potential biases will be reflected in clinical data available in EMRs, depending upon where and when the system was installed. Genetic heterogeneity in a population may also arise due to population structure and recent admixture and may confound the results of genetic association studies in unrelated individuals, leading to a potential excess of both false positive and false negative results (Ziv and Burchard, 2003). The degree of confounding is dependent in part on how genetically different the subgroups are within the population. For example, among many contemporary populations, recent admixture between two or more genetically distinct subgroups has added to the genetic complexity. In the United States, this complexity is particularly
■
237
pertinent for admixed populations such as Latino- and AfricanAmericans (Parra et al., 1998). Within indigenous continental population groups, European subpopulations appear to be the most similar to each other (Bowcock et al., 1994; Mountain and Cavalli-Sforza, 1997). The ability of an EMR to distinguish among ethnicities within its patient base may impact the usability of the data for GWAS. The Geisinger MyCode Biorepository The creation of biorepositories that are closely linked with EMRs may be an extremely efficient approach to genomic medicine research, especially for GWAS that require DNA samples from large populations with corresponding robust phenotypic data. The challenges to establishing such biorepositories involve complex institutional, legal, and social issues that also must be addressed. However, the EMR opens new opportunities to address these concerns, such as patients’ safety, rights, informed consent, and privacy. The Geisinger Clinic MyCode biorepository project is perhaps the first large-scale biobanking initiative built around an EMR. In this project, the EMR is used as a tool for identifying and recruiting patients, obtaining blood samples, and retrieving phenotype data. The primary goal of the MyCode project is to bank blood/serum/DNA samples from patients on a large-scale coupled with access to their EMR data. MyCode was initiated in 2007 to leverage the resources of the Geisinger Health System, an integrated delivery system offering healthcare services to residents of 31 of Pennsylvania’s 67 counties, with a significant presence in central and northeastern Pennsylvania between Pittsburgh and Philadelphia. Census data indicate that in most counties, the out-migration rate is less than 1% per year, thus providing an extremely attractive population for longitudinal studies. The population within the Geisinger Clinic catchment area is relatively homogeneous. Based on available historical, demographic, and self-reported ethnicity data, the Geisinger population is almost entirely Caucasian of mixed European ancestry, with only small contributions from other continental population groups, such as African, Asian, or Latino. The relatively lower degree of genetic variation among European subpopulations, and the length of time for intermingling of European subgroups in Pennsylvania, also serves to minimize population stratification. Such a population straddles genetically homogeneous populations such as Iceland or Finland with strong founder effects, and admixed populations, now present in many urban areas. The MyCode patient recruitment process (Figure 19.3) begins with a search of the EMR for patients scheduled to visit an outpatient or designated specialty clinic in which a consenter will be stationed. Eligible patients are identified through a daily EMR report, while consent status (e.g., consent/did not consent/ not interested) is verified in the patient’s EMR to prevent unnecessary requests for participation at future appointments. Patients are invited to participate in MyCode through direct interaction with a project consenter. After consent, a standing order for the additional MyCode blood draw is placed in
238
CHAPTER 19
■
Electronic Medical Records in Genomic Medicine Practice and Research
Use of EMR in MyCode Biorepository Project
Obtain Clinic Schedule from EMR
Generate Eligibility List from EMR
Patient Consent
Questionnaire Data into EMR
Blood Draw Order Placed in EMR
Frozen Storage
Blood Samples Processed
Blood Drawn
EMR Blood Draw Order Read by Phlebotomy
Patient to Phlebotomy
Figure 19.3 Use of the EMR in the MyCode Biorepository Project. The EMR functions as a tool at multiple points in the patient recruitment and consenting processes, and for acquisition of blood samples. The EMR also serves as the repository for phenotype data.
the EMR such that it is appended to the new order for blood work ordered by a physician at a future visit. With this strategy, blood is not specifically drawn for research which has resulted in high patient participation rates, as has been found in other studies (Sanner and Frazier, 2007). The sample collection process is highly efficient because it leverages the EMR (i.e., maintaining a standing order with an automatic protocol to activate the order), the laboratory infrastructure (e.g., standard protocols for phlebotomy, data systems, transportation of blood samples to a central location, etc.), and a passive low cost means of obtaining samples. Blood samples are transported from clinics across the Geisinger coverage region using the existing Geisinger Lab courier service to a central research laboratory that processes the samples for the MyCode biorepository. The EMR serves as the electronic enabler of the entire MyCode process. The approach to creating the MyCode biorepository mitigates several methodological challenges with GWAS, including possible genotype errors, phenotype misclassification, and selection bias. While genotyping platforms are robust, DNA quality can affect genotyping results. Frozen blood and/or DNA are collected using the same protocol and processed using the same methodology. Errors in labeling patient samples are minimal, since an integrated clinical lab quality protocol is used for tracking samples from the point of phlebotomy to freezer storage. Phenotype misclassification is minimized through the use of careful case and control definition. Selection bias is minimized through the selection of cases and controls from the same population base. Example of EMR Data Extraction The EMR is a powerful and convenient source of phenotypic data for GWAS and genomic medicine research. A reference
dataset has been created with a record for each patient in the Geisinger EMR population that includes demographics, physical measurements, personal medical history, health behaviors, laboratory tests, surgical and medical procedures, and medications prescribed. The dataset was developed to reflect the top 80% of elements most likely to be used for standard research purposes. Core data were extracted from the EMR using imbedded routines (known as Clarity), provided by the software vendor. Clarity tables can be manipulated using standard queries in SQL applications. The Clarity tables containing the specific elements in the data dictionary for a particular domain (e.g., physical measurements, or endocrine-related lab) are identified, and those fields are extracted into a text file. Typically, all instances of an element are pulled (e.g., the dataset contains all blood pressure measurements, or all glucose lab for a patient), not just the most recent or the first. The text file is read into SAS/STAT software (SAS Institute Inc., Cary, NC) and mapped to pre-defined fields. The resulting files are documented, cleaned and stored on a dedicated directory controlled by the biostatistics team. Analysis files are created by merging the appropriate data and applying any algorithms necessary for derived values. As an example, this extraction of core data from the EMR can be used to obtain body mass index (BMI) data on over 270,000 adults (Table 19.2). Analyses were limited to adult patients 20 years of age and older. The mean and median BMIs were 28.4 and 27.2, respectively. Slightly over 30% of patients had a BMI 25 while 7.8% had a BMI of 40+. Patients with a BMI of 40+ were three times more likely to be female and more likely to be between 40 and 59 years of age (i.e., with 46.2% versus 29.6%) compared to those with a BMI 25. The relatively smaller proportion above age 60 and the gender bias may suggest a survival effect for this subgroup.
EMRs and Genomic Medicine Research
TABLE 19.2
■
239
Percent of adult patients by BMI and specific phenotypes
Phenotype
BMI Category ≤25 n ⴝ 80,861
>25, 30,