GENE DISCOVERY FOR DISEASE MODELS
ffirs01.indd i
1/12/2011 9:44:45 AM
GENE DISCOVERY FOR DISEASE MODELS Edited by
Weikuan Gu and Yongjun Wang
A JOHN WILEY & SONS, INC., PUBLICATION
ffirs02.indd iii
1/12/2011 9:44:45 AM
Copyright © 2011 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Gene discovery for disease models / edited by Weikuan Gu and Yongjun Wang. p. ; cm. Includes bibliographical references and index. ISBN 978-0-470-49946-7 (cloth) 1. Medical genetics. 2. Mutation (Biology) 3. Genomics. 4. Genetic disorders. I. Gu, Weikuan and Yongjun Wang. [DNLM: 1. Genetic Association Studies–methods. 2. Models, Genetic. 3. Mutation. QU 450] RB155.G3584 2011 616′.042–dc22 2010028355 Printed in Singapore. 10
ffirs03.indd iv
9
8
7
6
5
4
3
2
1
1/12/2011 9:44:45 AM
CONTENTS
Preface
vii
Acknowledgments
ix
Contributors
xi
1. Gene Discovery: From Positional Cloning to Genomic Cloning
1
Weikuan Gu and Daniel Goldowitz
2. High-Throughput Gene Expression Analysis and the Identification of Expression QTLs
11
Rudi Alberts and Klaus Schughart
3. DNA Methylation in the Pathogenesis of Autoimmunity
31
Xueqing Xu, Ping Yang, Zhang Shu, Yun Bai, and Cong-Yi Wang
4. Cell-Based Analysis with Microfluidic Chip
59
Wang Qi and Zhao Long
5. Missing Dimension: Protein Turnover Rate Measurement in Gene Discovery
83
Gary Guishan Xiao
6. Bioinformatics Tools for Gene Function Prediction
93
Yan Cui
7. Determination of Genomic Locations of Target Genetic Loci
111
Bo Chang
8. Mutation Discovery Using High-Throughput Mutation Screening Technology
139
Kai Li, Hanlin Gao, Hong-Guang Xie, Wanping Sun, and Jia Zhang
9. Candidate Screening through Gene Expression Profile
165
Michal Korostynski
10. Candidate Screening through High-Density SNP Array
195
Ching-Wan Lam and Kin-Chong Lau
11. Gene Discovery by Direct Genome Sequencing
215
Kunal Ray, Arijit Mukhopadhyay, and Mainak Sengupta v
ftoc.indd v
1/12/2011 9:44:46 AM
vi
CONTENTS
12. Candidate Screening through Bioinformatics Tools
235
Song Wu and Wei Zhao
13. Using an Integrative Strategy to Identify Mutations
261
Yan Jiao and Weikuan Gu
14. Determination of the Function of a Mutation
279
Bouchra Edderkaoui
15. Confirmation of a Mutation by Multiple Molecular Approaches
303
Hector Martinez-Valdez and Blanca Ortiz-Quintero
16. Confirmation of a Mutation by MicroRNA
343
Hongwei Zheng and Yongjun Wang
17. Confirmation of Gene Function Using Translational Approaches
371
Caroline J. Zeiss
18. Confirmation of Single Nucleotide Mutations
391
Jochen Graw
19. Initial Identification and Confirmation of a QTL Gene
403
David C. Airey and Chun Li
20. Gene Discovery of Crop Disease in the Postgenome Era
425
Yulin Jia
21. Impact of Genomewide Structural Variation on Gene Discovery
443
Lisenka E.L.M. Vissers and Joris A. Veltman
22. Impact of Whole Genome Protein Analysis on Gene Discovery of Disease Models
471
Sheng Zhang, Yong Yang, and Theodore W. Thannhauser
Index
ftoc.indd vi
531
1/12/2011 9:44:46 AM
PREFACE
The availability of an annotated mouse genome sequence now provides the most efficient tool yet in the gene hunter’s toolkit. One can move directly from genetic mapping to identification of candidate genes, and the experimental process is reduced to PCR amplification and sequencing of exons and other conserved elements in the candidate interval. With this streamlined protocol, it is anticipated that many decades-old mouse mutants will be understood precisely at the DNA level in the near future.
This paragraph, from Initial Sequencing and Comparative Analysis of the Mouse Genome (2002) by the Mouse Genome Sequencing Consortium, summarizes the historical transition in both the strategy and the direction of gene discovery. It describes the tremendous changes in our protocols for positional cloning and announces the evolution of positional gene cloning. Positional gene cloning no longer requires years of teamwork to achieve. In fact, it is now possible for just one laboratory to identify several genes within one year. This not only is an extraordinary milestone but also is an opportunity for a new starting point for better comprehending the biological function(s) of genes throughout the whole genome. The implication of such a statement should also have similar results with the identification of mutated genes from human diseases and animal models once genome sequencing is completed for them as well. More than half a decade has passed since that paragraph’s publication. We are pleased to prepare a book detailing a new set of concepts and protocols representing the current progress in gene discovery. One goal of this book is to provide readers with a comprehensive understanding of the new concepts and protocols implemented in gene discovery in the present post-genome era. The dramatic progress of gene discovery is built on tremendous resources, such as genome sequences, literature, and databases as well as technologies such as high-throughput gene expression platforms and mutation-screening systems. With the availability of whole genome sequences, molecular markers can be precisely located on the genome and within a particular region of interest. In addition, every genomic element can be obtained easily through genome databases. Online literature and databases for gene information allow a quick search for a specific gene’s function. Large databases on gene expression profiles created by microarray data provide information on both gene expression levels and the specificity of almost every gene vii
fpref.indd vii
1/12/2011 9:44:46 AM
viii
PREFACE
in humans and major organisms. High throughput mutation screening systems are capable of identifying mutations from hundreds of genes with a minimal time frame. This book not only provides a systematic introduction to the available resources and technologies for gene discovery but, most important, teaches readers how to use all the available tools and data to find new mutated genes. Another goal of this book is to predict the future of gene discovery. We intend to let readers not only understand the current concepts and technologies but also learn how to take advantage of these new resources and technologies in the future—namely, the ability to adapt to new discoveries in genetic sciences. In the post-genome era, a good geneticist should be able to use these rapidly accumulating genome resources and incredibly rapidly developed new biotechnologies effectively and efficiently. The large amount of benchwork now produces tens of thousands of times more data than was possible decades ago. Gene discovery to a certain degree is data collection and analysis from large resources. We do not expect that every reader will be able to predict the future of genetic research, but we do hope that, by reading this book, readers can sharpen their minds in preparation for expected or unexpected challenges in this booming era of genomic research. This book can be used as a handbook for gene cloning and discovery as well as a reference book for teachers and students in the fields of genetics and biology. Weikuan Gu and Yongjun Wang
fpref.indd viii
1/12/2011 9:44:46 AM
ACKNOWLEDGMENTS
We thank everyone who contributed to this book for their dedicated work in making this book available. This is a unique, international team with strong scientific background and broad experience. We appreciated the discussions and exchanging of ideas during the preparation of each chapter. We would like to thank the following people for kindly reviewing the chapters for this book: Beth Bennett, Cong-Yi Wang, Daniel Goldowitz, David C. Airey, Griffin Gibson, Junming Yue, Qing Xiong, and Yan Cui. Special thanks to Drs. Bruce Roe, Hongwen Deng, Xinmin Li, Xingen Lei, and Wesley Beamer for their suggestions and kind support during the preparation of this book. We appreciate the assistance of David L. Armbruster and Griffin Gibson for their contributions in editing the chapters and, finally, we would also like to thank Griffin Gibson, XiaoYue Liu, Lishi Wang, and Yue Huang for their assistance in formatting the chapters.
ix
flast01.indd ix
1/12/2011 9:44:46 AM
CONTRIBUTORS
David C. Airey, Department of Pharmacology, Vanderbilt University School of Medicine Nashville, TN, United States Rudi Alberts, Department of Infection Genetics, Helmholtz Center for Infection Research & University of Veterinary Medicine, Hannover, Germany Yun Bai, Department of Medical Genetics, Third Military Medical University, Chongqing, China Bo Chang, Jackson Laboratory, Bar Harbor, ME, United States Yan Cui, Department of Molecular Sciences, University of Tennessee Health Science Center, Memphis, TN, United States Bouchra Edderkaoui, School of Medicine, Loma Linda University, Loma Linda, CA, and Research Scientist, Musculoskeletal Disease Center, JLP Memorial VA Medical Center, Loma Linda, CA, United States Hanlin Gao, DNA core, City of Hope National Medical Center, Duarte, CA, United States Daniel Goldowitz, Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada Jochen Graw, Helmholtz Center Munich, German Research Center for Environmental Health, Institute of Developmental Genetics, Neuherberg, Germany Weikuan Gu, Department of Orthopedic Surgery—Campbell Clinic, University of Tennessee Health Science Center, Memphis, TN, United States Yulin Jia, USDA-ARS Dale Bumpers National Rice Research Center, University of Arkansas, Stuttgart, AR, United States Yan Jiao, Department of Orthopedic Surgery—Campbell Clinic, University of Tennessee Health Science Center, Memphis, TN, United States xi
flast02.indd xi
1/12/2011 9:44:46 AM
xii
CONTRIBUTORS
Michal Korostynski, Department of Molecular Neuropharmacology, Institute of Pharmacology Polish Academy of Sciences, Krakow, Poland Ching-Wan Lam, Department of Pathology, the University of Hong Kong, Queen Mary Hospital, Hong Kong, China Kin-Chong Lau, Department of Pathology, the University of Hong Kong, Queen Mary Hospital, Hong Kong, China Chun Li, Department of Biostatistics, Vanderbilt University School of Medicine Nashville, TN, United States Kai Li, Department of Pharmacology, Suzhou University, Suzhou, Jiangsu, China Zhao Long, Respiratory Department, The Second Hospital Affiliated with Dalian Medical University, Dalian, China Hector Martinez-Valdez, Department of Immunology, The University of Texas M. D. Anderson Cancer Center, Houston, TX, United States Arijit Mukhopadhyay, Genomics & Molecular Medicine, Institute of Genomics & Integrative Biology (CSIR), Delhi, India Blanca Ortiz-Quintero, Department of Immunology, The University of Texas M. D. Anderson Cancer Center, Houston, TX, United States Wang Qi, Respiratory Department, The Second Hospital Affiliated with Dalian Medical University, Dalian, China Kunal Ray, Molecular & Human Genetics Division, Indian Institute of Chemical Biology (CSIR), Kolkata, India Klaus Schughart, Department of Infection Genetics, Helmholtz Center for Infection Research & University of Veterinary Medicine, Hannover, Germany Mainak Sengupta, Molecular & Human Genetics Division, Indian Institute of Chemical Biology (CSIR), Kolkata, India Zhang Shu, The Center for Biomedical Research, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China Wanping Sun, Department of Pharmacology, College of Pharmacy, Suzhou University, Suzhou, Jiangsu, China Theodore W. Thannhauser, Proteomics and Mass Spectrometry Core Facility, Robert W. Holley Center for Agriculture & Health, USDA-ARS, Cornell University, Ithaca, NY, United States Joris A. Veltman, Radboud University Nijmegen Medical Centre, Nijmegen Centre of Molecular Life Sciences, Department of Human Genetics, Nijmegen, The Netherlands
flast02.indd xii
1/12/2011 9:44:46 AM
CONTRIBUTORS
xiii
Lisenka E.L.M. Vissers, Radboud University Nijmegen Medical Centre, Nijmegen Centre of Molecular Life Sciences, Department of Human Genetics, Nijmegen, The Netherlands Cong-Yi Wang, The Center for Biomedical Research, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Center for Biotechnology and Genomic Medicine, Medical College of Georgia, Augusta, GA, United States Yongjun Wang, Beijing Tiantan Hospital, Capital Medical University, Beijing, China Song Wu, Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN, United States Gary Guishan Xiao, Hospital Central Laboratory, Nanjing First Hospital, Nanjing Medical University, Nanjing, Jiangsu, China Hong-Guang Xie, Hospital Central Laboratory, Nanjing First Hospital, Nanjing Medical University, Nanjing, Jiangsu, China Xueqing Xu, Department of Medical Genetics, Third Military Medical University, Chongqing, China Ping Yang, The Center for Biomedical Research, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China Yong Yang, Proteomics and Mass Spectrometry Core Facility, Robert W. Holley Center for Agriculture & Health, USDA-ARS, Cornell University, Ithaca, NY, United States Caroline J. Zeiss, Department of Comparative Medicine, Yale School of Medicine, New Haven, CT, United States Jia Zhang, DNA core, GNF Institute, San Diego, CA, United States Sheng Zhang, Proteomics and Mass Spectrometry Core Facility, Robert W. Holley Center for Agriculture & Health, USDA-ARS, Cornell University, Ithaca, NY, United States Wei Zhao, Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN, United States Hongwei Zheng, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
flast02.indd xiii
1/12/2011 9:44:46 AM
Figure 2.2. A microarray produced by a scanner. Each of the spots on the microarray represents a gene, and the color represents the amount of fluorescence that is measured, hence the amount of cDNA that was present in the original sample. Reprinted from Reinke (2006).
bins.indd 1
1/12/2011 9:43:43 AM
(a)
2
1
3
(c)
trans A B
1
(b)
A
cis
2
3
B
high
RILs
low
Figure 2.8. (a) The genomewide genotypes of eight recombinant inbred lines generated from a cross between two homozygous parents (A and B). Each row indicates the genome of a single RIL. The light or dark gray color in each of the RILs indicates whether that part of its genome was inherited from parent A or B. (b) Gene expression values are determined by microarrays. Four values are shown for each parent and one value for each of the RILs. (c) For three molecular markers, the gene expression values of the RILs are dissected into two groups, according to the allele they carry for that molecular marker (light or dark gray). A statistical test of each marker location calculates whether the means of both groups differ. The significances of the tests are plotted in a genomewide plot as a QTL plot. Here, a QTL peak is found for the second marker. Triangles in the QTL plot indicate the position of the gene, whose expression was used. If the gene coincides with the QTL peak, the QTL is referred to as a cis-QTL, otherwise, it is called a trans-QTL. Adapted from Alberts et al. (2005) by permission of Oxford University Press.
bins.indd 2
1/12/2011 9:43:44 AM
GTGTGTATGGTTGGGTGTTTTTGGGGTGGGTAGGGAGGTGT
GCGCGTACGGTCGGGCGTTTTTGGGGTGGGTAGGGAGGCGC
Figure 3.2. Bisulfite-specific PCR and direct sequencing. Genomic DNA underwent bisulfite conversion followed by PCR amplification. The resulting PCR products were then purified and directly sequenced. The results at the left are from the unmethylated DNA; the results at the right are for the methylated DNA.
bins.indd 3
1/12/2011 9:43:44 AM
(a)
Staurosporine
Microsyringe pump
TLM detection (532 nm)
Drain X-Y scanning stage
Cell culturing flask 30 µm
(b) 9–10 8–9
30 µm
7–8 6–7 5–6 4–5 3–4 2–3 1–2 0–1 + Staurosporine
Figure 4.4. A single-cell analysis system in a glass microchip using a thermal lens microscope (TLM). (a) Cell culture chip design and TLM scanning method. A microflask (1 mm × 10 mm × 0.1 mm) was fabricated in a glass microchip, and a cell suspension was introduced into it. After cultivation, the microchip with capillaries connected to syringe pumps was mounted on the TLM stage, and TLM signals were measured while scanning the stage to obtain a 2D-image. (b) Direct imaging of cytochrome-c in a cell and its distribution change during apoptosis (Tamaki et al., 2002).
bins.indd 4
1/12/2011 9:43:44 AM
(a)
A23187 concentrations
1 2 3 4 5 6 7 8
Low 0.0 µM
0.86 µM
1.71 µM
High 2.57 µM
(a) 3.42 µM
(b) 4.28 µM
(c) 5.13 µM
(d) 6.00 µM
(e)
(f)
(g)
(h)
a b c d e f g h
Cell culture chambers
Normalized fluorescence intensity per Cell
(b) 140.00 130.00 120.00 110.00 100.00 90.00 80.00 70.00 60.00 0.00
0.86
1.71
2.57
3.42
4.28
5.13
6.00
A23187 Concentrations (µM)
Figure 4.5. The expression of GRP78 on a protein level by immunofluorescence in SK-MES-1 cells. (a) Cells were treated with A23187 for 24 h and observed by fluorescence microscope (IX-71; Olympus Optical Co., Tokyo, Japan) (×400). The fluorescent indicated GRP78 in the cytoplasm. (b) The average expression of GRP78 per cell reflected by normalized fluorescent intensities increased with the concentrations of A23187. The normalized fluorescent intensities per cell were determined by the number of cells in a region divided by the fluorescent intensity in that region (Wang et al., 2009).
bins.indd 5
1/12/2011 9:43:44 AM
bins.indd 6
1/12/2011 9:43:45 AM
Figure 6.3. Function annotation with Blast2GO. In this example, 10 sequences are annotated. The upper panel shows the annotations transferred from homologous sequences. The lower panel shows the part of GO DAG containing the annotation terms assigned to the query sequences. Node color represents the annotation intensity.
(a) PCR
Conformation
(b)
Electrophonesis
Variant (2677T, Ser893) MDR1*1 (2677G, Ala893)
Figure 8.1. (a) Band patterns of single-stranded PCR products as visualized on a gel differ with the change in their conformations. (b) An example of how to identify two variants in the MDR1 gene using SSCP. (Kim et al., 2001.)
Model system
Clinical samples
Gene expression profiling
List of transcripts
Co-expression of genes
Association with phenotype
Mechanism of regulation
Gene expression signature
Gene function
Validation
Candidate drug targets
Candidate biomarkers
Discovery of new drug
Discovery of new diagnostic marker
Figure 9.1. Major concepts in gene transcription profiling.
bins.indd 7
1/12/2011 9:43:47 AM
(a)
14 10
12
Log intensity
5 0 –5 0
5
10
6
8
–10
Intensity ratio
10
16
(b)
15
1
Average intensity (c)
3
4
5
6
10 5 –5
0
0.0
0.5
Log-odds ratio
15
1.0
(d)
–0.5
Sample quantiles
2
–3
–2
–1
0
1
2
3
–4
–2
Theoretical quantiles (e)
0 2 4 Log2 fold change
6
(f) 0.6 0.4
PCA2
0.2 0
Class 1
–0.2 –0.4 Class 2
–0.6 –0.8 –1.0 –1.5
–1
–0.5
0
0.5
1
PCA1
Figure 9.4. Various methods of presentation of microarray quality control and data analysis. (See text for full caption.)
bins.indd 8
1/12/2011 9:43:47 AM
–4
–1
0
+1
+4
MMA
MMA
MMA
MMA
MMA
PMA
PMA
PMA
PMA
PMA
PMB
PMB
PMB
PMB
PMB
MMB
MMB
MMB
MMB
MMB
Target sequence (250-2000 bp) ... CAGACAGAGTCTTG[A/C]AATCTATTTCTCATA... Probe sequence (25 bp)
AA
PMA:
TGTCTTCAGAACTTTAGATAAAGAG
MMA:
TGTCTTCAGAACATTAGATAAAGAG
PMB:
TGTCTTCAGAACGTTAGATAAAGAG
MMB:
TGTCTTCAGAACCTTAGATAAAGAG
BB
AB
Figure 10.1. Probe array tiling and hybridization patterns (from Affymetrix).
bins.indd 9
1/12/2011 9:43:47 AM
Standard PCR protocol — 48 or 96 per batch Modified PCR protocol — no batch size limitation 1. Pool 700 µl PCR into deep well plate
1. Pool 700 µl PCR into 2-mL microcentrifuge tube
2. Add 1 ml magnetic beads
2. Add 1 ml magnetic beads
3. Pipetting up & down 5×; Incubate 10 min @ RT
3. Pipetting up & down 5×; Incubate 10 min @ RT
4. Transfer PCR + beads to filter plate
4. Place on magnetic stand for 10 min
5. Apply vacuum until all wells are dry (60–90 min)
5. Pipette out the supernatant
6. Add 1.8 mL 75% ethanol wash
6. Add 1.8 mL 75% ethanol wash
7. Apply vacuum until all wells are dry (10–20 min)
7. Vortex at 75% power for 2.5 min; incubate 7.5 min
8. Dry beads for further 10 min under vacuum
8. Place on magnetic stand for 10 min
9. Tap-off excess ethanol & attach catch plate
9. Pipette out the supematant; air-dry for 15 min
10. Add 55 µl elution buffer
10. Add 55 µL elution buffer
11. Incubate on vortexer for 10 min
11. Vortex at 75% power for 2.5 min; incubate 7.5 min
12. Apply vacuum until all wells are dry (15–30 min)
13. Centrifuge for 5 min at 1400 RCF @ RT
12. Place on magnetic stand for 10 min 13. Collect the eluate (~55 µL)
14. Remove catch plate with eluate (~50 µL)
Figure 10.3. Comparing PCR purification workflow between the Affymetrix standard and our modified protocols.
bins.indd 10
1/12/2011 9:43:47 AM
Figure 10.10. Identification of homozygous DYFS mutations in the homozygous region detected by SNP Array 6.0 (Lau et al., 2009).
bins.indd 11
1/12/2011 9:43:47 AM
(a)
Shearing of genomic DNA
End repairing of sheared DNA
Adapter mediated PCR enrichment of fragments
Addition of dATP at the 3’ends
Purification
Adapter ligation
(b) + Prepared Library
+ Hybridization Buffer
Biotinylated Probes
Optimum Temperature Regulation Hybridization Streptavidin Coated Magnetic Beads
+
Unbound Fraction Discarded
Wash Beads and Remove Probes
Bead Capture Amplify Sequencing
Figure 11.2. The hybridization-based sequencing method. (a) Genomic DNA is sheared and end repaired or modified. A poly-A tail is added to the fragments, adapters are ligated to the 3′-end of the fragments, and excess adapters or unligated primers are removed. The amplicons are purified, and adapter-specific PCR amplification is done to enrich the product pool to prepare a library. (b) The prepared library is hybridized with relevant biotinylated probes (specific sequences, whole exome, etc.) in solution in a hybridization array. The probes bind to the relevant sequences from the library. Then streptavidin-coated magnetic beads are released in the array, and a magnet is used to capture biotinylated probes bound to their complementary sequences. Those specific sequences can then be sequenced in appropriate platform. [Panel (b) of the illustration has been adapted from Protocol version 1.0.1, October 2009; SureSelect Human All Exon Kit from Agilent Technologies.]
bins.indd 12
1/12/2011 9:43:48 AM
(a)
(b) Primer library
(c)
Genomic DNA template
Microfluidic chip
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
8 7
2
Droplet PCF
Genomic DNA
Break emulsion
3
gDNA removal
5
9 Fragmentation and nick translation
4
Sequence
6
Figure 11.3. Microdroplet PCR workflow. (a) Primer library generation: 1, Identify targeted sequences of interest in the genome. 2, Design and synthesize forward and reverse primer pairs for each targeted sequence (library element). 3, Generate primer pair droplets for each library element. A microfluidic chip is used to encapsulate the aqueous PCR primers in inert fluorinated carrier oil with a block-copolymer surfactant to generate the equivalent of a picoliter-scale test tube compatible with standard molecular biology. 4, Mix primer pair droplets of library elements together so that each library element has an equal representation. (b) Genomic DNA template mix preparation: 5, Biotinylate (red dots), fragment into 2- to 4-kb fragments, and purify genomic DNA. 6, Mix purified genomic DNA together with all of the components of the PCR reaction (DNA polymerase, dNTPs and buffer) except for the PCR primers. (c) Droplet merge and PCR: 7, Dispense primer library droplets to the microfluidic chip. 8, Deliver the genomic DNA template as an aqueous solution; template droplets are formed within the microfluidic chip. Then pair the primer pair droplets and template droplets in a 1:1 ratio. 9, Allow the paired droplets to flow through the channel of the microfluidic chip and pass through a merge area, where an electric field induces the two discrete droplets to coalesce into a single PCR droplet. Collect ∼1.5 million PCR droplets in a single 0.2-mL PCR tube. Process the PCR droplets (PCR library) in a standard thermal cycler for targeted amplification; break the emulsion of PCR droplets to release the PCR amplicons into solution for genomic DNA (gDNA) removal, purification, and sequencing. (Reprinted by permission from Macmillan Publishers Ltd: Nature Biotechnology, Tewhey et al., Microdroplet-based PCR enrichment for largescale targeted sequencing, 27, 1025–1031, 2009.)
bins.indd 13
1/12/2011 9:43:49 AM
A
B
A CGT
A T T C G A T A T CA A GC T T A TC G A T AC C G T C G A C C T
Figure 15.1. Manual Versus Automated DNA Sequencing. (A) Shows acrylamide gel electrophoresis results resolving typical chain termination reactions (Ho et al. unpublished data). Each lane corresponds to a designated reaction terminated with ddATP (A), ddCTP (C), ddGTP (G) and ddTTP (T) analogs (Ho et al. unpublished), which identifies respective nucleotides on target DNA template. (B) Depicts a color-coded chromatogram of typical automated DNA sequencing data (Albrechtson et al. unpublished).
Figure 15.2. Chromosome locus assignment of a newly discovered gene. Fluorescent in situ hybridization analysis, using a biotin-labeled genomic probe, which reveals the new gene at the 9q32 locus (Sims-Mourtada et al., 2005) upon binding of fluorescent Streptavidin (shown herein as pseudo yellow fluorescence) against DAPI (blue fluorescence) background. Arrows emphasize gene locus assignment on respective chromosome 9 alleles.
bins.indd 14
1/12/2011 9:43:49 AM
(a)
(b)
Figure 15.3. Gene Expression by Histological Methods. (See text for full caption.)
bins.indd 15
1/12/2011 9:43:50 AM
WT
rd-1
ONL
ONL
Figure 17.2. WT and rd-1 mouse retina at postnatal day 12. In rd-1 mice, a mutation in the cyclic phosphodiesterase beta gene results in rapid loss of the outer nuclear layer in the first 3 weeks of life.
bins.indd 16
1/12/2011 9:43:52 AM
CHAPTER 1
Gene Discovery: From Positional Cloning to Genomic Cloning WEIKUAN GU and DANIEL GOLDOWITZ
Contents 1.1 Concept of Classic Positional Cloning 1.2 Concept of Gene Discovery in the Post-Genome Era 1.3 Strategies for Gene Discovery in the Post-Genome Era 1.4 Future Direction 1.5 References
1 4 5 6 7
Despite the highly significant advances in studying the genetics and genomics of human populations, there are still large gaps in our understanding of the molecular genetic mechanisms involved in the pathogenesis of many human diseases. The mutated genes in many human diseases remain unknown. Identification of these mutations is crucial for correlating disease pathology and biology to the molecular basis of the disease. Discovery of new gene functions depends on the identification of the mutated genes responsible for disease in humans and other species. The techniques of positional cloning have oftentimes discovered new functions of known genes or new genes for known diseases. The goal of this book is to provide illustrations of the strategy in the post-genomic era for the identification and initial characterization of mutated genes in inherited human diseases and animal models.
1.1
CONCEPT OF CLASSIC POSITIONAL CLONING
Positional cloning, also called reverse genetics, is the identification and cloning of a specific gene, with its chromosomal location being the only available Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
1
c01.indd 1
1/12/2011 9:43:53 AM
2
GENE DISCOVERY: FROM POSITIONAL CLONING TO GENOMIC CLONING
Collection of phenotype information
Collection of genotype information
Initial mapping of trait locus
Fine mapping of trait locus
Genomic contig construction
Analysis of genomic elements
Selection of candidate genes
Conformation of candidate genes
Figure 1.1. Procedure of identification of a mutated gene using strategy of classic positional cloning.
information about that gene (Collins, 1990). The identification of the X-linked gene for chronic granulomatous disease in 1986 was the first report employing such a strategy (Baehner et al., 1986; Royer-Pokora et al., 1986). For the past several decades, positional cloning has been widely used in humans, animals, and plants to isolate genes known only by their phenotypic effects. Underlying positional cloning is the assumption that a gene’s location can be pinpointed with sufficient precision to narrow down its location to a DNA segment that is small enough to be sequenced and/or subjected to transformation/complementation experiments. The classic procedure for positional cloning usually includes several steps as shown in Figure 1.1. It starts with the phenotype collection from a genetically mappable population. The population genetics necessary for creating the mappable population is beyond the scope of this chapter (Holsinger and Weir, 2009; Zou, 2009). Briefly, however, a mutant phenotype can be genetically mapped when (1) the phenotype shows Mendelian inheritance, (2) the phenotype is differentially distributed among individuals within the population, and (3) a population is large enough to reach a statistical significance when the phenotype is analyzed using mapping software. Parallel to the phenotype collection, genotype information of the same individuals in the same popula-
c01.indd 2
1/12/2011 9:43:53 AM
CONCEPT OF CLASSIC POSITIONAL CLONING
3
tion is collected. Usually, molecular markers that segregate in the population along each and every chromosome are analyzed. The collected phenotype and genotype data from the population are used in conducting linkage analysis by one of a variety of softwares to define the chromosomal regions that the locus is likely to occupy. If a trait is controlled by a single gene or locus, the linkage analysis should point to a single chromosomal region. For traits regulated by multiple genes, multiple loci, or quantitative trait loci, multiple chromosomal regions are identified. To actually identify the gene underlying the trait of interest, fine mapping has to be conducted to narrow down the chromosomal regions so that genomic searching is practical. The next step, then, is to construct a genomic contiguous region (contig), which is defined as a set of overlapping segments of DNA, to connect and cover all the genomic elements in the targeted area. After a precise contig is constructed, it will be sequenced and analyzed by a technique termed chromosomal walking. This is a lengthy procedure that involves the recognition of potential genes, noncoding genes, and/or coding and noncoding regions. Finally, potential candidate genes should be confirmed using a variety of genetic and biochemical methods. Because all of these procedures require a large amount of work, positional cloning typically requires a team effort and positional cloning projects have been known to take many years. First, the genetic region needs to be narrowed down as precisely as possible by means of initial linkage analysis and fine mapping. Second, linkage analysis requires both the availability of a large pedigree and PCR-based analysis of microsatellite markers of that pedigree to allow a whole-genome search for linkage. Fine mapping is a particularly difficult task consisting of breaking the linkage and identifying useful markers in the targeted region. Contig construction entails identification of a large insert genomic library, either BAC (bacterial artificial chromosomes) or YAC (yeast artificial chromosomes), with known markers. Analysis of genetic elements within a contig can be very difficult because of the lack of knowledge of both genes and gene organization. However, the recent completion of the human and mouse genome projects (e.g., Mouse Genome Sequencing Consortium. 2002), along with other new technology, such as mutation analysis and microarrays, allows unprecedented progress in positional cloning of mutant genes. There are four major changes in the technique of positional cloning (Hinkes et al., 2006): (1) Contig construction is no longer needed because of the availability of whole genomes that have been sequenced. (2) Sequencing of an entire region—usually 10 Mbp of the genome, is no longer necessary, as those sequences are now readily available through public (Ensembl) and private databases (Celera). (3) Sequence analysis requires much less time and effort since annotations of whole genomes have been done (e.g., we now know that the majority of the mouse genome is made up of repetitive sequences, such as transposons, that are easy to identify and, therefore, can be eliminated from further analysis). (4) Because of the availability of whole genome sequences and high-throughput
c01.indd 3
1/12/2011 9:43:53 AM
4
GENE DISCOVERY: FROM POSITIONAL CLONING TO GENOMIC CLONING
technologies, we can now work on a much larger genomic regions, which eliminates fine mapping. (5) Annotations of genomes and bioinformatic algorithms has paralleled the rapid acquisition of genomic data and has permitted an in silico assessment of candidate genes. This is the major theme of this book. As a result of new high-throughput technologies and whole-genome libraries, a genome-based integrative strategy is the most practical method for gene discovery in our current post-genome era (Gu et al., 2002; Jiao et al., 2005a, 2005b, 2007, 2008). Consequently, pure positional cloning in humans, animals, or plants is no longer necessary. The definition of positional cloning is cloning or identifying a gene with specific function purely according to its position. In humans, mice, and rats it is rare to localize mutations to a gene or the expression of that gene is unknown. For example, microarray technology has arrayed every gene into their chips. As a result, microarray analysis of gene expression profiles has become routine in many laboratories. Therefore, soon we may find out that expression data of every gene in every tissue is available to public. Thus, for any gene, even if nothing else is known about that gene, its expression level in a tissue can be assessed. As such, the classic positional cloning method is of little utility in the rapidly evolving arena of functional genomics. A new procedure that integrates both genomic and high-throughput technology has been created and will be, and should be, the next generation’s tool of choice.
1.2
CONCEPT OF GENE DISCOVERY IN THE POST-GENOME ERA
The strategy for gene discovery using positional cloning depends on the availability of genetic-based data and technology. The new approach for gene discovery is highly integrative and is based on the availability of genome resources and biotechnology (Rintisch et al., 2008). There are three distinct and significant differences between new gene discovery strategies and classical positional cloning. The first one is the elimination of fine mapping. Rather than narrowing down the genomic regions using several approaches, a large number of genomic regions can be searched to discover the genes of interest all at once. The second is the direct investigation of genetic elements within the targeted region, without construction of contig or sequencing, because of the availability of genomic sequences and annotation of genomic elements. The third one is the high-throughput screening of candidates within the targeted region. The high-speed analytical methods include mutation screening, resequencing, and both gene expression profiling and functional predictions (Jiao et al., 2008). The following chapters provide detailed information on each of those aspects. The first part of this book introduces the technologies and resources used in gene discovery in our post-genome era. The second part of this book provides experimental procedures and methodologies for gene discovery using both genome resources and high-
c01.indd 4
1/12/2011 9:43:53 AM
STRATEGIES FOR GENE DISCOVERY IN THE POST-GENOME ERA
5
throughput technologies. The third and final part of this book predicts the future direction of gene discovery based on the elucidation of genomes and developing technologies. We are living in an era of both technology explosion and unparalleled expansion of biological resources. of the advances in gene discovery, however, are rooted in the technology of genome sequencing. Without the completion of whole genome sequences for humans and other species, gene discovery would still be stuck in the classic positional cloning approach. Therefore, gene discovery in every chapter is based on the fact that genomic sequences are available for the subjects of interest. Parallel with the necessity of completed genomes is the demand for, and rapid development of, high-throughput technologies necessary for mutation screening, genome analysis, and bioinformatics. Without these tools, there would be no effective method for capitalizing on the completion of whole genomes and for allowing our current rapid methods for gene discovery. Due to the significance of these various technologies, Chapters 2–4 introduce these technologies. Chapters 2–6 illustrate a variety approaches, including SNP analysis, DNA methylation, protein turnover rate measurement, microarray analysis, and bioinformatic tools. Finally, the integrative analysis of data from a variety procedures provides clues for potential candidate genes for the follow-up experiments, such as RT-PCR, DNA sequencing of the potential mutation(s), and/or northern or western blot analysis to determine the significance of the mutated gene. An important reminder to readers is that although this book mainly focuses on coding sequences known as genes, mutations in many other genetic elements could be identified using the same or similar technologies or procedures. Those none-gene elements of the genome include not only the introns, 5′ and 3′ ends of the genes, but also many others (Chen et al., 2008), such as transcription factor binding sites, microRNAs, cis-acting elements, palindromic motifs, and/or conserved k-tuples (phylogenetic footprints) (Hui and Bindereif, 2010). Readers should keep in mind that gene regulation is a complicated process and regulators are not necessarily near the genes that they influence. They can be located at long distances, called distant regulatory elements (REs) (Gotea and Ovcharenko, 2008), such as enhancers, repressors, and silencers. In addition, repetitive sequences sometimes play unexpected roles in gene regulations (Hui and Bindereif, 2005).
1.3 STRATEGIES FOR GENE DISCOVERY IN THE POST-GENOME ERA Current experimental procedure strategies for mutation screening have been summarized (Jiao et al., 2008) and are shown in Figure 1.2. Individual chapters in this book focus on one or more steps or different approaches of this strategy. We briefly touch on screening for mutations in DNA in this introduction using
c01.indd 5
1/12/2011 9:43:53 AM
6
GENE DISCOVERY: FROM POSITIONAL CLONING TO GENOMIC CLONING
Identify mutation models and chromosomes of their disease loci Bioninformatic search
Screen of coding regions Whole genome sequencing Determination of mutated gene Knockdown
Gene network
Knockin
Function elucidation of mutated genes
Figure 1.2. Strategy of gene discovery through mutation models.
the mouse as the model. Detailed procedure and methodologies are presented in Chapters 7–13. The first step is to determine the total number of genes/transcripts within the targeted region. Chapter 7 describes the genetic markers and methods for determine the genomic location of target genetic loci. Any of the many recently developed software programs (see, for example, www.genediscovery.org/ pgmapper/index.jsp; Xiong et al., 2008a) can be used to identify every candidate gene from a defined genomic region. The next step is to evaluate candidate genes to reduce the number of genes in the list to a more workable and feasible amount (Chapters 8–13). At this step, obvious candidate genes are first evaluated. We believe that a large number of differences exist between the gene of interest (GOI) in mutation and in wild type (control). Our current knowledge of gene function and bioinformatics should allow us to eliminate most of the unlikely candidate genes. Series of comparisons and function analyses should be made to rule out the candidacy of variation in introns sequences, if those sequences do not affect the phenotype (Chapters 11–13) At the end, a short list of candiate genes are expected or, in the best case senario, only one gene will remain. Finally, mutation evaluation or testing is carried out (Chapters 14–20). This evaluation considers differences between the GOI and control, sequence differences in these genes, potential gene function changes due to these differences, and whether other strains or populations have similar differences. Information on differences is combined with gene expression profiling and possible gene function to determine a list of candidate genes. Finally, selected candidate genes are tested and confirmed using a variety of experimental approaches, such as gene knockout and/or knockin.
1.4
FUTURE DIRECTION
Gene discovery or mutation identification has gone through two stages, as we have discussed: the classical and the post-genome era. The next stage of gene discovery will depend on development of high-throughput technology and
c01.indd 6
1/12/2011 9:43:53 AM
REFERENCES
7
Disease phenotype /trait of interest Mapping information (which chromosome)
Fine mapping (where on chromosome) Contig construction (DNA assembly) Search based on genome sequences (Small region) (Large region)
Figure 1.3. Different stages of positional cloning (from left to right): classic, postgenome era, and future (dashed blue line).
bioinformatic tools. As shown in Figure 1.3, in the first stage, positional cloning a GOI (the classical approach) has to go through every step, including initial mapping, fine mapping, contig construction, and candidate searching based on genome sequences. Currently at the second stage, in most cases, fine mapping and contig construction are not necessary because of the available information of genomic sequences and genetic elements within the targeted region. The next stage of genomic cloning will allow researchers to conduct a search of candidate genes without mapping information (shown as dashed lines in Figure 1.3). At that stage, once a phenotype is found from an animal model or an individual, a search of candidate genes can be done based on the annotation of every gene or regulatory element in the genome. To reach the next stage, two critical improvements in our genomic research are needed. The first one is the complete evaluation of potential function of every gene and regulatory element in the whole genome. This seemingly large amount of work is most likely to be done within a decade or even sooner, as technologies for the analysis of gene function, SNP analysis, and proteomics are rapidly developing. The second is the availability of software for rapid automatic high-throughput searching. Currently, some programs such as PGmapper (Xiong et al., 2008a) has provided the capability to search genome regions of several megabases. The capability of searching whole chromosomes and whole genomes within a reasonable time (under an hour) will follow development of computational tools in coordination with genome and literature databases.
1.5
REFERENCES
Baehner RL, Kunkel LM, Monaco AP, Haines JL, Conneally PM, Palmer C, Heerema N, Orkin SH. (1986). DNA linkage analysis of X chromosome-linked chronic granulomatous disease. Proc Natl Acad Sci U S A 83(10):3398–401. Chen HP, Lin A, Bloom JS, Khan AH, Park CC, Smith DJ. (2008). Screening reveals conserved and nonconserved transcriptional regulatory elements including an E3/
c01.indd 7
1/12/2011 9:43:54 AM
8
GENE DISCOVERY: FROM POSITIONAL CLONING TO GENOMIC CLONING
E4 allele-dependent APOE coding region enhancer. Genomics 92(5):292–300. Epub Sept. 3. Collins FS. (1990). Identifying human disease genes by positional cloning. Harvey Lect 86:149–64. Gotea V, Ovcharenko I. (2008). DiRE: identifying distant regulatory elements of coexpressed genes. Nucleic Acids Res 36:W133–39. Epub May 17. Gu W, Li X, Lau KH, Edderkaoui B, Donahae LR, Rosen CJ, Beamer WG, Shultz KL, Srivastava A, Mohan S, Baylink DJ. (2002). Gene expression between a congenic strain that contains a quantitative trait locus of high bone density from CAST/EiJ and its wild-type strain C57BL/6J. Funct Integr Genomics 1(6):375–86. Hinkes B, Wiggins RC, Gbadegesin R, Vlangos CN, Seelow D, Nürnberg G, Garg P, Verma R, Chaib H, Hoskins BE, Ashraf S, Becker C, Hennies HC, Goyal M, Wharram BL, Schachter AD, Mudumana S, Drummond I, Kerjaschki D, Waldherr R, Dietrich A, Ozaltin F, Bakkaloglu A, Cleper R, Basel-Vanagaite L, Pohl M, Griebel M, Tsygin AN, Soylu A, Müller D, Sorli CS, Bunney TD, Katan M, Liu J, Attanasio M, O’toole JF, Hasselbacher K, Mucha B, Otto EA, Airik R, Kispert A, Kelley GG, Smrcka AV, Gudermann T, Holzman LB, Nürnberg P, Hildebrandt F. (2006). Positional cloning uncovers mutations in PLCE1 responsible for a nephrotic syndrome variant that may be reversible. Nat Genet 38(12):1397–405. Epub Nov. 5. Holsinger KE, Weir BS. (2009). Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet 10(9):639–50. Hui J, Bindereif A. (2005). Alternative pre-mRNA splicing in the human system: unexpected role of repetitive sequences as regulatory elements. Biol Chem 386(12): 1265–71. Jiao Y, Li X, Beamer WG, Yan J, Tong Y, Goldowitz D, Roe B, Gu W. (2005a). Identification of a deletion causing spontaneous fracture by screening a candidate region of mouse chromosome 14. Mammal Genome 16(1):20–31. Jiao Y, Yan J, Zhao Y, Donahue LR, Beamer WG, Li X, Roe BA, Ledoux MS, Gu W. (2005b). Carbonic anhydrase-related protein VIII deficiency is associated with a distinctive lifelong gait disorder in waddles mice. Genetics Epub Aug. 22. Jiao Y, Yan J, Jiao F, Yang H, Donahue LR, Li X, Roe BA, Stuart J, Gu W. (2007). A single nucleotide mutation in Nppc is associated with a long bone abnormality in lbab mice. BMC Genet 8:16. Jiao Y, Jin X, Yan J, Zhang C, Jiao F, Li X, Roe BA, Mount DB, Gu W. (2008). A deletion mutation in Slc12a6 is associated with neuromuscular disease in gaxp mice. Genomics 91(5):407–14. Koppel I, Aid-Pavlidis T, Jaanson K, Sepp M, Palm K, Timmusk T. (2010). BAC transgenic mice reveal distal cis-regulatory elements governing BDNF gene expression. Genesis 48(4):214–19. Mouse Genome Sequencing Consortium. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–62. Rintisch C, Ameri J, Olofsson P, Luthman H, Holmdahl R. (2008). Positional cloning of the Igl genes controlling rheumatoid factor production and allergic bronchitis in rats. Proc Natl Acad Sci U S A 105(37):14005–10. Epub Sept. 8. Royer-Pokora B, Kunkel LM, Monaco AP, Goff SC, Newburger PE, Baehner RL, Cole FS, Curnutte JT, Orkin SH. (1986). Cloning the gene for an inherited human
c01.indd 8
1/12/2011 9:43:54 AM
REFERENCES
9
disorder—chronic granulomatous disease—on the basis of its chromosomal location. Nature 322(6074):32–38. Xiong Q, Qiu Y, Gu W. (2008a). PGMapper: a web-based tool linking phenotype to genes. Bioinformatics 24(7):1011–13. Epub Jan. 18. Xiong Q, Jiao Y, Hasty KA, Stuart JM, Postlethwaite A, Kang AH, Gu W. (2008b). Genetic and molecular basis of QTL of rheumatoid arthritis in rat: genes and polymorphisms. J Immunol 181(2):859–64. Xiong Q, Jiao Y, Hasty KA, Canale ST, Stuart JM, Beamer WG, Deng HW, Baylink D, Gu W. (2009). Quantitative trait loci, genes, and polymorphisms that regulate bone mineral density in mouse. Genomics 93(5):401–14. Zou F. (2009). QTL mapping in intercross and backcross populations. Methods Mol Biol 573:157–73.
c01.indd 9
1/12/2011 9:43:54 AM
CHAPTER 2
High-Throughput Gene Expression Analysis and the Identification of Expression QTLs RUDI ALBERTS and KLAUS SCHUGHART
Contents 2.1 Concepts in High-Throughput Gene Expression Analysis 2.2 Technologies of High-Throughput Gene Expression Analysis 2.2.1 Gene Expression Microarrays 2.2.2 One-Channel Versus Two-Channel Microarrays 2.2.3 Oligonucleotide Versus Spotted Microarrays 2.2.4 Whole-Transcript Arrays 2.2.5 Genome Tiling Arrays 2.2.6 MicroRNA Arrays 2.3 Protocols 2.3.1 Image Analysis 2.3.2 Normalization 2.3.3 Quality Control 2.4 Applications and Limitations 2.4.1 Identification of Expression QTL and Gene Regulatory Networks 2.4.2 Identification of Differentially Expressed Genes 2.4.3 Identification of Cell-Type-Specific Genes 2.4.4 Determination of the Downstream Effects of a Mutation 2.4.5 Determination of the Downstream Effects of a Signaling Molecule 2.4.6 Predicting Vaccine Efficacy 2.4.7 Determination of Host Responses after Infection 2.4.8 Limitations 2.5 Questions and Answers
12 13 13 13 14 15 16 16 17 17 17 20 23 23 26 26 26 27 27 27 27 28
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
11
c02.indd 11
1/12/2011 9:43:59 AM
12 2.6 2.7
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
Acknowledgments References
28 28
2.1 CONCEPTS IN HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS Many diseases have a genetic basis. Together with influences from the environment, these genetic factors determine whether a certain disease will develop and how severe it will be. In some cases, a disease is determined by only one gene. The sickle-cell disease, for example, is caused by a mutation in the hemoglobin gene. This causes red blood cells to adopt an abnormal sickle shape. It results in a risk of various complications and a shortened life expectancy. Another example of a single gene disease is cystic fibrosis. This disease affects the exocrine glands of the lungs, liver, pancreas, and intestines and results in progressive disability and a severely shortened life expectancy. It is caused by a mutation in the cystic fibrosis transmembrane conductance regulator (CFTCR) gene. However, in most human diseases, multiple genes play a role in the development of the pathological symptoms. Examples for these, so-called complex genetic diseases are cancer, obesity, diabetes, hypertension, asthma, and heart disease. Here, each gene contributes to a certain degree to the establishment of the phenotype. And we can assume that the contributing genes and their products operate in regulatory networks. They may enhance or inhibit each other. If multiple genes contribute to the development of a disease and individual contributions of each gene are small, it is a major challenge to identify the causal disease genes and their interactions. The advent of new highthroughput analyses makes it now possible to study such complex genetic interactions and thus unravel the molecular basis of complex genetic diseases in humans. For example, high-throughput gene expression analysis allows one to measure the expression of tens of thousands of genes at the same time. Researchers can now compare complete gene expression profiles for diseased and healthy samples and obtain a direct insight into global gene expression changes. Thus these new technologies allow them to unravel the interplay between genes and to reconstruct gene regulatory networks for biological processes. The analysis of gene expression is based on the following basic biological principles: The genetic information of a cell is stored in genes, which are part of the DNA in the nucleus. DNA is transcribed into RNA and then processed to messenger RNA (mRNA), which transfers the information to the cytoplasm. Here the mRNA is translated into protein. Many proteins are enzymes that catalyze biochemical reactions. Other proteins have mechanical or structural functions. But proteins are also important in biological signaling pro-
c02.indd 12
1/12/2011 9:43:59 AM
TECHNOLOGIES OF HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
13
cesses, such as growth factor responses, immune responses, cell adhesion, and the cell cycle. Since proteins are major players in living organisms, they are also involved in the development of diseases. Therefore, to gain an understanding of processes that lead to disease, it is of great value to have a global picture of the amount of mRNA of all genes that are expressed in diseased and in healthy subjects. High-throughput gene expression microarrays measure these global changes and differences of mRNA and, therefore, give a very good indication of the processes that are abnormal in disease tissues. 2.2 TECHNOLOGIES OF HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS 2.2.1
Gene Expression Microarrays
Gene expression microarray technology enables the measurement of mRNA abundances in a high-throughput manner. Instead of directly using mRNA, more stable cDNA molecules are used, which are an inverse copy of the RNA. This copy is created by a viral enzyme, reverse transcriptase, in a process called reverse transcription. Microarrays are small glass plates that are subdivided into thousands of spots. Short sequences of the nucleotides A, C, T, and G, commonly referred to as probes, are bound as spots to the glass surface (Fig. 2.1). All probes in one spot have a sequence that is reverse complementary to part of the sequence of the cDNA of a specific gene. The idea is that the cDNA generated from the mRNA that is expressed from this gene will hybridize (bind) to the probes on the specific spot. To make it possible to measure the amount of cDNA hybridized to the microarray, the cDNA is labeled with a fluorescent dye. After hybridization and removal of the cDNA that did not bind, the microarray is inserted into a scanner that reads the amount of fluorescence for each of the spots. These measurements represent the level of gene expression for all genes on the microarray and are generally represented in the manner shown in Figure 2.2. Using specialized software, the intensities of each spots in the image are quantified, providing a quantitative value of the mRNA expression level for each of the genes on the microarray. 2.2.2
One-Channel Versus Two-Channel Microarrays
There exist different kinds of microarrays. A general distinction that can be made is one-channel versus two-channel microarrays. On a one-channel microarray, one fluorescently labeled sample is hybridized and the resulting expression values are read as absolute expression values for that sample. To compare the expression values between multiple samples, it is necessary to use multiple microarrays. The most widely known provider of one-channel microarrays is Affymetrix (www.affymetrix.com). On two channel microarrays, one can directly compare the gene expression values of two different samples. Each of the samples is fluorescently labeled
c02.indd 13
1/12/2011 9:43:59 AM
14
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS RNA fragments with fluorescent tags from sample to be tested
RNA fragment hybridizes with DNA on GeneChip array
Figure 2.1. Hybridization of labeled cDNAs to a gene expression microarray. The small glass plate contains millions of probes. Fluorescently labeled (spheres) cDNA binds to the probes on the microarray. Image courtesy of Affymetrix.
using a different dye. In most cases a Cy5 (red) dye is used for one sample and Cy3 (green) for the other. This produces images like Figure 2.2 with black, yellow, red, and green spots. A red spot indicates that the sample with the red labeling has a higher expression values (vice versa for green) and a yellow spot indicates that both samples have a similar expression values. If the spot remains black, there is no expression in either of the samples. Well-known providers of two-channel microrarrays are Agilent (www.agilent.com) and Illumina (www.illumina.com). 2.2.3
Oligonucleotide Versus Spotted Microarrays
A second important distinction between microarray setups are oligonucleotide arrays versus spotted arrays. On oligonucleotide arrays, the probes are attached to the microarray by the manufacturer. For example Affymetrix uses chemical synthesis and photolithographic masks to build up the probes on the microarray. Here, all probes on the microarray are simultaneously synthesized nucleotide by nucleotide. This results in high-density prefabricated microarrays.
c02.indd 14
1/12/2011 9:43:59 AM
TECHNOLOGIES OF HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
15
Figure 2.2. (See color insert.) A microarray produced by a scanner. Each of the spots on the microarray represents a gene, and the color represents the amount of fluorescence that is measured, hence the amount of cDNA that was present in the original sample. Reprinted from Reinke (2006).
On the other hand, probes on spotted microarrays are synthesized before they are added (spotted) onto the glass. Such microarrays are sold without probes, and laboratories have to design and fabricate their own probes and fix them onto the microarray. This is often a cheaper solution since the gene density can be much lower. Also, the researcher can customize the microarray to each experiment. 2.2.4
Whole-Transcript Arrays
The classical microarrays mentioned earlier interrogate the mRNA at only one specific location. Agilent, for example, uses one 60-mer probe per gene to measure its expression. Affymetrix uses multiple 25-mer probes per gene to measure mRNA abundances, all of them located at the end of the gene, either in the 3′ untranslated region (3′ UTR) or in the last exon or exons (Fig. 2.3). Recent studies, however, indicate that alternative splicing plays a major role in the generation of proteins and thereby functional diversity in metazoan organisms (Blencowe 2006). Alternative splicing means that different transcript isoforms are produced from the same gene, by variations in pre-mRNA
c02.indd 15
1/12/2011 9:43:59 AM
16
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
Genome mRNA transcripts
Exon array probes 3´ array probes
Figure 2.3. Probe coverage along the transcript. Gray regions represent exons, and black regions are introns that are removed during splicing. The short dashes underneath the exon regions indicate probes of the exon array and the classical 3′ array setup.
splicing. It is estimated that 40–60% of human genes have multiple splice forms (Modrek and Lee, 2002). These findings led to the development of a new type of microarray, the whole-transcript array, which is able to measure mRNA levels over the whole length of the gene. As depicted in Figure 2.3, Affymetrix exon arrays cover every exon of a gene with, on average, four probes. By using these microarrays, one can study global gene expression profiles like before but also detect different isoforms of a gene, such as transcripts with alternative 5′ start sites or an undefined 3′ end, nonpolyadenylated messages, or truncated or alternatively spliced transcripts. 2.2.5
Genome Tiling Arrays
The design of gene expression arrays and whole-transcript arrays is based on sequence information and annotation of known transcripts. Genome tiling arrays contain probes that are tiled over the whole genome at regular intervals, including both annotated regions of the genome and regions considered to be noncoding. Tiling arrays can thus be used to discover novel transcripts. The Affymetrix Human Tiling 1.0 R array set is a set of 14 microarrays that contain 45 million oligonucleotide probes covering the whole human genome. Probes have a length of 25 nucleotides and are tiled at an average resolution of 35 bp, leaving an average gap of 10 bp between probes. 2.2.6
MicroRNA Arrays
MicroRNAs (miRNAs) are single-stranded RNAs of very short size, 21–23 nucleotides in length. They do not code for proteins but are complementary to certain mRNA sequences. Binding to their target mRNA causes its degradation. In this way, miRNAs can regulate gene expression. It has been shown that miRNAs have an effect on various biological processes—for example, the development of cancer (He et al., 2005) and heart disease (Thum et al., 2007). Several commercial products are available for large
c02.indd 16
1/12/2011 9:44:00 AM
PROTOCOLS
17
scale identification of miRNA. Example vendors are Affymetrix, Agilent, Invitrogen, Applied Biosystems, and Exiqon. 2.3 2.3.1
PROTOCOLS Image Analysis
After the microarrays have been scanned, one obtains a figure with thousands of individual spots (Fig. 2.2) representing the mRNA levels for each gene. Now image analysis is needed to quantify the intensity for each spot. Most of the microarray vendors provide software that performs image analysis and outputs quantitative intensity values per gene. Several steps are performed in such an image analysis. First, the image will be filtered. This is a cleaning procedure by which small contamination artifacts such as dust particles are removed. Next, the location of the center of each spot is identified. This is called gridding. Next, is a process called segmentation; for each of the pixels in the spot area, it is decided whether it belongs to the signal or to the background (signal detected by the scanner in the area where no hybridization has taken place). Finally, in the quantification step, the pixel values of each spot are summarized into one gene expression value and a background value. 2.3.2
Normalization
During a microarray experiment, there can be multiple factors that introduce unwanted variation into the data. For example, if the experiment includes analysis of many samples that cannot be labeled and hybridized in one day, the quality of the labeling may be different on different days. This will lead to global expression differences between different microarrays that are not due to biological differences. The aim of normalization is to remove such unwanted variations from the data so that different samples can be properly compared and real biological differences detected. This process is called normalization and several techniques exist. In the following sections, we describe the ones that are most commonly used. As a rule of thumb in microarray normalization, Wit and McClure (2004) suggest first normalizing all local features and then gradually progressing to normalizations that deal with several or all microarrays. This procedure involves the following steps. 2.3.2.1 Spatial Correction Since probes are randomly distributed over the microarray, one expects a similar distribution of signals on each location on the array. After performing microarray experiments, there might however be microarrays where this is not the case. For example, there might be an array where all signals tend to be structurally lower in one corner of the array. Yang et al. (2002) observed that the variation in signal can also be different at different locations on the array.
c02.indd 17
1/12/2011 9:44:00 AM
18
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
The spatial effect can be removed by robust smoothing of the expression data across the array in each channel separately. Here, a smooth surface is fit to the data and subsequently subtracted from the data. To also correct for differences in variation on different locations of the array, one can divide by a location-dependent scale parameter. This parameter is obtained by smoothing the absolute differences between the expression values and the first smoothed surface (Wit and McClure, 2004). In cases of very strong abnormal local effects, it may even be best to exclude this array and to repeat the experiment. 2.3.2.2 Background Correction Microarray scanners always detect a background signal, even in places where no true signal is present. To obtain more accurate quantifications of gene expression values, several methods have been proposed to adjust for this background. Some methods work with local background values per spot. These background values are measured directly near the spot. Eisen. (1999) simply subtracts the background value from the observed value to obtain a signal value. Kooperberg et al. (2002) apply a Bayesian approach, assuming that the mean of the observed pixel values is the sum of the mean true signal and the mean background signal. Because of the close vicinity of the background measurements and the signal measurements, there is a possibility that the background values are contaminated with true signal. Therefore, several global background correction methods have been proposed that do not use the background values per spot but global approaches. Wit and McClure (2004) suggest calculating the mean value of all empty spots on the array, subtracting that mean from all measurements and putting the negative values obtained to zero. Irizarry et al. (2003) propose a probabilistic model that determines the conditional expectation of the true signal given the observed signal, assuming that the observed signal is the sum of the true signal and a background signal and that the spot intensities are drawn from one exponential distribution and the background intensities from a normal distribution. Both methods give similar results. 2.3.2.3 Dye-Effect Correction The most commonly used dyes in twochannel microarray experiments are Cy5 (red) and Cy3 (green). Slight differences in the characteristics of these dyes, such as in the size of the molecules, lead to unwanted effects in the observed intensity signals. For Cy5 and Cy3 it was observed that the dyes often have an intensity-dependent effect. That is, for large expression values, one of the dyes tends to give higher expression values, while for small expression values they may give lower expression values (Fig. 2.4a). Yang and Speed (2003) suggest transforming the Cy3 vs. Cy5 scatter plot into an MA plot, which is basically a 45° rotation of the Cy3 vs. Cy5 scatter plot (Fig. 2.4b). The values of M and A are calculated as follows: M = log (Cy 5) − log (Cy3),
c02.indd 18
(2.1)
1/12/2011 9:44:00 AM
PROTOCOLS
(b)
2 −6 −4 −2
0
M
12 10 6
8
log2(Cy5)
14
4
6
16
(a)
6
8
10
12
14
16
4
6
8
log2(Cy3)
10 12 14 16 A
log2(Cy5)
4
6
8
4 2 0 −6 −4 −2
10 12 14 16
(d) 6
(c)
M
19
4
6
8
10 12 14 16
4
6
A
8
10 12 14 16
log2(Cy3)
Figure 2.4. (a) Scatterplot of Cy3 versus Cy5 signal. (b) MA plot. Gray curve fitted by loess. (c) Normalized MA plot. (d) Normalized Cy3 versus Cy5 plot.
A=
1 ∗ ( log (Cy 5) + log (Cy 3)). 2
(2.2)
Then, using a function such as loess, a smooth curve is fitted through all data points and a normalized MA plot is created by subtracting the distance to the line (Fig. 2.4c). The MA plot is transferred back into a normalized Cy5 vs. Cy3 scatter plot by applying the inverse of equations 2.1 and 2.2 on the calculated data (Fig. 2.4d): log (Cy 5) = A + M/ 2,
(2.3)
log (Cy 3) = A − M/ 2.
(2.4)
2.3.2.4 Normalization between Arrays The global range of gene expression values can differ between arrays from one experiment to the next. These
c02.indd 19
1/12/2011 9:44:00 AM
20
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
global changes are often the result of slight variations during the process of sample preparation, labeling, microarray hybridization, and washes. The aim of normalization between arrays is to remove global expression differences so that multiple samples can be properly compared and real biological differences detected. The most straightforward way to normalize between microarrays is to equalize the median or mean value for each of the arrays and to adjust the scale to some fixed value. A disadvantage of this method is that it performs a linear scaling, which is not optimal if the distributions of the expression values differ. Therefore, another method called quantile normalization was proposed by several authors (e.g., Bolstad et al., 2003). This method equalizes the distributions of the expression values of all microarrays. The procedure is as follows: 1. Given n arrays of length g, form matrix A of dimension g × n where each array is a column and each gene is a row. 2. Sort each column of A to give Asort. 3. Take the means across rows of Asort and assign this mean to each element in the row to get Asortmean. 4. Get A normalized by rearranging each column of Asortmean to have the same ordering as the original A. Figure 2.5 shows the distribution of all gene expression values before and after the quantile normalization procedure has been applied. After normalization, the distributions of values are all equal. 2.3.3 Quality Control The performance of microarray experiments involves many steps, and there are many stages where things can go wrong. Here we describe the most common procedures for quality control and explain how they can be used to inspect the quality of the data. 2.3.3.1 Inspection of Signal Plots As a first quality control measure one can make a signal plot of all measured signals for all microarrays before and after normalization. If probes are randomly distributed over the microarrays, there should not be any patterns visible in these plots. Visual inspection of these images may reveal cases in which, for example, a hair or pieces of dust disturb the signals. Also, one can detect if spatial effects are properly corrected by the normalization methods. 2.3.3.2 Dissimilarity Measures To detect deviating microarrays, Wit and McClure (2004) suggest calculating similarity measures between all pairs of microarrays. Suppose one wants to investigate two types of dissimilarity measures: absolute similarities indicating whether genes have similar levels over
c02.indd 20
1/12/2011 9:44:00 AM
21
PROTOCOLS
(b) 14
14
12
12 log 2 Signal
log 2 Signal
(a)
10 8
10 8
6
6
4
4
1
2
3
4
1
Microarrays
2
3
4
Microarrays
Figure 2.5. Quantile normalization. (a) Box plots of the log2 signals for four microarrays before normalization, showing differences in the distributions. (b) The distributions of the log2 signals for the same microarrays after quantile normalization.
different arrays and correlations indicating coordinated changes of genes between arrays. As absolute similarity measures they use power distances ngenes
d p ( x, y) =
p
∑ x −y i
i
p
(2.5)
i =1
and use both the Manhattan distance d1 and the Euclidean distance d2. Investigation of the dissimilarity matrices directly identifies microarrays in which processing problems may have occurred. 2.3.3.3 Dimensionality Reduction Another way to check for possible problems in the data is to perform a dimensionality reduction. A popular method is principal component analysis (PCA), which is a method that transfers a number of variables (gene expression values in this case) into a number of uncorrelated variables called principal components. The first component accounts for as much of the variability in the data as possible, and each following component accounts for as much of the remaining variability as possible. A quicker method for dimensionality reduction is Sammon mapping (Sammon, 1969). Instead of using the whole gene expression data matrix, it uses the distance matrix between arrays. It aims to find a representation of the arrays in a lower-dimensional space in such a way that the distances between the arrays are closest to the distances in the original matrix. Inspection of a two-dimensional (2D) Sammon mapping can indicate whether samples have been swapped or specific arrays have a deviating behavior. For example, the
c02.indd 21
1/12/2011 9:44:00 AM
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
60
22
B.3.2
40
B.3.3 A.3.3
20
B.3.1
B.1.3
A.2.3
A.3.2 B.2.2
0
B.1.2
B.2.3
B.2.1
A.2.2
−40
−20
B.1.1
A.3.1
A.1.3 A.1.2
−60
A.2.1 A.1.1 −60
−40
−20
0
20
40
60
Figure 2.6. Sammon mapping of microarray data of two strains of mice (A and B) infected with a virus, measured 3 days after the infection. Each measurement has been performed in three replicates: A.1.3 means mouse A, day 1 postinfection, replicate 3.
Sammon mapping in Figure 2.6 indicates that samples A.2.3 and B.2.3 have probably been swapped. 2.3.3.4 Pairwise Scatter Plots Another good way to detect deviating microarrays is to inspect scatter plots of all expression values for all possible pairs of microarrays. Normally, the amount of differentially expressed genes between two experimental conditions is small, relative to the total amount of genes on the microarray used. A figure in which the expression values of all genes in the two conditions are plotted against each other should reveal a cloud of points on the diagonal with relatively few points off the diagonal. Comparing the scatter plots of all possible pairs of microarrays might reveal a single microarray that shows deviating scatter plots with all other microarrays—for example, scatter plots in which the cloud on the diagonal is broader than in the other pairwise plots. This would indicate that the microarray shows many more and larger changes in gene expression compared to other samples than do other comparisons. If these changes are not expected from a biological point of view, there might have been technical problem causing these changes, and it will be better to repeat the microarray experiment for this sample.
c02.indd 22
1/12/2011 9:44:00 AM
APPLICATIONS AND LIMITATIONS
23
2.3.3.5 Sex-Specific Gene Expression If the experiment involves both male and female samples, mislabelings of the microarrays can be detected by comparing the expression values for sex-specific genes. Xist is one example of a female-specific gene. It should be expressed only in female samples. In this way, samples that have been mixed up can be easily identified. 2.4 APPLICATIONS AND LIMITATIONS 2.4.1 Identification of Expression QTL and Gene Regulatory Networks Combining gene expression profiles with genetic information represents a new, powerful approach for identifying genes in disease models. This approach is generally referred to as the identification of expression quantitative trait loci (eQTL) (Rockman and Kruglyak, 2006) or genetical genomics (Jansen and Nap, 2001). A quantitative trait locus is a specific region on the genome where one or more genes are located that most likely regulate a phenotypic trait. By making use of specific populations of organisms that are genetically related and by measuring the trait values for many individuals of the population and combining them with the genetic information of the individual organisms, one can identify the locations on the genome regulating the trait. Recombinant inbred lines (RIL) are often used for QTL analysis. In mice, these lines are obtained by breeding two genetically different inbred parental lines and by performing brother–sister mating from a large number of F1 hybrid pairs for about 20 generations. As can be seen from Figure 2.7, the parental strains produce F1 offspring that are heterozygous. The F1 offspring are mated to produce F2 animals. After many generations of brother–sister matings that start from a given F2 pair, recombinant inbred lines have evolved whose genomes represent a fixed mixture of the parental genomes and in which all individuals are again homozygous at every location. The fixed parental genome mixture is, however, different from one line (starting with a given F2 pair) to another line (starting with a different F2 pair). The genetic makeup of these recombinant inbred lines is then determined for each RIL using molecular markers. RILs now allow one to perform phenotype analysis, subsequently relating them to the genotypes. For the identification of eQTL, global gene expression profiles are determined by gene expression microarrays for each RIL. Subsequently, the gene expression profile for each gene is taken as a quantitative trait and is compared in a genome scan with the distribution of molecular markers (Fig. 2.8). For each molecular marker, the expression trait values are divided into two groups, according to the alleles that the individuals carry for that marker (Fig. 2.8c). Then a statistical test is performed that determines whether the means of both groups differ significantly. In this case, eQTL may be determined, as shown for the second marker in Figure 2.8. This result indicates that, with a high
c02.indd 23
1/12/2011 9:44:00 AM
24
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
x
x
Parents
x
F1
x
x
F2
F20
Figure 2.7. Generation of recombinant inbred lines.
probability, there are one or several factors (genes) at that genomic location that regulate the expression of the target gene, since all individuals carrying the one allele have a low expression and all individuals carrying the other allele have a high expression value. Once a QTL is identified, one can compare the location of the QTL with the location of the gene. If they coincide, the QTL is referred to as a cis-QTL or local QTL, otherwise it is called a transQTL, or distant QTL. There exist several methods for the identification of quantitative trait loci. The most straightforward method is called single-marker analysis. Here, a genomewide scan is performed and at each molecular marker a regression test determines whether a QTL is present. In a second method, called interval mapping, the QTL likelihood is determined at locations in between markers. At fixed genomic intervals and by making use of the information for the surrounding markers, this method is able to calculate QTL scores at the markers themselves and at places in between. Based on the idea that multiple QTL can regulate a quantitative trait, Jansen (1993) and Zeng (1993) proposed the method of composite interval mapping (multiple QTL mapping). Here, the existence of multiple QTLs regulating the expression of one trait is modeled. This allows identifying epistatic QTL—that is, multiple QTL regions that regulate the trait by interacting with each other. Also, it allows for the identification of multiple linked QTL. The GeneNetwork (www.genenetwork.org) has been established as a rich resource for systems genetics. It contains a large collection of genotypes, phenotypes, and gene expression profiles for multiple organisms and genetic
c02.indd 24
1/12/2011 9:44:00 AM
APPLICATIONS AND LIMITATIONS
(a)
2
1
3
(c)
trans A B
1
(b)
A
25
cis
2
3
B
high
RILs
low
Figure 2.8. (See color insert.) (a) The genomewide genotypes of eight recombinant inbred lines generated from a cross between two homozygous parents (A and B). Each row indicates the genome of a single RIL. The light or dark gray color in each of the RILs indicates whether that part of its genome was inherited from parent A or B. (b) Gene expression values are determined by microarrays. Four values are shown for each parent and one value for each of the RILs. (c) For three molecular markers, the gene expression values of the RILs are dissected into two groups, according to the allele they carry for that molecular marker (light or dark gray). A statistical test of each marker location calculates whether the means of both groups differ. The significances of the tests are plotted in a genomewide plot as a QTL plot. Here, a QTL peak is found for the second marker. Triangles in the QTL plot indicate the position of the gene, whose expression was used. If the gene coincides with the QTL peak, the QTL is referred to as a cis-QTL, otherwise, it is called a trans-QTL. Adapted from Alberts et al. (2005) by permission of Oxford University Press.
reference populations. It offers good tools for QTL and correlation analysis and the identification of QTL genes and gene networks. The identification of a trans-QTL means that the location probably regulates the expression of another gene, the target gene. Furthermore, several genes may map to the same trans-QTL, which indicates that all these genes appear to have a common regulator. By following links between genes, revealed by trans-QTLs, one can build up gene regulatory networks. These regulatory networks have the potential to explain the complex interplay of genes and their products affecting complex traits and diseases. Ferrara et al. (2008) demonstrated how eQTL can be used to reconstruct networks. They use an F2 intercross between a diabetes-resistant and a
c02.indd 25
1/12/2011 9:44:00 AM
26
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
diabetes-susceptible mouse strain and identified expression QTLs (eQTLs) as well as metabolite QTLs (mQTLs). mQTLs were determined by taking metabolite abundances as quantitative traits. For one metabolite, glutamate, they identified an mQTL interval that also contains eQTLs and transcripts with eQTLs elsewhere. Using this information, they reconstructed a regulatory network, demonstrating the validity of the network by showing that the genes respond to changes in glutamate. Crawford et al. (2008) described how eQTLs can be used to derive a transcriptional network that predicts breast cancer survival. In previous work, it was shown that extracellular matrix (ECM) gene dysregulation predicts both mouse mammary tumorigenesis and human breast cancer. They identified three reproducible eQTLs that regulate ECM gene expression. By correlation analyses and known association with metastasis, they identified seven candidate genes. Six out of the seven candidates appeared to suppress metastasis. 2.4.2
Identification of Differentially Expressed Genes
Microarrays are most often used for the identification of differentially expressed genes. In disease gene discovery, healthy samples and disease samples are compared and genes that are differentially expressed are identified. Thuong et al. (2008) for example, compared gene expression profiles of macrophages from individuals with different clinical manifestations of Mycobacterium tuberculosis infection. For three clinical phenotypes—latent, pulmonary, and meningeal tuberculosis—they identified lists of differentially expressed genes. Comparing the three phenotypes, they identified 261 genes having a greater than fivefold change in expression between any of the three conditions. Pennings et al. (2008) compared multiple microarray studies on acute lung inflammation models. The models included air pollutants; bacterial, viral, and parasitic infections; and allergic asthma models. They identified a cluster of 383 genes with an expression response that was common to all pulmonary diseases. 2.4.3
Identification of Cell-Type-Specific Genes
Another application of microarrays is to identify genes that are expressed in specific cell types. Sugimoto et al. (2006), for example, compared gene expression profiles in CD25+CD4+ regulatory T cells and CD25−CD4+ naive T cells. They found multiple genes that were expressed in a pattern that is specific for regulatory T cells. These genes are thought to be involved in differentiation and homeostatis of regulatory T cells. 2.4.4
Determination of the Downstream Effects of a Mutation
Von Bernuth et al. (2008) used microarrays to determine the downstream effects of a MyD88 mutation in human. Nine patients with MyD88 deficiency
c02.indd 26
1/12/2011 9:44:00 AM
APPLICATIONS AND LIMITATIONS
27
suffered from life-threatening, often recurrent pyogenic bacterial infections. The authors identified the functional pathways in healthy fibroblasts that were regulated after treatment with interlukin 1β (IL-1β), tumor necrosis factor α (TNF), or Poly(IC) and compared them to the expression levels obtained from cells derived from patients. They identified a complete, specific lack of response to IL-1β as a defining characteristic of MyD88 deficiency. 2.4.5 Determination of the Downstream Effects of a Signaling Molecule Type 1 interferon (IFN) contributes significantly to innate immune responses. Malakhova et al. (2006) reported that UBP43 is highly expressed in macrophages and inhibits type 1 IFN signaling. To understand the effect of UBP43 and type 1 IFN signaling, Zou et al. (2007) analyzed the genomewide gene expression profiles of IFN-β-stimulated genes in wild type and UBP43−/− bone marrow–derived macrophages (BMMs). They identified 749 genes that were uniquely upregulated in UBP43−/− BMMs, including a large number of previously unidentified IFN-stimulated genes. 2.4.6 Predicting Vaccine Efficacy Another application of microarrays is the identification of gene signatures that have a predictive value for a biological response. For example, Querec et al. (2009) used this approach to predict vaccine efficacy. They vaccinated humans with the yellow fever vaccine YF-17D and performed microarray experiments on 0, 1, 3, 7, and 21 days after vaccination in two independent trials. Using the DAMIP classification model (Lee 2007, Brooks and Lee 2008), they identified innate immune signatures that could predict subsequent adaptive immune responses. One signature predicted YF-17D CD8+ T cell responses with up to 90% accuracy and another signature predicted the neutralizing antibody response with up to 100% accuracy. 2.4.7 Determination of Host Responses after Infection Microarrays have been successfully used in the characterization of host responses after infection. For example, Kash et al. (2006) infected mice with a contemporary human influenza A/Texas/36/91 H1N1 virus (Tx91) and a reconstructed 1918 (H1N1) recombinant virus (r1918) that caused about 50 million deaths worldwide. They found that mice infected with the r1918 virus revealed a much stronger inflammatory response. As another example, Ding et al. (2008) found differences in mouse strains after infection with influenza A. 2.4.8 Limitations A limitation of using microarrays is that they measure mRNA abundances and not protein levels. Posttranscriptional modifications or mRNA degradation
c02.indd 27
1/12/2011 9:44:00 AM
28
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
might cause actual protein levels to be different from gene expression levels measured with microarrays. In these situations, the transcriptional profiles obtained by microarrays do not fully correspond to the proteome within the cell.
2.5
QUESTIONS AND ANSWERS
Q1. Why are multiple microarrays in an experiment normalized? Q2. What is an expression QTL (eQTL)? And how can it be used to discover gene-interaction networks? A1. Multiple microarrays are normalized to remove the nonbiological variation, such as technical variation, to maintain pure biological variation. A2. An eQTL is an expression quantitative trait locus—a genomic region that very likely contains one or multiple genes regulating the expression of another gene. Trans-eQTLs represent genomic regions that influence the expression of another gene located distantly. The trans-QTL region will very likely contain genes that directly influence the expression of the target gene(s). By relating multiple trans-QTLs with multiple target genes, one may obtain valuable hypotheses for gene–gene regulatory interactions.
2.6
ACKNOWLEDGMENTS
We would like to thank Dr. Robert Geffers for fruitful discussions. This work was supported by intramural grants from the HelmholtzAssociation (Program Infection and Immunity) and a research grant for the virtual institute GeNeSys (German Network for Systems Genetics, No VHVI-242) from the Helmholtz Association.
2.7
REFERENCES
Alberts R, Fu J, Swertz MA, Lubbers LA, Albers CJ, Jansen RC. (2005). Combining microarrays and genetic analysis. Briefings Bioinformatics 6(2):135–45. Blencowe BJ. (2006). Alternative splicing: new insights from global analyses. Cell 126:37–47. Bolstad BM, Irizarry RA, Astrand M, Speed TP. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–93. Brooks JP, Lee EK. (2008). Analysis of the consistency of a mixed integer programming based multi-category constrained discriminant model. Ann Oper Res 164:1–20.
c02.indd 28
1/12/2011 9:44:00 AM
REFERENCES
29
Crawford NP, Walker RC, Lukes L, Officewala JS, Williams RW, Hunter KW. (2008). The Diasporin pathway: a tumor progression-related transcriptional network that predicts breast cancer survival. Clin Exp Metastasis 25(4):357–69. Ding M, Lu L, Toth LA. (2008). Gene expression in lung and basal forebrain during influenza infection in mice. Genes Brain Behav 7(2):173–83. Eisen M. (1999). ScanAlyze User Manual. Available at http://rana.lbl.gov/manuals/ ScanAlyzeDoc.pdf. Ferrara CT, Wang P, Neto EC, Stevens RD, Bain JR, Wenner BR, Ilkayeva OR, Keller MP, Blasiole DA, Kendziorski C, Yandell BS, Newgard CB, Attie AD. (2008). Genetic networks of liver metabolism revealed by integration of metabolic and transcriptional profiling. PLoS Genet 4(3):e1000032. He L, Thomson JM, Hemann MT, Hernando-Monge E, Mu D, Goodson S, Powers S, Cordon-Cardo C, Lowe SW, Hannon GJ, Hammond SM. (2005). A microRNA polycistron as a potential human oncogene. Nature 435(7043):828–33. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249–62. Jansen RC. (1993). Interval mapping of multiple quantitative trait loci. Genetics 135: 205–11. Jansen RC, Nap JP. (2001). Genetical genomics: the added value from segregation. Trends Genet 17(7):388–91. Kash JC, Tumpey TM, Proll SC, Carter V, Perwitasari O, Thomas MJ, Basler CF, Palese P, Taubenberger JK, García-Sastre A, Swayne DE, Katze MG. (2006). Genomic analysis of increased host immune and cell death responses induced by 1918 influenza virus. Nature 443(7111):578–81. Kooperberg C, Fazzio TG, Tsukiyama T. (2002). Improved background correction for spotted DNA microarrays. J Computat Biol 9:57–68. Lee EK. (2007). Large-scale optimization-based classification models in medicine and biology. Ann Biomed Eng 35:1095–109. Malakhova OA, Kim KI, Luo JK, Zou W, Kumar KG, Fuchs SY, Shuai K, Zhang DE. (2006). UBP43 is a novel regulator of interferon signaling independent of its ISG15 isopeptidase activity. EMBO J 25(11):2358–67. Modrek B, Lee C. (2002). A genomic view of alternative splicing. Nat Genet 30:13–19. Pennings JLA, Kimman TG, Janssen R. (2008). Identification of a common gene expression response in different lung inflammatory diseases in rodents and macaques. PLoS ONE 3(7):e2596. Querec TD, Akondy RS, Lee EK, Cao W, Nakaya HI, Teuwen D, Pirani A, Gernert K, Deng J, Marzolf B, Kennedy K, Wu H, Bennouna S, Oluoch H, Miller J, Vencio RZ, Mulligan M, Aderem A, Ahmed R, Pulendran B. (2009). Systems biology approach predicts immunogenicity of the yellow fever vaccine in humans. Nat Immunol 10(1):116–25. Reinke V. (2006). Germline genomics. Available at www.wormbook.org. Rockman MV, Kruglyak L. (2006). Genetics of global gene expression. Nat Rev Genet 7:862–72. Sammon JW. (1969). A non-linear mapping for data structure analysis. IEEE Trans Comput 18:401–09.
c02.indd 29
1/12/2011 9:44:00 AM
30
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
Sugimoto N, Oida T, Hirota K, Nakamura K, Nomura T, Uchiyama T, Sakaguchi S. (2006). Foxp3-dependent and -independent molecules specific for CD25+CD4+ natural regulatory T cells revealed by DNA microarray analysis. Int Immunol 18(8):1197–209. Thum T, Galuppo P, Wolf C, Fiedler J, Kneitz S, van Laake LW, Doevendans PA, Mummery CL, Borlak J, Haverich A, Gross C, Engelhardt S, Ertl G, Bauersachs J. (2007). MicroRNAs in the human heart: a clue to fetal gene reprogramming in heart failure. Circulation 116(3):258–67. Thuong NT, Dunstan SJ, Chau TT, Thorsson V, Simmons CP, Quyen NT, Thwaites GE, Thi Ngoc Lan N, Hibberd M, Teo YY, Seielstad M, Aderem A, Farrar JJ, Hawn TR. (2008). Identification of tuberculosis susceptibility genes with human macrophage gene expression profiles. PLoS Pathog 4(12):e1000229. von Bernuth H, Picard C, Jin Z, Pankla R, Xiao H, Ku CL, Chrabieh M, Mustapha IB, Ghandil P, Camcioglu Y, Vasconcelos J, Sirvent N, Guedes M, Vitor AB, HerreroMata MJ, Aróstegui JI, Rodrigo C, Alsina L, Ruiz-Ortiz E, Juan M, Fortuny C, Yagüe J, Antón J, Pascal M, Chang HH, Janniere L, Rose Y, Garty BZ, Chapel H, Issekutz A, Maródi L, Rodriguez-Gallego C, Banchereau J, Abel L, Li X, Chaussabel D, Puel A, Casanova JL. (2008). Pyogenic bacterial infections in humans with MyD88 deficiency. Science 321(5889):691–96. Wit E, McClure J. (2004). Statistics for Microarrays: Design, Analysis and Inference. New York, John Wiley and Sons. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30(4):e15. Yang YH, Speed T. (2003). Design and analysis of comparative microarray experiments. In Statistical Analysis of Gene Expression Microarray Data (ed. Speed T). Chapman & Hall/CRC, Boca Raton, FL, pp. 35–92. Zeng ZB. (1993). Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci. Proc Natl Acad Sci U S A 90:10972–76. Zou W, Kim JH, Handidu A, Li X, Kim KI, Yan M, Li J, Zhang DE. (2007). Microarray analysis reveals that type I interferon strongly increases the expression of immuneresponse related genes in UBP43 (USP18) deficient macrophages. Biochem Biophys Res Commun 356(1):193–99.
c02.indd 30
1/12/2011 9:44:00 AM
CHAPTER 3
DNA Methylation in the Pathogenesis of Autoimmunity XUEQING XU*, PING YANG*, ZHANG SHU*, YUN BAI*, and CONG-YI WANG*
Contents 3.1 Introduction 3.2 General Information for DNA Methylation in Mammals 3.3 DNA Methyltransferases and Methyl-CpG-Binding Domain (MBD) Proteins 3.3.1 DNA Methyltransferases 3.3.2 MBD Proteins 3.4 DNA Methylation in T and B Cell Development 3.4.1 DNA Methylation of IFN-γ Locus in Th1 Cell Development 3.4.2 DNA Methylation of Th2 Cytokine Locus in Th2 Cell Development 3.4.3 DNA Methylation in Regulatory T Cell and Th17 Development 3.4.4 DNA Methylation in B Cell Maturation and Functionality 3.5 The Implication of DNA Methylation in Autoimmune Diseases 3.5.1 DNA Methylation in Systemic Lupus Erythematosus 3.5.2 DNA Methylation in Rheumatoid Arthritis 3.5.3 DNA Methylation in Type 1 Diabetes 3.6 Common Technological Approaches for Assay of DNA Methylation 3.6.1 Methylation-Specific PCR 3.6.2 Bisulfite PCR 3.6.3 Arbitrary Primed PCR 3.6.4 Methylated DNA Immunoprecipitation Chip 3.7 Summary 3.8 Acknowledgments 3.9 References
32 33 34 34 35 36 36 37 38 40 40 41 42 43 44 45 46 46 47 48 48 49
* These authors contributed equally to this work. Correspondence should be addressed to Dr. Cong-Yi Wang,
[email protected]. Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
31
c03.indd 31
1/12/2011 9:44:01 AM
32
3.1
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
INTRODUCTION
Despite the characterization of our genome at DNA basepair level, we are still far from understanding the molecular events underlying the phenotypic variations such as disease susceptibility. It is now realized that epigenetic factors are also a significant contributor for a particular phenotype, indicating a dual inheritance for mammalian cells. Moreover, the epigenome-encoded information is superimposed on DNA sequences, consisting of three interconnected molecular mechanisms: DNA methylation (methylome) (Jeltsch et al., 2006; Vanden Berghe et al., 2006; Wilson et al., 2006), histone modification (chromatin remodeling) (Kouskouti and Talianidis, 2005; Trouche et al., 2003) and RNA interference (SiRNAs and microRNAs) (Andersen and Panning, 2003; Cheng et al., 2005; Mattick, 2001), among which, DNA methylation is the most profound epigenetic mechanism because DNA methylation changes are also linked with the presence of an aberrant pattern of histone modification and microRNA expressions (Callinan and Feinberg, 2006; Esteller, 2006; Jones and Martienssen, 2005; Rauscher, 2005; Wilson et al., 2006). DNA methylation is a type of postsynthesis modification after every cycle of DNA replication. It involves the addition of a methyl group to DNA to the number 5 carbon of the cytosine. It is the oldest epigenetic mechanism known to correlate with gene repression and is also the most thoroughly studied epigenetic modification so far. DNA methylation influences, most notably, gene expression, in that hypermethylation of promoter regions is usually associated with transcriptional repression, while hypomethylation of control regions is generally associated with active gene transcription. However, as indicated, the presence of methylcytosine in the promoter of specific genes also has profound consequences on local chromatin structure in addition to the regulation of gene expression. Therefore, the end results of DNA methylation go far beyond the control of gene expression with much broader implications in genetic regulation such as genomic imprinting, X chromosome inactivation, and chromatin structure modifications (Tollefsbol, 2004). DNA methylation has been found in every vertebrate examined. In adult somatic tissues, DNA methylation typically occurs in a CpG dinucleotide context, while non-CpG methylation is prevalent in embryonic stem cells. In plants, cytosines are methylated both symmetrically (CpG or CpNpG) and asymmetrically (CpNpNp, N can be any nucleotide but guanine). In mammals, CpG dinucleotides are usually located at the promoters of genes, and the methylation of DNA in such promoter regions plays an important role in the gene activation and expression. Similar as nuclear genome, DNA methylome consists all inheritable information encoded by DNA methyaltion, which is established during development in a tissue-specific fashion (Yung et al., 2001). Methylation of CpG dinucleotides (also called CpG islands, regions of DNA enriched with CpG sites) is a unique epigenomic mechanism for suppressing the expression of genes not essentially or potentially detrimental to cellular function. Abnormal
c03.indd 32
1/12/2011 9:44:01 AM
GENERAL INFORMATION FOR DNA METHYLATION IN MAMMALS
33
demethylation of these CpG islands would lead to active transcription of the suppressed genes associated with disease development, such as cancer. In this chapter we provide an overview for the role of DNA methylation in the development of autoimmunity. We particularly discuss its possible implication in the pathogenesis of systemic lupus erythematosus (SLE), rheumatoid arthritis (AR), and type 1 diabetes (T1D).
3.2 GENERAL INFORMATION FOR DNA METHYLATION IN MAMMALS DNA methylation is one of the most important epigenetic alterations in mammals. This modification can be inherited through cell division. DNA methylation is typically removed during zygote formation and reestablished through successive cell divisions during development. DNA methylation is a crucial part of normal organism development and cellular differentiation in mammalians such as regulation of imprinted genes and X chromosome inactivation. It also acts as a protective mechanism adopted by the pathogen DNA—for example, many bacteria take the advantage of DNA methylation against the endonuclease activity that destroys any foreign DNA. DNA methylation stably alters the expression pattern of a particular gene in cells such that the cells can “remember where they have been.” In this way, cells programmed to be pancreatic islet cells during embryonic development would remain within the pancreatic islets throughout the life of the organism, and there is no need for additional signals to direct them to be remained in the islets. In general, 70–80% of all CpGs in the human genome are methylated, and the majority of unmethylated CpG islands are located within or near gene promoters or first exons of housekeeping genes (Heller et al., 2010). In contrast, the promoters of noncoding RNAs and regulatory regions of transposable elements are methylated, thereby inhibiting the parasitic transposable and repetitive elements from replicating (Dolinoy et al., 2006). Studies in animal models indicated that DNA methylome-encoded information is both mitotically and meiotically heritable (Morgan et al., 1999; Rakyan et al., 2003). Alterations for DNA methylome can be induced by a plethora of environmental insults (discussed later). Studies in a large collection of human monozygotic (MZ) twins indicated that these changes in DNA methylome are accumulated during one’s lifetime, which establish a quantitative threshold for the induction of gene expression. This quantitative effect contributes to the phenotypic discordance, including the susceptibilities to disease and a wide range of anthropomorphic features among MZ twins (Fraga et al., 2005). DNA methylation may impact the transcription of genes in two ways. The methylation of DNA may itself physically impede the binding of transcription factors to the gene. More important, methylated DNA may be bound by proteins known as methyl-CpG-binding domain (MBD) proteins. MBD proteins
c03.indd 33
1/12/2011 9:44:01 AM
34
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
then recruit additional proteins to the locus, such as histone deacetylases and other chromatin remodeling proteins that can modify histones, thereby forming compact, inactive chromatin termed silent chromatin (Fatemi and Wade, 2006; Fraga et al., 2003).
3.3 DNA METHYLTRANSFERASES AND METHYL-CPG-BINDING DOMAIN (MBD) PROTEINS 3.3.1 DNA Methyltransferases DNA methylation is carried out by DNA methyltransferases (DNMTs), and at least five DNA methyltransferases (DNMT1, DNMT3a, DNMT3b, DNMT3L, and DNMT2) have been identified in the eukaryotic kingdom. DNMT1 preferentially methylates hemimethylated substrates, such as DNA in the S phase, and is primarily involved in the maintenance of methylation patterns with each cell replication (Bestor and Ingram, 1983; Leonhardt et al., 1992). In contrast, DNMT3a and DNMT3b, which have significantly higher de novo methylation activity than DNMT1, contribute to de novo methylation during embryogenesis (Fatemi et al., 2002). Another two DNMTs—DNMT2 and DNMT3L—are found without significant methylating activity. DNMT3L binds to DNMT3a and DNMT3b and regulates their functionality. The main function for DNMT2 is found to methylate the aspartyl-tRNA (Goll et al., 2006). Given the role of DNMTs in maintenance methylation and de novo methylation, they can be divided into two general classes. Maintenance methylation is necessary to preserve DNA methylation after every cellular DNA replication cycle. In the absence of this activity, the replication machinery itself would generate daughter strands that are unmethylated and over time would result in passive demethylation. DNMT1 is proposed to be a maintenance methyltransferase responsible for copying DNA methylation patterns to the daughter strands during DNA replication (Goyal et al., 2006). Mouse models with both copies of DNMT1 deleted are embryonic lethal at approximately day 9, due to the requirement of DNMT1 activity for development in mammalian cells (Chappell et al., 2006). It is thought that DNMT3a and DNMT3b are the de novo methyltransferases that set up DNA methylation patterns early in development. DNMT3L is a homologous protein to the other DNMT3s but has no catalytic activity. On the other hand, DNMT3L assists the de novo methyltransferases by increasing their binding capacity to DNA and stimulating their enzymatic activity (Brenner and Fuks, 2006). Recently, DNMT2 (TRDMT1) has been identified as a DNA methyltransferase homolog, containing all 10 sequence motifs common to all DNA methyltransferases. However, DNMT2 (TRDMT1) does not methylate DNA, instead it methylates cytosine-38 in the anticodon loop of aspartic acid transfer RNA (Goll et al., 2006).
c03.indd 34
1/12/2011 9:44:01 AM
DNA METHYLTRANSFERASES AND METHYL-CPG-BINDING DOMAIN (MBD) PROTEINS
3.3.2
35
MBD Proteins
A evolutionarily conserved family of DNA-binding proteins characterized by a common sequence motif called methyl-CpG-binding domain (MBD) is generally believed to convert the information represented by methylation patterns into the appropriate functional state (Fatemi and Wade, 2006; Hendrich et al., 1999). MBD forms a wedge-shaped structure composed of a β-sheet superimposed over an α-helix and loop. Amino acid side chains in two of the β-strands along with residues immediately N-terminal to the α-helix interact with the cytosine methyl groups within the major groove, which provides the structural basis for the selective recognition of methylated CpG dinucleotides (Ballestar et al., 2001; Ohki et al., 2001). Thus far, five MBD proteins (MeCP2 and MBD1–MBD4) have been identified in mammals (Fatemi and Wade, 2006). In fact, these proteins share pretty low homology for the primary structures between each other outside the MBD motif, except for MBD2 and MBD3, which indeed share substantial sequence similarity. Furthermore, unlike its amphibian counterpart, mammalian MBD3 does not have the capability to selectively recognize methylated DNA because a tyrosine to phenylalanine substitution within the MBD motif (Fraga et al., 2003). The other four MBD proteins are believed to function, at least in part, in transcriptional repression (Hendrich and Tweedie, 2003; Wade, 2001). It is interesting that MBD4 possesses DNA N-glycosylase enzymatic activity and may exert functionality in DNA repair (Bird and Wolffe, 1999). In most cases, all MBD proteins are ubiquitously expressed (Hendrich and Bird, 1998; Meehan et al., 1992). They represent an important class of chromosomal protein by associating with protein partners to play active roles in transcriptional repression and/or heterochromatin formation. Therefore, it is believed that DNA methylation pattern is “read” by the MBD proteins. Although they may have overlapped functional redundancy, genetic analysis indeed suggested functional heterogeneity for each MBD proteins. Most MBD-deficient animals can survive to adulthood although with varying abnormalities. However, mice deficient for MBD3 fail to survive embryogenesis (Hendrich et al., 2001). Lack of MeCP2 is associated with specific neurological defects in the mouse that mimic the symptoms observed in the human neurological disorder Rett syndrome, which is caused by mutation in the human MECP2 gene (Amir et al., 1999). Loss of MBD1 function is associated with neuronal defects, potentially related to the subtle upregulation of a specific class of endogenous retroelements (Zhao et al., 2003). Mice deficient for MBD4 were found with increased frequency for C → T transitions at CpG sites and showed accelerated tumor formation with CpG → TpG mutations in the Apc gene (Millar et al., 2002), indicating that MBD4 may be important to suppress CpG mutability and tumorigenesis. In contrast, MBD2 deficiency is associated with a decreased incidence of tumors of the colon promoted by mutation of the adenomatous polyposis coli gene (Sansom et al., 2003). More interesting, MBD2 could be also important in the regulation of immune
c03.indd 35
1/12/2011 9:44:01 AM
36
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
response as loss of MBD2 function leads to changes in the abundance of transcripts for certain cytokines essential for T cell development (Hutchins et al., 2002; Hutchins et al., 2005).
3.4
DNA METHYLATION IN T AND B CELL DEVELOPMENT
There is increasing evidence that DNA methylome organizes the ability of signal transduction pathways to generate a restricted set of progeny from a multipotent progenitor. In addition, epigenomic effects seem to allow dividing cells to memorize, or imprint, signaling events that occurred earlier in their development (Ballestar et al., 2006; Reiner, 2005; Richardson, 2003; Sekigawa et al., 2003, 2006; Teitell and Richardson, 2003). Therefore, DNA methylation is emerging as a common strategy in development and function of the mammalian immune system. Recent studies have not only demonstrated the importance of DNA methylation in T cell development and differentiation but also established its pivotal role in T cell polarization (Eivazova and Aune, 2004; Fields et al., 2004; Fitzpatrick and Wilson, 2003; Lee et al., 2002; Reiner, 2005). T cells and B cells are two of the most important components in the immune system, and the differentiation of T helper cells (i.e., Th1, Th2, or Th17) and the maturation of B cells play an important role in specific immune responses and antibody production. While the genomic DNA is the same, different T cell subsets have different functions, most likely through expressing different proteins. Th1 cells are characterized by producing IFNγ and Th2 cells secret IL-4 and IL-13 cytokines, while Th17 cells are preferential producers of IL-17A, IL-17F, IL-21, and IL-22. Another important T subset, T regulatory T cells (Treg), is manifested by expressing Foxp3 in the nucleus. There is mounting evidence that the production of cytokines and the expression of transcription factors necessary for T helper cell differentiation to different cell subsets are regulated at the DNA methylation level. 3.4.1
DNA Methylation of IFN-γ Locus in Th1 Cell Development
IFN-γ is a signature cytokine for Th1 cells. Unlike the well-defined Th2 cytokine loci, little is known about the regulatory elements that govern the expression of interferon. The methylation status of CpG sites in the regulatory elements around IFN-γ locus is quite complicated, different CpG sites show inconsistent methylation status during Th1 development. Multiple regulatory elements and conserved noncoding sequences (CNS) have been identified in a region that extends 60–70 kb upstream and downstream of the mouse Ifn-γ locus, which include enhancers at CNS-34, CNS-22, CNS-6, CNS+18–20 and CNS+29, as well as a putative insulator at CNS+46 (Hatton et al., 2006; Wilson et al., 2009). In naive murine T cells, the IFN-γ promoter +29 and +46 are unmethylated, while CNS-54, CNS-6, intron 3, CNS-18, CNS-20, and CNS-55 were methylated (Bowen et al., 2008). In Th1 cells, methyl groups were removed
c03.indd 36
1/12/2011 9:44:01 AM
DNA METHYLATION IN T AND B CELL DEVELOPMENT
37
from CpG dinucleotides at CNS-54, CNS-6, some CpGs in IFN-γ intron 3, CNS-18 and CNS-20, indicating a prerequisite role for demethylation of these elements in IFN-γ expression for polarizing Th1 development. The involvement of DNA methylation in IFN-γ expression during Th1 development is further supported by the observation in Th2 cells. The Th2 CpG methylation pattern resembles that in naive cells apart from the fact that more CpGs within the IFN-γ promoter were methylated. Studies have shown that during Th2 polarization, the DNA methyltransferase DNMT3a was enriched to the IFN-γ promoter; as a consequence, the promoter undergoes a progressive de novo methylation (Jones and Chen, 2006). It was found that CpGs located at the 53 position become methylated rapidly and such methylation inhibits ATF2/c-Jun and CREB transcription factors binding to the IFN-γ promoter, which then suppresses IFN-γ expression to polarize naive T cells to the Th2 condition. 3.4.2 DNA Methylation of Th2 Cytokine Locus in Th2 Cell Development The IL-4, IL-5, and IL-13 genes are linked closely in an evolutionarily conserved cytokine gene cluster. The region is called the Th2 cytokine locus and contains IL-4, IL-5, IL-13, and the constitutively expressed Rad50 genes, which transcribe well-known cytokines essential for Th2 development. Sustained TCR stimulation in the presence of IL-4 polarizes naive T cells to the Th2 condition, which silences the IFN-γ gene while activating the IL-4, IL-5, and IL-13 genes. Of note, the Th2 cytokine locus is conserved in the genomes of mammals in terms of gene composition and the linear relationship of genes within the locus (Wilson et al., 2009). The transcription of those Th2 cytokines is tightly regulated through their promoters and by several additional regulatory elements, implicating the epigenetic regulatory mechanisms such as DNA methylation. For example, the transcription of murine IL-4 is tightly controlled by the regulatory elements mapped to the DNAse I hypersensitive site I (HSI) and HSII in the second intron of IL-4, the DNAse I hypersensitive site VA (HSVA) and HSV located in the 3′ region of IL-4, and the DNAse I hypersensitive site s1 (HSs1) and HSs2 located between IL-13 and IL-4 (Wilson et al., 2009). During T helper cell differentiation the IL-4 locus undergoes a complex series of methylation and demethylation steps. The 5′ region of the IL-4 locus is hypermethylated in naive T cells and becomes specifically demethylated in Th2 cells, whereas the highly conserved HSVA and HSV regions at the 3′ end show the converse behavior, being hypomethylated in naive T cells and becoming methylated during Th1 differentiation. The 5′ demethylation is not required for primary transcription of the IL-4 gene but is strongly associated with efficient, high-level induction of IL-4 transcripts by differentiated Th2 cells (Lee et al., 2002). In humans, CpG methylation is predominantly present at the IL-4 and IL13 gene in naive and Th1 cells, while CpG demethylation occurs only in Th2-
c03.indd 37
1/12/2011 9:44:01 AM
38
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
cells around the Th2-specific DNAse I hypersensitive sites (Santangelo et al., 2002). This wave of CpG demethylation during Th2 development coincides with the consensus binding sites for the Th2-specific transcription factor GATA-3. Similarly, Makar and colleagues found that as naive T cells were differentiated into Th1 populations, the decreased capacity for IL-4 expression was accompanied by an increase in recruitment of DNMTs to the IL-4 and IL-13 promoters along with an increase in CpG methylation at these regions (Makar et al., 2003; Makar and Wilson, 2004). On the contrary, these promoters/ regions were significantly demethylated once those cells in the Th2 condition (Bowen et al., 2008; Makar et al., 2003). Together, these data establish that the CpG sites in the IL-4/IL-13 locus are mostly methylated in naive T cells and Th1 cells but are demethylated during Th2 differentiation. 3.4.3
DNA Methylation in Regulatory T Cell and Th17 Development
Regulatory T cells and Th17 cells are two recently described lymphocyte subsets with opposing actions. Tregs, also called as T suppressor cells, are a specialized subpopulation of T cells that act to suppress activation of the immune system and thereby maintain immune system homeostasis and tolerance to self-antigens (Bettini and Vignali, 2009; Wing and Sakaguchi, 2010). This subset of T cells is defined by the expression of the forkhead family transcription factor Foxp3. It is evident that Foxp3 is required for Treg development; therefore, it has been used as a marker for Tregs in most current studies. CD4+Foxp3+ Tregs are a very heterogeneous population in both mice and humans, and different subsets of Tregs possess different levels of CpG DNA methylation at the Foxp3 locus. Increased methylation of CpG nucleotides at the Foxp3 locus has been linked with less Foxp3 expression, decreased Treg stability, and reduced suppressive Treg function (Lal et al., 2009; Miyara et al., 2009). The methylation regulation of Foxp3 locus may be the most detailed studied. Demethylation induced by 5-aza-cytidine, an inhibitor of DNA methylation, in human NK cells leads to Foxp3 expression (Zorn et al., 2006). Approximately 70% of CpGs in the human Foxp3 promoter are methylated in CD4+CD25lo cells in contrast to about 5% in CD4+CD25hi T cells (Janson et al., 2008). Similarly, the methylation status of the CpG residues in the proximal promoter region has an essential role in murine Foxp3 expression. CpGs in the promoter region of mouse natural Tregs (nTregs) are all demethylated. In contrast, 10– 45% of these CpG sites are methylated in naive CD4+CD25– T cells (Kim and Leonard, 2007). It has been known for a while that TGF-β induces Foxp3 expression in peripheral naive CD4+CD25– T cells (Fu et al., 2004). It appears that TGF-β may increase Foxp3 expression by inducing demethylation of CpG islands at the promoter region in CD4+CD25– T cells (Luo et al., 2008). TGF-β was found being able to inhibit DNMT expression by suppressing the phosphorylation of ERK69, and as a result, inhibition of DNMT with either siRNA or chemicals leads to Foxp3 expression in CD4+ T cells (Lal et al., 2009).
c03.indd 38
1/12/2011 9:44:01 AM
DNA METHYLATION IN T AND B CELL DEVELOPMENT
39
In addition to promoter elements, there are regulatory cis-elements between noncoding exons that act as intronic enhancers. The intronic region (+4201 to +4500) of Foxp3 is highly conserved, and CpG islands within this region are completely methylated in naive CD4+CD25– T cells but fully demethylated in nTregs in mice (Floess et al., 2007; Kim and Leonard, 2007) and in humans (Baron et al., 2007). Consistent with these observations, TGF-β stimulation results in different levels of demethylation of CpG islands in this region both in mice and humans (Baron et al., 2007; Floess et al., 2007), leading to increased Foxp3 expression. The stable Foxp3 expression is vital in maintaining Treg function (Zhou et al., 2009a, 2009b). It has been noted that a fraction of CD4+Foxp3+ nTregs adoptively transferred into lymphophenic mice converted into Foxp3– T cells (Komatsu et al., 2009). Furthermore, under certain inflammatory conditions, Foxp3+ Tregs lose Foxp3 expression and suppressive function in an IL-6– dependent manner (Lal et al., 2009; Pasare and Medzhitov, 2003). Studies now provide evidence suggesting that DNA methylation may also play a role in the maintenance of Treg stability. For example, stable Foxp3 expression in nTregs is associated with demethylated CpG islands at the Foxp3 locus. In contrast, TGF-β-induced Tregs show methylated CpG associated with loss of their capability to maintain constitutive Foxp3 expression after restimulation in the absence of TGF-β (Lal et al., 2009). This emerging role is further supported by the observation that manipulation of DNA methylation can be used to induce Foxp3 expression and promote the conversion of naive T cells to Tregs (Moon et al., 2009). In contrast to Tregs, much less is known about the regulatory mechanisms and epigenetic processes that control Th17 development. Th17 cells are derived from naive CD4+ precursor cells and preferentially secrete a characteristic profile of cytokines, including IL-17A, IL-17F, IL-21, and IL-22 (Eisenstein and Williams, 2009). As a novel subset of effector cells, they have been implicated in the pathogenesis of allograft rejection and autoimmune diseases such as arthritis and experimental autoimmune encephalomyelitis (EAE) (Hofstetter et al., 2009; Yuan et al., 2008). It is interesting that both Th17 and Tregs can be developed from naive CD4+ T cell precursors in the presence of the same cytokine, the transforming growth factor β1 (TGF-β1), suggesting that they may be coordinately regulated by shared regulatory elements. Exposure of a naive CD4+ T cell to TGF-β1 and IL-6 results in the induction of RORγt, a orphan retinoic acid nuclear receptor, that directs Th17-specific differentiation (Ivanov et al., 2006). It is interesting to note that IL-6 suppresses the development and function of Tregs (Samanta et al., 2008). Studies have shown that IL-6 induces DNMT1 expression and enhances its activity (Hodge et al., 2001), which then leads to STAT3-dependent methylation of the upstream Foxp3 enhancer and, as a consequence, represses Foxp3 expression. Therefore, the action of IL-6 for polarizing naive T cells to the Th17 condition is likely, at least in part, by inducing remethylation of CpG DNA at the upstream enhancers of Foxp3 promoter.
c03.indd 39
1/12/2011 9:44:01 AM
40
3.4.4
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
DNA Methylation in B Cell Maturation and Functionality
Many transcription factors act in a hierarchical and combinational manner to induce specific gene expression necessary for each stage of differentiation of B-lineage cells. It is interesting that the DNA methylation profile during B-cell differentiation is characterized in a dynamic manner (Renaudineau et al., 2009). At the early B-cell differentiation stage, regulatory regions for the Pax5, Pu-1, and Igα/mb1 genes are demethylated (Amaravadi and Klemsz, 1999; Danbara et al., 2002; Maier et al., 2003). At the pre-pro-B-stage, demethylation occurs for the CD19 promoter (Walter et al., 2008), while the CD21 promoter in mature B cells is demethylated (Schwab and Illges, 2001). In unstimulated B cells, DNMT1 and DNMT3a are expressed at low levels, and histones are mainly acetylated. Once the BCR is engaged, DNMT1 is induced and overexpressed, which then methylates many CpG islands, leading to B cell maturation. The involvement of DNA methylation in B cell development is further manifested by the detection of DNMT3b missense mutations in patients with immunodefficiency, centromeric region instability, and facial anomalies (ICF) syndrome (Hansen et al., 1999; Shirohzu et al., 2002), which is a rare recessive disease characterized by B-cell differentiation abnormalities. Studies have shown that DNMT3b missense mutations account for 40% of patients reported all around the world (Ehrlich et al., 2006). More recent, studies have demonstrated that altered DNA methylation is associated with autoantibody production. For example, hydralazine and procainamide were found with the capability to interact with DNA (Dubroff and Reid, 1980; Lee et al., 2005), which then reverses the methylation of cytosines present in CpG islands. Hydralazine suppresses the induction of DNMT1 and DNMT3 transcription by inhibiting the extracellular signal-regulated kinase pathway, while procainamide inhibits DNMT1 enzymatic activity as 5-azacythidine (Deng et al., 2003; Lee et al., 2005). Both of them affect the methylation status of CpG islands within the regulatory sequences, leading to phenotypic changes of peripheral blood lymphocytes (PBL). As a result, treatment of animals with hydralazine-induced autoantibodies against self-antigens associated with the development of lupus erythematosus (Deng et al., 2003; Dubroff and Reid, 1980; Mazari et al., 2007). 3.5 THE IMPLICATION OF DNA METHYLATION IN AUTOIMMUNE DISEASES It has been well accepted that susceptibility to autoimmune diseases such as rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and type 1 diabetes (T1D) is influenced by both genetic and environmental (epigenetic) factors. DNA methylation has now been recognized as the major mechanism of epigenetics that initiates and maintains heritable patterns of gene expression and gene function in an inheritable manner without changing the sequence of the genome (Callinan and Feinberg, 2006; Dennis, 2003; Holliday, 2006;
c03.indd 40
1/12/2011 9:44:01 AM
THE IMPLICATION OF DNA METHYLATION IN AUTOIMMUNE DISEASES
41
Jones and Martienssen, 2005; Sutherland and Costa, 2003; Wilson et al., 2006). Therefore, it acts as a “footprint” for gene–environment interactions or accumulated environmental exposures (Ushijima et al., 2006). Alterations for DNA methylation can be induced by a plethora of environmental factors such as diet (Aggarwal and Shishodia, 2006; Junien et al., 2005; Liu et al., 2003), lifestyle (Fraga et al., 2005), stress (Dennis, 2003; Meaney and Szyf, 2005), chronic inflammation (Tao and Robertson, 2003; Vanden Berghe et al., 2006), bacterial and viral infections (Li et al., 2005a; Maekita et al., 2006; Waterland and Jirtle, 2004), irradiation (Koturbash et al., 2006a, 2006b), and toxicants (Bombail et al., 2004). More important, these changes are not only heritable but also stably accumulated during an individual’s lifetime (Ballestar et al., 2006; Egger et al., 2004; Fraga et al., 2005). 3.5.1
DNA Methylation in Systemic Lupus Erythematosus
SLE is a chronic inflammatory connective tissue disorder that can involve joints, kidneys, mucous membranes, and blood vessel walls (Wardle, 2009). The presence of anti-DNA autoantibody (Ab) and high levels of circulating free DNA is a hallmark for the onset of SLE. Therefore, sera originated from SLE patients show high levels of low molecular weight DNA (e.g., 100–250 bp) enriched with Alu sequences (55% versus 13% in the whole genome) (Li and Steinman, 1989; Sano et al., 1983). These Alu sequences were potentially derived from Z-DNA (Van Helden, 1985), and contain large amounts of demethylated CpG motifs associated with increased anti-DNA recognition in SLE patients and in mouse lupus models. In line with this observation, hydralazine was found being able to induce Z-DNA conformation in a polynucleotide and elicits anti(Z-DNA) antibodies in treated patients (Thomas et al., 1993). It is believed that these circulating free DNAs were derived from apoptotic cells as manifested by the observation that apoptotic hypomethylated DNAs are immunogenic (Wen et al., 2007). Studies from Richardson and colleagues have systematically demonstrated the role of DNA methylation in the occurrence of SLE (Kaplan et al., 2004; Quddus et al., 1993; Richardson, 1986; Richardson et al., 1990, 1992; Yung et al., 1996, 1995; Yung and Richardson, 1994). The group first noted global DNA hypomethylation in T cells derived from SLE patients (Richardson et al., 1990). They next reported that antigen-specific CD4 T cells develop self-reactivity to major histocompatibility complex determinants (HLA-D molecules) upon the treatment with 5-aza-deoxycytidine (5-aza C), a DNA methylation inhibitor (Richardson, 1986), associated with the expression of leukocyte function-associated antigen 1 (LFA-1), an essential molecule implicated in T cell activation. The abnormal activation of LFA-1 is likely caused by 5-aza C-mediated demethylation of regulatory elements, and overexpression of LFA-1 alone seems to be sufficient to cause autoreactivity in SLE patients (Richardson et al., 1992). As a result, T cells with exogenous LFA-1 expression can induce a disease similar to SLE (Yung et al., 1996). They further
c03.indd 41
1/12/2011 9:44:01 AM
42
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
reported that altered perforin expression in SLE T cells is probably induced as a result of DNA hypomethylation, which could in part account for the increase of T cell–mediated apoptosis in SLE patients (Kaplan et al., 2004). Studies in animals have shown that adoptive transfer of 5-aza C-treated cloned or polyclonal T cells induces diverse autoimmune manifestations such as antiDNA autoantibody production, immune complex-mediated glomerulonephritis, central nervous system lesions resembling those seen in human SLE, and pulmonary alveolitis (Quddus et al., 1993; Yung et al., 1995; Yung and Richardson, 1994). Together, the results suggest local hypomethylation of regulatory elements predisposes individuals to the increased risk for the development of SLE. Given the effect of DNA demethylation or hypomethylation on gene transcription, these data support the hypothesis that hypomethylation of SLE-associated genes results in their expression or overexpression and subsequent development of the disease. 3.5.2
DNA Methylation in Rheumatoid Arthritis
RA is a chronic inflammatory autoimmune disease that may affect many tissues and organs but principally attacks the joints, producing an inflammatory synovitis that often progresses to the destruction of articular cartilage and ankylosis of the joints (Korb et al., 2009). Similar to other complex disorders, genetic predisposition is involved in RA pathogenesis. However, studies have demonstrated that the influence of epigenetic processes (environmental triggers) on the development of rheumatic diseases is probably as strong as genetic factors in terms of disease predisposition. It has now become more and more evident that epigenetic factors increase the risk to RA development in those genetic predisposed individuals. Studies have shown that RA synovial fibroblasts (RASFs) have decreased levels of global DNA methylation (Karouzakis et al., 2009), which is associated with altered expression of cell-activating genes and could stimulate innate immune response through TLR9 signaling. In line with this observation, a retrotransposable element LINE-1 is reactivated and transcribed in RASF because of hypomethylation of CpG islands in its promoter (Neidhart et al., 2000). There is substantial evidence supporting that IL-6 is implicated in RA pathogenesis both in animal models and clinical patients (Alonzi et al., 1998; Nishimoto et al., 2004). It is interesting that IL-6 expression is tightly regulated by both transcriptional and posttranscriptional mechanisms. Studies in peripheral blood mononuclear cells (PBMCs) from RA patients and healthy controls found that a specific CpG site in the IL-6 promoter showed a lower level of methylation in cells from RA patients, which rendered PBMCs from RA patients with much higher sensitivity for induction of IL-6 expression upon LPS stimulation (Nile et al., 2008). In sharp contrast, CpG islands within the death receptor 3 (DR3) promoter is highly methylated in first- or second-passage RA synovial cells along with significant downregulation of DR3 expression (Takami et al., 2006). Since DR3 is a member
c03.indd 42
1/12/2011 9:44:01 AM
THE IMPLICATION OF DNA METHYLATION IN AUTOIMMUNE DISEASES
43
of the apoptosis-inducing Fas gene family, the down-regulation of DR-3 could be responsible for the resistance to apoptosis in these cells. Together, the data reviewed here suggest a strong interplay between DNA methylation and RA development. 3.5.3
DNA Methylation in Type 1 Diabetes
T1D is an autoimmune disorder resulted from the breakdown of peripheral tolerance (Wang et al., 2006; Wang and She, 2008). Similar to other autoimmune diseases, a characteristic feature for T1D is the selective targeting of a specific type of cells, the insulin-secreting β cells of the islets of Langerhans in the pancreas, by a certain population of autoreactive immune cells (Li et al., 2005b; Wang et al., 2006, 2008; Wang and She, 2008). It has been well accepted that exogenous (epigenetic) factors modulate T1D susceptibility in genetically predisposed individuals (Dahlquist, 1998; Knip et al., 2005; Metcalfe et al., 2001). As a result, a great deal of research in the past few years has been focused on dissecting environmental triggers for autoimmunity. There is an ever-increasing body of evidence demonstrating that T1D development and progression are associated with diverse environmental triggers such as viral infection. The most popular hypothesis circulating within and beyond the scientific community is that viral infections enhance or elicit autoimmune disorders such as T1D (Filippi and von Herrath, 2008; van der Werf et al., 2007). Indeed, the environmental triggers often used for the explanation of differences of disease frequency across many populations and the rapid rise in disease frequency in the last few decades (Atkinson, 2005). As mentioned, studies in the last several years have not only demonstrated the importance of DNA methylation in T cell development and differentiation but also established its pivotal role in T cell polarization (Eivazova and Aune, 2004; Fields et al., 2004; Fitzpatrick and Wilson, 2003; Lee et al., 2002; Reiner, 2005). To generate an appropriate response to an infectious condition, the type of cytokine as well as the cell type, dose range, and kinetics of its expression are of critical importance. The NFκB transcription factors are, therefore, crucial in rapid responses to stress and pathogens (innate immunity) as well as in the development and differentiation of immune cells (acquired immunity). It is interesting that recent studies confirmed DNA methylome settings are the ultimate integration sites of both environmental and differentiative inputs, determining proper expression of each NFκB-dependent gene (Egger et al., 2004; Fischle et al., 2003; Fuks, 2005; Henikoff et al., 2004; Natoli et al., 2005; Teferedegne et al., 2006; Vanden Berghe et al., 2006). Therefore, the expression of many cytokines implicated in T1D development such as IL-2, IFNγ, and IL-4 are actually also regulated by DNA methylation (Bix and Locksley, 1998; Bruniquel and Schwartz, 2003; Fitzpatrick et al., 1999; Grogan et al., 2001; Lee et al., 2002). A tissue-specific DNA hypomethylation was noted in male rats with type 1 diabetes (Williams et al., 2008). A recent study revealed that CpG islands both in the mouse INS2 and human INS promoters
c03.indd 43
1/12/2011 9:44:01 AM
44
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
are uniquely demethylated in insulin-producing pancreatic β cells (Williams et al., 2008). Methylation of these CpG sites could suppress insulin promoterdriven reporter gene activity by almost 90%, and particularly, the promoter activity can be reduced by 50% by specific methylation of CpG islands in the cAMP responsive element (CRE) within the promoter alone. In NOD mice, a human T1D prone model, we found that altered DNA methylation not only dysregulates T cell function but also results in abnormal DC activation (unpublished data). DNA methylation also affects all aspects of apoptosis right from its initiation to execution (Fulda et al., 2001; Gopisetty et al., 2006). The DNA methylome in NOD pancreatic islet undergoes a rapid switch to hypomethylation upon the development of insulitis, suggesting that autoimmune insult activated genes responsible for β cell apoptosis. In line with this notion, cytokine-induced apoptosis in NIT-1 cells, a NOD-derived β cell line, was significantly exaggerated upon the addition of a DNA methylation inhibitor, 5′-azaC (unpublished data). Taken together, these observations pinpoint an important role for DNA methylation in the regulation of gene expressions for control of autoimmunity and autoimmune-mediated β cell destruction during T1D development.
3.6 COMMON TECHNOLOGICAL APPROACHES FOR ASSAY OF DNA METHYLATION The analysis of DNA methylation patterns was once considered to be a formidable technical challenge. Methylation information is not retained during amplification steps that form the basis of most standard molecular biology techniques such as PCR, biological amplification by cloning in Escherichia coli, and signal amplification by probe hybridization. Therefore, methods for DNA methylation analysis generally rely on a methylation-dependent modification of the original genomic DNA before any amplification step (Eads et al., 2000), which can be roughly divided into two types, global and gene-specific methylation analysis. For global methylation analysis, the typical conventional approach is based on the property of some restriction enzymes to be unable to cut methylated DNA, and HpaII–MspI (CCGG) and SmaI–XmaI (CCCGGG) are the two classical enzyme pairs widely used for this purpose (Sadri and Hornsby, 1996). In contrast, microarray-based high-throughput technology is a typical example for modern technologies used for global DNA methylation analysis (Hayashi et al., 2007). For gene-specific methylation analysis, a large number of techniques have been developed. In the past, the most common way to study DNA methylation of particular sequences was almost entirely based on the use of enzymes that can distinguish between methylated and unmethylated recognition sites in genes of interest. The resulting products after digestion can be detected by either PCR or Southern blotting, which would generate different signal patterns according to the methylated sites or unmethylated sites (Ariel,
c03.indd 44
1/12/2011 9:44:01 AM
COMMON TECHNOLOGICAL APPROACHES FOR ASSAY OF DNA METHYLATION
45
2001). Recently, the bisulfate conversion based techniques have become quite popular and are referred to as the second generation of methylation assays (Frommer et al., 1992). In this section, we briefly introduce several commonly used techniques for analysis of both global and gene-specific DNA methylation patterns. 3.6.1
Methylation-Specific PCR
Methylation-Specific PCR (MSP) is a technique developed to rapidly assess the methylation status of practically any group of CpG sites within a CpG island, independent of the use of cloning or methylation-sensitive restriction enzymes (Herman et al., 1996). This technique includes an initial modification of DNA by sodium bisulfite that converts all unmethylated, but not methylated, cytosines to uracil, followed by PCR amplification using primers specific for methylated versus unmethylated DNA (Fig. 3.1). MSP requires only a very small amount of DNA and is sensitive to 0.1% methylated alleles for a given CpG island locus. Therefore, MSP has the capacity to examine almost all CpG sites, not just those within sequences recognized by the methylation-sensitive restriction enzymes, which can markedly increase the number of such sites to be assessed. Another advantage for this approach is that it can be applied to DNAs extracted from paraffin-embedded samples (Herman et al., 1996).
Figure 3.1. Methylation-specific PCR. After bisulfite convertion, genomic DNA was amplified with primers specific for the methylated DNA or unmethylated DNA. The resulting products were analyzed on the agarose gel. Sample 1 shows the products resulting from unmethylated DNA only; samples 2 and 3 are PCR products yielded from both methylated and unmethylated DNA.
c03.indd 45
1/12/2011 9:44:02 AM
46
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Furthermore, MSP can eliminate the frequent false-positive results due to partial digestion of methylation-sensitive enzymes, and the amplified products can be easily validated by differential restriction patterns. Although MSP is a quick way to detect DNA methylation patterns, it has some drawbacks. It relies on PCR of complex CG-rich DNA and the differential signal patterns of the resulting PCR products. Therefore, it is quite difficult for quantitative assessment because of the PCR bias between different samples and different primers. The key issue for this approach is to design primers that can achieve good PCR efficiency. Primer sequences are usually chosen in regions containing frequent cytosines to distinguish unmodified from modified DNA. The location of CpG pairs within the PCR products should be near the 3′ end of the primers to provide maximal discrimination between methylated and unmethylated DNA. Because of primers for amplification have to cover CpG sites, this approach has limited capacity for the number of CpG residues to be analyzed in one assay, which renders this technique inapplicable for genomewide methylation analysis. 3.6.2
Bisulfite PCR
Similar to MSP, BSP also needs the initial bisulfite conversion. Unlike MSP, this approach is suitable for quantitative assessment and is applicable for whole genome-wide analysis. The most important difference between these two assays is the primer design. In contrast to MSP, primers for BSP do not cover any CpG site. Therefore, there is no difference between the primers binding to the methylated or unmethylated template and no discrimination process during the PCR step. This strategy leads to the simultaneous amplification of all sequence variants resulted from various patterns of DNA methylation in the region resided in between the two primers. The enrichment for each of those sequence variants in the PCR mixtures can be then easily assessed by a variety of standard methods such as DNA sequencing (Fig. 3.2). As PCR reactions are designed to be the same between methylated or unmethylated template, the results produced from this approach are more comparable. 3.6.3 Arbitrary Primed PCR Arbitrary primed PCR is a simply and reproducible way to generate fingerprints of complex genomes without knowing any prior sequence information (Welsh and McClelland, 1990). This approach can be combined with methylation-specific assays such as methylation-sensitive restriction digestion for analysis of whole genome methylation patterns. Several methods have been developed in this approach such as methylation-sensitive arbitrary primed PCR (Gonzalgo et al., 1997), methylated CpG-island amplification (MCA) (Toyota et al., 1999), and amplification of intermethylated sites (AIMS) (Frigola et al., 2002). These methods are particularly useful because arbitrary
c03.indd 46
1/12/2011 9:44:02 AM
COMMON TECHNOLOGICAL APPROACHES FOR ASSAY OF DNA METHYLATION
GTGTGTATGGTTGGGTGTTTTTGGGGTGGGTAGGGAGGTGT
47
GCGCGTACGGTCGGGCGTTTTTGGGGTGGGTAGGGAGGCGC
Figure 3.2. (See color insert.) Bisulfite-specific PCR and direct sequencing. Genomic DNA underwent bisulfite conversion followed by PCR amplification. The resulting PCR products were then purified and directly sequenced. The results at the left are from the unmethylated DNA; the results at the right are for the methylated DNA.
primed PCR is carried out using DNA templates that have been enriched for methyl sequences, which leads to preferential amplification of CpG islands and gene-rich regions. However, all of these techniques require further validation by bisulfite genomic sequencing, and a background of PCR “noise” resulted from repetitive sequences must be taken into account. 3.6.4
Methylated DNA Immunoprecipitation Chip
The completion of the Human Genome Project has brought significant advances in high-throughput technologies, and one typical example is the development of high-density oligonucleotide-based whole-genome microarray (tiling array), which has now been emerged as a new platform for genomic analysis far beyond simple gene expression profiling. Tiling arrays differ in the nature of the probes. Short fragments or probes (around 70-mer) are designed to cover the entire genome. These probes can be synthesized directly on the surface of the arrays by photolithography using light-sensitive synthetic chemistry and photolithographic masks (Fodor et al., 1991; Pease et al., 1994) or programmable optical mirrors (Nuwaysir et al., 2002). Tiling arrays can be made with >6,000,000 discrete features per chip, with each feature comprising millions of copies of a distinct probe sequence. Techniques for enriching genomewide methylated DNA sequences have also been developed, and one such technique is methylated DNA immunoprecipitation (MeDIP), which is carried out by chromatin immunoprecipitation of
c03.indd 47
1/12/2011 9:44:02 AM
48
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
methylated DNA using either monoclonal antibodies specific to 5-methylcytidine (5mC) or by methyl-binding domain (MBD) protein specific to methylated CpGs. One emerging technique is beginning to gain popularity is MeDIP chip that combining chromatin immunoprecipitation with microarray, which helps screen all unknown methylation sites in a genomewide scale. For MeDIP, DNA-protein complexes are first cross-linked in cells with formaldehyde followed by immunoprecipitation with specific antibodies against the protein of interest (e.g., MBD2). DNAs bound by MBD protein are sheared to 0.2- to 2-kb fragments by sonication. The immunoprecipitated DNA and appropriate controls are fluorescently labeled and subsequently applied to chips for microarray analysis. Using input DNA as background, the profiles for methylated DNA in the genome can be then characterized by comparing the immunoprecipitated DNA with background control. Unlike the conventional PCR amplification of specific target sequences from immunoprecipitated materials, MeDIP chip is a genomewide “reverse-genetic” approach (Wu et al., 2006). Another advantage of the MeDIP chip is that it targets genes directly bound by protein factors, while the classic expression arrays cannot distinguish between directly regulated genes and those changed secondarily.
3.7
SUMMARY
Along with the characterization of our genome at DNA basepair level, it has now been recognized that DNA methylome changes and the interactions between cis-acting elements and protein factors may play a central role in gene regulation, which could have significant implications in genome function both in health and disease state. Therefore, altered DNA methylation is thought to be an important risk factor contributing to the development of autoimmunity in genetic predisposed subjects. The novel high-throughput array-based method enables the analysis of DNA methylome in a genomewide scale. The results obtained with this method demonstrate its effectiveness for reliably profiling many CpG sites in parallel, by which informative methylation markers could be identified for disease prediction and prognosis. Therefore, the MeDIP chip approach is thus far the preferable strategy for advancing the genomewide analysis of the DNA methylome.
3.8
ACKNOWLEDGMENTS
This work was supported by grants from the Juvenile Diabetes Foundation International (JDRFI), the EFSD/CDC/Lilly Program for Collaborative Diabetes Research between China and Europe to CYW. The authors declare that they have no competing financial interest.
c03.indd 48
1/12/2011 9:44:02 AM
REFERENCES
3.9
49
REFERENCES
Aggarwal BB, Shishodia S. (2006). Molecular targets of dietary agents for prevention and therapy of cancer. Biochem Pharmacol 71:1397–421. Alonzi T, Fattori E, Lazzaro D, Costa P, Probert L, Kollias G, De Benedetti F, Poli V, Ciliberto G. (1998). Interleukin 6 is required for the development of collageninduced arthritis. J Exp Med 187:461–68. Amaravadi L, Klemsz MJ. (1999). DNA methylation and chromatin structure regulate PU.1 expression. DNA Cell Biol 18:875–84. Amir RE, Van DV, Wan M, Tran CQ, Francke U, Zoghbi HY. (1999). Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl-CpG-binding protein 2. Nat Genet 23:185–88. Andersen AA, Panning B. (2003). Epigenetic gene regulation by noncoding RNAs. Curr Opin Cell Biol 15:281–89. Ariel M. (2001). A PCR-based method for studying DNA methylation. Methods Mol Biol 181:205–16. Atkinson MA. (2005). ADA Outstanding Scientific Achievement Lecture 2004. Thirty years of investigating the autoimmune basis for type 1 diabetes: why can’t we prevent or reverse this disease? Diabetes 54:1253–63. Ballestar E, Esteller M, Richardson BC. (2006). The epigenetic face of systemic lupus erythematosus. J Immunol 176:7143–47. Ballestar E, Pile LA, Wassarman DA, Wolffe AP, Wade PA. (2001). A Drosophila MBD family member is a transcriptional corepressor associated with specific genes. Eur J Biochem 268:5397–406. Baron U, Floess S, Wieczorek G, Baumann K, Grutzkau A, Dong J, Thiel A, Boeld TJ, Hoffmann P, Edinger M, Turbachova I, Hamann A, Olek S, Huehn J. (2007). DNA demethylation in the human FOXP3 locus discriminates regulatory T cells from activated Foxp3(+) conventional T cells. Eur J Immunol 37:2378–89. Bestor TH, Ingram VM. (1983). Two DNA methyltransferases from murine erythroleukemia cells: purification, sequence specificity, and mode of interaction with DNA. Proc Natl Acad Sci U S A 80:5559–63. Bettini M, Vignali DA. (2009). Regulatory T cells and inhibitory cytokines in autoimmunity. Curr Opin Immunol 21:612–18. Bird AP, Wolffe AP. (1999). Methylation-induced repression—belts, braces, and chromatin. Cell 99:451–54. Bix M, Locksley RM. (1998). Independent and epigenetic regulation of the interleukin-4 alleles in CD4+ T cells. Science 281:1352–54. Bombail V, Moggs JG, Orphanides G. (2004). Perturbation of epigenetic status by toxicants. Toxicol Lett 149:51–58. Bowen H, Kelly A, Lee T, Lavender P. (2008). Control of cytokine gene transcription in Th1 and Th2 cells. Clin Exp Allergy 38:1422–31. Brenner C, Fuks F. (2006). DNA methyltransferases: facts, clues, mysteries. Curr Top Microbiol Immunol 301:45–66. Bruniquel D, Schwartz RH. (2003). Selective, stable demethylation of the interleukin-2 gene enhances transcription by an active process. Nat Immunol 4:235–40.
c03.indd 49
1/12/2011 9:44:02 AM
50
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Callinan PA, Feinberg AP. (2006). The emerging science of epigenomics. Hum Mol Genet 15(Spec. no 1):R95–101. Chappell C, Beard C, Altman J, Jaenisch R, Jacob J. (2006). DNA methylation by DNA methyltransferase 1 is critical for effector CD8 T cell expansion. J Immunol 176: 4562–72. Cheng LC, Tavazoie M, Doetsch F. (2005). Stem cells: from epigenetics to microRNAs. Neuron 46:363–67. Dahlquist G. (1998). The aetiology of type 1 diabetes: an epidemiological perspective. Acta Paediatr Suppl 425:5–10. Danbara M, Kameyama K, Higashihara M, Takagaki Y. (2002). DNA methylation dominates transcriptional silencing of PAX5 in terminally differentiated B cell lines. Mol Immunol 38:1161–66. Deng C, Lu Q, Zhang Z, Rao T, Attwood J, Yung R, Richardson B. (2003). Hydralazine may induce autoimmunity by inhibiting extracellular signal-regulated kinase pathway signaling. Arthritis Rheum 48:746–56. Dennis C. (2003). Epigenetics and disease: altered states. Nature 421:686–88. Dolinoy DC, Weidman JR, Jirtle RL. (2006). Epigenetic gene regulation: linking early developmental environment to adult disease. Reprod Toxicol 23:297–307. Dubroff LM, Reid RJ Jr. (1980). Hydralazine-pyrimidine interactions may explain hydralazine-induced lupus erythematosus. Science 208:404–06. Eads CA, Danenberg KD, Kawakami K, Saltz LB, Blake C, Shibata D, Danenberg PV, Laird PW. (2000). MethyLight: a high-throughput assay to measure DNA methylation. Nucleic Acids Res 28:E32. Egger G, Liang G, Aparicio A, Jones PA. (2004). Epigenetics in human disease and prospects for epigenetic therapy. Nature 429:457–63. Ehrlich M, Jackson K, Weemaes C. (2006). Immunodeficiency, centromeric region instability, facial anomalies syndrome (ICF). Orphanet J Rare Dis 1:2. Eisenstein EM, Williams CB. (2009). The T(reg)/Th17 cell balance: a new paradigm for autoimmunity. Pediatr Res 65:26R–31R. Eivazova ER, Aune TM. (2004). Dynamic alterations in the conformation of the IFNγ gene region during T helper cell differentiation. Proc Natl Acad Sci U S A 101: 251–56. Esteller M. (2006). The necessity of a human epigenome project. Carcinogenesis 27: 1121–25. Fatemi M, Hermann A, Gowher H, Jeltsch A. (2002). DNMT3a and Dnmt1 functionally cooperate during de novo methylation of DNA. Eur J Biochem 269:4981–84. Fatemi M, Wade PA. (2006). MBD family proteins: reading the epigenetic code. J Cell Sci 119:3033–37. Fields PE, Lee GR, Kim ST, Bartsevich VV, Flavell RA. (2004). Th2-specific chromatin remodeling and enhancer activity in the Th2 cytokine locus control region. Immunity 21:865–76. Filippi CM, von Herrath MG. (2008). Viral trigger for type 1 diabetes: pros and cons. Diabetes 57:2863–71. Fischle W, Wang Y, Allis CD. (2003). Histone and chromatin cross-talk. Curr Opin Cell Biol 15:172–83.
c03.indd 50
1/12/2011 9:44:02 AM
REFERENCES
51
Fitzpatrick DR, Shirley KM, Kelso A. (1999). Cutting edge: stable epigenetic inheritance of regional IFN-gamma promoter demethylation in CD44highCD8+ T lymphocytes. J Immunol 162:5053–57. Fitzpatrick DR, Wilson CB. (2003). Methylation and demethylation in the regulation of genes, cells, and responses in the immune system. Clin Immunol 109:37–45. Floess S, Freyer J, Siewert C, Baron U, Olek S, Polansky J, Schlawe K, Chang HD, Bopp T, Schmitt E, Klein-Hessling S, Serfling E, Hamann A, Huehn J. (2007). Epigenetic control of the Foxp3 locus in regulatory T cells. PLoS Biol 5:e38. Fodor SP, Read JL, Pirrung MC, Stryer L, Lu AT, Solas D. (1991). Light-directed, spatially addressable parallel chemical synthesis. Science 251:767–73. Fraga MF, Ballestar E, Montoya G, Taysavang P, Wade PA, Esteller M. (2003). The affinity of different MBD proteins for a specific methylated locus depends on their intrinsic binding properties. Nucleic Acids Res 31:1765–74. Fraga MF, Ballestar E, Paz MF, Ropero S, Setien F, Ballestar ML, Heine-Suner D, Cigudosa JC, Urioste M, Benitez J, Boix-Chornet M, Sanchez-Aguilera A, Ling C, Carlsson E, Poulsen P, Vaag A, Stephan Z, Spector TD, Wu YZ, Plass C, Esteller M. (2005). Epigenetic differences arise during the lifetime of monozygotic twins. Proc Natl Acad Sci U S A 102:10604–09. Frigola J, Ribas M, Risques RA, Peinado MA. (2002). Methylome profiling of cancer cells by amplification of inter-methylated sites (AIMS). Nucleic Acids Res 30:e28. Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL. (1992). A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci U S A 89:1827–31. Fu S, Zhang N, Yopp AC, Chen D, Mao M, Chen D, Zhang H, Ding Y, Bromberg JS. (2004). TGF-beta induces Foxp3 + T-regulatory cells from CD4 + C. Am J Transplant 4:1614–27. Fuks F. (2005). DNA methylation and histone modifications: teaming up to silence genes. Curr Opin Genet Dev 15:490–95. Fulda S, Kufer MU, Meyer E, van Valen F, Dockhorn-Dworniczak B, Debatin KM. (2001). Sensitization for death receptor- or drug-induced apoptosis by re-expression of caspase-8 through demethylation or gene transfer. Oncogene 20:5865–77. Goll MG, Kirpekar F, Maggert KA, Yoder JA, Hsieh CL, Zhang X, Golic KG, Jacobsen SE, Bestor TH. (2006). Methylation of tRNAAsp by the DNA methyltransferase homolog DNMT2. Science 311:395–98. Gonzalgo ML, Liang G, Spruck CH III, Zingg JM, Rideout WM III, Jones PA. (1997). Identification and characterization of differentially methylated regions of genomic DNA by methylation-sensitive arbitrarily primed PCR. Cancer Res 57:594–99. Gopisetty G, Ramachandran K, Singal R. (2006). DNA methylation and apoptosis. Mol Immunol 43:1729–40. Goyal R, Reinhardt R, Jeltsch A. (2006). Accuracy of DNA methylation pattern preservation by the DNMT1 methyltransferase. Nucleic Acids Res 34:1182–88. Grogan JL, Mohrs M, Harmon B, Lacy DA, Sedat JW, Locksley RM. (2001). Early transcription and silencing of cytokine genes underlie polarization of T helper cell subsets. Immunity 14:205–15.
c03.indd 51
1/12/2011 9:44:02 AM
52
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Hansen RS, Wijmenga C, Luo P, Stanek AM, Canfield TK, Weemaes CM, Gartler SM. (1999). The DNMT3B DNA methyltransferase gene is mutated in the ICF immunodeficiency syndrome. Proc Natl Acad Sci U S A 96:14412–17. Hatton RD, Harrington LE, Luther RJ, Wakefield T, Janowski KM, Oliver JR, Lallone RL, Murphy KM, Weaver CT. (2006). A distal conserved sequence element controls Ifng gene expression by T cells and NK cells. Immunity 25:717–29. Hayashi H, Nagae G, Tsutsumi S, Kaneshiro K, Kozaki T, Kaneda A, Sugisaki H, Aburatani H. (2007). High-resolution mapping of DNA methylation in human genome using oligonucleotide tiling array. Hum Genet 120:701–11. Heller G, Zielinski CC, Zochbauer-Muller S. (2010). Lung cancer: from single-gene methylation to methylome profiling. Cancer Metastasis Rev 29(1):95–107. Hendrich B, Abbott C, McQueen H, Chambers D, Cross S, Bird A. (1999). Genomic structure and chromosomal mapping of the murine and human MBD1, MBD2, MBD3, and MBD4 genes. Mamm Genome 10:906–12. Hendrich B, Bird A. (1998). Identification and characterization of a family of mammalian methyl-CpG binding proteins. Mol Cell Biol 18:6538–47. Hendrich B, Guy J, Ramsahoye B, Wilson VA, Bird A. (2001). Closely related proteins MBD2 and MBD3 play distinctive but interacting roles in mouse development. Genes Dev 15:710–23. Hendrich B, Tweedie S. (2003). The methyl-CpG binding domain and the evolving role of DNA methylation in animals. Trends Genet 19:269–77. Henikoff S, Furuyama T, Ahmad K. (2004). Histone variants, nucleosome assembly and epigenetic inheritance. Trends Genet 20:320–26. Herman JG, Graff JR, Myohanen S, Nelkin BD, Baylin SB. (1996). Methylation-specific PCR: a novel PCR assay for methylation status of CpG islands. Proc Natl Acad Sci U S A 93:9821–26. Hodge DR, Xiao W, Clausen PA, Heidecker G, Szyf M, Farrar WL. (2001). Interleukin-6 regulation of the human DNA methyltransferase (HDNMT) gene in human erythroleukemia cells. J Biol Chem 276:39508–11. Hofstetter H, Gold R, Hartung HP. (2009). Th17 cells in MS and experimental autoimmune encephalomyelitis. Int MS J 16:12–18. Holliday R. (2006). Dual inheritance. Curr Top Microbiol Immunol 301:243–56. Hutchins AS, Artis D, Hendrich BD, Bird AP, Scott P, Reiner SL. (2005). Cutting edge: a critical role for gene silencing in preventing excessive type 1 immunity. J Immunol 175:5606–10. Hutchins AS, Mullen AC, Lee HW, Sykes KJ, High FA, Hendrich BD, Bird AP, Reiner SL. (2002). Gene silencing quantitatively controls the function of a developmental trans-activator. Mol Cell 10:81–91. Ivanov II, McKenzie BS, Zhou L, Tadokoro CE, Lepelley A, Lafaille JJ, Cua DJ, Littman DR. (2006). The orphan nuclear receptor RORgammat directs the differentiation program of proinflammatory IL-17+ T helper cells. Cell 126:1121–33. Janson PC, Winerdal ME, Marits P, Thorn M, Ohlsson R, Winqvist O. (2008). Foxp3 promoter demethylation reveals the committed Treg population in humans. PLoS One 3:e1612. Jeltsch A, Walter J, Reinhardt R, Platzer M. (2006). German human methylome project started. Cancer Res 66:73–78.
c03.indd 52
1/12/2011 9:44:02 AM
REFERENCES
53
Jones B, Chen J. (2006). Inhibition of IFN-gamma transcription by site-specific methylation during T helper cell development. EMBO J 25:2443–52. Jones PA, Martienssen R. (2005). A blueprint for a Human Epigenome Project: the AACR Human Epigenome Workshop. Cancer Res 65:11241–46. Junien C, Gallou-Kabani C, Vige A, Gross MS. (2005). [Nutritionnal epigenomics: consequences of unbalanced diets on epigenetics processes of programming during lifespan and between generations]. Ann Endocrinol (Paris) 66:2S19–28. Kaplan MJ, Lu Q, Wu A, Attwood J, Richardson B. (2004). Demethylation of promoter regulatory elements contributes to perforin overexpression in CD4+ lupus T cells. J Immunol 172:3652–61. Karouzakis E, Gay RE, Michel BA, Gay S, Neidhart M. (2009). DNA hypomethylation in rheumatoid arthritis synovial fibroblasts. Arthritis Rheum 60:3613–22. Kim HP, Leonard WJ. (2007). CREB/ATF-dependent T cell receptor-induced FoxP3 gene expression: a role for DNA methylation. J Exp Med 204:1543–51. Knip M, Veijola R, Virtanen SM, Hyoty H, Vaarala O, Akerblom HK. (2005). Environmental triggers and determinants of type 1 diabetes. Diabetes 54(Suppl 2):S125–36. Komatsu N, Mariotti-Ferrandiz ME, Wang Y, Malissen B, Waldmann H, Hori S. (2009). Heterogeneity of natural Foxp3+ T cells: a committed regulatory T-cell lineage and an uncommitted minor population retaining plasticity. Proc Natl Acad Sci U S A 106:1903–08. Korb A, Pavenstadt H, Pap T. (2009). Cell death in rheumatoid arthritis. Apoptosis 14:447–54. Koturbash I, Baker M, Loree J, Kutanzi K, Hudson D, Pogribny I, Sedelnikova O, Bonner W, Kovalchuk O. (2006a). Epigenetic dysregulation underlies radiationinduced transgenerational genome instability in vivo. Int J Radiat Oncol Biol Phys 66:327–30. Koturbash I, Rugo RE, Hendricks CA, Loree J, Thibault B, Kutanzi K, Pogribny I, Yanch JC, Engelward BP, Kovalchuk O. (2006b). Irradiation induces DNA damage and modulates epigenetic effectors in distant bystander tissue in vivo. Oncogene 25:4267–75. Kouskouti A, Talianidis I. (2005). Histone modifications defining active genes persist after transcriptional and mitotic inactivation. EMBO J 24:347–57. Lal G, Zhang N, van der Touw W, Ding Y, Ju W, Bottinger EP, Reid SP, Levy DE, Bromberg JS. (2009). Epigenetic regulation of Foxp3 expression in regulatory T cells by DNA methylation. J Immunol 182:259–73. Lee BH, Yegnasubramanian S, Lin X, Nelson WG. (2005). Procainamide is a specific inhibitor of DNA methyltransferase 1. J Biol Chem 280:40749–56. Lee DU, Agarwal S, Rao A. (2002). Th2 lineage commitment and efficient IL-4 production involves extended demethylation of the IL-4 gene. Immunity 16: 649–60. Leonhardt H, Page AW, Weier HU, Bestor TH. (1992). A targeting sequence directs DNA methyltransferase to sites of DNA replication in mammalian nuclei. Cell 71:865–73. Li HP, Leu YW, Chang YS. (2005a). Epigenetic changes in virus-associated human cancers. Cell Res 15:262–71.
c03.indd 53
1/12/2011 9:44:02 AM
54
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Li JZ, Steinman CR. (1989). Plasma DNA in systemic lupus erythematosus. Characterization of cloned base sequences. Arthritis Rheum 32:726–33. Li M, Guo D, Isales CM, Eizirik DL, Atkinson M, She JX, Wang CY. (2005b). Sumo wrestling with type 1 diabetes. J Mol Med 83:504–13. Liu L, Wylie RC, Andrews LG, Tollefsbol TO. (2003). Aging, cancer and nutrition: the DNA methylation connection. Mech Ageing Dev 124:989–98. Luo X, Zhang Q, Liu V, Xia Z, Pothoven KL, Lee C. (2008). Cutting edge: TGF-betainduced expression of Foxp3 in T cells is mediated through inactivation of ERK. J Immunol 180:2757–61. Maekita T, Nakazawa K, Mihara M, Nakajima T, Yanaoka K, Iguchi M, Arii K, Kaneda A, Tsukamoto T, Tatematsu M, Tamura G, Saito D, Sugimura T, Ichinose M, Ushijima T. (2006). High levels of aberrant DNA methylation in Helicobacter pylori-infected gastric mucosae and its possible association with gastric cancer risk. Clin Cancer Res 12:989–95. Maier H, Colbert J, Fitzsimmons D, Clark DR, Hagman J. (2003). Activation of the early B-cell-specific MB-1 (IG-alpha) gene by Pax-5 is dependent on an unmethylated ETS binding site. Mol Cell Biol 23:1946–60. Makar KW, Perez-Melgosa M, Shnyreva M, Weaver WM, Fitzpatrick DR, Wilson CB. (2003). Active recruitment of DNA methyltransferases regulates interleukin 4 in thymocytes and T cells. Nat Immunol 4:1183–90. Makar KW, Wilson CB. (2004). DNA methylation is a nonredundant repressor of the Th2 effector program. J Immunol 173:4402–06. Mattick JS. (2001). Non-coding RNAs: the architects of eukaryotic complexity. EMBO Rep 2:986–91. Mazari L, Ouarzane M, Zouali M. (2007). Subversion of B lymphocyte tolerance by hydralazine, a potential mechanism for drug-induced lupus. Proc Natl Acad Sci U S A 104:6317–22. Meaney MJ, Szyf M. (2005). Environmental programming of stress responses through DNA methylation: life at the interface between a dynamic environment and a fixed genome. Dialogues Clin Neurosci 7:103–23. Meehan RR, Lewis JD, Bird AP. (1992). Characterization of MeCP2, a vertebrate DNA binding protein with affinity for methylated DNA. Nucleic Acids Res 20:5085–92. Metcalfe KA, Hitman GA, Rowe RE, Hawa M, Huang X, Stewart T, Leslie RD. (2001). Concordance for type 1 diabetes in identical twins is affected by insulin genotype. Diabetes Care 24:838–42. Millar CB, Guy J, Sansom OJ, Selfridge J, MacDougall E, Hendrich B, Keightley PD, Bishop SM, Clarke AR, Bird A. (2002). Enhanced CpG mutability and tumorigenesis in MBD4-deficient mice. Science 297:403–05. Miyara M, Yoshioka Y, Kitoh A, Shima T, Wing K, Niwa A, Parizot C, Taflin C, Heike T, Valeyre D, Mathian A, Nakahata T, Yamaguchi T, Nomura T, Ono M, Amoura Z, Gorochov G, Sakaguchi S. (2009). Functional delineation and differentiation dynamics of human CD4+ T cells expressing the FoxP3 transcription factor. Immunity 30:899–911. Moon C, Kim SH, Park KS, Choi BK, Lee HS, Park JB, Choi GS, Kwan JH, Joh JW, Kim SJ. (2009). Use of epigenetic modification to induce Foxp3 expression in naive T cells. Transplant Proc 41:1848–54.
c03.indd 54
1/12/2011 9:44:02 AM
REFERENCES
55
Morgan HD, Sutherland HG, Martin DI, Whitelaw E. (1999). Epigenetic inheritance at the agouti locus in the mouse. Nat Genet 23:314–18. Natoli G, Saccani S, Bosisio D, Marazzi I. (2005). Interactions of NF-kappaB with chromatin: the art of being at the right place at the right time. Nat Immunol 6: 439–45. Neidhart M, Rethage J, Kuchen S, Kunzler P, Crowl RM, Billingham ME, Gay RE, Gay S. (2000). Retrotransposable L1 elements expressed in rheumatoid arthritis synovial tissue: association with genomic DNA hypomethylation and influence on gene expression. Arthritis Rheum 43:2634–47. Nile CJ, Read RC, Akil M, Duff GW, Wilson AG. (2008). Methylation status of a single CpG site in the IL6 promoter is related to IL6 messenger RNA levels and rheumatoid arthritis. Arthritis Rheum 58:2686–93. Nishimoto N, Yoshizaki K, Miyasaka N, Yamamoto K, Kawai S, Takeuchi T, Hashimoto J, Azuma J, Kishimoto T. (2004). Treatment of rheumatoid arthritis with humanized anti-interleukin-6 receptor antibody: a multicenter, double-blind, placebo-controlled trial. Arthritis Rheum 50:1761–69. Nuwaysir EF, Huang W, Albert TJ, Singh J, Nuwaysir K, Pitas A, Richmond T, Gorski T, Berg JP, Ballin J, McCormick M, Norton J, Pollock T, Sumwalt T, Butcher L, Porter D, Molla M, Hall C, Blattner F, Sussman MR, Wallace RL, Cerrina F, Green RD. (2002). Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. Genome Res 12:1749–55. Ohki I, Shimotake N, Fujita N, Jee J, Ikegami T, Nakao M, Shirakawa M. (2001). Solution structure of the methyl-CpG binding domain of human MBD1 in complex with methylated DNA. Cell 105:487–97. Pasare C, Medzhitov R. (2003). Toll pathway-dependent blockade of CD4+CD25+ T cell-mediated suppression by dendritic cells. Science 299:1033–36. Pease AC, Solas D, Sullivan EJ, Cronin MT, Holmes CP, Fodor SP. (1994). Lightgenerated oligonucleotide arrays for rapid DNA sequence analysis. Proc Natl Acad Sci U S A 91:5022–26. Quddus J, Johnson KJ, Gavalchin J, Amento EP, Chrisp CE, Yung RL, Richardson BC. (1993). Treating activated CD4+ T cells with either of two distinct DNA methyltransferase inhibitors, 5-azacytidine or procainamide, is sufficient to cause a lupuslike disease in syngeneic mice. J Clin Invest 92:38–53. Rakyan VK, Chong S, Champ ME, Cuthbert PC, Morgan HD, Luu KV, Whitelaw E. (2003). Transgenerational inheritance of epigenetic states at the murine Axin(Fu) allele occurs after maternal and paternal transmission. Proc Natl Acad Sci U S A 100:2538–43. Rauscher FJ III. (2005). It is time for a Human Epigenome Project. Cancer Res 65: 11229. Reiner SL. (2005). Epigenetic control in the immune response. Hum Mol Genet 14(Special issue 1):R41–46. Renaudineau Y, Garaud S, Le, DC, onso-Ramirez R, Daridon C, Youinou P. (2009). Autoreactive B cells and epigenetics. Clin Rev Allergy Immunol 39:85–94. Richardson B. (1986). Effect of an inhibitor of DNA methylation on T cells. II. 5-Azacytidine induces self-reactivity in antigen-specific T4+ cells. Hum Immunol 17:456–70.
c03.indd 55
1/12/2011 9:44:02 AM
56
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Richardson B. (2003). DNA methylation and autoimmune disease. Clin Immunol 109:72–79. Richardson B, Scheinbart L, Strahler J, Gross L, Hanash S, Johnson M. (1990). Evidence for impaired T cell DNA methylation in systemic lupus erythematosus and rheumatoid arthritis. Arthritis Rheum 33:1665–73. Richardson BC, Strahler JR, Pivirotto TS, Quddus J, Bayliss GE, Gross LA, O’Rourke KS, Powers D, Hanash SM, Johnson MA. (1992). Phenotypic and functional similarities between 5-azacytidine-treated T cells and a T cell subset in patients with active systemic lupus erythematosus. Arthritis Rheum 35:647–62. Sadri R, Hornsby PJ. (1996). Rapid analysis of DNA methylation using new restriction enzyme sites created by bisulfite modification. Nucleic Acids Res 24:5058–59. Samanta A, Li B, Song X, Bembas K, Zhang G, Katsumata M, Saouaf SJ, Wang Q, Hancock WW, Shen Y, Greene MI. (2008). TGF-beta and IL-6 signals modulate chromatin binding and promoter occupancy by acetylated FOXP3. Proc Natl Acad Sci U S A 105:14023–27. Sano H, Imokawa M, Steinberg AD, Morimoto C. (1983). Accumulation of guaninecytosine-enriched low M.W. DNA fragments in lymphocytes of patients with systemic lupus erythematosus. J Immunol 130:187–90. Sansom OJ, Berger J, Bishop SM, Hendrich B, Bird A, Clarke AR. (2003). Deficiency of MBD2 suppresses intestinal tumorigenesis. Nat Genet 34:145–47. Santangelo S, Cousins DJ, Winkelmann NE, Staynov DZ. (2002). DNA methylation changes at human Th2 cytokine genes coincide with DNase I hypersensitive site formation during CD4(+) T Cell differentiation. J Immunol 169:1893–903. Schwab J, Illges H. (2001). Silencing of CD21 expression in synovial lymphocytes is independent of methylation of the CD21 promoter CpG island. Rheumatol Int 20:133–37. Sekigawa I, Kawasaki M, Ogasawara H, Kaneda K, Kaneko H, Takasaki Y, Ogawa H. (2006). DNA methylation: its contribution to systemic lupus erythematosus. Clin Exp Med 6:99–106. Sekigawa I, Okada M, Ogasawara H, Kaneko H, Hishikawa T, Hashimoto H. (2003). DNA methylation in systemic lupus erythematosus. Lupus 12:79–85. Shirohzu H, Kubota T, Kumazawa A, Sado T, Chijiwa T, Inagaki K, Suetake I, Tajima S, Wakui K, Miki Y, Hayashi M, Fukushima Y, Sasaki H. (2002). Three novel DNMT3B mutations in Japanese patients with ICF syndrome. Am J Med Genet 112:31–37. Sutherland JE, Costa M. (2003). Epigenetics and the environment. Ann N Y Acad Sci 983:151–60. Takami N, Osawa K, Miura Y, Komai K, Taniguchi M, Shiraishi M, Sato K, Iguchi T, Shiozawa K, Hashiramoto A, Shiozawa S. (2006). Hypermethylated promoter region of DR3, the death receptor 3 gene, in rheumatoid arthritis synovial cells. Arthritis Rheum 54:779–87. Tao Q, Robertson KD. (2003). Stealth technology: how Epstein-Barr virus utilizes DNA methylation to cloak itself from immune detection. Clin Immunol 109: 53–63. Teferedegne B, Green MR, Guo Z, Boss JM. (2006). Mechanism of action of a distal NF-kappaB-dependent enhancer. Mol Cell Biol 26:5759–70.
c03.indd 56
1/12/2011 9:44:02 AM
REFERENCES
57
Teitell M, Richardson B. (2003). DNA methylation in the immune system. Clin Immunol 109:2–5. Thomas TJ, Seibold JR, Adams LE, Hess EV. (1993). Hydralazine induces Z-DNA conformation in a polynucleotide and elicits anti(Z-DNA) antibodies in treated patients. Biochem J 294(Pt 2):419–25. Tollefsbol TO. (2004). Methods of epigenetic analysis. Methods Mol Biol 287:1–8. Toyota M, Ho C, Ahuja N, Jair KW, Li Q, Ohe-Toyota M, Baylin SB, Issa JP. (1999). Identification of differentially methylated sequences in colorectal cancer by methylated CpG island amplification. Cancer Res 59:2307–12. Trouche D, Khochbin S, Dimitrov S. (2003). Chromatin and epigenetics: dynamic organization meets regulated function. Mol Cell 12:281–86. Ushijima T, Nakajima T, Maekita T. (2006). DNA methylation as a marker for the past and future. J Gastroenterol 41:401–07. van der Werf N, Kroese FG, Rozing J, Hillebrands JL. (2007). Viral infections as potential triggers of type 1 diabetes. Diabetes Metab Res Rev 23:169–83. Van Helden PD. (1985). Potential Z-DNA-forming elements in serum DNA from human systemic lupus erythematosus. J Immunol 134:177–79. Vanden Berghe W, Ndlovu MN, Hoya-Arias R, Dijsselbloem N, Gerlo S, Haegeman G. (2006). Keeping up NF-kappaB appearances: epigenetic control of immunity or inflammation-triggered epigenetics. Biochem Pharmacol 72:1114–31. Wade PA. (2001). Methyl CpG-binding proteins and transcriptional repression. Bioessays 23:1131–37. Walter K, Bonifer C, Tagoh H. (2008). Stem cell-specific epigenetic priming and B cellspecific transcriptional activation at the mouse CD19 locus. Blood 112:1673–82. Wang CY, Han J, She JX. (2008). Genetic factors for human type 1 diabetes. In Current Topics in Human Genetics (Studies in Complex Diseases). Vol. 24. Ed. World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ 07601, pp. 693–727. Wang CY, Podolsky R, She JX. (2006). Genetic and functional evidence supporting SUMO4 as a type 1 diabetes susceptibility gene. Ann N Y Acad Sci 1079:257–67. Wang CY, She JX. (2008). SUMO4 and its role in type 1 diabetes pathogenesis. Diabetes Metab Res Rev 24:93–102. Wardle EN. (2009). Systemic lupus erythematosus conundrums. Saudi J Kidney Dis Transpl 20:731–36. Waterland RA, Jirtle RL. (2004). Early nutrition, epigenetic changes at transposons and imprinted genes, and enhanced susceptibility to adult chronic diseases. Nutrition 20:63–68. Welsh J, McClelland M. (1990). Fingerprinting genomes using PCR with arbitrary primers. Nucleic Acids Res 18:7213–18. Wen ZK, Xu W, Xu L, Cao QH, Wang Y, Chu YW, Xiong SD. (2007). DNA hypomethylation is crucial for apoptotic DNA to induce systemic lupus erythematosus-like autoimmune disease in SLE-non-susceptible mice. Rheumatology (Oxford) 46: 1796–803. Williams KT, Garrow TA, Schalinske KL. (2008). Type I diabetes leads to tissue-specific DNA hypomethylation in male rats. J Nutr 138:2064–69. Wilson CB, Rowell E, Sekimata M. (2009). Epigenetic control of T-helper-cell differentiation. Nat Rev Immunol 9:91–105.
c03.indd 57
1/12/2011 9:44:02 AM
58
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Wilson IM, Davies JJ, Weber M, Brown CJ, Alvarez CE, MacAulay C, Schubeler D, Lam WL. (2006). Epigenomics: mapping the methylome. Cell Cycle 5:155–58. Wing K, Sakaguchi S. (2010). Regulatory T cells exert checks and balances on self tolerance and autoimmunity. Nat Immunol 11:7–13. Wu J, Smith LT, Plass C, Huang TH. (2006). ChIP-chip comes of age for genome-wide functional analysis. Cancer Res 66:6899–902. Yuan X, Paez-Cortez J, Schmitt-Knosalla I, D’Addio F, Mfarrej B, Donnarumma M, Habicht A, Clarkson MR, Iacomini J, Glimcher LH, Sayegh MH, Ansari MJ. (2008). A novel role of CD4 Th17 cells in mediating cardiac allograft rejection and vasculopathy. J Exp Med 205:3133–44. Yung R, Powers D, Johnson K, Amento E, Carr D, Laing T, Yang J, Chang S, Hemati N, Richardson B. (1996). Mechanisms of drug-induced lupus. II. T cells overexpressing lymphocyte function-associated antigen 1 become autoreactive and cause a lupuslike disease in syngeneic mice. J Clin Invest 97:2866–71. Yung R, Ray D, Eisenbraun JK, Deng C, Attwood J, Eisenbraun MD, Johnson K, Miller RA, Hanash S, Richardson B. (2001). Unexpected effects of a heterozygous dnmt1 null mutation on age-dependent DNA hypomethylation and autoimmunity. J Gerontol A Biol Sci Med Sci 56:B268–76. Yung RL, Quddus J, Chrisp CE, Johnson KJ, Richardson BC. (1995). Mechanism of drug-induced lupus. I. Cloned Th2 cells modified with DNA methylation inhibitors in vitro cause autoimmunity in vivo. J Immunol 154:3025–35. Yung RL, Richardson BC. (1994). Role of T cell DNA methylation in lupus syndromes. Lupus 3:487–91. Zhao X, Ueba T, Christie BR, Barkho B, McConnell MJ, Nakashima K, Lein ES, Eadie BD, Willhoite AR, Muotri AR, Summers RG, Chun J, Lee KF, Gage FH. (2003). Mice lacking methyl-CpG binding protein 1 have deficits in adult neurogenesis and hippocampal function. Proc Natl Acad Sci U S A 100:6777–82. Zhou L, Chong MM, Littman DR. (2009a). Plasticity of CD4+ T cell lineage differentiation. Immunity 30:646–55. Zhou X, Bailey-Bucktrout S, Jeker LT, Bluestone JA. (2009b). Plasticity of CD4(+) FoxP3(+) T cells. Curr Opin Immunol 21:281–85. Zorn E, Nelson EA, Mohseni M, Porcheray F, Kim H, Litsa D, Bellucci R, Raderschall E, Canning C, Soiffer RJ, Frank DA, Ritz J. (2006). IL-2 regulates Foxp3 expression in human CD4+CD25+ regulatory T cells through a STAT-dependent mechanism and induces the expansion of these cells in vivo. Blood 108:1571–79.
c03.indd 58
1/12/2011 9:44:02 AM
CHAPTER 4
Cell-Based Analysis with Microfluidic Chip WANG QI and ZHAO LONG
Contents 4.1 Introduction 4.2 Fabrication of the Microfluidic Chip and Cell Culture 4.2.1 Fabrication of the Microfluidic Chip 4.2.2 Cell Culture and Analysis 4.3 Application of the Cell-Based Microfluidic Chip 4.3.1 Genomic Analysis on Chip 4.3.2 Protein Analysis on Chip 4.3.3 Analysis of Chemotherapy Resistance in Tumor Cells 4.4 Conclusions and Future Prospects 4.5 Acknowledgments 4.6 References
4.1
59 60 60 62 67 68 70 72 75 76 76
INTRODUCTION
Micro total analysis systems (TAS), also called labs on a chip, integrate analytical processes for sequential operations like sampling, sample pretreatment, analytical separation, chemical reaction, analytical detection, and data analysis in a single microfluidic device. Microfluidic-based research has made significant advances over the last few years and has become very much a hot topic. Because of its advantages—including low reagent and power consumption, short reaction time, portability for in situ use, low cost, versatility in design, and potential for parallel operation and integration with other miniaturized devices—microfluidic chip-based systems for biological cell studies have Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
59
c04.indd 59
1/12/2011 9:44:03 AM
60
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
attracted significant attention. The microfluidic technique has started to play an increasingly important role in discoveries in cell biology, neurobiology, pharmacology, and tissue engineering. As cell-based assay is deemed to be essential for the functional characterization and detection of drugs, pathogens, toxicants, and odorants, some articles concerning microfluidics for cell analysis have been published during the past few years. Andersson and Berg et al. (2003) presented some research on microfluidics for cellomics, which covered the microfluidic devices for cell sampling, cell trapping and sorting, cell treatment, and cell analysis. Erickson and Li (2004) focused on integrated microfluidic devices for cell handling and cytometry, dielectrophoretic cellular manipulation and sorting, and general cellular analysis. There were a large number of reports on the microfluidicbased biological applications such as cell culture, PCR, DNA separation, DNA sequencing, and clinical diagnostics (Auroux et al., 2002; Vilkner et al., 2004). Recently, microfluidics for cell culture, flow cytometers, and other microscale flow-based cell analysis systems were presented (Tai and Shuler, 2003; Huh et al., 2005), where cell detection and enumeration systems and microfluidic fluorescence-activated cell sorting systems were described (Huh et al., 2005). This chapter is to provide an in-depth look at the applications of cell-based analysis with microfluidic devices. In this chapter, we summarize some reports on the use of cell-based microfluidic chip to carry out specific functions on microchips. The chapter consists mainly of three sections, covering fabrication of the microfluidic devices; cell culture and manipulation; and cell-based analysis including protein analysis, genomic analysis and tumor cell chemotherapy analysis. Selected examples for the manipulation method are described in detail to reveal their advantages and disadvantages. An outlook of this promising and rapidly expanding area is presented in the concluding remarks. Some typical microchips are shown in Figure 4.1.
4.2 FABRICATION OF THE MICROFLUIDIC CHIP AND CELL CULTURE Fabrication of miniaturized cell manipulator is the first step toward microfluidics for cell-based assays. The commonly used methods for cell manipulation on chip can be categorized based on the manipulating force employed. In this section, a few selected microfluidic devices fabrication and cell culture are described. 4.2.1 Fabrication of the Microfluidic Chip Microfluidic devices can be fabricated from a variety of materials using different techniques. The materials are silicon, glass, and polydimethylsiloxane (PDMS), each with differently characteristics; the main fabricateing techniques are lithography, UV laser ablation, and hot embossing.
c04.indd 60
1/12/2011 9:44:03 AM
FABRICATION OF THE MICROFLUIDIC CHIP AND CELL CULTURE
Glass four-channel array electrophoresis chip
Glass PCR chip
Quarts electrophoresis chip
PDMS-glass drug-screening chip
61
Gradient chip for cell culture and drug screen
Figure 4.1. Typical microchips.
4.2.1.1 Different Materials Silicon and glass are traditional materials and have been used to create microfluidic devices (Li et al., 1999; McClain et al., 2003). The particularly attractive advantages of glass and quartz include their well-defined surfaces and excellent optical properties, which are highly desired for signal readout of microarrays by fluorescence. Because of the defects of silicon and glass, such as strong and expensive, polymers are rapidly evolving as alternative substrate materials for many microfluidic applications due to their diverse properties, which can be selected to suit the particular application need and the ability to microfabricate structures in a high-production mode and at low cost. Recently, PDMS, a kind of polymer, has became one of the materials extensively used in microfluidic devices due to its biocompatibility, low toxicity, high oxidative and thermal stability, optically transparency, low permeability to water, and low electrical conductivity; furthermore, it can be easily fabricated into microstructures using soft lithography. 4.2.1.2 Fabrication Technique The fabrication of the prerequisite fluidic networks on polymer substrates can be achieved by lithography, UV laser ablation, hot embossing, injection molding, or direct micromilling techniques (Ford et al., 1999; Friedrich et al., 1997; Grass et al., 2001; Ihlemann and Rubahn, 2000; Martynova et al., 1997; Qi et al., 2002; Robert et al., 1997; Rossier et al., 1999; Soper et al., 2000). Lithography is the earlier technique that has been accomplished in glass/silicon substrates. However, the fabrication processes involved are fairly time-consuming, labor intensive, and expensive. Using UV laser ablation or micromilling, a polymer substrate can be exposed to laser radiation or a milling bit, resulting in the direct writing of the microstructures into the polymer part. In the case of laser ablation, the laser
c04.indd 61
1/12/2011 9:44:03 AM
62
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
photons, typically from a pulsed excimer laser, are focused directly on the polymer. Fluidic channels are produced either by moving the polymer substrate or by moving a focused laser beam across the surface. This method offers the ability to form complex features with various geometries, even in three dimensions because the patterning beam can be moved both horizontally and vertically on the substrates. In either case, fluidic prototypes can be rapidly produced using direct writing techniques to optimize the device performance before mass production using microreplication techniques. Hot embossing and/or injection molding involves the use of mold masters containing the desired microstructures to stamp patterns into the required substrate. The patterns poised on the mold masters can be made either by micromachining or by a lithography-based approach, such as LiGA (lithograhie electroformung abformung, lithography electroplating molding). The LiGA processing steps using x-ray lithography to build the microstructures on the mold master is shown in Figure 4.2. The important aspect of this technology is that the difficult fabrication steps, lithography and electroplating, are performed only once and then parts are microreplicated from the master mold using either injection molding or hot embossing. Soft lithography, also named PDMS fabrication technology, was developed by Whitesides and co-workers and has been widely adopted for the fabrication of different microfluidic networks (Anderson et al., 2000a; Duffy et al., 1998, 1999; McDonald and Whitesides, 2002; Sia and Whitesides, 2003). Fabrication of microfluidic structures using PDMS technology involves pouring a solution containing the PDMS prepolymer and a curing agent over a relief (mold) containing the prerequisite microstructures followed by curing to cross-link the polymer. This technology first requires creation of positive relieves using a variety of techniques (Becker and Gartner, 2000; Love et al., 2001), such as wet etching of silicon or glass followed by photolithography, micromachining metals, or reactive ion etching. A solution containing the pr-polymer and curing agent is then cast against the relief and the cross-linked polymer conforms to the shape of the relief. After casting, the polymer is simply removed from the relief, resulting in a replica that contains the network of microfluidic channels. Finally, the PDMS replicate can be plasma oxidized and sealed to other surfaces by conformal contact to enclose the fluidic network. Plasma oxidation also renders the PDMS channels hydrophilic so that they can easily be filled with aqueous solutions. Besides plasma oxidation, Quake and coworkers (2000) have described a method for sealing fluidic networks created in PDMS, which involves the use of two slabs; one slab consists of an excess of base while the other contains the curing agent. These two are then brought into conformal contact followed by curing. 4.2.2
Cell Culture and Analysis
Microfluidic devices are especially suitable for biological applications, particularly on cellular level, because the scale of the channels is commensurate with
c04.indd 62
1/12/2011 9:44:03 AM
FABRICATION OF THE MICROFLUIDIC CHIP AND CELL CULTURE
63
Lithography Synchrotron irradiation Absorber structure Mask membrane Resist Base plate
Resist structure
Electroplating
Metal Resist structure Electrically conductive base plate
Molding
Mould cavity Plastic structure
Figure 4.2. Processing steps required to prepare a mold master using x-ray LiGA and molding finished parts. The process starts with a lithography step using a mask that transfers the desired pattern into a photoresist. Following development of the resist, the underlying metal layer is exposed, which serves as a plating base for another metal that fills the voids left by the resist that was removed due to exposure to the patterning radiation. The deposited metal forms the desired microstructures. The unexposed resist is then removed, forming the finished master mold. This master mold is then available for microreplicating parts in polymers or other materials using hot embossing or injection molding. At the right is the master mold and parts that have been prepared from it (Situma et al., 2006; Ford et al., 1999).
c04.indd 63
1/12/2011 9:44:03 AM
64
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
that of the cells’ (Walker et al., 2004), and the scale of the device allows important factors (e.g., growth factors) to accumulate locally, forming a stable microenvironment for cell culture (Mahoney et al., 2005). Compared to traditional culture tools, microfluidic platforms provide much greater control over cell microenvironments and rapid optimization of media composition using relatively small numbers of cells. 4.2.2.1 Static Cell Culture In recent years, there have been many reports concerning cell culture with microdevices, various cells such as epithelium cells, interstitial cells, cancer cells, and even stem cells grow well in the devices. Most of the culture processes involve static medium—for instance, our research group developed a single-channel PDMS chip used for the culture of lung cancer cell line H446 to explore the function of glucose-regulated protein 78 (GRP78), an endoplasmic reticulum chaperone, in chemotherapy-resistance lung cancer (Ying-Yan et al., 2008). Also, Wlodkowic et al. (2009) describe an inexpensive method for production of reusable, optical-grade PDMS microculture chips that provide a static and self-contained microwell system analogous to conventional polystyrene multiwell plates. Shao and co-workers (2009) developed an integrated microfluidic cell culture platform in which endothelial cells (ECs) are under static conditions or exposed to a pulsatile and oscillatory shear stress. Until now, static cell culture was still a key method for studying biological function of cells. But static environments are far from representative of cells in vivo and rarely involve long-term cell culture. Therefore, it is essential to establish a microfluidic platform supplied with fresh medium of oxygen and nutrition at a continuous control flow rate for mimicking cellular microenvironments and achieving cell culture and biomedical assay in an environment closer to the in vivo situation. 4.2.2.2 Dynamic Cell Culture To achieve the goal of long-term cell culture in higher density and larger numbers in a microenvironment closer to in vivo conditions, continuous nutrition and oxygen supply and waste removal through the culture medium have to be ensured. Takayama and co-workers (2004) described the use of horizontally oriented mini-reservoir arrays as a gravitydriven pumping system to generate multiple fluid streams inside microfluidic cell culture channels at a constant flow rate for prolonged periods. Maharbiz et al. (2003) presented a microfabricated electrolytic oxygen generator for high-density miniature cell culture arrays. Long-term (>2 weeks) cultures of muscle cells spanning the whole process of differentiation from myoblasts to myotubes was reported by Folch and co-workers. Prokop et al. (2004) developed nanoliter bioreactors (NBRs) for long-term culture and maintaining up to several hundred cultured mammalian cells in volumes three orders of magnitude smaller than those in standard multiwell screening plates. Refreshable Braille display–based microfluidic bioreactors, which are more densely packed and not limited to linear and unidirectional perfusion, were developed for cell culture up to 3 weeks under perfusion (Gu et al., 2004). Yasuda and co-workers
c04.indd 64
1/12/2011 9:44:03 AM
FABRICATION OF THE MICROFLUIDIC CHIP AND CELL CULTURE
3 cm
1
65
2
5 cm
24 h
48 h
72 h
96 h
Figure 4.3. Long-term cell culture of SPCA cells in the microfluidic system. An LM photo for each time period (from 1 day to 4 days) indicated SPCA cells grew well and propagated gradually. Magnification is ×100. (Zhao et al., 2010).
developed a type of on-chip microcultivation chamber, which could directly measure the valve opening/closing by optical microscope, for long-term cultivation of swimming cells. Fujii and co-workers achieved large-scale cell culture (up to 107 cells/cm3) in a microfluidic device with a multilayer bioreactor containing an oxygen supply system. Lee and co-workers presented a microfluidic cell culture array that could assay 100 different cell-based experiments in parallel for long-term cellular monitoring. Repeated cell growth/passage cycles, reagent introduction, and real-time optical analysis could all be achieved in this microdevice. Recently, our research group developed a integration microfluidic system consisting of a MS26 syringe pump and single-channel PDMS chip used for a lung cancer cell line culture (SPCA1) for at least 5 days (Zhao et al., 2010) (Fig. 4.3). 4.2.2.3 3D Cell Culture To improve the efficiency of cell culture and closer to the stereoscopic manner in vivo, some approaches, such as fabricating threedimensional (3D) microstructures (Kojima et al., 2003; Moriguchi et al., 2002; Tan and Desai, 2004) and attempting other biocompatible materials (Leclerc et al., 2004), are presented. Yasuda and co-workers developed a method that
c04.indd 65
1/12/2011 9:44:03 AM
66
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
made use of noncontact 3D phototherma etching with a 1480-nm or 1064-nm infrared focused laser beam to form shapes of agar microstructures for cultivating cells. Desai and co-workers fabricated a 3D heterogeneous multilayer tissue-like structure inside microchannels for cell culture. Single and multiple stepwise microstructures were fabricated in photosensitive biodegradable polymers poly-(caprolactone (CL)-dl-lactide (LA)) tetraacrylate for static cell culture of Hep G2 cells and fetal human hepatocyte (FHH) cells (Leclerc et al., 2004). Other approaches for microfluidic cell culture were also studied. Cell cultivation was performed in a highly parallelized manner in fluid segments that were formed as droplets at a channel junction. Organic and cellcontaining aqueous phases were merged at channel junctions (Martin et al., 2003). A multichannel microelectrode array that could influence and record electrical cellular activity was integrated into the microfluidics for cell culture (Pearce et al., 2005). A gradient-generating microfluidic platform for optimizing proliferation and differentiation of neural stem cells in culture was described by Jeon and co-workers. 4.2.2.4 Single-Cell Analysis Systems Our understanding of many biological processes would greatly benefit if we had the ability to analyze the content of single cells. Today, there are only a few conventional systems that enable direct intrinsic studies of single cells, including capillary electrophoresis (CE) and flow cytometry. These systems, however, are based on conventional technologies and instrumentation, they give only limited information about the cell content and do not present a general method for single cell analysis. Recent, rapid developments in microfabrication and nanofabrication technologies, have already led to the successful so called laboratory on a chip (LOC) concept and these developments offer great opportunities for the analysis of single cells. Takayama et al. (2001) reported that by using multiple laminar streams, which constitute one of the most important characteristics of a microfluidic channel, it has been possible to partially stimulate selected domains inside cells. These investigators used fluorescent tags to detect subcellular positioning of small molecules. Microchips with complicated microstructures can also be used to screen for gap-junction formation between two adjacent cardiomyocytes. Tanaka and co-workers (2002) developed a novel single-cell analysis system consisting of a scanning thermal lens microscope (TLM) detection system and a cell culture microchip (Figure 4.4a). TLM is a sensitive instrument to detect nonfluorescent molecules using a microspace. The detection process is based on absorption of visible or UV light followed by a photothermal process (Kitamori et al., 2004). Generally, fluorescent probes are labeled to cells to detect subcellular positioning of small molecules. However, it is difficult to detect subcellular molecules directly without using fluorescent tags by this method. On the other hand, the system using TLM could detect nonfluorescent biological substances with extremely high sensitivity without any labeling materials, and it had a high spatial resolution of ∼1 μm. The microchip system was good for liquid control and simplified troublesome procedures. This system
c04.indd 66
1/12/2011 9:44:03 AM
APPLICATION OF THE CELL-BASED MICROFLUIDIC CHIP
(a)
67
Staurosporine
Microsyringe pump
TLM detection (532 nm)
Drain X-Y scanning stage
Cell culturing flask 30 µm
(b) 9–10 8–9
30 µm
7–8 6–7 5–6 4–5 3–4 2–3 1–2 0–1 + Staurosporine
Figure 4.4. (See color insert.) A single-cell analysis system in a glass microchip using a thermal lens microscope (TLM). (a) Cell culture chip design and TLM scanning method. A microflask (1 mm × 10 mm × 0.1 mm) was fabricated in a glass microchip, and a cell suspension was introduced into it. After cultivation, the microchip with capillaries connected to syringe pumps was mounted on the TLM stage, and TLM signals were measured while scanning the stage to obtain a 2D-image. (b) Direct imaging of cytochrome-c in a cell and its distribution change during apoptosis (Tamaki et al., 2002).
was applied to monitoring cytochrome-c distribution in a neuroblastomaglioma hybrid cell cultured in the microflask (Fig. 4.4b). This system seems to be applicable to monitoring compounds released in cells, when used in combination with analytical microchips. 4.3
APPLICATION OF THE CELL-BASED MICROFLUIDIC CHIP
Microfluidic chips are used for transporting and manipulating minute amount of fluids and/or biological entities through microchannel manifolds, allowing
c04.indd 67
1/12/2011 9:44:03 AM
68
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
integration of various chemical and biochemical processes into fast and automated monolithic microflow systems. It has become evident that there is a tremendous market potential for microdevices aiding in diagnostics, drug discovery, and evaluation of new pharmaceuticals, since these devices are expected to satisfy the urgent demand for high-throughput and large-scale applications. In this section, some applications cover genomics and proteomics analysis and chemotherapy resistance in cancer with the microfluidic devices are described. 4.3.1
Genomic Analysis on Chip
Separation of nucleic acids is one of the leading applications of microchipbased analysis today. Due to the negligible deteriorating effects of Joule heating during electric field–mediated separations in microchannels and the ability to inject very small, well-defined sample plugs, the resolving power of microfluidic bioanalytical devices are mainly diffusion limited, resulting in superior performance compared to slab gel and capillary electrophoresis. Rapid, high-quality separations on chip have been demonstrated for the analysis of oligonucleotides and RNA and DNA fragments. 4.3.1.1 Sample Preparation and Separation The shorter separation distances, compared to those used in conventional CE, represent new challenges to the optimization of separation conditions, such as electrokinetic manipulations, channel geometry, and sieving media. The geometrical effects of folded microchannel structures on band broadening have been extensively studied and developed for CE instruments, based on viscous solutions of entangled water-soluble polymers, and successfully applied to microchip electrophoresis. Linear polyacryamide and its derivatives—polydimethy-acryamide, polyethylene oxide, polyethylene glycol with fluorocarbon tails, hydroxyethylcellulose and various cellulose derivatives, and other polysaccharides—have been used for size separation of nucleic acid molecules in capillaries, and some of these matrices have also been adapted for microchip electrophoresis. Recently introduced novel thermoresponsive co-polymers, making up hydrophobic and hydrophilic blocks have also shown promising results in terms of efficiency and the possibility of theoretical modeling of acrylamide grafted with polyethylene oxide chains (Liang et al., 1999). These matrices have a pronounced temperature-dependent viscosity transition point, which suggests promising implementations. In particular, thermoresponsive polymers can offer some practical advantages for microchannel electrophoresis, enabling easier handling and loading of the viscous polymer solutions without the requirement of a high-pressure manifold. Buchholz et al. (2001) have constructed interesting ‘‘viscosity switch’’ materials, which respond to changes of temperature, pH, or ionic strength. These matrices are based on co-polymers of acrylamide derivatives with variable hydrophobicity and possess a reversible temperature-controlled viscosity
c04.indd 68
1/12/2011 9:44:03 AM
APPLICATION OF THE CELL-BASED MICROFLUIDIC CHIP
69
switch from high-viscosity solutions at room temperature to low-viscosity colloid dispersions at elevated temperatures. Also, high resolving power and good DNA persequencing performance were achieved with these sieving media. 4.3.1.2 DNA Analysis and Genotyping Major applications of electrophoresis microchips include sizing of double-stranded DNA fragments (Effenhauser et al., 1997; Woolley and Methies, 1994; McCormick et al., 1997; Duffy et al., 1999; Ronai et al., 2001), short single-stranded oligonucleotides (Effenhauser et al., 1994), and ribosomal RNA fragments (Ogura et al., 1998). High separation performance and speed have been achieved. DNA genotyping on microchips enables quick identification of genes and can substantially enhance the capabilities of genomic, diagnostic, pharmacogenetic, and forensic tests. The identification of genes related to heredity diseases, such as hemachromatosis (Woolley et al., 1997), has been successfully performed on chip. An ultra-fast allelic profiling assay for the analysis of short tandem repeats (STRs) has also been demonstrated (Schmalzing et al., 1999). Separation of a CTTv quadruplex system was accomplished in 98% accuracy. The HuSNP array is manufactured using technology that
c10.indd 197
1/12/2011 9:44:14 AM
198
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
–4
–1
0
+1
+4
MMA
MMA
MMA
MMA
MMA
PMA
PMA
PMA
PMA
PMA
PMB
PMB
PMB
PMB
PMB
MMB
MMB
MMB
MMB
MMB
Target sequence (250-2000 bp) ... CAGACAGAGTCTTG[A/C]AATCTATTTCTCATA... Probe sequence (25 bp) PMA:
TGTCTTCAGAACTTTAGATAAAGAG
MMA:
TGTCTTCAGAACATTAGATAAAGAG
PMB:
TGTCTTCAGAACGTTAGATAAAGAG
MMB:
TGTCTTCAGAACCTTAGATAAAGAG
AA
BB
AB
Figure 10.1. (See color insert.) Probe array tiling and hybridization patterns (from Affymetrix).
TABLE 10.2. Timeline of the Clinical Applications of Different Affymetrix Microarrays Year
2002
Array type
GeneChip HuSNP Mapping Assay
Number of SNP 1,494 markers Number of 0 CNV markers
c10.indd 198
2003
2004
2005
2006
2007
GeneChip Human Mapping 10 K Array Xba 131
2008–Present
11,500
GenomeWide Human SNP Arrary 6.0 906,600
0
945,826
1/12/2011 9:44:14 AM
PLATFORMS AND PROTOCOLS OF SNP MICROARRAY
199
combines photolithographic methods and combinatorial chemistry. Tens to hundreds of thousands of different oligonucleotide probes are synthesized in a 0.81- by 0.81-cm area on a glass substrate of each array. The fully mapped markers are evenly distributed across the human genome with a median marker gap size of 1.2 cM. We were able to identify the gene xeroderma pigmentosum, complementation group C (XPC), as the disease-causing gene by detecting LOH in a xeroderma pigmentosum (XP) patient with consanguineous parents (Lam et al., 2005). We mapped the chloride channel 7 gene (CLCN7) for malignant osteopetrosis or autosomal recessive osteopetrosis (ARO) in a consanguineous family (Lam et al., 2007). Using HuSNP probe arrays also allowed us to detect the allelic imbalance with patterns of LOH in the paraffin-embedded tissues of renal cell carcinoma (RCC) cells (Lam et al., 2006). We showed that a high-density SNP array can detect previously described and new LOH sites in cancer genomic studies. According to the manufacturer’s protocol, starting with 120 ng of genomic DNA, a set of 24 simultaneously run multiplex PCRs will amplify the human SNPs represented in the GeneChip HuSNP genetic mapping assay. The amplified SNPs are further amplified and concomitantly labeled using biotinylated primers in a second set of 24 simultaneously run labeling PCRs. The biotinylated PCR products are then pooled, concentrated, and prepared for hybridization. The biotinylated amplification products, which reflect the biallelic genotype in the sample DNA, are hybridized to the GeneChip HuSNP probe arrays during an overnight incubation at 44°C in the GeneChip hybridization oven. On the following day, the probe arrays are thoroughly washed and stained with streptavidin and antistreptavidin antibody. The automated wash and stain procedures are run on the GeneChip Fluidics Station 400 (Affymetrix), under the control of Affymetrix microarray suite software running on a workstation with a Windows NT operating system. The stained probe arrays will be scanned twice to capture the light emitted at wavelengths of 530 nm and 570 nm, generating two scan image files. Affymetrix microarray suite will process the two scan images to calculate all of the signal intensities on the probe array. 10.2.2 The Second Generation of High-Density SNP Array Platform and Protocol Further technical advancement of chip development and high-resolution scanning allows efficient SNP genotyping of >10,000 SNPs in one array. Affymetrix GeneChip Human Mapping 10K Array Xba 131 contains approximately 11,500 SNP markers with higher genomic coverage of median intermarker distance of 105 kb. Each array has 18- by 18-mm features consisting of more than 1 million copies of a 25-bp oligonucleotide probe of defined sequence, synthesized in parallel by photolithographic manufacturing. This platform gives a significant increase in genetic power and ensures more informative markers for molecular investigation in consanguineous families.
c10.indd 199
1/12/2011 9:44:14 AM
200
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
We repeated the molecular investigations on the same samples of ARO and XP cases using this improved platform. The results are consistent with the diagnosis made from HuSNP probe arrays. Then we standardized the use of the 10K mapping array for DNA-based diagnosis of genetic diseases due to consanguinity in our laboratory. We successfully identified homozygous mutations in sulfite oxidase deficiency and hypophosphatasia after mapping the homozygous regions in the consanguineous families (unpublished data). We also detected a small homozygous region around the disease-causing gene, solute carrier family 25 (carnitine/acylcarnitine translocase), member 20 (SLC25A20) in a sudden neonatal death case of nonconsanguineous marriage and confirmed carnitine-acylcarnitine translocase deficiency by further sequencing the SLC25A20 gene (Lam et al., 2003). Total genomic DNA of 250 ng is digested with 10 U of XbaI restriction enzyme and ligated to adapters that recognize the cohesive overhangs. Each SNP that lies within the XbaI fragments is amplified by a generic primer that recognizes the adapter sequence. PCR conditions have been optimized to preferentially amplify fragments in the 250 to 1000 bp range. The amplified DNA is purified over MinElute 96 UF PCR purification plates by vacuum pumping. The PCR amplicons are then fragmented and biotin labeled. Hybridization is carried out at 48°C overnight in a rotisserie rotating at 60 rpm. The array runs on the standard GeneChip instrument system, including fluidics station 450 and GeneChip Scanner 3000, for automated washing, staining, and scanning. The image will be processed to get hybridization signal intensity values using GCOS software (Affymetrix) while the genotype-calling is performed by GDAS analysis software (Affymetrix). The LOH and copy number (in Log2 ratio) are calculated through CNAT analysis embedded in GTYPE software (Affymetrix). 10.2.3 A Much Advanced Platform and Protocol of an Ultra-High-Density SNP Array Affymetrix Genome-wide Human SNP Array 6.0 is an advanced microarray platform containing 906,600 SNPs probes and 945,826 copy number probes on a single array for studying LOH and CNV simultaneously (Table 10.3). The median intermarker distance taken over all SNP and CNV markers is 1.8 million ∼906 K ∼946 K 680 bases 500 ng
∼1.0 million ∼1050 K ∼22 K 1700 bases 750 ng
* Data from specification sheets on company websites.
800 700
# per Mb
600 500
#SNPs/Mb #CNs/Mb
400
#SNP+CN
300 200 100
ch r1 5 ch r1 7 ch r1 9 ch r2 1 ch rX
r1 3 ch
r1 1 ch
ch r9
r7 ch
r5 ch
r3 ch
ch
r1
0
Figure 10.2. Affymetrix SNP Array 6.0: SNP and CNV markers across multiple chromosomes (from Affymetrix).
genome allows detection of the smallest structural changes and regions of autozygosity. We performed genomewide scanning in a case of ring chromosome and a consanguineous family with three affected siblings of limb-girdle muscular dystrophy (LGMD) using the ultra-high-density SNP array (Lau et al., 2009). The workflow of the GeneChip Mapping Assay has some similarity to the previous generations of Affymetrix microarrays. Instead of a fixed sample size of 48 or 96 per batch described in the standard Affymetrix protocol, every two samples were simultaneously processed in 0.2 mL centrifuge tubes following the manufacturer’s instructions, with some modifications specialized for random access approach in personalized genomic medicine (Lau et al., 2009). 1. Prepare two sets of DNA samples for each individual assay; use 250 ng of genomic DNA by adding 5 μL of sample at concentration 50 ng/μL. 2. Digest the genomic DNA with 10 U Sty I and 10 U Nsp I, respectively, at 37°C for 2 h. Inactivate the enzymes at 65°C for 20 min.
c10.indd 201
1/12/2011 9:44:14 AM
202
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
3. Ligate 50 μM restriction enzyme–specific Adaptor (Affymetrix) to the digested DNA with 800 U T4 DNA Ligase at 16°C for 3 h and inactivate the enzymes at 70°C for 20 min. 4. Dilute the ligated DNA fourfold and take 10 μL for 30 cycles of PCR amplifications using 2 μL TITANIUM Taq DNA polymerase. Use the correct PCR program for different models of thermocycler (for details, see the Affymetrix manual). Perform a set of 3 simultaneously run multiplex PCRs for Sty I-digested DNA and a set of 4 simultaneously run multiplex PCRs for Nsp I-digested DNA. 5. Pool 700 μL PCR products from the two sets (total 7 simultaneously run multiplex PCRs) in a 2.0-mL round-bottom tube before purification. 6. Mix 1 mL magnetic beads with the pooled PCR products by pipetting up and down 5 times generously. Leave the DNA-beads mixture for binding at room temperature (RT) for 10 min. PCR products ≥100 bp bind to the beads based on the solid-phase reversible immobilization (SPRI) technology. The beads are paramagnetic microparticles of small bead size (1.0 μm ± 8%), having a carboxylate-modified polymer coating that gives them a high nucleic acids binding capacity. 7. To separate the magnetic beads, place the 2.0-mL tube on the magnetic stand for 10 min until the solution becomes clear before proceeding to the next step.* 8. Leave the tube on the stand, pipette off the supernatant without disturbing the bead pellet on the tube wall. 9. Wash the beads with 1.8 mL 75% ethanol and vortex at 75% power for 2.5 min, and incubate at RT for 7.5 min. Repeat steps 7 and 8. 10. Allow the beads to air dry on the stand for 15 min. 11. To elute the DNA from the beads, add 55 μL Buffer EB. Vortex the tube for 2.5 min, and incubate at RT for 7.5 min. Repeat steps 7 and 8. 12. Collect the purified DNA, and concentrate to 47 μL by speed-vac concentrator. Apply 2 μL purified DNA for quantitation using a spectrophotometer (NanoDrop ND-1000, NanoDrop Technologies). High concentration of 4–6 μg/μL purified DNA would be sufficient for subsequent assay procedures. 13. Fragment the rest of the 45 μL purified DNA at 37°C for 35 min using 2.5 U DNase I-containing Fragmentation reagent (Affymetrix). * The PCR purification steps in the standard Affymetrix protocol using vacuum pumping are sample size dependent with a fixed batch size of 48 or 96 samples per run, which is designed specifically for high-throughput genomewide association studies. For personalized genomic medicine, we use a magnetic stand device (six-tube holder) as a magnetic particles concentrator instead of a vacuum pump (Fig. 10.3). The 40% iron content gives the beads a very quick magnetic response time so that they are separated rapidly and completely from suspensions on application of the magnetic force. This modification changes the genotyping to a random access assay, which 1 is practical for clinical application.
c10.indd 202
1/12/2011 9:44:14 AM
PLATFORMS AND PROTOCOLS OF SNP MICROARRAY
203
Standard PCR protocol protocol—— 48 or 96 per batch Modified PCR protocol — no batch size limitation 1. Pool 700 µl PCR into deep well plate
1. Pool 700 µl PCR into 2-mL microcentrifuge tube
2. Add 1 ml magnetic beads
2. Add 1 ml magnetic beads
3. Pipetting up & down 5×; Incubate 10 min @ RT
3. Pipetting up & down 5×; Incubate 10 min @ RT
4. Transfer PCR + beads to filter plate
4. Place on magnetic stand for 10 min
5. Apply vacuum until all wells are dry (60–90 min)
5. Pipette out the supernatant
6. Add 1.8 mL 75% ethanol wash
6. Add 1.8 mL 75% ethanol wash
7. Apply vacuum until all wells are dry (10–20 min)
7. Vortex at 75% power for 2.5 min; incubate 7.5 min
8. Dry beads for further 10 min under vacuum
8. Place on magnetic stand for 10 min
9. Tap-off excess ethanol & attach catch plate
9. Pipette out the supematant; air-dry for 15 min
10. Add 55 µl elution buffer
10. Add 55 µL elution buffer
11. Incubate on vortexer for 10 min
11. Vortex at 75% power for 2.5 min; incubate 7.5 min
12. Apply vacuum until all wells are dry (15–30 min)
13. Centrifuge for 5 min at 1400 RCF @ RT
12. Place on magnetic stand for 10 min 13. Collect the eluate (~55 µL)
14. Remove catch plate with eluate (~50 µL)
Figure 10.3. (See color insert.) Comparing PCR purification workflow between the Affymetrix standard and our modified protocols.
Inactivate the enzymes at 95°C for 15 min. Remove 1.5 μL fragmented DNA for running 4% QC gel. 14. Biotin-label the fragmented DNA with 100 U TdT enzyme and 30 mM DNA labeling reagent at 37°C for 4 h. Inactivate the enzymes at 95°C for 15 min. 15. Combine all the components of hybridization cocktail, and mix with the labeled DNA. Hybridize to a single SNP Array 6.0 chip in a hybridization oven at 50°C overnight (about 16 h) with 60 rpm rotation on the rotisserie. 16. Proceed to the automated washing and staining in a new model of fluidics station 450.
c10.indd 203
1/12/2011 9:44:14 AM
204
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
17. Scan the chip in an advanced scanner called GeneChip Scanner 3000 7G. The 7G refers to the seventh-generation of the GeneChip technology platform with higher resolution scanning at pixelations from 2.5 um down to 0.51 um, and this spot size is 50% smaller than the previous scanner. To scan a typical 49-format array at 2.5-um pixelation only takes 5 mins. The SNP Array 6.0 platform has several check points before the GeneChip hybridization to exclude experimental errors; intact genomic DNA, PCR amplicon size, and DNase I digested fragment size are checked by electropherograms. Quantity check of starting DNA and purified PCR products are measured using a spectrophotometer. Accurate genotype calls of each sample are determined by the Birdseed version 2 genotype calling algorithm embedded in the software Affymetrix Genotyping Console 3.0 (Nishida et al., 2008). The platform includes quality control (QC) probes for 3022 SNPs to assess the overall quality for a sample based on the dynamic model (DM) algorithm. 10.3 HOW A HIGH-DENSITY SNP ARRAY CAN BE USED TO LOCALIZE A POSSIBLE DISEASE LOCI The polymorphic nucleotide allele difference of SNP provides confirmation of genomic imbalance by identifying regions of LOH associated with deletions, allele-specific dosage gain associated with duplications, and long contiguous stretches of homozygosity (LCSH) associated with UPD and consanguinity. Therefore, SNP genotyping aids in identification of candidate genes in both complex and Mendelian disorders for clinical practice. 10.3.1
LOH in Cancer
Most human cancers are characterized by chromosomal aberrations in which allelic imbalances can be identified by LOH. These LOH events sometimes affect known genes, and mutations may suggest regions of novel somatic events contributing to tumorigenesis by activating potential oncogenes or unmasking mutated tumor suppressor genes. The LOH regions are specific for tumor type or subtype (Tuna et al., 2009). Copy-neutral LOH represents one example of the cancer genomic abnormality and is important in cancer clonal evolution. In the HuSNP assay, allelic imbalance usually indicates true loss of heterozygosity, whereas amplifications are rarely detected by HuSNP array. The software calculates the difference in RAS values between two samples and those reported as delta RAS values. We compare the quantitative representation of alleles for samples obtained from normal tissue to those obtained from tumor tissue to determine the location and extent of chromosomal loss in tumor cells. Significant shifts (P < 0.05) in delta RAS between tumor and germline DNA indicate the presence of LOH. To incorporate possible genotyping errors in the analysis, we declare a chromosomal region as having LOH when there are more than two SNPs in the LOH region.
c10.indd 204
1/12/2011 9:44:14 AM
HOW A HIGH-DENSITY SNP ARRAY CAN BE USED TO LOCALIZE A POSSIBLE DISEASE LOCI
205
16 14
Std Units
12 10 8 6 4 2 1 1 2 2 3 3 4 5 5 6 6 6 7 7 8 8 8 9 9 10 11 11 11 12 12 14 14 16 16 17 19 19 20 22
0
Chromosome
Figure 10.4. LOH regions in RCC detected by HuSNP probe array (Lam et al., 2006). On the x-axis, markers are arranged by chromosome number and mapped positions. On the y-axis, Std Units are the normalized delta RAS, using the observed RAS standard deviations in heterozygotes.
Renal-cell carcinoma is the most common malignancy in the kidney. Figure 10.4 shows the whole-genome view of LOH regions in one of our RCC samples (Lam et al., 2006). We identified a common 14q LOH that has been shown to be significantly associated with tumor aggressiveness and disease-specific mortality with a hazard ratio of 1.22; 95% CI = 1.02–1.45; p = 0.039. The deletion of 3p as a simple deletion or by translocation has strongly suggested the loss of function of the tumor suppressor gene, von Hippel-Lindau (VHL). Mutations of VHL are found in about 60% of those RCC that exhibit chromosome 3p loss (Gnarra et al., 1994). 10.3.2 Copy-Neutral LOH in Genetic Diseases due to Consanguinity Because of the consanguineous parents, the two disease-causing locus (one from each parent) should be located in an autozygous chromosomal region that is IBD. Consequently, the disease-causing locus of this family should fall in a chromosomal region marked by homozygous SNPs and can be identifiable by LOH. As these regions segregate and are cut by additional generations of recombination, they become fewer and smaller proportional to the degree of inbreeding. After whole-genome scanning, we examine the homozygosity of SNPs flanking all the possible disease-causing loci. Then, we rank the quality of all the LOH regions for prioritization of mutational analysis as follows: (1) the size of the homozygous chromosomal region (normalized with the size of the respective chromosome), (2) the number of SNPs in the homozygous chromosomal region, and (3) the number of SNPs on the centromeric and telomeric side of the disease-causing locus in the homozygous chromosomal region. The best of each was scored as 3 and the worst was scored as 1. A high-quality
c10.indd 205
1/12/2011 9:44:14 AM
206
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
homozygous chromosomal region should have the largest size, the highest number of SNPs, and an equal number of SNPs on both sides of the possible causal locus. The possible disease-causing loci within the LOH region with the highest total score is selected for mutation detection by direct sequencing. Consanguinity of parents is common in patients with rare autosomal recessive diseases and, for example, has been reported in about 30% of the XP cases (Kraemer et al., 1987). We successfully applied 10K mapping assays for homozygosity mapping in different autosomal recessive diseases (Fig. 10.5). We then identified the homozygous mutations by direct sequencing the candidate genes. Table 10.4 summarizes some of the homozygous mutations that are associated with hypophosphatasia (Patient A), Wilson disease (Patient B) (Mak et al., 2008), sulfite oxidase deficiency (Patients C, D, and E) (Lam et al., 2002), Pompe disease (Patient H), and molybdenum cofactor deficiency (Patient I) in our clinical samples. This approach is not only useful for prenatal diagnosis for the next pregnancy in the same family but also useful for identifying novel mutations or additional disease-causing genes (Lam et al., 2002; Chiang et al., 2006). We found novel homozygous mutation p.M219V of the ALPL (alkaline phosphatase) gene for hypophosphatasia but found no mutation in all the coding exons of the collagen type 1 alpha2 gene (COL1A2) in an abortus (Patient A) suspected of osteogenesis imperfecta. Using the same approach in a patient with sulfite oxidase deficiency (Patient E), novel p.D512Y of the SUOX (sulfite oxidase) gene was identified. Molybdenum cofactor deficiency results in pleiotropic loss of the activity of all molybdoenzyme and displays the symptoms of a combined deficiency of sulfite oxidase, and xanthine dehydrogenase (XDH) (Johnson et al., 1989). In addition to a candidate gene, molybdenum cofactor synthesis 2 (MOCS2), Figure 10.6 shows the homozygous regions that harbor SUOX and XDH genes. The long stretch of 10 Mb LOH on chromosome 5 and identification of a small deletion on MOCS2 gene confirmed molybdenum cofactor deficiency in Patient I. A high-density SNP array decreases the burden of completely sequencing all possible loci for genetic diseases with extensive genetic heterogeneity such as ARO, XP, Bardet-Biedl syndrome (BBS), and LGMD, which have 7, 8, 9, and 13 disease-causing genes, respectively (Lam et al., 2005, 2007; Chiang et al., 2006; Lau et al. 2009). Based on the history of consanguinity in our XP case (Patient F), the XPC loci was prioritized for mutational analysis after LOH detection by genome-wide SNP genotyping (Lam et al., 2005). Figure 10.7 shows two out of eight candidate genes of XP that were found in homozygous regions detected in both HuSNP and 10K mapping assays. The degree of LOHs showed in 10K mapping assay are consistent with our scoring system used in the HuSNP mapping assay in which XPC got the highest score for subsequence mutation analysis. A homozygous nonsense mutation c.445G>T or p.E149X was identified in the patient and was heterozygous in the parents. Most ARO cases have been ascribed to a mutation in the T-cell immune regulator 1, ATPase, H+ transporting, lysosomal V0 subunit A3 (TCIRG1)
c10.indd 206
1/12/2011 9:44:14 AM
Figure 10.5. The 10K mapping assays for different autosomal recessive diseases. The y-axis shows the degree of LOH. ATP7B (NM 000053) for Wilson disease locates at chromosome 13q14.3; SUOX (NM 000456) for sulfite oxidase deficiency locates at chromosome 12q13.2; ALPL (NM 0000478) for hypophosphatasia locates at chromosome 1p36.12; GAA (NM 000152) for Pompe disease locates at chromosome 17q25.2–25.3.
c10.indd 207
1/12/2011 9:44:14 AM
208
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
TABLE 10.4. Mutation Screening of Candidate Genes Identified in LOH Sites Patient Number
Genetic Disease
A B C D E F G H I J, K, and L M
Mutation (All are Homozygous)
Gene
Hypophosphatasia Wilson disease Sulfite oxidase deficiency Sulfite oxidase deficiency Sulfite oxidase deficiency Xeroderma pigmentosum (type C) Malignant osteopetrosis Pompe disease Molybdenum cofactor deficiency Limb-girdle muscular dystrophy type IIB Carnitine-acylcarnitine translocase deficiency
ALPL ATP7B SUOX SUOX SUOX XPC
p.M219V p.R778L c.1521_1524delTTGT p.R217Q p.D512Y p.E149X
CLCN7 GAA MOCS2
p.I261F p.R224W c.346_349delGTCA
DYSF
p.D1837N
SLC25A20
c.199-10T>G
4.00 PATIENT I - LOH
Chromosome 5
3.00
2.00
1.00
0.00 42.77
47.40 p12
52.03
p12
56.66
q11.1
61.29 Mb
q11.2
q12.1
PATIENT I - LOH
4.00
Chromosome 12
3.00
2.00
1.00
0.00 47.49
52.13
56.76
q13.13
q13.12
q13.2
61.40
q13.3
q14.1
66.03 Mb q14.2
4.00 PATIENT I - LOH
q14.3
Chromosome 2
3.00
2.00
1.00
0.00 20.10
25.16 p24.1
30.21 p23.3
p23.2
35.26 p23.1
p22.3
40.31 Mb p22.2
p22.1
Figure 10.6. Comparing the LOH regions from 10K mapping assays between three candidate genes of molybdenum cofactor deficiency. On the y-axis is the degree of LOH. MOCS2 (NM 176806.2) locates at chromosome 5q11; SUOX (NM 000456) locates at chromosome 12q13.2; XDH (NM 000379) locates at chromosome 2p23.1.
c10.indd 208
1/12/2011 9:44:14 AM
209
HOW A HIGH-DENSITY SNP ARRAY CAN BE USED TO LOCALIZE A POSSIBLE DISEASE LOCI
25
PATIENT F - LOH
Chromosome 3
20
15
10
5
0 2.77
8.75 p26.2
p26.3
4.00 PATIENT F - LOH
p26.1
14.73 p25.3
p25.2
p25.1
20.71 p24.3
26.69 Mb p24.2
Chromosome 16
3.00
2.00
1.00
0.00 9.51
11.57 p13.2
p13.13
13.64
15.70
p13.12
p13.11
17.77 Mb p12.3
Figure 10.7. Comparing the LOH regions from 10K mapping assays between two candidate genes of XP in a patient whose parents are blood relatives (fifth degree). On the y-axis, the degree of LOH. XPC locates at chromosome 3p25.1; XPF (ERCC4) locates at chromosome 16p13.12.
gene, with only a few cases attributed to a mutation in the CLCN7 gene (Cleiren et al., 2001). We were able to prioritize the CLCN7 loci for mutation screening in a Chinese patient (Patient G) and identify a homozygous novel missense mutation c.781A>T or p.I261F, which was heterozygous in the parents (Lam et al., 2007, 2010). From the results of 10K mapping arrays with increased SNP markers among the three candidate genes CLCN7, TCIRG1, and osteopetrosis associated transmembrane protein 1 (OSTM1), only chromosome 16p harboring the CLCN7 gene shows the highest degree of LOH (>6.0), which is consistent with the HuSNP assay (Fig. 10.8). In a consanguineous family with LGMD (Fig. 10.9), we identified a long stretch of 25M bp homozygous region in all three affected siblings (Patients J, K, L) on the short arm of chromosome 2 (Lau et al., 2009). The LCSH is greater than that showed in Figure 10.7 as there are multigeneration recombinations and dilutions of consanguineous chromosomes in Patient F. The SNP genotyping revealed a homozygous candidate region for further mutation analysis. The ultra-high density SNP Array 6.0 mapped the gene dysferlin (DYSF), located on 2p13.3, in this LOH region (Fig. 10.10). A homozygous missense mutation c.5509G>A or p.D1837N of DYSF was identified in all the affected siblings. 10.3.3 LOH in Other Clinical Cytogenetics Analysis Using 10K mapping arrays, we identified in a Chinese neonate presenting with sudden unexpected death (Patient M) a 7M-bp LOH region at 3p21.1–21.31
c10.indd 209
1/12/2011 9:44:15 AM
210
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
4.00 PATIENT G - LOH
Chromosome 16
3.00
2.00
1.00
0.00 0.00
5.38
10.77 p13.2
p13.3
4.00 PATIENT G - LOH
16.15 p13.13
p13.12
21.54 Mb
p13.11
p12.3
p12.2
Chromosome 11
3.00
2.00
1.00
0.00 66.53
70.43 q13.2
q13.1
4.00
PATIENT G - LOH
74.33
q13.3
q13.4
78.23 q13.5
82.13 Mb q14.1
Chromosome 6
3.00
2.00
1.00
0.00 103.09
106.20
109.31
q16.3
112.42
115.53 Mb
q21
q22.1
Figure 10.8. Comparing the LOH regions from 10K mapping assays between the three candidate genes of ARO in a patient whose parents are first cousins. CLCN7 (NM 001287) locates at chromosome 16p13.3; TCIRG1 (NM 006019) locates at chromosome 11q13.2; OSTM1 (NM 014028) locates at chromosome 6q21.
I
II
J
K
L
III
Figure 10.9. Pedigree of a consanguineous family with limb-girdle muscular dystrophy.
(Fig. 10.11). The SLC25A20 gene encodes a protein carnitine-acylcarnitine translocase (CACT), which is located at chromosome 3p21.31. Although lack of consanguinity of the parents, we found homozygous IVS2-10T>G, a known mutation for CACT deficiency (Lam et al., 2003). Since the parents are nonconsanguineous, the homozygous region is likely due to linkage disequilibrium (Wong et al., 1998; Okubo et al., 1999; Lam, 1999). We suggest that this mutation within IBD region may be a founder mutation in the Chinese population. In a case of ring chromosome, a karyotype report from a clinical laboratory
c10.indd 210
1/12/2011 9:44:15 AM
HOW A HIGH-DENSITY SNP ARRAY CAN BE USED TO LOCALIZE A POSSIBLE DISEASE LOCI
211
Figure 10.10. (See color insert.) Identification of homozygous DYFS mutations in the homozygous region detected by SNP Array 6.0 (Lau et al., 2009).
suggested the ring chromosome derived from chromosome 21 (46, XY, r21) (Lau et al., 2009). From the genotyping results of SNP Array 6.0, an approximately 7M-bp segment was lost at the end of the long arm of chromosome 21. Figure 10.12 shows the breakpoint (with changes in both LOH and copy number) located at chromosome 21q22.2, indicating the deletion of the gene Down syndrome cell adhesion molecule (DSCAM).
c10.indd 211
1/12/2011 9:44:15 AM
212 4.00
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
PATIENT M - LOH
Chromosome 3
3.00
2.00
1.00
0.00 43.88 p21.33 p21.32
46.58
49.27 p21.31
51.96 p21.2
54.65 Mb p21.1
p14.3
Figure 10.11. Detection of a homozygous region in the short arm of chromosome 3 in a case of CACT deficiency by 10K mapping array. SLC25A20 (NM 000387) locates at chromosome 3p21.31.
Figure 10.12. Positional mapping of ring chromosome 21 by the SNP Array 6.0 platform (Lau et al., 2009).
c10.indd 212
1/12/2011 9:44:15 AM
REFERENCES
213
10.4 DISCUSSION The comprehensive genomewide scan with the use of automated high-density SNP array offers significant cost and time benefits in sample preparation, processing, and data analysis. The savings in the cost of the analysis will be more if the disease has marked locus heterogeneity for prioritization of mutational analysis. Accurate mapping of the disease-causing genes by detecting LOH using high-density SNP array provides a better method for making a reliable DNA-based prenatal diagnosis, while the prenatal ultrasound scans are sometimes normal for an affected fetus. This is of particular advantage in finding the disease loci in consanguineous families using SNP microarrays. The results of the mapping study and the mutation study proved to be consistent, validating this approach. LOH sites may also have prognostic significance and may be ethnic-specific in different populations. The patterns of LOH obtained by SNP microarray are in excellent agreement with those obtained by analysis with both microsatellite genotyping and comparative genomic hybridization in some cancer studies (Lam et al., 2006). Unlike microsatellites, SNPs are not susceptible to repeat expansion that is so often observed in cancer. We strongly suggest that the use of high-density SNP arrays should be standardized for molecular investigations of genetic diseases due to consanguinity. High-density SNP array is useful not only for personalized genomic medicine but also for building disease-specific databasest (Lau et al., 2009; Seelow et al., 2009). Having precision medical informatics, we can make genotype– phenotype correlations of early warning signs or improved drug response for better disease management. With further technical and software advancements, high-density SNP genotyping will continue to help exploring the pathophysiology of more disorders and creating targeted drugs in the near future.
10.5 REFERENCES Chiang AP, Beck JS, Yen HJ, et al. (2006). Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet-Biedl syndrome gene (BBS11). Proc Natl Acad Sci USA 103:6287–92. Cleiren E, Benichou O, Van Hul E, et al. (2001). Albers-Schönberg disease (autosomal dominant osteopetrosis, type II) results from mutations in the ClCN7 chloride channel gene. Nature 415:287–94. Gnarra JR, Tory K, Weng Y, et al. (1994). Mutations of the VHL tumor suppressor gene in renal carcinoma. Nat Genet 7:85–90. Johnson JL, Wuebbens MM, Mandell R, Shih VE. (1989). Molybdenum cofactor biosynthesis in humans. Identification of two complementation groups of cofactordeficient patients and preliminary characterization of a diffusible molybdopterin precursor. J Clin Invest 83:897–903. Kraemer KH, Lee MM, Scotto J. (1987). Xeroderma pigmentosum. Cutaneous, ocular, and neurologic abnormalities in 830 published cases. Arch Dermatol 123:241–50.
c10.indd 213
1/12/2011 9:44:16 AM
214
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
Lam CW. (1999). Origin of the Japanese population. Science 284:1125. Lam CW. (2010). Genome-based diagnosis of genetic disease. Indian J Med Res 131: 484–85. Lam CW, Cheung KKT, Luk NM, Chan SW, Lo KK, Tong SF. (2005). DNA-based diagnosis of xeroderma pigmentosum group C by whole-genome scan using singlenucleotide polymorphism microarray. J Invest Dermatol 124:97–91. Lam CW, Lai CK, Chow CB, Tong SF, Yuen YP, Mak YF, Chan YW. (2003). Ethnicspecific splicing mutation of the carnitine-acylcarnitine translocase gene in a Chinese neonate presenting with sudden unexpected death. Chin Med J (Engl) 116:1110–12. Lam CW, Li CK, Lai CK, Tong SF, Chan KY, Ng GSF, Yuen YP, Cheng AWF, Chan YW. (2002). DNA-based diagnosis of isolated sulfite oxidase deficiency by denaturing high-performance liquid chromatography. Mol Genet Metab 75:91–95. Lau KC, Mak CM, Leung KY, Tsoi TH, Tang HY, Lee P, Lam CW. (2009). A fast modified protocol for random-access ultra-high density whole-genome scan: A tool for personalized genomic medicine, positional mapping, and cytogenetic analysis. Clin Chim Acta 406:31–35. Lam CW, To KF, Tong SF. (2006). Genome-wide detection of allelic imbalance in renal cell carcinoma using high-density single-nucleotide polymorphism microarrays. Clin Biochem 39:187–90. Lam CW, Tong SF, Wong K, et al. (2007). DNA-based diagnosis of malignant osteopetrosis by whole-genome scan using a single-nucleotide polymorphism microarray: standardization of molecular investigations of genetic diseases due to consanguinity. J Hum Genet 52:98–101. Mak CM, Lam CW, Tam S, et al. (2008). Mutational analysis of 65 Wilson disease patients in Hong Kong Chinese: identification of 17 novel mutations and its genetic heterogeneity. J Hum Genet 53:55–63. Nishida N, Koike A, Tajima A, et al. (2008). Evaluating the performance of Affymetrix SNP Array 6.0 platform with 400 Japanese individuals. BMC Genom 9:431. Okubo M, Horinishi A, Murase T, Hamada K. (1999). 1176C polymorphism in Japanese patients with glycogen storage disease type 1a. Hum Genet 104:193. Redon R, Ishikawa S, Fitch KR, et al. (2006). Global variation in copy number in the human genome. Nature 444(7118):444–54. Seelow D, Schuelke M, Hildebrandt F, Nürnberg P. (2009). HomozygosityMapper—an interactive approach to homozygosity mapping. Nucl Acids Res 37:W593–10. The International HapMap Consortium. (2005). A haplotype map of the human genome. Nature 437(7063):1299–13110. The International HapMap Consortium. (2007). A second generation human haplotype map of over 3.1 million SNPs. Nature 449(7164):851–61. Tuna M, Knuutila S, Mills GB. (2009). Uniparental disomy in cancer. Trends Mol Med 15:120–28. Wong LJ, Liang MH, Hwu WL, Lam CW. (1998). Linkage disequilibrium and linkage analysis of the glucose-6-phosphatase gene. Hum Genet 103:199–203.
c10.indd 214
1/12/2011 9:44:16 AM
CHAPTER 11
Gene Discovery by Direct Genome Sequencing KUNAL RAY, ARIJIT MUKHOPADHYAY, and MAINAK SENGUPTA
Contents 11.1 Introduction 11.2 Gene Discovery by Direct Genome Sequencing 11.2.1 Discovery of Mutations in Mendelian Diseases 11.2.2 Discovery of QTL or Single Nucleotide Mutations 11.3 Applications and Protocols 11.3.1 Identification and Capturing of the Targeted Genomic Region 11.3.2 Selection of Suitable Platform 11.4 The Limitations of Direct Genome Sequencing 11.5 References
215 216 217 218 219 219 225 228 231
11.1 INTRODUCTION The last few decades saw unprecedented growth in our understanding of the genetic basis of diseases as the underlying molecular defects were unraveled especially for Mendelian disorders. Most of these discoveries were possible using the Sanger method of direct DNA sequencing (Sanger and Coulson, 1978; Sanger et al., 1977, 1992). Discovery of the Sanger sequencing method and its widespread use made forward genetics more powerful than reverse genetics. We could identify the causal variants even before we knew the molecular basis of the causal relationship with the trait under study. However, the OMIM database reveals that there are only 344 entries (out of 19,864) for which gene mutations have been definitively linked with phenotypes, indicating that most of the diseases are not purely monogenic and are probably controlled by one or more modifier loci. Even the most commonly used Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
215
c11.indd 215
1/12/2011 9:44:17 AM
216
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
textbook example of a Mendelian disease, hemophilia, has been recently shown to have a quasi-quantitative character with variable penetrance (Chavali et al., 2008, 2009). To add more to the complexity, Gottlieb et al., (2009) while studying a condition called abdominal aortic aneurysm (AA), found that SNPs the in BAK1 gene were different in aortic tissue than in blood samples, even in samples taken from the same individuals. Based on these findings, the authors suggested being careful in interpreting genetic associations based on DNA from blood samples alone. Hence, to achieve a proper genotype-phenotype correlation, we need to probe deeper into the genome, matching it with epigenome and transcriptome to get a glimpse of the complex networks that exist at the proteome level. This realization has led to a new turn in the way research in human genetics is carried out. Today researchers are using high-throughput genotyping tools to first map a disease locus, then sequence the entire candidate region to detect all possible nucleotide variations that might contribute to biology, either singly or in synergy. A large number discoveries are being reported in the literature in which a classical exon or ORF sequencing does not identify the causal variant, which is found only by means of direct genome sequencing. In this chapter we deal with the recent approaches that are being undertaken to emphasize that researchers need to align themselves with changing paradigm to discover novel genes.
11.2 GENE DISCOVERY BY DIRECT GENOME SEQUENCING As described in the previous chapters, recent technological developments in the field of high-throughput genotyping have allowed us to scan the human genome at a very high resolution (the latest arrays have a marker at every 1.5 kb on average). These arrays allow us to look for causal association of phenotypes with single nucleotide polymorphisms as well as copy number variations both independently and in combination. These platforms are becoming cost effective and are gradually replacing classical microsatellite-based linkage or association analysis for hereditary traits. This possibility has revolutionized the field, especially in the area of complex multifactorial diseases, where lack of large pedigrees prohibits the use of more powerful linkage analysis leaving us with only association studies. In the last few years, there has been an explosion of data from various genomewide association studies (GWAS), which led to discovery of many unexpected disease-causing genes. As we know from experience, discovery of these genes in the context of specific diseases by traditional candidate gene approach would take very long time. All these high-throughput screening technologies have enabled us to take an unbiased approach toward the genetic basis of diseases. An exemplary success of GWAS is the discovery the association of the p.Y402H variant of the complement factor H gene for age-related macular degeneration. This variant, identified by a GWAS, helped explain a large proportion of the disease burden and led to a new field of research in
c11.indd 216
1/12/2011 9:44:17 AM
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
217
the understanding and cure of this blinding disorder in the elderly (Haines et al., 2005). However, these high-throughput screening technologies have some inherent limitations as well. Recent studies suggest that usually expression of complex array of disease phenotypes result from the effect of a few rare variants with higher penetrance in combination with many common variants with lower penetrance. The higher the penetrance of the rare variant, the greater is the Mendelian character of the phenotype. By virtue of the design of high-throughput genotyping arrays, the markers being identified are in most cases validated in different population groups. On the other hand, GWAS studies are very poor in detecting the rare variants that would be expected to be highly penetrant. This means, that GWAS can detect only less penetrant common variants, warranting a large sample size to have enough power for the study. Another limitation is at the present level of saturation of markers, where the average spacing between markers ranges from 1.5 to 5 kb. It is debatable whether one would gain significantly in power to identify causal variants by increasing the number of markers in any region. The recombination frequency map of the human genome is nonrandom (linkage disequilibrium blocks), implying that increasing the number of markers in a block would not necessarily increase the power to detect a causal variant. In addition to these conceptual limitations of this approach, there are some technical limitations as well that are dealt with in other chapters of this book. Direct genome sequencing, on the other hand, can circumvent all these limitations described above. Technically speaking, direct genome sequencing is powerful enough to detect the causal variant even from one individual sample. One can probe the entire genomic region of interest without having to hypothesize for where the causal variant might lie. Presently, the cost of direct genome sequencing for large regions or for a large number of samples is prohibitory, but rapid technological breakthroughs under way will almost certainly reduce the cost substantially. Different strategies for direct genome sequencing are described later in this chapter. 11.2.1
Discovery of Mutations in Mendelian Diseases
For a Mendelian disorder, where one mutation is often penetrant enough to precipitate a disease phenotype, one would collect samples from one or more large pedigrees with multiple affected members, carry out linkage analysis, and then sequence the exons of all the genes in the linkage interval. Exclusion of an ORF or a gene often will be decided based on lack of a causal variant in the exons. Classically, these sequencing approaches typically involve bidirectional Sanger sequencing. Recently, expression analysis (den Hollander et al., 2007; Mukhopadhyay et al., 2006), microRNA sequencing (Mencía et al., 2009), and direct genome sequencing (Ng et al., 2009; Nikopoulos et al., 2010) have been successfully used to find causal gene or mutation for Mendelian disorders.
c11.indd 217
1/12/2011 9:44:17 AM
218
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
In spite of the enormous advancement in the field of disease genetics and genomics, there are almost 1800 identified genetic loci or phenotypes for which the causal gene is not known and another 2000 suspected Mendelian traits with unknown molecular defect (according to OMIM). These statistics suggest that identification of the molecular defects in genetic diseases still remains a challenging task, despite pinning down the underlying loci using traditional tools and strategies of molecular genetic analysis. The major impediment to better success is hidden mutations in the region of the gene that is not intuitively obvious and immediately testable—for example, promoter regions, locus controlling regions, and miRNA binding sites in untranslated regions. Such hurdles could be surmounted if multiple unrelated patient samples are available with the defect in the same genetic locus but harboring different mutations. Thus some mutation may remain refractory to limited sequencing efforts or a variant may be present in an uncharacterized region of the gene (promoter, UTR), but other mutations may be easily related to biological aberration. However, for rare genetic diseases finding multiple unrelated families/ patients is not easy. In such circumstances, direct genome sequencing plays a very important role in the discovery of novel genes or mutations in Mendelian disorders. For example, a recent report describes identification of a causal mutation in a novel gene for familial exudative vitreoretinopathy (FEVR), a Mendelian disease of the eye phenotype (Nikopoulos et al., 2010). Here the authors used high-throughput genotyping technology to find a common linkage interval in two pedigrees spanning 40 Mb. They then used next-generation sequencing (NGS) technology to perform direct genome sequencing for the entire 40-Mb and identified a causal variant in TSPAN12 that is a novel gene for this phenotype (Nikopoulos et al., 2010). It is likely that the classical approach for characterizing the causal gene for the disease and the underlying defect in the patient(s) would have taken a much longer time. 11.2.2 Discovery of QTL or Single Nucleotide Mutations As discussed at the beginning of this chapter, most of the diseases are influenced by modifying effects of other loci and the environment. To understand the etiology of a disease one needs to quantify the contribution of each trait or parameter responsible for the disease phenotype under study. These parameters are not of binary nature and form a continuum in the population. The genetic locus controlling a particular trait is called a quantitative trait locus (QTL). For example, one can study type 2 diabetes and decide on the disease status in a binary manner (i.e., either affected with the disease or not). However, the most important parameter to make the binary decision is the level of blood glucose, which is not binary and will have a range of values in cases and controls. Hence blood glucose becomes a QTL for type 2 diabetes, and understanding the genetic locus controlling the blood glucose level can be more useful in elucidating the disease etiology than studying the disease itself.
c11.indd 218
1/12/2011 9:44:17 AM
APPLICATIONS AND PROTOCOLS
219
In contrast to what has been discussed for Mendelian diseases, identification of genetic signatures for QTL is much more challenging. Typically, the variants contributing to any QTL will be frequent in both cases and controls for a phenotype with less penetrance and hence would need larger samples to detect the genetic variant with enough statistical power. In addition, one may not have a correct estimate of the number of QTLs contributing to a phenotype and hence would have to do very robust and deep phenotyping, so that during genetic analysis the contribution of each phenotypic trait can be assessed using regression based approaches.
11.3
APPLICATIONS AND PROTOCOLS
In the previous sections we discussed the need to use the direct genome sequencing to discover novel genes and variations causing diseases, both Mendelian as well as multifactorial. In this section, we shall briefly describe various technical approaches used for direct genome sequencing for gene discovery. 11.3.1
Identification and Capturing of the Targeted Genomic Region
11.3.1.1 PCR-Based Method for Targeted Deep Sequencing The most obvious and commonly used method to generate fragments from targeted genomic regions is traditional single-plex PCR. This can be either long range (LR-PCR) to capture a few kilobases of DNA in one amplicon, or multiple exons of multiple genes can be sequenced individually and then pooled for a particular sample to proceed with parallel sequencing approaches. Multiple sources are now available that provide specialized PCR reagents for longrange PCR reaction that can amplify fragments up to 25 kb. Recently, Yeager et al. (2008) have successfully used the LR-PCR approach to resequence a 136 kb region from 8q24 to detect variations associated with prostate and colon cancer. They designed primers to amplify fragments ranging from 2.0 to 5.5 kb and kept more than a 500-bp overlap between any two adjacent fragments. After checking for successful amplification for each reaction, they pooled equimolar amounts of each fragment to represent the entire 136 kb region and then did parallel sequencing using the 454 technology from Roche. This approach might be useful when the region under study represents one contiguous stretch on the genome. However, typically for a multifactorial disease one would like to sequence multiple small regions from one individual, and LR-PCR will not be effective. Using the PCR-based approach one has to do many independent PCR reactions and then follow the usual pipeline of quantitation, pooling, library preparation, and parallel sequencing. Both the approaches are refractory to the variability between each PCR reaction and needs a large amount of DNA as input (each PCR reaction will need 10–20 ng DNA) compared to other available options. In addition, for degraded samples,
c11.indd 219
1/12/2011 9:44:17 AM
220
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
as often the case for cancer, long-range PCR does not work well. These traditional PCR-based methods are both highly specific and sensitive but are difficult to scale up, which leads to underutilization of the massive throughput provided by the NGS platforms. Recently, several other methods have evolved to circumvent these problems. 11.3.1.2 Multiplex Amplification by Padlock Molecular Inversion Probe The padlock probes, circular oligonucleotides for localized DNA detection, were described as early as 1994 (Nilsson et al., 1994). This technology was later used for multiplex SNP genotyping and termed molecular inversion probe (MIP) assay (Hardenbol et al., 2003, 2005). With this approach “inverted” probes are generated in which the SNP information is reformatted into tag sequences enabling large-scale screening using a tag DNA microarray. MIPs are single-stranded DNA with sequence complementary to the flanking sequences of the SNP under study. Each probe also contains two universal primers separated by endo-ribonuclease recognition site. During the assay, the probes undergo a unimolecular rearrangement that is (1) circularized by filling gaps with nucleotides corresponding to the SNPs in separate allele-specific polymerizations (A, C, G, and T) and ligation reactions and (2) linearized in enzymatic reactions. As a result, they become inverted. This step is followed by PCR amplification and sequencing (Fig. 11.1). Recently, Porreca et al. (2007) have shown the utility of the MIP assay in generating multiplex amplification. In evaluating multiplex targeting methods, key performance parameters to consider include multiplexity, specificity, and uniformity. Multiplexity refers to the number of independent capture reactions performed simultaneously in a single reaction. Specificity is measured as the fraction of captured nucleic acids derived from the targeted regions. Uniformity is defined as the relative abundance of targeted sequences after selective capture. Ideally, a multiplex targeting method will perform adequately by all three measures. An additional concern is cost; targeted capture necessarily requires one or more oligos to specify each target, which is potentially very expensive at high degrees of multiplexing (Porreca et al., 2007). To overcome these problems the authors synthesized 100-mer oligos and released them from a programmable microarray. This complex pool is PCR amplified, then restriction digested to release a single-stranded 70-mer capture probe mixture. Individual probes consist of a universal 30-nucleotide motif flanked by unique 20-nt segments (targeting arms). Each linked pair of targeting arms is designed to hybridize immediately upstream and downstream of a specific genomic target—for example, an exon. The capture event itself, a modification of the molecular inversion probe strategy developed for multiplex genotyping, is achieved by polymerase-driven extension from the 3′ end of the capture probe to copy the target, followed by ligation to the 5′ end to complete the circle. Subsequent steps enrich and amplify these circles or generate products amenable to shotgun sequencing library production (Porreca et al., 2007). Although this method circumvents the low scalability of a PCRbased approach, it is highly specific and represents >90% of the targets.
c11.indd 220
1/12/2011 9:44:17 AM
Figure 11.1. Molecular inversion probe assay. 1, The probe and the genomic DNA with the polymorphic base to be genotyped. The bases in black are complementary to portions of the probe (bolded) flanking the polymorphic site. The white and textured arrows are regions complementary to universal PCR primers, and the black region is the cleavage site containing a restriction endonuclease recognition sequence. 2, Terminal portions of the probe complementary to the genomic DNA sequence hybridizes with it, while the remaining part is looped out and a gap is created at the site of the polymorphic site. 3, The gap in the probe is filled by incorporation of dNTPs in the system by using polymerase and ligase. Unreacted probes are digested by Exonuclease. 4, The probe is linearized by restriction enzyme digestion and 5, released from the genomic DNA. Now, the probe has an inverted appearance compared to its original conformation. 6, The probe is enriched by PCR amplification with the help of the universal PCR primer pairs (arrows). 7, The polymorphic base is identified by sequencing or hybridization.
c11.indd 221
1/12/2011 9:44:17 AM
222
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
Published results show that there is >100-fold range in coverage of the targeted sequences, and 34–42% of the sequence capacity is consumed by either sequencing of primer sequences or the molecular inversion probes’ linker backbone (Tewhey et al., 2009a). This results in unnecessary sequencing of the unwanted regions, implying less coverage for sequence of interest. 11.3.1.3 Hybridization-Based Method for Genome Sequencing The third approach, based on hybridization with long oligonucleotides that are either matrix bound or in solution, captures and pulls down the target sequences (Albert et al., 2007; Gnirke et al., 2009; Hodges et al., 2007; Okou et al., 2007). The solid-phase hybridization approach has been used to capture the entire human exome and has been reported in several published studies (Hodges et al., 2007; Ng et al., 2009). However, the process is difficult to scale up for large population studies. A proof-of-principle study for solution-phase hybridization by using long 170-bp capture probes has recently been published (Gnirke et al., 2009). Although this study clearly demonstrated the utility of the approach at a depth of 84× coverage, the variant-detection sensitivity was only 64–80% within the exonic sequences, likely because of insufficient coverage uniformity (Tewhey et al., 2009a). Recently, a modification of the solution-based hybridization method has been published for enrichment of sequencing targets (Tewhey et al., 2009a). The authors noted that the tiling frequency of the 120-bp capture probes is important for obtaining high uniform coverage across the targeted sequences. Their results suggest that the sequence coverage improves if each targeted basepair is contained within two different capture probes but is not affected by a greater tiling density. They also demonstrated that, for optimal coverage of human exons shorter than 180 bp, at least three 120-mer capture probes per exon should be used. The hybridization-based methods have good capture rates, uniform coverage of target sequences, and good reproducibility. However, the methods are known to be biased to repetitive elements, which can result in a high proportion of reads that are not uniform. In addition, sequences that are highly homologous to other sequences in the genome cannot be individually targeted. The method (solution based) is presented in Figure 11.2. 11.3.1.4 Microdroplet-Based PCR Enrichment for Large-Scale Sequencing To use the strength of each of the approaches described, a microdroplet-based PCR enrichment for large-scale targeted sequencing has been developed (Tewhey et al., 2009b). In short, the method takes advantage of massive parallel singleplex amplification retaining its specificity, and sensitivity. It is achieved by discrete encapsulation of microdroplet, which prevents primer pair interaction; thus up to 4000 target amplification has been done successfully. The authors described that it involves the preparation of 1.5 million separate PCR reactions from 20 μL template solution containing 7.5 μg genomic DNA. This technology is well suited for processing DNA for mas-
c11.indd 222
1/12/2011 9:44:17 AM
APPLICATIONS AND PROTOCOLS
(a)
End repairing of sheared DNA
Shearing of genomic DNA
Adapter mediated PCR enrichment of fragments
223
Addition of dATP at the 3’ends
Purification
Adapter ligation
(b) +
+ Prepared Library
Hybridization Buffer
Biotinylated Probes
Optimum Temperature Regulation Hybridization Streptavidin Coated Magnetic Beads
+
Unbound Fraction Discarded
Wash Beads and Remove Probes
Bead Capture Amplify
Sequencing
Figure 11.2. (See color insert.) The hybridization-based sequencing method. (a) Genomic DNA is sheared and end repaired or modified. A poly-A tail is added to the fragments, adapters are ligated to the 3′-end of the fragments, and excess adapters or unligated primers are removed. The amplicons are purified, and adapter-specific PCR amplification is done to enrich the product pool to prepare a library. (b) The prepared library is hybridized with relevant biotinylated probes (specific sequences, whole exome, etc.) in solution in a hybridization array. The probes bind to the relevant sequences from the library. Then streptavidin-coated magnetic beads are released in the array, and a magnet is used to capture biotinylated probes bound to their complementary sequences. Those specific sequences can then be sequenced in appropriate platform. [Panel (b) of the illustration has been adapted from Protocol version 1.0.1, October 2009; SureSelect Human All Exon Kit from Agilent Technologies.]
c11.indd 223
1/12/2011 9:44:17 AM
224
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
(a)
(b) Primer library
(c)
Genomic DNA template
Microfluidic chip
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
8 7
2
Droplet PCF
Genomic DNA
Break emulsion
3
gDNA removal
5
9 Fragmentation and nick translation
4
Sequence
6
Figure 11.3. (See color insert.) Microdroplet PCR workflow. (a) Primer library generation: 1, Identify targeted sequences of interest in the genome. 2, Design and synthesize forward and reverse primer pairs for each targeted sequence (library element). 3, Generate primer pair droplets for each library element. A microfluidic chip is used to encapsulate the aqueous PCR primers in inert fluorinated carrier oil with a blockcopolymer surfactant to generate the equivalent of a picoliter-scale test tube compatible with standard molecular biology. 4, Mix primer pair droplets of library elements together so that each library element has an equal representation. (b) Genomic DNA template mix preparation: 5, Biotinylate (red dots), fragment into 2- to 4-kb fragments, and purify genomic DNA. 6, Mix purified genomic DNA together with all of the components of the PCR reaction (DNA polymerase, dNTPs and buffer) except for the PCR primers. (c) Droplet merge and PCR: 7, Dispense primer library droplets to the microfluidic chip. 8, Deliver the genomic DNA template as an aqueous solution; template droplets are formed within the microfluidic chip. Then pair the primer pair droplets and template droplets in a 1:1 ratio. 9, Allow the paired droplets to flow through the channel of the microfluidic chip and pass through a merge area, where an electric field induces the two discrete droplets to coalesce into a single PCR droplet. Collect ∼1.5 million PCR droplets in a single 0.2-mL PCR tube. Process the PCR droplets (PCR library) in a standard thermal cycler for targeted amplification; break the emulsion of PCR droplets to release the PCR amplicons into solution for genomic DNA (gDNA) removal, purification, and sequencing. (Reprinted by permission from Macmillan Publishers Ltd: Nature Biotechnology, Tewhey et al., Microdroplet-based PCR enrichment for large-scale targeted sequencing, 27, 1025–1031, 2009.)
sively parallel amplification of sequencing targets. Microdroplet PCR consists of the following steps: merging picoliter of fragmented genomic DNA template (2–4 kb by DNase I digestion) with primer pair droplets, pooled thermal cycling of the PCR reactions, and destabilizing the droplets to release the PCR product for purification and sequencing. Figure 11.3 shows the methodology.
c11.indd 224
1/12/2011 9:44:17 AM
APPLICATIONS AND PROTOCOLS
225
The microdroplet PCR technology and the improvement of the solutionbased hybridization technology were developed by the same group of researchers (Tewhey et al., 2009a, 2009b). They furnished important parameters to consider when choosing an enrichment method for targeted sequencing: (1) uniformity of coverage of targeted sequences, (2) the detection rate and calling accuracy of sequence variants, (3) the efficiency of the enrichment over background sequences, (4) universality of the capture method (fraction of genome that can be uniquely captured), and (5) the multiplicity of the reaction (amount of sequence that can be targeted). Compared to other enrichment methods, microdroplet PCR generates substantially greater uniform coverage of targeted sequences, resulting in a higher variant detection rate: microdroplet PCR (94.5%), solution-based hybridization (64–89%), molecular inversion probe (75%). Microdroplet PCR is a universal method allowing for unique capture of most sequences including those highly similar to other regions of the genome. By anchoring a primer in the divergent portion of a homologous sequence or in an adjacent unique region, almost any interval can be specifically targeted. In contrast, hybridization-based methods cannot capture individual repetitive elements or homologous exons. The authors have further commented that they have already used the microdroplet PCR to enrich ∼4,000 targeted sequences in a single tube per sample and are currently working on scaling it up to 20,000 targets (∼7.5 Mb, ∼1/10th the exome) using an expanded content format with five sets of primers in each droplet and no other changes to the workflow. The requirement for 7.5 μg of starting DNA used in this study limits the applicability of microdroplet PCR for samples with limited quantities. The authors commented that optimization has reduced the current requirement to 2 μg, and is being further reduced to nanogram quantities of DNA (Tewhey et al., 2009b). As discussed, today a researcher can choose from multiple options of methods for sample preparation, depending on the objective of the research question. Table 11.1 is a comparison of all the methods. 11.3.2 Selection of Suitable Platform In terms of ease of sample preparation and accuracy of data, the traditional Sanger method of DNA sequencing is still the best. However, highest achievable throughput of the method is 1 kb sequence per hour. On the other hand, some of the NGS platforms can sequence 10 times of the human genome sequence (30 Gb) in about 1 week. For higher throughput and to address pangenomic questions, the research strategies are being formulated to match the requirements of NGS platforms. As described earlier, choosing the right approach for sample preparation is perhaps the most crucial step in determining the effectiveness of NGS technology. After preparing the samples for direct genome sequencing using the methods of choice, the sample is ready to go onto the NGS platform for massive parallel sequencing. At present there are three major players providing appropriate platforms for this purpose: 454 by Roche, Genome analyzer by Illumina, and SOLiD by Applied Biosystems. The
c11.indd 225
1/12/2011 9:44:17 AM
226
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
TABLE 11.1. Sample Preparation Methods for Direct Genome Sequencing Approach PCR and LR-PCR
Merits Highest specificity and sensitivity
Multiplex amplification by padlock molecular inversion probe
•
Hybridization-based methods
•
•
• •
Microdroplet-based PCR enrichment
•
•
•
•
Limitations • •
High specificity Can multiplex up to 10,000 different regions
•
High specificity High depth of coverage Can capture targeted regions from the entire genome Very high sensitivity and specificity Highest variant detection sensitivity (94.5%) Can amplify up to 4000 different targets simultaneously and can perform 1.5 million PCR reactions in a small volume Suitable for multiple regions and many samples
•
•
•
• •
•
Difficult to scale up Not suited for sequencing multiple genomic regions High cost of oligo synthesis ∼40% of total data are from known, unwanted oligo or linker sequences Expensive to use in large population studies Biased for repetitive sequences Nonuniform coverage Depends on a highquality oligo library Currently requires a large amount (7.5 μg) of genomic DNA
details about these platforms are available in other chapter(s) of this book; here we will discuss their performance in regard to direct genome sequencing for gene discovery. The NGS technologies generate a large amount of sequence for each run. For example, both the Illumina genome analyzer and the ABI SOLiD can produce 30–70 Gb of raw sequence read per week but one run on the 454 currently produces only 500 Mb of sequence. However, for the platforms that produce short-sequence reads, greater than half of this sequence is not usable. On average, 55% of the Illumina genome analyzer reads pass quality filters, of which approximately 77% align to the reference sequence. For ABI SOLiD, approximately 35% of the reads pass quality filters, and subsequently 96% of the filtered reads align to the reference sequence. Thus only 43% and 34% of the Illumina genome analyzer and ABI SOLiD raw reads, respectively, are usable. In contrast to the platforms generating short-read lengths, approximately 95% of the Roche 454 reads uniquely align to the target sequence. When designing experiments and calculating the target coverage for a region, one must consider the fraction of alignable sequence (Harismendy et al., 2009). It has been reported that for the genome analyzer platform 50% of the gener-
c11.indd 226
1/12/2011 9:44:18 AM
APPLICATIONS AND PROTOCOLS
227
ated sequence represent the first 50 bp of the amplicon initially generated by LR-PCR (Harismendy and Frazer, 2009) thus making half of the data unusable. The authors have shown that by blocking the 5′ end of the PCR primer reduces its overrepresentation in sequencing, resulting in more uniform coverage. 11.3.2.1 The New Range of Possibilities All the available NGS platforms use amplification-based methods. However, the latest technologies in the field have bypassed that requirement. The true Single Molecule Sequencing (tSMS) technique from HELICOS is the first of its kind available commercially (Gupta, 2008; Milos, 2008). It first fragments the genomic DNA to a length between 100 and 200 nucleotides. Then a universal poly-A stretch is added at one end of every fragment. These DNA fragments are then hybridized onto a lawn of immobilized poly-T oligos. Subsequently, DNA polymerase and one of the four nucleotides (fluorescent labeled) are passed onto the array (flow-cell) and the images are captured. The residual reagents are washed away and the step is repeated until the desired length is sequenced (Fig. 11.4). This platform will revolutionize the field as it allows probing a single cell without the need for any amplification. The recent findings that the genomic DNA content as well as the variations in the DNA varies from cell to cell (Gottlieb et al., 2009) could be further probed using this platform, which was not possible earlier. The removal of the amplification step also drastically reduces the time required for each experiment, and currently it can sequence an entire human genome in one day compared to one week for other NGS platforms. The product has been recently launched to market. With the use of the technology one would learn about the possibilities and limitations of this new innovation. However, as the technology depends on fragmentation and uses a single primer for sequencing, one can imagine the computational issues will be more challenging for this platform. Nanopore based sequencing is another novel approach for PCR-free massive parallel sequencing (Branton et al., 2008). This technology uses the polarity of the DNA strands; upon application of a potential difference across a nanometer scale tube (nanopore), the DNA strand is attracted toward the pore. Upon interaction of each different nucleotide, the current flow through the nanopore changes differently, enabling one to read the order of the nucleotides—for example DNA sequencing (Fig. 11.5). Although this technology is still under development, once available this technique should be more powerful than the HELICOS tSMS platform, as it will not require any fluorescent labeling. Also, the technical design of the platform is likely to provide largest read lengths in the field. Currently, the biggest challenge is to ensure entry of every base into the nanopore as it gets cleaved by the exonuclease. As discussed, multiple platforms for massive parallel sequencing are now available for the research community. Each comes with their pros and cons. A comparison is given in Table 11.2.
c11.indd 227
1/12/2011 9:44:18 AM
228
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
Figure 11.4. Single molecule sequencing. a, Genomic DNA is sheared and a poly A tail added to the fragments of DNA. The fragments are then hybridized in an array containing probes of poly T attached to the chip. After hybridization, florescently labeled dNTPs are added (different fluorescence for different types of the bases; one type at a time) that get incorporated in relevant positions in the poly T probe, depending on the sequence context of the poly A tailed fragment hybridized to the respective probe. Excess dNTPs are washed away, and then the florescent signal is cleaved from the nucleotide that emits florescence captured by relevant detectors.
11.4 THE LIMITATIONS OF DIRECT GENOME SEQUENCING The biggest limitation of the direct genome sequencing at present is its prohibitive cost to most of the investigators. Although technological innovations in this area are bringing down the cost quite rapidly to achieve the $1000 magic number for genome sequencing, still it is beyond reach of most individual investigators. One approach to overcome this problem is pooling samples (Ingman and Gyllensten, 2009) or to index them. For example, if one needs to sequence a common genomic region in many individuals to detect common at-risk variants for a common phenotype or QTL, multiple samples can be pooled together. Thus many samples from cases and controls can be made into two pooled samples, one for cases and one for controls. This will ensure overall higher coverage of the sequence data and will amplify the signal of the variants that are more frequent in a particular pool. However, this has several limitations: (1) multiple PCR or LR-PCR from every sample has to work with similar efficiency; (2) unequal pooling of samples will lead to nonuniform
c11.indd 228
1/12/2011 9:44:18 AM
THE LIMITATIONS OF DIRECT GENOME SEQUENCING
229
Nanopore platform
ATGCT A AGGC DNA strand Nanopore
v
C –
+ Electrodes
A
GG A
C
Measurement of the alteration in the magnitude of current with passage of each nudeotide through the pore.
Figure 11.5. Nanopore-based sequencing technology. Genomic DNA is made to pass through a nanopore immersed in a conducting fluid and a potential (voltage) is applied across it. The electric current generated due to conduction of ions through the nanopore is assayed. As individual nucleotides pass through the nanopore, each nucleotide obstructs the nanopore to a different, characteristic degree, thereby varying the amount of current that passes through the nanopore at any given moment. The change in the current through the nanopore represents a direct reading of the DNA sequence.
representation of every individual in a pool; and (3) upon detection of variants, one cannot identify the individuals who harbor the variant, hence it would require another round of sequencing or genotyping for the targeted loci on each sample. Another method of sequencing multiple samples on NGS is indexing. Currently, almost all the NGS platforms provide the indexing kit. This attaches unique tags at the end of universal primers, and the more unique tags, the more samples that can be multiplexed together. This approach of pooling or indexing is only applicable for targeted resequencing and is not applicable for whole-genome sequencing approaches. The next challenge is to handle the data computationally, once the machine churns out a few gigabytes of sequence data in matter of days. One run on the Genome Analyzer machine from Illumina currently generates an image of >5 terabyte followed by almost 1 terabyte of raw sequence data once the image is analyzed. It is a challenging endeavor to keep up with the growing need of computational storage space as well as the requirement of fast processors to analyze the large amount of data. Finally, identification of genomic variants of specific interest from a very large pool of background changes is a remarkably daunting task. In short, we have come a long way since the discovery of DNA sequencing. Powered by technological innovations and computational capacity building,
c11.indd 229
1/12/2011 9:44:18 AM
230
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
TABLE 11.2. Sequencing Methods for Direct Genome Sequencing Approach Sanger sequencing
Merits •
•
Emulsion PCR based pyrosequencing
•
•
•
•
Polymerase-based sequencing by synthesis
•
•
•
Ligation-based parallel sequencing
•
•
Single molecule sequencing
•
•
•
Highest specificity and sensitivity; high coverage not needed Maximum read length (∼1 kb) Liquid-phase emulsion PCR for high throughput Larger sequence read length (400 bp) Most suitable for de novo sequencing of small genomes (metagenomics) Highest fraction (∼95%) of usable data High throughput (>50 Gb of sequence per experiment) Comparatively low consumable cost Robust workflow ensuring less failures High throughput (∼100 Gb of sequence per experiment) Uses proprietary two-base recognition method for high accuracy Sequence from single molecule, amplification free Enable researchers to analyze cell-to-cell differences in genomic composition Should provide high read lengths
Limitations • •
•
•
• • •
•
•
•
•
Lowest throughput Sequence quality depends on the nucleotide composition High cost of consumables Lower quantity of sequence per run (∼500 Mb) compared to other next-generation sequencing platforms
Small read length (75 nt) Nonuniform coverage Only 43% of raw data is usable
Only 34% of raw data is usable Comparatively smaller read length
Latest entry in the field, needs more data to compare Computationally challenging to analyze
sequencing technology has accelerated the pace of learning about information embedded in the genome sequence. Direct rapid sequencing of large region of our genome has largely replaced need of painstaking fine physical mapping to narrow down the critical region for our trait of interest such that it is amenable to sequencing. Sequencing of genome from a single cell is now a reality. It is likely that rapidly a large number of complete genomes will be available in the public domain, which will make the reference genome more complete for a better coverage in our approach of resequencing for gene
c11.indd 230
1/12/2011 9:44:18 AM
REFERENCES
231
discovery. It is tempting to speculate that within a few years we will be beyond the era of $1000 per genome sequencing and global projects on 1000 genome sequencing with exponential growth in our knowledge in biology. Of course, the key challenge remains the cost that would determine the extent to which the scientific community can take benefit of new innovations in direct genome sequencing. 11.5 REFERENCES ABI SOLiD: solid.appliedbiosystems.com HELICOS: www.helicosbio.com Illumina genome analyzer: www.illumina.com/systems/genome_analyzer.ilmn Nanopore: www.nanoporetech.com OMIM: www.ncbi.nlm.nih.gov/Omim Roche 454: www.454.com Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, Middle CM, Rodesch MJ, Packard CJ, Weinstock GM, Gibbs RA. (2007). Direct selection of human genomic loci by microarray hybridization. Nat Meth 4:903–05. Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, Di Ventra M, Garaj S, Hibbs A, Huang X, Jovanovich SB, Krstic PS, Lindsay S, Ling XS, Mastrangelo CH, Meller A, Oliver JS, Pershin YV, Ramsey JM, Riehn R, Soni GV, Tabard-Cossa V, Wanunu M, Wiggin M, Schloss JA. (2008). The potential and challenges of nanopore sequencing. Nat Biotechnol 26:1146–53. Chavali S, Ghosh S, Bharadwaj D. (2009). Hemophilia B is a quasi-quantitative condition with certain mutations showing phenotypic plasticity. Genomics 94:433–37. Chavali S, Sharma A, Tabassum R, Bharadwaj D. (2008). Sequence and structural properties of identical mutations with varying phenotypes in human coagulation factor IX. Proteins 73:63–71. Den Hollander AI, Koenekoop RK, Mohamed MD, Arts HH, Boldt K, Towns KV, Sedmak T, Beer M, Nagel-Wolfrum K, McKibbin M, Dharmaraj S, Lopez I, Ivings L, Williams GA, Springell K, Woods CG, Jafri H, Rashid Y, Strom TM, van der Zwaag B, Gosens I, Kersten FF, van Wijk E, Veltman JA, Zonneveld MN, van Beersum SE, Maumenee IH, Wolfrum U, Cheetham ME, Ueffing M, Cremers FP, Inglehearn CF, Roepman R. (2007). Mutations in LCA5, encoding the ciliary protein lebercilin, cause Leber congenital amaurosis. Nat Genet 39:889–95. Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, Gabriel S, Jaffe DB, Lander ES, Nusbaum C. (2009). Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 27:182–89. Gottlieb B, Chalifour LE, Mitmaker B, Sheiner N, Obrand D, Abraham C, Meilleur M, Sugahara T, Bkaily G, Schweitzer M. (2009). BAK1 gene variation and abdominal aortic aneurysms. Hum Mutat 30:1043–47. Gupta PK. (2008). Single-molecule DNA sequencing technologies for future genomics research. Trends Biotechnol 26:602–11.
c11.indd 231
1/12/2011 9:44:18 AM
232
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
Haines JL, Hauser MA, Schmidt S, Scott WK, Olson LM, Gallins P, Spencer KL, Kwan SY, Noureddine M, Gilbert JR, Schnetz-Boutaud N, Agarwal A, Postel EA, PericakVance MA. (2005). Complement factor H variant increases the risk of age-related macular degeneration. Science 308:419–21. Hardenbol P, Banér J, Jain M, Nilsson M, Namsaraev EA, Karlin-Neumann GA, Fakhrai-Rad H, Ronaghi M, Willis TD, Landegren U, Davis RW. (2003). Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol 21:673–78. Hardenbol P, Yu F, Belmont J, Mackenzie J, Bruckner C, Brundage T, Boudreau A, Chow S, Eberle J, Erbilgin A, Falkowski M, Fitzgerald R, Ghose S, Iartchouk O, Jain M, Karlin-Neumann G, Lu X, Miao X, Moore B, Moorhead M, Namsaraev E, Pasternak S, Prakash E, Tran K, Wang Z, Jones HB, Davis RW, Willis TD, Gibbs RA. (2005). Highly multiplexed molecular inversion probe genotyping: over 10,000 targeted SNPs genotyped in a single tube assay. Genome Res 15:269–75. Harismendy O, Frazer K. (2009). Method for improving sequence coverage uniformity of targeted genomic intervals amplified by LR-PCR using Illumina GA sequencingby-synthesis technology. Biotechniques 46:229–31. Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, Frazer KA. (2009). Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 10:R32. Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch MJ, Albert TJ, Hannon GJ, McCombie WR. (2007). Genome-wide in situ exon capture for selective resequencing. Nat Genet 39:1522–27. Ingman M, Gyllensten U. (2009). SNP frequency estimation using massively parallel sequencing of pooled DNA. Eur J Hum Genet 17:383–86. Mencía A, Modamio-Høybjør S, Redshaw N, Morín M, Mayo-Merino F, Olavarrieta L, Aguirre LA, del Castillo I, Steel KP, Dalmay T, Moreno F, Moreno-Pelayo MA. (2009). Mutations in the seed region of human miR-96 are responsible for nonsyndromic progressive hearing loss. Nat Genet 41:609–13. Milos P. (2008). Helicos BioSciences. Pharmacogenomics 9:477–80. Mukhopadhyay A, Nikopoulos K, Maugeri A, de Brouwer AP, van Nouhuys CE, Boon CJ, Perveen R, Zegers HA, Wittebol-Post D, van den Biesen PR, van der Velde-Visser SD, Brunner HG, Black GC, Hoyng CB, Cremers FP. (2006). Erosive vitreoretinopathy and wagner disease are caused by intronic mutations in CSPG2/ Versican that result in an imbalance of splice variants. Invest Ophthalmol Vis Sci 47:3565–72. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J. (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461:272–76. Nikopoulos K, Gilissen C, Hoischen A, van Nouhuys CE, Boonstra FN, Blokland EA, Arts P, Wieskamp N, Strom TM, Ayuso C, Tilanus MA, Bouwhuis S, Mukhopadhyay A, Scheffer H, Hoefsloot LH, Veltman JA, Cremers FP, Collin RW. (2010). Next-generation sequencing of a 40 Mb linkage interval reveals TSPAN12 mutations in patients with familial exudative vitreoretinopathy. Am J Hum Genet 86:240–47.
c11.indd 232
1/12/2011 9:44:18 AM
REFERENCES
233
Nilsson M, Malmgren H, Samiotaki M, Kwiatkowski M, Chowdhary BP, Landegren U. (1994). Padlock probes: circularizing oligonucleotides for localized DNA detection. Science 265:2085–88. Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME. (2007). Microarraybased genomic selection for high-throughput resequencing. Nat Meth 4:907–09. Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, LeProust EM, Peck BJ, Emig CJ, Dahl F, Gao Y, Church GM, Shendure J. (2007). Multiplex amplification of large sets of human exons. Nat Meth 4:931–36. Sanger F, Coulson AR. (1978). The use of thin acrylamide gels for DNA sequencing. FEBS Lett 87:107–10. Sanger F, Nicklen S, Coulson AR. (1977). DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74:5463–67. Sanger F, Nicklen S, Coulson AR. (1992). DNA sequencing with chain-terminating inhibitors. Biotechnology 24:104–08. Tewhey R, Nakano M, Wang X, Pabón-Peña C, Novak B, Giuffre A, Lin E, Happe S, Roberts DN, LeProust EM, Topol EJ, Harismendy O, Frazer KA. (2009a). Enrichment of sequencing targets from the human genome by solution hybridization. Genome Biol 10:R116. Tewhey R, Warner JB, Nakano M, Libby B, Medkova M, David PH, Kotsopoulos SK, Samuels ML, Hutchison JB, Larson JW, Topol EJ, Weiner MP, Harismendy O, Olson J, Link DR, Frazer KA. (2009b). Microdroplet-based PCR enrichment for largescale targeted sequencing. Nat Biotechnol 27:1025–31. Yeager M, Xiao N, Hayes RB, Bouffard P, Desany B, Burdett L, Orr N, Matthews C, Qi L, Crenshaw A, Markovic Z, Fredrikson KM, Jacobs KB, Amundadottir L, Jarvie TP, Hunter DJ, Hoover R, Thomas G, Harkins TT, Chanock SJ. (2008). Comprehensive resequence analysis of a 136 kb region of human chromosome 8q24 associated with prostate and colon cancers. Hum Genet 124:161–70.
c11.indd 233
1/12/2011 9:44:18 AM
CHAPTER 12
Candidate Screening through Bioinformatics Tools SONG WU and WEI ZHAO
Contents 12.1 Introduction 12.2 Computing Environment: R and Bioconductor 12.3 Bioinformatic Databases 12.3.1 Literature Database: PubMed 12.3.2 Biological Ontology Databases 12.3.3 Protein–Protein Interaction Databases 12.4 Bayesian Network to Analyze Expression Data: NATbox 12.5 Weighted Gene Co-Expression Network Analysis 12.5.1 Generation of Weighted Gene Co-Expression Network 12.5.2 Detection of Modules 12.5.3 Define Measures of Gene Significance and Module Relevance 12.5.4 Functional Enrichment Studies of Gene Modules 12.5.5 Relating Intramodular Connectivity to Gene Significance 12.5.6 Network-Based Screening Strategy 12.5.7 Brain Tumor Example 12.6 In Silico Screening of Candidate Genes 12.6.1 Input Gene List Preparation 12.6.2 Gene Set Enrichment Analysis 12.6.3 Protein–Protein Interaction Network Analysis 12.6.4 PID Example 12.6.5 Other Bioinformatics Tools 12.7 Future Directions 12.8 Questions 12.9 Acknowledgments 12.10 References
236 237 237 237 238 239 240 242 244 244 246 246 246 247 247 248 248 249 252 255 255 256 257 257 257
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
235
c12.indd 235
1/12/2011 5:03:45 PM
236
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
12.1 INTRODUCTION With the rapid development of DNA sequencing technologies, whole genome sequences have become available for many species. Accompanying this, genomewide high-throughput experiments, such as gene expression assays and single-nucleotide polymorphism (SNP) arrays, have been developed and industrialized. The ability to do large-scale screening with these assays makes them popular among researchers who are interested in searching for disease genes. In most cases, the result of a genomewide array experiment is a list of hundreds or even thousands of significant genes. Besides the assay experiments, traditional marker-based linkage analysis is another disease gene hunting method, in which quantitative trait loci associated with disease traits can be identified. A quantitative trait locus usually corresponds to a large genomic region that contains several hundred genes. Thus, in either highthroughput assays or linkage analyses, causal genes underlying a disease frequently hide within a large set of genes, and it is a daunting exercise to validate all of them experimentally. Typically, researchers pick a few top genes or cherry pick a handful of interesting genes for further experiments. However, by retrieving and integrating the information from multiple bioinformatic databases, better strategies may be applied to prioritize the resulting genes. In this chapter, we review several bioinformatics tools to explore the gene structures among the long list of significant genes to generate a short list of candidate genes. Because candidate gene screening by bioinformatics tools is essentially a gene prioritization process, the terms candidate gene screening and gene prioritization will be used interchangeably for ease of presentation. Generally speaking, two types of gene prioritization analyses can be done. One is based on data-driven network analysis, which aims to infer the structure of the gene regulation process based on assay data (Friedmen, 2000; Zhao, 2006); another is based on information-driven analysis, which aims to retrieve and integrate biological knowledge from multiple databases to reveal the gene relationships (Sun et al., 2009; Ortutay et al., 2009). For the data-driven analysis, we focus on gene expression experiments, in which the data reveal not only differentially expressed genes, but also their coexpression patterns—that is, the gene correlations. Based on this, it is possible to query the interactions between genes and form a gene network from their interconnectiveness. We will dedicate two sections to demonstrating some bioinformatics tools for network analysis. For the informationdriven analysis, we focus on how to generate a small list of justifiable disease candidate genes solely from the bioinformatic resources. The main idea behind this is that perturbation of genes that are involved in the same pathway or biological process important for a disease will produce the same or very similar disease phenotypes. We describe in detail how this can be done.
c12.indd 236
1/12/2011 9:44:21 AM
BIOINFORMATIC DATABASES
237
12.2 COMPUTING ENVIRONMENT: R AND BIOCONDUCTOR R is a language and environment capable of providing a wide variety of statistical computing and graphics techniques. It is a free software tool and can be downloaded from http://cran.r-project.org. The R environment has many notable features, one of which is its great extensibility through add-on packages that can be easily installed. Packages contributed by developers from all statistics research areas have greatly enriched the choices and benefited biological and medical researchers by providing good-quality analyses. In this sense, R can be viewed as an integrated suite of software facilities. Due to its flexibility and data manipulation capacity, R is now becoming one of the most widely used tools for bioinformatics. Bioconductor is an R-based open source and open development software project specializing in providing tools for the analysis and comprehension of genomic data (http:// www.bioconductor.org). The functional scope of Bioconductor includes the analysis of almost all types of genomic data, such as DNA microarray, serial analysis of gene expression, sequence, and SNP data. All analysis packages in Bioconductor are distributed as R packages and are compatible with the R environment. Bioconductor also includes many up-to-date data packages for easy annotations. Most analysis software tools discussed in this chapter are implemented in R/Bioconductor and are freely available. Anyone who is not familiar with R or is interested in learning more about its applications in bioinformatics can read Gentleman (2008).
12.3 BIOINFORMATIC DATABASES Bioinformatic databases serve as the arsenal for bioinformatics analyses. Before we go further, it is obligatory to introduce the bioinformatic resources. Since there are a huge number of databases out there and it is impossible to introduce them all, only those closely related to the material discussed in this chapter will be reviewed. 12.3.1
Literature Database: PubMed
PubMed is a service of the U.S. National Library of Medicine (USNLM) that currently includes more than 19 million citations from MEDLINE and other life science journals for biomedical articles (www.ncbi.nlm.nih.gov/pubmed). It is the single largest literature resource online. For the average researcher, hands-on PubMed usage might just mean searching key works in PubMed and trying to read all the related abstracts. This exercise was feasible a decade ago when the body of literature was relatively small. However, as the number of articles grows exponentially each year, it is no longer practical to read
c12.indd 237
1/12/2011 9:44:21 AM
238
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
every detail within all the relevant literature. More efficient methods to retrieve information from articles are needed. Recognizing these, USNLM has developed tools or data files to facilitate fast information extraction by annotating published articles, which can be found in the NCBI repositories site (ftp://ftp.ncbi.nlm.nih.gov). For example, the file gene2pubmed under the gene directory and DATA subdirectory contains annotations between genes and PubMed IDs, which provides fast conversion between genes and PubMed articles. 12.3.2 Biological Ontology Databases Ontology is the science of what is (Smith, 2001). It is a rigorous and exhaustive organization of some knowledge domains that are usually hierarchical and contain all the relevant entities and their relations. From a practical view, biological ontologies provide deeper and more robust representations of biological domains on which we wish to reason and solve problems. Here we discuss two relevant ontology databases. 12.3.2.1 Gene Ontology Gene ontology (GO) provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data (www.geneontology.org). It consists of three independent areas: biological process, molecular function, and cellular component. The terms in GO can be structured as a graph, with terms as nodes and the relations between the terms as arcs (Fig. 12.1). The relations between GO terms can also be categorized and defined, including is_a (is a subtype of ); part_of; has_part; and regulates, negatively_regulates, and positively_regulates relationships. The properties of each relation are specified in the OBO format and can be graphically viewed by an OBO ontology editor (http://oboedit.org) or browsed on the web (http://amigo.geneontology.org). More important, each GO term is functionally annotated with a set of genes, which can be used for functional enrichment analysis. The gene annotations to GO terms can be found on the GO website or obtained in a cleaner format from BioMart (http://www.biomart.org), a generic query-oriented data management system developed jointly by the Ontario Institute for Cancer Research and the European Bioinformatics Institute. The GO database is becoming so important that most candidate gene screening algorithms are based somehow on this information. 12.3.2.2 Medical Subject Headings Medical Subject Headings (MeSH) is the USNLM’s controlled vocabulary thesaurus used for indexing articles for MEDLINE/PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts (www.ncbi.nlm.nih.gov/mesh). Similar to GO, MeSH has more than one ontology and has 16 areas. However, in terms of disease candidate gene screening, MeSH C (Diseases) and MeSH D (Chemicals and Drugs) are more related
c12.indd 238
1/12/2011 9:44:21 AM
BIOINFORMATIC DATABASES
Cellular component
Biological process
I Cellular metabolic process
Macromolecular complex
Biological regulation R
I R Protein complex
I Nucleobase, nucleoside, nucleotide, and nucleic acid metabolic process
Molecular function
I
I Cellular process
239
I Regulation of biological process
I
I
PCNA complex
Regulation of cellular process I
I
Regulation of cellular response to stress
DNA metabolic process I
I
DNA repair I Nucleotide-excision repair
R Regulation of DNA repair
Figure 12.1. An graphic example of the GO term of regulation of DNA repair shows the hierarchical structure of the terms. I, is_a relationship; R, regulates relationship.
terms. These MeSH terms are very useful when combined with literature searches to generate candidate gene lists. Some software such as G2D (Genes to Diseases, www.ogic.ca/projects/g2d_2) use MeSH terms for candidate gene prioritization. 12.3.3 Protein–Protein Interaction Databases Protein–protein interactions (PPIs) are essential to all biological processes. Over the past few years, the number of known PPIs has grown at a substantial pace, either due to direct experimental evidence or due to in silico evidence derived from deeper understandings of PPI mechanisms. Many protein interaction repositories have been built to store PPI knowledge and are widely used for investigating molecular networks or pathways. There are six major PPI databases: the Human Protein Reference Database (HPRD), the Biomolecular Interaction Network Database (BIND), the Biological General Repository for Interaction Datasets (BioGRID), the Molecular INTeraction database (MINT), the Database of Interacting Proteins (DIP), and the IntAct molecular interaction database (IntAct). Each differs in scope and content. If possible, it is better to combine all interactions together for PPI analysis. However, in practice, it is also fine to use only HPRD (http:// www.hprd.org), since it alone contains about 80% of interactions (Mathivanan et al., 2006).
c12.indd 239
1/12/2011 9:44:21 AM
240
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
12.4 BAYESIAN NETWORK TO ANALYZE EXPRESSION DATA: NATBOX A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional probabilistic independencies. Each gene is a variable in a Bayesian network, which is essentially a directed acyclic graph (DAG) based on the causal relationship of genes (e.g., upregulation of a transcription factor promotes the expression of its downstream regulatory genes). Bayesian networks have many advantages in modeling gene expression networks: (1) they explicitly relate the DAG model of the causal relations among the gene expression levels to a statistical analysis; (2) they have broad applications and include linear models, nonlinear models, Boolean networks, and Hidden Markov models as special cases; (3) there are already well-developed algorithms for searching for Bayesian networks from observational data; (4) they allow for the introduction of a stochastic element and hidden variables; and (5) they allow explicit modeling of the process by which the data are collected (Spirtes et al., 2000). Bayesian networks assume that a variable is independent of its nondescendants, given its parents in the network. The conditional independence assumption allows the decomposition of the joint distribution of the network. Figure 12.2 is a simple example given by Friedmen (2000) that clearly demonstrates this. The joint probability distribution of A, B, C, D, and E can be decomposed as P ( A, B, C , D, E ) = P ( A) P ( B | A, E ) P (C | B) p ( D | A) P(E). To learn the network structure from the observed data is an NP-hard problem (Chickering, 1996). Many searching algorithms have been proposed (Margaritis, 2003; Tsamardinos et al., 2003, 2006; Yaramakala and Margaritis, 2005). Currently, there are two R packages that perform Bayesian network analysis, BNArray (Chen et al., 2006) and NATbox (Chavan et al., 2009), and we found that NATbox is superior. NATbox is a menu-driven graphical user interface (GUI) implemented in R for modeling and analysis of functional relationships for gene expression data (Chavan et al., 2009). The input data should be saved as a tab-delimited *.txt file, with each column representing a gene, and no row name is allowed. All functions are accessible with a simple click; no command needs to be entered once the software is running. This gives NATbox a superior advantage over BNArray such that less program-savvy researchers can use it easily. The software provides more searching algorithms for optimizing Bayesian networks, versus only two searching algorithms in BNArray. The backbone of the software is bnlearn, an R package developed by Marco Scutari. NATbox calls the bnlearn function through its GUI to conduct network searches and draw network plots. Given an adjacency matrix, NATbox also provides the option to draw a network plot and perform network analysis through its Social Network Analysis tool. However, the tool cannot label genes by their names
c12.indd 240
1/12/2011 9:44:21 AM
BAYESIAN NETWORK TO ANALYZE EXPRESSION DATA: NATBOX
241
Figure 12.2. A simple example of Bayesian network. The network can be decomposed as follow: (1) A and E are independent; (2) B and D are independent, given A and E; (3) C is independent of A, D, and E, given B; (4) D is independent of B, C, and E, given A; (5) E is independent of A and D.
and does not allow users to interact with the network graphs. Stand-alone software, such as VisANT, or more sophisticated R packages, such as graph, RBGL, and Rgraphviz, can be used to draw network plots for those who are interested in a particular network structure. As an example, we have constructed a new Bayesian network of the 17 DNA repair genes using NATbox’s GS algorithm (Figures 12.3 and 12.4). This list of genes was originally from a yeast experiment to study genes that are involved in cell cycle regulations (Spellman et al., 1998). The data include 77 yeast gene expression microarrays and around 6200 genes. The 17 DNA repair genes among 799 differentially expressed genes were selected by the authors of BNArray as a tutorial example of their software. The input data can be derived using the code in Box 12.1. The network is significantly different from that constructed using BNArray. The network by BNArray has far more edges than that by NATbox, and the directions of many edges are also different. We also compared the networks constructed by these two programs using the example data found at www.bnlearn.com. Both programs work well with simple examples, but BNArray is less useful for complicated networks. Not only does it generate more edges, but also some of the edges are in the wrong direction. For that reason, we recommend NATbox for Bayesian network analysis.
c12.indd 241
1/12/2011 9:44:21 AM
242
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
Figure 12.3. A screen shot of the NATbox GUI.
Despite all their advantages, Bayesian networks have many pitfalls as well. First of all, they require a large sample size that most microarray studies cannot afford. Simulation studies inferring DAG structure from sample data indicate that even for relatively sparse graphs, sample sizes of several hundred are required for high accuracy. Second, learning a Bayesian network poses an insurmountable task for large networks. The relationship between two genes A and B has four possibilities: A causes B, B causes A, A and B mutually cause each other, and A and B have no causal relationship. For n genes, there are 4n possibilities. This number becomes astronomical even for a small network made up of 100 genes. Although searching algorithms, such as genetic algorithms and PC algorithms, have been developed to make global searches possible, identification of all causal relationships of genes in a small network still remains an ordeal (Aten et al., 2008). Third, because of the nature of the Bayesian network, two networks constructed from the same data may be different if the program runs at different times. It is important to be aware of the pitfalls of Bayesian networks and interpret the results cautiously.
12.5 WEIGHTED GENE CO-EXPRESSION NETWORK ANALYSIS Weighted gene co-expression network analysis (WGCNA) is based on the concept of a scale-free network. Metabolic networks in all organisms have been suggested to be scale-free networks, and scale-free network phenomena
c12.indd 242
1/12/2011 9:44:21 AM
WEIGHTED GENE CO-EXPRESSION NETWORK ANALYSIS
243
YOR033C YDL101C
YOL090W YNL312W
YDR097C
YER095W
YNL082W
YGL021W
YML061C
YGL163C
YML060W
YML021C
YIL066C
YLR288C
YKL113C YLR032W
YLR383W
Figure 12.4. A screen shot of the NATbox network analysis of the DNA repair gene network.
BOX 12.1. Code to Retrieve DNA Repair Gene Expression Data > library(BNArray) > data(total.data) > attach(total.data) > ori.compact = LLSimpute(total.data$df.all, total. data$df.ori, total.data$n.changed) > ori.compact = FinalImpute(ori.compact) > bn.data = PrepareCompData(ori.compact) # the names of DNA repair genes can be found in http://www.cls.zju.edu.cn/binfo/BNArray/ > dnarepair=read.csv(file=″DNA repair.csv″, header=F) > bn.data=bn.data[, as.character(unlist(dnarepair))]
c12.indd 243
1/12/2011 9:44:21 AM
244
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
have been observed in many empirical studies (Zhang and Horvath, 2005; Dong and Horvath, 2007; Horvath et al., 2006; Carlson et al., 2006; Gargalovic et al., 2006). In a scale-free network, the connectivity among genes, p(k), follows a power law distribution, p(k)∼k-Y (Ravasz et al., 2002; Barabasi and Albert, 1999). One key feature of a scale-free network is the existence of a few highly connected hub nodes that participate in a very large number of metabolic reactions. With a large number of links, these hubs integrate all substrates into a single, integrated web. Scale-free networks have been shown to be robust against accidental failures but vulnerable to coordinated attacks (Albert et al., 2000). Identifying the hub molecules involved in certain diseases could lead new drugs that would target those hubs (Barabasi and Bonabeau, 2003). Thus knowing the connectivity of genes helps prioritize the candidate genes (Zhao et al., 2006). 12.5.1 Generation of Weighted Gene Co-Expression Network Weighted network construction was performed using R as described by Zhang et al. (2005). Briefly, the absolute value of the Pearson correlation coefficient was calculated for all pair-wise comparisons of gene expression values across all microarray samples. The Pearson correlation matrix was then transformed into an adjacency matrix A—that is, a matrix of connection strengths using a power function. Thus the connection strength aij between gene expressions xi and xj is defined as aij = |cor(xi,xj)|β. The power β is chosen large enough so that the resulting network exhibits approximate scale-free topology. The network connectivity ki of the ith gene expression profile xi is the sum of the connection strengths with all other genes in the network: ki =
∑
N j =1
aij .
12.5.2 Detection of Modules The next step in network construction is to identify groups of genes with similar patterns of connection strengths with all other genes in the network. The topological overlap matrix (Ravasz, 2002; Yip et al., 2007; Zhang and Horvath, 2005) is used as a measure of gene similarity. This amounts to defining a module as a set of highly co-expressed genes. A pair of genes is said to have high topological overlap if they are both strongly connected to the same group of genes. The use of topological overlap thus serves as a filter to exclude spurious or isolated connections during network construction. After calculating the topological overlap for all pairs of genes in the network, this information is used in conjunction with a hierarchical clustering algorithm to identify groups, or modules, of densely interconnected genes. In the resulting dendrogram, discrete branches of the tree correspond to modules of co-expressed genes (Fig. 12.5a). After identifying modules of co-expressed genes, each module in effect becomes a subnetwork, and a new measure of connectivity,
c12.indd 244
1/12/2011 9:44:21 AM
Standard TOM Measure
1.0 0.9 0.8 0.7 0.6 0.5
Colored by GTOM1 modules
Colored by GeneSignificance
(a) brown, cor = 0.58 0.35
GeneSignificance
0.30 0.25 0.20 0.15 0.10 0.05 0.00 10
20
30 Connectivity (b)
40
50
1.0
mean (P < 0.05)
0.8 0.6 0.4 0.2 0.0 1
5
10
20
50
100
Size (c)
Figure 12.5. Brain cancer network results. a, Average linkage hierarchical clustering tree colored by modules (first color band) and high/low gene significance (white/black) in the second color band. Note that the brown module is enriched with significant genes. b, Scatterplot between intramodular connectivity (x-axis) and gene significance (y-axis) in the brown module. c, Proportion of significant genes in the test set data as a function of different sizes of gene lists. Green and red bar plots associated with network screening and gene significance screening, respectively.
c12.indd 245
1/12/2011 9:44:21 AM
246
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
intramodular connectivity, is defined as the sum of a gene’s connection strengths with all other genes in its module. 12.5.3 Define Measures of Gene Significance and Module Relevance Based on the clinical outcome y, the gene significance of the ith gene expression profile xi is defined as the absolute value of the Student t-test statistic for testing differential expression between cases and controls, GSi = Gene significance ( xi ) = t testi . Depending on the type of clinical outcome, the gene significance can also be defined as F-test statistic, Pearson correlation, -log10 of Cox regression p value, or other reasonable statistics. An important step in gene network analysis is to study the biological or clinical relevance of network modules. To relate gene modules to a clinical trait y, it is natural to make use of the gene significance measure. Specifically, we define a measure of module relevance by the mean gene significance in the qth module,
Module relevance
q
∑ =
nq i =1
GSi
nq
,
where i indexes the genes in the qth module and nq equals the qth module size. By considering the module relevance measure in our applications, we find that certain modules can be singled out as being enriched with genes that are differentially expressed between cases and controls. 12.5.4 Functional Enrichment Studies of Gene Modules Identification of biologically plausible modules that are relevant for the clinical outcome y is an important step for our main goal: finding the important genes within these relevant modules. In real data applications, one would certainly want to study the functional enrichment (gene ontology information) of gene modules and study the expression profiles of the module genes in other tissues to further elucidate the meaning of the identified modules. Functional enrichment analysis may provide insights into the meaning of the modules. Available software for this analysis includes topGO (discussed later), EASE (http://david.niaid.nih.gov/david/ease/ease.jsp), Ingenuity, GeneGo, and others. In practice, important complementary information may help with selection of the biologically most plausible module. 12.5.5 Relating Intramodular Connectivity to Gene Significance Highly connected hub genes are far more likely than nonhub genes to be essential for survival (Giaever, 2002; Han, 2004; Winzeler, 1999). Therefore, we
c12.indd 246
1/12/2011 9:44:22 AM
WEIGHTED GENE CO-EXPRESSION NETWORK ANALYSIS
247
hypothesize that hub genes may also be more significant according to the gene significance measure. Empirically, we find that this intuition is correct in relevant modules and usually not true for nonrelevant modules. 12.5.6
Network-Based Screening Strategy
The fact that intramodular connectivity is significantly correlated with gene significance in a relevant module suggests that intramodular connectivity can be used to obtain complementary information for finding prognostic genes. To select genes based on a gene significance measure and connectivity, we propose the following network-based gene screening strategy: • •
• • •
•
Input S is the number of genes that should be selected. Define a gene significance measure based on the clinical outcome of interest—for example, the absolute value of the t-test statistic for testing differential expression. Construct a weighted gene co-expression network. Identify modules of highly co-expressed (correlated) genes. Identify relevant modules based on the module relevance measure (see equation in section 12.5.3). Within the relevant modules, select S genes with high gene significance and high intramodular connectivity.
In general, since the number of genes selected for Bayesian network analysis is much smaller than that for WGCNA, the genes for Bayesian network analysis can be treated as a gene module. Once the network is constructed, the connectivity of each gene (or the degree) can also be calculated. Thus the network-based screening strategy applies to Bayesian networks as well. 12.5.7 Brain Tumor Example The network-based gene screening method was applied brain cancer study. Dataset 1 consisted of 55 glioblastomas and was considered the training data set, and dataset 2 consisted of 65 independent glioblastoma samples as a validation set. Expression of 22,215 probe sets (15,005 unique transcripts) was measured using Affymetrix HG-U133A microarrays. The absolute value of the Pearson correlation between expression profiles of all pairs of genes was determined for the 8,000 most varying nonredundant transcripts. Since module identification is computationally intensive, only the 3,600 most connected genes were considered for module detection. Since module genes tend to have high connectivity, this step does not lead to a big loss in information (www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/ASPMgene). The gene significance of a gene is defined as -log10 of its univariate Cox regression p value. Thus the gene significance is proportional to the number
c12.indd 247
1/12/2011 9:44:22 AM
248
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
of zeroes in the Cox p value. The network-based gene screening method that incorporates connectivity information is hypothesized to more likely identify genes that validate in an independent dataset than traditional screening methods that ignore network connectivity. The validation success rate of a candidate set of prognostic genes is defined as the proportion of those genes for which the Cox regression p value in an independent data set is smaller than 0.05. Network-based screening methods have significantly higher validation success rates than the traditional method (Fig. 12.5c). The code to perform WGCNA has been developed into publicly accessible R packages. For examples and tutorials, visit www.genetics.ucla.edu/labs/ horvath/CoexpressionNetwork.
12.6 IN SILICO SCREENING OF CANDIDATE GENES In the previous two sections, we discussed data-based gene prioritizations. However, sometimes researchers may want to explore a disease trait but do not have the resources or are restrained to perform large-scale screening assays. Several in silico methods have been proposed for such purposes by mining through public databases. Given a disease phenotype, savvy investigators can prepare an input list of a few thousand genes that show preliminary evidence of association with the disease, based on prior knowledge or experimental data. Further functional analysis can then be applied on the list to prioritize a handful of them to direct validation experiments. In this section, we will discuss in detail how to start from an interesting phenotype to develop a short list of candidate genes.
12.6.1 Input Gene List Preparation There are several ways to build the input list. The first and easiest way is to search through the literature. By using Entrez Programming Utilities (eUtils), a set of tools developed by NCBI to facilitate information retrieval from Entrez data, including PubMed, one can automate the search for gene-disease associations from the literature. A detailed description of how to use eUtils can be found at http://eutils.ncbi.nlm.nih.gov. In short, a fixed URL syntax is used to translate a standard set of input parameters into values necessary for various NCBI software components to retrieve the requested data. It is easy to call the eUtils in R/Bioconductor to run batch searches. In the following, we describe how to combine Esearch, one of the eUtils, with the annotate package in Bioconductor to retrieve abstracts from PubMed. The key step is to build an appropriate query URL, which makes up the base URL http://eutils.ncbi.nlm.nih.gov/entrez/eu-tils/esearch.fcgi?db=pubmed& term= and additional search terms. Some special characters needed for the term syntax are
c12.indd 248
1/12/2011 9:44:22 AM
IN SILICO SCREENING OF CANDIDATE GENES
Text Char
space
[
]
″
#
Query Char
+
%5B
%5D
%22
%23
249
For example, to obtain abstracts for all case reports that contain the key words iron overload, the regular input in PubMed would be (“iron overload” AND case report[msh]). The implementation with R is shown by the code in Box 12.2. The code can be easily modified for tasks such as detecting co-occurrences of genes and disease phenotypes in the literature. Input genes can also be collected from experimental evidence, including association studies, linkage scans, and gene expression. Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo) is a public functional genomics data repository storing thousands of genomic array datasets. The Input list could result from analyses of these experiments with loose criteria, such as a calculated false discovery rate library(annotate); library(XML); > base.url = ″http://eutils.ncbi.nlm.nih.gov/entrez/ eutils/esearch.fcgi?db=pubmed&term=″ > my.term = ″%22iron+overload%22+AND+(Case+Reports%5Bp typ%5D)″ > url.txt = scan(paste(base.url, my.term, sep=″″), what=″″) > url.txt = gsub(″>″,″″, tmp)]) > totalcount = unlist(strsplit(url.txt, ″ url.txt = scan(paste(base.url, my.term,″&retmax=″, totalcount, sep=″″), what=″″) > ids= url.txt[grep(″″, url.txt)]; ids= gsub(″″, ″″, ids); ids= gsub(″″, ″″, ids); > x = pubmed(ids); a =xmlRoot(x); numAbst =length(xmlChildren(a)); > arts = vector(″list″, length = length(numAbst)) > absts = rep(NA, numAbst) > for (i in 1:numAbst) {arts[[i]] = buildPubMedAbst(a[[i]]); absts[i] = abstText(arts[[i]]); }
Let p0 = (number of input genes in term u)/( number of genes in term u) and p1 = (number of input genes not in term u)/(number of genes in gene pool but not in term u). The test hypotheses are formulated as H 0 : p0 = p1 against H1 : p0 > p1. Rejection of the null hypothesis H0 suggests enrichment of a significant gene set in the term u. Standard tests like Fisher’s exact or the Kolmogorov-Smirnov test can be applied. There are several tools/software that can be used to perform GSEA based on GO terms (Al-Shahrour et al., 2004; Beissbarth and Speed, 2004). However, most of them do not consider the hierarchical structure of the GO terms. In this section, we introduce topGO, a Bioconductor package that takes into account the dependent structure of the GO terms. 12.6.2.1 Software: topGO topGO is a Bioconductor package developed by Alexa et al. (2006) for scoring functional groups by de-correlating GO graphs. R codes for topGO functions can be found at www.koders.com/ noncode/fidBD151204CB40891793D227DE8E474F119A9020A7.aspx. The key step to using the topGO package is to create a topGOdata object, for
c12.indd 250
1/12/2011 9:44:22 AM
IN SILICO SCREENING OF CANDIDATE GENES
251
Figure 12.6. Enrichment of a significant gene set in GO terms.
which two inputs are needed: a GO term-to-term structure relationship file and a GO term gene annotation file. The GO structure can be obtained from GO.db, a data package from Bioconductor. Therefore, users do not need to prepare such an input. However, the gene annotations for GO require some work because different applications have different setups. This input can be prepared from the BioMart database. The code in Box 12.3 describes how to perform GSEA by topGO. Before the code is run, two variables, geneList and gene2GO.goa, need to be constructed. geneList is a factor vector of 0s and 1s with input genes coded as 1. The vector has the same length as the number of genes in the gene pool, and the vector names are the gene names. gene2GO. goa is a list with each entry being a gene annotated by a set of GO terms. It requires the same gene order as the vector name of geneList. The genes in the significant GO terms but not in the original input list are considered to be candidate genes. This process eliminates genes that are unlikely to be associated with the disease of interest. The limitation of the process is that the analysis is entirely dependent on the GO annotation file, which is generated from the current knowledge. Functional terms about which people have less understanding will contain fewer genes. Therefore, results based on GO terms generally bias toward what is already known. Nevertheless, it is a good exploratory type of analysis, and further experiments are needed to confirm any findings from it.
c12.indd 251
1/12/2011 9:44:22 AM
252
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
BOX 12.3. Sample Code for Gene Set Enrichment Analysis by topGO > library(topGO) > head(geneList) 44M2.3 A1BG A1CF A1IGU5 A1L167 A1L4H1 0 0 1 0 0 0 > head(gene2GO.goa) $′1′ [1] ″GO:0000166″ ″GO:0003723″ ″GO:0004527″ ″GO:0005622″ ″GO:0005730″ ″GO:0016787″ $′2′ [1] ″GO:0003674″ ″GO:0005576″ ″GO:0008150″ > # create a topGO object > GOdata = new(″topGOdata″, ontology = ″MF″, allGenes = geneList, annot = annFUN.gene2GO, + gene2GO = gene2GO.goa) > #Classic test without considering the correlation among GO terms > test.stat = new(″classicCount″, testStatistic = GOFisherTest, name = ″Fisher test″) > resultFis = getSigGroups(GOdata, test.stat) > #weighted test with consideration of the correlation among GO terms > test.stat = new(″weightCount″, testStatistic = GOFisherTest, name = ″Fisher test″, sigRatio = + ″ratio″) > resultWeight = getSigGroups(GOdata, test.stat) > #results summary and plotting the significant GO graphs > l = list(classic = score(resultFis), weight = score(resultWeight)) > allRes = genTable(GOdata, l, orderBy = ″weight″, ranksOf = ″classic″,top = 20) > showSigOfNodes(GOdata, score(resultWeight), firstTerms = 5,useInfo = ″all″)
12.6.3
Protein–Protein Interaction Network Analysis
PPIs are important for virtually every aspect of cellular functions. Although small diagrams of PPIs are commonly seen, the whole network of PPIs is hard to visualize. The level of complexity makes them difficult to generate and analyze. Network analysis is a way to approach this problem. By capturing certain notions of the important genes in a gene network that may be related
c12.indd 252
1/12/2011 9:44:22 AM
IN SILICO SCREENING OF CANDIDATE GENES
253
to a disease trait, these genes have a good chance of being related to the disease. The hypothesis is that a change of gene function in the hub nodes/ genes causes higher instability of the whole network, and working on these genes has a higher chance to achieve experimentally positive results. Several measures that can be used to describe a network structure and search for the important nodes are degree, vulnerability, and closeness centrality. It is natural to view a network as graph and thus the fundamental elements of network graphs are their nodes and edges. For a node i, its degree is simply the count of numbers of edges incident upon this node. It measures the connection of a node to its neighboring nodes. It is reasonable to assume that the higher a node’s degree, the more it contributes to the network stability. Unlike degree, Closeness centrality measures the importance of a node in a more global sense. It is based on the notion of how close a node is to all other nodes in the graph. The closeness centrality is calculated as CCi =
Vi − 1
∑d
,
ij
i≠ j
where |Vi| is the size of the reachable subnetwork from node i, and dij is the shortest distance between node i and node j (Kolaczyk et al., 2008). Vulnerability is another measurement of the importance a node to a network and is calculated based on the network efficiency (Gol’dshtein et al., 2004). The network efficiency quantifies the efficiency of information transmission within the network. Assuming the efficiency between two nodes is inversely proportional to their distance, measured by the edges, the global efficiency of a network is calculated as E=
1 N ( N − 1)
∑d , 1
i≠ j
ij
where N is the total number of nodes in the network. Then, the vulnerability of node i is Vi =
E − Ei , E
where Ei is the global efficiency of the network with node i and all edges connected to node i removed. Therefore, vulnerability is the efficiency loss of a network if node i is missing. Studies have shown that these measures can capture some level of importance for a PPI network, as important genes tend to be in more central positions (Ortutay et al., 2009). These values can be easily calculated from igraph. 12.6.3.1 Software: igraph igraph is a free software tool for creating and manipulating undirected and directed graphs (Csardi et al., 2006; http:// igraph.sourceforge.net). It supports R and can be installed as an R package.
c12.indd 253
1/12/2011 9:44:22 AM
254
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
BOX 12.4. Sample Code for Network Analysis > library(igraph); > ppis.sel = ppi[ppi[,1] %in% gene.input | ppi[,2] %in% gene.input, ]; > ppis.dm = data.frame( p1 = as.character(ppis. sel[,1]), p2 = as.character(ppis.sel[,2])); > ppi.graph = graph.data.frame(ppis.dm, directed = F); geneNames = V(ppi.g)$name > # function to calcuate global efficiency ### > global.eff = function(graph){ > v = V(graph); n = length(v); Es = rep(0, n); > for (i in 0:(n-1)){ > spi = shortest.paths(graph,i); spi[spi==n|spi==0]=Inf; Es[i+1] = sum(1/spi); > } > sum(Es)/n/(n-1) > } > E = global.eff(ppi.graph); > # calculating the vulnerability of node i > Eis = rep(0, N.gene); > for (i in 1:N.gene){ > v.i = v.vec[v.vec!= which(geneNames %in% gene. sel[i])-1]; > ppi.gi = subgraph(ppi.g, v.i); Eis[i] = global. eff(ppi.gi); > } # this process may take a few hours if the network is large > Vis = (E-Eis)/E; ## Vulnerability;
The PPI data can be stored in the following format with each row corresponding to a PPI: A1CF A1CF A26C3 A2BP1 …
SYNCRIP TNPO2 MME ATN1 …
The network characteristics for a list of interested genes can be calculated using the code in Box 12.4, where ppi is a matrix read from the data file and gene.input contains the gene list. The top ranked genes are considered candidate genes worthy of further pursuit.
c12.indd 254
1/12/2011 9:44:22 AM
IN SILICO SCREENING OF CANDIDATE GENES
12.6.4
255
PID Example
To illustrate the methods discussed, we take the primary immunodeficiency (PID) study (Ortutay et al., 2009) as an example. PID occurs when part of the body’s immune system is missing or does not function properly. There are numerous mechanisms that can cause PID, most of which are related to dysfunctional immune genes. To search for candidate genes for PID, first a set of 847 genes that are crucial for the immune system were constructed, based on exhaustive analyses of the literature and databases, to be the input list (Ortutay et al., 2008). A PPI network for the 847 genes was built from the HPRD PPI database and used for network analysis. The top genes (e.g., 50) from analyses of degree, vulnerability, and closeness centrality characteristics were merged together for further GO enrichment analysis. The combination of interaction and enrichment analyses results in a list of 39 significant genes, of which 13 genes have been previously known to be PID genes. This suggests the other 26 could be very promising PID candidate genes. 12.6.5
Other Bioinformatics Tools
Many other web-based tools can be used for searching for candidate disease genes as well. Some of them share similar ideas and use similar procedures, with slight differences in how algorithms are implemented. Here we briefly discuss four of them. Details about how to apply the software can be found in the corresponding websites. 12.6.5.1 GeneSeeker (www.cmbi.ru.nl/GeneSeeker) GeneSeeker is a server that gathers information from several online databases to filter positional candidate disease genes (van Driel MA, 2003; van Driel MA, 2005). The rationale is that genes causing a disease are most likely expressed in the tissue affected by that disease. In addition, through synteny or protein homology comparison, information from other species such as mice can be borrowed to infer the function of human genes/proteins. GeneSeeker automates the combination of data from cytogenetic locations, phenotypes, and expression patterns. It is particularly well suited for syndromes in which disease genes alter their expression patterns in the affected tissues. 12.6.5.2 G2D (www.ogic.ca/projects/g2d_2) G2D (gene to disease) provides three strategies, phenotype, known genes, and interactions, to prioritize disease candidate genes (Perez-Iratxeta et al., 2005; Perez-Iratxeta et al., 2007). The inputs of each algorithm include a location box, in which an interesting genomic region that may be linked with a disease trait is provided by the user, and an additional box containing either MIM disease number (phenotype), Entrez gene identifiers known to be associated with diseases (known genes), or another locus region (interaction). The phenotype algorithm searches MeSH C and MeSH D terms associated with a disease from PubMed and uses them
c12.indd 255
1/12/2011 9:44:22 AM
256
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
to associate with GO terms. The genes annotating the GO terms are used for comparison with genes in the location box by their sequence homologue. The main assumption of this method is that for a given disease with an undiscovered associated gene X and a phenotypically similar disease with a known associated gene Y, some functions of genes X and Y will be related and relevant to those phenotypes (Perez-Iratxeta et al., 2002). The known gene algorithm uses the known genes in enriched GO terms to compare with the interesting locus. The interaction algorithm performs PPI analysis on genes from both loci. The justification for the interaction algorithm is that mutations on two proteins that participate in the same pathway or directly interact will produce the same or very similar disease phenotypes. 12.6.5.3 SUSPECTS (www.genetics.med.ed.ac.uk/suspects) The inputs of SUSPECTS are exactly the same as the inputs for the known gene algorithm in G2D: one is the coordinates of the requested genomic region and the other is a list of known genes involved in the disease of interest. Users may also simply enter the name of the disease, and SUSPECTS can search an appropriate gene list from OMIM, the HGMD, and GAD (Adie et al., 2006). This list is known as the training set. SUSPECTS scores each gene in the region requested on three features: (1) how well its GO annotation compares with the annotation found in the training set (similar to what is done in G2D), (2) how well its Interpro domains are shared with the training set, and (3) how its gene expression profile compares with the profiles from the match set using Spearman’s rho rank-order correlation. A weighted average is then calculated to rank genes in order of likelihood of involvement in the disease. Genes near the top of the list are, in theory, better candidates than those farther down. 12.6.5.4 PGMapper (www.genediscovery.org/pgmapper) PGMapper is a software tool for automatically matching phenotype to genes from a defined genome region or a group of given genes. PGMapper retrieves information of all genes in the prespecified region from the Ensemble database and then searches OMIM and PubMed to find candidate genes relevant to a disease trait in the literature. Users can specify key words that describe the particular disease/phenotype features. PGMapper is currently available for candidate gene search of humans, mice, rats, zebrafish, and 12 other species (Xiong et al., 2008).
12.7 FUTURE DIRECTIONS In this chapter, we reviewed several bioinformatics tools for candidate gene screening, and we are certain that many more will be developed in the foreseeable future. Tools that are currently in a dire need are those can integrate data and information from multiple resources and produce consistent findings, such as how to construct a reliable network based on different platforms like SNP
c12.indd 256
1/12/2011 9:44:22 AM
REFERENCES
257
data, gene expression data, and PPI data. Since different platforms have different high-throughput technologies or knowledge depth, the networks constructed from them contain similar but not the same information. They query the genome at different stages, and therefore provide a mechanism to double check the edges and directions of the network (e.g. for Bayesian networks). Conceptually, the network integrating more information should be more dependable. Groundbreaking work has been done to use SNP array data to guide position directions of network edges (Aten et al., 2008) and to create a method to integrate a seeded prior network to boost the reliability of a gene expression network (Djebbari and Quackenbush, 2008). More research should be positioned in this direction. 12.8 QUESTIONS 1. Compare the Bayesian networks constructed by BNArray and NATbox. The sample data can be found at www.bnlearn.com. 2. Visit www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/ and replicate the WGCNA provided on the web site. 3. Generate a GO Biological Process term annotation file for probes in an Affy HG-U133A array. (Hint: Biomart database.) 4. Randomly pick 100 probes from the Affy HG-U133A array, and use the annotation file above to perform a GO enrichment analysis by topGO. 5. Randomly choose 500 genes and construct a PPI subnetwork containing only those from the HPRD database. (Hint: igraph package.) 6. Pick a disease you are interested in (e.g., Alzheimer disease or obesity) and use the method used for the PID example to find a candidate gene list. 7. For Alzheimer disease (OMIM #104300), compare the candidate genes within the genomic region of 17q23 found by GeneSeeker, G2D, SUSPECTS, and PGMapper. 12.9
ACKNOWLEDGMENTS
We thank David Galloway in Scientific Editing at St Jude Children’s Research Hospital for his professional editing support. This work was supported in part by the American Lebanese Syrian Associated Charities. 12.10 REFERENCES Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS. (2006). SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 22:773.
c12.indd 257
1/12/2011 9:44:22 AM
258
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
Albert R, Jeong H, Barabasi AL. (2000). Error and attack tolerance of complex networks. Nature 406:378. Alexa A, Rahnenfuhrer J, Lengauer T. (2006). Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22:1600. Al-Shahrour F, Díaz-Uriarte R, Dopazo J. (2004). FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20:578–580. Aten JE, Fuller TF, Lusis AJ, Horvath S. (2008). Using genetic markers to orient the edges in quantitative traite networks: The NEO software. BMC Syst Biol 2:34. Barabasi AL, Albert R. (1999). Emergence of scaling in random networks. Science, 286:509. Barabasi AL, Bonabeau E. (2003). Scale-free networks. Scientific American 288:60–69. Beissbarth T, Speed TP. (2004). GOstat: Find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20:1464–65. Carlson MRJ, Zhang B, Fang Z, Mischel PS, Horvath S, Nelson SF. (2006). Yeast Network Application. Gene Connectivity, Function, Sequence Conservation: Predictions from Modular Yeast Co-Expression Networks. BMC Genomic 7:40. Chavan SS, Bauer MA, Scutari M, Nagarajan R. (2009). NATbox: A network analysis toolbox in R. BMC Bioinformatics 10:S14. Chen X, Chen M, Ning KD, et al. (2006). BNArray: an R package for constructing gene regulatory net-works from microarray data by using Bayesian network. Bioinformatics 22:2952. Chickering DM. (1996). Learning Bayesian networks is NP-complete. Springer Verlag. Csardi G, Nepusz T. (2006). The igraph software package for complex network research. Int. J. Complex Syst. 1695. Djebbari A, Quackenbush J. (2008). Seeded Bayesian networks: constructing genetic networks from microarray data. BMC Sys Biol 2:57. Dong J, Horvath S. (2007). Understanding network concepts in modules. BMC Syst Biol 1:24. Friedman N, Linial M, Nachman I, dPe’er D. (2000). Using Bayesian networks to analyze expression data. J Comput Biol 7:601–20. Gargalovic PS, Imura M, Zhang B, Gharavi NM, Clark MJ, Pagnon J, Yang WP, He AQ, Truong A, Patel S, Nelson SF, Horvath S, Berliner JA, Kirchgessner TG, Lusis AJ. (2006). Identification of inflammatory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proc Natl Acad Sci USA 103:12741. Gentleman RR. (2008). Programming for Bioformatics. CRC Press. Giaever G, Chu AM, Ni L, Connelly C, Riles L, et al. (2002). Functional profiling of the Saccharomyces cerevisiae genome. Nature 418:387–391. Gol’dshtein V, Koganov G, Surdutovich G. (2004). Vulnerability and hierarchy of complex networks. Arxiv Prepr Cond–Mater 0409298. Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, et al. (2004). Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430:88.
c12.indd 258
1/12/2011 9:44:22 AM
REFERENCES
259
Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Shu Q, Lee Y, Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS. (2006). Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target. PNAS 103:17402–07. Kolaczyk ED. (2008). Statistical Analysis of Network Data. Springer, New York. Margaritis D. (2003). Learning Bayesian Network Model Structure from Data. Ph.D. Thesis, Carnegie-Mellon University, Pittsburgh, PA. Mathivanan S, Periaswamy B, Gandhi TK, Kandasamy K, Suresh S, Mohmood R, Ramachandra YL, Pandey A. (2006). An evaluation of human protein-protein interaction data in the public domain. BMC Bioinformatics 18:S19. Ortutay C, Vihinen M. (2008). Efficiency of the immunome protein interaction network increases during evolution. Immunome Res. 4:4. Ortutay C, Vihinen M. (2009). Identification of candidate disease genes by integrating Gene Ontologies and protein-interaction networks: case study of primary immunodeficiencies. Nucl Acids Res 37:622. Perez-Iratxeta C, Bork P, Andrade MA. (2002). Association of genes to genetically inherited diseases using data mining. Nat Genet 31:316. Perez-Iratxeta C, Wjst M, Bork P, Andrade MA. (2005). G2D: A Tool for Mining Genes Associated to Disease. BMC Genetics 6:45. Perez-Iratxeta C, Bork P, Andrade-Navarro MA. (2007). Update of the G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids Res. 35: W212. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL. (2002). Hierarchical organization of modularity in metabolic networks. Science 297:1551. Smith B, Welty C. (2001). Ontology: towards a new synthesis. In FOIS ’01: Proceedings of the international conference on Formal Ontology in Information Systems. October 17–19, 2001. Ogunquit, Maine. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. (1998). Comprehensive Identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9(12):3273. Spirtes P, Glymore C, Scheines R, Kauffman S, Aimale V, Wimberly F. (2000). Constructing bayesian network models of gene expression networks from microarray data.Acailable at www.phil.cmu.edu/projects/genegroup/papers/spirtes2002a.pdf. Sun J, Jia P, Fanous AH, Webb BT, van den Oord EJ, Chen X, Bukszar J, Kendler KS, Zhao Z. (2009). A multi-dimensional evidence-based candidate gene prioritization approach for complex diseases-schizophrenia as a case. Bioinformatics 25:2595. Tsamardinos I, Aliferis CF, Statnikov A. (2003). Algorithms for Large Scale Markov Blanket Discovery. Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference. Tsamardinos I, Brown LE, and Aliferis CF. (2006). The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm. Machine Learning 65:3. van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG. (2003). A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur J Hum Genet 11:57.
c12.indd 259
1/12/2011 9:44:22 AM
260
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
van Driel MA, Cuelenare K, Kemmeren PPCW, Leunissen JAM, Brunner HG, Vriend G. (2005). GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases. Nucleic Acids Res 33: W758–61. Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, et al. (1999). Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285:901. Xiong Q, Qiu YH, Gu WK. (2008). PGMapper: a web-based tool linking phenotype to genes. Bioinformatics 24:10113. Yaramakala S, Margaritis D. (2005). Speculative Markov Blanket Discovery for Optimal Feature Selection. Paper presented at the Fifth IEEE International Conference on Data Mining. November 27–30, 2005, Houston, Texas, USA. Yip AM, Horvath S. (2007). Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 8:22. Zhang B, Horvath S. (2005). A general framework for weighted gene coexpression network analysis. Stat Appl Genet Mol Biol 4:1. Zhao W, Mishel P, Carlson M, Zhang B, Nelson SF, Horvath S. (2006). A network-based gene screening approach for improving the validation success of microarray. Paper presented at the 2006 International Conference on Boinformatics & Computational Biology. June 26–29, 2006, Las Vegas, Nevada, USA.
c12.indd 260
1/12/2011 9:44:22 AM
CHAPTER 13
Using an Integrative Strategy to Identify Mutations YAN JIAO and WEIKUAN GU
Contents 13.1 Introduction 13.2 Identifying Possible Candidate Genes within the Genome Region of Interest 13.2.1 Selection of Genomic Database 13.2.2 Identification of All Genes and Other Genetic Elements 13.3 Identification of Possible Nucleotide Differences/Mutations within the GRI 13.3.1 Confirmation of Genomic Mutations in cDNA 13.3.2 Limitations and Alternative Approaches 13.4 Identifying Differentially Expressed Candidate Genes within GRI 13.4.1 Analyzing Candidate Gene Expression Levels 13.4.2 Microarray 13.4.3 Gene Expression Profiles Determined by Gene Microarray Analysis and Quantitative RT-PCR 13.5 Functional Prediction for Genes within GRI by Bioinformatics Approaches 13.5.1 Literature and Webpage Searching 13.5.2 Gene Network Analysis 13.5.3 Limitations and Alternative Approaches 13.6 Candidate Selection and Prioritization 13.6.1 The Prioritization of Candidate Genes Should Be Done According to Possible Function and the Nature of the Differences 13.6.2 The Importance of a Gene’s Potential Function in the Candidate Gene Selection Process 13.6.3 Final Prioritization of Candidate Genes Should Be Based on Integrative Information from All the Analyses
262 262 263 263 263 266 266 266 267 268 268 269 269 270 271 271
272 273 273
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
261
c13.indd 261
1/12/2011 9:44:23 AM
262
13.7
13.8 13.9
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
13.6.4 Limitations and Alternative Approaches Confirmation of the Function of Selected Genes in GRI 13.7.1 Molecular Biological Approach 13.7.2 Genetic Approach 13.7.3 Limitation and Alternative Approaches Questions and Answers References
274 274 275 276 276 277 277
13.1 INTRODUCTION The initial work of positional cloning began in the late of 1980s (Baehner et al., 1986; Royer-Pokora et al., 1986). However, the classic protocol for positional cloning was challenged in 2002 when the initial version of the sequence of the whole mouse genome was completed by the Mouse Genome Sequencing Consortium (2002). Eight years later, indeed a historical transition in the strategy and direction of gene discovery has occurred. Extraordinary progress has been made in gene discovery through positional cloning: (1) A tremendously large number of mutated genes have been identified, (2) the time for identifying mutations has been greatly shortened, and (3) mutated genes of many decades-old mouse mutants have been discovered. In this chapter we describe an integrative strategy that has been used successfully to search for mutated genes in animal models of human diseases, mostly using the mouse model. The relative importance of every candidate gene depends on either the relevance of the gene to the tissues of interest or to the phenotype: specifically, whether there is a difference in sequences between the mutation and the wild type or the control and whether the gene is expressed in certain tissues. Therefore, before evaluating candidates, one needs to fully characterize each candidate gene. First, one needs to identify every possible difference in the candidate genes. Next, the expression profiles for each of them must be examined. Finally, one needs to elucidate their possible known function by bioinformatics analysis or literature searching. By doing so, a profile for each candidate gene can be established for mutation confirmation and further functional analyses. 13.2 IDENTIFYING POSSIBLE CANDIDATE GENES WITHIN THE GENOME REGION OF INTEREST For most chronic animal disease models, it appears that the causative mutation is not among well-known candidate genes. Therefore, it is important not to dismiss any possibility, and, accordingly, it is necessary to examine all possible genes and other genetic components such as regulatory elements. Understanding
c13.indd 262
1/12/2011 9:44:23 AM
IDENTIFICATION OF POSSIBLE NUCLEOTIDE DIFFERENCES/MUTATIONS WITHIN THE GRI
263
the genetic elements within the genome region of interest (GRI) ensures that no candidate gene will be missed. The first imperative step is to identify every possible genetic factor within this region. Although officially the entire genomes in humans or many animals have been sequenced, gaps and errors in the sequence assembly still remain. It is essential to obtain not only accurate genome information but also every possible candidate gene within the genome sequences. This seemingly simple step is crucial in searching for the candidate gene. With current bioinformatics, we can select candidate genes by hierarchically examining every nucleotide in the GRI region. First, all of the coding sequences of a chromosomal region of interest are identified. Second, introns, 5′ and 3′ sequences, are determined. Third, nucleotide organization, gene ordering, and chromosomal structure will be analyzed. Very important, gene regulatory elements in non-gene regions in the GRI region should also be carefully identified. 13.2.1
Selection of Genomic Database
Genome data are essential for determining and verifying the GRI interval genome sequences. Currently, several genomic databases provide genomic sequences. The Ensembl Genome Browser (www.ensembl.org/index.html) is one example and is commonly used as the basic database for gene identification. Previously, Ensembl was used to determine the location of targeted genome regions in the identification of mutated genes for several mouse disease models, during which one assembly problem was identified (Jiao et al., 2005a). Currently, Ensembl provides the most accurate assembly of genome sequences. Other databases include the University of California Santa Cruz genome browser (http://genome.ucsc.edu) and the NCBI genome database (www.ncbi.nlm.nih.gov/sites/entrez?db=genome). Both of these databases are used for comparing and confirming the results obtained from Ensembl. 13.2.2
Identification of All Genes and Other Genetic Elements
Although the Ensembl database has complete sequence coverage of entire genomes, genes and transcripts notated by other sources/programs (e.g., EMBL mRNAs, Unigene, Genscan) are presented in the genome as well. The combination of Ensembl and these other databases allows for complete genome information, which is critical for the purpose of identifying candidate genes.
13.3 IDENTIFICATION OF POSSIBLE NUCLEOTIDE DIFFERENCES/ MUTATIONS WITHIN THE GRI Two major nucleotide changes cause variation in phenotype: mutations and polymorphism. Mutations usually refer to deletion, insertion, replacement, and
c13.indd 263
1/12/2011 9:44:23 AM
264
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
duplication of nucleotides that lead to change of function of a genetic element, which in turn results in visible or detectable changes in phenotype. Polymorphism usually refers to a single nucleotide polymorphism (SNP) or variation. Polymorphic nucleotides individually or in groups in general cause relatively small but continuous scale of changes in phenotype, usually called quantitative trait loci (QTL). Phenotype variation of QTL usually is not considered as disease. SNPs exist and segregate in the general population. Identifying an SNP for QTL can be achieved using available genome databases. Initially, some genome regions can be eliminated through haplotype analysis and polymorphic comparison. Such an analysis can be done with a method partially based on a suggestion by Wade and Daly (2005): use comparative haplotype analysis to limit GRI intervals and to identify likely candidate genes in these regions. A haplotype is a group of markers retained as a block. Haplotypes are typically used to characterize regions where linkage disequilibrium occurs; here, the purpose is to identify nonpolymorphic marker blocks between the GRI of subject and the control. Identifying these blocks is useful because they exclude these regions from further consideration as GRI intervals. Having removed all nonpolymorphic blocks, one can focus searching on gene identification in the regions of polymorphics between GRI and control. Dense marker coverage in GRI regions (averaging one SNP per 400– 500 bp) is available by using SNPs between GRI and a control, archived at Mouse Genome Informatics (www.informatics.jax.org/javawi2/servlet/ WIFetch?page=snpQF) and GeneNetwork (www.genenetwork.org/cgi-bin/ beta/snpBrowser.py), to determine haplotypes. Next, identifying differences can be done, to some extent, with available SNP databases such as the Roche SNP database (http://mousesnp.roche.com), the MGI mouse SNP database (www.informatics.jax.org/javawi2/servlet/WIFetch?page=snpQF), or the NCBI SNP database (www.ncbi.nlm.nih.gov/SNP). In contract to SNPs, mutated nucleotides usually cause disease and therefore do not exist in the normal population. Accordingly, mutations need to be discovered by sequencing or mutation screening. While sequence confirmation is the final call for a mutation, many mutation systems have been used for rapidly screening a large number of samples. Currently major mutation analysis systems include chemical cleavage of mismatch (CCM) (Tabone et al., 2006) and denaturing high-pressure liquid chromatography (DHPLC) (Hall et al., 2001). CCM is one of few methods capable of detecting nearly all single base mismatches. Mutation detection by CCM is based on the chemical modification and cleavage at the site of mismatched C or T in heteroduplexes by using hydroxylamine or osmium tetroxide (OsO4) as chemical probes. DHPLC compares two or more DNA fragments as a mixture of denatured and reannealed PCR amplicons, thereby revealing the presence of a mutation by the differential retention of homo- and heteroduplex DNA on reversed-phase chromatography supports under partial denaturation. Differences among DNA fragments can be detected successfully by UV or fluorescence monitor-
c13.indd 264
1/12/2011 9:44:23 AM
IDENTIFICATION OF POSSIBLE NUCLEOTIDE DIFFERENCES/MUTATIONS WITHIN THE GRI
Importance Lowest
265
Nucleotide differences Unique between GRI and control? No (stop) Yes
Coding sequences
3’ and 5’ end
Intron
Change amino acid
Known motif or regulatory domain
splicing element or regulatory domain
No
No Yes
Yes
No Yes
Change the type of Amino acid No
Yes Highest
Selected for prioritization
Figure 13.1. Characterization of candidate genes according to nucleotide difference.
ing. One of major commercial DHPLC systems used for discovering mutations in animal models is the SpectruMedix system (Jiao et al., 2005a). The SpectruMedix system includes high-throughput capillary electrophoresis instruments, specialized separation polymers, and a suite of automated software applications. Mutation screening should start with examining coding regions in all the candidate genes for differences. As illustrated in Figure 13.1, initial screening identifies all the differences in nucleotides between the GRI and control. The next step is to identify which of those differences are unique to the GRI by sequencing or by searching known sequences. Theoretically, a mutation should not exist in other wild type strains or populations. If one or more different nucleotides are unique to the GRI, then each difference’s location in the targeted genome must be examined. If the difference is located in an exon, it is necessary to determine whether it leads to a change in amino acids and/or whether the change leads to the alteration of a functional domain, such as from hydrophobic to hydrophilic. The following step examines other regions to identify new differences. In addition to identifying known differences, screening with mutation detection systems identifies any other differences. In practice, if the change is in the regulatory region, the next question is whether the change is in a known motif or another regulatory element. If it is in an intron or nongene region, the potential impact of the mutation or polymorphism on gene regulation needs to be investigated. For each such investigation,
c13.indd 265
1/12/2011 9:44:23 AM
266
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
extensive searching and comparison should be conducted using sequencing and other databases such as Ensembl. 13.3.1
Confirmation of Genomic Mutations in cDNA
Once a difference in a coding region is found, several approaches should be taken to confirm the mutation. First, the obvious question is whether it is located within the genetic region of the GRI locus. Second, is this mutation the only defect detected among the candidate genes and ESTs within the GRI locus? The answer to this question ensures that there are no other differences between the GRIs of the subject and normal controls, so that the possibility of another mutation is ruled out. Third, do the cDNA sequence results agree with the genomic DNA data? Last, is the mutation unique in the subject GRI compared with other available populations or inbred strains (if the mutation is from the mouse disease model)? 13.3.2
Limitations and Alternative Approaches
The next step identifies all possible coding sequences (open reading frames), promoters, intron/exon junctions, sequences that match to known genes, and other biologically significant sequences such as repetitive elements in the targeted region. Investigators can usually treat the identified fragments as if they were genes or parts of genes. However, fragments that the software does not pick up should be considered as undetermined sequences rather than noncoding sequences. In general, the only excluded sequences in this step should be well-known repetitive sequences. Errors may exist among the currently assembled genome sequences. During positional cloning for the spontaneous fracture (sfx) mutation (Jiao et al., 2005a), inconsistencies in the number of exons between the NCBI and Ensembl databases were discovered. Searching through different databases may not be sufficient to obtain a complete list of candidates. Further steps should be taken such as obtaining information from different sources, including the bacterial artificial chromosome and yeast artificial chromosome contigs deposited in GeneBank. Thus every possible measure must be taken to search every possible resource so that the specific errors in the genome sequence assembly can be identified. Alternatively, search regions can be increased (e.g., extending 0.5 Mp on each side of the GRI region) to ensure all possible genes are considered.
13.4 IDENTIFYING DIFFERENTIALLY EXPRESSED CANDIDATE GENES WITHIN GRI Identifying differentially expressed genes in the GRI is an important step for identifying mutated genes and for analyzing molecular pathways of mutated
c13.indd 266
1/12/2011 9:44:23 AM
IDENTIFYING DIFFERENTIALLY EXPRESSED CANDIDATE GENES WITHIN GRI
Importance Lowest
267
Expression Tissue of Interest? No (stop) Expression level Low
High Differentially Expressed (DE)
Yes
No
Yes
Larger than 2 folds Yes
No
Larger than 2 folds No
No
Yes
Quantitative PCR DE Confirmed No Yes
Highest
Differentially Expressed (DE)
Quantitative PCR DE Confirmed No Yes
Selected for prioritization
Figure 13.2. Characterization of candidate genes according to gene expression.
genes. If the expression level of a gene in the GRI is altered, the gene is either the mutated one or affected by the mutated gene. It is important to keep in mind that mutations in a regulatory region are expected to lead changes in gene expression. However, mutations in a coding region of a gene may or may not lead to changes in gene expression. 13.4.1 Analyzing Candidate Gene Expression Levels An illustration of the experimental steps is shown in Figure 13.2. Briefly, the first step is to examine whether a gene is expressed in the tissues of interest or other relevant tissues. If it is not expressed, most likely that gene is not relevant in the GRI phenotype. The next step is to examine the expression levels and determine whether there is a difference between the GRI and the control. If there is a difference, then the significance of the difference should be investigated. The expression level and the differential expression of a gene should be confirmed by real-time PCR. This selection process involves many experimental steps and is described below. To analyze gene expression level,
c13.indd 267
1/12/2011 9:44:23 AM
268
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
whole genome expression and exon arrays offer rapid high-throughput solutions. Currently, several commercial systems are available, including Illumina, Agilent, and Affymetix systems. Here we use the Affymetrix system as an example for analyzing gene expression levels. Currently, the Affymetrix array contains almost every gene found in the mouse genome. The advantage of using Affymetrix exon arrays is they allow investigators to characterize gene network(s) by the expression levels of all the possible genes, as well as by their exons. This, in turn, should allow investigators to better assess the potential role each candidate gene has within the GRI region or phenotype. In case any genetic element is not covered by an Affymetrix array, real-time PCR (RT-PCR) or similar experimental technologies must be used to confirm the data. High-quality RNA is essential for this step. Many investigators have been very successful in extracting high-quality RNA by using a modified procedure from Life Technologies that employs TRIzol as a reagent for microarray analysis (Gu et al., 2002). 13.4.2
Microarray
For each tissue block of interest, it is important to use equally mixed RNA from multiple samples with at least three replicates and the same number of controls. Total RNA for each group should be used for cDNA synthesis with SuperScript Choice System for cDNA Synthesis. Generated cRNA should be hybridized to an Affymetrix GeneChip 430 2.0 array, representing ∼36,000 mouse transcripts, at 45°C for 16 h. Chips should be washed and stained in a fluidic station. MAS 5.0 should be used to control image scanning by Agilent GeneArray Scanner and data generation. 13.4.3 Gene Expression Profiles Determined by Gene Microarray Analysis and Quantitative RT-PCR Raw image data from microarray fluorescence scanning should be subjected to quality control analysis by using dChip software. Hybridization signals should be analyzed using software released by Affymetrix (currently, GCOS). Depending on the data distribution of individual transcripts, parametric (ANOVA) or nonparametric (e.g., Kruskal-Wallace) multigroup comparison analyses for replicate samples should be run to determine which transcripts are differentially expressed in both the tissues of interest and lung tissues from the different mouse groups. Analyses can be done using customized SAS analysis tools (SAS Institute, 2001), dChip, GeneSpring (Silicon Genetics, 2002), and R-Affy analysis packages. Additional analysis of array data can be performed with the current or upgraded 8 + 1 node Linux cluster. The next step is to use quantitative RT-PCR to analyze and confirm the expression level of the various candidate genes. Not all genes are included in an Affymetrix array. The quantitative (q)RT-PCR should analyze the genes not included in Affymetrix arrays, as well as confirm the other gene’s data
c13.indd 268
1/12/2011 9:44:23 AM
FUNCTIONAL PREDICTION FOR GENES WITHIN GRI BY BIOINFORMATICS APPROACHES
Importance Lowest
269
Gene function
Known to Relevant Function?
Yes
No In pathway of relevant gene?
Yes
No Has been well instigated Yes No
Highest
Selected for prioritization
Figure 13.3. Characterzation of candidate genes according to known function.
from Affymetrix arrays. Depending on the number of candidate genes, one should vary the selection of samples for each assay. If only a few candidates are identified, one should use a more conventional 96-well format and larger volumes. In either case, the data should detect genes that show differences between the control and GRI. In addition, investigators should use some genes randomly selected from a microarray experiment so that the genes can be independently evaluated by qRT-PCR.
13.5 FUNCTIONAL PREDICTION FOR GENES WITHIN GRI BY BIOINFORMATICS APPROACHES Functional analysis of candidate genes should be conducted via multiple approaches, including gene annotation, sequencing comparison or domain recognition, known reported functions, and gene network construction. 13.5.1 Literature and Webpage Searching One obvious approach is by literature searching and comparison (Fig. 13.3). The literature search should be conducted using web-based search programs
c13.indd 269
1/12/2011 9:44:23 AM
270
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
such as PGMapper (www.genediscovery.org/pgmapper) (Xiong et al., 2008a). PGMapper is a software tool that automatically matches phenotypes to their causative genes/genotype. PGMapper provides detailed information concerning the candidate genes and all related references from the OMIM and PubMed databases to support their candidacy. PGMapper can search publications and examine each candidate gene’s respective encoded protein structure to look for a possible connection between gene function and phenotype of interest. If the candidate gene is novel, GeneBank should be searched for possible similarities between the selected candidate genes and other known genes. The available protein structure models should be used to predict the probable function for each candidate gene. Detailed procedures for functional analysis when using PGMapper should follow the outline detailed in publications (Xiong et al., 2009, 2008b). Briefly, first, names of all the candidate genes and key terms should be put into the space provided on the PGMapper webpage. Second, the program should be instructed to search through OMIM and PubMed for any publications containing either the name of the candidate gene and/or the key words. Third, all identified publications will be retrieved from the databases into a list. Fourth, the abstracts or the full content of each retrieved publication must be reviewed individually to determine the relevance of the identified gene to tissues of interest or the fibrotic phenotype in mouse GRI. Finally, a table consisting of a list of candidate genes and publication references relevant to phenotype or trait of interest should be constructed. 13.5.2 Gene Network Analysis The second approach should be to examine the role of each candidate gene(s) in the gene network or pathway network (Rhodes et al., 2002; Selaru et al., 2002; Pereira et al., 2004). Not all genes relevant to the GRI disease have been studied or reported. However, large numbers of gene networks based on whole genome microarray gene chips currently provide connections among genes. If a gene has not been studied for roles in tissues of interest or trait of interest but is connected to a gene of high relevance to a known function of interest in a gene network or pathway, its importance in the candidate list should be much greater than that of a gene with no known function/relevance in/to function of interest. One of the resources used to construct the gene network is the microarray data. User-friendly software has been used with great success (e.g., data clustering performed with Cluster and TreeView (Rhodes et al., 2002) (Eisen Laboratory, http://rana.lbl.gov/EisenSoftware.htm); biological profiling of altered gene expression patterns with Ingenuity Pathways Analysis; and construction and visualization of gene relation networks using the web server of the University of Tennessee (http://132.192.64.224/geneinfoviz/ search.php). Another important resource is the database Genenetwork (www. genenetwork.org). Currently, this database contains gene expression data
c13.indd 270
1/12/2011 9:44:23 AM
CANDIDATE SELECTION AND PRIORITIZATION
271
from mouse tissues. Genenetwork is extremely important for studying gene pathways, particularly in the mouse models, because it contains gene expression profiles from a variety of tissues of more than 60 recombinant mouse inbred lines from two popular mouse strains, C57BL/6 and DBA/2. By inputting the name of a gene, this database provides a network that contains the inputted gene and its co-relationship to other genes. 13.5.3
Limitations and Alternative Approaches
Investigators should realize that the one thing not under their control is the genome database. For example, the mouse genome database in Ensemble is improved/modified from time to time and is currently at its 50th version. Nevertheless, there is no guarantee for 100% accuracy of every sequence and the assembly of the sequences. Investigators should keep searching for updated information from the genome database and make corrections whenever necessary. For a similar reason, literature searches on the reported function of genes are limited by the available literature and research data.
13.6 CANDIDATE SELECTION AND PRIORITIZATION Candidate selection and prioritization will determine which genes have the greatest potential to be responsible for the dermal fibrosis GRI. Through gene searching and profile analysis, many differences and functionally relevant genes for the GRI should be identified. However, it is unlikely that all of them regulate the variations in tissues of interest. For example, it is possible to find differences in every gene. Clearly, not every gene should be a candidate. It is reasonable to assume that, if a difference is related to the GRI phenotype, the difference should be specific to the trait of interest. Thus any nucleotide differences should co-segregate with the phenotype of interest. In contrast, if a difference is randomly segregated among different strains, it should not be considered a candidate. In a similar example, not all the genes expressed in the tissues of interest are responsible for the trait of interest and not all phenotypically relevent genes function in the regulation of a particular disease. Therefore, one needs to eliminate obvious noncandidate genes. If there are still too many candidate genes to handle, it is necessary to prioritize the remaining genes according to current knowledge. Prioritization of candidate genes should be done according to their sequence differences, expression profiles, and known and/or potential biological function. Investigators should elect the most favorable candidate gene(s) according to an overall evaluation. As shown in Figure 13.4, among differences, expression levels, and potential functions, a confirmed useful difference should be given much more weight than either expression or known functional relevance.
c13.indd 271
1/12/2011 9:44:23 AM
272
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
Prioritization Lowest
Candidate genes
Relevant function
Unique Polymorphism No
Yes
Highest
Differential expression No
Yes
No Yes
Favorite candidate/s
Figure 13.4. Prioritization and candidate gene selection.
13.6.1 The Prioritization of Candidate Genes Should Be Done According to Possible Function and the Nature of the Differences In regard to the position of a difference, in general, the relative importance is (1) coding regions, (2) 3′ and 5′ end sequences, and (3) intron sequences, ranked from the highest to lowest importance. However, in many cases, the nature of the difference is more important than the position. For example, within coding sequences, a difference that alters the amino acid is much more important than a difference that changes only the nucleotide. Among the differences that alter amino acids, the changes in the conserved sequences are more important than those in nonconserved sequences. Changes that result in an amino acid’s becoming another type, (e.g., the AA changes its hydrocarbon R-group into an acid or base R-group) are more important than changes that result in a similar type of amino acids (e.g., hydrocarbon R-group changes to another hydrocarbon R-group) (Jiao et al., 2007). Whenever examining a difference in coding sequences, it is essential to first check the amino acid and codon table to see whether the difference leads to changes in amino acids. Furthermore, the difference should be examined to determine whether the change encodes the same type or different type of amino acid. Although differences in noncoding regions are generally regarded as a secondary priority to differences in the coding region, some differences may be of more importance than those in coding regions, especially when the differences are relevant to gene expression and splicing. One should try to identify commonly known potential functional motifs, sequences that may affect differential splicing, and common promoter sequences. Software for autoiden-
c13.indd 272
1/12/2011 9:44:23 AM
CANDIDATE SELECTION AND PRIORITIZATION
273
tifying promoter sequences is available for such a search. For example, Promoter 2.0 Prediction Server (www.cbs.dtu.dk/services/promoter) is a tool for predicting potential promoter regions from a given sequence. Another web-based searching tool is Transcription Regulatory Element Search (TRES). TRES can simultaneously search up to 20 promoter sequences for known transcription factor binding sites, cis-acting elements, palindromic motifs, and/ or conserved k-tuples (phylogenetic footprints). It is useful for comparative promoter sequence analysis to elucidate common themes (modules) in functionally or phylogenetically related promoters. Accordingly, one should select sequences that contain identified, potential regulatory elements first. Next, one should examine the rest of the nonrepetitive sequences. The repetitive sequences should be examined last. After examining noncoding regions, the next task is to identify differences that are segregated between the GRI and the genome region of control or wild type and that are associated with tissues of interest. Nucleotide differences or SNPs within those haplotypes, strains, or substrains should be confirmed and analyzed. One should amplify, from each strain/substrain/haplotype, every DNA fragment in the GRI interval polymorphic between the control and the GRI. 13.6.2 The Importance of a Gene’s Potential Function in the Candidate Gene Selection Process The first priority should be genes relevant to the susceptibility of the disease of interest. The second priority is novel genes that have no known biological function. The next is genes with known function in other pathways. The lowest priority genes are those that have been extensively studied (indicated by a large number of literature reports in searching) but have no connection to the susceptibility of the disease of interest in either functional or molecular pathways. 13.6.3 Final Prioritization of Candidate Genes Should Be Based on Integrative Information from All the Analyses The first priority is the genes that possess a unique difference affecting the amino acid, are highly expressed in the tissues of interest, and have a demonstrated connection to the disease of interest. The next level of priority is the genes that translate amino acid differences and are expressed in the tissues of interest. At the level of sequence difference, emphasis should be put on the amino acid change and the possible impact it may have. If a difference is in a noncoding region, it may be in an intron, the 5′ or 3′ end, or in sequences between genes. The importance of such a difference is difficult to evaluate. However, differences in noncoding regions may affect regulation of gene expression. Much importance should be given to combining literature, gene function, and gene expression profiling.
c13.indd 273
1/12/2011 9:44:23 AM
274
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
If a gene has no change in its expression level between the control and the GRI and there is no different nucleotide co-segregating with the GRI from the individual with disease phenotype or trait of interest throughout a gene, the gene should be eliminated from the candidate list. At the nucleotide level, some differences in a gene may be retained for further study, while others may not. Polymorphisms can occur in the regulatory regions, 5′ or 3′ end, coding region, or introns. If the change is in the regulatory 5′ or 3′ position, it is important to know whether the difference causes any change in gene expression, modification of posttranscription and translation, and function. If a gene is not differentially expressed between the control and GRI and if the difference in a noncoding region of this gene has no obvious function, the difference may be considered unimportant. However, if the difference is in a coding region, it is necessary to examine whether the difference causes any possible changes in protein translation and whether the change in protein sequence causes any change in the function or activity of the protein. If a difference potentially results in altered splicing, the different splicing should be reflected in the data from the exon arrays. It is important to closely examine or repeat the experiment to confirm the data from the exon arrays and to determine the importance of potential differential splicing (Jiao et al., 2007). 13.6.4
Limitations and Alternative Approaches
It is not possible to predict the extent to which the numbers of candidate genes can be reduced. For some genes, especially novel genes within the GRI, one may have no information as to their function. It will difficult to eliminate them according to their function. In this case, difference and expression screening should play key roles in eliminating such genes. One should have concerns about polymorphic comparisons between multiple strains. It is not necessary that the same GRI or the same gene regulates the disease within a strain and among available strains. Also, the phenotype of interest may be affected by a modifier. With all of these uncertain factors, one should come to a conclusion concerning nucleotide comparison with great caution. Prioritization must be based on a combination of nucleotide-specific differences, as well as functional and bioinformatics information.
13.7 CONFIRMATION OF THE FUNCTION OF SELECTED GENES IN GRI In general, one should expect that one or more mutations lead to the phenotype of interest. The confirmation of the candidate gene(s) should be an integrative process that includes both bioinformatics and experiment. The consequence of the mutation may be easily seen according to the change in the DNA codon and resulting amino acid. The mutation can be tested experi-
c13.indd 274
1/12/2011 9:44:23 AM
CONFIRMATION OF THE FUNCTION OF SELECTED GENES IN GRI
275
mentally for its transcription and translational products. The mutation can also be transferred into another mouse strain to examine the phenotype. After initially confirming the finding of the gene defect(s) in the GRI, one should examine the identified genes to see whether (1) the gene has been studied for the same or similar phenotype, (2) the gene has not been studied and therefore the potential pathway in which it participates is unexplored, (3) there is a significant role the new gene may play in the pathway, (4) the gene is a known gene or the function revealed represents a new function of the gene, and (5) the importance of the new function affects the phenotype of interest. Confirmation of the function of a candidate gene can be accomplished using a combination of molecular biological and genetic approaches described next. 13.7.1
Molecular Biological Approach
13.7.1.1 Relative Quantitative RT-PCR It is likely that the GRI gene(s) should be identified as one or more genes or coding regions. In this case, it will be necessary to perform expression studies by RT-PCR. Unique probes should be designed from the sequence data of the GRI gene obtained earlier. To determine in what tissues and to what extent the gene is expressed, message levels for GRI gene(s) should be determined by RT-PCR (Jiao et al., 2005b, 2008). 13.7.1.2 Expression of GRI Gene(s) In Vitro To compare the protein sequences from normal and GRI genes, one should insert the corresponding nucleotide sequences into expression vectors. First, one should pay attention to designing a pair of primers that flank the entire normal gene. This pair of primers will then be used to amplify cDNA derived by RT-PCR from normal and GRI mouse tissues. Protein products should be analyzed using SDSPAGE and/or native PAGE electrophoresis with known enzymes to confirm the predicted differences between the normal and congenic proteins. 13.7.1.3 In Situ Hybridization In situ hybridization is still useful in detecting gene expression in different tissues and at different time points. With a gene that has not been studied or with an unknown function, one needs to detect its expression during its developmental process and in a variety of tissues. The procedure should follow that outlined in our previous publications (Jiao et al., 2005a, 2005b, 2008)). 13.7.1.4 Antibody Generation and Immunolocalization The generation of one or more highquality antibodies to the encoded proteins of interest should allow for subcellular localization studies and, in addition, corroborate the in situ hybridization results (Jiao et al., 2005a). Furthermore, antibodies may be needed for protein–protein interaction and other biochemical studies.
c13.indd 275
1/12/2011 9:44:23 AM
276
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
In addition, BLAST analyses should be performed to exclude significant homology to other peptide sequences. 13.7.1.5 Western Blotting to Characterize Antibody Specificity Standard immunohistochemical methods should be used to localize encoded proteins in the tissues of interest (Gu et al., 2002; Jiao et al., 2005a)). 13.7.2
Genetic Approach
Depending on the number of candidate genes and the nature of the mutations, one or more of the following approaches should be used to confirm the function(s) of candidate genes. 13.7.2.1 Transfer Selected Gene from GRI to the Control or Wild Type One should plan to transfer the selected gene into its original control background. Theoretically, the genesin GRI and control are different only in the mutation that results in the fibrotic phenotype. Therefore, if the identified mutation truly causes disease, transferring the selected GRI gene back into the control should result in the expected phenotype. The experimental protocol is straightforward. Initially, a cross between a heterozygous individual and a control should be made. Beginning with F1, one should use primers that flank the mutation to select heterozygous individuals for backcrossing to the control. After five to six generations of backcrossing, one should cross heterozygous mice to create homozygous individuals for the phenotype test. A key disadvantage of this approach is the difficulty in transferring only the selected GRI gene into the control mice. 13.7.2.2 Creation of Transgenic Mice that Overexpress or a Knockout That Does not Express the Candidate Gene Whether one takes the approaches of expressing or knocking out GRI depends on the nature of the selected GRI gene. If it is expressed at a much higher level than in the wild type control mice, using an srRNA approach to knock it out or suppress its expression may cue the disease phenotype. If the expression of the selected GRI gene is decreased or disappears in the mutant, then creation of a transgenic mouse with normal expression of the GRI gene may rescue the diseased individual. 13.7.3
Limitation and Alternative Approaches
It is possible that the expression results obtained do not give any idea about the function of selected GRI genes. In this case, one should perform microarray analyses comparing homozygous GRI/+ as well at GRI/− tissues of interest to +/+ tissues of interest to identify genes that are either up- or down-regulated by selected GRI compared to controls. If there are multiple mutations in multiple genes, double or triple knockout or transgenic individuals need to
c13.indd 276
1/12/2011 9:44:23 AM
REFERENCES
277
be created. This process will involve much more work and additional difficulties. An additional complicated issue is the interaction between mutation(s) within the GRI locus and the genomic background. Phenotype variations of the same mutation in different genomic backgrounds have been very common. Once the mutation(s) is identified, interactions with other genes and in different genomic backgrounds should be one important aspect for future studies.
13.8 QUESTIONS AND ANSWERS Q1. When determining the relative importance of a gene in the genome region of interest of a disease model, which three major features of the gene need to be considered? A1. Whether there is a difference in DNA sequences between study subject and wild type or the control, whether there is a differential expression level between subject and control, and whether there is a reported relevant function or a connection of pathway of the gene to the disease.
13.9
REFERENCES
Baehner RL, Kunkel LM, Monaco AP, Haines JL, Conneally PM, Palmer C, Heerema N, Orkin SH. (1986). DNA linkage analysis of X chromosome-linked chronic granulomatous disease. Proc Natl Acad Sci U S A 83(10):3398–401. Gu W, Li X, Lau KH, Edderkaoui B, Donahae LR, Rosen CJ, Beamer WG, Shultz KL, Srivastava A, Mohan S, Baylink DJ. (2002). Gene expression between a congenic strain that contains a quantitative trait locus of high bone density from CAST/EiJ and its wild-type strain C57BL/6J. Funct Integr Genomics 1(6):375–86. Hall AG, Hamilton P, Minto L, Coulthard SA. (2001). The use of denaturing highpressure liquid chromatography for the detection of mutations in thiopurine methyltransferase. J Biochem Biophys Meth 47(1–2):65–71. Jiao Y, Jin X, Yan J, Zhang C, Jiao F, Li X, Roe BA, Mount DB, Gu W. (2008). A deletion mutation in Slc12a6 is associated with neuromuscular disease in gaxp mice. Genomics 91(5):407–14. Jiao Y, Li X, Beamer WG, Yan J, Tong Y, Goldowitz D, Roe B, Gu W. (2005a). Identification of a deletion causing spontaneous fracture by screening a candidate region of mouse chromosome 14. Mammal Genome 16(1):20–31. Jiao Y, Yan J, Jiao F, Yang H, Donahue LR, Li X, Roe BA, Stuart J, Gu W. (2007). A single nucleotide mutation in Nppc is associated with a long bone abnormality in lbab mice. BMC Genet 17;8:16. Jiao Y, Yan J, Zhao Y, Donahue LR, Beamer WG, Li X, Roe BA, Ledoux MS, Gu W. (2005b). Carbonic anhydrase-related protein VIII deficiency is associated with a distinctive lifelong gait disorder in waddles mice. Genetics 171(3):1239–46.
c13.indd 277
1/12/2011 9:44:23 AM
278
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
Mouse Genome Sequencing Consortium. (2002). Initial Sequencing and Comparative Analysis of the Mouse Genome. Nature 420:520–562. Pereira E, Tamia-Ferreira MC, Cardoso RS, Mello SS, Sakamoto-Hojo ET, Passos GA, Donadi EA. (2004). Immunosuppressive therapy modulates T lymphocyte gene expression in patients with systemic lupus erythematosus. Immunology 113(1): 99–105. Rhodes DR, Miller JC, Haab BB, Furge KA. (2002).CIT: identification of differentially expressed clusters of genes from microarray data. Bioinformatics 18(1):205–06. Royer-Pokora B, Kunkel LM, Monaco AP, Goff SC, Newburger PE, Baehner RL, Cole FS, Curnutte JT, Orkin SH. (1986). Cloning the gene for an inherited human disorder–chronic granulomatous disease–on the basis of its chromosomal location. Nature 322(6074):32–38. Selaru FM, Zou T, Xu Y, et al. (2002). Global gene expression profiling in Barrett’s esophagus and esophageal cancer: a comparative analysis using cDNA microarrays. Oncogene 21(3):475–78. Tabone T, Sallmann G, Chiotis M, Law M, Cotton R. (2006). Chemical cleavage of mismatch (CCM) to locate base mismatches in heteroduplex DNA. Nat Protoc 1(5):2297–304. Wade CM, Daly MJ. (2005). Genetic variation in laboratory mice. Nat Genet 37(11):1175–80. Xiong Q, Jiao Y, Hasty KA, Canale ST, Stuart JM, Beamer WG, Deng HW, Baylink D, Gu W. (2009). Quantitative trait loci, genes, and polymorphisms that regulate bone mineral density in mouse. Genomics 93(5):401–14. Xiong Q, Jiao Y, Hasty KA, Stuart JM, Postlethwaite A, Kang AH, Gu W. (2008b). Genetic and molecular basis of QTL of rheumatoid arthritis in rat: genes and polymorphisms. J Immunol 181(2):859–64. Xiong Q, Qiu Y, Gu W. (2008a). PGMapper: a web-based tool linking phenotype to genes. Bioinformatics 24(7):1011–13.
c13.indd 278
1/12/2011 9:44:23 AM
CHAPTER 14
Determination of the Function of a Mutation BOUCHRA EDDERKAOUI
Contents 14.1 Introduction 14.2 Concept of Quantitative Trait Loci 14.2.1 Mouse Model 14.2.2 Human Diseases and Association Studies 14.3 How to Determine if a Mutation is Functional 14.3.1 Test to Determine if the Mutation is Null 14.3.2 Dosage Analysis—Quantitative or Qualitative Changes Due to Mutation 14.3.3 Complementation Test 14.4 Effect of Mutations on the Function of the Gene 14.4.1 Nonsynonymous SNPs 14.4.2 Synonymous SNPs 14.4.3 Regulatory SNPs 14.5 General Strategy to Assess the Effect of a Mutation on the Function of a Gene and the Observed Phenotypic Variations 14.5.1 Sequence Analyses 14.5.2 Tissue Specificity 14.5.3 Expression Profiling 14.5.4 In Vitro Functional Studies 14.5.5 In Vivo Functional Studies and Mouse Model 14.6 Questions and Answers 14.7 References
280 280 280 281 283 283 284 286 287 287 288 289 291 291 292 292 292 293 296 297
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
279
c14.indd 279
1/12/2011 5:03:48 PM
280
DETERMINATION OF THE FUNCTION OF A MUTATION
14.1 INTRODUCTION Mutations are permanent changes in the DNA sequence. They range in size from a single DNA building block (DNA base) to a large segment of a chromosome and can be inherited from a parent or acquired during lifetime. The concept of mutation has enflamed imaginations for generations, as its potential power to benefit or harm humans is enormous. Depending on the type of mutation and the type of cells affected by the mutation, the results can be positive, negative, or neutral. Some genetic changes are very rare; others are common in the population. Single nucleotide polymorphism (SNP) is the simplest form of DNA variation among individuals. SNPs can be of transition or transversion type, they occur throughout the genome at a frequency of about one in 1,000 bp (Shastry, 2009) but some genomic sequences show more SNP frequency than others. SNPs may change the encoded amino acids and will be called nonsynonymous SNPs or change the DNA sequence but not the encoded amino acid and will be called synonymous SNPs. Or they can simply occur in the noncoding regions, and in this case they may influence promoter activity (gene expression), messenger RNA (mRNA) conformation (stability), and subcellular localization of mRNAs and/or proteins and hence may lead to a disease. These small DNA variations induce diversity among individuals and could be responsible for genome evolution and the most common familial traits such as skin, eye color, and interindividual differences in drug response. They can affect how humans develop diseases and respond to pathogens, chemicals, drugs, vaccines, and other agents. Therefore, understanding the effect of specific mutations on gene function and individuals health is a key to develop the concept of personalized medicine. To determine the effect of a mutation on the function of a gene; the following parameters need to be carefully analyzed: Mutation-phenotype linkage. Position of the mutation in the gene sequence. Genomic changes caused by the mutation. Protein changes caused by the mutation. Tissue specificity. In this chapter, I will discuss each parameter and the strategies that could lead to a better understanding of the impact of different mutations on gene function with the focus on SNPs. 14.2 CONCEPT OF QUANTITATIVE TRAIT LOCI 14.2.1 Mouse Model A quantitative trait locus (QTL) is a polymorphic locus containing alleles that differentially affect the expression of a specific phenotypic trait (a genetic basis
c14.indd 280
1/12/2011 9:44:24 AM
CONCEPT OF QUANTITATIVE TRAIT LOCI
281
for physiological variation) (Nadeau and Frankel, 2000). The purpose of QTL research is to identify genes and gene variants or mutations that contribute to the expression of these traits. QTL are identified via association of the studied traits with genetic markers, which are polymorphic sequences that characterize each species or strain of mice. The use of inbred strains of mice in this setting has proven to be a viable alternative to human genetic studies given the degree of control that can be exercised over experimental parameters such as environment, breeding scheme, and detailed phenotyping. Over the past 20 years, QTL mapping has led to the identification of numerous genetic loci for a variety of traits relevant to human diseases, including behavioral differences, lipid levels, obesity, atherosclerosis and osteoporosis (Korstanje and Paigen, 2002; Allayee et al., 2003; Williams and Spector, 2007). Genomewide SNP screen combined with microsatellite markers have been used to successfully identify QTL for complex traits. Bice et al. (2009) have genotyped a total of 867 informative SNPs in an F2 population of 989 mice derived from a cross between a high- and low-alcohol-preferring mouse lines. QTL analyses detected significant evidence of associations between the phenotypic variations and multiple chromosomal regions in mouse. Several of the identified regions included candidate genes previously associated with alcohol dependence in humans or other animal models. (More details on QTL studies are described in Chapter 19.) After the identification of a QTL, most researchers perform further fine mapping of the QTL to reduce the size of these regions and hence refine the list of potential candidate genes that can be achieved by creating additional recombination events through selective breeding (Darvasi, 1998) or by exploiting historical recombinations (Cardon and Bell 2001). Then the candidate genes are selected based on the following criteria: (1) position within the QTL, (2) known function, (3) expression profile, and (4) sequence variations between the strains of mice analyzed. Several genes responsible for complex traits have been identified by combining QTL analyses, gene expression profiling, and mutation analyses (Kleeberger et al., 2000; Klein et al., 2004; Edderkaoui et al., 2007). 14.2.2 Human Diseases and Association Studies 14.2.2.1 Genomewide Associations Scientists have spent decades mapping human disease genes. Initially, the focus was directed toward the identification of genetic mutations responsible for single-gene disorders. However, recently, new technology has made it possible to link multiple genes to a single disease and to connect multiple diseases to one another by knowing the genes associated with them. Genomewide association (GWA) studies are performed to identify the genes involved in complex traits and human diseases; this method searches the genome for DNA variations that are associated with different traits. If association is present, a particular allele, genotype or haplotype of a polymorphism or polymorphism(s) will be seen more often
c14.indd 281
1/12/2011 9:44:24 AM
282
DETERMINATION OF THE FUNCTION OF A MUTATION
than expected by chance within a population that shows the trait. Thus a person carrying one or two copies of a high-risk variant is at increased risk of developing the associated disease or having the associated trait. Because GWA studies examine DNA variations across the genome, they represent a promising way to study complex, common diseases in which many genetic variations contribute with different percentage to the development of the disease. In comparison to family linkage-based approaches, association studies have two key advantages: (1) They are able to capitalize on all meiotic recombination events in a population rather than only those in the families studied. Thus association signals are localized to small regions of the chromosome containing only a single to few genes, enabling rapid detection of the actual disease susceptibility gene. (2) GWA studies allow the identification of disease genes with only modest increases in risk, which is considered a severe limitation in linkage studies and the very type of genes one expects for common disorders. The power to detect association between genetic variation and disease is a function of several factors, including the frequency of the risk allele or genotype, the relative risk conferred by the diseaseassociated allele or genotype, the correlation between the genotyped marker and the risk allele, sample size, disease prevalence, and genetic heterogeneity of the sample population. While the first three factors are unknown before specific GWAS, their impact can be influenced by the study design. The key elements for the success of an association study include sufficient sample sizes, rigorous phenotypes, comprehensive maps, accurate high-throughput genotyping technologies, sophisticated information technology infrastructure, rapid algorithms for data analysis, and rigorous assessment of genomewide signatures. GWAS have played a large role in unraveling these complex relationships. Although these studies do not account for the many environmental factors that contribute to disease, they have revealed numerous gene–disease associations (Lees and Satsangi, 2009; Bajaj et al., 2010; Vanunu et al., 2010), which encouraged healthcare professionals to think about the molecular pathways of diseases, which in turn have led to the development of various new treatment options. Of course, there is still a long way to go before much of the knowledge provided by GWAS can actually be used to treat or cure human disorders. One prominent obstacle along the path to this goal involves determining how to best manage the ever-growing body of gene association data. 14.2.2.2 Haplotypes Genetic association studies can also be performed with haplotypes. A haplotype is the series of genetic variants in a specific chromosome that are inherited from one parent. In subsequent generations, the chromosomal haplotype is progressively broken up by crossing over events in meiosis. In general, the term haplotype usually refers to closely linked genetic loci. SNPs that are located in close proximity tend to travel together,
c14.indd 282
1/12/2011 9:44:24 AM
HOW TO DETERMINE IF A MUTATION IS FUNCTIONAL
283
a phenomenon that is known as linkage disequilibrium (LD). In general, loci that are located more closely together on a chromosome will be in stronger LD than those loci located far apart, but the correlation between LD and the physical distance separating two loci is modest: some loci that are separated by 20 bp will not be in LD, whereas other loci separated by 200,000 nucleotide bases will be in tight LD (Pritchard and Przeworski, 2001). One approach is to assign the most likely haplotypes to each individual in a study population and then determine if the distribution of assigned haplotypes differs between cases and control subjects or within families. However, this approach does not adjust for the uncertainty in haplotype assignment. Therefore, approaches that explicitly incorporate the relative probabilities of each haplotype for each individual are preferred. The statistical genetic issues in haplotype association studies of unrelated individuals have been recently reviewed by Schaid (2004). A variety of haplotype-based association methods have been developed in both unrelated subjects or in families (Clayton and Jones 1999; Schaid et al, 2002; Horvath et al., 2004; Morris et al., 2004; Satten and Epstein, 2004). Schaid et al. (2002) have developed a regression-based score test for haplotype association in unrelated subjects that allows for testing of both global haplotype association and individual haplotype association as implemented in the Haplo. Stats program. Such regression-based approaches have a number of advantages, including the inclusion of covariates for environmental and other nongenetic factors as well as the inclusion of haplotype by environment interactions (Lake et al., 2003).
14.3 HOW TO DETERMINE IF A MUTATION IS FUNCTIONAL 14.3.1
Test to Determine if the Mutation is Null
When analyzing gene function, geneticists start by searching for multiple mutations with a particular phenotype in a genetic screen. Once this is accomplished, the next step is usually to do complementation tests to figure out how many genes are represented by the mutations that were isolated in the screen. However, it is important to figure out whether the individual mutations are loss of function, gain of function, or something else (dominant negative, neomorphic, etc.). In the case of null mutations, the function of the gene is completely abolished either due to a total absence of the translated product or the production of a totally inactive product. Thus one way to know if the mutation is null is to know the type of DNA modification caused by the mutation and the protein structure. Another way is to determine whether the phenotype is completely altered by the mutation and to evaluate if the gene and the protein are expressed and translated, respectively, by simply comparing the mRNA and the protein expression levels in the mutated and the control gene.
c14.indd 283
1/12/2011 9:44:24 AM
284
DETERMINATION OF THE FUNCTION OF A MUTATION
Three types of mutations can cause loss of function: • •
•
Deletion is the loss of one or several bases. Insertion represents the addition of one or several bases, which can be of various natures—for instance, duplication of a preexisting DNA sequence or insertion of a foreign sequence, such as a viral sequence. Substitution consists in the replacement of one base by a different one, with no change in total number of bases in the sequence.
Deletions, insertions, or substitutions could create a stop codon in place of the same codon or further in the DNA sequence, which will lead to either a truncated protein or to a translation alteration. When located in noncoding regions, deletions or insertions may affect gene expression, when they are located in coding regions, they may affect the structure of the protein and consequently its function. In the case of base substitutions, three subtypes can be identified: •
•
•
Nonsense mutation, where the replacement of one base by another creates a stop codon in place of a codon specifying an amino acid. Missense mutation, where the mutated codon specifies a different amino acid. Splice mutation, where a splicing site is suppressed or created. Depending on the position of the polymorphism within the gene, it may affect DNA transcription, RNA splicing, RNA translation, protein structure, or quantity. Functional consequences at the phenotypic level may show various degrees.
14.3.2 Dosage Analysis—Quantitative or Qualitative Changes Due to Mutation 14.3.2.1 Loss of Function Loss of function mutation behaves generally in a recessive way, because the normal allele of a heterozygous carrier retains its function and may sometimes even be transcribed at a higher level than in the normal homozygous. Thus the heterozygous carrier will show an intermediate amount of the gene product, which can be sufficient to maintain the function. Variations may occur if the function is greatly impaired under a certain threshold level (wild type level) for the amount of the gene product. This latter phenomenon corresponds to a dosage effect (Fig. 14.1). If the amount of the gene product in the heterozygous carrier is below the threshold, then the mutation will behave in a dominant way. This situation is also described as haploinsufficiency, where the presence of one normal allele is not sufficient to maintain the function. Because the normal activity of a gene can be disturbed in many different ways, one can expect an important molecular diversity at the origin of loss of function mutations.
c14.indd 284
1/12/2011 9:44:24 AM
HOW TO DETERMINE IF A MUTATION IS FUNCTIONAL
285
Amount of Gene Product
120 100 a. Threshold > 50% phenot H=phenot M dominance
80 60
b. Threshold < 50% phenot H=phenot N recessivity
40 20 0 Normal
Heterozygot
Mutant (Homozygot)
Figure 14.1. The dosage effect for a loss of function mutation. Threshold values a and b used to distinguish between dominance and recessivity. N represents the homozygous normal genotype, H, the heterozygous genotype for the mutation, and M, the homozygous mutant genotype (Tixier-Boichard, 2002).
Purely genetic evidence, without biochemical studies, can often suggest whether a phenotype is caused by loss or gain of function. When a clinical phenotype results from loss of function of a gene, we would expect any change that inactivates the gene product to produce the same clinical result. We should be able to find point mutations that have the same effect as mutations that delete or disrupt the gene. Waardenburg syndrome type 1 provides a good example of loss of function mutation, since missense mutation as well as nonsense mutation and in some patients complete deletion of the PAX3 sequence produces the same clinical result (Sheffer and Zlotogora, 1992; Wollnik et al., 2003). 14.3.2.2 Dominant Negative Mutations This situation is observed when the gene product of the mutated allele is only partially active and may interfere with the normal gene product. This is particularly the case when the gene product acts antagonistically to the wild type allele as a cofactor or if it is involved in the formation of a dimer or polymer. In cases of polymeric molecules, such as collagen, dominant negative mutations are often more deleterious than mutations causing the production of no gene product (null mutations or null alleles). The defect of one component in a dimer or polymer may be sufficient to impair the overall function because the abnormal allele also disturbs the normal allele when both allelic products form a dimer/polymer. The mutation behaves in a dominant way because only one mutated allele is able to impair the gene function. This phenomenon is well illustrated in human genetics by mutations of the nuclear hormone receptors (Yen and Chin, 1994). The same applies when the dimer involves the product of a different gene; the mutation of one gene has a negative epistatic effect on the function usually associated with the other gene. Another example of dominant negative muta-
c14.indd 285
1/12/2011 9:44:24 AM
286
DETERMINATION OF THE FUNCTION OF A MUTATION
tion was reported by Thomas et al. (1997). The group identified a point mutation in the gene encoding cartilage-derived morphogenetic protein 1 (CDMP-1). The mutation substitutes a tyrosine for the first of seven highly conserved cysteine residues in the mature active domain of the protein. They showed that the mutation results in a protein that is not secreted and is inactive in vitro. It produces a dominant negative effect by preventing the secretion of other related bone morphogenetic protein family members. 14.3.2.3 Gain of Function Mutations Gain of function mutations usually cause a dominant phenotype, they change the gene product such that it gains a new and abnormal function. A new function can be obtained with the production of an aberrant protein or the expression of a normal protein in abnormal conditions, either at an unusual age or in an aberrant location. These mutations may have severe phenotypic effects. Most commonly activating mutations were found in thyroid nodules, many are not heritable (germ line) but are somatic mutations that develop under specific environmental conditions such as a long-term iodine deficiency or exposure to goitrogens. Several germline mutations within thyroid-stimulating hormone receptor (TSHR) have been identified as gain of function mutations (Esapa et al., 1999; Alberti et al., 2001) in families that showed hyperthyroidism with severe thyrotoxicosis. The effect of the mutation on the function of the gene was evaluated after cloning and direct mutagenesis. COS-7 cells were transfected with wild type or the mutated TSHR cDNA. The cells transfected with mutated TSHR displayed increased constitutive activity toward the cAMP pathway when compared with the wild type TSHR (Alberti et al., 2001). 14.3.3 Complementation Test Genetic studies start by the identification of the molecular components that contribute to a particular process and then determine how those molecules communicate or work together to execute that process. There are several strategies to unravel the networking of each process. However, there are certain rules that apply to all situations. The first is to identify multiple mutations with a particular phenotype in a genetic screen. Once this is accomplished, the next step is to do complementation tests to figure out how many genes are represented by the mutations that were isolated in the screen. It is important to figure out whether the individual mutations isolated are loss of function, gain of function, or something else (dominant negative, neomorphic, etc.). For that purpose, there is a need to combine the mutations in different genes that give opposite or at least different phenotypes. Combining mutations with the same or similar phenotypes is not often informative, since the combination is likely to give the same phenotype as either individual gene. The one exception is when two mutations act as enhancers of each other. In this case, the two mutations that generate similar phenotype can be combined to generate either a new phenotype or a probably stronger phenotype. Two
c14.indd 286
1/12/2011 9:44:24 AM
EFFECT OF MUTATIONS ON THE FUNCTION OF THE GENE
287
loss of function mutations that give different phenotypes or a loss of function allele of one gene with a gain of function allele of another gene can be combined in the suppressor and enhancer problem set. Suppressor and enhancer screens are usually designed to find additional genes that act in the same process or related/parallel processes. They generally work by mutating cells or animals that already carry one mutation and looking for a milder phenotype (to find suppressors) or a stronger or different phenotype (to find enhancers).
14.4 EFFECT OF MUTATIONS ON THE FUNCTION OF THE GENE The candidate mutation, whether identified through linkage, candidate gene, and/or bioinformatics analyses, can be verified through a number of methods. The choice of the approaches to follow to determine the effect of a specific mutation on the function of a gene depends on the position of the mutation and the type of mutation (synonymous, nonsynonymous, regulatory SNPs). 14.4.1
Nonsynonymous SNPs
SNPs located within the coding region of genes have been extensively studied, including those that cause amino acid codon alterations (nonsynonymous variants) that can lead to protein misfolding, polarity shift, improper phosphorylation, and other functional consequences. The impact of the nonsynonymous SNPs can be assessed using bioinformatic tools to evaluate the importance of the amino acids they affect. There are three categories of SNPs: (1) the SNP that causes a change in a part of the protein sequence that is not conserved over long evolutionary distances; (2) the SNP that causes a change in a conserved domain of the protein, but the changed residue is not conserved; and (3) the SNP that changes a conserved residue within a conserved domain. Using the publicly available sequences and some simple computational tools available at the websites listed, the conserved domain of the protein as well as the conserved amino acids through species evolution can be identified. Then the SNP can be categorized either among the most probably affecting the function of the protein or the least probably functional. www.ensembl.org/index.html. www.ncbi.nlm.nih.gov/guide/sequence-analysis. www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Other more sophisticated computational tools can be used. The Sorting Intolerant from Tolerant (SIFT) software uses sequence homology to predict whether an amino acid substitution will affect protein function and hence
c14.indd 287
1/12/2011 9:44:24 AM
288
DETERMINATION OF THE FUNCTION OF A MUTATION
potentially alters the phenotype (Ng and Henikoff, 2001, 2002). The program is available at http://sift.jcvi.org. After computational analyses, it is still necessary to perform functional studies to confirm the effect of the mutation on the function of the encoded gene. The different experiments that can be performed to confirm the function of specific mutations are explained later in this book. 14.4.2
Synonymous SNPs
Due to the relatively large frequency of the SNPs in the human genome, synonymous SNPs (sSNPs) were often disregarded in many pharmacogenomic studies based on the assumption that these are silent mutations. However, recent genetic studies have shown evidence that synonymous mutations can have important fitness consequences, with >40 genetic diseases now associated with such silent mutations (Chamary et al., 2006). There is now clear evidence that synonymous codons are not used randomly, that preferred codons correlate strongly with the relative abundance of the corresponding tRNAs, and that natural selection acts on synonymous mutations. Recently Kimchi-Sarfaty et al. (2007) investigated the effect of a synonymous SNP in the human multidrug resistance 1 (MDR1) gene. The SNP was selected as part of a haplotype previously linked to altered function of the MDR1 gene product, P-glycoprotein (P-gp), which is implicated both in determining drug pharmacokinetics and in multidrug resistance in human cancer cells, the SNP results in P-gp with altered drug and inhibitor interactions. No difference on mRNA and protein levels were found in the cells transfected with the mutant allele compared to wild type allele, but altered conformation was found in the mutated P-gp. Therefore, the authors suggested that the presence of a rare codon, marked by the synonymous polymorphism, affects the timing of cotranslational folding and insertion of P-gp into the membrane, thereby altering the structure of substrate and inhibitor interaction sites. Accordingly, Nackley et al. (2006) reported that haplotypes divergent in synonymous SNPs in the catechol-O-methyltransferase gene exhibited the largest difference in the enzymatic activity due to a reduced amount of translated protein. The mutated gene showed a change in the RNA local stem-loop structures, such that the most stable structure was associated with the lowest protein levels and enzymatic activity. Site-directed mutagenesis that eliminated the stable structure restored the amount of translated protein, which highlights the functional significance of synonymous SNPs. Furthermore, it has been shown that a synonymous SNP of the corneodesmosin gene leads to increased mRNA stability, and haplotypes that have this SNP are more likely to develop psoriasis (Capon et al., 2004). In summary, there are at least three relatively resolved mechanisms by which synonymous mutations can affect fitness (Joanna and Parmley, 2007): (1) mRNA structure and stability, (2) kinetic of translation, and (3) alternate splicing (Fig. 14.2). It is also likely that overlapping transcripts, which may well
c14.indd 288
1/12/2011 9:44:24 AM
EFFECT OF MUTATIONS ON THE FUNCTION OF THE GENE
289
Mechanism
mRNA structure and stability
Kinetic of translation
Alternate splicing
Change in protein amount, structure and/or function Figure 14.2. Mechanisms by which synonymous mutations could affect the protein expression and/or its function.
be much more common than one would think (Sun et al., 2006), will impose some form of extra constraint (Lipman, 1997) on mutations that are synonymous but in only one of the two genes. Very recently, Okamoto et al. (2010) identified the potassium inwardly rectifying channel subfamily J member 15 (KCNJ15) gene as a new type 2 diabetes mellitus (T2DM) susceptibility gene. A sSNP, rs3746876, in exon 4 (C566T) of this gene have been associated with T2DM in three independent Japanese sample sets. Thus to determine the effect the sSNP on the function of the gene, the authors have cloned the wild type and the mutated KCNJ15 in human embryonic kidney 293 cells. The functional analysis demonstrated that the risk allele of the sSNP in exon 4 increased KCNJ15 expression via increased mRNA stability, which resulted in a higher expression of protein as compared to that of the nonrisk allele. Overexpression of KCNJ15 decreased insulin secretion in high-glucose conditions, while no significant change was found under normoglycemic conditions. 14.4.3 Regulatory SNPs The SNPs located within noncoding regions of the genome are the less predictable. While mostly regarded as nonfunctional, this type of alteration can impact gene regulatory sequences such as promoters, enhancers, and silencers (Ponomarenko et al., 2002). Termed regulatory SNPs (rSNPs), these variations have become more prevalent in recent studies (Wang et al., 2005, 2007; Knight, 2003, 2005). Transcription factor (TF) binding sites are the most attractive
c14.indd 289
1/12/2011 9:44:24 AM
290
DETERMINATION OF THE FUNCTION OF A MUTATION TF
Promoter-gene
TF-TFBS interaction
TEBS
ATG
TF
TF
TF
TF
tFBS
TfBS
TFbS
TFBs
No change
Increased binding
Decreased binding
No binding
TF TFBS
Novel binding Controlled by other TF
Expression
Figure 14.3. The impact of a regulatory SNP in a transcription factor binding site (TFBS). In most cases, the SNP will not change TF binding activity or target gene expression because the TF, in general, allows variation in the consensus sequence of the binding site. In some cases, a SNP may increase or decrease the stability of the binding, leading to allelic-specific gene expression. It is rare when the SNP eliminates the natural binding site or generate a novel binding site, and consequently the gene is no longer controlled by the original TF (Chorley et al., 2008).
regions to search for functional rSNPs. A SNP in a TF binding site can have multiple consequences. In most cases, a SNP does not change the TF and binding site interaction nor does it alter the gene expression, since a TF will usually recognize a considerable number of binding sites. In some cases, a SNP may increase or decrease the binding, leading to allele-specific gene expression. In rare cases, a SNP may eliminate the natural binding site or generate a novel binding site. Consequently, the gene can no longer be regulated by the original TF. Thus, functional rSNPs in TF binding sites may predictably lead to differences in gene expression (Fig. 14.3) and phenotypes, and ultimately affect susceptibility to environmental exposure. Indeed, there are numerous examples of rSNPs associated with disease susceptibility, including hypercholesterolemia (Ono et al., 2003), hyperbilirubinemia (Sugatani et al., 2002; Bosma et al., 1995) myocardial infarction (Nakamura et al., 2002), acute lung injury (Marzec et al., 2007), and asthma (Jinnai et al., 2004). Identification and experimental verification of functional rSNPs are key limiting steps in an efficient functional polymorphism discovery process. 14.4.3.1 Bioinformatics Successful bioinformatics identification of functional rSNPs requires identification of putative regulatory sequences, or motifs, and the co-location of SNPs in these sequences. Some genes have cisregulatory sequences within 10 kb (sometimes larger) of the transcription start site. Computational methods for the identification of cis-regulatory sequences have been successfully applied to simple organisms such as yeast
c14.indd 290
1/12/2011 9:44:25 AM
THE OBSERVED PHENOTYPIC VARIATIONS
291
and worm; and while some methods have been plagued by high false positive rates in mammalians, primarily because of the very large quantity of intergenic sequence present (Chang et al., 2006), many recent new bioinformatics algorithms have improved prediction (Warner et al., 2008). These include examining evolutionarily conserved regulatory sequences in upstream sequences of orthologous genes across species (Wang et al., 2007; Boffelli et al., 2003; Sun et al., 2004; Cliften et al., 2003) and identifying statistically overrepresented motifs in the upstream regions of genes that are co-regulated in microarray expression profiles (Warner et al., 2008; Haverty et al., 2004). 14.4.3.2 Experimental Assessment of Regulatory SNPs After computational analyses, it is still necessary to verify the functional impact of novel gene polymorphism using basic molecular biology techniques such as electrophoretic mobility shift assay (EMSA) to assess DNA binding; chromatin immunoprecipitation (ChIP), a binding assay that provides insight into gene regulation in an endogenous state; and luciferase reporter constructs to test the effect of a SNP on regulatory element function. These processes tend to be laborious and not well matched for screening large numbers of DNA elements and SNPs. It is, therefore, imperative to develop high-throughput methods to assess regulatory regions in the genome. Indeed, Chorley and collaborators (2008) have covered some new high-throughput methodologies such as surface plasmon resonance imaging arrays (SPR) analysis, oligoconjugated microsphere binding assays, and allelic imbalance methods that can help shortening the process of rSNP function. 14.5 GENERAL STRATEGY TO ASSESS THE EFFECT OF A MUTATION ON THE FUNCTION OF A GENE AND THE OBSERVED PHENOTYPIC VARIATIONS In summary, the overall strategy to determine the effect of the mutation on the function of the gene or the mechanism by which a mutation affects a specific phenotype starts by identifying the type of mutation, the localization of the mutation within the gene sequence and the effect of the mutation on the RNA or protein structure. Then, in vitro functional studies allow the evaluation of the molecular pathways affected by the mutation and the in vivo studies will determine the effect of the mutation on cell interaction and the overall phenotype expressed by the affected individual. 14.5.1
Sequence Analyses
As described earlier, the first step in determining the role of a mutation on the function of a gene is to determine the type of mutation, the position of the mutation in the DNA sequence and the effect of the mutation on the structure of the RNA or the amino acid sequence that can be performed by sequencing using computer tools as described.
c14.indd 291
1/12/2011 9:44:25 AM
292
14.5.2
DETERMINATION OF THE FUNCTION OF A MUTATION
Tissue Specificity
It is important to determine the cells or tissues that express the gene analyzed since some genes are tissue/cell specific. Thus gene expression has to be evaluated in different tissues/cells to determine which cells express the studied gene. This information will help with estimating the function of the gene and therefore the experiments used to test the effect of the mutation on the function of the affected gene. 14.5.3
Expression Profiling
There are different methods to determine the effect of a mutation on the expression of the mutated gene or protein; the tissue/cells that express the analyzed gene is isolated from the affected and nonaffected individuals, and then gene expression is evaluated in the wild type and the affected cells by quantitative real-time polymerase chain reaction (RT-PCR) (Logan et al., 2009). The protein expression can be evaluated by western immunoblotting; protocols that describe different methods for isolating proteins and performing western bloting are available online (www.westernblotting.org). Immunohistochemistry staining is also used to determine the effect of a mutation on the protein expression and the localization of the protein expression in different cell compartments. 14.5.4
In Vitro Functional Studies
Biologists mostly start with the in vitro studies to determine the functional consequence of a specific mutation. For in vitro studies, the choice of the cells to be used is crucial for the success of the study; in the case of candidate gene/ mutation, some investigators first use the cells that show high expression of the candidate. Thus the expression and the activity of the candidate gene or any molecule that interacts with this gene are assessed in the cells isolated from affected and nonaffected individuals, assuming that any identified change between the affected and nonaffected cells is due to the mutation of interest. This approach allows the investigators to measure different parameters with great precision and explore molecular mechanisms of the mutation studied. However, it does not eliminate the effect of the genetic background of both cells, especially when the cells are human derived. Therefore, it is necessary to confirm the results by cloning and direct mutagenesis. cDNA from nonaffected individuals is generated by reverse transcriptase and cDNA of the analyzed gene is amplified and cloned under a strong promoter. Then the mutation of interest is introduced by direct mutagenesis. The vector carrying the mutated allele/control vector is transfected in one specific cell line to overcome the effect of different genetic backgrounds. The choice of the cell line to be used in the study depends on the planned experiments. Wataha et al. (1994) have tested four cell lines to determine the best cell line
c14.indd 292
1/12/2011 9:44:25 AM
THE OBSERVED PHENOTYPIC VARIATIONS
293
for in vitro biological tests that assess the cytotoxicity of dental materials. Lindén et al. (2007) have tested nine human gastrointestinal epithelial cell lines to improve in vitro model systems for gastrointestinal infection studies. The use of in vitro cell models for functional studies has many advantages over in vivo systems. First, variation among individuals is eliminated especially when using cell lines, as is the confounding interactions between different cells other than the one under study. Moreover, in vitro systems can be manipulated in ways not possible in vivo, allowing investigators to measure the effects of different variables (e.g., temperatures and pharmacological agents) with greater precision and to explore the molecular mechanisms of the gene/ mutation studied. However, these advantages are offset by the loss of the in vivo context (e.g., cues from extracellular matrix, other cell types), which undoubtedly provides levels of regulation that are missing in vitro. For this reason, it is important to confirm the in vitro results in the in vivo situation and to compare results obtained in the two systems.
14.5.5
In Vivo Functional Studies and Mouse Model
Animal models such as fruit fly (Drosophila melanogaster), zebrafish (Danio rerio), and mouse (Mus musculus) have been used since 1930 to investigate different human traits and diseases (Paigen, 2003; Lieschke and Currie, 2007; Rosenthal and Brown, 2007). However, laboratory mouse is considered the model organism of choice to determine the mechanism by which different mutations regulate specific human diseases in vivo for the following reasons: The mouse genome is the most completely described of any animal model so mouse gene sequences can be compared to human. Mice and humans share 99% of their genes. Mice and humans share most physiological and pathological features; similarities in nervous, cardiovascular, endocrine, immune, musculoskeletal, and other internal organ systems have been extensively documented. When using mouse model the environmental factors as well as genetic variations are controlled. At present, a number of mutagenesis strategies based on embryonic stem (ES) cells are used, all of which use homologous recombination to alter genes in their original location, producing either knockouts to cripple gene function or knockins to introduce a mutated gene version. Typically, this is done in mice since the technology for this process is more refined, and because mouse embryonic stem cells are easily manipulated. The rational for using ES cells to introduce mutation is because ES cells have the capacity of self-renewal and broad differentiation plasticity. ES cells can be propagated as a homogeneous, uncommitted cell population for an almost unlimited period of time without losing their pluripotency and their stable karyotype.
c14.indd 293
1/12/2011 9:44:25 AM
294
DETERMINATION OF THE FUNCTION OF A MUTATION
Figure 14.4. Two methods of generating standard transgenic mice.
Even after extensive genetic manipulation, mouse ES cells are able to reintegrate fully into viable embryos when injected into a host blastocyst or aggregated with a host morula. 14.5.5.1 Standard Transgenic Mice The definition of transgenesis is the introduction of DNA from one species into the genome of another species, but at present the term transgenic mice is given to any animal that carries a foreign DNA (even from the same species) that was deliberately inserted into its genome. Many of the first transgenic mice were generated to study the overexpression of a human protein (Masliah and Rockenstein, 2000). To generate a standard transgenic mouse, the gene with the mutation of interest, a strong mouse gene promoter and enhancer to allow the gene to be expressed, and a bacterial or viral vector DNA to enable the transgene to be inserted into the mouse genome are needed. The investigator can choose either a cell specific promoter that will induce the gene expression only in specific cells or use a promoter that can express in all cells. It is important to add a selection marker such as G148. Two methods of producing transgenic mice are widely used (Fig 14.4). Transform embryonic ES cells growing in tissue culture. Then the successfully transformed cells are selected and injected into the inner cell mass of mouse blastocysts to generate embryos.
c14.indd 294
1/12/2011 9:44:26 AM
THE OBSERVED PHENOTYPIC VARIATIONS
295
Inject the desired gene into the pronucleus of a fertilized mouse egg. This method has been used to introduce a mutation at the nuclear factor interleukin 6 (NF-IL-6) DNA binding site in mouse model and evaluate the role of IL-6 in the response to environmental oxygen deprivation (Yan et al. 1997). Chiesa et al. (1998) have used this transgenic method to evaluate the effect of an insertional mutation in a prion gene on a neurological disorder characterized clinically by ataxia and neuropathologically by cerebellar atrophy. The embryos are then transferred into the uterus of a pseudopregnant foster mouse. Since the success of the implantation is estimated to be no more than 33% (Wang and Dey, 2006), it is necessary to transfer at least three embryos each time to be sure to get the implantation of one embryo. The offspring are then tested; a small piece of tissue from the tail is isolated, and its DNA is examined for the specific mutation. It is estimated that 10–20% of the progeny will have the introduced mutation, and they will be heterozygous for the mutated allele. Heterozygous mice are then mated and their offspring are screened for the 1:4 that will be homozygous for the transgene. Transgenic mice approach is relatively quick, but includes the risk that the DNA may insert itself into a critical locus, causing an unexpected, detrimental genetic mutation. For this reason, several independent mouse lines containing the same transgene must be created and studied to ensure that any resulting phenotype is not due to toxic gene-dosing or to the mutations created at the site of transgene insertion. 14.5.5.2 Knockin Mice To avoid the problems of a standard transgenic, biologists in the last 10 years rely on knockin mice to study the exogenous expression of a protein. In this method, a mutated DNA sequence is exchanged for the endogenous sequence without any other disruption of the gene. Knockin strategies rely on a method developed by Orban et al. (1992). This procedure comprises heritable tissue-specific and site-specific DNA recombination as a function of recombinase expression in transgenic mice. Transgenes encoding the bacteriophage P1 Cre recombinase and the loxP-flanked βgalactosidase gene were used to generate transgenic mice. The use of gene vectors with flanking sequences, termed loxP, are constructed to delete a specific exon of a gene in embryonic stem cells. When exposed to an enzyme called Cre recombinase, LoxP undergoes reciprocal recombination, leading to the deletion of the intervening DNA. With this method, it is possible to replace the wild type gene sequence with the mutated sequence or vice versa and to delete unnecessary sequences. The gene for Cre recombinase has been knocked into targeted loci in a way that brings its expression under the direction of the endogenous gene promoter, thus allowing tissue-specific or temporal-specific expression of the Cre enzyme and hence recombination of loxP sites that flank the gene of interest. Site-specific knockins result in a more consistent level of expression of the transgene from generation to generation because it is known that the
c14.indd 295
1/12/2011 9:44:26 AM
296
DETERMINATION OF THE FUNCTION OF A MUTATION
overexpression cassette is present as a single copy. Also, because a targeted transgene is not interfering with a critical locus, the investigator can be more certain that any resulting phenotype is due to the exogenous expression of the protein. The knockin mouse procedure requires more time to assemble the vector and to identify ES cells that have undergone homologous recombination, but it does avoid many of the problems of a traditional transgenic mouse. The applications of this method are numerous, and some are already clinically useful. knockin mouse models of Huntington disease have been developed by introducing the mutation responsible for this fatal disease, which is an abnormally expanded and unstable CAG repeat within the coding region of the gene encoding huntingtin (Menalled, 2005). Furthermore, knockin mouse models that carry the mutation R345W in the EFEMP1 gene (also called fibulin-3) have been developed to determine the mechanism by which this mutation causes age-related macular degeneration (Fu et al., 2007).
14.6 QUESTIONS AND ANSWERS Q1. What is the advantage of association mapping over linkage analysis? Q2. How do you distinguish between dominant and recessive mutation? Q3. What is the purpose of complementation test? Q4. Which of the following SNPs can affect the expression and the function of the protein; synonymous, non-synonymous, regulatory SNPs? Q5. What is the possible effect of a mutation at a noncoding region? A1. In comparison to family linkage based approaches, association studies have two key advantages. (1) They are able to capitalize on all meiotic recombination events in a population, rather than only those in the families studied. Because of this, association signals are localized to small regions of the chromosome containing only a single to a few genes, enabling rapid detection of the actual disease susceptibility gene. (2) GWAS allow the identification of disease genes with only a modest increases in risk, which is considered a severe limitation in linkage studies. A2. You can distinguish between dominant and recessive mutations by comparing the product of heterozygote mutant, if the heterozygote product is equal or close to normal product, then the mutation is considered recessive. If the product of the heterozygote is different from the normal and is close to the homozygote mutant, then the mutation is dominant. A3. Complementation test allows the identification of the molecules that interact with the gene studied and determines whether the gene studied
c14.indd 296
1/12/2011 9:44:26 AM
REFERENCES
297
is upstream or downstream of the others and whether it inhibits or activates a downstream target. A4. The three SNPs can affect the function; the nonsynonymous SNP can affect the conformation of the protein and therefore alter the function of the protein which would consequently affect the expression of the gene/ protein. The synonymous SNP could affect the translation of the protein and the regulatory SNPs could alter the promoter activity (gene expression), mRNA conformation (stability), and subcellular localization of mRNAs and/or proteins. A5. A mutation at a noncoding region can affect the expression of the affected gene and therefore its function. The mutation at noncoding region can affect the stability of the transcription factor so the expression will be either increased or decreased or even completely altered, or it can create a new transcription binding site. 14.7 REFERENCES Alberti L, Proverbio MC, Costagliola S, Weber G, Beck-Peccoz P, Chiumello G, Persani L. (2001). A novel germline mutation in the TSH receptor gene causes nonautoimmune autosomal dominant hyperthyroidism. Eur J Endocrinol 145(3): 249–54. Allayee H, Ghazalpour A, Lusis AJ. (2003). Using mice to dissect genetic factors in atherosclerosis, Arterioscler thromb vasc biol 23:1501–09. Bajaj A, Driver JA, Schernhammer ES. Parkinson’s disease and cancer risk: a systematic review and meta-analysis. Cancer Causes Control 2010 May;21(5):697–707. Bice P, Valdar W, Zhang L, Liu L, Lai D, Grahame N, Flint J, Li TK, Lumeng L, Foroud T. (2009). Genomewide SNP screen to detect quantitative trait loci for alcohol preference in the high alcohol preferring and low alcohol preferring mice. Alcohol Clin Exp Res 33(3):531–37. Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM. (2003). Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299:1391–94. Bosma PJ, Chowdhury JR, Bakker C, Gantla S, de Boer A, Oostra BA, Lindhout D, Tytgat GNJ, Jansen PLM, Elferink RPJO, Chowdhury NR. (1995). The genetic basis of the reduced expression of bilirubin UDP-glucuronosyltransferase 1 in Gilbert’s syndrome. N Engl J Med 333:1171–75. Capon F, Allen MH, Ameen M, Burden AD, Tillman D, Barker JN, Trembath RC. (2004). A synonymous SNP of the corneodesmosin gene leads to increased mRNA stability and demonstrates association with psoriasis across diverse ethnic groups. Hum Mol Genet 13:2361–68. Cardon LR, Bell JI. (2001). Association study designs for complex diseases. Nat Rev Genet 2:91–99. Chamary J-V, Parmley JL, Hurst LD. (2006). Hearing silence: non-neutral evolution at synonymous sites in mammals. Nat Rev Genet 7:98–108.
c14.indd 297
1/12/2011 9:44:26 AM
298
DETERMINATION OF THE FUNCTION OF A MUTATION
Chang LW, Nagarajan R, Magee JA, Milbrandt J, Stormo GD. (2006). A systematic model to predict transcriptional regulatory mechanisms based on overrepresentation of transcription factor binding profiles. Genome Res 16:405–14. Chiesa R, Piccardo P, Ghetti B, Harris DA. (1998). Neurological illness in transgenic mice expressing a prion protein with an insertional mutation. Neuron 21(6):1339– 51. Clayton D, Jones H. (1999). Transmission/disequilibrium tests for extended marker haplotypes. Am J Hum Genet 65:1161–69. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M. (2003). Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301:71–76. Chorley BN, Wang X, Campbell MR, Pittman GS, Noureddine MA, Bell DA. (2008). Discovery and verification of functional single nucleotide polymorphisms in regulatory genomic regions: current and developing technologies. Mutat Res 659(1–2): 147–57. Darvasi A. (1998). Experimental strategies for the genetic dissection of complex traits in animal models. Nat Genet 18:19–24. Edderkaoui B, Baylink DJ, Beamer WG, Wergedal JE, Porte R, Chaudhuri A, Mohan S. (2007). Identification of mouse Duffy antigen receptor for chemokines (Darc) as a BMD QTL gene. Genome Res 17(5):577–85. Fu L, Garland D, Yang Z, Shukla D, Rajendran A, Pearson E, Stone EM, Zhang K, Pierce EA. (2007). The R345W mutation in EFEMP1 is pathogenic and causes AMD-like deposits in mice. Hum Mol Genet 16(20):2411–22. Esapa CT, Duprez L, Ludgate M, Mustafa MS, Kendall-Taylor P, Vassart G, Harris PE. (1999). A novel thyrotropin receptor mutation in an infant with severe thyrotoxicosis. Thyroid 9(10):1005–10. Haverty PM, Hansen U, Weng Z. (2004). Computational inference of transcriptional regulatory networks from expression profiling and transcription factor binding site identification. Nucleic Acids Res 32:179–88. Horvath S, Xu X, Lake SL, Silverman EK, Weiss ST, Laird NM. (2004). Family based tests for associating haplotypes with general phenotype data: application to asthma genetics. Genet Epidemiol 26:61–69. Jinnai N, Sakagami T, Sekigawa T, Kakihara M, Nakajima T, Yoshida K, Goto S, Hasegawa T, Koshino T, Hasegawa Y, Inoue H, Suzuki N, Sano Y, Inoue I. (2004). Polymorphisms in the prostaglandin E2 receptor subtype 2 gene confer susceptibility to aspirin-intolerant asthma: a candidate gene approach. Hum Mol Genet 13:3203–17. Joanna L, Parmley LDH. (2007). How do synonymous mutations affect fitness? Bioessays 29:515–19. Kimchi-Sarfaty C, Oh JM, Kim IW, Sauna ZE, Calcagno AM, Ambudkar SV, Gottesman MM. (2007). A Silent polymorphism in the MDR1 gene changes substrate specificity. Science 315:525–28. Kleeberger SR, Reddy S, Zhang LY, Jedlicka AE. (2000). Genetic susceptibility to ozone-induced lung hyperpermeability: role of toll-like receptor 4. Am J Respir Cell Mol Biol 22(5):620.
c14.indd 298
1/12/2011 9:44:26 AM
REFERENCES
299
Klein RF, Allard J, Avnur Z, Nikolcheva T, Rotstein D, Carlos AS, Shea M, Waters RV, Belknap JK, Peltz G, Orwoll ES. (2004). Regulation of bone mass in mice by the lipoxygenase gene Alox15. Science 303:229–32. Knight JC. (2003). Functional implications of genetic variation in non-coding DNA for disease susceptibility and gene regulation. Clin Sci (Lond) 104:493–501. Knight JC. (2005). Regulatory polymorphisms underlying complex disease traits. J Mol Med 83:97–109. Korstanje R, Paigen B. (2002). From QTL to gene: the harvest begins. Nature Genet 31:235–36. Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ. (2003). Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum Hered 55:56–65. Lees CW, Satsangi J. (2009). Genetics of inflammatory bowel disease: implications for disease pathogenesis and natural history. Expert Rev Gastroenterol Hepatol 3(5):513–34. Lieschke GJ, Currie PD. (2007). Animal models of human disease: zebrafish swim into view. Nat Rev Genet 8(5):353–67. Lindén SK, Driessen KM, McGuckin MA. (2007). Improved in vitro model systems for gastrointestinal infection by choice of cell line, pH, microaerobic conditions, and optimization of culture conditions. Helicobacter 12(4):341–53. Lipman DJ. (1997). Making (anti)sense of non-coding sequence conservation. Nucl Acids Res 25:3580–83. Logan J, Edwards K and Saunders N. (2009). Real-Time PCR: Current Technology and Applications. Applied and Functional Genomics. Health Protection Agency, London. Marzec JM, Christie JD, Reddy SP, Jedlicka AE, Vuong H, Lanken PN, Aplenc R, Yamamoto T, Yamamoto M, Cho HY, Kleeberger SR. (2007). Functional polymorphisms in the transcription factor NRF2 in humans increase the risk of acute lung injury. Faseb J 21(9):2237–46. Masliah E, Rockenstein E. (2000). Genetically altered transgenic models of Alzheimer’s disease. J Neural Transm Suppl 59:175–83. Menalled LB. (2005). Knock-in mouse models of Huntington’s disease. NeuroRx 2(3):465–70. Morris AP, Whittaker JC, Balding DJ. (2004). Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data. Am J Hum Genet 74:945–53. Nackley AG, Shabalina SA, Tchivileva IE, Satterfield K, Korchynskyi O, Makarov SS, Maixner W, Diatchenko L. (2006). Human catechol-O-methyltransferase haplotypes modulate protein expression by altering mRNA secondary structure. Science 314:1930–33. Nadeau JH, Frankel WN. (2000). The roads from phenotypic variation to gene discovery: mutagenesis versus QTL. Nature Genet 25:381–84. Nakamura S, Kugiyama K, Sugiyama S, Miyamoto S, Koide S, Fukushima H, Honda O, Yoshimura M, Ogawa H. (2002). Polymorphism in the 5’-flanking region of human glutamate-cysteine ligase modifier subunit gene is associated with myocardial infarction. Circulation 105:2968–73.
c14.indd 299
1/12/2011 9:44:26 AM
300
DETERMINATION OF THE FUNCTION OF A MUTATION
Ng PC, Henikoff S. (2001). Predicting deleterious amino acid substitutions. Genome Res 11(5):863–74. Ng PC, Henikoff S. (2002). Accounting for human polymorphisms predicted to affect protein function. Genome Res 12(3):436–46. Okamoto K, Iwasaki N, Nishimura C, Doi K, Noiri E, Nakamura S, Takizawa M, Ogata M, Fujimaki R, Grarup N, Pisinger C, Borch-Johnsen K, Lauritzen T, Sandbaek A, Hansen T, Yasuda K, Osawa H, Nanjo K, Kadowaki T, Kasuga M, Pedersen O, Fujita T, Kamatani N, Iwamoto Y, Tokunaga K. (2010). Identification of KCNJ15 as a susceptibility gene in Asian patients with type 2 diabetes mellitus. Am J Hum Genet 86(1):54–64. Ono S, Ezura Y, Emi M, Fujita Y, Takada D, Sato K, Ishigami T, Umemura S, Takahashi K, Kamimura K, Bujo H, Saito Y. (2003). A promoter SNP (-1323T>C) in G-substrate gene (GSBS) correlates with hypercholesterolemia. J Hum Genet 48:447–50. Orban PC, Chui D, Marth JD. (1992). Tissue- and site-specific DNA recombination in transgenic mice. Proc Natl Acad Sci U S A 89(15):6861–65. Paigen K. (2003). One hundred years of mouse genetics: an intellectual history. II. The molecular revolution (1981–2002). Genetics 163(4):1227–35. Ponomarenko JV, Orlova GV, Merkulova TI, Gorshkova EV, Fokin ON, Vasiliev GV, Frolov AS, Ponomarenko MP. (2002). rSNP guide: an integrated database-tools system for studying SNPs and sitedirected mutations in transcription factor binding sites. Hum Mutat 20:239–48. Pritchard JK, Przeworski M. (2001). Linkage disequilibrium in humans: models and data. Am J Hum Genet 69:1–14. Rosenthal N, Brown S. (2007). The mouse ascending: perspectives for human-disease models. Nat Cell Biol 9(9):993–99. Satten GA, Epstein MP. (2004). Comparison of prospective and retrospective methods for haplotype inference in case-control studies. Genet Epidemiol 27:192–201. Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. (2002). Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet 70:425–34. Schaid DJ. (2004). Evaluating associations of haplotypes with traits. Genet Epidemiol 27:348–64. Shastry BS. (2009). SNPs: impact on gene function and phenotype. Meth Mol Biol 578:3–22. Sheffer R, Zlotogora J. (1992). Autosomal dominant inheritance of Klein-Waardenburg syndrome. Am J Med Genet 42(3):320–22. Sugatani J, Yamakawa K, Yoshinari K, Machida T, Takagi H, Mori M, Kakizaki S, Sueyoshi T, Negishi M, Miwa M. (2002). Identification of a defect in the UGT1A1 gene promoter and its association with hyperbilirubinemia. Biochem Biophys Res Commun 292:492–97. Sun YV, Boverhof DR, Burgoon LD, Fielden MR, Zacharewski TR. (2004). Comparative analysis of dioxin response elements in human, mouse and rat genomic sequences. Nucl Acids Res 32:4512–23. Sun M, Hurst LD, Carmichael GG, Chen J. (2006). Evidence for variation in of antisense transcripts between multicellular animals but no relationship between antisense transcription and organismic complexity. Genome Res 16:922–33.
c14.indd 300
1/12/2011 9:44:26 AM
REFERENCES
301
Thomas JT, Kilpatrick MW, Lin K, Erlacher L, Lembessis P, Costa T, Tsipouras P, Luyten FP. (1997). Disruption of human limb morphogenesis by a dominant negative mutation in CDMP1. Nat Genet 17(1):58–64. Tixier-Boichard M. (2002). From phenotype to genotype: Major genes in chickens. World’s Poul Sc J 58:35–43, 65–75. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. (2010). Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol 6(1):e1000641 Wang H, Dey SK. (2006). Roadmap to embryo implantation: Clues from mouse models. Nat Rev Genet 7(3):185–99. Review. Wang X, Tomso DJ, Chorley BN, Cho HY, Cheung VG, Kleeberger SR, Bell DA. (2007). Identification of polymorphic antioxidant response elements in the human genome. Hum Mol Genet 16:1188–200. Wang X, Tomso DJ, Liu X, Bell DA. (2005). Single nucleotide polymorphism in transcriptional regulatory regions and expression of environmentally responsive genes. Toxicol Appl Pharmacol 207:84–90. Warner JB, Philippakis AA, Jaeger SA, He FS, Lin J, Bulyk ML. (2008). Systematic identification of mammalian regulatory motifs’ target genes and functions. Nat Methods 5:347–53. Wataha JC, Hanks CT, Sun Z. (1994). Effect of cell line on in vitro metal ion cytotoxicity. Dent Mater 10:156–61. Williams FM, Spector TD. (2007). The genetics of osteoporosis. Acta Reumatol Port 32(3):231–40. Wollnik B, Tukel T, Uyguner O, Ghanbari A, Kayserili H, Emiroglu M, Yuksel-Apak M. (2003). Homozygous and heterozygous inheritance of PAX3 mutations causes different types of Waardenburg syndrome. Am J Med Gene A 122A(1):42–5. Yan SF, Zou YS, Mendelsohn M, Gao Y, Naka Y, Du Yan S, Pinsky D, Stern D. (1997). Nuclear factor interleukin 6 motifs mediate tissue-specific gene transcription in hypoxia. J Biol Chem 272(7):4287–94. Yen PM, Chin WW. (1994). Molecular mechanisms of dominant negative activity by nuclear hormone. receptors. Mol Endocrinol 8:1450–54.
c14.indd 301
1/12/2011 9:44:26 AM
CHAPTER 15
Confirmation of a Mutation by Multiple Molecular Approaches HECTOR MARTINEZ-VALDEZ and BLANCA ORTIZ-QUINTERO
Contents 15.1 Introduction 15.1.1 Gene Expression Overview 15.1.2 Mutations 15.1.3 Other Factors That Affect Gene Expression 15.2 mRNA Expression by Real-Time PCR 15.2.1 Theory, Scheme, and Scope 15.2.2 Comparative Appraisals 15.2.3 Quantitation and Data Report 15.3 DNA Sequencing 15.3.1 Scope and Evolution 15.3.2 Chemistry, Reaction, and Analysis 15.3.3 Outsourcing 15.4 In Situ Hybridization 15.4.1 Experimental Considerations 15.4.2 Scope and New Developments 15.5 Expression at the Protein Level 15.5.1 Overview 15.5.2 Antibody Technology 15.5.3 Evolution and Scope of Protein Analyses 15.6 Genetically Engineered Animals 15.6.1 Enforced Expression in Transgenic Mice 15.6.2 Gene Targeting/Knockout 15.6.3 Pitfalls and Solutions 15.7 Concluding Remarks 15.8 Acknowledgments 15.9 References
304 304 305 307 308 308 310 311 314 314 314 316 316 316 317 319 319 320 320 324 324 327 329 330 330 331
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
303
c15.indd 303
1/12/2011 9:44:27 AM
304
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
15.1 INTRODUCTION 15.1.1
Gene Expression Overview
In most eukaryotic cells, the DNA content is several orders of magnitude higher than the one required for the coding of proteins (Rangel et al., 2005). However, beyond the sequence annotations for protein coding, the human genome stands out with a complex gene expression diversity that breaches the limits of gene usage and function (Rangel et al., 2005). Such diversity is largely contributed by genetic rearrangements (Chen and Alt, 1993), differential RNA processing, and alternative translation initiation mechanisms (Sims-Mourtada et al., 2005). In keeping with this notion, genes previously ascribed to only yield noncoding germline mRNAs, are increasingly being documented to productively translate into small and long polypeptides, which conform to the cell lineage phenotype and function (Rangel et al., 2005; Frances et al., 1994; Jolly and O’Neill, 1997; Saint-Ruf et al., 1994; Erdmann et al., 2000; McKeller and Martinez-Valdez, 2006). Such remarkable diversity stems from the distinct expression patterns of genes in different tissues (Rangel et al., 2005). Whereas some gene products are required for basal cell functions and are constitutively expressed by most cells, a restricted transcription and translation dictate tissue and cell-specific functions (Rangel et al., 2005). For instance, the expression of antigen (Ag) receptor genes is developmentally controlled, and it is exclusive of the immune cells (Alt et al., 1992). A problem with such restriction, as experienced by the adaptive immune system in mammals, is the requirement of a broad repertoire of Ag receptor genes that surpasses the encoded capacity of the genome (Yang et al., 2003). However, nature has tailored gene rearrangement mechanisms to circumvent the diversity constraints of the Ag receptor loci and to ensure a large repertoire of Ag specificities (Chen and Alt, 1993; Rajewsky, 1996; Sleckman et al., 1996). Notably, mouse and human genomes carry fewer than twice the number of genes present in lower eukaryotic genomes, which indicates a greater degree of complexity that requires mechanisms that diversify and increase the number of gene products derived from a single gene (Sims-Mourtada et al., 2005; Landry et al., 2003). Among those, mRNA alternative splicing is perhaps the most frequent and diversifying (Sims-Mourtada et al., 2005). Recent estimates indicate that 35–50% of mammalian genes undergo alternative splicing (Landry et al., 2003; Wen et al., 2004). Whereas alternative splicing frequently leads to changes in protein structure, subcellular localization and/or function, in some cases, differential mRNA processing only affects 5′ or 3′ untranslated regions without altering the makeup of functional motifs or protein structures. However, taken as a whole, alternative splicing has profound influence on protein output and magnifies gene expression diversity. Other means to enhance gene expression diversity is the use of alternative promoters. This has the additional feature of having the potential to create
c15.indd 304
1/12/2011 9:44:27 AM
INTRODUCTION
305
complex regulatory diversity (Landry and Mager, 2002). Although not as common as alternative splicing, alternative promoter usage is a frequent regulatory mechanism that occurs in at least 18% of the mammalian genes (Trinklein et al., 2003). The use of alternative promoters permits genes to exhibit more than one pattern of tissue specificity and developmental control (Trinklein et al., 2003; Saleh et al., 2002; Medstrand et al., 2001). In some cases, alternative promoters not only provide regulatory versatility but dictate the expression of different protein isoforms, thereby greatly expanding both the translational and transcriptional capacity of the genome (Sims-Mourtada et al., 2005). In higher eukaryotes, polyadenylation serves as an added layer of regulatory diversification, which is accomplished by the addition of poly (A) tails at the 3′ end of all mRNAs, except histone transcripts (Zarudnaya et al., 2003). The site where polyadenylation occurs can have profound functional consequences, as it dictates the length and sequence of the 3′ untranslated region, which in turn can control RNA splicing events, mRNA stability and translation rates (Zarudnaya et al., 2003). Thus gene rearrangement, alternative splicing, differential promoter usage, and alternative polyadenylation constitute important mechanisms that have significant impact on the control of gene expression and functional diversity. 15.1.2
Mutations
The complexity of the human genome is underscored by its unprecedented sequence diversity, variation in gene copy number, inherent potential for basepair mutations that can lead to single nucleotide polymorphisms (SNPs) and the enormous plasticity for gene rearrangements, insertions, deletions and translocations, which can radically change gene expression, cellular functions and organism phenotype. In keeping with this notion, point mutations, deletions, insertions and recombinations of selected gene loci can occur, under physiological conditions, as a result of programmed mechanisms that enhance genome usage. For instance, the efficacy of antibody protection against pathogens relies on specific recognition functions. Because antibodies are encoded by distinct immunoglobulin (Ig) gene segments, mechanisms are tailored (the good news) to ensure a broad spectrum of pathogen recognition by multifaceted gene rearrangements and mutations. Conversely, these mechanisms are susceptible of intrinsic (genetic) and/or extrinsic (biological, physical or chemical gene lesions) derailment that can result in the loss of the programmed function or the gain of deleterious pathology (the bad news). Further elaboration on the physiological (good news) and pathological (bad news) genetic scenarios is detailed herein below. 15.1.2.1 The Good News Physiological gene rearrangement, somatic hypermutation and class switch recombination (CSR). During B lymphocyte development in the bone marrow, gene segments on the immunoglobulin (Ig)
c15.indd 305
1/12/2011 9:44:27 AM
306
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
heavy (H) and light (L) chain loci assemble in a defined ordered manner (Rangel et al., 2005; Chen and Alt, 1993; Frances et al., 1994; McKeller and Martinez-Valdez, 2006; Alt et al., 1992; Tonegawa, 1983; Dudley et al., 2005; Puebla-Osorio and Zhu, 2008). Thus germline B cell progenitors (pro-B cells) rearrange DH and JH genes (pre-B 1 cells) and subsequently join a given VH gene to the rearrange DH-JH segment (pre-B2 cells), allowing the expression of IgH chain. Finally, if a successful VL to JL rearrangement occurs, the cells can express κ or λ chains and can mature into B cells, displaying surface IgM antigen receptor (Rangel et al., 2005; Chen and Alt, 1993; Frances et al., 1994; McKeller and Martinez-Valdez, 2006; Alt et al., 1992; Tonegawa, 1983; Dudley et al., 2005; Puebla-Osorio and Zhu, 2008). The subsequent acquisition of B cell memory from naïve cells takes place within a highly specialized microenvironment, the germinal centers (GC) of the secondary lymphoid organs (Siddiqa et al., 2001; Guzman-Rojas et al., 2002). Within the GC microenvironment, the B cell maturation program faces a series of genetic events that include changes in the regulation of cell cycle checkpoint genes that result in the proliferation of Ag-specific B cells (Thorbecke et al., 1994; Siepmann et al., 2001), somatic diversification of the IgV domains by the introduction of point mutations that occurs as a result of active DNA polymerase during clonal cell expansion (Jacob et al., 1991; Pascual et al., 1994; Liu et al., 1996a; 1996b), selection of the functional (highaffinity) Ag-specific B cell repertoire against low-affinity and autoreactive cells by a concerted regulation of survival or death-inducing genes (Choe et al., 1996; Martinez-Valdez et al., 1996; Rathmell et al., 1996), and the intramolecular switch of the constant Ig regions, from IgM to the IgG, IgA, or IgE isotypes (Liu et al., 1996b, 1996c), to express immunoglobulin (Ig) receptors with high affinity Ag-binding (IgV) domains that are associated to μ, γ, α, or ε constant regions (Liu et al., 1996a, 1996c). Thus, the memory of the immune system is borne by B and T lymphocytes, which make rapid and robust humoral (antibody) responses or cell-mediated responses upon repeated antigenic invasion (Siddiqa et al., 2001; GuzmanRojas et al., 2002; Liu et al., 1996b, 1996c; Martinez-Valdez et al., 1996). 15.1.2.2 The Bad News Autoreactive and malignant GC B cells represent a unique class of disorders because they originate from cells of the immune system that divert from the normal maturation programs, via genetic rearrangements or somatic mutations (Guzman-Rojas et al., 2002). Since gene rearrangement, somatic hypermutation, Ag receptor editing and isotype switching are the physiological landmarks of Ag-driven GC responses (Guzman-Rojas et al., 2002; Unniraman and Schatz, 2006; Casellas et al., 2001; Meffre et al., 1998; Franco et al., 2006), the risk for genetic lesions and hence autoimmunity and malignant transformation is exponentially enhanced. In line with this rationale, T cells, which do not undergo receptor editing, somatic hypermutation, or isotype switching, give rise to 10 to 20 times less lymphoproliferative diseases than B cells (Guzman-Rojas et al., 2002). It is during these
c15.indd 306
1/12/2011 9:44:27 AM
INTRODUCTION
307
stages that B lymphocytes become target of aberrant rearrangements and mutations that result in diverse forms of lymphoproliferative disorders, including autoimmunity, leukemia and lymphoma (Frances et al., 2000; Malisan et al., 1996a; O’Brien et al., 1995; Sawyers et al., 1991). Moreover, Somatic hypermutation is a fundamental mechanism by which diversity for the antibody repertoire is created and such function is exponentially amplified within GC (Liu et al., 1996b, 1996c). Whereas immunoglobulin (Ig) genes rapidly became the stereotyped targets, due to their high molecular frequency, non-Ig genes can also be targeted for somatic hypermutation (Pasqualucci et al., 1998; Storb et al., 2001; Parsa et al., 2007; WatanabeFukunaga et al., 1992; Takahashi et al., 1994; Muschen et al., 2002; Muschen et al., 2000a, 2000b). Consequently, Fas and other GC genes gained celebritylike notoriety and gradually became de facto endangered tumor suppressors at the mercy of GC somatic mutation predatory machinery (Guzman-Rojas et al., 2002). 15.1.3
Other Factors That Affect Gene Expression
Decades of research in diverse walks of science have unveiled how chromosome rearrangements, mutations and epigenetic mechanisms contribute to altered cell functions and organism phenotypes (Pfeifer and Besaratinia, 2009; Partanen et al., 2009; Mathews et al., 2009; Herceg and Hainaut, 2007). However, deciphering the pathological scenario at the molecular level can be complex and involve the cooperative alteration of multiple genes, which override normal checkpoint mechanisms and produce an intricate phenotype (Hussain et al., 2009; Jones and Thompson, 2009). Moreover, the majority of gene lesions affect cells undergoing distinct developmental stages (Malisan et al., 1996b; O’Brien et al., 1995; Sawyers et al., 1991), where the genetic and epigenetic events can equally target proto-oncogene and tumor suppressor functions and lead to deregulation of cell proliferation and/or survival (Hussain et al., 2009; Jones and Thompson, 2009; Porter and Polyak, 2003; Fusco and Fedele, 2007; Van Vlierberghe et al., 2008). It is for this reason that the characterization of genes, whose function may equally influence the regulation of normal cell development and tumorigenesis, is of critical significance. As an example of how ancillary genetic and epigenetic events can be efficiently documented, genomewide screens and data processing emerge as a technological breakthrough (Landvik et al., 2009; Marcucci et al., 2008; Savas and Liu, 2009; Thye et al., 2003) that enables array-based comparative genome hybridization (CGH) studies and concomitant assessment of major regions of chromosome fragility (Blaveri et al., 2005; Engelmark et al., 2008; Nymark et al., 2006; Ross et al., 2007). A major advantage of the technology is its versatility to query common fragile sites (CFS) of synteny with mouse chromosomes (Bauer and Rondini, 2009; Helmrich et al., 2006), which prompts the rationale for the generation and study of genetically engineered animal models (Sims-Mourtada et al., 2005; Bauer and Rondini, 2009; Helmrich et al., 2006;
c15.indd 307
1/12/2011 9:44:27 AM
308
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Callahan et al., 2003; Festing et al., 1998; Kleeberger et al., 2000) Furthermore, genomewide scans can also interrogate gene structure, expression profiles and copy number variations of normal, experimental and pathological specimens (Geisert et al., 2009; Xiong et al., 2008a; Ikram et al., 2009; Schejeide et al., 2009; Takezaki and Nei, 2009; Yan et al., 2009; Li et al., 2008). Added features of genomewide screening susceptible of quantitative measurements include transcriptional activity (Jiao et al., 2009; Xiong et al., 2008b), DNA methylation and/or acetylation and regulatory microRNAs expression. As gene screens reveal exceedingly broad variations in chromatin revisions, microRNA engagement, gene copy numbers, insertions, deletions, inversions, and translocations (Bild et al., 2006; Hartmann et al., 2008; Yuille et al., 2001; Pritchard and Przeworski, 2001; Mullighan et al., 2009; Tay et al., 2009; Fan et al., 2007; Zhang et al., 2009; Jaillard et al., 2009; Karnan et al., 2006; Calin and Croce, 2007; Liu et al., 2008; Yu et al., 2008), apt technology must be applied to validate the pathophysiological relevance of genomewide screens (Zhang et al., 2009; Bentley et al., 2008; Dunckley et al., 2007; Gallardo et al., 2008; Scheinfeldt et al., 2009; Wheeler et al., 2008). These are discussed in subsequent sections. 15.2 MRNA EXPRESSION BY REAL-TIME PCR 15.2.1 Theory, Scheme, and Scope As stated, genomewide microarray screens permit high-throughput genetic queries and large-scale gene expression analyses. However, quantitative confirmation is necessary for meaningful interpretations and real-time PCR technology possesses the accuracy, sensitivity and specificity to validate gene discovery, structure, mutation patterns, polymorphisms and expression. Quantitative real-time PCR has gained center-stage attention for over a decade (Gibson et al., 1996; Luthra and Medeiros, 2006; Mocellin et al., 2003; Murphy and Bustin, 2009; Rooney et al., 2005; Wang and Brown, 1999; GarciaCastillo and Barros-Nunez, 2009) for quite pragmatic reasons: dynamics, high sensitivity, unprecedented specificity and reproducibility, and the potential for large sample management (Murphy and Bustin, 2009; Rooney et al., 2005; Wong and Medrano, 2005; Szczepanski, 2007). The incredible pace of real-time PCR applications, virtually reaching personalized genetic analyses, enables molecular medicine to quantitatively confirm gene expression patterns, copy number variations, single nucleotide polymorphisms, allelic discrimination, and gene lesions (deletions, insertions or inversions), a particularly critical feat when mutations in the sample population are underrepresented (Deepak et al., 2007). In keeping with personalized genetic analyses, individual records can be cross-examined against multifaceted databases that include clinical, drug response, epidemiology, gender, race, phenotype, job history, and geography/environment parameters. Crossexamined information can thus serve to establish genetic links to disease and adverse reactions to medication and drug resistance traits (Deepak et al., 2007;
c15.indd 308
1/12/2011 9:44:27 AM
MRNA EXPRESSION BY REAL-TIME PCR
309
Severino and Zompo, 2004). Hence coordinate application of quantitative real-time PCR with multiparameter database information could become a rational means to personalized medical intervention and assessment of individual responses to therapy. Cancer, immunodeficiency, autoimmunity, pathogen infection, diabetes, neurodegeneration, cardiovascular, and respiratory disorders are among the most challenging threats, where confirmatory realtime PCR can be of enormous value to decode genetic substrates associated to the pathology. Real-time PCR measures the initial content of DNA or reverse-transcribed mRNA templates by combining log amplification and detection parameters, whose progression can be monitored in real time. These key features are in contrast with other PCR methods, which are designed to only record the endpoint amplified product (Espy et al., 2006; Freeman et al., 1999; Raeymaekers, 2000). Hence real-time PCR has become the preferred method to quantitatively assess gene structure, copy number, rearrangements, deletions, insertions, inversions, and expression (Gibson et al., 1996; Luthra and Medeiros, 2006; Mocellin et al., 2003; Murphy and Bustin, 2009; Garcia-Castillo and Barros-Nunez, 2009; Muller et al., 2004). The principle of real-time PCR analyses is based on the detection of fluorescence emission produced at each reaction and measures amplicon production per cycle. Toward that end, real-time PCR relies on the quantitative detection of reporter probes, whose fluorescence emission increases proportionally with the amount of PCR product that is generated. The quantitation of the fluorescence emitted per cycle serves as the means to assess the state of the reaction at an exponential stage, in which the level of the amplified product corresponds to the starting amount of template. The overall conceptual message is that the higher concentration of the DNA or cDNA template, the faster detection of log fluorescent increments. Whereas there are multiple (albeit redundant) modes to define the progression and significance of real-time PCR stages, in the interest of simplicity a standard reaction can be divided into three main stages: (1) exponential, (2) linear, and (3) plateau. During the early exponential stage, fluorescence emission reaches a level of increment above background baseline that reflects template concentration at origin. The threshold value obtained at the cycle where the increased fluorescence shift is first detected, known as cycle threshold (Ct), is applied to quantitatively determine the changes in the experimental samples (Gibson et al., 1996; Heid et al., 1996; Lay and Wittwer, 1997). Under optimal conditions, exponential doubling of DNA or cDNA copies reaching the most favorable amplification ratios follows the primary fluorescent shift. The linear stage is characterized by a higher degree of variability, which results from substrate consumption and degradation, and which ultimately leads to a slower pacing reaction. The plateau stage is basically the end point of the reaction, marked by the arrest of amplification products and increased vulnerability for DNA or cDNA degradation. It must be emphasized that upon reaching the plateau stage, changes in fluorescence emission are negligible, irrelevant, and without quantitative value (Wong and Medrano, 2005; Bustin, 2000).
c15.indd 309
1/12/2011 9:44:27 AM
310
15.2.2
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Comparative Appraisals
Given the nature of the real-time PCR chemistry, kinetic measurements of DNA or cDNA amplification can be achieved during the early exponential phase of the reaction. This is in contrast with nonquantitative standard PCR, in which detection can be accomplished only at the end of the reaction, through the resolution of final plateau-phase PCR products by gel (agarose or acrylamide) electrophoresis and fluorescent staining (ethidium bromide, acridine orange and SYBR dyes) or autoradiography (alternatively phosphoimaging), when electrophoresis is followed by Southern transferring and hybridization with radioactively labeled (commonly 32pNTP) probes. It must be emphasized that the qualitative information obtained by combined gel electrophoresis, fluorescent staining, and/or autoradiography originates from end point PCR products. In this context, the results are largely based on reference size discrimination, which are prone to inaccuracy due potential product degradation and hence without quantitative value. Congruent with these differences, real-time PCR quantitation does not require postreaction processing, which facilitates the management of large and complex sample analyses, enables the simultaneous processing of multiple repeats for statistic validation, and reduces the risk of sample contamination. Moreover, unlike nonquantitative methods, real-time PCR possesses a broader range where variations in template concentration remain susceptible of accurate detection (roughly 107-fold, against 103-fold of standard PCR) and commonly known as reaction dynamics (Wagatsuma et al., 2005; Louvel et al., 2008). In brief, a broad dynamic range provides permissible concentration ratios between experimental and housekeeping control templates to carry the reaction with comparable sensitivity and specificity. In other words, accuracy of the reaction is proportional to the dynamic range. Overall, real-time not only surpasses the quantitative, accuracy and sensitivity reach of conventional PCR but also stands alone against most widely used approaches to genetic analyses, including RNAse protection assay (Wang and Brown, 1999; Wong and Medrano, 2005) and dot-blot hybridization (Wong and Medrano. 2005; Malinen et al., 2003). Among the key features that gained real-time PCR center-stage prominence in biomedicine are the feasibility to detect a single copy of DNA or mRNA (Hyvarinen et al., 2009; Palmer et al., 2003; Barragan et al., 2001), the capability to discern subtle differences of detection levels within comparing templates (Gentle et al., 2001; Reil et al., 2008), and the capacity to discriminate virtually identical gene copy isoforms (Wong and Medrano, 2005; Rodriguez-Manotas et al., 2006; Louis et al., 2004). Because real-time PCR applications can amplify both DNA and mRNA substrates, inherent differences must be taken into account. For instance, under homeostatic conditions, DNA content, gene mutations, and polymorphisms are equally represented in all cells of a given organism and usually uninfluenced by internal or external environmental cues (Bustin and Nolan, 2004; Nannya et al., 2005). In contrast, mRNA transcription, processing and
c15.indd 310
1/12/2011 9:44:27 AM
MRNA EXPRESSION BY REAL-TIME PCR
311
stability largely depend on both intrinsic and extrinsic factors, which ultimately determine the physiological steady-state levels (Neu-Yilik and Kulozik, 2008). This means that whereas DNA presents a stable number of target gene templates, mRNA copies of a given transcript can be highly variable and depend on the state of cell maturation, differentiation, and/or activation (Bustin and Nolan, 2004; Neu-Yilik and Kulozik, 2008). Unlike DNA, mRNA is relatively unstable and susceptible to degradation during extraction procedures, which can contribute to target template variability in quality and quantity. Such limitation not only impinges on the accuracy of reverse-transcription and real-time PCR reactions but also on data interpretation and biological impact. Whereas human error cannot be completely ruled out, the fragile quality of mRNA preparations is inevitably associated with the nature of the biological specimen, such those obtained postmortem, sampled as transoperatory biopsies, or collected as sorted or microdissected single cell preparations. Multiple preventive approaches (not discussed in this chapter) are now in place to preserve the quality of unique mRNA samples, hence cDNA amplifications by real-time PCR remain the state of the art method to evaluate common and rare gene expressions. Other noteworthy considerations include the cost of equipment and validated reagents. Although equipment may not be a turning point, since most departments of prominent institutions provide the necessary hardware, realtime reagents of superior quality that ensure reproducible experimentation are expensive consumables and are the sole responsibility of the investigator. Last, real-time PCR reactions are designed neither to measure the size of the amplified product nor to discriminate between DNA and cDNA templates. Then again, these parameters are not needed to evaluate the accuracy, specificity, or reproducibility of real-time PCR reactions and are susceptible of independent assessment. For example, amplified products can be readily analyzed by electrophoresis or directly sequenced, which concomitantly provides size and gene identity information. On the other hand, RNA specific chromatography, oligo d(T)-dependent reverse transcription and/or enzymatic clearing of DNA can be applied to achieve discriminating cDNA amplification. 15.2.3
Quantitation and Data Report
Real-time PCR quantitation is a direct function of the detection of fluorescent reporters, whose log increase is proportional to initial template input and the magnitude of amplified DNA or cDNA. Namely, faster fluorescence detection reflects higher template input. For pragmatic purposes, three categories of fluorescent reporter chemistries applied to real-time PCR are herein considered: (1) hydrolysis; (2) hybridization, and (3) DNA-binding (Mackay and Landt, 2007). As an example of hydrolysis chemistries (often referred as 5 nuclease reactions), TaqMan fluorescence is emitted only when 5′ exonucleolytic Taq polymerase activity cleaves the reporter probe, usually a 20- to 30-mer oligonucleotide
c15.indd 311
1/12/2011 9:44:27 AM
312
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
that carries a 5′ fluorochrome and a 3′ quencher (with or without native fluorescence). The function of the quencher in the intact probe is to prevent the emission of fluorescence from the reporter by fluorescence resonance energy transfer (FRET). However, when the sequence of interest is recognized, the reporter probe specifically intercalates between the sites where the amplifying primers anneal. Upon hybridization of the probe to the target sequence, the fluorochrome reporter dissociates from the quencher by 5′ exonuclease activity, which thus enables fluorescence emission. In this context, increased fluorescence emission by the reporter fluorochrome is directly proportional to PCR product buildup. The specificity of these reactions stems from the sequence complementarity between the probe and the target of amplification. In considering the chemistries that depend on probe hybridization to DNA and cDNA templates, annealing and melting temperatures stand out as inherent features of nucleotide sequences that are key in the design and application of real-time PCR technology to a variety of demanding research projects (Mackay and Landt, 2007). The exploit of DNA melting chemistries is equally central to equipment development, designed to control and record temperature shifts (Arya et al., 2005; Winter et al., 2004) and the use of suitable fluorescent DNA probes (Kutyavin et al., 2003; Wong and Bai, 2006). Probe hybridization chemistries can be designed in two alternative formats. One uses the amplifying oligonucleotides intercalated head to tail by two template-specific probes proximal to each forward and reverse primer. In this format, the probe proximal to the forward primer is tagged at its 3′ end (acceptor fluorochrome), whereas the one nearing the reverse primer is tagged at its 5′ end (donor fluorochrome). Increased fluorescence is emitted upon probe DNA binding, via FRET. Alternatively, the forward primer can carry the acceptor fluorochrome at its 3′ end and thus reduce the reaction to a tripartiteoligonucleotide reaction. Although the technology is designed to facilitate multiple gene/sequence target assessments in a single reaction, the approach is demanding and entails laborious optimizations. The availability of new fluorochromes and detection modules should circumvent these limitations. Irrespective of the format, the probe hybridization approach enables highresolution assessment of template amplifications under stringent conditions, which records fluorescence emission. DNA-binding dye chemistries target double-stranded (ds) DNA and are designed to measure PCR amplification products through sequenceindependent fluorochrome intercalation. Examples of these dyes are SYBRgreen and ethidium bromide, which are weak fluorochromes as free molecules but emit increased fluorescence when bound to dsDNA (Lutfalla and Uze, 2006). Since PCR products double with every amplifying cycle, template availability increases, more DNA binding ensues, and higher fluorescence levels are proportionately registered. Whereas this method has been applied and improved in numerous and varied applications, it requires standardization and validation at different levels to ensure accuracy and specificity (Lutfalla and Uze, 2006). Also, it must be noted that despite the plasticity of sequenceindependent techniques, which allows the analysis of different genes with the
c15.indd 312
1/12/2011 9:44:27 AM
MRNA EXPRESSION BY REAL-TIME PCR
313
same type of probe, the use of DNA binding dyes precludes the setting of multiparametric reactions and often results in spurious amplification products (Kutyavin et al., 2003). Nevertheless and as pointed out earlier, high stringency conditions together with confirmed amplifying primer specificity are routinely applied to defray the risk of false-positive products. Saturation of fluorescence detection is a disadvantage often associated to DNA-binding dye chemistries, particularly when longer sequences are amplified. However, amplicon limits and detection parameters can be adjusted. Upon data processing and irrespective of the chemistry, it is important to bear in mind the significance of cycle threshold (Ct) settings, since the slope generated at the log exponential phase measures the amplification efficiency (Gibson et al., 1996; Wong and Medrano, 2005; Heid et al., 1996, Lay and Wittwer, 1997). Ct is routinely set above the amplification baseline—namely within the exponential phase, which becomes linear with log conversion. While amplification efficiency is roughly estimated to be near 100%, it can be affected by multiple factors, including the quality of the amplifying oligonucleotides, structural constraints, contaminants and as stated earlier, the size of target sequence (Wong and Medrano, 2005; Bustin and Nolan, 2004; Yuan et al., 2006; Kubista et al., 2006; Mehra and Hu, 2005). Hence steps must be taken to prevent exogenous sources of error that can mislead interpretation of data. For absolute quantitation standards with validated concentrations are either commercially available or custom prepared depending on the application. Such standards are used as internal Ct references with experimental linear-log slope validation, usually obtained from serial dilution measurements that cover a relatively broad dynamic range. Examples of these standards include dsDNA, single-stranded (ss) DNA and complementary RNA (cRNA) that carry the target sequence of amplification (Wong and Medrano, 2005). A basic requirement for absolute quantitation is that the standards routinely used possess quasi-constant amplification efficiencies. Conversely, relative quantitation relies on comparative housekeeping genes that serve as reference parameters for amplification levels. Toward that end, relative quantitation is best accomplished when expression of the endogenous reference is relatively abundant and independent of pathophysiological events. Moreover, target amplification can be normalized for concentration differences relative to the endogenous control by the amount of total template added to the experimental reaction. Among the options for quantitative reference of amplification, ribosomal RNA, (rRNA) glyceraldehyde 3-phosphate dehydrogenase (G3PDH), β-actin and α-tubulin stand out as routine housekeeping controls (Wang et al., 2009; Teste et al., 2009). However, these controls have been noted to exhibit some limitations. For instance, rRNA lacks poly-(A) tailing and hence does not represent a true reference for normalization of oligo d(T)-dependent reverse-transcribed mRNA, whereas G3PDH β-actin and αtubulin expression can be influenced by physiological or pathological conditions (Wang et al., 2009; Teste et al., 2009). In keeping with these constraints, careful selection of reference genes is necessary to ensure accurate application of relative quantitation (Wang et al., 2009; Teste et al., 2009).
c15.indd 313
1/12/2011 9:44:27 AM
314
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
In considering that mutations inevitably impinge on gene expression, realtime PCR has become the technology of choice to quantitatively assess the impact of the given genetic lesions. 15.3 15.3.1
DNA SEQUENCING Scope and Evolution
The original design for nucleotide sequencing reactions over three decades ago (Sanger et al., 1977) launched an unprecedented quest to decode eukaryote and prokaryote genomes. Since then, DNA and RNA sequencing are virtually ordinary practices in most research laboratories. Complete characterization of newly identified genes, assessment of physiological and aberrant gene rearrangements and mutations, confirmation of genetic polymorphisms, legal proof of parental imprinting, criminal investigation, and evolution research are only a handful of molecular practices where DNA sequencing created tremendous experimental impact. DNA polymerase-dependent reactions, using radioactively (mainly 35SdNTP) or nonisotopically labeled nucleotides (commonly Biotin) and resolved in polyacrylamide gel electrophoresis characterized the pioneer DNA sequencing practices, whose results were revealed by autoradiography and eye-read base by base (Sanger et al., 1977; Church and Gilbert, 1984). The advent of fluorescence-based reactions, automated detection and commercialization of high throughput DNA sequencers, gave rise to large and ambitious projects that included the sequencing of the human genome. The phenomenon empowered technology improvement and fostered spinoffs of ancillary methodology to support the demands (McPherson, 2009; Watts and MacBeath, 2001; MacBeath et al., 2001). Although nucleotide reading from an autoradiography rarely occurs at the present time, the fundamental principle of the chain termination reaction remains and naturally followed by rational modifications that gave rise to primer walking, unidirectional deletions, direct sequencing by PCR, and formamide gel-based RNA sequencing (Sinden et al., 1999; Reddy et al., 2008; Voss et al., 1995; Brent and Guigo, 2004; Motta et al., 2006). 15.3.2
Chemistry, Reaction, and Analysis
The principles of the nucleotide sequencing chemistry are herein summarized: Basically the method involves (1) a synthetic oligonucleotide complementary to the target DNA that serves as a primer and anneals at the 3′ end of the sequence, and (2) the enzyme DNA polymerase, which directionally catalyses a 5′ to 3′ reaction to synthesize an exact copy of the template strand. The in vitro DNA synthesis requires a pool of dATP, dCTP, dGTP and dTTP to support the extension of the copying strand. While maintaining the equimolarity of the reaction, the dNTP pool also incorporates a labeled deoxy-nucleotide,
c15.indd 314
1/12/2011 9:44:27 AM
DNA SEQUENCING
A
315
B
A CGT
A T T C G A T A T CA A GC T T A TC G A T AC C G T C G A C C T
Figure 15.1. (See color insert.) Manual Versus Automated DNA Sequencing. (A) Shows acrylamide gel electrophoresis results resolving typical chain termination reactions (Ho et al. unpublished data). Each lane corresponds to a designated reaction terminated with ddATP (A), ddCTP (C), ddGTP (G) and ddTTP (T) analogs (Ho et al. unpublished), which identifies respective nucleotides on target DNA template. (B) Depicts a color-coded chromatogram of typical automated DNA sequencing data (Albrechtson et al. unpublished).
usually α35SdATP or α35SdCTP if radioactivity is used or biotin or digoxigenin when nonisotopic labeling is applied. Since nucleotide elongation needs the presence of 3′-hydroxi (OH) groups, such requirement is exploited to terminate the reaction at each given base by the use of di-deoxy nucleotide (ddNTP) analogs, which lack the OH. Congruent with this rationale, four different reactions are set in which each contain one of the four distinct ddNTP terminators and electrophoretically resolved through independent lanes of an acrylamide slab gel. Since the DNA polymerase-dependent reaction stops every time an analog-base is incorporated into the growing strand, this creates labeled nucleotide chains of different length. Hence each reaction run in a separate gel lane provides the precise nucleotide sequence identity (Fig. 15.1A). The spinoffs, unprecedented improvements and multi faceted applications of this remarkable technology are now history (McPherson, 2009; MacBreath et al., 2001; Sinden et al., 1999; Reddy et al., 2008; Voss et al., 1995; Brent and Guigo, 2004; Motta et al., 2006; Lander et al., 2001). However, essential parameters remain valid and applicable irrespective of chemistry innovations, creative new reagents and continuously updated equipment. These include pure and intact templates, versatile polymerase enzymes, rational design of priming reagents and reliable detection systems. Whereas laboratory-based advances such as bacterial artificial chromosome (BAC) cloning and shotgun analyses continue to be the force behind the landmark leap of the nucleotide sequencing technology, software development can be credited with resolving the conundrum of managing the enormous
c15.indd 315
1/12/2011 9:44:27 AM
316
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
outburst of data generated by the powerful sequencing hardware. The imaginative software eased complex sequence annotations and analyses, thus giving rise to the numerous, efficient and constantly evolving sequence databases, including the National Center for Biotechnology Information (NCBI), the University of California Santa Cruz (UCSC)/Bioinformatics and the European Molecular Biology Laboratory (EMBL)/Enembl genome browsers (Kent et al., 2001; Hubbard et al., 2002; Wheeler et al., 2001). Among the platforms with relevant clinical interest, PolyPhred, PolyScan, and SNPDetector are technologies in continued evolution that support complex genotyping analyses, including, SNP, insertion/deletion variants and heterozygosis assessment (Chen et al., 2007; Zhang et al., 2005; Bhangale et al., 2006). 15.3.3
Outsourcing
The new generation of automated sequencing hardware continues to be a constant challenge for software evolution. Several sequencers can now produce over 1 × 106 reads of sequence lengths of 400 basepairs (bp) or more, which handily satisfies the demands of the most ambitious projects. The cost of nucleotide sequencing has significantly dropped by the offer and demand equation. The most sophisticated sequencing services are either intramurally or extramurally available, in which sample pick up and electronic data downloading are included in the cost. The need to own nucleotide sequencing equipment or manually perform sequencing experiments is no longer cost effective. The quality and quantity of data provided by most nucleotide sequencing core facilities (Fig. 15.1B) is highly competitive and complete with superb downloadable software linkage. Unless a laboratory is fully engaged in the sequence analysis field, the wise action is to routinely outsource the sequence characterization of the genes of interest. Irrespective of the approach nucleotide sequencing is an indispensible method that can uncover modification imprints affecting gene expression and function. On one hand, it provides the direct means to validate physiologically relevant gene mutations, rearrangements and recombinatorial switches, which forecast successful achievement of gene expression diversity (the good news). On the other hand, it enables researchers to diagnose gene lesions of deleterious consequences (the bad news). Moreover, nucleotide sequencing has become a pillar in forensic medicine. 15.4 IN SITU HYBRIDIZATION 15.4.1 Experimental Considerations Among the multitude of methods designed to investigate gene content, amplification, mutation, and expression, in situ hybridization can be distinguished for its unique property to provide information in the context of the chromosome, nuclear, cellular and/or histological microenvironments. By and large DNA and RNA are the prominent targets of this powerful technology.
c15.indd 316
1/12/2011 9:44:27 AM
IN SITU HYBRIDIZATION
317
Genetic traits linked to diverse forms of disease are recognized as inherent consequences of chromosomal abnormalities, which can affect gene dosage, structure, processing and function. Therefore, it is not surprising that gene duplications, deletions and aberrant rearrangements resulting from chromosomal translocations can lead to dramatic phenotypes, which are often lethal or the cause of severe morphological and functional defects (Shaffer and Bejjani, 2004). The development of techniques that enabled the visualization, identification and analysis of chromosomes marked a new era for accurate counts, integrity assessments and detection of deletions and translocations (Garcia-Sagredo, 2008). Most of these methods, such as chromosome G-banding are a routine in most laboratories where cytogenetics is performed and like most technologies, progress continues to be achieved when high-resolution techniques are required to reveal subtle and/or complex gene abnormalities. Applications include karyotype, chromosome gene assignment, chromatin structure, DNA recombination, gene expression and radiation dosimetry assessments (Maierhofer et al., 2002). Detection of specific chromosome segments to assess structurally inaccessible gene lesions is laborious and technically demanding. Hence the need for innovative approaches imposed by clinical demands, resolution limitations and inconsistency of available methods, moved the creative development of technology to the next level and thus fluorescent in situ hybridization (FISH) became a reality. Historically, FISH appeared at the lab scene nearly 30 years ago and despite its rapid and remarkable evolution the elemental principles and the applications prevail. Basically, FISH involves the hybridization of fluorescently labeled probes to target segments of chromosomes that have been chemically fixed on slide preparations (Fig. 15.2). Alternatively, probes can be labeled with biotin or digoxigenin-conjugated nucleotides, whose detection can be indirectly achieved by fluorescent conjugates, such as streptavidin/avidin or antidigoxigenin antibodies. Irrespective of the use of direct or indirect methods, FISH detection technology has exploded in the past decade, and fluorochrome reagents, probe engineering, and visualization hardware and software are just as diverse as sophisticated. Accordingly, FISH earned solid credibility for its chromosome/gene mapping capabilities, specificity, precision, flexibility and superb microscopy and digital imaging support, thus rapidly becoming an indispensable tool in biomedical research and practice. For instance, FISHdependent genetic queries find widespread use in a variety of biomedical fields, including genetics, neurosciences, reproduction, toxicology, ecology and evolution (Volpi and Bridger, 2008) to name some. 15.4.2 Scope and New Developments Because of the enormous diversity of FISH technology, where acronyms are coined for any given application (Volpi and Bridger, 2008), it may be justified to conclude that FISH now appears on the menu in multiple and varied application flavors. Because the aim of the present chapter is to underscore the
c15.indd 317
1/12/2011 9:44:27 AM
318
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Figure 15.2. (See color insert.) Chromosome locus assignment of a newly discovered gene. Fluorescent in situ hybridization analysis, using a biotin-labeled genomic probe, which reveals the new gene at the 9q32 locus (Sims-Mourtada et al., 2005) upon binding of fluorescent Streptavidin (shown herein as pseudo yellow fluorescence) against DAPI (blue fluorescence) background. Arrows emphasize gene locus assignment on respective chromosome 9 alleles.
impact of the various molecular strategies applied to the pathophysiological assessment of mutations of discovered genes, only brief appraisals of FISH applications are herein presented. Unambiguous karyotype analyses, which can concomitantly query genespecific location, detect cryptic gene fusions, and resolve intricate chromosome rearrangements, called for multicolor chromosome painting. The design and application of multiple fluorochromes and development of broad wavelength rage detection systems led the way to multiplex-FISH (M-FISH) and gave a record boost to cytogenetics (Volpi and Bridger, 2008; Kearney, 2006). The invention was of particular biomedical impact on cancer research. Alterations in chromosome numbers (aneuploidy) are common in trisomy syndromes, like in trisomy 2, and accurate assessment of micronucleation events is not trivial. Blocking cytoplasm partition with the drug cytochalasin-B (CB) in combination with FISH (CB-FISH) became instrumental in assessing an array of chromosome segregation abnormalities (Volpi and Bridger, 2008; Migliore et al., 1999). Quantitative determination of telomere loss in aging can exploit the power of telomere hybridization, using peptide nucleic acid (PNA) FISH, combined with the versatility of flow cytometry (flow-FISH) that can measure fluores-
c15.indd 318
1/12/2011 9:44:27 AM
EXPRESSION AT THE PROTEIN LEVEL
319
cent telomere signals in cell suspensions (Baerlocher et al., 2006; Potter and Wener, 2005). The approach enables us to manage multiple cell analyses with high precision and conveys high clinical potential. DNA strand breaks are physiologically and pathologically important and determination of chromosome loci susceptibility is of relevant biomedical interest. The electrophoretical exit of DNA from the nucleus onto an agarose gel field, a detection method known as the comet assay, measures the degree of DNA breaks at the single cell level. When combined with FISH (cometFISH), the procedure reveals the chromosome sites with relevant DNA breakage susceptibility (Glei et al., 2009; Escobar et al., 2007). The use of antibody probes in combination with FISH (immuno-FISH) to simultaneously detect precise gene loci and protein complexes has quasiunlimited potential (Yang et al., 2004; Zinner et al., 2007; Sun et al., 2003). Likewise, the accurate capture of aberrant sister-chromatid exchanges by combining BrdU/cell cycle labeling with FISH (harlequin-FISH) technology (Pala et al., 2001; Jordan et al., 1999) is enticing. Focused cytogenetic analysis on gene fusions resulting from chromosome rearrangements, found a niche that has relevant diagnostic and prognostic value. By the clever application of dual-color fluorescent probes flanking the breakpoint site of chromosomal translocations (split-signal FISH), precise identification of the rearranging loci can be achieved (Volpi and Bridger, 2008; van Rijk et al., 2008, 2009). Determination of gene expression in situ at nuclear or cytosol locales, using fluoresce-based methods (RNA/Expression-FISH) opened a whole new dimension for evaluating transcription, mRNA processing, and decay. The method enables one to equally assess endogenous transcription, enforced expression resulting from plasmid-mediated transfection or lentivirus/ retrovirus-dependent transduction/infection, and overexpression in transgenic (Tg) animal models. The potential applications are virtually unlimited, from single cell-based allelic expression to phenotypically/pathologically distinct expression arrays, gene organization and regulation, nuclear/cytosol traffic and diagnostic-based expression analyses (Volpi and Bridger, 2008; Ferrai et al., 2010; Mahadevaiah et al., 2009; Voss et al., 2006). As stated above, applications of the in situ hybridization technology in general and FISH in particular exploded in extraordinary directions. The unparalleled flexibility of the methods developed predicts that a variety FISH flavors will be on the table of leading biomedical researchers. 15.5 EXPRESSION AT THE PROTEIN LEVEL 15.5.1 Overview The pathophysiological consequences of known and newly identified gene lesions cannot be more flagrant when it comes to protein expression, subcellular and extracellular location, molecular interactions, and proprietary functions.
c15.indd 319
1/12/2011 9:44:27 AM
320
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Proteins can be present in the nucleus to carry out both genetic and epigenetic roles, locate at the nuclear/cytosol boundaries as active gatekeepers, confer structural support and cell plasticity, perform an unimaginable array of cytosol, mitochondrial, lysosome, golgi, centriole, vacuolar, and endoplasmic reticulum actions; direct intracellular signal traffic; be deployed at the cell surface as intracellular and extracellular communicators; and be secreted as intercellular signal exporters (Scott and Pawson, 2009; Xylourgidis and Fornerod, 2009; Nigg and Raff, 2009; Michelsen and von Hagen, 2009; Koizumi et al., 2007). Whereas certain gene mutations may not preclude transcription, translation, subcellular location, and function, thus resulting in silent phenotypes, others can produce dramatic effects that range from lethal phenotypes to severe functional impairments. Either way, formal proof of silent or deleterious, point mutations, polymorphisms, deletions, insertions inversions, or fusions must be obtained at the protein level before the pathophysiological significance of a newly identified gene can be established. Coherent with the needs, enormous progress has been achieved in technology development and generation of sensitive and specific probes. 15.5.2
Antibody Technology
Although in vitro translation became instrumental to obtaining primary evidence of protein expression and continues to be applied to gene discovery (Frances et al., 1994), the generation of mouse monoclonal antibody (mAbs) probes (Kohler and Milstein, 1975) completely transformed the approach to protein analysis. Consistent with its broad reach, the groundbreaking discovery found applications in virtually all fields of science, empowered gene discovery and enabled the pathophysiological characterization of known and newly identified molecules. Notably, biomedical research became a major beneficiary of both diagnostic and therapeutic reagents, which successfully turned to lifesaving uses in immunology, oncology, cardiovascular, respiratory, neurology and infectious disorders (Nissim and Chernajovsky, 2008). The cloning, engineering, and generation of recombinant monoclonal antibody probes permitted to get around adverse reactions to proteins of murine origin and led the way to the development of human or humanized mAb reagents (Frances et al., 2000; Nissim and Chernajovsky, 2008). Despite the progress, antibody therapeutics still faces significant hurdles to prevent undesirable effects resulting from systemic and relatively long-term use of mAbs, including anti-idiotype reactions. On the other hand, the use of polyclonal antibody probes, obtained directly from sera of immunized animals (goat, rabbit, donkey, or chicken) is noteworthy because it provides increased detection diversity, particularly when simultaneous analysis of distinct protein targets is needed. 15.5.3 Evolution and Scope of Protein Analyses In addition to monoclonal and polyclonal antibody probes, peptide and protein tags can be engineered to facilitate in vitro, ex vivo, and in vivo detection of
c15.indd 320
1/12/2011 9:44:27 AM
EXPRESSION AT THE PROTEIN LEVEL
321
proteins of interest, evaluation of their subcellular location and even assessment of their function, such as cell structure, motility, migration, phagocytosis, killing, and survival (Siddiqa et al., 2001; Chen et al., 2009; Dross et al., 2009; Flannagan and Grinstein, 2009; Li et al., 2007; Lohela and Werb, 2009; Karan et al., 2004). Methods, reagents and equipment to determine protein expression at the organ, histological, cellular and biochemical levels are now in place to provide accurate morphological and physiological parameters on the status of wild type and mutant protein forms. For example, immunohistochemistry (Fig. 15.3A) and immunofluorescent histology (Fig. 15.3B) can reveal the presence or absence of protein expression (Frances et al., 2000), subcellular localization (Siddiqa et al., 2001), cell homing to specific organs (Sims-Mourtada et al., 2003), and altered tissue development and function resulting from partial or complete gene inactivation (Aldrich et al., 2003). Enzyme-linked immunostain assays (ELISA) and the spinoff application ELISOPT quantitatively measure extracellular mediators, such as immunoglobulins, active peptides, cytokines, and chemokines (Bondada and Robertson, 2003; Bocchino et al., 2009; Hogrefe, 2005), whereas confocal microscopy captures the dynamics of proteins in action and cell signal synapses (Chen et al., 2009; Dross et al., 2009; Flannagan and Grinstein, 2009; Li et al., 2007; Contento et al., 2008). In retrospect, it is not difficult to appraise how all areas of cell biology research came to soar on the availability of antibody probes and feasibility to engineer molecular tags. Flow cytometry for instance, enabled researchers to accurately achieve comprehensive phenotypings, discover new and unique cell populations, and provide evidence of cell surface assembly of receptor proteins. Moreover, flow cytometry allowed the concomitant cell ID and function, evaluation of cell cycle progression and proliferation and to record cell senescence and death events (Rangel et al., 2005, Frances et al., 1994; Liu et al., 1996a, 1996b, 1996c; Malisan et al., 1996a; Matteucci and Giampietro, 2008; Krysko et al., 2008; Malisan et al., 1996b; Challen et al., 2009; Passos and von Zglinicki, 2007). Rationally, the access to the innovative and continually evolving antibody and recombinant protein tag reagents permitted to undertake more challenging biochemical tasks. These ranged from simultaneous verification of protein identity, molecular mass, and covalent protein bond formations, using single and two-dimension immunoblotting (Rangel et al., 2005; Frances et al., 1994; McKeller and Martinez-Valdez, 2006) to confirmatory assessment of deleterious mutations (Perlman et al., 2003). Furthermore, protein purification, by means of antibody-based (affinity) and tag-dependent chromatography, respectively, brought within reach the achievement of immunoproteomic experiments (Wu and Mohan, 2009) and the feasibility of resolving molecular structures (Wetterholm et al., 2008). Moreover, exploiting the use of antibody probes and genetically engineered tags facilitates the biochemical elucidation of complex molecular interactions, which can be relevant to cell surface, cytosol, nuclear or extracellular functions. After protein extraction, the experiment involves a two-step procedure that includes (1) an antibody-based precipitation or immunoprecipitation (IP) by
c15.indd 321
1/12/2011 9:44:27 AM
(a)
(b)
c15.indd 322
1/12/2011 9:44:27 AM
EXPRESSION AT THE PROTEIN LEVEL
323
Figure 15.3. (See color insert.) Gene Expression by Histological Methods. (A) Immunohistochemical (IHC) detection of IgD and CD38-expressing B lymphocytes, respectively identified in the follicular mantle (FM: blue) and germinal centers (GC: Red) of cryopreserved tonsil sections (Martinez-Valdez, unpublished), obtained according to institutional guidelines. IHC reactions result from specific monoclonal anti-IgD or CD38 antibody reactivity, revealed by enzyme-dependent blue or red chromogen activation. (B) Representative immunofluorescent histology (IFH) to confirm the presence CD3 (red fluorescence) and interleukin-8 (IL-8) receptor/CxCR1expressing (green fluorescence) T lymphocytes on DAPI-stained cryostat tonsil sections (Sims-Mourtada et al., 2003).
controlled centrifugation, also known as pull down, and (2) a subsequent immunoblotting to reveal the identity of the IP protein complexes. Simple reciprocal IP/immunoblots can provide confirmatory information when the identity of the suspected protein interactions is known and intentionally tested (Rangel et al., 2005). The use of recombinant tag does not usually override the need of the antibody probe, since anti-tag antibodies are routinely used to either perform the IP or as a reveling probe to identify the protein(s) in the immune complex. When novel protein entities are being characterized and possess structural motifs that are hypothesized as crucial for physiological molecular interactions, the IP/immunoblot approach can be instrumental to determine the consequences of genetic mutations (engineered or naturally occurring) at the active site. When the need to elucidate the identity of proteins interacting with a molecule under study, the alternative to the two-step IP/immunoblot approach involves state of the art proteomics that though technically demanding, it has been widely implemented and is available in most institutional cores. The experiments typically involve target-specific IP in conjunction with 2-electrophoresis and mass spectrometry (Long et al., 2004). A judicious evolution of the IP technology led to the development of chromatin IP (ChIP), which directly queries protein/DNA interactions, implicates genetic and epigenetic functions and has reached throughput capacity. The basic method involves the cross-linking of nucleic acid/protein complexes, antibody-dependent IP, amplification of target DNA region by PCR, and nucleotide sequence identification. Variations of the technology logically developed and large scanning projects became feasible (Lund-Olesen et al., 2008; Collas and Dahl, 2008; Barski and Zhao, 2009; Jothi et al., 2008; Barski and Frenkel, 2004; Trelle and Jensen, 2007). Nothing has impacted biomedicine more than the application of antibody technology to therapy as diagnostic biomarkers, neutralizing agents, inhibitors, or active magic bullets. Understandably, while this chapter cannot cover all dimensions of the antibody technology, the discussion does underscore the fact that since the inception of monoclonal antibodies (Kohler and Milstein, 1975) science has
c15.indd 323
1/12/2011 9:44:28 AM
324
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
never been the same. Currently, unconjugated and fluorescently or enzymatically conjugated antibody probes are commercially available, and while these reagents can be in-house/custom tailored for new molecules or specific applications, outsourcing to commercial vendors can be more cost-effective.
15.6 GENETICALLY ENGINEERED ANIMALS 15.6.1
Enforced Expression in Transgenic Mice
In vitro studies can uncover fundamental features of key gene functions and involvement in pathways that may be crucial for cellular operations, including transcriptional activation/repression, survival/death, proliferation, motility, signal communication, immunity, and reproduction. Yet, gene discovery, elucidation of intricate mechanisms, transcriptional/translational intereference, and evaluation of dominant-negative mutations can never reach the pathophysiological significance of in vivo measurements. It is for this reason that engineered transgenic (Tg) mouse models are needed to assess the consequences of enforced gene overexpression (Blyth et al., 2009). Toward that end, mice can be engineered to enforce transgene expression ubiquitously or in a tissuespecific manner (Blyth et al., 2009). The choice of tissue/cell-specific Tg expression depends on whether the aim is to determine how the excess gene expression can alter the balance of known cell pathways or how ectopic expression can uncover additional gene pathophysiology. A typical example is herein highlighted. Up until now, the mechanisms by which antigen (Ag)-specific germinal center (GC) B cells survive and die were not clear and hence overexpression of a candidate genes with suspected potential to confer resistance to physiological death could provide fundamental clues. To facilitate the in vivo interrogation in a broad or restricted context, the use of ubiquitous and cell-specific promoters can be applied to Tg mouse technology (Blyth et al., 2009; Zhou et al., 2004; Bordon et al., 2008). For instance, the Eμ-SV system uses IgM regulatory sequences to selectively drive transgene expression in the B cell compartment (Blyth et al., 2009; Bordon et al., 2008), whereas β-globin promoter can direct transgene expression throughout animal tissues (Zhou et al., 2004). Added features of Tg mouse engineering is that tags can be incorporated to the gene expression under study to facilitate detection. A brief example is the engineering of gene X to generate a B cell–specific Eμ/promoter-gene X-flag transgene that can be achieved by inserting the entire gene, flag-tagged by 3′ PCR insertion, into available cloning sites of the Eμ-SV40 vector. An alternative option is to use discistronic transgene constructs, in which larger protein-based not peptide tags can be independently expressed from a different promoter through the incorporation of internal ribosomal entry sites (IRES) (Bouabe et al., 2008). When the gene locus of interest spans only a few kilobases (kb) or the target gene bears no introns
c15.indd 324
1/12/2011 9:44:28 AM
GENETICALLY ENGINEERED ANIMALS
325
(Guzman-Rojas et al., 2000; Drysdale et al., 2002), the use of the entire gene can be engineered for overexpression. However, cDNA-based constructs are by and large the most frequently used transgenes. In some applications where tissue-specific overexpression is envisioned, certain gene promoters that are used to target a given cell lineage can drive undesirable side effects. A typical observation when the transgene of interest is placed under the control of Eμ-driven, expression could leak into cell lineages other than B cells. The interpretation of the Tg phenotype could remain unaffected if the information sought is exclusively focused on the B cell compartment. Should the data lead to overwhelming confusion, due to a striking phenotype that indirectly impinges on the B cell physiology, the rational remedy is to substitute the promoter for a more stringent cell lineage-specific one, if available. In the described scenario, the CD19 promoter represents the ideal substitution. Advances in Tg technology enable us to test the pathophysiological effect of dominant-negative function interference by the overexpression functiondefective mutant proteins (Halabi et al., 2008). Likewise, potentially deleterious point mutations within functional motifs can be tested using knockin Tg mice (Yang et al., 2008), which often resemble human disease and thus provide invaluable diagnostic/prognostic clues. Most institutions house genetically engineered mouse core facilities, which operate Tg and knockout (KO) projects within pathogen-free barriers (required for precise applications) and under the supervision of an institutional animal care and use committee (IACUC). Thus a given Tg project follows a quite similar routine: After sequence verification that the gene of interest has the correct in-frame orientation within the promoter-driven construct of choice, the transgene is excised from the vector and purified according to the institution Tg core facility’s specifications. From here, the Tg core performs all steps from the pro-nuclear injections to the delivery of Tg mice. Assessment of the relative transgene copy number against the endogenous gene homologue is then determined, using quantitative dot-blot hybridization and phosphoimaging. Typically, titrations can be performed using a standard genomic reference, i.e., G3PDH gene. After F1 Tg mouse lines are established, reverse transcription-based Real-time PCR is performed to select the Tg mouse lines with relevant transgene expression that will be used for further studies. Immunoblot confirmation can be applied, when the transgene is tagged and/or specific antibody reagents are available. Alternatively tissue sections and/or cell suspensions can be prepared and subjected to immunohistochemistry, immunofluorescence histology/cytology or flow cytometry, should appropriate antibody reagents are available or the transgene carries a fluorescent tag. In keeping with the Eμ-Gene X-Flag Tg mouse example, downstream pathophysiological assessment of Tg mouse phenotypes involves comprehensive and multi-parametric tests, in which the experimental approach is tailored to the information available on the gene of interest. Accordingly, B cell specific
c15.indd 325
1/12/2011 9:44:28 AM
326
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
parameters can disclose stage-specific accumulation of progenitor, precursor and/or immature B cell subsets found in single cell suspensions from bone marrows. Pure mononuclear cells can be obtained by discontinuous-gradient centrifugation, followed by either fluorescent or magnetic sorting of untouched B cells, using negative selection protocols that deplete all other cell lineages. Cell lineage, morphology, and developmental stage assessment of the pure B cell preparations can be performed by multiparametric flow cytometry and immunocytology assays. After primary characterization of Tg B cells, basal functions can be examined, including cell division patterns, capacity to proliferate in the absence of growth, and differentiation factors. In the event that the B cells exhibit a potential for autonomous clonal expansion, simple cytogenetic experiments can be performed to investigate whether autonomous proliferation and clonal expansion originate from illegitimate gene rearrangements. Additional cellular tests can be carried out to determine whether the enforced expression of transgene X confers cell survival advantage or whether clonal expansion emerged from resistance to bone marrow microenvironment-induced death signaling. Depending on the results of the primary evaluation of Tg B cell phenotype, the information can be supplemented by more mechanistically oriented experimentation. This can envision, comparative cDNA micoarrays and/or protein-profiling experiments to reveal potential gene expression, mechanisms and cellular pathways affected by the enhanced and sustained expression of the transgene. In the peripheral organs, critical parameters of antigen (Ag)-dependent B cell responses within secondary lymphoid organs can be evaluated. For example, if conventional knowledge hypothesizes that enforced overexpression of Eμ-gene-X-flag may influence mature B cell survival and promote massive accumulation resulting from immunization, focused attention must be directed to perturbations in all peripheral lymphoid organs. This basically entails postmortem examination of Tg mice for splenomegaly and lymphoadenopathies, followed by histopathology experiments to assess architecture abnormalities and defects in GC formation (Guzman-Rojas et al., 2002; SimsMourtada et al., 2003; Aldrich et al., 2003). Flow cytometry and immunohistochemistry experiments can help reveal signs of abnormal clonal expansion within the GC microenvironment. Notably, enhanced Tg B cell survival or resistance to death-inducing stimuli can be equally determined by flow cytometry parameters. With the same logistics of investigation on primary B cell maturation programs in the bone marrow, differential gene expression arrays and proteinprofiles can be queried to determine whether sustained Eμ-gene X-flag overexpression impinges upon target GC cellular pathways. Although the hypothetical gene X, proposed herein as an example, may aim to understand how high affinity/Ag-specific GC B cells acquire cell survival benefits, the overall take-home message of this chapter is to emphasize that the ultimate goal of a given Tg project is to reveal the in vivo significance of
c15.indd 326
1/12/2011 9:44:28 AM
GENETICALLY ENGINEERED ANIMALS
327
gene X expression. In keeping with this rationale, the undertaken study must remain open to evaluate the unexpected emergence of ancillary information resulting from the enforced overexpression of the gene under investigation. These may involve the spontaneous expansion and persistence of abnormal cells, including tumorigenic phenotypes, which can conceivably lead to proliferative disorders. The concept of spontaneous expansion and persistence of abnormal cell phenotypes must be evaluated with caution, since it is equally conceivable that the enforced overexpression of gene X alone may not be responsible of the observed phenomenon. Because tumor progression is likely to be complex and multifactorial, the concept of spontaneity and persistence needs to be rigorously evaluated in the context of complementary studies. Coherently, breeding hypothetical Tg X mice into either tumor suppressor KO or protooncogene Tg backgrounds, in the presence or absence of carcinogens, represents a complement strategy to further assess gene X’s pathophysiological potential. 15.6.2
Gene Targeting/Knockout
Naturally occurring mutations provide useful information about the function of altered genes, particularly those that affect hematopoietic development. The introduction of gene-targeted deletions, whereby mutations are engineered in the mouse (Thomas and Capecchi, 1987; Capecchi, 1989) led to remarkable advances in the analysis of cellular functions. The principle of method, also known as gene knockout (KO) is the use of murine embryonic stem (ES) cells, which can be maintained in a pluripotent state in culture (Weiss, 1997). When mutant ES cells are introduced into a host blastocyst, they develop in most tissues of the chimeric mouse, including germ cells. Breeding of the chimeras results in the transmission of the mutation, first manifested heterozygotically. Mice can be then mated to develop homozygous-/- strains. Genes that exhibit a potential role in development, such as regulators of growth and differentiation, stand out as key candidates. Fundamental features in the choice of a gene target largely relies on its background: (1) the pattern of expression in developing and mature cells, (2) known or suspected lineage specificity, (3) expression related to the growth of malignancies (i.e., leukemia and lymphoma), (4) prior in vitro or in vivo evidence for function in a particular cellular pathway and relationship to other molecules. The most widely used practice in disrupting a gene in vivo for the analysis of its function is the deletion of a segment or the complete targeted gene by substitution of the wild type gene with a mutant allele by homologous recombination, concomitantly dependent on drug selection (the neomycin gene, for example). At least three targeting strategies, whose effectiveness can be monitored in embryonic stem (ES) cells, can be envisioned. In the first model, the inactivation of the gene expression of interest can be accomplished by the replacement of its 5′ upstream region and the first exon with a Neo gene
c15.indd 327
1/12/2011 9:44:28 AM
328
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
cassette. The second model could interrupt gene expression by swapping the largest exon with the Neo cassette. Alternatively, homologous recombination can target selective exons, encoding known functional motifs or enzyme catalytic sites by Neo cassette replacement in opposite transcriptional and translational orientation (Scacheri et al., 2001). The bidirectional location of the Neo cassette marker with respect to the gene target, aims at dominant Neo gene transcription from the shared locus to abrogate endogenous gene activity. However, neomycin/G418 selection is by no means proof of gene inactivation, and hence confirmation at the transcriptional and the protein level must be obtained before ES/blastocyte transfers. While the approach can be technically demanding, numerous cassette constructs are commercially available. The generation of preliminary in vitro data to obtain relevant information on the function of the gene under study can be of the utmost value in considering the pathophysiological consequences of its in vivo inactivation. Information on tissue distribution; expression pattern during embryo and adult development, cell differentiation, and/or activation; subcellular location; function; potential physiological redundancy; and pathway characterization could prove instrumental in the design of the targeting strategy and the assessment of the KO phenotype. Even when embryo or fetal deaths may not be anticipated, constant monitoring of the gene-deficient colony in conjunction with veterinarians and animal quarters is customary. Should death occur at any stage, necropsy procedures must be carried out, followed by organ fixation, tissue section preparation and anatomopathological examination. When in utero deaths occur, thorough examination of whole-mount embryos and fetal cryopreserved sections can be carried out by immunohistochemistry and in situ hybridization to register potential developmental alterations that led to the mouse demise. Likewise, necropsies of animals that die postnatally should be routinely performed and tissue sections and cell suspension preparations processed for immunohistochemistry, in situ hybridization, and flow cytometry. These experiments can be useful to reveal altered cellularity, morphology, and gene expression. Preparation of tissue sections can also be useful to perform terminal deoxyribonucleotidyl transferase (TdT) mediated dUTP nick-end labeling (TUNNEL) assays and uncover defects in cell survival resulting from the targeted gene deficiency. Isolation and purification of distinct cell subsets carry equal experimental value because they give the investigator access to a thorough assessment of phenotype and function, using flow cytometry and cell culture assays. The approach can be of particular immunological value when cell preparations are obtained from bone marrow, thymus, and spleen, which can provide key information on developmental, maturation, and differentiation deficiencies within the immune cell compartment. Otherwise, viable litters can be routinely examined for anatomical integrity, weight at birth, gain or loss of weight during neonatal to adulthood development, fur color, number and characteristics of limbs, ambulatory properties,
c15.indd 328
1/12/2011 9:44:28 AM
GENETICALLY ENGINEERED ANIMALS
329
ability to thrive, behavior, and the capacity to respond to external stimuli (light, sound, mild changes in temperature). 15.6.3
Pitfalls and Solutions
While it is tenable that the in vivo inactivation of certain genes often leads to phenotypes that a priori are perceived as inconsequential, numerous experimental approaches can be designed to challenge behavioral, neurological, reproductive cardiovascular, respiratory, renal, gastrointestinal, metabolic and immunological functions (Bhangoo and Jacobson-Dickman, 2009; Chan, 2008; Cihakova and Rose, 2008; Gungor et al., 2010; Komine, 2009; Lerebours et al., 2002; Li et al., 2009; McGivern and Lemon, 2009; Morisawa et al., 2008). Within the immunological context, for instance, lymphoid organ analyses can be performed via preimmunization and postimmunization schemes to examine ligand/receptor interactions, plasma cell differentiation, and antibody production of Ag-specific antibodies (Siddiqa et al., 2001; Guzman-Rojas et al., 2002; Aldrich et al., 2003; Malisan et al., 1996b; Briere et al., 1996). Flow cytometry, cell sorting, immunohistology, cell culture, ELISA, mixed lymphocyte reactions (MLR), together with standard genetic/epigenetic, biochemical and molecular biology parameters make up tools of routine use for evaluating immune functions. On the other hand, inactivation of genes that are critical for mouse development or exert critical functions for adult vital organs, frequently result in embryo or perinatal lethality. To circumvent these limitations alternative gene targeting strategies can be devised. For example conditional cell lineage specific gene targeting, which makes use of the Cre/loxP recombination system of bacteriophage P1 can be envisioned (Aoki and Taketo, 2008; Kirschner, 2009). The target gene is thus flanked by the recombinase recognition loxP sites (a floxed target gene), which are introduced by homologous recombination into the ES cells. In such models, the expression of the target gene should be the same as in wild type strains, unless the gene becomes inactivated by Cre-mediated deletion. The deletion of the target gene can then selectively accomplished in vivo in a cell lineage-restricted manner, by mating the mouse strain carrying a floxed target with a transgenic mouse expressing the Cre recombinase under the control of a cell lineage specific promoter (Aoki and Taketo, 2008; Kirschner, 2009). Furthermore, advances in gene targeting technology have elegantly introduced procedures to induce gene inactivation in mice, practically at will, during specific stages of mouse development. The approach exploits the use of inducible promoters to control the expression of diverse Cre recombinase constructs (Aoki and Taketo, 2008). An example of the approach features the Mx1 promoter (Gutierrez et al., 2008), which transiently drives high levels of transcriptional activity under the influence of interferon-α or -β. Thus mice carrying a floxed target gene and Mx-Cre transgene can be induced to inactivate the target locus following a treatment with
c15.indd 329
1/12/2011 9:44:28 AM
330
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
IFN (Gutierrez et al., 2008). Such approaches are of great value and continue to be developed. The overall lesson arising from the thought of how genetically engineered mouse technology emerged and where it stands today, is that given the challenge to assess the biological significance of a new gene, one can rest assured that supporting reagents, methods, and expertise is within reach.
15.7 CONCLUDING REMARKS Evidence that mutations of pathophysiologically relevant genes account for a large extent of human morbidity and mortality is readily available in most biomedical fields (Mullighan et al., 2009; Bhangoo and Jacobson-Dickman, 2009; Chan, 2008; Cihakova and Rose, 2008; Gungor et al., 2009; Komine, 2009; Lerebours et al., 2002; McGivern and Lemon, 2009; Morisawa et al., 2008). Vast reports of cardiovascular, respiratory, neurological, renal, metabolic, reproductive, and sexual failures, together with increased susceptibility to inflammation, oxidative toxicity, anaphylactic shock, infections, and tumorigenicity, provide support to such notion (Mullighan et al., 2009; Bhangoo and JacobsonDickman, 2009; Chan, 2008; Cihakova and Rose, 2008; Gungor et al., 2009; Komine, 2009; Lerebours et al., 2002; McGivern and Lemon, 2009; Morisawa et al., 2008). Notably, defective leukocyte reactivities are widely documented in association with genetic lesions of proinflammatory cytokines, chemotactic factors, cell cycle regulators, tumor suppressors, proto-oncogenes, surface receptors, and proteolytic metalloproteases (Mullighan et al., 2009; Bhangoo and Jacobson-Dickman, 2009; Chan, 2008; Cihakova and Rose, 2008; Gungor et al., 2009; Komine, 2009; Lerebours et al., 2002; McGivern and Lemon, 2009; Morisawa et al., 2008). The collective supporting information validates the development and use of multiparametric methodology to accurately investigate new gene entities not only to reveal the physiological benefits of their expression but also to assess the pathological risks associated to their lesions. As science is in constant motion and continues to yield information that challenges existing paradigms, the pragmatic purpose of this chapter is to entice the reader to the thought that new findings lead to new questions, chases and problems to be solved. A thorough scrutiny of what we can learn on genetics in particular and biology in general may serve to ignite creativity that contributes to biomedical endeavors.
15.8 ACKNOWLEDGMENTS The writing of this book chapter was supported by the National Institutes of Health (NIH) R01 grant AI065796-01.
c15.indd 330
1/12/2011 9:44:28 AM
REFERENCES
331
15.9 REFERENCES Aldrich MB, et al. (2003). Impaired germinal center maturation in adenosine deaminase deficiency. J Immunol 171(10): 5562–70. Alt FW, et al. (1992). VDJ recombination. Immunol Today 13(8):306–15. Aoki K, Taketo MM. (2008). Tissue-specific transgenic, conditional knockout and knock-in mice of genes in the canonical Wnt signaling pathway. Methods Mol Biol 468:307–31. Arya M, et al. (2005). Basic principles of real-time quantitative PCR. Expert Rev Mol Diagn 5(2):209–19. Baerlocher GM, et al. (2006). Flow cytometry and FISH to measure the average length of telomeres (flow FISH). Nat Protoc 1(5):2365–76. Barragan E, et al. (2001). Quantitative detection of AML1-ETO rearrangement by real-time RT-PCR using fluorescently labeled probes. Leuk Lymphoma 42(4): 747–56. Barski A, Frenkel B. (2004). ChIP Display: novel method for identification of genomic targets of transcription factors. Nucl Acids Res 32(12):e104. Barski A, Zhao K. (2009). Genomic location analysis by ChIP-Seq. J Cell Biochem 107(1):11–18. Bauer AK, Rondini EA. (2009). Review paper: the role of inflammation in mouse pulmonary neoplasia. Vet Pathol 46:369–90. Bentley DR, et al. (2008). Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218):53–59. Bhangale TR, et al. (2006). Automating resequencing-based detection of insertiondeletion polymorphisms. Nat Genet 38(12):1457–62. Bhangoo A, Jacobson-Dickman E. (2009). The genetics of idiopathic hypogonadotropic hypogonadism:unraveling the biology of human sexual development. Pediatr Endocrinol Rev 6(3):395–404. Bild AH, et al. (2006). Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439(7074):353–57. Blaveri E, et al. (2005). Bladder cancer stage and outcome by array-based comparative genomic hybridization. Clin Cancer Res 11:7012–22. Blyth K, et al. (2009). Runx1 promotes B-cell survival and lymphoma development. Blood Cells Mol Dis 43(1):12–19. Bocchino M, et al. (2009). IFN-gamma release assays in tuberculosis management in selected high-risk populations. Expert Rev Mol Diagn 9(2):165–77. Bondada S, Robertson DA. (2003). Assays for B lymphocyte function. Curr Protoc Immunol Chapter 3, Unit 3.8. Bordon A, et al. (2008). Enforced expression of the transcriptional coactivator OBF1 impairs B cell differentiation at the earliest stage of development. PLoS One 3(12):e4007. Bouabe H, et al. (2008). Improvement of reporter activity by IRES-mediated polycistronic reporter system. Nucl Acids Res 36(5):e28. Brent MR, Guigo R. (2004). Recent advances in gene structure prediction. Curr Opin Struct Biol 14(3):264–72.
c15.indd 331
1/12/2011 9:44:28 AM
332
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Briere F, et al. (1996). [B lymphocytes of patients with complete IgA deficiency secrete IgA in response to interleukin 10]. Nephrologie 17(5):289–95. Bustin SA. (2000). Absolute quantification of mRNA using real-time reverse transcription polymerase chain reaction assays. J Mol Endocrinol 25(2):169–93. Bustin SA, Nolan T. (2004). Pitfalls of quantitative real-time reverse-transcription polymerase chain reaction. J Biomol Tech 15(3):155–66. Calin GA, Croce CM. (2007). Chromosomal rearrangements and microRNAs: a new cancer link with clinical implications. J Clin Invest 117(8):2059–66. Callahan G, et al. (2003). Characterization of the common fragile site FRA9E and its potential role in ovarian cancer. Oncogene 22:590–61. Capecchi MR. (1989). The new mouse genetics: altering the genome by gene targeting. Trends Genet 5(3):70–76. Casellas R, et al. (2001). Contribution of receptor editing to the antibody repertoire. Science 291(5508):1541–44. Challen GA, et al. (2009). Mouse hematopoietic stem cell identification and analysis. Cytometry A 75(1):14–24. Chan LS. (2008). Atopic dermatitis in 2008. Curr Dir Autoimmun 10:76–118. Chen J, Alt FW. (1993). Gene rearrangement and B-cell development. Curr Opin Immunol 5(2):194–200. Chen K, et al. (2007). PolyScan: an automatic indel and SNP detection approach to the analysis of human resequencing data. Genome Res 17(5):659–66. Chen Y, et al. (2009). Automated 5-D analysis of cell migration and interaction in the thymic cortex from time-lapse sequences of 3-D multi-channel multi-photon images. J Immunol Methods 340(1):65–80. Choe J, et al. (1996). Cellular and molecular factors that regulate the differentiation and apoptosis of germinal center B cells. Anti-Ig down-regulates Fas expression of CD40 ligand-stimulated germinal center B cells and inhibits Fas-mediated apoptosis. J Immunol 157(3):1006–16. Church GM, Gilbert W. (1984). Genomic sequencing. Proc Natl Acad Sci U S A 81(7):1991–95. Cihakova D, Rose NR. (2008). Pathogenesis of myocarditis and dilated cardiomyopathy. Adv Immunol 99:95–115. Collas P, Dahl JA. (2008). Chop it, ChIP it, check it: the current status of chromatin immunoprecipitation. Front Biosci 13:929–43. Contento RL, et al. (2008). CXCR4-CCR5: a couple modulating T cell functions. Proc Natl Acad Sci U S A 105(29):10101–06. Deepak S, et al. (2007). Real-time PCR: revolutionizing detection and expression analysis of genes. Curr Genomics 8(4):234–51. Dross N, et al. (2009). Mapping eGFP oligomer mobility in living cell nuclei. PLoS One 4(4):e5041. Drysdale J, et al. (2002). Mitochondrial ferritin: a new player in iron metabolism. Blood Cells Mol Dis 29(3):376–83. Dudley DD, et al. (2005). Mechanism and control of V(D)J recombination versus class switch recombination: similarities and differences. Adv Immunol 86:43–112. Dunckley T, et al. (2007). Whole-genome analysis of sporadic amyotrophic lateral sclerosis. N Engl J Med 357(8):775–88.
c15.indd 332
1/12/2011 9:44:28 AM
REFERENCES
333
Engelmark MT, et al. (2008). Polymorphisms in 9q32 and TSCOT are linked to cervical cancer in affected sib-pairs with high mean age at diagnosis. Hum Genet 123: 437–43. Erdmann VA, et al. (2000). Non-coding, mRNA-like RNAs database Y2K. Nucl Acids Res 28(1):197–200. Escobar PA, et al. (2007). Leukaemia-specific chromosome damage detected by comet with fluorescence in situ hybridization (comet-FISH). Mutagenesis 22(5):321–27. Espy MJ, et al. (2006). Real-time PCR in clinical microbiology: applications for routine laboratory testing. Clin Microbiol Rev 19(1):165–256. Fan YS, et al. (2007). Detection of pathogenic gene copy number variations in patients with mental retardation by genomewide oligonucleotide array comparative genomic hybridization. Hum Mutat 28(11):1124–32. Ferrai C, et al. (2010). Poised transcription factories prime silent uPA gene prior to activation. PLoS Biol 8(1):e1000270. Festing MF, et al. (1998). At least four loci and gender are associated with susceptibility to the chemical induction of lung adenomas in A/J x BALB/c mice. Genomics 53:129–36. Flannagan RS, Grinstein S. (2009). The application of fluorescent probes for the analysis of lipid dynamics during phagocytosis. Meth Mol Biol 591:121–34. Frances V, et al. (1994). A surrogate 15 kDa JC kappa protein is expressed in combination with mu heavy chain by human B cell precursors. EMBO J 13(24):5937–43. Frances V, et al. (2000). The human anti-bullous pemphigoid monoclonal autoantibody P22 is encoded by genes of the IGHV4 and IGLV4 families. J Autoimmun 15(4):459–68. Franco S, et al. (2006). Pathways that suppress programmed DNA breaks from progressing to chromosomal breaks and translocations. DNA Repair (Amst) 5(9–10): 1030–41. Freeman WM, et al. (1999). Quantitative RT-PCR: pitfalls and potential. Biotechniques 26(1):112–22, 124–15. Fusco A, Fedele M. (2007). Roles of HMGA proteins in cancer. Nat Rev Cancer 7(12):899–910. Gallardo D, et al. (2008). Mapping of quantitative trait loci for cholesterol, LDL, HDL, and triglyceride serum concentrations in pigs. Physiol Genomics 35(3):199–209. Garcia-Castillo H, Barros-Nunez P. (2009). Detection of clonal immunoglobulin and T-cell receptor gene recombination in hematological malignancies: monitoring minimal residual disease. Cardiovasc Hematol Disord Drug Targets 9(2):124–35. Garcia-Sagredo JM. (2008). Fifty years of cytogenetics: a parallel view of the evolution of cytogenetics and genotoxicology. Biochim Biophys Acta 1779(6–7):363–75. Geisert EE, et al. (2009). Gene expression in the mouse eye: an online resource for genetics using 103 strains of mice. Mol Vis 15:1730–63. Gentle A, et al. (2001). High-resolution semi-quantitative real-time PCR without the use of a standard curve. Biotechniques 31(3):502–08. Gibson UE, et al. (1996). A novel method for real time quantitative RT-PCR. Genome Res 6(10):995–1001. Glei M, et al. (2009). Use of Comet-FISH in the study of DNA damage and repair: review. Mutat Res 681(1):33–43.
c15.indd 333
1/12/2011 9:44:28 AM
334
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Gungor N, et al. (2010). Genotoxic effects of neutrophils and hypochlorous acid. Mutagenesis 25(2):149–54. Gutierrez L, et al. (2008). Ablation of Gata1 in adult mice results in aplastic crisis revealing its essential role in steady-state and stress erythropoiesis. Blood 111(8):4375–85. Guzman-Rojas L, et al. (2000). PRELI, the human homologue of the avian px19, is expressed by germinal center B lymphocytes. Int Immunol 12(5):607–12. Guzman-Rojas L, et al. (2002). Life and death within germinal centres: a double-edged sword. Immunology 107(2):167–75. Halabi CM, et al. (2008). Interference with PPAR gamma function in smooth muscle causes vascular dysfunction and hypertension. Cell Metab 7(3):215–26. Hartmann S, et al. (2008). Detection of genomic imbalances in microdissected Hodgkin and Reed-Sternberg cells of classical Hodgkin’s lymphoma by array-based comparative genomic hybridization. Haematologica 93(9):1318–26. Heid CA, et al. (1996). Real time quantitative PCR. Genome Res 6(10):986–94. Helmrich A, et al. (2006). Common fragile sites are conserved features of human and mouse chromosomes and relate to large active genes. Genome Res 16:1222–30. Herceg Z, Hainaut P. (2007). Genetic and epigenetic alterations as biomarkers for cancer detection, diagnosis and prognosis. Mol Oncol 1(1):26–41. Hogrefe WR. (2005). Biomarkers and assessment of vaccine responses. Biomarkers 10(Suppl 1):S50–57. Hubbard T, et al. (2002). The Ensembl genome database project. Nucl Acids Res 30(1):38–41. Hussain S, et al. (2009). DUBs and cancer: the role of deubiquitinating enzymes as oncogenes, non-oncogenes and tumor suppressors. Cell Cycle 8(11):1688–97. Hyvarinen K, et al. (2009). Detection and quantification of five major periodontal pathogens by single copy gene-based real-time PCR. Innate Immun 15(4):195–204. Ikram MA, et al. (2009). Genomewide association studies of stroke. N Engl J Med 360(17):1718–28. Jacob J, et al. (1991). Intraclonal generation of antibody mutants in germinal centres. Nature 354(6352):389–92. Jaillard S, et al. (2009). Identification of gene copy number variations in patients with mental retardation using array-CGH: Novel syndromes in a large French series. Eur J Med Genet, Epub ahead of print, October 28. Jiao Y, et al. (2009). ENU induced single mutation locus on chr 16 leads to highfrequency hearing loss in mice. Genes Genet Syst 84(3):219–24. Jolly CJ, O’Neill HC. (1997). Specific transcription of the unrearranged TCR V beta 8.2 gene in lymphoid tissues occurs independently of V(D)J rearrangement. Immunol Cell Biol 75(1):13–20. Jones RG, Thompson CB. (2009). Tumor suppressors and cell metabolism: a recipe for cancer growth. Genes Dev 23(5):537–48. Jordan R, et al. (1999). Detection of chromosome aberrations by FISH as a function of cell division cycle (harlequin-FISH). Biotechniques 26(3):532–34. Jothi R, et al. (2008). Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucl Acids Res 36(16):5221–31.
c15.indd 334
1/12/2011 9:44:28 AM
REFERENCES
335
Karan G, et al. (2004). Expression of wild type and mutant ELOVL4 in cell culture: subcellular localization and cell viability. Mol Vis 10:248–53. Karnan S, et al. (2006). Genomewide array-based comparative genomic hybridization analysis of acute promyelocytic leukemia. Genes Chromosomes Cancer 45(4): 420–25. Kearney L. (2006). Multiplex-FISH (M-FISH): technique, developments and applications. Cytogenet Genome Res 114(3–4):189–98. Kent WJ. (2002). BLAT—the BLAST-like alignment tool. Genome Res 12(4):656–64. Kent WJ, Haussler D. (2001). Assembly of the working draft of the human genome by GigAssembler (2001) Genome Res 11(9):1541–48. Kirschner LS. (2009). Use of mouse models to understand the molecular basis of tissuespecific tumorigenesis in the Carney complex. J Intern Med 266(1):60–68. Kleeberger SR, et al. (2000). Genetic susceptibility to ozone-induced lung hyperpermeability: role of toll-like receptor 4. Am J Respir Cell Mol Biol 22:620–27. Kohler G, Milstein C. (1975). Continuous cultures of fused cells secreting antibody of predefined specificity. Nature 256(5517):495–97. Koizumi K, et al. (2007). Chemokine receptors in cancer metastasis and cancer cellderived chemokines in host immune response. Cancer Sci 98(11):1652–58. Komine M. (2009). Analysis of the mechanism for the development of allergic skin inflammation and the application for its treatment:keratinocytes in atopic dermatitis—their pathogenic involvement. J Pharmacol Sci 110(3):260–64. Krysko DV, et al. (2008). Apoptosis and necrosis: detection, discrimination and phagocytosis. Methods 44(3):205–21. Kubista M, et al. (2006). The real-time polymerase chain reaction. Mol Aspects Med 27(2–3):95–125. Kutyavin I, et al. (2003). Chemistry of minor groove binder-oligonucleotide conjugates. Curr Protoc Nucleic Acid Chem Chapter 8, Unit 8.4. Lander ES, et al. (2001). Initial sequencing and analysis of the human genome. Nature 409(6822):860–921. Landry JR, Mager DL. (2002). Widely spaced alternative promoters, conserved between human and rodent, control expression of the Opitz syndrome gene MID1. Genomics 80(5):499–508. Landry JR, et al. (2003). Complex controls: the role of alternative promoters in mammalian genomes. Trends Genet 19(11):640–48. Landvik NE, et al. (2009). A specific interleukin-1B haplotype correlates with high levels of IL1B mRNA in the lung and increased risk of non-small cell lung cancer. Carcinogenesis 30:1186–92. Lay MJ, Wittwer CT. (1997). Real-time fluorescence genotyping of factor V Leiden during rapid-cycle PCR. Clin Chem 43(12):2262–67. Lerebours F, et al. (2002). Evidence of chromosome regions and gene involvement in inflammatory breast cancer. Int J Cancer 102(6):618–22. Li H, et al. (2009). Matrix metalloproteinase-9 inhibition ameliorates pathogenesis and improves skeletal muscle regeneration in muscular dystrophy. Hum Mol Genet 18(14):2584–98. Li J, et al. (2007). Noninvasive intravital imaging of thymocyte dynamics in medaka. J Immunol 179(3):1605–15.
c15.indd 335
1/12/2011 9:44:28 AM
336
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Li X, et al. (2008). Clinical utility of microarrays: current status, existing challenges and future outlook. Curr Genomics 9(7):466–74. Liu CG, et al. (2008). MicroRNA expression profiling using microarrays. Nat Protoc 3(4):563–78. Liu YJ, et al. (1996a). Normal human IgD + IgM- germinal center B cells can express up to 80 mutations in the variable region of their IgD transcripts. Immunity 4(6):603–13. Liu YJ, et al. (1996b). Sequential triggering of apoptosis, somatic mutation and isotype switch during germinal center development. Semin Immunol 8(3):169–77. Liu YJ, et al. (1996c). Within germinal centers, isotype switching of immunoglobulin genes occurs after the onset of somatic mutation. Immunity 4(3):241–50. Lohela M, Werb Z. (2009). Intravital imaging of stromal cell dynamics in tumors. Curr Opin Genet Dev 20(1):72–78. Long A, et al. (2004). A multidisciplinary approach to the study of T cell migration. Ann N Y Acad Sci 1028:313–19. Louis M, et al. (2004). Rapid combined genotyping of factor V, prothrombin and methylenetetrahydrofolate reductase single nucleotide polymorphisms using minor groove binding DNA oligonucleotides (MGB probes) and real-time polymerase chain reaction. Clin Chem Lab Med 42(12):1364–69. Louvel S, et al. (2008). Detection of drug-resistant HIV minorities in clinical specimens and therapy failure. HIV Med 9(3):133–41. Lund-Olesen T, et al. (2008). Sensitive on-chip quantitative real-time PCR performed on an adaptable and robust platform. Biomed Microdevices 10(6):769–76. Lutfalla G, Uze G. (2006). Performing quantitative reverse-transcribed polymerase chain reaction experiments. Meth Enzymol 410:386–400. Luthra R, Medeiros LJ. (2006). TaqMan reverse transcriptase-polymerase chain reaction coupled with capillary electrophoresis for quantification and identification of bcr-abl transcript type. Meth Mol Biol 335:135–45. MacBeath JR, et al. (2001). Automated fluorescent DNA sequencing on the ABI PRISM 377. Meth Mol Biol 167:119–52. Mackay J, Landt O. (2007). Real-time PCR fluorescent chemistries. Meth Mol Biol 353:237–61. Mahadevaiah SK, et al. (2009). Using RNA FISH to study gene expression during mammalian meiosis. Meth Mol Biol 558:433–44. Maierhofer C, et al. (2002). Multicolor FISH in two and three dimensions for clastogenic analyses. Mutagenesis 17(6):523–27. Malinen E, et al. (2003). Comparison of real-time PCR with SYBR Green I or 5′nuclease assays and dot-blot hybridization with rDNA-targeted oligonucleotide probes in quantification of selected faecal bacteria. Microbiology 149(Pt 1): 269–77. Malisan F, et al. (1996a). B-chronic lymphocytic leukemias can undergo isotype switching in vivo and can be induced to differentiate and switch in vitro. Blood 87(2): 717–24. Malisan F, et al. (1996b). Interleukin-10 induces immunoglobulin G isotype switch recombination in human CD40-activated naive B lymphocytes. J Exp Med 183(3): 937–47.
c15.indd 336
1/12/2011 9:44:28 AM
REFERENCES
337
Marcucci G, et al. (2008). MicroRNA expression in cytogenetically normal acute myeloid leukemia. N Engl J Med 358:1919–28. Martinez-Valdez H, et al. (1996). Human germinal center B cells express the apoptosisinducing genes Fas, c-myc, P53, and Bax but not the survival gene bcl-2. J Exp Med 183(3):971–77. Mathews LA, et al. (2009). Epigenetic gene regulation in stem cells and correlation to cancer. Differentiation 78(1):1–17. Matteucci E, Giampietro O. (2008). Flow cytometry study of leukocyte function: analytical comparison of methods and their applicability to clinical research. Curr Med Chem 15(6):596–603. McGivern DR, Lemon SM. (2009). Tumor suppressors, chromosomal instability, and hepatitis C virus-associated liver cancer. Annu Rev Pathol 4:399–415. McKeller MR, Martinez-Valdez H. (2006). The kappa-like pre-B receptor: surplus biology or a missing link? Semin Immunol 18(1):40–43. McPherson JD. (2009). Next-generation gap. Nat Methods 6(11 Suppl):S2–5. Medstrand P, et al. (2001). Long terminal repeats are used as alternative promoters for the endothelin B receptor and apolipoprotein C-I genes in humans. J Biol Chem 276(3):1896–903. Meffre E, et al. (1998). Antigen receptor engagement turns off the V(D)J recombination machinery in human tonsil B cells. J Exp Med 188(4):765–72. Mehra S, Hu WS. (2005). A kinetic model of quantitative real-time polymerase chain reaction. Biotechnol Bioeng 91(7):848–60. Michelsen U, von Hagen J. (2009). Isolation of subcellular organelles and structures. Meth Enzymol 463:305–28. Migliore L, et al. (1999). Preferential occurrence of chromosome 21 malsegregation in peripheral blood lymphocytes of Alzheimer disease patients. Cytogenet Cell Genet 87(1–2):41–46. Mocellin S, et al. (2003). Quantitative real-time PCR: a powerful ally in cancer research. Trends Mol Med 9(5):189–95. Morisawa T, et al. (2008). Organ-specific profiles of genetic changes in cancers caused by activation-induced cytidine deaminase expression. Int J Cancer 123(12): 2735–40. Motta FC, et al. (2006). Comparison between denaturing gradient gel electrophoresis and phylogenetic analysis for characterization of A/H3N2 influenza samples detected during the 1999–2004 epidemics in Brazil. J Virol Meth 135(1):76–82. Muller MC, et al. (2004). Standardization of preanalytical factors for minimal residual disease analysis in chronic myelogenous leukemia. Acta Haematol 112(1–2):30–33. Mullighan CG, et al. (2009). Deletion of IKZF1 and prognosis in acute lymphoblastic leukemia. N Engl J Med 360(5):470–80. Murphy J, Bustin SA. (2009). Reliability of real-time reverse-transcription PCR in clinical diagnostics: gold standard or substandard? Expert Rev Mol Diagn 9(2):187–97. Muschen M, et al. (2000a). Somatic mutation of the CD95 gene in human B cells as a side-effect of the germinal center reaction. J Exp Med 192(12):1833–40. Muschen M, et al. (2000b). Somatic mutations of the CD95 gene in Hodgkin and ReedSternberg cells. Cancer Res 60(20):5640–43.
c15.indd 337
1/12/2011 9:44:28 AM
338
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Muschen M, et al. (2002). The origin of CD95-gene mutations in B-cell lymphoma. Trends Immunol 23(2):75–80. Nannya Y, et al. (2005). A robust algorithm for copy number detection using highdensity oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res 65(14):6071–79. Neu-Yilik G, Kulozik AE. (2008). NMD: multitasking between mRNA surveillance and modulation of gene expression. Adv Genet 62:185–243. Nigg EA, Raff JW. (2009). Centrioles, centrosomes, and cilia in health and disease. Cell 139(4):663–78. Nissim A, Chernajovsky Y. (2008). Historical development of monoclonal antibody therapeutics. Handb Exp Pharmacol (181):3–18. Nymark P, et al. (2006). Identification of specific gene copy number changes in asbestosrelated lung cancer. Cancer Res 66:5737–43. O’Brien S, et al. (1995). Advances in the biology and treatment of B-cell chronic lymphocytic leukemia. Blood 85(2):307–18. Pala FS, et al. (2001). In vitro transmission of chromosomal aberrations through mitosis in human lymphocytes. Mutat Res 474(1–2):139–46. Palmer S, et al. (2003). New real-time reverse transcriptase-initiated PCR assay with single-copy sensitivity for human immunodeficiency virus type 1 RNA in plasma. J Clin Microbiol 41(10):4531–36. Parsa JY, et al. (2007). AID mutates a non-immunoglobulin transgene independent of chromosomal position. Mol Immunol 44(4):567–75. Partanen JI, et al. (2009). 3D view to tumor suppression: Lkb1, polarity and the arrest of oncogenic c-Myc. Cell Cycle 8(5):716–24. Pascual V, et al. (1994). Analysis of somatic mutation in five B cell subsets of human tonsil. J Exp Med 180(1):329–39. Pasqualucci L, et al. (1998). BCL-6 mutations in normal germinal center B cells: evidence of somatic hypermutation acting outside Ig loci. Proc Natl Acad Sci U S A 95(20):11816–821. Passos JF, von Zglinicki T. (2007). Methods for cell sorting of young and senescent cells. Meth Mol Biol 371:33–44. Perlman S, et al. (2003). Ataxia-telangiectasia: diagnosis and treatment. Semin Pediatr Neurol 10(3):173–82. Pfeifer GP, Besaratinia A. (2009). Mutational spectra of human cancer. Hum Genet 125(5–6):493–506. Porter D, Polyak K. (2003). Cancer target discovery using SAGE. Expert Opin Ther Targets 7(6):759–69. Potter AJ, Wener MH. (2005). Flow cytometric analysis of fluorescence in situ hybridization with dye dilution and DNA staining (flow-FISH-DDD) to determine telomere length dynamics in proliferating cells. Cytometry A 68(1):53–58. Pritchard JK, Przeworski M. (2001). Linkage disequilibrium in humans: models and data. Am J Hum Genet 69(1):1–15. Puebla-Osorio N, Zhu C. (2008). DNA damage and repair during lymphoid development: antigen receptor diversity, genomic integrity and lymphomagenesis. Immunol Res 41(2):103–22.
c15.indd 338
1/12/2011 9:44:28 AM
REFERENCES
339
Raeymaekers, L. (2000). Basic principles of quantitative PCR. Mol Biotechnol 15(2): 115–22. Rajewsky, K. (1996). Clonal selection and learning in the antibody system. Nature 381(6585):751–58. Rangel R, et al. (2005). Assembly of the kappa preB receptor requires a V kappa-like protein encoded by a germline transcript. J Biol Chem 280(18):17807–815. Rathmell JC, et al. (1996). Expansion or elimination of B cells in vivo: dual roles for CD40- and Fas (CD95)-ligands modulated by the B cell antigen receptor. Cell 87(2):319–29. Reddy PS, et al. (2008). A high-throughput genome-walking method and its use for cloning unknown flanking sequences. Anal Biochem 381(2):248–53. Reil H, et al. (2008). Clinical validation of a new triplex real-time polymerase chain reaction assay for the detection and discrimination of Herpes simplex virus types 1 and 2. J Mol Diagn 10(4):361–67. Rodriguez-Manotas M, et al. (2006). Real time PCR assay with fluorescent hybridization probes for genotyping intronic polymorphism in presenilin-1 gene. Clin Chim Acta 364(1–2):343–44. Rooney PH. (2005). Multiplex quantitative real-time PCR of laser microdissected tissue. Meth Mol Biol 293:27–37. Ross AJ, et al. (2007). Transcriptional profiling of mucociliary differentiation in human airway epithelial cells. Am J Respir Cell Mol Biol 37:169–85. Saint-Ruf C, et al. (1994). Analysis and expression of a cloned pre-T cell receptor gene. Science 266(5188):1208–12. Saleh A, et al. (2002). Identification of a novel Ly49 promoter that is active in bone marrow and fetal thymus. J Immunol 168(10):5163–69. Sanger F, et al. (1977). DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74(12):5463–67. Savas S, Liu G. (2009). Genetic variation as cancer prognostic markers: review and update. Hum Mutat 30:1369–77. Sawyers CL, et al. (1991). Leukemia and the disruption of normal hematopoiesis. Cell 64(2):337–50. Scacheri PC, et al. (2001). Bidirectional transcriptional activity of PGK-neomycin and unexpected embryonic lethality in heterozygote chimeric knockout mice. Genesis 30(4):259–63. Scheinfeldt LB, et al. (2009). Population genomic analysis of ALMS1 in humans reveals a surprisingly complex evolutionary history. Mol Biol Evol 26(6):1357–67. Schjeide BM, et al. (2009). GAB2 as an Alzheimer disease susceptibility gene: follow-up of genomewide association results. Arch Neurol 66(2):250–54. Scott JD, Pawson T. (2009). Cell signaling in space and time: where proteins come together and when they’re apart. Science 326(5957):1220–24. Severino G, Del Zompo M. (2004). Adverse drug reactions: role of pharmacogenomics. Pharmacol Res 49(4):363–73. Shaffer LG, Bejjani BA. (2004). A cytogeneticist’s perspective on genomic microarrays. Hum Reprod Update 10(3):221–26. Siddiqa A, et al. (2001). Regulation of CD40 and CD40 ligand by the AT-hook transcription factor AKNA. Nature 410(6826):383–87.
c15.indd 339
1/12/2011 9:44:28 AM
340
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Siepmann K, et al. (2001). Rewiring of CD40 is necessary for delivery of rescue signals to B cells in germinal centres and subsequent entry into the memory pool. Immunology 102(3):263–72. Sims-Mourtada JC, et al. (2003). In vivo expression of interleukin-8, and regulated on activation, normal, T-cell expressed, and secreted, by human germinal centre B lymphocytes. Immunology 110(3):296–303. Sims-Mourtada JC, et al. (2005). The human AKNA gene expresses multiple transcripts and protein isoforms as a result of alternative promoter usage, splicing, and polyadenylation. DNA Cell Biol 24(5):325–38. Sinden RR, et al. (1999). DNA-directed mutations. Leading and lagging strand specificity. Ann N Y Acad Sci 870:173–89. Sleckman BP, et al. (1996). Accessibility control of antigen-receptor variable-region gene assembly: role of cis-acting elements. Annu Rev Immunol 14:459–81. Storb U, et al. (2001). Somatic hypermutation of immunoglobulin and nonimmunoglobulin genes. Philos Trans R Soc Lond B Biol Sci 356(1405):13–19. Sun Y, et al. (2003). Specific interaction of PML bodies with the TP53 locus in Jurkat interphase nuclei. Genomics 82(2):250–52. Szczepanski, T. (2007). Why and how to quantify minimal residual disease in acute lymphoblastic leukemia? Leukemia 21(4):622–26. Takahashi T, et al. (1994). Generalized lymphoproliferative disease in mice, caused by a point mutation in the Fas ligand. Cell 76(6):969–76. Takezaki N, Nei M. (2009). Genomic drift and evolution of microsatellite DNAs in human populations. Mol Biol Evol 26(8):1835–40. Tay SK, et al. (2009). Global discovery of primate-specific genes in the human genome. Proc Natl Acad Sci U S A 106(29):12019–024. Teste MA, et al. (2009). Validation of reference genes for quantitative expression analysis by real-time RT-PCR in Saccharomyces cerevisiae. BMC Mol Biol 10:99. Thomas KR, Capecchi MR. (1987). Site-directed mutagenesis by gene targeting in mouse embryo-derived stem cells. Cell 51(3):503–12. Thorbecke GJ, et al. (1994). Biology of germinal centers in lymphoid tissue. FASEB J 8(11):832–40. Thye T, et al. (2003). Genomewide linkage analysis identifies polymorphism in the human interferon-gamma receptor affecting Helicobacter pylori infection. Am J Hum Genet 72:448–53. Tonegawa, S. (1983). Somatic generation of antibody diversity. Nature 302(5909): 575–81. Trelle MB, Jensen ON. (2007). Functional proteomics in histone research and epigenetics. Expert Rev Proteomics 4(4):491–503. Trinklein ND, et al. (2003). Identification and functional analysis of human transcriptional promoters. Genome Res 13(2):308–12. Unniraman S, Schatz DG. (2006). AID and Igh switch region-Myc chromosomal translocations. DNA Repair (Amst) 5(9–10):1259–64. van Rijk A, et al. (2008). Translocation detection in lymphoma diagnosis by split-signal FISH: a standardised approach. J Hematop 1(2):119–26.
c15.indd 340
1/12/2011 9:44:28 AM
REFERENCES
341
van Rijk A, et al. (2009). Double staining chromatic in situ hybridization as a useful alternative of split-signal in situ hybridization. Hematologica 95(2):247–52. Epub 2009 Sep 22. Van Vlierberghe P, et al. (2008). Molecular-genetic insights in paediatric T-cell acute lymphoblastic leukaemia. Br J Haematol 143(2):153–68. Volpi EV, Bridger JM. (2008). FISH glossary: an overview of the fluorescence in situ hybridization technique. Biotechniques 45(4):385–90. Voss H, et al. (1995). Efficient low redundancy large-scale DNA sequencing at EMBL. J Biotechnol 41(2–3):121–29. Voss TC, et al. (2006). Single-cell analysis of glucocorticoid receptor action reveals that stochastic post-chromatin association mechanisms regulate ligand-specific transcription. Mol Endocrinol 20(11):2641–55. Wagatsuma A, et al. (2005). Determination of the exact copy numbers of particular mRNAs in a single cell by quantitative real-time RT-PCR. J Exp Biol 208(Pt 12): 2389–98. Wang F, et al. (2009). Normalizing genes for real-time PCR in epithelial and nonepithelial cells of mouse small intestine. Anal Biochem 399(2):211–17. Wang T, Brown MJ. (1999). mRNA quantification by real time TaqMan polymerase chain reaction: validation and comparison with RNase protection. Anal Biochem 269(1):198–201. Watanabe-Fukunaga R, et al. (1992). Lymphoproliferation disorder in mice explained by defects in Fas antigen that mediates apoptosis. Nature 356(6367):314–17. Watts D, MacBeath JR. (2001). Automated fluorescent DNA sequencing on the ABI PRISM 310 Genetic Analyzer. Meth Mol Biol 167:153–70. Weiss MJ. (1997). Embryonic stem cells and hematopoietic stem cell biology. Hematol Oncol Clin North Am 11(6):1185–98. Wen F, et al. (2004). The impact of very short alternative splicing on protein structures and functions in the human genome. Trends Genet 20(5):232–36. Wetterholm A, et al. (2008). High-level expression, purification, and crystallization of recombinant rat leukotriene C(4) synthase from the yeast Pichia pastoris. Protein Expr Purif 60(1):1–6. Wheeler DA, et al. (2008). The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189):872–76. Wheeler DL, et al. (2001). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 29(1):11–16. Winter H, et al. (2004). Direct gene expression analysis. Curr Pharm Biotechnol 5(2): 191–97. Wong LJ, Bai RK. (2006). Real-time quantitative polymerase chain reaction analysis of mitochondrial DNA point mutation. Meth Mol Biol 335:187–200. Wong ML, Medrano JF. (2005). Real-time PCR for mRNA quantitation. Biotechniques 39(1):75–85. Wu T, Mohan C. (2009). Proteomic toolbox for autoimmunity research. Autoimmun Rev 8(7):595–98. Xiong Q, et al. (2008a). A close examination of genes within quantitative trait loci of bone mineral density in whole mouse genome. Crit Rev Eukaryot Gene Expr 18(4):323–43.
c15.indd 341
1/12/2011 9:44:28 AM
342
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Xiong Q, et al. (2008b). PGMapper: a web-based tool linking phenotype to genes. Bioinformatics 24(7):1011–13. Xylourgidis N, Fornerod M. (2009). Acting out of character: regulatory roles of nuclear pore complex proteins. Dev Cell 17(5):617–25. Yan H, et al. (2009). IDH1 and IDH2 mutations in gliomas. N Engl J Med 360(8): 765–73. Yang F, et al. (2004). Cytogenetic and immuno-FISH analysis of the 4q subtelomeric region, which is associated with facioscapulohumeral muscular dystrophy. Chromosoma 112(7):350–59. Yang SH, et al. (2008). Progerin elicits disease phenotypes of progeria in mice whether or not it is farnesylated. J Clin Invest 118(10):3291–300. Yang XO, et al. (2003). Regulation of T-cell receptor D beta 1 promoter by KLF5 through reiterated GC-rich motifs. Blood 101(11):4492–99. Yu W, et al. (2008). Epigenetic silencing of tumour suppressor gene p15 by its antisense RNA. Nature 451(7175):202–06. Yuan JS, et al. (2006). Statistical analysis of real-time PCR data. BMC Bioinformatics 7:85. Yuille MR, et al. (2001). TCL1 is activated by chromosomal rearrangement or by hypomethylation. Genes Chromosomes Cancer 30(4):336–41. Zarudnaya MI, et al. (2003). Downstream elements of mammalian pre-mRNA polyadenylation signals: primary, secondary and higher-order structures. Nucleic Acids Res 31(5):1375–86. Zhang F, et al. (2009). Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet 10:451–81. Zhang J, et al. (2005). SNPdetector: a software tool for sensitive and accurate SNP detection. PLoS Comput Biol 1(5):e53. Zhou W, et al. (2004). The role of p22 NF-E4 in human globin gene switching. J Biol Chem 279(25):26227–32. Zinner R, et al. (2007). Biochemistry meets nuclear architecture: multicolor immunoFISH for co-localization analysis of chromosome segments and differentially expressed gene loci with various histone methylations. Adv Enzyme Regul 47: 223–41.
c15.indd 342
1/12/2011 9:44:28 AM
CHAPTER 16
Confirmation of a Mutation by MicroRNA HONGWEI ZHENG and YONGJUN WANG
Contents 16.1 Basic Concept of MicroRNA and Relevance to Gene Function 16.1.1 Introduction 16.1.2 Biogenesis of MiRNAs 16.1.3 Biological Functions of MiRNAs 16.1.4 The Mechanism of MiRNA-Target Recognition 16.1.5 The Mechanism of MiRNA Regulation 16.1.6 Involvement of MiRNA in Human Diseases 16.1.7 Variation of MiRNA Binding Sites (Single-Base Mutation/ Polymorphism) within 3′-UTR and MiRNA Functional Deregulations 16.2 Designing an Experiment Using MiRNA to Confirm Gene Mutation Function 16.2.1 Background 16.2.2 Experiment Design 16.3 Procedure of Confirmation of a Gene Mutation by MiRNA 16.3.1 Luciferase Reporter Assays 16.3.2 MiRNA Target Gene Expression Analysis 16.4 Limitations and Troubleshooting 16.5 References
344 344 344 345 347 348 349
351 351 351 352 356 356 359 362 362
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
343
c16.indd 343
1/12/2011 5:03:50 PM
344
CONFIRMATION OF A MUTATION BY MICRORNA
16.1 BASIC CONCEPT OF MICRORNA AND RELEVANCE TO GENE FUNCTION 16.1.1 Introduction MicroRNAs (miRNA) are evolutionarily endogenous, ∼22-nucleotide (nt), noncoding small RNAs that regulate gene expression in a sequence-specific manner via mRNA degradation, transcriptional regulation, or translational repression. Vertebrate miRNA targets are thought to be plentiful in number. Computational analysis estimates the presence of up of ∼1000 miRNAs may be contained in the human genome (Berezikov et al., 2005). More than 721 of them have been identified by molecular cloning and registered in the miRNA database, miRBase, and it is predicted that they regulate 30% of proteinencoding transcripts (Lewis et al., 2005; Xie et al., 2005). 16.1.2 Biogenesis of MiRNAs The basic scheme of the microRNA pathway is shown in Figure 16.1. MiRNAs are generated in multiple steps, and the biogenesis and function of miRNA
Cell Nucleus
Pri-miRNA
Pri-miRNA
Dicer
Ran-GTP Exportin-5
Pasha Drosha
PACT m7G
AAAA
7
mG
Pol II
TRBP
AAAA
Helicase?
Ago RISC
Figure 16.1. Biogenesis of miRNAs. The primary miRNA (pri-miRNA) was transcribed by RNA polymerase II (Pol II) in the nucleus and then modified by adding a 5′-m7G cap and a 3′-poly(A)-tail. Following this, the pri-miRNA is processed by the RNase drosha and its co-factor, pasha, to form the hairpin-structured precursor miRNA (pre-miRNA), which is exported from the nucleus by exportin 5. In the cytoplasm, the pre-miRNA is further processed by the RNase dicer, is unwound by a putative helicase, and ends up as a mature single-stranded miRNA, which is loaded into the RNAinduced silencing complex (RISC) in tight association with the argonaute protein (Ago). The miRNA is now ready to interact with its target mRNAs (Cowland et al., 2007).
c16.indd 344
1/12/2011 9:44:29 AM
BASIC CONCEPT OF MICRORNA AND RELEVANCE TO GENE FUNCTION
345
require a common set of proteins. First, miRNAs are transcribed by RNA polymerase II as long RNA precursors (pri-miRNAs) (Lee et al., 2002; Cai et al., 2004; Lee et al., 2004), which are usually several kilobases long and contained in a 7-methyl guanosine cap structure and a poly(A) tail similar to protein-coding mRNAs. Under the effects of the RNase III enzyme, drosha, the pri-miRNAs are processed into 60- to 70-nt precursor miRNAs (premiRNAs) with a hairpin-shaped stem-loop secondary structure, a 5′ phosphate, and a 2-nt 3′ overhang (Lee et al., 2003). Drosha associates with the double-stranded RNA-binding protein DGCR8 in humans (Gregory et al., 2004) or pasha in flies (Denli et al., 2004) to form the microprocessor complex, which is required for directing the specific cleavage of pri-miRNA by Drosha. Pre-miRNAs are exported to the cytoplasm by exportin-5 (Yi et al., 2003; Lund et al., 2004); further processed by another RNase III enzyme, dicer; and released as 22-nt double-stranded miRNA (Hutvagner et al., 2001). After being unwound by a helicase, only one mature miRNA strand (guide strand) is incorporated into an RNA-induced silencing complex (RISC) that mediates cleavage or translational inhibition of target mRNAs, while the other strand (passenger strand) is quickly degraded (Matranga et al., 2005; Rand et al., 2005). RISC is composed of dicer, argonaute 2 (Ago2), and the doublestrand RNA binding protein TRBP. It cleaves target mRNAs more efficiently by using pre-miRNAs rather than the duplex RNAs that do not have the stemloop structure, a process that suggests that processing by dicer may be coupled with assembly of the mature miRNA into RISC (Gregory et al., 2005). Ago2, the key component of RISC, may function as an endonuclease that cleaves target mRNAs (Hammond et al., 2001). RISC was guided by the incorporated miRNA to the complementary sequence in the 3′ untranslated region (UTR) of target mRNAs. miRNAs bind to the 3′ UTR of the target mRNA with perfect or near perfect complementarity, leading to the target mRNA degradation by Ago2. On the contrary, partial base pairing between an miRNA and a target mRNA leads to translational silencing of the target mRNA without degradation. In systematic mutation experiments, the binding of some nucleotides in the 5′ region of miRNAs seems to be functionally important in partial base pairing (Doench, 2004; Kiriakidou et al., 2004). 16.1.3 Biological Functions of MiRNAs Less than 20 years ago, the lin-4 gene, which controls the timing of C. elegans larval development, was discovered to unexpectedly produce a 21-nt-long noncoding RNA that suppressed lin-14 protein expression without noticeably affecting lin-14 mRNA levels (Lee et al., 1993; Wightman et al., 1993). This first miRNA was initially treated as a genetic oddity and virtually ignored for nearly a decade, but now we recognize that hundreds of these small RNAs exist in the genomes of divergent species and posttranscriptionally regulate gene expression by basepair to complementary sites in the 3′-UTR of the target gene and negatively affect the translation (Callis et al., 2008). Increasing
c16.indd 345
1/12/2011 9:44:30 AM
346
CONFIRMATION OF A MUTATION BY MICRORNA
evidence indicates that miRNAs may in fact be key regulators of processes such as development (Reinhart et al., 2000; Giraldez et al., 2005), cell proliferation and death (Brennecke et al., 2003), apoptosis and fat metabolism (Xu et al., 2003), hematopoiesis (Chen et al., 2004), and stem cell division (Hatfield et al., 2005). 16.1.3.1 MiRNA Emerges as a Central Regulator for Development Basic research found that miRNAs are involved in regulating developmental processes. For instance, without miR-430, zebrafish embryos develop defects, that can be rescued and complemented by supplying miR-430 (Giraldez et al., 2005). Another study of C. elegans miRNAs showed that without lin-4, C. elegans is unable to make the transition from the first to the second larval stage because of a differentiation defect, that is caused by a failure to posttranscriptionally repress the lin-14 gene, the target gene of lin-4 (Lee et al., 1993; Wightman et al., 1993). Similarly, let-7 can also cause a failure of larvalto-adult transition (Reinhart et al., 2000). It is known that lin-41, hbl-1, daf-12, and the fork head transcription factor pha-4 are the direct targets of let-7 during this transition (Slack et al., 2000; Abrahante et al., 2003; Grosshans et al., 2005). McGlinn et al. (2009) identified a layer of regulatory control provided by the miR-196 family in defining the boundary of Hox gene expression along the anterior-posterior (A-P) embryonic axis in chick development. Following knockdown of miR-196, they observed a homeotic transformation of the last cervical vertebrae toward a thoracic identity. 16.1.3.2 MiRNAs are Involved in Cell Proliferation and Apoptosis A number of miRNAs have been shown to balance cell proliferation and survival. For example, members of the miR-17-92 cluster are frequently upregulated in lymphomas, representing potential oncomiRs. It was shown in Eμ-Myc transgenic mice that the miR-17-92 cluster, but not the individual miRNAs, could enhance tumorigenesis by inhibiting apoptosis in c-Myc-overexpressing tumors (He et al., 2005). Additional studies in human cell lines showed that transcription of the miR-17-92 cluster was directly regulated by c-Myc and that the individual miRs-17-5p and -20 regulate the translation of E2F1, a transcription factor with both pro-apoptotic and proproliferative activity. Thus co-expression of c-Myc and miR-17 is believed to finetune E2F1 activity so that proliferation is enhanced and apoptosis is inhibited (O’Donnell et al., 2005). In addition, miR-21 has anti-apoptotic activity, which is highly expressed in glioblastoma (Ciafre et al., 2005). Knockdown of miR-21 in breast tumor and glioblastoma cell lines led to inhibition of BCL-2 activity, caspase reactivation, and increased apoptosis (Chan et al., 2005; Si et al., 2007). 16.1.3.3 MiRNAs Might Contribute to Maintaining Tissue Identity Basic research found that the expression levels of miRNA targets are significantly lower in all mature mouse tissues and later life stages of Drosophila than in the embryos, which indicates that miRNAs might play roles in determining
c16.indd 346
1/12/2011 9:44:30 AM
347
BASIC CONCEPT OF MICRORNA AND RELEVANCE TO GENE FUNCTION
(a) AAAA
m’G m7G
AAAA
3’ m’G
eIF4E
AAAA
(b)
5’
18s
m7G
28s 28s
AAAA
P-body
Ccr4 AAAA Not1
18s
Decay
Protease
AAAA 18s
eIF4E
(c)
Decay
m7G
Storage
28s
AAAA AAAA mRNA exit and translation
3’
⎫ ⎬ ⎭
5’ Seed AAAA
m7G
Figure 16.2. Possible mechanisms of miRNA-target recognition. a, Perfect (or nearperfect) complementarity between miRNA and mRNA leads to cleavage of the target mRNA through the siRNA pathway. b, Translation is repressed by an miRNA with incomplete complementarity with mRNA by inhibition of ribosomal elongation or recruitment of a protease that degrades the nascent polypeptide chain. Because ribosomes are still associated with mRNA, the complex cannot enter the P-body. c, Inhibition of translation initiation by interaction between RISC and the translation initiation complex protein eIF4E leads to a ribosome-free miRNA: an mRNA structure that is directed to the P-body, where it interacts with the Ccr4:Not1 deadenylase complex. This initiates a degradation of the mRNA. Alternatively, the miRNA:mRNA complex may be stored in the P-body, and—after an appropriate stimulus—reenter the cytoplasm for renewed translation (Cowland et al., 2007).
the timing of tissue differentiation and maintaining tissue identity during adulthood (Giraldez et al., 2005). 16.1.4
The Mechanism of MiRNA-Target Recognition
The precise mechanism by which individual miRNAs recognize their target sites on mRNAs has not yet been completely unraveled, but some general patterns have been determined (Fig. 16.2). The miRNA binding motif is situated in the 3′-UTR of the transcription production—that is, between the protein-coding region of the mRNA and its poly (A) tail (Stark et al., 2005). By sequence comparison of miRNAs and their cognate mRNA target sequences, it has been found that nucleotides 2 to 8 of the miRNA 5-region
c16.indd 347
1/12/2011 9:44:30 AM
348
CONFIRMATION OF A MUTATION BY MICRORNA
constitute a seed region, which mediates the interaction of miRNA and its target (Lewis et al., 2003; Brennecke et al., 2005). 1. In most cases, the seed region binds to a perfectly complementary recognition sequence on the mRNA (Lewis et al., 2003; Brennecke et al., 2005). The central part of the miRNA usually lacks complementarity to the mRNA (typically nucleotides 10 and 11), whereas the 3-region of the miRNA binds more or less specifically to the mRNA and contributes partly to the specificity and affinity of the miRNA:mRNA complex (Brennecke et al., 2005). 2. In a few instances, the seed region does not show complete complementarity to the target sequence, and, these cases, a strong binding of the miRNA 3 region to the mRNA is required to stabilize the RNA duplex (Enright et al., 2003). MiRNAs that rely mainly on their seed sequence for binding may exert a function on the mRNA by themselves, whereas those that bind less strongly due to a weaker seed sequence often have to act in concert with other miRNAs binding to the same mRNA to cause an effect (Brennecke et al., 2005). There are multiple searchable databases can computationally predict, miRNA targets in several species, using various algorithms, but further validation experiments are needed. At present, these databases are vital in guiding us in experimentally validating miRNA targets (Giraldez et al., 2005). 16.1.5
The Mechanism of MiRNA Regulation
The repression of mRNA is achieved in two different ways, depending on the degree of complementarity between the miRNA and the target. 16.1.5.1 The Perfectly Complementary Pathway If perfect base complementarity exists between the miRNA and mRNA, the mRNA will be processed through the siRNA pathway and cleaved in an miRNA-directed manner by argonaute proteins (Ago2 in humans), the catalytically active component of RISC (Yekta et al., 2004; Bagga et al., 2005) (Fig. 16.2). RISC is the ribonucleoprotein effector complex for miRNA-mediated gene expression regulation and consists of argonaute protein family members and accessory factors such as R2D2, along with an miRNA and targeted mRNA (Hammond et al., 2000; Filipowicz, 2005). This mode of gene silencing is common in plants, but only occasionally in animals (for instance, the miR-196-directed degradation of the HOXB8 transcript during mouse embryogenesis) (Yekta et al., 2004). 16.1.5.2 The Imperfectly Complementary Pathway Generally in animals, the vast majority of miRNAs are imperfectly complementary to the 3′-UTR of targeted mRNAs, which results in suppression of translation and
c16.indd 348
1/12/2011 9:44:30 AM
BASIC CONCEPT OF MICRORNA AND RELEVANCE TO GENE FUNCTION
349
subsequent partial mRNA decay (Bartel, 2004; Bagga et al., 2005). Three modes of action have been unraveled (Fig. 16.2) 1. Repression of the initiation step of translation. For instance, the distribution of polysomes of let-7-repressed mRNAs was shifted toward the lighter fractions of a sucrose gradient in a manner similar to that observed when using known inhibitors of translational initiation. Analogously, the cationic amino acid transporter 1 (CAT-1) mRNA, which is repressed by miR-122 in hepatocarcinoma cells under regular growth conditions, was found in the light polysomal fraction (Bhattacharyya et al., 2006). 2. Repression of the elongation phase of translation. In Caenorhabditis elegans, repression of the lin-14 mRNA by miRNA lin-4 does not involve a change in polysome distribution, indicating that repression occurs after initiation of translation (Olsen and Ambros, 1999). 3. General destabilization of the transcript as a result of poly(A)-tail shortening. This mechanism of degradation relies on recruiting deadenylating and decapping enzymes by miRNAs with a subsequent degradation of the cognate transcript (Behm-Ansmant et al., 2006; Wu et al., 2006). Mounting data indicate that mRNAs silenced by miRNA accumulate in cytoplasmic compartments known as processing bodies (P-bodies) (Liu et al., 2005; Pillai et al., 2005; Sen and Blau, 2005). The mRNAs found in these locations are devoid of ribosomes and other translation factors (Teixeira et al., 2005). The P-bodies are rich in enzymes involved in mRNA deadenylation, decapping, and degradation and are believed to cause decay of the miRNAinhibited mRNAs (Sheth and Parker, 2003; Behm-Ansmant et al., 2006). In some instances, however, mRNAs instead appear to be stored in an inactive form in the P-body with the potential to reenter the cytoplasm and reengage in translation (Brengues et al., 2005). One example of this phenomenon is the miR-122-directed repression of CAT-1 in hepatocarcinoma cells during normal growth, which is relieved by starvation and results in retranslation of the CAT-1 mRNA (Bhattacharyya et al., 2006). 16.1.6
Involvement of MiRNA in Human Diseases
Many studies have revealed a large number of miRNA-disease associations and shown the mechanisms of miRNAs involved in diseases. As such, mutation of miRNAs, dysfunction of miRNA biogenesis, and deregulation of miRNAs and their targets may result in various diseases. Currently, 70 diseases associated with miRNAs have been reported (see http://cmbi.bjmu.edu.cn/ hmdd) (Lu et al., 2008). Giraldez et al. (2005) reported a linkage of miRNAs to cardiac hypertrophy and offered new insight into the regulation of this disease process. They found that the expression profiles for a number of miRNAs changed during cardiac hypertrophy. Furthermore, misexpression of miRNAs and loss-of-function experiments in mice demonstrated that specific
c16.indd 349
1/12/2011 9:44:30 AM
350
CONFIRMATION OF A MUTATION BY MICRORNA
miRNAs can augment or attenuate the hypertrophic growth response and suggested the potential of these molecules as novel therapeutic targets. Ample evidence also shows that components of the miRNA machinery, miRNAs themselves, and their binding motif are involved in many cellular processes that are altered in cancer, such as differentiation, proliferation, and apoptosis. Some miRNAs exhibit differential expression levels in cancer and have demonstrated the capability to affect cellular transformation, carcinogenesis, and metastasis by acting either as oncogenes (oncomiRs) or tumor suppressors (TSmiRs) (Medina and Slack, 2008). In general, the majority of miRNAs are downregulated in cancer specimens (Lu et al., 2005). Because miRNAs have several potential targets that may be the mRNAs of both oncogenes and tumor suppressors, the actual function of a particular miRNA as either TS-miR or onco-miR may depend on the cellular context (Lee et al., 1993). Although the miRNA era started only a few years ago, it has brought great promise for cancer diagnosis, prognosis, and therapy. The quick development of powerful techniques such as miRNA microarrays, bead-based miRNA profiling, specific quantitative PCR of miRNAs, and anti-sense technologies are expected to have a significant impact on clinical oncology in the next decade (Medina and Slack, 2008). Because one mRNA generally dictates the translation of a single protein, while one miRNA molecule has the capacity to regulate the translation of an array of genes governing a certain function (John et al., 2004; Sayed et al., 2007), we must come to the realization that miRNA has the capacity to regulate a cellular function and makes it more powerful in functional outcome prediction. Moreover, mature miRNA levels are more tightly regulated and less variable. Thus it is expected that miRNA will provide us with a more superior predictive parameter in diseases. When Lu et al. (2005) put this idea to the test, they found that miRNA profiling was highly accurate in predicting the differentiation state of tumors and in classification of poorly differentiated tumors, predictions that they could not determine by mRNA profiling. Another study achieved almost perfect accuracy in classifying the tissue origin of 400 tumor samples from 22 different tumor tissues and metastases (Rosenfeld et al., 2008) and demonstrated the effectiveness of miRNAs as biomarkers for tracing the tissue of origin of cancers of unknown primary origin, a major clinical problem. But miRNA expression profiles also provide important information regarding the prognosis of cancer patients. For example, it was shown that miRNA expression profiles obtained by miRNA microarrays correlated with survival with lung adenocarcinomas, including those in precocious pathological stages. High levels of miR-155 and low let-7a-2 expression correlated with poor survival (Yanaihara et al., 2006). Another recent miRNA profiling effort in lung cancer identified five miRNAs as important for prognosis: high levels of miR-221 and let-7a appeared to be protective, while high levels of miR-137, miR-372, and miR-182 correlated with worse clinical outcome. The levels of these miRNAs could also help in predicting relapse of the cancer (Yu et al., 2008). A recent study focused on colorectal cancer showed that high miR-21 expression was associated with poor survival and poor therapeutic outcome (Schetter et al.,
c16.indd 350
1/12/2011 9:44:30 AM
DESIGNING AN EXPERIMENT USING MIRNA TO CONFIRM GENE MUTATION FUNCTION
351
2008). But clearly, more studies are needed to further validate the predictive powers of miRNA in cancer. 16.1.7 Variation of MiRNA Binding Sites (Single-Base Mutation/ Polymorphism) within 3′-UTR and MiRNA Functional Deregulations The 3′-UTRs of human protein-coding genes play a pivotal role in regulating mRNA 3′ end formation, stability/degradation, nuclear export, and subcellular localization and translation and hence are particularly rich in cis-acting regulatory elements. One recent addition to the already large repertoire of known cis-acting regulatory elements is the miRNA target binding sites that are present in the 3′-UTRs of many human genes (Chuzhanova et al., 2007). Recently, SNPs residing in miRNA-binding sites were shown to affect the expression of miRNA targets and contribute to the susceptibility to complex disorders such as cancer, asthma, cardiovascular disease, and Tourette’s syndrome (Abelson et al., 2005; Martin et al., 2007; Saunders et al., 2007; Tan et al., 2007; Yu et al., 2007). As each miRNA is expected to regulate the translation of up to 100 mRNAs (Brennecke et al., 2005; Lim et al., 2005; Xie et al., 2005), it is clear that any disturbance of miRNA expression level, processing of the miRNA precursors, or mutation in the sequence of the miRNA, its precursor, or its target mRNA may have detrimental effects on cell physiology. One of the modes leads to the change is mutations in the miRNA:mRNA interacting sequences. Inappropriate base pairing resulting from variations in the 3′-UTR sequence of the target mRNAs or in the mature miRNA sequence is likely to weaken the interaction between the miRNA and mRNA (He et al., 2005; Iwai and Naraba, 2005) and might contribute to alterations in the translation efficiency of the target mRNA. This interaction is especially sensitive to mutations in the seed region (Brennecke et al., 2005). Indeed, naturally occurring polymorphisms in miRNA binding sites have been documented in Tourette’s syndrome in humans and muscularity in sheep (Abelson et al., 2005; Clop et al., 2006). Loss of the KIT protein in thyroid cancers has been associated with high expression of miR-221, -222, and -146b, and polymorphic changes in 3′-UTR of the KIT-mRNA were demonstrated in half of these cases. Owing to the high incidence of familial thyroid cancer, researchers speculated that these polymorphisms might predispose one to this disease (He et al., 2005).
16.2 DESIGNING AN EXPERIMENT USING MIRNA TO CONFIRM GENE MUTATION FUNCTION 16.2.1
Background
Because miRNAs have emerged as a new class of regulatory gene, miRNA target sites within the 3′-UTRs of human protein-coding genes constitute a new class of cis-acting regulatory elements (Lee et al., 1993). In humans, about
c16.indd 351
1/12/2011 9:44:30 AM
352
CONFIRMATION OF A MUTATION BY MICRORNA
one third of all protein-coding genes contain conserved target sequences for the 163 miRNA families that are conserved among different species (Gardner and Vinther, 2008). Upon binding to their cognate targets, miRNA posttranscriptionally downregulate gene expression by inducing either mRNA degradation or translational repression. A base change in the mature miRNA or in the target sequence of the mRNA will weaken the interaction between the miRNA and mRNA (He et al., 2005; Iwai and Naraba, 2005). This interaction is especially sensitive to mutations in the seed region (Brennecke et al., 2005). Specifically, here we highlight the feasibility of using miRNA to confirm miRNA target site variations by functional experiments in vitro. Actually, several studies have identified genetic miRNA target site variant that are claimed to be associated with disorders ranging from Parkinson’s disease to cancer (Sethupathy and Collins, 2008). One such lesion has recently been reported: A G to A transition (absent in 4296 control chromosomes), which replaces a G : U wobble base pair with an A : U Watson-Crick pairing in a binding site for human miRNA hsa-miR-189 within the 3′-UTR of the Slit and Trk-like 1 gene, was identified in two unrelated patients with Tourette’s syndrome and obsessive-compulsive symptoms (Abelson et al., 2005). In vitro functional analysis demonstrated that, in the presence of hsa-miR-189, the mutant allele gave rise to decreased repression of the reporter gene as compared with the wild type allele. Other studies have reported that, in the THPO and PTGS2 genes, the polymorphism affects the binding ability of miRNA and the target mRNAs. The reduced complementarity of the THPO rs6141(+24) G > A variant allele to hsa-miR-431 and of the PTGS2 9850A > G variant allele to hsa-miR-132, as compared with their respective wild type alleles, led to overexpression of these genes and was, therefore, consistent with the functional consequences (Cox et al., 2004; Garner et al., 2006). In another way, polymorphism/mutation may also yield new, illegitimate miRNA binding sites. For example, the muscular phenotype of the Texel sheep strain is the result of a mutation in the myostatin 3′-UTR that creates a binding site for miR-1 and miR-206—miRNAs that are highly expressed in skeletal muscle. As a key negative regulator of muscle mass, even slight decreases in myostatin activity yield muscle overgrowth (Flynt and Lai, 2008). Based on this evidence, we can use miRNA to confirm the variations in the 3′-UTR of candidate genes. 16.2.2
Experiment Design
16.2.2.1 Search the Existent Databases As of December 2009, various techniques including small RNA cloning and, most recently, deep-sequencingbased approaches have characterized 721 human miRNAs, which are listed in the official miRNA database (miRBase). Most computional approaches have suggested that there are well over 1000 human miRNAs (Berezikov et al., 2005) and, according to some projections, even tens of thousands (Rigoutsos et al., 2006). Target predictions based primarily on conserved seed pairing and
c16.indd 352
1/12/2011 9:44:30 AM
DESIGNING AN EXPERIMENT USING MIRNA TO CONFIRM GENE MUTATION FUNCTION
353
local sequence or structural features suggest that individual animal miRNAs often have >100 targets and that at least 20–30% of animal transcripts bear one or more conserved miRNA binding sites in their 3′-UTR (Krek et al., 2005; Xie et al., 2005; Ruby et al., 2007). Additional targets may potentially be regulated through miRNA binding to atypical sites with imperfect seeds (Callis et al., 2008; Brennecke et al., 2003, 2005) or nonconserved sites (Farh et al., 2005; Giraldez et al., 2006; Sood et al., 2006). Therefore, the direct target network of animal miRNAs is inferred to be quite substantial. One can also ask whether the miRNA operates through one target or many targets, each of which might behave differently with respect to the quantitative and qualitative consequence of miRNA control (Flynt and Lai, 2008). To bridge the information in the miRNA database with the biology of the cell, a number of computer programs have been developed for predicting mRNA targets for these miRNAs in animals. In summary, the common criteria used for target prediction by these computer programs are (1) the degree of base complementarity between the miRNA and mRNA with special focus on identifying a perfect—or near perfect—complementarity between a target mRNA and the miRNA in the seed region (i.e., nts 2–8 of the miRNA), (2) the calculated thermodynamic stability of the predicted miRNA:mRNA complex, and (3) the degree of conservation of orthologous target sites in the 3′-UTR of different species. The different software, however, do not use the same algorithm for calculating the targets, and therefore, only a partial overlap is seen between the hit lists of each program (Lee et al., 1993). Several existing tools and resources provide updated data regarding each of these areas of research. Sanger Institute’s miRBase serves as the central database for experimentally supported mature miRNA sequences (GriffithsJones et al., 2006). For each supported miRNA, miRBase provides the genomic coordinates of the predicted precursor sequence, the nucleotide sequences of both the precursor and mature miRNA sequences, and predicted targets of the mature miRNA according to prediction programs miRanda, PicTar, and TargetScanS. Two additional databases, ARGONAUTE (Shahi et al., 2006) and miRNAMap (Hsu et al., 2006), offer enhanced interfaces to the data contained in miRBase for human, mouse, rat, and dog. MiRNAMap also reports computationally predicted miRNAs and their predicted targets, according to programs miRanda and RNAhybrid (Kruger and Rehmsmeier, 2006). Moreover, it provides cross-links to other biological databases to provide tissue expression and cross-species sequence conservation data for each supported and predicted miRNA. ARGONAUTE, published simultaneously with miRNAMap, provides much of the same information with perhaps a larger miRNA tissue expression dataset— collected from various miRNA expression studies. In addition, TarBase offers a manually curated and comprehensive set of experimentally supported targets in eight different species (Sethupathy et al., 2006). It contains over 550 target genes and over 750 individual target sites. For each miRNA:target interaction that has gained experimental support, TarBase reports on the
c16.indd 353
1/12/2011 9:44:30 AM
354
CONFIRMATION OF A MUTATION BY MICRORNA
sufficiency of the interaction to independently induce translational silencing, the type of translational silencing that is induced (repression vs. immediate cleavage), the location of the target site along the 3′-UTR, the nature of the base pairing between the miRNA and target sequence according to the minimum free-energy hybridization, and the types of experimental methods used for verification. Recently, the suggestion that a polymorphism/mutation in miRNA binding sites (poly-miRTS) can lead to disease was strengthened by a study from Clop et al. (2006). They provided rigorous in vivo evidence that an miRNA target site mutation in myostatin (GDF8 or MSTN) contributes to muscular dystrophy in sheep. In addition to this, there are also studies have claimed association of one or more poly-miRTS with various human diseases ranging from colorectal cancer to Parkinson’s disease (Sethupathy and Collins, 2008). Now, roughly 20,000 poly-miRTS have been cataloged in databases such as PolymiRTS (http://compbio.utmem.edu/miRSNP) and Patrocles (www.patrocles.org) (Georges et al., 2006; Bao et al., 2007) and used to study natural selection on miRNA target sites (Chen and Rajewsky, 2006; Saunders et al., 2007). 16.2.2.2 Confirmation of the Base Variation within the MiRNA Binding Sites by Functional Experiment 16.2.2.2.1 Luciferase Activity Assay Further evaluation of the predicted target in a biological system is therefore needed. A widely used method is to make a plasmid construct, which encodes a reporter such as firefly luciferase with a 3′-UTR of the predicted miRNA target, and transfect it into a cell expressing the cognate miRNA. If the target and miRNA interact, a decreased luciferase activity should be measured (Taganov et al., 2006; Voorhoeve et al., 2006). Conversely, a similar reporter construct with a mutated target sequence has no luciferase activity deduction. This approach has been widely used in miRNA functional studies. Zhao et al. (2009) found that miR-15a inhibits reporter luciferase activities in a dose-dependent manner by binding to its seed regions of c-myb 3′-UTR. Compared with controls, relative luciferase activity was significantly decreased with as little as 1 nM miR-15a. Maximal decrease was obtained with a 50-nM concentration of this miRNA. In contrast, increasing concentrations of miRNA control had little effect on luciferase activity, even when the concentration of these RNAs was 100 nM. To demonstrate that miR-15a interacts with a specific target sequence localized in the human c-myb 3′-UTR, three additional mutant reporter constructs were generated in which two 7-bp seed sequences (i.e., ACGACGA) were deleted individually or simultaneously. The resulting constructs, pBub1/Myb3U/miR-15a1 and pBub1/Myb3U/ miR-15a2, were co-transfected together with miR-15a into HEK293 T cells. Luciferase activity in the respective cells was then measured. Compared with the decrease in luciferase activity observed when the authentic c-myb 3′-UTR was cotrans-
c16.indd 354
1/12/2011 9:44:30 AM
DESIGNING AN EXPERIMENT USING MIRNA TO CONFIRM GENE MUTATION FUNCTION
a
CMVp
Luciferase
CMVp
Luciferase
355
SV40 pA
pLuci Hsp20 3’UTR
SV40 pA
pLuci-p20-3’
CMVp
Luciferase
Mutated 3’UTR
SV40 pA
pLuci-3’M
b H9c2
Relative Luciferase Activity (Luci/β-Gal)
40 30 20
*
10 0 miR-Ctl Luci
miR-320 miR-Ctl miR-320 miR-Ctl Luci Luci-p20-3’ Luci-p20-3’ Luci-3’M
miR-320 Luci-3’M
Figure 16.3. a, Plasmid construction. A segment of Hsp20 3′-UTR or a mutated segment was cloned downstream of the luciferase encoding region. b, Luciferase activity in H9c2 cells cotransfected with the various vectors indicated. *p < .05 relative to respective controls (Ren and Wu et al., 2009).
fected with miR-15a, deletion of the miR-15a binding sites in the c-myb 3′UTR resulted in a twofold to threefold increase in luciferase activity, indicating that miR15a was no longer able to bind the 3′-UTR with the same avidity. All these data are consistent with the hypothesis that miR-15a hybridizes with the predicted sequence in the c-myb 3′-UTR and that alteration of the sequence to which the miRNA hybridizes would result in enhanced luciferase activity. Other evidence supporting the use of this method is coming from the work of Ren et al. (2009) (Fig. 16.3). To validate whether miR-320 directly recognizes the 3′-UTR of Hsp20, they cotransfected H9c2 cells with a construct containing the 3′-UTR of Hsp20 fused downstream to the luciferase coding sequence along with miR-320 or a negative control miRNA. Overexpression of miR-320 strongly inhibited the luciferase activity from the reporter construct containing the 3′-UTR segment of Hsp20, whereas no effect was observed with a construct containing a mutated segment of Hsp20 3′-UTR (seed sequence
c16.indd 355
1/12/2011 9:44:30 AM
356
CONFIRMATION OF A MUTATION BY MICRORNA
CAGCUUU was mutated to GACACAA). This effect was specific, because no change was seen in luciferase reporter activity when a negative control miRNA was cotransfected with either reporter construct. Collectively, these data indicate that variations located in the 3′-UTR may influence the complementary affinity between miRNA and its binding site, which can be tested by the luciferase activity assay. 16.2.2.2 Altered Expression of the Target Genes Another way to confirm a mutation is the expression analysis in a proper cell model by cotransfection the experimental cells with plasmid containing a known miRNA target gene containing a binding site mutation locating in the 3′-UTR, the miRNA mimic and a proper control. Because the mutation could weaken the interaction of the miRNA and the target binding site, it will release the mRNA from the negative control and show a protein expression elevation. Compare the expression level of the target gene by Western blotting or immunostaining in the cell model. Calin et al. (2002), using expression analyses, determined that as many as 68% of all chronic lymphocytic leukemias (CLLs) showed downregulation of miRs-15 and -16. Both miRNAs were shown to act as tumor suppressors by targeting translation of the anti-apoptotic BCL-2 mRNA (Calin et al., 2002), an oncogene that frequently is found to be overexpressed in CLL. Downregulation of miR-15 and -16 has been shown to correlate with overexpression of the BCL-2 protein, and transfection with either of the two miRNAs completely abolished protein expression and reestablished apoptosis in a leukemia model (Cimmino et al., 2005). Another early and well-documented finding was the downregulation of oncogenic Ras by the let-7 family members of miRNAs in lung cancer (Johnson et al., 2005). It was observed that low Let-7 expression correlated with a shortened postoperative survival in lung cancer patients who had undergone potentially curative operative procedures (Takamizawa et al., 2004). 16.3 PROCEDURE OF CONFIRMATION OF A GENE MUTATION BY MIRNA 16.3.1
Luciferase Reporter Assays
16.3.1.1 Target Prediction Target prediction programs are very useful to define potential miRNA targets. MiRNAs do not switch off their target genes completely but rather fine tune their expression through the binding site within the 3′-UTR. Identifying target mRNAs of miRNAs is an important step in elucidating the interaction between miRNAs and the target. Several computational target prediction programs have been developed, but the overlap between sets of predicted target genes for a given miRNA by different programs is surprisingly low (Sethupathy et al., 2006), suggesting a number of false positive predictions (Nicolas et al., 2008).
c16.indd 356
1/12/2011 9:44:31 AM
PROCEDURE OF CONFIRMATION OF A GENE MUTATION BY MIRNA
357
TABLE 16.1. Online Databases for MiRNA Research Name of the Database
Website Linkage
miRBase
www.mirbase.org
ARGONAUTE
www.ma.uni-heidelberg.de/ apps/zmf/argonaute
miRNAMap
http://mirnamap.mbc. nctu.edu.tw
TargetScanS
http://genes.mit.edu/ targetscan
PolymiRTS
http://compbio.utmem.edu/ miRSNP
Patrocles
www.patrocles.org/ Patrocles.htm
Description Contains three main sections: • miRBase sequences contains all published miRNA sequences, genomic locations, and associated annotations • miRBase targets is a newly developed database of predicted miRNA target genes • miRBase registry provides a confidential service assigning official names for novel miRNA genes before publication of their discovery Mammalian miRNAs and their function in gene and pathway regulation Collects experimental verified miRNAs and experimental verified miRNA target genes in human, mouse, rat, and other metazoan genomes Predict biological targets of miRNAs by searching for the presence of conserved 8mer and 7mer sites that match the seed region of each miRNA Naturally occurring DNA variations in putative miRNA target sites Polymorphic miRNA-target interactions
16.3.1.1.1 Online Databases and Software Available online data resources are summarized in Table 16.1. For example, using default parameters, GriffithsJones et al. (2006) searched for 79 collated upstream sequences (USS) between the translational termination codon and the upstream core polyadenylation signal (UCPAS) variant miRNA binding sites with miRBase software. For each variant, both the wild type 3′-UTR sequence and its mutated counterpart, each a total length ∼50 bp flanking the site of mutation, could be screened for the presence of miRNA binding sites, with all possible 25 bp fragments within these flanking sequences being examined sequentially. While the three of six databases overlap in many predicted targets, they diverge in others. Thus it might be beneficial to search all the databases for potential targets of a miRNA
c16.indd 357
1/12/2011 9:44:31 AM
358
CONFIRMATION OF A MUTATION BY MICRORNA
of interest for experimental validation. Once the candidate miRNA has been fixed on, its sequence can be searched online in different database, too. The ready-to-use mature miRNA is commercially available (e.g., Ambion). 16.3.1.2 Experimental Validation by Luciferase Reporter Assay 16.3.1.2.1 Plasmid Construction 3′-UTR containing the target binding site with or without the mutation could be PCR-amplified from the DNA or cDNA samples and the product could be subcloned and transferred into commercial luciferase vector downstream of the firefly luciferase coding region (Fig. 16.4). The authenticity and orientation of the inserts relative to the luciferase gene should be confirmed by sequencing. The experimental reporter vector containing tested target elements for transcription activity is usually cotransfected with a second reporter vector, which is used as an internal control of transfection efficiency. The dualluciferase reporter (DLR) assay system offers sequential measurement of activities of two distinct reporter luciferases in the same cell lysates obtained from cells cotransfected with experimental and control reporter vectors, for example, firefly luciferase (encoded by the experimental vector) and renilla luciferase (encoded by the control vector). Many choices for commercially available vectors containing the suitable luciferase gene are available, and a particular vector should be selected according to the study (Matuszyk, 2002). 16.3.1.2.2
Cell Culture Conditions
1. One day before the transfection experiment, adjust cell concentration, and plate the cells. 2. Culture the cells overnight to achieve 60–80% confluence. 16.3.1.2.3 Transfection of Experimental Cells Transfect the experimental cells with the luciferase reporter constructs described above along with the internal control vector and the appropriate commercially available mimic miRNA by using transfection reagent according to the manufacturer’s instructions (e.g., Lipofectamine 2000 Invitrogen). After 48 h incubation, harvest the cells. The relative luciferase activity can be expressed as the ratio of experimental and inner control luciferase: 1. Before the transfection, change fresh medium. 2. Prepare the transfection reagent. Promoter Vector
CMV
Reporter gene Luciferase
With/without mutation 3’-UTR
SV40 pA
Figure 16.4. Construction of the reporter plasmid.
c16.indd 358
1/12/2011 9:44:31 AM
PROCEDURE OF CONFIRMATION OF A GENE MUTATION BY MIRNA
359
3. Prepare the experimental DNA and miRNA; mix gently. 4. Incubate the transfection reagent (DNA/RNA complex) for a minimum of 15 min at room temperature. 5. When ready, add this mixture drop-wise directly to the cells through the medium. Be sure to evenly sprinkle the droplets over the entire area. There is no need to remove and replace with fresh medium. 6. Incubate for 36–48 h. Harvest cells. 16.3.1.2.4
Preparation of Cell Llysate for Luciferase Activity Assay
1. Remove growth medium from cultured cells. 2. Rinse cells in the washing buffer (e.g., 1 × PBS) without dislodging cells. Remove as much of the final wash as possible. 3. Dispense a minimal volume of 1× lysis reagent into each culture vessel (e.g., 200–400 μL/60-mm culture dish). 4. For culture dishes, scrape attached cells from the dish, and transfer the cells and solution to a microcentrifuge tube. 5. Pellet debris by brief centrifugation, and transfer the supernatant to a new tube. 6. Mix 20 μL of cell lysate with 100 μL of luciferase assay reagent and measure the light produced by using an illuminometer. From Nature protocol: www.natureprotocols.com/2006/10/27/transient_ transfection_and_luc.php (May 30; 2009). 16.3.2 MiRNA Target Gene Expression Analysis 16.3.2.1 Target Prediction Predicted targets are available in online databases. 16.3.2.2 Experimental Cell Model 1. Choose a cell line with a known mutation to be tested. 2. Transfect a cell line with miRNA target gene containing the 3′-UTR with/without mutation to be tested (Fig. 16.5). 16.3.2.3 Transfection Cotransfect the cells with commercially available miRNA mimics along with the target gene expression plasmid. Promoter
Target gene With/without mutation
Vector
CMV
3’-UTR
Coding Region
Figure 16.5. Construction of expression plasmid.
c16.indd 359
1/12/2011 9:44:31 AM
360
CONFIRMATION OF A MUTATION BY MICRORNA
16.3.2.4 Detection of the Target Gene Expression 16.3.2.4.1 A Brief Protocol of Western Blotting with Monoclonal Antibodies Sample Preparation 1. Wash dishes three times with 1 × PBS. 2. Apply lysis buffer and scrape the cells from the dishes. 3. Transfer the lysate to a microcentrifuge tube and boil for 5 min in a boiling water bath. To reduce viscosity, the sample may be sonicated briefly or passed several times through a 26-gauge needle. 4. Centrifuge the sample for 5 min to pellet insoluble material and collect the supernatant. 5. Determine the protein concentration by BCA (Pierce) protein concentration assay according to the manufacturer’s instruction. Polyacrylamide Gel Electrophoresis 1. Apply the proper volume of electrophoresis sample buffer to the sample tube and boil 3–5 min. 2. Apply 5–20 μg total protein for each lane. Refer to the antibody datasheet for the appropriate positive control cell lysate. 3. Electrophorese until the bromophenol blue in the samples reaches the bottom of the gel. Keep gels in running buffer until ready to transfer. Semidry Transfer 1. Transfer the protein from the gel to PVDF membrane at 1.2 mAmp/cm2 for 1.75 h in transfer buffer. Protein Blotting 1. Blocking: Transfer the blot from the transfer apparatus or staining tray to blocking buffer (5% nonfat dry milk, 10 mM Tris pH 7.5, 100 mM NaCl, 0.1% Tween 20). Incubate the blot for 30 min at 37°C, 1 h at room temperature, or overnight at 4°C. 2. Primary antibody: Decant the blocking buffer from the blot, add the antibody solution with optimized dilution, and incubate with agitation for 30 min at 37°C, 1 h at room temperature, or overnight at 4°C. 3. Decant the primary antibody solution, and wash the blot with 1 × PBS three times, 10 min each. 4. Enzyme-conjugated secondary antibody: Add the enzyme-conjugated secondary antibody and incubate with agitation for 30 min at 37°C or 1 h at room temperature. 5. Decant the secondary antibody solution, and wash the blot with 1 × PBS three times, 10 min each.
c16.indd 360
1/12/2011 9:44:31 AM
PROCEDURE OF CONFIRMATION OF A GENE MUTATION BY MIRNA
361
Develop the Blot 1. Put the blot into the chemiluminescent working solution and expose to X-ray film to get the best signal. From BD Biosciences: www.bdbiosciences.com/pharmingen/protocols/ Western_Blotting.shtml (May 30, 2009). 16.3.2.4.2
Brief Protocol of Immunostaining
1. With freezing acetone or other fixative, gently rinse slides containing sections for 20 min. 2. Return the slides to room temperature before the experiment. 3. Rinse slides 3× in PBS, 2 min each time. 4. Block endogenous peroxidase activity by incubating the slides in 0.3% H2O2 solution in PBS for 10 min. 5. Rinse slides 3× in PBS, 2 min each time. 6. Block nonspecific binding by incubating with blocking buffer (10% serum from host species of secondary antibody diluted in PBS or 10% FBS in PBS) 30–60 min at room temperature in a humidified chamber. 7. Dilute the primary antibody in the antibody diluent. Alternatively, use a buffered solution with a source of protein as antibody diluent. Apply the diluted antibody to the tissue sections on the slide. Incubate for 1 h at room temperature in a humidified chamber. 8. Rinse slides 3× in PBS, 2 min each time. 9. Dilute the biotinylated secondary antibody in the antibody diluent. Alternatively, use a buffered solution with a source of protein as antibody diluent. Apply to the tissue sections on the slide, and incubate for 30 min at room temperature. 10. Rinse slides 3× in PBS, 2 min each time. 11. Apply streptravidin-horseradish peroxidase prediluted to the tissue sections on the slide, and incubate for 30 min at room temperature. 12. Rinse slides 3× in PBS, 2 min each time. 13. Prepare DAB substrate solution following manufacturer’s recommendations. Safety note: DAB is a suspected carcinogen. Handle with care. Wear gloves, lab coat, and eye protection. 14. Drain PBS from slides and apply the DAB substrate solution. Allow slides to incubate for 5 min or until the desired color intensity is reached. 15. Wash 3× in water, 2 min each time. 16. Counterstain slides: dip twice in hematoxylin. 17. Rinse thoroughly in water. 18. Dip twice in bluing reagent or diluted ammonia water. 19. Rinse thoroughly in water.
c16.indd 361
1/12/2011 9:44:31 AM
362
CONFIRMATION OF A MUTATION BY MICRORNA
20. Dehydrate through four changes of alcohol (95%, 95%, 100%, and 100%). Clear in three changes of xylene (or xylene substitute), and coverslip using mounting solution. Note: From BD Biosciences: www.bdbiosciences.com/pharmingen/protocols/ Frozen_Tissue_Sections.shtml (May 30, 2009). 16.4
LIMITATIONS AND TROUBLESHOOTING
Limitations •
•
•
Target predictions based primarily on conserved seed pairing and local sequence or structural features suggest that individual animal miRNAs often have >100 targets and that at least 20–30% of animal transcripts bear one or more conserved miRNA binding sites in their 3′-UTR. Thus the level of an mRNA or its translation product is governed by the combinatorial effect of its targeting miRNA. One candidate target gene of the known miRNA may have more than one binding site within the 3′-UTR. For example, HMGA2 encodes a chromatin-associated protein and contains seven let-7 sites in its 3′-UTR (Chuzhanova et al., 2007). Search different databases to pick up the most overlap candidate sequence. It is estimated that 1–4% of genes in the human genome are miRNAs and that a single miRNA can regulate as many as 200 mRNAs. It is clear that disturbances of their binding may have detrimental effects on cell physiology (Esquela-Kerscher and Slack, 2006).
Troubleshooting • • •
•
Use low concentrations. Use independent miRNAs to the same target. MiRNAs cannot be used to confirm mutations within the protein coding region. High cost is an important limitation of this method.
16.5 REFERENCES Abelson JF, Kwan KY, O’Roak BJ, Baek DY, Stillman AA, Morgan TM, Mathews CA, Pauls DL, Rasin MR, Gunel M, Davis NR, Ercan-Sencicek AG, Guez DH, Spertus JA, Leckman JF, Dure LS 4th, Kurlan R, Singer HS, Gilbert DL, Farhi A, Louvi A, Lifton RP, Sestan N, State MW. (2005). Sequence variants in SLITRK1 are associated with Tourette’s syndrome. Science 310(5746):317–20. Abrahante JE, Daul AL, Li M, Volk ML, Tennessen JM, Miller EA, Rougvie AE. (2003). The Caenorhabditis elegans hunchback-like gene lin-57/hbl-1 controls developmental time and is regulated by microRNAs. Dev Cell 4(5):625–37.
c16.indd 362
1/12/2011 9:44:31 AM
REFERENCES
363
Bagga S, Bracht J, Hunter S, Massirer K, Holtz J, Eachus R, Pasquinelli AE. (2005). Regulation by let-7 and lin-4 miRNAs results in target mRNA degradation. Cell 122(4):553–63. Bao L, Zhou M, Wu L, Lu L, Goldowitz D, Williams RW, Cui Y. (2007). PolymiRTS Database: linking polymorphisms in microRNA target sites with complex traits. Nucl Acids Res 35:D51–4. Bartel, DP. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116(2):281–97. Behm-Ansmant I, Rehwinkel J, Doerks T, Stark A, Bork P, Izaurralde E. (2006). mRNA degradation by miRNAs and GW182 requires both CCR4:NOT deadenylase and DCP1:DCP2 decapping complexes. Genes Dev 20(14):1885–98. Berezikov E, Guryev V, van de Belt J, Wienholds E, Plasterk RH, Cuppen E. (2005). Phylogenetic shadowing and computational identification of human microRNA genes. Cell 120(1):21–24. Bhattacharyya SN, Habermacher R, Martine U, Closs EI, Filipowicz W. (2006). Relief of microRNA- mediated translational repression in human cells subjected to stress. Cell 125(6):1111–24. Brengues M, Teixeira D, Parker R. (2005). Movement of eukaryotic mRNAs between polysomes and cytoplasmic processing bodies. Science 310(5747):486–89. Brennecke J, Stark A, Russell RB, Cohen SM. (2005). Principles of microRNA-target recognition. PLoS Biol 3(3):e85. Brennecke J, Hipfner DR, Stark A, Russell RB, Cohen SM. (2003). bantam encodes a developmentally regulated microRNA that controls cell proliferation and regulates the proapoptotic gene hid in Drosophila. Cell 113(1):25–36. Cai X, Hagedorn CH, Cullen BR. (2004). Human microRNAs are processed from capped, polyadenylated transcripts that can also function as mRNAs. RNA 10(12):1957–66. Calin GA, Dumitru CD, Shimizu M, Bichi R, Zupo S, Noch E, Aldler H, Rattan S, Keating M, Rai K, Rassenti L, Kipps T, Negrini M, Bullrich F, Croce CM. (2002). Frequent deletions and down-regulation of micro-RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia. Proc Natl Acad Sci USA 99(24):15524–29. Callis TE, Tatsuguchi M, Wang DZ. (2008). MTAD Wang, VA Erdmann, W Poller, J Barciszewski (eds.) miRNAs and their emerging role in cardiac hypertrophy. In RNA Technologies in Cardiovascular Medicine and Research. Springer-Verlag Berlin Heidelberg. Chan JA, Krichevsky AM, Kosik KS. (2005). MicroRNA-21 is an antiapoptotic factor in human glioblastoma cells. Cancer Res 65(14):6029–33. Chen CZ, Li L, Lodish HF, Bartel DP. (2004). MicroRNAs modulate hematopoietic lineage differentiation. Science 303(5654):83–86. Chen K, Rajewsky N. (2006). Natural selection on human microRNA binding sites inferred from SNP data. Nat Genet 38(12):1452–56. Chuzhanova N, Cooper DN, Férec C, Chen JM. (2007). Searching for potential microRNA- binding site mutations amongst known disease-associated 3′ UTR variants. Genomic Med 1(1–2):29–33. Ciafrè SA, Galardi S, Mangiola A, Ferracin M, Liu CG, Sabatino G, Negrini M, Maira G, Croce CM, Farace MG. (2005). Extensive modulation of a set of microRNAs in primary glioblastoma. Biochem Biophys Res Commun 334(4):1351–58.
c16.indd 363
1/12/2011 9:44:31 AM
364
CONFIRMATION OF A MUTATION BY MICRORNA
Cimmino A, Calin GA, Fabbri M, Iorio MV, Ferracin M, Shimizu M, Wojcik SE, Aqeilan RI, Zupo S, Dono M, Rassenti L, Alder H, Volinia S, Liu CG, Kipps TJ, Negrini M, Croce CM. (2005). miR-15 and miR-16 induce apoptosis by targeting BCL2. Proc Natl Acad Sci U S A 102(39):13944–49. Clop A, Marcq F, Takeda H, Pirottin D, Tordoir X, Bibé B, Bouix J, Caiment F, Elsen JM, Eychenne F, Larzul C, Laville E, Meish F, Milenkovic D, Tobin J, Charlier C, Georges M. (2006). A mutation creating a potential illegitimate microRNA target site in the myostatin gene affects muscularity in sheep. Nat Genet 38(7):813–18. Cowland JB, Hother C, Grønbaek K. (2007). MicroRNAs and cancer. APMIS 115(10):1090–106. Cox DG, Pontes C, Guino E, Navarro M, Osorio A, Canzian F, Moreno V; Bellvitge Colorectal Cancer Study Group. (2004). Polymorphisms in prostaglandin synthase 2/cyclooxygenase 2 (PTGS2/COX2) and risk of colorectal cancer. Br J Cancer 91(2):339–43. Denli AM, Tops BB, Plasterk RH, Ketting RF, Hannon GJ. (2004). Processing of primary microRNAs by the Microprocessor complex. Nature 432(7014):231–35. Doench JG, Sharp PA. (2004). Specificity of microRNA target selection in translational repression. Genes Dev 18(5):504–11. Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS. (2003). MicroRNA targets in Drosophila. Genome Biol 5(1):R1. Esquela-Kerscher A, Slack FJ. (2006). Oncomirs—microRNAs with a role in cancer. Nat Rev Cancer 6(4):259–69. Farh KK, Grimson A, Jan C, Lewis BP, Johnston WK, Lim LP, Burge CB, Bartel DP. (2005). The widespread impact of mammalian microRNAs on mRNA repression and evolution. Science 310(5755):1817–21. Filipowicz, W. (2005). RNAi: the nuts and bolts of the RISC machine. Cell 122(1): 17–20. Flynt AS, Lai EC. (2008). Biological principles of microRNA-mediated regulation: shared themes amid diversity. Nat Rev Genet 9(11):831–42. Gardner PP, Vinther J. (2008). Mutation of miRNA target sequences during human evolution. Trends Genet 24(6):262–65. Garner C, Best S, Menzel S, Rooks H, Spector TD, Thein SL. (2006). Two candidate genes for low platelet count identified in an Asian Indian kindred by genome-wide linkage analysis: glycoprotein IX and thrombopoietin. Eur J Hum Genet 14(1): 101–108. Georges M, Clop A, Marcq F, Takeda H, Pirottin D, Hiard S, Tordoir X, Caiment F, Meish F, Bibé B, Bouix J, Elsen JM, Eychenne F, Laville E, Larzul C. (2006). Polymorphic microRNA-target interactions: a novel source of phenotypic variation. Cold Spring Harb Symp Quant Biol 71:343–50. Giraldez AJ, Cinalli RM, Glasner ME, Enright AJ, Thomson JM, Baskerville S, Hammond SM, Bartel DP, Schier AF. (2005). MicroRNAs regulate brain morphogenesis in zebrafish. Science 308(5723):833–38. Giraldez AJ, Mishima Y, Rihel J, Grocock RJ, Van Dongen S, Inoue K, Enright AJ, Schier AF. (2006). Zebrafish MiR-430 promotes deadenylation and clearance of maternal mRNAs. Science 312(5770):75–79.
c16.indd 364
1/12/2011 9:44:31 AM
REFERENCES
365
Gregory RI, Yan KP, Amuthan G, Chendrimada T, Doratotaj B, Cooch N, Shiekhattar R. (2004). The Microprocessor complex mediates the genesis of microRNAs. Nature 432(7014):235–40. Gregory RI, Chendrimada TP, Cooch N, Shiekhattar R. (2005). Human RISC couples microRNA biogenesis and posttranscriptional gene silencing. Cell 123(4):631–40. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. (2006). miRBase: microRNA sequences, targets and gene nomenclature. Nucl Acids Res 34:D140–44. Grosshans H, Johnson T, Reinert KL, Gerstein M. (2005). The temporal patterning microRNA let-7 regulates several transcription factors at the larval to adult transition in C. elegans. Dev Cell 8(3):321–30. Hammond SM, Bernstein E, Beach D, Hannon GJ. (2000). An RNA-directed nuclease mediates post-transcriptional gene silencing in Drosophila cells. Nature 404(6775): 293–96. Hammond SM, Boettcher S, Caudy AA, Kobayashi R, Hannon GJ. (2001). Argonaute2, a link between genetic and biochemical analyses of RNAi. Science 293(5532): 1146–50. Hatfield SD, Shcherbata HR, Fischer KA, Nakahara K, Carthew RW, Ruohola-Baker H. (2005). Stem cell division is regulated by the microRNA pathway. Nature 435(7044):974–78. He H, Jazdzewski K, Li W, Liyanarachchi S, Nagy R, Volinia S, Calin GA, Liu CG, Franssila K, Suster S, Kloos RT, Croce CM, de la Chapelle A. (2005). The role of microRNA genes in papillary thyroid carcinoma. Proc Natl Acad Sci U S A 102(52): 19075–80. Hsu PW, Huang HD, Hsu SD, Lin LZ, Tsou AP, Tseng CP, Stadler PF, Washietl S, Hofacker IL. (2006). miRNAMap: genomic maps of microRNA genes and their target genes in mammalian genomes. Nucl Acids Res 34:D135–39. Hutvágner G, McLachlan J, Pasquinelli AE, Bálint E, Tuschl T, Zamore PD. (2001). A cellular function for the RNA- interference enzyme Dicer in the maturation of the let-7 small temporal RNA. Science 293(5531):834–38. Iwai N, Naraba H. (2005). Polymorphisms in human pre-miRNAs. Biochem Biophys Res Commun 331(4):1439–44. John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS. (2004). Human microRNA targets. PLoS Biol 2(11):e363. Johnson SM, Grosshans H, Shingara J, Byrom M, Jarvis R, Cheng A, Labourier E, Reinert KL, Brown D, Slack FJ. (2005). RAS is regulated by the let-7 microRNA family. Cell 120(5):635–47. Kiriakidou M, Nelson PT, Kouranov A, Fitziev P, Bouyioukos C, Mourelatos Z, Hatzigeorgiou A. (2004). A combined computational-experimental approach predicts human microRNA targets. Genes Dev 18(10):1165–78. Krek A, Grün D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, MacMenamin P, da Piedade I, Gunsalus KC, Stoffel M, Rajewsky N. (2005). Combinatorial microRNA target predictions. Nat Genet 37(5):495–500. Krüger J, Rehmsmeier M. (2006). RNAhybrid: microRNA target prediction easy, fast and flexible. Nucl Acids Res 34:W451–4. Lee RC, Feinbaum RL, Ambros V. (1993). The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75(5):843–54.
c16.indd 365
1/12/2011 9:44:31 AM
366
CONFIRMATION OF A MUTATION BY MICRORNA
Lee Y, Ahn C, Han J, Choi H, Kim J, Yim J, Lee J, Provost P, Rådmark O, Kim S, Kim VN. (2003). The nuclear RNase III Drosha initiates microRNA processing. Nature 425(6956):415–19. Lee Y, Jeon K, Lee JT, Kim S, Kim VN. (2002). MicroRNA maturation: stepwise processing and subcellular localization. EMBO J 21(17):4663–70. Lee Y, Kim M, Han J, Yeom KH, Lee S, Baek SH, Kim VN. (2004). MicroRNA genes are transcribed by RNA polymerase II. EMBO J 23(20):4051–60. Lewis BP, Burge CB, Bartel DP. (2005). Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120(1):15–20. Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. (2003). Prediction of mammalian microRNA targets. Cell 115(7):787–98. Lim LP, Lau NC, Garrett-Engele P, Grimson A, Schelter JM, Castle J, Bartel DP, Linsley PS, Johnson JM. (2005). Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 433(7027):769–73. Liu J, Valencia-Sanchez MA, Hannon GJ, Parker R. (2005). MicroRNA-dependent localization of targeted mRNAs to mammalian P-bodies. Nat Cell Biol 7(7): 719–23. Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D, Sweet-Cordero A, Ebert BL, Mak RH, Ferrando AA, Downing JR, Jacks T, Horvitz HR, Golub TR. (2005). MicroRNA expression profiles classify human cancers. Nature 435(7043): 834–38. Lu M, Zhang Q, Deng M, Miao J, Guo Y, Gao W, Cui Q. (2008). An analysis of human microRNA and disease associations. PLoS One 3(10):e3420. Lund E, Güttinger S, Calado A, Dahlberg JE, Kutay U. (2004). Nuclear export of microRNA precursors. Science 303(5654):95–98. Martin MM, Buckenberger JA, Jiang J, Malana GE, Nuovo GJ, Chotani M, Feldman DS, Schmittgen TD, Elton TS. (2007). The human angiotensin II type 1 receptor +1166 A/C polymorphism attenuates microrna-155 binding. J Biol Chem 282(33): 24262–69. Matranga C, Tomari Y, Shin C, Bartel DP, Zamore PD. (2005). Passenger-strand cleavage facilitates assembly of siRNA into Ago2-containing RNAi enzyme complexes. Cell 123(4):607–20. Matuszyk J. (2002). Selection of a control reporter vector for the dual-luciferase reporter assay for transcription activation. E. ZIOLO 7:63. McGlinn E, Yekta S, Mansfield JH, Soutschek J, Bartel DP, Tabin CJ. (2009). In ovo application of antagomiRs indicates a role for miR-196 in patterning the chick axial skeleton through Hox gene regulation. Proc Natl Acad Sci U S A 106(44): 18610–15. Medina PP, Slack FJ. (2008). microRNAs and cancer: an overview. Cell Cycle 7(16): 2485–92. Nicolas FE, Pais H, Schwach F, Lindow M, Kauppinen S, Moulton V, Dalmay T. (2008). Experimental identification of microRNA-140 targets by silencing and overexpressing miR-140. RNA 14(12):2513–20. O’Donnell KA, Wentzel EA, Zeller KI, Dang CV, Mendell JT. (2005). c-Myc-regulated microRNAs modulate E2F1 expression. Nature 435(7043):839–43.
c16.indd 366
1/12/2011 9:44:31 AM
REFERENCES
367
Olsen PH, Ambros V. (1999). The lin-4 regulatory RNA controls developmental timing in Caenorhabditis elegans by blocking LIN-14 protein synthesis after the initiation of translation. Dev Biol 216(2):671–80. Pillai RS, Bhattacharyya SN, Artus CG, Zoller T, Cougot N, Basyuk E, Bertrand E, Filipowicz W. (2005). Inhibition of translational initiation by let-7 microRNA in human cells. Science 309(5740):1573–76. Rand TA, Petersen S, Du F, Wang X. (2005). Argonaute2 cleaves the anti-guide strand of siRNA during RISC activation. Cell 123(4):621–29. Reinhart BJ, Slack FJ, Basson M, Pasquinelli AE, Bettinger JC, Rougvie AE, Horvitz HR, Ruvkun G. (2000). The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature 403(6772):901–06. Ren XP, Wu J, Wang X, Sartor MA, Qian J, Jones K, Nicolaou P, Pritchard TJ, Fan GC. (2009). MicroRNA-320 is involved in the regulation of cardiac ischemia/reperfusion injury by targeting heat-shock protein 20. Circulation 119(17):2357–66. Rigoutsos I, Huynh T, Miranda K, Tsirigos A, McHardy A, Platt D. (2006). Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological processes. Proc Natl Acad Sci U S A 103(17): 6605–10. Rosenfeld N, Aharonov R, Meiri E, Rosenwald S, Spector Y, Zepeniuk M, Benjamin H, Shabes N, Tabak S, Levy A, Lebanony D, Goren Y, Silberschein E, Targan N, Ben-Ari A, Gilad S, Sion-Vardy N, Tobar A, Feinmesser M, Kharenko O, Nativ O, Nass D, Perelman M, Yosepovich A, Shalmon B, Polak-Charcon S, Fridman E, Avniel A, Bentwich I, Bentwich Z, Cohen D, Chajut A, Barshack I. (2008). MicroRNAs accurately identify cancer tissue origin. Nat Biotechnol 26(4):462–69. Ruby JG, Stark A, Johnston WK, Kellis M, Bartel DP, Lai EC. (2007). Evolution, biogenesis, expression, and target predictions of a substantially expanded set of Drosophila microRNAs. Genome Res 17(12):1850–64. Saunders MA, Liang H, Li WH. (2007). Human polymorphism at microRNAs and microRNA target sites. Proc Natl Acad Sci U S A 104(9):3300–05. Sayed D, Hong C, Chen IY, Lypowy J, Abdellatif M. (2007). MicroRNAs play an essential role in the development of cardiac hypertrophy. Circ Res 100(3):416–24. Schetter AJ, Leung SY, Sohn JJ, Zanetti KA, Bowman ED, Yanaihara N, Yuen ST, Chan TL, Kwong DL, Au GK, Liu CG, Calin GA, Croce CM, Harris CC. (2008). MicroRNA expression profiles associated with prognosis and therapeutic outcome in colon adenocarcinoma. JAMA 299(4):425–36. Sen GL, Blau HM. (2005). Argonaute 2/RISC resides in sites of mammalian mRNA decay known as cytoplasmic bodies. Nat Cell Biol 7(6):633–36. Sethupathy P, Corda B, Hatzigeorgiou AG. (2006). TarBase: A comprehensive database of experimentally supported animal microRNA targets. RNA 12(2):192–97. Sethupathy P, Collins FS. (2008). MicroRNA target site polymorphisms and human disease. Trends Genet 24(10):489–97. Sethupathy P, Megraw M, Hatzigeorgiou AG. (2006). A guide through present computational approaches for the identification of mammalian microRNA targets. Nat Methods 3(11):881–86. Shahi P, Loukianiouk S, Bohne-Lang A, Kenzelmann M, Küffer S, Maertens S, Eils R, Gröne HJ, Gretz N, Brors B. (2006). Argonaute–a database for gene regulation by mammalian microRNAs. Nucleic Acids Res 34:D115–18.
c16.indd 367
1/12/2011 9:44:31 AM
368
CONFIRMATION OF A MUTATION BY MICRORNA
Sheth U, Parker R. (2003). Decapping and decay of messenger RNA occur in cytoplasmic processing bodies. Science 300(5620):805–08. Si ML, Zhu S, Wu H, Lu Z, Wu F, Mo YY. (2007). miR-21-mediated tumor growth. Oncogene 26(19):2799–803. Slack FJ, Basson M, Liu Z, Ambros V, Horvitz HR, Ruvkun G. (2000). The lin-41 RBCC gene acts in the C. elegans heterochronic pathway between the let-7 regulatory RNA and the LIN-29 transcription factor. Mol Cell 5(4):659–69. Sood P, Krek A, Zavolan M, Macino G, Rajewsky N. (2006). Cell-type-specific signatures of microRNAs on target mRNA expression. Proc Natl Acad Sci U S A 103(8): 2746–51. Stark A, Brennecke J, Bushati N, Russell RB, Cohen SM. (2005). Animal microRNAs confer robustness to gene expression and have a significant impact on 3′UTR evolution. Cell 123(6):1133–46. Taganov KD, Boldin MP, Chang KJ, Baltimore D. (2006). NF-kappaB-dependent induction of microRNA miR-146, an inhibitor targeted to signaling proteins of innate immune responses. Proc Natl Acad Sci U S A 103(33):12481–86. Takamizawa J, Konishi H, Yanagisawa K, Tomida S, Osada H, Endoh H, Harano T, Yatabe Y, Nagino M, Nimura Y, Mitsudomi T, Takahashi T. (2004). Reduced expression of the let-7 microRNAs in human lung cancers in association with shortened postoperative survival. Cancer Res 64(11):3753–56. Tan Z, Randall G, Fan J, Camoretti-Mercado B, Brockman-Schneider R, Pan L, Solway J, Gern JE, Lemanske RF, Nicolae D, Ober C. (2007). Allele-specific targeting of microRNAs to HLA-G and risk of asthma. Am J Hum Genet 81(4):829–34. Teixeira D, Sheth U, Valencia-Sanchez MA, Brengues M, Parker R. (2005). Processing bodies require RNA for assembly and contain ontranslating mRNAs. RNA 11(4):371–82. Voorhoeve PM, le Sage C, Schrier M, Gillis AJ, Stoop H, Nagel R, Liu YP, van Duijse J, Drost J, Griekspoor A, Zlotorynski E, Yabuta N, De Vita G, Nojima H, Looijenga LH, Agami R. (2006). A genetic screen implicates miRNA-372 and miRNA-373 as oncogenes in testicular germ cell tumors. Cell 124(6):1169–81. Wightman B, Ha I, Ruvkun G. (1993). Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell 75(5):855–62. Wu L, Fan J, Belasco JG. (2006). MicroRNAs direct rapid deadenylation of mRNA. Proc Natl Acad Sci U S A 103(11):4034–39. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M. (2005). Systematic discovery of regulatory motifs in human promoters and 3 UTRs by comparison of several mammals. Nature 434(7031):338–45. Xu P, Vernooy SY, Guo M, Hay BA. (2003). The Drosophila microRNA Mir-14 suppresses cell death and is required for normal fat metabolism. Curr Biol 13(9): 790–95. Yanaihara N, Caplen N, Bowman E, Seike M, Kumamoto K, Yi M, Stephens RM, Okamoto A, Yokota J, Tanaka T, Calin GA, Liu CG, Croce CM, Harris CC. (2006). Unique microRNA molecular profiles in lung cancer diagnosis and prognosis. Cancer Cell 9(3):189–98.
c16.indd 368
1/12/2011 9:44:31 AM
REFERENCES
369
Yekta S, Shih IH, Bartel DP. (2004). MicroRNA-directed cleavage of HOXB8 mRNA. Science 304(5670):594–96. Yi R, Qin Y, Macara IG, Cullen BR. (2003). Exportin-5 mediates the nuclear export of pre-microRNAs and short hairpin RNAs. Genes Dev 17(24):3011–16. Yu SL, Chen HY, Chang GC, Chen CY, Chen HW, Singh S, Cheng CL, Yu CJ, Lee YC, Chen HS, Su TJ, Chiang CC, Li HN, Hong QS, Su HY, Chen CC, Chen WJ, Liu CC, Chan WK, Chen WJ, Li KC, Chen JJ, Yang PC. (2008). MicroRNA signature predicts survival and relapse in lung cancer. Cancer Cell 13(1):48–57. Yu Z, Li Z, Jolicoeur N, Zhang L, Fortin Y, Wang E, Wu M, Shen SH. (2007). Aberrant allele frequencies of the SNPs located in microRNA target sites are potentially associated with human cancers. Nucl Acids Res 35(13):4535–41. Zhao H, Kalota A, Jin S, Gewirtz AM. (2009). The c-myb proto-oncogene and microRNA-15a comprise an active autoregulatory feedback loop in human hematopoietic cells. Blood 113(3):505–16.
c16.indd 369
1/12/2011 9:44:31 AM
CHAPTER 17
Confirmation of Gene Function Using Translational Approaches CAROLINE J. ZEISS
Contents 17.1 Introduction 17.2 Sources of Phenotypic Variability in Genetically Altered Mice 17.2.1 The Effect of Mouse Strain on Expression of an Induced Genetic Alteration 17.2.2 Strain-Specific Pathology 17.2.3 Environmental Phenomena 17.3 Gene-Driven or Reverse Genetics Approach to Mouse Research 17.3.1 Method of Genetic Manipulation and Phenotypic Variability 17.3.2 Determining the Phenotypic Effects of a Known or Novel Gene 17.3.3 Embryonal Lethal Phenotypes 17.4 Phenotype-Driven or Forward Genetics Approach to Mouse Research 17.4.1 Spontaneous Mutations 17.4.2 Genomewide Mutagenesis 17.4.3 High-Throughput Phenotyping 17.5 Information Resources 17.6 Questions and Answers 17.7 References
371 373 373 375 376 378 378 379 381 383 384 384 385 386 386 387
17.1 INTRODUCTION The ultimate intent of translational research is to advance patient care by integrating information obtained in basic molecular studies with clinical trials (Woolf, 2008; Goldblatt and Lee, 2010). This approach works in both directions. Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
371
c17.indd 371
1/12/2011 9:44:32 AM
372
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
Forward genetic screen
Observe phenotype
Identify causative gene and infer function
Random mutagenesis Infer function
Alter a known gene
Observe phenotype
Reverse genetic screen
Figure 17.1. Forward and reverse approaches to identifying gene function in mice.
Clinical findings provide the impetus for in vitro studies or experiments in animal models that elucidate mechanisms of disease, or their proximate causes. Similarly, pathophysiologic clues to progression of a disease entity, or new solutions towards mitigating its effects typically evolve from applied studies in animals. The mouse has been the primary vehicle of translational exploration. Genetically altered mice provide superb models of human physiology and disease. They allow us to evaluate the effects of single altered genes in the context of the whole organism and provide tremendous insight into gene function. The hereditary basis for a multitude of hereditary, neoplastic and degenerative disorders in humans has long been known. Solving the human genome generated the coding sequence of genes causing or contributing to these conditions. Simultaneously, the capacity to genetic manipulate specific genes in the mouse genome (or reverse genetics) has resulted in a vast body of knowledge as to how genetic alterations in humans create disease in a mammalian system and how they may be remedied (Figure 17.1). Although murine studies form the bulk of such investigations, they are increasingly supplemented by more basic studies in lower vertebrates (zebrafish, nematodes, and frogs) and applied clinical studies in larger models such as dogs, pigs, and nonhuman primates. The solution of multiple organismal genomes and the ability to manipulate genes in vivo has rapidly accelerated our ability to dissect gene function and apply these findings to human disease. Before these developments, exploration of gene function relied on spontaneous appearance of hereditary disorders in animals or random mutagenesis followed by observation of progeny for a
c17.indd 372
1/12/2011 9:44:32 AM
SOURCES OF PHENOTYPIC VARIABILITY IN GENETICALLY ALTERED MICE
373
phenotype (forward genetics) (Figure 17.1). This approach depends on relatively laborious identification of the causative gene defect using classical genetics, or the fortuitous choice of the correct candidate gene. In the forward approach, random mutagenesis in a founder results in numerous defects in unknown genes. Progeny are screened for a phenotype and those showing one used to develop individual lines in which the causative defect is mapped. In the reverse approach, a specific gene is mutagenized or introduced and the phenotypic effect on the progeny observed. Recording the physiologic effect of a gene defect on an animal system is broadly termed phenotyping. Most commonly, the approach taken is hypothesis driven and limited by the interests of the individual investigator. Less commonly, a non-hypothesis-driven approach to standardized characterization of mutants is employed by facilities or consortia devoted to generating and/ or screening large numbers of novel mutant animals. The purpose of this chapter is to describe these approaches so as to generate an overall understanding of their intent, as well as to provide a guide to performing phenotypic studies in mice.
17.2 SOURCES OF PHENOTYPIC VARIABILITY IN GENETICALLY ALTERED MICE Variation is intrinsic to animal populations, and limiting the source of this to the experimental intervention results in a more satisfactory outcome. Unexpected variation in murine studies most commonly arises from several sources, including background strain, concurrent infection, and methods used to create the genetic alteration of interest. 17.2.1 The Effect of Mouse Strain on Expression of an Induced Genetic Alteration Inbred mouse strains differ dramatically in their appearance, physiologic characteristics, and spectrum of spontaneous disease. These differences arise from strain-specific allelic variation across the genome and profoundly affect expression of the genetic alteration under study as well as the spectrum of spontaneous disease in individual mouse strains. 17.2.1.1 Genetic Manipulation in 129 and C57BL/6J Strains The C57BL/6 mouse is the reference strain for the mouse genome sequence (Waterston et al., 2002) and is the most commonly used strain for biomedical research. However, the most commonly used strain for embryonic stem (ES) cell manipulation is the 129 strain (Simpson et al., 1997), as these display robust germline transmission of the genetic alteration. The 129-derived ES cells are injected into C57BL/6J blastocysts because of high fecundity and good mothering of the latter strain. Chimeric animals are typically bred
c17.indd 373
1/12/2011 9:44:32 AM
374
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
to C57BL/6J mice, producing genetically similar F1 animals sharing similar chromosomal complements of 129 and C57BL/6J strains. Due to meiotic recombination, the local influence of each in the genome becomes increasingly variable if progeny are simply bred together in subsequent generations. Phenotypic variability may be caused by alleles that are adjacent to or interact with the target locus. The increasing use of double/triple knockout combinations, inducible transgenes and cell-specific knockouts provide additional opportunities for creating a genetic background so mixed that results cannot be replicated by other investigators. Genetic heterogeneity is not limited to interstrain variation. Both 129 and C57BL/6 strains exhibit heterogeneity within strain. Over time, genetic drift within a population is inevitable—this means that wild type C57BL/6J mice from the colony of one investigator are no longer identical to the same strain in another facility. Consequently, wild type littermates are more appropriate control animals than wild type animals from another source. 17.2.1.2 Backcrossing Ideally, the genetic background of control and experimental animals should be identical, with exception of the target locus. In cases where background effects are likely to be important, the target locus is best propagated in congenic strains by successive backcrossing to one inbred strain. After breeding parental strains, F1 progeny are bred back to one parental strain (usually C57BL/6J). F2 progeny from this mating are then similarly bred back to the parental strain until, after 6 backcross breedings (a process which generally takes 2 years), the resultant offspring are 99% similar to the chosen strain, with the exception of the region surrounding the target locus. After 10 generations, the induced gene defect, enclosed in about 100 Mb of donor genome, is all that remains within a pure homozygous recipient genome. This strategy also provides the opportunity to place the target locus on a number of genetic backgrounds to assess the effects of strain-specific modifier loci (Sigmund, 2000). Although 10 generations of backcrossing is recommended, circumstances frequently dictate that phenotype of genetically altered animals be evaluated long before congenic strains can be generated. In these cases, the use of aged-matched control littermates, rather than inbred animals from another source is essential. 17.2.1.3 C57BL/6 Embryonic Stem Cells Genetic manipulation of C57BL/6 ES cells would eliminate the need for backcrossing of progeny, however development of these has been hindered by low germline transmission rates of C57BL/6 ES cells in contrast to 129 derived ES cells. Large scale mutagenesis programs have been established to mutate all protein coding genes in the mouse genome, and for these, C57BL/6 embryonic stem cells have been chosen to reduce the effects of mixed background (Austin et al., 2004). Recently, robust germline transmission has been achieved with C57BL/6N ES cells (Pettitt et al., 2009). Repair of the Agouti locus (imparting agouti color
c17.indd 374
1/12/2011 9:44:32 AM
SOURCES OF PHENOTYPIC VARIABILITY IN GENETICALLY ALTERED MICE
WT
375
rd-1
ONL
ONL
Figure 17.2. (See color insert.) WT and rd-1 mouse retina at postnatal day 12. In rd-1 mice, a mutation in the cyclic phosphodiesterase beta gene results in rapid loss of the outer nuclear layer in the first 3 weeks of life.
rather than black) in C57BL/6N ES cells and subsequent injection into C57BL/6J blastocysts allows identification of ES cell derived chimeras by mixed agouti and black coat color (Pettit et al., 2009). 17.2.2 Strain-Specific Pathology Mice experience a spectrum of spontaneous diseases that are heavily influenced by strain. Some, such as retinal degeneration caused by the rd1 mutation (in cyclic GMP phosphodiesterase) occur in specific strains such a C3H and FVB mice. Investigators studying retinal development or disease would be advised to avoid these strains. Other conditions, such as ulcerative dermatitis, are common to many strains, but occur with higher frequency in C57BL/6 mice. The significance of this condition is that it frequently necessitates euthanasia of affected mice and may be worsened by genetic intervention. Clinical examination should
c17.indd 375
1/12/2011 9:44:32 AM
376
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
precede behavioral testing as a host of unrelated factors can cause profound artifactual deficits on behavioral tests. Strain-specific background pathology (e.g., callosal defects in BALB/c, 129 and other mice) may affect behavioral test results. C57BL/6 mice tend to display normal hyperactive behaviors compared to other strains. Strain-specific anatomy and pathology are described and referenced in several excellent texts (Maronpot et al., 1999; Hof et al., 2000; Ward et al., 2000; Brayton et al., 2001). In addition, online resources such as the Mouse Phenome Database, the database of Inbred Strain Characteristics, and the Mouse Tumor Biology Database provide searchable databases of strain-specific anatomy and pathology (Table 17.1). The latter is complemented by a recent text on murine tumor classification (Mohr, 2001).
17.2.3 Environmental Phenomena In addition to the combined effects of genetic background and the induced mutation, environmental factors may confound the phenotype. The most significant of these is the presence of subclinical infection within the colony. 17.2.3.1 Concurrent Infection Experimental mouse colonies are subjected to a rigorous diseasemonitoring program to prevent the spread of infectious agents. Consequently, most of the diseases that plagued mouse colonies 30 years ago have been eradicated. However, several subclinical conditions persist in most colonies and are tolerated because they cause subclinical disease and are difficult to detect or eradicate. Currently, these include Helicobacter sp., mouse parvo virus, and to a lesser extent, mouse hepatitis virus. The significance of these infections is probably limited to those investigators working on immunologic topics. Although subclinical, they do stimulate immune responses that could confound immunologic studies. In general, phenotypes that affect the immune system are most likely to suffer potential confounding effects of a prevalent but subclinical infectious disease. Not infrequently, helicobacteriosis will present as clinical disease (rectal prolapse), particularly in animals prone to inflammatory bowel disease (Chin et al., 2000) as the result of ablation of components of their immune system. Pinworms outbreaks are relatively frequent occurrence, but can be controlled by quarantine and treatment. If animals have been produced at a research facility, information regarding the health status of the room in which they live will be available from the veterinary staff. 17.2.3.2 Epigenetic Phenomena Environmental phenomena such as stress, food composition, and The light/dark cycle can have substantial effects on phenotype. In particular, behavioral phenotypes, and phenotypes, such as obesity, that are affected by behavior and feeding can be particularly affected (Crabbe et al., 1999; Tordoff et al., 1999).
c17.indd 376
1/12/2011 9:44:32 AM
SOURCES OF PHENOTYPIC VARIABILITY IN GENETICALLY ALTERED MICE
377
TABLE 17.1. Online Resources for Mouse Phenotyping Mouse Development Atlases The House Mouse. Atlas of Embryonic Development by Karl Theiler Edinburgh Mouse Atlas Project Caltech μMRI Atlas of Mouse Development 3D embryo images at Duke Center for In Vivo Microscopy Correlation of the Theiler system with embryonic age, size and morphologic features Mouse Brain Atlases Allen Brain Atlas MBL: The Mouse Brain Library Mouse Atlas Project High resolution Mouse Brain Atlas Electronic Prenatal Mouse Brain Atlas Whole Body Mouse Atlases Three-dimensional atlas of the mouse Mouse anatomy
http://genex.hgu.mrc.ac.uk/Atlas/ Theiler_book_download.html http://genex.hgu.mrc.ac.uk/intro.html http://mouseatlas.caltech.edu/ http://www.civm.duhs.duke.edu/ devatlas/index.html http://genex.hgu. mrc.ac.uk/.
http://www.brain-map.org/ http://www.mbl.org/ http://map.loni.ucla.edu/ http://www.hms.harvard.edu/research/ brain/ http://www.epmba.org/ http://www.mrpath.com/ previousvisiblemouse.html http://www.informatics.jax.org/ cookbook/chapters/contents2.shtml
Mouse Gene Expression Atlases Allen Brain Atlas http://www.brain-map.org/ Mouse Atlas of Gene Expression http://www.mouseatlas.org/data/mouse/ Gene Expression in the PN 7 Mouse Brain http://www.geneatlas.org/gene/ Mouse Genomics and Phenotyping Resources Mouse Genome Resources http://www.ncbi.nlm.nih.gov/projects/ genome/guide/mouse/ Mouse Genome Informatics http://www.informatics.jax.org/ Mouse Phenome Database http://phenome.jax.org/pub-cgi/ phenome/mpdcgi?rtn=docs/home Inbred Strain Characteristics http://www.informatics.jax.org/external/ festing/search_form.cgi Mouse Tumor Biology Database http://tumor.informatics.jax.org/mtbwi/ index.do European Mouse Phenotyping Resource http://empress.har.mrc.ac.uk/) of Standardized Screens EuroPhenome http://www.europhenome.org/ EUMORPHIA consortium http://www.eumorphia.org/). Knockout Mouse Project http://www.komp.org/ International Knockout Mouse http://www.knockoutmouse.org/ Consortium European Conditional Mouse Mutagenesis http://www.eucomm.org/ Program
c17.indd 377
1/12/2011 9:44:32 AM
378
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
17.3 GENE-DRIVEN OR REVERSE GENETICS APPROACH TO MOUSE RESEARCH The completion of the Mouse Genome Project (www.ncbi.nlm.nih.gov/ projects/genome/guide/mouse) provided the impetus to explore the function of specific genes by introducing a mutation in a gene, followed by observation of the phenotype in mutant progeny. This approach forms the basis of hypothesis driven research performed by the majority of individual research labs. Animals derived from targeted gene manipulations are analyzed according to the interests of the lab—the type of analysis ranges from general overall screening to highly specialized organ specific techniques. The general approach towards this type of analysis is described in this section. 17.3.1
Method of Genetic Manipulation and Phenotypic Variability
Understanding the technology used in the experiment is necessary to identify potential factors that may confound the phenotype. Detailed descriptions of methodologies used to create genetically altered animals can be found in Williams and Wagner (2000) and Adams and van der Weyden (2008). The following discussion will address only pitfalls associated with the more commonly used methods to generate transgenic and knockout mice. Creating a transgenic animal employs random chromosomal integration of foreign DNA after injection into fertilized oocytes. The resulting offspring are screened to identify those animals in which stable chromosomal integration of the foreign DNA has occurred. In addition to the target gene, the transgenic construct contains a transcriptional regulatory region that directs both expression level and tissue specificity of the inserted gene. Depending on the aim of the experiment, the target protein may be overexpressed (excessive amounts of normal protein expressed in tissues that normally express it) or ectopically expressed (a normal protein is expressed in tissues that do not normally express it). Alternatively, the transgene may be modified to create a gain of function mutant (by which the protein is constitutively expressed) or a loss of function mutant (by which the protein interacts with its partners in a dominant negative fashion). The nature of the transgenic manipulation will determine the extent to which individual tissues are examined. Because of the random nature of transgene insertion after pronuclear injection, each resultant founder contains the transgene at a different site in the genome (Clark et al., 1994). This position effect can profoundly affect the expression of both the transgene and endogenous genes whose regulatory elements may be disrupted by the insertion event. Several factors may influence the resultant phenotype. The foreign DNA usually integrates as linear arrays, resulting in variable levels of gene dosage. The site of chromosomal integration may affect the regulatory function of the transcriptional element contained within the construct. These factors result in variable expression levels of the transgene in different founder lines. In addition, random integration of the transgene may disrupt endogenous genes (insertional mutagenesis)
c17.indd 378
1/12/2011 9:44:32 AM
GENE-DRIVEN OR REVERSE GENETICS APPROACH TO MOUSE RESEARCH
379
thus further confounding phenotype. Consequently, it is essential that lines from several (at least two) different founders be examined before a conclusion relating a specific phenotype to transgene expression is made (Sigmund, 2000; Williams and Wagner, 2000). To assess dose-response relationships between transgene expression and phenotype, it is also important to assess lines of mice that express the transgene at different levels. The uncertainties of random integration may be circumvented by the more challenging technology used to create knockout mice. Using homologous recombination, the coding region of a specific endogenous gene can be interrupted to eliminate gene expression (knockout) or replace it with a modified variant of the gene (knockin). The foreign DNA is inserted into cultured ES cells, followed by identification of clones that have the correct mutation and then injection of these clones into mouse blastocysts. If chimeric mice have integrated the foreign DNA into their germline, they can pass it along to their progeny to establish a colony of genetically altered animals. Although gene expression may be more precisely controlled with this method, it is possible to destroy transcriptional control elements controlling expression of a neighboring gene, thus creating varying phenotypes (Olson et al., 1996). More sophisticated methods of genetic manipulation are accompanied by their own particular pitfalls. These methods include Cre-Lox technology to create conditional mutants and drug-regulated transgene expression (Adams and van der Weyden, 2008). 17.3.2 Determining the Phenotypic Effects of a Known or Novel Gene In general, the effect of eliminating the gene (knockout model) is investigated first, followed by more sophisticated interventions such as introducing single nucleotide polymorphisms in an otherwise functional gene or inducing tissue specific expression of a mutant gene. The investigator is required to assess phenotypic impact of single gene alterations on complex molecular pathways. The effects of genetic background and the variability inherent in the gene construct used to create the animals frequently confound this assessment. Finally, findings must be integrated with published information to draw conclusions and design new experiments. 17.3.2.1 Preparatory Steps When initiating a phenotypic examination, it is important to collect as much information about the experiment as possible. These include (1) the aim and design of the experiment, (2) the known physiology of the target gene and the methods used to manipulate its expression, and (3) potential sources of phenotypic variability. Last, but most important, the induced genetic defect should be characterized. At a minimum, the genomic lesion should be characterized, and expression of the transcript and cognate protein assessed in mutant animals. This will ensure that the target gene has been successfully deleted or otherwise altered prior to embarking upon a phenotyping effort.
c17.indd 379
1/12/2011 9:44:32 AM
380
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
17.3.2.2 Assessing the Live Animal Simple observations and good record-keeping are the fundamentals of phenotypic assessment of a new mutant line. The initial step in any phenotyping project is to determine whether the altered gene is inherited at the expected Mendelian frequently in live progeny. Following birth of progeny, the investigator notes whether neonatal deaths occur in the litter. At some point between birth and weaning, pups are genotyped, and the Mendelian spread of wildtype, heterozygote, and homozygote mutant animals as assessed. If fewer than expected homozygous mutant live animals are detected, this is generally the first indication that the altered gene may cause early neonatal or embryonal death. In this case, the approach taken is described in Section 17.3.3. If a normal Mendelian spread is identified, the investigator typically has a selection of mutant, heterozygous, and wild type littermates from each mating. These should be examined clinically and weighed at regular intervals e.g. early neonatal period (P5), weaning, and sexual maturity (6–8 weeks). To assess reproductive performance, mutant mice should be mated with one another, and with wild type mice to determine whether reproduction of each gender is normal. This provides a broad but sensitive test of normal physiology in a multitude of organ systems. Not uncommonly, no clinical abnormalities are noted at all. In this case, baseline clinical and anatomic pathology phenotyping can be performed in young (8–12 weeks) and older (12–15 months) adult animals. In general, 6–10 animals of each genotype and age are used for terminal assessment. This typically comprises clinical chemistry, hematology, and a relatively detailed panel of histologic tissues. Detailed descriptions of hematologic and morphologic evaluation of mice are given in Brayton et al. (2001) and Car and Eng (2001). 17.3.2.3 Mice with no Phenotype Failure to identify a clear phenotype following genetic manipulation is not uncommon. In these cases, the steps outlined above represent the minimum that is typically required to complete a publishable study. The reasons for failure of a phenotype to emerge vary from failure of the genetic alteration to create a corresponding abnormality in the protein, to the activation of compensatory mechanisms (Susulic et al., 1995; Cummings et al., 1996). Such compensation is typically identified by the altered or increased expression of related genes in the presence of a relatively normal phenotype. 17.3.2.4 Mice with a Clinical Phenotype If it is relatively obvious, a phenotype may be identified during initial observation and provides a direction for further analysis. Alternatively, the phenotype may be subtle and revealed only when the investigator employs a series of tests designed to reveal a phenotype of interest. Numerous protocols for the antemortem physiologic assessment of mutant mice exist. These are succinctly reviewed in Rao and Verkman (2000).
c17.indd 380
1/12/2011 9:44:32 AM
GENE-DRIVEN OR REVERSE GENETICS APPROACH TO MOUSE RESEARCH
381
900 800 700 600 500 400 300 200 100 0
13:03 14:43 16:23 18:03 19:43 21:23 23:03 0:43 2:23 4:03 5:43 7:23 9:03 10:43 12:23 14:03 15:43 17:23 19:03 20:43 22:23 0:03 1:43 3:23 5:03 6:43 8:23 10:03 11:43 13:23 15:03 16:43 18:23 20:03 21:43 23:23 1:03 2:43 4:23 6:03 7:43
XT
Horizontal activity 12-h light: dark cycle
Mutant WT
Time
Figure 17.3. Horizontal activity over 3 days in wild type and mutant mice. Data are collected in a metabolic cage where horizontal movement by the mounts elicits a beam break (XT). Activity is recorded over several light/dark cycles. During the day (white segments), mutant mice are more active than WT mice.
For progressive conditions, animals in early, middle, and late stages of the condition should be chosen for histologic analysis. The final number of animals assessed in each study is unique, and determined by the variability of the data. This is in turn, affected by the subtlety of the phenotype, the variability inherent in the tests used to assess physiology, and additional factors such as background strain. General testing protocols for the majority of body systems have been established over the last few years by the Emorphia consortium (www.eumorphia.org), which developed a number of standardized screens known as EMPReSS (European Mouse Phenotyping Resource of Standardized Screens). These are available online from the EMPreSS website (http:// empress.har.mrc.ac.uk) and provide a good overall starting point for more in-depth screening of live mice. Following characterization of live animals, age-matched animals of mutant and wild type genotypes are sacrificed and characterized as described above. Frequently, this step is combined with collection of tissues for molecular analysis. Often, the mouse phenotype represents a relatively minor portion of a publication and is located between the description of the genomic intervention and the molecular work describing the mechanism constituting the essence of the paper. 17.3.3 Embryonal Lethal Phenotypes Identifying the time of fetal death requires euthanasia of pregnant dams at successive stages of pregnancy to determine the time at which embryos are lost. Good reviews detailing the evaluation of embryonic death and perinatal mortality can be found in Brayton et al. (2001) and Ward et al. (2000). 17.3.3.1 Collection of Embryos at Specific Developmental Stages Matings between fertile males and spontaneously cycling females are usually set up in the late afternoon or early evening. Females in proestrus can be
c17.indd 381
1/12/2011 9:44:32 AM
382
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
selected by vaginal inspection (Champlin et al., 1973). Approximately half of the females selected this way will mate that night. Consequently, a relatively large number of matings need to be set up to obtain the required number of timed pregnant females. Alternatively, females can be superovulated using intraperitoneal pregnant mares serum gonadotropin (PMSG; typical dose 5 IU) followed 48 h later by human chorionic gonadotropin (hCG; typical dose 5 IU). Ovulation occurs approximately 12 h later. Depending on the dose administered, ranging from 2.5 IU (physiologic) to 10 IU (high), large numbers of embryos may implant and result in artifactual changes from overcrowding (Kaufman, 2000). Observation of a vaginal plug is required to accurately determine the developmental stage of embryos. In mice kept in a standard 12 h light/dark cycle, it is assumed that mating occurs at the mid-dark point, at approximately 2 a.m. If a vaginal plug is identified the next morning, embryos will be assumed to be E (embryonic day) 0.5 or 0.5 dpc (days post coitum) old (Hogan et al., 1994). Implantation usually occurs at E4.5 and the duration of pregnancy is 19.5–21 days. Before implantation, embryos may be retrieved by flushing the oviduct and uterus with phosphate buffered saline. Between E4.5 and E8.5, it is best to isolate the embryo within its intact decidual swelling to avoid damaging it. After E8.5, the embryo can be dissected from the uterus and its yolk sac. It can be retained within the amnion, but considerable care should be taken to avoid damage. 17.3.3.2 Fixation, Embedding, and Orientation Following collection of tissue for genotyping, embryos may be fixed in Bouin’s solution, 10% formalin, or 4% paraformaldehyde. Particularly with Bouin’s solution, the tissue will become brittle if placed in fixative for too long. Embryos with a crown–rump length of 2 mm require only 1 h in fixative, while those with a crown–rump length of ∼15 mm can be placed in fixative for up to 24 h. After removal from fixative, embryos may be placed in 70% ethanol for long-term storage at room temperature. Before embedding, the embryos are dehydrated through graded stages of alcohol, before being placed in a 1 : 1 mixture of 100% ethanol : benzene (see Kaufman [2000] for detailed procedures). The addition of a few drops of eosin at the 90% ethanol stage will stain the embryo pink and facilitate its visualization during embedding. Embryos older than E8.5 can be relatively easily oriented, as the head and tail can be easily visualized, and they tend to fall on their sides in the wax block. Younger embryos within their decidual swellings can be sectioned in the transverse plane by using the decidual swelling to orient the embryo. A large number of specimens may be required to obtain useful sections of specimens under E8.5. Tranverse sections are generally done through the majority of the embryo, and provide the most morphologic information. 17.3.3.3 Histologic Interpretation and Staging The most commonly used staging system is that of Theiler (1972, 1989). This system has been
c17.indd 382
1/12/2011 9:44:32 AM
PHENOTYPE-DRIVEN OR FORWARD GENETICS APPROACH TO MOUSE RESEARCH
383
adopted by recent standard texts on mouse embryology (Kaufman, 1994; Kaufman and Bard, 1999). A table correlating the Theiler system with embryonic age, size and morphologic features can be found online (http:// genex.hgu.mrc.ac.uk). The Atlas of Mouse Development (Kaufman, 1994) provides the most comprehensive illustration of each of the Theiler stages. Each Theiler stage, up to about E11.5 (Theiler stage 20) lasts for about 12 h. As tissues develop so rapidly at these stages, a precise identification of embryo age may be difficult. Aging of embryos is easier after E12, when each Theiler stage encompasses about 24 h. Aging can also be done by examining the sequence of long bone ossification in whole embryos or tissue sections. This method is best used after E15.5 (Patton and Kaufman, 1995) when ossification centers are present. The pathologist should be aware of intrinsic variations in normal embryonal development. Within the same litter, developmental maturity can vary by 6–12 h. Hematoxylin and eosin staining is sufficient for initial screening. Further analyses frequently make use of the spectrum of techniques traditionally used in light microscopy—for example, special stains, histochemistry (Kaufman and Schnebelen, 1986), immunohistochemistry, and in situ hybridization (Durrant, 1996; Kadkol et al., 1999). 17.3.3.4 Newer Techniques to Assess Embryonal Phenotypes Detailed two- or three-dimensional visualization of embryonal anatomy can be achieved using a variety of imaging techniques. Volumetric X-ray computed tomography of osmium tetroxide stained tissues is able to generate virtual histologic images (Johnson et al., 2006). Three-dimensional images can be obtained by magnetic resonance imaging (Petiet et al., 2008). An atlas of normal embryos obtained using this method can be accessed online at the Duke Center for In Vivo Microscopy (CIVM; Table 17.1). A review of these, and other in vivo techniques is reviewed by Kulandavelu et al. (2006).
17.4 PHENOTYPE-DRIVEN OR FORWARD GENETICS APPROACH TO MOUSE RESEARCH In the forward genetics approach, genetic lesions are introduced randomly throughout the genome, followed by observation for a phenotype in progeny. Once the phenotype can be established through successive generations, the causative gene is mapped and identified. Many of the animal models with which we have been familiar for many decades resulted from the forward genetics approach applied to spontaneous mutations that arose in inbred colonies of mice and larger animals. The advantage of this approach is that novel mutants derived from alteration of as yet uncharacterized genes can be developed. Also, the induced mutations tend to be point mutations that can result in subtle or dominant negative phenotypes (Rajan and Kopito, 2005). Different point mutations in the same gene can generate an allelic series, thus more accurately reflecting the disease phenotypes arising from mutations in the
c17.indd 383
1/12/2011 9:44:32 AM
384
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
corresponding human disease gene (Bendotti and Carrì, 2004). Last, point mutations and knockouts may have very different phenotypes (Signorini et al., 1997; Jensen et al., 1999). 17.4.1
Spontaneous Mutations
Modern inbred mouse strains were created by progressive inbreeding and are defined by over 20 generations of brother–sister matings. It is surprising that this endeavor generated a body of inherited diseases in mice that have since become established models for their corresponding human diseases. Spontaneous mutants still arise in mouse colonies, and in facilities equipped to pursue these, they form the basis of new spontaneous models of disease. While some transgenic models have been created in larger animals, most commonly pigs, spontaneous inherited diseases in larger animal models remain the most common means by which these models are developed to complement murine and human studies. 17.4.2
Genomewide Mutagenesis
Random genomewide mutagenesis is typically employed by facilities that generate large numbers of novel mutant mice. For example, the aim of the Knockout Mouse Project (now the International Knockout Mouse Consortium) is to mutate all protein-coding genes in the mouse using a combination of gene trapping and gene targeting in C57BL/6 mouse embryonic stem (ES) cells (www.knockoutmouse.org). This approach is also used by individual investigators, most commonly to identify novel genes to modify a known phenotype. Techniques of mutagenesis can be broadly divided into those that facilitate subsequent identification of the mutagenized gene using a genetic tag (such as gene trapping or transposon based mutagenesis) and those that do not (such as ENU mutagenesis). In all cases, backcrossing is needed to identify recessive phenotypes. 17.4.2.1 ENU Mutagenesis Because of its ability to induce single base pair mutations in any gene, ENU-mutageneisis has become a standard mutagen for the phenotype driven approach in the mouse (Barbaric et al. 2007). Mutagenesis is achieved in male founders by single or multiple treatment with N-ethyl-N-nitrosurea (ENU). These males (G0) are bred to normal females, and G1 progeny can be screened for dominant mutations. To obtain recessive mutants, G1 males (each a unique set of mutations) are bred to normal females, and the female G2 progeny of this mating bred back to their G1 father. The G3 progeny of these matings are screened for phenotypes, a higher proportion of which can now be expected to be recessive (Vitaterna et al., 2006). ENUinduced mutations are most commonly A :T to T :A or G : C transitions that cause missense mutations—genes with larger coding sequences are most commonly affected (Barbaric et al., 2007). Because ENU mutagenesis does not
c17.indd 384
1/12/2011 9:44:32 AM
PHENOTYPE-DRIVEN OR FORWARD GENETICS APPROACH TO MOUSE RESEARCH
385
introduce a selectable marker, there is no direct means to identify the insertion site, this must be established by positional cloning. 17.4.2.2 Gene Trapping Gene trapping is a method of random mutagenesis in which insertion of a synthetic DNA element into endogenous genes results in their transcriptional disruption (Brennan and Skarnes, 2008). A gene trap construct consists of a splice acceptor, selectable marker gene and polyadenylation signal that is placed within a retroviral genome. Retroviral particles are used to infect the ES cell line, when insertions occur within transcriptionally active regions, the marker is transcribed and expressed, allowing selection of positive clones. This insertion also results in disruption of the endogenous transcript and is associated function. In addition, the selectable marker can be used as a tag to identify the insertion location and the disrupted gene. 17.4.2.3 Transposons DNA transposons are genetic elements consisting of inverted terminal DNA repeats (TRs), which in their naturally occurring configuration flank a transposase coding sequence (CDS). This transposase follows a cut- and-paste mechanism to excise the transposon from its original genomic location and insert it into a new locus (Adams and van der Weyden, 2008; Ivics et al., 2009; Largaespada, 2009). These genetic elements are responsible for both ancient and new phenotypes in mice. Approximately 10% of the mouse genome is made up of endogenous retrovirus (ERV) sequences, which represent the remains of ancient germ line infections by transposable elements. These interrupt gene function at specific loci and are associated with strain specific cancers, predominantly mammary and lymphoid tumors (Stocking and Kozak, 2008). Recently, two classess of transposons have been engineered to induce random genomewide mutagenesis. Both the Tc1-like DNA transposon known as Sleeping Beauty, and an insect derived transposon PiggyBAC have been optimized to induce random mutagenesis in mouse cells. In addition to the transposase, the inverted terminal DNA repeats enclose a selectable marker to aid subsequent identification of the insertion site (Ivics et al., 2009; Largaespada DA 2009). 17.4.3
High-Throughput Phenotyping
Several facilities exist, primarily in Europe, that perform high-throughput, standardized phenotyping studies on mice. These are best suited to characterizing mice that are created by random mutagenesis and thus employ a nonhypothesis-driven approach. Its strength lies in the uniformity of its approach, as well the breadth of data collected on each mouse. The majority of tests are performed on live mice, consequently, pathology data constitute only a fraction of the entire dataset. Currently, a multinational consortium to knockout all genes in mouse genome has been created away EUCOMM in Europe, NorCOMM in Canada, and KOMP in the United State. EMPReSS is a
c17.indd 385
1/12/2011 9:44:32 AM
386
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
database of SOPs (Mallon et al., 2008), developed by a now extinct European Consortium known as EUMORPHIA consortium (www.eumorphia.org). A truncated version of these SOPs, known as EMPReSS Slim, is currently in use in four major phenotyping facilities in France, Germany, and the UK. Using these protocols, normal baseline data from several inbred mouse strains have already been established, and are available at the Europhenome site (www.europhenome.org). Current and future data from high-throughput phenotyping of mutant strains will be deposited in the Europhenome database (Morgan et al., 2009). 17.5 INFORMATION RESOURCES After completion of the initial phenotyping panel, the data should be assessed in the light of the experimental design. This requires integration of current knowledge of the cellular process in which the target gene is involved, and comparison with described related mouse phenotypes. Collective analysis of numerous mutant mouse studies has generated a massive body of information that is still undergoing organization. Currently, no comprehensive resource exists that correlates structure and function of genes to their cognate cellular pathways and mutant phenotypes. However, extensive data exist for each of these disciplines independently (Hancock and Mallon, 2007), so it falls to the pathologist and investigator to integrate them. In addition to resources listed in this paper Chapter, Table 17.1 provides a list of the most comprehensive resources. 17.6 QUESTIONS AND ANSWERS Q1. You have generated a knockout line and notice that after genotyping, most litters contain the following distribution of pups at weaning: homozygous mutant 10%, heterozygous 55%, homozygous wild-type 35%. What could be happening and how do you investigate this? Q2. You have generated a knockout line on a mixed C57BL/6/129 background. You are testing your line for elevated blood pressure (BP) and notice that the data are so variable that you cannot establish significant differences in BP between genotypes. What could be happening and what could you do? Q3. You have generated a knockout line and fail to identify a phenotype. What is the minimum amount of data you have to show to claim that your mice have “no phenotype”? Q4. You have a characterized a cancer phenotype resulting from mice carrying a dominant negative allele for the p53 gene. You wish to identify modifier genes that either worsen or improve the phenotype. What approaches can you take?
c17.indd 386
1/12/2011 9:44:32 AM
REFERENCES
387
A1. These data indicate that most homozygous mutant pups are dying in utero. Because some do survive, it also suggests that there is variable penetrance of the phenotype. To investigate this, you need to do timed matings, and sacrifice the mother at various stages of pregnancy, working backward from E19/20. Genotype all pups in the uterus until you find a normal Mendelian spread. This will tell you which day pups are lost in utero. Phenotype mutant and control pups and placenta 1–2 days before the mutants are lost to identify the cause. You are likely to see a spectrum of severity in pups that die in utero and those that survive. A2. You are working with a phenotype (blood pressure) that is variable to begin with. The additional variability created by a mixed genetic background is probably masking nay differences between mutant and control animals. Backcross your mice for 6–10 generations to C57BL/6J mice. Also develop a standardized technique of BP measurement that you rigorously apply to all mice and make sure mutant and control animals are age and sex matched. A3. First, ensure that the genetic defect is characterized and you demonstrate loss or alteration of the transcript and protein. Then ensure that there is normal Mendelian spread of the mutant, het and wild type alleles in progeny and that these mice have comparable growth curves and morphology until sexual maturity. Determine whether mutant mice breed normally and produce normal offspring carrying the mutant allele. Age some mice to 2 years to assess whether they develop age-related phenotypes. Sacrifice age-matched cohorts of 4–6 male and female mice of all genotypes at 6–12 weeks and around 1 year and perform clinical chemistry, hematology, and histology. A4. You could induce germline genome wide point mutations in a male P53 mutant mouse using ENU-mutagenesis. Breeding this male to a female p53 mutant would deliver dominant phenotypes in G1. To assess recessive phenotypes, G1 males should be backcrossed to p53 females. Female G2 mice are bred back to their father and phenotypes assessed in progeny. Each G1 male will be used to develop one line of mice. This approach will reveal point mutations and subtle interactions in interacting genes, but requires positional cloning to identify the new gene. Using gene trapping or transposon-mediated genomewide disruption has the advantage of introducing a selectable marker allowing rapid identification of the new locus.
17.7 REFERENCES Adams DJ, van der Weyden L. (2008). Contemporary approaches for modifying the mouse genome. Physiol Genomics 34(3):225–38.
c17.indd 387
1/12/2011 9:44:32 AM
388
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
Austin CP, Battey JF, Bradley A, Bucan M, Capecchi M, Collins FS, Dove WF, Duyk G, Dymecki S, Eppig JT, et al. (2004). The knockout mouse project. Nat Genet 36:921–24. Barbaric I, Wells S, Russ A, Dear TN. (2007). Spectrum of ENU-induced mutations in phenotype-driven and gene-driven screens in the mouse. Environ Mol Mutagen 48(2):124–42. Bendotti C, Carrì MT. (2004). Lessons from models of SOD1-linked familial ALS. Trends Mol Med 10(8):393–400. Brayton C, Justice M, Montgomery CA. (2001). Evaluating mutant mice: anatomic pathology. Vet Pathol 38:1–19. Brennan J, Skarnes WC. (2008). Gene trapping in mouse embryonic stem cells. Methods Mol Biol 461:133–48. Car BD, Eng VM. (2001). Special considerations in the evaluation of the hematology and hemostasis of mutant mice. Vet Pathol 38:20–30. Champlin AK, Dorr DL, Gates AH. (1973). Determining the stage of the estrus cycle in the mouse by the appearance of the vagina. Biol Reprod 8:491–94. Chin EY, Dangler CA, Fox JG, Schauer DB. (2000). Helicobacter hepaticus infection triggers inflammatory bowel disease in T cell receptor alpha-beta mutant mice. Comp Med 50:586–94. Clark AJ, Bissinger P, Bullock DW, Damak S, Wallace R, Whitelaw CB, Yull F. (1994). Chromosomal position effects and the modulation of transgene expression. Reprod Fertil Dev 6:589–98. Copp AJ, Cockcroft DL. (1990). Postimplantation Mammalian Embryos. A practical Approach. IRL Press, Oxford. Crabbe JC, Wahlsten D, Dudek BC. (1999). Genetics of mouse behavior: interactions with laboratory environment. Science 284:1670–72. Cummings DE, Brandon EP, Planas JV, Motamed K, Idzerda RL,McKnight GS. (1996). Genetically lean mice result from targeted disruption of the RII beta subunit of rotein kinase A. Nature 382:622–26. Durrant I. (1996). Nonradioactive in situ hybridization for cells and tissues. Methods Mol Biol 58:155–67. Goldblatt EM, Lee WH. (2010). From bench to bedside: the growing use of translational research in cancer medicine. Am J Transl Res 2(1):1–18. Hancock JM, Mallon AM. (2007). Phenobabelomics—mouse phenotype data resources. Brief Funct Genomic Proteomic 6(4):292–301. Hof PR, Young WG, Bloom F. (2000). Comparative Cytoarchitectonic Atlas of the C57BL/6 and 129/SV: Mouse Brains. New York, NY, Elsevier Science. Hogan B, Beddington R, Constantini F, Lacy E. (1994). Manipulating the Mouse Embryo. A Laboratory Manual. 2nd ed. Cold Spring Harbor Laboratory, New York. Ivics Z, Li MA, Mátés L, Boeke JD, Nagy A, Bradley A, Izsvák Z. (2009). Transposonmediated genome manipulation in vertebrates. Nat Methods 6(6):415–22. Jensen P, Surmeier DJ, Goldowitz D. (1999). Rescue of cerebellar granule cells from death in weaver NR1 double mutants. J Neurosci 19(18):7991–98. Johnson JT, Hansen MS, Wu I, Healy LJ, Johnson CR, Jones GM, Capecchi MR, Keller C. (2006). Virtual histology of transgenic mouse embryos for high-throughput phenotyping. PLoS Genet 2(4):e61.
c17.indd 388
1/12/2011 9:44:32 AM
REFERENCES
389
Kadkol SS, Gage WR, Pasternack GR. (1999). In situ hybridization—theory and practice. Mol Diagn 4:169–83. Kaufman MH, Schnebelen MT. (1986). The histochemical identification of primordial germ cells in diploid parthenogenetic mouse embryos. J Exp Zool 238:103–11. Kaufman MH. (1994). The Atlas of Mouse Development. Academic Press, London. Kaufman MH. (2000). Gestational Mortality in Genetically Engineered Mice. In Pathology of Genetically Engineered Mice. Ward JM, Mahler JF, Maronpot RR, Sundberg JP (eds.). Iowa State University Press, Ames, pp. 63–88; 103–122. Kaufman MH, Bard JB. (1999). The Anatomical Basis of Mouse Development. Academic Press, San Diego, CA. Kulandavelu S, Qu D, Sunn N, Mu J, Rennie MY, Whiteley KJ, Walls JR, Bock NA, Sun JC, Covelli A, Sled JG, Adamson SL. (2006). Embryonic and neonatal phenotyping of genetically engineered mice. ILAR J 47(2):103–17. Largaespada DA. (2009). Transposon mutagenesis in mice. Meth Mol Biol 530: 379–90. Maronpot RR, Boorman GA, Gaul BW. (1999). Pathology of the Mouse: Reference and Atlas. Cache River Press, Vienna, IL. Mallon AM, Blake A, Hancock JM. (2008). EuroPhenome and EMPReSS: online mouse phenotyping resource. Nucl Acids Res 36:D715–8. Mohr U. (2001). International Classification of Rodent Tumours: The Mouse. Springer Verlag, Heidelberg, Germany. Morgan H, Beck T, Blake A, Gates H, Adams N, Debouzy G, Leblanc S, Lengger C, Maier H, Melvin D, Meziane H, Richardson D, Wells S, White J, Wood J, de Angelis MH, Brown SD, Hancock JM, Mallon AM. (2009). EuroPhenome: a repository for high-throughput mouse phenotyping data. Nucl Acids Res 2010 Jan; 38 (Database issue):D577–85. Olson EN, Arnold HH, Rigby PW, Wold BJ. (1996). Know your neighbors: three phenotypes in null mutants of the myogenic bHLH gene MRF4. Cell 85:1–4. Patton JT, Kaufman MH. (1995). The timing of ossification of the limb bones, and growth rates of various long bones of the fore and hind limbs of the prenatal and early postnatal laboratory mouse. J Anat 186(pt 1):175–85. Petiet AE, Kaufman MH, Goddeeris MM, Brandenburg J, Elmore SA, Johnson GA. (2008). High-resolution magnetic resonance histology of the embryonic and neonatal mouse: a 4D atlas and morphologic database. Proc Natl Acad Sci U S A. 26: 105(34):12331–36. Pettitt SJ, Liang Q, Rairdan XY, Moran JL, Prosser HM, Beier DR, Lloyd KC, Bradley A, Skarnes WC. (2009). Agouti C57BL/6N embryonic stem cells for mouse genetic resources. Nat Methods 6(7):493–95. Rajan RS, Kopito RR. (2005). Suppression of wild-type rhodopsin maturation by mutants linked to autosomal dominant retinitis pigmentosa. J Biol Chem 280(2): 1284–91. Rao S, Verkman AS. (2000). Analysis of organ physiology in transgenic mice. Am J Physiol Cell Physiol 279(1):C11–C18. Signorini S, Liao YJ, Duncan SA, Jan LY, Stoffel M. (1997). Normal cerebellar development but susceptibility to seizures in mice lacking G protein-coupled, inwardly rectifying K+ channel GIRK2. Proc Natl Acad Sci U S A 94(3):923–27.
c17.indd 389
1/12/2011 9:44:32 AM
390
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
Sigmund CD. (2000). Viewpoint: are studies in genetically altered mice out of control? Arterioscler Thromb Vasc Biol 20:1425–29. Simpson EM, Linder CC, Sargent EE, Davisson MT, Mobraaten LE, Sharp JJ. (1997). Genetic variation among 129 substrains and its importance for targeted mutagenesis in mice. Nat Genet 16(1):19–27. Susulic VS, Frederich RC, Lawitts J, Tozzo E, Kahn BB, et al. (1995). Targeted disruption of the beta 3-adrenergic receptor gene. J Biol Chem 270:29483–92. Stocking C, Kozak CA. (2008). Murine endogenous retroviruses. Cell Mol Life Sci 65(21):3383–98. Theiler K. (1972). The House Mouse: Development and Normal Stages from Fertilization to 4 weeks of Age. Springer Verlag, Berlin. Theiler K. (1989). The House Mouse: Atlas of Embryonic Development. Springer Verlag, New York. Tordoff MG, Bachmanov AA, Friedman MI, Beauchamp GK. (1999). Testing the genetics of behavior in mice. Science 285:2069. Vitaterna MH, Pinto LH, Takahashi JS. (2006). Large-scale mutagenesis and phenotypic screens for the nervous system and behavior in mice. Trends Neurosci 29(4):233–40. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420(6915):520–62. Ward JM, Mahler JF, Maronpot RR, and Sundberg JP. (2000). Pathology of Genetically Engineered Mice. Ames: Iowa State University Press. Woolf SH. (2008). The meaning of translational research and why it matters. JAMA 299:211–213. Williams RS, Wagner PD. (2000). Transgenic animals in integrative biology: approaches and interpretations of outcome. J Appl Physiol 88:1119–26.
c17.indd 390
1/12/2011 9:44:32 AM
CHAPTER 18
Confirmation of Single Nucleotide Mutations JOCHEN GRAW
Contents 18.1 Introduction: Why Single Nucleotide Mutations Are Difficult to Confirm 18.2 Initial Confirmation by Co-Segregation in the Family 18.3 Second Confirmation by Population Screening 18.3.1 Recurrent Mutation or Founder Effect? 18.4 Third Confirmation by Expression Analysis and Functional Studies in Model Systems 18.5 Recapitulation of Human Mutations in Animal Models 18.6 Conclusions and Outlook 18.7 Acknowledgments 18.8 Questions and Answers 18.9 References
391 393 395 396 397 398 399 400 400 400
18.1 INTRODUCTION: WHY SINGLE NUCLEOTIDE MUTATIONS ARE DIFFICULT TO CONFIRM The human genome—as any other genome—is dynamic and underlies a broad variety of changes. These changes may alter the biological function of a given sequence, but in many cases it is just a polymorphic site without any (actual) consequence. It is a game of evolution, and only in the context of other modifications or under new environmental conditions might it have a positive or negative effect for the organism. One of the simplest modifications seems to be the exchange of one single nucleotide. If such point mutations occur within a coding sequence, they may alter the encoded amino Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
391
c18.indd 391
1/12/2011 9:44:33 AM
392
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
acid (missense mutation). The single-basepair mutation may also cause a stop codon (nonsense mutation), leading to a premature stop of the translation. In such cases, instability of the corresponding mRNA is discussed leading finally to its nonsense-mediated decay. Biochemically, mutations like this can be identified using the western blot technique by a band representing a lower molecular weight of the corresponding protein or even its absence (in case of a nonsense-mediated decay). However, frequently the mutation affects the third base of a triplett and is predicted having no effect on the protein sequence (silent mutation). In a few cases, a mutation can also change a stop codon into an amino-acid coding codon extending the length of a given protein. However, one should consider the fact that the different tRNAs coding for the same amino acid do not have the same concentration within a cell. Therefore, the actual amount of a translated protein might depend on the available amount of tRNA, or in other words, the tRNA might be a rate-limiting factor during protein synthesis. Point mutations can occur rather frequently outside the coding region of a gene—namely, in its promoter, in its 5′- or 3′-untranslated regions, in an intron, or even in intergenic regions, where they may effect enhancers, chromosomal domains, or functionally less defined regions. A rather new but emerging field concerns the small regulatory RNAs, where point mutations may occur also. The latter points address the difficulties we have in confirmation of single nucleotide mutations: it is the confirmation of the biological function—on the background of the major noise of single nucleotide polymorphisms (SNPs), that can vary between very rare up to highly frequent in a given population. If a missense or nonsense mutation occurs in any coding region, the functional outcome can be deduced from their consequences with respect to the amino acid sequence and the effects on charge, structure, or possible posttranslational modification sites. In a promoter, a single nucleotide mutation can be tested in a reporter gene assay to quantify its effect on gene expression as compared to a wild type sequence. And a mutation in an intron may affect splicing, which can be confirmed by the analysis of cDNA in cases where the mRNA (for making of cDNA) is easily accessible. However, in all other cases the functional analysis is difficult, making a statement concerning a causative relationship to the genetic disorder of interest rather complicated. An overview of the entire strategy to confirm single point mutations as disease-causing mutations is given in Figure 18.1. Point mutations can be caused by a failure during DNA-replication or DNA-repair processes or can be induced induced by chemicals or radiation (Nomura, 2008; Sankaranarayanan, 2006). In model systems like mice, alkylating agents like ethylnitrosourea (ENU) are very potent mutagens, leading to the modification of bases, mispairing during replication, and fixation of the mutation in the next replication period (Ehling et al., 1985). Because of the higher rate of cell divisions in male germ cell development compared to females, base substitutions arising from errors during replication tend to be paternal in origin (Eichenlaub-Ritter et al., 2007).
c18.indd 392
1/12/2011 9:44:33 AM
INITIAL CONFIRMATION BY CO-SEGREGATION IN THE FAMILY
393
Patient with suggested hereditary disease
Family analysis
Sporadic case or small family (n < 10)
Large family (n > 10)
Functional candidate approach
Linkage analysis
Identification of a mutation
Genetic confirmation Co-segregation with the disease in the family Absence in the healthy population
Functional confirmation Analysis in model systems (cell culture, animal models)
Figure 18.1. Confirmation of point mutations in humans.
18.2
INITIAL CONFIRMATION BY CO-SEGREGATION IN THE FAMILY
From a genetic point of view, a causative mutation has to co-segregate with the disease within a family. Therefore, the familial analysis has to be at the first place for confirmation of a newly detected mutation, however, it should be kept in mind that this confirmation is valid for monogenic Mendelian disorders only; special features will be discussed below. A recent example is the observation of a point mutation in the GJA8 gene encoding connexin 50. The mutation leads to an exchange of Ile to Met at pos 247, therefore it is referred to as Cx50I247M (Graw et al., 2009). The mutation was previously described in a Russian family to be causative for a dominant congenital cataract (opacity of the eye lens); the authors found a co-segregation within the family and did not find the mutation in 25 healthy nonrelated people (Polyakov et al., 2001). However, the mutation in the German patient analyzed by Graw et al. (2009) showed no co-segregation within the family, since the mutation was also present in the unaffected mother. This finding
c18.indd 393
1/12/2011 9:44:33 AM
394
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
excluded the Cx50I247 mutation in the GJA8 gene as a causative mutation for the cataract; the observation that this particular mutation was not present in 179 controls demonstrated only that it is a very rare allele. Further functional studies of the authors in cell culture systems showed that the mutant protein has the same functional characteristics like the wild type protein. Finally, biochemical studies confirmed the first conclusions based on genetic studies only. In practice, the sequence of investigations can be also contrariwise, since in large families the first step in mutation analysis of rare diseases or of disorders with genetic heterogeneity (e.g., retinitis pigmentosa; Rivolta et al., 2002; Hamel, 2006) will be a linkage analysis. Such a positional cloning approach will exclude many candidate genes and end up in a critical region in the range of a few megabases (or even less); however, the suggested mutation has to be present in all affected members of the family, but not in the healthy ones. In this context, a haplotype analysis is helpful. The term haplotype is a contraction of the term haploid genotype and refers to a combination of alleles at multiple, but close loci (including also genetic markers, like SNPs or microsatellite markers) that are transmitted together on the same chromatide. A haplotype contains only a few loci; the borders of a haplotype in a given family are determined by the recombination events that have occurred during the family history. It allows the ascertainment of a putative disease allele during family history (“identity by descent”; Visscher, 2009). However, if the families of interest are rather small (in its extreme form just a trio: the parents and one child), only a functional candidate gene approach is possible. In such cases only known genes for the diagnosed diseases can be tested. However, all other aspects for the control of the result have to be considered (segregation of the mutation within the small family, population control of the particular mutation, and its functional relevance, if possible). In the case of the GJA8 mutation (I247M), the biochemical experiments were crucial for the final categorization of the mutation as a polymorphism. However, if the biochemical investigations are not considered (or if such experiments cannot be performed), the result in the German cataract family could be interpreted in a different manner using the term reduced penetrance. Penetrance is always 100% in classical Mendelian disorders, and if one assumes a Mendelian way of inheritance but the disorder cannot be diagnosed in all carriers of the same mutation, we refer to this feature as reduced penetrance. This term, however, is only a formal description of unknown mechanisms modifying the outcome of a given mutation. An example was published by de Lange et al. (2007), who analyzed a large family suffering from cherubism, a benign fibro-osseous disease of the jaws. The disease has an autosomal dominant inheritance and causative mutations have been found in the SH3BP2 gene encoding the SH3 domain binding protein 2 on chromosome 4p16. They identified the P418T mutation in the SH3BP2 gene in five members of the family, however, two of those five were obviously healthy.
c18.indd 394
1/12/2011 9:44:33 AM
SECOND CONFIRMATION BY POPULATION SCREENING
395
The new P418T mutation occurs at a codon having the most frequent changes in cherubism (Pro418 to Leu, Arg, or His). Because it is a substitution of the nonpolar Pro by the polar Thr, the authors suggest that also the change to Thr is pathogenic. However, since no further biochemical or genetic data are given, the pathogenic mechanism of this particular mutation remains to be elucidated. Another unsolved problem in genotype–phenotype correlation is the clinical heterogeneity of particular mutations. An example was published by Tein et al. (2008) for mutations in the ACADS gene (encoding the short-chain acylCoA dehydrogenase) among individuals of Ashkenazi Jewish origin. In this population, the heterozygous carrier frequency of a particular point mutation (319C->T; resulting in a R107C mutation in the precursor protein or a R83C mutation in the mature enzyme) is 1 : 15, and the homozygous frequency is 1 : 900; therefore, this mutation is discussed as a founder mutation in this population. Another mutation, 625G->A (resulting in a G209S mutation in the precursor protein or a G185S mutation in the mature enzyme), is also quite frequent in the general population and discussed as a susceptibility mutation. The clinical analysis of 10 homzygotes for the 319C->T mutation or compound heterozygotes for the 319C->T/625G->A mutation exhibit a broad range of clinical heterogeneity, including hypotonia and developmental delay in most of the patients but also myopathy of different grades of severity, facial weakness, lethargy, feeding difficulties, or congenital abnormalities in only few of the patients. These differences might be caused by biochemical differences of the two mutations resulting in 1–6% of residual enzyme activity. However, it indicates that there are additional modifiers involved in the pathogenic process which still remain to be elaborated. In addition of the topics discussed above, some particular modes of inheritance need specific consideration. First of all, if the gene of interest is located on a sex chromosome, it has to be proven whether it is on one of the two pseudo-autosomal regions of the X or Y chromosome. In these cases the classical scheme of sex-linked inheritance is not valid. Similar difficulties in the interpretation of a pedigree might be if the mutation affects a fertility gene on the Y chromosome. If the mutation occurs on the mitochondrial genome, matrilinearity of the inherited disease has to be taken into account; additionally, heteroplasmy might be a cause of clinical heterogeneity in these cases.
18.3 SECOND CONFIRMATION BY POPULATION SCREENING The essential distinction between a true disease-causing point mutation and a rare polymorphism (as some SNPs are) is its distribution among the healthy population. The human genome contains at least 11 million SNPs, with ∼7 million of these occurring with a minor allele frequency of over 5% and the remaining having minor allele frequencies between 1 and 5% (Frazer et al., 2009). A rare polymorphism can be found also within a healthy population,
c18.indd 395
1/12/2011 9:44:33 AM
396
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
but a true mutation must not be found. (Caveat: the tested population has to be screened for the critical characteristics of the corresponding disease, otherwise a finding within the population can pick up just a new patient!) This dogma is a formal one and holds true for classical Mendelian disorders only. For Mendelian disorders it is therefore necessary to include ∼100 controls (=200 chromosomes); if the observed mutation does not occur within this number of control persons at least in a heterozygous condition, the allele frequency is T (or G->A) transition (depending on which strand of the DNA the 5-methylcytosine was present); if this occurs during the development of the germ cells, it is fixed as a mutation. 5-Methylcytosine is made during biochemical modification of cytosine residues by DNA-methylation and occurs frequently during regulatory processes at CpG dinucleotides (Mukherjee et al., 2003). In such frequently occurring mutations the question arises whether these mutations are independent de novo mutations or founder mutations having spread throughout the population. An answer to this question can come from haplotype analysis. If the haplotypes are different, independent de novo mutations have to be considered, and vice versa, identical haplotypes point to a common origin (i.e., a founder mutation). In case of the T296M mutation in the FIX gene, 36 patients underwent an additional haplotype analysis, resulting in 15 different haplotypes. This observation strongly argues in favour of major contributions of de novo mutations as opposed to a founder effect (Mukherjee et al., 2003). In another case, Loidi et al. (2006) investigated CYP21A2 mutations in unrelated Spanish patients suffering from congenital adrenal hyperplasia. They observed in a total of 138 patients seven times the R444X mutation. Using haplotype analysis, the authors could demonstrate that six of the seven patients share a common haplotype, indicating a unique ancestral origin of this particular mutation.
18.4 THIRD CONFIRMATION BY EXPRESSION ANALYSIS AND FUNCTIONAL STUDIES IN MODEL SYSTEMS The determination of a mutation in the DNA sequence alone is just an indicator that it might be causative for an inherited defect—even if all formal criteria are fulfilled (co-segregation within the family, absence in the healthy population). However, it is also important to make a statement concerning the underlying mechanism. Therefore, expression studies are necessary as well as functional studies in appropriate model systems (model organisms, cell culture or biochemical tests). It is obvious that a gene that is expressed in the eye lens only cannot be responsible for a heart failure. However, if it is expressed in both tissues a mutation in this gene can be causative for cataracts (lens opacities) as well as for heart problems, as it is the case for mutations in CRYAB encoding αB-crystallin (Graw, 2009). Such pleiotropic effects are well known even in classical genetics and describe effects of the same mutation on different organs or tissues. However, also the opposite situation can be observed: a child suffers from an apparent new syndrome of cataract and macular hypoplasia. The first mutation screening using a functional candidate approach in the small family revealed a de novo point mutation in the CRYAA gene (R21L) encoding the αA-crystallin explaining the cataract. Since CRYAA is not expressed in the retina, the mutation cannot be responsible for the other pathological findings.
c18.indd 397
1/12/2011 9:44:33 AM
398
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
Additional screening revealed compound heterozygosity in the OCA2 gene (R419Q and A481T); one of both alleles was present in each of the unaffected parents. The macular hypoplasia was explained by a concerted interaction of compound heterozygous mutations in the OCA2 gene manifesting a mild form of oculocutaneous albinism (Graw et al., 2006). Besides expression analysis, functional studies in appropriate model systems are necessary to confirm a mutation or to characterize it as a rare polymorphism without pathological potential. As mentioned above, the 1247M mutation in the GJA8 gene (encoding connexin50, Cx50) was not found in more than 200 healthy people of two different populations. Nevertheless, when expressed in HeLa cells, both wild type Cx50 and the I247M-Cx50 formed gap junction plaques. Moreover, both wild type Cx50 and the I247M-Cx50-induced gap junctional currents in pairs of Xenopus oocytes, indicating no functional differences between these two isoforms of Cx50 (Graw et al., 2009).
18.5 RECAPITULATION OF HUMAN MUTATIONS IN ANIMAL MODELS The gold standard in confirming single nucleotide mutations is the recapitulation of the same mutation in the mouse. Actually, from a genetic point of view the mouse is the best animal model that can be used to study genetic effects in a living organism for comparison to humans. Several ways for making mouse mutants can be discussed: just the knockout of a gene of interest leading to loss-of-function mutations (or, in terms of genetics: to a null allele) or the exchange of the wild type allele by the mutation of interest (knockin approach), or a random mutagenesis using ENU as mutagen. The knockout of any gene in the mouse is the concept of major consortia worldwide (www.eucomm.org, www.komp.org). This approach has made a lot of information available about genes and their functions. In many cases, the diseases are similar or even identical, particularly if the human point mutation also leads to a loss of function (e.g., by forming a premature stop codon). These mouse mutants and their comparison to wild type mice allow studying the expression of the gene of interest in the tissue(s) or organ(s) of interest and the physiological consequences of its loss at the molecular, cellular and wholeorgan level. Mouse models have been used to prove or disprove causality, necessity and sufficiency of various genes and their encoded proteins or their absence in causing pathological situations in many organs (an excellent review about such mouse models for cardio-vascular diseases was published by Yutzey and Robbins [2007]). However, loss-of-function mutations represent only one aspect of the broad spectrum of the consequences of mutations in humans; they do not consider hypermorphic or hypomorphic alleles and their pathophysiological consequences, which are part of daily practice if dealing with human point mutations. Therefore, allelic series of mutations are absolutely required for
c18.indd 398
1/12/2011 9:44:33 AM
CONCLUSIONS AND OUTLOOK
399
understanding the frequently observed clinical heterogeneity. Therefore, a combination of gene targeting, spontaneous or randomly induced mutations, are necessary to represent the entire clinical spectrum. One of the most interesting sets of mutations was published very early in the field for the Mitf gene in the mouse (encoding a micophthalmia-associated transcription factor; Steingrímsson et al., 2004). The mutation spectrum in the mouse ranges from large deletions to point mutations of different severity of the disease (from severely affected dominant mutants to recessive mutants with almost no pathological phenotype). The human homolog, MITF, is mutated in patients with the pigmentary and deafness disorder Waardenburg syndrome type 2A. An actual summary of all available mouse mutants at the Mitf (and all other genes) can be found on the Jackson Laboratory Website (www.informatics.jax.org/), which lists 35 Mitf alleles. For comparison, the database for human genetic diseases (Online Mendelian Inheritance in Man [OMIM], www.ncbi.nlm.nih.gov/omim) gives information for 8 selected alleles. Point mutations in the mouse can be induced randomly and with high efficiency by ENU (Ehling et al., 1985). This treatment schedule have been widely used and yielded a large collection of point mutations in the mouse (AcevedoArozena et al., 2008) and leads to mutants being picked up because of an interesting phenotype. The underlying mutations have to be characterized in a similar way as human mutations—that is, by linkage analysis, sequencing of positional candidate genes, and exclusions of polymorphisms. Another way is offered by the Harwell sperm bank, which contains over 4000 DNA samples from individual F1 ENU-mutagenized mice (paralleled by frozen sperm samples). This archive can be screened for mutations in many genes, which allows a target-oriented phenotyping of the mutants afterward (Quwailid et al., 2004). In this context it is important to perform phenotyping in a highly standardized manner to receive finally sets of data being comparable to those from human clinics. One example for a high-throughput and standardized phenotyping unit of mutant mice is the German Mouse Clinic (GMC; GailusDurner et al., 2009).
18.6 CONCLUSIONS AND OUTLOOK Point mutations are frequently causative for inherited disorders in humans. Since single-nucleotide mutations occur also frequently as polymorphisms, it is necessary to confirm their pathological nature by co-segregation with the disease in the family, by its absence in the common population (in case of Mendelian disorders), and by its biological meaning. However, next-generation sequencing techniques will allow a fast sequencing of entire individual genomes at low prices. It is expected that the data on single-nucleotide mutations will increase significantly. Therefore, a clear pipeline of tests is necessary to confirm the elaborated mutations as causative for a given disorder.
c18.indd 399
1/12/2011 9:44:33 AM
400
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
18.7 ACKNOWLEDGMENTS I thank those clinicians who have sent us samples for mutation analysis. Moreover, I thank Erika Bürkle, Monika Stadler and Maria Kugler for expert technical assistance in the analysis of numerous mutations in mice and humans. This project was supported by grants from the European Community (EUMODIC; LSHG-2006-037188) and from the National Genome Network (NGFN plus; BMBF 01GS0850). 18.8 QUESTIONS AND ANSWERS Q1. What does SNP mean? Q2. Give three criteria that define a real single nucleotide mutation as causative for a Mendelian disease. Q3. Which criterion does not make sense for complex disorders? A1. Scottish National Party, single nucleotide polymorphism or SchneiderNeureither & Partner. A2. Co-segregation in the family, absence in the population, biological meaning. A3. Absence in the population. 18.9
REFERENCES
Acevedo-Arozena A, Wells S, Potter P, Kelly M, Cox RD, Brown SD. (2008). ENU mutagenesis, a way forward to understand gene function. Annu Rev Genomics Hum Genet 9:49–69. de Lange J, van Maarle MC, van den Akker HP, Redeker EJ. (2007). A new mutation in the SH3BP2 gene showing reduced penetrance in a family affected with cherubism. Oral Surg Oral Med Oral Pathol Oral Radiol Endod 103:378–81. Eichenlaub-Ritter U, Adler ID, Carere A, Pacchierotti F. (2007). Gender differences in germ-cell mutagenesis and genetic risk. Environ Res 104:22–36. Ehling UH, Charles DJ, Favor J, Graw J, Kratochvilova J, Neuhäuser-Klaus A, Pretsch W. (1985). Induction of gene mutations in mice: the multiple endpoint approach. Mutation Res 150:393–401. Frazer KA, Murray SS, Schork NJ, Topol EJ. (2009). Human genetic variation and its contribution to complex traits. Nat Rev Genet 10:241–51. Gailus-Durner V, Fuchs H, Adler T, Aguilar Pimentel A, Becker L, Bolle I, CalzadaWack J, Dalke C, Ehrhardt N, Ferwagner B, Hans W, Hölter SM, Hölzlwimmer G, Horsch M, Javaheri A, Kallnik M, Kling E, Lengger C, Mörth C, Mossbrugger I, Naton B, Prehn C, Puk O, Rathkolb B, Rozman J, Schrewe A, Thiele F, Adamski J, Aigner B, Behrendt H, Busch DH, Favor J, Graw J, Heldmaier G, Ivandic B, Katus
c18.indd 400
1/12/2011 9:44:33 AM
REFERENCES
401
H, Klingenspor M, Klopstock T, Kremmer E, Ollert M, Quintanilla-Martinez L, Schulz H, Wolf E, Wurst W, de Angelis MH. (2009). Systemic first-line phenotyping. Meth Mol Biol 530:463–509. Graw J. (2009). Crystallins: cataract and beyond. Exp Eye Res 88:173–89. Graw J, Klopp N, Illig T, Preising MN, Lorenz B. (2006). Congenital cataract and macular hypoplasia in humans associated with a de novo mutation in CRYAA and compound heterozygous mutations in P. Graefe’s Arch Clin Exp Ophthalmol 244:912–19. Graw J, Schmidt W, Minogue PJ, Rodriguez J, Tong JJ, Klopp N, Illig T, Ebihara L, Berthoud VM, Beyer EC. (2009). The GJA8 allele encoding CX50I247M is a rare polymorphism, not a cataract-causing mutation. Mol Vis 14:1881–85. Hamel C. (2006). Retinitis pigmentosa. Orphanet J Rare Dis 1:40(doi: 10.1186/17501172-1-40). Kottke-Marchant K. (2002). Genetic polymorphisms associated with venous and arterial thrombosis. Arch Pathol Lab Med 126:295–304. Loidi L, Quinteiro C, Parajes S, Barreiro J, Lestón DG, Cabezas-Agrícola JM, Sueiro AM, Araujo-Vilar D, Catro-Feijóo L, Costas J, Pombo M, Domínguez F. (2006). High variability in CYP21A2 mutated alleles in Spanish 21-hydroxylase deficiency patients, six novel mutations and a founder effect. Clin Endocrinol 64:330–36. Mukherjee S, Mukhopadhyay A, Chaudhuri K, Ray K. (2003). Analysis of haemophilia B database and strategies for identification of common point mutations in the factor IX gene. Haemophilia 9:187–92. Nomura T. (2008). Transgenerational effects from exposure to environmental toxic substances. Mutat Res 659:185–93. Polyakov AV, Shagina IA, Khlebnikova OV, Evgrafov OV. (2001). Mutation in the connexin 50 gene (GJA8) in a Russian family with zonular pulverulent cataract. Clin Genet 60:476–78. Quwailid MM, Hugill A, Dear N, Vizor L, Wells S, Horner E, Fuller S, Weedon J, McMath H, Woodman P, Edwards D, Campbell D, Rodger S, Carey J, Roberts A, Glenister P, Lalanne Z, Parkinson N, Coghill EL, McKeone R, Cox S, Willan J, Greenfield A, Keays D, Brady S, Spurr N, Gray I, Hunter J, Brown SD, Cox RD. (2004). A gene-driven ENU-based approach to generating an allelic series in any gene. Mamm Genome 15:585–91. Rivolta C, Sharon D, DeAngelis MM, Dryja TP. (2002). Retinitis pigmentosa and allied diseases: numerous diseases, genes, and inheritance patterns. Hum Mol Genet 11:1219–27. Rosendaal FR, Reitsma PH. (2009). Genetics of venous thrombosis. J Thromb Haemost 7(suppl. 1):301–4. Sankaranarayanan K. (2006). Estimation of the genetic risks of exposure to ionizing radiation in humans: current status and emerging perspectives. J Radiat Res 47(suppl.):B57–66. Steingrímsson E, Copeland NG, Jenkins NA. (2004). Melanocytes and the microphthalmia transcription factor network. Annu Rev Genet 38:365–411. Tein I, Elpeleg O, Ben-Zeev B, Korman SH, Lossos A, Lev D, Lerman-Sagie T, Leshinsky-Silver E, Vockley J, Berry GT, Lamhonwah AM, Matern D, Roe CR,
c18.indd 401
1/12/2011 9:44:33 AM
402
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
Gregersen N. (2008). Short-chain acyl-CoA dehydrogenase gene mutation (c.319C>T) presents with clinical heterogeneity and is candidate founder mutation in individuals of Ashkenazi Jewish origin. Mol Genet Metab 93:179–89. Visscher PM. (2009). Whole genome approaches to quantitative genetics. Genetica 36:351–58. Yutzey KE, Robbins J. (2007). Principles of genetic murine models for cardiac disease. Circulation 115:792–99.
c18.indd 402
1/12/2011 9:44:33 AM
CHAPTER 19
Initial Identification and Confirmation of a QTL Gene DAVID C. AIREY and CHUN LI
Contents 19.1 Introduction 19.1.1 What Are QTLs? 19.1.2 What Are the Goals of QTL Mapping? 19.1.3 Why Map QTL in Mice and Rats? 19.2 Initial Mapping of QTL 19.2.1 Software 19.2.2 Segregating Crosses 19.2.3 Genetic Reference Populations 19.2.4 Experimental Design and Statistical Power 19.3 Fine Mapping QTL 19.3.1 Selective Phenotyping and Recombinant Progeny Testing 19.3.2 Congenics 19.3.3 Advanced Intercrosses 19.3.4 Heterogeneous Stock and Outbred Mice 19.3.5 Recombinant Inbred Segregation Tests 19.3.6 Haplotype Association Mapping 19.3.7 Multiple Cross Mapping and Combining Crosses 19.3.8 Populations on the Horizon: Diversity Outbred and Collaborative Cross Mice 19.4 Confirmation of a QTL Gene 19.4.1 What Is Required to Claim a Gene? 19.5 Bioinformatics, Systems Genetics, and Networks 19.6 Pharmacogenomics and Dynamic Phenotyping 19.7 References
404 404 404 404 405 405 407 409 410 411 411 412 412 412 412 413 413 413 413 413 414 415 418
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
403
c19.indd 403
1/12/2011 9:44:34 AM
404
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
19.1 INTRODUCTION 19.1.1
What Are QTLs?
Quantitative genetics is broadly defined as the study of biological variation (Lynch and Walsh, 1998). Most phenotypes show continuous variation when carefully measured, despite the tendency of the medical model to categorize disease. Quantitative trait loci (QTLs) are the genetic loci that contribute to quantitative variation in a trait, and QTL mapping in mice and rats is the effort to identify QTL through an experimental cross. This chapter serves as a pragmatic tour guide for those considering rodent QTL mapping experiments for the first time. The main parts of the chapter summarize QTL mapping using mice and rats, from initial identification to confirmation of a QTL gene. Following, we briefly discuss the intersection of bioinformatics and systems genetics with QTL mapping. We end the chapter with a suggestion for QTL mapping in the area of pharmacogenomics. 19.1.2
What Are the Goals of QTL Mapping?
The goals of QTL mapping in rodents include detection, localization, and estimating effect size. For biomedical purposes, where the goal is translation to improved human disease understanding and treatment, we care most about these goals in that order. Were we conducting agricultural experiments where the end goal was genotype assisted selection for an improved phenotype, then the heritability of the QTL might be of greatest interest. The heritability of a QTL is the fraction of phenotypic variance explained by it, and is a measure of QTL effect size. Rather, in biomedical applications of QTL mapping in rodents, any identified QTL genes, regardless of their effect size, may provide clues to disease mechanism. 19.1.3
Why Map QTL in Mice and Rats?
All common human diseases of great economic burden, such as obesity, heart disease, hypertension, diabetes, cancer, and psychiatric illness, have complex genetic etiology. The underlying genetic variation of such diseases is polygenic, and the effect of individual genetic variants is small. In other words, it is difficult to discover the genetic causes of any of these diseases for the vast majority of victims, despite recent progress with genomewide association (GWA) studies in humans. Manolio et al. (2009) outlined several strategies to improve GWA studies in humans, given that the amount of phenotypic variance that is collectively explained by all discovered genes in each disease is thus far much lower than prior estimates of disease heritability. One strategy provided by the authors is improved phenotyping. Ascertainment of human phenotypes can be limited by access and ethics. Thus a complementary approach to finding the “missing heritability of complex diseases” in humans not outlined by
c19.indd 404
1/12/2011 9:44:34 AM
INITIAL MAPPING OF QTL
405
Manolio et al. (2009) is the use of experimental crosses in rodents. Within the ethical guidelines for animal research, experimental crosses in mice and rats provide access to disease model phenotypes not feasible or possible in human genetics research.
19.2 INITIAL MAPPING OF QTL It is unfortunate that much of the primary and secondary literature describing the methods of QTL mapping requires a high level of statistical sophistication. As an accessible beginning treatment, Broman (2001) is recommended. Another recent review is by Zou (2009). For non-statisticians willing or required to go beyond that, a recently released book by Karl Broman and Saunuk Sen (2009) provides a complete and highly useful exposition of QTL mapping using the software package R/QTL (www.rqtl.org). Every practitioner of QTL mapping should own a copy of this book; there are other good sources (e.g., Siegmund and Yakir, 2007; Wu et al., 2007), but they are not as useful to the biologist. Broman and Sen (2009) manage to provide enough of the right details about how to actually do QTL mapping without oversimplifying the material or digressing into too much statistical methodology while providing enough of that to allow good judgment by the biomedical scientist-practitioner. In the following sections, we discuss software first, followed by a synopsis of work flow in two contrasting types of mapping populations for QTL mapping in mice and rats: segregating crosses and genetic reference populations. The distinguishing feature between these is that genetic variation is between individuals in segregating crosses and between isogenic lines or strains in genetic reference populations. 19.2.1
Software
19.2.1.1 Why Is Advanced Software Necessary? While rodent QTL mapping methods differ in either or both the organization of the genomes used (e.g., backcross, intercross, advanced intercross, recombinant inbred, consomic, congenic, recombinant congenic, inbred strain diversity panel, or outbred lines of mice) and the type of statistical association evaluated (e.g., t-test, ANOVA, correlation, regression, generalized linear model, nonparametric approaches, or Bayesian approaches), all are fundamentally similar in that they relate phenotypic variation to genetic variation. It is important to note that if we had complete genotype data for the entire mapping population, then using a t-test or ANOVA to compare the means of the animals grouped by genotype at every genetic location (called marker regression) would be a satisfactory approach for a single QTL model. However, three central problems conspire against using this simple approach and result in the requirement of sophisticated software like R/QTL. These three problems are the model
c19.indd 405
1/12/2011 9:44:34 AM
406
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
selection problem, the missing data problem, and the test multiplicity problem (Broman and Sen, 2009). First, even if we had complete genotype data, we still have the daunting problem of finding a good model. For example, the ttest would be inadequate if the phenotype were best predicted by a covariate moderated two-QTL interaction. This illustrates the model selection problem. Second, because we generally observe only a finite set of markers and not the QTL genotype, we must infer an association between marker and phenotype due to linkage. This is the missing data problem, and it is inadequately handled by marker regression. Marker regression drops animals from the analysis with missing marker data, has reduced power depending on marker density and also cannot discern the difference between a smaller effect QTL close to a marker or a larger effect QTL further away from the marker. Third and finally, the testing of hundreds or thousands of genetic markers makes the nominal criterion for statistical significance invalid. Software such as R/QTL flexibly defines the genomewide statistical significance criterion by permutation methods. 19.2.1.2 Features of R/QTL Introduced in 2003 by Broman et al., R/QTL is a mature software package for mapping QTLs in rodents. R/QTL runs as a package within R, a free and open-source, cross-platform language and environment for statistical computing and graphics (www.r-project.org). Although all R/QTL functions are command line executed, a more user-friendly JAVA graphical interface, called J/QTL is available (Smith et al., 2009). R/QTL has numerous functions for genetic mapping and map construction, data summarization and results plots of many kinds, single QTL interval mapping scans by four different estimation methods (standard maximum likelihood interval mapping, Haley-Knott approximation, an extended Haley-Knott approximation, and multiple imputation), two-QTL epistasis scans, binary trait and nonparametric mapping, allowance for additive or interacting covariates, proper handling of the X chromosome (missing in other packages), statistical significance by permutation and stratified permutation, multiple QTL models with a routine for automated model selection, and—under it all—a Hidden-Markov Model that gracefully handles missing data. The rationale and use of each of these many features is clearly explained by Broman and Sen (2009). We add that some caution should be exercised when using automated model selection routines, because overfitting and lack of generalization to similar crosses may occur. 19.2.1.3 Missing from R/QTL Broman and Sen (2009) note the lack of multiple trait mapping methods in R/QTL, where two or more related phenotypes are jointly mapped. Also desirable but missing, is the structural or causal analysis of multiple traits, described by Li et al. (2006). With a touch of humor, Broman and Sen (2009, pp.257–58) note the lack of Bayesian approaches to QTL mapping in R/QTL, suggesting that while these approaches are
c19.indd 406
1/12/2011 9:44:34 AM
INITIAL MAPPING OF QTL
407
attractive, they require considerable training to use. Multiple QTL model construction after multiple imputation supports Haley-Knott interval mapping only. 19.2.1.4 Other Software There are several other free and commercial QTL mapping software environments, such as QTL Cartographer (http:// statgen.ncsu.edu/qtlcart/WQTLCart.htm). This software has a Windows graphical user interface and is also scriptable like R/QTL, although it does not run on top of a general purpose statistical language like R. QTL Cartographer has tools for single-marker analysis, interval mapping, composite interval mapping, Bayesian interval mapping, multiple interval mapping (the extension of standard interval mapping by maximum likelihood to multiple QTL models), multiple trait analysis, and categorical trait analysis. Brief online documentation is available (http://statgen.ncsu. edu/qtlcart/HTML/index.html) and support is offered by email. For online analysis of genetic reference populations, the website www.webqtl.org is recommended (Wang et al., 2003a; Rosen et al., 2007). For analysis of outbred populations, see for example, Mott et al. (2000) and Johannesson et al. (2009). Both Karl Broman and Brian Yandell maintain directories of software for QTL mapping using rodents (e.g., see www.rqtl.org). 19.2.2 Segregating Crosses 19.2.2.1 Phenotyping Successful crosses benefit from using mice or rat inbred strains that differ in the phenotype of interest. Before firsthand assessment of the technical variance and heritability of a phenotype (www.nervenet.org/ papers/shortcourse98.html), primary literature and online resources should be investigated. For the mouse, a large number of strain phenotypes are deposited at the Jackson Laboratory Mouse Phenome Database (www.jax.org/phenome, (Grubb et al., 2009)). Similar resources are available for the rat (http:// rgd.mcw.edu) (Twigger et al., 2006). An important difference in phenotyping exists between segregating crosses and genetic reference populations. Repeated measurement of mice in a segregating cross can only reduce technical variance. Repeated measurement of mice from genetic reference populations can reduce technical variance (through measures on the same mouse) or environmental variance (through measures on different mice of the same genotype). While careful planning does allow multiple phenotypes to be collected for each mouse from an intercross or backcross (Solberg et al., 2006), greater flexibility in obtaining multiple phenotypes is enabled by use of genetic reference populations. 19.2.2.2 Genotyping Each individual mouse or rat from a segregating cross is genetically unique and requires genotyping at equally spaced intervals if it is to contribute genetic information to the mapping of QTLs. Fortunately, mutiplex genotyping can be performed rapidly by high-throughput assays,
c19.indd 407
1/12/2011 9:44:34 AM
408
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
such as the Illumina Mouse LD (low-density, 377 loci) and Mouse MD (medium-density, 1449 loci) Linkage genotyping panels. Reagent cost per biological sample as of March 2010 is approximately $70 for the LD panel and $90 for the MD panel. For qualified NIH funded projects, use of the Illumina Mouse MD Linkage panel can be free (www.nih.gov/ science/models/mouse/ resources/geno-service.html). For very high genetic resolution, appropriate in fine mapping (see below) and other applications, a JAX Mouse Diversity Genotyping Array is available with 620,000 SNPs (http://jaxservices.jax.org/ mdarray/index.html) (Yang et al., 2009). From a cost-savings point of view, it should be emphasized that it is not necessary to genotype every cross progeny— selective genotyping can be considered when using segregating crosses in QTL mapping experiments (Sen et al., 2007, 2005). In rats, although genotyping in segregating crosses is still performed using PCR of microsatellites and polyacrylamide gel separation, the basis for highthroughput assays is becoming available (Anegon, 2011; Jacob, 2010; STAR Consortium et al., 2008). A highly important new development for mouse QTL mapping is an improved standard genetic map based on a large heterogeneous mouse population (Cox et al., 2009). When using this “revised Shifman” map and examining 78 mapped QTLs, 15 (19%) showed altered localization. The new map also improved concordance between mouse QTLs and human GWAS loci. Ackert-Bicknell et al. (2010) found 26 of 28 human GWAS loci for bone mineral density coincided with mouse QTL support intervals, with 14 GWAS loci within 3 cM of a mouse QTL peak, using the new map. 19.2.2.3 Mapping with Backcrosses and Intercrosses The purpose of an experimental cross is to create genetic variation, so that we can study the association between genetic markers and the phenotype. Backcrosses and intercrosses are the two standard segregating crosses used to map QTLs. In the backcross, two inbred strains are crossed to produce isogenic F1 mice that are heterozygous at loci that differ between the two strains. F1 mice are crossed back to one of the inbred strains to produce backcross mice, that are either homozygous for the backcross parent genotype or heterozygous at other loci. In the intercross, following the F1 stage, two F1 mice are crossed to produce F2 intercross mice that are homozygous for either inbred strain genotype, or heterozygous. Although no more on this will be said here, to map the X chromosome, one has to keep track of the direction of the crosses with respect to sex (see Broman and Sen, 2009, §4.2). Following creation of a backcross or intercross population, phenotyping, and genotyping, a data set is generated that records columns of data for animal identification matched to one or more phenotypes, sex, covariates, and genotypes with associated chromosomal and centimorgan (cM) position information. Covariates are secondary measurements that may affect the phenotype of interest (in general, if we cannot experimentally control a covariate, then we measure subjects in different stratum of the covariate, or at least measure the covariate). The general workflow using R/QTL as illustrated by Broman and Sen (2009) follows.
c19.indd 408
1/12/2011 9:44:34 AM
INITIAL MAPPING OF QTL
409
1. Format and import the data into R/QTL (or J/QTL, see above). 2. Perform various data quality checks, such as checking the distribution shape of the phenotype or the integrity or order of the genotype data. 3. Perform single-QTL analysis by interval mapping. 4. Determine genomewide significance by permutation. 5. Determine the confidence interval estimates of location. 6. Determine effect size estimates for significant QTL. 7. Investigate additive covariates or QTL × covariate interactions, possibly also using QTL markers as covariates (composite interval mapping) to enhance single QTL mapping. 8. Perform a two-QTL epistasis scan. 9. Explore and construct a multiple QTL model using multiple imputation combined with Haley-Knott interval mapping. Some additional aspects to this work flow can be highlighted. First, the authors note that it is not uncommon to spend half of the intellectual effort of a project editing and cleaning one’s data! Second, some of the permutation procedures can take considerable time (days) and require processing on multiple CPUs. Third, an automated (stepwise) version of the work flow above (#3–9) is implemented in R/QTL and demonstrated in two case studies at the end of Broman and Sen (2009). 19.2.3 Genetic Reference Populations 19.2.3.1 Phenotyping As stated earlier, the distinguishing feature between segregating crosses and genetic reference populations (GRPs) is that genetic variation is between individuals in segregating crosses and between isogenic lines or strains in GRPs. This allows greater flexibility in phenotyping, because the same genotype can be repeatedly sampled under the same conditions to reduce measurement error and environmental variance, or under different conditions to facilitate multivariate analysis. How many animals per strain are usefully measured has been addressed by Crusio (2004) and others. A great advantage of GRPs is the possibility of using phenotypes from the literature (Philip et al., 2009, Wang et al., 2003a), but caution about environmental differences is appropriate (Wahlsten et al., 2006). Warehouses of GRP data are available at www.webqtl.org and at the JAX Mouse Phenome Database. 19.2.3.2 Genotyping Although genotyping of segregating crosses has become much easier with the high-throughput assays like those described above, once genotyping is established for a GRP, it is archival and need not be repeated (by contrast, every new segregating cross must be genotyped). For example, 15,260 SNPs were genotyped on 480 recombinant inbred lines and standard inbred strains by the Wellcome Trust Centre for Human Genetics (http://mus.well.ox.ac.uk/mouse/INBREDS).
c19.indd 409
1/12/2011 9:44:34 AM
410
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
19.2.3.3 Mapping with Recombinant Inbred Lines Recombinant inbred lines (RILs) are created by inbreeding F2 intercross progeny 20 or more generations. Each resulting RIL is an isogenic genetic mosaic of the parental inbred strain genomes. The genotype at any locus is homozygous for either parent strain’s alleles. Because creating RILs is expensive and requires years of effort, QTL mapping with RILs typically uses available mapping populations. RIL populations are available for both mice and rats (e.g., Jirout et al., 2003; Peirce et al., 2004), but they are relatively few in number and size compared to opportunities using segregating crosses. The advantages and disadvantages of using RILs in QTL mapping led to the proposal to create a larger panel of RILs using more than two parental strains (Churchill et al., 2004; Roberts et al., 2007; Chesler et al., 2008). Mapping with RILs is usually performed on the strain averages, and follows the same general progression described above for segregating crosses. 19.2.3.4 Mapping with Other GRPs There are other types of GRPs besides RILs that can be used for initial QTL mapping. They are chromosome substitution strains (also called consomic lines), recombinant congenic lines, and standard inbred strain diversity panels. A chromosome substitution strain (CSS) is an inbred strain with one of its chromosomes replaced by the homologous chromosome of another inbred strain. For example, the C57BL/6J (host) × A/J (donor) CSSs are 22 in number, including one CSS for each autosome, the X and Y sex chromosomes, and the mitochondrial DNA (all available through the Jackson Laboratory). To map a QTL to a particular chromosome, a CSS is simply compared to the control (host) strain C57BL/6J (Hill et al., 2006; Shao et al., 2008). A recombinant congenic line is similar in concept to the chromosome substitution strain, except a smaller donor region of a chromosome is present on the host background (Fortin et al., 2007). The same, simple mapping principles apply, but with greater QTL mapping resolution. Finally, there is the possibility of directly mapping QTL using a large panel of inbred strains. This approach, called haplotype association mapping (HAM), has been advocated for both initial mapping and fine mapping, but has also been controversial (Tsaih and Korstanje, 2009). Nevertheless, some experimental success has been achieved using HAM (e.g., Burgess-Herbert et al., 2009). 19.2.4
Experimental Design and Statistical Power
19.2.4.1 Experimental Design Good experimental design balances pragmatism against the scientific goals of a study and often requires extensive experience. Experimental design choices related to QTL mapping are reviewed by Broman and Sen (2009). For example, the relative strengths and weakness of using the backcross, intercross, and RILs are discussed as well as some underlying theory. The intercross is the most versatile cross, able to detect both additive and dominance effects. The backcross may be more powerful for
c19.indd 410
1/12/2011 9:44:34 AM
FINE MAPPING QTL
411
dominance effects, but only if the right backcross is chosen. RILs are theoretically the most powerful choice to detect a purely additive locus, but only if enough lines are available. RILs are incapable of detecting overdominant loci that lack additive effects. Given their different merits, it is a reasonable strategy to use more than one cross type. 19.2.4.2 Statistical Power The R package “qtlDesign” (Sen et al., 2007) can be used to explore statistical power associated with different experimental design choices. For example, Broman and Sen (2009) use this package to show equivalent power to detect an additive QTL of a given effect size using an intercross of 162 animals, a backcross of 247 animals, or 42 RILs with four replicate mice per line. Whereas the choice of cross type and its size can influence the power to detect a QTL, selective genotyping may reduce experimental cost. Broman and Sen (2009) note this generally is useful only when a specific single phenotype is of interest (because selective genotyping is done on the phenotypic extremes) and when the cost of raising and phenotyping mice is much less than genotyping. Finally, the R functions provided by Sen et al. (2007) are approximations given a set of assumptions. Sen et al. (2007) also provide tools to simulate experimental design choices, and although this approach may be more difficult, it may also prove more accurate (Broman and Sen, 2009).
19.3 FINE MAPPING QTL The methods described earlier are sufficient to localize QTL to an interval of 10 cM or more. To further reduce the support interval of a QTL, additional fine-mapping methods are required. Because these methods involve considerable expense and effort, independent confirmation of a mapped QTL is desirable before advancing to fine mapping (Abiola et al., 2003). Approaches to fine mapping a QTL will generally result in a 1–5 cM support interval containing 20–100 candidate genes. Although many approaches are touched on briefly below, congenics and advanced intercrosses are popular and proven methods. 19.3.1
Selective Phenotyping and Recombinant Progeny Testing
Once a QTL is initially mapped using an intercross or backcross, additional progeny can be bred, tail snipped, and genotyped within the QTL support interval. Animals recombinant within the interval can be selectively phenotyped allowing the QTL to be confirmed and its mapping interval reduced. Alternatively, animals that have a recombinant chromosome within a QTL support interval can be backcrossed to a parental strain to determine the location of the QTL relative to the recombination point, using a procedure called recombinant progeny testing (Darvasi, 1998).
c19.indd 411
1/12/2011 9:44:34 AM
412
19.3.2
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
Congenics
Congenic animals contain a defined chromosomal segment from a donor inbred strain on the background of a host inbred strain. Congenics are created by repeatedly backcrossing one inbred strain onto another with selection for a particular marker from the donor strain. When additional selection for the genetic background is performed with markers spanning the genome, speed congenics can be created in 5 generations (15–19 months) rather than the 10 generations (30–36 months) required for traditional congenics. Jackson Laboratory provides full services for speed congenic creation ($25,000) or partial services for genotyping and pairing advice ($7–9000). A variation on this approach called interval specific congenics has been described by Darvasi (1997). 19.3.3
Advanced Intercrosses
Advanced intercross (AI) animals are created by breeding beyond the F2 stage. Because mapping resolution depends on recombination density, each additional generation in an advanced intercross introduces additional recombination and reduces the support interval of a mapped QTL. For example, after eight additional breeding generations a fivefold reduction in the QTL confidence interval is expected (Darvasi and Soller, 1995). Several groups have used the advanced intercross effectively (Fawcett et al., 2009; Iraqi et al., 2000; Wang et al., 2003b), but some care is required in the breeding of the population (Schmitt et al., 2009) and the analysis (Peirce et al., 2008; Valdar et al., 2009). 19.3.4
Heterogeneous Stock and Outbred Mice
Heterogeneous stock (HS) mouse and rat populations were each separately created from eight inbred strains. HS animals have both high recombination density (50–60+ generations) and genetic diversity (8 strains). The use of these populations for high-resolution genomewide mapping has been successfully pioneered by Flint and colleagues (Valdar et al., 2006, 2009; Johannesson et al., 2009; Huang et al., 2009; Woods et al., 2010). In each of these applications, careful attention was paid to the family structure of this highly recombinant population. Heterogeneous stock mice and rat populations are a very promising resource for fine mapping QTLs. Other outbred stock, such as MF1 mice (Ghazalpour et al., 2008; Yalcin et al., 2004), have also been used to fine map QTL. The use of HS and other outbred stock is a rapidly developing area of interest in rodent complex trait genetics (Aldinger et al., 2009). 19.3.5
Recombinant Inbred Segregation Tests
To perform a recombinant inbred segregation test (RIST) (Darvasi, 1998), recombinant inbred lines that have a recombination event within a QTL support interval are crossed to both parental lines. In one of the crosses, the
c19.indd 412
1/12/2011 9:44:34 AM
CONFIRMATION OF A QTL GENE
413
QTL will segregate and in the other it will not, providing a test of whether the QTL is above or below the recombination event. Because the RIST is limited to available RILs, yin-yang crosses try to generalize the RIST by treating available inbred strains as RILs and allowing a greater pool of possible recombination events to choose from (Flint et al., 2005). 19.3.6
Haplotype Association Mapping
HAM (also called in silico mapping) looks for associations between the phenotype and haplotypes of mouse inbred strains, treating inbred strains as individuals. While initially regarded as circumspect, experimental validation of the method and careful consideration of potential pitfalls has led to continued interest in the approach (e.g., Burgess-Herbert et al., 2009). Tsaih and Korstanje (2009) provide an excellent and current review of HAM methods in mice. 19.3.7
Multiple Cross Mapping and Combining Crosses
In some cases it may be possible to combine QTL mapping crosses by assuming the same locus is segregating in each cross. While this approach can reduce the QTL support interval considerably (Hitzemann et al., 2003), the assumption is safest with crosses of the same parental strains (Peirce et al., 2007; Malmanger et al., 2006). 19.3.8 Populations on the Horizon: Diversity Outbred and Collaborative Cross Mice There are two large-scale QTL mapping populations being developed that will greatly increase the power of complex trait genetics in mice. These are collaborative cross (CC) mice and diversity outbred (DO) mice (http:// jaxmice.jax.org/jaxnotes/514/514b.html). CC mice are a large panel of recombinant inbred lines derived from an eight-way cross of standard inbred strains and wild-derived strains (Chesler et al., 2008; Roberts et al., 2007). DO mice are outbred mice derived from the founder breeding mice used to create the collaborative cross. The combination of these two populations, perhaps combined with RIST methods (Flint et al., 2005) and dense genotyping offered by the JAX Mouse Diversity Genotyping Array, should enable researchers to rapidly map genetic loci at high resolution and identify individual genes involved in disease. The expected availability of DO and CC mice is 2010 and 2012, respectively. 19.4 CONFIRMATION OF A QTL GENE 19.4.1
What Is Required to Claim a Gene?
A consortium of authors (Abiola et al., 2003) presented a community white paper describing the evidence needed to identify a candidate gene for a
c19.indd 413
1/12/2011 9:44:34 AM
414
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
mapped QTL. A predominance of evidence, including more than one of the following types should suffice when supported by peer review. • • • • • • • •
Polymorphism in coding or regulatory regions. Gene function related to the mapped trait. In vitro functional studies. Transgenesis (BAC lines). Knockins. QTL-knockout interaction test. Mutational analysis. Homology searches.
To gain a sense of the use of these guidelines, the reader is directed to outstanding reviews of success stories in mice (Flint et al., 2005) and rats (Aitman et al., 2010). Tables 1 and 2 in Flint et al. (2005) and table 2.1 in Aitman et al. (2010) list cloned quantitative trait loci along with the approaches used. The most noticeable difference between the mouse and rat QTL endgame is the lack of large numbers of available rat knockouts. Aitman et al. (2010) do describes new developments in rat genetic engineering, and the overall tenor of using rats for QTL mapping is extremely promising, given the long history of this excellent animal model of physiology and pharmacology; for additional papers on rat genomics see Anegon (2011). 19.5 BIOINFORMATICS, SYSTEMS GENETICS, AND NETWORKS Perhaps the most remarkable change in rodent genomics over the last decade, and more recently in human genetics, has been the transformation to systems approaches and perspectives (Sieberts and Schadt, 2007; Cookson et al., 2009). The combination of genome mapping with the ability to monitor the transcriptome has allowed both gene expression mapping to discover eQTL—which can very quickly nominate candidate genes for classical trait QTL (Lu et al., 2008)—and grander schemes for delineating expression networks causally affected by DNA polymorphism and predictive of disease. Two studies illustrate such outstanding scientific achievement (Chen et al., 2008, Emilsson et al., 2008). Chen et al. (2008) used a large F2 intercross to define a liver and adipose tissue gene expression network by correlations with a suite of traits related to metabolic disorders and used this network to define novel obesity genes. Emilsson et al. (2008) created an adipose gene expression network from a large sample of human volunteers and discovered significant overlap with the mouse network of Chen and colleagues. Because the mouse network was predictive of obesity genes, the human network was examined for its relation to obesity. First, Emilsson et al. (2008) found that the human network was
c19.indd 414
1/12/2011 9:44:34 AM
PHARMACOGENOMICS AND DYNAMIC PHENOTYPING
415
robustly correlated with body mass index (BMI) across subjects. Second, and as a predictive test, Emilsson et al. (2008) genotyped a collection of SNPs from the vicinity of each gene in the human network and found a significant collective association with a large independent group of humans measured for BMI. Although the brevity of this chapter limits detected discussion of systems genetics in mice and rats, it is worth pointing out two recently published practical reviews on the importance and utility of reproducible bioinformatic workflows in support of eQTL studies (Fisher et al., 2009) and how to actually do eQTL mapping in mice or rats (Tesson and Jansen, 2009). QTL and combined eQTL studies make use of very large data sets (that potentially change over time and build). The analysis often involves bioinformatic services and specialized programs that are only loosely connected by hyperlinks. As Fisher et al. (2009) note, researchers can quickly become overwhelmed. As one solution, researchers should know about workflow systems. Fisher et al. (2009) outline the use of one such workflow system, Taverna (www.taverna.org.uk), to support discovery of classical trait QTL candidate genes using gene expression data sets. Perhaps most useful to those considering eQTL analysis in mice or rats is the chapter by Tesson and Jansen (2009), who provide a detailed step-by-step guide to performing genomewide linkage analysis in an eQTL mapping experiment by using the R statistical framework. Tesson and Jansen (2009) provide a literal computational protocol that more or less (depending on your R skill!) demystifies the process.
19.6 PHARMACOGENOMICS AND DYNAMIC PHENOTYPING We began this chapter by acknowledging how the use of QTL mapping with mice and rats can complement the study of human genetics, especially with phenotypes that cannot be feasibly or ethically obtained. We now return to where we started by discussing a novel approach to pharmacogenomics. Use of mice and rats can elucidate novel drug targets as well as increase understanding of undesirable or toxic properties of current drugs, and thus contribute to the goals of personalized medicine (Harrill et al., 2009; Lum et al., 2009). We suggest that functional QTL mapping of drug dose-response is best applied to genetic reference populations of mice or rats. Functional mapping of QTLs uses biologically motivated nonlinear mathematical models (e.g., the four parameter logistic dose-response function (Kenakin, 2009; Motulsky and Christopoulos, 2004)) to make associations between genotype and phenotype in dynamically patterned data (Gong et al., 2004; Wu and Lin, 2006). Mapping QTL for drug effects in mice typically employs a single dose level. Because drug effects are dose dependent and heritable, choosing a single dose is not an optimal design for the study of multiple genetic backgrounds. Failure to accommodate dose-response can result in reduced signal or interpretive
c19.indd 415
1/12/2011 9:44:34 AM
416
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
error. For example (Fig. 19.1, top left), if a QTL affects the maximum response, choosing too low a dose will underestimate the maximum and reduce statistical power. Choosing too high a dose runs the risk of introducing confounding toxicity responses or altered phenotypes. If a QTL affects the ED50 (Fig. 19.1, top right), then not only is it possible to miss the optimal dose and reduce statistical power, but choosing an optimum single dose will not discriminate the differences as a simple right-shift versus a change in maximum. Functional mapping of dose-response QTL avoids these problems by estimating the full curve for each independent genotype. Functional mapping of dose-response QTL is enhanced by using isogenic lines of mice that make up a genetic reference population. Independent mice of an identical genotype can be exposed to different drug doses to achieve genotype specific curves, while also allowing invasive phenotypes. This assumes that environmental similarity is maintained aside from the differences in drug dose level. Shown in the middle left and right panels of Figure 19.1 are two demonstrations of significantly different dose-response profiles from common inbred (isogenic) strains of mice. On the left are three hyperactivity profiles in the open field test chamber in response to a drug of abuse, MDMA (Ecstasy), using two independent mice per dose level per strain (unpublished data). On the right are the dose-response profiles for two isogenic strains of mice, using three independent mice per dose level, for the head twitch response, a behavioral response to the 5-HT2A/5-HT2C receptor agonist and hallucinogen DOI (Canal et al., 2010). Means ± 1 SEM error bars are indicated. Two approaches to functional mapping of dose-response QTL using genetic reference populations of mice are shown in the bottom of Figure 19.1. The simplest approach (bottom left) is to use chromosome substitution strains. To locate QTL to a chromosome, each CSS is compared to a reference or control parental strain using nonlinear regression. A potentially more powerful approach but also more complex, is the use of recombinant inbred lines, such as the BXD RILs, derived from a C57BL/6J × DBA/2J intercross (bottom right). BXD RIL dose-response QTL analysis requires grouping dose-response curves for a set of RILs by genotype at each of many markers along each chromosome and testing for curve shape difference by genotype. This approach requires nonlinear mixed model methods with random effects for RILs nested in genotype. The use of genetic reference populations of mice is key to enabling this approach. Genetic variation is between strain and lines rather than between individual animals in these groups of mice. A comparable approach in segregating populations, such as an F2 intercross or in human volunteers, would require multiple dosing of the same individual, which poses potentially insurmountable feasibility and ethical concerns. The use of animals also allows collection of invasive phenotypes and terminal endpoint data. Although Figure 19.1 outlines pharmacodynamic dose-response modeling, similar approaches can be developed for pharmacokinetic questions, or combined pharmacokineticpharmacodynamic models.
c19.indd 416
1/12/2011 9:44:34 AM
PHARMACOGENOMICS AND DYNAMIC PHENOTYPING
417
Figure 19.1. Functional mapping of dose-response in genetic reference populations of mice. See text for details.
c19.indd 417
1/12/2011 9:44:34 AM
418
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
19.7 REFERENCES Abiola O, Angel JM, Avner P, Bachmanov AA, Belknap JK, Bennett B, Blankenhorn EP, Blizard DA, Bolivar V, Brockmann GA, Buck KJ, Bureau JF, Casley WL, Chesler EJ, Cheverud JM, Churchill GA, Cook M, Crabbe JC, Crusio WE, Darvasi A, de Haan G, Dermant P, Doerge RW, Elliot RW, Farber CR, Flaherty L, Flint J, Gershenfeld H, Gibson JP, Gu J, Gu W, Himmelbauer H, Hitzemann R, Hsu HC, Hunter K, Iraqi FF, Jansen RC, Johnson TE, Jones BC, Kempermann G, Lammert F, Lu L, Manly KF, Matthews DB, Medrano JF, Mehrabian M, Mittlemann G, Mock BA, Mogil JS, Montagutelli X, Morahan G, Mountz JD, Nagase H, Nowakowski RS, O’Hara BF, Osadchuk AV, Paigen B, Palmer AA, Peirce JL, Pomp D, Rosemann M, Rosen GD, Schalkwyk LC, Seltzer Z, Settle S, Shimomura K, Shou S, Sikela JM, Siracusa LD, Spearow JL, Teuscher C, Threadgill DW, Toth LA, Toye AA, Vadasz C, Van Zant G, Wakeland E, Williams RW, Zhang HG, Zou F; Complex Trait Consortium. (2003). The nature and identification of quantitative trait loci: a community’s view. Nat Rev Genet 4(11):911–16. Ackert-Bicknell CL, Karasik D, Li Q, Smith RV, Hsu YH, Churchill GA, Paigen BJ, Tsaih SW. (2010). Mouse BMD quantitative trait loci show improved concordance with human genome wide association loci when recalculated on a new, common mouse genetic map. J Bone Miner Res Epub ahead of print, February 23. Aitman TJ, Petretto E, Behmoaras J. (2010). Genetic mapping and positional cloning. Meth Mol Biol 597:13–32. Aldinger KA, Sokolo G, Rosenberg DM, Palmer AA, Millen KJ. (2009). Genetic variation and population substructure in outbred CD-1 mice: implications for genomewide association studies. PLoS One 4(3):e4729. Anegon I, ed. (2011). Rat Genomics. Springer, New York. Broman KW. (2001). Review of statistical methods for QTL mapping in experimental crosses. Lab Anim (NY), 30(7):44–52. Broman K, Sen S. (2009). A Guide to QTL Mapping with R/QTL. Statistics for Biology and Health, vol. 2848. Springer, New York. Broman KW, Wu H, Sen S, Churchill GA. (2003). R/QTL: QTL mapping in experimental crosses. Bioinformatics 19(7):889–90. Burgess-Herbert SL, Tsaih S-W, Stylianou IM, Walsh K, Cox AJ, Paigen B. (2009). An experimental assessment of in silico haplotype association mapping in laboratory mice. BMC Genet 10:81. Canal CE, Olaghere da Silva UB, Gresch PJ, Watt EE, Sanders-Bush E, Airey DC. (2010). The serotonin 2C receptor potently modulates the head-twitch response in mice induced by a phenethylamine hallucinogen. Psychopharmacology (Berl) 209(2):163–74. Chen Y, Zhu J, Lum PY, Yang X, Pinto S, MacNeil DJ, Zhang C, Lamb J, Edwards S, Sieberts SK, Leonardson A, Castellini LW, Wang S, Champy MF, Zhang B, Emilsson V, Doss S, Ghazalpour A, Horvath S, Drake TA, Lusis AJ, Schadt EE. (2008). Variations in DNA elucidate molecular networks that cause disease. Nature 452(7186):429–35. Chesler EJ, Miller DR, Branstetter LR, Galloway LD, Jackson BL, Philip VM, Voy BH, Culiat CT, Threadgill DW, Williams RW, Churchill GA, Johnson DK, Manly KF.
c19.indd 418
1/12/2011 9:44:35 AM
REFERENCES
419
(2008). The collaborative cross at Oak Ridge National Laboratory: developing a powerful resource for systems genetics. Mamm Genome 19(6):382–89. Churchill GA, Airey DC, Allayee H, Angel JM, Attie AD, Beatty J, Beavis WD, Belknap JK, Bennett B, Berrettini W, Bleich A, Bogue M, Broman KW, Buck KJ, Buckler E, Burmeister M, Chesler EJ, Cheverud JM, Clapcote S, Cook MN, Cox RD, Crabbe JC, Crusio WE, Darvasi A, Deschepper CF, Doerge RW, Farber CR, Forejt J, Gaile D, Garlow SJ, Geiger H, Gershenfeld H, Gordon T, Gu J, Gu W, de Haan G, Hayes NL, Heller C, Himmelbauer H, Hitzemann R, Hunter K, Hsu HC, Iraqi FA, Ivandic B, Jacob HJ, Jansen RC, Jepsen KJ, Johnson DK, Johnson TE, Kempermann G, Kendziorski C, Kotb M, Kooy RF, Llamas B, Lammert F, Lassalle JM, Lowenstein PR, Lu L, Lusis A, Manly KF, Marcucio R, Matthews D, Medrano JF, Miller DR, Mittleman G, Mock BA, Mogil JS, Montagutelli X, Morahan G, Morris DG, Mott R, Nadeau JH, Nagase H, Nowakowski RS, O’Hara BF, Osadchuk AV, Page GP, Paigen B, Paigen K, Palmer AA, Pan HJ, Peltonen-Palotie L, Peirce J, Pomp D, Pravenec M, Prows DR, Qi Z, Reeves RH, Roder J, Rosen GD, Schadt EE, Schalkwyk LC, Seltzer Z, Shimomura K, Shou S, Sillanpää MJ, Siracusa LD, Snoeck HW, Spearow JL, Svenson K, Tarantino LM, Threadgill D, Toth LA, Valdar W, de Villena FP, Warden C, Whatley S, Williams RW, Wiltshire T, Yi N, Zhang D, Zhang M, Zou F; Complex Trait Consortium. (2004). The collaborative cross, a community resource for the genetic analysis of complex traits. Nat Genet 36(11): 1133–137. Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M. (2009). Mapping complex disease traits with global gene expression. Nature Reviews Genetics (PMID: 19223927) March Vol. 10: 184–94. Cox A, Ackert-Bicknell CL, Dumont BL, Ding Y, Bell JT, Brockmann GA, Wergedal JE, Bult C, Paigen B, Flint J, Tsaih SW, Churchill GA, Broman KW. (2009). A new standard genetic map for the laboratory mouse. Genetics 182(4):1335–44. Crusio WE. (2004). A note on the effect of within-strain sample sizes on QTL mapping in recombinant inbred strain studies. Genes Brain Behav 3(4):249–51. Darvasi A. (1998). Experimental strategies for the genetic dissection of complex traits in animal models. Nat Genet 18(1):19–24. Darvasi A. (1997). Interval-specific congenic strains (ISCS): an experimental design for mapping a QTL into a 1-centimorgan interval. Mamm Genome 8(3):163–67. Darvasi A, Soller M. (1995). Advanced intercross lines, an experimental population for fine genetic mapping. Genetics 141(3):1199–207. Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, Carlson S, Helgason A, Walters GB, Gunnarsdottir S, Mouy M, Steinthorsdottir V, Eiriksdottir GH, Bjornsdottir G, Reynisdottir I, Gudbjartsson D, Helgadottir A, Jonasdottir A, Jonasdottir A, Styrkarsdottir U, Gretarsdottir S, Magnusson KP, Stefansson H, Fossdal R, Kristjansson K, Gislason HG, Stefansson T, Leifsson BG, Thorsteinsdottir U, Lamb JR, Gulcher JR, Reitman ML, Kong A, Schadt EE, Stefansson K. (2008). Genetics of gene expression and its effect on disease. Nature 452(7186):423–28. Fawcett GL, Jarvis JP, Roseman CC, Wang B, Wolf JB, Cheverud JM. (2009). Finemapping of obesity-related quantitative trait loci in an F(9/10) advanced intercross line. Obesity 18(7):1383–92.
c19.indd 419
1/12/2011 9:44:35 AM
420
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
Fisher P, Noyes H, Kemp S, Stevens R, Brass A. (2009). A systematic strategy for the discovery of candidate genes responsible for phenotypic variation. Meth Mol Biol 573:329–45. Flint J, Valdar W, Shifman S, Mott R. (2005). Strategies for mapping and cloning quantitative trait genes in rodents. Nat Rev Genet 6(4):271–86. Fortin A, Diez E, Henderson JE, Mogil JS, Gros P, Skamene E. (2007). The AcB/BcA recombinant congenic strains of mice: strategies for phenotype dissection, mapping and cloning of quantitative trait genes. Novartis Found Symp 281:141–53 (discussion 153–55, 208–09). Ghazalpour A, Doss S, Kang H, Farber C, Wen PZ, Brozell A, Castellanos R, Eskin E, Smith DJ, Drake TA, Lusis AJ. (2008). High-resolution mapping of gene expression using association in an outbred mouse stock. PLoS Genet 4(8):e1000149. Gong Y, Wang Z, Liu T, Zhao W, Zhu Y, Johnson JA, Wu R. (2004). A statistical model for functional mapping of quantitative trait loci regulating drug response. Pharmacogenomics J 4(5):315–21. Grubb SC, Maddatu TP, Bult CJ, and Bogue MA. (2009). Mouse phenome database. Nucl Acids Res 37:D720–30. Harrill AH, Ross PK, Gatti DM, Threadgill DW, Rusyn I. (2009). Population-based discovery of toxicogenomics biomarkers for hepatotoxicity using a laboratory strain diversity panel. Toxicol Sci 110(1):235–43. Hill AE, Lander ES, Nadeau JH. (2006). Chromosome substitution strains: a new way to study genetically complex traits. Meth Mol Med 128:153–72. Hitzemann R, Malmanger B, Reed C, Lawler M, Hitzemann B, Coulombe S, Buck K, Rademacher B, Walter N, Polyakov Y, Sikela J, Gensler B, Burgers S, Williams RW, Manly K, Flint J, Talbot C. (2003). A strategy for the integration of QTL, gene expression, and sequence analyses. Mamm Genome 14(11):733–47. Huang GJ, Shifman S, Valdar W, Johannesson M, Yalcin B, Taylor MS, Taylor JM, Mott R, Flint J. (2009). High resolution mapping of expression QTLs in heterogeneous stock mice in multiple tissues. Genome Res 19(6):1133–40. Iraqi F, Clapcott SJ, Kumari P, Haley CS, Kemp SJ, Teale AJ. (2000). Fine mapping of trypanosomiasis resistance loci in murine advanced intercross lines. Mamm Genome 11(8):645–48. Jacob HJ. (2010). The rat: a model used in biomedical research. Meth Mol Biol 597:1–11. Jirout M, Krenová D, Kren V, Breen L, Pravenec M, Schork NJ, Printz MP. (2003). A new framework marker-based linkage map and SDPs for the rat HXB/BXH strain set. Mamm Genome 14(8):537–46. Johannesson M, Lopez-Aumatell R, Stridh P, Diez M, Tuncel J, Blázquez G, MartinezMembrives E, Cañete T, Vicens-Costa E, Graham D, Copley RR, Hernandez-Pliego P, Beyeen AD, Ockinger J, Fernández-Santamaría C, Gulko PS, Brenner M, Tobeña A, Guitart-Masip M, Giménez-Llort L, Dominiczak A, Holmdahl R, Gauguier D, Olsson T, Mott R, Valdar W, Redei EE, Fernández-Teruel A, Flint J. (2009). A resource for the simultaneous high-resolution mapping of multiple quantitative trait loci in rats: the NIH heterogeneous stock. Genome Res 19(1):150–58. Kenakin TP. (2009). A Pharmacology Primer: Theory, Applications, and Methods. 3rd ed. Academic Press/Elsevier, Amsterdam.
c19.indd 420
1/12/2011 9:44:35 AM
REFERENCES
421
Li R, Tsaih SW, Shockley K, Stylianou IM, Wergedal J, Paigen B, Churchill GA. (2006). Structural model analysis of multiple quantitative traits. PLoS Genet 2(7):e114. Lu L, Wei L, Peirce JL, Wang X, Zhou J, Homayouni R, Williams RW, Airey DC. (2008). Using gene expression databases for classical trait QTL candidate gene discovery in the BXD recombinant inbred genetic reference population: mouse forebrain weight. BMC Genomics 9:444. Lum PY, Derry JMJ, Schadt EE. (2009). Integrative genomics and drug development. Pharmacogenomics 10(2):203–12. Lynch M, Walsh B. (1998). Genetics and Analysis of Quantitative Traits. Sinauer, Sunderland, MA. Malmanger B, Lawler M, Coulombe S, Murray R, Cooper S, Polyakov Y, Belknap J, Hitzemann R. (2006). Further studies on using multiple-cross mapping (MCM) to map quantitative trait loci. Mamm Genome 17(12):1193–204. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM. (2009). Finding the missing heritability of complex diseases. Nature 461(7265): 747–53. Mott R, Talbot CJ, Turri MG, Collins AC, Flint J. (2000). A method for fine mapping quantitative trait loci in outbred animal stocks. Proc Natl Acad Sci U S A 97(23): 12649–54. Motulsky H Christopoulos A. (2004). Fitting Models to Biological Data Using Linear and Nonlinear Regression: A Practical Guide to Curve Fitting. Oxford University Press, Oxford, UK. Peirce JL, Broman KW, Lu L, Chesler EJ, Zhou G, Airey DC, Birmingham AE, Williams RW. (2008). Genome reshuffling for advanced intercross permutation (GRAIP): simulation and permutation for advanced intercross population analysis. PLoS One 3(4):e1977. Peirce JL, Broman KW, Lu L, Williams RW. (2007). A simple method for combining genetic mapping data from multiple crosses and experimental designs. PLoS One 2(10):e1036. Peirce JL, Lu L, Gu J, Silver LM, Williams RW. (2004). A new set of BXD recombinant inbred lines from advanced intercross populations in mice. BMC Genet 5:7. Philip VM, Duvvuru S, Gomero B, Ansah TA, Blaha CD, Cook MN, Hamre KM, Lariviere WR, Matthews DB, Mittleman G, Goldowitz D, Chesler EJ. (2009). Highthroughput behavioral phenotyping in the expanded panel of BXD recombinant inbred strains. Genes Brain Behav. Epub ahead of print, September 22. Roberts A, Pardo-Manuel de Villena F, Wang W, McMillan L, Threadgill DW. (2007). The polymorphism architecture of mouse genetic resources elucidated using genome-wide resequencing data: implications for qtl discovery and systems genetics. Mamm Genome 18(6–7):473–81. Rosen GD, Chesler EJ, Manly KF, Williams RW. (2007). An informatics approach to systems neurogenetics. Meth Mol Biol 401:287–303. Schmitt AO, Bortfeldt R, Neuschl C, Brockmann GA. (2009). RANDOMATE: a program for the generation of random mating schemes for small laboratory animals. Mamm Genome 20(5):321–25.
c19.indd 421
1/12/2011 9:44:35 AM
422
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
Sen S, Satagopan JM, Broman KW, Churchill GA. (2007). R/QTLDESIGN: inbred line cross experimental design. Mamm Genome 18(2):87–93. Sen S, Satagopan JM, Churchill GA. (2005). Quantitative trait locus study design from an information perspective. Genetics 170(1):447–64. Shao H, Burrage LC, Sinasac DS, Hill AE, Ernest SR, O’Brien W, Courtland HW, Jepsen KJ, Kirby A, Kulbokas EJ, Daly MJ, Broman KW, Lander ES, Nadeau JH. (2008). Genetic architecture of complex traits: large phenotypic effects and pervasive epistasis. Proc Natl Acad Sci U S A 105(50):19910–14. Sieberts SK, Schadt EE. (2007). Moving toward a system genetics view of disease. Mamm Genome 18(6–7):389–401. Siegmund D, Yakir B. (2007). The Statistics of Gene Mapping. Springer, New York. Smith R, Sheppard K, DiPetrillo K, Churchill G. (2009). Quantitative trait locus analysis using J/QTL. Meth Mol Biol 573:175–88. Solberg LC, Valdar W, Gauguier D, Nunez G, Taylor A, Burnett S, Arboledas-Hita C, Hernandez-Pliego P, Davidson S, Burns P, Bhattacharya S, Hough T, Higgs D, Klenerman P, Cookson WO, Zhang Y, Deacon RM, Rawlins JN, Mott R, Flint J. (2006). A protocol for high-throughput phenotyping, suitable for quantitative trait analysis in mice. Mamm Genome 17(2):129–46. STAR Consortium, Saar K, Beck A, Bihoreau MT, Birney E, Brocklebank D, Chen Y, Cuppen E, Demonchy S, Dopazo J, Flicek P, Foglio M, Fujiyama A, Gut IG, Gauguier D, Guigo R, Guryev V, Heinig M, Hummel O, Jahn N, Klages S, Kren V, Kube M, Kuhl H, Kuramoto T, Kuroki Y, Lechner D, Lee YA, Lopez-Bigas N, Lathrop GM, Mashimo T, Medina I, Mott R, Patone G, Perrier-Cornet JA, Platzer M, Pravenec M, Reinhardt R, Sakaki Y, Schilhabel M, Schulz H, Serikawa T, Shikhagaie M, Tatsumoto S, Taudien S, Toyoda A, Voigt B, Zelenika D, Zimdahl H, Hubner N. (2008). SNP and haplotype mapping for genetic analysis in the rat. Nat Genet 40(5):560–66. Tesson BM, Jansen RC. (2009). eQTL analysis in mice and rats. Meth Mol Biol 573:285–309. Tsaih S-W, Korstanje R. (2009). Haplotype association mapping in mice. Meth Mol Biol 573:213–22. Twigger SN, Smith S, Zuniga-Meyer A, Bromberg SK. (2006). Exploring phenotypic data and the rat genome database. Current Protocols in Bioinformatics, 14:1, 14.1– 1.14.27. Valdar W, Holmes CC, Mott R, Flint J. (2009). Mapping in structured populations by resample model averaging. Genetics 182(4):1263–77. Valdar W, Solberg LC, Gauguier D, Burnett S, Klenerman P, Cookson WO, Taylor MS, Rawlins JN, Mott R, Flint J. (2006). Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat Genet 38(8):879–87. Wahlsten D, Bachmanov A, Finn DA, Crabbe JC. (2006) Stability of inbred mouse strain differences in behavior and brain size between laboratories and across decades. Proc Natl Acad Sci USA 103(44):16364–69. Wang J, Williams RW, Manly KF. (2003a). WebQTL: web-based complex trait analysis. Neuroinformatics 1(4):299–308. Wang X, Le Roy I, Nicodeme E, Li R, Wagner R, Petros C, Churchill GA, Harris S, Darvasi A, Kirilovsky J, Roubertoux PL, Paigen B. (2003b). Using advanced inter-
c19.indd 422
1/12/2011 9:44:35 AM
REFERENCES
423
cross lines for high-resolution mapping of HDL cholesterol quantitative trait loci. Genome Res 13(7):1654–64. Woods LCS, Holl K, Tschannen M, Valdar W. (2010). Fine-mapping a locus for glucose tolerance using heterogeneous stock rats. Physiol Genomics 41(1):102–08. Wu R, Lin M. (2006). Functional mapping—how to map and study the genetic architecture of dynamic complex traits. Nat Rev Genet 7(3):229–37. Wu R, Ma C-X, Casella G. (2007). Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. Springer, New York. Yalcin B, Willis-Owen SA, Fullerton J, Meesaq A, Deacon RM, Rawlins JN, Copley RR, Morris AP, Flint J, Mott R. (2004). Genetic dissection of a behavioral quantitative trait locus shows that Rgs2 modulates anxiety in mice. Nat Genet 36(11): 1197–202. Yang H, Ding Y, Hutchins LN, Szatkiewicz J, Bell TA, Paigen BJ, Graber JH, de Villena FP, Churchill GA. (2009). A customized and versatile high-density genotyping array for the mouse. Nat Methods 6(9):663–66. Zou F. (2009). QTL mapping in intercross and backcross populations. Meth Mol Biol 573:157–73.
c19.indd 423
1/12/2011 9:44:35 AM
CHAPTER 20
Gene Discovery of Crop Disease in the Postgenome Era YULIN JIA
Contents 20.1 Introduction 20.2 Map-Based Cloning 20.2.1 Mapping Population for Map-Based Cloning 20.3 A Plant Model—The Rice Blast System 20.3.1 The Structure and Function of Blast R Gene 20.3.2 Co-Evolution of Host R and Pathogen AVR Genes 20.4 R Gene Use in Breeding 20.5 Future Prospects 20.6 Acknowledgments 20.7 References
425 426 426 430 431 433 436 437 438 438
20.1 INTRODUCTION Plants producing essential foods and fibers for human survival have been subjected to intensive studies worldwide. In nature, plants are attacked by numerous viral, bacterial, and fungal pathogens. Unlike animals and humans, plants cannot move away from the pathogens by themselves. To survive in natural ecosystems, plants evolved sophisticated innate defense systems governed by R genes to fight against invading pathogens. These R genes each provide robust power to battle against pathogens and are distributed in plant germplasm worldwide. Crop plants are domesticated plant species for human cultivation. Before breeding and genetics, farmers knew to save seeds that survived disease epidemics. Since then, R genes have been accumulated in
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
425
c20.indd 425
1/12/2011 9:44:36 AM
426
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
numerous crop landraces. These landraces have been used as R gene donors for breeding to improve resistance worldwide. In contrast with animals and humans, the principles of plant genetics can be used such that genetic crosses can be made without ethical concerns. This unique feature has made plants excellent models for R gene discovery. Genetic analysis of crop germplasm thus far has identified hundreds of plant R genes that are effective in preventing infections by numerous races of diverse pathogens containing the corresponding avirulence (AVR) genes (Flor, 1971). Many of these have been mapped with diverse molecular markers for breeding and genetic studies and some of them have been isolated. The Pto gene in tomato was the first plant R gene cloned that conferred gene-for-gene resistance (Martin et al., 1993). Subsequently, numerous R genes have been cloned from other crop plants and Arabidopsis (Martin et al., 1993). Cloning and characterization of crop R genes have been greatly expedited in the postgenomic era. A wide range of biotechnologies, including numerous functional genomic tools, have been successfully applied to isolate and characterize plant R genes. Among them, the most common technique used for R gene isolation is map-based positional cloning. Map-based cloning of an R gene takes several years and sometimes longer than 10 years. There is no doubt that the time needed for map-based cloning can be reduced by available genome sequences. Available genome sequences of two subspecies of rice (Oryza sativa indica cv 93-11 and japonica cv Nipponbare) have allowed rapid R gene isolation using in silico cloning (Goff et al., 2002; Yu et al., 2002). Rapid improvement of sequencing technology has allowed whole genome sequences of diverse crop plants. These technological advances have laid a solid foundation for R gene discovery and for studying the molecular mechanisms of R gene-mediated defense responses. The purposes of this chapter are to describe the contemporary map-based cloning for plant R gene discovery and review the current understanding of structure and function of plant R genes with emphasis on R genes to the world renowned crop killer Magnaporthe oryzae.
20.2 MAP-BASED CLONING Map-based cloning used for plant R gene discovery involves the following three steps: (1) construction of a genetic linkage map using DNA markers, (2) determination of genetic and physical locations of candidate genes, and (3) functional confirmation of candidate genes by complementation test (Fig. 20.1). 20.2.1 Mapping Population for Map-Based Cloning 20.2.1.1 Mapping Population A mapping population is required for the construction of a genetic linkage map. Such a mapping population can be created using any sexually compatible
c20.indd 426
1/12/2011 9:44:36 AM
MAP-BASED CLONING
427
Figure 20.1. The procedure of map-based cloning. a, Integrated genetic and physical maps with left and right borders shown. A candidate R gene was identified from a BAC or a YAC clone. b, Complementation test using either Agrobacterium-mediated transformation or particle bombardment to verify that the resistance was due to the presence of the candidate gene. Primary transformant T1 expressing the candidate R gene is resistant and T2, progeny of T1 segregate as a ratio of 3 resistant:1 susceptible, demonstrating the resistant function of the R gene.
plant. There are three different mapping populations that are commonly used for map-based cloning of plant genes. 1. An F2 population: F1 hybrid is produced by the crossing of two parents. Selfing an F1 hybrid can give rise to multiple F2 progeny, and these F2 progeny are members of an F2 mapping population. The advantage of an F2 population is that each individual is genetically unique; however, the disadvantage is that the phenotype cannot be replicated at F2 because each individual cannot be reused. One solution to this problem is to use a controlled phenotyping method. For example, detached leaf inoculation was developed to repeatedly measure the phenotype of each F2 individual (Jia et al., 2003). 2. A recombinant inbred line (RIL) population: A RIL population overcomes the disadvantage of an F2 population such that the phenotype can be evaluated repeatedly for each segregating progeny. The most common method to develop a RIL population is to self each F2 individual for an additional three to nine generations. An assumption can be made that each RIL at F5 to F9 would reach the homogenous status at all genetic loci. A RIL population is ideal for mapping and cloning quantitative trait loci (QTL); however, it normally takes 2 to 5 years to develop a RIL population (Jia et al., 2010). In rice,
c20.indd 427
1/12/2011 9:44:36 AM
428
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
two generations can be advanced per year and sometimes three generations per year for short-season rice varieties under greenhouse conditions. In some tropical areas of the world, such as Los Banos in the Philippines, Puerto Rico, and Hainan Island in China, three generations can be amplified per year. 3. A doubled haploid (DH) population: DH population is designed to reduce the time needed for developing a RIL population. F1 pollen are cultured using anther culture method, and the resulting haploids are doubled after chemical treatment to reach homozygosity so that each locus in a doubled haploid is homozygous. A DH population can be developed within 2 years; however, some rice varieties are not suitable for tissue culture because their regeneration frequencies are extremely low. 20.2.1.2
The Procedure of Map-Based Cloning
20.2.1.2.1 Construction of a Genetic Linkage Map A genetic linkage map is a map with genetic markers based on crossing-over events during crosses. For example, a genetic linkage map of rice was constructed using JoinMap (Kosambi, 1944; Liu et al., 2008). The genetic distances of any two markers in each linkage map may be different; however, physical locations of the majority of markers should be the same for a particular crop. For any plant genome, DNA markers should first be identified that distinguish both parents for constructing a genetic linkage map. These polymorphic DNA markers evenly distributed among all chromosomes are used to determine the genotypes of each segregating progeny in a mapping population. For the past decade, the genetic markers have evolved from restriction fragment length polymorphism (RFLP), to rapid amplified polymorphic DNA fragment, (RAPD) (Dioh et al., 2000), to simple sequence repeat (SSR) and to single nucleotide polymorphism (SNP) (Rafalski, 2002). User-friendly co-dominant DNA markers, SSR and SNP, can be readily identified in the genome if sequence is available. In rice, a high-density linkage map consisting of abundant genetic markers on each chromosome was constructed by McCouch and colleagues (1998). 20.2.1.2.2 Determination of Physical Location of the Candidate R Gene After a genetic linkage map with evenly distributed DNA markers is constructed, the phenotype of each progeny in a mapping population should be evaluated. Often replicated experiments are required for evaluating the phenotypes to exclude environmental effects. Replications can be easily set up with each individual RIL and DH because of unlimited seed supplies. For mapping using an F2 mapping population, F2 phenotypes can be verified in F3 families. A genetic map is constructed with both phenotype and genotype of each individual of a mapping population. The most common software for map construction for major R genes is Mapmaker (Lincoln et al., 1992) and for quantitative trait loci (QTL) the software is QTL cartographer (Wang et al., 2007).
c20.indd 428
1/12/2011 9:44:36 AM
MAP-BASED CLONING
429
The next step is to construct a physical map to determine the physical location of the candidate R gene on a chromosome. Physical mapping can be expedited by using available sequence information. A genomic library that provides several times the coverage of a genome is needed to ensure the identification of all possible candidate genes in the genome. The two most commonly used genomic libraries for cloning a plant R gene are the yeast artificial chromosome (YAC) library and the bacterial artificial chromosome (BAC) library. Physical mapping begins with the closely linked DNA markers to the R locus. The closest linked DNA markers are initially used to generate an overlapping contig spanning the R locus. DNA sequences of either YAC or BAC clones spanning an R locus are sequenced. Based on homology to known R genes from public databases, candidate R genes from the contig are thus identified. 20.2.1.2.3 Functional Confirmation of the Candidate Gene The candidate R gene can be transferred into a susceptible parent using genetic transformation. There are two methods of transformation routinely used for plants. 20.2.1.2.3.1 AGROBACTERIUM-mediated transformation The most commonly used method for plant transformation is Agrobacterium-mediated transformation using binary vectors (Bevan, 1984). Agrobacterium is commonly used for transferring new genes into plant cells and for securing their stable integration into the host genome. The theory behind plant transformation using Agrobacterium is that the A. tumefaciens, a soil bacterium, causes tumors at wound sites (crown gall) and introduces the genetic information of the crown gall (tumor) formation into the plant genome. Instead of being situated on the Agrobacteria’s chromosomes, this transferred DNA (T-DNA) is found on a Ti (tumor-inducing) plasmid, flanked by the right (RB) and left (LB) borders. Agrobacteria can modify plants genetically and the original T-DNA containing the information for tumor formation can be replaced with foreign DNA. The Agrobacterium can therefore be used as a transport vector to introduce new genes. Since an indicator is necessary to identify transformed cells, a marker such as hygromycin resistance is often placed on the Ti plasmid alongside the target gene. However, transformation using only one plasmid can lead to random introgression of unwanted DNA from plasmid into the plant genome. Binary vectors were later developed to avoid random T-DNA introduction into plant cells. With a binary vector, the target gene (T-DNA) and virulence gene, which are naturally joined on one plasmid, are split between two plasmids. The initial Ti plasmid contains virulence (vir) genes that are necessary for T-DNA transfer. In the case of binary vectors, the vir genes are removed, and the desired DNA can be integrated between the borders. These disarmed binary plasmids are easy to propagate in E. coli and are also much smaller than a normal Ti plasmid. The vir genes needed for transfer to the plant are arranged on a second plasmid where the T-DNA and both borders are removed. Using
c20.indd 429
1/12/2011 9:44:36 AM
430
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
binary vectors, one plasmid is used to transfer the target gene, while the other helps with the transformation to avoid random integration of unwanted plasmid DNA into the plant genome. 20.2.1.2.3.2 particle bombardment Another common alternative for genetic transformation is particle bombardment (Christou et al., 1997). Particle bombardment is based on the fact that gold particles can be used as a vector to introduce plasmid DNA that expresses the candidate gene inside a plant cell. For particle bombardment, plasmid containing a DNA construct expressing the coding region of candidate R genes is coated with fine gold particles that can be co-introduced into calli along with the plasmid expressing an antibiotic selective marker, such as hygromycin resistance. 20.2.1.2.3.3 plant regeneration and progeny analysis Plant calli without plasmids expressing hygromycin resistance do not survive, and only transformed calli can give rise to seedlings in culture media containing hygromycin. All surviving seedlings can be examined for the presence of a transgene using PCR and/or southern blot. Once primary T1 transformants are produced, the next step is to determine their reactions to the pathogen infection. Standard disease infection assays are used to evaluate T1’s reaction to the pathogen. The phenotype of primary transformants can be evaluated at T1. A ratio of 3 resistant with candidate R gene to 1 susceptible without candidate R gene is expected in T2 progeny if the candidate R gene is responsible for newly acquired resistance. 20.2.1.2.3.4 positional cloning of the first gene-specific plant R gene Using the map-based cloning strategy, the Pto gene in tomato was the first race-specific R gene cloned in plants. Pto confers race specific resistance to Pseudomonas syringae DC3000 causing bacteria speck disease. The Pto gene was mapped within 3 cM using RFLP markers by using an F2 mapping population consisting of 251 individuals. The candidate R gene on a YAC clone was identified from a YAC library using a RFLP marker that co-segregated with resistance (Martin et al., 2003). After a decade of investigation, molecular mechanisms underlying the Pto gene-mediated defense response are now well understood (Martin et al., 2003). Subsequently, more R genes have been isolated from tomato, peppers, tobacco, and rice (Martin et al., 2003).
20.3 A PLANT MODEL—THE RICE BLAST SYSTEM Rice blast disease caused by the filamentous ascomycetes fungus Magnaporthe oryzae (formerly M. grisea) is one of the most damaging rice diseases (Khush and Jena, 2007). Rice, one of the most important food crops, has been cultivated under diverse environments around the globe. M. oryzae is highly adaptive to the environment and infection begins with asexual conidia. The conidial
c20.indd 430
1/12/2011 9:44:36 AM
A PLANT MODEL—THE RICE BLAST SYSTEM
(a)
(b)
(c)
(d)
431
Figure 20.2. Symptoms of rice blast disease on rice plants without the major blast R genes. a, Four asexual conidian of Magnaporthe oryzae that infect rice seedlings. b, Leaf blast with close up eye-shaped lesion shown from an irrigated rice field in the United States. c, Leaf blast showing severe diseased leaves in an upland rice field in Colombia. d, Panicle blast showing affected grain from an irrigated rice field in Arkansas.
infection is a semi-biotrophic process resulting in loss of productivity and quality of rice grains (Howard et al., 1991; Fig. 20.2). M. oryzae shifted the host to rice after worldwide rice cultivation. On the other hand, R genes in rice have evolved over time to prevent infections by M. oryzae. Evidently the AVR genes in strains of M. oryzae determining efficacies of R genes are highly unstable (Kang et al., 2001; Zhou et al., 2007). Under favorable environmental conditions, blast epidemics sometimes cause significant economic losses in some rice growing areas. This never-ending battle between rice and M. oryzae makes the rice blast system a model for plant R gene discovery. 20.3.1 The Structure and Function of Blast R Gene Historically the blast R genes called Pi-genes confer resistance to the blast fungus in a race-specific manner (Silue et al., 1992). The Pi-genes have been commonly discovered from landrace varieties and wild rice relatives. Today, rice blast R genes are one of the best plant models for understanding the
c20.indd 431
1/12/2011 9:44:36 AM
432
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
coevolutionary dynamics of R genes with their cognate AVR genes. Thus far, over 80 blast Pi genes have been tagged with molecular markers; some of them have been cloned and others have been used for breeding for improved resistance in plants (Ballini et al., 2008). Similar to other plant R genes, 13 cloned blast R genes encode putative receptor proteins with the NBS-LRR domain except for Pi-d2 and pi21. Pi-d2 encodes a predicted B-lectin receptor kinase and pi21 is a defective proline protein (Table 20.1). Pi-b was the first blast R gene cloned, and Pi-b encodes a predicted protein with NBS-LRR domain (Wang et al., 1999). Pi-ta is the second blast R gene isolated from rice. Pi-ta encodes a putative receptor with NBS and a degenerated LRR domain (referred to as LRD) (Fig. 20.3; Bryan et al., 2000; Jia et al., 2000). Among them, three blast R genes, Pi-ta, Pi-d2 and Pi-36 are single members in the rice genome. Others are members of small gene families. It is interesting that, two members of the Pi-km and Pi-5 families are required for complete resistance to some races of blast fungus (Ashikawa et al., 2008; Lee et al., 2009b). The ability of R proteins to recognize pathogen signaling molecules is the most critical component of signal transduction. Such recognition can be direct and indirect. The Pto gene in tomato was the first plant R gene whose product was demonstrated to bind to the pathogen signaling molecule (Scofield et al., 1996; Tang et al., 1996). Similar to Pto, Pi-ta was the only blast R gene whose role as a cytoplasmic receptor was demonstrated with a putative product of the corresponding AVR gene, AVR-Pita176 (Jia et al., 2000). A model for Pi-ta-mediated resistance is proposed (Fig. 20.4). In this model, the AVR gene product AVR-Pita, with 223 amino acids, was processed to be an active protein with 176 amino acids, AVR-Pita176, with unknown mechanisms. The AVRPita716 protein was demonstrated to interact with the Pi-ta protein that may also be involved in the Pi-ta2 protein for resisting other races of M. oryzae (Bryan et al., 2000; Jia et al., 2000). As a result of these interactions, defense signals are triggered and presumably transferred to another plant modifier Ptr(t), subsequently activating plant defense gene expression that stops invading blast fungus (Jia et al., 2002; Jia and Martin, 2008). To date, four other NBS-LRR proteins have been shown to physically interact with cognate AVR proteins in other plant-pathogen systems: The Arabidopsis thaliana protein RPS1 with PopP2 from Ralstonia solanacearun (Deslandes et al., 2003); the flax L5, L6, and L7 proteins with Avr567 proteins of flax rust (Melampsora lini) (Dodds et al., 2006); N from tobacco with p50 elicitor from tobacco mosaic virus (Ueda et al., 2006); and the flax M protein with AvrM of flax rust (Catanzariti et al., 2010). Indirect bindings of R proteins with the pathogen effectors were demonstrated in the A. thaliana and P. syringae bacterium system (Axtell et al., 2003; Mackey et al., 2003; Shao et al., 2003). In the rice blast system, it is not an easy task to isolate the cognate AVR gene because of difficulties of genetic crosses in M. oryzae. It can take a long time to isolate one AVR gene using map based cloning (Orbach et al., 2000). Pi-ta and AVR-Pita are still the only pair of R and AVR genes that are well characterized thus far. By genomewide association, three AVR genes from
c20.indd 432
1/12/2011 9:44:37 AM
A PLANT MODEL—THE RICE BLAST SYSTEM
433
TABLE 20.1. DNA Sequences, Chromosomal Locations, and Structural Characteristics of Cloned Rice Blast R Genes Name of R Genes
GenBank Accessiona
Pi-b Pi-ta
AB013448 AF207842
Pi-9 Pi2/Pizt Pi-d2
Pi36 Pi37 Pi5 Pit Pikm
Pi-d3
Pi-21
a
Chromosome
Motif
Reference
2 12
NBS-LRR NBS-LRR
ABB88855 ABC94599/DQ352040 Not available
6 6 6
DQ90896 DQ923494 EU869185 and EU869186 AB379815-AB379822
8 1 9
NBS-LRR NBS-LRR B-lectin Receptor Kinase NBS-LRR NBS-LRR NBS-LRR
Wang et al. (1999) Bryan et al. (2000) Qu et al. (2006) Zhou et al. (2006) Chen et al. (2006)
1
NBS-LRR
11
NBS-LRR
6
NBS-LRR
Shang et al. (2009)
4
Defected proline protein
Fukuoka et al. (2009)
AB462256, AB462324, and AB462325 FJ745365 (Pid3-A4) FJ745366 (Pid3-ZYQ8) FJ745367 (Pid3-TP309) FJ745368 (Pid3-LTH) FJ773285 (Pid3-9311) FJ773286 (Pid3-Nip)] AB430852, AB430853, and AB430854
Li et al. (2009) Lin et al. (2007) Lee et al. (2009) Hayashi et al. (2009) Ashikawa et al. (2008)
Available at www.ncbi.nlm.nih.gov.
M. oryzae were rapidly isolated (Yoshida et al., 2009). Similarly, isolation of blast R genes has been accelerated by available genomic resources. In fact, 10 of 13 blast R genes were cloned within the past 5 years (Table 20.1). With more matched R and AVR genes cloned in the future, the recognition mechanisms of blast R genes can be further examined. 20.3.2 Co-Evolution of Host R and Pathogen AVR Genes The structures of rice R genes are extremely conserved. However, AVR genes are known to encode random molecules that may play important roles in pathogen fitness and pathogenicity. One outstanding question is how host R genes have evolved the ability to detect these random molecules from the pathogens. In the blast system, transposition, alternative splicing, gene
c20.indd 433
1/12/2011 9:44:37 AM
434
c20.indd 434
1/12/2011 9:44:37 AM
Figure 20.3. Cloning of the rice blast resistance gene Pi-ta using map-based cloning. Pi-ta mediated resistance was located near the centromere of the chromosome 12, and a 1-mb BAC contig was assembled and sequenced. All candidates were analyzed by bioinfomatic tools, and the candidate for Pi-ta was identified and isolated from the contig. The resistant function of the candidate Pi-ta gene was verified by transforming the candidate gene into a susceptible rice cultivar (Bryan et al., 2000).
A PLANT MODEL—THE RICE BLAST SYSTEM
435
Figure 20.4. A model for the Pi-ta gene-mediated disease resistance response. The avirulence gene product AVR-Pita was predicted to be processed to be an active form AVR-Pita176 in plant cells with unknown mechanism. Once inside the plant cell, AVRPita176 is predicted to bind to the Pi-ta protein. Binding of Pi-ta with AVR-Pita176 may need the Pi-ta2 and Ptr(t) proteins in activating signaling for producing plant proteins to prevent further invasive growth of the blast fungus.
clustering, diversification, and genomic rearrangements are known mechanisms of genetic changes that drive the co-evolution of host R genes with the pathogen AVR genes. 20.3.2.1 Transposition Transposons are sequences of DNA that can move around to different positions within the genome. As a result of transposition, transposons can influence the structure of genes and genome via transposition, insertion, excision, chromosome breakage, and ectopic recombination, often with alteration of gene expression (Bennetzen, 2000). In the rice blast system, Pi-ta is located near the centromere of rice chromosome 12, a region that embeds fewer active genes than other regions of the chromosome. It is interesting, that a transposon was found at the promoter region of the Pi-ta gene. The presence of this transposon was found to be strictly associated with resistance in rice germplasm surveyed (Lee et al., 2009a). Similarly, an ancient blast R gene Pit was demonstrated to be activated by another transposon in the promoter region (Hayashi et al., 2009). Both cases led to a hypothesis that transposons play a positive role in regulating ancient blast R genes.
c20.indd 435
1/12/2011 9:44:37 AM
436
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
20.3.2.2 Alternative Splicing Alternative splicing is a process by which the exons of the RNA produced by transcription of a gene are reconnected in multiple ways during RNA splicing. The resulting different mRNAs may be translated into different protein isoforms; therefore, a single gene may code for multiple proteins. The Pi-ta gene was predicted to encode 12 distinct putative products between 315 and 1033 amino acids. Among them, five preserve complete NBS-LRR domains and two couple the original NBS-LRR domain of the Pi-ta protein with a C-terminal thioredoxin (TRX) domain. Gene expression analysis demonstrated that transcript variants encoding the TRX domain had the highest level of expression in comparison to other full length or truncated transcripts. These posttranscriptional modifications of Pi-ta produced a series of transcripts that could have a significant impact on newly evolved resistance specificity. 20.3.2.3 Clustering Historically, resistance to different pathogens or different races of the same pathogen is often mapped within a small genomic interval. The presence of the clustered R genes suggests that R genes in plant genomes have evolved in clusters to fight against pathogens. This is also true for most of the cloned blast R genes because they are members of small gene families. Whether these family members are in fact R genes remains to be demonstrated. Most noticeably, a large linkage block (5.4–27 Mbp) was found in rice cultivars that contain the resistant Pi-ta alleles. These findings suggest that many genes involved in gene specific blast resistance may reside within a small genomic region on the same chromosome. 20.3.2.4 Diversification Surveys of Pi-ta alleles in a wide range of rice germplasm and their wild rice relatives revealed that selection constraints had occurred at the Pi-ta locus in cultivars but not in wild rice relatives (Jia et al., 2003; Lee et al., 2009a; Wang et al., 2008). Diversification was not common at the Pi-ta gene among cultivated rice varieties O. sativa; however, diversification was more pronounced in O. rufipogon, a predicted ancestor of the cultivated species of rice. In contrast, pathogen AVR gene products are often involved in promoting the virulence and fitness. Diversification of AVR genes is one of the most important strategies that the fungus employs to overcome resistance controlled by R genes. Surveys thus far have identified 37 AVR-Pita variants with minor amino acid differences in field isolates of M. oryzae. In addition, partial and complete deletions, frame-shift mutation were found in the AVR-Pita variants in virulent field and laboratory races (Zhou et al., 2007). These genomic rearrangements can alter resistance stability of deployed R genes that eventually lead to severe blast epidemics.
20.4 R GENE USE IN BREEDING The use of plant R genes is the most economical and environmentally friendly method of crop protection. For a long time, breeding for improved resistance
c20.indd 436
1/12/2011 9:44:39 AM
FUTURE PROSPECTS
437
has been accomplished through traditional genetic crosses of donor parents with recurrent parents and then selecting resistant individuals in subsequent generations. In practically any given rice variety, the exact number of R genes present is unknown, similarly in any race of M. oryzae there is often an unknown number of AVR genes. Despite this fact, an effective international differential system has been in place for predicting the spectra of R genes. It is known that resistance to one race of M. oryzae is governed by a matched pair of R genes in rice and AVR genes in M. oryzae; interaction of both products of the R and the AVR gene can result in complete resistance. The presence of one matched pair therefore can mask expression of other matched pairs of R and AVR genes. Selection based on disease reactions can identify resistant progeny but cannot accurately identify a particular R gene. Closely linked DNA markers to an R gene can immediately be used for R gene selection in classical plant breeding using marker assisted selection (MAS). The use of markers for breeding is a relatively new tool in plant breeding programs. Under normal circumstances, the use of markers can reduce the time needed for developing a cultivar because trait selection can be made at seed and seedling stage under controlled laboratory settings. One short-term benefit of cloning a plant R gene is being able to develop DNA markers from portions of cloned genes. These types of markers are derived from R genes themselves and are referred to as The Perfect Markers. The perfect markers for two blast R genes, Pi-ta, Pi-b have been effectively developed (Fjellstrom et al., 2004; Jia et al., 2000, 2002, 2009; Wang et al. 2007). Among them, the perfect markers for the Pi-ta gene have been effectively used for MAS since 2002 (Jia et al., 2009). For the long term, the cloned R genes can immediately be used to engineer durable resistance to the pathogens. Transferring R genes using transformation into advanced breeding lines can accelerate R gene incorporation and also avoid linkage drag associated with resistance (Jia, 2009).
20.5 FUTURE PROSPECTS Up to now, it has been well known that plants evolved an array of highly regulated defense strategies to prevent pathogen invasion. Among them, R genes regulate infector triggered immune responses. Rapid advances in biotechnology, including controlled phenotyping, DNA sequence analysis, gene expression using DNA microarray, and serial analysis of gene expression, have accelerated efforts of crop R gene discovery and use. With dramatic reduction in the cost of biotechnology and the speed of isolation, characterization of plant R genes will be unimaginably increased. Cloning and characterization of crop R genes have facilitated a better understanding of the molecular mechanisms of disease resistance, co-evolution mechanisms, interaction and signaling recognition, and transduction (Jia et al., 2000). However, several challenges lie ahead, including understanding how crop R genes keep pace with the rapid changes of pathogen effectors. Specifically, (1) the pathogen effector is meant
c20.indd 437
1/12/2011 9:44:39 AM
438
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
to promote plant diseases; however, cellular targets for the effector proteins in plant cells are still unknown. (2) The existence of a master controller(s) for plant disease resistance is undetermined. If there is a master controller, why has it not been discovered? (3) The pathogens can overcome engineered resistance in a short time after resistance deployment. The methodology that ensures resistance durability mediated by R genes has not been demonstrated. (4) Gene flow has been commonly observed in crop fields; however, methods to prevent crop R genes from escaping to weedy species of crop plants have not been developed. Finally, (5) the penalty for crop productivity and quality is unknown if a crop plant is immune to invading pathogens (Tian et al., 2003). Besides these challenges, an improved defined genetic system is urgently needed to study resistance to the necrotrophic pathogens, such as the soilborne fungal pathogen Rhizoctonia solani, which causes the rice sheath blight disease. For the sheath blight disease, the major R genes are not functional, and mechanisms are unknown. Recently, progress has been made in improving a phenotyping method and tagging the major QTLs for MAS (Jia et al., 2007; Liu et al., 2009). Continued identification and cloning of major QTLs will be another important priority for crop protection. In summary, cloned plant R genes can be directly used for genetic engineering for effective resistance and MAS. There is no doubt that genetic engineering is one of the fastest approaches to developing disease resistant crops. However, MAS also holds great promise despite the fact that MAS is a relatively young infant for crop breeding (Jia, 2003). Significant new knowledge learned from characterized plant R genes thus far has established a solid foundation for continued exploration of sophisticated natural defense systems for effective crop protection to maintain stable crop production that should benefit humanity. 20.6 ACKNOWLEDGMENTS The author thanks the present and past members of the Molecular Plant Pathology program of USDA-ARS Dale Bumpers National Rice Research Center (DB NRRC) for excellent technical support, Melissa Jia (staff scientist) and Ellen McWhirter (English editor) of DB NRRC for proofreading, Seonghee Lee (Plant Pathologist, Noble Foundation) and Stefano Costanzo (Plant Pathologist, DB NRRC) for critical reading, and ARS 301 National Program, National Science Foundation and Arkansas Rice Research and Promotion Board for financial support. 20.7 REFERENCES Ashikawa I, Hayashi N, Yamane H, Kanamori H, Wu J, Matsumoto T, Ono K, Yano M. (2008). Two adjacent nucleotide-binding site-leucine-rich repeat class genes are required to confer Pikm-specific rice blast resistance. Genetics 180: 2267–76.
c20.indd 438
1/12/2011 9:44:39 AM
REFERENCES
439
Axtell MJ, Chisholm ST, Dahlbeck D, Staskawicz BJ. (2003). Genetic and molecular evidence that the Pseudomonas syringae type III effector protein AvrRpt2 is a cysteine protease. Mol Microbiol 49:1537–46. Ballini E, Morel JB, Droc G, Price A, Courtois B, Notteghem JL, Tharreau D. (2008). A genome-wide meta-analysis of rice blast resistance genes and quantitative trait loci provides new insights into partial and complete resistance. Mol PlantMicrobe Interact 21: 859–68. Bennetzen JL. (2000). Transposable element contributions to plant gene and genome evolution. Plant Mol Biol 42:251–69. Bevan M. (1984). Binary Agrobacterium vectors for plant transformation. Nucl Acids Res 12:8711–21. Bryan GT, Wu K, Farrall L, Jia Y, Hershey HP, McAdams SA, Faulk KN, Donaldson GK, Tarchini R, Valent B. (2000). A single amino acid difference distinguishes resistant and susceptible alleles of rice blast resistance gene Pi-ta. Plant Cell 12: 2033–45. Catanzariti A-M, Dodds PN, Ve T, Kobe B, Ellis JG, Staskawicz BJ. (2010). The AvrM effector from flax rust has a structured c-terminal domain and interacts directly with the M resistance protein. Mol Plant Microbe Interact 23:49–57. Chen X, Shang J, Chen D, Lei C, Zou Y, Zhai W, Liu G, Xu J, Ling Z, Cao G, Ma B, Wang Y, Zhao X, Li S, Zhu L. (2006). A B-lectin receptor kinase gene conferring rice blast resistance. Plant J 46: 794–804. Christou P. (1997). Rice transformation: bombardment. Plant Mol Biol 35:193– 203. Deslandes L, Olivier J, Peeters N, Feng DX, Khounlotham M, Boucher C, Somssich I, Genin S, Marco Y. (2003). Physical interaction between RRS1-R, a protein conferring resistance to bacterial wilt, and PopP2, a type III effector targeted to the plant nucleus. Proc Natl Acad Sci U S A 100:8024–29. Dioh W, Tharreau D, Notteghem JL, Orbach M, Lebrun MH. (2000). Mapping of avirulence genes in the rice blast fungus, Magnaporthe grisea, with RFLP and RAPD markers. Mol Plant Microbe Interact 13:217–27. Dodds PN, Lawrence GJ, Catanzariti A, Teh T, Wang CI, Ayliffe MA, Kobe B, Ellis JG. (2006). Direct protein interaction underlies gene-for-gene specificity and coevolution of the flax resistance genes and flax rust avirulence genes. Proc Natl Acad Sci U S A 103:8888–93. Fjellstrom RG, Conaway-Bormans CA, McClung AM, Marchetti MA, Shank AR, Park WD. (2004). Development of DNA markers suitable for marker assisted selection of three Pi genes conferring resistance to multiple Pyricularia grisea pathotypes. Crop Sci 44:1790–98. Flor HH. (1971). Current status of the gene-for-gene concept. Annu Rev Phytopathol 9:275–96. Fukuoka S, Saka N, Koga H, Ono K, Shimizu T, Ebana K, Hayashi N, Takahashi A, Hirochika H, Okuno K, Yano M. (2009). Loss of function of a proline-containing protein confers durable disease resistance in rice. Science 325:998–1001. Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J et al. (2002). A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296:92–100.
c20.indd 439
1/12/2011 9:44:39 AM
440
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
Hayashi K, Yoshida H. (2009). Refunctionalization of the ancient rice blast disease resistance gene Pit by the recruitment of a retrotransposon as a promoter. Plant J 57:413–25. Howard R, Ferrari J, Roach MA, Roach DH, Money NP. (1991). Penetration of hard substrates by a fungus employing enormous turgor pressures. Proc Natl Acad Sci U S A 88:11281–284. Jia Y. (2003). Marker assisted selection for the control of rice blast disease. Pesticide Outlook 14:150–52. Jia Y. (2009). Artificial introgression of a large chromosome fragment around the rice blast resistance gene Pi-ta in backcross progeny and several elite rice cultivars. Heredity 103:333–39. Jia Y, Correa-Victoria, FJ, McClung A, Zhu L, Liu G, Wamishe Y, Xie J, Marchetti MA, Pinson SRM, Rutger JN, Correll JC. (2007). Rapid determination of rice cultivar responses to the sheath blight pathogen Rhizoctonia solani using a micro-chamber screening method. Plant Dis 91:485–89. Jia Y, Lee F, McClung A. (2009). Determination of resistance spectra to US races of Magnaporthe oryzae causing blast in a recombinant inbred line population. Plant Dis 93:639–44. Jia Y, Martin R. (2008). Identification of a new locus, Ptr(t), required for rice blast resistance gene Pi-ta-mediated resistance. Mol Plant Microbe Interact 21:396–403. Jia Y, McAdams S, Bryan G, Hershey H, Valent B. (2000). Direct interaction of resistance gene and avirulence gene products confers rice blast resistance. EMBO J 19:4004–14. Jia Y, Moldenhauer K. (2010). Development of mono and digenic rice lines of rice blast resistance gene Pi-ta, Pi-k(s/h). Plant Reg 4:163–66. Jia Y, Valent B, Lee FN. (2003). Determination of host responses to Magnaporthe grisea on detached rice leaves using a spot inoculation method. Plant Dis 87:129–33. Jia Y, Wang Z, Singh P. (2002). Development of dominant rice blast resistance Pi-ta gene markers. Crop Sci 42:2145–49. Kang S, Lebrun MH, Farrall L, Valent B. (2001). Gain of virulence caused by insertion of a Pot3 transposon in a Magnaporthe grisea avirulence gene. Mol Plant Microbe Interact 14:671–74. Khush G, Jena K. (2007). Current status and future prospects of research on blast disease in rice (Oryza sativa). Paper presented at the 4th International Rice Blast Conference, Changsha, China. Kosambi DD. (1944). The estimation of map distances from recombination values. Ann Eugen 12:172–75. Lee S, Costanzo S, Jia Y, Olsen K, Caicedo A. (2009a). Evolutionary dynamics of the genomic region around the blast resistance gene Pi-ta in AA genome Oryza species. Genetics 183:1315–25. Lee SK, Song MY, Seo YS, Kim HK, Ko S, Cao PJ, Suh JP, Yi G, Roh JH, Lee S, An G, Hahn TR, Wang GL, Ronald P, Jeon JS. (2009b). Rice Pi5-mediated resistance to Magnaporthe oryzae requires the presence of two coiled-coil-nucleotide-bindingleucine-rich repeat genes, Genetics 181:1627–38. Li B, Wang J, Wu Y, Hu X, Zhang Z, Zhang Q, Zhao Q, Feng H, Zhang Z, Wang GL, Wang G, Lu B, Han Z, Wang Z, Zhou B. (2009). The Magnaporthe oryzae avirulence
c20.indd 440
1/12/2011 9:44:39 AM
REFERENCES
441
gene AvrPiz-t encodes a predicted secreted protein that triggers the immunity in rice mediated by the blast resistance gene Piz-t. Mol Plant Microbe Interact 22: 411–20. Lin F, Chen S, Que Z, Wang L, Liu X, Pan Q. (2007). The blast resistance gene Pi37 encodes a nucleotide binding site-leucine-rich repeat protein and is a member of a resistance gene cluster on rice chromosome 1. Genetics 177:1871–80. Lincoln S, Daly M, Lander ES. (1992). Construction Genetic Maps with MAPMAKER/ EXP 3.0 in Whitehead Institute Technical Report. 2nd ed., Whitehead Institute, Cambridge, UK. Liu G, Bernhardt JL, Jia MH, Wamishe YA, Jia Y. (2008). Molecular characterization of the recombinant inbred line population derived from a japonica-indica rice cross. Euphytica 159:73–82. Liu G, Jia Y, Correa-Victoria F, Prado GA, Yeater KM, McClung A, Correll JC. (2009). Mapping quantitative trait loci responsible for resistance to sheath blight in rice. Phytopathology 99:1078–84. Liu X, Lin F, Wang L, Pan Q. (2007). The in silico map-based cloning of Pi36, a rice coiled-coil–nucleotide-binding site–leucine-rich repeat gene that confers racespecific resistance to the blast fungus. Genetics 176:2541–49. Mackey D, Belkhadir Y, Alonso JM, Ecker JR, Dangl JL. (2003). Arabidopsis RIN4 is a target of the type III virulence effector AvrRpt2 and modulates RPS2-mediated resistance. Cell 112:379–89. Martin GB, Bogdanove A, Sessa G. (2003). Understanding the functions of plant disease resistance proteins. Ann Plant Biolo 54:23–61. Martin GB, Brommonschenkel S, Chunwongse J, Frary A, Ganal MW, Spivey R, Wu T, Earle ED, Tanksley SD. (1993). Map-based cloning of a protein kinase gene conferring disease resistance in tomato. Science 262:1432–36. McCouch SR, Kochert G, Yu Z, Wang Z, Khush GS, Coffman WR, Tanksley SD. (1998). Molecular mapping of rice chromosomes. Theor Appl Genet 76:815–29. Orbach MJ, Farrall L, Sweigard JA, Chumley FG, Valent B. (2000). A telomeric avirulence gene determines efficacy for the rice blast resistance gene Pi-ta. Plant Cell 12:2019–32. Qu S, Liu G, Zhou B, Bellizzi M, Zeng L, Dai L, Han B, Wang GL. (2006). The broadspectrum blast resistance gene Pi9 encodes a nucleotide-binding site–leucine-rich repeat protein and is a member of multigene family in rice. Genetics 172:1901–14. Rafalski A (2002). Applications of single nucleotide polymorphisms in crop genetics. Curr Opin Plant Biol 5:94–100. Scofield SR, Tobias CM, Rathjen JP, Chang JH, Lavelle DT, Michelmore RW, Staskwicz BJ. (1996). Molecular basis of gene-for-gene specificity in bacterial speck disease of tomato. Science 268:661–67. Shang J, Tao Y, Chen X, Zou Y, Lei C, Wang J, Li X, Zhao X, Zhang M, Lu Z, Xu J, Cheng Z, Wan J, Zhu L. (2009). Identification of a new rice blast resistance gene, Pid3, by genomewide comparison of paired nucleotide-binding site-leucine-rich repeat genes and their pseudogene alleles between the two sequenced rice genomes. Genetics 182:1303–11. Shao F, Golstein C, Ade J, Stoutemyer M, Dixon JE, Innes RW. (2003). Cleavage of Arabidopsis PBS1 by a bacterial type III effector. Science 301:1230–33.
c20.indd 441
1/12/2011 9:44:39 AM
442
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
Silue D, Notteghem JL, Tharreau D. (1992). Evidence for a gene for gene relationship in the Oryza sativa-Magnaporthe grisea pathosystem. Phytopathology 82: 577–82. Tang X, Frederick R, Zhou J, Halterman DA, Jia Y, Martin GB. (1996). Initiation of plant disease resistance by physical interaction of avrPto and Pto kinase. Science 274:2060–63. Tian D, Traw MB, Chen JQ, Kreltman M, Bergelson J. (2003). Fitness costs of R-genemediated resistance in Arabidopsis thaliana. Nature 423:74–77. Ueda H, Yamaguchi Y, Sano H. (2006). Direct interaction between the tobacco mosaic virus helicase domain and the ATP-bound resistance protein, N factor during the hypersensitive response in tobacco plants. Plant Mol Biol 2006; 61:31–45. Wang S, Basten CJ, Zeng ZB. (2007). Windows QTL Cartographer 2.5. Department of Statistics, North Carolina State University, Raleigh. Available at http://statgen. ncsu.edu/qtlcart/WQTLCart.htm. Wang X, Jia Y, Shu QY, Wu D. (2008). Haplotype diversity at the Pi-ta locus in cultivated rice and its wild relatives. Phytopathology 98:1305–11. Wang X, Yano M, Yamanouchi U, Iwamoto M, Monna L, Hayasaka H, Katayose Y, Sasaki T. (1999). The Pi-b gene for rice blast resistance belongs to the nucleotide binding and leucine-rich repeat class of plant disease resistance genes. Plant J 19:55–64. Yoshida K, Saitoh H, Fujisawa S, Kanzaki H, Matsumura H, Yoshida K, Tosa Y, Chuma I, Takano Y, Win J, Kamoun S, Terauchi R. (2009). Association Genetics reveals three novel avirulence genes from the rice blast fungal pathogen magnaporthe oryzae. Plant Cell 21:1573–91. Yu J, Hu S, Wang J, Wong G, Li S, Liu B, Deng Y, et al. (2002). A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296:79–92. Zhou E, Jia Y, Singh P, Correll JC, Lee FN. (2007). Instability of the Magnaporthe oryzae avirulence gene AVR-Pita alters virulence. Fungal Genet Biol 44:1024–34. Zhou B, Qu S, Liu G, Dolan M, Sakai H, Lu G, Bellizzi M, Wang G. (2006). The eight amino acid differences within three leucine-rich repeats between Pi2 and Piz-t resistance proteins determine the resistance specificity to Magnaporthe grisea. Mol Plant Microbe Interact 19:1216–28.
c20.indd 442
1/12/2011 9:44:39 AM
CHAPTER 21
Impact of Genomewide Structural Variation on Gene Discovery LISENKA E.L.M. VISSERS and JORIS A. VELTMAN
Contents 21.1 A Historical Perspective of the Detection of Genomewide Structural Variation and Its Relevance to Disease 21.1.1 Human Genomic Variation and Visualization of Structural Variants 21.1.2 Chromosomal Rearrangements Causing Disease 21.1.3 The Detection of Submicroscopic Chromosomal Rearrangements 21.1.4 The Clinical Consequence of Submicroscopic Chromosome Rearrangements 21.1.5 Array-Based Comparative Genomic Hybridization 21.2 The Basic Concept for Disease Gene Discovery through Genomewide Profiling Strategies 21.2.1 Single Gene Disorders 21.2.2 Contiguous Gene Syndromes 21.2.3 Point Mutations, Deletions, and Duplications May Lead to the Same Phenotype 21.3 Disease Gene Discovery through Genomewide Profiling Strategies 21.3.1 CHARGE Syndrome 21.3.2 The 9q Subtelomeric Deletion Syndrome 21.3.3 Defining New Microdeletion Syndromes 21.4 Disease Gene Identification for Common Diseases 21.4.1 Rare CNVs in Common Diseases 21.5 Discriminating the Disease-Related CNV from All Normal CNVs 21.5.1 Forging Links between Human Phenotypes and Mouse Gene Knockout Models
444 444 444 446 447 448 450 450 450 450 451 451 453 454 454 455 455 456
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
443
c21.indd 443
1/12/2011 9:44:40 AM
444 21.6
21.7 21.8 21.9 21.10
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
Next-Generation Sequencing for the Detection of Structural Variation 21.6.1 CNV Detection Using Shotgun Sequencing 21.6.2 CNV Detection Using Mate-Pair Sequencing Conclusion Questions and Answers Acknowledgments References
456 457 458 458 459 461 462
21.1 A HISTORICAL PERSPECTIVE OF THE DETECTION OF GENOMEWIDE STRUCTURAL VARIATION AND ITS RELEVANCE TO DISEASE 21.1.1 Human Genomic Variation and Visualization of Structural Variants Human genomic variation is present in many forms, including single nucleotide polymorphisms (SNPs), variable number of tandem repeats (VNTRs), transposable elements, and structural variation. All these variants differ among individuals and as such define human phenotypic variation. Whereas SNPs involve only a single base, structural chromosome variants involve multiple bases and are defined by a chromosome breakage and subsequent joining in an altered configuration. Large structural variants, involving millions of bases, can be visualized using light microscopy, which is referred to as karyotyping. The altered configuration may lead to a loss or gain of genetic material, in which case the rearrangement is unbalanced. Alternatively, all genetic material is retained, in which case the new configuration may be fully balanced. Structural chromosome abnormalities include (1) deletions, (2) duplications, (3) isochromosomes, (4) ring chromosomes, (5) inversions, and (6) translocations (Fig. 21.1). By definition, deletions, isochromosomes, duplications, and ring chromosomes are unbalanced in nature. Translocations and inversions can either be balanced or unbalanced. Structural chromosome rearrangements can lead to a wide variety of serious clinical manifestations, including mental retardation (MR) and congenital malformations, as the altered configuration may affect several genes. The exact clinical manifestations depend on the size of the rearrangement and the genetic information affected. That is, larger genome segments are likely to contain more genes, and as such, may lead to a more severe phenotype. 21.1.2
Chromosomal Rearrangements Causing Disease
With the availability of karyotyping, chromosome rearrangements have been linked to disease. Down syndrome was the first clinical syndrome linked to
c21.indd 444
1/12/2011 9:44:40 AM
THE DETECTION OF GENOMEWIDE STRUCTURAL VARIATION
(a)
445
(b)
16 15.3 15.1 14 13 12 12 13 21
16 15.3 15.1 14 13 12 12 13 21
22 24
22 24
26 27 28 31.1 31.3 32 33 34 35
26 27 28 31.1 31.3 32 33 34 35
4
4
(c)
(d) 16 15.3 15.1 14 13 12 12 13 21
16 15.3 15.1 14 13 12 12 13 21
22 24
22 24
26 27 28 31.1 31.3 32 33 34 35
26 27 28 31.1 31.3 32 33 34 35
4
(e)
4
(f) 16 15.3 15.1 14 13 12 12 13 21
16 15.3 15.1 14 13 12 12 13 21
22 24
22 24
26 27 28 31.1 31.3 32 33 34 35
26 27 28 31.1 31.3 32 33 34 35
4
22 21 15 14 13 12 11.2 21 22 31 32 33 35 36
4
7
Figure 21.1. Structural rearrangements. a, Deletion with loss of genetic material. b, Duplication with insertion of genetic material (gray). c, Isochromosome of the p arm with loss of one chromosome arm and duplication of the other arm (duplication in gray). d, Ring chromosome with joining of two sticky chromosome ends caused by deletion of genetic material on both chromosome arms. e, Inversion with reversion of genetic material on a single chromosome. f, Translocation with transfer of genetic material from one chromosome to another. In this case a translocation of chromosomes 4 (in black) and 7 (in gray). Part of the 4q arm is exchanged with material originating from the 7q arm.
c21.indd 445
1/12/2011 9:44:40 AM
446
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
a specific chromosome rearrangement, being trisomy of chromosome 21. Since this discovery, the concept of linking a disease phenotypes to a chromosome rearrangement has been further explored. The strategy that was used mostly included the identification of overlapping deletions in multiple patients presenting a similar disease phenotype. The smallest region of overlap, commonly deleted in all patients, would as such define the region causing the phenotype under investigation. Subsequently, a positional gene approach can be used to identify the disease gene. In this approach, all genes present in the shortest region of deletion overlap are examined by DNA sequencing to identify mutations in patients with a similar phenotype but without a causative structural variation. A related disease-gene identification approach starts with the identification of a (balanced) translocation and studies the gene(s) disrupted by the translocation as potential disease gene(s). These approaches have been very successful for localizing genes for several diseases, including holoprosencephaly, retinoblastoma and Gardner syndrome (Lele et al., 1963; Riccardi et al., 1978; Herrera et al., 1986; Schmickel, 1986; Munke, 1989; Roessler et al., 1996; Brown et al., 1998; Wallis et al., 1999; Gripp et al., 2000). Gross chromosomal rearrangements can be detected by karyotyping. However, this genomewide approach has several limiting factors. First, karyotyping requires actively dividing cells to obtain chromosomes in the optimal configuration for visualization (i.e., metaphase chromosome), and second, karyotyping has a limited resolution (i.e., the structural aberration needs to be of sufficient size to be detectable, involving at least 5–10 Mb). There is, however, no reason why smaller structural genomic variations should not cause disease, as even a single basepair mutation can cause disease. Moreover, for disease gene identification studies, the detection of a smaller structural variation is much more useful than a larger one, as the number of candidate disease genes in the affected genomic locus will be limited. 21.1.3 The Detection of Submicroscopic Chromosomal Rearrangements For a considerable number of clinical disorders, the genetic cause has been established to be smaller than 5–10 Mb in size and, as such, remain undetectable using karyotyping. To detect these rearrangements, more sensitive techniques were needed. The resolution of chromosome analysis has greatly benefited from the introduction of fluorescent in situ hybridization (FISH) (Van Prooijen-Knegt et al., 1982). This technology relies on the unique ability of single-stranded fluorescently labeled DNA, known as a probe, to anneal to its complementary sequence in the chromosomes. Next, the location of the annealed (or hybridized) probe can be visualized by use of a fluorescent microscope. Depending on the application, different types of FISH probes can be used, such as telomere-specific probes, whole chromosome painting probes
c21.indd 446
1/12/2011 9:44:40 AM
THE DETECTION OF GENOMEWIDE STRUCTURAL VARIATION
447
and locus specific probes. Especially the latter type of probe can be used, for instance, to validate a clinical diagnosis by proving that the genomic locus involved in the disease is indeed not present the normal copy number of two copies. Although FISH and related high-resolution technologies, such as quantitative PCR, provide a higher resolution than does karyotyping (100–300 kb), the techniques can interpret only a limited amount of loci in a single experiment and a priori knowledge on the genomic locus to investigate is needed. As such, these type of technologies are too expensive and labor intensive to use for a genomewide analysis. In subsequent years, FISH technologies were further modified to allow for a genomewide analysis. The optimization resulted in the introduction of comparative genomic hybridization (CGH) (Kallioniemi et al., 1992; Lichter et al., 2000). CGH is based on the comparison of two genomic DNA populations, one derived from a test (patient) sample, and one derived from a normal reference sample. Equal amounts of DNA are differentially labeled and simultaneously hybridized onto normal human metaphase chromosomes, thus competing for the same targets on the chromosomes. Variation in fluorescence intensities of test and reference DNA along each chromosome target reveals the genomic locations of chromosome rearrangements in the test DNA. The advantage of CGH over karyotyping is its independence of actively dividing cells from the test sample as a source for metaphase spreads. As such, CGH can be performed on virtually all samples from which DNA can be extracted. CGH has proven particularly useful in cancer research. In general, tumour samples are difficult to culture and harvest for preparing metaphase spreads. In addition, these spreads are difficult to analyze by conventional karyotyping because of the abundance and complexity of the rearrangements present. The resolution of CGH, however, still depends on the resolution of the target metaphase chromosomes (i.e., it remains difficult to detect rearrangements below the level of 5–10 Mb) (Forozan et al., 1997). 21.1.4 The Clinical Consequence of Submicroscopic Chromosome Rearrangements Over the last decades, various FISH studies have revealed that submicroscopic subtelomeric rearrangements account for approximately 6% of all previously unexplained cases of MR (Flint et al., 1995; Knight et al., 1999; de Vries et al., 2003). Similarly, it was found that interstitial submicroscopic chromosome rearrangements account for a vast proportion of contiguous gene syndromes (Osborne et al., 2001; Shaikh et al., 2001). However, in cases where the clinical phenotype has not previously been associated with a known genomic rearrangement, there is no a priori knowledge of the region to be tested and, hence, FISH is no longer the method of choice. To detect such submicroscopic rearrangements on a genomewide scale, novel technologies were needed that
c21.indd 447
1/12/2011 9:44:40 AM
448
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
combine the resolution of targeted FISH technologies with the genomewide approach of CGH. One such technology is microarray-based comparative genomic hybridization (array CGH). 21.1.5
Array-Based Comparative Genomic Hybridization
Through the development of novel technologies such as array CGH the resolving power of conventional chromosome analysis techniques has increased from the megabase to the kilobase level (Solinas-Toldo et al., 1997; Pinkel et al., 1998). Tools that have facilitated the development of these technologies include (1) genomewide clone resources integrated into the finished human genome sequence, (2) high-throughput microarray platforms, and (3) optimized CGH protocols and data analysis systems. Together, these microarraybased technological developments have accumulated into a so-called molecular karyotyping approach that allows for the sensitive and specific detection of submicroscopic single copy number changes throughout the entire human genome. Array CGH builds on conventional CGH procedures in such a way that the target metaphase spreads are replaced by genomic fragments with known physical locations in a microarray format. In comparison with conventional CGH, the microarray format provides a higher resolution, a higher dynamic range, and a better possibility for automation. In addition, it allows for direct linking of (submicroscopic) chromosome rearrangements, also referred to as copy number variation (CNV), to known genomic sequences and, thus, to genes which may be involved in the disease under investigation (Fig. 21.2). Initially, genomic microarrays were developed in academia and contained mostly genomic fragments obtained from large-insert genomic clones, mainly bacterial artificial chromosomes. Different clone sets have been used, most popular ones containing one clone per 1 Mb or later on using a tiling resolution clone set of approximately 30,000 clones, covering the genome with one clone per 100 kb (Vissers et al., 2003; Shaw-Smith et al., 2004; de Vries et al., 2005; Schoumans et al., 2005; Menten et al., 2006; Redon et al., 2006; Rosenberg et al., 2006). Recently, genomic microarray production has been taken over by private enterprises, and many companies are now offering microarrays for genomewide copy number profiling containing more than a million oligonucleotides and targeting random sequences, SNPs, or a combination thereof. These oligonucleotides have been more evenly spaced across the genome, and optimized protocols are now available for the quantitative detection of CNVs (Fig. 21.2). With this, CNV detection can now reliably be performed at the kilobase level, resulting in the detection of hundreds of CNVs per individual (Redon et al., 2006). These advances have made genomic profiling technology an excellent tool for the genomewide detection of CNVs in health and disease (Friedman et al., 2006; Wagenstaller et al., 2007; Shao et al., 2008; Zhang et al., 2008; McMullan et al., 2009). Consequently, disease gene discovery has been facilitated using these approaches.
c21.indd 448
1/12/2011 9:44:40 AM
THE DETECTION OF GENOMEWIDE STRUCTURAL VARIATION
a
Patient
b
Reference
449
Patient
DNA isolation
DNA isolation
Differential labeling
Restriction enzym digestion
Simultaneous hybridization in 1:1 ratio
Adaptor ligation
PCR amplification and complexity reduction
microarray
unique probe sequence Fragmentation and end-labeling
Detection of labeled DNA
Hybridization
Computational analysis using internal control
Log2 Patient/Reference
AA
AB
BB
Detection of hybridized material
Computational analysis using external controls
Duplication
Deletion
Clones ordered by Mb position on chromosome
Figure 21.2. Overview of (a) the array CGH procedure using BAC arrays and (b) the SNP array technology. a, Genomic DNA samples from a test (patient; left) and reference (normal control; right) are differentially labeled with different fluorochromes, usually Cy3 and Cy5 (for green and red–indicated here by light and dark gray asterisks—respectively). The two DNA samples are mixed in equal amounts and hybridized to the microarray, onto which large-insert clone DNAs (e.g., BAC clones) have been robotically spotted as targets. Subsequent computer imaging assesses the relative fluorescence levels of each labeled DNA for each array target. Clones to which equal amounts of patient DNA and reference DNA have been hybridized will appear in yellow, clones deleted in the patient DNA but not the reference DNA will appear in red, and clones that are duplicated in the patient DNA will appear in green. b, For SNP arrays, genomic DNA of a single patient is hybridized onto the array (single color hybridization). Signal intensities for all probes are determined, and intensity ratios are calculated in silico using signal intensities obtained in previous array runs with control DNA. For both BAC arrays and SNP arrays, the output visualizes the copy number variation with deletions and duplication showing ratios below and above preset thresholds for loss and gain, respectively.
c21.indd 449
1/12/2011 9:44:40 AM
450
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
21.2 THE BASIC CONCEPT FOR DISEASE GENE DISCOVERY THROUGH GENOMEWIDE PROFILING STRATEGIES 21.2.1
Single Gene Disorders
In single gene disorders, or monogenic diseases, the phenotypic spectrum observed can be attributed to the malfunctioning of a single gene. Such malfunctioning can be achieved by physical deletion or duplication of (parts of) a genomic copy of the gene or by more subtle intragenic mutations. The ultimate effect of a deletion or mutation that, for example, leads to a premature stop in the reading frame of the affected gene is haploinsufficiency, a state by which a decrease in the level of the corresponding protein gives rise to the phenotype. Such genes are dosage sensitive. In reverse, duplications or gain of function mutations create proteins that exhibit an increase in constitutive activity, even in the absence of a physiological activator, or that create insensitivity to negative regulators. To date, over 10,000 single gene disorders are known and listed in the database of Online Mendelian Inheritance in Man (OMIM). 21.2.2
Contiguous Gene Syndromes
In contrast to single gene disorders, it has been shown that several conditions, including mental retardation and additional congenital/developmental abnormalities, may be due to submicroscopic chromosome rearrangements encompassing several genes. In 1986, the term contiguous gene syndrome was coined for these disorders (Schmickel, 1986). Since the introduction of this term, many alternatives have been suggested, including microdeletion/ microduplication syndrome and segmental aneusomy syndrome. All these terms intend to imply that the phenotype of the disorder results from an inappropriate dosage of more than one critical gene located within the genomic region affected—that is, individual genes located in such genomic regions contribute to distinct clinical features of the syndrome. It has, therefore, been suggested that the extent of the chromosomal region involved in each case would correlate with the ultimate phenotype and that individual clinical features might be inherited in isolation (Budarf and Emanuel, 1997). 21.2.3 Point Mutations, Deletions, and Duplications May Lead to the Same Phenotype As outlined above, it is becoming increasingly clear that the only real requirement for a candidate microdeletion syndrome gene is that it should be dosage sensitive. In case of microduplications, the effect of having a complete extra copy of a gene may result in a phenotype that is not mirrored by other mutations in this gene. The frequencies at which microdeletions or microduplications are encountered in monogenic diseases differ markedly. For example,
c21.indd 450
1/12/2011 9:44:40 AM
DISEASE GENE DISCOVERY THROUGH GENOMEWIDE PROFILING STRATEGIES
451
there are monogenic diseases that are mostly caused by gene mutations and rarely by deletions or duplications, such as Rubinstein-Taybi syndrome and Alagille syndrome (Krantz et al., 1997; Petrij et al., 2000). In other monogenic diseases, however, large deletions or duplications involving a dosage-sensitive gene are responsible for the majority of the cases, including PelizaeusMerzbacher syndrome and Smith-Magenis syndrome (Juyal et al., 1996; Mimault et al., 1999). Thus microdeletions and microduplications occur at various frequencies in many monogenic diseases with a known genetic cause, and the difference between a microdeletion syndrome with rare mutations and a single gene disorder with occasional large deletions may be gradual rather than absolute. The availability of genomewide technologies to detect submicroscopic CNVs may further enhance the possibilities for a straightforward mapping of the genes underlying these disorders (Fig. 21.3).
21.3 DISEASE GENE DISCOVERY THROUGH GENOMEWIDE PROFILING STRATEGIES To identify disease-causing genes, a stringent clinical preselection of patients whose DNA can be interrogated using high-resolution genomewide technologies for the detection of CNVs is an important first step. Second, the platform to conduct the molecular study needs to be selected. Whenever possible, the latter choice is the highest resolution available at the time of testing. Since smaller CNVs can be detected only using higher resolution platforms, the chance on disease gene identification the increases with increasing resolution. It is, however, noteworthy that with increasing resolution, more CNVs per individual will be detected (up to 100 CNVs per individual). With this observation, discriminating between benign CNVs occurring in the general population and the causative CNV becomes increasingly important to facilitate disease gene discovery (see Section 21.5). The first syndrome successfully resolved using a high-resolution genomewide approach was CHARGE syndrome, through the detection and characterization of microdeletions by array CGH (Vissers et al., 2004). 21.3.1
CHARGE Syndrome
CHARGE syndrome (MIM #214800) is an autosomal dominant disorder with a prevalence of one in 10,000 (Blake et al., 1998). The acronym stands for the cardinal clinical features of the syndrome: coloboma, heart malformation, choanal atresia, retardation of growth and/or development, genital anomalies, and ear anomalies (Pagon et al., 1981). Most cases of CHARGE syndrome are sporadic, but several aspects of this condition support the involvement of a genetic factor that had remained elusive until recently (Tellier et al., 1998, 2000; Martin et al., 2002; Lalani et al., 2003). With the availability of microarray-
c21.indd 451
1/12/2011 9:44:40 AM
452 a
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
p24.3
0
p14.1 p12.3
50.000.000
q26.1
100.000.000
150.000.000
2 1 0 –1 –2
prioritize CNVs to find causative variant b
gene A gene B
gene C
prioritize genes to find dosage sensitive gene c
gene A
sequence for mutations in large patient cohort
Figure 21.3. Disease gene identification strategy using genomewide structural variation. From all CNVs detected in a patient (a), the causal variant needs to be identified using diverse criteria, including de novo occurrence and absence in healthy controls (b). Prioritization of the candidate genes located within this CNV determines which gene to sequence for disease-causing mutations in a larger cohort of patients showing the same phenotype but who do not harbor the deletion or duplication (c). Dark gray and light gray arrowheads in a represent deletions and duplications, respectively. Arrows in b signify the orientation of the genes located within the deleted interval. Circles in c represent mutations in patients without the deletion.
based approaches, unbiased, genomewide screens were performed hypothesizing that microdeletions and/or microduplications might be the underlying cause of CHARGE syndrome (Vissers et al., 2004). Initial screening of two patients with CHARGE syndrome on a 1 Mb array revealed a microdeletion of ∼5 Mb in one of the patients at chromosome locus 8q12. Subsequent array analysis of an apparently balanced chromosome 8 translocation (based on
c21.indd 452
1/12/2011 9:44:40 AM
DISEASE GENE DISCOVERY THROUGH GENOMEWIDE PROFILING STRATEGIES
453
karyotyping) with estimated breakpoints within the 8q12 region (Hurst et al., 1991) unraveled two interspersed microdeletions, overlapping with the microdeletion of the first patient. This result showed (1) that higher resolution genomic analyses can reveal deletions at the breakpoint of a translocation, which are impossible to detect with the lower-resolution karyotyping technology, and (2) that deletions in two unrelated patients with the same syndrome point to chromosome 8q12 as the disease-causing locus. Subsequent analyses of DNA from 17 additional CHARGE patients on a tiling resolution chromosome 8 array did not show any additional microdeletions in this genomic locus. As such, it was reasoned that the disease in these patients was caused by point mutations in one of the genes residing in the shortest region of deletion overlap. Sequence analysis of all nine genes located within this region indeed revealed de novo mutations in CHD7, a novel member of the chromodomain helicase DNA-binding gene family, in the majority of individuals with CHARGE syndrome without deletions. Based on these results, it was concluded that CHARGE syndrome is caused by haploinsufficiency of the CHD7 gene, either by microdeletions encompassing the CHD7 gene, or by mutations within this gene (Vissers et al., 2004). 21.3.2 The 9q Subtelomeric Deletion Syndrome A second well-illustrated example of gene discovery through deletion and/ or translocation mapping is the discovery of the euchromatin histone methyl transferase 1 (EHMT1) gene causing 9q subtelomeric deletion syndrome (MIM #610253). Submicroscopic subtelomeric deletions of chromosome 9q (9qSTDS) are associated with a recognizable mental retardation syndrome (Harada et al., 2004; Stewart et al., 2004; Kleefstra et al., 2009). The identification of the molecular cause of 9qSTDS started with the initial FISH screening of subtelomeric rearrangements in 12 patients narrowing down the commonly deleted region to an ∼1.2 Mb interval (Stewart et al., 2004). Subsequently, this region was further reduced to ∼700 kb, still containing at least five genes and several ESTs (Yatsenko et al., 2005). The first evidence that 9qSTDS was a single gene disorder came from the characterization of the breakpoints of a balanced translocation t(X;9)(p11.23;q34.3) in a patient presenting with typical features of 9qSTDS, whose chromosome 9 breakpoint disrupted the EHMT1 gene in intron 9 (Kleefstra et al., 2005). Additional evidence was provided by deletion screening and sequence analysis of the gene in 23 patients with a clinical presentation reminiscent of 9qSTDS (Kleefstra et al., 2006). Of these 23 patients, 3 patients showed a deletion including the EHMT1 gene and 2 patients showed a de novo mutation in the EHMT1 gene. With this discovery, it was established that haploinsufficiency of the EHMT1 gene, either by deletion or mutation, leads to 9qSTDS (Kleefstra et al., 2009). Other examples for which this strategy has been successful include PotockiLupski syndrome (RAI1), Peters-plus syndrome (B3GALTL), and the MECP2
c21.indd 453
1/12/2011 9:44:40 AM
454
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
duplication syndrome (MECP2) (van Esch et al., 2005; Lesnik Oberstein et al., 2006; Potocki et al., 2007). 21.3.3
Defining New Microdeletion Syndromes
In addition to revealing disease genes for known syndromes, the worldwide use of high-resolution platforms on large patient cohorts has also been instrumental for defining novel microdeletion syndromes. Here, the concept of a phenotype-first approach, referring to the identification of a disease gene in a patient-cohort (e.g., CHARGE syndrome), is changed into a genotype-first approach. In the genotype-first approach, overlapping CNVs are identified in a large clinically heterogeneous patient cohort. After this molecular finding, more careful examination of the patients phenotype may show phenotypic overlap that is not expected for such a heterogeneous disease, hence allowing the definition of a new syndrome. The 17q21.31 microdeletion syndrome was the first new microdeletion syndrome identified through this approach by studying large cohorts of patients with unexplained mental retardation. The identification of this microdeletion syndrome encompassing 17q21.31 was simultaneously described by three groups (Koolen et al., 2006; Sharp et al., 2006; Shaw-Smith et al., 2006). Apart from mental retardation these patients turned out to have additional clinical features in common, such as hypotonia and a specific facial feature (bulbous nose) (Koolen et al., 2006; Sharp et al., 2006; Shaw-Smith et al., 2006). Other examples of novel syndromes include 15q24 microdeletion syndrome, 3q29 microdeletion syndrome, Xq28 microduplication syndrome, and 16p11.2 microdeletion/microduplication syndrome (Van Esch et al., 2005; Willatt et al., 2005; Sharp et al., 2007; Weiss et al., 2008; Brunetti-Pierri et al., 2008; Mefford et al., 2008; El-Hattab et al., 2009). Whether these novel microdeletions and microduplication syndromes can also be attributed to a single dosage sensitive gene, similar to the syndromes that have been resolved using CNV as the initial discovery, has not yet been established.
21.4 DISEASE GENE IDENTIFICATION FOR COMMON DISEASES Although many common diseases, such as schizophrenia, mental retardation, and autism spectrum disorder, occur at high frequencies in the general population and show an overall high heritability, the genetic contribution of these diseases is only partially explained (Vissers et al., 2003; Shaw-Smith et al., 2004; de Vries et al., 2005; Schoumans et al., 2005; Friedman et al., 2006, 2008; Rosenberg et al., 2006; Owen et al., 2007; Wagenstaller et al., 2007; Ingason et al., 2009; Kirov et al., 2008, 2009; Sanders et al., 2008; Shao et al., 2008; Vrijenhoek et al., 2008; Walsh et al., 2008; Weiss et al., 2008; Xu et al., 2008; Zhang et al., 2008; McMullan et al., 2009). Genomewide CNV microarrays have recently been introduced in these diseases (Gijsbers et al., 2009; Koolen et al., 2009). For mental retardation CNV studies have already entered the
c21.indd 454
1/12/2011 9:44:40 AM
DISCRIMINATING THE DISEASE-RELATED CNV FROM ALL NORMAL CNVS
455
diagnostic arena and replaced karyotyping as the golden standard in studying chromosomal abnormalities. For the other diseases this has not yet happened, but these have benefited from the fact that CNVs can now be reliably called on SNP-based microarrays that are being routinely used for genomewide association studies. Until now most studies in all of these common diseases focus on the detection and interpretation of rare CNVs, as these are more easy to link to disease than CNVs that also occur frequently in the normal population (Vissers et al., 2003; Shaw-Smith et al., 2004; de Vries et al., 2005; Schoumans et al., 2005; Friedman et al., 2006; Rosenberg et al., 2006; Ullmann et al., 2007; Wagenstaller et al., 2007; Mefford et al., 2008; Shao et al., 2008; Sharp et al., 2008; Zhang et al., 2008; van Bon et al., 2009; Hannes et al., 2009; McMullan et al., 2009). 21.4.1
Rare CNVs in Common Diseases
One of the examples in which the role of rare CNVs in schizophrenia was evaluated combined a genomewide CNV screen in patients with deficit schizophrenia, with a more targeted follow-up study in a general-schizophrenia patient–control cohort (Vrijenhoek et al., 2008). The discovery cohort of deficit schizophrenia patients revealed a set of four CNVs containing candidate genes, not reported to be copy number variant in healthy individuals. The genes located within these rare CNVs—NRXN1, CTNND2, MYT1L, and ASTN2—were further studied for copy number variation in more than 700 patients with more generalized schizophrenia as well as more than 700 unaffected controls. In total four additional CNVs were identified in the patient cohort, all leading to deletions, duplications or disruptions of one of the genes, thereby suggesting an important role in the etiology of the schizophrenia (Vrijenhoek et al., 2008). Similarly, dosage variation of the CNTNAP2 gene has been linked to a combination of schizophrenia and epilepsy in three individuals with overlapping aberrations involving this gene (Friedman et al., 2008). In addition, several other rare variants have been reported in patients with schizophrenia, including deletion of the ERBB4 gene and a fusion the SKP2 and the SLC1A gene due to deletion of an intervening segment (Walsh et al., 2008). Although individually rare, the total number of disease-causing structural variants in these common diseases such as mental retardation and schizophrenia indicates that these contribute substantially to the disease etiology.
21.5 DISCRIMINATING THE DISEASE-RELATED CNV FROM ALL NORMAL CNVS Taking all of the above into account, the general strategy to identify disease genes through CNV mapping is best suited for resolving those diseases that are monogenic/oligogenic and involve haploinsufficiency as the disease causing mechanism (Vissers et al., 2005). Hence the identification of a CNV that is (1)
c21.indd 455
1/12/2011 9:44:40 AM
456
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
relatively large, (2) rare, and (3) de novo in a patient provides a strong indicator of clinical significance, as this combination is rare in the normal population (Conrad et al., 2006; Redon et al., 2006; Lupski, 2007). Increases in microarray resolution have, however, been revealing a much higher rate of CNVs per individual than previously thought (McMullan et al., 2009) and increasing number of genomic loci are showing variable inheritance and penetrance (Ullmann et al., 2007; Mefford et al., 2008; Sharp et al., 2008; van Bon et al., 2009; Hannes et al., 2009). These observations complicate direct identification of the causative CNV—potentially harboring the disease-causing gene—and as such, argue for a predication of CNVs to be benign or disease causing. 21.5.1 Forging Links between Human Phenotypes and Mouse Gene Knockout Models A first elegant strategy to make this distinction is by forging a link between human MR-associated CNVs and mouse gene knockout models (Webber et al., 2009). In this novel approach, all genes located in 148 MR-associated CNVs were collected and functionally compared to the genes in more than 26,000 CNVs from the general population. The MR-CNVs were found to be significantly enriched in two classes of genes, those whose mouse orthologues, when disrupted, result in either abnormal axon or dopaminergic neuron morphologies. Additional enrichments highlighted correspondence between relevant mouse phenotypes and secondary presentation including brain abnormalities, cleft palate, and seizures. Already, this approach has identified 78 new candidate genes contributing to MR and associated phenotypes (Webber et al., 2009), thereby demonstrating the power of exploiting mouse knockout data to better understand the distinction between benign and disease-associated CNVs. These novel candidate genes within the pathogenic CNV(s) can now be prioritized for high-throughput sequencing in large cohorts of patients with a similar phenotype, potentially leading to the identification of mutations in novel disease-genes.
21.6 NEXT-GENERATION SEQUENCING FOR THE DETECTION OF STRUCTURAL VARIATION The ultimate resolution to screen the human genome for disease-causing mutations and structural variants is at the basepair level. Major advances in DNA sequencing technologies, collectively termed next-generation sequencing (NGS) technologies, are now enabling the comprehensive analysis of whole genomes (Korbel et al., 2007; Levy et al., 2007; Kidd et al., 2008; Mardis, 2008; Rusk and Kiermer, 2008; Wheeler et al., 2008; Conrad et al., 2009; Ng et al., 2009). Currently, NGS includes three main non-Sanger-based sequencing methods: (1) pyrosequencing (Roche 454 technology), (2) sequencing with reversible terminators (Solexa technology), and (3) sequencing by ligation
c21.indd 456
1/12/2011 9:44:40 AM
NEXT-GENERATION SEQUENCING FOR THE DETECTION OF STRUCTURAL VARIATION
Position (Mb)
27.56
27.57
457 27.58
Shotgun (a)
Mate-pairs (b)
Figure 21.4. Detecting structural variation using next-generation sequencing. a, Structural variation using next-generation sequencing depends directly on the read depth of the individual sequence reads derived from patient DNA. As such, read depth within a heterozygous deletion will contain half the number of sequence reads compared to flanking genomic regions for which two genomic copies are present. In addition, split-reads will be present, indicating the breakpoints of the deletion interval. b, Alternatively, structural variation can be detected by sequencing a mate-paired library providing positional information. Deletions can be detected by mate pairs spanning a larger genomic segment than anticipated based on the library size. Light gray boxes represent individual sequence reads. Split-reads in a are indicated by dark gray boxes and connected by black dotted lines. In b, appropriately mapped mate pairs are shown in light gray boxes and connected by solid gray lines, indicating the distance between the pairs. Mate pairs that map at outside the expected size distribution are shown in dark gray boxes, connected by black solid lines.
(SOLiD technology) (Rusk and Kiermer, 2008). The main differences among the methods are read length, number of reads per run, and the costs involved (Mardis, 2008). It is interesting that all NGS methods are in principle capable of detecting single base mutations and structural variation, including both balanced and unbalanced rearrangements. 21.6.1
CNV Detection Using Shotgun Sequencing
Copy number variation can be identified using shotgun sequencing by studying local differences in read depth—for example, the number of reads mapping to a specific genomic locus also referred to as coverage (Fig. 21.4a). Hence for heterozygous deletions half the number of reads should be expected compared to the surrounding regions where two copies are present, whereas for duplications 1.5× the number of sequence reads should be present. Additional evidence for copy number variants can be provided by the presence of split-reads in which one part of the sequence read maps to one the deleted or duplicated interval, whereas the remainder of the sequence read maps to the other side of the interval.
c21.indd 457
1/12/2011 9:44:40 AM
458
21.6.2
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
CNV Detection Using Mate-Pair Sequencing
Currently, the most specific NGS application to identify all structural variation—including balanced rearrangements—is paired-end mapping or mate-paired library sequencing. This application directly provides detailed positional information, this in contrast to array-based methods or shotgun sequencing (Fig. 21.4b) (Korbel et al., 2007; Kidd et al., 2008). For mate-pair runs, genomic DNA is randomly sheared and the size is selected. After several processing steps, shotgun reads are obtained by sequencing both ends of the size-selected DNA library. This positional information is determined by the size selection constrains the placement of paired reads within the reference genome. Deviations from this expected size distribution may point to deletions, duplications, and insertions (Fig. 21.5). For example, fragments sequenced from 3-kb library are expected to map ∼3 kb apart when mapped back onto the reference genome, whereas fragments mapping ∼100 kb apart may point to a deletion in the DNA library tested. Similarly, differences in strand location, orientation, or mapping positions to different chromosomes may indicate inversions and translocations (Fig. 21.5d). It is interesting that paired-end mapping strategies have identified numerous structural variants currently not annotated in the reference genome, suggesting that the reference genome is still incomplete (Kidd et al., 2008). With this ultimate resolution to screen the genome, new disease genes await discovery. The labor-intensive candidate-gene approach of sequencing all genes within a deletion interval, as for CHARGE syndrome, is now no longer required. A simple unbiased next-generation sequence run will immediately lead to the identification of the causative gene. Also, inversions and balanced rearrangements that were difficult to fully sequence in the pre-NGS era will now be analyzed to the greatest detail potentially unraveling disrupted genes and fusions thereof. 21.7 CONCLUSION In conclusion, the impact of genomewide structural variation on gene discovery has been enormous. The ability to obtain detailed quantitative copy number information for the whole genome in a single experiment has led to the identification of a significant number of disease genes. Without doubt, further implementation of next-generation sequencing technologies and medical resequencing strategies will continue disease gene identification at a more rapid pace than ever before. Eventually, the vast majority of Mendelian disorders, if not all, may be explained by copy number-dependent gene dosage variations or single base pair substitutions. There are many challenges ahead in the clinical interpretation of structural variation related to disease, especially since not all of these variants will be fully penetrant and not all of these variants will contain functional genes. Nevertheless it can be expected that many more disease genes will be identified through the study of structural genomic variation.
c21.indd 458
1/12/2011 9:44:40 AM
QUESTIONS AND ANSWERS
459
(a)
(b)
(c)
(d)
Figure 21.5. The interpretation of structural variation using mate-paired library sequencing. a, Mate-paired library of a given size is mapped to the reference genome. If both tags are interspersed by the expected size insert, no structural variation is present. b, For mate pairs spanning a deletion, the distance between the tags exceeds the expected size distance. c, For insertions in the test sample, the mate pair spans a shorter distance on a reference genome than expected. d, Balanced rearrangements, such as inversions, are detected by the altered orientation of one of the pairs.
21.8
QUESTIONS AND ANSWERS
1. How are microarray-based technologies used for the detection of chromosomal rearrangements and what are the differences between the most widely used platforms? 2. What is more difficult to detect on a genomic microarray: single copy number deletions or single copy number duplications?
c21.indd 459
1/12/2011 9:44:40 AM
460
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
3. Explain how point mutations and microdeletions/duplications may result in the same clinical phenotype and why this is useful to identify novel disease genes. 4. In a clinical diagnostic setting, a patient’s DNA is examined using a highresolution genomic microarray because a CNV is expected to cause the clinical phenotype. Lab results show that several CNVs are present in the patients DNA. How do you proceed to determine which of these variants is the disease-causing variant? 5. What are the advantages and disadvantages of using next-generation sequencing for the detection of structural variation when compared to microarray-based technologies? 1. By definition, microarray-based technologies use microarrays containing target probes representing anything of interest. For the detection of human chromosomal rearrangements, microarrays that contain target probes representing the entire human genome with an even spacing between the probes are most preferred. These target probes can be either genomic DNA fragments (e.g., BAC arrays—two-color hybridization) or oligonucleotides representing SNPs (e.g., SNP arrays—single-color hybridization). To this microarray, DNA of a patient is hybridized. In case of microarrays using BAC clones or random oligonucleotides, control DNA, labeled with a different fluorochrome is simultaneously hybridized onto the same microarray. For SNP arrays, control DNA is hybridized to a separate microarray. Subsequently, hybridization intensities are determined user laser scanners for each target probe on the array. Next, for each target on the array, the ratio for hybridization intensity of patient DNA (T) is calculated over the hybridization intensity of the control DNA (R). This ratio is a relative measure for the copy number state of each probe on the array. Using statistical tools, all target probes are ordered on the physical genome position, and deletions and duplications can be determined over the entire human genome as for each target on the array. 2. Single copy number duplications are more difficult to detect than single copy number deletions. This is because you measure relative changes in copy number on a microarray, and this relative change is less for single copy number duplications (from two copies to three, a relative increase of 50%) than for single copy number deletions (from two to one, a relative decrease of 100%). 3. Point mutations may generate premature stop codons leading to haploinsufficiency of the gene in which the mutation is present. The remaining copy of the gene on the other allele is not enough for a normal functioning of the gene. Similarly, microdeletions may lead to the physical absence of one copy of the gene, thereby also leading to haploinsufficiency of the gene. For point mutations leading to a gain-of-function, the gene is constitutively active, or overstimulation of downstream target genes. Similarly, duplica-
c21.indd 460
1/12/2011 9:44:41 AM
ACKNOWLEDGMENTS
461
tions lead to an additional copy of the gene, thus, potentially leading to the same array of consequences. The fact that deletions and duplications may lead to similar phenotypes as point mutations is used to find disease genes according to the following principle. Point mutations are very difficult to localize in a genomewide manner without a priori information what genomic region to screen for mutations, at least using traditional Sanger sequencing. Next-generation sequencing technology will soon allow unbiased genomewide mutation screening. However, microdeletions/duplications can already be identified in an unbiased genomewide fashion. Since a phenotype can be caused by both deletions/duplications and mutations, any patient with a deletion/ duplication may point to the genomic locus to screen for gene mutations in patients with a similar phenotype but not showing the deletion/ duplication. 4. If more than one CNV is found a patient, there are several steps that will help identify of the causative CNV. The first step is to examine whether the CNVs are de novo by testing the parents for the same CNVs. Second, the presence of the CNV must be checked in control cohorts. This can either be done using in-house tested control samples or, using online databases, such as the Database of Genomic Variants or the HapMap consortium, collecting these CNV data on healthy controls. Third, classifier programs and phenotype databases can be used to predict the disease potential of a given CNV. Especially for those patients whose parental samples are not available, classifier predication programs are of great importance. Current diagnostic practice mostly includes testing parental samples and testing the occurrence of the CNVs in in-house collected CNV databases of healthy controls to determine the disease-causing CNV. 5. The advantages of using next-generation sequencing for the detection of structural variation over microarray-based technology include the detection of balanced rearrangements, such as inversions and balanced translocations, which both remain undetected using microarray technologies. In addition, direct positional information is acquired, which directly points to breakpoints for deletions and to the inserted location for duplications. Also, the exact copy number can be established for duplications, which cannot be obtained using microarray-based technologies. Currently, the disadvantages of the next-generation sequence technology include the relative high costs involved per experiment and the complex practical and bioinformatic workflow. Also, the biological interpretation of the data are still challenging.
21.9
ACKNOWLEDGMENTS
This work was supported by grants from the Netherlands Organisation for Health Research and Development (ZonMW 916.86.016 to LELMV, ZonMW
c21.indd 461
1/12/2011 9:44:41 AM
462
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
917.66.363 to JAV) and grants from the AnEUploidy project (LSHG-CT-2006037627 to JAV) supported by the European Commission under FP6. 21.10 REFERENCES Blake KD, Davenport SL, Hall BD, Hefner MA, Pagon RA, Williams MS, Lin AE, Graham JM Jr. (1998). CHARGE association: an update and review for the primary pediatrician. Clin Pediatr (Phila) 37(3):159–73. Brown SA, Warburton D, Brown LY, Yu CY, Roeder ER, Stengel-Rutkowski S, Hennekam RC, Muenke M. (1998). Holoprosencephaly due to mutations in ZIC2, a homologue of Drosophila odd-paired. Nat Genet (2):180–83. Brunetti-Pierri N, Berg JS, Scaglia F, Belmont J, Bacino CA, Sahoo T, Lalani SR, Graham B, Lee B, Shinawi M, Shen J, Kang SH, Pursley A, Lotze T, Kennedy G, Lansky-Shafer S, Weaver C, Roeder ER, Grebe TA, Arnold GL, Hutchison T, Reimschisel T, Amato S, Geragthy MT, Innis JW, Obersztyn E, Nowakowska B, Rosengren SS, Bader PI, Grange DK, Naqvi S, Garnica AD, Bernes SM, Fong CT, Summers A, Walters WD, Lupski JR, Stankiewicz P, Cheung SW, Patel A. (2008). Recurrent reciprocal 1q21.1 deletions and duplications associated with microcephaly or macrocephaly and developmental and behavioral abnormalities. Nat Genet 40(12):1466–71. Budarf ML & Emanuel BS. (1997). Progress in the autosomal segmental aneusomy syndromes (SASs): single or multi-locus disorders? Hum Mol Genet 1997;6(10): 1657–65. Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK. (2006). A highresolution survey of deletion polymorphism in the human genome. Nat Genet 38(1):75–81. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH, Kristiansson K, Macarthur DG, MacDonald JR, Onyiah I, Pang AW, Robson S, Stirrups K, Valsesia A, Walter K, Wei J, Tyler-Smith C, Carter NP, Lee C, Scherer SW, Hurles ME. (2009). Origins and functional impact of copy number variation in the human genome. Nature 464(7239):704–12. de Vries BB, Pfundt R, Leisink M, Koolen DA, Vissers LE, Janssen IM, Reijmersdal S, Nillesen WM, Huys EH, Leeuw N, Smeets D, Sistermans EA, Feuth T, van Ravenswaaij-Arts CM, Geurts van Kessel A, Schoenmakers EF, Brunner HG, Veltman JA. (2005). Diagnostic genome profiling in mental retardation. Am J Hum Genet 77(4):606–16. de Vries BB, Winter R, Schinzel A, van Ravenswaaij-Arts C. (2003). Telomeres: a diagnosis at the end of the chromosomes. J Med Genet 40(6):385–98. El-Hattab AW, Smolarek TA, Walker ME, Schorry EK, Immken LL, Patel G, Abbott MA, Lanpher BC, Ou Z, Kang SH, Patel A, Scaglia F, Lupski JR, Cheung SW, Stankiewicz P. (2009). Redefined genomic architecture in 15q24 directed by patient deletion/duplication breakpoint mapping. Hum Genet 126(4):589–602. Flint J, Wilkie AO, Buckle VJ, Winter RM, Holland AJ, McDermid HE. (1995). The detection of subtelomeric chromosomal rearrangements in idiopathic mental retardation. Nat Genet 9(2):132–40.
c21.indd 462
1/12/2011 9:44:41 AM
REFERENCES
463
Forozan F, Karhu R, Kononen J, Kallioniemi A, Kallioniemi OP. (1997). Genome screening by comparative genomic hybridization. Trends Genet 13(10):405–09. Friedman JM, Baross A, Delaney AD, Ally A, Arbour L, Armstrong L, Asano J, Bailey DK, Barber S, Birch P, Brown-John M, Cao M, Chan S, Charest DL, Farnoud N, Fernandes N, Flibotte S, Go A, Gibson WT, Holt RA, Jones SJ, Kennedy GC, Krzywinski M, Langlois S, Li HI, McGillivray BC, Nayar T, Pugh TJ, RajcanSeparovic E, Schein JE, Schnerch A, Siddiqui A, Van Allen MI, Wilson G, Yong SL, Zahir F, Eydoux P, Marra MA. (2006). Oligonucleotide microarray analysis of genomic imbalance in children with mental retardation. Am J Hum Genet 79(3):500–13. Friedman JI, Vrijenhoek T, Markx S, Janssen IM, van de Stelt I, Faas BH, Knoers NV, Cahn W, Kahn RS, Edelmann L, Davis KL, Silverman JM, Brunner HG, Geurts van Kessel A, Wijmenga C, Ophoff RA, Veltman JA. (2008). CNTNAP2 gene dosage variation is associated with schizophrenia and epilepsy. Mol Psychiatry 13(3):261–66. Gijsbers AC, Lew JY, Bosch CA, Schuurs-Hoeijmakers JH, van HA, den Hollander NS, Kant SG, Bijlsma EK, Breuning MH, Bakker E, Ruivenkamp CA. (2009). A new diagnostic workflow for patients with mental retardation and/or multiple congenital abnormalities: test arrays first. Eur J Hum Genet 17(11):1394–402. Gripp KW, Wotton D, Edwards MC, Roessler E, Ades L, Meinecke P, Richieri-Costa A, Zackai EH, Massague J, Muenke M, Elledge SJ. (2000). Mutations in TGIF cause holoprosencephaly and link NODAL signalling to human neural axis determination. Nat Genet 25(2):205–08. Harada N, Visser R, Dawson A, Fukamachi M, Iwakoshi M, Okamoto N, Kishino T, Niikawa N, Matsumoto N. (2004). A 1-Mb critical region in six patients with 9q34.3 terminal deletion syndrome. J Hum Genet 49(8):440–44. Hannes FD, Sharp AJ, Mefford HC, de RT, Ruivenkamp CA, Breuning MH, Fryns JP, Devriendt K, Van BG, Vogels A, Stewart H, Hennekam RC, Cooper GM, Regan R, Knight SJ, Eichler EE, Vermeesch JR. (2009). Recurrent reciprocal deletions and duplications of 16p13.11: the deletion is a risk factor for MR/MCA while the duplication may be a rare benign variant. J Med Genet 46(4):223–32. Herrera L, Kakati S, Gibas L, Pietrzak E, Sandberg AA. (1986). Gardner syndrome in a man with an interstitial deletion of 5q. Am J Med Genet 25(3):473–76. Hurst JA, Meinecke P, Baraitser M. (1991). Balanced t(6;8)(6p8p;6q8q) and the CHARGE association. J Med Genet 28(1):54–5. Ingason A, Rujescu D, Cichon S, Sigurdsson E, Sigmundsson T, Pietilainen OPH, Buizer-Voskamp JE, Strengman E, Francks C, Muglia P, Gylfason A, Gustafsson O, Olason PI, Steinberg S, Hansen T, Jakobsen KD, Rasmussen HB, Giegling I, Moller HJ, Hartmann A, Crombie C, Fraser G, Walker N, Lonnqvist J, Suvisaari J, TuulioHenriksson A, Bramon E, Kiemeney LA, Franke B, Murray R, Vassos E, Toulopoulou T, Muhleisen TW, Tosato S, Ruggeri M, Djurovic S, Andreassen OA, Zhang Z, Werge T, Ophoff RA, Rietschel M, Nothen MM, Petursson H, Stefansson H, Peltonen L, Collier D, Stefansson K, Clair DMS. (2009). Copy number variations of chromosome 16p13.1 region associated with schizophrenia. Mol Psychiatry doi 10.1038/mp. 2009.101. Juyal RC, Figuera LE, Hauge X, Elsea SH, Lupski JR, Greenberg F, Baldini A, Patel PI. (1996). Molecular analyses of 17p11.2 deletions in 62 Smith-Magenis syndrome patients. Am J Hum Genet 58(5):998–1007.
c21.indd 463
1/12/2011 9:44:41 AM
464
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D. (1992).Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 258(5083):818–21. Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, Haugen E, Zerr T, Yamada NA, Tsang P, Newman TL, Tuzun E, Cheng Z, Ebling HM, Tusneem N, David R, Gillett W, Phelps KA, Weaver M, Saranga D, Brand A, Tao W, Gustafson E, McKernan K, Chen L, Malig M, Smith JD, Korn JM, McCarroll SA, Altshuler DA, Peiffer DA, Dorschner M, Stamatoyannopoulos J, Schwartz D, Nickerson DA, Mullikin JC, Wilson RK, Bruhn L, Olson MV, Kaul R, Smith DR, Eichler EE. (2008). Mapping and sequencing of structural variation from eight human genomes. Nature 453(7191):56–64. Kirov G, Gumus D, Chen W, Norton N, Georgieva L, Sari M, O’Donovan MC, Erdogan F, Owen MJ, Ropers HH, Ullmann R. (2008). Comparative genome hybridization suggests a role for NRXN1 and APBA2 in schizophrenia. Hum Mol Genet 17(3): 458–65. Kirov G, Zaharieva I, Georgieva L, Moskvina V, Nikolov I, Cichon S, Hillmer A, Toncheva D, Owen MJ, O’Donovan MC. (2009). A genome-wide association study in 574 schizophrenia trios using DNA pooling. Mol Psychiatry 14(8):796–803. Kleefstra T, Smidt M, Banning MJ, Oudakker AR, Van Esch H, de Brouwer AP, Nillesen W, Sistermans EA, Hamel BC, de Bruijn D, Fryns JP, Yntema HG, Brunner HG, de Vries BB, van Bokhoven H. (2005). Disruption of the gene Euchromatin Histone Methyl Transferase1 (Eu-HMTase1) is associated with the 9q34 subtelomeric deletion syndrome. J Med Genet 42(4):299–306. Kleefstra T, Koolen DA, Nillesen WM, de Leeuw N, Hamel BC, Veltman JA, Sistermans EA, van Bokhoven H, van Ravenswaaij C, de Vries BB. (2006). Interstitial 2.2 Mb deletion at 9q34 in a patient with mental retardation but without classical features of the 9q subtelomeric deletion syndrome. Am J Med Genet A 140(6):618–23. Kleefstra T, van Zelst-Stams WA, Nillesen WM, Cormier-Daire V, Houge G, Foulds N, van Dooren M, Willemsen MH, Pfundt R, Turner A, Wilson M, McGaughran J, Rauch A, Zenker M, Adam M, Innes M, Davies C, Gonzalez-Meneses LA, Casalone R, Weber A, Brueton LA, Delicado NA, Palomares BM, Venselaar H, Stegmann SP, Yntema HG, van Bokhoven H, Brunner HG. (2009). Further clinical and molecular delineation of the 9q subtelomeric deletion syndrome supports a major contribution of EHMT1 haploinsufficiency to the core phenotype. J Med Genet 46(9):598–606. Knight SJ, Regan R, Nicod A, Horsley SW, Kearney L, Homfray T, Winter RM, Bolton P, Flint J. (1999). Subtle chromosomal rearrangements in children with unexplained mental retardation. Lancet 354(9191):1676–81. Koolen DA, Pfundt R, de LN, Hehir-Kwa JY, Nillesen WM, Neefs I, Scheltinga I, Sistermans E, Smeets D, Brunner HG, Geurts van Kessel A, Veltman JA, de Vries BB. (2009).Genomic microarrays in mental retardation: a practical workflow for diagnostic applications. Hum Mutat 30(3):283–92. Koolen DA, Vissers LE, Pfundt R, de LN, Knight SJ, Regan R, Kooy RF, Reyniers E, Romano C, Fichera M, Schinzel A, Baumer A, Anderlid BM, Schoumans J, Knoers NV, Geurts van Kessel A, Sistermans EA, Veltman JA, Brunner HG, de Vries BB. (2006). A new chromosome 17q21.31 microdeletion syndrome associated with a common inversion polymorphism. Nat Genet 38(9):999–1001.
c21.indd 464
1/12/2011 9:44:41 AM
REFERENCES
465
Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders AC, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M. (2007). Paired-end mapping reveals extensive structural variation in the human genome. Science 318(5849):420–26. Krantz ID, Rand EB, Genin A, Hunt P, Jones M, Louis AA, Graham JM, Jr., Bhatt S, Piccoli DA, Spinner NB. (1997). Deletions of 20p12 in Alagille syndrome: frequency and molecular characterization. Am J Med Genet 70(1):80–86. Lalani SR, Stockton DW, Bacino C, Molinari LM, Glass NL, Fernbach SD, Towbin JA, Craigen WJ, Graham JM Jr., Hefner MA, Lin AE, McBride KL, Davenport SL, Belmont JW. (2003). Toward a genetic etiology of CHARGE syndrome: I. A systematic scan for submicroscopic deletions. Am J Med Genet A 118A(3):260–66. Lele KP, Penrose LS, Stallard HB. (1963). Chromosome deletion in a case of retinoblastoma. Ann Hum Genet 27:171–74. Lesnik Oberstein SA, Kriek M, White SJ, Kalf ME, Szuhai K, den Dunnen JT, Breuning MH, Hennekam RC. (2006). Peters Plus syndrome is caused by mutations in B3GALTL, a putative glycosyltransferase. Am J Hum Genet 79(3):562–66. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC. (2007). The diploid genome sequence of an individual human. PLoS Biol 5(10):e254. Lichter P, Joos S, Bentz M, Lampel S. (2000). Comparative genomic hybridization: uses and limitations. Semin Hematol 37(4):348–57. Lupski JR. (2007). Genomic rearrangements and sporadic disease. Nat Genet 39(7 suppl):S43–S47. Mardis ER. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387–402. Martin DM, Probst FJ, Fox SE, Schimmenti LA, Semina EV, Hefner MA, Belmont JW, Camper SA. (2002). Exclusion of PITX2 mutations as a major cause of CHARGE association. Am J Med Genet 111(1):27–30. McMullan DJ, Bonin M, Hehir-Kwa JY, de Vries BB, Dufke A, Rattenberry E, Steehouwer M, Moruz L, Pfundt R, de LN, Riess A, tug-Teber O, Enders H, Singer S, Grasshoff U, Walter M, Walker JM, Lamb CV, Davison EV, Brueton L, Riess O, Veltman JA. (2009). Molecular karyotyping of patients with unexplained mental retardation by SNP arrays: A multicenter study. Hum Mutat 30(7):1082–92. Mefford HC, Sharp AJ, Baker C, Itsara A, Jiang Z, Buysse K, Huang S, Maloney VK, Crolla JA, Baralle D, Collins A, Mercer C, Norga K, de RT, Devriendt K, Bongers EM, de LN, Reardon W, Gimelli S, Bena F, Hennekam RC, Male A, Gaunt L, Clayton-Smith J, Simonic I, Park SM, Mehta SG, Nik-Zainal S, Woods CG, Firth HV, Parkin G, Fichera M, Reitano S, Lo GM, Li KE, Casuga I, Broomer A, Conrad B, Schwerzmann M, Raber L, Gallati S, Striano P, Coppola A, Tolmie JL, Tobias ES, Lilley C, Armengol L, Spysschaert Y, Verloo P, De CA, Goossens L, Mortier G, Speleman F, van BE, Nelen MR, Hochstenbach R, Poot M, Gallagher L, Gill M, McClellan J, King MC, Regan R, Skinner C, Stevenson RE, Antonarakis SE, Chen C, Estivill X, Menten B, Gimelli G, Gribble S, Schwartz S, Sutcliffe JS, Walsh T,
c21.indd 465
1/12/2011 9:44:41 AM
466
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
Knight SJ, Sebat J, Romano C, Schwartz CE, Veltman JA, de Vries BB, Vermeesch JR, Barber JC, Willatt L, Tassabehji M, Eichler EE. (2008). Recurrent rearrangements of chromosome 1q21.1 and variable pediatric phenotypes. N Engl J Med 359(16):1685–99. Menten B, Maas N, Thienpont B, Buysse K, Vandesompele J, Melotte C, de RT, Van VS, Balikova I, Backx L, Janssens S, De PA, De MB, Moreau Y, Marynen P, Fryns JP, Mortier G, Devriendt K, Speleman F, Vermeesch JR. (2006). Emerging patterns of cryptic chromosomal imbalance in patients with idiopathic mental retardation and multiple congenital anomalies: a new series of 140 patients and review of published reports. J Med Genet 43(8):625–33. Mimault C, Giraud G, Courtois V, Cailloux F, Boire JY, Dastugue B, Boespflug-Tanguy O. (1999). Proteolipoprotein gene analysis in 82 patients with sporadic PelizaeusMerzbacher disease: duplications, the major cause of the disease, originate more frequently in male germ cells, but point mutations do not. The Clinical European Network on Brain Dysmyelinating Disease. Am J Hum Genet 65(2):360–69. Munke M. (1989). Clinical, cytogenetic, and molecular approaches to the genetic heterogeneity of holoprosencephaly. Am J Med Genet 34(2):237–45. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J. (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461(7261):272–76. Osborne LR, Li M, Pober B, Chitayat D, Bodurtha J, Mandel A, Costa T, Grebe T, Cox S, Tsui LC, Scherer SW. (2001). A 1.5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome. Nat Genet 29(3):321–25. Owen MJ, Craddock N, Jablensky A. (2007). The genetic deconstruction of psychosis. Schizophr Bull 33(4):905–11. Pagon RA, Graham JM, Jr., Zonana J, Yong SL. (1981). Coloboma, congenital heart disease, and choanal atresia with multiple anomalies: CHARGE association. J Pediatr 99(2):223–27. Petrij F, Dauwerse HG, Blough RI, Giles RH, van der Smagt JJ, Wallerstein R, Maaswinkel-Mooy PD, van Karnebeek CD, van Ommen GJ, van HA, Rubinstein JH, Saal HM, Hennekam RC, Peters DJ, Breuning MH. (2000). Diagnostic analysis of the Rubinstein-Taybi syndrome: five cosmids should be used for microdeletion detection and low number of protein truncating mutations. J Med Genet 37(3): 168–76. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y, Dairkee SH, Ljung BM, Gray JW, Albertson DG. (1998). High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 20(2):207–11. Potocki L, Bi W, Treadwell-Deering D, Carvalho CM, Eifert A, Friedman EM, Glaze D, Krull K, Lee JA, Lewis RA, Mendoza-Londono R, Robbins-Furman P, Shaw C, Shi X, Weissenberger G, Withers M, Yatsenko SA, Zackai EH, Stankiewicz P, Lupski JR. (2007).Characterization of Potocki-Lupski syndrome (dup(17)(p11.2p11.2)) and delineation of a dosage-sensitive critical interval that can convey an autism phenotype. Am J Hum Genet 80(4):633–49. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos
c21.indd 466
1/12/2011 9:44:41 AM
REFERENCES
467
M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME. (2006). Global variation in copy number in the human genome. Nature 444(7118):444–54. Riccardi VM, Sujansky E, Smith AC, Francke U. (1978). Chromosomal imbalance in the Aniridia-Wilms’ tumor association: 11p interstitial deletion. Pediatrics 61(4):604–10. Roessler E, Belloni E, Gaudenz K, Jay P, Berta P, Scherer SW, Tsui LC, Muenke M. (1996). Mutations in the human Sonic Hedgehog gene cause holoprosencephaly. Nat Genet 14(3):357–60. Rosenberg C, Knijnenburg J, Bakker E, Vianna-Morgante AM, Sloos W, Otto PA, Kriek M, Hansson K, Krepischi-Santos AC, Fiegler H, Carter NP, Bijlsma EK, van Haeringen A, Szuhai K, Tanke HJ. (2006). Array-CGH detection of micro rearrangements in mentally retarded individuals: clinical significance of imbalances present both in affected children and normal parents. J Med Genet 43(2):180–86. Rusk N, Kiermer V. (2008). Primer: sequencing—the next generation. Nat Meth 5(1):15. Sanders AR, Duan J, Levinson DF, Shi J, He D, Hou C, Burrell GJ, Rice JP, Nertney DA, Olincy A, Rozic P, Vinogradov S, Buccola NG, Mowry BJ, Freedman R, Amin F, Black DW, Silverman JM, Byerley WF, Crowe RR, Cloninger CR, Martinez M, Gejman PV. (2008). No significant association of 14 candidate genes with schizophrenia in a large European ancestry sample: implications for psychiatric genetics. Am J Psychiatry 165(4):497–506. Schmickel RD. (1986). Contiguous gene syndromes: a component of recognizable syndromes. J Pediatr 109(2):231–41. Schoumans J, Ruivenkamp C, Holmberg E, Kyllerman M, Anderlid BM, Nordenskjold M. (2005). Detection of chromosomal imbalances in children with idiopathic mental retardation by array based comparative genomic hybridisation (array-CGH). J Med Genet 42(9):699–705. Shaikh TH, Kurahashi H, Emanuel BS. (2001). Evolutionarily conserved low copy repeats (LCRs) in 22q11 mediate deletions, duplications, translocations, and genomic instability: an update and literature review. Genet Med 3(1):6–13. Shao L, Shaw CA, Lu XY, Sahoo T, Bacino CA, Lalani SR, Stankiewicz P, Yatsenko SA, Li Y, Neill S, Pursley AN, Chinault AC, Patel A, Beaudet AL, Lupski JR, Cheung SW. (2008). Identification of chromosome abnormalities in subtelomeric regions by microarray analysis: a study of 5,380 cases. Am J Med Genet A 146A(17):2242–51. Sharp AJ, Hansen S, Selzer RR, Cheng Z, Regan R, Hurst JA, Stewart H, Price SM, Blair E, Hennekam RC, Fitzpatrick CA, Segraves R, Richmond TA, Guiver C, Albertson DG, Pinkel D, Eis PS, Schwartz S, Knight SJ, Eichler EE. (2006). Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat Genet 38(9):1038–42. Sharp AJ, Mefford HC, Li K, Baker C, Skinner C, Stevenson RE, Schroer RJ, Novara F, De GM, Ciccone R, Broomer A, Casuga I, Wang Y, Xiao C, Barbacioru C, Gimelli G, Bernardina BD, Torniero C, Giorda R, Regan R, Murday V, Mansour S, Fichera M, Castiglia L, Failla P, Ventura M, Jiang Z, Cooper GM, Knight SJ, Romano C, Zuffardi O, Chen C, Schwartz CE, Eichler EE. (2008). A recurrent
c21.indd 467
1/12/2011 9:44:41 AM
468
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
15q13.3 microdeletion syndrome associated with mental retardation and seizures. Nat Genet 40(3):322–28. Sharp AJ, Selzer RR, Veltman JA, Gimelli S, Gimelli G, Striano P, Coppola A, Regan R, Price SM, Knoers NV, Eis PS, Brunner HG, Hennekam RC, Knight SJ, de Vries BB, Zuffardi O, Eichler EE. (2007). Characterization of a recurrent 15q24 microdeletion syndrome. Hum Mol Genet 16(5):567–72. Shaw-Smith C, Pittman AM, Willatt L, Martin H, Rickman L, Gribble S, Curley R, Cumming S, Dunn C, Kalaitzopoulos D, Porter K, Prigmore E, Krepischi-Santos AC, Varela MC, Koiffmann CP, Lees AJ, Rosenberg C, Firth HV, de SR, Carter NP. (2006). Microdeletion encompassing MAPT at chromosome 17q21.3 is associated with developmental delay and learning disability. Nat Genet 38(9):1032–37. Shaw-Smith C, Redon R, Rickman L, Rio M, Willatt L, Fiegler H, Firth H, Sanlaville D, Winter R, Colleaux L, Bobrow M, Carter NP. (2004). Microarray based comparative genomic hybridisation (array-CGH) detects submicroscopic chromosomal deletions and duplications in patients with learning disability/mental retardation and dysmorphic features. J Med Genet 41(4):241–48. Solinas-Toldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, Dohner H, Cremer T, Lichter P. (1997). Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer 20(4):399–407. Stewart DR, Huang A, Faravelli F, Anderlid BM, Medne L, Ciprero K, Kaur M, Rossi E, Tenconi R, Nordenskjold M, Gripp KW, Nicholson L, Meschino WS, Capua E, Quarrell OW, Flint J, Irons M, Giampietro PF, Schowalter DB, Zaleski CA, Malacarne M, Zackai EH, Spinner NB, Krantz ID. (2004). Subtelomeric deletions of chromosome 9q: a novel microdeletion syndrome. Am J Med Genet A 128A(4):340–51. Tellier AL, Amiel J, Delezoide AL, Audollent S, Auge J, Esnault D, Encha-Razavi F, Munnich A, Lyonnet S, Vekemans M, ttie-Bitach T. (2000). Expression of the PAX2 gene in human embryos and exclusion in the CHARGE syndrome. Am J Med Genet 93(2):85–88. Tellier AL, Cormier-Daire V, Abadie V, Amiel J, Sigaudy S, Bonnet D, de LonlayDebeney P, Morrisseau-Durand MP, Hubert P, Michel JL, Jan D, Dollfus H, Baumann C, Labrune P, Lacombe D, Philip N, LeMerrer M, Briard ML, Munnich A, Lyonnet S. (1998). CHARGE syndrome: report of 47 cases and review. Am J Med Genet 76(5):402–09. Ullmann R, Turner G, Kirchhoff M, Chen W, Tonge B, Rosenberg C, Field M, ViannaMorgante AM, Christie L, Krepischi-Santos AC, Banna L, Brereton AV, Hill A, Bisgaard AM, Muller I, Hultschig C, Erdogan F, Wieczorek G, Ropers HH. (2007). Array CGH identifies reciprocal 16p13.1 duplications and deletions that predispose to autism and/or mental retardation. Hum Mutat 28(7):674–82. van Bon BW, Mefford HC, Menten B, Koolen DA, Sharp AJ, Nillesen WM, Innis JW, de Ravel TJ, Mercer CL, Fichera M, Stewart H, Connell LE, Ounap K, Lachlan K, Castle B, van der Aa N, van Ravenswaaij C, Nobrega MA, Serra-Juhe C, Simonic I, de Leeuw N, Pfundt R, Bongers EM, Baker C, Finnemore P, Huang S, Maloney VK, Crolla JA, van KM, Elia M, Vandeweyer G, Fryns JP, Janssens S, Foulds N, Reitano S, Smith K, Parkel S, Loeys B, Woods CG, Oostra A, Speleman F, Pereira AC, Kurg A, Willatt L, Knight SJ, Vermeesch JR, Romano C, Barber JC, Mortier G, PerezJurado LA, Kooy F, Brunner HG, Eichler EE, Kleefstra T, de Vries BB. (2009).
c21.indd 468
1/12/2011 9:44:41 AM
REFERENCES
469
Further delineation of the 15q13 microdeletion and duplication syndromes: A clinical spectrum varying from non-pathogenic to a severe outcome. J Med Genet 46(8):511–23. Van Esch H, Bauters M, Ignatius J, Jansen M, Raynaud M, Hollanders K, Lugtenberg D, Bienvenu T, Jensen LR, Gecz J, Moraine C, Marynen P, Fryns JP, Froyen G. (2005). Duplication of the MECP2 region is a frequent cause of severe mental retardation and progressive neurological symptoms in males. Am J Hum Genet 77(3):442–53. Van Prooijen-Knegt AC, Van Hoek JF, Bauman JG, Van DP, Wool IG, Van der PM. (1982). In situ hybridization of DNA sequences in human metaphase chromosomes visualized by an indirect fluorescent immunocytochemical procedure. Exp Cell Res 141(2):397–407. Vissers LE, de Vries BB, Osoegawa K, Janssen IM, Feuth T, Choy CO, Straatman H, van der Vliet WA, Huys EH, van RA, Smeets D, van Ravenswaaij-Arts CM, Knoers NV, van de Burgt I, de Jong PJ, Brunner HG, Geurts van Kessel A, Schoenmakers EF, Veltman JA. (2003). Array-based comparative genomic hybridization for the genomewide detection of submicroscopic chromosomal abnormalities. Am J Hum Genet 73(6):1261–70. Vissers LE, van Ravenswaaij CM, Admiraal R, Hurst JA, de Vries BB, Janssen IM, van der Vliet WA, Huys EH, de Jong PJ, Hamel BC, Schoenmakers EF, Brunner HG, Veltman JA, Geurts van Kessel A. (2004). Mutations in a new member of the chromodomain gene family cause CHARGE syndrome. Nat Genet 36(9):955–57. Vissers LE, Veltman JA, Geurts van Kessel A, Brunner HG. (2005). Identification of disease genes by whole genome CGH arrays. Hum Mol Genet 14 (Spec No. 2): R215–R23. Vrijenhoek T, Buizer-Voskamp JE, van de Stelt I, Strengman E, Sabatti C, Geurts van Kessel A, Brunner HG, Ophoff RA, Veltman JA. (2008). Recurrent CNVs disrupt three candidate genes in schizophrenia patients. Am J Hum Genet 83(4):504–10. Wagenstaller J, Spranger S, Lorenz-Depiereux B, Kazmierczak B, Nathrath M, Wahl D, Heye B, Glaser D, Liebscher V, Meitinger T, Strom TM. (2007). Copy-number variations measured by single-nucleotide-polymorphism oligonucleotide arrays in patients with mental retardation. Am J Hum Genet 81(4):768–79. Wallis DE, Roessler E, Hehr U, Nanni L, Wiltshire T, Richieri-Costa A, GillessenKaesbach G, Zackai EH, Rommens J, Muenke M. (1999). Mutations in the homeodomain of the human SIX3 gene cause holoprosencephaly. Nat Genet 22(2): 196–98. Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, Nord AS, Kusenda M, Malhotra D, Bhandari A, Stray SM, Rippey CF, Roccanova P, Makarov V, Lakshmi B, Findling RL, Sikich L, Stromberg T, Merriman B, Gogtay N, Butler P, Eckstrand K, Noory L, Gochman P, Long R, Chen Z, Davis S, Baker C, Eichler EE, Meltzer PS, Nelson SF, Singleton AB, Lee MK, Rapoport JL, King MC, Sebat J. (2008). Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320(5875):539–43. Webber C, Hehir-Kwa JY, Nguyen DQ, de Vries BB, Veltman JA, Ponting CP. (2009). Forging links between human mental retardation-associated CNVs and mouse gene knockout models. PLoS Genet 5(6):e1000531. Weiss LA, Shen Y, Korn JM, Arking DE, Miller DT, Fossdal R, Saemundsen E, Stefansson H, Ferreira MA, Green T, Platt OS, Ruderfer DM, Walsh CA, Altshuler
c21.indd 469
1/12/2011 9:44:41 AM
470
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
D, Chakravarti A, Tanzi RE, Stefansson K, Santangelo SL, Gusella JF, Sklar P, Wu BL, Daly MJ. (2008). Association between microdeletion and microduplication at 16p11.2 and autism. N Engl J Med 358(7):667–75. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM. (2008). The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189):872–76. Willatt L, Cox J, Barber J, Cabanas ED, Collins A, Donnai D, FitzPatrick DR, Maher E, Martin H, Parnau J, Pindar L, Ramsay J, Shaw-Smith C, Sistermans EA, Tettenborn M, Trump D, de Vries BB, Walker K, Raymond FL. (2005). 3q29 microdeletion syndrome: clinical and molecular characterization of a new syndrome. Am J Hum Genet 77(1):154–60. Xu B, Roos JL, Levy S, van Rensburg EJ, Gogos JA, Karayiorgou M. (2008). Strong association of de novo copy number mutations with sporadic schizophrenia. Nat Genet 40(7):880–85. Yatsenko SA, Cheung SW, Scott DA, Nowaczyk MJ, Tarnopolsky M, Naidu S, Bibat G, Patel A, Leroy JG, Scaglia F, Stankiewicz P, Lupski JR. (2005). Deletion 9q34.3 syndrome: genotype-phenotype correlations and an extended deletion in a patient with features of Opitz C trigonocephaly. J Med Genet 42(4):328–35. Zhang ZF, Ruivenkamp C, Staaf J, Zhu H, Barbaro M, Petillo D, Khoo SK, Borg A, Fan YS, Schoumans J. (2008). Detection of submicroscopic constitutional chromosome aberrations in clinical diagnostics: a validation of the practical performance of different array platforms. Eur J Hum Genet 16(7):786–92.
c21.indd 470
1/12/2011 9:44:41 AM
CHAPTER 22
Impact of Whole Genome Protein Analysis on Gene Discovery of Disease Models SHENG ZHANG, YONG YANG, and THEODORE W. THANNHAUSER
Contents 22.1 Introduction 22.2 Proteomics Strategies and Workflow 22.2.1 MS Instrumentation for Proteomics 22.2.2 Importance of Experimental Design for Proteomics 22.2.3 Sample Preparation and Separation Technologies 22.2.4 MS Analytical Strategies for Proteomics 22.2.5 Protein Identification by Database Searching 22.2.6 Quantitative Proteomics 22.2.7 PTM Characterization: Phosphoproteome 22.3 Biological Impact of Proteomic Technologies 22.3.1 Understanding Complex Biological Processes 22.3.2 Proteomics-Driven Discovery of Cancer Biomarkers 22.3.3 Proteogenomics: From Proteome to Genome 22.4 Conclusions and Future Perspectives 22.5 Questions and Answers 22.6 Acknowledgments 22.7 References
471 474 474 478 482 487 491 493 499 503 504 509 511 514 517 521 521
22.1 INTRODUCTION The emergence of technologies that facilitate genomewide data acquisition (DNA sequence, mRNA expression, protein expression, and associated Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
471
c22.indd 471
1/12/2011 9:44:42 AM
472
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
metabolite profiles, etc.) heralds a new paradigm for functional genomics that will allow the development of a deep understanding of gene regulation and other complex molecular and cellular biological processes (Cox et al., 2007; de Hoog et al., 2004). Over the past decade comprehensive genomic sequence information has become available for an ever-increasing number of species, the most significant being the completion of the Human Genome Project (Collins et al., 2004; Venter et al., 2001; Lander et al., 2001). The development of DNA microarray technologies to study transcriptional regulation of genes at the messenger level has permitted genomewide expression analysis in response to various stimuli, providing unparalleled opportunities for biomarker discovery from which has emerged a new discipline, transcriptomics. The reason that DNA microarrays have become so popular is that they allow the evaluation of expression for thousands of genes in parallel and make it possible to assess interactions between expressed genes. However, mRNA levels do not provide a complete picture of cellular function. First, the vast majority of cellular functions involve the interaction of proteins. Second, protein expression levels dependent not only on transcript levels but also on translational efficiency and regulated degradation (Gygi et al., 1999; Lu et al., 2007). Also, proteins function at specific subcellular localizations and are susceptible to posttranslational modification (often required to enable function) in ways that cannot be predicted from transcript expression levels or from the genome sequence. Therefore, it is essential to supplement DNA microarray data with direct measurements at the protein level. In fact, for many cases it is proteins that act as the cellular machinery that directly assert the function of genes through enzymatic catalysis, molecular signaling, and physical/ chemical interactions. It is at the protein level that most regulatory processes take place, where the primary disease processes occur and where most drugs target to. Unfortunately, the analogous protein array technologies are much more difficult to implement because proteins cannot be as easily synthesized or replicated in the same way as nucleic acids. Furthermore, the physical properties of proteins vary much more widely than those of nucleic acids, making protein– protein binding less predictable and more subject to non-specific interactions. Protein arrays also require antibodies for each protein of interest. Since antibodies recognize only a portion (the epitope) of the target molecule, they tend to cross-react with similar or accidentally homologous proteins, and they are generally unable to distinguish between microheterogeneous forms (due to PTMs etc.) of the target protein. As a result, the rapidly developing field of proteomics has largely been limited to the systematic study of protein expression in particular cell or tissue types as a function of time and biological or environmental stress. These studies typically involve the identification, quantification and characterization of proteins but often are extended to determine the subcellular localization of certain proteins of interest. Parallel to the success in genomics and transcriptomics, in the past decade, considerable progress has been made in the field of proteomics. Mass spec-
c22.indd 472
1/12/2011 9:44:42 AM
INTRODUCTION
473
trometry (MS) has emerged as an indispensible tool for the investigation of the protein components in biological systems (Han et al., 2008; Domon et al., 2006; Yates, 2004; Aebersold and Mann, 2003). Advances in MS, together with new methods of biochemical separation, protein tagging, chemical labeling, and the development of new bioinformatics tools have allowed initial efforts focused on protein identification to evolve such that the science of proteomics is currently being applied to high-throughput quantitative applications (Han et al., 2008; Ong and Mann, 2005; Ong et al., 2003), the characterization of protein modification state (Wiesner et al., 2008; Mirza and Olivier, 2007; Cantin and Yates, 2004; Mann and Jensen, 2003) and to study large protein complexes. Moreover, modern MS can be used to study time resolved changes in protein structure and interactions within a given subcellular compartment or superstructure (Cox et al., 2007; Aebersold and Mann, 2003; Gingras et al., 2007, 2005). There is no doubt that these developments have led to a tremendous insight into the composition, regulation, and function of molecular complexes and the metabolic pathways they engender (Cravatt et al., 2007; Yates et al., 2005). Discovery-based quantitative proteomics has been widely applied to study various disease states (such as cancers) with an aim to identify biomarkers associated with the diseases or targets for potential therapeutics (Pan et al., 2009; Ferrer-Alcon et al., 2009). These efforts have led to a significant increase in identification of novel biomarker candidates. However, further characterization and validation of the vast majority of the putative biomarkers remains extremely challenging due to the dynamic nature and complexity of the cellular proteomes. Driven by the challenges associated with protein characterization and quantification and the need to develop strategies to deal with the inherent complexities of the proteome, a wide range of new MS-based analytical platforms and technologies have been developed. It is quite clear that the advances in proteomics have been closely tied to the continuous improvement in mass spectrometry technology, experimental design, strategies for sample preparation, and data mining tools. Despite the fact that the target analytes of the genome and proteome are fundamentally different, there is a strong and synergistic relationship between proteomics and genomics as the two disciplines investigate the molecular makeup of the cell at complementary levels and each provides information that enhances the effectiveness of the other. For example, the MS-based peptide-spectral matching that is used in proteomics to identify peptides is possible only with extensive knowledge of the genomes’ sequence. Genomics provides complete genomic sequences, a critical resource for identifying proteins quickly and robustly by the correlation of MS spectra with sequence databases. Meanwhile, recent advances in mass spectrometry hardware and software have enabled the production of large proteomics datasets with broad coverage of the proteome through high-throughput LC-MS/MS analysis. The resulting peptide and fragment mass information has been proven useful for genome annotation.
c22.indd 473
1/12/2011 9:44:43 AM
474
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
This new discipline of proteogenomics complements computationally based genome annotation in that it unambiguously determines reading frame, translation start and stop sites, and splice boundaries and provides validation of short ORFs (Ansong et al., 2008). By combining gene model-based annotation with proteogenomics, an accurate and more complete protein catalog can be obtained. This chapter reviews the advanced MS-based proteomics technologies, the biological impact of proteomics on biomarker and gene discovery associated with diseases, and the recent development of proteogenomics studies for facilitating genome annotation. Finally, the future perspectives in MS-based proteomics development including its applications and challenges are discussed.
22.2
PROTEOMICS STRATEGIES AND WORKFLOW
A fundamental aspect of proteomics is the ability to systematically identify every protein expressed in a cell or tissue and to comprehensively characterize the alterations of identified proteins for their abundance, state of modification/ complexation, and subcellular location in response to environmental and physiological factors of the cell. The general workflow and technology for such analyses requires the integration of effective separation methods for reduction of sample complexity, advanced MS-based analytical techniques for the identification and quantitation of analytes, and bioinformatics tools for data analysis and interpretation. Among all the hardware and software tools required for proteomics analysis, MS technology has become increasingly important, to the exclusion of almost every other strategy. The combined analytical features of MS for enhanced sensitivity, high selectivity, high mass accuracy, and wide dynamic range offer unique abilities to handle the sample complexity, detect low abundance protein components, and identify sites of modification and protein complexes. 22.2.1
MS Instrumentation for Proteomics
Mass spectrometers measure the mass-to-charge ratio (m/z) of gas phase ions. Fundamentally, all mass spectrometers consist of three core parts: an ion source that converts analyte molecules into gas phase ions, a mass analyzer that allows the separation of ionized peptides on the basis of m/z ratio, and a detector that registers the number of ions at each m/z value. Historically, MS was limited to the analysis of small, volatile, and thermostable compounds until the late 1980s when two soft ionization techniques—electrospray ionization (ESI) (Fenn et al., 1989) and matrix-assisted laser desorption/ionization (MALDI) (Tanaka et al., 1988; Karas et al., 1988)—were developed and introduced into protein analysis. These two effective ionization techniques allow polar, nonvolatile, and thermally unstable protein/peptide molecules to be ionized and transferred from a condensed phase into the gas phase
c22.indd 474
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
475
without excessive fragmentation. These developments have simply revolutionized the analysis of peptides and proteins, a fact that was recognized by awarding the 2002 Nobel Prize in chemistry to Fenn and Tanaka. It is not surprisingly that ESI and MALDI are the two most common ionization techniques integrated into modern mass spectrometers intended for proteomic applications. MALDI is a solid-phase ionization technique that ionizes sample molecules out of crystalline matrix via laser excitation. It is important that the absorbance spectrum of the matrix used be well matched to the wavelength of light emitted by the laser. The MALDI matrix absorbs laser energy, becomes excited, and vaporizes, carrying the macromolecular analyte molecules into the gas phase. In this tumultuous and explosive process the analyte molecules undergo collisions with excited matrix ions that are sufficiently energetic to cause the transfer of electrons and protons, creating a population of charged macromolecular ions that can be analyzed, typically in a time of flight (TOF) mass analyzer. Two closely related techniques often used are atmospheric pressure MALDI (AP-MALDI) and surface-enhanced laser desorption ionization (SELDI). AP-MALDI allows an easy interchange between MALDI and ESI sources on the same MS instrument. SELDI is essentially MALDI that has been targeted to a specific class of molecules through the introduction of surface affinity ligands on the target plate before analysis. Unless MALDI analysis is coupled with off-line reverse phase liquid chromatography (RPLC), which will be subjected, to relatively low throughput, direct MALDI-MS is limited to analysis of relatively simple peptide mixtures. Unlike MALDI, ESI is a solution-based ionization method and is therefore readily coupled with liquid-based separation techniques such as liquid chromatography. ESI is initially driven by a high voltage difference applied between the sample delivery probe and the inlet of the mass spectrometer, which creates a spray of electrically charged droplets. In the low pressure of the mass spectrometer inlet, with the assistance of a heated capillary and sheath gas flow, the liquid in the droplets continues to vaporize leaving droplets of smaller size with an ever-increasing surface charge. When the repulsion from the charged ions on the surface of the droplet overcomes the surface tension of the liquid, the droplets undergo a Coulombic explosion, creating a set of smaller droplets. This process continues until liquid of the droplet is fully depleted and the residual charges are deposited on the macromolecules contained within, creating the multiply charged ions observed (the charge deposition model), or the curvature of the droplet surface becomes so great that the field strength at the surface of the droplet is sufficient to cause direct ionization of the multiply charged macromolecules from the droplet surface (direct ionization model). Nanoelectrospray ionization (nanoESI) MS introduced by Wilm and Mann (Wilm et al., 1996; Wilm and Mann, 1996) has become the most widely used ionization technique for proteomics studies due to its low flow rates, low sample consumption and improved detection limits when coupled with upfront nanoLC. Both MALDI and ESI have strengths and
c22.indd 475
1/12/2011 9:44:43 AM
476
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
weaknesses, but many studies have shown that they are highly complementary (Bodnar et al., 2003; Stapels and Barofsky, 2004; Yang et al., 2007). The mass analyzers are the core component of the mass spectrometer as they can store and/or separate ions based on their m/z and can be manipulated to select specific ions for further analysis. There are four types of mass analyzers widely used in proteomics research: (1) quadrupole mass analyzer separating ions based on trajectory stability, (2) TOF based on velocity (flight time), (3) ion trap (IT), and (4) Fourier-transform ion cyclotron resonance (FTICR) based on their m/z resonance frequency. Each of the mass analyzers has unique physical properties that contribute to determine the instrument’s performance with respect to key parameters, such as detection mass range, scan speed, sensitivity, mass accuracy, resolution, and dynamic range. Each of the mass analyzers can be used individually, but often they are used in combination creating hybrid tandem mass spectrometry (MS/MS) instruments, such as triple quadrupole (Q-q-Q), Q-q-Q-linear IT, Q-TOF, TOF-TOF, and LITFTICR. Thus many different types of MS and MS/MS instruments are commercially available for proteomics research. Each has strengths and weaknesses, depending on the specific application; these characteristics are used to determine which is the most appropriate to use. Many excellent recent reviewers have covered the latest development of MS instrumentation (Han et al., 2008; Domon and Aebersold, 2006; Perry et al., 2008; Liu et al., 2007). In tandem mass spectrometers, the first mass analyzer is always used for separation and isolation of ions with subsequent fragmentation of a specific m/z in a collision cell. The m/z of the product ions are measured in a second stage of mass analysis and then interpreted to yield information concerning the structure of the selected precursor ion. MS/MS is a fundamental and essential technique for protein/peptide analysis. Thus far the most widely used fragmentation means in MS/MS analysis for proteomics is collision-induced dissociation (CID) (Shukla and Futrell, 2000). In the CID process, a gas-phase protein/peptide cation is selected and transmitted into a high-pressure region where it undergoes a number of collisions with gas atoms or molecules. During these inelastic collisions, a portion of kinetic energy is converted into the internal energy in the ion, making the ion unstable, which drives peptide backbone fragmentation resulting in the series of b-fragment and y-fragment ions acquired in the second stage mass analysis (Roepstorff and Fohlman, 1984; Biemann, 1990a, 1990b). In addition to the typical b- and y-ion series observed in low-energy CID analysis, internal fragmentation and neutral-loss of H2O, NH3 and labile modification molecules can be seen due to slow-heating and other energetic features associated with CID. These can combine to produce very complex spectra which often restrict the amount of sequence information one can obtain from large peptides and intact proteins. ESI, which typically produces multiply charged ions, is most often coupled with low energy CID to produce high-quality and sequence-specific MS/MS data. MALDI typically produces singly charged ions that require higher energy to fragment effectively. Thus MALDI instruments are often
c22.indd 476
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
477
coupled to high-energy collision cells or make use of other high-energy fragmentation strategies, such as postsource decay (PSD), to produce sequencespecific fragment ion spectra. While high-energy fragmentation has some unique advantages (such as the ability to distinguish between leucine and isoleucine residues), it generally results in more internal fragmentation and side-chain fragmentation, making the spectra more difficult to interpret compared to low energy CID spectra. A relatively new fragmentation method, electron-capture dissociation (ECD) was developed and introduced by the McLafferty group in 1998 (Zubarev, 2006; Zubarev et al., 1998). ECD involves an excitation of the massselected multiply protonated peptide/protein cation by the capture of a thermal, low-energy electron and subsequent fragmentation of the resulting odd-electron ion at the amino alkyl (N-Cα) bond to produce c-type and z-type fragment ions in abundance. Because the process is nonergodic (does not involve any intramolecular vibration-energy distribution), the fragmentation of large protein ions with the preservation of labile modifications becomes possible. As ECD produces far more backbone cleavages than CID, particularly for large proteins and peptides, it offers better sequence coverage for proteomics analysis and PTM characterizations. Therefore, ECD has become a useful tool. However, its use was initially confined to expensive and sophisticated FTICR mass spectrometers. Electron transfer dissociation (ETD) is another nonergodic fragmentation method that is analogous to ECD and has recently been developed by the Hunt laboratory. It uses electron transfer between singly charged anions with low electron affinity and multiply charged peptide cations to induce backbone fragmentation at N-Cα bond (Coon et al., 2005; Syka et al., 2004). ETD fragmentation creates complementary c and z-type ion series, yielding information highly complementary to conventional CID fragmentation. More importantly, ETD can be implemented on relatively inexpensive RF ion trap mass spectrometers, making it available to a much larger number of researchers. As with ECD, ETD preserves labile PTMs as fragmentation occurs along the peptide backbone in a sequence-independent manner. Thus ETD has been increasingly recognized as an important alternative dissociation technique to CID for analysis of many PTMs (Wiesner et al., 2008; Mikesh et al., 2006), particularly phosphorylation (Lu et al., 2008; Chi et al., 2007; Molina et al., 2007) and glycosylation (Khidekel et al., 2007; Wuhrer et al., 2007; Catalina et al., 2007). ETD can be used to analyze large peptides and small intact proteins through a sequential proton transfer reaction (PTR), by which the reduction of charge states for ETD generated multiply charged ions is performed and readily measured on a bench top instrument. As a result, ETD integrated ion trap instruments enable rapid sequencing of large peptides (middle-down workflow) and small intact proteins (top-down strategy), which allows the determination of a 15–40 amino acid sequence at both the N- and C-terminals of proteins (Bunger et al., 2008; Chi et al., 2007; Wu et al., 2007). In addition, because the ion/ion reaction is highly efficient and fast, ETD can be performed
c22.indd 477
1/12/2011 9:44:43 AM
478
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
at a fast scan rate (∼3 scans/s) and is therefore compatible with the chromatographic timescale (Udeshi et al., 2008, 2007) typically used for shotgun analysis, providing high sensitivity and superior sequence coverage for peptides of all sizes. Furthermore, both CID and ETD can be combined in the same experiment, generating two sets of highly complementary data (Hart et al., 2009; Swaney et al., 2008; Good et al., 2007). Table 22.1 summarizes the key features of commonly used MS instruments with their available ion source, fragmentation technique and specific applications in proteomics analysis. 22.2.2
Importance of Experimental Design for Proteomics
The focus of proteomics research has been systematic identification and quantitation of all expressed cellular components, their associated biological modifications and isoforms in response to a particular treatment or environmental stimulus. Given the dynamic nature of proteins in any given biological system, proteome samples should be collected in specific conditions at various time points for proteomic analysis to fully reflect the various states of proteins in a cell. In contrast to the traditional target analysis for single protein characterization, proteomics studies often take a brute-force discovery strategy (nontarget analysis) at a global (hypothesis free) level. Consequently, proteomics research requires large-scale, multistep analysis on multiple complex samples and the collection of large amounts of data. In addition, there are a plethora of techniques available to carry out experiments, each of which will generate enormous sets of complementary data. For these reasons, generating highly reliable and reproducible methodologies and optimized workflows has been a significant challenge in the entire field (Rifai et al., 2006). Past experience has proven that all these data, often generated at a significant cost, can have very little value if appropriate attention is not paid to the design of the experiment. Thus it is advisable to make an effort to design the experiment to ensure that the right type of data are collected to answer the question of interest as efficiently as possible. Perhaps the most important question to be answered when considering the design of a proteomics experiment is, What are the objectives? Once the experimental objectives are defined, the resources, protocols and instrumentation can be selected based on their ability to achieve the objectives with their accuracy, precision, and reliability. It is necessary to identify the known or expected sources of variability within the experiment so that efforts can be made to reduce their impact on our ability to answer the question of interest. One designs an experiment to improve the precision of the answer. Proteomics is applied to a broad array of experimental objectives. These include dissecting the biomolecular interactions involved in the formation of protein complexes, deciphering the intricacies of metabolic processes or identification of proteins and/or peptides that are characteristic of a particular environmental stress or disease state. Furthermore, proteomic researchers
c22.indd 478
1/12/2011 9:44:43 AM
479
c22.indd 479
1/12/2011 9:44:43 AM
fm
pm
1E+4 MSn ESI CID/EDT 1,3,6
CID
1,3,6
Fragmentation technology Applicationsf
TOF-TOF Q-q-TOF
1–6
CID
6E+6 MS/MS ESI
1–6
CID
4E+6 MSn ESI
1
CID/PSD
1E+4 n/ae MALDI
No upper limit Fast
fm
1–3
CID/PSD
1E+4 MS/MS MALDI
No upper limit Fast
fm
1–3,6
Moderate to fast 1E+4 MS/MS ESI; MALDI CID
20–100,000
fm
fm
30,000– 100,000