Editorial
© Schattauer 2009
Biomedical Data Mining N. Peek1; C. Combi2; A. Tucker3 1Department of Medical Informatics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands; 2Department of Computer Science, University of Verona, Verona, Italy; 3School of Information Systems, Computing and Mathematics, Brunel, London, UK
Keywords Data mining, machine learning
Summary Objective: To introduce the special topic of Methods of Information in Medicine on data mining in biomedicine, with selected papers from two workshops on Intelligent Data Analysis in bioMedicine (IDAMAP) held in Verona (2006) and Amsterdam (2007). Methods: Defining the field of biomedical data mining. Characterizing current developments and challenges for researchers in the field. Reporting on current and future activities of IMIA’s working group on Intelligent Data Analysis and Data Mining. Describing the content of the selected papers in this special topic. Results and Conclusions: In the biomedical field, data mining methods are used to develop clinical diagnostic and prognostic systems, to interpret biomedical signal and image data, to discover knowledge from biological and clinical databases, and in biosurveillance and anomaly detection applications. The main challenges for the field are i) dealing with very large search spaces in a both computationally efficient and statistically valid manner, ii) incorporating and utilizing medical and biological background knowledge in the data analysis process, iii) reasoning with time-oriented data and temporal abstraction, and iv) developing enduser tools for interactive presentation, interpretation, and analysis of large datasets.
Correspondence to: Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam P.O. Box 22700 1100 DE Amsterdam The Netherlands E-mail:
[email protected] Methods Inf Med 2009; 48: 225–228
What Is Data Mining? The goal of this special topic is to survey the current state of affairs in biomedical data mining. Data mining is generally described as the (semi-)automatic process of discovering interesting patterns in large amounts of data [1–4]. It is an essential activity to translate the increasing abundance of data in the biomedical field into information that is meaningful and valuable for practitioners. Traditional data analysis methods, such as those originating from statistics, often fail to work when datasets are sizeable, relational in nature, multimedial, or object-oriented. This has led to a stormy development of novel data analysis methods that are increasingly receiving attention in the biomedical literature. Data mining is a young and interdisciplinary field, drawing from fields such as database systems, data warehousing, machine learning, statistics, signal analysis, data visualization, information retrieval, and highperformance computing. It has been successfully applied in diverse areas such as marketing, finance, engineering, security, games, and science. And rather than comprising a clearcut set of methods, the term “data mining” refers to an eclectic approach to data analysis where choices are led by pragmatic considerations concerning the problem at hand. Broadly speaking, the goals of data mining can be classified into two categories: description and prediction [2–4]. Descriptive data mining attempts to discover implicit and previously unknown knowledge, which can be used by humans in making decisions. In this case, data mining is part of a larger knowledge discovery process that includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and presentation of discovered knowledge to end-users. To arrive at usable results, it is essential that the discovered patterns are comprehensible by humans. Typical descriptive data mining tasks are unsupervised machine learning problems such as mining frequent Methods Inf Med 3/2009
225
226
Editorial
patterns, finding interesting associations and correlations in data, cluster analysis, outlier analysis, and evolution analysis. Predictive data mining seeks to find a model or function that predicts some crucial but (yet) unknown property of a given object, given a set of currently known properties. In prognostic data mining, for instance, one seeks to predict the occurrence of future medical events before they actually occur, based on patients’ conditions, medical histories, and treatments [5]. Predictive data mining tasks are typically supervised machine learning problems such as regression and classification. Well-known supervised learning algorithms are decision tree learners, rule-based classifiers, Bayesian classifiers, linear and logistic regression analysis, artificial neural networks, and support vector machines. The models that result from predictive data mining may be embedded in information systems and need not, in that case, to be always comprehensible by humans, even though a sound motivation of the provided prediction is often required in the medical field. The distinction between descriptive and predictive data mining is not always clear-cut. Interesting patterns that were found with descriptive data mining techniques can sometimes be used for predictive purposes. Conversely, a comprehensible predictive model (e.g. a decision tree) may highlight interesting patterns and thus have descriptive qualities. It may also be useful to alternate between descriptive and predictive activities within a data mining process. In all cases, the results of descriptive and predictive data mining should be valid on new, unseen data in order to be valuable to, and trusted by, end-users.
Data Mining in Biomedicine Data mining can be applied in biomedicine for a large variety of purposes, and is thus connected to diverse biomedical subfields. Traditionally, data mining and machine learning applications focused on clinical applications, such as decision support to medical practitioners and interpretation of signal and image data. More recently, applications in epidemiology, bioinformatics, and biosurveillance have received increasing attention. Methods Inf Med 3/2009
Clinical data mining applications are mostly predictive in nature and attempt to derive models that use patient-specific information to predict a patient’s diagnosis, prognosis, or any other outcome of interest and to thereby support clinical decision-making [6]. Historically, diagnostic applications have received most attention [7–9], but in the last decade prognostic models are becoming more popular [5, 10, 11]. Other tasks that are addressed with clinical data mining are detection of data artifacts [12] and adverse events [13], discovering homogeneous subgroups of patients [14], and extracting meaningful features from signal and image data [15]. Several characteristic features of clinical data may complicate the data mining process, such as the frequent and often meaningful occurrence of missing values, and the fact that data values (e.g. diagnostic categories) may stem from structured and very large medical vocabularies such as the ICD [16]. Furthermore, when the data were collected in routine care settings, it may be misleading to draw conclusions from the data with respect to causal effects of therapies. Data from randomized controlled studies enable researchers to compute unbiased estimates of causal effects, as these studies ensure exchangeability of patient groups [17]. In observational data, however, the analysis is biased due to the lack of this exchangeability. In recent years, biomedical data mining has received a strong impulse from research in molecular biology. In this field, datasets fall into three classes: i) sequence data, often represented by a collection of single nucleotide polymorphisms (SNPs) [18]; ii) gene expression data, which can be measured with DNA microarrays to obtain a snapshot of the activity of all genes in one tissue at a given time [19], and iii) protein expression data, which can include a complete set of protein profiles obtained with mass spectra technologies, or a few protein markers [20]. Initially, most genomic and proteomic research focused upon working with individual data sources and achieved considerable success. However, a number of key barriers have been met concerning for example the variability in microarray data [21] and the enormous search spaces involved with identifying protein-protein interactions and folding which require substantial data samples. An alternative approach to dealing directly with ge-
nomic and proteomic data is to perform literature mining which aims to discover related genes and proteins through analysis of biomedical publications [22]. Recent developments have explored methods to combine data sources such as meta-analysis and consensus algorithms for homogenous data [23] and Bayesian priors for heterogeneous data [24]. Another major recent development that aims to combine data and knowledge is system biology. This is an emerging field that attempts to model an entire organism (or a major system within an organism) as a whole [25]. It is starting to show genuine promise when combined with data mining [26], particularly in certain biological subsystems such as the cardiovascular and immune systems. Although many data mining concepts are today well-established and toolsets are available to readily apply data mining algorithms [27, 28], various challenges remain for researchers in the biomedical data mining field. First and foremost, biomedical datasets continue to grow in terms of the number of variables (measurements) per patient. This results in exponentially growing search spaces of hypotheses that are explored by data mining algorithms. An important challenge is to deal with these very large search spaces in a manner that is both computationally efficient and statistically valid. Second, knowledge discovery activities are only meaningful when they take advantage of existing background knowledge in the application area at hand [29]. Biomedical knowledge typically resides in structured medical vocabularies and ontologies, clinical practice guidelines and protocols, and results from scientific studies. Few existing data mining methods are capable of utilizing any of these forms of background knowledge. Third, a most characteristic feature of medical data is its temporal dimension. All clinical observations and interventions must occur at some point in time or during a time period, and the medical jargon abounds with references to time and temporality [30]. Although the attention to temporal reasoning and data analysis has increased over the last decade [31–33], there is still a lack of established data mining methods that deal with temporality. The fourth and final challenge is the development of software tools for end users (such as biologists and medical professionals). With the growing amounts of data available, there is an © Schattauer 2009
Editorial
increasing need for interactive tools that support users in the presentation, interpretation, and analysis of datasets.
IMIA’s Working Group on Intelligent Data Analysis and Data Mining In 2000, a Working Group on Intelligent Data Analysis and Data Mining was established as part of the International Medical Informatics Association (IMIA) [34]. The objectives of the working group are i) to increase the awareness and acceptance of intelligent data analysis and data mining methods in the biomedical community, ii) to foster scientific discussion and disseminate new knowledge on AI-based methods for data analysis and data mining techniques applied to medicine, iii) to promote the development of standardized platforms and solutions for biomedical data analysis, iv) to provide a forum for presentation of successful intelligent data analysis and data mining implementations in medicine. The working group’s main activity is organization of a yearly workshop called Intelligent Data Analysis in bioMedicine and Pharmacology (IDAMAP) [35]. IDAMAP workshops are devoted to computational methods for data analysis in medicine, biology and pharmacology that present results of analysis in the form communicable to domain experts and that somehow exploit knowledge of the problem domain. Typical methods include data visualization, data exploration, machine learning, and data mining. Gathering in an informal setting, workshop participants have the opportunity to meet and discuss selected technical topics in an atmosphere that fosters the active exchange of ideas among researchers and practitioners. IDAMAP workshops have been organized since 1996. The most recent workshops were held in Aberdeen (2005), Verona (2006), Amsterdam (2007), and Washington (2008). Other activities of the working group include the organization of tutorials and panel discussions at international conferences on the topics of intelligent data analysis and data mining in biomedicine. In all its activities, there is a close collaboration with the Working Group on Knowledge Discovery and Data Mining of AMIA [36]. © Schattauer 2009
Selected Papers From a total of 35 papers presented at the IDAMAP-2006 and IDAMAP-2007 workshops, the ten best papers were selected based on the workshop review reports, and the authors were invited to submit an extended version of their paper for the special topic. Eight authors responded positively, from which five papers were finally accepted after blinded peer review. To our opinion, these papers form a representative sample of the current developments in biomedical data mining. The paper by Curk et al. [37] considers the problem of knowledge discovery from gene expression data, by searching for patterns of gene regulation in microarray datasets. Knowledge of gene regulation mechanisms is essential for understanding gene function and interactions between genes. Curk et al. present a novel descriptive data mining algorithm, called rule-based clustering, that finds groups of genes sharing combinations of promoter elements (regions of DNA that facilitate gene transcription). The main methodological challenge is the vast number of candidate combinations of genes and promoter regions, which is handled by the algorithm by employing a heuristic search method. Interesting features of this algorithm are that it yields a symbolic cluster representation, and, in contrast to traditional clustering techniques, allows for overlapping clusters. Also Bielza et al. [38] discuss on the analysis of microarray gene expression data, but focus on predictive data mining, using logistic regression analysis. As discussed in the previous section, microarray datasets have created new methodological challenges for existing data analysis algorithms. In particular, the number of data attributes (genes) is typically much larger than the number of observations (patients), potentially resulting in unreliable statistical inferences due to a severe ‘multiple testing’ problem. One popular solution in biostatistics is regularization of the model parameters by setting a penalty on total size of the estimated regression coefficients. However, estimation of regularized model parameters is a complex numeric optimization problem. The paper by Bielza et al. presents an evolutionary algorithm to solve the problem. The third paper, by Andreassen et al. [39], is clinically oriented and uses a Bayesian
learning method to solve a well-known problem in pharmacoepidemiology: discovering which bacterial pathogenic organisms can be treated with particular antibiotic drugs. Again, the large number of possible combinations that need to be considered poses problems for traditional data analysis methods. More specifically, many pathogen-drug combinations will not even occur in the data, or in such small numbers that reliable direct inferences are not possible. Andreassen et al. propose to solve this problem by borrowing statistical strength from observations on similar pathogens using hierarchical Dirichlet learning. Castellani et al. [40] consider the identification of tumor areas in dynamic contrast enhanced magnetic resonance imaging (DCEMRI), a technique that has recently expanded the range and application of imaging assessment in clinical research. DCE-MRI data consists of serial sets of images obtained before and after the injection of a paramagnetic contrast agent. Rapid acquisition of images allows an analysis of the variation of the MR signal intensity over time for each image voxel, which is indicative for the type of tissue represented by the voxel. Castellani et al. use support vector machines to classify the signal intensity time curves associated with image voxels. The fifth and final paper of the special topic, written by Klimov et al. [41], deals with the visual exploration of temporal clinical data. They present a new workbench, called VISITORS (VISualizatIon of Time-Oriented RecordS), which integrates knowledge-based temporal reasoning mechanisms with information visualization methods. The underlying concept is the temporal association chart, a list of raw or abstracted observations. The VISITORS system allows users to interactively visualize temporal data from a set of patient records.
References 1. Fayyad U, Piatetsky-Shapiro G, Smyth P. The KDD process for extracting useful knowledge from volumes of data. Commun ACM 1996; 39 (11): 27–34. 2. Hand DJ, Mannila H, Smyth P. Principles of Data Mining. Cambridge, Massachusetts: MIT Press; 2001. 3. Giudici P. Applied Data Mining Statistical Methods for Business and Industry. London: John Wiley & Sons; 2003.
Methods Inf Med 3/2009
227
228
Editorial
4. Han J, Kamber M. Data Mining. Concepts and Techniques. San Francisco, California: Morgan Kaufmann Publishers; 2006. 5. Abu-Hanna A, Lucas PJ. Prognostic models in medicine: AI and statistical approaches. Methods Inf Med 2001; 40 (1): 1–5. 6. Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform 2008; 77 (2): 81–97. 7. Lavrac N, Kononenko I, Keravnou E, Kukar M, Zupan B. Intelligent data analysis for medical diagnosis: using machine learning and temporal abstraction. AI Commun 1998; 11: 191–218. 8. Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artif Intell Med 2001; 23: 89–109. 9. Sakai S, Kobayashi K, Nakamura J, Toyabe S, Akazawa K. Accuracy in the diagnostic prediction of acute appendicitis based on the Bayesian network model. Methods Inf Med 2007; 46 (6): 723–726. 10. Pfaff M , Weller K, Woetzel D, Guthke R, Schroeder K, Stein G, Pohlmeier R, Vienken J. Prediction of cardiovascular risk in hemodialysis patients by data mining. Methods Inf Med 2004; 43 (1): 106–113. 11. Tjortjis C, Saraee M, Theodoulidis B, Keane JA. Using T3, an improved decision tree classifier, for mining stroke-related medical data. Methods Inf Med 2007; 46 (5): 523–529. 12. Verduijn M, Peek N, de Keizer NF, van Lieshout EJ, de Pont AC, Schultz MJ, de Jonge E, de Mol BA. Individual and joint expert judgments as reference standards in artifact detection. J Am Med Inform Assoc 2008; 15 (2): 227–234. 13. Jakkula V, Cook DJ. Anomaly detection using temporal data mining in a smart home environment. Methods Inf Med 2008; 47 (1): 70–75. 14. Nannings B, Bosman RJ, Abu-Hanna A. A subgroup discovery approach for scrutinizing blood glucose management guidelines by the identification of hyperglycemia determinants in ICU patients. Methods Inf Med 2008; 47 (6): 480–488. 15. Lessmann B, Nattkemper TW, Hans VH, Degenhard A. A method for linking computed image fea-
Methods Inf Med 3/2009
tures to histological semantics in neuropathology. J Biomed Inform 2007; 40 (6): 631–641. 16. www.who.int/whosis/icd10. Last accessed Mar 3, 2009. 17. Hernán MA. A definition of causal effect for epidemiological research. J Epidemiol Community Health 2004; 58: 265–271. 18. Barker G, Batley J, O’Sullivan H, Edwards KJ, Edwards D. Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics 2003; 19 (3): 421–422. 19. Friedman N, Linial M, Nachman I, Pe’er D. Using Bayesian networks to analyze expression data. J Comput Biol 2000; 7 (3–4): 601–620. 20. Lobley A, Swindells MB, Orengo CA, Jones DT. Inferring function using patterns of native disorder in proteins. PLoS Comput Biol 2007; 3 (8): e162. 21. Choi JK, Yu U, Kim S, Yoo OJ. Combining multiple microarray studies and modeling interstudy variation. Bioinformatics 2003; 19 (Suppl 1): i84–i90. 22. Jelier R, Schuemie MJ, Roes PJ, van Mulligen EM, Kors JA. Literature-based concept profiles for gene annotation: the issue of weighting. Int J Med Inform 2008; 77 (5): 354–362. 23. Steele E, Tucker A. Consensus and meta-analysis regulatory networks for combining multiple microarray gene expression datasets. J Biomed Inform 2008; 41 (6): 914–926. 24. Husmeier D, Werhli AV. Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks with Bayesian networks. In: Markstein P, Xu Y, editors. Computational Systems Bioinformatics, Volume 6: Proceedings of the CSB 2007 Conference. World Scientific; 2007. pp 85–95. 25. Kitano H. Computational systems biology. Nature 2002; 420: 206–210. 26. Zhang M. Interactive analysis of systems biology molecular expression data. BMC Systems Biol 2008; 2: 2–23. 27. Witten IH, Frank E. Data Mining. Practical Machine Learning Tools and Techniques. San Franciso, California: Morgan Kaufmann Publishers; 2005.
28. http://www.ailab.si/orange. Last accessed Mar 3, 2009. 29. Zupan B, Holmes JH, Bellazzi R. Knowledge-based data analysis and interpretation. Artif Intell Med 2006; 37 (3): 163–165. 30. Shahar Y. Dimensions of time in illness: an objective view. Ann Intern Med 2000; 132 (1): 45–53. 31. Combi C, Shahar Y. Temporal reasoning and temporal data maintenance in medicine: issues and challenges. Comput Biol Med 1997; 27 (5): 353–368. 32. Adlassnig KP, Combi C, Das AK, Keravnou ET, Pozzi G. Temporal representation and reasoning in medicine: Research directions and challenges. Artif Intell Med 2006; 38 (2): 101–113. 33. Stacey M, McGregor C. Temporal abstraction in intelligent clinical data analysis: a survey. Artif Intell Med 2007; 39 (1): 1–24. 34. http://magix.fri.uni-lj.si/idadm . Last accessed Mar 3, 2009. 35. http://www.idamap.org. Last accessed Mar 3, 2009. 36. http://www.amia.org/mbrcenter/wg/kddm. Last accessed Mar 3, 2009. 37. Curk T, Petrovic U, Shaulsky G, Zupan B. Rulebased clustering for gene promoter structure discovery. Methods Inf Med 2009; 48: 229–235. 38. Bielza C, Robles V, Larrañaga P. Estimation of distribution algorithms as logistic regression regularizers of microarray classifiers. Methods Inf Med 2009; 48: 236–241. 39. Andreassen S, Zalounina A, Leibovici L, Paul M. Learning susceptibility of a pathogen to antibiotics using data from sim lar pathogens. Methods Inf Med 2009; 48: 242–247. 40. Castellani U, Cristani M, Daducci A, Farace P, Marzola P, Murino V, Sbarbati V. DCE-MRI data analysis for cancer area classification. Methods Inf Med 2009; 48: 248–253. 41. Klimov D, Shahar Y, Taieb-Maimon M. Intelligent interactive visual exploration of temporal associations among multiple time-oriented patient records. Methods Inf Med 2009; 48: 254–262.
© Schattauer 2009
Original Articles
© Schattauer 2009
Rule-based Clustering for Gene Promoter Structure Discovery T. Curk1; U. Petrovic2; G. Shaulsky3; B. Zupan1, 3 1University
of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia; Stefan Institute, Department of Molecular and Biomedical Sciences, Ljubljana, Slovenia; 3Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, Texas, USA 2J.
Keywords Promoter analysis, gene expression analysis, machine learning, rule-based clustering
Summary Background: The genetic cellular response to internal and external changes is determined by the sequence and structure of gene-regulatory promoter regions. Objectives: Using data on gene-regulatory elements (i.e., either putative or known transcription factor binding sites) and data on gene expression profiles we can discover structural elements in promoter regions and infer the underlying programs of gene regulation. Such hypotheses obtained in silico can greatly assist us in experiment planning. The principal obstacle for such approaches is the combinatorial explosion in different com-
Correspondence to: Tomaz Curk University of Ljubljana Faculty of Comp. and Inf. Science Trzaska c. 25 1000 Ljubljana Slovenija E-mail:
[email protected] 1. Introduction Regulation of gene expression is a complex mechanism in the biology of eukaryotic cells. Cells carry their function and respond to the environment by an orchestration of transcription factors and other signaling molecules that influence gene expression. The resulting products regulate expression of other genes thus forming diverse sets of regulatory pathways. To better understand gene function and gene interactions we need to uncover and analyze the programs of gene regulation. Computational analysis [1] of gene-regulatory regions that can use information from known gene sequences, putative binding sites
binations of promoter elements to be examined. Methods: Stemming from several state-ofthe-art machine learning approaches we here propose a heuristic, rule-based clustering method that uses gene expression similarity to guide the search for informative structures in promoters, thus exploring only the most promising parts of the vast and expressively rich rule-space. Results: We present the utility of the method in the analysis of gene expression data on budding yeast S. cerevisiae where cells were induced to proliferate peroxisomes. Conclusions: We demonstrate that the proposed approach is able to infer informative relations uncovering relatively complex structures in gene promoter regions that regulate gene expression.
Methods Inf Med 2009; 48: 229–235 doi: 10.3414/ME9225 prepublished: April 20, 2009
and sets of gene expression studies, can greatly speed-up and automate the tedious discovery process performed by classical genetics. The regulatory region of a gene is defined as a stretch of DNA, which is normally located upstream of the gene’s coding region. Transcription factors are special proteins that bind to specific sequences (binding sites) in the regulatory regions, thus inhibiting or exciting gene expression of target genes. Regulation by binding of transcription factors is just one of the many regulatory mechanisms. Expression is also determined by chromatin structure [2], epigenetic effects, post-transcriptional, translational, post-translational and other forms of regulation [3]. Because there is a
general lack of these kinds of data, most current computational studies focus on inference of relations between gene-regulatory content and gene expression measured using DNA microarrays [4]. Determination of the regulatory region and putative binding sites are the first crucial steps in such analyses. Regulatory and coding regions differ in nucleotide and codon frequency. This fact is successfully exploited by many prediction algorithms [5], and promoter (regulatory) sequences are readily available in public data bases for most model organisms. The next crucial, well studied, and notoriously difficult step is to determine the transcription factors’ putative binding sites in promoter regions. These are 4–20 nucleotidelong DNA sequences [3] which are highly conserved in the promoter regions of regulated genes. A matrix representation of the frequencies of the four nucleotides (A, T, C, G) at each position in the binding site is normally used in computational analysis. The TRANSFAC data base [6] is a good source of experimentally confirmed and computationally inferred binding sites. Candidate binding sites for genes with unknown regulations can be found using local sequence alignment programs such as MEME [7]. A detailed description and evaluation of such tools is presented in the paper by Tompa et al. [8]. Most contemporary methods that try to relate gene structure and expression start with gene expression clustering and then determine cluster-specific binding sites [4, 9]. The success of such approaches strongly relies on the number and composition of gene clusters. Slight parameter changes in clustering procedures can lead to significantly different clustering [10, 11], and consequently to inference of different cluster-specific binding sites. Most often these methods search for non-overlapping clusters and may miss interesting relations, as it is known that genes can respond in many different ways and perform various functions [12]. Methods Inf Med 3/2009
229
230
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
An alternative to clustering-first approaches are methods that start with information on binding sites and search for descriptions shared by similarly expressed genes. For example, in an approach by Chiang et al. [13] the group’s pairwise gene expression intra-correlation is computed for each set of genes comprising a specific binding site in the promoter region. Their method reports on binding sites where this correlation is statistically significant, but fails to investigate the combinations of two or more putative binding sites: it is known that regulation of gene expression can be highly combinatorial and requires the coordinated presence of many transcription factors. There are other approaches where combinations of binding sites are investigated, but they are often limited to the presence of two sites due to the combinatorial explosion of the search [4, 14]. For example, the number of all possible combinations of three binding sites, from a base of a thousand binding sites available for modeling, quickly grows into hundreds of millions. Transcription is also affected by absolute or relative orientation and distance between binding sites and other landmarks in the promoter region (i.e., the translation start ATG), further complicating the language that should be used to model promoter structure and subsequently increasing the search space.
To overcome the limitations described above, we have devised a new algorithm that can infer potentially complex promoter sequence patterns and relate them to gene expression. In the approach, which we call rulebased clustering (RBC), we essentially borrowed from several approaches developed within machine learning that use heuristic search to cope with potentially huge search space. The uniqueness of the presented algorithm is its ability to discover groups of genes that share any combination of promoter elements that can be in placement and orientation specific to the start of the gene or to another promoter element. Below, we first define the language we use to describe the constitution of promoter region, then describe the RBC algorithm and finally illustrate its application on the analysis of peroxisome proliferation data on S. cerevisiae.
2. Rule-based Clustering Method The inputs to the proposed rule-based clustering (RBC) method are gene expression profiles and data on their promoter-regulatory elements. The algorithm does not include any preprocessing of expression data (e.g., normalization, scaling) and considers the data as provided. For each gene, the data on regulatory el-
Fig. 1 Example of a rule search trace. Rule refinements that result in a significant increase in gene expression coherence (check mark) are explored further. Search along unpromising branches is stopped (cross). Methods Inf Med 3/2009
ements is given as a set of sequence motifs with their position relative to the start of the gene and orientation. The motifs are represented either by a position weight matrix [7] or a single line consensus; the former was used in all our experiments. The RBC algorithm aims to find clusters of similarly expressed genes with structurally similar promoter regions. The output of the algorithm are rules of the form “IF structure THEN expression profile”, where structure is an assertion over the regulatory elements in the gene promoter sequence and expression profile is a set of expression profiles of matching genes.
2.1 Descriptive Language for Assertions on Promoter Structure RBC discovers rules that contain assertions/ conditions on the structure of the promoter region that include the presence of binding sites, the distance of the binding sites from transcription and translation start site (ATG), the distance between binding sites, and the orientation of binding sites. We have devised a simple language to represent these assertions. For instance, the expression “S1” says that site S1 (in whichever orientation) must be present in the promoter, and the expression “S1- @- d1(ref:S2)” asserts that both sites S1 and S2 should be present in the promoter region such that S1, in the nonsense direction, appears d1 nucleotides upstream of S2. The proposed description language is not unequivocal: the same promoter structure may often be described in several different ways. For example, any of the following rules may describe the same structure : “S1+@- d1(ref:ATG) and S2- @d2(ref:S1),” “S2- @- d3(ref:ATG) and S1+@- d2(ref:S2),” and “S1+@- d1(ref:ATG) and S2- @- d3(ref: ATG)”. All three descriptions require sites S1 and S2 to be oriented in the sense and nonsense directions, respectively. The first rule requires site S1 to be positioned at distance d1 from the reference ATG (translation start site) and the position S2 to be relative to S1. According to the second rule, the position of S1 is relative to the absolutely positioned S2 at distance d3 from ATG. The third rule defines the position of both sites relative to ATG. In such cases, the RBC algorithm will return only one of the semantically equiva© Schattauer 2009
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
Fig. 2 Outline of the RBC algorithm
lent descriptions, depending on the order in which they were found in the heuristic search.
2.2 RBC Algorithm The proposed algorithm is outlined in 씰Figure 2. In its input it requires data on gene expression profiles Pall and data on promoter elements in the corresponding gene-regulatory regions. The algorithm returns a list of inferred rules of the form R = (C, P) with condition on the promoter structure C contained in genes with similar gene expression profiles P. RBC uses a beam-search approach (lines 3–12) followed by two post-processing steps © Schattauer 2009
(lines 13 and 14 of the algorithm). Beam is a list of at most L currently inferred rules considered for further refinement that are ordered according to their associated scores (see below). Parameter L is a user-defined parameter (with a default value of 1000) that affects the scope of the search and thus the runtime. At the start of the search Beam is initialized with a rule “IF True THEN Pall” that covers all genes under consideration. In every iteration of the main loop (lines 3 to 12), the search focuses on the best-scored rule R = (C, P) from Beam and considers all possible single-term extensions of its condition C, which are allowed by the given descriptive language. Each such refinement results in a new candidate rule, which is added
into the list of Candidates (line 6). The refinements include adding the terms with assertion on the presence of a site, presence of a site with its orientation, or the presence of a site (with or without the information on orientation) at a relative distance of a specific landmark (another site or start of gene). Refined rules are then represented in a simplified form. For instance, adding a single-site presence condition S1 to the initial rule “(True, Pall )” yields a rule “True and S1” which is simplified to its logical equivalent “S1”. Adding a term with the same site but non-sense orientation to the latter yields the rule “S1 and S1–” which is simplified to “S1–”. Similarly, adding a term with the same site but with information on a distance of 100 to 80 nucleoMethods Inf Med 3/2009
231
232
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
tides to the ATG may result in a rule such as “S1@ – 100.. – 80(ref:ATG)”. Requirements of other binding sites may be added, either simply by requiring their presence (e.g., rule “S1 and S2”) or by adding them as a reference to the presently included sites in conditions (e.g., “S1@ – 100.. – 80(ref:S2)”). Candidate rules will include those with matching at least N genes, where N being a user-defined parameter with a default value of six. Candidate rules are then compared to their (non-refined) parent rule based on the intra-cluster pairwise gene expression profiles distance of the covered genes. To identify co-expressed genes, the algorithm uses Pearson correlation as a default distance measure, which – when computing the distance between two genes – ignores experiments where for any of these two genes the expression is missing. The user can replace it with any other type of distance function that suits the particular type of expression profiles or the biological question addressed. For a set of candidate rules, only those with a significant reduction of this distance are retained in the list of Candidates (line 7). This decrease of variance in the intra-cluster pairwise distances is tested using the F-test statistic:
where SSR and SSCandidate are sums of squared differences from mean inside the cluster of genes covered by the parent rule R and by a refined Candidate rule, respectively, and values nR and nCandidate are the total number of genes in each of the two clusters. A p-value is calculated from the F score and used to determine the significance of change (the threshold, αF , defaults to 0.05). 씰Figure 1 shows an example of explored refinements during rule search that may lead to the identification of profile-coherent gene clusters. The resulting refined rules stored in the Candidates list are added to Beam (line 9), which retains at most L best-scored rules (line 10). Because the goal is to discover the most homogeneous clusters, each rule is scored according to the potential coherence of its corresponding sub-cluster potentially obtained after the refinement of the rule. Potential coherence estimates how promising the cluster is in terms of finding a good subset of genes. Methods Inf Med 3/2009
While examining all subgroups of genes in the cluster would be an option, such an estimate is computationally expensive because of potentially large number of subgroups. Instead, we define the potential coherence of a cluster as the average of k · N · (k · N – 1)/2 minimal pairwise profile distances. This in a way approximates a choice of a subset with k · N most similar genes. If the cluster being estimated contains less then k · N genes, its estimated potential equals to the average of all pairwise gene distances. Rules for which the above procedure finds no suitable refinements and whose intracluster pairwise distance is below a user-defined threshold D are added to Rules, the list that stores the terminal rules discovered by RBC algorithm (line 12). Note that a process of taking the best-scored rule from the Beam, refining it and adding newly found rules (if any) with improvements in intra-cluster profile distances is repeated until Beam is left empty. To further reduce the potentially large number of rules found by the beam search, RBC uses two post-processing steps (lines 13 and 14). RBC may infer rules that describe exactly the same cluster of genes. Each such rule set is considered individually, with the aim to retain only the most general rules from it. That is, for each pair of rules with conditions C1 and C2, only the first rule from the pair is retained in the rule set if its condition C1 subsumes condition C2 , that is, it covers the same genes but is more general in terms of logic. For instance, condition “S1” subsumes condition “S1 and S2”. The remaining list of Rules is further filtered by keeping only the most coherent rules so that on average no more than a limited number of rules describe any gene (parameter M set by the user, default is five). The final set of rules is formed by selecting the rules with lowest intracluster distance first, and adding them to the final set only if their inclusion does not increase the rule-coverage for any gene beyond M. Alternatively to considering all the genes in its input data, RBC can additionally deal with the information on a set of target genes for which the user wants to focus the analysis. Typically, target genes would comprise a subgroup of similarly annotated genes, or a subset of differentially expressed genes. If a target set is given, discovered rules are included in Beam
and in the final set only if they cover at least N target genes. Because the algorithm starts with one rule (line 1), which describes all genes, the discovered rules can cover genes outside the target set. The method is thus able to identify genes that were initially left out of a target set but should have been included based on their regulatory content and gene expression. The proposed rule-based clustering method was inspired by the beam-search procedure successfully used in a well known, supervised machine learning algorithm CN2 [15], and by an unsupervised approach of clustering trees developed by Blockeel et al. [16], but is in its implementation and application substantially different from both. CN2 infers rules that relate attribute-value based description of the objects to their discrete class, while clustering trees identify attributevalue based description of non-overlapping clusters of similar objects. RBC combines both approaches by using a beam search to infer symbolic descriptions of potentially overlapping clusters of similarly regulated genes. Compared to beam search in CN2, where the size of the beam is relatively small (10–20 best rules are most often considered for further refinements), RBC uses a much wider beam but also generates potentially overlapping rules in a single loop. In contrast, in CN2, only the best-found rule is retained, objects covered by it removed from the data, and the procedure is restarted until no objects to be explored remain. Similar to CN2, the essence of our algorithm is rule refinement, for which, in the area of machine learning, the beam search proved to be an appropriate heuristic method.
3. A Case Study and Experimental Validation We applied the proposed RBC method to data from a microarray transcription profiling study where budding yeast S. cerevisiae cells were induced to proliferate peroxisomes – organelles that compartmentalize several oxidative reactions – due to the cell’s regulated response to the exposure to oleic fatty acid (oleate) and to the absence of glucose, which causes peroxisome repression [17]. The transcriptional profile of each gene consists of six microarray measurements on oleate induction time course, and two measurements in © Schattauer 2009
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
Fig. 3 a) Gene network, where we connect genes from same rule, for the peroxisome data set (target genes in gray, genes outside target in black). It includes 114 target genes and 7 outside genes, which are clustered in six major groups. b) Group graph of the discovered 37 clusters (two groups are
“oleate vs. glucose” and “glucose vs. glycerol” growth conditions. In total, gene pairwise distance was calculated on gene expression profiles consisting of eight microarray measurements. We defined the pairwise distance function to be 1.0 – r, where r is the Pearson correlation between two gene profiles. © Schattauer 2009
connected if sharing a subset of genes). c and d) Inferred promoter structure and gene expression of the two sub-clusters forming the eight-gene cluster, marked “1” in Figure 3a (also shown as clusters “group 37” and “group 34” in Fig. 3b).
For the target group we selected a set of 224 genes identified by the study to have similar expression profiles to those of genes involved in peroxisome biogenesis and peroxisome function. The goal of our analysis was to further divide the target group into smaller subgroups of genes with common promoter structure and possibly identify genes that were
inadvertently left out of the target group but should have been included based on their expression and promoter structure similarity. We analyzed data on 2135 putative binding sites which were identified using a local sequence alignment software tool MEME [7]. We searched for presence of these binding sites in 1 Kb promoter regions taken upMethods Inf Med 3/2009
233
234
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
stream from the translation start site (ATG) for ~6700 genes. The search identified ~302,000 matches of putative binding sites that were then used to infer rules with RBC. The algorithm was run with the default values of parameters. Distances between binding sites were rounded to increments of 40 bases; the maximum possible range of 2 Kb (for the given promoter length, relative distances can be from –1 Kb to +1 Kb) was thus reduced to
hard problem due to combinatorial explosion. Exhaustive search for all possible rules composed of three binding sites with defined orientation (three possible values: positive, negative, no preference) and distance (distance range is reduced into 50 different values) would, for this case study, require checking a huge number of rules:
50 different values =
Our method checked 2 × 11× 109 of the most promising rules, or less than 0.00004% of the entire three-term rule space. The search took 40 minutes on a Pentium 4, 3.4 GHz workstation. This demonstrates RBC’s ability to efficiently derive potentially complex rules within reasonable time frame. To evaluate the predictive ability of the approach we used a data set on 1364 S. cerevisiae genes that includes accurate binding sites data for 83 transcription factors [18]. We modeled the regulatory region spanning from - 800 bp to 0 bp relative to ATG. Pairwise gene distance was calculated as the average pairwise distance across 19 gene expression microarray studies available at SGD’s Expression Connection data base (http://www.yeastgenome.org/). All genes were considered to be target genes. Fivefold cross-validation was used that randomly splits genes into five sets. Clustering and testing of the inferred rules was repeated five times, each time with a different set of genes for validation of a model constructed using the remaining four sets. Each discovered rule was tested on genes in the test set. If a rule matched the promoter region of a test gene, then we calculated the prediction error by calculating the distance between the true gene expression of the test gene and its predicted expression. When more than one rule could be applied to predict the expression of a test gene, the average prediction error was returned for that gene. Overall, the method successfully predicted the expression of 286 genes (21% of all genes considered), with an average cross-validation prediction error of 0.75. If we were to use “random” rules, which would randomly cluster genes into groups of the same size as those by inferred rules, we could expect the prediction error to be 0.96. We believe that the achieved prediction error is a good indication of the predictive quality of inferred rules.
. This largely
reduced the number of possible subintervals that needed to be considered during rule inference. The search returned 41 rules that described and divided 114 target genes (51% of target genes) into 37 subgroups (씰see Fig. 3b). No rule could be found to describe the remaining 110 target genes. Most of the discovered gene groups are composed of five genes with high pairwise intra-group correlation (above 0.927). Many genes are shared (overlap) between the 37 discovered groups, resulting in six major gene groups visible in 씰Figures 3a and 3b. Seven genes outside the target set were also identified by the method (marked in black in Fig. 3a). For example, the smallest eight-gene group in the top-left corner in Figure 3a includes two outsiders (INP53 and YIL168W – also named SDL1). Gene ontology annotation shows that INP53 is involved together with two target genes (ATP3 and VHS1) in the biological process phosphate metabolism. Gene SDL1 is annotated to function together with the group’s target gene LYS14 in the biological process amino acid metabolism and other similar parent GO terms (results not shown). Details on the promoter structure and gene expression are given in 씰Figures 3c and 3d. These examples confirm the method’s ability to identify functionally related genes that were not initially included in the target set. The majority of the discovered rules in the case study include conditions that are composed of three terms, describing the binding site’s orientation and distance relative to ATG or other binding sites. There is no general binding site that would appear in many rules; only two rules include the same binding site (results not shown). Exhaustive search of even relatively simple rules can quickly grow into a prohibitively Methods Inf Med 3/2009
4. Conclusion The proposed rule-based clustering method can efficiently find rules of gene regulation by searching for groups of similarly expressed genes and with similar structure of the regulatory region. Starting from a target set of genes of interest, the method was able to cluster them into subgroups. Concurrently, RBC may expand the target set by identifying other similarly regulated genes that were initially overlooked by the user. Rule-search is guided and is made efficient by the proposed search heuristics. An important feature of RBC is its ability to discover overlapping groups of genes, potentially indicating common regulation or function. The algorithm uses a number of parameters that essentially determine the size of the search space being examined. The default values provided with the algorithm were set according to particular characteristics of the domain (e.g., about 10,000 genes, small subset of genes sharing some motif pattern, most known patterns include from one to five motifs [19]). The choice of parameters also affects the run time, and the defaults were chosen to make implementation practical and to infer the rules within one hour of computational time on a standard personal computer. We have experimentally confirmed the ability of RBC algorithm with default settings to infer rules that describe a complex regulatory structure and which can be used to reliably predict gene expression from regulatory content. In contrast with other contemporary methods that mainly use information on the presence of binding sites, a principal novelty of our approach is the use of a rich descriptive language to model the promoter structure. The language can be easily extended to accommodate other descriptive features, such as chromatin structure, when such kinds of data become available on a genome-wide scale. To summarize and display the findings of the analysis at different levels of abstraction we have applied different visualizations, which proved useful for understanding and biological interpretation. We believe that the main application of RBC is an exploratory search for additional evidence that genes, in theoretically or experimentally defined groups, actually share a common regulatory © Schattauer 2009
T. Curk et al.: Rule-based Clustering for Gene Promoter Structure Discovery
mechanism. The biologist can then gain insight by looking at the presented evidence and can better decide which inferred patterns are worth testing in the laboratory.
Acknowledgments This work was supported in part by Program and Project grants from the Slovenian Research Agency (P2-0209, J2-9699, P1-0207) and by a grant from the National Institute of Child Health and Human Development (P01-HD39691).
References 1. Bellazzi R, Zupan B. Intelligent data analysis – special issue. Methods Inf Med 2001; 40 (5): 362–364. 2. Segal E, Fondufe-Mittendorf Y, Chen L, Thastrom A, Field Y, Moore IK, et al. A genomic code for nucleosome positioning. Nature 2006; 442 (7104): 772–778.
© Schattauer 2009
3. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 2004; 5 (4): 276–287. 4. Beer MA, Tavazoie S. Predicting gene expression from sequence. Cell 2004; 117 (2): 185–198. 5. Bajic VB, Tan SL, Suzuki Y, Sugano S. Promoter prediction analysis on the whole human genome. Nat Biotechnol 2004; 22 (11): 1467–1473. 6. Wingender E, Dietze P, Karas H, Knuppel R. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res 1996; 24 (1): 238–241. 7. Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 1994; 2: 28–36. 8. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005; 23 (1): 137–144. 9. Down TA, Bergman CM, Su J, Hubbard TJ. Largescale discovery of promoter motifs in Drosophila melanogaster. PLoS Comput Biol 2007; 3 (1): e7. 10. Bolshakova N, Azuaje F. Estimating the number of clusters in DNA microarray data. Methods Inf Med 2006; 45 (2): 153–157. 11. Rahnenfuhrer J. Clustering algorithms and other exploratory methods for microarray data analysis. Methods Inf Med 2005; 44 (3): 444–448.
12. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N. Revealing modular organization in the yeast transcriptional network. Nat Genet 2002; 31 (4): 370–377. 13. Chiang DY, Brown PO, Eisen MB. Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics 2001; 17 (Suppl 1): S49–55. 14. Pilpel Y, Sudarsanam P, Church GM. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet 2001; 29 (2): 153–159. 15. Clark P, Nibblet T. The CN2 induction algorithm. Machine Learning 1989; 3 (4): 261–283. 16. Blockeel H, De Raedt L, Ramon J. Top-down induction of clustering trees. Machine Learning 1998. 17. Smith JJ, Marelli M, Christmas RH, Vizeacoumar FJ, Dilworth DJ, Ideker T, et al. Transcriptome profiling to identify genes involved in peroxisome assembly and function. J Cell Biol 2002; 158 (2): 259–271. 18. MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 2006; 7: 113. 19. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, et al. Transcriptional regulatory code of a eukaryotic genome. Nature 2004; 431 (7004): 99–104.
Methods Inf Med 3/2009
235
236
© Schattauer 2009
Original Articles
Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers C. Bielza1; V. Robles2; P. Larrañaga1 1Departamento 2Departamento
de Inteligencia Artificial, Universidad Politécnica de Madrid, Spain; de Arquitectura y Tecnología de Sistemas Informáticos, Universidad Politécnica de Madrid, Spain
Keywords Logistic regression, regularization, estimation of distribution algorithms, DNA microarrays
Summary Objectives: The “large k (genes), small N (samples)” phenomenon complicates the problem of microarray classification with logistic regression. The indeterminacy of the maximum likelihood solutions, multicollinearity of predictor variables and data over-fitting cause unstable parameter estimates. Moreover, computational problems arise due to the large number of predictor (genes) variables. Regularized logistic regression excels as a solution. However, the difficulties found here involve an objective function hard to be optimized from a mathematical viewpoint and a careful required tuning of the regularization parameters. Methods: Those difficulties are tackled by introducing a new way of regularizing the logistic regression. Estimation of distribution algorithms (EDAs), a kind of evolutionary algorithms, emerge as natural regularizers. Obtaining the regularized estimates of the logistic classifier amounts to maximizing the likelihood function via our EDA, without having to be penalized. Likelihood penalties add a
Correspondence to: Concha Bielza Facultad de Informática Campus de Montegancedo s/n 28660 Boadilla del Monte, Madrid Spain E-mail:
[email protected] Methods Inf Med 3/2009
number of difficulties to the resulting optimization problems, which vanish in our case. Simulation of new estimates during the evolutionary process of EDAs is performed in such a way that guarantees their shrinkage while maintaining their probabilistic dependence relationships learnt. The EDA process is embedded in an adapted recursive feature elimination procedure, thereby providing the genes that are best markers for the classification. Results: The consistency with the literature and excellent classification performance achieved with our algorithm are illustrated on four microarray data sets: Breast, Colon, Leukemia and Prostate. Details on the last two data sets are available as supplementary material. Conclusions: We have introduced a novel EDA-based logistic regression regularizer. It implicitly shrinks the coefficients during EDA evolution process while optimizing the usual likelihood function. The approach is combined with a gene subset selection procedure and automatically tunes the required parameters. Empirical results on microarray data sets provide sparse models with confirmed genes and performing better in classification than other competing regularized methods.
Methods Inf Med 2009; 48: 236–241 doi: 10.3414/ME9223 prepublished: March 31, 2009
1. Introduction The development of DNA microarray technology allows screening of gene expression levels from different tissue samples (e.g. cancerous and normal). The resulting gene expression data help explore gene interactions, discover gene functions and classify individual cancerous/normal samples, using different supervised learning techniques [1, 2]. Among these techniques, logistic regression [3] is widely used because it provides explicit probabilities of class membership, interpretation of the regression coefficients of predictor variables and it avoids gaussianity or correlation structure assumptions. Microarray classification is a challenging task since these data typically involve extremely high dimensionality (thousands of genes) and small sample sizes (less than one hundred cases). This is the so-called “large k (variables), small N (samples) problem” or the “curse of dimensionality”. This may cause a number of statistical problems for estimating parameters properly. First, a large number of parameters have to be estimated using a very small number of samples. Therefore, an infinite number of solutions is possible as the problem is undetermined. Second, multicollinearity largely exists. The likelihood of some gene profiles being linear combinations of other gene profiles grows as more and more variables are introduced into the model, thereby supplying no new information. Third, over-fitting may occur, i.e. the model may fit the training data well but perform badly on new samples. These problems yield unstable parameter estimates. Furthermore, there are also computational problems due to the large number of predictor variables. Traditional numerical algorithms for finding the estimates, like Newton-Raphson’s method [4], require prohibitive computa-
C. Bielza et al.: Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers
tions to invert a huge, sometimes singular matrix, at each iteration. To alleviate this situation within the context of logistic regression, many authors use techniques of dimensionality reduction and feature (or variable) selection [5]. Feature selection methods yield parsimonious models which reduce information costs, are easier to explain and understand, and increase model applicability and robustness. The goodness of a proposed gene subset may be assessed via an initial screening process where genes are selected in terms of some univariate or multivariate scoring metric (filter approach [6]). By contrast, wrapper approaches search for good gene subsets using the classifier itself as part of their function evaluation [7]. A performance estimate of the classifier trained with each subset assesses the merit of this subset. Imposing a penalty on the size of logistic regression coefficients is another different solution. Finding a maximum likelihood estimate subject to spherical restrictions on the logistic regression parameters leads to ridge or quadratic (penalized) logistic regression [8]. Therefore, the ridge estimator is a restricted maximum likelihood estimator (MLE). Shrinking the coefficients towards zero and allowing a little bias provide more stable estimates with smaller variance. Apart from ridge penalization, there are other penalties within the more general framework of regularization methods. All of them aim at balancing the fit to the data and the stability of the estimates. These methods are much more efficient computationally than wrapper methods with the similar performance. Furthermore, regularization methods are more continuous than usual discrete processes of retaining-or-discarding features thereby not suffering as much from high variability. Here we introduce estimation of distribution algorithms (EDAs) as natural regularizers within the logistic regression context. EDAs are a recent optimization heuristic included in the class of stochastic populationbased search methods [9]. EDAs work by constructing an explicit probability model from a set of selected solutions, which is then conveniently used to generate new promising solutions in the next iteration of the evolutionary process. An optimization heuristic is an appropriate tool since shaping the logistic © Schattauer 2009
classifier means estimating its parameters, which in turn entails solving a maximization problem. Unlike traditional numerical methods, EDAs do not require derivative information or matrix inversions. Moreover, used as fitness functions, EDAs could similarly maximize penalized likelihoods to tackle the k >> N problem. This would just reveal the potential of a heuristic (EDA) against a numerical (Newton-Raphson) method. In this paper we will show that the EDA framework is so general that, under certain parameterizations, it obtains the regularized estimates in a natural way, without penalizing the original likelihood. EDAs receive the unrestricted likelihood equations as inputs and they generate the restricted MLEs as outputs.
be the maximizer of l. Newton-Raphson’s algorithm is traditionally used to solve the resulting nonlinear equations. Other methods [10] are gradient ascent, coordinate ascent, conjugate gradient ascent, fixed-Hessian Newton, quasi-Newton algorithms (DFP and BFGS), iterative scaling, Nelder-Mead and random integration.
2.2 Regularized Approaches to Logistic Regression Ridge logistic regression seeks MLEs subject to spherical restrictions on the parameters. Therefore, the function to be maximized is the penalized log-likelihood given by (3)
2. Methods 2.1 Logistic Regression for Microarray Data Assume we have a (training) data set DN of N independent samples from microarray experiments DN = {(cj, xj1, ..., xjk), j = 1, ..., N}, where xj = (xj1, ..., xjk)t ∈ Rk is the gene expression profile of the j-th sample, xji indicates the i-th gene expression level of the j-th sample and cj is the known class label of the j-th sample, 0 or 1, for the different states. We assume the expression profile x to be preprocessed, log-transformed and standardized to zero mean and unit variance across genes. Let πj, j = 1, ..., N denote P(C = 1⏐xj), i.e. the conditional probability of belonging to the class state 1 given gene expression profile xj. Then the logistic regression model is defined as
where λ > 0 is the penalty parameter and controls the amount of shrinkage. λ is usually chosen by cross-validation. The cross-validation deviance, error, BIC or AIC are used as the criteria to be optimized. Let be the maximizer of Equation 3 or ridge estimator. This estimator always exists and is unique. In the field of microarray classification, Newton-Raphson’s algorithm may be employed but it requires a matrix of dimension k + 1 to be inverted. Inverting huge matrices may be avoided to some extent with algorithms like the dual algorithm based on sequential minimal optimization [11] or SVD [12]. Combined with SVD, [13, 14] use a feature selection method called recursive feature elimination (RFE) [15] that iteratively removes genes with smaller absolute values of . Within a broader context, log-likelihood can be penalized as
(1)
, where the
penalty function is generally The Ll penalty ψ (βi) =
where β = (β0, β1, ..., βk) denotes the vector of regression coefficients including intercept β0. From DN, the log-likelihood function is built as t
(2) where πj is given by (1). MLEs, , are obtained by maximizing l with respect to β. Let
|βi| results in lasso, introduced by [16] in the context of logistic regression. In a Bayesian setting, the prior corresponding to this case is an independent Laplace distribution (or double exponential) for each βi. Cawley and Talbot [17] even model the penalty parameter λ by using a Jeffreys’ prior to eliminate this parameter by integrating it out analytically. Although the objective function is still concave in lasso (as in ridge regression), an added Methods Inf Med 3/2009
237
238
C. Bielza et al.: Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers
computational problem is that this function is not differentiable. Generic methods for nondifferentiable concave problems, such as the ellipsoid method or subgradient methods, are usually very slow in practice. Faster methods have recently been investigated [18, 19]. Interest in lasso is growing because Ll penalty encourages the estimators be either significantly large or exactly zero, which has the effect of automatically performing feature selection and hence yielding concise models.
2.3 EDAs for Regularizing Logistic Regression-based Microarray Classifiers Among the stochastic population-based search methods, EDAs have recently emerged as a general framework that overcomes some weaknesses of other well-known methods like genetic algorithms [9]. Unlike genetic algorithms, EDAs avoid the ad hoc design of crossover and mutation operators, as well as the tuning of a large number of parameters, while they explicitly capture the relationships among the problem variables by means of a joint probability distribution (jpd). The main system underlying the EDA approach, which will be denoted Proc-EDA, is: 1. D0 ← Generate M points of the search space randomly 2. h = 1 3. do { 4. ← Select M ′ < M points of the search space from Dh – 1 5. ph(z) = p (z | )← Estimate the jpd from the selected points of the search space 6. Dh ← Sample M points of the search space (the new population) from ph(z) 7. } until a stopping criterion is met M points of the search space constitute the initial population and are generated at random. All of them are evaluated by means of a fitness function (step 1). Then, M ′(M ′ < Μ) points are selected according to a selection method, taking the fitness function into account (step 4). Next, a multidimensional probabilistic model that reflects the interdependencies between the encoded variables in these M ′ selected points is induced (step 5). The estimation of this underlying jpd represents the EDA Methods Inf Med 3/2009
bottleneck, as different degrees of complexity in the dependencies can be considered. In the next step, M new points of the search space – the new population – are obtained by sampling from the multidimensional probabilistic model learnt in the previous step (step 6). Steps 4 to 6 are repeated until some pre-defined stopping condition is met (step 7). Likewise other numerical methods (see above) as Nelder-Mead’s, EDAs work by simply evaluating the objective function at some points. However, Nelder-Mead’s algorithm is deterministic and evaluates the vertices of a simplex, while EDAs are stochastic, require a population and to learn/simulate models. If we confine ourselves to logistic regression classifiers, EDAs have been used for estimating the parameters from a multiobjective viewpoint [20]. EDAs could be successfully used to optimize any kind of penalized likelihood because, unlike traditional numerical methods, they do not require derivative information or matrix inversions. However, we investigate here a more interesting approach that shows that EDAs can act as an intrinsic regularizer if we choose a suitable representation. Thus, let us take l (β) (씰see Eq. 2) as the fitness function that assesses each possible solution β to the (unrestricted) maximum likelihood problem. β is a k + 1 dimensional continuous random variable. EDAs would start by randomly generating the initial population D0 of M points of the search space . After selecting M ′ points (e.g. the top M ′), the core of the EDA paradigm is step 5 above to estimate the jpd from these selected M ′ points. Without losing generality, we start from a univariate marginal distribution algorithm (UMDAcG) [21] in our continuous β-domain. UMDAcG assumes that at each generation h all variables are independent and normally distributed, i.e.
(4)
See [22] for the UMDAcG theoretical support. We now modify UMDAcG to tackle the regularized logistic regression by shrinking the βi parameters during the EDA simulation step. Specifically, we introduce a new algorithm UMDAcG * that learns a UMDAcG model given by (4) at step 5 and iteration h. This involves
estimating the new μih and σih with the MLEs computed on the selected set of M ′ points of the search space from the previous generation. However, sampling at step 6 now generates points from (4) with the normal distributions ph(βi) constrained to lie in an interval [–bh, bh]. This is readily achieved by generating values from a Gaussian of parameters μih and σih for each variable βi and constraining its outputs, according to a standard rejection method to fall within [–bh, bh]. The idea is that, as long as the algorithm progresses, forcing the βi parameters to be in a bounded interval around 0 constrains and stabilizes their values, just like regularization does. At step 5, we learn, for the random variable β, the multivariate Gaussian distribution with a diagonal covariance matrix that best fits, in terms of likelihood, the M ′ β-points that are top ranked in the objective function l (β). We then generate, at step 6, M new points from the previous distribution truncated at each coordinate at –bh (bottom) and at bh (top). New data are ranked with respect to their l (β) values, and the best M ′ are chosen and so on. In spite of optimizing function l (β) rather than another penalized loglikelihood function like e.g. ridge regression’s l*(β), the evolutionary process guarantees that the βi’s values belong to intervals of the desired size. Therefore, our estimates of βi are regularized estimates. In fact, we have empirically verified that the standard errors of our estimators are smaller than those of regularized approaches like ridge logistic regression and exhibiting less outliers than lasso. Moreover, since we use the original l (β) objective function of the logistic regression, we do not need to specify the λ parameter of other penalized approaches like (3). Note that plenty of probability models are possible in (4), without necessarily assuming all variables to be Gaussian and independent. Different univariate, bivariate or multivariate dependencies may be designed with the benefit of having an explicit model of (possible) complex probabilistic relationships among the different parameters. Traditional numerical methods are unable to provide this kind of information. Thus, the estimation of Gaussian network algorithm (EGNA) [21] models multivariate dependencies among βi by learning at each generation a nonrestricted normal density that maximizes the Bayesian information © Schattauer 2009
C. Bielza et al.: Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers
Fig. 1 Number of genes in set S vs. accuracy (%) and vs. bhop for Breast and Colon data sets
criteria (BIC) score. In EGNA, ph(β) factorizes as a Gaussian network [23]. The rationale for this assumption is in part justified by the fact that the MLEs asymptotically follow a multivariate normal distribution. However, in our case the number of observations N is small and, as mentioned above, we do not have MLEs either since our estimators are restricted MLEs. Finally, the last step, say at iteration h = T, would contain from which would be chosen as argmax the final regularized estimate of β.
2.4 Gene Selection Our EDA-based regularization is now embedded in a gene selection procedure. We propose it to take into account the strength of each gene i given by its regression coefficient βi and besides to automatically search for an optimal bh according to the classification accuracy of the associated regularized model. The general procedure, denoted Proc-gene, is: 1. For a subset of genes S, search for bh of EDA approach using the classification accuracy as the criterion. Let bhop be the optimal value. 2. With bhop fixed, eliminate a percentage of the genes with the smallest βι2 values. Let S be the new (smaller) set of genes. 3. Repeat steps 1 and 2 until there is only one gene left. An optimal subset of genes is finally derived. Some remarks follow. In step 1, subset S to initialize the process may be chosen in different © Schattauer 2009
ways. Basically, we can start with all the genes or we can use a filter approach to reduce the size of this subset. Since it is not clear which filter criterion to use and different filter criteria may lead to different conclusions, we propose here a kind of consensus among different filter criteria. Thus, for four filters f1, f2, f3 and f4, if gene i is ranked first by f1, second by f2, third by f3 and fourth by f4, then its rank aggregation would be 11. The top-ranked genes by this new agreement would be chosen. In our experiments we have used the following four filter criteria: 1) the BSS/WSS criterion (as in [24]), 2) the Pearson correlation coefficient to the class variable (as in [5, 25]), 3) a p-metric (as in [26]), and 4) a t-score. The search for the optimal bh for the EDA in step 1 amounts to running EDA (ProcEDA) several times (for different bh values) and measuring which of the fitted logistic regression models is the best. This is assessed by estimating the classifier’s accuracy (percentage of correctly classified microarrays) as the generalization performance of the model.
Braga-Neto and Dougherty [27] proved the .632 bootstrap estimator to be a good overall estimator in small-sample microarray classification, and it was therefore the chosen method in this paper. In step 2 of Proc-gene, EDA has already provided a fitted model (with the best bh value) and then a gene selection method inspired by RFE is carried out. As in [13, 14], we remove more than one feature at a time for computational reasons (the original RFE only removes one), based on the smallest βι2 values, indicators of a lower relative importance in the gene subset.
3. Results and Discussion We illustrate how our approach really acts as a regularizer on some publicly available benchmark microarray data sets. First, the Breast data set [25] with 7129 genes and 49 tumor samples, 25 of them representing estrogen receptor-positive (ER+) and the other 24
Table 1 Selected top 7 genes with their β estimate for Breast GenBank ID [ref.]
Description
β
X87212_at [25]
H. sapiens mRNA for cathepsin C
–6.988
L26336_at [32]
Heat shock 70kDa protein 2
L17131_ma1_at [16, 33]
Human high mobility group protein
–5.402
J03827_at
Y box binding protein-1 mRNA
–3.549
S62539_at [34]
Insulin receptor substrate 1
HG4716-HT5158_at [35]
Guanosine 5’-monophosphate synthase
U30827_s_at [25, 36]
Splicing factor, arginine/serine-rich 5
6.980
3.419 –2.685 2.480
Methods Inf Med 3/2009
239
240
C. Bielza et al.: Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers
Table 2 Selected top 9 genes with their β estimate for Colon GenBank ID [ref.]
Description
β
T94579 [38]
Human chitotriosidase precursor mRNA, complete cds
–0.500
D26129 [40]
Ribonuclease pancreatic precursor (human)
–0.500
T40578 [39]
Caldesmon 1
–0.499
R80427 [38]
C4-dicarboxylate transport sensor protein dctb (Rhizobium leguminosarum)
–0.497
Z50753 [38]
H.sapiens mRNA for GCAP-II/uroguanylin precursor
0.496
M76378 [38]
Human cysteine-rich protein (CRP) gene, exons 5 and 6
0.494
H06061 [38]
Voltage-dependent anion-selective channel protein 1 (Homo sapiens)
0.485
H08393 [38]
Collagen alpha 2(XI) chain (Homo sapiens)
0.482
T62947 [38]
60S ribosomal protein L24 (Arabidopsis thaliana)
being estrogen receptor-negative (ER–). Second, the Colon data set [28] that contains 2000 genes for 62 tissue samples: 40 cancer tissues and 22 normal tissues. Other public data sets have been studied: the Leukemia data set [29] and the Prostate cancer data set [30]. See the supplementary material on the web pagea. We have developed our own implementation in C++ for the EDA-based regularized logistic regression (Proc-EDA) and in R for the gene selection method (Proc-gene) that calls the former. We tried two different EDA approaches: UMDAcG and EGNA. To run EDAs we found that an initial population of at least M = 100 points and of at least M ′ = 50 selected points for learning guarantee robust β estimates. The relative change in the mean fitness value between successive generations was the chosen value for assessing the convergence of the Proc-EDA algorithm. As regards Proc-gene, we considered reasonable to initialize it with 500 genes for the size of subset S. These were selected according to the aggregation of the four filter criteria as described above. Based on our experience, a good choice in the experiments for the number of bootstrap samples used for training was 100. The percentage of genes to be removed in step 2 is fixed as 10%. 씰Figure 1a and 씰Table 1 show the experimental results on the Breast data set. Since perfect classification (100%) is
a
http://laurel.datsi.fi.upm.es/~vrobles/eda_lr_reg
Methods Inf Med 3/2009
–0.480
achieved with many different gene subsets, we choose the subset with fewer genes, i.e. the 7-gene model. Note how bhop obtained at step 1 of procedure Proc-gene varies as long as the number of selected genes changes due to the adapted RFE. Its minimum value is 0.5. Running times on an Intel Xeon 2GHz under Linux are quite acceptable: almost 3 minutes for 500 genes, 39 s for 250, between 2.5 and 5 s for 75–125 genes, and less than 2 s for 70 genes or fewer. The seven genes found to separate ER+ from ER– samples achieve a higher classification accuracy than other up-to-date regularized methods. Shevade and Keerthi [16] report an accuracy of 81.9% and use logistic regression with Ll penalty solved by the Gauss-Seidel method. They propose a different gene selection procedure and retain six genes, two of them also found by us (see below). Fort and Lambert-Lacroix [31] use a combination of PLS and ridge logistic regression to achieve an about 87.5% accuracy. They perform a gene selection based on the BSS/WSS criterion choosing some fixed number of genes: 100, 500, 1000, although they do not indicate which are they. Finally, a slightly different approach followed by the original paper by West et al. [25], where a probit (binary) regression model is combined with a stochastic regularization and SVDs, yields a 89.4% accuracy using 100 genes selected according to their Pearson correlation coefficient to the class variable. When our results are compared to the most popular regularization methods, lasso and ridge logistic
regressions only achieve 98.23% and 98.46% accuracies, respectively, using in both cases the same 500 selected genes provided by the aggregation of the four filter criteria. All of our seven selected genes have been linked with breast cancer proving the consistency of our results with the literature (see Table 1). 씰Figure 1b and 씰Table 2 show the results on the Colon data set. Classes are less well separated outputting at most a 99.65% accuracy, for the 9-gene model. Running times are longer than before: almost 10 minutes for 500 genes, 1.5 minutes for 250, between 2 and 7 s for 60–125 genes, and less than 2 s for 55 genes or fewer. An analysis of the selected genes and the accuracy reported by other directly related methods is as follows. Shevade and Keerthi [16] achieve an accuracy of 82.3% with eight genes, three of them – Z50753, T62947 and H08393 – included in our list. Liu et al. [37] use logistic regression with Lp penalty, where p = 0.1 and retain 12 genes. Genes Z50753, M76378 and H08393 of their list are also in ours. They do not compute the accuracy but the AUC (0.988), which in our case for the 9-gene model is better (0.9996). Using a ridge logistic regression approach, Shen and Tan [14] keep 16 genes with a similar RFE than in our case and report a 99.3% accuracy, without any mention to the specific genes selected. When our results are compared to lasso and ridge logistic regressions, these only achieve 89.74% and 90.51% accuracies, respectively, both lower than our 99.65% accuracy. Our 9-gene list includes genes identified as relevant for colon cancer in the literature (see Table 2). See the supplementary material for details on EGNA factorizations.
4. Conclusions The high interest of combining a regularization with a dimension-reduction step to enhance classifier efficiency has been pointed out elsewhere [31]. Combined with a gene subset selection procedure that adapts the RFE and automatically tunes the required parameters, we have introduced a novel EDAbased logistic regression regularizer. It includes the shrinkage of the coefficients implicitly during EDA evolution process while optimizing the usual likelihood function. The © Schattauer 2009
C. Bielza et al.: Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers
empirical results on several microarray data sets have provided models with a low number of relevant genes, most of them confirmed by the literature, and performing better in classification than other competing regularized methods. Unlike the traditional procedures for finding maximum likelihood βi parameters, the EDA approach is able to use any optimization objective, regardless of its complexity or the non-existence of an explicit formula for its expression. In this respect, our framework could find parameters that maximize the AUC objective (a difficult problem [41]) or it would also fit the search for parameters of any regularized logistic regression. The inclusion of interaction terms among (possibly coregulated) genes in ηj of expression (1) would also be feasible as other future direction to explore.
Acknowledgments The authors are grateful to the referees for their constructive comments. Work partially supported by the Spanish Ministry of Education and Science, projects TIN2007-62626, TIN2007-67148 and TIN2008-06815-C02 and Consolider Ingenio 2010-CSD200700018 and by the National Institutes of Health (USA), project 1 R01 LM009520-01.
References 1. Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, Robles V. Machine learning in bioinformatics. Briefings in Bioinformatics 2006; 17 (1): 86–112. 2. Dugas M, Weninger F, Merk S, Kohlmann A, Haferlach T. A generic concept for large-scale microarray analysis dedicated to medical diagnostics. Methods Inf Med 2006; 45 (2): 146–152. 3. Hosmer DW, Lemeshow S. Applied Logistic Regression. 2nd edn. New York: J. Wiley and Sons; 2000. 4. Thisted RA. Elements of Statistical Computing. New York: Chapman and Hall; 1988. 5. Markowetz F, Spang R. Molecular diagnosis classification, model selection and performance evaluation. Methods Inf Med 2005; 44 (3): 438–443. 6. Weber G, Vinterbo S, Ohno-Machado L. Multivariate selection of genetic markers in diagnostic classification. Artif Intell Med 2004; 31: 155–167. 7. Heckerling PS, Gerber BS, Tape TG, Wigton R. Selection of predictor variables for pneumonia using neural networks and genetic algorithms. Methods Inf Med 2005; 44 (1): 89–97. 8. Lee A, Silvapulle M. Ridge estimation in logistic regression. Comm Statist Simulation Comput 1988; 17: 1231–1257.
© Schattauer 2009
9. Lozano JA, Larrañaga P, Inza I, Bengoetxea E (eds). Towards a New Evolutionary Computation. Advances in Estimation of Distribution Algorithms. New York: Springer; 2006. 10. Minka T. A comparison of numerical optimizers for logistic regression. Tech Rep 758, Carnegie Mellon University; 2003. 11. Keerthi SS, Duan KB, Shevade SK, Poo AN. A fast dual algorithm for kernel logistic regression. Mach Learning 2005; 61: 151–165. 12. Eilers P, Boer J, van Ommen G, van Houwelingen H. Classification of microarray data with penalized logistic regression. In: Proc of SPIE. Progress in Biomedical Optics and Images, 2001. Volume 4266 (2): 187–198. 13. Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics 2004; 5: 427–443. 14. Shen L, Tan EC. Dimension reduction-based penalized logistic regression for cancer classification using microarray data. IEEE Trans Comput Biol Bioinformatics 2005; 2: 166–175. 15. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learning 2002; 46: 389–422. 16. Shevade SK, Keerthi SS. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 2003; 19: 2246–2253. 17. Cawley GC, Talbot N. Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics 2006; 22: 2348–2355. 18. Koh K, Kim SY, Boyd S. An interior-point method for large-scale L1-regularized logistic regression. J Mach Learn Res 2007; 8: 1519–1555. 19. Krishnapuram B, Carin L, Figueiredo M, Hartemink A. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE Trans Pattern Anal Mach Intell 2005; 27: 957–968. 20. Robles V, Bielza C, Larrañaga P, González S, OhnoMachado L. Optimizing logistic regression coefficients for discrimination and calibration using estimation of distribution algorithms. TOP 2008; 16: 345–366. 21. Larrañaga P, Etxeberria R, Lozano JA, Peña JM. Optimization in continuous domains by learning and simulation of Gaussian networks. In: Workshop in Optimization by Building and Using Probabilistic Models. Genetic and Evolutionary Computation Conference, GECCO 2000. pp 201–204. 22. González C, Lozano JA, Larrañaga P. Mathematical modelling of UMDAc algorithm with tournament selection. Behaviour on linear and quadratic functions. Internat J Approx Reason 2002; 31: 313–340. 23. Shachter R, Kenley C. Gaussian influence diagrams. Manag Sci 1989; 35: 527–550. 24. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002; 97: 77–87. 25. West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 2001; 98 (20): 11462–11467. 26. Inza I, Larrañaga P, Blanco R, Cerrolaza A. Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med 2004; 31: 91–103.
27. Braga-Neto UM, Dougherty ER. Is cross-validation valid for small-sample microarray classification? Bioinformatics 2004; 20: 374–380. 28. Alon U et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide microarrays. Proc Natl Acad Sci USA 1999; 96: 6745–6750. 29. Golub TR et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1996; 286: 531–537. 30. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002; 1: 203–209. 31. Fort G, Lambert-Lacroix S. Classification using partial least squares with penalized logistic regression. Bioinformatics 2005; 21: 1104–1111. 32. Rohde M, Daugaard M, Jensen MH, Helin K, Nylandsted J, Marja Jaattela M. Members of the heat-shock protein 70 family promote cancer cell growth by distinct mechanisms. Genes Dev 2005; 19: 570–582. 33. Chiappetta G, Botti G, Monaco M, Pasquinelli R, Pentimalli F, Di Bonito M, D’Aiuto G, Fedele M, Iuliano R, Palmieri EA, Pierantoni GM, Giancotti V, Fusco A. HMGA1 protein overexpression in human breast carcinomas: Correlation with ErbB2 expression. Clin Cancer Res 2004; 10: 7637–7644. 34. Sisci D, Morelli C, Garofalo C, Romeo F, Morabito L, Casaburi F, Middea E, Cascio S, Brunelli E, Ando S, Surmacz E. Expression of nuclear insulin receptor substrate 1 in breast cancer. J Clin Pathol 2007; 60: 633–641. 35. Turner GA, Ellis RD, Guthrie D, Latner AL, Monaghan JM, Ross WM, Skillen AW, Wilson RG. Urine cyclic nucleotide concentrations in cancer and other conditions; cyclic GMP: A potential marker for cancer treatment. J Clin Pathol 2004; 35 (8): 800–806. 36. Abba MC, Drake JA, Hawkins KA, Hu Y, Sun H, Notcovich C, Gaddis S, Sahin A, Baggerly K, Aldaz CM. Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression. Breast Cancer Res 2004; 6: 499–513. 37. Liu Z, Jiang F, Tian G, Wang S, Sato F, Meltzer SJ, Tan M. Sparse logistic regression with Lp penalty for biomarker identification. Statistical Applications in Genetics and Molecular Biology 2007; 6: Article 6. 38. Furlanello C, Serafini M, Merler S, Jurman G. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinform 2003; 4: 54. 39. Gardina PJ. Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array. BMC Genomics 2006; 7: 325. 40. Lin YM, Furukawa Y, Tsunoda T, Yue CT, Yang KC, Nakamura Y. Molecular diagnosis of colorectal tumors by expression profiles of 50 genes expressed differentially in adenomas and carcinomas. Oncogene 2002; 21: 4120–4128. 41. Ma S, Huang J. Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 2005; 21: 4356–4362.
Methods Inf Med 3/2009
241
242
© Schattauer 2009
Original Articles
Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens S. Andreassen1; A. Zalounina1; L. Leibovici2; M. Paul2 1Center
for Model-based Medical Decision Support, Aalborg University, Aalborg, Denmark; of Medicine E, Rabin Medical Center, Beilinson Hospital, Petah-Tiqva, Israel
2Department
Keywords Antimicrobial susceptibility, Dirichlet estimator, Brier score, cross-validation
Summary Objectives: Selection of empirical antibiotic therapy relies on knowledge of the in vitro susceptibilities of potential pathogens to antibiotics. In this paper the limitations of this knowledge are outlined and a method that can reduce some of the problems is developed. Methods: We propose hierarchical Dirichlet learning for estimation of pathogen susceptibilities to antibiotics, using data from a group of similar pathogens in a bacteremia database. Results: A threefold cross-validation showed that maximum likelihood (ML) estimates of susceptibilities based on individual pathogens gave a distance between estimates obtained from the training set and observed frequencies in the validation set of 16.3%.
Correspondence to: Alina Zalounina Center for Model-based Medical Decision Support Aalborg University Fredrik Bajers Vej 7 9220 Aalborg Denmark E-mail:
[email protected] 1. Introduction Antibiotic treatment of bacteremia relies on knowledge of the susceptibility of the infecting pathogen to antibiotics. At the onset of infection this information is typically not available and it must be assessed from prior knowledge of the probability that a given pathogen is susceptible to a given antibiotic. In practice there are limits to how well these probabilities can be known. Methods Inf Med 3/2009
Estimates based on the initial grouping of pathogens gave a distance of 16.7%. Dirichlet learning gave a distance of 15.6%. Inspection of the pathogen groups led to subdivision of three groups, Citrobacter, Other Gram Negatives and Acinetobacter, out of 26 groups. Estimates based on the subdivided groups gave a distance of 15.4% and Dirichlet learning further reduced this to 15.0%. The optimal size of the imaginary sample inherited from the group was 3. Conclusion: Dirichlet learning improved estimates of susceptibilities relative to ML estimators based on individual pathogens and to classical grouped estimators. The initial pathogen grouping was well founded and improvement by subdivision of the groups was only obtained in three groups. Dirichlet learning was robust to these revisions of the grouping, giving improved estimates in both cases, while the group-based estimates only gave improved estimates after the revision of the groups.
Methods Inf Med 2009; 48: 242–247 doi: 10.3414/ME9226 prepublished: April 20, 2009
Susceptibilities of bacteria to antibiotics differ between hospitals and estimation of susceptibilities from databases of in vitro susceptibilities must therefore be based on local data. Even for a department of microbiology serving a large hospital or several smaller hospitals, the number of positive blood cultures, i.e. the number of times bacteria can be isolated from the blood, is unlikely to be much greater than about 1000 per year. This effectively limits the size of local databases because
susceptibilities change over time. If we for the purpose of this discussion assume that data older than three years should be used with caution, then the effective upper limit on the size of the database is about 3000 bacterial isolates, distributed over about 100 pathogens. This is further aggravated because the susceptibilities for community-acquired and hospital-acquired infections are substantially different and therefore must be estimated separately. It is difficult to set a threshold for how large the sample should be to make the classical maximum likelihood (ML) estimate useful. If we consider a pathogen that has an estimated susceptibility of 70% to an antibiotic, then the standard deviation (SD), calculated based on the binomial distribution, of that estimate is 9% for a sample size N = 25 and 5% for N = 100. So it is probably safe to conclude that the lower limit for useful estimates is somewhere between N = 25 and N = 100. This obviously leaves a large fraction of the pathogens without useful ML estimates. The simplest solution is to group the pathogens, assuming that all pathogens within a group have identical susceptibilities. This is a fairly strong assumption, and this paper will explore if it is possible to find estimates of susceptibility that represent a middle ground between the two extremes mentioned above, either using estimates based on a single pathogen or using estimates based on a whole group of pathogens. Technically, the method will be based on hierarchical Dirichlet learning [1–4], that allows a systematic approach to strengthening sparse data with educated guesses. For example, for Proteus spp., which is one of seven members of the “Proteus group” of pathogens (씰see Table 2), an educated guess, in the absence of enough data, would be to assume that it resembles other members of the Proteus group in terms of susceptibility. The Dirichlet learning then provides a mechanism which allows the susceptibility estimates for Proteus spp. to deviate
S. Andreassen et al.: Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens
from the susceptibilities of other bacteria belonging to the Proteus group, if and when data on the actual susceptibility of Proteus spp. to this antibiotic becomes available. The potential benefit of this idea will be evaluated by applying the proposed method to a bacteremia database and it will be assessed whether our method improves the estimate, relative to the ML estimate for single pathogens and the grouped estimate.
2. Materials and Methods
Table 1 A fragment of the bacteremia database showing 4 out of the 1556 isolates from hospitalacquired infections. Amongst other information, the database contains attributes (columns) specifying the name of the pathogen and the in vitro susceptibility (S = susceptible, R = resistant) to a total of 36 antibiotics, out of which only 3 are shown here. Pathogen
1. tobramycin
2. piperacillin
3. gentamycin
…
…
…
…
Proteus spp.
R
S
S
Proteus spp.
S
S
S
Proteus spp.
S
S
S
Proteus vulgaris
S
S
S
…
…
…
…
2.1 The Bacteremia Database and ML Estimates Prior probabilities used in the model were based on a bacteremia database collected at Rabin Medical Center, Beilinson Campus, in Israel during 2002–2004. The bacteremia database included 3350 patient- and episodeunique clinically significant isolates from blood cultures. We shall restrict our attention to the 1556 isolates from adults with hospitalacquired infections. These isolates were obtained from 76 different pathogens and each isolate was on average tested in vitro for susceptibility to 21 antibiotics (range 1–31) out of a total of 36 antibiotics. A fragment from the bacteremia database is shown in 씰Table 1. The bacteremia database provides the counts of susceptibilities (Mij) and the number of isolates tested (Nij) belonging to each pathogen for a range of antibiotics. The index i identifies the antibiotic and j identifies the pathogen. For example, 씰Table 2 shows the counts of susceptibility (M1j) and the number of isolates tested (N1j) for the antibiotic tobramycin (i = 1) and seven pathogens belonging to the Proteus group (j = 1, …, 7). Using these counts, ML estimates of susceptibility (MLij) were calculated. For example, the ML estimate for the susceptibility of Proteus spp. (j = 1) to tobramycin and its SD were obtained as ML11 = M11/N11 = 2/3 = 0.67 and SD =√ML11(1 – ML11)/N11= 0.27.
2.2 Hierarchical Dirichlet Learning over Groups of Pathogens Dirichlet learning is a Bayesian approach for estimation of the parameters in binomial (or © Schattauer 2009
multinomial) distributions. In this paper it will be assumed that a priori estimates of the parameters of the binomial distribution for susceptibility can be guessed from the susceptibilities averaged over pathogens that are assumed to be similar. It is assumed, that the a priori distribution of the parameter follow the conjugated prior of the binomial distribution, which is the Beta distribution (or the Dirichlet distribution for the multinomial distribution). In the TREAT project a decision support system for advice on antibiotic treatment has been constructed [5]. As part of this construction 40 such groups of pathogens with similar susceptibility properties have been identified by the clinicians based on the clinical knowledge. In 씰Table 3 the 76 different pathogens from the bacteremia database have been allocated to 26 of these groups. Assume that a group of n similar pathogens has been identified, the pathogens being indexed by j ∈ {1, …, n}. On a number of occasions the susceptibility of these pathogens to a certain antibiotic (indexed by i) has been tested, Ni1, …, Nin times respectively, with the counts Table 2 The counts of susceptibility, the number of isolates tested, the ML estimates and the Dirichlet estimators of susceptibility to tobramycin (i = 1) for seven pathogens belonging to the Proteus group
of susceptibility being Mi1, …, Min, respectively. The average susceptibility Pi of this group is: Pi =
, where Ni =
.
(1)
The ML estimator of susceptibility of a pathogen MLij = Mij /Nij is now replaced by the Dirichlet estimator: Pij = (βi + Mij)/(αi + Nij),
(2)
where βi and αi are imaginary counts, βi = αi × Pi representing positive outcomes in the binomial distribution and αi representing the imaginary sample size, inherited from the pathogen group. Thus, αi indicates how strong the confidence is in the a priori distribution of the parameters, and βi /αi can be used as the a priori estimate of the parameter of the binomial distribution, i.e. as an estimate of the susceptibility averaged over the pathogen group. We let all αi assume the value A, except that we impose an upper limit on each αi: αi = min (A, Ni),
(3)
Pathogen
j
M1j
N1j
ML1j
P1j
Proteus spp.
1
2
3
0.67
0.7
Proteus mirabilis
2
39
49
0.80
0.79
Proteus vulgaris
3
1
1
1
0.78
Proteus penneri
4
2
2
1
0.82
Morganella morganii
5
19
20
0.95
0.91
Providencia spp.
6
4
10
0.4
0.49
Providencia stuartii
7
5
14
0.36
0.44
72
99
0.73
0.73
Sum of Proteus group
Methods Inf Med 3/2009
243
244
S. Andreassen et al.: Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens
Table 3 Allocation of pathogens to the groups. Subgroups of Acinetobacter, Citrobacter and Other Gram-negative pathogen groups are placed in boxes. 1. Acinetobacter
10. Gram Positive Rod pathogen
18. Pseudomonas
– Acinetobacter baumanii
– Bacillus spp.
– Pseudomonas aeruginosa
– Acinetobacter spp.
– Corynebacterium aquaticum
– Pseudomonas alcaligenes
– Acinetobacter johnsoni
11. Klebsiella
– Pseudomonas cepacia
– Acinetobacter junii
– Klebsiella oxytoca
– Pseudomonas fluorescens
– Acinetobacter lwoffi
– Klebsiella pneumoniae
– Pseudomonas mendocida
– Klebsiella spp.
– Pseudomonas putida
2. Campylobacter – Campylobacter spp. 3. Candida – Candida tropicalis 4. Citrobacter – Citrobacter diversus – Citrobacter koserii
12. Listeria – Listeria monocytogenes
19. Salmonella non-typhi
– Moraxella
– Salmonella enteritidis
– Moraxella lacunata
– Salmonella Group C
14. Other Gram Negative pathogen 20. Staphylococcus negative – Alcaligenes xylosoxidans
– Citrobacter spp.
– Methylobacterium mesophilicum – Stenotrophomonas maltophila
– Enterobacter aerogenes
– Brevundimonas vesicularis
– Enterobacter cloacae
– Chryseobacter meningosept.
– Enterobacter gergoviae
– Sphingomonas paucimobilis
– Enterobacter sakazakii
– Serratia fanticola
– Enterobacter spp.
– Serratia marcescens
6. Enterococcus – Enterococcus avium
– Serratia spp. 15. Other Gram Positive
– Staphylococcus coagulasenegative – Staphylococcus epidermidis 21. Staphylococcus positive – Staphylococcus coagulase-positive 22. Streptococcus Group A
23. Streptococcus Group B – Streptococcus Group B
– Streptococcus Bovis
– Enterococcus faecalis
– Streptococcus acidominimus
– Streptococcus Bovis I
– Enterococcus spp. 7. Eschericia coli – Eschericia coli 8. Gram Negative Anaerobe pathogen – Fusobacterium 9. Gram Positive Anaerobe pathogen – Peptostreptococcus
Methods Inf Med 3/2009
– Streptococcus pneumoniae 17. Proteus
– Streptococcus Bovis II 25. Streptococcus viridans – Streptococcus mitis
– Proteus spp.
– Streptococcus oralis
– Proteus mirabilis
– Streptococcus salivarius
– Proteus vulgaris
– Streptococcus viridans
– Proteus penneri
Dist =
, (4)
where N =
.
24. Streptococcus Group D
– Gemella spp.
16. Pneumococcus
To evaluate the quality of the estimates a threefold cross-validation procedure is applied. The three years of data are divided into three periods, each containing data from one year. In turn, one of the three periods is designated as the validation set and the other two periods are designated as the training set and used for calculation of the estimators. We wish to evaluate how well the Dirichlet estimator Pij , calculated from the training set, predicts Fij , the observed frequency of susceptibility, calculated from the validation set. Fij is calculated as Fij = Mij/Nij . For this purpose we define the distance measure:
– Streptococcus Group A
– Enterococcus durans
– Enterococcus faecium
2.3 Evaluation of the Quality of the Estimates
– Pseudomonas stutzerii
13. Moraxella
– Citrobacter freundii
5. Enterobacter
– Pseudomonas spp.
since it is not reasonable to let the imaginary sample size αi exceed the number of counts Ni actually available for the group. If A = 0, then the Dirichlet estimate becomes equal to the ML estimate. If A → ∞, then the Dirichlet estimate becomes equal to the grouped estimate Pi . In the next section it will be shown, that a “suitable” value for A can be determined empirically.
26. Streptococcus
– Morganella morganii
– Streptococcus constellatus
– Providencia spp.
– Streptococcus Group F
– Providencia stuartii
– Streptococcus Group G
This distance measure calculates the square distance between Pij and Fij , weighted by the relative frequency of the pathogen. It can be interpreted as the average distance between the estimate derived from the learning set and the observed frequency in the validation set. It is a modified version of the Brier score [6] and algebraically it is easy to prove that any set of estimated Pij s that minimize Dist also minimize the Brier score. The procedure followed in the threefold cross-validation described above is graphically illustrated in 씰Figure 1. Dist measures the averaged distance between the Dirichlet estimator from the training set and the observed frequency in the validation set. Since Pij is a function of A (see Eqs. 2–4), Dist is also a function of A. The value of A which minimizes Dist is the optimal size of the imaginary sample to be inherited from a pathogen group to individual pathogens. © Schattauer 2009
S. Andreassen et al.: Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens
3. Results 3.1 Comparison of ML Estimates for Individual Pathogens and Grouped Estimates Only 10 of the pathogens in the bacteremia database (13% out of 76) have been isolated more than 50 times. The counts available for estimation of susceptibility are even smaller, because susceptibility is only tested for a selection of antibiotics. This indicates, that the ML estimates of susceptibility for most combinations of pathogens and antibiotics in this database are too uncertain to be useful. When averaged over all pathogens and antibiotics the distance between the estimated susceptibilities based on individual pathogens and the observed frequency was 16.3% (Dist = 16.3%). If the estimates based on individual pathogens were replaced by estimates based on the groups of pathogens given in Table 3, then the distance between the estimators and the observed frequencies rose to 16.7%.
3.2 Hierarchical Dirichlet Learning over Groups of Pathogens We shall now explore hierarchical Dirichlet learning over groups of pathogens, and we initially consider the Proteus group mentioned above, which has seven members (씰see Table 2). To illustrate Dirichlet learning of the susceptibility of a single pathogen to a single antibiotic, let us consider learning susceptibility of Proteus spp. to tobramycin using susceptibility data available for other members of the Proteus group. (The procedure can be applied to any of seven pathogens in the Proteus group.) First we assume a value for A, e.g. A = 4. This gives α1 = min (4, 99) = 4, because for the Proteus group N1 = 99 (see Table 2). The average susceptibility of the group is P1 = 72/99 = 0.728. Next we calculate β1 = α1 × P1 = 4 × 0.728 = 2.91. Finally we can calculate the Dirichlet estimator as P11 = (β1 + M11)/(α1 + N11) = (2.91 + 2)/(4 + 3) = 0.70. This result along with the ML estimator and the Dirichlet estimator for the remaining members of the Proteus group are shown in Table 2, assuming that A = 4. © Schattauer 2009
Fig. 1 The procedure followed in the threefold cross-validation
An optimal value for A can be determined empirically by minimizing the distance in Equation 4. We have applied the distance measure for tobramycin across the Proteus group (the summation in Eq. 4 was performed across one antibiotic and seven pathogens in the Proteus group). It was found that the distance reaches its minimum (20.2%) at A = 4 (씰Fig. 2a), which is therefore the optimal imaginary sample size to be used for calculation of the Dirichlet estimator. Note, that the maximum value of Dist (25.8%) is observed at A = 0 and corresponds to the distance achieved by the ML estimator. The distance corresponding to the grouped estimator is observed at A → ∞ and is equal to 25.2%. Next we apply the same method to the Proteus group of pathogens, but averaged over all antibiotics. The result is given in 씰Figure 2b, where it can be seen that for the
Proteus group the susceptibility estimates based on individual pathogens give Dist = 22.4% (value of Dist for A = 0). The estimates based on the entire Proteus group give Dist = 22.7% (value of Dist for A → ∞). The lowest value, Dist = 20.9%, is obtained for A = 2. Finally the method is applied to all pathogen groups across all antibiotics. As mentioned above, the individual and group-based estimates give Dist of 16.3% and 16.7%, respectively, and from 씰Figure 3a (the full curve) it can be seen that the smallest value, Dist = 15.6%, is obtained for A = 1.
3.3 Revision of Groups of Pathogens The value of the groups and of the groupbased Dirichlet estimates depends on the quality of the groups. We therefore explored
Fig. 2 The distance measure Dist as a function of A for the Proteus group and a) tobramycin; b) all antibiotics. The filled circles represent the distances corresponding to A → ∞. Methods Inf Med 3/2009
245
246
S. Andreassen et al.: Learning Susceptibility of a Pathogen to Antibiotics Using Data from Similar Pathogens
Fig. 3 The results of the Dirichlet learning applied to all pathogens in the database
whether dividing some of the pathogen groups into smaller subgroups might improve the estimates. Out of the 26 groups represented in the database 16 groups were considered not eligible for subdivision, either because the group consisted of a single pathogen (n = 1) or because the number of isolates in the group was very small (Ni < 10). The remaining 10 groups were divided into three categories, depending on whether the optimal value of A for each group was 0 < A < ∞, A → ∞ or A = 0. These three categories are considered in more detail in the following.
0< A *, 1 = n = N, 1 = m = Mn , where N is the number of patients. Mn is the number of values of Conceptc for Patientn ; Tsn,m and Ten,m are the start and end times of the m-th observation (or temporal abstraction) for Patientn , with value valuec,n,m . The delegate value for Patientn of Conceptc within a specific aggregation time period [Tsagg , Teagg] is computed by the conceptspecific delegate function DFconcept from the input_data as follows: delegate_valuec,n,Tsagg,Teagg = DF [(Tsn,1 , Ten,1 , valuec,n,1),…(Tsn,i , Ten,i , valuec,n,i), … (Tsn,K , Ten,K , valuec,n,K)], Tsagg ≤ Tsn,i , Ten,i ≤ Teagg , 1 ≤ i ≤ K = Kc,n ,Tsagg ,Teagg where K = Kc,n,Tsagg,Teagg is number of instances of Conceptc for Patientn measured within the [Tsagg, Teagg] period. That is, K varies per each concept, patient, and time period. The delegate function of each concept is defined in the knowledge base, or is chosen at runtime by the user from several predefined
default functions. For example, assume that the results of a patient’s three blood glucose (BGL) observations on January 1 revealed the following values: 92 g/dl at 5 a.m., 140 g/dl at 11 a.m. and 182 g/dl at 8 p.m. If maximum is the default daily delegate function for BGL, then the patient had a daily delegate value of 182 g/dl for BGL. However, the user can choose another suitable delegate function (such as the mode or the mean). Indeed, for granularity of months, it might be preferable to use the mean as a delegate function (applied to all raw data within each month). In the case of interval-based temporal abstractions, such as intervals of different grades of bone-marrow toxicity, we provide additional delegate functions, such as the value of the abstractions that has the maximal cumulative duration during the relevant time period. Indeed, in theory, almost any function from multiple values into one value (with same units) can serve the role of a delegate function. However, it must be applied to each time interval in the relevant temporal granularity (e.g., day), and, of course, must make clinical sense, and is thus specific to each clinical concept and medical context.
3. Temporal Association Charts The core of a Temporal Association Chart (TAC) is an ordered list of raw and/or abstract domain concepts (e.g., platelet state, hemoglobin value, WBC count), in a particular order determined by the user. Each concept is measured (or computed, in the case of a temporal abstraction) for a particular patient group during a concept-specific particular time period. The period can be different for each concept. Between every pair of consecutive concepts in the list, a set of relations amongst the delegate values, for each patient, of these neighboring concepts, will be computed. If one of the concepts is raw, each relation will be between a delegate value of the first concept and a delegate value of the second concept for each patient. If both concepts are abstract, the relations between the delegate values for all patients will be aggregated into a set of extended relations – temporal association rules, one rule per each combination of values from both concepts. Each © Schattauer 2009
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
rule represents the set of patients who have had this particular combination of values for the two abstract concepts. TACs are created by the user in two steps. First, the user selects two or more concepts, using an appropriate interface (not shown here), possibly changing the order as necessary; second, the user selects the group of patients (e.g., from a list of groups retrieved earlier by Select Patients queries). In the current version, the VISITORS system does not recommend which concepts to select, nor the time periods in which to examine their, nor group of patients. However, as we explain in the Discussion, we intend to combine the VISITORS systems with pure computational tools (that we have been developing) for detection of sufficiently common temporal associations.
3.1 Temporal Association Templates A Temporal Association Template (TAT) is an ordered list of time-oriented concepts (TOCs) (| TOCs | ≥ 2), where each TOC denotes a combination of a raw or derived domain concept (such as a hemoglobin value or a bone-marrow toxicity grade) and a time interval < tstart , tend >. Each TAT is thus a list LTOCs of TOCs, that is: LTOCs = < TOC1 , ...TOCi , ... TOCI >, ∀i, 1 = i = I, TOCi ≡ < Ci , tistart, tiend >, where Ci ∈ C (the set of domain concepts). The time stamps tistart and tiend define the time interval of concept Ci (using either absolute or relative time stamps). I is the number of concepts in the TAT. A concept can appear more than once in the TAT, but only within different time intervals. An example of a TAT listing the hemoglobin-state and WBC-state abstract (derived) concepts, and the plateletcount raw-data concepts, and their respective time intervals, would be . Note that once a TAT is defined, it can be applied to different patient groups. At runtime, a relation will be created between each pair TOCi and TOCi + 1 , for each patient, such that the delegate value of © Schattauer 2009
concept Ci for that patient during [tistart, tiend] is a value vali and the delegate value of concept Ci + 1 for that patient during [ti + 1start, ti + 1end] is vali + 1.
Given a pair of instantiated TOCs , the set of association relations (AR) between them is:
3.2 Application of a Temporal Association Chart Template to the Set of Patient Records
where N is the number of relevant patients, and I is the number of concepts in the TAC. When at least one of the concepts is raw, the number of ARs between each pair of TOCs is equal to the number of relevant patients. Each AR connects the delegate values val ni and val ni + 1 of the pair of concepts Ci and Ci + 1 , during the relevant period of each concept, for one specific patient Pn . In the case of an abstract-abstract concept pair, we aggregate the ARs between two consecutive TOCs into groups, where each group includes a set of identical pairs of delegate values (one pair for each concept). Each such group denotes a temporal association rule (TAR) and includes: ● Support: the proportion of patients who have the combination of delegate values , 1 ≤ j ≤ J, 1 ≤ k ≤ K, where val ni, j, valni + 1, k are the j-th and k-th allowed values of Ci and Ci + 1 , respectively. J and K are the numbers of different values of the concepts Ci and Ci + 1 . (We assume a finite number of (symbolic) values for each abstract concept.) ● Confidence: the fraction of patients that, given the delegate value val ni, j of the concept Ci for patient Pn , the delegate value of concept Ci + 1 will be valni + 1, k , i.e., P[val ni + 1, k | val ni, j]. ● Actual number of patients: the number of patients who have this combination of values.
When applying a TAT to a set P of patient records including N patients, we get a Temporal Association Chart (TAC). A TAC is a list of instantiated TOCs and association relations (AR), in which each instantiated TOC is composed of the original TOC of the TAT upon which it is based, and the patientspecific delegate values for that TOC within its respective time interval, based on the actual values of the records in P. Each set of the associations denotes the associations between pairs of consecutive instantiated TOCs , 1 ≤ i < I. To be included in a TAC, a patient Pn (1 = n = N) must have at least one value from each TOC of the TAT defining the TAC. The group of such patients is the relevant group (relevant patients). In the resulting TAC, each instantiated TOC *i includes the original TAT TOCi and the set of delegate values (one delegate value for each patient) of the concept Ci , computed using the delegate function appropriate to Ci from the set of patient data included within the respective time interval [tistart, tiend] as defined in the TAT: TOC *i ≡ {, [Dist]}, 1 ≤ n ≤ N, 1 ≤ i ≤ I, where val in is the delegate value of Ci within the period [tistart, tiend] for patient Pn , N is the number of patients in the relevant group, and I is the number of concepts in the TAT. Dist is an optional distribution data structure {Propil >}, where val il is the l-th value of concept Ci , and Prop il is its proportion within the group of patients P. The optional Dist structure is useful only for abstract concepts and supports the visualization of the relative proportion (i.e, distribution) of all the values of Ci for the N relevant patients, within the time interval of the instantiated TOC *i (see Section 4).
AR ≡{ val ni+ 1 >}, 1 ≤ n ≤ N, 1 ≤ i < I
The number of possible TARs between two consecutive TOCs is thus J * K. The support and confidence measures are calculated as follows: support (val ni, j , val ni + 1, k) ≡ {Pi, j i + 1, k}| /N confidence (valni, j , val ni + 1, k) ≡ |{P i, ji + 1, k }| /M 1 ≤n ≤N
1 ≤i < I
1 ≤j ≤J
1 ≤k ≤K
where |{P ii +, 1,j k}| is the number of patients whose delegate value for concept Ci was valni, j and the delegate value for concept Ci + 1 was val ni + 1, k , N is the number of relevant patients, Methods Inf Med 3/2009
257
258
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
M is the number of patients whose delegate value for concept Ci was val ni, j, I is the number of concepts in the TAC, and J and K are the number of symbolic values of concepts Ci and Ci + 1 , respectively.
4. Display of TACs and Interactive Data Mining Using TACs 씰Figure 2 presents an example of a TAC computed by applying a TAT (user-defined on the fly, using another interface (not shown here) that enables the user to select TAT concepts). The TAT includes three hematological concepts (platelet state, hemoglobin (HGB) state abstractions, and the white blood cell (WBC) count raw concept) and two hepatic concepts (total bilirubin and Alkphosphatase state abstraction) applied to a group of 58 patients selected earlier by the user. The visualization in Figure 2 shows the dis-
tribution of the values, using the optional Dist structure (see Section 3.2), for the abstract concepts HGB, Platelet, and Alkphosphatase states; it also shows each patient’s mean values for WBC count and total bilirubin during the year 1995. Delegate values of all adjacent concept pairs for each patient are connected by lines, denoting the ARs. Only 49 patients in this particular group happen to have data for all concepts during 1995. As described above, ARs among values of temporal abstractions provide additional statistical information. For example, the AR’s width indicates to the user the support for each combination of values, while the color saturation represents the level of confidence: a deep shade of red signifies high confidence while pink denotes lower confidence. The support, confidence, and the number of patients in each association are displayed numerically on the edge. For example, the widest edge in Figure 2 represents the relation between the “low” value of the platelet state
and the “moderately low” value of the HGB state during the respective time periods: 55.8% of the patients in the relevant patient group had this combination of values during these periods (i.e. support = 0.558), with “low” platelet state values, 92.6% of the patients exhibited “moderately low” HGB state values (i.e. confidence = 0.926), and this association was valid for 25 patients. Note that in this case both time periods are similar. Using direct manipulations [15], the user can dynamically apply a value and time lens in the TACs. Generally, the term direct manipulation is defined as a human-computer interaction style which involves continuous representation of objects of interest, and rapid, reversible, incremental actions and feedback. In our case, the direct manipulations enable the user to interactively analyze the time and value associations among multiple patients’ data: ● Dynamic application of a value lens enables the user to answer the question “how does constraining the value of one concept
Fig. 2 Visualization of associations among three hematological and two hepatic concepts for 49 patients during the year 1995. Association rules are displayed between the Platelet-state and HGB-state abstract concepts. The confidence and support scales are represented on the left. Methods Inf Med 3/2009
© Schattauer 2009
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
●
during a particular time period affect the association between multiple concepts during that and/or during additional time periods”. The user can either select another range of values for the data of the raw concepts using trackbars, or select a subset of the relevant values in the case of an abstract concept. In the next versions, we are planning to allow the user to vary also the delegate function to enable additional analyses. The system also supports the application of a time lens by changing the range of the time interval for each instantiated TOC, including ranges on the relative time line. The time lens can be especially useful for clinical research involving longitudinal monitoring.
In addition, the user can change the order of the displayed concepts, export all of the visualized data and associations to an electronic spreadsheet, and add or remove displayed concepts.
5. Evaluation of the Functionality and Usability of Temporal Association Charts 5.1 Research Questions We envision TACs as potentially useful for two user types: clinicians and medical informaticians. We also envision them using the system to answer different clinical motivated questions, while using also the general exploration operators of the VISITORS system. Thus we defined the following three research questions.
5.1.1 Functionality and Usability Are clinicians and medical informaticians able to answer clinical questions, which require the use of TACs, at a high level of accuracy and within a reasonably short time? Furthermore, is the integrated VISITORS/ TAC system usable, when assessed using the SUS score [16]?
© Schattauer 2009
5.1.2 The Effect of the Clinical Question Are there significant differences in accuracy and time when answering different clinical questions that required the use of TACs?
5.1.3 The Effect of the Interaction Mode Are there significant accuracy or time to answer differences when answering questions requiring only the use of general VISITORS exploration operators, as opposed to answering questions requiring the use of TACs?
5.2 Measurement Methods and Data Collection When evaluating a new tool such as TACs, it is difficult to produce a control group. As far as we know, no known method duplicates the effect of using either VISITORS or TACs. Furthermore, the potential users simply cannot answer the complex questions (for which TACs are designed) other than by laborious computations. Thus, we have chosen an objective-based approach [17]. In such an approach, certain reasonable objectives are defined for a new system, and the evaluation strives to demonstrate that these objectives have been achieved. In this case, we strove to prove certain functionality and usability objectives of the TACs system when evaluated within the context of a larger framework (VISITORS) for exploration of the timeoriented data of multiple patients. Our evaluation measures and specific research questions, listed below, reflected these objectives. The evaluation of the TACs was performed in the oncology domain. Ten participants, five medical informaticians, i.e., information system engineers who work in the medical domain, and five clinicians with different medical training levels, were each asked to answer five clinical questions that require using TACs (five questions are listed in the Results section). None of the study participants was a member of the VISITORS development team. The five questions were selected by consultation with oncology domain experts. They represented typical questions relevant when monitoring a group of oncology patients, or when performing an analysis
of an experimental protocol in an oncology. The order of the questions was randomly permuted across participants. Each evaluation session with a participant started with a 20-minute tutorial that included a brief description of the VISITORS general exploration operators and of the TAC operators. A demonstration was given of the general and TAC operators showing how several typical clinical questions are answered. The scope of the instruction was predetermined and included (after the demo) testing each participant by asking them to answer three clinical questions, one of which included the use of TAC. When the participant could answer the questions correctly, he/she was considered ready for the evaluation. The TACs evaluation study was performed as a part of an overall feasibility and usability assessment of the VISITORS system. Another part of the evaluation involved testing the feasibility and usability of the general exploration operators by asking clinical questions such as “What were the maximal and mean monthly values of the WBC count during August 1995?”. Throughout the evaluation, we used a retrospective database of more than 1000 oncology patients who had a BMT event. The knowledge source used for the evaluation was an oncology knowledge base specific to the bone-marrow transplantation domain. Our goals for the TACs objectives-based evaluation were manifested in our evaluation measures. The functionality was assessed using two parameters: the time in minutes needed to answer the question, and the accuracy of the resultant answer. The accuracy score assigned to each possible answer (in each case, measured on a scale of [0 ... 100], 0 being completely wrong and 100 being completely right), was predetermined by a medical expert. To test the usability of TACs, we used the system usability scale (SUS)[16], a common validated method to evaluate interface usability. The SUS is a questionnaire that includes ten predefined questions regarding the effectiveness, efficiency, and satisfaction of an interface. SUS scores have a range of 0 to 100. Informally, a score higher than 50 is considered to indicate a usable system.
Methods Inf Med 3/2009
259
260
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
5.3 Analysis Methods 5.3.1 Functionality and Usability User capability in answering clinical questions using the TACs was assessed by calculating the means and standard deviations of the answers accuracy and of the answers response time.
5.3.2 The Effect of the Clinical Question The effects of five clinical questions and of two groups of participants (i.e., medical informaticians and clinicians) on the dependent variables of response time and accuracy of answer were examined using two different two-way ANOVA tests with repeated measures (one for each dependent variable). The clinical question was a within-subject independent variable, and the group of participants was a between-subjects independent variable.
5.3.3 The Effect of the Interaction Mode The effects of the interaction mode (i.e., general exploration operators and TACs) and the group of participants (i.e., medical informaticians and clinicians) on the dependent variables of response time and accuracy of answer were examined using two different two-way ANOVA tests with repeated meas-
ures (one for each dependent variable). The interaction mode was a within-subject independent variable and the group of participants was a between-subject independent variable. Since we did not find statistically significant differences among the response times (and among the resultant accuracy levels) of the different clinical questions of the same interaction mode, the mean value of the response time (and of the accuracy) of the five clinical questions of the same interaction mode was used as the dependent variable.
5.4 Results This section summarizes the evaluation results of the TAC in terms of the research questions.
The mean response time was 2.7 ± 0.4 minutes. Only two participants needed more than 3 minutes (3.2 and 3.6 minutes) to answer. The range of mean response times per participant across all questions was [2.2 ... 3.0] minutes. The mean SUS score for all operators, across all participants, was 69.3 (over 50 is usable). The results of a t-test analysis showed that the mean SUS score of the medical informaticians (80.5) was significantly higher than that of the clinicians (58): [t (8) = 3.88, p < 0.01]. Conclusion: Based on the results of the TAC evaluation, we can determine that after a very short training period, the participants were able to answer the clinical questions with very high accuracies and within short periods of time. The SUS scoring shows that TACs are usable but still need to be improved.
5.4.1 Functionality and Usability
5.4.2 The Effect of the Clinical Question
씰Table 1 summarizes the results according to the clinical questions used in the evaluation. The mean accuracy was 97.9 ± 3.4 (the median was 100 for all five questions, with interquartile range of zero for questions one to four and 10 for question five). All participants successfully answered the clinical questions with mean accuracies greater than 90, while six of them achieved mean accuracies of 100. The range of mean accuracies per participant across all questions was [90 ... 100].
Both analyses yielded no significant effect. The interaction effect group × clinical question was not significant both in the ANOVA of the accuracy scores [F (4, 32) = 2.02, p = 0.12], and of the response times [F (4, 32) = 0.73, p = 0.58]. There was no significant difference between the mean accuracy scores/ response times of the clinicians (96.3 ± 4.3)/ (2.6 ± 0.2 min) and of the medical informaticians (99.5 ± 0.9)/ (2.9 ± 0.5 min); accuracy: [F (1, 8) = 2.72, p = 0.14, regression coef-
Table 1 Details of the questions that all participants had to answer, mean response times (minutes) and accuracy scores N
Clinical questions
1
2.4 ± 0.5 What percentage of the patients has had a “low” delegate value of the Platelet-state concept? What percentage of the patients has had a “moderately low” delegate value of the HGB-state concept? What percentage of the patients has had both a “low” value of the Platelet-state concept and a “moderately low” value of the HGB-state concept?
99.9 ± 0.3
2
What delegate value of the HGB-state derived concept was the most frequent among the patients who have had a “low” aggregate value of the Platelet-state?
2.4 ± 0.5
99.8 ± 0.6
3
What were the maximal and minimal delegate values of the WBC count for patients who have the HGB-state delegate value “moderately low”? What were the maximal and minimal delegate values of the RBC for patients who have had a delegate HGB-state value that was “normal”?
2.9 ± 1.0
96.0 ± 9.0
4
What is the distribution of the delegate values of the Platelet-state in patients whose minimal delegate value of the WBC count raw concept was 5000 cells/ml (instead of the previous minimal value)? What were the new maximal and minimal delegate values of the RBC?
3.0 ± 0.7
98.8 ± 4.0
5
What percentage of the patients has had a “low” delegate value of the Platelet-state during both the first and second month following bone-marrow transplantation?
3.0 ± 0.8
96.0 ± 5.0
Methods Inf Med 3/2009
Response time (mean ± s.d.)
Accuracy score [0..100] (mean ± s.d.)
© Schattauer 2009
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
ficient ± sd = 3.2 ± 1.95]/time: [F(1, 8) = 1.90, p = 0.20, regression coefficient ± sd = 0.3 ± 0.26]. There was no significant difference also between the mean accuracy scores/response times of the five clinical questions (for mean accuracy ± sd/response times ± sd; see Table 1); accuracy: [F(4, 32) = 2.29, p = 0.08, maximal estimated effect ± sd = 4.8 ± 3.1]/time: [F(4, 32) = 2.24, p = 0.09, maximal estimated effect ± sd = 0.6 ± 0.33]. Conclusion: Thus, we can conclude that neither the clinical question nor the group seem to affect the accuracy or the response time of answers provided using the TAC module.
5.4.3 The Effect of the Interaction Mode The results of the ANOVA of the accuracy scores showed that the interaction effect group × interaction mode was not significant [F(1, 8) = 2.64, p = 0.14]. There was no significant difference between the mean accuracy scores of the clinicians (97.7 ± 3.0) and of the medical informaticians (99.8 ± 0.4); [F(1, 8) = 2.37, p = 0.16, regression coefficient ± sd = 2.1 ± 1.4]. There was no significant difference also between the mean accuracy scores of the answers obtained by using the general exploration operators (99.5 ± 1.6) and by using the TACs (97.9 ± 3.4); [F(1, 8) = 4.80, p = 0.07, regression coefficient ± sd = 2.3 ± 1.6]. With respect to the response time, the results of the analysis showed that the only significant effect was the main effect of the type of interaction mode [F(1, 8) = 14.96, p < 0.01]: a mean of 2.2 ± 0.18 minutes for answering the clinical questions when using the general exploration operators of VISITORS, and a mean of 2.7 ± 0.43 minutes for answering the clinical questions using the TACs. There was no significant difference between the mean response times of the clinicians (2.6 ± 0.2 min) and the medical informaticians (2.9 ± 0.5 min); [F(1, 8) = 3.24, p = 0.11, regression coefficient ± sd = 0.3 ± 0.16]. The interaction effect group × interaction mode was also insignificant [F(1, 8) = 0.42, p = 0.54]. Conclusion: Interaction mode does not seem to affect the accuracy of the answers to clinical questions. The mean time needed to answer the clinical scenarios using the TACs is © Schattauer 2009
significantly higher than when using the general exploration operators of VISITORS, but it is still less than three minutes.
5.5 Results of Power Analysis Since we have not found significant effects in the results of research questions 2 and 3, we performed a statistical power analysis. For each two-way ANOVA test in the results of research question 2 (for each dependent variable), we performed a power analysis, both for each main effect, namely, the five clinical questions and the two groups of participants, and for the interaction effect (questions X groups). In the case of research question 3, the main effects for which the power analysis was performed were the two interaction modes (general exploration operators vs. TACs) and the two groups of participants, and for the interaction effect (modes Χ groups). The results showed that, in the case of accuracy, assuming a meaningful difference being at least 5 points on a scale of (0 ... 100), the experiment (i.e., N = 10, α = 0.05) would detect an effect (i.e., a difference between the means of the groups) with a probability of at least 80%, which is considered a reasonable power. Similarly, in the case of response time, assuming a meaningful difference being at least one minute. Moreover, the power analysis had shown that with a larger group, including 20 participants (10 clinicians and 10 medical informaticians), a smaller effect of only two points in the mean accuracy (the minimal effect size obtained in our analysis) could be detected with a probability of 80%. However, we consider a difference of two points or less to be relatively insignificant for practical purposes. Similarly in the case of response times for an effect size of 0.3 min with a sample size of 54 participants (26 clinicians and 26 medical informaticians).
6. Discussion This paper presents the Temporal Association Chart, a computational and interaction module that enables users to graphically explore and analyze the time and value associations among domain concepts that explicitly or
implicitly exist within multiple time-oriented patient records. Moreover, it enables the exploration of 1) intelligent interpretations (temporal abstractions) of the raw data derived using the context-sensitive domainspecific knowledge, and 2) temporal aggregations of the patient data summarized within several specific time periods (including the use of temporal granularities) using delegate functions to each concept and each temporal granularity. The associations displayed between pairs of consecutive abstract concepts include support and confidence measures that can be interactively investigated via manipulation by the user. Note that when the time periods of pair of concepts are the same, these measures are an interval-based extension of the familiar data mining measures, using delegate functions. When the time periods are different, an extension of temporal association rules, commonly called sequence mining, emerges, again using delegate functions, which allows multiple time granularities and which does not necessitate the simultaneous existence of different concepts (known as items). The evaluation of the TAC module, which is integrated within the VISITORS system, has demonstrated its functionality and usability. The only significant difference between the TACs and the general exploration operators was a slightly longer response time. A possible reason for not detecting significant differences in the accuracy scores when using different interaction modes was that the evaluation included a relatively small group of participants and questions. However, it should be noted that the variance among accuracy scores was quite low for both interaction modes, all of the participants achieving scores above 90. Thus, the absence of a significant effect could not be attributed to random differences and high variability in each interaction mode. This conclusion was also supported by the results of the power analysis. Although applying ANOVA in the context of a ceiling effect in the accuracy scores is potentially problematic, this phenomenon, in our judgment, did not have the significant effect on the conclusions of the study study. One possible conceptual limitation of the TAC approach is the use of a goal-directed (user-driven) method for temporal data mining. Thus, the user must have a meaningful intuition regarding the selection of the Methods Inf Med 3/2009
261
262
D. Klimov et al.: Intelligent Interactive Visual Exploration of Temporal Associations among Multiple Time-oriented Patient Records
necessary concepts to explore. However, this limitation can be overcome by combining the TAC module with a knowledge-based temporal data mining method, such as the one we have been developing [18]. Associations that have sufficient support are automatically flagged, and in the future, could be visually explored. To summarize, we conclude that TACs might be described as “intelligent equalizers” that result in a uniform performance level with respect to answering complex timeoriented clinical statistical-aggregation questions, regardless of the questions asked, or of the user type.
Acknowledgments This research was supported by Deutsche Telekom Labs at Ben Gurion University and the Israeli Ministry of Defence, BGU award No. 89357628-01.
Methods of Information in Medicine on the internet Methods Inf Med 3/2009
References 1. Bonneau G, Ertl T, Nielson G. Scientific Visualization: The Visual Extraction of Knowledge from Data. New York: Springer-Verlag; 2005. 2. Soukup T, Davidson I. Visual Data Mining: Techniques and Tools for Data Visualization and Mining. New York: John Wiley & Sons, Inc.; 2002. 3. Spenke M. Visualization and interactive analysis of blood parameters with InfoZoom. Artificial Intelligence in Medicine 2001; 22 (2): 159–172. 4. Wang T, Plaisant C, Quinn A, Stanchak R, Shneiderman B, Murphy S. Aligning Temporal Data by Sentinel Events: Discovering Patterns in Electronic Health Records. SIGCHI Conference on Human Factors in Computing Systems, 2008. 5. Chittaro L, Combi C, Trapasso G. Visual Data Mining of Clinical Databases: An Application to the Hemodialytic Treatment based on 3D Interactive Bar Charts. Proceedings of Visual Data Mining VDM’2002, Helsinki, Finland, 2002. 6. Aigner W, Miksch S, Müller W, Schumann H, Tominski C. Visual Methods for Analyzing TimeOriented Data. IEEE Transactions on Visualization and Computer Graphics 2008; 14 (1): 47–60. 7. Shahar Y. A framework for knowledge-based temporal abstraction. Artificial Intelligence 1997; 90 (1–2). 8. Stein A, Shahar Y, Musen M. Knowledge Acquisition for Temporal Abstraction. 1996 AMIA Annual Fall Symposium, Washington, D.C. Published in 1996. 9. Chakravarty S, Shahar Y. Acquisition and Analysis of Repeating Patterns in Time-oriented Clinical Data. Methods Inf Med 2001; 40 (5): 410–420.
10. Shahar Y, Goren-Bar D, Boaz D, Tahan G. Distributed, intelligent, interactive visualization and exploration of time-oriented clinical data and their abstractions. Artificial Intelligence in Medicine 2006; 38 (2): 115–135. 11. Martins S, Shahar Y, Goren-Bar D, Galperin M, Kaizer H, Basso LV, McNaughton D, Goldstein MK. Evaluation of an architecture for intelligent query and exploration of time-oriented clinical data. Artificial Intelligence in Medicine 2008; 4 (3): 17–34. 12. Klimov D, Shahar Y. Intelligent querying and exploration of multiple time-oriented medical records. MEDINFO Annu Symp Proc 2007; 12 (2): 1314–1318. 13. Klimov D, Shahar Y. A Framework for Intelligent Visualization of Multiple Time-Oriented Medical Records. AMIA Annu Symp Proc 2005. pp 405–409. 14. Falkman G. Information visualisation in clinical Odontology: multidimensional analysis and interactive data exploration. Artificial Intelligence in Medicine 2001; 22 (2): 133–158. 15. Shneiderman B, Plaisant C. Designing the user interface: strategies for effective human-computerinteraction. 4th edition. Addison Wesley; March 2004. 16. Brooke J. SUS: a “quick and dirty” usability scale. In: Jordan PW, Thomas B, Weerdmeester BA, McClelland AL, editors. Usability Evaluation in Industry. Taylor and Francis; 1996. 17. Friedman C, Wyatt J. Evaluation Methods in Medical Informatics. New York: Springer; 1997. 18. Moskovitch R. and Shahar Y. Temporal Data Mining Based on Temporal Abstractions. ICDM-05 workshop on Temporal Data Mining, Houston, US. 2005.
www.methods-online.com – see there also our Instructions to Authors © Schattauer 2009
Original Articles
© Schattauer 2009
Estimation of Patient Accrual Rates in Clinical Trials Based on Routine Data from Hospital Information Systems M. Dugas1; S. Amler1; M. Lange2; J. Gerß1; B. Breil1; W. Köpcke1 1Department 2IT
of Medical Informatics and Biomathematics, University of Münster, Münster, Germany; Centre, Universitätsklinikum Münster, Münster, Germany
Keywords Patient accrual rate, hospital information system, clinical trial
Summary Background: Delayed patient recruitment is a common problem in clinical trials. According to the literature, only about a third of medical research studies recruit their planned number of patients within the time originally specified. Objectives: To provide a method to estimate patient accrual rates in clinical trials based on routine data from hospital information systems (HIS). Methods: Based on inclusion and exclusion criteria for each trial, a specific HIS report is
Correspondence to: Prof. Dr. Martin Dugas Department of Medical Informatics and Biomathematics University of Münster Domagkstraße 5 48149 Münster Germany E-mail:
[email protected] Introduction Delays in patient recruitment are a common problem in clinical trials. Charlson [1] analyzed trials listed in the 1979 inventory of the National Institute of Health. He found that only 14 of 38 (37%) trials reached planned recruitment. Twenty-three years later a review of 114 trials between 1994 and 2003 held by the Medical Research Council and Health Technology Assessment Programmes found that less than one-third recruited their original target within the time originally specified [2]. There is a variety of reasons, such as fewer
generated to list potential trial subjects. Because not all information relevant for assessment of patient eligibility is available as coded HIS items, a sample of this patient list is reviewed manually by study physicians. Proportions of matching and non-matching patients are analyzed with a Chi-squared test. An estimation formula for patient accrual rate is derived from this data. Results: The method is demonstrated with two datasets from cardiology and oncology. HIS reports should account for previous disease episodes and eliminate duplicate persons. Conclusion: HIS data in combination with manual chart review can be applied to estimate patient recruitment for clinical trials.
Methods Inf Med 2009; 48: 263–266 doi: 10.3414/ME0582 received: June 4, 2008 accepted: November 26, 2008 prepublished: March 31, 2009
patients eligible than expected, staff problems, limited funding, complexity of trial design, length of recruitment procedure and others. A recent Cochrane review [3] analyzed strategies to improve recruitment to research studies. Monetary incentives, an additional questionnaire at invitation and treatment information on the consent form demonstrated benefit; the authors concluded that these specific interventions from individual trials are not easily generalizable. Therefore from a methodological point of view, methods are needed to estimate patient accrual rates in clinical trials more precisely.
Hospital information systems (HIS) contain data items, which are relevant for inclusion and exclusion of patients to clinical trials. For instance, diagnosis information is coded in HIS for billing purposes, but can also be analyzed to screen for potential trial subjects [4]. However, electronic patient records contain a lot of unstructured text information, therefore automated data analysis has limitations and expert review of records is needed to assess patient eligibility. In this context, we propose a method to estimate patient accrual rates based on HIS reports in combination with manual review of a sample of HIS records.
Methods Because not all information relevant for assessment of patient eligibility is available as coded HIS data items, a two-stage process to estimate patient accrual rates is applied: First, a list of matching patients is generated with a specific HIS report for a given time span T (for instance, T = [January 1, 2007; December 31, 2007] ). Second, a sample of these patient records is reviewed manually by an expert to assess eligibility and thereby estimate patient accrual rate. HIS reports are database queries which can be generated using reporting tools of the HIS (HIS report generator) or by data queries from a data warehouse. These reports can access all structured data elements within the HIS. Typical examples of HIS data items are admission and discharge diagnoses (primary as well as secondary diagnoses, coded according to international classification of diseases), patient age, patient gender and routine lab values. Depending on inclusion and exclusion criteria of each trial, all suitable HIS items should be considered for this HIS report to provide high recall and precision. Methods Inf Med 3/2009
263
264
M. Dugas et al.: Estimation of Patient Accrual Rates in Clinical Trials Based on Routine Data from Hospital Information Systems
HIS documentation is focused at a “case”, i.e. a certain episode of care in a hospital with related clinical and administrative data; trials are addressing individual patients. For this reason HIS reports for patient accrual should analyze all HIS cases of a patient to avoid duplicate persons and to account for preexisting diseases. We propose a stepwise approach for those HIS reports: First, select all HIS cases matching inclusion and exclusion criteria; second, remove duplicate persons; third, identify all HIS cases for each matching patient and retrieve data on pre-existing diseases to check inclusion and exclusion criteria for each patient. For instance, many trials recruit patients with initial diagnosis, therefore it needs to be verified whether this diagnosis was established in the past. Output of this report should be pseudonymized to protect patient data. The number of patients on this HIS report for time span T is denoted as nT . Under the assumptions that average patient accrual rate does not change over time and the HIS report identifies exactly all eligible patients, estimated patient accrual rate would be nT /|T| , where |T| denotes the length of time span T. However, typically only a subset of information required for inclusion and exclusion is available as coded HIS data items. Therefore only a subset of nT matches all inclusion and exclusion criteria for a specific trial. A manual expert review of a sample with sT patient records from the HIS report results in mT matching patients.
Manual review of HIS patient records requires access to identifiable patient data, therefore it needs to be compliant with data protection laws. Physicians with direct involvement into patient care are allowed to access records of their patients. Therefore these physicians get a list of pseudonyms from the HIS report, which enables them to access those patient records. They report for each pseudonym, whether this patient is eligible for the trial without disclosure of the person’s identity. In general, data access policies must be approved by the responsible data protection officer. Before patient accrual rate is estimated, we propose to assess, whether the probability of HIS patients actually matching to the trial is constant over time. Therefore the number of matching and non-matching patients is figured out in a contingency table for a set of predefined sub-intervals t of time span T. Our null hypothesis states that the proportion of matching patients among all reviewed sample patients mt /st is constant for all sub-intervals t and is tested by Pearson’s Chi-squared test. If the null hypothesis is not rejected (p > 0.05), we conclude that the probability of HIS patients matching to the trial is constant in time and estimate patient accrual rate (PAR) in the total time span T as follows: PAR = (mT /sT) * (nT /|T|)
(1)
A confidence interval for the expected PAR can be calculated according to Clopper [6], as
Table 1 A HIS report generates monthly lists of potential trial patients for an atrial fibrillation trial (second column). Experts reviewed manually medical records from these persons and identified matching patients for the trial (third column). Overall, 304 of 544 (56%) of HIS report patients were suitable for the trial.
month
number of patients in HIS report per month (nt = st)
number of matching patients from manual expert review per month (mt)
November 2007
79
71
December 2007
60
55
January 2008
76
62
February 2008
90
71
March 2008
70
21
April 2008
96
21
May 2008
73
3
total
nT = sT = 544
Methods Inf Med 3/2009
mT = 304
implemented in R-function binom.test [5]. Specifically, we assume a fixed rate nT /|T|. The supposed rate is multiplied by the calculated confidence interval of the probability of HIS patients actually matching to the trial.
Results We use datasets from ongoing Münster atrial fibrillation trials [7] and leukemia trials [8, 9] to demonstrate this method of patient accrual rate estimation. A HIS-based notification system generated HIS reports for study physicians, who manually reviewed patient records to assess trial eligibility [4].
Example 1: Atrial Fibrillation Trial 씰Table 1 presents number of patients iden-
tified by a HIS report and number of matching patients identified by manual expert review. The HIS report queried diagnosis code (I48.11 or I48.0) for the department of cardiology. In this example, all patients listed on the report were analyzed manually, i.e. sT = nT . Within seven months (November 2007 to May 2008) 544 patients were found in the HIS report; all these patients were reviewed manually and 304 matching patients were found, i.e. T = [November 2007; May 2008], nT = 544, mT = 304, sT = 544. When looking at data values of Table 1, it is striking that the number of matching patients is very low in March, April and May 2008. Pearson’s Chi-squared test to compare proportions of mT/(sT – mT) by t results in a highly significant p-value (p < 2.2E-16), therefore our estimation formula 1 cannot be applied in this example.
Example 2: Leukemia Trial In analogy to example 1, 씰Table 2 presents number of patients identified by HIS report-1 and associated number of matching patients. This report queried diagnosis code (C92.0-, C92.00 or C92.01) for the department of oncology. Again, all patients were analyzed manually. Within six months (April 2008 to September 2008) 283 patients were listed in HIS report-1. Twenty-eight match© Schattauer 2009
M. Dugas et al.: Estimation of Patient Accrual Rates in Clinical Trials Based on Routine Data from Hospital Information Systems
ing patients were identified by manual review, i.e. T = [April 2008; September 2008], nT = 283, mT = 28, sT = 283. Pearson’s Chi-squared test to compare proportions of mT/(sT – mT) by t results in a non-significant p-value (p = 0.60), therefore our estimation formula 1 can be applied. Formula 1 yields an estimated patient accrual rate PAR = 4.67/month with a 95% confidence interval (3.15/month; 6.59/month). When comparing Table 1 and Table 2 it is striking that the overall proportion of matching patients is much lower in Table 2. Therefore we applied an improved HIS report-2 which eliminated persons with previous leukemia episodes as well as duplicate persons (씰Table 3). With this improved report, nT was reduced (nT = sT = 53) for the same number of matching patients (mT = 28), i.e. 53% of HIS report-2 patients were suitable for the trial. Again, Pearson’s Chi-squared test to compare proportions of mT/(sT – mT) by t results in a non-significant p-value (p = 0.13), therefore our estimation formula 1 can be applied. Formula 1 yields an estimated patient accrual rate PAR = 4.67/month with a 95% confidence interval (3.41/month; 5.89/month).
Discussion In Germany and many other countries, electronic HIS are available in almost all hospitals. Initially, they were implemented for administrative purposes (billing, DRG system), but in recent years more and more clinical information is available in these systems. Due to deficiencies in data monitoring and software validation they are at present not suited for documentation of clinical trials, but they contain relevant information, such as diagnosis codes, which can be used to support patient recruitment [4]. Estimation of realistic patient accrual rates is important for planning of clinical trials, but quite difficult. The phenomenon that patient recruitment often takes much more time than investigators expected is called “Lasagna’s Law” [10] (Louis Lasagna, clinical pharmacologist, investigator of the placebo response). Collins [11] wrote about “fantasy and reality” of patient recruitment and concluded “we cannot overemphasize the importance of paying adequate attention to © Schattauer 2009
Table 2 Leukemia trial. HIS report-1 selects potential trial patients based on ICD codes (second column). Matching patients were identified by manual review of medical records (third column). Overall, only 28 of 283 (9.9%) of HIS report-1 patients were suitable for the trial.
month
number of patients in HIS report-1 per month (nt = st)
number of matching patients from manual expert review per month (mt)
April 2008
49
2
Mai 2008
30
5
June 2008
52
5
July 2008
63
6
August 2008
47
5
September 2008
42
5
total
nT = sT = 283
mT = 28
Table 3 Leukemia trial. In contrast to Table 2, HIS report-2 eliminates persons with previous leukemia episodes as well as duplicate persons (second column). Matching patients were identified by manual review of medical records (third column, same as in Table 2). Overall, 28 of 53 (53%) of HIS report-2 patients were suitable for the trial.
month
number of patients in HIS report-2 per month (nt = st)
number of matching patients from manual expert review per month (mt)
April 2008
6
2
Mai 2008
10
5
June 2008
13
5
July 2008
6
6
11
5
7
5
August 2008 September 2008
total
nT = sT = 53
sample size calculations and patient recruitment during the planning process. A sample size that is too small may turn a potentially important study into one that is indecisive or even an utter failure”. There is a lot of evidence that many clinical trials failed behind their recruitment objectives [1, 2]. Data monitoring committees must frequently decide about actions in trials with lower-thanexpected accrual [12]. Carter [13] stated “the most complicated aspect pertaining to the estimation of accrual periods is the determination of the expected rate”. HIS statistics can be used to estimate annual case numbers for a specific disease. However, this approach lacks precision, because due to specific inclusion and exclusion criteria only a subset of these patients is
mT = 28
eligible for a certain trial. Depending on these criteria, the rate of suitable patients within a certain disease may vary considerably. For this reason we combine a HIS report with manual expert review of patient records to estimate possible accrual rates more precisely. Manual chart review is labor-intensive; especially when nT is large, analysis of a sample sT (sT T, 1188delA, etc.), which can also be described
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
by their representative alleles [17, 18] (e.g.: CYP2C9*2, *3, etc.). Warfarin, a commonly prescribed oral anticoagulant, is a vitamin-K antagonist [19] metabolized by CYP2C9 [20], with a narrow therapeutic index, and significant inter-patient variability in dose response, due to which, it has been underutilized [21]. The variability in dose response to Warfarin has been partially explained by SNPs in CYP2C9 and VKORC1 [22] genes, and the United States Food and Drug Administration’s (FDA) new labeling on all Warfarin products [23] underscores the importance of pharmacogenetics in general, and of this use-case in particular. Since the availability of genetic tests for CYP2C9 and VKORC1 genes, dosing algorithms that incorporate results of these two tests are now available [24], and their implications in Warfarin dosing may be conceptually represented by applying the hierarchical knowledge model [25] (씰Fig. 1). In Figure 1, the results of molecular assays used to detect known SNPs [14] constitute raw data, which may be represented by the corresponding alleles [18] that subsume the SNPs (information), which may then be interpreted according to the expected phenotype [15] as slow-metabolizers of Warfarin (knowledge), which may be understood by clinicians as having elevated risk of bleeding complications [26] in Warfarin therapy (understanding), who may then adjust the Warfarin dosage [24] by using a combination of dosing nomograms, pharmacogenetics, the physiological condition of the patient, and their experience in treating patients with similar conditions (wisdom). Alternatively, genetic findings may be reported with recommendations for dose adjustment, which could provide clinical context for the results. Although the above knowledge model may work well in the short-term, with an increasing use, and evolving knowledge of the implications of these test results in clinical practice, the complexity of clinical information is likely to increase [27]. In addition, the issues with navigating information sources that enable genotype-to-phenotype type translation of such knowledge into clinical practice [28] necessitate the use of clinical decision support systems (CDSS) at the point of care [29, 30]. CDSS require that the genetic test results, as well as the interpretations reported in the Laboratory Information System (LIS) and the © Schattauer 2009
Fig. 1 Conceptual levels of data abstraction: A set of SNPs can be considered data, the allele containing the SNPs as information, the resulting slow-metabolizer phenotype as knowledge, the implications on bleeding complications as understanding, while the overall need to adjust Warfarin dosage based on all of the above as wisdom.
EHR be in discrete, concise and machinereadable format. Further, with the increasing availability of genetic testing in external reference labs [2] (or potentially direct-to-consumer genetic testing facilities [31, 32]) results of these genetic tests may not have been entered in the same LIS as the rest of their results, in which case, these genetic test results would then have to be collected as part of their history & physical (H&P) examination. However, having H&P as plain text would negate the benefit of having a point-of-care CDSS, particularly for genetic test results, which often tend to be more complex than other clinical findings [27], and it would become necessary to capture any genetic information provided by the patient in a discrete, coded format, rather than as a textual narrative. In the absence of an appropriate level of data abstraction, the sheer amount and complexity of information generated by genetic testing have the potential to overwhelm existing EHR systems as well as the clinical endusers. In the present work, we investigate the suitability of reporting genetic data in the EHR at the level of SNPs and alleles by comparing these two data models from the perspective of clinical decision support, report-
ing within the EHR, and suitability for integrating future discoveries.
2. Methods Our pilot project involved the CYP2C9 gene and the corresponding alleles and SNPs known to have clinical significance in Warfarin therapy. The initial prototyping was performed in a simulation environment at Cerner Corporation headquarters in Kansas City, MO, and then the decision-support component was reconstructed in a live clinical system environment at the University of Utah Hospital, Salt Lake City, UT. Although all software testing was performed using Cerner software, our methods are generalizable, and can be evaluated using any EHR system that integrates a point-of-care clinical decision support system (CDSS), and is independent of the underlying computational environments, databases, etc. (the environments that we prototyped and tested in have different underlying hardware and software architectures). All the decision support rules used in our study are available for download as standard Arden syntax files at http:// Methods Inf Med 3/2009
283
284
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
Fig. 2 Block diagram of information architecture: The above schematic shows a simplified, scaled version of the various components in our EHR system. The EHR system contains several integrated modules, and these communicate with other components and with the database through a middleware layer. The CDSS component resides in the middleware layer, and is available within most components of the EHR.
Allele
Table 1
SNP:
Allele
SNP: Genomic DNA Subtype cDNA
*1
*2
*3
*1A
None
None
*1B
2665_2664delTG; 1188T>C
*1C
1188T>C
*1D
2665_2664delTG
*2A
430C>T
1188T>C; 1096A>G; 620G>T; 485T>A; 484C>A ; 3608C>T
*2B
2665_2664delTG; 1188T>C; 1096A>G; 620G>T ; 485T>A ; 484C>A ; 3608C>T
*2C
1096A>G; 620G>T; 485T>A; 484C>A; 3608C>T
*3A
1075A>C
*3B
CYP2C9 alleles, SNPs and lab panels (adapted from the Human Cytochrome P450 (CYP) Allele Nomenclature Committee’s website) [18]
1911T>C; 1885C>G; 1537G>A; 981G>A; 42614A>C 1911T>C; 1885C>G; 1537G>A; 1188T>C; 981G>A ; 42614A>C
*4
1076 T>C
42615T>C
*5
1080 C>G 42619C>G
*6
818delA
10601delA
informatics.bmi.utah.edu/cyp2c9/suppl/. The latest version of the CBO is available for download at http://www.clinbioinformatics. org. Methods Inf Med 3/2009
The Cerner® Millennium® platform consists of several modules that serve different functions, leveraging a common application and database infrastructure. The block dia-
gram in 씰Figure 2 shows a simplified, scaleddown schematic of various components of the EHR that are relevant to the present work. Laboratory orders can be placed in the CPOE module of the EHR, or in the LIS modules, and the results of lab tests can be charted in the LIS. One of the differences in the EHR environments that were used during prototyping and testing was that our testing environment at the University of Utah Hospital receives lab results over an HL7 interface, whereas the prototyping environment at Cerner Corporation shared a common database with the LIS module through the EHR middleware. Regardless of the setup, however, once charted, the results are then automatically sent to the EHR application, where they appear under the lab results section in the patient’s chart. The same genetic results can also be charted as part of discrete patientcare documentation within the EHR itself as part of the clinical documentation module, which also posts these results to the exact same place within the database through the middleware, which is important when considering scenarios where genetic test results may have been provided directly by the patients themselves as results from other independent labs [2] during their history and physical examination. The integrated, pointof-care CDSS module in the EHR which operates within the middleware layer is able to consume the results, and communicate the alerts to several different applications, regardless of the application/method in which these were posted to the EHR, making our methods even more generalizable. Medication orders can be placed in the EHR, or in a separate pharmacy module, and the clinical decision support system runs in the background in both of these modules. During prototyping at Cerner Corporation, we developed two lab panels in the LIS (씰Table 1), a panel for reporting test results as alleles (allele panel) and another panel for reporting test results as SNPs (SNP panel). Discrete genetic test results within each of these lab panels were mapped to their corresponding pre-coordinated concepts within the CBO, using the CYP2C9*2A allele as an example (씰Table 2). E.g. the definition of the CBO allele concept CYP2C9.004 contains several individual SNPs represented by the respective CBO concepts, and also the synonym CYP2C9*2A, which represents the allele. © Schattauer 2009
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
Within the allele panel, the concept representing CYP2C9*2A was used as-is for storing one discrete data element, whereas in the SNP panel, the individual concepts that represent SNPs contained in this concept each had their own separate, discrete data elements. The LIS also allowed automated interpretation of genetic results as ‘homozygous normal’, ‘heterozygous affected’ and ‘homozygous affected’, based on the results entered for each copy of the allele or SNP in the corresponding lab panels. Upon electronically signing the results charted in a given lab panel for a test patient, these results and their automated interpretations were posted to the EHR. During testing at the University of Utah, the same genetic test panels were recreated within the discrete clinical documentation module of the EHR itself, and results were charted using that module. Decision-support rules based on each data model were built in the CDSS module, using the logic illustrated in 씰Figure 3A. In addition to the individual rules based on allele- & SNP-models, two other generic rules were created, one of which triggered upon adding a medication to the list of medication orders (rule ‘A’), and another which triggered upon actually signing the medication order (rule ‘D’). Individual rules based on allele- & SNPmodels (rules ‘B’ and ‘C’ respectively) were set to trigger in response to the placing of Warfarin orders, and the rule-evaluation criteria were slightly different for each rule (씰Fig. 3B), with the allele-rule checking for lab values indicating CYP2C9 *2, *3, or *6 alleles (Table 1, column 1), and the SNP-rule checking for all the corresponding SNPs for each of these alleles (Table 1, columns 3/4). The execution order of these rules were set so that during the process of adding Warfarin to a patient’s list of medication orders, the first rule that fired was the generic rule ‘A’, followed by either ‘B’ or ‘C’ depending on whether we were testing the allele-model or the SNPmodel, and then finally rule ‘D’. The common element within each of these rules was an action that created a database time-stamp, with precision in milliseconds, so that the difference between three time-stamps in each test case would give us the actual amount of time needed to evaluate the rule, the time taken to respond to the medication alert (씰Fig. 4), and the total time needed to complete individual orders. © Schattauer 2009
Table 2 Concepts and relationships in the Clinical Bioinformatics Ontology using CYP2C9*2A allele as an example [12]
CBO Concept 1
Relationship
CBO Concept 2
Human Allele
Subsumes
CYP2C9.0004
CYP2C9.0004
Synonym
CYP2C9*2A
CYP2C9.0004
Has constituent variant
CYP2C9.c.-1188T>C
CYP2C9.0004
Has constituent variant
CYP2C9.c.430C>T
CYP2C9.0004
Has constituent variant
CYP2C9.c.-1096A>G
CYP2C9.0004
Has constituent variant
CYP2C9.c.-620G>T
CYP2C9.0004
Has constituent variant
CYP2C9.c.-485T>A
CYP2C9.0004
Has constituent variant
CYP2C9.c.-484C>A
Fig. 3 Testing methodology. A: The overall testing method showing the rules being triggered upon placing an order for Warfarin on a test patient on whom CYP2C9 Allele/SNP results were available; B: Overall differences in the rule evaluation logic for the Allele & SNP rule.
Methods Inf Med 3/2009
285
286
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
Fig. 4 Medication alert: Medication alert triggered by adding Warfarin to the list of medication orders on a test patient with results for CYP2C9 Allele/SNP results
The CDSS rules ‘B’ and ‘C’ were tested in isolation from one another by enabling one, while having disabled the other, and using test patients for whom the corresponding SNPs
or alleles were reported as positive through the LIS or the clinical documentation modules. This was done in order to prevent test conditions where either of these rules could
Fig. 5 Rule execution times: Rule execution times were measured as the difference between the database time-stamps on adding Warfarin to the list of the test patient’s medication orders, and the time needed to complete evaluation of the rule logic. Methods Inf Med 3/2009
be triggered simultaneously, in order to avoid any potential impact on each other, so that the triggering of the rules followed an order A-B-D for the allele-model and A-C-D for the SNP-model. It is possible to place the same medication order through either the inpatient pharmacy system module or the Computerized Provider Order Entry (CPOE) module, and so as to allow the execution of the rule. The CPOE module is integrated tightly with the main EHR application, and the drug-gene interaction alerts were primarily intended to be seen by physicians at the point where they would place the orders. Thus, for our testing, we chose to place these orders using the CPOE module of the EHR system, rather than the pharmacy module. Each order for Warfarin placed in this manner was later discontinued, and this process was then repeated so that there were no active orders for Warfarin on the given test patient at the time of placing another order. This was done in order to avoid triggering other existing error-checking mechanisms such as therapeutic duplication checking and dose-range checking, which could have potentially introduced confounders by interfering with rule execution. The execution of the rule in each case generated a popup alert (Fig. 4) indicating a medication warning due to an underlying genetic condition, with options to accept or cancel the order, or ignore the recommendation, and a link for additional information describing the importance of these genotypes in Warfarin therapy [26]. Although it was also possible to make numeric dosing recommendation based on pharmacogenetic data [24], this was not included in our tests to minimize confounding. Over 50 orders were placed to evaluate each rule in this manner for Warfarin on each test patient, with the rule triggering on every event, and the rule execution times recorded as time-stamps in the EHR database using database triggers. In order to further minimize confounding due to differences in system utilization during testing, the simulations were performed during the same time periods of a day to account for system load. Other aspects of reporting genetic test results in the EHR, such as the formats for reporting results and interpretations to the clinicians were also considered, although a formal evaluation was not performed as part of the present study. © Schattauer 2009
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
3. Results 3.1 Rule Execution Times The rule execution times were determined by examining the differences in database timestamps from the point of rule triggering to the completion of rule-logic evaluation and generating an alert in the EHR. The results for rules based on both allele- and SNP-models are plotted in 씰Figure 5. These measurements were logged in milliseconds, and the average rule execution times were 25.06 ms (n = 50) for the allele rule and 57.64 ms (n = 50) for the SNP rule (씰Table 3). Using a twotailed Student t-test of two samples assuming equal variance, the p-value was G ‘Absent’; Copy 2: CYP2C9 1096A > G ‘Present’), thus generating about 32 discrete genetic test results in the EHR per test order.
Time
Sample size
Mean
Standard deviation
Allele rule execution
50
25.06 ms
2.431 ms
SNP rule execution
50
57.64 ms
3.997 ms
Allele alert reaction time
50
6.578 s
0.908 s
SNP alert reaction time
50
6.520 s
1.182 s
Allele total time
50
6.604 s
0.909 s
SNP total time
50
6.578 s
1.182 s
4. Discussion The integration of genetic data in clinical care is an important step toward delivering personalized medicine, and point-of-care CDSS that enable recommendations based on these data will serve as important means of realizing this goal. Given the inherent complexity of genetic data and the need to have concise, human-interpretable guidelines at the pointof-care, it will be necessary to present these data in a form that can be consumed by frontline clinicians, and it will therefore be necessary to abstract these data. However, each level of data-abstraction going from the complete DNA sequence onward to SNPs, alleles, haplotypes, etc. comes with tradeoffs in terms of current vs. future usability of the findings, performance of the CDSS, loss of information that may be important in secondary use, etc. In the present study, we have considered one such scenario involving data abstraction by comparing two data models for reporting genetic data in the EHR based on SNPs & alleles, and have considered some of the potential implications of choosing either of these data models in CDSS involving genetic data at the point-of-care by tackling a realworld clinical problem involving CYP2C9 polymorphisms & Warfarin dosing in a realworld EHR environment.
4.1 Rule Execution Times From the CDSS database time-stamps, it was estimated that the average rule-execution times for the allele model were 25.06 ms, while that for the SNP model were 57.64 ms (Table 3), which were both within acceptable limits for interactive software applications, and would not have a negative impact on end-
Degrees of freedom
Two-tailed t-statistic (tc = 1.96)
p-value (α = 0.05)
98
49.245
6.15E-71
98
0.275
0.784
98
0.120
0.904
Methods Inf Med 3/2009
287
288
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
user applications if these rules were triggered in isolation. In spite of the significant differences in rule execution times, differences between the total rule execution times as well as the reaction times to the EHR alerts for the Allele-/SNP-models were insignificant, since these two times, measured in seconds, differed from the rule execution times by at least two orders of magnitude. In other words, within our experiments, there was plenty of room to accommodate more complex CDSS rules before they could have had a noticeable impact on the end-user applications. However, in a clinical system with multiple CDSS rules being evaluated on multiple patients, the overall rule execution times could still vary, and possibly impact the performance and responsiveness of the front-end applications. Some of these potential issues with performance could be addressed by consolidating two or more CDSS rules into a single rule which runs faster, but such an approach could potentially create new problems by adding to the rule complexity, and maintaining such rules over time.
4.2 Rule Complexity With the growing complexity of knowledge in the genetics of human diseases, CDSS rules will also tend to become more complex, and this complexity was readily apparent in the differences in rule logic between the SNP & allele models (씰Fig. 3B). The complexity of CDSS rules can be addressed by constructing an executable knowledge base of SNPs, alleles, haplotypes, etc., and the relevant clinical effects, interpretations and recommendations, so that pharmacogenetic decision support could be driven by stored, updateable knowledge instead of hard-coded logic such as that used in our rules. The PharmGKB, a publicly available, searchable online resource, is one such pharmacogenetic knowledge base [33] that contains current information on the relationships between drugs, diseases and genes, but in order to be used as a knowledge base for CDSS rules, the rules based on these relationships themselves will have to be formalized and stored in an executable format. Molecular-genetic vocabularies such as the CBO, which contain pre-coordinated concepts for genetic findings (e.g.: the CBO concept CYP2C9.c.430C >T implies a change Methods Inf Med 3/2009
from Cytosine to Thymine at position 430 in the cDNA of CYP2C9 gene) will have to be combined with other clinical vocabularies such as SNOMED CT [10] to adequately describe the effect as well as the recommendations that will be required for enabling pharmacogenetic decision support at the level necessary for formalizing these rules. However, at present, most clinical vocabularies lack the granularity and coverage needed to describe the effects and interpretations of molecular-genetic tests as well as recommendations based on these results, and may require considerable improvement before they can be used for effectively representing these rules in an executable knowledge base.
4.3 Precision of Allele Assignment With DNA sequencing becoming the gold standard for genetic information, any other form of capturing SNP findings is subject to imprecision. For instance, an allele-specific PCR panel could generate SNP results that are interpreted as a specific allele. Subsequent DNA sequencing could identify a SNP that was not specifically targeted in the initial panel and require the correction of the allele assignment. Therefore, it is important to consistently document the method used to generate an allele assignment. A vocabulary concept descriptive of a finding that represents an allele can be semantically related to other concepts that represent the corresponding SNPs for that allele. Using the allele concepts for reporting the results of genetic tests does not exclude the possibility of other SNPs being detected by a more comprehensive assay, but the burden falls on the LIS to be configured to accurately describe the methodology. Some allele terminologies imply phylogenetic relationships between allele names, which present a problem of precision, and therefore accurate descriptions of these methodologies becomes important. E.g.: subtypes of the CYP2C9*2 allele, *2A & *2B, are identical from a functional perspective, causing the same overall change in the cDNA (Table 1). However, the *2B allele includes a few more SNPs in addition to those found in *2A, and reporting the results as only the relevant allele *2 would lead to a loss of this information, which may
be important to retain in the context of future discoveries. Further, allele nomenclature itself is not consistent across different genes, and it may be more suitable to report all individual SNPs within the LIS and the EHR, in order to retain the maximum amount of information. The CBO addresses these concerns by creating unique concepts for each allele regardless of functional significance and through the use of a phylogenetically neutral naming convention that is consistent for all genes represented (modeled after the allele convention utilized by OMIM). In the light of recent discoveries [34] in the genetics of human diseases, it is important to retain as much information as possible about both the findings and the methods used to generate those findings., Although the allele model allows the construction of simpler CDSS rules (Fig. 3B), this model requires clear presentation of the method in order to prevent loss of information that may be useful in ‘bubbling up’ interpretations of existing data, in the future. The HL7 CGS approach has provisions for capturing all possible genetic information for the explicit purpose of allowing future reinterpretation; however, since the EHR system in question is built around a database-driven architecture, it does not implement HL7 information models directly, like a majority of present-day EHRs. Using an ontology like the CBO, which is structured around biological observations in conjunction with another codification system (e.g. LOINC) to describe the actual method used to collect such observations could possibly reduce such information loss.
4.4 Reporting Genetic Results Reporting discrete genetic test results in the EHR could pose some unique problems for clinicians. Unlike other lab results such as ‘International Normalized Ratio’ (INR), where the results can be interpreted directly within the context provided by the normal ranges, and have traditionally been a part of clinicians’ training, genetic test results such as those obtained during CYP2C9 testing may require additional clinical recommendations in addition to the results themselves. This becomes particularly evident when considering the SNP panel in our simulations, where 32 discrete results could be reported as part of a © Schattauer 2009
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
single assay, compared to 12 that can be reported as part of the allele panel. In each of these cases, the only clinically relevant piece of information is that having certain SNPs would predispose the patient to slower metabolism of the drug Warfarin, thereby increasing their risk of bleeding complications during anticoagulation therapy (Fig. 1), thereby necessitating dose adjustments. In the light of the constantly improving molecular-genetic diagnostic methods, it is important to retain as much information as possible, so that future re-interpretations of existing results could be performed in the proper context of the sensitivity/specificity of these assays. However, presenting all these discrete results may not be of much direct value to clinicians at the point-of-care, and may be counter-productive, whereas presenting the same results as a decision-support alert along (Fig. 4) with a suitable recommendation at the point of ordering Warfarin may be far more desirable.
4.5 Limitations Although we considered the CYP2C9 gene and its implications in anticoagulation therapy, the CYP allele nomenclature itself is unique in some ways, and frequently, genetic variations may be described by haplotypes rather than alleles, as is the case with VKORC1 gene, which would then subsume the constituent SNPs in a manner similar to the alleles described in this work. However, with regard to differences in complexity of rule design for CDSS, and scenarios involving loss of information depending on the level of data abstraction, it is still possible to generalize these findings. Scenarios involving multiple genes, alleles and haplotypes were not considered in the present study, and these could further add to the complexity of CDSS rules that were considered in the evaluation of the two models.
5. Conclusions The present work represents one of the first efforts at exploring the real-world application of genetic data in the EHRs using decision support, and the issues we have considered represent a few among the myriad of ques© Schattauer 2009
tions that will arise from the increased use of genetic data in clinical care in the future. We evaluated two data models for CDSS rules in the EHR on the basis of their performance, complexity, loss of information and reporting within the EHR. Although there was a significant difference between the computational times needed for evaluating rules based on the allele model and the SNP model, this difference, being in the order of milliseconds, did not translate into a significant difference in the time taken to place a Warfarin order. CDSS rules based on the SNP data model are inherently complex, and will be difficult to maintain with the continuous addition of new knowledge in this domain. Although the allele model allowed for simpler clinical decision support and clinical reporting, maintaining an allele nomenclature locally can be a challenge over time. The issue is further complicated by incorrect assignment of allele concepts during system implementations. At the present time, due to the lack of a pharmacogenetic knowledge base containing rules and recommendations in an executable, machine-readable format, as well as a consortium of experts to maintain such a resource, it may be necessary to hard-code many of the decision-support rules involving pharmacogenetic data, thus necessitating the abstraction of genetic test results for use in EHRs; and the appropriate level of data abstraction will ultimately have to be decided on a per-gene basis.
Acknowledgments This research was supported by an education and travel award from the Cerner Corporation to the University of Utah Department of Biomedical Informatics, and by the University of Utah Health Sciences Information Technology Services. The authors are also grateful to Lisa Prins from the University of Utah Hospital, and to Nick Smith, Scott Haven and Ginger Kuhns from Cerner Corporation.
References 1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004; 431 (7011): 931–945. 2. GeneTests.org at the University of Washington. http://www.genetests.org/. 1–15–2007.
3. Personalised medicines: hopes and realities. http://www.royalsoc.ac.uk/displaypagedoc.asp?id =15874. 9–1–2005. The Royal Society (UK). 4. Collins FS, Green ED, Guttmacher AE, Guyer MS. A vision for the future of genomics research. Nature 2003; 422 (6934): 835–847. 5. Mitchell DR, Mitchell JA. Status of clinical gene sequencing data reporting and associated risks for information loss. J Biomed Inform 2007; 40 (1): 47–54. 6. Health Level 7. http://www.hl7.org. 3–2–2007. 7. Shabo A. Introduction to the Clinical Genomics Specifications and Documentation of the Genotype Topic. HL7 Clinical Genomics SIG DSTU Update 2 (Genotype topic update 1)[V0.9]. 11–5–2006. HL7.org. 8. Shabo A, Dotan D. The seventh layer of the clinicalgenomics information infrastructure. IBM Systems Journal 2007; 46 (1): 57–67. International Business Machines Corporation. 9. Bioinformatic Sequence Markup Language. http://www.bsml.org. 3–2–2007. 10. SNOMED CT. http://www.snomed.org/snomedct/. 3–2–2007. 11. McDonald CJ, Huff SM, Suico JG, et al. LOINC, a Universal Standard for Identifying Laboratory Observations: A 5-Year Update. Clin Chem 2003; 49 (4): 624–633. 12. Hoffman M, Arnoldi C, Chuang I. The clinical bioinformatics ontology: a curated semantic network utilizing RefSeq information. Pac Symp Biocomput 2005; 139–150. 13. Clinical Bioinformatics Ontology Whitepaper. https://www.clinbioinformatics.org/cbopublic/. 2004. Cerner Corporation. 14. Furuya H, Fernandez-Salguero P, Gregory W, et al. Genetic polymorphism of CYP2C9 and its effect on warfarin maintenance dose requirement in patients undergoing anticoagulation therapy. Pharmacogenetics 1995; 5 (6): 389–392. 15. Rettie AE, Wienkers LC, Gonzalez FJ, Trager WF, Korzekwa KR. Impaired (S)-warfarin metabolism catalysed by the R144C allelic variant of CYP2C9. Pharmacogenetics 1994; 4 (1): 39–42. 16. Takahashi H, Echizen H. Pharmacogenetics of CYP2C9 and interindividual variability in anticoagulant response to warfarin. Pharmacogenomics J 2003; 3 (4): 202–214. 17. Stubbins MJ, Harries LW, Smith G, Tarbit MH, Wolf CR. Genetic analysis of the human cytochrome P450 CYP2C9 locus. Pharmacogenetics 1996; 6 (5): 429–439. 18. Oscarson M, Ingelman-Sundberg M. CYP alleles: a web page for nomenclature of human cytochrome P450 alleles. Drug Metab Pharmacokinet 2002; 17 (6): 491–495. 19. Bell RG, Sadowski JA, Matschiner JT. Mechanism of action of warfarin. Warfarin and metabolism of vitamin K 1. Biochemistry 1972; 11 (10): 1959–1961. 20. Kaminsky LS, Zhang ZY. Human P450 metabolism of warfarin. Pharmacol Ther 1997; 73 (1): 67–74. 21. Horton JD, Bushwick BM. Warfarin therapy: evolving strategies in anticoagulation. Am Fam Physician 1999; 59 (3): 635–646. 22. Rost S, Fregin A, Ivaskevicius V, et al. Mutations in VKORC1 cause warfarin resistance and multiple coagulation factor deficiency type 2. Nature 2004; 427 (6974): 537–541. 23. United States Food and Drug Administration. FDA Approves Updated Warfarin (Coumadin) Prescribing Information. 8–16–2007.
Methods Inf Med 3/2009
289
290
V. G. Deshmukh et al.: Clinical Decision Support Using CYP2C9 Genotypes
24. Sconce EA, Khan TI, Wynne HA, et al. The impact of CYP2C9 and VKORC1 genetic polymorphism and patient characteristics upon warfarin dose requirements: proposal for a new dosing regimen. Blood 2005. 25. Ackoff RL. From Data to Wisdom. Journal of Applied Systems Analysis 1989; 16: 3–9. 26. Aithal GP, Day CP, Kesteven PJ, Daly AK. Association of polymorphisms in the cytochrome P450 CYP2C9 with warfarin dose requirement and risk of bleeding complications. Lancet 1999; 353 (9154): 717–719.
Methods Inf Med 3/2009
27. Guttmacher AE, Collins FS. Welcome to the genomic era. N Engl J Med 2003; 349 (10): 996–998. 28. Mitchell JA, McCray AT, Bodenreider O. From phenotype to genotype: issues in navigating the available information resources. Methods Inf Med 2003; 42 (5): 557–563. 29. Martin-Sanchez F, Maojo V, Lopez-Campos G. Integrating genomics into health information systems. Methods Inf Med 2002; 41 (1): 25–30. 30. Mitchell JA. The impact of genomics on E-health. Stud Health Technol Inform 2004; 106: 63–74.
31. Navigenics. http://www.navigenics.com/. 9–3– 2008. 32. 23andMe. https://www.23andme.com/. 9–3–2008. 33. Klein TE, Altman RB. PharmGKB: the pharmacogenetics and pharmacogenomics knowledge base. Pharmacogenomics J 2004; 4: 1. 34. Greenman C, Stephens P, Smith R, et al. Patterns of somatic mutation in human cancer genomes. Nature 2007; 446 (7132): 153–158.
© Schattauer 2009
Original Articles
© Schattauer 2009
Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning S. Tortajada1; J. M. García-Gómez1; J. Vicente1; J. Sanjuán2; R. de Frutos2; R. MartínSantos3; L. García-Esteve3; I. Gornemann4; A. Gutiérrez-Zotes5; F. Canellas6; Á. Carracedo7; M. Gratacos8; R. Guillamat9; E. Baca-García10; M. Robles1 1IBIME,
Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas (ITACA), Universidad Politécnica de Valencia, Valencia, Spain; 2Faculty of Medicine, Universidad de Valencia, Valencia CIBERSAM, Spain; 3IMIM-Hospital del Mar and ICN-Hospital Clínic, Barcelona CIBERSAM, Spain; 4Hospital Carlos Haya, Málaga, Spain; 5Hospital Pere Mata, Reus, Spain; 6Hospital Son Dureta, Palma de Mallorca, Spain; 7National Genotyping Center, Hospital Clínico, Santiago de Compostela, Spain; 8Center for Genomic Regulation, CRG, Barcelona, Spain; 9Hospital Parc Tauli, Sabadell, Spain; 10Hospital Jiménez Díaz, Madrid CIBERSAM, Spain
Keywords Multilayer perceptron, neural network, pruning, postpartum depression
Summary Objective: The main goal of this paper is to obtain a classification model based on feedforward multilayer perceptrons in order to improve postpartum depression prediction during the 32 weeks after childbirth with a high sensitivity and specificity and to develop a tool to be integrated in a decision support system for clinicians. Materials and Methods: Multilayer perceptrons were trained on data from 1397 women who had just given birth, from seven Spanish
Correspondence to: Salvador Tortajada IBIME-Itaca Universidad Politécnica de Valencia Camino de Vera s/n CP 46022 Valencia Spain E-mail:
[email protected] 1. Introduction Postpartum depression (PPD) seems to be a universal condition with equivalent prevalence (around 13%) in different countries [1, 2] which implies an increase in medical care costs. Women suffering from PPD feel a considerable deterioration of cognitive and emotional functions that can affect mother-infant
general hospitals, including clinical, environmental and genetic variables. A prospective cohort study was made just after delivery, at 8 weeks and at 32 weeks after delivery. The models were evaluated with the geometric mean of accuracies using a hold-out strategy. Results: Multilayer perceptrons showed good performance (high sensitivity and specificity) as predictive models for postpartum depression. Conclusions: The use of these models in a decision support system can be clinically evaluated in future work. The analysis of the models by pruning leads to a qualitative interpretation of the influence of each variable in the interest of clinical protocols.
Methods Inf Med 2009; 48: 291–298 doi: 10.3414/ME0562 received: May 15, 2008 accepted: December 8, 2008 prepublished: March 31, 2009
attachment. This may have an impact on the child’s future development until primary school [3]. The identification of women at risk of developing PPD would be of significant use to clinical practice and would enable preventative interventions to be targeted at vulnerable women. Multiple studies have been carried out on PPD. Several psychosocial and biological risk
factors have been suggested concerning its etiology. For instance, social support, partner relationships and stressful life events related to pregnancy and childbirth [4], as well as neuroticism [5] have all been pointed out as being important. With respect to biological factors, it has been shown that inducing an artificial decrease in estrogen can cause depressive symptoms in patients wih PPD antecedents. Cortisol alteration, thyroid hormone changes and a low rate of prolactin are also relevant factors [6]. Treloar et al. conclude in [7], a comparative study with twin samples, that genetic factors would explain 40% of variance in PPD predisposition. In Ross et al. [8], a biopsychosocial model for anxiety and depression symptoms during pregnancy and the PPD period has been developed using structural equations. However, most of the research studies involving genetic factors are separate from those involving environmental factors. There is a remarkable exception that explains that a functional polymorphism in the promoter region of the serotonin transporter gene seems to moderate the influence of stressful life events on depression [9]. An early prediction of PPD may reduce the impact of the illness on the mother, and it can help clinicians to give appropriate treatment to the patient in order to prevent depression. The need for a prediction model rather than a description model is of paramount importance. Thus, artificial neural networks (ANN) have a remarkable ability to characterize discriminating patterns and deMethods Inf Med 3/2009
291
292
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
rive meaning from complex and noisy data sets. They have been widely applied in general medicine for differential diagnosis, classification and prediction of disease, and condition prognosis. In the field of psychiatric disorders, few studies have used ANNs despite their predictive power. For instance, ANNs have been applied to the diagnosis of dementia using clinical data [10] and more recently for predicting Alzheimer’s disease using mixed effects neural networks [11]. EEG data from patients with schizophrenia, obsessive-compulsive disorder and controls has been used to demonstrate that an ANN was able to correctly classify over 80% of the patients with obsessive-compulsive disorder and over 60% of the patients with schizophrenia [12]. In Jefferson et al. [13], evolving neural networks overcome statistical methods in depression prediction after mania. Berdia and Metz [14] have used an ANN to provide a framework for understanding some of the pathological processes in schizophrenia. Finally, Franchini et al. [15] have applied these models to support clinical decision making for the treatment of psychopharmacological therapy. One of the main goals of this paper is to obtain a classification model based on feedforward multilayer perceptrons in order to predict PPD with high, well-balanced sensitivity and specificity during the 32 weeks after childbirth and using pruning methods to obtain simple models. This study is part of a large research project about the environment genetic interaction in postpartum depression [16]. These models can be used later in a decision support system [17] to help clinicians in the prediction and treatment of PPD. A secondary goal is to find and interpret the qualitative contribution of each independent variable in order to obtain clinical knowledge from those pruned models.
Women whose children died after delivery were excluded. This study was approved by the Local Ethical Research Committees, and all the patients gave their informed written consent. Depressive symptoms were assessed with the total score of the Spanish version of the Edinburgh Postnatal Depression Scale (EPDS) [18] just after delivery, at week 8 and week 32 after delivery. Major depression episodes were established using first the EPDS (cut-off point of 9 or more) at 8 or 32 weeks, and then probable cases (EPDS ≥ 9) were evaluated using the Spanish version of the Diagnostic Interview for Genetics Studies (DIGS) [19, 20] adapted to postpartum depression in order to determine if the patient was suffering a depression episode (positive class) or not (negative class). All the interviews were conducted by clinical psychologists with previous common training in the DIGS with video recordings. A high level of reliability (K > 0.8) was obtained among interviewers. From the 1880 women initially included in the study, 76 were excluded because they did not correctly fill out all the scales or questionnaires. With these patients, a prospective study was made just after delivery, at 8 weeks and 32 weeks after delivery. At the 8-week follow-up, 1407 (78%) women remained in the study. At the 32-week follow-up 1397 (77.4%) women were evaluated. We compared the loss of follow-up cases with the remainder of the final sample. Only the lowest social class was significantly increased in the loss of follow-up cases (p = 0.005). A total of 11.5% (160) of women at baseline, 8 weeks and 32 weeks had a major depressive episode during the eight months of postpartum follow-up. Hence, from a total number of 1397 patients, we had 160 in the positive class and 1237 in the negative class.
2. Materials and Methods
2.1 Independent Variables
Data from postpartum women were collected from seven Spanish general hospitals, in the period from December 2003 to October 2004 on the second to third day after delivery. All the participants were Caucasian, none of them were under psychiatric treatment during pregnancy, and all of them were able to read and answer the clinical questionnaires.
Based on the current knowledge about PPD, several variables were taken into account in order to develop predictive models. In a first step, psychiatric and genetic information was used. These predictive models are called subject models. Then, social-demographic variables were included in the subject-environment models. For each approach, we used
Methods Inf Med 3/2009
EPDS (just after childbirth) as an input variable in order to measure depressive symptoms. 씰Table 1 shows the clinical variables used in this study. All participants completed a semistructured interview that included socio-demographic data: age, education level, marital status, number of children and employment during pregnancy. Personal and family history of psychiatric illness (psychiatric antecedents) and emotional alteration during pregnancy were also recorded. Both are binary variables (yes/no). Neuroticism can be defined as an enduring tendency to experience negative emotional states. It is measured on the Eysenck Personality Questionnaire short scale (EPQ) [21], which is the most widely used personality questionnaire, and consists of 12 items. For this study, the validated Spanish version [22] was used. Individuals who score high on neuroticism are more likely than the average to experience such feelings as anxiety, anger, guilt and depression. The number of experiences are the number of stressful life events of the patient just after delivery, at an interval of 0–8 weeks and at an interval of 8–32 weeks using the St. Paul Ramsey Scale [23, 24]. This is an ordinal variable and depends on the patient’s point of view. Depressive symptoms just after delivery were evaluated by EPDS. It is a 10-item, selfreport scale, and it has been validated for the Spanish population [18]. The best cut-off of the Spanish validation of the EPDS was 9 for postpartum depression. We decided to prove its initial value (i.e., at the moment of birth) as an independent variable because the goal is to prevent and predict postpartum depression within 32 weeks. Social support is measured by means of the Spanish version of the Duke UNC social support scale [25], which originally consists of 11 items. This questionnaire is rated just after delivery, at 6–8 weeks and at week 32. For this work, the variable used was the sum of the scores obtained immediately after childbirth plus the scores obtained in week 8. Since we wanted to predict possible depression risk during the first 32 weeks after childbirth, the Duke score at week 32 was discarded for this experiment. Genomic DNA was extracted from the peripheral blood of women. Two functional © Schattauer 2009
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
polymorphisms of the serotonin transporter gene were analyzeda. For the entire machine learning process, we decided to use the combination genotypes (5-HTT-GC) proposed by Hranilovic in [26]: no low-expressing genotype at either of the loci (HE); low-expressing genotype at one of the loci (ME); lowexpressing genotypes at both loci (LE). The medical perinatal risk was measured as seven dichotomous variables: medical problems during pregnancy, use of drugs during pregnancy (including alcohol and tobacco), cesarean, use of anesthesia during delivery, mother medical problems during delivery, medical problems with more admission days in hospital, and newborn medical problems. A two-step cluster analysis was done in order to explore these seven binary variables. This analysis provides an ordinal variable with four values for every woman: no medical perinatal risk, pregnancy problems without delivery problems, pregnancy problems and delivery mother problems, and presence of both other and newborn problems. Other psychosocial and demographic variables were considered in the subject-environment model such as age, the highest level of education achieved rated on a 3-point scale (low, medium, high), labor situation during pregnancy, household income rated on a 4-point scale (economical level), the gender of the baby, or the number of family members who live with the mother. Every input variable was normalized in the range [0, 1]. Non-categorical variables were represented by one input unit. Missed variables were replaced by their mean, if they were continuous, or by their mode, if they were discrete. A dummy representation was used for each categorical variable, i.e., one unit represents one of the possible values of the variable, and this unit is activated only when the corresponding variable takes this value. Missed variables were simply represented by not activating any of the units.
2.2 ANNs Theoretical Model
Table 1 There are 160 cases with postpartum depression (PPD) and 1237 cases without it. The second column shows the number of missing values for each independent variable, where ’– ’ indicates no missing value. The last two columns show the number of patients in each class. For categorical variables, the number of patients (percentage) is shown. For non-categorical variables, the mean ± standard deviation is presented. Input Variable Psychiatric antecedents
No. miss. No PPD 76
No
790 (90.3%)
85 (9.7%)
Yes
374 (83.9%)
72 (16.1%)
Emotional alterations during pregnancy
–
No
73 (81.1%)
17 (18.9%)
Yes
1164 (89.1%)
143 (10.9%)
Neuroticism (EPQN)
6
3.25 ± 2.73
5.68 ± 3.55
Life events after delivery
2
0.99 ± 1.06
1.40 ± 1.09
Life events at 8 weeks
176
0.88 ± 1.09
1.69 ± 1.33
Life events at 32 weeks
64
0.87 ± 1.07
1.95 ± 1.53
Depressive symptoms (Initial EPDS)
–
5.64 ± 3.97
8.96 ± 4.85
Social support (DUKE)
10
88.06 ± 56.27
138.63 ± 82.45
5-HTT-GC
79
HE, No low-expressing genotype
93 (83.8%)
18 (16.2%)
ME, Low-expressing genotype at one locus
664 (87.5%)
95 (12.5%)
LE, Low-expressing genotype at both loci
408 (91.1%)
40 (8.9%)
No problems
376 (88.1%)
51 (11.9%)
Pregnancy problems
426 (86.1%)
69 (13.2%)
Mother problems
117 (89.3%)
14 (10.7%)
318 (92.4%)
26 (7.6%)
Medical Perinatal Risk
–
Mother and child problems Age Educational level
–
5-HTTLPR in the promoter region and STin2 within intron 2
© Schattauer 2009
31.89 ± 4.96
2 324 (85.5%)
55 (14.5%)
Medium
518 (88.5%)
67 (11.5%)
393 (91.2%)
38 (8.8%)
Employed
879 (91.1%)
86 (8.9%)
Unemployed
136 (86.1%)
22 (13.9%)
Student/Housewife
103 (85.1%)
18 (14.9%)
116 (77.9%)
33 (22.1%)
Suitable income
830 (90.9%)
83 (9.1%)
Enough income
311 (85.9%)
51 (14.1%)
73 (79.3%)
19 (20.7%)
7 (53.8%)
6 (46.2%)
599 (89.7%)
69 (10.3%)
623 (87.6%)
88 (12.4%)
High Labor situation during pregnancy
4
Leave Economical level
17
Economical problems Gender of the baby
18
Male Female
a
36.12 ± 4.42
Low
Tight income
ANNs are inspired by biological systems in which large numbers of simple units work in parallel to perform tasks that conventional
PPD
Number of people living together
31
2.67 ± 0.96
2.66 ± 0.77
Methods Inf Med 3/2009
293
294
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
computers have not been able to tackle successfully. These networks are made of many simple processors (neurons or units) based on Rosenblatt’s perceptron [27]. A perceptron gives a linear combination, y, of the values of its D inputs, xi , plus a bias value,
y=
x i wi + θ.
(1)
The output, z = f (y), is calculated by applying an activation function to the input. Generally, the activation function is an identity, a logistic or a hyperbolic tangent. As these functions are monotonic, the form f (y) still determines a linear discriminant function [28]. A single unit has a limited computing ability, but a group of interconnected neurons has a very powerful adaptability and the ability to learn non-linear functions that can model complex relationships between inputs and outputs. Thus, more general functions can be constructed by considering networks having successive layers of processing units, with connections running from every unit in one layer to every unit in the next layer only. A feedforward multilayer perceptron consists of an input layer with one unit for every independent variable, one or two hidden layers of perceptrons and the output layer for the dependent variable (in the case of a regression problem), or the possible classes (in the case of a classification problem). We call a fully connected feed-forward multilayer perceptron when every unit of each layer receives an input from every unit in its precedent layer and the output of each unit is sent to every unit in the next layer. Since PPD is considered in this work as a binary dependent variable, the activation function of the output unit was the logistic function, while the activation function of the hidden units was the hyperbolic tangent. As a first approach, fully connected feedforward multilayer perceptrons were used with one or two hidden layers. The learning algorithm backpropagation with momentum was used to train the networks. The connection weights of the network were updated following the descent gradient rule [29]. Although these models, and ANNs in general, exhibit a superior predictive power compared to traditional approaches, they have been labeled as “black box” methods because Methods Inf Med 3/2009
they provide little explanatory insight into the relative influence of the independent variables in the prediction process. This lack of explanatory power is a major concern in achieving an interpretation of the influence of each independent variable on PPD. In order to gain some qualitative knowledge of the causal relationships about depression phenomena, we used several pruning algorithms to obtain more simple and interpretable models [30, 34].
2.2.1 Pruning Algorithms Based on the fundamental idea in Wald statistics, pruning algorithms estimate the importance of a parameter (or weight) in the model by how much the training error increases if that parameter is eliminated. Then, it removes the least relevant one and continues iteratively until some convergence condition is reached. These algorithms were initially thought of as a way to achieve a good generalization for connectionist models, i.e., the ability to infer a correct structure from training examples and to perform well on future samples. A very complex model can lead to poor generalization or overfitting, which happens when it adjusts to specific features of the training data rather than to the general ones [31]. But pruning has also been used for feature selection with neural networks [32, 33], making their operation easier to understand since there is less opportunity for the network to spread functions over many nodes. This is important in this critical application where knowing how the system works is a major concern. The algorithms used here are based on weight pruning. The strategy consists in deleting parameters with small saliency, i.e. those whose deletion will have
Table 2 Summary of the nature of the variables as being a risk factor or a protective factor depending on the sign of the weights of the inputhidden connection (I-H) and the hidden-output connection (H-O).
I-H
H-O
Factor
+
+
Risk
+
–
Protective
–
+
Protective
–
–
Risk
the least effect on the training error. The Optimal Brain Damage (OBD) algorithm [30] and its descendant, Optimal Brain Surgeon (OBS) [34], use a second-order approximation to predict the saliency of each weight. Pruned models were obtained from fully connected feed-forward neural networks with two hidden units, i.e., there was initially a connection between every unit from a layer and every unit of each consecutive layer. In order to select the best pruned architecture, a validation set was used to compare the networks. Then, when the best model was obtained, the interpretation of the influence of each variable was done in the following way: if an input unit is directly connected to the output unit, then a positive weight means that it is a risk factor as it increases the probability of having depression. Thus, a negative weight means that the variable is a protective factor. Let a hidden unit be connected to the output unit with a positive weight. If an input unit is connected to this hidden unit with a positive value, then the variable represented by this unit is a risk factor. If its weight is negative, then it is a protective factor. On the contrary, if the weight between the hidden unit and the output unit is negative, then a positive value in the connection between the input and the hidden unit means that the variable is a protective factor. Thus, a negative value in the weight that connects the input to the hidden unit means that it is a risk factor. 씰Table 2 summarizes these influences. This interpretation is justified because the hidden units have a hyperbolic tangent as an activation function which delimits its output activation values between –1 and 1.
2.2.2 Comparison with Logistic Regression The significant variables obtained by the pruned models were compared to the ones obtained by logistic regression models. The latter models are used when the dependent variable is categorical with two possible values. Independent variables may be in numerical or categorical form. The logistic function can be transformed using the logit transformation into a linear model [35]:
g (x) =
βi x i + β0 .
(2)
© Schattauer 2009
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
The log-likelihood is used for estimating regression coefficients (βi) in the model. Thus, the exponential values of the regression coefficients give the odds ratio value, which reflects the effect of the input variables as a risk or a protective factor. To assess the significance of an independent variable, we compare the value of the likelihood of the model with and without the variable. This comparison follows a chi-square distribution with one degree of freedom, so it is possible to find the associated p-value. Thus, we have the statistical significance and the character of each factor as being a protective one or a risk one. A noteworthy fact is that the logistic regression models are limited to linear relationships between dependent and independent variables. The neural network models can overcome this restriction. Thus, the linear relationships between independent variables and the target should be found in both models. While non-linear interactions will appear only in the connectionist model.
2.3 Evaluation Criteria The evaluation of the models was made using a hold-out validation where the observations were chosen randomly to form the validation and the evaluation sets. In order to obtain a good error estimation of the predictive model, this database had to be split into three different datasets: the training set with 1006 patients (72%), the validation set with 112 patients (8%), and the test set with 279 patients (20%). Each partition followed the prevalence of the original database (씰see Table 3). The best network architecture and parameters were selected empirically using the validation set and then evaluated with the test set. Overfitting was avoided using the validation set to stop the learning procedure when the validation medium square error function reached its minimum. Section 3 shows that using a single hidden layer was enough to obtain a good predictive model. There is an intrinsic difficulty in the nature of the problem: the dataset is imbalanced [36, 37], in the sense that the positive class is underrepresented compared to the negative class. Thus, with this prevalence on the negative examples (89%), a trivial classifier consisting in assigning the most prevalent © Schattauer 2009
Table 3 Number of samples per class of each partition of the original database. The prevalence of the original dataset is observed in each one: 11% for the positive class (major postpartum depression) and 89% for the negative class (no depression). Dataset
No depression
Major depression
Total
891
115
1006
Validation
99
13
112
Evaluation
247
32
279
1237
160
1397
Training
Total
means that the model is the worst we can obtain.
3. Results 씰Table 4 shows the results of the best con-
class to a new sample would achieve an accuracy of around 89%, but its sensitivity would be null. The main goal is to obtain a predictive model with a good sensitivity and specificity. Both measures depend on the accuracy on positive examples, a+, and the accuracy on negative examples, a–. Increasing a+ will be done at the cost of decreasing a–. The relation between these quantities can be captured by the ROC (Receiver Operating Characteristic) curve [38]. The larger the area under the ROC curve (AUC), the higher the classification potential of the model. This relation can also be estimated by the geometric mean of the two accuracies, G = √a+ · a–, reaching high values only if both values are high and in equilibrium. Thus, if now we use the geometric mean to evaluate our trivial model (which always assigns the class with the maximum a priori probability) we see that G = 0, which
nectionist models obtained from the first approach. Two models were trained based on differential input variables: the subject model (SUBJ) and the subject-environment model (SUBENV). Both models included psychiatric antecedents, emotional alterations, neuroticism, life events, depressive symptoms, genetic factors, social support and medical perinatal risk. The SUBENV also included social and demographic features, such as age, economical and educational level, family members and labor situation. The best model (SUBJ with no pruning) achieved 0.82 of G and 0.81 of accuracy (95% CI: 0.76- 0.86) with 0.84 of sensitivity and 0.81 of specificity. In general, SUBENV and non-pruned models tend to have a better behavior than SUBJ and pruned ones, but a χ2 test with Bonferroni correction shows that there is no statistical significance. Also, notice that the accuracy confidence intervals are overlapped (씰see Table 4). On the other hand, the use of pruning methods leads to a more understandable model at the expense of a small loss of sensitivity. A logistic regression has been done for the SUBJ and SUBENV sets of variables to compare and confirm the significant influence of the pruned selected features. It is expected that the linear relationships between independent variables and the target should be found
Table 4 Results for the best models with the subject feature models (SUBJ) and the subject-environment feature models (SUBENV). We show the G-mean, the accuracy of the model with its confidence intervals at 5% of significance and its sensitivity and specificity. Varying the threshold of the classifier we obtain a continuous classifier for which the AUC value is shown. The architecture points out the number of input units, hidden units and the output unit. When pruning a network, we see that some input variables were discarded because their connections towards any hidden unit were eliminated. Thus, these pruned models are simpler than the original ones and may be more interpretable, although they might lose some sensitivity. Model
Pruning
Architecture
G
Acc (95% CI)
Sen
Spe
AUC
SUBJ
No
16–14–1
0.82
0.81 (0.76, 0.86)
0.84
0.81
0.82
SUBENV
No
31–3–1
0.81
0.84 (0.80, 0.88)
0.78
0.85
0.84
SUBJ
Yes
9–1–1
0.77
0.78 (0.73, 0.83)
0.75
0.78
0.80
SUBENV
Yes
13–2–1
0.80
0.84 (0.80, 0.88)
0.75
0.84
0.84
Methods Inf Med 3/2009
295
296
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
in the logistic regression as well as in the neural network models. In the best pruned SUBJ model the most relevant features appear as statistically significant (α = 0.05) in the logistic regression model. Neuroticism, life events from week 8 to week 32, social support and depressive symptoms are considered risk factors. Moreover, the influence of the 5-HTT-GC combination of low-expressing genotypes, LE, is also significant and appears as a protective factor. The rest of the input variables in the logistic regression model: emotional alterations, psychiatric antecedents, pregnancy problems and the 5-HTT-GC combination of no low-expressing genotype, HE, are not significant, but in the pruned model, these four variables are seen as risk factors. The difference between significant factors of the pruned models and logistic regression may be explained by non-linear interactions of a higher order between variables because the indepen-
dent variables interact with each other as explained in Section 2.2. Considering the SUBENV model, most of the relevant features appear as significant input variables in the logistic regression: social support, neuroticism, life events from week 8 to week 32, depressive symptoms, leave labor situation and female baby are risk factors in both models. Pregnancy problems for the mother and the baby appears as a protective factor, which is explained by the proportion of mothers with postpartum depression in the observations (see Table 1). On the other hand, the age and the number of people that the patient lives with appear as protective factors in both models, but they have no statistical significance in the regression model, whereas psychiatric antecedents is a risk factor without statistical significance. Again, we find these differences due to the interactions between variables as explained before.
Table 5 Independent variables selected for the SUBJ pruned model and the SUBENV pruned model for PPD. risk: risk factor; protect: protective factor; pruned: pruned variable. The table shows which variables were significant for the pruned models and for the logistic regression. If a variable is pruned in the neural network, then it is not considered significant. In the case of a logistic regression, a variable is significant if and only if the p-value < 0.05. As expected, every significant variable in logistic regression was also significant in the neural network model. Variable
SUBJ
SUBENV
Pruned Net
Log Reg (p-value)
Pruned Net
Log Reg (p-value)
Neuroticism (EPQN)
risk
risk(= 0.004)
risk
risk(= 0.004)
Social support (DUKE)
risk
risk(< 0.001)
risk
risk(< 0.001)
Depressive symptoms (Initial EPDS)
risk
risk(< 0.001)
risk
risk(< 0.018)
5-HTT-GC, HE
risk
not significant
pruned
not significant
5-HTT-GC, LE
protect
protect (= 0.041)
pruned
not significant
Emotional alteration
risk
not significant
pruned
not significant
Psychiatric antecedents
risk
not significant
risk
not significant
Pregnancy problems
risk
not significant
pruned
not significant
Life events at 32 week
risk
risk(< 0.001)
risk
risk(< 0.001)
Life events at 8 week
risk
not significant
risk
not significant
Gender girl
–
–
risk
risk(< 0.007)
Labor leave
–
–
risk
risk(< 0.008)
Labor active
–
–
protect
protect (< 0.026)
Mother-child problems
–
–
protect
protect (= 0.008)
Age
–
–
protect
not significant
No. of people living together
–
–
protect
not significant
Methods Inf Med 3/2009
In 씰Table 5, the SUBJ model shows that neuroticism, social support, life events and depressive symptoms are the most outstanding features and that they are risk factors in the prediction of PPD. In the SUBENV model these variables are also main risk factors, but the variable age and the number of people that the patient lives with are both protective factors although in the regression model they have no statistical significance.
4. Discussion The main objective of this study was to fit a feed-forward ANN classification model to predict PPD with a high sensitivity and specificity during the first 32 weeks after the delivery. The predictive model showing the best G was selected ensuring a balanced sensitivity and specificity, as Table 4 shows. With this model, we achieved around 81% of accuracy. From our results, SUBENV models did not significantly improve SUBJ models for prediction. The major concern for the medical staff is how the PPD is influenced by the variables. These independent variables have different influences on the output of the classification model and they depend on the connections between nodes. While logistic regression models detect only linear relationships between the independent variables and the dependent variable, the neural network models can also detect non-linear relationships. Thus, the comparison with logistic regression aims to confirm that the neural network model is not inferring wrong linear influences between independent variables and the dependent variable. We expect that if a linear relationship is found to be significant in the logistic regression model then it should be also considered by the neural network pruned model. But non-linear relationships are only going to be detected by the neural network model since logistic regression cannot detect these relations. In the case where the logistic regression finds an independent variable as significant but the neural network fails to detect it, then it would be an evidence of a wrong trained model. But this situation was not found in this work as Table 5 shows. In future work, some quantitative techniques will be used in order to achieve a numeric measure of the influence of each input feature © Schattauer 2009
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
and its interactions following rule extraction methods [39] or numeric methods [40] for ANNs. Therefore, these prevention models would give the clinicians a tool to gain knowledge on the PPD. A classification model with this good performance, i.e., high accuracy, sensitivity and specificity, may be very useful in clinical settings. In fact, the ability of neural networks to tolerate missing information could be relevant when part of the variables are missing thus giving a high reliability in the clinical field. Since no comparison was established with other machine-learning techniques, it could be interesting to try Bayesian network models as they can also deal with missing information, find probabilistic dependencies and show good performance [41]. However, our models provided better results than the work done by Camdeviren et al. in [42] on the Turkish population. Although the number of patients was comparable, our study included more independent variables than Camdeviren’s study, where a logistic regression model and a classification tree were compared to predict PPD. Based on logistic regression, they reached an accuracy of 65.4% with a sensitivity of 16% and a specificity of 95%, which means a G of 0.39. With the optimal decision tree, they obtained an accuracy of 71%, a sensitivity of 22% and a specificity of 94%, which gives a G of 0.45. As they explained, there is also a maximal tree that is very complex and overfitted, thus the generalization of this tree is very limited. In the best model achieved, neuroticism, life events, social support and depressive symptoms just after delivery were the most important risk factors for PPD. Therefore, women with high levels of neuroticism, depressive symptoms during pregnancy and high HTT genotype are the most likely to suffer from PPD. In this subgroup, a careful postpartum follow-up should be considered in order to improve the social support and help to cope with the life events [43]. In a long term, the final goal is the improvement of clinical management of patients with possible PPD. In this sense, ANN models have been shown to be valuable tools by providing decision support, thus reducing the workload on clinicians. The practical solution to integrate the pattern recognition developments in the clinical routine workflows is the design of clinical decision support systems (CDSSs) © Schattauer 2009
taking into account also clinical guidelines and user preferences [44].There are relatively few published clinical trials and they need more rigorous methodologies of evaluation, but the general conclusion is that CDSSs can improve practitioners performances [45, 46]. In conclusion, four models for predicting PPD have been developed using multilayer perceptrons. These models have the ability to predict PPD during the first 32 weeks after delivery with high accuracy. The use of G as a measure for selecting and evaluating the models yields to a high, well-balanced sensitivity and specificity. Moreover, pruning methods can lead to simpler models, which can be easier to analyze in order to interpret the influence of each input variable on PPD. Finally, the models achieved should be incorporated, integrated and clinically evaluated in a CDSS [17] to give this knowledge to clinicians and improve the prevention and premature detection of PPD.
Acknowledgments This work was partially funded by the Spanish Ministerio de Sanidad (PIO41635, Vulnerabilidad genético-ambiental a la depresión posparto, 2006–2008) and the Instituto de Salud Carlos III (RETICS Combiomed, RD07/0067/2001). The authors acknowledge to Programa Torres Quevedo from Ministerio de Educación y Ciencia, co-founded by the European Social Fund (PTQ05–02–03386 and PTQ-08–01–06802).
References 1. Oates MR, Cox JL, Neema S, Asten P, GlangeaudFreudenthal N, Figueiredo B, et al. Postnatal depression across countries and cultures: a qualitative study. British Journal of Psychiatry 2004; 46 (Suppl): s10–s16. 2. O’Hara MW, Swain AM. Rates and risk of postnatal depression – a meta analysis. International Review of Psychiatry 1996; 8: 37–54. 3. Cooper PJ, Murray L. Prediction, detection and treatment of postnatal depression. Archives of Disease in Childhood 1997; 77: 97–99. 4. Beck CT. Predictors of postpartum depression: an update. Nursing Research 2001; 50: 275–285. 5. Kendler KS, Kuhn J, Prescott CA. The interrelationship of neuroticism, sex and stressful life events in the prediction of episodes of major depression. American Journal of Psychiatry 2004; 161: 631–636. 6. Bloch M, Daly RC, Rubinow DR. Endocrine factors in the etiology of postpartum depression. Comprehensive Psychiatry 2003; 44: 234–246.
7. Treloar SA, Martin NG, Bucholz KK, Madden PAF, Heath AC. Genetic influences on post-natal depressive symptoms: findings from an Australian twin sample. Psychological Medicine 1999; 29: 645–654. 8. Ross LE, Gilbert EM, Evans SE, Romach MK. Mood changes during pregnancy and the postpartum period: development of a biopsychosocial model. Acta Psychiatrica Scandinavica 2004; 109: 457–466. 9. Caspi A, Sugden K, Moffitt TE, Taylor A, Craig IW, Harrington H, et al. Influence of life stress on depression: moderation by a polimorphism in the 5-HTT gene. Science 2003; 301: 386–389. 10. Mulsant BH, Servan-Schreiber E. A connectionist approach to the diagnosis of dementia. In: Proc. 12th Annual Symposium on Computer Applications in Medical Care; 1988. pp 245–249. 11. Tandon R, Adak S, Kaye JA. Neural networks for longitudinal studies in Alzheimers disease. Artificial Intelligence in Medicine 2006; 36: 245–255. 12. Zhu J, Hazarika N, Chung-Tsoi A, Sergejew A. Classification of EEG signals using wavelet coefficients and an ANN. In: Pan Pacific Conference on Brain Electric Topography. Sydney, Australia; 1994. p 27. 13. Jefferson MF, Pendleton N, Lucas CP, Lucas SB, Horan MA. Evolution of artificial neural network architecture: prediction of depression after mania. Methods Inf Med 1998; 37: 220–225. 14. Berdia S, Metz JT. An artificial neural network stimulating performance of normal subjects and schizophrenics on the Wisconsin card sorting test. Artificial Intelligence in Medicine 1998; 13: 123–138. 15. Franchini L, Spagnolo C, Rossini D, Smeraldi E, Bellodi L, Politi E. A neural network approach to the outcome definition on first treatment with sertraline in a psychiatric population. Artificial Intelligence in Medicine 2001; 23: 239–248. 16. Sanjuán J, Martín-Santos R, García-Esteve L, Carot JM, Guillamat R, Gutiérrez-Zotes A, et al. Mood changes after delivery: role of the serotonin transporter gene. British Journal of Psychyatry 2008; 193: 383–388. 17. Vicente J, García-Gómez JM, Vidal C, Martí-Bonmatí L, del Arco A, Robles M. SOC: A distributed decision support architecture for clinical diagnosis. Biological and Medical Data Analysis; 2004. pp 96–104. 18. García-Esteve L, Ascaso L, Ojuel J, Navarro P. Validation of the Edinburgh Postnatal Depression Scale (EPDS) in Spanish mothers. Journal of Affective Disorders 2003; 75: 71–76. 19. Nurnberger JI, Blehar MC, Kaufmann C, YorkCooler C, Simpson S, Harkavy-Friedman J, et al. Diagnostic interview for genetic studies and training. Archives of Genetic Psychiatry 1994; 51: 849–859. 20. Roca M, Martin-Santos R, Saiz J, Obiols J, Serrano MJ, Torrens M, et al. Diagnostic Interview for Genetic Studies (DIGS): Inter-rater and test-retest reliability and validity in a Spanish population. European Psychiatry 2007; 22: 44–48. 21. Eysenck HJ, Eysenck SBG. The Eysenck Personality Inventory. London: University of London Press; 1964. 22. Aluja A, García O, García LF. A psychometric analysis of the revised Eysenck Personality Questionnaire short scale. Personality and Individual Differences 2003; 35: 449–460.
Methods Inf Med 3/2009
297
298
S. Tortajada et al.: Prediction of Postpartum Depression Using Multilayer Perceptrons and Pruning
23. Paykel ES. Methodological aspects of life events research. Journal of Psychosomatic Research 1983; 27: 341–352. 24. Zalsman G, Huang YY, Oquendo MA, Burke AK, Hu XZ, Brent DA, et al. Association of a triallelic serotonin transporter gene promoter region (5-HTTLPR) polymorphism with stressful life events and severity of depression. American Journal of Psychiatry 2006; 163: 1588–1593. 25. Bellón JA, Delgado A, Luna JD, Lardelli P. Validity and reliability of the Duke-UNC-11 questionnaire of functional social support. Atención Primaria 1996; 18: 158–163. 26. Hranilovic D, Stefulj J, Schwab S, Borrmann-Hassenbach M, Albus M, Jernej B, et al. Serotonin transporter promoter and intron 2 polymorphisms: relationship between allelic variants and gene expression. Biological Psychiatry 2004; 55: 1090–1094. 27. Rosenblatt F. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 1958; 65 (6): 386–408. 28. Bishop CM. Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press; 1995. 29. Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. The MIT Press 1986. pp 318–362. 30. Le Cun Y, Denker JS, Solla A. Optimal brain damage. Advances in Neural Information Processing Systems 1990; 2: 598–605.
Methods Inf Med 3/2009
31. Duda RO, Hart PE, Stork DG. Pattern Classification. New York, NY: Wiley-Interscience; 2001. 32. Mao J, Jain AK. Artificial neural networks for feature extraction and multivariate data projection. IEEE Transactions on Neural Networks. 1995; 6 (2): 296–317. 33. Leray P, Gallinari P. Feature selection with neural networks. Behaviormetrika 1999; 26: 145–166. 34. Hassibi B, Stork DG, Wolf G. Optimal brain surgeon and general network pruning. In: Proceedings of the 1993 IEEE International Conference on Neural Networks. San Francisco, CA; 1993. pp 293–300. 35. Hosmer DW, Lemeshow S. Applied logistic regression. Wiley-Interscience; 2000. 36. Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. In: Proc. 14th International Conference on Machine Learning. Morgan Kaufmann; 1997. pp 179–186. 37. Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intelligent Data Analysis Journal 2002; 6 (5): 429–449. 38. Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters 2006; 27 (8): 861–874. 39. Saad EW, Wunsch DC. Neural network explanation using inversion. Neural Networks. 2007; 20 (1): 78–93. 40. Heckerling PS, Gerber BS, Tape TG, Wigton RS. Entering the black box of neural networks. A descrip-
41.
42.
43.
44.
45.
46.
tive study of clinical variables predicting community-acquired pneumonia. Methods Inf Med 2003; 42: 287–296. Sakai S, Kobayashi K, Nakamura J, Toyabe S, Akazawa K. Accuracy in the diagnostic prediction of acute appendicitis based on the bayesian network model. Methods Inf Med 2007; 46: 723–726. Camdeviren HA, Yazici AC, Akkus Z, Bugdayci R, Sungur MA. Comparison of logistic regression model and classification tree: an application to postpartum depression data. Expert Systems with Applications 2007; 32: 987–994. Dennis CL. Psychosocial and psychological interventions for prevention of postnatal depression: systematic review. BMJ 2005; 331 (7507): 15. Fieschi M, Dufour JC, Staccini P, Gouvernet J, Bouhaddou O. Medical Decision Support Systems: Old dilemmas and new paradigms? Methods Inf Med 2003; 42: 190–198. Lisboa PJ, Taktak AFG. The use of artificial neural networks in decision support in cancer: a systematic review. Neural Networks 2006; 19(4): 408–415. Kawamoto K, Houl han CA, Balas EA, Lobach DF. Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. BMJ 2005. bmj.38398.500764.8F.
© Schattauer 2009
Original Articles
© Schattauer 2009
A Simple Modeling-free Method Provides Accurate Estimates of Sensitivity and Specificity of Longitudinal Disease Biomarkers F. Subtil1, 2, 3, 4; C. Pouteil-Noble2, 3, 5; S. Toussaint2, 3, 5; E. Villar2, 3, 5; M. Rabilloud1, 2, 3, 4 1Hospices
Civils de Lyon, Service de Biostatistiques, Lyon, France; de Lyon, Lyon, France; 3Université Lyon 1, Villeurbanne, France; 4CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, Equipe Biostatistique Santé, Pierre-Bénite, France; 5Hospices Civils de Lyon, Service de Néphrologie-Transplantation, Centre Hospitalier Lyon-Sud, Pierre-Bénite, France 2Université
Keywords Sensitivity and specificity, prognosis, early diagnosis, longitudinal study, biological markers
Summary Objective: To assess the time-dependent accuracy of a continuous longitudinal biomarker used as a test for early diagnosis or prognosis. Methods: A method for accuracy assessment is proposed taking into account the marker measurement time and the delay between marker measurement and outcome. It dealt with markers having interval-censored measurements and a detection threshold. The threshold crossing times were assessed by a Bayesian method. A numerical study was conducted to test the procedures that were later applied to PCR measurements for prediction
Correspondence to: Fabien Subtil Hospices Civils de Lyon – Service de Biostatistique 162 avenue Lacassagne 69003 Lyon France E mail:
[email protected] 1. Introduction Today, disease diagnosis is made not only on traditional clinical observations, but also on laboratory results; for example, fluorescence polarization, a measure of cellular functionality, is used to make the diagnosis of breast cancer [1]. Methods have been developed to use those results as diagnostic tests and to compare their accuracies [2–4]. Mo-
of cytomegalovirus disease after renal transplantation. Results: The Bayesian method corrected the bias induced by interval-censored measurements on sensitivity estimates, with corrections from 0.07 to 0.3. In the application to cytomegalovirus disease, the Bayesian method estimated the area under the ROC curve to be over 75% during the first 20 days after graft and within five days between marker measurement and disease onset. However, the accuracy decreased quickly as that delay increased and late after graft. Conclusions: The proposed Bayesian method is easy to implement for assessing the timedependent accuracy of a longitudinal biomarker and gives unbiased results under some conditions.
Methods Inf Med 2009; 48: 299–305 doi: 10.3414/ME0583 received: July 1, 2008 accepted: December 12, 2008 prepublished: March 31, 2009
lecular biology has also contributed to the improvement of early diagnosis or prognosis of diseases. Recent research fields, as in genomics or proteomics, led to the development of numerous biomarkers for early diagnosis or prognosis [5, 6]. During patient follow-up, it became frequent to collect repeated measurements of a quantitative biomarker such as the CA19-9 antigen in screening for recurrence of colorectal cancer [7]. The prog-
nostic value of such longitudinal clinical biomarkers has to be carefully assessed and analyzed [8, 9]. For a clinician, a biomarker is useful if it has a good discriminant accuracy and if its test becomes positive early enough to allow an efficient reaction between marker measurement and the disease clinical manifestation. Thus, the progression of a biomarker’s accuracy along the delay from marker measurement and disease onset is of major interest. A marker load may also vary along the time elapsed since inclusion of a patient into a study regardless of the progression toward disease. Consequently, accuracy analyses should take into account both the marker measurement time and the delay between marker measurement and disease onset. When a marker is measured with the disease present, it is conventional to use a ROC curve to summarize the accuracy of continuous or ordinal tests [9–12]. That curve displays the relationship between sensitivity (true-positive rate) and 1-specificity (falsepositive rate) across all possible threshold values set for that test. The test accuracy is then measured by the area under the ROC curve (AUC). This area, comprised between 0 and 1, may be interpreted as the probability that the diagnostic test result in a diseased subject exceeds that result in a non-diseased one (for a complete review of classical diagnostic methods, see Pepe [3] and Zhou et al. [13]). Recently, several methods have been proposed to assess the time-dependent accuracy of a biomarker when the measurements are repeated before disease onset [14–20]. A first approach consists in modeling semi-parametrically the time-dependent sensitivity and Methods Inf Med 3/2009
299
300
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
specificity or the ROC curve itself [16, 17]; the model's validity may be checked with methods proposed by Cai and Zheng [21]. A second approach models survival conditional on the marker values [18–20]. A third approach models the marker distribution conditional on the disease status [14, 15]. In each of the previous models, effects related to marker measurement time and to the delay between marker measurement and outcome are introduced. In their comprehensive and very instructive review on the subject, Pepe et al. [22] recommended sensitivity be assessed on events that occur exactly t days after marker measurement (incident sensitivity) and not over a delay following the measurement (cumulative sensitivity). Also, they recommended specificity be evaluated in subjects with follow-up long enough to be considered as subjects who will not develop the disease (static specificity). Five out of the six above-mentioned methods [14–18, 20] use this definition of time-dependent accuracy. However, those methods require sophisticated models that are not currently available in standard statistical softwares. Considering those facts, we developed a simple method to assess the time-dependent accuracy of a longitudinal biomarker using a Bayesian approach. In agreement with the recommendations of Pepe et al., that method takes into account interval-censored measurements and, possibly, biomarkers with a detection threshold. The first section of the present article describes the method. Numerical studies were conducted in order to compare the results obtained with and without consideration of the sparse nature of the measurements. The method is also illustrated by an analysis of data stemming from a clinical study where patients were screened by PCR measurements to predict cytomegalovirus (CMV) disease after renal transplantation.
incident sensitivity definition was used here [23]: cases correspond to patients who develop the disease exactly t time units after marker measurement. Thus, for a t time units delay between marker measurement and outcome, sensitivity is estimated with measurements taken exactly t time units before the outcome. Sensitivity is assessed at different delays t to assess its progression along the delay from marker measurement to outcome. In this article, a positive test is defined as a marker value higher than or equal to a certain threshold (though equal or lower values may be elsewhere considered). If Yi (s) denotes a measurement relative to patient i at time s since his inclusion into the study, and Ti the event onset time, the incident sensitivity for a delay t between marker measurement and outcome and for a threshold c may be formalized as: Sensitivity (c, t) = P [Yi (s) ≥ c | Ti – s = t ] The progression of sensitivity along t reflects the test ability of early prediction of the outcome. Controls are defined as subjects who do not develop the disease τ days after inclusion into the study, τ being a fixed delay, long enough to consider as controls patients who will probably never develop the disease. Specificity is estimated using measurements in those patients, which leads to static specificity estimates. A possible progression of specificity after inclusion may be taken into account by estimating specificity using, in the controls, the measurements taken at different periods after inclusion. For each subject of the control group, the highest measurement obtained during the period [sj , sj + 1] is kept, sj and sj + 1 denoting successive times since inclusion. The definition of specificity may be formalized as: Specificity (c, τ, sj , sj + 1) =
case in most studies. A first method, called the crude method, consists in using for each cases the last value obtained before Ti – t , introducing a bias because the delay between marker measurement and Ti – t might vary widely from one patient to another. Because of measurements sparsity, a marker threshold value is often crossed between two dates; this leads to “intervalcensored data” [24]. For example, for each couple of measurements, the crude method supposes that the marker value was Yi at time ti and Yj at time tj , whereas Yj was actually reached and crossed during interval ]ti ; tj]. Biomarkers with a detection threshold raise similar issues. All that can be known is that the biomarker has crossed the detection threshold between two dates. One way to deal with interval-censored measurements is to estimate the exact threshold crossing times using a Bayesian method with non-informative priors and assuming that, for a given threshold, the crossing times of all patients who crossed it follow a Weibull distribution. The Weibull distribution was chosen because it is commonly used to model times to event, in particular failure times, but other positive distributions can be used if appropriate. The moment at which each observed marker value is crossed by each patient can be estimated. Unlike the crude method, that Bayesian method uses all the information contained in interval-censored data or measurements below a detection threshold. Then, in patients who develop the disease, the most recent threshold value crossed at Ti – t is used as a diagnostic test for ROC analysis. In patients who do not develop the disease, the diagnostic test used is the highest threshold value crossed between sj and sj + 1 obtained using the Bayesian method.
3. Simulation Study 3.1 Numerical Studies
2. Methods 2.1 Time-dependent Accuracy Definition
2.2 Time-dependent Accuracy Estimation
Heagerthy and Zheng [23] have proposed several ways to integrate time into ROC analysis according to how “cases” and “controls” are defined. As recommended by Pepe et al. [22], the
Estimating incident sensitivity requires that a marker measurement be taken exactly t days before the onset of the disease in each subject who developed that disease, which is not the
Methods Inf Med 3/2009
Numerical studies were carried out to compare the results obtained with the crude method to those obtained with the Bayesian method. Let us consider 200 subjects who developed a given disease at time Ti , and 100 subjects who did not develop that disease. Marker measurements were considered throughout a follow-up duration that did not © Schattauer 2009
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
Table 1 Estimated mean AUC values and sensitivities for thresholds 1, 2, 3, and 4, with their respective standard errors, obtained with the Bayesian method and the crude method over 100 simulations, for three delays between marker measurement and disease outcome
Delay
Method
AUC
Se 1
Se 2
Se 3
Se 4
2
Theoretical
0.985
0.999
0.971
0.787
0.378
Bayesian
0.868 (0.037)
0.959 (0.027)
0.842 (0.045)
0.617 (0.056)
0.312 (0.050)
Crude
0.697 (0.029)
0.885 (0.026)
0.610 (0.029)
0.284 (0.038)
0.085 (0.043)
Theoretical
0.791
0.871
0.5
0.129
0.012
Bayesian
0.616 (0.031)
0.838 (0.025)
0.509 (0.038)
0.175 (0.048)
0.026 (0.029)
Crude
0.458 (0.045)
0.717 (0.030)
0.299 (0.049)
0.060 (0.055)
0.012 (0.041)
Theoretical
0.618
0.664
0.233
0.03
0.001
Bayesian
0.418 (0.048)
0.682 (0.037)
0.253 (0.055)
0.043 (0.051)
0.006 (0.025)
Crude
0.342 (0.049)
0.578 (0.041)
0.176 (0.051)
0.027 (0.045)
0.007 (0.032)
4
6
True AUCs and sensitivities were estimated according to process of generation of the biomarker values. Se denotes sensitivity.
exceed 30 days. High marker values were considered indicative of disease onset. The way data were simulated is described in 씰Appendix 1. The biomarker predictive ability was assessed by the crude and the Bayesian method. Sensitivity was estimated at t = 2, 4, and 6 days before the outcome. Specificity was estimated only during the period [0, 10[ days after inclusion because, in controls, there was no trend for change of biomarker values over time. One hundred simulations were performed. The means obtained for the 100 areas under the ROC curve and for sensitivities at four threshold values (1, 2, 3, and 4) were compared to the theoretical time-dependent area under the ROC curve and sensitivity assessed according to the process of generation of the biomarker values (씰Table 1).
3.2 Results Except for the delay of six days and the threshold value 4, the Bayesian method led to higher sensitivities with differences ranging between 0.02 and 0.33. The standard errors were roughly of the same order of magnitude with the two methods. The comparisons with the theoretical results showed that, except for the delay of six days and the threshold value 4, the crude method clearly underestimated the test sensitivity and that the use of the Bayesian method corrected this underestimation. Besides, except for a delay of two days, the sensitivities obtained with the Bayesian method were close to the theoretical values with small differences ranging between –0.05 and 0.03. The precision of threshold crossing times es© Schattauer 2009
timates depends partly on the measurement frequency. With measurements taken approximately every three days, there is a lack of information to precisely estimate the latest threshold crossed two days before the event, especially when the biomarker values increase as quickly as the onset of disease become closer in time. This explains the differences between the theoretical and the Bayesian results. A way to increase the precision of Bayesian estimates is to make more frequent measurements or to increase the number of cases. Both the Bayesian and the crude method underestimated the specificities at low thresholds (data not shown). This was not due to the exact estimations of the thresholds crossing times but to the fact that specificity was assessed using the highest value reached in each control during a given period. The longer was the period, the highest was the bias. Hence, the choice of the period should be made with great caution. The AUC values obtained with the Bayesian method were higher than those obtained with the crude method and corrected partly the underestimation of accuracy with the latter method. The differences between the Bayesian and the theoretical values came from underestimation of sensitivity with a delay of two days, but also and mainly from underestimation of specificity. The Bayesian methods led to a better estimation of sensitivity, which is the aim of the present article. Underestimation of specificity came from the empirical assessment of specificity and not from the exact threshold crossing times.
4. Example: CMV Disease Prediction after Renal Transplantation 4.1 Study Description The study involved 68 patients who had undergone kidney transplantation between January 1, 1999 and December 31, 2003, at the Centre Hospitalier Lyon Sud (Lyon, France). All were CMV-seropositive before transplantation; 46 received a CMV-positive graft and 22 a CMV-negative one. They were weekly monitored for CMV by quantitative PCR during eight weeks after transplantation, semi-monthly until the third month, then monthly until the sixth month. Because the probability of developing CMV disease six months after renal transplantation is low, patients who did not present a CMV disease after a six-month follow-up were considered disease-free. CMV infection was defined as isolation of CMV by early or late viral culture. CMV disease was defined as the presence of the above defined CMV infection plus either: i) an association of two among the following clinical or biological signs: temperature above 38 °C for at least two days, leukopenia (less than 3.5 G/L), thrombocytopenia (less than 150 G/ L), abnormalities of liver enzymes (twice or more the reference levels); ii) isolated leukopenia (less than 3 G/L); or iii) tissue injury (invasive disease). The PCR method had a detection threshold of 200 copies/mL; 321 measurements out of 494 fell below this threshold. Those leftMethods Inf Med 3/2009
301
302
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
Fig. 1 PCR measurements for cases (solid lines) and controls (dotted lines) versus measurement day after transplantation. x and y scales have been truncated.
measurements in 25 patients. Sensitivity was estimated at t = 0, 5, and 10 days before the outcome, with measurements in 43 patients. Threshold crossing times were estimated using the Bayesian method. The model was fitted using WinBUGS software package [25]; its corresponding code is given in 씰Appendix 2. ROC curves were then constructed with those sensitivity and specificity estimates. There was a large gap between thresholds 0 and 200 on ROC curves, although there was no information on other in-between thresholds. Therefore, only the partial area above threshold 200 was estimated [26]. The obtained values were transformed in values between 0 and 1, as proposed by McClish [27]. The confidence intervals (CI) for AUC values and the standard errors (SE) for sensitivity and specificity were assessed by bootstrap, based on 1000 samples.
4.2 Results censored measurements were given value 0. Forty-three subjects developed a CMV disease with transplantation-to-disease quartiles 21, 25, and 31 days, respectively. The quartiles relative to the number of measurements in those patients were 3, 4, and 5 measurements, respectively. Most patients who developed a CMV disease had an earlier sharp increase in the viral load (씰Fig. 1). The viral load of the 25 subjects who did not develop the disease remained generally low;
however, six of them had a slight increase starting from the 20th day, followed by a decrease starting about the 30th day, then a return to the initial level. This may strongly influence the diagnostic test specificity. However, during the first 30 days, the variability between measurements in subjects who did not develop the disease remained very low. Specificity was estimated at four periods after transplantation, p1 to p4: [0; 10[, [10; 20[, [20; 30[, and [20; 30[ days, respectively, with
For a fixed delay between marker measurement and disease onset, the ROC curves corresponding to the first 10 days p1 and 10–20 days p2 after transplantation were very close (씰Fig. 2). Regarding the two later periods p3 and p4, the ROC curve was as much close to the diagonal as the period was late after transplantation. For each period during which specificity was estimated, the ROC curves were all the more close to the diagonal that the delay between marker measurement and
Fig. 2 ROC curves estimated at three delays between marker measurement and disease onset (t = 0, 5, and 10 days) and during four periods after transplantation for specificity: p1 = [0; 10[, p2 = [10; 20[; p3 = [20; 30[, and p4 = [20; 30[ days Methods Inf Med 3/2009
© Schattauer 2009
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
disease onset increased. AUC estimates in 씰Table 2 show that the test accuracy was good during the two first periods after graft and at 0- and 5-day delay between test and disease onset (t = 0 and t = 5). The AUC was then over 75% but it decreased quickly as the period and the delay increased. The AUC decrease with the advance of the period was linked to a decrease of specificity in late periods; thus, specificity depended on the period after graft. The decrease of the AUC along the delay between marker measurement and disease onset was linked to a decrease of sensitivity. The discriminant ability was not significantly greater than 0.5 neither in the third period p3 with t = 10 nor in the fourth period p4 with t = 5 or t = 10 (value 0.5 lies within the 95% confidence interval). At the specific threshold of 200, the sensitivity was above 80% for t = 0, but lower than 50% at t = 10 (씰Table 3). This threshold was associated with a good specificity during the two first periods p1 and p2, but that specificity decreased quickly to less than 50% during the fourth period.
5. Discussion The Bayesian method to estimate the exact threshold-crossing times described in this article allows estimating incident sensitivity and static specificity of a longitudinal biomarker. The numerical studies showed that the crude method underestimated sensitivity in the case of interval-censored measurements whereas, under some conditions, the Bayesian method corrected that bias. In the application, quantitative PCR seemed reliable to predict CMV disease within five days preceding the disease onset and within the first 20 days after transplantation. Before that fifth day, the test sensitivity decreased quickly with the increasing delay between marker measurement and disease onset and the test specificity decreased quickly after the 20th day after transplantation. To our knowledge, this is the first study on early diagnosis of CMV disease that took into account the progression of accuracy with both the marker measurement time and the delay between marker measurement and the disease clinical detection. This was found crucial and explained the differences that exist in the literature about quantitative PCR accuracy, © Schattauer 2009
Table 2 Partial AUC values (95% confidence interval) estimated at three delays between marker measurement and disease onset and during four periods for specificity Period after graft (days)
Delay between test and disease onset (days) 0
5
10
[0; 10[
0.852 (0.783; 0.907)
0.769 (0.694; 0.833)
0.662 (0.591; 0.721)
[10; 20[
0.845 (0.780; 0.906)
0.759 (0.684; 0.833)
0.647 (0.574; 0.717)
[20; 30[
0.757 (0.661; 0.844)
0.669 (0.569; 0.761)
0.550 (0.349; 0.642)
[20; 30[
0.634 (0.509; 0.759)
0.555 (0.356; 0.678)
0.344 (0.157; 0.555)
where the delay or the measurement period changes from one study to another [28–30]. The use of the highest biomarker value from each control during a given period may lead to an underestimation of specificity; this bias is conservative because we are sure that the true biomarker accuracy is not smaller than the one estimated. There is no consensus throughout the literature on the way to estimate specificity empirically with repeated marker measurements. Our choice was partly motivated by Murtaugh [31], who also kept the highest marker value from each control to estimate specificity. He compared these results to those obtained keeping the average marker value from each control, but the differences were slight. Emir et al. [32, 33], then Slate and Turnbull [15] proposed another way to assess static specificity without modeling it. At a specific threshold, the specificity
Table 3 Estimated sensitivities and specificities (standard error) for quantitative PCR, the threshold being 200 copies/mL Sensitivity Delay between test and disease onset (days) 0
0.814 (0.063)
5
0.651 (0.073)
10
0.442 (0.077)
Specificity Period after transplantation (days) [0; 10[
0.960 (0.040)
[10; 20[
0.880 (0.066)
[20; 30[
0.680 (0.091)
[20; 30[
0.480 (0.100)
with each control was estimated by the proportion of negative tests; then the global specificity was defined as the average of all individual specificities, possibly weighted by the number of measurements per subject. The possible bias of this method was not analyzed; the underestimation might be smaller than the one stemming from Murtaugh’s method; however, both methods should lead to similar results when estimation periods are short, with few measurements by subject. All those methods could be used after estimation of the threshold-crossing times. A third method would be to model specificity; but then, the bias would depend on the validity of the model assumptions. Certainly, there is still a lot of work to do about estimation of specificity with repeated measurements along time. One contribution of this article is the assessment of specificity over different periods. This is relevant when specificity progresses along time after inclusion. The exact estimation of the thresholdcrossing times relies on the assumption that, for a specific threshold, the crossing times follow a Weibull distribution. This distribution is commonly used to model failure time data; this is the case of parametric regression for interval-censored data [34–37]. Lindsey [35] compared the results obtained from nine different distributions (including the Weibull, the log-normal, and the gamma distributions) and concluded that, except for heavily intervalcensored data, the results may change with the distributional assumptions. However, in the above CMV study, the use of a log-normal distribution led to results, and especially ROC curves, which were almost identical to those obtained with a Weibull distribution. Other forms than incident and static have been proposed for sensitivity and specificity Methods Inf Med 3/2009
303
304
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
[23]; for example, estimating the cumulative sensitivity using the measurements taken during the t days preceding the outcome and not exactly t days before the outcome. However, cumulative sensitivity estimates depend on the time to disease distribution conditional on the marker measurement time and, thus, do not simply reflect biomarker sensitivity. In the concept of dynamic specificity, the controls are the patients who do not develop the disease during the t days following a measurement. However, in our study, the patients developed CMV diseases rapidly after transplantation. Among the subjects whose viral load increased during the few days before disease onset, some developed the disease very soon after t days following a measurement; these would therefore be considered as controls, inducing a high estimate of the false-positive rate and, thus, an underestimation of the real specificity. Thus, the incident sensitivity/static specificity definition of accuracy is, to our opinion, the best way to integrate the concept of time in ROC analysis. As stated by Pepe et al. [22], this should be used in most studies. Compared to previous methods [15–20], the one proposed here is really easy to implement using standard statistical softwares (the code for Bayesian computations under WinBUGS is given in 씰Appendix 2). Moreover, there is no need to define and select a model for biomarker progression, sensitivity, specificity, the ROC curve, or the survival conditional to biomarker values; hence, the method can be very quickly adapted to other settings. Despite the need for a complex modeling phase, the method proposed by Cai et al. [17] remains appealing, but it requires large datasets because each biomarker value for which sensitivity or specificity is estimated adds a new parameter to the model; however, biomarker development studies do not always include a high number of patients. Anyway, our method imposes a restriction: it requires control follow-ups be long enough to assume they are real controls, i.e., the method does not allow so far for censoring, but it may be improved to deals with censored data using ideas similar to those proposed by Cai et al. [17]. The next step of our research would be to analyze the effect of the delay between measurements on accuracy estimates when that delay depends on the last measurement value. Within the context of longiMethods Inf Med 3/2009
tudinal biomarker modeling, Shardell and Miller [38], then Liu et al. [39] have directly addressed this problem. We hope our simple method will help statisticians undertake complete and precise analyses of longitudinal biomarkers accuracy taking into account the marker measurement time and the delay between marker measurement and outcome. In most studies, this is essential.
Acknowledgments The authors are grateful to Dr J. Iwaz, PhD, scientific advisor, for his helpful comments on the manuscript.
References 1. Blokh D, Zurgil N, Stambler I, Afrimzon E, Shafran Y, Korech E, Sandbank J, Deutsch M. An information-theoretical model for breast cancer detection. Methods Inf Med 2008; 47: 322–327. 2. Benish WA. The use of information graphs to evaluate and compare diagnostic tests. Methods Inf Med 2002; 41: 114–118. 3. Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford: Oxford University Press; 2003. 4. Sakai S, Kobayashi K, Nakamura J, Toyabe S, Akazawa K. Accuracy in the diagnostic prediction of acute appendicitis based on the Bayesian network model. Methods Inf Med 2007; 46: 723–726. 5. Maojo V, Martin-Sanchez F. Bioinformatics: towards new directions for public health. Methods Inf Med 2004; 43: 208–214. 6. Goebel G, Muller HM, Fiegl H, Widschwendter M. Gene methylation data – a new challenge for bioinformaticians? Methods Inf Med 2005; 44: 516–519. 7. Liska V, Holubec LJ, Treska V, Skalicky T, Sutnar A, Kormunda S, Pesta M, Finek J, Rousarova M, Topolcan O. Dynamics of serum levels of tumour markers and prognosis of recurrence and survival after liver surgery for colorectal liver metastases. Anticancer Res 2007; 27: 2861–2864. 8. Roy HK, Khandekar JD. Biomarkers for the Early Detection of Cancer: An Inflammatory Concept. Arch Intern Med 2007; 167: 1822–1824. 9. Ransohoff DF. Rules of evidence for cancer molecular-marker discovery and validation. Nat Rev Cancer 2004; 4: 309–314. 10. Hanley JA. Receiver operating characteristics ROC methodology: The state of the art. Crit Rev Diag Imag 1989; 29: 307–335. 11. Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 1993; 39: 561–577. 12. Pepe MS. Receiver operating characteristic methodology. J Am Stat Ass 2000; 95: 308–311. 13. Zhou X-H, McClish DK, Obuchowski NA. Statistical methods in diagnostic medicine. New York: Wiley; 2002.
14. Etzioni R, Pepe M, Longton G, Chengcheng Hu, Goodman G. Incorporating the time dimension in receiver operating characteristic curves: A case study of prostate cancer. Med Decis Making 1999; 19: 242–251. 15. Slate EH, Turnbull BW. Statistical models for longitudinal biomarkers of disease onset. Stat Med 2000; 19: 617–637. 16. Zheng Y, Heagerty PJ. Semiparametric estimation of time-dependent ROC curves for longitudinal marker data. Biostatistics 2004; 5: 615–632. 17. Cai T, Pepe MS, Zheng Y, Lumley T, Jenny NS. The sensitivity and specificity of markers for event times. Biostatistics 2006; 7: 182–197. 18. Zheng Y, Heagerty PJ. Prospective accuracy for longitudinal markers. Biometrics 2007; 2: 332–341. 19. Cai T, Cheng S. Robust combination of multiple diagnostic tests for classifying censored event times. Biostatistics 2008; 9: 216–233. 20. Song X, Zhou X-H. A semiparametric approach for the covariate specific ROC curve with survival outcome. Stat Sinca 2008; 18: 947–966. 21. Cai TZ, Yingye. Model checking for ROC regression analysis. Biometrics 2007; 63: 152–163. 22. Pepe MS, Zheng Y, Jin Y, Huang Y, Parikh CR, Levy WC. Evaluating the ROC performance of markers for future events. Lifetime Data Anal 2008; 14: 86–113. 23. Heagerty PJ, Zheng Y. Survival model predictive accuracy and ROC curves. Biometrics 2005; 61: 92–105. 24. Lindsey JC, Ryan LM. Tutorial in biostatistics: methods for interval-censored data. Stat Med 1998; 17: 219–238. 25. Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS – a Bayesian modelling framework: concepts, structure, and extensibility. Stat Comput 2000; 10: 325–337. 26. Zhang DD, Zhou X-H, Freeman DH, Freeman JL. A non-parametric method for the comparison of partial areas under ROC curves and its application to large health care data sets. Stat Med 2002; 21: 701–715. 27. McClish DK. Analyzing a portion of the ROC curve. Med Decis Making 1989; 9: 190–195. 28. Naumnik B, Malyszko J, Chyczewski L, Kovalchuk O, Mysliwiec M. Comparison of serology assays and polymerase chain reaction for the monitoring of active cytomegalovirus infection in renal transplant recipients. Transplant Proc 2007; 39: 2748–2750. 29. Mhiri L, Kaabi B, Houimel M, Arrouji Z, Slim A. Comparison of pp65 antigenemia, quantitative PCR and DNA hybrid capture for detection of cytomegalovirus in transplant recipients and AIDS patients. J Virol Methods 2007; 143: 23–28. 30. Madi N, Al-Nakib W, Mustafa AS, Saeed T, Pacsa A, Nampoory MR. Detection and monitoring of cytomegalovirus infection in renal transplant patients by quantitative real-time PCR. Med Princ Pract 2007; 16: 268–273. 31. Murtaugh PA. ROC curves with multiple marker measurements. Biometrics 1995; 51: 1514–1522. 32. Emir B, Wieand S, Su JQ, Cha S. Analysis of repeated markers used to predict progression of cancer. Stat Med 1998; 17: 2563–2578. 33. Emir B, Wieand S, Jung S-H, Ying Z. Comparison of diagnostic markers with repeated measurements: a non-parametric ROC curve approach. Stat Med 2000; 19: 511–523.
© Schattauer 2009
F. Subtil et al.: Accuracy Estimation of Longitudinal Disease Markers
34. Odell PM, Anderson KM, D’Agostino RB. Maximum likelihood estimation for interval-censored data using a Weibull-based accelerated failure time model. Biometrics 1992; 48: 951–959. 35. Lindsey JK. A study of interval censoring in parametric regression models. Lifetime Data Anal 1998; 4: 329–354. 36. Collet D. Modelling Survival Data in Medical Research. London: Chapman and Hall; 2003. 37. Sparling YH, Younes N, Lachin JM, Bautista OM. Parametric survival models for interval-censored data with time-dependent covariates. Biostatistics 2006; 7: 599–614. 38. Shardell M, Miller RR. Weighted estimating equations for longitudinal studies with death and non-monotone missing time-dependent covariates and outcomes. Stat Med 2008; 27: 1008–1025. 39. Liu L, Huang X, O’Quigley J. Analysis of Longitudinal Data in the Presence of Informative Observational Times and a Dependent Terminal Event, with Application to Medical Cost Data. Biometrics 2008; 64: 950–958.
Ti ~ uniform(15, 20) with probability 0.4 Ti ~ uniform(20, 30) with probability 0.6
biomarker values follow a normal distribution with mean
1.4 Biomarker Values
1 + exp(0.5(4 – Δ))
For Controls
and variance
Throughout each simulation, controls have their own biomarker value normally distributed with mean 1 and variance 0.25; for each measurement, an error is added that follows a normal distribution with mean 0 and variance 0.49. For Cases
In cases, biomarker values are generated as for controls up to eight days before diagnosis; for later measurements, an extra term is added: exp(2 – (0.5 + δi) Δik)
Appendix 1 1. Generation of the Simulated Data 1.1 Notation i = subject index; k = kth marker measurement; sik = time of the kth measurement for the ith subject; Δik = delay between the kth measurement and the diagnosis time for the ith subject
1.2 Sampling Times (sik) Patients should have a biomarker measurement every three days for 30 days after inclusion into the study; but, actually, the measurement is often delayed. Generate: sik = 3k + εik, k = 0, ..., 9 εik =
{
uniform (1, 2.95) if k = 0, uniform (0, 2.95) if k >0.
1.3 Time of Diagnosis The time of diagnosis was generated as follows:
© Schattauer 2009
δi corresponds to patients’ specific biomarker increase with time between marker measurement and diagnosis. It follows a normal distribution, with mean 0 and variance 0.0025. Measurements taken after the time of diagnosis are removed.
2. Calculation of the Theoretical AUC Values When biomarkers follow normal distributions in the diseased and non-diseased populations (respectively N( ) and N( )), Pepe et al. [3] showed that the AUC for the ROC curve is given by
where a = (μD – μD–)/σD, b = σD–/σD, and Φ denotes the standard normal cumulative function. According to the process of generation of biomarker values, during each period, measurements in control subjects follow a normal distribution with mean 1 and variance 0.25 ± 0.49. In cases, for a delay Δ between the marker measurement and the diagnosis time, the
exp(4 – Δ) × Var(exp(–δ × Δ)) where δ follows a normal distribution with mean 0 and variance 0.0025. For small delays Δ , the variance may be approximated using the delta-method; for our applications, the variance was estimated using 107 random values stemming from a normal distribution with mean 0 and variance 0.0025. Those results allow us to calculate the theoretical AUC for each period and delay between marker measurement and the onset of disease.
Appendix 2 The WinBUGS code for estimating the exact threshold-crossing time (paragraph ROC curve analysis). model { for(i in 1:N) ## N corresponds to the number of crossings { crossing_time[i]~dweib(r,mue)I (left[i],right[i]) ## left[i] corresponds to the date of last PCR measurement whose result was inferior to the threshold ## right[i] corresponds to the date of first PCR measurement whose result was superior or equal to the threshold } r~dgamma(1.0E-3, 1.0E-3) mue