Machine Learning Applications in
Software Engineering
SERIES ON SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING Series Editor-in-Chief S K CHANG (University of Pittsburgh, USA)
Vol. 1
Knowledge-Based Software Development for Real-Time Distributed Systems Jeffrey J. -P. Tsai and Thomas J. Weigert (Univ. Illinois at Chicago)
Vol. 2
Advances in Software Engineering and Knowledge Engineering edited by Vincenzo Ambriola (Univ. Pisa) and Genoveffa Tortora (Univ. Salerno)
Vol. 3
The Impact of CASE Technology on Software Processes edited by Daniel E. Cooke (Univ. Texas)
Vol. 4
Software Engineering and Knowledge Engineering: Trends for the Next Decade edited by W. D. Hurley (Univ. Pittsburgh)
Vol. 5
Intelligent Image Database Systems edited by S. K. Chang (Univ. Pittsburgh), E. Jungert (Swedish Defence Res. Establishment) and G. Tortora (Univ. Salerno)
Vol. 6
Object-Oriented Software: Design and Maintenance edited by Luiz F. Capretz and Miriam A. M. Capretz (Univ. Aizu, Japan)
Vol. 7
Software Visualisation edited by P. Eades (Univ. Newcastle) and K. Zhang (Macquarie Univ.)
Vol. 8
Image Databases and Multi-Media Search edited by Arnold W. M. Smeulders (Univ. Amsterdam) and Ramesh Jain (Univ. California)
Vol. 9
Advances in Distributed Multimedia Systems edited by S. K. Chang, T. F. Znati (Univ. Pittsburgh) and S. T. Vuong (Univ. British Columbia)
Vol. 10 Hybrid Parallel Execution Model for Logic-Based Specification Languages Jeffrey J.-P. Tsai and Bing Li (Univ. Illinois at Chicago) Vol. 11 Graph Drawing and Applications for Software and Knowledge Engineers Kozo Sugiyama (Japan Adv. Inst. Science and Technology) Vol. 12 Lecture Notes on Empirical Software Engineering edited by N. Juristo & A. M. Moreno (Universidad Politecrica de Madrid, Spain) Vol. 13 Data Structures and Algorithms edited by S. K. Chang (Univ. Pittsburgh, USA) Vol. 14 Acquisition of Software Engineering Knowledge SWEEP: An Automatic Programming System Based on Genetic Programming and Cultural Algorithms edited by George S. Cowan and Robert G. Reynolds (Wayne State Univ.) Vol. 15 Image: E-Learning, Understanding, Information Retieval and Medical Proceedings of the First International Workshop edited by S. Vitulano (Universita di Cagliari, Italy) Vol. 16 Machine Learning Applications in Software Engineering edited by Du Zhang (California State Univ.,) and Jeffrey J. P. Tsai (Univ. Illinois at Chicago)
Machine Learning Applications in
Software Engineering editors
Du Zhang California State University, USA
Jeffrey J.P. Tsai University of Illinois, Chicago, USA
\[p World Scientific N E W J E R S E Y
• LONDON
• SINGAPORE
• BEIJING
• S H A N G H A I
• H O N G K O N G
• TAIPEI
•
CHENNAI
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
The author and publisher would like to thank the following publishers of the various journals and books for their assistance and permission to include the selected reprints found in this volume: IEEE Computer Society (Trans, on Software Engineering, Trans, on Reliability); Elsevier Science Publishers (Information and Software Technology); Kluwer Academic Publishers (Annals of Software Engineering, Automated Software Engineering, Machine Learning)
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
MACHINE LEARNING APPLICATIONS IN SOFTWARE ENGINEERING Copyright © 2005 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN
981-256-094-7
Cover photo: Meiliu Lu
Printed in Singapore by World Scientific Printers (S) Pte Ltd
DEDICATIONS DZ: To Jocelyn, Bryan, and Mei JT: To Jean, Ed, and Christina
ACKNOWLEDGMENT The authors acknowledge the contribution of Meiliu Lu for the cover photo and the support from National Science Council under Grant NSC 92-2213-E-468-001, R.O.C.. We also thank Kim Tan, Tjan Kwang Wei, and other staffs at World Scientific for helping with the preparation of the book.
TABLE OF CONTENTS Chapter 1 Introduction to Machine Learning and Software Engineering 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
The Challenge Overview of Machine Learning Learning Approaches SE Tasks for ML Applications State-of-the-Practice in ML&SE Status Applying ML Algorithms to SE Tasks Organization of the Book
Chapter 2 ML Applications in Prediction and Estimation 2.1 Bayesian Analysis of Empirical Software Engineering Cost Models, (with S. Chulani, B. Boehm and B. Steece) IEEE Transactions on Software Engineering, Vol. 25. No. 4, July 1999, pp. 573-583. 2.2 Machine Learning Approaches to Estimating Software Development Effort, (with K. Srinivasan and D. Fisher) IEEE Transactions on Software Engineering, Vol. 21, No. 2, February 1995, pp. 126-137. 2.3 Estimating Software Project Effort Using Analogies, (with M. Shepperd and C. Schofield) IEEE Transactions on Software Engineering, Vol. 23, No. 12, November 1997, pp. 736-743. 2.4 A Critique of Software Defect Prediction Models, (with N.E. Fenton and M. Neil) IEEE Transactions on Software Engineering, Vol. 25, No. 5, September 1999, pp. 675-689. 2.5 Using Regression Trees to Classify Fault-Prone Software Modules, (with T.M. Khoshgoftaar, E.B. Allen and J. Deng) IEEE Transactions on Reliability, Vol. 51, No. 4, 2002, pp. 455-462. 2.6 Can Genetic Programming Improve Software Effort Estimation? A Comparative Evaluation, (with CJ. Burgess and M. Lefley) Information and Software Technology, Vol. 43, No. 14, 2001, pp. 863-873. 2.7 Optimal Software Release Scheduling Based on Artificial Neural Networks, (with T. Dohi, Y. Nishio, and S. Osaki) Annals of Software Engineering, Vol. 8, No. 1, 1999, pp. 167-185.
ix
1 1 3 9 13 15 25 35 36 37 41
52
64
72
87
95
106
Chapter 3 ML Applications in Property and Model Discovery 3.1 Identifying Objects in Procedural Programs Using Clustering Neural Networks, (with S.K. Abd-El-Hafiz) Automated Software Engineering, Vol. 7, No. 3, 2000, pp. 239-261. 3.2 Bayesian-Learning Based Guidelines to Determine Equivalent Mutants, (with A. M. R. Vincenzi, et al.) International Journal of Software Engineering and Knowledge Engineering, Vol. 12, No. 6, 2002, pp. 675-689. Chapter 4 ML Applications in Transformation 4.1 Using Neural Networks to Modularize Software, (with R. Schwanke and S.J. Hanson) Machine Learning, Vol. 15, No. 2, 1994, pp. 137-168. Chapter 5 ML Applications in Generation and Synthesis 5.1 Generating Software Test Data by Evolution, (with C.C. Michael, G. McGraw and M.A. Schatz) IEEE Transactions on Software Engineering, Vol. 27, No. 12, December 2001, pp. 1085-1110. Chapter 6 ML Applications in Reuse
125 127
150
165 167
199 201
227
6.1 On the Reuse of Software: A Case-Based Approach Employing a Repository, (with P. Katalagarianos and Y. Vassiliou) Automated Software Engineering, Vol. 2, No. 1, 1995, pp. 55-86. Chapter 7 ML Applications in Requirement Acquisition 7.1 Inductive Specification Recovery: Understanding Software by Learning From Example Behaviors, (with W.W. Cohen) Automated Software Engineering, Vol. 2, No. 2, 1995, pp. 107-129. 7.2 Explanation-Based Scenario Generation for Reactive System Models, (with R.J. Hall) Automated Software Engineering, Vol. 7, 2000, pp. 157-177. Chapter 8 ML Applications in Management of Development Knowledge 8.1 Case-Based Knowledge Management Tools for Software Development, (with S. Henninger) Automated Software Engineering, Vol. 4, No. 3, 1997, pp. 319-340.
229
261 263
286
307 309
Chapter 9 Guidelines and Conclusion
331
References
345
X
Chapter 1 Introduction to Machine Learning and Software Engineering 1.1. The Challenge The challenge of developing and maintaining large software systems in a changing environment has been eloquently spelled out in Brooks' classic paper, No Silver Bullet [20]. The following essential difficulties inherent in developing large software still hold true today: > Complexity: "Software entities are more complex for their size than perhaps any other human construct." "Many of the classical problems of developing software products derive from this essential complexity and its nonlinear increases with size." > Conformity: Software must conform to the many different human institutions and systems it comes to interface with. > Changeability: "The software product is embedded in a cultural matrix of applications, users, laws, and machine vehicles. These all change continually, and their changes inexorably force change upon the software product." > Invisibility: "The reality of software is not inherently embedded in space." "As soon as we attempt to diagram software structure, we find it to constitute not one, but several, general directed graphs, superimposed one upon another." [20] However, in his "No Silver Bullet" Refired paper [21], Brooks uses the following quote from Glass to summarize his view in 1995: So what, in retrospect, have Parnas and Brooks said to us? That software development is a conceptually tough business. That magic solutions are not just around the corner. That it is time for the practitioner to examine evolutionary improvements rather than to wait-or hope-for revolutionary ones [56]. Many evolutionary or incremental improvements have been made or proposed, with each attempting to address certain aspect of the essential difficulties [13, 47, 57, 96, 110]. For instance, to address changeability and conformity, an approach called the transformational programming allows software to be developed, modified, and maintained at specification level, and then automatically transformed into production-quality software through automatic program synthesis [57]. This software development paradigm will enable software engineering to become the discipline of capturing and automating currently undocumented domain and design knowledge [96]. Software engineers will deliver knowledge-based application generators rather than unmodifiable application programs. A system called LaSSIE was developed to address the complexity and invisibility issues [36]. The multi-view modeling framework proposed in [22] could be considered as an attempt to address the invisibility issue. The application of artificial intelligence techniques to software engineering (AI&SE) has produced some encouraging results [11, 94, 96, 108, 112, 122, 138, 139, 145]. Some of the successful AI techniques include: knowledge-based approach, automated reasoning, expert systems, heuristic search strategies, temporal logic, planning, and pattern recognition. To ultimately overcome the essential difficulties, AI techniques can play an important role. As a subfield of AI, machine learning (ML) deals with the issue of how to build computer programs that improve their performance at some task through experience [105]. It is dedicated to creating and compiling verifiable knowledge related to the design and construction of artifacts [116]. ML
1
algorithms have been utilized in many different problem domains. Some typical applications are: data mining problems where large databases contain valuable implicit regularities that can be discovered automatically, poorly understood domains where there is lack of knowledge needed to develop effective algorithms, or domains where programs must dynamically adapt to changing conditions [105]. Not surprisingly, the field of software engineering turns out to be a fertile ground where many software development and maintenance tasks could be formulated as learning problems and approached in terms of learning algorithms. The past two decades have witnessed many ML applications in the software development and maintenance. ML algorithms offer a viable alternative and complement to the existing approaches to many SE issues. In his keynote speech at the 1992 annual conference of the American Association for Artificial Intelligence, Selfridge advocated the application of ML to SE (ML&SE): We all know that software is more updating, revising, and modifying than rigid design. Software systems must be built for change; our dream of a perfect, consistent, provably correct set of specifications will always be a nightmare-and impossible too. We must therefore begin to describe change, to write software so that (1) changes are easy to make, (2) their effects are easy to measure and compare, and (3) the local changes contribute to overall improvements in the software. For systems of the future, we need to think in terms of shifting the burden of evolution from the programmers to the systems themselves...[we need to] explore what it might mean to build systems that can take some responsibility for their own evolution [130]. Though many results in ML&SE have been published in the past two decades, effort to summarize the state-of-the-practice and to discuss issues and guidelines in applying ML to SE has been few and far between [147-149]. A recent paper [100] focuses its attention on decision tree based learning methods to SE issues. Another survey is offered from the perspective of data mining techniques being applied to software process and products [99]. The AI&SE summaries published so far are too broad a brush that they do not give an adequate account on ML&SE. There is a related and emerging area of research under the umbrella of computational intelligence in software engineering (CI&SE) recently [80, 81, 91, 92, 113]. Research in this area utilizes fuzzy sets, neural networks, genetic algorithms, genetic programming and rough sets (or combinations of those individual technologies) to tackle software development issues. ML&SE and CI&SE share two common grounds: targeted software development problems, and some common techniques. However, ML offers many additional mature techniques and approaches that can be brought to bear on solving the SE problems. The scope of this book, as depicted in the shaded area in Figure 1, is to attempt to fill this void by studying various issues pertaining to ML&SE (the applications of other AI techniques in SE are beyond the scope of this book). We think this is an important and helpful step if we want to make any headway in ML&SE. In this book, we address various issues in ML&SE by trying to answer the following questions: > What types of learning methods are there available at our disposal? > What are the characteristics and underpinnings of different learning algorithms?
2
> How do we determine which learning method is appropriate for what type of software development or maintenance task? > Which learning methods can be used to make headway in what aspect of the essential difficulties in software development? > When we attempt to use some learning method to help with an SE task, what are the general guidelines and how can we avoid some pitfalls? > What is the state-of-the-practice in ML&SE? > Where is further effort needed to produce fruitful results?
Figure 1. Scope of this book. 1.2.
Overview of Machine Learning
The field of ML includes: supervised learning, unsupervised learning and reinforcement learning. Supervised learning deals with learning a target function from training examples of its inputs and outputs. Unsupervised learning attempts to learn patterns in the input for which no output values are available. Reinforcement learning is concerned with learning a control policy through reinforcement from an environment. ML algorithms have been utilized in many different problem domains. Some typical applications are: data mining problems where large databases contain valuable implicit regularities that can be discovered automatically, poorly understood domains where there is lack of knowledge needed to develop effective algorithms, or domains where programs must dynamically adapt to changing conditions [105]. The following list of publications and web sites offers a good starting point for the interested reader to be acquainted with the state-of-the-practice in ML applications [2, 3, 9, 15, 32, 37-39, 87, 99, 100, 103, 105107,117-119,127,137]. ML is not a panacea for all the SE problems. To better use ML methods as tools to solve real world SE problems, we need to have a clear understanding of both the problems, and the tools and methodologies utilized. It is imperative that we know (1) the available ML methods at our disposal, (2) characteristics of those methods, (3) circumstances under which the methods can be most effectively applied, and (4) their theoretical underpinnings. Since many SE development or maintenance tasks rely on some function (or functions, mappings, or models) to predict, estimate, classify, diagnose, discover, acquire, understand,
3
generate, or transform certain qualitative or quantitative aspect of a software artifact or a software process, application of ML to SE boils down to how to find, through the learning process, such a target function (or functions, mappings, or models) that can be utilized to carry out the SE tasks. Learning involves a number of components: (1) How is the unknown (target or true) function represented and specified? (2) Where can the function be found (the search space)? (3) How can we find the function (heuristics in search, learning algorithms)? (4) Is there any prior knowledge (background knowledge, domain theory) available for the learning process? (5) What properties do the training data have? And (6) What are the theoretical underpinnings and practical issues in the learning process? 1.2.1. Target functions Depending on the learning methods utilized, a target function can be represented in different hypothesis language formalisms (e.g., decision trees, conjunction of attribute constraints, bit strings, or rules). When a target function is not explicitly defined, but the learner can generate its values for given input queries (such as the case in instance-based learning), then the function is said to be implicitly defined. A learned target function may be easy for the human expert to understand and interpret (e.g., first order rules), or it may be hard or impossible for people to comprehend (e.g., weights for a neural network). interpretability \
\ \
formalism
representation \
easy to understand \ \
ex pij c i t
\ \ hard to understand \ \ \ length \ \
7
bit string, \
\
bayesian networks \ \ \
7
linear functions propositions \ Horn clauses \
7
/
/
/
/ statistical significance
/
/information content / / / / tradeoff between / / complexity and degree / / of fit to data /
ANN
decision trees \
\ implicit \ \ \ \
/ predictive accuracy
properties
attribute constraints
\
/ PQrrpr
ea
/
Ser
/ / / /
lazy /
_ Target |
Function
binary classification mu lti-value
classification regression
/
generalization
output
Figure 2. Issue in target functions. Based on its output, a target function can be utilized for SE tasks that fall into the categories of binary classification, multi-value classification and regression. When learning a target function from a given set of training data, its generalization can be either eager (at learning stage) or lazy
4
(at classification stage). Eager learning may produce a single target function from the entire training data, while lazy learning may adopt a different (implicit) function for each query. Evaluating a target function hinges on many considerations: predictive accuracy, interpretability, statistical significance, information content, and tradeoff between its complexity and degree of fit to data. Quinlan in [119] states: Learned models should not blindly maximize accuracy on the training data, but should balance resubstitution accuracy against generality, simplicity, interpretability, and search parsimony. 1.2.2. Hypothesis space Candidates to be considered for a target function belong to the set called a hypothesis space H. Let/be a true function to be learned. Given a set D of examples (training data) of/, the inductive learning amounts to finding a function h e H that is consistent with D. h is said to approximate/. How an H is specified and what structure H has would ultimately determine the outcome and efficiency of the learning. The learning becomes unrealizable [124] when / £ H. Since / is unknown, we may have to resort to background or prior knowledge to generate an H in which / must exist. How prior knowledge is utilized to specify an appropriate H where the learning problem is realizable (f e H) is a very important issue. There is also a tradeoff between the expressiveness of an H and the computational complexity of finding simple and consistent h that approximates / [124]. Through some strong restrictions, the expressiveness of the hypothesis language can be reduced, thus yielding a smaller H. This in turn may lead to a more efficient learning process, but at the risk of being unrealizable. structures
lattlce
\
no structure
properties
\
realizable (fs H)
\
unrea Iizable
\
Hypothesis Space H
\ 7
/
/ Prior knowledge
/
(fg H)
/ / domain theory
/
expressiveness
/ computational complexity of finding a simple and / consistent h /
constraints
tradeoff
Figure 3. Issues in hypothesis space H.
5
1.2.3. Search and bias How can we find a simple and consistent h e H that approximates / ? This question essentially boils down to a search problem. Heuristics (inductive or declarative bias) will play a pivotal role in the search process. Depending on how examples in D are defined, learning can be supervised or unsupervised. Different learning methods may adopt different types of bias, different search strategies, and different guiding factors in search. For an /, its approximation can be obtained either locally with regard to a subset of examples in D, or globally with regard to all examples in D. Learning can result in either knowledge augmentation or knowledge (re)compilation. Depending on the interaction between a learner and its environment, there are query learning and reinforcement learning. There are stable and unstable learning algorithms depending on the sensitivity to changes in the training data [37]. For unstable algorithms (e.g., decision tree, neural networks, or rule-learning), small changes in the training data will result in the algorithms generating significantly different output function. On the other hand, stable algorithms (e.g., linear-regression and nearestneighbor) are immune (not easy to succumb to) small changes in data [37]. Instead of using just a single learned hypothesis for classification of unseen cases, an ensemble of hypotheses can be deployed whose individual decisions are combined to accomplish the task of new case classification. There are two major issues in ensemble learning: how to construct ensembles, and how to combine individual decisions of hypotheses in an ensemble [37]. outcome \
guiding factor \
VS
style
domain theory
\ generalTospecific \
_
bias \
searchbias
\ knowledge \ \ \ training data \ info. g a i n idi^ 6 \ augmentation \ ,. . \ \ alone \ language Dias \ distance metric \ , ,.... ,. . \ \ \ \ \ \ g r e e d y ( hlU c h m - ) \ \ declarative bias \ gradient desce. \ \ \ \ \ \ fitnp^ \ s i m P l e T o c o m P l e x \ training data + \ changeable vs. unchangeable \ \ deductive \ domain theory \ \ Knowledge \recompilation \ cumulat. reward \ _ \ \ I m p l i c i t vs. explicit \ \relat. frequency \ randomized beam \ \ \ \ m-esti. accuracy \ n o explicit search \ \ I ^7 ^ i ^ 7 1 T Search / / / / how to / ' / query learning / supervised / unstable / / construct / / learning / algonthms / / local (subset of ensembles / I / / I training data) / / /
reinforcement learning
/ /
unSupervised
/ /
leamin
/
/
learner-environment supervision interaction
stable / algorithms / /
stability
how to combine classifiers
ensemble
Figure 4. Issues in search of hypothesis.
6
/ / /
global (all training ta)
approximation
Another issue during the search process is the need for interaction with an oracle. If a learner needs an oracle to ascertain the validity of the target function generalization, it is interactive; otherwise, the search is non-interactive [89]. The search is flexible if it can start either from scratch or from an initial hypothesis. 1.2.4. Prior knowledge Prior (or background) knowledge about the problem domain where an / is to be learned plays a key role in many learning methods. Prior knowledge can be represented differently. It helps learning by eliminating otherwise consistent h and by filling in the explanation of examples, which results in faster learning from fewer examples. It also helps define different learning techniques based on its logical roles and identify relevant attributes, thus yielding a reduced H and speeding up the learning process [124]. There are two issues here. First of all, for some problem domains, the prior knowledge may be sketchy, inaccurate or not available. Secondly, not all learning approaches are able to accommodate such prior knowledge or domain theories. A common drawback of some general learning algorithms such as decision tree or neural networks is that it is difficult to incorporate prior knowledge from problem domains to the learning algorithms [37]. A major motivation and advantage of stochastic learning (e.g., naive Bayesian learning) and inductive logic programming is their ability to utilize background knowledge from problem domains in the learning algorithm. For those learning methods for which prior knowledge or domain theory is indispensable, one issue to keep in mind is that the quality of the knowledge (correctness, completeness) will have a direct impact on the outcome of the learning. representation \ first order \ theories \
properties \setey \ \ inaccurate
\ correctness \ \
\COnStraintS
\ completeness \
\ probabilities \
quality
\
\
\ 7 7 /expedite learning / /fromfewerdata / e ^ / define different / / learning methods /
/ identify relevant / / attributes / roles
notavailable
hard t0
\ I t0 accommodate
accommodate
accommodation
Figure 5. Issues in prior knowledge.
7
Prior Knowledge
1.2.5. Training data Training data gathered for the learning process can vary in terms of (1) the number of examples, (2) the number of features (attributes), and (3) the number of output classes. Data can be noisy or accurate in terms of random errors, can be redundant, can be of different types, and have different valuations. The quality and quantity of training data have direct impact on the learning process as different learning methods have different criteria regarding training data, with some methods requiring large amount of data, others being very sensitive to the quality of data, and still others needing both training data and a domain theory. Training data may be just used once, or multiple times, by the learner. Scaling-up is another issue. Real world problems can have millions of training cases, thousands of features and hundreds of classes [37]. Not all learning algorithms are known to be able to scale up well with large problems in those three categories. When a target function is not easy to be learned from the data in the input space, a need arises to transform the data into a possible high-dimensional feature space F and learn the target function in F. Feature selection in F becomes an important issue as both computational cost and target function generalization performance can degrade as the number of features grows [32]. Finally, based on the way in which training data are generated and provided to a learner, there are batch learning (all data are available at the outset of learning) and on-line learning (data are available to the learner one example at a time) feature selection
\
\
7
\ \ noisy/accurate irrelevant feature \ r e a l v a l u e \ \ detection and \ \ r a n d o m errors \ elimination \ \ \ -u \ vector value \ redundancy J \ filters \ \ \ wrappers \ \ I \ \ \ Training
7
/ sequences / /time series / / spatial type
properties
«rctra'kyViscre,e value \ «-™-««
\
/
valuation
/
usec
7
^ once
/
/ /
I
batch learning
/ / / used multiple / online / times / learning / / frequency
availability
/
large data set
/ large number of / features / /
large number of classes scale-up
Figure 6. Issues in training data.
8
Data
/
1.2.6. Theoretical underpinnings and practical considerations Underpinning learning methods are different justifications: statistical, probabilistic, or logical. What are the frameworks for analyzing learning algorithms? How can we evaluate the performance of a generated function, and determine the convergence issue? What types of practical problems do we have to come to grips with? Those issues must be answered if we are to succeed in real world SE applications.
application types \
convergence
, . . data mining
\
\
analysis framework \
\
\ \ \ poorly understood \ \ domains \
feasibility
settin
PAC
\
\ stationary assumption \ \ sample complexity of H
8s
\ changing conditions \ conditions \in domains \ \
\ mistake bound \
\
/ /
overfitting , . / underfitting / local minima /
/ / /
I /
/ accuracy (sample/true
/
I statistical
/
em)r)
/
confidence
/ crowding / intervals / / / Curse of / comparison / dimensionality / practical problem
Theory & Practice
\
7
evaluating h
logical
/ / /
probabilistic
/ justification
Figure 7. Theoretical and practical issues. 1.3. Learning Approaches There are many different types of learning methods, each having its own characteristics and lending itself to certain learning problems. In this book, we organize major types of supervised and reinforcement learning methods into the following groups: concept learning (CL), decision tree learning (DT), neural networks (NN), Bayesian learning (BL), reinforcement learning (RL), genetic algorithms (GA) and genetic programming (GP), instance-based learning (IBL, of which case-based reasoning, or CBR, is a popular method), inductive logic programming (ILP), analytical learning (AL, of which explanation-based learning, or EBL is a method), combined inductive and analytical learning (IAL), ensemble learning (EL) and support vector machines (SVM). The organization of different learning methods is largely influenced by [105]. In some literature [37, 124], stochastic (statistical) learning is used to refer to learning methods such as BL.
9
1.3.1. Concept learning In CL, a target function is represented as a conjunction of constraints on attributes. The hypothesis space H consists of a lattice of possible conjunctions of attribute constraints for a given problem domain. A least-commitment search strategy is adopted to eliminate hypotheses in H that are not consistent with the training set D. This will result in a structure called the version space, the subset of hypotheses that are consistent with the training data. The algorithm, called the candidate elimination, utilizes the generalization and specialization operations to produce the version space with regard to H and D. It relies on a language (or restriction) bias that states that the target function is contained in H. CL is an eager and supervised learning method. It is not robust to noise in data and does not have support for prior knowledge accommodation. 1.3.2. Decision trees A target function is defined as a decision tree in DT. Search in DT is often guided by an entropy based information gain measure that indicates how much information a test on an attribute yields. Learning algorithms in DT often have a bias for small trees. It is an eager, supervised, and unstable learning method, and is susceptible to noisy data, a cause for overfitting. It cannot accommodate prior knowledge during the learning process. However, it scales up well with large data in several different ways [37]. A popular DT tool is C4.5 [118]. 1.3.3. Neural networks Given a fixed network structure, learning a target function in NN amounts to finding weights for the network such that the network outputs are the same as (or within an acceptable range of) the expected outcomes as specified in the training data. A vector of weights in essence defines a target function. This makes the target function very difficult for human to read and interpret. NN is an eager, supervised, and unstable learning approach and cannot accommodate prior knowledge. A popular algorithm for feed-forward networks is Backpropagation, which adopts a gradient descent search and sanctions an inductive bias of smooth interpolation between data points [105]. 1.3.4. Bayesian learning BL offers a probabilistic approach to inference, which is based on the assumption that the quantities of interest are dictated by probability distributions, and that optimal decisions or classifications can be reached by reasoning about these probabilities along with observed data [105]. BL methods can be divided into two groups based on the outcome of the learner: the ones that produce the most probable hypothesis given the training data, and the ones that produce the most probable classification of a new instance given the training data. A target function is thus explicitly represented in the first group, but implicitly defined in the second group. One of the main advantages is that BL accommodates prior knowledge (in the form of Bayesian belief networks, prior probabilities for candidate hypotheses, or a probability distribution over observed data for a possible hypothesis). The classification of an unseen case is obtained through combined predictions of multiple hypotheses. It also scales up well with large data. BL is an eager and supervised learning method and does not require search during learning process. Though it has no problem with noisy data, BL has difficulty with small data sets. BL adopts a bias that is based on the minimum description length principle that prefers a hypothesis h that minimizes the description length of h plus the description length of the data given h [105]. There
10
are several popular algorithms: MAP (maximum a posteriori), Bayes optimal classifier, naive Bayes classifier, Gibbs, and EM [37, 105]. 1.3.5. Genetic algorithms and genetic programming GA and GP are both biologically inspired learning methods. A target function is represented as bit strings in GA, or as programs in GP. The search process starts with a population of initial hypotheses. Through the crossover and mutation operations, members of current population give rise to the next generation of population. During each step of the iteration, hypotheses in the current population are evaluated with regard to a given measure of fitness, with the fittest hypotheses being selected as members of the next generation. The search process terminates when some hypothesis h has a fitness value above some threshold. Thus, the learning process is essentially embodied in the generate-and-test beam search [105]. The bias is fitness-driven. There are generational and steady-state algorithms. 1.3.6. Instance-based learning IBL is a typical lazy learning approach in the sense that generalizing beyond the training data is deferred until an unseen case needs to be classified. In addition, a target function is not explicitly defined; instead, the learner returns a target function value when classifying a given unseen case. The target function value is generated based on a subset of the training data that is considered to be local to the unseen example, rather than the entire training data. This amounts to approximating a different target function for a distinct unseen example. This is a significant departure from the eager learning methods where a single target function is obtained as a result of the learner generalizing from the entire training data. The search process is based on statistical reasoning, and consists in identifying training data that are close to the given unseen case and producing the target function value based on its neighbors. Popular algorithms include: K-nearest neighbors, CBR and locally weighted regression. 1.3.7. Inductive logic programming Because a target function in ILP is defined by a set of (propositional or first-order) rules, it is highly amenable to human readability and interpretability. ILP lends itself to incorporation of background knowledge during learning process, and is an eager and supervised learning. The bias sanctioned by ILP includes rule accuracy, FOIL-gain, or preference of shorter clauses. There are a number of algorithms: SCA, FOIL, PROGOL, and inverted resolution. 1.3.8. Analytical learning AL allows a target function, represented in terms of Horn clauses, to be generalized from scarce data. However, it is in dispensable that the training data D must be augmented with a domain theory (prior knowledge about the problem domain) B. The learned h is consistent with both D and B, and good for human readability and interpretability. AL is an eager and supervised learning, and search is performed in the form of deductive reasoning. The search bias in EBL, a major AL method, is B and preference of a small set of Horn clauses (for learning h). One important perspective of EBL is that learning can be construed as recompiling or reformulating the knowledge in B so as to make it operationally more efficient when classifying unseen cases. EBL algorithms include Prolog-EBG.
11
1.3.9. Inductive and analytical learning Both inductive learning and analytical (deductive) learning have their props and cons. The former requires plentiful data (thus vulnerable to data quality and quantity problems), while the latter relies on a domain theory (hence susceptible to domain theory quality and quantity problems). IAL is meant to provide a framework where benefits from both approaches can be strengthened and impact of drawbacks minimized. IAL usually encompasses an inductive learning component and an analytical learning component, e.g., NN+EBL (EBNN), or ILP+EBL (FOCL) [105]. It requires both D and B, and can be an eager and supervised learning. The issues of target function representation, search, and bias are largely determined by the underlying learning components involved. 1.3.10. Reinforcement learning RL is the most general form of learning. It tackles the issue of how to learn a sequence of actions called a control strategy from indirect and delayed reward information (reinforcement). It is an eager and unsupervised learning. Its search is carried out through training episodes. Two main approaches exist for reinforcement learning: model-based and model-free approaches [39]. The best-known model-free algorithm is Q-learning. In Q-learning, actions with maximum Q value are preferred. 1.3.11. Ensemble learning In EL, a target function is essentially the result of combining, through weighted or unweighted voting, a set of component or base-level functions called an ensemble. An ensemble can have a better predictive accuracy than its component function if (1) individual functions disagree with each other, (2) individual functions have a predictive accuracy that is slightly better than random classification (e.g., error rates below 0.5 for binary classification), and (3) individual functions' errors are at least somewhat uncorrelated [37]. EL can be seen as a learning strategy that addresses inadequacies in training data (insufficient information in training data to help select a single best h 6 H), in search algorithm (deployment of multiple hypotheses amounts to compensating for less than perfect search algorithms), and in the representation of H (weighted combination of individual functions makes it possible to represent a true function f £ H). Ultimately, an ensemble is less likely to misclassify than just a single component function. Two main issues exist in EL: ensemble construction, and classification combination. There are bagging, cross-validation and boosting methods for constructing ensembles, and weighted vote and unweighted vote for combining classifications [37]. The AdaBoost algorithm is one of the best methods for constructing ensembles of decision trees [37]. There are two approaches to ensemble construction. One is to combine component functions that are homogeneous (derived using the same learning algorithm and being defined in the same representation formalism, e.g., an ensemble of functions derived by DT) and weak (slightly better than random guessing). Another approach is to combine component functions that are heterogeneous (derived by different learning algorithms and being represented in different formalism, e.g., an ensemble of functions derived by DT, IBL, BL, and NN) and strong (each of the component function performs relatively well in its own right) [44],
12
1.3.12. Support vector machines Instead of learning a non-linear target function from data in the input space directly, SVM uses a kernel function (defined in the form of inner product of training data) to transform the training data from the input space into a high dimensional feature space F first, and then learns the optimal linear separator (a hyperplane) in F. A decision function, defined based on the linear separator, can be used to classify unseen cases. Kernel functions play a pivotal in SVM. A kernel function relies only on a subset of the training data called support vectors. Table 1 is a summary of the aforementioned learning methods. 1.4.
SE Tasks for ML Applications
In software engineering, there are three categories of entities: processes (collections of software related activities, such as constructing specification, detailed design, or testing), products (artifacts, deliverables, documents that result from a process activity, such as a specification document, a design document, or a segment of code), and resources (entities required by a process activity, such as personnel, software tools, or hardware) [49]. There wee internal and external attributes for entities of the aforementioned categories. Internal attributes describe an entity itself, whereas external attributes characterize the behavior of an entity (how the entity relates to its environment). SE tasks that lend themselves to ML applications include, but are certainly not limited to: 1. Predicting or estimating measurements for either internal or external attributes of processes, products, or resources. 2. Discovering either internal or external properties of processes, products, or resources. 3. Transforming products to accomplish some desirable or improved external attributes. 4. Synthesizing various products. 5. Reusing products or processes. 6. Enhancing processes (such as recovery of specification from software). 7. Managing ad hoc products (such as design and development knowledge). In the next section, we take a look at applications that fall into those application areas.
13
Table 1. Major learning methods. Type
Target function representation
Target function generation
Search
Inductive bias
Sample algorithm
AL .pRT,
Horn clauses
Eager, D + B, supervised
Deductive reasoning
B + small set of Horn clauses
Prolog-EBG
BL
Probability tables Bayesian network
Eager, supervised, ^ (global), explicit or implicit
Probabilistic, no explicit search
Minimum description length
MAP, BOC, Gibbs, NBC
CL
Conjunction of attribute constraints
Eager, supervised, j) (global)
Version Space (^S) guided
c£ H
Candidate elimination
DT
Decision trees
Eager, D (global), supervised
Information gain (entropy)
Preference for small trees
ID3, C4.5, Assistant
EL
Indirectly defined through ensemble of component functions
Eager, D (global), supervised
Ensemble construction, classification combination
Determined by ensemble members
AdaBoost (for ensemble of DT)
GA GP
Bit strings, program trees
Eager, no D, unsupervised
Hill climbing (simulated evolution)
Fitness-driven
Prototypical GA/GP algorithms
IBL
Not explicitly defined
Lazy, D (local), supervised,
Statistical reasoning
Similarity to nearest neighbors
K-NN, LWR, CBR
ILP
If-then rules
Eager, supervised, D (global),
Statistical, general-tospecific
Rule accuracy, FOIL-gain, shorter clauses
SCA, FOIL, PROGOL, inv. resolution
NN
Weights for neural networks
Eager, supervised, D (global)
Gradient descent guided
Backpropagation
L\L
Determined by underlying learning methods
Eager, D + B, supervised
Determined by underlying learning methods
Smooth interpolation between data points Determined by underlying learning methods
RL
Control strategy n*
Eager, no D, unsupervised
Through training episodes
Actions with max. Q value
Q, TD
SVM
Decision function in inner product form
Eager, supervised, D local ( > support vectors)
Kernel mapping
Maximal margin separator
SMO
14
KBANN, EBNN, FOCL
1.5.
State-of-the-Practice in ML&SE
A number of areas in software development have already witnessed the machine learning applications. In this section, we take a brief look at reported results and offer a summary of the existing work. The list of applications included in the section, though not a complete and exhaustive one, should serve to represent a balanced view of the current status. The trend indicates that people have realized the potential of ML techniques and begin to reap the benefits from applying them in software development and maintenance. In the organization below, we use the areas discussed in Section 1.4 as the guideline to group ML applications in SE tasks. Tables 2 through 8 summarize targeted SE objectives, and ML approaches used. 1.5.1.
Prediction and estimation
In this group, ML methods are used to predict or estimate measurements for either internal or external attributes of processes, products, or resources. These include: software quality, software size, software development cost, project or software effort, maintenance task effort, software resource, correction cost, software reliability, software defect, reusability, software release timing, productivity, execution times, and testability of program modules. 1.5.1.1. Software quality prediction GP is used in [48] to generate software quality models that take as input software metrics collected earlier in development, and predict for each module the number of faults that will be discovered later in development or during operations. These predictions will then be the basis for ranking modules, thus enabling a manager to select as many modules from the top of the list as resources allow for reliability enhancement. A comparative study is done in [88] to evaluate several modeling techniques for predicting quality of software components. Among them is the NN model. Another NN based software quality prediction work, as reported in [66], is language specific, where design metrics for SDL (Specification and Description Language) are first defined, and then used in building the prediction models for identifying fault prone components. In [71, 72], NN based models are used to predict faults and software quality measures. CBR is the learning method used in software quality prediction efforts [45, 54, 74, 77, 78]. The focus of [45] is on comparing the performance of different CBR classifiers, resulting in a recommendation of a simple CBR classifier with Euclidean distance, z-score standardization, no weighting scheme, and selecting the single nearest neighbor for prediction. In [54], CBR is applied to software quality modeling of a family of full-scale industrial software systems and the accuracy is considered better than a corresponding multiple linear regression model in predicting the number of design faults. Two practical classification rules (majority voting and data clustering) are proposed in [77] for software quality estimation of high-assurance systems. [78] discusses an attribute selection procedure that can help identify pertinent software quality metrics to be utilized in the CBR-based quality prediction. In [74], CBR approach is used to calibrate software quality classification models. Data from several embedded systems are collected to validate the results.
15
Table 2. Measurement prediction and estimation. ML Method1
SE Task Software quality (high-risk, or faultprone component identification)
GP [48], NN [66, 71, 72, 88], CBR [45, 54, 74, 77, 78], DT [18, 75, 76, 115,121], GP+DT [79] CL [35], ILP [30]
Software size
{NN, GP} [41]
Software development cost Project/software (development) effort
DT [17], CBR [19], BL [28] CBR [82,131,140,142], {DT, NN} [135], GA+NN [133], {NN, CBR} [51], GP [93] {NN, CBR, DT} [97], NN [63,146], {GP, NN, CBR} [25]
Maintenance task effort
{NN, DT} [68]
Software resource analysis
DT [129]
Software cost/correction cost
GP [42], {DT, ILP} [34]
Software reliability
NN [69]
Defects
BL [50]
Reusability
DT [98]
Software release timing
NN [40]
Productivity
BL [136]
Execution time
GA [143]
Testability of program modules
NN [73]
In [115], a DT based approach is used to generate measurement-based models of high-risk components. The proposed method relies on historical data (metrics from previous releases or projects) for identifying components of fault prone properties. Another DT based approach is used to build models for predicting high-risk Ada components [18]. Regression trees are used in [75] to classify fault-prone software modules. The approach allows one to have a preferred balance between Type I and Type II misclassification rates. The SPRINT DT algorithm is used 1
An explanation on the notations: {...} is used to indicate that multiple ML methods are each independently applied for the same SE task, and "...+..." is used to indicate that multiple ML methods are collectively applied to an SE task. These apply to Tables 2 through 8.
16
in [76] to build classification trees as quality estimation models that predict the class of software modules (fault-prone or not fault-prone). A set of computational intelligence techniques, of which DT is one, is proposed in [121] for software quality analysis. A hybrid approach, GPbased DT, is proposed in [79] for software quality prediction. Compared with DT alone, GPbased DT approach is more flexible and allows optimization of performance objectives other than accuracy. Another comparative study result is reported in [30] on using ILP methods for software fault prediction for C++ programs. Both natural and artificial data are used in evaluating the performance of two ILP methods and some extensions are proposed to one of them. Software quality prediction is formulated as a CL problem in [35]. It is noted in the study that there are activities (such data acquisition, feature extraction and example labeling) prior to the actual learning process. These activities would have impact on the quality of the outcome. The proposed approach is applied to a set of COBOL programs. 1.5.1.2. Software size estimation NN and GP are used in [41] to validate the component-based method for software size estimation. In addition to producing results that corroborate the component-based approach for software sizing, it is noticed in the study that NN works well with the data, recognizing some nonlinear relationships that the multiple linear regression method fails to detect. The equations evolved by GP provide similar or better values than those produced by the regression equations, and are intelligible, providing confidence in the results. 1.5.1.3. Software cost prediction A general approach, called optimized set reduction and based on DT, is described in [17] for analyzing software engineering data, and is demonstrated to be an effective technique for software cost estimation. A comparative study is done in [19] which includes a CBR technique for software cost prediction. The result reported in [28] indicates that the improved predictive performance of software cost models can be obtained through the use of Bayesian analysis, which offers a framework where both prior expert knowledge and sample data can be accommodated to obtain predictions. A GP based approach is proposed in [42] for searching possible software cost functions. 1.5.1.4. Software (project) development effort prediction IBL techniques are used in [131] for predicting the software project effort for new projects. The empirical results obtained (from nine different industrial data sets totaling 275 projects) indicate that CBR offers a viable complement to the existing prediction and estimations techniques. Another CBR application in software effort estimation is reported in [140]. The work in [82] focuses on the search heuristics to help identify the optimal feature set in a CBR system for predicting software project effort. A comparison is done in [142] of several CBR estimation methods and the results indicate that estimates obtained through analogues selected by human are more accurate than estimates obtained through analogues selected by tools, and more accurate than estimates through the simple regression model. DT and NN are used in [135] to help predict software development effort. The results were competitive with conventional methods such as COCOMO and function points. The main advantage of DT and NN based estimation systems is that they are adaptable and nonparametric.
17
NN is the method used in [63, 146] for software development effort prediction and the results are encouraging in terms of accuracy. Additional research on ML based software effort prediction includes: a genetically trained NN (GA+NN) predictor [133], and a GP based approach [93]. The conclusion in [93] epitomizes the dichotomy of the application of an ML method; "GP performs consistently well for the given data, but is harder to configure and produces more complex models", and "the complexity of the GP must be weighed against the small increases in accuracy to decide whether to use it as part of any effort prediction estimation". In addition, in-house data are more significant than the public data sets for estimates. Several comparative studies of software effort estimation have been reported in [25, 51, 97] where [51] deals with NN and CBR, [97] with CBR, NN and DT, and [25] with CBR, GP and NN. 1.5.1.5. Maintenance task effort prediction Models are generated in terms of NN and DT methods, and regression methods, for software maintenance task effort prediction in [68]. The study measures and compares the prediction accuracy for each model, and concludes that DT-based and multiple regression-based models have better accuracy results. It is recommended that prediction models be used as instruments to support the expert estimates and to analyze the impact of the maintenance variables on the process and product of maintenance. 1.5.1.6. Software resource analysis In [129], DT is utilized in software resource data analysis to identify classes of software modules that have high development effort or faults (the concept of "high" is defined with regard to the uppermost quartile relative to past data). Sixteen software systems are used in the study. The decision trees correctly identify 79.3 percent of the software modules that had high development effort or faults. 1.5.1.7. Correction cost estimation An empirical study is done in [34] where DT and ILP are used to generate models for estimating correction costs in software maintenance. The generated models prove to be valuable in helping to optimize resource allocations in corrective maintenance activities, and to make decisions regarding when to restructure or reengineer a component so as to make it more maintainable. A comparison leads to an observation that ILP-based results perform better than DT-based results. 1.5.1.8. Software reliability prediction Software reliability growth models can be used to characterize how software reliability varies with time and other factors. The models offer mechanisms for estimating current reliability measures and for predicting their future values. The work in [69] reports the use of NN for software reliability growth prediction. An empirical comparison is conducted between NN-based models and five well-known software reliability growth models using actual data sets from a number of different software projects. The results indicate that NN-based models adapt well across different data sets and have a better prediction accuracy.
18
1.5.1.9. Defect prediction BL is used in [50] to predict software defects. Though the system reported is only a prototype, it shows the potential Bayesian belief networks (BBN) has in incorporating multiple perspectives on defect prediction into a single, unified model. Variables in the prototype BBN system [50] are chosen to represent the life-cycle processes of specification, design and implementation, and testing (Problem-Complexity, Design-Effort, Design-Size, Defects-Introduced, Testing-Effort, Defects-Detected, Defects-Density-At-Testing, Residual-Defect-Count, and Residual-DefectDensity). The proper causal relationships among those software life-cycle processes are then captured and reflected as arcs connecting the variables. A tool is then used with regard to the BBN model in the following manner. For given facts about Design-Effort and Design-Size as input, the tool will use Bayesian inference to derive the probability distributions for DefectsIntroduced, Defects-Detected and Defect-Density. 1.5.1.10. Reusability prediction Predictive models are built through DT in [98] to verify the impact of some internal properties of object-oriented applications on reusability. Effort is focused on establishing a correlation between component reusability and three software attributes (inheritance, coupling and complexity). The experimental results show that some software metrics can be used to predict, with a high level of accuracy, the potential reusable classes. 1.5.1.11. Software release timing How to determine the software release schedule is an issue that has impact on both the software product developer and the user and the market. A method, based on NN, is proposed in [40] for estimating the optimal software release timing. The method adopts the cost minimization criterion and translates it into a time series forecasting problem. NN is then used to estimate the fault-detection time in the future. 1.5.1.12. Testability prediction The work reported in [73] describes a case study in which NN is used to predict the testability of software modules from static measurements of the source code. The objective in the study is to predict a quantity between zero and one whose distribution is highly skewed toward zero, which proves to be difficult for standard statistical techniques. The results echo the salient feature of NN-based predictive models that have been discussed so far: its ability to model nonlinear relationships. 1.5.1.13. Productivity A BL based approach is described in [136] for estimating the productivity of software projects. A demonstrative BBN is defined to capture the causal relationships among components in the COCOMO81 model along with probability tables for the nodes. The results obtained are still preliminary. 1.5.1.14. Execution time Temporal behaviors of real-time software are pivotal to the overall system correctness. Testing whether a real-time system violates its specified timing constraints for certain inputs thus becomes a critical issue. A GA based approach is described in [143] to produce inputs with the longest or shortest execution times that can be used to check if they will cause a temporal error or a violation of a system's time constraints.
19
1.5.2.
Property and model discovery
ML methods are used to identify or discover useful information about software entities. Work in [16] explores using ILP to discover loop invariants. The approach is based on collecting execution traces of a program to be proven correct and using them as learning examples of an ILP system. The states of the program variables at a given point in the execution represent positive examples for the condition associated with that point in the program. A controlled closed-world assumption is utilized to generate negative examples. Table 3. Property discovery SE Task
ML Method
Program invariants
ILP [16]
Identifying objects in programs
NN [1]
Boundary of normal operations
S VM [95]
Equivalent mutants
BL [141]
Process models
NN [31 ], EBL [55]
In [1], NN is used to identify objects in procedural programs as an effort to facilitate many maintenance activities (reuse, understanding). The approach is based on cluster analysis and is capable of identifying abstract data types and groups of routines that reference a common set of data. A data analysis technique called process discovery is proposed in [31] that is implemented in terms of NN. The approach is based on first capturing data describing process events from an ongoing process and then generating a formal model of the behavior of that process. Another application involves the use of EBL to synthesize models of programming activities or software processes [55]. It generates a process fragment (a group of primitive actions which achieves a certain goal given some preconditions) from a recorded process history. Despite its effectiveness at detecting faults, mutation testing requires a large number of mutants to be compiled, executed, and analyzed for possible equivalence to the original program being tested. To reduce the number of mutants to be considered, BL is used in [141] to provide probabilistic information to determine the equivalent mutants. A detection method based on SVM is described in [95] as an approach for validating adaptive control systems. A case study is done with an intelligent flight control system and the results indicate that the proposed approach is effective for discovering boundaries of the safe region for the learned domain, thus being able to separate faulty behaviors from normal events.
20
1.5.3.
Transformation
The work in [125, 126] describes a GP system that can transform serial programs into functionally identical parallel programs. The functional identical property between the input and the output of the transformation can be proven, which greatly enhances the opportunities of the system being utilized in commercial environments. Table 4. Transformation. SE Task
ML Method
Transform serial programs to parallel ones
GP [125, 126]
Improve software modularity
CBR+NN [128], GA [62]
Mapping 0 0 applications to heterogeneous distributed environments
GA [27]
A module architecture assistant is developed in [128] to help assist software architects in improving the modularity of large programs. A model for modularization is established in terms of nearest-neighbor clustering and classification, and is used to make recommendations to rearrange module membership in order to improve modularity. The tool learns similarity judgments that match those of the software architect through performing back propagation on a specialized neural network. Another work for software modularization is reported in [62] that introduces a new representation (aimed at reducing the size of the search space) and a new crossover operator (designed to promote the formation and retention of building blocks) for GA based approach. GA is used in [27] in experimenting and evaluating a partitioning and allocation model for mapping object-oriented applications to heterogeneous distributed environments. By effectively distributing software components of an object-oriented system in a distributed environment, it is hoped to achieve performance goals such as load balancing, maximizing concurrency and minimizing communication costs. 1.5.4.
Generation and synthesis
In [10], a test case generation method is proposed that is based on ILP. An adequate test set is generated as a result of inductive learning of programs from finite sets of input-output examples. The method scales up well when the size or the complexity of the program to be tested grows. It stops being practical if the number of alternatives (or possible errors) becomes too large. A GP based approach is described in [46] to select and evaluate test data. A tool is reported in [101, 102] that uses, among other things, GA to generate dynamic test data for C/C++ programs. The tool is fully automatic and supports all C/C++ language constructs. Test results have been obtained for programs containing up to 2000 lines of source code with complex, nested conditionals. Three separate works on test data generation are also based on GA
21
[14, 24, 144]. In [14], the issue of how to instrument programs with flag variables is considered. GA is used in [24] to help generate test data for program paths, whereas work in [144] is focused on test data generation for structural test methods. Table 5. Generation and synthesis. SE Task
ML Method ILP [10], GA [14,24,101,102,144],
Test cases/data
GP [46] Test resource
GA [33]
Project management rules
{GA, DT] [5]
Software agents
GP [120]
Design repair knowledge
CBR + EBL [6]
Design schemas
IBL [61]
Data structures
GP [86]
Programs/scripts
IBL [12], [CL, AL} [104]
Project management schedule
GA [26]
Testing resource allocation problem is considered in [33] where a GA approach is described. The results are based on consideration of both system reliability and testing cost. In [5], DT and GA are utilized to learn software project management rules. The objective is to provide decision rules that can help project managers to make decisions at any stage during the development process. Synthesizing Unix shell scripts from a high-level specification is made possible through IBL in [12]. The tool has a retrieval mechanism that allows an appropriate source analog to be automatically retrieved given a description of a target problem. Several domain specific retrieval heuristics are utilized to estimate the closeness of two problems at implementation level based on their perceived closeness in the specification level. Though the prototype system demonstrates the viability of the approach, the scalability remains to be seen. A prototype of a software engineering environment is described in [6] that combines CBR and EBL to synthesize design repair rules for software design. Though the preliminary results are promising, the generality of the learning mechanism and the scaling-up issue remain to be open questions, as cautioned by the authors. In [61], IBL provides the impetus to a system that acquires software design schemas from design cases of existing applications.
22
GP is used in [120] to automatically generate agent programs that communicate and interact to solve problems. However, the reported work so far is on a two-agent scenario. Another GP based approach is geared toward generating abstract data types, i.e., data structures and the operations to be performed on them [86]. In [104], CL and AL are used in synthesizing search programs for a Lisp code generator in the domain of combinatorial integer constraint satisfaction problems. GA is behind the effort in generating project management schedules in [26]. Using a programmable goal function, the technique can generate a near-optimal allocation of resources and a schedule that satisfies a given task structure and resource pool. 1.5.5.
Reuse library construction and maintenance
This area presents itself as a fertile ground for CBR applications. In [109], CBR is the corner stone of a reuse library system. A component in the library is represented in terms of a set of feature/term pairs. Similarity between a target and a candidate is defined by the distance measure, which is computed through comparator functions based on the subsumption, closeness and package relations. Components in a software reuse library have an added advantage in that they can be executed on a computer so as to yield stronger results than could be expected from generic CBR. The work reported in [52] takes advantage of this property by first retrieving software modules from the library, adapting them to new problems, and then subjecting those new cases to executions on system-generated test sets in order to evaluate the results of CBR. CBR can be augmented with additional mechanisms to help aid other issues in reuse library. Such is the case in [70] where CBR is adopted in conjunction with a specificity-genericity hierarchy to locate and adopt software components to given specifications. The proposed method focuses its attention on the evolving nature of the software repository. Table 6. Reuse. SE Task
ML Method
Similarity computing
CBR [109]
Active browsing
IBL [43]
Cost of rework
DT [8]
Knowledge representation
CBR [52]
Locate and adopt software to
CBR [70]
specifications Generalizing program abstractions
EBL [65]
Clustering of components
GA [90]
23
How to find a better way of organizing reusable components so as to facilitate efficient user retrieval is another area where ML finds its application. GA is used in [90] to optimize the multiway clustering of software components in a reusable class library. Their approach takes into consideration the following factors: number of clusters, similarity within a cluster and similarity among clusters. In [8], DT is used to model and predict the cost of rework in a library of reusable software components. Prescriptive coding rules can be generated from the model that can be used by programmers as guidelines to reduce the cost of rework in the future. The objective of the work is to use DT to help manage the maintenance of reusable components, and to improve the way the components are produced so as to reduce maintenance costs in the library. A technique called active browsing is incorporated into a tool that helps assist the browsing of a reusable library for desired components [43]. An active browser infers its similarity measure from a designer's normal browsing actions without any special input. It then recommends to the designer components it estimates to be close to the target of the search, which is accomplished through a learning process similar to IBL. EBL is used as the basis to capture and generalize program abstractions developed in practice to increase their potential for reuse [65]. The approach is motivated by the explicit domain knowledge embodied in data type specifications and the mechanisms for reasoning about such knowledge used in validating software. 1.5.6. Requirement acquisition CL is used to support scenario-based requirement engineering in the work reported in [85]. The paper describes a formal method for supporting the process of inferring specifications of system goals and requirements inductively from interaction scenarios provided by stakeholders. The method is based on a learning algorithm that takes scenarios as examples and counter-examples (positive and negative scenarios) and generates goal specifications as temporal rules. Table 7. Process enhancement. SE Task
ML Method
Derivation of specifications of system
CL [85]
goals and requirements Extract specifications from software
ILP [29]
Acquire knowledge for specification refinement and augmentation Acquire and maintain specification consistent with scenarios
{DT, NN) [111] EBL [58, 59]
Another work in [58] presents a scenarios-based elicitation and validation assistant that helps requirements engineers acquire and maintain a specification consistent with scenarios provided.
24
The system relies on EBL to generalize scenarios to state and prove validation lemmas. A scenario generation tool is built in [59] that adopts a heuristic approach based on the idea of piecing together partially satisfying scenarios from the requirements library and using EBL to abstract them in order to be able to co-instantiate them. A technique is developed in [29] to extract specifications from software using ILP. It allows instrumented code to be run on a number of representative cases, and generate examples of the code's behavior. ILP is then used to generalize these examples to form a general description of some aspect of a system's behavior. Software specifications are imperfect reflections of a reality, and are prone to errors, inconsistencies and incompleteness. Because the quality of a software system hinges directly on the accuracy and reliability of its specification, there is dire need for tools and methodologies to perform specification enhancement. In [111], DT and NN are used to extract and acquire knowledge from sample problem data for specification refinement and augmentation. 1.5.7.
Capture development knowledge
How to capture and manage software development knowledge is the theme of this application group where both papers report work utilizing CBR as the tool. In [64], a CBR based infrastructure is proposed that supports evolving knowledge and domain analysis methods that capture emerging knowledge and synthesize it into generally applicable forms. Software process knowledge management is the focus in [4]. A hybrid approach including CBR is proposed to support the customization of software processes. The purpose of CBR is to facilitate reuse of past experiences.
Table 8. Management. SE Task
1.6.
ML Method
Collect and manage software development knowledge
CBR [64]
Software process knowledge
CBR [4]
Status
In this section, we offer a summary of the state-of-the-practice in this niche area. The application patterns of ML methods in the body of existing work are summarized in Table 9.
25
Table 9. Application patterns of ML methods. Pattern
Description
Convergent
Different ML methods each being applied to the same SE task
Divergent
A single ML method being applied to different SE tasks
Compound
Several ML methods being combined together for a single SE task
Figure 8 captures a glimpse of the types of software engineering issues in the seven application areas people have been interested in applying ML techniques to. Figure 9 summarizes the publication counts in those areas. For instance, of the eighty-six publications included in Subsection 1.5 above, forty-five of them (52%) deal with the issue of how to build models to predict or estimate certain property of software development process or artifacts. On the other hand, Figure 10 provides some clue on what types of ML techniques people feel comfortable in using. Based on the classification, IBL/CBR, NN, and DT are the top three popular techniques in that order, amounting to fifty-seven percent of the entire ML applications in our study.
Figure 8. Number of different SE tasks in each application area.
26
Figure 9. Number of publications in each application area.
Figure 10. State-of-the-practice from the perspective of ML algorithms.
Table 10 depicts the distribution of ML algorithms in the seven SE application areas. The trend information captured in Figure 11, though only based on the published work we have been able to collect, should be indicative of the increased interest in ML&SE.
27
Table 10. ML methods in SE application areas. NN
Prediction
V
Discovery
-^
Transformation
V
IBL CBR
V
V
V V
Reuse
V
Management
V
GA
ILP
GP
V
EBL
V V
Generation
Acquisition
DT
yj
V V V yj
V
CL
BL
V
V
V
AL
IAL
V
V
V
V
V
V
V V
V
V
^j
Figure 11. Publications on applying ML algorithms in SE. Tables 11-21 summarize the applications of individual ML methods.
28
EL
SVM
V
V V
RL
Table 11. IBL/CBR applications. Category
Application
Prediction
Quality Development cost Development effort
Transformation
Modularity
Generation
Design repair knowledge Design schemas Programs/scripts
Reuse
Similarity computing Active browsing Knowledge representation Locate/adopt software to specifications
Management
Software development knowledge Software process knowledge
29
Table 12. NN applications. Category
Application
Prediction
Quality Size Development effort Maintenance effort Reliability Release time Testability
Discovery
Identifying objects Process models
Transformation
Modularity
Acquisition
Specification refinement
30
Table 13. DT applications. Category
Application
Prediction
Quality Development cost Development effort Maintenance effort Resource analysis Correction cost Reusability
Generation
Project management rules
Reuse
Cost of rework
Acquisition
Specification refinement Table 14. GA applications.
Category
Application
Prediction
Development effort Execution time
Transformation
Modularity Object-oriented application
Generation
Test data Test resource allocation Project management rules Project management schedule
Reuse
Clustering of components
31
Table 15. GP applications. Category
Application
Prediction
Quality Size Development effort Software cost
Transformation
Parallel programs
Generation
Test data Software agents Data structures Table 16. ILP applications.
Category
Application
Prediction
Quality Correction cost
Discovery
Program invariants
Generation
Test data
Acquisition
Extract specifications from software Table 17. EBL applications.
Category
Application
Discovery
Process models
Generation
Design repair knowledge
Reuse
Generalizing program abstractions
Acquisition
Acquiring specifications from scenarios
32
Table 18. BL applications. Category
Application
Prediction
Development cost Defects Productivity
Discovery
Mutants
Table 19. CL applications. Category
Application
Prediction
Quality
Generation
Programs/scripts
Acquisition
Derivation of specifications
Table 20. AL applications. Category
Application
Generation
Programs/scripts
Table 21. SVM application. Category
Application
Discovery
Operation boundary
33
The body of existing work we have been able to glean definitely represents the efforts that have been underway to take advantage of the unique perspective ML affords us to explore for SE tasks. Here we point out some general issues in ML&SE as follows. > Applicability and justification. When adopting an ML method to an SE task, we need to have a good understanding of the dimensions of the leaning method and characteristics of the SE task, and find a best match between them. Such a justification offers a necessary condition for successfully applying an ML method to an SE task. > Issue of scaling up. Whether a learning method can be effectively scaled up to handle real world SE projects is an issue to be reckoned with. What seems to be an effective method for a scaled-down problem may hit a snag when being subject to a full-scale version of the problem. Some general guidelines regarding the issue are highly desirable. > Performance evaluation. Given some SE task, some ML-based approaches may outperform their conventional (non-ML) counterparts, others may not offer any performance boost but just provide a complement or alternative to the available tools, yet another group may fill in a void in the existing repertoire of SE tools. In addition, we are interested in finding out if there are significant performance differences among applicable ML methods for an SE task. To sort out those different scenarios, we need to establish a systematic way of evaluating the performance of a tool. Let S be a set of SE tasks, and let Tc and TL contain a set of conventional (non-ML) SE tools and a set of ML-based tools, respectively. Figure 12 describes some possible scenarios between S and TQ/TL, where Tc(s) c Tc and TL(S) C TL indicate a subset of tools applicable to an SE task s, respectively. If P is defined to be some performance measure (e.g., prediction accuracy), then we can use P(t, s) to denote the performance of t for seS, where t e (Tc(s) v TL(s)). Let A ::= < | = | >. Given an s e S, the performance of two applicable tools can be compared in terms of the following relationships: P(ti; s) A Pft, s), where % e Tc(s) A tj e TL(s), P(tk, s) A P(t,, s), where tk, t, e TL(s) A |T L (S)| >1.
Figure 12. Relationships between S and Tc, and between S and TL.
34
> Integration. How can an ML-based tool be seamlessly integrated into the SE development environment or tool suite is another issue that deserves attention. If it takes a heroic effort for the tool's integration, it may ultimately affects its applicability. 1.7.
Applying ML Algorithms to SE Tasks
In applying machine learning to solving any real-world problem, there is usually some course of actions to follow. What we propose is a guideline that has the following steps: Problem Formulation. The first step is to formulate a given problem such that it conforms to the framework of a particular learning method chosen for the task. Different learning methods have different inductive bias, adopt different search strategies that are based on various guiding factors, have different requirements regarding domain theory (presence or absence) and training data (valuation and properties), and are based on different justifications of reasoning (refer to Figures 2-7). All these issues must be taken into consideration during the problem formulation stage. This step is of pivotal importance to the applicability of the learning method. Strategies such as divide-and-conquer may be needed to decompose the original problem into a set of subproblems more amenable to the chosen learning method. Sometimes, the best formulation of a problem may not always be the one most intuitive to a machine learning researcher [87]. Problem representation. The next step is to select an appropriate representation for both the training data and the knowledge to be learned. As can be seen in Figure 2, different learning methods have different representational formalisms. Thus, the representation of the attributes and features in the learning task is often problem-specific and formalism-dependent. Data collection. The third step is to collect data needed for the learning process. The quality and the quantity of the data needed are dependent on the selected learning method. Data may need to be preprocessed before they can be used in the learning process. Domain theory preparation. Certain learning methods (e.g., EBL) rely on the availability of a domain theory for the given problem. How to acquire and prepare a domain theory (or background knowledge) and what is the quality of a domain theory (correctness, completeness) therefore become an important issue that will affect the outcome of the learning process. Performing the learning process. Once the data and a domain theory (if needed) are ready, the learning process can be carried out. The data will be divided into a training set and a test set. If some learning tool or environment is utilized, the training data and the test data may need to be organized according to the tool's requirements. Knowledge induced from the training set is validated on the test set. Because of different splits between the training set and test set, the learning process itself is an iterative one. Analyzing and evaluating learned knowledge. Analysis and evaluation of learned knowledge is an integral part of the learning process. The interestingness and the performance of the acquired knowledge will be scrutinized during this step, often with the help from human experts, which hopefully will lead to the knowledge refinement. If learned knowledge is deemed insignificant, uninteresting, irrelevant, or deviating, this may be indicative to the need for revisions at early stages such as problem formulation and representation. There are known practical problems in many learning methods such as overfitting, local minima, or curse of dimensionality that are due
35
to either data inadequacy, noise or irrelevant attributes in data, nature of a search strategy, or incorrect domain theory. Fielding the knowledge base. What this step entails is that the learned knowledge be used [87]. The knowledge could be embedded in a software development system or a software product, or used without embedding it in a computer system. As observed in [87], the power of machine learning methods does not come from a particular induction method, but instead from proper formulation of the problems and from crafting the representation to make learning tractable. 1.8.
Organization of the Book
The rest of the book is organized as follows. Chapters 2 through 8 cover ML applications in seven different categories of SE, respectively. Chapter 2 deals with ML applications in software measurements or attributes prediction and estimation. This is the most concentrated category that includes forty-five publications in our study. In this chapter, a collection of seven papers is selected as representatives for activities in this category. These seven papers include ML applications in predicting or estimating: software quality, software development cost, project effort, software defect, and software release timing. Those applications involve ML methods of BL, DT, NN, CBR, and GP. In Chapter 3, two papers are included to address the use of ML methods for discovering software properties and models, one dealing with using NN to identify objects in procedural programs, and the other tackling the issue of detecting equivalent mutants in mutation testing using BL The main theme in Chapter 4 is software transformation. ML methods are utilized to transform software into one with desirable properties (e.g., from serial programs to parallel programs, from a less modularized program to a more modularized one, mapping object-oriented applications to heterogeneous distributed environments). In this chapter, we include one paper that deals with the issue of transforming software systems for better modularity using nearest-neighbor clustering and a special-purpose NN. Chapter 5 describes ML applications where software artifacts are generated or synthesized. The chapter contains one paper that describes a GA based approach to test data generation. The proposed approach is based on dynamic test data generation and is geared toward generating condition-decision adequate test sets. Chapter 6 takes a look at how ML methods are utilized to improve the process of constructing and maintaining reuse libraries. Software reuse library construction and maintenance has been a fertile ground for ML applications. The paper included in this chapter describes a CBR based approach to locating and adopting reusable components to particular specifications. In Chapter 7, software specification is the target issue. Two papers are selected in the chapter. The first paper describes an ILP based approach to extracting specifications from software. The second paper discusses an EBL based approach to scenario generation that is an integral part of specification modeling. Chapter 8 is concerned with how ML methods are used to capture and manage software development or process knowledge. The one paper in the chapter discusses a CBR based method for collecting and managing software development knowledge as it evolves in an organizational context. Finally, Chapter 9 offers some guidelines on how to select ML methods for SE tasks, how to formulate an SE task into a learning problem, and concludes the book with remarks on where future effort will be needed in this niche area.
36
Chapter 2 ML Applications in Prediction and Estimation As evidenced in Chapter One, the majority of the ML applications (52%) deal with the issue of how to build models to predict or estimate certain property of software development process or artifacts. The subject of the prediction or estimation involves a range of properties: quality, size, cost, effort, reliability, reusability, productivity, and testability. In this chapter, we include a set of 7 papers where ML methods are used to predict or estimate measurements for either internal or external attributes of processes, products, or resources in software engineering. These include: software quality, software cost, project or software development effort, software defect, and software release timing. Table 22 summarizes the current state-of-the-practice in this application area. Table 22. ML methods used in prediction and estimation. NN IBL DT GA GP ILP EBL CL BL AL IAL RL EL SVM CBR Quality
V
Size
V
Development Cost Development Effort
V
Maintenance Effort
-^
Resource Analysis
V
V
V
V
V
V
V
V V
V
\j \j yj
Correction Cost
\j
y]
yj
Defects
yj
Reusability Release Time
A/ yj
Productivity
-\j
Execution Time Testability
V
V
Software Cost
Reliability
V
-\/ yj
A primary concern in prediction or estimation models and methods is accuracy. There are some general issues about prediction accuracy. The first is the measurement, namely, how accuracy is to be measured. There are several accuracy measurements and the choice of which one to use may be dependent on what objectives one has when using the predictor. The second issue is the
37
sensitivity, that is, how sensitive a prediction method's accuracy is to changes in data and time. Different approaches may have different level of sensitivity. The paper by Chulani, Boehm and Steece [28] describes a BL approach to software development cost prediction. A salient feature of BL is that it accommodates prior knowledge and allows both data and prior knowledge to be utilized in making inferences. This proves to be especially helpful in circumstances where data are scarce and incomplete. The results obtained by authors in the paper indicate that the BL approach has a predictive performance (within 30 percent of the actual values 75 percent of the time) that is significantly better than that of the previous multiple regression approach (within 30 percent of the actual values only 52 percent of the time) on their latest sample of 161 project datapoints. The paper by Srinivasan and Fisher [135] deals with the issue of estimating software development effort. This is an important task in software development process, as either underestimation or overestimation of the development effort would have adverse effect. Their work describes the use of two ML methods, DT and NN, for building software development effort estimators from historical data. The experimental results indicate that the performance of DT and NN based estimators are competitive with traditional estimators. Though just as sensitive to various aspects of data selection and representation as the traditional models, a major benefit of ML based estimators is that they are adaptable and nonparametric. The paper by Shepperd and Schofield [131] adopts a CBR approach to software project effort estimation. In their approach, projects are characterized in terms of a feature set that ranges from as few as one and as many as 29, and includes features such as the number of interfaces, development method, the size of functional requirements document. Cases for completed projects are stored along with their features and actual values of development effort. Similarity among cases is defined based on project features. Prediction for the development effort of a new project amounts to retrieving its nearest neighbors in the case base and using their known effort values as the basis for estimation. The sensitivity analysis indicates that estimation by analogy may be highly unreliable if the size of the case base is below 10 known projects, and that this approach can be susceptible to outlying projects, but the influence by a rogue project can be ameliorated as the size of dataset increases. The paper by Fenton and Neil [50] offers a critical analysis of the existing defect prediction models, and proposes an alternative approach to defect prediction using Bayesian belief networks (BBN), part of BL method. Software defect prediction is a very useful and important tool to gauge the likely delivered quality and maintenance effort before software systems are deployed. Predicting defects requires a holistic model rather than a single-issue model that hinges on either size, or complexity, or testing metrics, or process quality data alone. It is argued in [50] that all these factors must be taken into consideration in order for the defect prediction to be successful. BBN proves to be a very useful approach to the software defect prediction problem. A BBN represents the joint probability distribution for a set of variables. This is accomplished by specifying (a) a directed acyclic graph (DAG) where nodes represent variables and arcs correspond to conditional independence assumptions (causal knowledge about the problem domain), and (b) a set of local conditional probability tables (one for each variable) [67, 105]. A BBN can be used to infer the probability distribution for a target variable (e.g., "Defects Detected"), which specifies the probability that the variable will take on each of its possible values given the observed values of the other variables. In general, a BBN can be used to compute the probability distribution for any subset of variables given the values or distributions
38
for any subset of the remaining variables. In [50], variables in the BBN model are chosen to represent the life-cycle processes of specification, design and implementation, and testing. The proper causal relationships among those software life-cycle processes are then captured and reflected as arcs connecting the variables. A tool is then used with regard to the BBN model in the following manner. For given facts about Design-Effort and Design-Size as input, the tool will use Bayesian inference to derive the probability distributions for Defects-Introduced, DefectsDetected and Defect-Density. The paper by Khoshgoftaar, Allen and Deng [75] discusses using a DT approach to classifying fault-prone software modules. The objective is to predict which modules are fault-prone early enough in the development life cycle. In the regression tree to be learned, the s-dependent variable is the response variable that is of the data type real, the ^-independent variables are predictors based on which the internal nodes of the tree are defined, and the leaf nodes are labeled with a real quantity for the response variable. A large legacy telecommunication system is used in the case study where four consecutive releases of the software are the basis for the training and test data sets (release 1 used as the training data set, releases 2-4 used as test data sets). A classification rule is proposed that allows the developer the latitude to have a balance between two types of misclassification rates. The case study results indicate satisfactory prediction accuracy and robustness. The paper by Burgess and Lefley [25] conducts a comparative study of software effort estimation in terms of three ML methods: GP, NN and CBR. A well-known data set of 81 projects in the late 1980s is used for the study. The input variables are restricted to those available from the specification stage. The comparisons are based on the accuracy of the results, the ease of configuration and the transparency of the solutions. The results indicate that the explanatory nature of estimation by analogy gives CBR an advantage when considering its interaction with the end user, and that GP can lead to accurate estimates and has the potential to be a valid addition to the suite of tools for software effort estimation. The paper by Dohi, Nishio and Osaki [40] proposes an NN based approach to estimating the optimal software release timing which minimizes the relevant cost criterion. Because the essential problem behind the software release timing is to estimate the fault-detection time interval in the future, authors adopt two typical NN (a feed forward NN and a recurrent NN) for the purpose of time series forecasting. Six data sets of real software fault-detection time are used in the case study. The results indicate that the predictive accuracy of the NN models outperforms those of software reliability growth models based approaches. Of the two NN models, the recurrent NN yields better results than the feed forward NN. The following papers will be included here: S. Chulani, B. Boehm and B. Steece, "Bayesian analysis of empirical software engineering cost models," IEEE Trans. SE, Vol. 25, No. 4, July 1999, pp. 573-583. K. Srinivasan and D. Fisher, "Machine learning approaches to estimating software development effort," IEEE Trans. SE, Vol. 21, No. 2, Feb. 1995, pp. 126-137. M. Shepperd and C. Schofield, "Estimating software project effort using analogies", IEEE Trans. SE, Vol. 23, No. 12, November 1997, pp. 736-743.
39
N. Fenton and M. Neil, "A critique of software defect prediction models," IEEE Trans. SE, Vol. 25, No. 5, Sept. 1999, pp. 675-689. T. Khoshgoftaar, E.B. Allen and J. Deng, Using regression trees to classify fault-prone software modules, IEEE Transactions on Reliability, Vol.51, No.4, 2002, pp.455-462. CJ. Burgess and M. Lefley, Can genetic programming improve software effort estimation? A comparative evaluation, Information and Software Technology, Vol.43, No. 14, 2001, pp.863873. T. Dohi, Y. Nishio, and S. Osaki, "Optimal software release scheduling based on artificial neural networks", Annals of Software Engineering, Vol.8, No.l, 1999, pp.167-185.
40
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,
VOL. 25,
NO. 4,
JULY/AUGUST 1999
573
Bayesian Analysis of Empirical Software Engineering Cost Models Sunita Chulani, Member, IEEE, Barry Boehm, Fellow, IEEE, and Bert Steece Abstract—To date many software engineering cost models have been developed to predict the cost, schedule, and quality of the software under development. But, the rapidly changing nature of software development has made it extremely difficult to develop empirical models that continue to yield high prediction accuracies. Software development costs continue to increase and practitioners continually express their concerns over their inability to accurately predict the costs involved. Thus, one of the most important objectives of the software engineering community has been to develop useful models that constructively explain the software development life-cycle and accurately predict the cost of developing a software product. To that end, many parametric software estimation models have evolved in the last two decades [25], [17], [26], [15], [28], [1], [2], [33], [7], [10], [22], [23]. Almost all of the above mentioned parametric models have been empirically calibrated to actual data from completed software projects. The most commonly used technique for empirical calibration has been the popular classical multiple regression approach. As discussed in this paper, the multiple regression approach imposes a few assumptions frequently violated by software engineering datasets. The source data is also generally imprecise in reporting size, effort, and cost-driver ratings, particularly across different organizations. This results in the development of inaccurate empirical models that don't perform very well when used for prediction. This paper illustrates the problems faced by the multiple regression approach during the calibration of one of the popular software engineering cost models, COCOMO II. It describes the use of a pragmatic 10 percent weighted average approach that was used for the first publicly available calibrated version [6]. It then moves on to show how a more sophisticated Bayesian approach can be used to alleviate some of the problems faced by multiple regression. It compares and contrasts the two empirical approaches, and concludes that the Bayesian approach was better and more robust than the multiple regression approach. Bayesian analysis is a well-defined and rigorous process of inductive reasoning that has been used in many scientific disciplines (the reader can refer to [11], [35], [3] for a broader understanding of the Bayesian Analysis approach). A distinctive feature of the Bayesian approach is that it permits the investigator to use both sample (data) and prior (expert-judgment) information in a logically consistent manner in making inferences. This is done by using Bayes' theorem to produce a 'postdata' or posterior distribution for the model parameters. Using Bayes' theorem, prior (or initial) values are transformed to postdata views. This transformation can be viewed as a learning process. The posterior distribution is determined by the variances of the prior and sample information. If the variance of the prior information is smaller than the variance of the sampling information, then a higher weight is assigned to the prior information. On the other hand, if the variance of the sample information is smaller than the variance of the prior information, then a higher weight is assigned to the sample information causing the posterior estimate to be closer to the sample information. The Bayesian approach discussed in this paper enables stronger solutions to one of the biggest problems faced by the software engineering community: the challenge of making good decisions using data that is usually scarce and incomplete. We note that the predictive performance of the Bayesian approach (i.e., within 30 percent of the actuals 75 percent of the time) is significantly better than that of the previous multiple regression approach (i.e., within 30 percent of the actuals only 52 percent of the time) on our latest sample of 161 project datapoints. Index Terms—Bayesian analysis, multiple regression, software estimation, software engineering cost models, model calibration, prediction accuracy, empirical modeling, COCOMO, measurement, metrics, project management.
• 1
CLASSICAL MULTIPLE REGRESSION APPROACH
M
OST of the existing empirical software engineering cost
can be used on software engineering data. We also highlight
models are calibrated using the classical multiple regression approach. In Section 1, we focus on the overall description of the multiple regression approach and how it
the assumptions imposed by the multiple regression approach and the resulting problems faced by the software engineering community in trying to calibrate empirical models using this approach. The example dataset used to facilitate the illustration is the 1997 COCOMO II dataset • S. Chulani is with IBM Research, Center for Software Engineering, 650 which is composed of data from 83 completed projects Harry Rd., San Jose, CA 95120. This work was performed while doing c o H e c t e d from commercial, aerospace, government, and research at the Center for Software Engineering, University of Southern ,. . . , , , , , , ., r« „ , T California, Los Angeles. E-mail:
[email protected] com. nonprofit organizations [30]. It should be noted that with • B. Boehm is with the Center for Software Engineering, University of more than a dozen commercial implementations, COCOMO Southern California, Los Angeles, CA 90089. has been one of the most popular cost estimation models of E-mail:
[email protected] the'80s and'90s. COCOMO II [2] is a recent update of the • B. Steece is with the Marshall School of Business, University of Southern ,-w-, J t California, Los Angeles, CA 90089. E-mail:
[email protected]. popular COCOMO model published in [1]. Manuscript received 29 June 1998; revised 25 Feb. 1999. Multiple Regression expresses the response (e.g., Person Recommended for acceptance by D. Ross Jeffery. Months (PM)) as a linear function of k predictors (e.g., For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference 1EEECS Log Number 109543. Source Lines of Code, Product Complexity, etc.). This linear
OO98-5S89/99/$10.00 © 1999 IEEE
41
574
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 25, NO. 4, JULY/AUGUST 1999
function is estimated from the data using the ordinary least squares approach discussed in numerous books such as [18], [34]. A multiple regression model can be written as Vt = 0o + 0i xt\ + ••• + PkXtk
+ Et
RCON. We used a threshold value of 0.65 for high correlation among predictor variables. T a b l e 1s h o w s ^ h i S h l y correlated parameters that were aggregated for the 1997 calibration of COCOMO II. (1) The regression estimated the /? coefficients associated
where * u . . . xtk are the values of the predictor (or . •ii r i i • n r, regressor) variables for the xth observation, 0O • • • 0K are the coefficients to be estimated, st is the usual error term, and yt is the response variable for the j/th observation. Our example model, COCOMO II, has the following mathematical form:
with the scale factors and effort multipliers as shown below in the RCode (statistical software developed at University of M i n n e s o t a [8]) m n :
Data s e t = COCOMOII. 1997 Response = log[PM] - 1. 01*log[SIZE] Coefficient Estimates Label Estimate Std. Error t-value l.oi+VsFi 1? Constant_A 0.701883 0.231930 3.026 Effort = A x [Size] •=• ' x Y[EM{ (2) PMAT*log[SIZE] 0.000884288 0.0130658 0.068 PREC*log[SIZE] -0.00901971 0.0145235 0 . 6 2 1 >=1 where TEAM*log[SIZE] 0.00866128 0.0170206 0.509 FLEX*log[SIZE] 0.0314220 0.0151538 0.074 RBSL*log[SIZE] -0.00558590 0.019035 - 0 . 2 9 3 A = multiplicative constant log[PERS] 0.987472 0.230583 4.282 Size = Size of the software project measured in terms of KSLOC (thousands of Source Lines of Code) log [RELY] 0.798808 0.528549 1.511 [26] or function points [13] and programming l°g[CPLX] 1.13191 0.434550 2.605 1 log[RCON] 1.36588 0.273141 5 . 0 0 1 language log[PEXP] 0.696906 0.527474 1.321 SF = scale factor log[LTEX] -0.0421480 0.672890 0.063 EM = effort multiplier (refer to [2] for further log[DATA] 2.52796 0.723645 3.493 explanation of COCOMOII terms) logtRUSE] -0.444102 0.486480 0.913 • «. rv-^™,^ IT • u ilog[DOCU] -1.32818 0.664557 1.999 We can lineanze the COCOMO II equation by taking x „. 8 5 8 3 0 2 „_ 5 3 2 5 4 4 [ p v 0 L ] x _6 1 2 logarithms on both sides of the equahon as shown: 0.609259 0 . 920 l Q g[ A E x p ] 0 . 5 6 0 5 4 2 ln(PM) = f3o+ 0i-1.01 ••ln(Size) + fo-SF1-ln(Size) log[PCON] 0.488392 0.322021 1.517 + - . . +06-SF5-ln(Size) + 07-ln(EM1) logfTOOL] 2.49512 1.11222 2.243 „ , ' , log[SITE] 1.39701 0.831993 1.679 „ , , , , . + 0z-ln{EM2) + ---+022-ln{EMK) iog[SCED] 2.84074 0.774020 3.670 23
("I
As the results indicate, some of the regression estimates had counter intuitive values, i.e., negative coefficients (shown in
Using (3) and the 1997 COCOMO II dataset consisting of b o l d ) A sa n exam le c o n s i d e r 83 completed projects, we employed the multiple regression P ' the 'Develop for Reuse' (RUSE) , r.nD , ., ,. . . y. , , effort multiplier. This multiplicative parameter captures the r ,,.,. , , c . ,. r , , r . t , , approach [6]. Because some of the rpredictor variables had r r additional effort required to develop components intended high correlations, we formed new aggregate predictor f o r r e u s e on current or future projects. As shown in Table 2, variables. These included analyst capability and program- i f ^ R U S E r a t i n g i s E x t r a H i g h ( X H ) / iS/ d e v e l o p i n g f o r mer capability which were aggregated into personnel reuse across multiple product lines, it will cause an increase capability, PERS, and time constraints and storage con- in effort by a factor of 1.56. On the other hand, if the RUSE straints, which were aggregated into resource constraints, rating is Low (L), i.e., developing with no consideration of TABLE 1 COCOMO 11.1997 Highly Correlated Parameters TIME
STOR
ACAP
PCAP
New Parameter
JlME__L00p0
R C Q N
STOR
0.6860
1.0000
ACAP
-0.2855
-0.0769
1.0000
PCAP
I -0.2015
| -0.0027
[ 0.7339
„_,„„ | 1.0000
|
Legend: timing constraints (TIME); storage constraints (STOR); resource constraints (RCON); analyst capability (ACAP); programmer capability (PCAP); personnel capability (PERS)
42
CHULANI ET A L : BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS
575
TABLE 2 RUSE—Expert-Determined a priori Rating Scale, Consistent with 12 Published Studies Develop for Reuse (RUSE) I Low(L) I Nominal (N) I High (H) I Very High (VH) I
Extra High
(XH)
__J
Definition 1997 A-priori Values
None |
0.89
|
Across project 1.00
Across Across product Across multiple program [ine product lines 1.16 | 1.34 | 1.56
TABLE 3 RUSE—Data-Determined Rating Scale, Contradicting 12 Published Studies Develop for Reuse (RUSE) I Low (L) I Nominal (N) I High (H) I Very High (VH) I
_J
Definition
None
1997 Data-Determined Values
1.05
Across project 1.00
future reuse, it will cause effort to decrease by a factor of 0.89. This rationale is consistent with the results of 12 published studies of the relative cost of developing for reuse compiled in [27] and was based on the expertjudgment of the researchers of the COCOMO II team. But, the regression results produced a negative coefficient for the j3 coefficient associated with RUSE. This negative coefficient results in the counter intuitive rating scale shown in Table 3, i.e., an XH rating for RUSE causes a decrease in effort and a L rating causes an increase in effort. Note the opposite trends followed in Table 2 and Table 3. A possible explanation [discussed in a study by [24] on "Why regression coefficients have the wrong sign"] for this contradiction may be the lack of dispersion in the responses associated with RUSE. A possible reason for this lack of dispersion is that RUSE is a relatively new cost factor and our follow-up indicated that the respondents did not have enough information to report its rating accurately during the data collection process. Additionally, many of the responses "I don't know" and "It does not apply" had to be coded as 1.0 (since this is the only way to code no impact on effort). Note (see Fig. 1 on the following page) that with slightly more than 50 of the 83 datapoints for RUSE being set at Nominal and with no observations at XH, the data for RUSE does not exhibit enough dispersion along the entire range of possible values for RUSE. While this is the familiar errors-in-variables problem, our data doesn't allow us to resolve this difficulty. Thus, the authors were forced to assume that the random variation in the responses for RUSE is small compared to the range of RUSE. The reader should note that all other cost models that use the multiple regression approach rarely explicitly state this assumption, even though it is implicitly assumed. Other reasons for the counterintuitive results include the violation of some of the restrictions imposed by multiple regression [4], [5]:
Across program 0.94
Across product line 0.88 '
Extra High
(XH)
Across multiple product lines 0.82
data has and continues to be one of the biggest challenges in the software estimation field. This is caused primarily by immature processes and management reluctance to release cost-related data. 2. There should be no extreme cases (i.e., outliers). Extreme cases can distort parameter estimates and such cases frequently occur in software engineering data due to the lack of precision in the data collection process. 3. The predictor variables (cost drivers and scale factors) should not be highly correlated. Unfortunately, because cost data is historically rather than experimentally collected, correlations among the predictor variables are unavoidable. The above restrictions are violated to some extent by the COCOMO II dataset. The COCOMO II calibration approach determines the coefficients for the five scale factors and the 17 effort multipliers (merged into 15 due to high correlation as discussed above). Considering the rule .of thumb that every parameter being calibrated should have at least five datapoints requires that the COCOMO II dataset have data
1. The number of datapoints should be large relative to the number of model parameters (i.e., there are many degrees of freedom). Unfortunately, collecting
Fig. 1. Distribution of RUSE.
43
576
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999
Fig. 2. Example of the 10 percent weighted average approach: RUSE rating scale. TABLE 4 Prediction Accuracy of COCOMO 11.1997
COCOMO n.1997 PRED(-20) PREDC25) PREDC30)
Before Stratification by Organization 46% 49% 52%
After Stratification by Organization 49% 55% 64%
on at least 110 (or 100 if we consider that parameters are the trends followed by the a priori and the data-determined merged) completed projects. We note that the COCOMO curves are opposite. The data-determined curve has a 11.1997 dataset has just 83 datapoints. negative slope and as shown in Table 3, violates expert The second point above indicates that due to the opinion. imprecision in the data collection process, outliers can The resulting calibration of the COCOMO II model using occur causing problems in the calibration. For example, if a t h e 1 9 9 7 d a t a s e t o f 8 3 p r . t s o d u c e d e s n m a t e s within 30 particular organization had extraordinary documentation n t of ^ achjals 52 n t of ^ H m e for e f f o r t T h e requirements imposed by the management, then even a , . . . . j i ,.„ i i . i i . j i . n r „ . ., . , . , ,, . ., . . rpredichon accuracy improved to 64 percent when the data . , . , . , , • , , very small project would require a lot of effort that is expended in trying to meet the excessive documentation w a s s t r a t l f l e d m t o s e t s b a s e d o n * e 1 8 unique s o u r c e s o f t h e match to the life cycle needs. If the data collected simply d a t a <see I 19 !' I 20 !' t 14 l f o r f u r t h e r confirmation of local used the highest DOCU rating provided in the model, then calibration improving accuracy) The constant, A, of the the huge amount of effort due to the stringent documenta- COCOMO II equation was recalibrated for each of these sets tion needs would be underrepresented and the project i.e., a different intercept was computed for each set. The would have the potential of being an outlier. Outliers in constant value ranged from 1.23 to 3.72 for the 18 sets and software engineering data, as indicated above, are mostly yielded the prediction accuracies as shown in Table 4. due to imprecision in the data collection process. While the 10 percent weighted average procedure The third restriction imposed requires that no para- produced a workable initial model, we want to develop a meters be highly correlated. As described above, in the m o r e formal methodology for combining expert judgment COCOMO 11.1997 calibration, a few parameters were a n d s a m p l e information. A Bayesian analysis with an aggregated to alleviate this problem. informative prior provides such a framework. To resolve some of the counter intuitive results produced by the regression analysis (e.g., the negative coefficient for RUSE as explained above), we used a weighted average of 2 THE BAYESIAN APPROACH the expert-judgment results and the regression results, with 2.1 Basic Framework—Terminology and Theory only 10 percent of the weight going to the regression results T h e B i a n a p p r o a c h p r o v i d e s a formal process by which expert-judgment can be combined with sampling for all the parameters We selected the 10 percent weighting a factor because models with 40 percent and 25 percent . , . ,, ,° , , i • • j i .... , . , ,. . j - i: -n_information (data) to produce a robust a posteriori model, weighting factors produced less accurate predictions. This \ / r r Usin Ba es eorem we S y '^ ' can combine our two information pragmatic calibrating procedure moved the model parameters in the direction suggested by the sample data but sources as follows: retained the rationale contained within the a priori values. i /y i a\\ t IQ\ An example of the 10 percent application using the RUSE f(P\Y) = J{Y\ ^ effort multiplier is given in Fig. 2. As shown in the graph,
44
CHULANI ET AL.: BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS
where P is the vector of parameters in which we are interested and Y is the vector of sample observations from the joint density function f(f3\Y). In (4), f(P\Y) is the posterior density function for j3 summarizing all the information about /?, f(Y \ ff) is the sample information and is algebraically equivalent to the likelihood function for fi, and f(/3) is the prior information summarizing the expertjudgment information about p. Equation (4) can be rewritten as: f(P\ Y) oc p | Y) / (P) In words, (5) means:
(5)
Posterior oc Sample x Prior In the Bayesian analysis context, the "prior" probabilities are the simple "unconditional" probabilities to the sample information; while the "posterior" probabilities are the "conditional" probabilities given sample and prior informay n The Bayesian approach makes use of prior information that is not part of the sample data by providing an optimal combination of the two sources of informarionAs described in many books on Bayesian analysis [21], [3], the posterior mean, 6", and variance, Var(b"), are defined as:
[ [
1
1 ~l
I" i
1
577
conducted a Delphi exercise [12], [1], [29]. Eight experts from the field of software estimation were asked to independently provide their estimate of the numeric values associated with each COCOMO II cost driver. Roughly half of these participating experts had been lead cost experts for large software development organizations and a few of them were originators of other proprietary cost models. All of the participants had at least 10 years of industrial software cost estimation experience. Based on the credibility of the participants, the authors felt very comfortable using t h e r e s u l t s o f t h e Delphi rounds as the prior information for tne purposes of calibrating COCOMO 11.1998. The reader is urged to refer to [32] where a study showed that estimates made by experts were more accurate than model-determined estimates. However, in [16] evidence showing the inefficiencies of expert judgment in other domains is highlighted. O ^ m e f i r s t r o u n d o f m e D e l P h i w a s completed, we summarized the results in terms of the means and the r a n e s of S **•* responses. These summarized results were q u i t e r a w w i t h significant variances caused by misunderstanding of the parameter definitions. In an attempt to improve the accuracy of these results and to attain better consensus among the experts, the authors distributed the results back to the participants. A better explanation of the behavior of the scale factors was provided since there was
-JJX'X + #*j x ^-X^fe + #*b*J and . _! (6)
highest variance in the scale factor responses. Each of the participants got a second opportunity to independently
— X'X + H* s J where X is the matrix of predictor variables, s is the variance of the residual for the sample data and H* and 6* are the precision (inverse of variance) and mean of the prior information, respectively From (6), it is clear that in order to determine the Bayesian posterior mean and variance, we need to determine the mean and precision of the prior information and the sampling information. The next two subsections describe the approach taken to determine the prior and sampling information, followed by a section on the Bayesian a posteriori model.
refine his/her response based on the responses of the rest of the participants in round 1. The authors felt that for the 17 effort multipliers the summarized results of round 2 were representative of the real world phenomena and decided to use these as the a priori information. But, for the five scale factors, the authors conducted a third round and made sure that the participants had a very good understanding of the exponential behavior of these parameters. The results of the third round were used as a priori information for the five scale factors. Please note that is the prior variance for any parameter is zero (in our case, if all experts responded the same value) then the Bayesian approach will completely rely o n expert opinion. However, this construct is inoperative since not surprisingly in the software field, disagreement and hence variability amongst the experts exists. Table 5 provides the a priori set of values for the RUSE parameter, i.e., the Develop for Reuse parameter. As
2.2 Prior Information To determine the prior information for the coefficients (i.e., b* and H") for our example model, COCOMO II, we
TABLE 5 COCOMO II 1998 "a priori" Rating Scale for Develop for Reuse (RUSE)
Develop for Reuse (RUSE)
Productivity Range
Low (L)
Nominal (N)
High (H)
Very High (VH)
Extra High (XH)
Definition
Least Productive Rating/ Most Productive Rating
None
Across project
Across program
Across product line
Across multiple product lines
Mean=1.73 I Variance = 0.05
0.89
1.0
1.15
1.33
1.54
1998A-priori Values
45
578
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999
discussed in Section 2, this multiplicative parameter captures the additional effort required to develop components intended for reuse on current or future projects. As shown in Table 5, if the RUSE rating is Extra High (XH), i.e., developing for reuse across multiple product lines, it will cause an increase in effort by a factor of 1.54. On the other hand, if the RUSE rating is Low (L), i.e., developing with no consideration of future reuse, it will cause effort to decrease by a factor of 0.89. The resulting range of productivity for RUSE is 1.73 (= 1.54/0.89) and the variance computed from the second Delphi round is 0.05. Comparing the results of Table 5 with the expert-determined a priori rating scale for the 1997 calibration illustrated in Table 2 validates the strong consensus of the experts in the Productivity Range of o ~ 1.7. 2.3 Sample Information The sampling information is the result of a data collection activity initiated in September 1994, soon after the initial publication of the COCOMO II description [2]. Affiliates of the Center for Software Engineering at the University of Southern California provided most of the data [30]. These organizations represent the commercial, aerospace, and federally funded research and development center (FFRDC) sectors of software development. Data of completed software projects is recorded on a data collection form that.asks between 33 and 59 questions depending on the degree of source code reuse [30]. A question asked very frequently is the definition of software size, i.e., what defines a line of source code or a Function Point (FP)? Appendix Bin the Model Definition Manual [30] defines a logical line of code using the framework described in [26], and [13] gives details on the counting rules of FPs. In spite of the definitions, the data collected to date exhibits local variations caused by differing interpretations of the counting rules. Another parameter that has different definitions within different organizations in effort, i.e., what is a person months (PM)? In COCOMO II, we define a PM as 152 person/hr. But, this varies from organization to organization. This information is usually derived from time cards maintained by employees. But, uncompensated overtime hours are illegal to report in time cards and hence do not get accounted for in the PM count. This leads to variations in the data reported and the authors took as much caution as possible while collecting the data. Variations also occur in the understanding of the subjective rating scale of the scale factors and effort multipliers [9] developed a system to alleviate this problem and help users apply cost driver definitions consistently for the PRICE S model. For example, a very high rating for analyst capability in one organization could be equivalent to a nominal rating in another organization. All these variations suggest that any organization using a parametric cost model should locally calibrate the model to produce better estimates. Please refer to the local calibration results discussed in Table 4. The sampling information includes data on the response variable, effort in person months (PM), where 1 PM = 152 hr and predictor variables such as actual size of the software in KSLOC (thousands of Source Lines of Code adjusted for breakage and reuse). The database has grown from 83
46
datapoints in 1997 to 161 datapoints in 1998. The disrributions of effort and size for the 1998 database of 161 datapoints are shown in Fig. 3. As can be noted, both the histograms are positively skewed with the bulk of the projects in the database with effort less than 500 PM and size less than 150 KSLOC. Since the multiple regression approach based on least squares estimation assumes that the response variable is normally distributed, the positively skewed histogram for effort indicates the need for a transformation. We also want the relationships between the response variable and the p r e d i c tor variables to be linear. The histograms for size in F i 3 and Fig. 4 and the scatter plot in Fig. 5 show that a log t r a n s f o r m a t i o n is appropriate for size. Furthermore, the log transformations on effort and size are consistent with (2) and (3) above. The egression analysis done in RCode (statistical software developed at University of Minnesota, [8]) on the log transformed COCOMO II parameters using a dataset of 161 datapoints yield the following results: Data s e t = COCOMOII 1998 Response = log[PM] C o e f f i c i e n t Estimates Estimate Std. Error t-value Label 0.103346 9.304 Constant_A 0 .961552 0.0460578 20.015 ± [SIZE] 0.921827 0.684836 0.481078 1.424 ^ C ' l o g l S I Z E ] 1.10203 TEAM*log[SIZE] 0.323318 FLEX*log[SIZE] 0.354658 RESL*log[SIZE] 1.32890 log[PCAP] 1.20332 ^g[RELY] 0.641228 log[CPLX] 1.03515 log [TIME] 1.58101 log[STOR] 0.784218 log[ACAP] 0.926205 log[PEXP] 0.755345 1 og [ LTEX ] 0.171569 l o g [DATA] 0.783232 l o g [RUSE] -0.339964 log[DOCU] 2.05772 lOg[PV0L] 0.867162 logfAEXP] 0.137859 i og [PC0N] 0.488392 l O g [ TOOL ] 0.551063 iog[SITE] 0.674702 l ! 11858 log[SCED]
0.373961 0.497475 0.686944 0.637678 0.307956 0.246435 0.232735 0.385646 0.352459 0.272413 0.356509 0.416269 0.218376 0.286225 0.622163 0.227311 0.330482 0.322021 0.221514 0.498431 0^275329
2.947 0.650 0.516 2.084 3.907 2.602 4.448 4.100 2.225 3.400 2.119 0.412 3.587 -1.188 3.307 3.815 0.417 1.517 2.488 1.354 4^063
T h e above results provide the estimates for the /? coefficients associated with each of the predictor variables (see (3). The t-value (ratio between the estimate and corresponding standard error, where standard error is the square root of the variance) may be interpreted as the signal-to-noise ratio associated with the corresponding predictor variables. Hence, the higher the t-value, the stronger the signal (i.e., statistical significance) being sent by the predictor variable. These coefficients can be used to
CHULANI ET AL: BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS
579
Fig. 3. Distribution of effort and size: 1998 dataset of 161 observations.
Fig. 4. Distribution of log transformed effort and size: 1998 dataset of 161 observations.
adjust the a priori Productivity Ranges (PRs) to determine the data-determined PRs for each of the 22 parameters. For example, the data-determined PR for RUSE = (1.73)"034 where 1.73 is the a priori PR as shown in Table 5. While the regression provides intuitively reasonable estimates for most of the predictor variables; the negative coefficient estimate for RUSE (as discussed earlier) and the magnitudes for the coefficients on Applications Experience (AEXP), Language and Tool Experience (LTEX), Development Flexibility FLEX, and Team Cohesion (TEAM), violate our prior opinion about the impact of these parameters on Effort (i.e., PM). The quality of the data probably explains some of the conflicts between the prior information and sample data. However, when compared to the results reported in Section 2, these regression results (using 161 datapoints) produced better estimates. Only, RUSE has a
Fig. 5. Correlation between log[effort] and log[size].
negative coefficient associated with it compared to PREC, RESL, LTEX, DOCU, and RUSE in the regression results using only 83 datapoints. Thus, adding more datapoints (which results in an increase in the degrees of freedom) reduced the problems of counterintuitive results. 2.4
Combining Prior and Sampling Information: Posterior Bayesian Update
As a means of resolving the above conflicts, we will now use the Bayesian paradigm as a means of formally combining prior expert judgment with our sample data. Equation (6) reveals that if the precision of the a priori information (H*) is bigger (or the variance of the a priori information is smaller) than the precision (or the variance) of the sampling information (l/s^X'X) the posterior values will be closer to the a priori values. This situation can arise when the gathered data is noisy as depicted in Fig. 6 for an example cost factor, Develop for Reuse. Fig. 6 illustrates that the degree-of-belief in the prior information is higher than the degree-of-belief in the sample data. As a consequence, a stronger weight is assigned to the prior information causing the posterior mean to be closer to the prior mean. On the other hand (not illustrated), if the precision of the sampling information (l/s2X'X) is larger than the precision of the prior information (H*), then a higher weight is assigned to the sampling information causing the posterior mean to be closer to the mean of the sampling data. The resulting posterior precision will always be higher than the a priori precision or the sample data precision. Note that if the prior variance of any parameter is zero, then the parameter will
47
580
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999
Fig. 6. A posteriori Bayesian update in the presence of noisy data (develop for reuse, RUSE).
Fig. 7. Bayesian a posteriori productivity ranges.
be completely determined by the prior information, Although, this is a restriction imposed by the Bayesian approach, it is of little concern as the situation of complete consensus very rarely arises in the software engineering domain. The complete Bayesian analysis on COCOMO II yields the Productivity Ranges (ratio between the least productive parameter rating, i.e., the highest rating, and the most productive parameter rating, i.e., the lowest rating) illustrated in Fig. 7. Fig. 7 gives an overall perspective of the relative Software Productivity Ranges (PRs) provided by the COCOMO 11.1998 parameters. The PRs provide insight on identifying the high payoff areas to focus on in a software productivity improvement activity. For example, Product Complexity (CPLX) is the highest payoff parameter and Development Flexibility (FLEX) is the lowest payoff parameter. The variance associated with each parameter is indicated along each bar. This indicates that even though
48
the two parameters, Multisite Development (SITE) and Documentation Match to Life Cycle Needs (DOCU), have the same PR, the PR of SITE (variance of 0.007) is predicted with more than five times the certainty than the PR of DOCU (variance of 0.037). The resulting COCOMO 11.1998 model calibrated to 161 datapoints produces estimates within 30 percent of the actuals 75 percent of the time for effort. If the model's multiplicative coefficient is calibrated to each of the 18 major sources of project data, the resulting model (with the coefficient ranging from 1.5 to 4.1) produces estimates within 30 percent of the actuals 80 percent of the time. It is therefore recommended that organizations using the model calibrate it using their own data to increase model accuracy and produce a local optimum estimate for similar type projects. From Table 6, it is clear that the prediction accuracy of the COCOMO 11.1998 model calibrated using the Bayesian approach is better than the prediction accuracy
CHULANI ET AL: BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS
581
TABLE 6 Prediction Accuracies of C0COM0 11.1997, a priori COCOMO 11.1998 and Bayesian a posteriori COCOMO 11.1998 Before and After Stratification
Prediction Accuracy
COCOMO n.1997 (83 datapoints) Before After PREP(.2O) 46% 49% PRED(.2S) 49% 55% PREDQ30) I 52% | 64% |
COCOMO A-Priori COCOMO Bayesian A-Posteriori 11.1997(161 11.1998 (Based on Delphi COCOMO 11.1998 datapoints) Results -161 datapoints) (161 datapoints) Before After Before After Before After 54% 57% 48% 54% 63% 70% 59% 65% 55% 63% 68% 76% 63% | 67% | 61% | 65% | 75% | 80%
of the COCOMO 11.1997 model (used on the 1997 dataset of doesn't lend itself to alleviating the third problem of 83 datapoints as well as the 1998 dataset of 161 datapoints) measurement error as discussed in Section 2. and the A Priori COCOMO II Model which is based on the Consider a reduced model developed by using a backexpert opinion gathered via the Delphi exercise. The full-set ward elimination technique, of model parameters for the Bayesian a posteriori COCOData s e t = COCOMOII .1998 MO 11.1998 model are given in Appendix A. Response = log[PM] 2.5
Cross Validation of the Bayesian Calibrated
coefficient Estimates
Model The COCOMO 11.1998 Bayesian calibration discussed above uses the complete dataset of 161 datapoints. The prediction accuracies of COCOMO 11.1998 (depicted in Table 6) are based on the same dataset of 161 datapoints. That is, the calibration and validation datasets are the same. A natural question that arises in this context is how well will the model predict new software development projects? To address this issue, we randomly selected 121 observations for our calibration dataset with the remaining 40 becoming assigned to the validation dataset (i.e "new" data). We repeated this process 15 times creahng 15 calibration and 15
Label log [SIZE] PREC. log [SIZE] RESL.log[SIZE] log[PCAP] 1 Og[RELY] iog[CPLX]
log[TIME] log[PEXP] log[DATA] x
[Ddcu]
1
[pv0L]
PeC u TT Z a prediction TJ equation t' T We/ T then developed for eachTJlu of the log [TOOL]
Estimate 0.933775 0.0120687 0.0209697 2.09098 0.570849 1 02007 1.99341 0.609801 0 714392 2 '.39447 0 .974858
0.772463
Std. Error t-value 0.0318149 29.350 0.00349253 3.456 0.00529576 3.960 0.257052 8.134 0.244610 2.334 0 232718 4 383 0317108 6^286 0.296591 2.056 0 229479 3 115 0]589210 4 ] 064 0.227189 4.291
0.199663
3.869
1.44428 0.437796 3.299 15 calibration datasets. We used the resulting a posteriori l ^ I T E ] models to predict the development effort of the 40 "new" D I S C E D ] 1.06009 0.286442 3.701 projects in the validation datasets. This validation approach, The above results have no counterintuitive estimates for known as out-of-sample validation, provides a true mea- the coefficients associated with the predictor variables. The sure of the model's predictive abilities. This out-of-sample high t-rario associated with each of these variables indicates test yielded an average PRED(0.30) of 69 percent; indicating a significant impact by each of the predictor variables. The that on average, the out-of-sample validation results highest correlation among any two predictor variables is 0.5 produced estimates within 30 percent of the actuals 69 a n d i s b e t w e e n R E L Y and CPLX. Overall, the above results percentofthehme Hence we conclude that our Bayesian a r f i s t a t i s t i c a l ] a c c e p t a b l e . This COCOMO II reduced model has reasonably good predictive qualities. , . . ., , . _ ,, „ u 7 r n ° model gives the accuracy results shown in Table 7. 2.6 Reduced Model These accuracy results are a little worse that the results When calibrating COCOMO II, the three main problems we obtained by the Bayesian A Posteriori COCOMO 11.1998 faced in our data are: 1) lack of degrees of freedom, 2) some model but the model is more parsimonious. In practice, highly correlated predictor variables, and 3) measurement removing a predictor variable is equivalent to stipulating error for a few predictor variables. These limitations led to TABLE 7 some of the regression results being counterintuitive. The posterior Bayesian update discussed in Section 3.4 alle_ .. .. . . , „ . . „ « „ « , , _ . ,, .,„„„ r . , , J ,, r , . . , Prediction Accuracies of Reduced COCOMO 11.1998 t. expert-judgment viated these problems by incorporating derived prior information into the calibration process. But, I p r e d i c t i o n Accuracy I Reduced COCOMO H. 1998 such prior information may not be always available. So, what must one do in the absence of good prior information? PRED(.2O) 54% One way to address this problem is to reduce over fitting by PRED(.25) 64% developing a more parsimonious model. This alleviates the PRED(.3O) 73% first two problems listed above. Unfortunately, our data
49
582
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999
TABLE 8 Acronym and Full Form Parameters " c b c O M O H Parameter
I VL I L I N I H I VH I XH I PR
PREC I Precendentedness " 6.20 4.96 3.72 2.48 1.24 0.00 1.33 FLEX Development Flexibility 5.07 4.05 3.04 2.03 1.01 0.00 1.26 RESL Architecture and Risk Resolution 7.07 5.65 4.24 2.83 1.41 0.00 1.38 TEAM Team cohesion 5.48 4.38 3.29 2.19 1.10 0.00 1.29 PMAT Process Maturity 7.80 6.24 4.68 3.12 1.56 0.00 1.43 RELY Required Software Reliability 0.82 0.92 1.00 1.10 1.26 1.53 DATA Data Base Size 0.90 1.00 1.14 1.28 1.42 CPLX Product Complexity 0.73 0.87 1.00 1.17 1.34 1.74 2.39 RUSE Develop for Reuse - 0.95 1.00 1.07 1.15 1.24 1.31 DOCU Documentation Match to Life-cycle Needs 0.81 0.91 1.00 1.11 1.23 1.52 TIME Time Constraint 1.00 1.11 1.29 1.63 1.63 STOR Storage Constraint 1.00 1.05 1.17 1.46 1.46 PVOL Platform Volatility 0.87 1.00 1.15 1.30 1.50 ACAP Analyst Capability 1.42 1.19 1.00 0.85 0.71 2.00 PCAP Programmer Capability 1.34 1.15 1.00 0.88 0.76 1.77 AEXP Applications Experience 1.22 1.10 1.00 0.88 0.81 1.51 PEXP Platform Experience 1.19 1.09 1.00 0.91 0.85 1.40 LTEX Language and Tool Experience 1.20 1.09 1.00 0.91 0.84 1.43 PCON Personnel Continuity T 2 9 ~ 1.12 1.00 0.90 0.81 1.59 TOOL Use of Software Tools 1.17 1.09 1.00 0.90 0.78 1.50 SITE Multi-Site Development 1.22 1.09 1.00 0.93 0.86 0.80 1.52 SCED I Required Development Schedule | 1.43 | 1.14 [ 1.00 | 1.00 | 1.00 | | 1.43 Multiplicative Effort Calibration Constant (A) = 2.94; Exponential Effort Calibration Constant (B) = 0.91
that variations in this variable have no effect on project effort. When our experts and our behavioral analyses tell us otherwise, we need extremely strong evidence to drop a variable. The authors believe that dropping variables for an individual organization via local calibration of the Bayesian Posteriori COCOMO 11.1998 model is a sounder option.
intuitive estimates when other traditional approaches are employed. We are currently using the approach to develop similar models to estimate the delivered defect density of software products and the cost of integrating commercialoff-the-shelf (COTS) software. APPENDIX A
3 CONCLUSIONS As shown in Table 6 and Table 7 of this paper, the estimation accuracy for the Bayesian a posteriori of COCOMO 11.1998 for the 161-project sample is better than the accuracies for the best version of COCOMO 11.1997, the 1998 a priori model, and a version of COCOMO 11.1998 with a reduced set of variables obtained by backward elimination. The improvement over the 1997 model provides evidence that the 1998 Bayesian variable-by-variable accommodation of expert prior information is stronger than the 1997 approach of one-factor-fits-all averaging of expert data and regression data. Overall, the class of Bayesian estimation models presented here provides a formal process for merging expert prior information with software engineering data. In many traditional models, such prior information is informally used to evaluate the "appropriateness" of the results. However, having a formal mechanism for incorporating expert prior information gives users of the cost model the flexibility to obtain predictions and calibrations based on a
different set of prior information.
Such Bayesian estimation models enable the engineering
This appendix has the acronyms and full forms of the 22 COCOMO II Post Architecture cost drivers and their associated COCOMO II. 1998 rating scales (see Table 8). F o ra further explanation of these parameters, please refer t0 PL [30]. ACKNOWLEDGMENTS This work was supported, both financially and technically, Contract No. F30602-96-C-0274, "KBSA Life Cycle Evaluation," and by the COCOMO II Program Affiliates: Aerospace, Air Force Cost Analysis Agency, A l l i e d Signal, AT&T, Bellcore, EDS, Raytheon E-Systems, GDE Systems, Hughes, IDA, JPL, Litton, Lockheed Martin, Loral, MCC, MDAC, Motorola, Northrop Grumman, Rational, Rockwell, SAIC, SEI, SPC, Sun, TI, TRW, USAF Rome Lab, U.S. Army Research Labs, and Xerox.
under AFRL
REFERENCES
g| ™ « ^ £ ! F £ f f i S tTuZe S ^ e ^
software community to more adequately address the
on Software Process and Product Measurement, J.D. Arthur and S.M.
challenge of making good decisions when the data is scarce , .
&
,
TTT
J i •
j• •
"
50
^ research interests j u n w a i e Development," L / e v e i u p i i i e i u , Software DUJIWUIC Lfi^. u i . o,5,iiu. p . Z.IU-£.AI, •His ««focus W K * J on ^H system's process process model, model, product product model, model, property property 1990. X990. integrating a softwarei system's via an an approach approach called called Model-Based Model-Based model via R.W. Jensen, Jensen, "An "An Improved Improved Macrolevel Macrolevel Software Software Development Development model, and success> model R.W. ware Engineering Engineering (MBASE). (MBASE). His His contributions contributions to Resource Estimation Estimation Model," Model," Proc. Proc. Fifth Fifth ISPA ISPA Conf, Conf, pp. pp. 88-92, Resource 88-92, Architecting and Software to Constructive Cost Model (COCOMO), the Spiral Apr. Apr. 1983. 1983. t n e f i e l d include: the Constructive Cost Model (COCOMO), the Spiral to process, and and the the Theory Theory W W (win-win) (win-win) approach approach to E.J. Johnson, Johnson, "Expertise "Expertise and and Decision Decision Under Under Uncertainty: Uncertainty: PerforPerfor- M o d e ' ° ' the software i process, E.J. it and mance Farr, software management mance and and Process," Process," The The Nature Nature of of Expertise, Expertise, Chi, Chi, Glaser,and Glaser,and Farr, andrequirements requirements determination. determination. He Hehas has served served eral scientific journals and as a member of the eds., 1988. eds., Lawrence Lawrence Earlbaum Earlbaum Assoc. Assoc. 1988. o n * n e board of several scientific journals and as a member of the ie IEEE C. C. Jones, Jones, Applied Applied Software Software Measurement. Measurement. McGraw-Hill, McGraw-Hill, 1997. 1997. governing board of the IEEEComputer ComputerSociety. Society.He Hecurrently currentlyserves servesas as Visitors for the CMU Software Engineering Institute. G.G. G.G. Judge, Judge, W. W. Griffiths, Griffiths, and and R. R. Carter Carter Hill, Hill, Learning Learning and and Practicing Practicing c n a i r o f t n e B o a r d o f v i s i t o r s f o r t h e C M U Software Engineering Institute. EEE, AIAA, and ACM, and a member of the IEEE Econometrics. Wiley, 1993. Econometrics Wiley 1993 He is a fellow of the IEEE, AIAA, and ACM, and member of the IEEE d the National Academy of a Engineering. C.F. Kemerer, "An Empirical Validation of Software Cost C.F. Kemerer, Empirical Validation of Software Cost Computer Society and the National Academy of Engineering. Models/' Comnt."An ACM, vol. 30, no. 5, pp. 416-429, 1987. Bert M. Steece is deputy dean of faculty and B.A. Kitchenham and N.R. Taylor, "Software Cost Models," ICL professor in the Information and Operations Technical J. vol. 1, May 1984. Management Department at the University of E.E. Learner, Specification Searches, ad hoc Inference with NonexperiSouthern California and is a specialist in mental Data. Wiley Series 1978. statistics. His research areas include statistical T.F. Masters, "An Overview of Software Cost Estimating at the modeling, time series analysis, and statistical National Security Agency," /. Parametrics, vol. 5, no. 1, pp. 72-84, computing. He is on the editorial board of 1985. Mathematical Reviews and has served on S.N. Mohanty, "Software Cost Estimation: Present and Future," various committees for the American Statistical Software Practice and Experience, vol. 11, pp. 103-121,1981. Association. Steece has consulted on a variety G.M. Mullet, "Why Regression Coefficients Have the Wrong / Quality Quality Technology, Technology 1976. of subjects: including forecasting, accounting, health care systems, legal Sign" 1976. Sign," /. L.H. Putnam and W. Myers, L.H.'putnam Myers', Measures for Excellence. Yourdon Press c a s e s ' a n d c h e m i c a l engineering. engineering. Computing Series, 1992. http://www.qsm.com/slim_estitnahttp://www.qsm.com/slim_estimate.html R.M. Park et al., "Software Size Measurement: A Framework for Counting Source Statements," CMU-SEI-92-TR-20, Software Eng. Inst., Pittsburgh, Pa. 1992. J.S. Poulin, Measuring Software Reuse, Principles, Practices and Economic Models. Addison-Wesley, 1997. H. Rubin, "ESTIMACS," IEEE, 1983. M CIKpnnprH and anH M. \A Schofield, fv-hnfiplrl "Estimating "FnHmatiiiff Software [22], which we will discuss throughout the pa per. In short, our work adds to the collection of machine learning techniques available to software engineers, and our analysis stresses the sensitivity of these approaches to the nature of historical data and other factors, A
Learning Decision and Regression Trees Many learning approaches have been developed that construct decision trees for classifying data [4], [17]. Fig. 1 illustrates a partial decision tree over Boehm's original 63 projects from which COCOMO was developed. Each project is described over dimensions such as AKDSI (i.e., adjusted delivered source instructions), TIME (i.e., the required system response time), and STOR (i.e., main memory limitations). The complete, set of attributes used to describe these data is given in Appendix A. The mean of actual project development months labels each leaf of the tree. Predicting development effort for a project requires that one descend the decision tree along an appropriate path, and the leaf value along that path gives the estimate of development effort of the new project. The decision tree in Fig. 1 is referred to as a regression tree, because the intent of categorization is to generate a prediction along a continuous dependent dimension (here, software development effort). There are many automatic methods for constructing decision and regression trees from data, but these techniques are typically variations on one simple strategy. A "top-down" strategy examines the data and selects an attribute that best divides the data into disjoint subpopulations. The most important aspect of decision and regression tree learners is the criterion used to select a "divisive" attribute during tree construction, In one variation the system selects the attribute with values that maximally reduce the mean squared error {MSE) of the dependent dimension (e.g., software development effort) observed in the training data. The MSE of any set, S,
53
128
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
1=26.0
F U N C T I O N C A R T X (Instances)
( M £ A N = 5 7 3 ) [3i]
^ termination-condition(Instances)
XKDSI~ I >
1 0
I {MEAN = 1069) (2]
V, of Best-Attribute . I
— AKDSI— f >U1
Nol BUS —
(MEAN = 702) |1]
C A R T X ( { I | 1 is an Instance
-°
*»'
274.0
'
o.925
L_ (MEAN = 3836) [2]
MEAN _ 1600
m
2-partitions only. Similarly, techniques that 2-partition all attribute domains, for both continuous and nominally-valued J
(MEAN = 2250) p]
>315 5
- (MEAN = 9ooo) [2] Fig. 1. A regression tree over Boehm's 63 software project descriptions. Numbers in square brackets represent the number of projects classified under a no e '
of training examples taking on values yk in the continuous dependent dimension is: v~-» / _ _N2 ^—' -^ MSE(S) = where y is the mean of the yk values exhibited in S. The Values of each attribute, A{ partition the entire training data set, T, into subsets, T^, where every example in Tij takes on the same value, say Vj for attribute A;. The attribute, Ait that maximizes the difference: . „ „ „ _ MlFtT^ - Y^ MSFIT \ ~
*• '
2—i 3
^ li'
( i - e "finite,unordered) attributes, have been explored (e.g., [24]). For continuous attributes this bisection process operates as we have just described, but for a nominally-valued attribute all ways to group values of the attribute into two disjoint sets are considered. Suffice it t o say that treating all attributes as though they had the same number of values (e.g., 2) for purposes of attribute selection mitigates Certain biases that are present in some attribute selection measures (e.g., AMSE). As we will note again in Section IV, we ensure that all attributes are either continuous or binary-valued at the outset o f r e g ression-tree construction. j ^ t ^asic r e gression-tree learning algorithm is summarized in Fig. 2. The data set is first tested to see whether tree consanction is worthwhile; if all the data are classified identically or some other statistically-based criterion is satisfied, then expansion ceases. In this case, the algorithm simply returns a leaf labeled by the mean value of the dependent dimension found in the training data. If the data are not sufficiently distinguished, then the best divisive attribute according to AMSE is selected, the attribute's values are used to partition the data into subsets, and the procedure is recursively called on these subsets to expand the tree. When used to construct predictors along continuous dimensions, this general procedure is referred
is selected to divide the tree. Intuitively, the attribute that minimizes the error over the dependent dimension is used, While MSE values are computed over the training data, the inductive assumption is that selected attributes will similarly reduce error over future cases as well. This basic procedure of attribute selection is easily extended to allow continuously-valued attributes: all ordered 2-partitions of the observed values in the training data are examined, In essence, the dimension is split around each observed value. The effect is to 2-partition the dimension in k — 1 alternate ways (where k is the number of observed values), and the binary "split" that is best according to AMSE is considered along with other possible attributes to divide a regression-tree node. Such "splitting" is common in the tree of Fig. 1; see AKDSI, for example. Approaches have also been developed that split a continuous dimension into more than two ranges [9], [15], though we will assume
54
to as recursive-partitioning regression. Our experiments use a partial reimplementation of a system known as CART [4]. We refer to our reimplementation as CARTX. Previously, Porter and Selby [14], [15], [22], have investigated the use of decision-tree induction for estimating development effort and other resource-related dimensions. Their work assumes that if predictions over a continuous dependent dimension are required, then the continuous dimension is "discretized" by breaking it into mutually-exclusive ranges. More commonly used decision-tree induction algorithms, which assume discrete-valued dependent dimensions, are then applied to the appropriately classified data. In many cases this preprocessing of a continuous dependent dimension may be profitable, though regression-tree induction demonstrates that the general tree-construction approach can be adapted for direct manipulation of a continuous dependent dimension. This is also the case with the learning approach that we describe next.
SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT
129
Fig. 4. An example of function approximation by a regression tree. Fig. 3. A network architecture for software development effort estimation.
B. A Neural Network Approach to Learning A learning approach that is very different from that outlined above is BACKPROPAGATION, which operates on a network of simple processing elements as illustrated in Fig. 3. This basic architecture is inspired by biological nerve nets, and is thus called an artificial neural network. Each line between processing elements has a corresponding and distinct weight. Each processing unit in this network computes a nonlinear function of its inputs and passes the resultant value along as its output. The favored function is 1 7 \~T — I V^ Wili ] \ i / J where £ , wJi is a weighted sum of the inputs, Iit to a processing element T191 T251 The network generates output by propagating the initial inputs, shown on the leffhand side of Fig. 3, through subsequent layers of processing elements to the final output layer. This net illustrates the kind of mapping that we will use for estimating software development effort, with inputs corresponding to various project attributes, and the output line corresponding to the estimated development effort. The inputs and output are restricted to numeric values. For numerically-valued attributes this mapping is natural, but for nominal data such as LANG (implementation language), a numeric representation must be found. In this domain, each value of a nominal attribute is given its own input line. If the value is present in an observation then the input line is set to 1.0, and if the value is absent then it is set to 0.0. Thus, for a given observation the input line corresponding to an observed nominal value (e.g., COB) will be 1.0, and the others (e.g., FTN) will be 0.0. Our application requires only one network output, but other applications may require more than one. Details of the BACKPROPAGATION learning procedure are beyond the scope of this article, but intuitively the goal of learning is to train the network to generate appropriate output patterns for corresponding input patterns. To accomplish this,
comparisons
are made between a network's actual
output
pattern and an a priori known correct output pattern. The difference or error between each output line and its correct co ™sponding value is "backpropagated" through the net and S u i d e s t h e m ° d l f i c a t ' ™ of weights in a manner that will t e n d t 0 r e d u c e t h e c o l l e c t i v e e r r o r b e t w e e n a c t u a l a n d COITect out
Puts
on tralnln
t0 conver
Patterns
Se
in a
8 Pattems- ^ ^ Procedure h a s b e e n s h ° w n PP i n g s b e t w e e n i n P u t a n d °«put variet o f d o m a i n s [ 2 1 ] [25] y ' "
o n a c c u r a t e ma
a
Approximating Arbitrary Functions In trying to approximate an arbitrary function like development effort, regression trees approximate a function with a "staircase" function. Fig. 4 illustrates a function of one continuous, independent variable. A regression tree decomposes this function's domain so that the mean at each
l e a f r e f l e c t s the
8 e w i t h i n a l o c a l reSion- T h e hidden" processing elements that reside between the input and OUt ut la ers P y ° f a n e U r a l n e t w O r k d o rOU S hl y t h e S a m e t h i n g ' thou h me 8 approximating function is generally smoothed. The g r a n u l a n t y o f t h i s P h o n i n g of the function is modulated by ^ d e P t h o f a reSression tree or the number of hidden units ln a n e t w o r • . . . Each leamm S a P P r o a c h l s °°°P*»™t™, since it makes no a Priori a s s u m P t l o n s a b o u t * e f o r m o f * * f u n c t i o n b « n g a roximated PP - ""«*« ** a w i d e v a r i e t y o f Parametnc methods for function a P P r o x i m a t i o n s u c h a s regression methods of statistics a n d P^om^ interpolation methods of numerical a n a l s i s [10] O t h e r n o n a r a m e t r l c m e t h o d s i n c l u d e y P 8enetic al orithms and nearest nei hbor a S ^ S PProaches [1], though w e wiU not elaborate o n a n y of these a l t e m a t l v e s here -
D
function>s ran
- Sensitivity to Configuration Choices Both BACKPROPAGATION and CARTX require that the analyst make certain decisions about algorithm implementation. For example, BACKPROPAGATION can be used to train networks with differing numbers of hidden units. Too few hidden units can compromise the ability of the network to approximate a desired function. In contrast, too many hidden units can lead to "overfitting," whereby the learning system fits the "noise"
55
130
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
present in the training data, as well as the meaningful trends that we would like to capture. BACKPROPAGATION is also typically trained by iterating through the training data many times. In general, the greater the number of iterations, the greater the reduction in error over the training sample, though there is no general guarantee of this. Finally, BACKPROPAGATION assumes that weights in the neural network are initialized to small, random values prior to training. The initial random weight settings can also impact learning success, though in many applications this is not a significant factor. There are other parameters that can effect BACKPROPAGATION 'S performance, but we will not explore these here. In CARTX, the primary dimension under control by the experimenter is the depth to which the regression tree is allowed to grow. Growth to too great a depth can lead to overfitting, and too little growth can lead to underfitting. Experimental results of Section IV-B illustrate the sensitivity of each learning system to certain configuration choices.
linear relationship and those close to 0.0 suggest no such relationship. Our experiments will characterize the abilities of BACKPROPAGATION and CARTX using the same dimensions as Kemerer: MRE and R2. As we noted, each system imposes certain constraints on the representation of data. There are a number of nominally-valued attributes in the project databases, including implementation language. BACKPROPAGATION requires that each value of such an attribute was treated as a binary-valued attribute that was either present (1) or absent (0) in each project. Thus, each value of a nominal attribute corresponded to a unique input to the neural network as noted in Section III-B. We represent each nominal attribute as a set of binary-valued attributes for CARTX as well. As we noted in Section III-A this mitigates certain biases in attribute selection measures such as AMSE. In contrast, each continuous attribute identified by Boehm corresponded to one input to the neural network. There was one output unit, which reflected a prediction of development effort and was also continuous. Preprocessing for the neural IV. OVERVIEW OF EXPERIMENTAL STUDIES network normalized these values between 0.0 and 1.0. A simple scheme was used where each value was divided by We conducted several experiments with CARTX and ^ m a j d m u m o f ^ y a , u e s for ^ a t t r i b u t e i n ^ ^ BACKPROPAGATION for the task of estimating software ^ ft h a s b e e n s h o w n ^ neural networks iricall development effort In general, each of our experiments c o n ^ckly tf a U ^ y a l u e s f ( j r ^ a t t r i b u t e s K]a&wl partitions historical data into samples used to train our learning ^ b e ( w e e n z e f o a n d Q n e NQ such norma]ization was systems, and disjoint samples used to test the accuracy of the d o n e for c s i n c e u w o u W h a y e n Q e f f e c t ofl CARTX>S trained classifier in predicting development effort. , _ f b r performance. For purposes of comparison, we refer to previous experimental results by Kemerer [11]. He conducted comparative analyses between SLIM, COCOMO, and FUNCTION POINTS on A Experiment 1: Comparison with Kemerer's Results a database of 15 projects.1 These projects consist mainly of O u r first business applications with a dominant proportion of them experiment compares the performance of machine leamin 8 algorithms with standard models of software devel(12/15) written in the COBOL language. In contrast, the COCOMO database includes instances of business, scientific, ° P m e n t e s t i m a t i o n u s i n g Kemerer's data as a test sample. To and system software projects, written in a variety of languages t e s t C A R T X a n d BACKPROPAGATION, we trained each system including COBOL, PL1, HMI, and FORTRAN. For compar- o n COCOMO'S database of 63 projects and tested on Kemerer's isons involving COCOMO, Kemerer coded his 15 projects using 1 5 P r o J e c t s ' F o r BACKPROPAGATION we initially configured the n e t w o r k with 3 3 in ut units 1 0 h i d d e n umts a n d l out ut the same attributes used by Boehm. P ' ' P One way that Kemerer characterized the fit between the pre- u n i t ' a n d r e 1 u i r e d m a t *** t r a m i n 8 s e t e r r o r r e a c h ° 0 0 0 0 1 o r dieted (M e 8 t ) and actual (Mact) development person-months c o n t i n u e f o r a maximum of 12 000 presentations of the training data T r a i n i n " 8 c e a s e d a f t e r 1 2 0 0 ° Presentations without conwas by the magnitude of relative error (MRE): verging to the required error criterion. The experiment was _ Mest ~ Mact done on an AT&T PC 386 under DOS. It required about 6-7 Mpp Mact hours for 12000 presentations of the training patterns. We __. ,. , ,.., , . actually repeated this experiment 10 times, though we only , , f . f , This measure normalizes the difference between actual and , . , , , , , ,. , ., report the results of one run here; we summarize the complete an analyst with a . . _ . . . . _ predicted development months, and supplies K yy J c ,. . . . . ' . , ,.„ , , set of expenments in Section IV-B. . , • •• i • c ^ v n J.U measure of the reliability of estimates by different models. T c , . . , . , . . In our initial configuration of CARTX, we allowed the TT However, when using a model developed at one site for . : „ . „ , „, , , , . , i •, , c , regression tree to grow to a maximum depth, where each , . . . . , . .. . ., estimation at another site, there may be local factors that , , a ... ... , , • , leaf represented a single software project description from the are not modeled, but which nonetheless impact development .-,„„„„„ . . „ , _• . f. . • „ , . . ., . ~ . ^, „„ . , . , COCOMO data. We were motivated initially to extend the tree • , , i_ J • i • effort m a systematic way. Thus, following earlier work u , ... . t^ ,r ",.,,• , , . to singleton leaves, because the data is very sparse relative to by Albrecht r[2], Kemerer did a linear regression/correlation , , • , , ., , , c ,. , . .. '., „ , ... ., , . , the number of dimensions used to describe each data point; , ... -„. . . . , analysis to calibrate the rpredictions, with Mest treated as . . . , . ,, , .. , : , , our concern is not so much with overfitting, as it is with , ,, . ., , t „ . „ ... ., the independent variable and Mact treated as the dependent • •, r™ > - , • > , • j . , ^ underfitting the data. Expenments with the regression tree , , . „ . „_ . ^, , variable. The R value indicates the amount of variation in , , . , , , ,. , . , . . , learner were performed on a SUN 3/60 under TTXTT UNIX, and the actual values accounted for by a linear relationship with JU . . „, ... u u • J r r , . , , _2 , , , . required about a minute. The predictions obtained from the the estimated values. R* values close to 1.0 suggest a strong , . , •, ,a •• »u ^ AN 00 " learning algorithms (after training on the COCOMO data) are 'we thank Professor Chris Kemerer for supplying this dataset. shown in Table I with the actual person-months of Kemerer's
56
SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT
131
TABLE I
TABLE II
CARTX AND BACKPROPAGATION ESTIMATES ON KEMERER'S DATA —| 1 n Actual CARTX BACKPROP
287.00
1893.30
8145
82 50
162 03
14 14
A COMPARISON OF LEARNING AND ALGORITHMIC APPROACHES. THE REGRESSION EQUATIONS GIVE Mact AS A FUNCTION OF Mest(x)
MRE(%) R-Square CARTX
364
0.83
BACKPROP 70 FuNC - PTS- 103 COCOMO 610
0.80
Regress. Eq. 102.5 + 0.075x
1107.31 11400.00 86.90 243.00 336.30 6600.00
1000.43 88.37 540.42
84.00 23.20
129.17 129.17
13.16 45.38
130.30
243.00
78.92
c a s e of
116.00 72.00 258.70 230.70 157.00 246.90
1272.00 129.17 243.00 243.00 243.00 243.00
113.18 15.72 80.87 28.65 44.29 39.17
by "calibrating" a model's predictions in a new environment, the adjusted model predictions can be reliably used. Along the R2 dimension learning methods provide significant fits to the data. Unfortunately, a primary weakness of these learning approaches is that their performance is sensitive to a number of implementation decisions. Experiment 2 illustrates some of
69.90
129.17
214.71
these sensitivities.
II S u M ^j
I772
modeiSi
058
0.70 I0'89
78.13 + 0.88* -37 + 0.96* 27.7 + 0.156* I49'9 + 0 M 2 x
Kemerer argues that high R2 suggests that
B. Experiment 2: Sensitivity of the Learning Algorithms 15 projects. We note that some predictions of CARTX do not correspond to exact person-month values of any COCOMO (training set) project, even though the regression tree was developed to singleton leaves. This stems from the presence of missing values for some attributes in Kemerer's data. If, during classification of a test project, we encounter a decision node that tests an attribute with an unknown value in the test project, both subtrees under the decision node are explored, In such a case, the system's final prediction of development effort is a weighted mean of the predictions stemming from each subtree. The approach is similar to that described in [17]. Table II summarizes the MRE and R2 values resulting from a linear regression of Mest and Mact values for the two learning algorithms, and results obtained by Kemerer with COCOMO-BASIC, FUNCTION POINTS, and SLIM. 2 These results
indicate that CARTX'S and BACKPROPAGATION 'S predictions show a strong linear relationship with the actual development effort values for the 15 test projects.3 On this dimension, the performance of the learning systems is less than SUM'S performance in Kemerer's experiments, but better than the other two models. In terms of mean MRE, BACKPROPAGATION does strikingly well compared to the other approaches, and CARTX'S MRE is approximately one-half that of SLIM and COCOMO. In sum, Expenment 1 rilustrates two points. In an absolute sense, none of the models does particularly well at estimating software development effort, particularly along the MRE dimension, but in a relative sense both learning approaches are competitive with traditional models examined by Kemerer on one dataset. In general, even though MRE is high in the ,
'Results are reported for COCOMO-BASIC (i.e., without cost drivers), which was comparable to the intermediate and detailed models on this data, in addition, Kemerer actually reported R2, which is R? adjusted for degrees of freedom^and which is slightly lower than the unadjusted R2 values that we report. R2 valuesreportedby Kemerer are 0.55,0.68, and 0.88 for FUNCTION POINTS, COCOMO, and SLIM, respectively.
,
3
, ,
,
. ,, .
. -c
„„
Both the slope and R value are significant at the 99% confidence level. The t coefficients for determining the significance of slope are 8.048 and 7.25 for CARTX and BACKPROPAGATION, respectively.
W e have noted m a t each leaming system assumes a number
«grow- regres. ] u d e d i n t h e neural n e t w o r k T h e s e c h o i c e s c a n significantly impact the success o f l e a m i n g . E x p e riment 2 illustrates the sensitivity of our t w o l e a m i n g systems relative t0 different choices along ^ ^ d i m e n s i o n s . I n particular, W e repeated Experiment 1 using BACKPROPAGATION with differing numbers of hidden units a n d u s i n g C A R T X w i t h d i f f e r i n g c o n s t r aints on regression-tree growth T a b l e m i l l u s t r a t e s o u r r e s u l t s w i t h BACKPROPAGATION. E a c h c e l l s u m m a r i z e s reS ults over 10 experimental trials, rather ta one ^ w h i c h w a s lepoTted in Section IVA for p r e s e n t a t i o n p u r p o s e s . Thus, Max, and Min values of important choices such as depth to which t0 sion
of
^ ^
R2
o r ±e
a n d
of hidden units i n c
mmhel
in
M R E
each
cell
of
Table
m
suggest
± e
to initial random weight settingS( w h k h w e r e different in e a c h o f t h e 1 0 e x p e r i m e n t a i t r f a l s T h e e x p e r i m e n t a l r e s u lts of Section IV-A reflect the .< best ,, ^ ^ 10 ^ s u m m a r i z e d in Table Ill's 10m d d e n _ u n i t c o l u m n . I n general, however, for 5, 10, and 15 h i d d e n ^ ^ MRE s c o r e s m s t i n c o m p a r a b i e o r s u p e r i o r to s o m e o f m e o t h e r m o d e l s s u m m a r i z e d i n Table II, and mean R2 s c o r e s s u g g e s t m a t s i g ni f i c ant linear relationships between p r e d i c t e d a n d a c t u a l development months are often found. Poor resuks obtained with n o hidden units indicate ^ i r n p O rt a nce o f m e s e for a c c u r a t e f u n c t i o n a p p r o x i m a t i o n . T h e p e r f o r m a n c e o f C A R T X c a n vary with the depth to w h i c h w e e x t e n d t h e regression tree. The results of Experiment l ^ repeated ^ a n d r e p r e S ent the case where required sensitivity
of
BACKPROPAGATION
accuracy over the training data is 0%—that is, the tree is
, , . , , _ _ . . . decomposed to singleton leaves. However, w e experimented with more conservative tree expansion policies, where CARTX extended the tree only to the point where an error threshold ( r e l a t i v e to the training data) is satisfied. In particular, trees , , .,__,
were grown to leaves where the mean MRE among projects .
,
._ l prespecified threshold that ranged from 0% to 500%. The MRE of each project at a leaf
a t a l e a f w a s less t h a n o r e< ual t 0 a
57
132
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
TABLE III
TABLE V
BACKPROPAGATION RESULTS WITH VARYING NUMBERS OF HIDDEN NODES
SENSITIVITY OVER 20 RANDOMIZED TRIALS ON COMBINED COCOMO AND KEMERER'S DATA
(J Mean R2
I
Hidden Units I 5 10 15
CARTX
0.04 0.52 0.60 0.59
2
Max R Min R2
y
0.04 0.84 0.80 0.85 0.03 0.08 0.10 0.01
Mean MRE(%)
618
104
133
1
n
111
Mini? 2
1
0.00
BACKPROPAGATION 0.00 '
'
1 n Meanii 2 Max R2 — — —
0.48
0.97
0.40
0.99
'
U
TABLE VI
„
SENSITIVITY OVER 20 RANDOMIZED TRIALS ON KEMERER'S DATA
Max MRE{%) Min MRE(%)
915 I 369
163 72
254 70
161 77
Min ft2 Mean R2 Max i P CARTX BACKPROPAGATION
TABLE IV
0.00 0.03
0.26 0.39
0.90 0.90
'
"
CARTX RESULTS WITH VARYING TRAINING ERROR THRESHOLDS
R2
o% 25% 50% ioo% 200% 300% 400% 500%
0.83 0.62 o,63 0.60 0.59 0.59 o 59 o 60
tree configuration. The holdout method divides the available data into two sets; one set, generally the larger, is used to build decision/regression trees or train networks under different configurations. The second subset is then classified using each alternative configuration, and the configuration yielding the best results over this second subset is selected as the final configuration. Better yet, a choice of configuration may rest on a orm f °f resampling that exploits many randomized holdout trials. Holdout could have been used in this case by dividing the COCOMO data, but the COCOMO dataset is very small as is. Thus, we have satisfied ourselves with a demonstration of the sensitivity of each learning algorithm to certain configuration decisions. A more complete treatment of resampling and other strategies for making configuration choices can be found in Weiss and Kulikowski [24].
MRE(%) 364 404 461 870 931 931 931 835
was calculated by M — Mact Mact where M is the mean person-months development effort of projects at that node Table IV shows CARTX's performance when we vary the required accuracy of the tree over the training data. Table entries correspond to the MRE and R2 scores of the learned trees over the Kemerer test data. In general, there is degradation in performance as one tightens the requirement for regressiontree expansion, though there are applications in which this would not be the case. Importantly, other design decisions in decision and regression-tree systems, such as the manner in which continuous attributes are "split" and the criteria used to select divisive attributes, might also influence prediction accuracy. Selby and Porter [22] have evaluated different design choices along a number of dimensions on the success of decision-tree induction systems using NASA software project descriptions as a test-bed. Their evaluation of decision trees, not regression trees, limits the applicability of their findings to the evaluation reported here, but their work sets an excellent example of how sensitivity to various design decisions can be evaluated. The performance of both systems is sensitive to certain configuration choices, though we have only examined sensitivity relative to one or two dimensions for each system. Thus, it seems important to posit some intuition about how learning systems can be configured to yield good results on new data, given only knowledge of performance on training data. In cases where more training data is available a holdout method can be used for selecting an appropriate network or regression-
58
^ Experiment 3: Sensitivity to Training and Test Data Thus far, our results suggest that using learning algorithms to discover regularities in a historical database can facilitate predictions on new cases. In particular, comparisons between our experimental results and those of Kemerer indicate that relatively speaking, learning system performance is competitive with some traditional approaches on one common data set. However, Kemerer found that performance of algorithmic approaches was sensitive to the test data. For example, when a selected subset of 9 of the 15 cases was used to test the models, each considerably improved along the R2 dimension, By implication, performance on the other 6 projects was likely poorer. We did not repeat this experiment, but we did perform similarly-intended experiments in which the COCOMO and Kemerer data sets were combined into a single dataset of 78 projects; 60 projects were randomly selected for training the learning algorithms and the remaining 18 projects were used for test. Table V summarizes the results over 20 such randomized trials. The low average R2 should not mask the fact that many runs yielded strong linear relationships. For example, on 9 of the 20 CARTX runs, R2 was above 0.80. We also ran 20 randomized trials in which 10 of Kemerer's cases were used to train each learning algorithm, and 5 were used for test. The results are summarized in Table VI. This experiment was motivated by a study with ESTOR [23], a casebased approach that we summarized in Section II: an expert's protocols from 10 of Kemerer's projects were used to construct
SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT
133
a "case library" and the remaining 5 cases were used to test the model's predictions; the particular cases used for test were not reported, but ESTOR outperformed COCOMO and FUNCTION POINTS on this set. We do not know the robustness of ESTOR in the face of the kind of variation experienced in our 20 randomized trials (Table VI), but we might guess that rules inferred from expert problem solving, which ideally stem from human learning over a larger set of historical data, would render ESTOR more robust along this dimension. However, our experiments and those of Kemerer with selected subsets of his 15 cases suggest that care must be taken in evaluating the robustness of any model with such sparse data. In defense of Vicinanza's et al.P methodology, we should note that the creation of a case library depended on an analysis of expert protocols and the derivation of expert-like rules for modifying the predictions of best matching cases, thus increasing the "cost" of model construction to a point that precluded more complete randomized trials. Vicinanza et al. also point out that their study is best viewed as indicating ESTOR's "plausibility" as a good estimator, while broader claims require further study. In addition to experiments with the combined COCOMO and Kemerer data, and the Kemerer data alone, we experimented with the COCOMO data alone for completeness. When experimenting with Kemerer's data alone, our intent was to weakly explore the kind of variation faced by ESTOR. Using the COCOMO data we have no such goal in mind. Thus, this analysis uses an JV-fold cross validation or a "leave-one-out" methodology, which is another form of resampling. In particular, if a data sample is relatively sparse, as ours is, then for each of JV (i.e., 63) projects, we remove it from the sample set, train the learning system with the remaining TV - 1 samples, and then test on the removed project. MRE and R2 are computed over the N tests. CARTX's R2 value was 0.56 (144.48+0.74*, t = 8.82) and MRE was 125.2%. In this experiment we only report results obtained with CARTX, since a fair and comprehensive exploration of BACKPROPAGATION across possible network configurations is computationally expensive and of limited relevance. Suffice it to say that over the COCOMO data alone, which probably reflects a more uniform sample than the mixed COCOMO/Kemerer data, CARTX provides a significant linear fit to the data with markedly smaller MRE than its performance on Kemerer's data.
effort estimation suggest the promise of an automated learning approach to the task. Both learning techniques performed well on the R2 and MRE dimensions relative to some other approaches on the same data. Beyond this cursory summary, our experimental results and the previous literature suggest several issues that merit discussion.
In sum, our initial results indicating the relative merits of a learning approach to software development effort estimation must be tempered. In fact, a variety of randomized experiments reveal that there is considerable variation in the performance of these systems as the nature of historical training data changes, This variation probably stems from a number of factors. Notably, there are many projects in both the COCOMO and Kemerer datasets that differ greatly in their actual development effort, but are very similar in other respects, including SLOC. Other characteristics, which are currently unmeasured in the COCOMO scheme, are probably responsible for this variation. V. GENERAL DISCUSSION Our experimental comparisons of CARTX and BACKPROPAGATION with traditional approaches to development
A
Limitations of Learning from Historical Data There
are
well
" k n ° w n limitations of models constructed " In Particular- attributes u s e d t0 Predict software development effort can change over time and/or differ between software development environments. Mohanty [13] m a k e s this P o i n t i n comparisons between the predictions of a w i d e variet y of m o d e l s on a single hyP°thetlcal software ro ect I n P J ' P ^ u l a r , M o h a n t y s u r v e y e d approximately 15 m o d e l s a n d m e t h o d s for P ^ i c t i n g software development effort These models were used to P r e d i c t s o f t w a r e development effort o f a sin le 8 hypothetical software project. Mohanty's m a i n finding w a s t h a t estimated effort on this single project varied significantly over models. Mohanty points out that each model was developed and calibrated with data collected within a uni( ue software l environment. The predictions of m e s e m o d e l s in ' Part> r e f l e c t underlying assumptions that are not ex Iicitlv P presented in the data. For example, software development sites may use different development tools. These tools are constant wlthin a facihty and thus not ' represented exP l i c i t l v i n d a t a c o l l e c t e d ^ t h a t f a c i l i t y ' b u t t h i s environmental factor is n o t c o n s t a n t a c r o s s faclhtlss Differin S environmental factors not reflected in data are undoubtedly responsible for much of the unexplained variance in o u r experiments. To some extent, the R2 derived from linear egression is intended to provide a better measure of a model s in cases w h e r e ' " f i t " t 0 a r b i t r a i y n e w d a t a t h a n MRE the environment f r o m w h i c h a model was derived is different from the environment from which new data was drawn. Even so> t h e s e environmental differences may not be systematic in a wa y m a t i s w e l 1 a c c o u n t e d for by a linear model. In sum, great care must be taken w h e n uslng a model constructed from data from one environment to make predictions about data from another environment. Even within a site, the environment mav evolve over time thus ' compromising the benefits of previously-derived models. Machine learning research has u s i n g h i s t o r i c a l data
recentI
P r o b I e m o f trackin8 &* a c c u r a c y ' w h i c h t r i g g e r s relearning when ex e P rience with new data suggests that the environment has c h a n g e d [6] Howe ' v e r , in an application such as software development effort estimation, there are probably explicit ^ c a t e r s that an environmental change is occurring or will o c c u r (e ^' w h e n n e w development tools or quality control ractices P are implemented), y
focussed
on the
of a learned m o d e l o v e r time
&• Engineering the Definition of Data if environmental factors are relatively constant, then there is little need to explicitly represent these in the description of d a t a H o w e v e r j w h e n t h e env ironment exhibits variance along some dimension, it often becomes critical that this variance be codified and included in data description. In this way,
59
134
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
differences across data points can be observed and used in estimating the person-month effort required for the project model construction. For example, Mohanty argues that the requiring 23.20 M or the project requiring 1107.31 M; the desired quality of the finished product should be taken into projects closest to each among the remaining 14 projects are account when estimating development effort. A comprehensive 69.90 M and 336.30 M, respectively. survey by Scacchi [20] of previous software production studies The root of CARTX'S difficulties lies in its labeling of leads to considerable discussion on the pros and cons of many each leaf by the mean of development months of projects attributes for software project representation. classified at the leaf. An alternative approach that would enable Thus, one of the major tasks is deciding upon the proper CARTX to extrapolate beyond the training data, would label codification of factors judged to be relevant. Consider the each leaf by an equation derived through regression—e.g., dimension of response time requirements (i.e., TIME) which a linear regression. After classifying a project to a leaf, the was included by Boehm in project descriptions. This attribute regression equation labeling that leaf would then be used to was selected by CARTX during regression-tree construction, predict development effort given the object's values along the However, is TIME an "optimal" codification of some aspect independent variables. In addition, the criterion for selecting of software projects that impacts development effort? Consider divisive attributes would be changed as well. To illustrate, that strict response time requirements may motivate greater consider only two independent attributes, development team coupling of software modules, thereby necessitating greater experience and KDSI, and the dependent variable of software communication among developers and in general increasing development effort. CARTX would undoubtedly select KDSI, development effort. If predictions of development effort must since lower (higher) values of KDSI tend to imply lower be made at the time of requirements analysis, then perhaps (higher) means of development effort. In contrast, development TIME is a realistic dimension of measurement, but better team experience might not pro vide as good a fit using CARTX'S predictive models might be obtained and used given some error criterion. However, consider a CART-like system that divides data up by an independent variable, finds a best measure of software component coupling. In sum, when building models via machine learning or sta- fitting linear equation that predicts development effort given tistical methods, it is rarely the case that the set of descriptive development team experience and KDSI, and assesses error attributes is static. Rather, in real-world success stories in- in terms of the differences between predictions using this volving machine learning tools the set of descriptive attributes best fitting equation and actual development months. Using evolves over time as attributes are identified as relevant or this strategy, development team experience might actually be irrelevant, the reasons for relevance are analyzed, and addi- preferred; even though lesser (greater) experience does not tional or replacement attributes are added in response to this imply lesser (greater) development effort, development team analysis [8]. This "model" for using learning systems in the experience does imply subpopulations for which strong linear real world is consistent with a long-term goal of Scacchi [20], relationships might exist between independent and dependent which is to develop a knowledge-based "corporate memory" of variables. For example, teams with lesser experience may not software production practices that is used for both estimating adjust as well to larger projects as do teams with greater and controlling software development. The machine-learning experience; that is, as KDSI increases, development effort tools that we have described, and other tools such as ESTOR, increases are larger for less experienced teams than more might be added to the repertoire of knowledge-acquisition experienced teams. Recently, machine learning systems have strategies that Scacchi suggests. In fact, Porter and Selby [14] been developed that have this flavor [18]. We have not yet make a similar proposal by outlining the use of decision-tree experimented with these systems, but the approach appears promising. induction methods as tools for software development. The success of CARTX, and decision/regression-tree learners generally, may also be limited by two other processing C. The Limitations of Selected Learning Methods characteristics. First, CARTX uses a greedy attribute selection Despite the promising results on Kemerer's common data- strategy—tree construction assesses the informativeness of a base, there are some important limitations of CARTX and single attribute at a time. This greedy strategy might overlook BACKPROPAGATION. We have touched upon the sensitivity attributes that participate in more accurate regression trees, to certain configuration choices. In addition to these prac- particularly when attributes interact in subtle ways. Second, tical limitations, there are also some important theoretical CARTX builds one classifier over a training set of software limitations, primarily concerning CARTX. Perhaps the most projects. This classifier is static relative to the test projects; important of these is that CARTX cannot estimate a value any subsequent test project description will match exactly one along a dimension (e.g., software development effort) that is conjunctive pattern, which is represented by a path in the outside the range of values encountered in the training data, regression tree. If there is noise in the data (e.g., an error in the Similar limitations apply to a variety of other techniques as recording of an attribute value), then the prediction stemming well (e.g., nearest neighbor approaches of machine learning from the regression-tree path matching a particular test project and statistics). In part, this limitation appears responsible for may be very misleading. It is possible that other conjunctive a sizable amount of error on test data. For example, in the patterns of attribute values matching a particular test project, experiment illustrating CARTX'S sensitivity to training data but which are not represented in the regression tree, could using 10/5 splits of Kemerer's projects (Section IV-C), CARTX ameliorate CARTX'S sensitivity to errorful or otherwise noisy is doomed to being at least a factor of 3 off the mark when project descriptions.
60
SR1NIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT
135
The Optimized Set Reduction (OSR) strategy of Briand,
Basili, and Thomas [5] is related to the CARTX approach in several important ways, but may mitigate problems associated with CARTX—OSR conducts a more extensive search for multiple patterns that match each test observation. In contrast to CARTX'S construction of a single classifier that is static relative to the test projects, OSR can be viewed as dynamically building a different classifier for each test project. The specifics of OSR are beyond the scope of this paper, but suffice it to say that OSR looks for multiple patterns that are statistically justified by the training project descriptions and that match a given test project. The predictions stemming from different patterns (say, for software development effort) are then combined into a single, global prediction for the test project. OSR was also evaluated in [5] using Kemerer's data for test, and COCOMO data as a (partial) training sample.4 The authors report an average MRE of 94% on Kemerer's data. However, there are important differences in experimental design that make a comparison between results with OSR, BACKPROPAGATION, and CARTX unreliable. In particular, when OSR was used to predict software development effort for a particular Kemerer project, the COCOMO data and the remaining 14 Kemerer projects were used as training examples. In addition, recognizing that Kemerer's projects were selected from the same development environment, OSR was configured to weight evidence stemming from these projects more heavily than those in the Cocomo data set. The sensitivity of results to this "weighting factor" is not described. We should note that the experimental conditions assumed in [5] are quite reasonable from a pragmatic standpoint, particularly the decision to weight projects more heavily that are drawn from the same environment as the test project. These different training assumptions simply confound comparisons between experimental results, and OSR's robustness across differing training and test sets is not reported. In addition, like the work of Porter and Selby [14], [15], [22], OSR assumes that the dependent dimension of software development effort is nominally-valued for purposes of learning. Thus, this dimension is partitioned into a number of collectivelyexhaustive and mutually-exclusive ranges prior to learning. Neither BACKPROPAGATION nor CARTX requires this kind of preprocessing. In any case, OSR appears unique relative to other machine learning systems in that it does not learn a static classifier; rather, it combines predictions from multiple, dynamically-constructed patterns. Whether one is interested in software development effort estimation or not, this latter facility appears to have merits that are worth further exploration. In sum, CARTX suffers from certain theoretical limitations: it cannot extrapolate beyond the data on which it was trained, it uses a greedy tree expansion strategy, and the resultant classifier generates predictions by matching a project against a single conjunctive pattern of attribute values. However, there appear to be extensions that might mitigate these problems.
"Our choice of using COCOMO data for training and Kemerer's data for test was made independently of [5].
VI. CONCLUDING REMARKS
^ the CARTX and micle has c o m pared BACKPROPAGATION learning methods to traditional a p p r o a c h e s for software effort estimation. We found that the l e a r n i n g approaches were competitive with SLIM, COCOMO, a n d ^ c n o N POINTS as represented in a previous study b y K emerer. Nonetheless, further experiments showed the s e n s i t i v i t y o f learning to various aspects of data selection and repr esentation. Mohanty and Kemerer indicate that traditional models ^ quite sensitive as well. advantage of learning systems is that they are A pTlm3ly adaptable and nonparametric; predictive models can be tailored t 0 t h e d a t a a t a p a r t i c u l a r s i t e . D e c ision and regression trees ^ p a r t i c u l a r l y We ll- S uited to this task because they make explicit the attributes (e.g., TIME) that appear relevant to ^ pre diction task. Once implicated, a process that engineers t h e d a t a d e f i n i t j o n i s o f t e n required to explain relevant and i r r e l e v a n t a s p e c t s o f t h e d a t a , and to encode it accordingly. T h i s p r o c e s s is b e s t d o n e locally> w i t h i n a s o f t w a r e shop, w h e r e t h e i d i o s y n c r a s i e s of that environment can be factored i n o f o u t I n s u c h a s e t t i n g a n a l y s t s m a y w a n t t 0 investigate t h e b e h a v i o r o f s y s t e m s l i k e BACKPROPAGATION, CART, and r e l a t e d a p p r o a c h e s [5]> [ 1 4 ] > [ 1 5 ] > [ 2 2 ] over a range of permiss i W e c o n f i g u r a t i o n S ) thus obtaining performance that is optimal i n m e i r env ironment.
APPENDIX A DATA DESCRIPTIONS Th e attributes defining the COCOMO and Kemerer databases were used to develop the COCOMO model. The following is a brief description of the attributes and some of their suspected influences on development effort. The interested reader is referred to [3] for a detailed exposition of them, These attributes can be classified under four major headings, They are Product Attributes; Computer Attributes; Personnel Attributes; and Project Attributes, A. Product Attributes ]) Required Software Reliability (RELY): This attribute measures how reliable the software should be. For example, if serious financial consequences stem from a software fault, then the required reliability should be high, 2) Database Size (DATA): The size of the database to be used by software may effect development effort. Larger databases generally suggest that more time will be required to develop the software product. 3) Product Complexity (CPLX): The application area has a bearing on the software development effort. For example, communications software will likely have greater complexity than software developed for payroll processing. 4) Adaptation Adjustment Factor (AAF): In many cases software is not developed entirely from scratch. This factor reflects the extent that previous designs are reused in the new project.
61
136
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
B. Computer Attributes -™,^ „ 1) Execution Time Constraint (TIME): tt there are conSttaintS on processing time, then the development time may , 8 • 2) Main Storage Constraint (STOR): If there are memory constraints, then the development effort will tend to be high. .,. . . .
, ,,
,.
...
...
,,,,r.T.>
Tj-
,
,
, .
3) Virtual Machine Volatility (VIRT): If the underlying hardware and/or system software change frequently, then development effort will be high.
[3] B. W. Boehm, Software Engineering Economics. Englewood Cliffs, NJ: Prentice-Hall, 1981. [4] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth International, 1984. [5] L. Briand, V. Basili, and W. Thomas, "A pattern recognition approach for software engineering data analysis," IEEE Trans. Software Eng., vol. is, pp. 931-942, Nov. 1992. W c - B r o d l e v a n d E - R'ssiand, "Measuring concept change," in AAAI Spring Symp. Training Issues in Incremental Learning, 1993, pp. 98-107.
^ [9]
C Personnel Attributes [JO] 1) Analyst Capability (ACAP): If the analysts working on the software project are highly skilled, then the development effort of the software will be less than projects With less-skilled analysts. T, . ,. . _ . / J C V D I TU • t 2) Applications Experience (AEXP): The experience of project personnel influences the software development effort. 3) Programmer Capability (PCAP): This is similar to • ,. ACAP, but it applies to programmers. 4) Virtual Machine Experience (VEXP): Programmer experience with the underlying hardware and the operating system has a bearing on development effort. 5) Language Experience (LEXP): Experience of the programmers with the implementation language affects the software development effort. 6) Personnel Continuity Turnover (CONT): If the same , , , . r , . . , , . personnel work on the project from beginning to end, then the development effort will tend to be less than Similar projects experiencing greater personnel turnover. ° D: Project Attributes J) Modern Programming Practices (MODP): Modern programming practices like Structured software design ° reduces the development effort. 2) Use of Software Tools (TOOL): Extensive use of software tools like source-line debuggers and syntax-directed editors reduces the software development effort. 3) Required Development Schedule (SCED): If the devel. , , . „ . ,
...
.
. . , . , ,
.
,
opment schedule of the software project is highly constrained, then the development effort will tend to be high. Apart from the attributes mentioned above, other attributes that influence the development are: programming language, and the estimated lines of code (unadjusted and adjusted for the use of existing software). ACKNOWLEDGMENT
The authors would like to thank the three reviewers and the action editor for their many useful comments.
REFERENCES
[1] M. /\ioen, Albert, "Instance-based ti j D. L/. Aha, mid, D. u. Kibler, rviuier, and anu ivi. insiaiitx-Daieu learning learning algorithms," aigorunms, Machine Machine Learning, Learning, vol. vol. 6, 6, pp. pp. 37-66, 37-66, 1991. 1991. [2] A. Albrecht and J. Gaffney Jr., "Software function, source lines of code, code, [2] A. Albrecht and J. Gaffney Jr., "Software function, source lines of and development development effort effort prediction: prediction: A A software software science science validation," validation," IEEE IEEE and Trans. Software Eng., vol. 9, pp. 639-648, 1983. Trans. Software Eng., vol. 9, pp. 639-648, 1983.
62
"Learning with genetic algorithms," Machine Learning, vol. 3, pp. 121-138, 1988. B. Evans andID^Fisher, "Overcoming process delays with decision tree induction, IEEE Expert, vol. 9, pp. 60-66, Feb. 1994. U. Fayyad, "On the induction of decision trees for multiple concept learning," Doctoral dissertation, EECS Dep., Univ. of Michigan, 1991. L. Johnson and R. Riess, Numerical Analysis. Reading, MA; AddisonWesley, 1982. ^^I^^^^^™. " " A. Lapedes and R. Farber, "Nonlinear signal prediction using neural networks: Prediction and system modeling," Los Alamos National Laboratory, 1987, Tech. Rep. LA-UR-87-2662 s M o h a I ( t y > "Software cost estimation: Present and future," Software—Practice and Experience, vol. 11, pp. 103-121, 1981. A. Porter and R.Selby, "Empirically-guided software development using metric-based classification trees," IEEE Software, vol. 7, pp. 46-54,
[7] K D e J o n g ?
"« [12] t,3]
[14]
Mar
1990
[15] A. Porter and R. Selby, "Evaluating techniques for generating metric based classification trees," 7. S ^ . to/rvvare, vol. 12, pp. 209-218, July [16] L. H. Putnam, "A general empirical solution to the macro software s i z i n 8 a n d estimating problem," IEEE Trans. Software Eng., vol. 4, pp. [17] j 3 4 ^ 3 ^ ^ . * Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993. t 18 l J- R- Q u i n l a n . "Combining instance-based and model-based learning," in Proc. the 10th Int. Machine Learning Conf, 1993, pp. 236-243. [ 1 9 j D . E . Rume lhart, G. E. Hinton, and R J. Williams, "Learning internal representations by error propagation," in Parallel Distributed Processr o m ("«• Cambridge MA: MIT Press, 1986. [20] W. Scacchi, Understanding software productivity: Toward a knowledge-based approach," Int. J. Software Eng. and Knowledge Eng., vol. 1, pp. 293-320, 1991. [21] T. J. Sejnowski and C. R. Rosenberg, "Parallel networks that learn to pronounce english text," Complex Systems, vol. 1, pp. 145-168, 1987. [22] R. Selby and A. Porter, "Learning from examples: Generation and evaluation of decision trees for software resource analysis, IEEE Trans. Software Eng., vol. 14, pp. 1743-1757, 1988. [23] S. Vicinanza, M. J. Prietulla, and T. Mukhopadhyay, "Case-based 7 ^ 9 9 ^ ^ ^ - ^ e s t i m a t i ° n > " i n Proc- IUh Int C o n / '"f°[24] S. Weiss and C. Kulikowski, Computer Systems that Learn. San Mateo, CA: Morgan Kaufmann, 1991. [25] J. Zaruda, Introduction to Artificial Neural Networks.
St. Paul, MN:
W e s t 1992
Krishnamoorthy Srinivasan, received the MB.A, in management information systems from the Owen Graduate School of Management, Vanderbilt University, and the M.S. in computer science from Vanderbilt University. He also received the Post Graduate Diploma in industrial engineering from the National Institute for Training in Industrial Engineering, Bombay, India, and the B.E. from the University of Madras, Madras, India. He is currently working as a Principal Software Engineer Inc., engineer with wun Personal rerj>onai Computer computer Consultants, consultants, inc., re joining PCC, he he worked as as a Senior Specialist with Washington, D.C. Before joining PCC, worked a Senior Specialist with Inc., Cambridge, Cambridge, MA. MA. His His primary primary research research interests interests McKinsey & Company,, Inc., :ations of machine learning techniques to real-world are in exploring applications of machine learning techniques to real-world business problems.
SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT
Douglas Fisher (M'92) received his Ph.D. in information and computer science from the University of California at Irvine in 1987. He is currently an Associate Professor in computer science at Vanderbilt University. He is an Associate Editor of Machine Learning, and IEEE Expert, and serves on the editorial board of the Journal of Artificial Intelligence Research. His research interests include machine learning, cognitive modeling, data analysis, and cluster analysis. An electronic addendu to this article, which reports any subsequent analysis, can be found at (http://www.vuse.vanderbilt.edurdfisher/dfisher.html). Dr. Fisher is a member of the ACM and AAA1.
63
137
736
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 12, NOVEMBER 1 9 9 ?
Estimating Software Project Effort Using Analogies Martin Shepperd and Chris Schofield Abstract—Accurate project effort prediction is an important goal for the software engineering community. To date most work has focused upon building algorithmic models of effort, for example COC0M0. These can be calibrated to local environments. We describe an alternative approach to estimation based upon the use of analogies. The underlying principle is to characterize projects in terms of features (for example, the number of interfaces, the development method or the size of the functional requirements document). Completed projects are stored and then the problem becomes one of finding the most similar projects to the one for which a prediction is required. Similarity is defined as Euclidean distance in n-dimensional space where n is the number of project features. Each dimension is standardized so all dimensions have equal weight. The known effort values of the nearest neighbors to the new project are then used as the basis for the prediction. The process is automated using a PC-based tool known as ANGEL. The method is validated on nine different industrial datasets (a total of 275 projects) and in all cases analogy outperforms algorithmic models based upon stepwise regression. From this work we argue that estimation by analogy is a viable technique that, at the very least, can be used by project managers to complement current estimation techniques. Index Terms—Effort prediction, estimation process, empirical investigation, analogy, case-based reasoning.
+ 1 INTRODUCTION
A
N important aspect of any software development proj-
tion in the dependent variable that can be "explained" in
ect is to know how much it will cost. In most cases the major cost factor is labor. For this reason estimating development effort is central to the management and control of a software project. A fundamental question that needs to be asked of any estimation method is how accurate are the predictions. Accuracy is usually defined in terms of mean magnitude of relative error (MMRE) [6] which is the mean of absolute percentage errors: ;=n ( r p \ V (2) ,=11 J n
terms of the independent variables. Unfortunately, this is not always an adequate indicator of prediction quality where there are outlier or extreme values. Yet another approach is to use Pred(25) which is the percentage of predictions that fall within 25 percent of the actual value. Clearly the choice of accuracy measure to a large extent depends upon the objectives of those using the prediction system, For exlmple, MMRE is fairly conservative with a bias against overestimates while Pred(25) will identify those prediction systems that are generally accurate but occasiona % wildly inaccurate. In this paper we have decided to adopt MMRE and Pred(25) as prediction performance indicators since these are widely used, thereby rendering our results more comparable with those of other workers, The remainder of this paper reviews work to date in the field of effort prediction (both algorithmic and nonalgorithmic) before going on to describe an alternative approach to effort prediction based upon the use of analogy, Results from this approach are compared with traditional statistical methods using nine datasets. The paper then discusses the results of a sensitivity analysis of the analogy method. An estimation process is then presented. The paper concludes by discussing the strengths and limitations of analogy as a means of predicting software project effort,
where there are n projects, E is the actual effort and E is the predicted effort. There has been some criticism of this measure, in particular that it is unbalanced and penalizes overestimates more than underestimates. For this reason Miyazaki et al. [19] propose a balanced mean magnitude of relative error measure as follows: \ ^00 ,~\ min(£, E) j " ' This approach has been criticized by Hughes [8], among others, as effectively being two distinct measures that should not be combined. Other workers have used the adjusted R squared or coefficient of determination to indicate the percentage of varia-
(
•
2
A BRIEF HISTORY OF EFFORT PREDICTION
Over the past two decades there has been considerable activity in the area of effort prediction with most approaches being typified as being algorithmic in nature. Well known M. Shepperd and C. Schofield are with the Department of ComputinQ, i • i J r^r^r^i,n/-^ r,n J r .• • ] r_n Bournemouth University. Talbot Campus, Poole, BH12 5BB United King- examples include COCOMO [4] and function points [2]. dom. E-mail: Imshepper,
[email protected]. Whatever the exact niceties of the model, the general form tends to be:
Manuscript received 10 Feb. 1997. Recommended for acceptance by D.R. Jeffery. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number 104091.
1. We include function points as an algorithmic method since they are dimensionless and therefore need to be calibrated in order to estimate effort.
0098-5589/97/$10.00 © 1997 IEEE
64
SHEPPERD AND SCHOFIELD: ESTIMATING SOFTWARE PROJECT EFFORT USING ANALOGIES
E = aS b
(3) _ „ . where E is effort, S is size typically measured as lines of , /T__. , . . 'v . , . . code (LOC) or function points, a is a productivity parameter and b is an economies or diseconomies of scale parame^^^.^ ^ , , ,, , ter. COCOMO represents an approach that could be re, , „ .. , r , ,,,, TT f . . , , earded as off the shelf. Here the estimator hopes that the o . , . , ; represent equations contained in the cost model adequately 4 4
737
compares linear regression with a neural net approach using the COCOMO dataset. Both approaches seem to per, , ,, ... . j A , O T r j mQ 1 ,. c ™ i form badly with MMREs of 520.7 and 428.1 percent, reHI , . , ,.„ , , , c . . Srmivasan and cFisher [27] also report on the use of a , ., , , . , . , ., neural net with a back propagation learning algorithm, „, , , ., . ., , ° . , ? ,," , They found that the neural net outperformed other ttech. , u nnroc ™ *. TUT
, . , , . their development environment and that any variations can . , ., , r • / , • be satisfactorily accounted for in terms of cost drivers or
niques and gave results of MMRE = 70 percent. However, it . , ., u ±u J t 1131.8 |24.2 | 7.6 |3 |3 T , , J ,. • J C u J 1. •. T 1 ^ »rr The above data is drawn from tthe dataset Telecom 1. ACT . . , ,, . . „_ ,_.„,, , . „ „ _,„_ . , , is actual effort, ACT DEV and ACT TEST are actual devel, . ",, .. 7 „ . „ . „ „ . . 0 & opment and testing effort, respectively. CHNGS is the number of changes made as recorded by the configuration man7 CONCLUSIONS agement system and files is the number of files changed by Accurate estimation of software project effort at an early t h e Particular enhancement project. Only FILES can be used stage in the development process is a significant challenge f o r Predictive purposes since none of the other information for the software engineering community. This paper has w o u l d b e a v a i l a W e at the time of making the prediction, described a technique based upon the use of analogy sometimes referred to as case based reasoning. We have com- ACKNOWLEDGMENTS _ ., . , . . ., r. . , ^ T , , , ^ . .. c rpared the use of analogy with prediction models based . . • 1 • r • J ± 1 The authors are grateful to the Finish TIEKE organization for upon stepwise regression analysis for nine datasets, a total .. ,, ,° , . . ., r. . , ,°. , . n JT7E • i A I_-I • L • 1 ..• granting the authors leave to use the Finnish dataset; to Barof 275 projects. A striking pattern emerges in that estima- ? °, , , ,. .Ux* - J J , . .... nnu ,. , 1 j • j- • r bara Kitchenham for supplying the Mermaid dataset; to Bob , , . Xf , . ° . T , _, . tion by analogy produces a superior predictive perform- „ , . ,1 , j , .>,,mj Hughes for supplying the dataset Telecom 2; and to anony° . „ . V ; ° . . , , . i T 1 . J U I ance in all cases when measured by MMRE and in seven . r . i .1 n j/r,.-\ • ,. ,» • mous staff for the provision of datasets Telecom 1 and Keal, , , , t-. out of nine cases for the Pred(25) indicator. Moreover, esti- , . , „ „ . . . . , . ,, . . time 1. Many improvements have been suggested by Uan mation by analogy is able to operate in circumstances „ . „ iT < *\ _ , TT , _ . T ^?u , o * • > . : , • , , , f , ., . , , Diaper, Pat Dugard, Bob Hughes, Barbara Kitchenham,CiSteve where it is not possible to generate an algorithmic model, . . r_. „ A . „ • ° •, „ . , , „ . , ,c , ., • . .-n , .. , , „ ? , MacDonell, Austen Rainer, and Bill nSamson. This workuhas such as the dataset Real-time 1 where all the data was cate- , . ,, _ . . , ' , ITTT^T-• A ... . .u -Kit - j H.T J i i. i • been supported by British Telecom, the U.K. Engineering and goncal in nature or the Mermaid N dataset where no statis- „ , . ( { . -L , _ ., , _ °r--n/r •Zriaa °. „ .,-. . i .• i ,,, , , ... , ,. Physical Sciences Research Council under Grant GR/L372yo, J ticallyJ significant relationships could be found. We believe . .. _ , _ . . r ° and the Defence Research Agency.
70
SHEPPERD AND SCHOFIELD: ESTIMATING SOFTWARE PROJECT EFFORT USING ANALOGIES
743
I2°l T- Mukhopadhyay, S.S. Vicinanza, and M.J. Prietula, "Examining the Feasibility of a Case-Based Reasoning Model for Software Effort Estimation," MIS Quarterly, vol. 16, pp. 155-171, June, 1992. D.W. Aha, "Case-Based Learning Algorithms," Proc. 1991 DARPA Case-Based Reasoning Workshop. Morgan Kaufmann, 1991. [21] A. Porter and R. Selby, "Empirically Guided Software DevelopA.J. Albrecht and J.R. Gaffhey, "Software Function, Source Lines of ment Using Metric-Based Classification Trees," IEEE Software, no. 7, pp. 46-54,1990. Code, and Development Effort Prediction: A Software Science Validation," IEEE Trans. Software Eng., vol. 9, no. 6, pp. 639-648,1983. [22] A. Porter and R. Selby, "Evaluating Techniques for Generating K. Atkinson and M.J. Shepperd, "The Use of Function Points to Metric-Based Classification Trees," J. Systems Software, vol. 12, pp. Find Cost Analogies," Proc. European Software Cost Modelling 209-218,1990. Meeting, Ivrea, Italy, 1994. [23] E. Rich and K. Knight, Artificial Intelligence, second edition. B.W. Boehm, "Software Engineering Economics," IEEE Trans. McGraw-Hill, 1995. Software Eng., vol. 10, no. 1, pp. 4-21,1984. [24] B. Samson, D. Ellison, and P. Dugard, "Software Cost Estimation L.C. Briand, V.R. Basili, and W.M. Thomas, "A Pattern RecogniUsing an Albus Perceptron (CMAC)," Information and Software tion Approach for Software Engineering Data Analysis," IEEE Technology, vol. 39, nos. 1/2,1997. Trans. Software Eng., vol. 18, no. 11, pp. 931-942,1992. [25] C. Serluca, "An Investigation into Software Effort Estimation Using a Back Propagation Neural Network," MSc dissertation, S. Conte, H. Dunsmore, and V.Y. Shen, Software Engineering Metrics and Models. Menlo Park, Calif.: Benjamin Cummings, 1986. Bournemouth Univ., 1995. J.M. Desharnais, "Analyse statistique de la productivitie des pro- [26] M.J. Shepperd, C. Schofield, and B.A. Kitchenham, "Effort Estijets informatique a partie de la technique des point des fonction," mation Using Analogy," Proc. 18th Int'l COM/. Software Eng., Berlin: masters thesis, Univ. of Montreal, 1989. IEEE CS Press, 1996. R.T. Hughes, "Expert Judgement as an Estimating Method," In- [27] K. Srinivasan and D. Fisher, "Machine Learning Approaches to formation and Software Technology, vol. 38, no. 2, pp. 67-75,1996. Estimating Development Effort," IEEE Trans. Software Eng., vol. 21, no. 2, pp. 126-137,1995. D.R. Jeffery, G.C. Low, and M. Barnes, "A Comparison of Function Point Counting Techniques," IEEE Trans. Software Eng., vol. [28] G.E. Wittig and G.R. Finnie, "Using Artificial Neural Networks 19, no. 5, pp. 529-532,1993. and Function Points to Estimate 4GL Software Development efR. Jeffery and J. Stathis, "Specification Based Software Sizing: An fort," Australian J. Information Systems, vol. 1, no. 2, pp. 87-94, 1994. Empirical Investigation of Function Metrics," Proc. NASA Goddard Software Eng. Workshop. Greenbelt, Md., 1993. N. Karunanithi, D. Whitley, and Y.K. Malaiya, "Using Neural H___E___S_BH_ I Martin Shepperd received a BSc degree Networks in Reliability Prediction," 7EEE Software, vol. 9, no. 4, _B___E_H_^___| (honors) in economics from Exeter University, an pp. 53-59,1992. _ ^ H _ _ H 9 _ i M S c d e 9 r e e t r o m A s t o n University, and the PhD C.F. Kemerer, "An Empirical Validation of Software Cost Estima- H _ _ _ _ _ _ H _ H H _ _ | degree from the Open University, the latter two tion Models," Comm. ACM, vol. 30, no. 5, pp. 416-429,1987. • _ _ S _ 8 _ E _ B ' n c o m P u t e r s c i e n c e - He has a chair in software B.A. Kitchenham and K. Kansala, "Inter-Item Correlations among _ H _ H _ r i _ B _ S _ _ engineering at Bournemouth University. ProfesFunction Points," Proc. First Int'l Symp. Software Metrics, Balti- S____E___|____| s o r Shepperd has written three books and pub-
REFERENCES [I] [2] [3] [4] [5] [6] [7] [8] [9] [10] [II] [12] [13]
H E H H i H I H i 'ished more tnan 5 0 PaPers in *ne areas of soft-
more, Md.: IEEE CS Press, 1993.
[14] B.A. Kitchenham and N.R. Taylor, "Software Cost Models," ICL _ B _ _ _ _ H _ B _ H | w a r e metrics and process modeling. Technology /., vol. 4, no. 3, pp. 73-102,1984. W^mlKMSam [15] P. Kok, B.A. Kitchenham, and J. Kirakowski, "The MERMAID Approach to Software Cost Estimation," Proc. ESPRIT Technical _ _ _ _ _ _ _ _ _ _ _ _ _ C n r i s
Week, 1990.
HWllftrlBSBI
Schofield received a BSc degree (honors)
[16] J.L. Kolodner, Case-Based Reasoning. Morgan Kaufmann, 1993. IWlMimiMlWffii ' n s o f t w a r e engineering management from [17] J.E. Matson, B.E. Barrett, and J.M. Mellichamp, "Software Devel- E S _ H | _ _ _ _ 9 _ H | Bournemouth University, where he is presently opment Cost Estimation Using Function Points," IEEE Trans. IraB_B___gra_fR studying for his PhD. His research interests inSoftware Eng., vol. 20, no. 4, pp. 275-287,1994. _ _ _ _ _ _ H B S K c ' u c l e software metrics and cost estimation. [18] Y. Miyazaki and K. Mori, "COCOMO Evaluation and Tailoring," _H_H_HH__H Proc. Eighth Int'l Software. Eng. Conf. London: IEEE CS Press, 1985. _HHH__HH| [19] Y. Miyazaki et al., "Method to Estimate Parameter Values in HHBH^fflSH Software Prediction Models," Information and Software Technology, _ _ _ H _ _ _ _ _ _ S _ _ H | vol. 33, no. 3, pp. 239-243,1991. •__^__BH_B
71
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 25, NO. 5, SEPTEMBER//OCTOBER 1999
675
A Critique of Software Defect Prediction Models Norman E. Fenton, Member, IEEE Computer Society, and Martin Neil, Member, IEEE Computer Society Abstract—Many organizations want to predict the number of defects (faults) in software systems, before they are deployed, to gauge the likely delivered quality and maintenance effort. To help in this numerous software metrics and statistical models have been developed, with a correspondingly large literature. We provide a critical review of this literature and the state-of-the-art. Most of the wide range of prediction models use size and complexity metrics to predict defects. Others are based on testing data, the "quality" of the development process, or take a multivariate approach. The authors of the models have often made heroic contributions to a subject otherwise bereft of empirical studies. However, there are a number of serious theoretical and practical problems in many studies. The models are weak because of their inability to cope with the, as yet, unknown relationship between defects and failures. There are fundamental statistical and data quality problems that undermine model validity. More significantly many prediction models tend to model only part of the underlying problem and seriously misspecify it. To illustrate these points the "Goldilock's Conjecture," that there is an optimum module size, is used to show the considerable problems inherent in current defect prediction approaches. Careful and considered analysis of past and new results shows that the conjecture lacks support and that some models are misleading. We recommend holistic models for software defect prediction, using Bayesian Belief Networks, as alternative approaches to the single-issue models used at present. We also argue for research into a theory of "software decomposition" in order to test hypotheses about defect introduction and help construct a better science of software engineering. Index Terms—Software faults and failures, defects, complexity metrics, fault-density, Bayesian Belief Networks. A
1
INTRODUCTION
O
RGANIZATIONS are still asking how they can predict the This paper provides a critical review of this literature quality of their software before it is used despite the with the purpose of identifying future avenues of research, substantial research effort spent attempting to find an answer We cover complexity and size metrics (Section 2), the testto this question over the last 30 years. There are many papers ing process (Section 3), the design and development process advocating statistical models and metrics which purport to (Section 4), and recent multivariate studies (Section 5). For a answer the quality question. Defects, like quality, can be de- comprehensive discussion of reliability models, see [4]. We fined in many different ways but are more commonly de- uncover a number of theoretical and practical problems in fined as deviations from specifications or expectations which these studies in Section 6, in particular the so-called "Goldimight lead to failures in operation. lock's Conjecture." Generally, efforts have tended to concentrate on the folDespite the many efforts to predict defects, there appears lowing three problem perspectives [1], [2], [3]: t o b e mle c o n s e n s u s o n w h a t t h e constituent elements of the 1) predicting the number of defects in the system; problem really are. In Section 7, we suggest a way to improve 2) estimating the reliability of the system in terms of the defect prediction situation by describing a prototype, time to failure; Bayesian Belief Network (BBN) based, model which we feel 3) understanding the impact of design and testing pro- c a n a t \east palt\y s o lve the problems identified. Finally, in cesses on defect counts and failure densities. Section 8 we record our conclusions. A wide range of prediction models have been proposed. Complexity and size metrics have been used in an attempt _ _ •• *» #•» to predict the number of defects a system will reveal in o j - 2 PREDICTION USING SlZE AND COMPLEXITY eration or testing. Reliability models have been developed METRICS to predict failure rates based on the expected operational Most defect prediction studies are based on size and cornusage profile of the system. Information from defect detec- p l e x i t y metrics. The earliest such study appears to have been tion and the testing process has been used to predict de- p^y^s, | 5 | , w h i c h w a s b a s e d o n a s y s t e m developed at fects. The maturity of design and testing processes have F u j i t s u j ft is k a l of r e g r e S sion based "data been advanced as ways of reducing defects. Recently large f „ m o d d s w h i c ^ b e c a m e c o m m o n lace in t h e litera. complex multivariate statistical models have been pro%, t J u j i L * iJ i *• - i , ,. .. r- _j • i i • . . ture. The study showed that linear models of some simple duced in an attempt to find a single complexity metric that . ._/ ,, . . , i_ *• r .,! » f j f t metrics provide reasonable estimates for the total number of will account for defects. , . , , , _. defects D (the dependent variable) which is actually defined as the sum of the defects found during testing and the de. JV.£. Fenton and M. Neil are with the Centre for Software Reliability,
fects f o u n d
durin
E-mail: {n.fenton, martin}<s>csr.city.ac.uk. Manuscript received3 Sept. 1997; revised25 Aug. 1998.
Akiyama's first Equation (1) predicted defects from lines of code (LOC). From (1) it can be calculated that a 1,000
g ^ o months after release. Akiyama com-
puted four regression equations.
Northampton Square, London EC1V OHB, England.
Recommended for acceptance by R.Hamlet. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number 105579.
TOP (\ P 1 K7 PlPI mnrlnlp k pvnprtpH tn have- annrnyi L O C (le " l KLU module IS expected to have approximately 23 defects.
0098-5589/99/S10.00 © 1999 IEEE
72
676
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 25, NO. 5, SEPTEMBER/OCTOBER 1999
D = 4.86 + 0.018L (1) _. . i_ j i_ r ii • J J *• Other equations had the following dependent metrics: r J •• /u r u Tii r J u number of decisions C; number of subroutine calls J; and a . „ . C ° AnotherTariy study by Ferdinand, [6], argued that the expected number of defects increases with the number n of code segments; a code segment is a sequence of executable statements which, once entered, must all be executed. Specifically the theory asserts that for smaller numbers of segments, the number of defects is proportional to a power of n; for larger numbers of segments, the number of defects increases as a constant to the power n. Halstead, [7], proposed a number of size metrics, which have been interpreted as "complexity" metrics, and used these as predictors of program defects. Most notably, Halstead asserted that the number of defects D in a program P is predicted by (2): y £) = -^j^r (2) where Vis the (language dependent) volume metric (which like all the Halstead metrics is defined in terms of number of unique operators and unique operands in P; for details see [8]). The divisor 3,000 represents the mean number of mental discriminations between decisions made by the programmer. Each such decision possibly results in error and thereby a residual defect. Thus, Halstead's model was, unlike Akiyama's, based on some kind of theory. Interestingly, Halstead himself validated (1) using Akiyama s data. Ottenstein, [9], obtained similar results to Halstead. .. r,ni i r i_ i_ u J U Lipow, [10] went much further, because he got round the problem of computing V directly in (3), by using lines of executable code I instead. Specifically, he used the Halstead theory to compute a series of equations of the form: _ _ _ A + J 4 In L + A In2 L (3) 2 ^ ' where each of the A; are dependent on the average number of usages of operators and operands per LOC for a particular language. For example, for Fortran Ao = 0.0047; A, = 0.0023; A2 = 0.000043. For an assembly language Ao = 0.0012; Ai = 0.0001; A2 = 0.000002. Gaffney, [11], argued that the relationship between D and L was not language dependent. He used Lipow's own data to deduce the prediction (4): D = 4 . 2 + 0.0015(L)4/3 (4) An interesting ramification of this was that there was an optimal size for individual modules with respect to defect density. For (4) this optimum module size is 877 LOC. Numerous other researchers have since reported on optimal module sizes. For example, Compton and Withrow of UNISYS derived the following polynomial equation, [12]: D = 0.069 + 0.00156L + 0.00000047 (L)2 (5) Based on (5) and further analysis Compton and Withrow concluded that the optimum size for an Ada module, with respect to minimizing error density is 83 source statements. They dubbed this the "Goldilocks Principle" with the idea that there is an optimum module size that is "not too bis
The phenomenon that larger modules can have lower defect densities was confirmed in [13], [14], [151. Basili and , . . . . Pern cone argued that this may be explained by the rfact , .. ... , i_ r • r J r J- ^ -L. that there are a large number of interface defects distnbt e d evenl a( " Y ™ S s modules. Moller and Paulish suggested * a t l a r 8 e r m o ^ l e s tend to be developed more carefully, the y d f o v e r ^ d t h a t m o d u l e s i n s i s t i n g of greater than 70 lines of c o d e hav , f s i m i l a r d e f e c t densities^ For modules of size less t h a n 70 lines of c o d e t h e defect densit ' y increases significantly. Similar experiences are reported by [16], [17]. Hatton examined a number of data sets, [15], [18] and concluded that there w a s evidence of "macroscopic behavior" common to a11 d a t a s e t s d e s i t e t h e P massive internal complexity of each system studied, [19]. This behavior was likened to "molecules" in a gas and used to conjecture an entropy model for defects which also borrowed from ideas in cognitive psychology. Assuming the short-term memory affects the rate of human error he developed a logarithmic model, made up of two parts, and fitted it to the data sets.1 The first part modeled the effects of small modules on short-term memory, while the second modeled the effects of large modules, He asserted that, for module sizes above 200-400 lines of c o d e the human "memory cache" overflows and mistakes a r e m a c j e leading to defects. For systems decomposed into s m a i l e r pieces than this cache limit the human memory between the c a c h e i s u s e d i n e f f i c i e n t l y s t o r i n g •miksm o d u i e s thus also leadi to more defects H e c o n c i u d e d ^ j co n t s a r e p r o p o r t i o n a i i y m o r e reliable ., „ „. , ... ,, . . than smaller components. Clearly this would, ifr true, cast . _• t_ , . r_• se ous doubt o v " f * e theory of program decomposition w h i c h ls s o c e n t r a l t 0 s o f t w a r e engineering, The realization that size-based metrics alone are poor general predictors of defect density spurred on much research into more discriminating complexity metrics. McCabe's cyclomatic complexity, [20], has been used in many studies, but it too is essentially a size measure (being equal to the number of decisions plus one in most programs). Kitchenham et al. [21], examined the relationship between the changes experienced by two subsystems and a number of metrics, including McCabe's metric. Two differe nt regression equations resulted (6), (7): _ C - 0.042MCi - 0.075N + 0.00001HE (6) C = 0.25MCJ - O.53D7 + 0.09VG (7) F o r t h e first subsystem changes, C, was found to be reasonabl y dependent on machine code instructions, MCI, operator and operand totals, N, and Halstead's effort metric, HK F o r t h e o t h e r subsystem McCabe's complexity metric, VG w a s f o u n d t 0 P a rtially explain C along with machine code instructions, MCI and data items, DI. All of the metrics discussed so far are defined on code. There are now a large number of metrics available earlier in t h e Me-cycle most of which have been claimed by their P r ° P o n e n t s t 0 h a v e some predictive powers with respect l. There is nothing new here since Halstead [3] was one of the first to apply g P^P16 can only effectively recall seven plus or minus two
Millers fmdin that
.. „ nor too small.
items from their short-term memory. Likewise the construction of a partitioned model contrasting "small" module effects on faults and "large" module effects on faults was done by Compton and Withrow in 1990 [7].
73
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
to residual defect density. For example, there have been numerous attempts to define metrics which can be ex-
tracted from design documents using counts of between module complexity" such as call statements and data flows; the most well known are the metrics in [221. Ohls-
son a n d Alberg, [23], reported on a s t u d y at Ericsson w h e r e metrics d e r i v e d automatically from design docum e n t s w e r e u s e d t o predict especially fault-prone m o d u l e s p r i o r t o testing. Recently, there h a v e been several att e m p t s , such as [24], [25], to define metrics on objectoriented designs. The advent a n d w i d e s p r e a d use of Albrecht Function Points (FPs) raises the possibility of defect density predictions based on a metric which can be extracted at t h e specification stage. There is w i d e s p r e a d belief that FPs are a better (one-dimensional) size metric t h a n LOC; in theory at least they get r o u n d t h e problems of lack of uniformity a n d they are also language independent. We already see defect density defined in terms of defects p e r FP, a n d empirical studies are emerging that seem likely t o be t h e basis for predictive models. For example, in Table 1, [26] reports the following bench-marking study, reportedly based on large a m o u n t s of data from different commercial sources.
3
PREDICTION USING TESTING METRICS
677
". **CTS ™^f™
Defect Origins "alZTZZ^IZZ requirements Design Coding Documentation Bad fixes | Total
1
„ K^**™
Defects per Function Point Tnri l .«JU 1.25 1.75 0.60 0.40 5_00
I TABI C O DEFECTS FOUND PER TESTiNG APPROACH
Testing Type ~~Reaular use Black box White box Reading/inspections
Defects Found/hr 0~210 0 282 0.322 1.057
"inherent to the p r o g r a m m i n g process itself." Also useful (providing y o u are aware of t h e kind of limitations discussed in [33]) is the kind of data published by [34] in Table 2. O n e class of testing metrics that a p p e a r to be quite promising for predicting defects a r e t h e so called test coverage measures, A structural testing strategy specifies that we
have to select enough test cases so that each of a set of "obSome of the most promising local models for predicting jects" in a program lie on some path (i.e., are "covered") in residual defects involve very careful collection of data a t least on test case. For example, statement coverage is a about defects discovered during early inspection and test- structural testing strategy in which the "objects" are the ing phases. The idea is very simple: you have n predefined statements. For a given strategy and a given set of test cases phases at which you collect data dn (the defect rate. Sup- we can ask what proportion of coverage has been achieved, pose phase n represents the period of the first six months of The resulting metric is defined as the Test Effectiveness Rathe product in the field, so that dn is the rate of defects tio (TER) with respect to that strategy. For example, TER1 is found within that period. To predict dn at phase n - 1 the TER for statement coverage; TER2 is the TER for branch (which might be integration testing) you look at the actual coverage; and TER3 is the TER for linear code sequence and sequence d\, .... dn_i and compare this with profiles of simi- jump coverage. Clearly we might expect the number of dislar, previous products, and use statistical extrapolation covered defects to approach the number of defects actually techniques. With enough data it is possible to get accurate in the program as the values of these TER metrics increases, predictions of dn based on observed du .... dm where m is less Veevers and Marshall, [35], report on some defect and relithan n-l. This method is an important feature of the Japa- ability prediction models using these metrics which give nese software factory approach [27], [28], [29]. Extremely quite promising results. Interestingly Neil, [36], reported accurate predictions are claimed (usually within 95 percent that the modules with high structural complexity metric confidence limits) due to stability of the development and values had a significantly lower TER than smaller modules, testing environment and the extent of data collection. It T h i s supports our intuition that testing larger modules is appears that the IBM NASA Space shuttle team is achieving m o r e difficult and that such modules would appear more similarly accurate predictions based on the same kind of l i k e t y t 0 contain undetected defects. approach [18] Voas and Miller use static analysis of programs to conjecIn the absence of an extensive local database it may be t u r e t h e presence or absence of defects before testing has possible to use published bench-marking data to help with t a k e n P ] a c e ' I37J- T h e i r method relies on a notion of program this kind of prediction. Dyer, [30], and Humphrey, [31], con- testability, which seeks to determine how likely a program tain a lot of this kind of data. Buck and Robbins, [32], report w i U fail assuming it contains defects. Some programs will on some remarkably consistent defect density values during contain defects that may be difficult to discover by testing by different review and testing stages across different types of v i r t u e o f t h e i r structure and organization. Such programs software projects at IBM. For example, for new code devel- h a v e a I o w d e f e c t revealing potential and may, therefore, oped the number of defects per KLOC discovered with Fa- h i d e defects until they show themselves as failures during gan inspections settles to a number between 8 and 12. There operation. Voas and Miller use program mutation analysis to is no such consistency for old code. Also the number of man- simulate the conditions that would cause a defect to reveal hours spent on the inspection process per major defect is i t s e l f ^ a f a i l u r e if a defect was indeed present. Essentially if always between three and five. The authors speculate that, program testability could be estimated before testing takes despite being unsubstantiated with data, these values form P l a c e t h e estimates could help predict those programs that "natural numbers of programming," believing that they are w o u l d reveal less defects during testing even if they contained
74
678
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 25, NO. 5, SEPTEMBER/OCTOBER 1999
defects. Bertolino and Strigini, [38], provide an alternative exposition of testability measurement and its relation to testing, debugging, and reliability assessment.
TABLE 3 RELATIONSHIP BETWEEN CMM LEVELS AND DELIVERED DEFECTS MULTIVARIATE APPROACHES SEI CMM
4 PREDICTION USING PROCESS QUALITY DATA
Defect Potentials
__J£^
There are many experts who argue that the "quality" of the development process is the best predictor of product quality (and hence, by default, of residual defect density). This issue, and the problems surrounding it, is discussed extensively in [33]. There is a dearth of empirical evidence linking process quality to product quality. The simplest metric of process quality is the five-level ordinal scale SEI Capability Maturity Model (CMM) ranking. Despite its widespread popularity, there was until recently no evidence to show that level (n + 1) companies generally deliver products with lower residual defect density than level (n) companies. The Diaz and Sligo study, [39], provides the first promising empirical support for this widely held assumption. Clearly the strict 1-5 ranking, as prescribed by the SEICMM, is too coarse to be used directly for defect prediction since not all of the processes covered by the CMM will relate to software quality. The best available evidence relating particular process methods to defect density concerns the Cleanroom method [30]. There is independent validation that, for relatively small projects (less than 30 KLOC), the use of Cleanroom results in approximately three errors per KLOC during statistical testing, compared with traditional development postdelivery defect densities of between five to 10 defects per KLOC. Also, Capers Jones hypothesizes quality targets expressed in "defect potentials" and "delivered defects" for different CMM levels, as shown in Table 3 [40].
1
|
2 3 4 •>
Removal
5
|
4 3 2 1
Delivered Defects
!5£i£EL« 85
j
89 91 93
95
0.75
1
—
0.44 0.27 0.14 0-Q5
underlying dimension being measured, such as control, volume and modularity. In [43] they used factor analytic variables to help fit regression models to a number of error data sets, including Akiyama's [5]. This helped to get over the inherent regression analysis problems presented by multicolinearity in metrics data. Munson and Khoshgoftaar have advanced the multivariate approach to calculate a "relative complexity metric." This metric is calculated using the magnitude of variability from each of the factor analysis dimensions as the input weights in a weighted sum. In this way a single metric integrates all of the information contained in a large number of metrics. This is seen to offer many advantages of using a univariate decision criterion such as McCabe's metric [44]. g
O F CURRENT APPROACHES TO _ PpcnirnnM DEFECT PREDICTION Despite the heroic contributions made by the authors of previous empirical studies, serious flaws remain and have detrimentally influenced our models for defect prediction. Of course, such weaknesses exist in all scientific endeavo u r s b u t if w e a r e t 0 i m r o v e P scientific enquiry in software 5 MULTIVARIATE APPROACHES engineering we must first recognize past mistakes before There have been many attempts to develop multilinear re- suggesting ways forward. gression models based on multiple metrics. If there is a conT h e k e y i s s u e s a ff ec ting the software engineering comsensus of sorts about such approaches it is that the accuracy m u n i t y - s historical research direction, with respect to defect of the predictions is never significantly worse when the prediction are' metrics set is reduced to a handful (say 3-6 rather than 30), [41]; a major reason for this is that many of the metrics are ' t h e u n k n o w n relationship between defects and failu r e s S e c t i o n 61 colinear; that is they capture the same underlying attribute ( )> (so the reduced set of metrics has the same information con* problems with the "multivariate" statistical approach tent, [42]). Thus, much work has concentrated on how to (Section 6.2); select those small number of metrics which are somehow * problems of using size and complexity metrics as sole the most powerful and/or representative. Principal Com"predictors" of defects (Section 6.3); ponent Analysis (see [43]) is used in some of the studies to * problems in statistical methodology and data quality reduce the dimensionality of many related metrics to a (Section b.4), smaller set of "principal components," while retaining most * f a l s e c l a i m s a b o u t s o f t w a r e decomposition and the of the variation observed in the original metrics. "Goldilocks Conjecture" (Section 6.5). For example, [42] discovered that 38 metrics, collected on 6 # 1 T h e unknown Relationship between Defects and around 1,000 modules, could be reduced to six orthogonal Failures . . ., ,. , , , ~ .... dimensions that account for 90 percent of the variability. J The TU . ^ There is considerable disagreement about the definitions ofr most important dimensions; size, nesting, and prime were , r ,. J, , ., ,.„ , . ,. . c T j . , . ,. . . . defects, errors, faults, and failures. In different studies dethen used to develop an equation to discriminate between r r . .... • .... _, , feet counts refer to: low and high maintainability modules. Munson and Khoshgoftaar in various papers, [41], [43], • postrelease defects; [44] use a similar technique, factor analysis, to reduce the • the total of "known" defects; dimensionality to a number of "independent" factors. • the set of defects discovered after some arbitrary fixed These factors are then labeled so as to represent the "true" point in the software life cycle (e.g., after unit testing).
75
A CR|T|QUE
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
679
The terminology differs widely between studies; defect TABLE 4 rate, defect density, and failure rate are used almost interDEFECTS DENSITY (F/KLOC) VS. MTTF I F/KLOC I MTTF changeably. It can also be difficult to tell whether a model is ™3 0 TrrAn predicting discovered defects or residual defects. Because of these problems (which are discussed extensively in [45]) we 20-30 4-5 min 5-10 1 hr have to be extremely careful about the way we interpret 2 5 several hours published predictive models. ~ Apart from these Fproblems of terminology and defini. . .. . r .J i 0.5-1 1 month K r any prediction of residual ' • ' ' tion the most serious weakness of defects or defect density _ ,. ... .... , , , , , , , . , J concerns the weakness of defect ,_ . . i. i_.i. 2 r. -r ReliabilityJ rprediction should, therefore, be viewed as corncount itself as a measure of software reliability. Even if we . ,c , . .... , , , , „ ., , , _ r plementary to defect density prediction, knew exactly the number of residual defects in our system we have to be extremely wary about making definitive 6.2 Problems with the Multivariate Approach statements about how the system will operate in practice. Applying multivariate techniques, like factor analysis, proThe reasons for this appear to be: d u c e s metrics w n i c h cannot be easily or directly interpret• difficulty of determining in advance the seriousness able in terms of program features. For example, in [43] a of a defect; few of the empirical studies attempt to factor dimension metric, control, was calculated by the distinguish different classes of defects; weighted sum (8): • great variability in the way systems are used by difcontrol = aflNK + a2PRC + a3E + a4VG + a5MMC ferent users, resulting in wide variations of opera® + a £ r r o r + a fg\fp + a IQQ tional profiles. It is thus difficult to predict which de6 7 8 fects are likely to lead to failures (or to commonly oc- where the a, s are derived from factor analysis. HNK was curring failures). Henry and Kafura's information flow complexity metric, T,, . ,. . . 11 . j i u u- u PRC is a count of the number of procedures, E is Halstead's , . . ..„,„. The latter point is particularly serious and has been high„ . ,._ . „ , , r i- u» J j ..• n u l^c^ A J • J J * t effort metric, VG is McCabe s complexity metric, MMC is lighted dramatically by [46]. Adams examined data from TT . . . , ,„„.,. , ., , i_-i_ i_ _ir Harrison s complexity metric, and LOC is lines ofc code. Al? , c rane large software products, each with many thousands of . . . . . . . . ._, • . i. . . , ,, ., , . , X. w uthough this equation might help to avoid multicohnearity it years off llogged use world wide. uHe charted the relationship . . . , . . , . . i t J~ . , , „ . , ., . .„ . .. is hard to see how you might advise a programmer or deJ f ., between detected defects and their manifestation as fail. , . 6, ,. ... ,, i oo r11jr4.1j4.c-1 signer on how to redesign the programs to achieve a bet& ures. For example, 33 percent of all defects led to failures ,, , . , e . , , , ., . , ter ... 4.4. c -i 4.u c nnn i control metric value for a given module. Likewise the with a mean time to failure greater than 5,000 years. In . f , . , ,. , , ,c .,, , effects of such a change & in module control on defects is less practical terms, this means that such defects will almost never manifest themselves as failures. Conversely, the pro„ ' ,. _. . , , . . _. . i_. i_ i J • r -i . These problems are compounded in the search for an ulportion of defects which led to a mean time to failure oft less .. , .. . £ . . . . . , „, . ,. . , rn ii / .o \ TT timate or relative complexity metnc [43]. The simplicity ofr H than 50 years was very small (around 2 percent). However, ^ g number ^ J de appealing bu/the it is these defects which are the important ones to find, . o f m e a s u r e m e n t a r e b a s e d o n i d e n t i f y l n g differing since these are the ones which eventually exhibit them- ^ e l I . d e f i n e d a t t r i b u t e s w i t h s i n g i e s t a n d a r d m e a s u r e s [ 4 5 f selves as failures to a significant number of users. Thus A l t h o u h t h e r e i s a c l e a r r o l e f o r d a t a r e d u c t i o n a n d a n a l y s i s Adams data demonstrates the Pareto principle: a very small t e c h n i s such as factor anal is t h i s s h o u l d n o t b e c o n . proportion of the defects in a system will lead to almost all f u s e d Q r u s e d i n s t e a d o f measurement theory. For example, the observed failures in a given period of time; conversely, s t a t e m e n t count and lines of code are highly correlated bemost defects in a system are benign in the sense that in the c a u s e p r o g r a m s w i t h m O re lines of code typically have a same given period of time they will not lead to failures. h i g h e r n u m b e r of statements. This does not mean that the It follows that finding (and removing) large numbers of t r u e s i z e o f prO grams is some combination of the two metrics, defects may not necessarily lead to improved reliability. It A m O re suitable explanation would be that both are alternaalso follows that a very accurate residual defect density pre- t i v e measures of the same attribute. After all centigrade and diction may be a very poor predictor of operational reliabil- fahrenheit are highly correlated measures of temperature, ity as has been observed in practice [47]. This means we Meteorologists have agreed a convention to use one of these should be very wary of attempts to equate fault densities a s a standard in weather forecasts. In the United States temwith failure rates, as proposed for example by Capers Jones perature is most often quoted as fahrenheit, while in the (Table 4 [48]). Although highly attractive in principle, such a United Kingdom it is quoted as centigrade. They do not take model does not stand up to empirical validation. a weighted sum of both temperature measures. This point Defect counts cannot be used to predict reliability because, lends support to the need to define meaningful and standard despite its usefulness from a system developer's point of measures for specific attributes rather than searching for a view, it does not measure the quality of the system as the single metric using the multivariate approach, user is likely to experience it. The promotion of defect counts
as a measure of "general quality" is, therefore, misleading. _ ,,
„
2. Here we use the technical concept of reliability, defined as mean time to failure or probability of failure on demand, in contrast to the "looser" concept of reliability with its emphasis on defects.
76
6 3
- Problems in Using Size and Complexity Metrics
to Predict Defects A discussion of the theoretical and empirical problems with
c i- • J • • J i • J• JI_ m a n y of the i n d i v i d u a l metrics discussed above m a y be
680
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, V O L 25, NO. 5, SEPTEMBER/OCTOBER 1999
found in [45]. There are as many empirical studies (see, for example, [49], [50], [51]) refuting the models based on Halstead, and McCabe as there are studies "validating" them. Moreover, some of the latter are seriously flawed. Here we concentrate entirely on their use within models used to predict defects. The majority of size and complexity models assume a straightforward relationship with defects-defects are a function of size or defects are caused by program complexity. Despite the reported high correlations between design complexity and defects the relationship is clearly not a straightforward one. It is clear that it is not entirely causal because if it were we couldn't explain the presence of defects introduced when the requirements are defined. It is wrong to mistake correlation for causation. An analogy would be the significant positive correlation between IQ and height in children. It would be dangerous to predict IQ from height because height doesn't cause high IQ; the underlying causal factor is physical and mental maturation, There are a number of interesting observations about the way complexity metrics are used to predict defect counts: . the models ignore the causal effects of programmers and designers. After all it is they who introduce the defects so any attribution for faulty code must finally restwithindividual(s). overly complex programs are themselves a consequence of poor design ability or problem difficulty. Difficult problems might demand complex solutions and novice programmers might produce "spaghetti coc j e ». . defects may be introduced at the design stage because of the overcomplexity of the designs already produced. Clerical errors and mistakes will be committed because the existing design is difficult to comprehend.
6.4.1 Multicolinearity Multicolinearity is the most common methodological probi e m encountered in the literature. Multicolinearity is pres e n t w h e n a number of predictor variables are highly posit i v e i y o r neg atively correlated. Linear regression depends o n m e assumption of zero correlation between predictor variables, [52]. The consequences of multicolinearity are m a n y f o l d ; i t c a u s e s u n s t a b l e coefficients, misleading statist i c a l t e s t s m d une xpected coefficient signs. For example, o n e o f t h e equations in [21] (9): _ C = 0MZMCI " 0 O 7 5 N + 0-00001HE (9) shows clear signs of multicolinearity. If we examine the equation coefficients we can see that an increase in the operator and operand total, JV, should result in an increase in changes, c, all things being equal. This is clearly counterintuitive. In fact analysis of the data reveals that machine code instructions, MCI, operand, and operator count, N, and Halstead's Effort metric, HE, are all highly correlated [42]. This type of problem appears to be common in the software metrics literature and some recent studies appear to h a v e fallen victim t0 t h e multicolinearity problem [12], [53]. Colinearity between variables has also been detected in a n u m b e r of studies that reported a negative correlation bem e e n defect densitv a n d module size. Rosenberg reports that since there m u s t b e a ' negative correlation between X, size a n d 1 / X h follows t h a t t h e ' correlation between X and Y / X (defects/size) must be negative whenever defects are gr°wing at most linearly with size [54]. Studies which have postulated such a linear relationship are more than likely to have detected negative correlation, and therefore concluded t h a t lar e 8 modules have smaller defect densities, because of this P ro Perty of arithmetic, 6 4 2 Factor Analysis vs. Principal Components Analysis
Defects of this type are "inconsistencies" between de, , j , ,, ,,. ,. . . sign modules and can be thought ofr as quite distinct . , c r from requirements defects. M
„, r * i • J • • i * i • The use ofr factor analysis and principal components analysis . , . . . . . . , . . solves the multicohneanty problem by creating new or, . . • • \ _•• . ,.m thogonal cfactors or principal component dimensions, [43]. 6.4 Problems in Data Quality and Statistical Unfortunately the application of factor analysis assumes the Methodology errors are Gaussian, whereas [55] notes that most software The weight given to knowledge obtained by empirical metrics data is non-Gaussian. Principal components analysis means rests on the quality of the data collected and the de- c a n b e u s e d i n s t e a d o f f a c t o r analysis because it does not rely gree of rigor employed in analyzing this data. Problems in o n a n y distributional assumptions, but will on many occaeither data quality or analysis may be enough to make the s i o n s produce results broadly in agreement with factor resulting conclusions invalid. Unfortunately some defect analysis. This makes the distinction a minor one, but one that prediction studies have suffered from such problems. These n e e d s t 0 b e considered, Predicting Data problems are caused, in the main, by a lack of attention to 6 4 3 FJWng Models vs the assumptions necessary for successful use of a particular „ . , ,. , . „ , ..... , . , . ,-.,, . .. . , , .. Regression modeling approaches are typically concerned statistical technique. Other serious problems include the .° _. . _, , , , , .. . , „ i i r ... ... j u t jir-J JI with fitting 6 models to data rather than predicting data. Relack of distinction made between model fitting and model . . . . „ , , . , , ~ .... , ., . ..f. , , c , . gression analysis typically finds the least-squares fit to the prediction and the unjustified removal of data points or 5 _• , J r , • r- , i .. . . J J * data and the goodness of this fit demonstrates how well the misuse ofc averaged data. .& , . . TT ,.,.,. ° . . . . , . c model explains historical data. However a truly successful T, u The ability to replicate results is a key component of any ... ,., ,. , / , c . . . . . . . . r T , . j . _ „ i model is one which can predict the number of defects disempincal discipline. In software development different find,. , , , ^ , . ,. . . u i _ i - j u i _ r covered in an unknown module. Furthermore, this must be a ings cfrom diverse experiments could be explained by the fact , , , . , - . . ^ , . , , , , ., . j . i r . i j module not used in the derivation of the model. Unfortut n j that different, perhaps uncontrolled, processes were used on , , , , , . , ,.tt j. . v. r* ui-4. 4. j ' •u u nately, perhaps because of the shortage of data, some redifferent proiects. Comparability over case studies might be , , J • • , r. , ... r. , .e .. , , . J 1 » searchers have tended to use their data to fit the model . , , . . , , , _, , better achieved if the processes used during development , *j l ... .. . ?., ^ . without being able to test the resultant model out on a new were documented, along with estimates of the extent to , ° , r r , „„, „ . . , ., , ii r H j data set. See, for example, [51, [12], [16]. p which they were actually followed. ' ' ' ' '
77
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
681
6.4.4 Removing Data Points However we can see that (9) and (10) are not equivalent. In standard statistical practice there should normally be The use of (10) mistakenly assumes the power of a sum is strong theoretical or practical justification for removing equal to a sum of powers. data points during analysis. Recording and transcription .._ . . . . , _ . „ errors are often an acceptable reason. Unfortunately it is 6 5 T h e "Gold.lock's Conjecture" often difficult to tell from published papers whether any T n e r e s u l t s of inaccurate modeling and inference is perhaps data points have been removed before analysis, and if they m o s t evident in the debate that surrounds the "Goldilock's have, the reasons why. One notable case is Compton and Conjecture" discussed in Section 2—the idea that there is an Withrow, [12], who reported removing a large number of optimum module size that is "not too big nor too small." data points from the analysis because they represented Hatton, [19], claims that there is modules that had experienced zero defects. Such action is "compelling empirical evidence from disparate sources to sugsurprising in view of the conjecture they wished to test; that gest that in any software system, larger components are propordefects were minimised around an optimum size for Ada. If tionally more reliable than smaller components." the majority of smaller modules had zero defects, as it ap- I f t h e s e r e s u l t s w e r e g e n e r a ] l y t r u e t h e implications for pears, then we cannot accept Compton and Withrow's con- s o f t w a r e engineering would be very serious indeed. It elusions about the "Goldilock's Conjecture." w o u ] d m e a n { h a t p r o g r a m decomposition as a way of solvm 64 5 Using "Averaged" Data S problems simply did not work. Virtually all of the work ,,.,,. , , ,., . . , . done in software engineering extending from fundamental r .. ... U J . i_ We believe that the use of averaged data in analysis ^ ... . , .,. ~, • c , , , . . ij . .. ,. i , concepts, like modularity and lnformation-hidine, to methrather than the original data prejudices many studies. The _, i-i u- .. • * J J .. * J J • UI_ B ,.,,„, i_ i j . j r i°ds, "ke object-oriented and structured design would be study in 19 uses graphs, apparently denved from the , J .. c , , . r , • • i K A c A n J 3 A J 4. i ^« • • suspect because all of them rely on some notion of decomonainal NASA-Goddard data, plotting average size in .. .. , J ,. , , , .. jr , . f , f, r f „ „ J| - ^ , position. If decomposition doesn t work then there would statements against number of defects or defect den- f , JT , . . .. A , . * j r ^u be no good reason for doing it. sity. Analysis ofr averages are one step removed from the „. . ., , . ° , , . . i j .. . > c• ITClaims with such serious consequences as these deserve original data and it raises a number of issues. Using aver. . .. ... , M , ., ., , _, ° , , *r-r • i i _ i i _ special attention. We must ask whether the data and m i • • ages reduces the amount of information available to test the , , . • * ,. _» L ° . , . j i • .11 L knowledge exists to support lthem. These are clear criteria coniecture under study and any conclusions will be corre. , 5 ,. . ^. K ., . . , , , J ,. . , _^ . . ^ j . noi J —ifr the data exist to refute the coniecture that large modspondinely weaker. The classic study in [13] used average . „, , .r , .1.1 1 . r ,,. . . J J ^ \u\ ^ j ules are better and if we have a sensible explanation for c fault density ofc sgrouped data in a way that suggested a ,. . , . . ... , „ 1 . 1 1 F , U J .JU»U j TU c this result then a claim will stand. Our analysis shows that, trend lthat was not supported by the raw data. The use of . , .. . , . . . . , 5 1 J *u i_ using these cntena, these claims cannot currently stand. TIn averages may be a practical way around the common prob- ., ° ,. ., ^ . ., . . , , f. c n , , _/c , . ,, . , ,. . . , , the studies that support the conjecture we found the followlem where defect data is collected at a higher level, perhaps ,, ,
1.
*
1
1
1.
• -j
1 j T
ing problems:
b r at the system or subsystem level, than is ideal; defects recorded against individual modules or procedures. As a con* n o n e d e f l n e "module" in such a way as to make cornsequence data analysis must match defect data on systems parison across data sets possjble; against statement counts automatically collected at the • none explicitly compare different approaches to structurin g a n d decomposing designs; module level. There may be some modules within a subsysthe tern that are over penalized when others keep the average * data analysis or quality of the data used could not high because the other modules in that subsystem have support the results claimed; # more defects or vice versa. Thus, we cannot completely a number of factors exist that could partly explain the results w h i c h ^ ^ s t u d i e s h a v e neglected to examine. trust any defect data collected in this way. Misuse of averages has occurred in one other form. In Additionally, there are other data sets which do not show Gaffney's paper, [11], the rule for optimal module size was any clear relationships between module size and defect derived on the assumption that to calculate the total num- density. ber of defects in a system we could use the same model as If W e examine the various results we can divide them into had been derived using module defect counts. The model three main classes. The first class contains models, exempliderived at the module level is shown by (4) and can be ex- fied by graph Fig. la, that shows how defect density falls as tended to count the total defects in a system, DT, based on module size increases. Models such as these have been proLj, (9). The total number of modules in the system is de- duced by Akiyama, Gaffney, and Basili and Pericone. The noted by N. second class of models, exemplified by Fig. lb, differ from w N the first because they show the Goldilock's principle at work. D T = X D i = 4-2N + ° 0 0 1 5 X ^ ® Here defect density rises as modules get bigger in size. The i=1 i=1 third class, exemplified by Fig. lc, shows no discernible patGaffney assumes that the average module size can be tern whatsoever. Here the relationship between defect denused to calculate the total defect count and also the opti- sity and module size appears random (no meaningful curvimum module size for any system, using (10): linear models could be fitted to the data at all. The third class of results show the typical data pattern p N -,4/3 y* £ from a number of very large industrial systems. One data set (10) was collected at the Tandem Corporation and was reported D = 4 2N + 0 0015JV '=' r ' ' N in, [56]. The Tandem data was subsequently analyzed by Neil [42], using the principal components technique to produce a
78
682
IEEE TRANSACTIONS O N SOFTWARE ENGINEERING, VOL. 25, NO. 5, SEPTEMBER/OCTOBER 1999
Fig. 1. Three classes of defect density results, (a) Akiyama (1971), Basili and Perricome (1984), and Gaffney (1984); (b) Moeller and Paulish (1993), Compton and Withrow (1990), and Hatton (1997); (c) Neil (1992) and Fenton and Ohlsson (1997).
"combined measure" of different size measures, such as deci- 7 PREDICTING DEFECTS USING BBNs sion counts. This principal component statistic was then plot^ ft from Qur is jn Section 6 ^ ion ted against the number of changes made to the system mod£ ^ defects be or sizf meas. ules (these were rpredominantlyJ changes made to fix defects). . . r . , J , .. -L, , „ . , , , , ,. , ° ,. . ures alone presents only a skewed picture. The number ofc This defect data was standardized according to normal statis, . ,. ,. , , , »j ; U r.. *• . , . • , .i . c-^. . i defects discovered is clearly related to ithe amount of testing ,. _• .. J / I_- u t. tical practice. A polynomial regression curve was fitted to the A , ,. . . * i • u tu *u • -cperformed, as discussed above. A program which has never data in order to determine whether there was significant f j , c .. , / &... , , . ,. _. „ . , r . , .t T , ,. been tested, or used for that matter, will have a zero defect nonlinear effects of size on defect density. The results were , , . . . , ,. , ., , . . , , , , ,, . p. o count, even though its complexity may be very high. Moreb here in Fig. 2. *u * * « *• c i ypublished and are reproduced y „ r i_ i •i u • over, we can assume the test effectiveness of complex proDespite some parameters of the polynomial curve being . . , , .„_, , , ij u . . „ .„ . . i . u /., . j. r grams is relatively low, 137], and such programs could be statistically significant it is obvious that there is no discerm- ° , . uu*i u r j c * ir ,. 1 . . . I _i c J j i • . expected to exhibit a lower number of defects per line of ble relationship between defect counts and module size in j , • ,. . . *u -U-J » J r » rr . _ , , ,, ii j i • j code during testing because they hide defects more effec. _. ,. , . ,,, .. , , ., . the Tandem data set. Many small modules experienced no , , „ , , ~ . , -i ij u tively.T1 This could explain many of the empirical results that defects at all and the fitted polynomial curve would be use- , J , , , , , i . , .i. T . J c . . ... „. , •: , , , . • ! • • larger modules have lower defect densities. Therefore, cfrom c less for prediction. This data clearly refutes the simplistic , i r ^ u-iu i J i_ i • -r- _• •_ i T~ i jiu J i iu what we know of testability, we could conclude that large ,, *• j j i j r ^ u u assumptions typified by class Fig. la and lb models (these ji ii. i • i- T J J \ . modules contained many residual defects, rather than conmodels couldn t explain the Tandem data) nor accurately ^ rf m o d u l e s w e f e m o r e reljable (and im predict the defect density values of these Tandem modules. A «n ^ s o « w a r e d e c o m i t i o n is w r o n } F S similar analysis and result is presented in [47]. ^ c l e a r aH of ^ o b l e m s d e s c r i b e d i n Sec tion 6 a r e n o t VVe conclude that the relationship between defects and ^ tQ so]ved easil H o w e v e r w e believe that model. module size is too complex in general, to admit to straightfhe c o m l e x i t i e s o f s o f t w a r e development using new forward curve fitting models. These results, therefore, conb a b i l i s t i c t e c h n i q u e s presents a positive way forward, tradict the idea that there is a general law linking defect T h e s e m e t h o d s c a l l e d gayesian Belief Networks (BBNs), density and software component size as suggested by the a l l o w u s t 0 e x p r e s s c o m p l e x interrelations within the model "Goldilock's Conjecture."
Fig. 2. Tandem data defects counts vs. size "principal component."
79
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
683
at a level of uncertainty commensurate with the problem, ties for the NPTs. One of the benefits of BBNs stems from the In this section, we first provide an overview of BBNs (Sec- fact that we are able to accommodate both subjective probtion 7.1) and describe the motivation for the particular BBN abilities (elicited from domain experts) and probabilities example used in defects prediction (Section 7.2). In Section based on objective data. Recent tool developments, notably on the 7.3, we describe the actual BBN. SERENE project [58], mean that it is now possible to build very large BBNs with very large probability tables (in7.1 An Overview Of BBNs eluding continuous node variables). In three separate indusBayesian Belief Networks (also known as Belief Networks, trial applications we have built BBNs with several hundred Causal Probabilistic Networks, Causal Nets, Graphical nodes and several millions of probability values [59]. T h e r e a r e mar >y advantages of using BBNs, the most imProbability Networks, Probabilistic Cause-Effect Models, and Probabilistic Influence Diagrams) have attracted much portant being the ability to represent and manipulate cornrecent attention as a possible solution for the problems of plex models that might never be implemented using convendecision support under uncertainty. Although the underly- tional methods. Another advantage is that the model can ing theory (Bayesian probability) has been around for a predict events based on partial or uncertain data. Because long time, the possibility of building and executing realistic B B N s h a v e a rigorous, mathematical meaning there are softmodels has only been made possible because of recent algo- ware tools that can interpret them and perform the complex rithms and software tools that implement them [57]. To date calculations needed in their use [58]. Th BBNs have proven useful in practical applications such as e benefits of using BBNs include: medical diagnosis and diagnosis of mechanical failures. . specification of complex relationships using condiTheir most celebrated recent use has been by Microsoft tional probability statements; where BBNs underlie the help wizards in Microsoft Office; . u s e of "what-if? analysis and forecasting of effects of also the "intelligent" printer fault diagnostic system which process changes; you can run when you log onto Microsoft's web site is in . easier understanding of chains of complex and seemfact a BBN which, as a result of the problem symptoms you ingly contradictory reasoning via the graphical forenter, identifies the most likely fault. mat; A BBN is a graphical network that represents probabilis. explicit modeling of "ignorance" and uncertainty in tic relationships among variables. BBNs enable reasoning estimates; under uncertainty and combine the advantages of an intui• u s e of subjectively or objectively derived probability tive visual representation with a sound mathematical basis distributions; in Bayesian probability. With BBNs, it is possible to articu• forecasting with missing data, late expert beliefs about the dependencies between different variables and to propagate consistently the impact of evi- 7 - 2 Motivation for BBN Approach dence on the probabilities of uncertain outcomes, such as Clearly defects are not directly caused by program complex"future system reliability." BBNs allow an injection of scien- ity alone. In reality the propensity to introduce defects will be tific rigor when the probability distributions associated influenced by many factors unrelated to code or design comwith individual nodes are simply "expert opinions." plexity. There are a number of causal factors at play when we A BBN is a special type of diagram (called a graph) to- want to explain the presence of defects in a program: gether with an associated set of probability tables. The graph # Difficulty of the problem is made up of nodes and arcs where the nodes represent un. C o m p l e x i t y o f des igned solution certain variables and the arcs the causal/relevance relation. Programmer/analyst skill , D ships between the variables Fig. 3 shows a BBN for an exammethods and edures used pie reliability prediction problem. The nodes represent discrete or continuous variables, for example, the node "use Eliciting requirements is a notoriously difficult process of IEC 1508" (the standard) is discrete having two values ar>d is widely recognized as being error prone. Defects intro"yes" and "no," whereas the node "reliability" might be con- duced at the requirements stage are claimed to be the most tinuous (such as the probability of failure). The arcs represent expensive to remedy if they are not discovered early enough, causal/influential relationships between variables. For ex- Difficulty depends on the individual trying to understand ample, software reliability is defined by the number of (la- a n d describe the nature of the problem as well as the probtent) faults and the operational usage (frequency with which km itself. A "sorting" problem may appear difficult to a novfaults may be triggered). Hence, we model this relationship i c e programmer but not to an expert. It also seems that the by drawing arcs from the nodes "number of latent faults and difficulty of the problem is partly influenced by the number "operational usage" to "reliability." °f foiled attempts at solutions there have been and whether a For the node "reliability" the node probability table (NPT) "ready made" solution can be reused. Thus, novel problems might, therefore, look like that shown in Table 5 (for ultra- have the highest potential to be difficult and "known" probsimplidty we have made all nodes discrete so that here reli- l e m s t e n d to be simple because known solutions can be idenability takes on just three discrete values low, medium, and tified and reused. Any software development project will high). The NPTs capture the conditional probabilities of a h a ve a mix of "simple" and "difficult" problems depending node given the state of its parent nodes. For nodes without o n w h a t intellectual resources are available to tackle them, parents (such as "use of IEC 1508" in Fig. 3.) the NPTs are G o °d managers know this and attempt to prevent defects by simply the marginal probabilities. pairing up people and problems; easier problems to novices There may be several ways of determining the probabili- a n d difficult problems to experts.
80
684
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL 25, NO. 5, SEPTEMBER/OCTOBER 1999
Fig. 3. "Reliability prediction" BBN example.
TABLE 5 NODE PROBABILITY TABLE (NPT) operational usage faults reliability
FOR THE NODE "RELIABILITY"
low
med
high
low
med
high
low
I med
high
low
med
low
0.10
0.20
0.33
0.20 | 0.33
O50
0~20
033
0J0
med high
0.20 0.70
0.30 0.33 0.50 | 0.33
0.30 | 0.33 0.50 j 0.33
0.30 0.30 0.20 j 0.50
0.33 0.33
0.20 0.10
When assessing a defect it is useful to determine when it was introduced. Broadly speaking there are two types of defect; those that are introduced in the requirements and those introduced during design (including coding/ implementation which can be treated as design). Useful defect models need to explain why a module has a high or low defect count if we are to learn from its use, otherwise we could never intervene and improve matters. Models using size and complexity metrics are structurally limited to assuming that defects are solely caused by the internal organization of the software design. They cannot explain defects introduced because: . the "problem" is "hard"• . problem descriptions are inconsistent; . the wrong "solution" is chosen and does not fulfill the requirements
high
Central to software design method is the notion that problems and designs can be decomposed into meaningful chunks where each can be readily understood alone and finally recomposed to form the final system. Loose coupling between design components is supposed to help ensure that defects are localized and that consistency is maintained. What we have lacked as a community is a theory of program composition and decomposition, instead we have fairly ill-defined ideas on coupling, modularity and cohesiveness. However, despite not having such a theory every day experience tells us that these ideas help reduce defects and improve comprehension. It is indeed hard to think of any other scientific or engineering disci line P * * h a s n o t b e n e f l t e d f r o m Ms approach. Surprisingly, much of the defect prediction work has been pursued without reference to testing or testability. According to [37], [38] the testability of a program will dictate its propensity to reveal failures under test conditions and We have long recognized in software engineering that u s e A ] s Q a t a ficiai level t h e a m o u n t of testi program quality can bepotentialfy improved through the use f o r m e d ^ d e t e r m i n e h o w defects ^ be discov. of proper project procedures and good design methods. Basic e r e d a s s u m i t h e r e a r e d e f e c t s t h e r e t 0 d i s c o v e r Q e a r l . f project procedures like configuration management, incident n o .. . f ., , c . .„ , r , , . , . > j j i_ u L i J testing is done then no defects will be found. nBy exten. logging, documentation and standards should help reduce s i.o n w e . ,. , . ,.™ , ,. ., J u i-i vu j r j r < . c L • u i u might argue that difficult problems, with complex r r the likelihood of defects. Such practices may not help the . . °. , ° _. . , _. ,. , ., „ ,.&. ,. . solutions, might be difficult to test and so might demand 7 , unique genius you need to work on the really difficult prob^ ,,- , . «• e , , , u u u j j r i _ Jmore test effort. If such testing effort is not forthcoming (as lems but they should raise the standards of the mediocre. . . . . . . . ,f. is typical in many commercial projects when deadlines
81
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
685
loom) then less defects will be discovered, thus giving an probability but, because less testing effort was allocated over estimate of the quality achieved and a false sense of than required, the distribution of defects detected peaks security. Thus, any model to predict defects must include around zero with probability 62 percent. The distribution testing and testability as crucial factors. for defect density at testing contrasts sharply with the residual ORM defect density distribution in that the defect density at testing 7.3 A Prototype bBN appears very favourable. This is of course misleading beWhile there is insufficient space here to fully describe the c a u s e t h e res idual defect density distribution shows a much development and execution of a BBN model here we have higher probability of higher defect density levels, developed a prototype BBN to show the potential of BBNs From the model we can see a credible explanation for and illustrate their useful properties. This prototype does not observing large "modules" with lower defect densities, exhaustively model all of the issues described in Section 7.2 Underallocation of design effort for complex problems nor does it solve all of the problems described in Section 6. r e s u i t s in more introduced defects and higher design size. Rather, it shows the possibility of combining the different Higher design size requires more testing effort, which if software engineering schools of thought on defect prediction unavailable, leads to less defects being discovered than into a single model. With this model we should be able to a r e actually there. Dividing the small detected defect show how predictions might be made and explain historical c o u n t s with large design size values will result in small results more clearly. defect densities at the testing stage. The model explains The majority of the nodes have the following states: t h e "Goldilock's Conjecture" without ad hoc explanation, "very-high," "high," "medium," "low," "very low," except Clearly the ability to use BBNs to predict defects will for the design size node and defect count nodes which have depend largely on the stability and maturity of the develinteger values or ranges and the defect density nodes which O p m e nt processes. Organizations that do not collect metrics have real values. The probabilities attached to each of these fata, do not follow defined life-cycles or do not perform states are fictitious but are determined from an analysis of a n y forrns o f systematic testing will find it hard to build or the literature or common-sense assumptions about the di- a p p i y s u c h m o dels. This does not mean to say that less marection and strength of relations between variables. t u r e organizations cannot build reliable software, rather it The defect prediction BBN can be explained in two stages, implies that they cannot do so predictably and controllably The first stage covers the life-cycle processes of specification, Achieving predictability of output, for any process, dedesign or coding and the second stage covers testing. In Fig. 4 m a n d s a degree of stability rare in software development problem complexity represents the degree of complexity inher- organizations. Similarly, replication of experimental results ent in the set of problems to be solved by development. We c a n o n ] y D e predicated on software processes that are decan think of these problems as being discrete functional re- fined a n d repeatable. This clearly implies some notion of quirements in the specification. Solving these problems ac- Statistical Process Control (SPC) for software development, crues benefits to the user. Any mismatch between the problem complexity and design effort is likely to cause the introduction of defects, defects introduced, and a greater design size. ° UONCLUSIONS Hence the arrows between design effort, problem complexity, Much of the published empirical work in the defect predicintroduced defects, and design size. The testing stage follows the tion area is well in advance of the unfounded rhetoric sadly design stage and in practice the testing effort actually allo- typical of much of what passes for software engineering cated may be much less than that required. The mismatch research. However every discipline must learn as much, if between testing effort and design size will influence the num- not more, from its failures as its successes. In this spirit we ber of defects detected, which is bounded by the number of have reviewed the literature critically with a view to better defects introduced. The difference between the defects detected understand past failures and outline possible avenues for and defects introduced is the residual defects count. The defect future success. density at testing is a function of the design size and defects Our critical review of state-of-the-art of models for predetected (defects/size). Similarly, the residual defect density is dieting software defects has shown that many methodoresidual defects divided by design size. logical and theoretical mistakes have been made. Many past Fig. 5 shows the execution of the defect density BBN studies have suffered from a variety of flaws ranging from model under the "Goldilock's Conjecture" using the Hugin model misspecification to use of inappropriate data. The Explorer tool [58]. Each of the nodes is shown as a window issues and problems surrounding the "Goldilock's Conjecwith a histogram of the predictions made based on the facts ture" illustrate how difficult defect prediction is and how entered (facts are represented by histogram bars with 100 easy it is to commit serious modeling mistakes. Specifically, percent probability). The scenario runs as follows. A very we conclude that the existing models are incapable of precomplex problem is represented as a fact set at "very high" dieting defects accurately using size and complexity metrics and a "high" amount of design effort is allocated, rather than alone. Furthermore, these models offer no coherent expla"very high" commensurate with the problem complexity. The nation of how defect introduction and detection variables design size is between 1.0-2.0 KLOC. The model then affect defect counts. Likewise any conclusions that large propagates these "facts" and predicts the introduced defects, modules are more reliable and that software decomposition detected defects and the defect density statistics. The distribu- doesn't work are premature, tion for defects introduced peaks at two with 33 percent
82
686
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL 25, NO. 5, SEPTEMBER/OCTOBER 1999
Fig. 4. BBN topology for defect prediction.
Fig. 5. A demonstration of the "Goldilock's Conjecture."
83
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
Each of the different "schools of thought" have their own view of the prediction problem despite the interactions and subtle overlaps between process and product identified here. Furthermore each of these views model a part of the problem rather than the whole. Perhaps the most critical issue in any scientific endeavor is agreement on the constituent elements or variables of the problem under study, Models are developed to represent the salient features of the problem in a systemic fashion. This is as much the case in physical sciences as social sciences. Economists could not predict the behavior of an economy without an integrated, complex, macroeconomic model of all of the known, pertinent variables. Excluding key variables such as savings rate or productivity would make the whole exercise invalid. By taking the wider view we can construct a more accurate picture and explain supposedly puzzling and contradictory results. Our analysis of the studies surrounding the "Goldilock's Conjecture" shows how empirical results about defeet density can make sense if we look for alternative explanations. Collecting data from case studies and subjecting it to isolated analysis is not enough because statistics on its own does not provide scientific explanations. We need compelling and sophisticated theories that have the power to explain the empirical observations. The isolated pursuit of these single issue perspectives on the quality prediction problem are, in the longer-term, fruitless. Part of the solution to many of the difficulties presented above is to develop prediction models that unify the key elements from the diverse software quality prediction models. We need models that predict software quality by taking into account information from the development process, problem complexity, defect detection processe, and design complexity, We must understand the cause and effect relations between important variables in order to explain why certain design processes are more successful than others in terms of the products they produce. It seems that successful engineers already operate in a way that tacitly acknowledges these cause-effect relations. After all if they didn't how else could they control and deliver quality products? Project managers make decisions about software quality using best guesses; it seems to us that will always be the case and the best that researchers J , . can do is 1) recognize this fact and . . . . „ n{ . 2) improve the guessing process. We, therefore, need to model the subjectivity and uncertainty that is pervasive in software development. Likewise, the challenge for researchers is in transforming this uncertain knowledge, which is already evident in elements of the
687
All of the defect prediction models reviewed in this paper operate without the use of any formal theory of program/ problem decomposition. The literature is however replete with acknowledgments to cognitive explanations of shortcomings in human information processing. While providing useful explanations of why designers employ decomposition as a design tactic they do not, and perhaps cannot, allow us to determine objectively the optimum level of decomposition within a system (be it a requiremen's specification or a program). The literature recognizes the two structural3 aspects of software, "within" component structural complexity and "between" component structural complexity, but we lack the way to crucially integrate these two views in a way that would allow us to say whether one design was more or less structurally complex than another, Such a theory might also allow us to compare different decompositions of the same solution to the same problem requirement, thus explaining why different approaches to problem or design decomposition might have caused a designer to commit more or less defects. As things currently stand without such a theory we cannot compare different decompositions and, therefore, cannot carry out experiments comparing different decomposition tactics. This leaves a gap in any evolving science of software engineering that cannot be bridged using current case study based approaches, despite their empirical flavor, ACKNOWLEDGMENTS The work carried out here was partially funded by the ESPRIT projects SERENE and DeVa, the EPSRC project IMPRESS, and the DISPO project funded by Scottish Nuclear. The authors are indebted to Niclas Ohlsson and Peter Popov for comments that influenced this work and also to the anonymous reviewers for their helpful and incisive contributions. REFERENCES 1" ^ Mjn-^jnd-^R ^ K ^ f f i T ] ^ Eng., vol. 5, no. 3, May 1979. [2] D. Potier, J.L. Albin, R. Ferreol, A, and Bilodeau, "Experiments ? " * C o m ? u , t e r S o £ w a r e C ° m Pi e o xit y a n d RellabiUt y." p™- Sixth IntlConf. Software Eng., pp. 94-103, 1982. [3] T N a k a j o a n d H K u m e »A C a s e History Analysis of Software Error Cause-Effect Relationships," IEEE Trans. Software Eng., vol. 17, no. 8, Aug. 1991. [4] s Brocklehurst and B. Littlewood, "New Ways to Get Accurate f ^ f ^ R e l i a b i l "y Modelling," IEEE Software, vol. 34, no. 42, [5] ^Akivama, "An Example of Software System Debugging," Information Processing, vol. 71, pp. 353-379,1971. 16] A.E. Ferdinand, "A Theory of System Complexity," InflJ. General
various quality models already discussed, into a prediction
.,, ? T u e ^ \ v o l J ' E P ' ^ f V ^ i '
model that other engineers can learn from and apply. We are already working on a number of projects using Bayesian Belief Networks as a method for creating more sophisticated models for prediction, [59], [60], [61], and have described one of the prototype BBNs to outline the approach. Ultimately, this research is aiming to produce a method for the Statistical process control (SPC) of software production
1975 [8] N.E. Fenton and B.A. Kitchenham, "Validating Software Measures •" J- Software Testing, Verification & Reliability, vol. 1, no. 2, PP-27-42,1991.
, , ,
,
,.
__T, \ , W J lmplied by the Sfc.1 S Capability Maturity Model.
I
c•
c>
• M U U ,. J
[7] M.H. Halstead, Elements of Software Science. Elsevier, North-Holland,
ring
84
3 We
'
are careful here to use the term structural complexity when dis-
cussing attributes of design artifacts and cognitive complexity when referto an individuals understanding of such an artifact. Suffice it to say, that structural complexity would influence cognitive complexity.
688
[9] [10] [11] [12] [13] [14] [15] [16]
[17] [18]
[19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37]
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 25, NO. 5, SEPTEMBER/OCTOBER 1999
L.M. Ottenstein, "Quantitative Estimates of Debugging Require- [38] A. Bertolino and L. Strigini, "On the Use of Testability Measures for Dependability Assessment," IEEE Trans. Software Eng., vol. 22, ments," IEEE Trans. SoftwareEng., vol. 5, no. 5, pp. 504-514,1979. no. 2, pp. 97-108,1996. M. Lipow, "Number of Faults per Line of Code," IEEE Trans. Software Eng., vol. 8, no. 4, pp. 437-439, 1982. [39] M. Diaz and J. Sligo, "How Software Process Improvement J.R. Gaffney, "Estimating the Number of Faults in Code," IEEE Helped Motorola," IEEE Software, vol. 14, no. 5, pp. 75-81, 1997. Trans. SoftwareEng., vol. 10, no. 4, 1984. [40] C. Jones, "The Pragmatics of Software Process Improvements," Software Engineering Technical Council Newsletter, Technical Council T. Compton, and C. Withrow, "Prediction and Control of Ada Software Defects," ;. Systems and Software, vol. 12, pp. 199-207, on Software Eng., IEEE Computer Society, vol. 14 no. 2, Winter 1990. 1996. V.R. Basili and B.T. Perricone, "Software Errors and Complexity: An [41] J.C. Munson and T.M. Khoshgoftaar, "Regression Modelling of Software Quality: An Empirical Investigation," Information and Empirical Investigation," Comm. ACM, vol. 27, no. 1, pp. 42-52, 1984. Software Technology, vol. 32, no. 2, pp. 106-114, 1990. V. Y. Shen, T. Yu, S.M., Thebaut, and L.R. Paulsen, "Identifying [42] M.D. Neil, "Multivariate Assessment of Software Products," /. Software Testing, Verification and Reliability, vol. 1, no; 4, pp. 17-37, Error-Prone Software—An Empirical Study," IEEE Trans. Software 1992. Eng., vol. 11, no. 4, pp. 317-323, 1985. K.H. Moeller and D. Paulish, "An Empirical Investigation of [43] T.M. Khoshgoftaar and J.C. Munson, "Predicting Software DevelSoftware Fault Distribution," Proc. First Int'l Software Metrics opment Errors Using Complexity Metrics," IEEE J. Selected Areas Symp, pp. 82-90, IEEE CS Press, 1993. in Comm., vol. 8, no. 2, pp. 253-261, 1990. L. Hatton, "The Automation of Software Process and Product [44] J.C. Munson and T.M. Khoshgoftaar, "The Detection of Fault-Prone Quality," M. Ross, C.A. Brebbia, G. Staples, and J. Stapleton, eds., Programs," IEEE Trans. Software Eng., vol. 18, no. 5, pp. 423-433, Software Quality Management, pp. 727-744, Southampton: Compu1992. tation Mechanics Publications, Elsevier, 1993. [45] N.E. Fenton and S. Lawrence Pfleeger, Software Metrics: A Rigorous L. Hatton, C and Safety Related Software Development: Standards, and Practical Approach, second edition, Int'l Thomson Computer Subsets, testing, Metrics, Legal Issues. McGraw-Hill, 1994. Press, 1996. T.Keller, "Measurements Role in Providing Error-Free Onboard [46] E. Adams, "Optimizing Preventive Service of Software Products," Shuttle Software," Proc. Third Int'l Applications of Software Metrics IBM Research ]., vol. 28, no. 1, pp. 2-14, 1984. Conf. La Jolla, Calif., pp. 2.154-2.166, 1992. Proc. available from [47] N. Fenton and N. Ohlsson, "Quantitative Analysis of Faults and Software Quality Engineering. Failures in a Complex Software System," IEEE Trans. Software L. Hatton, "Re-examining the Fault Density-Component Size Eng., 1999. to appear [48] T. Stalhane, "Practical Experiences with Safety Assessment of a Connection," IEEE Software, vol. 14, no. 2, pp. 89-98, Mar./Apr. 1997. System for Automatic Train Control," Proc. SAFECOMP'92, ZuT.J. McCabe, "A Complexity Measure," IEEE Trans. Software Eng., rich, Switzerland, Oxford, U.K.: Pergamon Press, 1992. vol. 2, no. 4, pp. 308 - 320,1976. [49] P. Hamer and G. Frewin, "Halstead's Software Science: A Critical B.A. Kitchenham, L.M. Pickard, and S.J. Linkman, "An Evaluation Examination," Proc. Sixth Int'l Conf. Software Eng., pp. 197-206, of Some Design Metrics," Software Eng J., vol. 5, no. 1, pp. 50-58, 1982. 1990. [50] V.Y. Shen, S.D. Conte, and H. Dunsmore, "Software Science RevisS. Henry and D. Kafura, "The Evaluation of Software System's ited: A Critical Analysis of the Theory and Its Empirical Support," IEEE Trans. SoftwareEng., vol. 9, no. 2, pp. 155-165, 1983. Structure Using Quantitative Software Metrics," Software— Practice and Experience, vol. 14, no. 6, pp. 561-573, June 1984. [51] M.J. Shepperd, "A Critique of Cyclomatic Complexity as a SoftN. Ohlsson and H. Alberg "Predicting Error-Prone Software ware Metric," Software Eng. J., vol. 3, no. 2, pp. 30-36,1988. Modules in Telephone Switches, IEEE Trans. Software Eng., vol. 22, [52] B.F. Manly, Multivariate Statistical Methods: A Primer. Chapman & no. 12, pp. 886-894, 1996. Hall, 1986. V. Basili, L. Briand, and W.L. Melo, "A Validation of Object Ori- [53] F. Zhou, B. Lowther, P. Oman, and J. Hagemeister, "Constructing and Testing Software Maintainability Assessment Models," First ented Design Metrics as Quality Indicators," IEEE Trans. Software Eng., 1996. Int'l Software Metrics Symp., Baltimore, Md., IEEE CS Press, 1993. S.R. Chidamber and C.F. and Kemerer, "A Metrics Suite for Object [54] J. Rosenberg, "Some Misconceptions About Lines of Code," Software Metrics Symp., pp. 37-142, IEEE Computer Society, 1997. Oriented Design," IEEE Trans. Software Eng., vol. 20, no. 6, pp. 476498, 1994. [55] B.A. Kitchenham, "An Evaluation of Software Structure Metrics," C. Jones, Applied Software Measurement. McGraw-Hill, 1991. Proc. COMPSAC'88, Chicago 111., 1988. M.A. Cusumano, Japan's Software Factories. Oxford Univ. Press, [56] S. Cherf, "An Investigation of the Maintenance and Support 1991. Characteristics of Commercial Software," Proc. Second Oregon Workshop on Software Metrics (AOWSM), Portland, 1991. K. Koga, "Software Reliability Design Method in Hitachi," Proc. Third European Conf. Software Quality, Madrid, 1992. [57] S.L. Lauritzen and D.J. Spiegelhalter, "Local Computations with K. Yasuda, "Software Quality Assurance Activities in Japan," Probabilities on Graphical Structures and Their Application to Japanese Perspectives in Software Eng., pp. 187-205, Addison-Wesley, Expert Systems (with discussion)," J.R. Statistical Soc. Series B, 50, 1989. no. 2, pp. 157-224, 1988. M. Dyer, The Cleanroom Approach to Quality Software Development. [58] HUGIN Expert Brochure. Hugin Expert A/S, Aalborg, Denmark, Wiley, 1992. 1998. W.S. Humphrey, Managing the Software Process. Reading, Mass.: [59] Agena Ltd, "Bayesian Belief Nets," http://www.agena.co.uk/bbnarticle/ Addison-Wesley, 1989. bbns.html R.D. Buck and J.H. Robbins, "Application of Software Inspection [60] M. Neil and N.E. Fenton, "Predicting Software Quality Using Methodology in Design and Code," Software Validation, H.-L. Bayesian Belief Networks, "Proc 21st Ann. Software Eng. Workshop, Hausen, ed., pp. 41-56, Elsevier Science, 1984. pp. 217-230, NASA Goddard Space Flight Centre, Dec. 1996. N.E. Fenton, S. Lawrence Pfleeger, and R. Glass, "Science and [61] M. Neil, B. Littlewood, and N. Fenton, "Applying Bayesian Belief Networks to Systems Dependability Assessment," Proc. Safety Substance: A Challenge to Software Engineers," IEEE Software, pp. 86-95, July 1994. Critical Systems Club Symp., Springer-Verlag, Leeds, Feb. 1996. R.B. Grady, Practical Software Metrics for Project Management and Process Improvement. Prentice Hall, 1992. A. Veevers and A.C. Marshall, "A Relationship between Software Coverage Metrics and Reliability," J Software Testing, Verification and Reliability, vol. 4, pp. 3-8, 1994. M.D. Neil, "Statistical Modelling of Software Metrics," PhD thesis, South Bank Univ. and Strathclyde Univ., 1992. J.M. Voas and K.W. Miller, "Software Testability: The New Verification," IEEE Software, pp. 17-28, May 1995.
85
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
Norman E. Fenton is professor of computing science at the Centre for Software Reliability, City University, London and is also a director at Agena Ltd. His research interests include software metrics, empirical software engineering, safety critical systems, and formal development methods. However, the focus of his current work is on applications of Bayesian nets; these applications include critical systems' assessment, vehicle reliability prediction, and software quality assessment. He is a chartered engineer (member of the IEE), a fellow nber of the IEEE Computer Society. of the IMA, and a member
689
Martin Neil holds a first degree in mathematics for business analysis from Glasgow Caledonian University and has achieved a PhD in statistical analysis of software metrics jointly from South Bank University and Strathclyde University. Currently he is a lecturer in computing at the Centre for Software Reliability, City University, London. Before joining the CSR, He spent three years with Lloyd's Register as a consultant and researcher and a year at South Bank University. He has also worked with J.P. Morgan as a software quality consultant. nt. His research interests cover software metrics, Bayesian probability, and the software process. Dr. Neil is a director at Agena Ltd., a consulting :ing company specializing in decision support and risk assessment of safety afety and business critical systems. He is a member of the CSR Council, sil, the IEEE Computer Society, and the ACM.
86
IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 4, DECEMBER 2002
455
Using Regression Trees to Classify Fault-Prone Software Modules Taghi M. Khoshgoftaar, Member, IEEE, Edward B. Allen, Member, IEEE, and Jianyu Deng
Abstract—Software faults are defects in software modules that might cause failures. Software developers tend to focus on faults, because they are closely related to the amount of rework necessary to prevent future operational software failures. The goal of this paper is to predict which modules are fault-prone and to do it early enough in the life cycle to be useful to developers. A regression tree is an algorithm represented by an abstract tree, where the response variable is a real quantity. Software modules are d a s sified as fault-prone or not, by comparing the predicted value to a threshold. A classification rule is proposed that allows one to choose a preferred balance between the two types of misclassification rates. A case study of a very large telecommunications systerns considered software modules to be fault-prone if any faults were discovered by customers. Our research shows that classifying fault-prone modules with regression trees and the using the classification rule in this paper, resulted in predictions with satisfactory accuracy and robustness.
xit j X; fauUs.
Vi ;(/, ^ £>(„,•„) > *«* mindev minsize £(x;) „,
Class
' Class(x t )
Index TW-Classification, fault-prone modules, regression trees, software metrics, software reliability, S-Plus.
(/) v
'
ACRONYMS 1 EMERALD Cdf f
rp nfp pdf
Enhanced Measurement for Early Risk Assessment of Latent Defects cumulative distribution function , f raun-prone not fault-prone probability density function NOTATION
3
mentiner ot a predictor u a o i e s u ana i n snow software metrics notation)
I
node identifier number of objects (modules)
object # i ' s value of Xj vector of predictor values for object #i n u m b e r o f c u s tomer-discovered faults in object „. response for object #i predicted y, average response for training objects in n o d e #1 s-deviance of node #1 . ..!•• ^ > i_ •_• pnor probabilities of class membership s-deviance threshold minimum number of objects in a decision node the leaf that object #i falls into number of training objects that fall into leaf * , , „ ,f i,. actual class of object #z predicted class of object #i, based on its x,p r { a n o b j e c t i m , e a f ( ig f a u l t . p r o n e l : _ }_. _ , ,., , .. '
<J(L(XJ))
estimated qi
£
classification-rule parameter
_ , , , . , Pr{fp|nfp}
T , — ™ ,^ , ^ type I misdassification rate, Pr{Class( X i ) fp|Class; - nfp} type II misclassification rate, Pr{Class(x,;) . . nf P |Clas S i = fp} pdf of the Gaussian Cdf.
Prjnfp fp} Saud«
I.
= =
INTRODUCTION
H
IGH software reliability is important for many software systems, especially those that support society's infras t r u c t u r e s s u c h a s t e l e c o m m u n i c a t i o n s y s t e m s . Reliability is u s u a l l y m e a s u r e d f r o m t h e user>s v i e w p o i n t i n t e r m s o f t i m e between failures, according to an operational profile [29], A s o f t w a r e fau[{
predictor #j
{& d e f m e d a s a d e f e c t j n a n e x e c u t a b l e
software
product that may cause a failure [26]. Thus, faults are attributed to the software modules that cause failures. Developers tend to Manuscript received December 29, 1999; revised October 1, 2001 and focus on faults, because they are closely related to the amount November 15, 2001. This work was supported in part by a grant from Nortel o f r e w o r k : necessary to prevent future failures. This paper Networks through the Software Reliability Engineering Department. The , ,. , , r , , . - , . , . , r findings and opinions in this paper belong solely to the authors, and are not d e f m e s a software module fault-prone when there is a high risk necessarily those of the sponsor. Moreover, our results do not in any way reflect that faults will be discovered during operations, the quality of the sponsor's software products. Responsible Editor: M. A. Vouk. p a u l t y modules cannot be identified until failures occur T.M. Khoshgoftaar is with the Empirical Software Engineering Lab., Depart. . _,. . . , J - I J I ment of Computer Science and Engineering, Florida Atlantic University, Boca d u n n g operation. This IS too late to be useful to developers. Raton, FL 33431 USA (e-mail:
[email protected]). However, if one could predict during development which modE. B. Allen is with the Department of Computer Science, Mississippi State u j e s ^ g f au lt-prone, then developers could take COSt-effective University, Mississippi State, MS 39762 USA (e-mail: Edward.Allen@com. ,f . . . . . puterorg) proactive measures to prevent the release of faulty software. J. Deng is with Motorola Metrowerks Corp., Austin, TX 78758 USA (e-mail: This in turn, would reduce the amount of expensive rework
[email protected]). needed to repair faulty software during the operational phase. Digital Object Identifier 10.1109/TR.2002.804488 „, , - {.• u• r A AI L The goal of this research is to find ways to predict, early enough 'The singular and plural of an acronym are always spelled the same. in the life cycle to be useful to developers, which modules are 0018-9529/02S17.00 © 2002 IEEE XJ
87
456
IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 4, DECEMBER 2002
fault-prone. The exact nature of the software improvement processes that developers could apply to fault-prone modules is not addressed here. In a well-built system, fault-prone modules typically are only a small fraction of the total. A variety of classification techniques have been used to model software quality, including • logistic regression [2], [14]; • discriminant analysis [21], [28]; • discriminant power [34], [35]; • discriminant coordinates [30]; • optimal set reduction [4]; • neural networks [24]; • fuzzy classification [7]; • classification trees [37], ti°n A classification tree is an algorithm represented by an abstract tree of decision rules. The s-dependent variable is the response variable which is categorical (e.g., fault-prone or not). The s-independent variables are predictors. Each internal node represents a decision that is based on a predictor. Each edge leads to a potential next decision. Each leaf is labeled with a class. An object (e.g., software module) is classified by traversing a path from the root of the tree to a leaf, according to the values of the object's predictors. Finally, the object's response variable is assigned the leaf's class. A classification tree accommodates nonmonotonic and nonlinear relationships among combinations of variables in a model that is easy to understand and use. References [31], [37] model software quality using the ID3
is similar to, but s-independent of the training data set. Both the training and evaluation data sets must represent historical software modules where actual faults are known. After a tree model has been built and evaluated with historical data, it is ready to make predictions for a similar current development project, where predictors are known, but faults have not yet been discovered. The accuracy of a classification model is characterized by misclassification rates. When the response variable can be 1 of 2 classes, e.g., fault-prone or not, then a model can make 2 kinds °f rnisclassifications. In the application in this paper, a Type I misclassification is when the model predicts that a module is fault-prone when it is not. Conversely, a Type II misclassifica*s when the model predicts that a module is not fault-prone when it is. This P a P e r presents a method for using regression trees to classify software modules as fault-prone or not, allowing one to choose a preferred balance between Types I and II misclassifixation rates. To our knowledge, this is the first time the S-Plus regression tree algorithm has been used for classification of software quality. A case study of a very large telecommunication system illustrates the approach [6]. Future work might include a comparative study of the various tree algorithms, The remainder of this paper explains how S-Plus builds a regression tree, defines the authors' classification rule for choosing a preferred balance between misclassification rates, ^ d presents details of the authors' case study.
algorithm [32] to build trees using an entropy-based criterion.
II. A CLASSIFICATION RULE FOR REGRESSION TREES
Reference [38] extended the ID3 algorithm by applying Akaike Information Criterion procedures [1] to prune the tree. The authors' research group has classified fault-prone modules with the CART algorithm [3], [17], [22] and the TREEDISC algorithm [23], [33] which is a refinement of the CHAID algorithm [12]. S-plus also has an algorithm for constructing classification trees [5]. However, this algorithm does not incorporate prior probabilities of membership nor costs of misclassifications [13]. In one case study, this algorithm did not build a tree, because our data had a very small proportion of fault-prone modules. This led the authors to explore the use of regression trees for the purpose of classifying fault-prone modules. A regression tree is also an algorithm represented by an abstract tree. However, the response variable is a real quantity, instead of a class. Decision nodes are similar to a classification tree's, but each leaf is labeled with a quantity for the response variable. The processing of an object is similar to a classification tree, but once the object reaches a leaf, the response variable is assigned the appropriate quantity. Reference [25] briefly reports using the Classification and Regression Trees (CART) regression i
-.v.
m .
J i
7
•
J
• •