LECTURE NOTES ON
EMPIRICAL SOFTWARE ENGINEERING
nes Editor i NajtaHa J u r i s t c j Q K Chang l ^ g M M o r e n o J
...
20 downloads
1844 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
LECTURE NOTES ON
EMPIRICAL SOFTWARE ENGINEERING
nes Editor i NajtaHa J u r i s t c j Q K Chang l ^ g M M o r e n o J
J
p^ . y. ' World Scientific
,V • •
LECTURE NOTES ON
EMPIRICAL SOFTWARE ENGINEERING
SERIES ON SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING Series Editor-in-Chief S K CHANG (University of Pittsburgh, USA)
Vol. 1
Knowledge-Based Software Development for Real-Time Distributed Systems Jeffrey J.-P. Tsai and Thomas J. Weigert (Univ. Illinois at Chicago)
Vol. 2
Advances in Software Engineering and Knowledge Engineering edited by Vincenzo Ambriola (Univ. Pisa) and Genoveffa Tortora (Univ. Salerno)
Vol. 3
The Impact of CASE Technology on Software Processes edited by Daniel E. Cooke (Univ. Texas)
Vol. 4
Software Engineering and Knowledge Engineering: Trends for the Next Decade edited by W. D. Hurley (Univ. Pittsburgh)
Vol. 5
Intelligent Image Database Systems edited by S. K. Chang (Univ. Pittsburgh), E. Jungert (Swedish Defence Res. Establishment) and G. Tortora (Univ. Salerno)
Vol. 6
Object-Oriented Software: Design and Maintenance edited by Luiz F. Capretz and Miriam A. M. Capretz (Univ. Aizu, Japan)
Vol. 7
Software Visualisation edited by P. Eades (Univ. Newcastle) and K. Zhang (Macquarie Univ.)
Vol. 8
Image Databases and Multi-Media Search edited by Arnold W. M. Smeulders (Univ. Amsterdam) and Ramesh Jain (Univ. California)
Vol. 9
Advances in Distributed Multimedia Systems edited by S. K. Chang, T. F. Znati (Univ. Pittsburgh) and S. T. Vuong (Univ. British Columbia)
Vol. 10 Hybrid Parallel Execution Model for Logic-Based Specification Languages Jeffrey J.-P. Tsai and Bing Li (Univ. Illinois at Chicago) Vol. 11 Graph Drawing and Applications for Software and Knowledge Engineers Kozo Sugiyama (Japan Adv. Inst. Science and Technology)
Forthcoming titles: Acquisition of Software Engineering Knowledge edited by Robert G. Reynolds (Wayne State Univ.) Monitoring, Debugging, and Analysis of Distributed Real-Time Systems Jeffrey J.-P. Tsai, Steve J. H. Yong, R. Smith and Y. D. Bi (Univ. Illinois at Chicago)
LECTURE NOTES ON
EMPIRICAL SOFTWARE ENGINEERING
Editors
Natalia Juristo Ana M Moreno Universidad Politecnica de Madrid, Spain
V | f e World Scientific w l
New Jersey •London • London ••Sin Singapore • Hong Kong
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: Suite 202, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
EMPIRICAL SOFTWARE ENGINEERING Copyright © 2003 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-02-4914-4
Printed by Fulsland Offset Printing (S) Pte Ltd, Singapore
Preface The use of reliable and validated knowledge is essential in any engineering discipline. Software engineering, as a body of knowledge that guides the construction of software systems, also has a need for tested and true knowledge whose application produces predictable results. There are different intermediate stages on the scale from knowledge considered as proven facts to beliefs or speculations: facts given as founded and accepted by all, undisputed statements, disputed statements and conjectures or speculations. The ranking of an enunciation depends on the change in its factuality status. The path from subjectivity to objectivity paved by testing or empirical comparison with reality determines these changes. For software development to really be an engineering discipline and to predictably build quality software, it has to make the transition from development based on speculations to that based on facts. Software engineering needs to lay aside perceptions, bias and market-speak to provide fair and impartial analysis and information. Several software developer community stakeholders can contribute to developing the engineering knowledge in the software discipline in different ways. Firstly, researchers are responsible for testing their proposals, providing data that demonstrate what benefits they have and identifying the best application conditions. These validations are usually performed, first, under controlled conditions, by means of what are called laboratory tests or in vitro experiments, which distinguish them from the real conditions in which these artefacts would be employed in an industrial environment. For example, software development techniques or tools would be applied on projects not subjected to market pressures and with developers of certain known characteristics. Additionally, researchers also have to replicate the studies performed by their peers, either to corroborate the results under the same application conditions or to provide additional information in new contexts. Far from being a straightforward process, replication is beset by a series of problems related to variations in the hypotheses, in the factors that affect the studies or the data collected during the experiments.
V
VI
Empirical Software Engineering
Through this experimental process, the research community would provide practitioners with tested knowledge of the benefits of applying given artefacts and their application conditions. However, studies need to be repeated in industrial settings to assure that these benefits also occur in an industrial environment and that practitioners can make use of the respective artefacts knowing beforehand what results their application will have. These are in-vivo experimental studies. This handbook deals with the above two levels of the process of empirical testing.1 The first three chapters focus on the process of experimental validation in laboratories, whereas the other three address this process of validation in industrial settings. Let us now turn to the individual topics discussed in each of these two groups. Verification techniques belong to one of the software development areas that have traditionally been subjected to empirical testing. Studies aiming to identify the strengths of the different error detection techniques have been undertaken since the 80s. The first chapter of this handbook aims to identify the knowledge available about testing techniques, after 20 years of experimentation. The authors find that, despite all the studies and replications in the area, it is not easy to extract coherent knowledge on all of these techniques due to the dispersion of the testing techniques and response variables analysed. From the study shown in "Limitations of Empirical Testing Techniques Knowledge", Juristo, Moreno and Vegas conclude that more coherence and coordination are needed in experimental efforts. It is not enough to run one-off experiments. Coherent and coordinated replication is just as important as running new experiments. Often, it would be more beneficial to establish a given item of knowledge through replication before trying to validate new items that are based on this knowledge. A single experiment is better than none, but it is not sufficient to convert an item of knowledge into a validated fact. The second chapter of this handbook provides guidelines on how to perform replications. The lessons from this chapter aim to contribute to 1
There is another empirical study level, not addressed in this book, which the software industry should perform routinely: having introduced a software artefact into an industrial setting, practitioners are responsible for measuring and monitoring the improvement and changes that have taken place in their development processes and software products. This activity is commonly referred to as process control.
Preface
vn
creating a body of factual knowledge, overcoming the problems detected in chapter one. Replications of empirical studies are a complex process that calls for an exhaustive analysis of the studies to be replicated and the precise definition of new working conditions. In "Replicated Studies: Building a Body of Knowledge about Software Reading Techniques", Shull, Carver, Maldnonado, Travassos, Conrado and Basili analyse how to run coherent replications of empirical studies. The authors further discuss which variables should change from one study to another to generate more knowledge of the different tests. The proposed approach is applied to specific validation techniques, namely, reading techniques. Having replicated studies, one of the most important tasks is to extract reliable knowledge from these studies. In the third chapter of the book, "Combining Data from Reading Experiments in Software Inspections. A Feasibility Study", Wohlin, Petersson and Aurum analyse the process of information extraction to be performed when we have a coherent and coordinated set of experiments. Chapter three of the handbook illustrates the types of generalised results that can be derived when combining different studies. By way of an example, the authors present some results from the combination of studies found in the software inspections area. The remainder of the book focuses on the second level of empirical testing, that is, on performing studies in industrial settings. It is important to note that, although considerable progress has been made recently in the field of empirical validation, there is still a lack of empirical work in the industrial setting. Therefore, the three remaining chapters of the book aim to propose alternatives that make this task easier. Chapters four and five propose two alternatives for making empirical testing more appealing to industry. In chapter four, "External Experiments - A workable paradigm for collaboration between industry and academia", Houdek proposes an empirical validation approach by means of which to share out the workload of empirical testing in an industrial setting between the university setting and industry and assure, on the one hand, that the result of the study represents the real conditions and, on the other, relieve industry of the work and effort required to run this sort of studies. In chapter five, "(Quasi)-Experimental Studies in Industrial Settings", Laitenberg and Rombach analyse the difficulties (for example, in terms of
Vlll
Empirical Software Engineering
cost or control of the different variables that can have an impact on the studies) faced by industry when running empirical studies and present a practical approach that makes it workable for practitioners to perform this sort of studies by relaxing the conditions of the experiment. Finally, chapter six, "Experimental validation of new software technology", presents an analysis of the empirical validation techniques in use by researchers and practitioners. Zelkowitz, Wallance and Binkley identify communalities and differences among these methods and propose some guides to assure that both kinds of models can work complementarily.
Natalia Juristo Ana M. Moreno Universidad Politecnica de Madrid SPAIN
Contents
Preface
Chapter 1 Limitations of Empirical Testing Technique Knowledge N. Juristo, A. M. Moreno and S. Vegas Chapter 2 Replicated Studies: Building a Body of Knowledge about Software Reading Techniques F. Shull, J. Carver, G. H. Travassos, J. C. Maldonado, R. Conradi and V. R. Basili Chapter 3 Combining Data from Reading Experiments in Software Inspections — A Feasibility Study C. Wohlin, H. Petersson and A. Aurum Chapter 4 External Experiments - A Workable Paradigm for Collaboration Between Industry and Academia F. Houdek
1
39
85
133
Chapter 5 (Quasi-)Experimental Studies in Industrial Settings O. Laitenberger and D. Rombach
167
Chapter 6 Experimental Validation of New Software Technology M V. Zelkowitz, D. R. Wallace and D. W. Binkley
229
IX
CHAPTER 1
Limitations of Empirical Testing Technique Knowledge N. Juristo, A. M. Moreno and S. Vegas Facultad de Informatica Universidad Politecnica de Madrid Campus de Montegancedo, Boadilla del Monte, 28660 Madrid, Spain
Engineering disciplines are characterised by the use of mature knowledge by means of which they can achieve predictable results. Unfortunately, the type of knowledge used in software engineering can be considered to be of a relatively low maturity, and developers are guided by intuition, fashion or market-speak rather than by facts or undisputed statements proper to an engineering discipline. Testing techniques determine different criteria for selecting the test cases that will be used as input to the system under examination, which means that an effective and efficient selection of test cases conditions the success of the tests. The knowledge for selecting testing techniques should come from studies that empirically justify the benefits and application conditions of the different techniques. This paper analyses the maturity level of the knowledge about testing techniques by examining existing empirical studies about these techniques. For this purpose, we classify testing technique knowledge according to four categories. Keywords: Testing techniques; empirical maturity of testing knowledge; testing techniques empirical studies.
1. Introduction Engineering disciplines are characterised by using mature knowledge that can be applied to output predictable results. (Latour and Woolgor, 1986) discuss a series of intermediate steps on a scale that ranges from the most mature knowledge, considered as proven facts, to the least mature knowledge, composed of beliefs or speculations: facts given as founded and 1
2
N. Juristo, A. M. Moreno & S. Vegas
accepted by all, undisputed statements, disputed statements and conjectures or speculations. The path from subjectivity to objectivity is paved by testing or empirical comparison with reality. It is knowledge composed of facts and undisputed statements that engineering disciplines apply to output products with predictable characteristics. Unfortunately, software development has been characterised from its origins by a serious want of empirical facts tested against reality that provide evidence of the advantages or disadvantages of using different methods, techniques or tools to build software systems. The knowledge used in our discipline can be considered to be relatively immature, and developers are guided by intuition, fashion or market-speak rather than by the facts or undisputed statements proper to an engineering discipline. This is equally applicable to software testing and is in open opposition to the importance of software quality control and assurance and, in particular, software testing. Testing is the last chance during development to detect and correct possible software defects at a reasonable price. It is a well-known fact that it is a lot more expensive to correct defects that are detected during later system operation (Davis, 1993). Therefore, it is of critical importance to apply knowledge that is mature enough to get predictable results during the testing process. The selection of the testing techniques to be used is one of the circumstances during testing where objective and factual knowledge is essential. Testing techniques determine different criteria for selecting the test cases that will be used as input to the system under examination, which means that an effective and efficient selection of test cases conditions the success of the tests. The knowledge for selecting testing techniques should come from studies that empirically justify the benefits and application conditions of the different techniques. However, as authors like (Hamlet, 1989) have noted, formal and practical studies of this kind do not abound, as: (1) it is difficult to compare testing techniques, because they do not have a solid theoretical foundation; (2) it is difficult to determine what testing techniques variables are of interest in these studies. In view of the importance of having mature testing knowledge, this chapter intends to analyse the maturity level of the knowledge in this area. For this purpose, we have surveyed the major empirical studies on testing in order to analyse their results and establish the factuality and objectivity level of the body of testing knowledge regarding the benefits of some techniques over others. The maturity levels that we have used are as follows:
Limitations of Empirical Testing Technique Knowledge
•
3
Program use and laboratory faults pending/confirmed: An empirical study should be/has been performed to check whether the perception of the differences between the different testing techniques is subjective or can be objectively confirmed by measurement. • Formal analysis pending/confirmed: Statistical analysis techniques should be/have been applied to the results output to find out whether the differences observed between the techniques are really significant and are not due to variations in the environment. • Laboratory replication pending/confirmed: Other investigators should replicate/have replicated the same experiment to confirm that they get the same results and that they are not the fruit of any uncontrolled variation. • Field study pending/confirmed: The study should be/has been replicated using real rather than toy programs or faults. For this purpose, the chapter has been structured as follows. Section 2 presents the chosen approach for grouping the different testing studies. Sections 3, 4, 5, 6 and 7 focus on each of the study categories described in section 2. Each of these sections will first describe the studies considered depending on the testing techniques addressed in each study and the aspects examined by each one. Each study and its results are analysed in detail and, finally, the findings are summarised. Finally, section 8 outlines the practical recommendations that can be derived from these studies, along with their maturity level, that is, how reliable these recommendations are. Section 8 also indicates what aspects should be addressed in future studies in order to increase the body of empirical knowledge on testing techniques. The organisation of this chapter means that it can be read differently by different audiences. Software practitioners interested in the practical results of the application of testing techniques will find section 8, which summarises the practical recommendations on the use of different testing techniques and their confidence level, more interesting. Researchers interested in raising the maturity of testing knowledge will find the central sections of this chapter, which contain a detailed description of the different studies and their advantages and limitations, more interesting. The replication of particular aspects of these studies to overcome the abovementioned limitations will contribute to providing useful knowledge on testing techniques. Researchers will also find a quick reference to aspects of testing techniques in need of further investigation in section 8.
4
N. Juristo, A. M. Moreno & S. Vegas
2. Classification of Testing Techniques Software testing is the name that identifies a set of corrective practices (as opposed to the preventive practices applied during the software construction process), whose goal is to determine software systems quality. In testing, quality is determined by analysing the results of running the software product (there is another type of corrective measures, known as static analysis, that examine the product under evaluation at rest and which are studied in other chapters of this book). Testing techniques determine different criteria for selecting the test cases that are to be run on the software system. These criteria can be used to group the testing techniques by families. Accordingly, techniques belonging to one and the same family are similar as regards the information they need to generate test cases (source code or specifications) or the aspect of code to be examined by the test cases (control flow, data flow, typical errors, etc.). This is not the place to describe the features of testing techniques or their families, as this information can be gathered from the classical literature on testing techniques, like, for example (Beizer, 1990), (Myers, 1979). For readers not versed in the ins and outs of each testing techniques family, however, we will briefly mention each family covered in this chapter, and the techniques of which they are composed, the information they require and the aspect of code they examine: • Random Testing Techniques. The random testing techniques family is composed of the oldest and intuitive techniques. This family of techniques proposes randomly generating test cases without following any pre-established guidelines. Nevertheless, pure randomness seldom occurs in reality, and the other two variants of the family, shown in Table 1, are the most commonly used. TECHNIQUE
TEST CASE GENERATION CRITERION
Pure random
Test cases are generated at random, and generation stops when there appear to be enough.
Guided by the number of cases
Test cases are generated at random, and generation stops when a given number of cases has been reached.
Error guessing
Test cases are generated guided by the subject's knowledge of what typical errors are usually made when programming. It stops when they all appear to have been covered.
Table 1. Random techniques family.
Limitations of Empirical Testing Technique Knowledge
•
Functional Testing Techniques. This family of techniques proposes an approach in which the program specification is used to generate test cases. The component to be tested is viewed as a black box, whose behaviour is determined by studying its inputs and associated outputs. Of the set of possible system inputs, this family considers a subset formed by the inputs that cause anomalous system behaviour. The key for generating the test cases is to find the system inputs that have a high probability of belonging to this subset. For this purpose, the technique divides the system inputs set into subsets termed equivalence classes, where each class element behaves similarly, so that all the elements of a class will be inputs that cause either anomalous or normal system behaviour. The techniques of which this family is composed (Table 2) differ from each other in terms of the rigorousness with which they cover the equivalence classes. TECHNIQUE
TEST CASE GENERATION CRITERION
Equivalence partitioning
A test case is generated for each equivalence class found. The test case is selected at random from within the class.
Boundary value analysis
Several test cases are generated for each equivalence class, one that belongs to the inside of the class and as many as necessary to cover the limits (or boundaries) of the class.
Table 2. Functional testing technique family.
•
•
1
Control Flow Testing Techniques. Control flow testing techniques require knowledge of source code. This family selects a series of paths1 throughout the program, thereby examining the program control model. The techniques in this family vary as to the rigour with which they cover the code. Table 3 shows the techniques of which this family is composed, giving a brief description of the coverage criterion followed, in ascending order of rigorousness. Data Flow Testing Techniques. Data flow testing techniques also require knowledge of source code. The objective of this family is to select program paths to explore sequences of events related to the data state. Again, the techniques in this family vary as to the rigour A path is a code sequence that goes from the start to the end of the program.
5
6
N. Juristo, A. M. Moreno & S. Vegas
with which they cover the code variable states. Table 4 reflects the techniques, along with their associated coverage criterion. TECHNIQUE Sentence coverage Decision coverage (branch testing) Condition coverage
Decision/condition coverage Path coverage
TEST CASES GENERATION CRITERION The test cases are generated so that all the program sentences are executed at least once. The test cases are generated so that all the program decisions take the value true or false. The test cases are generated so that all the conditions (predicates) that form the logical expression of the decision take the value true or false. Decision coverage is not always achieved with condition coverage. Here, the cases generated with condition coverage are supplemented to achieve decision coverage. Test cases are generated to execute all program paths. This criterion is not workable in practice.
Table 3. Control flow testing technique family. TECHNIQUE All-definitions All-c-uses/ some-p-uses All-p-uses/ some-c-uses All-c-uses All-p-uses All-uses All-du-paths All-dus
TEST CASES GENERATION CRITERION Test cases are generated to cover each definition of each variable for at least one use of the variable. Test cases are generated so that there is at least one path of each variable definition to each c-use2 of the variable. If there are variable definitions that are not covered, use p-uses. Test cases are generated so that there is at least one path of each variable definition to each p-use of the variable. If there are variable definitions that are not covered, use c-uses. Test cases are generated so that there is at least one path of each variable definition to each c-use of the variable. Test cases are generated so that there is at least one path of each variable definition to each p-use of the variable. Test cases are generated so that there is at least one path of each variable definition to each use of the definition. Test cases are generated for all the possible paths of each definition of each variable to each use of the definition. Test cases are generated for all the possible executable paths of each definition of each variable to each use of the definition.
Table 4. Data flow testing techniques. 2
There is said to be a c-use of a variable when the variable appears in a computation (righthand side of an assignation). There is said to be a p-use of a variable when the variable appears as a predicate of a logical expression.
Limitations of Empirical Testing Technique Knowledge
•
Mutation Testing Techniques. Mutation testing techniques are based on modelling typical programming faults by means of what are known as mutation operators (dependent on the programming language). Each mutation operator is applied to the program, giving rise to a series of mutants (programs that are exactly the same as the original program, apart from one modified sentence, originated precisely by the mutation operator). Having generated the set of mutants, test cases are generated to examine the mutated part of the program. After generating test cases to cover all the mutants, all the possible faults should, in theory, be accounted for (in practice, however, coverage is confined to the faults modelled by the mutation operators). The problem with the techniques that belong to this family is scalability. A mutation operator can generate several mutants per line of code. Therefore, there will be a sizeable number of a mutants for long programs. The different techniques within this family aim to improve the scalability of standard (or strong) mutation to achieve greater efficiency. Table 5 shows the techniques of which this family is composed and gives a brief description of the mutant selection criterion.
TECHNIQUE
TEST CASES GENERATION CRITERION
Strong (standard) mutation
Test cases are generated to cover all the mutants generated by applying all the mutation operators defined for the programming language in question.
Selective (or constrained) mutation
Test cases are generated to cover all the mutants generated by applying some of the mutation operators defined for the programming language. This gives rise to selective mutation variants depending on the selected operators, like, for example, 2, 4 or 6 selective mutation (depending on the number of mutation operators not taken into account) or abs/ror mutation, which only uses these two operators.
Weak mutation
Test cases are generated to cover a given percentage of mutants generated by applying all the mutation operators defined for the programming language in question. This gives rise to weak mutation variants, depending on the percentage covered, for example, randomly selected 10% mutation, ex-weak, st-weak, bb-weak/1, or bbweak/n.
Table 5. Mutation testing technique family.
7
8
N. Juristo, A. M. Moreno & S. Vegas
Our aim is to review the empirical studies designed to compare testing techniques in order to identify what knowledge has been empirically validated. We have grouped the empirical studies reviewed into several subsets taking into account which techniques they compare: • Intra-family studies, which compare techniques belonging to the same family to find out the best criterion, that is, which technique of all the family members should be used. We have identified: o Studies on the data flow testing techniques family. o Studies on the mutation testing techniques family. • Inter-family studies, which study techniques belonging to different families to find out which family is better, that is, which type of techniques should be used. We have identified: o Comparative studies between the control flow and data flow testing techniques families o Comparative studies between the mutation and data flow testing techniques families o Comparative studies between the functional and control flow testing techniques families. In the following sections, we examine all these sets of studies, together with the empirical results obtained.
3. Studies on the Data Flow Testing Techniques Family The objective of this series of studies is to analyse the differences between the techniques within the data flow testing techniques family. Table 6 shows
STUDY
ASPECT STUDIED TESTING TECHNIQUE
(Weyuker, 1990)
(Bieman & Schultz, 1992)
Criterion compliance Number of test cases generated
X X
X
All-c-uses
0
All-p-uses All-uses
0 O
All-du-paths
0
Table 6. Studies on data flow testing techniques.
0
Limitations of Empirical Testing Technique Knowledge
9
which aspects were studied for which testing techniques. For example, Weyuker analysed the criterion compliance and the number of test cases generated by four techniques (all-c-uses, all-p-uses, all-uses and all-dupaths), whereas Bieman and Schultz studied the number of test cases generated for the all-du-paths technique alone. (Weyuker, 1990) (see also (Weyuker, 1988)) conducts a quantitative study to check the theoretical relationship of inclusion among the test cases generation criteria followed for each technique. This theoretical relationship can be represented as follows: all-du-paths => all-uses all-uses => all-c-uses all-uses => all-p-uses all-p-uses and all-c-uses cannot be compared Which would read as follows. The test cases that comply with the all-dupaths criterion satisfy the all-uses criterion; the test cases that comply with the all-uses criterion satisfy the all-c-uses criterion, and so on. Weyuker's empirical results (obtained by studying twenty-nine programs taken from a book on Pascal with five or more decision sentences) reveal that the following, generally, holds: all-uses => all-du-paths all-p-uses => all-c-uses all-p-uses => all-uses So, the author establishes an inverse relationship with respect to the theory between: all-uses and all-du-paths and between all-p-uses and alluses. That is, she concludes that, in practice, the test cases generated to meet the all-uses criterion, also normally comply with all-du-paths, and the test cases generated by all-p-uses also comply with all-uses. According to these results, it would suffice with respect to criterion compliance to use all-uses instead of all-du-paths and all-p-uses instead of all-uses, as the test cases that meet one criterion will satisfy the other. However, the number of test cases generated by each criterion needs to be examined to account for the cost (and not only the benefits) of these relationships. Analysing this variable, Weyuker gets the following relationship:
10
N. Juristo, A. M. Moreno & S. Vegas
all-c-uses < all-p-uses < all-uses < all-du-paths which would read as: more test cases are generated to comply with all-puses than to meet all-c-uses and fewer than to satisfy all-uses and all-dupaths. Bearing in mind the results concerning the number of generated test cases and criteria compliance, we could deduce that it is better to use all-puses than all-uses and it is better to use all-uses than all-du-paths, as the former generate fewer test cases and generally meet the other criterion. With respect to all-c-uses, although it generates fewer test cases than allp-uses, the test cases generated by all-c-uses do not meet the criterion of allp-uses, which means that it does not yield equivalent results to all-p-uses. Note that the fact that the set of test cases generated for one criterion is bigger than for another does not necessarily mean that the technique detects more faults, as defined in other studies examined later. And the same applies to the relationship of inclusion. The fact that a criterion includes another, does not say anything about the number of faults it can detect. Another of Weyuker's results is that the number of test cases generated by all-du-paths, although exponential in theory, is in practice linear with respect to the number of program decisions. (Bieman and Schultz, 1992) partly corroborate these results using real industrial software system, deducing that the number of test cases required to meet this criterion is reasonable. Bieman and Schultz indicate that the number of cases in question appears to depend on the number of lines of code, but they do not conduct a statistical analysis to test this hypothesis, nor do they establish what relationship there is between the number of lines of code and the number of generated test cases. The results yielded by this group of studies have the following limitations: • Weyuker uses relatively simple toy programs, which means that the results cannot be directly generalised to real practice. • On the other hand, Bieman and Schultz do not conduct a statistical analysis of the extracted data, and their study is confined to a qualitative interpretation of the data. • The response variable used by Weyuker and Bieman and Schultz is the number of test cases generated. This characteristic merits analysis insofar as the fewer test cases are generated, the fewer are run and the fewer need to be maintained. However, it should be supplemented by a study of case effectiveness, which is a variable that better describes what is expected of the testing techniques.
Limitations of Empirical Testing Technique Knowledge
•
11
What the number of test cases generated by all-du-paths depends on needs to be examined in more detail, as one study says it is related to the number of decisions and the other to the number of lines of code, although neither further specifies this relationship.
However, despite these limitations, the following conclusions can be drawn: • All-p-uses should be used instead of all-uses, and all-uses instead of all-du-paths, as they generate fewer test cases and generally cover the test cases generated by the other criteria. • It is not clear that it is better to use all-c-uses instead of all-p-uses, as, even though all-c-uses generates fewer test cases, there is no guarantee that the generated test cases meet the criterion imposed by all-p-uses. • Both Weyuker, using toy programs, and Bieman and Schultz, using industrial software, appear to agree that, contrary to testing theory, the all-du-paths technique is usable in practice, since it does not generate too many test cases. These results are summarised in Table 7.
4. Studies on the Mutation Testing Techniques Family This family is examined in three papers, which look at types of mutation that are less costly than traditional mutation. Generally, these papers aim to ascertain what the costs and benefits of using different mutation testing techniques are. These studies, along with the characteristics they examine and the techniques they address, are shown in Table 8. As shown in Table 8, the efficiency of these techniques is measured differently. So, whereas (Offut and Lee, 1994) (see also (Offut and Lee, 1991)) and (Offut et al, 1996) (see also (Offut et al, 1993)) measure efficiency as the percentage of mutants killed by each technique, (Wong and Mathur, 1995) measure it as the percentage of generated test set cases that detect at least one fault. On the other hand, all the studies consider the cost of the techniques identified as the number of generated test cases and/or the number of generated mutants. The results of the three studies appear to corroborate each other as regards mutation being much more costly than any of its variants, while
N. Juristo, A. M. Moreno & S. Vegas
12
there does not appear to be too drastic a loss of effectiveness for the variants as compared with strong mutation. After analysing 11 subroutines of no more than 30 LOC, Offut and Lee indicate in this respect that, for non-critical applications, it is recommendable to use weak as opposed to strong mutation, because it generates fewer test cases and kills a fairly high percentage of mutants. In particular, they suggest that bb-weak-1 and st-weak kill a higher percentage of mutants, but they also generate more test cases.
(Weyuker, 1990) - All-p-uses includes all-uses - All-uses includes all-du-paths
Number of test cases generated
- All-c-uses generates fewer test cases than all-p-uses - All-p-uses generates fewer test cases than all-uses - All-uses generates fewer test cases than all-du-paths - The number of test cases generated by all-du-paths is linear as regards the number of decisions in the program, rather than exponential as stated in theory
- The number of test cases generated with all-du-paths is not exponential, as stated in theory, and is reasonable - The number of test cases generated by all-du-paths seems to depend on the number of lines ofcode
PRACTICAL RESULTS
Criteria compliance
(Bieman & Schultz, 1992)
- All-p-uses should be used instead of all-uses, and all-uses instead of all-du-paths, as they generate fewer test cases and generally cover the test cases generated by the other criteria. • It is not clear that it is better to use all-c-uses instead of all-p-uses, as, even though all-c-uses generates fewer test cases, coverage is not assured. - Contrary to testing theory, the all-du-paths technique is usable in practice, since it does not generate too many test cases.
LIMITATIONS
ASPECT STUDIED
STUDY
- It remains to ratify the laboratory results of Weyuker's study in industry. - The results of Bieman and Schultz's study have to be corroborated using formal statistical analysis techniques. - Technique effectiveness should be studied, as the fact that the test cases generated with one criterion cover the other criteria is not necessarily related to effectiveness. - What the number of test cases generated in all-du-paths depends on should be studied in more detail, as one study says it depends on the number of decisions and the other on the number of lines of code.
Table 7. Results of the studies on data flow testing techniques.
Limitations of Empirical Testing Technique Knowledge
13
TESTING TECHNIQUE
ASPECT STUDIED
STUDY (Offut & Lee, 1994)
(Offut et al, 1996)
% mutants killed3 by each technique
X
X
No. of generated cases
X
X
No. of generated mutants
X
% generated sets that detect at least 1 fault Mutation (strong/standard) MD EX-WEAK MD ST-WEAK MD BB-WEAK/1 MD BB-WEAK/n 2-selective mutation 4-selective mutation 6-selective mutation Random selected 10% mutation Constrained (abs/ror) mutation
(Wong & Mathur, 1995)
X X
O 0 0 0 0
O
0
O 0 O 0 0
Table 8. Studies on mutation testing techniques.
Furthermore, Offut et al. analyse 10 programs (9 of which were studied by Offut and Lee, of no more than 48 LOC) and find that the percentage of strong mutation mutants killed by each selective variant is over 99% and is, in some cases, 100%. Therefore, the authors conclude that selective mutation is an effective alternative to strong mutation. Additionally, selective mutation cuts test costs substantially, as it reduces the number of generated mutants. As regards Wong and Mathur, they compare strong mutation with two selective variants (randomly selected 10%> mutation and constrained mutation, also known as abs/ror mutation). They find, on 10 small programs, that strong or standard mutation is equally as or more effective than either of the other two techniques. However, these results are not supported statistically, which means that it is impossible to determine whether or not this difference in effectiveness is significant.
3
A mutant is killed when a test case causes it to fail.
14
N. Juristo, A. M. Moreno & S. Vegas
Finally, Wong and Mathur refer to other studies they have performed, which determined that abs/ror mutation and 10% mutation generate fewer test cases than strong mutation. This gain is offset by a loss of less than 5% in terms of coverage as compared with strong mutation. This could mean that many of the faults are made in expressions and conditions, which are the questions evaluated by abs/ror mutation. For this reason and for noncritical applications, they suggest the possibility of applying abs/ror mutation to get good cost/benefit performance (less time and effort with respect to a small loss of coverage).
• •
•
• •
•
In summary, the conclusions reached by this group of studies are: Standard mutation appears to be more effective, but is also more costly than any of the other techniques studied. The mutation variants provide similar, although slightly lower effectiveness, and are less costly (generate fewer mutants and, therefore, fewer test cases), which means that the different mutation variants could be used instead of strong mutation for non-critical systems. However, the following limitations have to be taken into account: The programs considered in these studies are not real, which means that the results cannot be generalised to industrial cases, and a replication in this context is required to get greater results reliability. Offut et al. and Wong and Mathur do not use formal techniques of statistical analysis, which means that their results are questionable. Additionally, it would be interesting to compare the standard mutation variants with each other and not only with standard mutation to find out which are more effective from the cost and performance viewpoint. Furthermore, the number of mutants that a technique kills is not necessarily a good measure of effectiveness, because it is not explicitly related to the number of faults the technique detects. Table 9 shows the results of this set of studies.
Limitations of Empirical Testing Technique Knowledge STUDY % mutants killed by each technique
(Offut & Lee, 1994)
Selective mutation kills more than 99% of the mutants generated by strong mutation.
-
Weak mutation generates fewer test cases than strong mutation.
- st-weak/1 and bb-weak/1 generate more test cases than exweak/1 and bbweak/n - Although not explicitly stated, strong mutation generates more test cases than selective mutation
-
ASPECT STUDIED
-
-
- 10% mutation generates fewer mutants than abs/ror mutation (approx. half) - Abs/ror generates from 50 to 100 times fewer mutants than standard mutation - Standard mutation is more effective than 10% mutation in 90% of cases and equal in 10% - Standard mutation is more effective than abs/ror in 40% of the cases and equal in 60% - Abs/ror is equally or more effective than 10% mutation in 90% of the cases
PRACTICAL RESULTS
-
Selective mutation generates fewer mutants than strong mutation
- Where time is a criticalfactor, it is better to use weak as opposed to standard mutation, as it generatesfewer test cases and effectiveness is approximately the same. - Where time is a criticalfactor, it is better to use selective (exweak/1 and bb-weak/n) as opposed to standard mutation, as it generates fewer mutants (and therefore fewer test cases) and its effectiveness is practically the same. - Where time is a criticalfactor, it is better to use 10% selective as opposed to standard mutation, although there is some loss in effectiveness, because it generates much fewer test cases. In intermediate cases, it is preferable to use abs/ror mutation, because, although it generates more cases (from 50 to 100 times more), it raises effectiveness by 7 points. If time is not a criticalfactor, it is preferable to use standard mutation.
LIMITATIONS
% sets generated that detect at least 1 fault
(Wong & Mathur, 1995)
The percentage of mutants killed by weak mutation is high.
No. cases generated
No. mutants generated
(Offut etaL, 1996)
15
- It remains to ratify the laboratory results of these studies in industry. - The results of the studies by Offut et al. and Wong and Mathur should be corroborated using formal techniques of statistical analysis. - It remains to compare the variants of strong mutation with each other. - The studies should be repeated with another measure of effectiveness, as the number of mutants killed by a technique is not necessarily a good measure of effectiveness.
Table 9. Results of the studies on mutation testing techniques.
N. Juristo, A. M. Moreno & S. Vegas
16
5. Comparative Studies Between the Data-Flow, ControlFlow and Random Testing Techniques Families The objective of this series of studies is to analyse the differences between three families, selecting, for this purpose, given techniques from each family. The selected techniques are the branch testing (decision coverage) control flow technique, all-uses and all-dus within the data flow family and random in the random testing technique family. Table 10 shows the studies considered, the aspects studied by each one and for which testing techniques.
(Hutchins etal, 1994)
ASPECT STUDIED
Number of test cases generated
X
X
No. of sets with at least 1 fault/no. of sets generated
X
X
TESTING TECHNIQUE
STUDY (Frankl & Weiss, 1993)
All-uses
0
Branch testing (all-edges)
0
X 0
0
0
0
All-dus (modified all-uses) Random (null)
(Frankl & Iakounenko, 1998)
0
0
0
Table 10. Comparative studies of the data flow, control flow and random testing technique families. (Frankl and Weiss, 1993) (see also (Frankl and Weiss, 1991a) and (Frankl and Weiss, 1991b)) and (Frankl and Iakounenko, 1998) study the effectiveness of the all-uses, branch testing and random testing techniques in terms of the probability of a set of test cases detecting at least one fault, measured as the number of sets of test cases that detect at least one fault/total number of sets generated. Frankl and Weiss use nine toy programs containing one or more faults to measure technique effectiveness. The results of the study indicate that the probability of a set of test cases detecting at least one fault is greater (from a statistically significant viewpoint) for all-uses than for all-edges in five of
Limitations of Empirical Testing Technique Knowledge
17
the nine cases. Additionally, all-uses behaves better than random in six of the nine cases and all-edges behaves better than random testing in five of the nine cases. Analysing the five cases where all-uses behaves better than all-edges, Frankl and Weiss find that all-uses provides a greater probability of a set of cases detecting at least one fault with sets of the same size in four of the five cases. Also, analysing the six cases where all-uses behaves better than random testing, the authors find that all-uses provides a greater probability of a set of cases detecting at least one fault in four of these cases. That is, all-uses has a greater probability of detecting a fault not because it works with sets containing more test cases than all-edges or random testing, but thanks to the very strategy of the technique. Note that the diffrence in the behaviour of the techniques (of nine programs, there are five for which a difference is observed and four for which none is observed for all-uses and six out of nine for random testing) is not statistically significant, which means that it cannot be claimed outright that all-uses is more effective than all-edges or random testing. Analysing the five cases where all-edges behaves better than random testing, Frankl and Weiss find that in no case does all-edges provide a greater probability of a set of cases detecting at least one fault with sets of the same size. That is, in this case, all-edges has a greater probability of detecting a fault than random testing because it works with larger sets. Frankl and Weiss also discern a relationship between technique effectiveness and coverage, but they do not study this connection in detail. Frankl and Iakounenko, however, do study this relationship and, as mentioned above, again define effectiveness as the probability of a set of test cases finding at least one fault, measured as the number of sets of test cases that detect at least one fault/total number of generated test cases. Frankl and Iakounenko deal with eight versions of an industrial program, each containing a real fault. Although the study data are not analysed statistically and its conclusions are based on graphical representations of the data, the qualitative analysis indicates that, as a general rule, effectiveness is greater when coverage is higher, irrespective of the technique. However, there are occasions where effectiveness is not 1, which means that some faults are not detected, even when coverage is 100%. This means that coverage increases the probability of finding a fault, but it does not guarantee that it is detected. Additionally, both all-uses and all-edges appear to behave similarly in terms of effectiveness, which is a similar result to what Frankl and Weiss found. For high coverage levels, both all-uses and
18
N. Juristo, A. M. Moreno & S. Vegas
all-edges behave much better than random testing. Indeed, Frankl and Weiss believe that the behaviour of random testing is unrelated to coverage. Hence, as random testing does not improve with coverage, it deteriorates with respect to the other two. Note that even when technique coverage is close to 100%, there are programs for which the technique's fault detection effectiveness is not close to 1. This leads us to suspect that there are techniques that work better or worse depending on the fault type. The better techniques for a given fault type would be the ones for which effectiveness is 1, whereas the poorer ones would be the techniques for which effectiveness is not 1, even though coverage is optimum. However, Frankl and Weiss do not further research this relationship. The study by (Hutchins et al, 1994) compares all-edges with all-dus and with random testing. As shown in Table 10, the authors study the number of test cases generated by each technique, the effectiveness of the techniques, again measured as the number of sets that detect at least one fault/the total sets, as well as the relationship to coverage. Hutchins et al. consider seven toy programs (with a number of lines of code from 138 to 515), of which they generate versions with just one fault. The results of the study show that the greater coverage is, the more effective the techniques are. While there is no evidence of a significant difference in effectiveness between all-edges and all-dus, there is for random testing. Furthermore, the authors study the sizes of the test cases generated by all-edges and all-dus, and how big a set of cases generated using random testing would have to be for a given coverage interval to be equally effective. They reach the conclusion that the sizes generated by all-edges and all-dus are similar and that the increase in the size of one set of cases generated by random testing can vary from 50 to 160% for high coverages (over 90%). The authors further examine the study, analysing the fault types detected by each technique. They find that each technique detects different faults, which means that although the effectiveness of all-edges and all-dus is similar, the application of one instead of the other is not an option, as they find different faults.
•
The limitations discovered in these studies are: Frankl and Weiss and Hutchins et al. use relatively simple, nonindustrial programs, which means that the results cannot be directly generalised to real practice.
Limitations of Empirical Testing Technique Knowledge
•
•
•
•
•
Of the three studies, Frankl and Iakounenko do not run a statistical analysis of the extracted data, which means that the significance of the results is questionable. The evaluation of the effectiveness of the techniques studied, measured as the probability of detecting at least one fault in the programs, is not useful in real practice. Measures of effectiveness like, for example, number of faults detected over number of total faults are more attractive in practice. Besides technique effectiveness, Frankl and Weiss and Frankl and Iakounenko should also study technique complementarity (as in Hutchins et al.), in order to be able to determine whether or not technique application could be considered exclusive, apart from extracting results regarding similar technique effectiveness levels. As regards the conclusions, it can be said that: There does not appear to be a difference between all-uses, all-edges and random testing as regards effectiveness from the statistical viewpoint, as the number of programs in which one comes out on top of the other is not statistically significant. However, from the practical viewpoint, random testing is easier to satisfy than all-edges and, in turn, all-edges is easier to satisfy than all-uses. On the other hand, all-uses is better than all-edges and than random testing as a technique, whereas all-edges is better than random because it generates more test cases. It follows from the results of the above studies that, in the event of time constraints, the use of the random testing technique can be relied upon to yield an effectiveness similar to all-uses in 50% of the cases. Where testing needs to be exhaustive, the application of all-uses provides assurance, as, in the other half of the cases, this criterion yielded more efficient results thanks to the actual technique and not because it generated more test cases. A logical relationship between coverage and effectiveness was also detected (the greater the coverage, the greater the effectiveness). However, effectiveness is not necessarily optimum in all cases even if maximum coverage is achieved. Therefore, it would be interesting to analyse in detail the faults entered in the programs in which the effectiveness of the techniques is below optimum, as a dependency could possibly be identified between the fault types and the techniques that detect these faults.
19
20
•
•
•
N. Juristo, A. M. Moreno & S. Vegas
Hutchins et al. discover a direct relationship between coverage and effectiveness for all-uses and all-edges, whereas no such relationship exists for random testing. For high coverage levels, the effectiveness of all-uses and all-edges is similar. Frankl and Iakounenko also discover a direct relationship between coverage and effectiveness for all-uses and all-edges. Again, the effectiveness of both techniques is similar, although all-edges and alldus are complementary because they detect different faults. Even when there is maximum coverage, however, there is no guarantee that a fault will be detected. This suggests that the techniques may be sensitive to certain fault types.
Taking into account these limitations, however, we do get some interesting results, which have been summarised in Table 11.
6. Comparisons between the Mutation and the Data-Flow Testing Techniques Families We have found two studies that compare mutation with data flow techniques. These studies, along with the characteristics studied and the techniques addressed, are shown in Table 12. (Frankl et al., 1997) (see also (Frakl et al, 1994)) compare the effectiveness of mutation testing and all-uses. They study the ratio between the sets of test cases that detect at least one fault vs. total sets for these techniques. The effectiveness of the techniques is determined at different coverage levels (measured as the percentage of mutants killed by each technique). The results for 9 programs at high coverage levels with a number of faults of no more than 2 are as follows: • Mutation is more effective for 5 of the 9 cases • All-uses is more effective than mutation for 2 of the 9 cases • There is no significant difference for the other two cases. With regard to (Wong and Mathur, 1995), they compare strong mutation, as well as two variants of strong mutation (randomly selected 10% and constrained mutation, also known as abs/ror mutation), with all-uses, studying again the ratio between the sets of test cases that detect at least 1 fault vs. total sets. For this purpose, the authors study 10 small programs, finding that the mutation techniques behave similarly to all-uses.
Limitations of Empirical Testing Technique Knowledge (Frankl & Weiss, 1993)
(Hutchins era/., 1994)
(Frankl & Iakounenko, 1998)
Number of test cases generated
- All uses is a better technique than all-edges and random by the technique itself - All-edges is better than random because it generates more test cases
- All-edges and all-dus generate approx. the same number of test cases - To achieve the same effectiveness as alledges and all-dus, random has to generate from 50% to 160% more test cases
-
No. of sets detecting at least 1 fault/no. of sets generated
- There is no convincing - The effectiveness of result regarding all-uses all-edges and all-dus being more effective than is similar, but they all-edges and random: find different faults • In approximately 50% of - Maximum coverage does not guarantee the cases, all-uses is that a fault will be more effective than alldetected edges and random, and all-edges is more effective than random • In approximately 50% of the cases, all-uses, alledges and random behave equally
STUDY
PRACTICAL RESULTS
- There is an effectiveness/coverage relationship in all-edges and all-uses (not so in random) - There is no difference as regards effective-ness between all-uses and alledges for high coverages
- In the event of time constraints, the use of the random testing technique can be relied upon to yield an effectiveness similar to all-uses and all-edges (the differences being smaller the higher coverage is) in 50% of the cases. Where testing needs to be exhaustive, the application of all-uses provides assurance, as, in the other half of the cases, this criterion yielded more efficient results thanks to the actual technique, unlike all-edges, which was more efficient because it generated more test cases. - All-edges should be applied together with all-dus, as they are equally effective and detect different faults. Additionally, they generate about the same number of test cases, and the random testing technique has to generate between 50% and 160% more test cases to achieve the same effectiveness as all-edges and all-dus. - High coverage levels are recommendedfor all-edges, all-uses and all-dus, as this increases their effectiveness. This is not the case for the random testing technique. Even when there is maximum coverage, however, there is no guarantee that a fault will be detected.
LIMITATIONS
ASPECT STUDIED
21
- It remains to ratify the laboratory results of the studies by Hutchins et al. and Frankl and Ianounenko in industry. - The results of the studies by Frankl and Weiss should be corroborated using formal techniques of statistical analysis. - The type of faults should be studied in the programs where maximum effectiveness is not achieved despite there being maximum coverage, as this would help to determine technique complementarity. - The studies should be repeated for a more practical measure of effectiveness, as the percentage of test case sets that find at least one fault is not real.
Table 11. Results of the studies comparing data flow, control flow and random testing techniques.
N. Juris to, A. M. Moreno & S. Vegas STUDY (Frankl et ah, 1997) ASPECT STUDIED
TESTING TECHNIQUE
% mutants technique
killed
by
each
(Wong & Mathur, 1995)
X
Ratio of generated sets that detect at least 1 fault
X
X
Mutation (strong or standard)
0
O
All-uses
0
0
Random selected 10% mutation
O
Constrained (abs/ror) mutation
0
Table 12. Comparative studies between the mutation and data flow testing techniques families. We cannot conclude from these results that there is a clear difference in terms of effectiveness between mutation testing and all-uses. Additionally, the authors highlight that it is harder to get high coverage with mutation as compared with all-uses. The limitations are: • The results of Frankl et al. can be considered as a first attempt at comparing mutation testing techniques with all-uses, as this study has some drawbacks. First, the faults introduced into the programs were faults that, according to the authors, "occurred naturally". However, the programs are relatively small (no more than 78 LOC), and it is not said whether or not they are real. Additionally, the fact that the programs had no more than two faults is not significant from a practical viewpoint. • Wong and Mathur do not use real programs or formal techniques of statistical analysis, which means that their results cannot be considered conclusive until a formal analysis of the results has been conducted on real programs. • The use of the percentage of sets that discover at least one fault as the response variable is not significant from a practical viewpoint. • Note that a potentially interesting question for this study would have been to examine the differences in the programs for which mutation and data flow testing techniques yield different results. This could have identified a possible relationship between program or fault types
Limitations of Empirical Testing Technique Knowledge
•
23
and the techniques studied, which would help to define application conditions for these techniques. There should be a more detailed study of the dependency between the technique and the program type to be able to more objectively determine the benefits of each of these techniques. In any replications of this study, it would be important to analyse the cost of technique application (in the sense of application time and number of test cases to be applied) to conduct a more detailed cost/benefit analysis.
The main results of this group are summarised in Table 13. As a general rule, mutation testing appears to be as or more effective than all-uses, although it is more costly. STUDY
- Standard mutation is more effective than all-uses in 63% of the cases and equally effective in 37% - Abs/ror is more effective than all-uses in 50% of the cases, equally effective in 30% and less effective in 20% - All-uses is more effective than 10% mutation in 40% of the cases, equally effective in 20% and less effective in 40%
PRACTICAL RESULTS
Ratio of sets generated that detect at least 1 fault
(Wong & Mathur, 1995)
- If high coverage is important and time is limited, it is preferable to use all-uses as opposed to mutation, as it will be just as effective as mutation in about half of the cases. - All-uses behaves similarly as regards effectiveness to abs/ror and 10% mutation.
LIMITATIONS
ASPECT STUDIED
% mutants killed by each technique
(Franklrf«/.,1997) It is more costly to reach high coverage levels with mutation than with all-uses There is not a clear difference between mutation and all-uses
- It remains to ratify the laboratory results of the studies in industry. - The studies should be repeated for a more practical measure of effectiveness, as the percentage of sets of cases that find at least one fault is not real. - It would be of interest to further examine the differences in the programs in which mutation and the data flow testing technique yield different results. - The cost of technique application should be studied.
Table 13. Comparisons between mutation and all-uses.
24
N. Juristo, A. M. Moreno & S. Vegas
7. Comparisons Between the Functional and Control-Flow Testing Techniques Families The four studies of which this group is composed are reflected in Table 14. These are empirical studies in which the authors investigate the differences between control flow testing techniques and the functional testing techniques family. These studies actually also compare these two testing technique families with some static code analysis techniques, which are not taken into account for the purposes of this paper, as they are not testing techniques. In Myers' study (Myers, 1978), inexperienced subjects choose to apply one control flow and one functional testing technique, which they apply to a program taken from a programming book, analysing the variables: number of faults detected, time to detect faults, time to find a fault/type, number of faults detected combining techniques, and time taken to combine techniques/fault type. Myers does not specify which particular techniques were used, which means that this study does not provide very practical results. One noteworthy result, however, is that the author does not find a significant difference as regards the number of faults detected by both technique types. However, the author indicates that different methods detect some fault types better than others (although this study is not performed statistically). Myers also studies fault detection efficiency combining the results of two different people. Looking at Table 14, we find that (Wood et al., 1997) also address this factor. The conclusions are similar in the two studies, that is, more faults are detected combining the faults found by two people. However, there are no significant differences between the different technique combinations. Of the studies in Table 14, we find that (Basili and Selby, 1987) (see also (Basili and Selby, 1985) and (Selby and Basili, 1984)) and Wood et al. use almost the same response variables: number of detected faults, percentage of detectedfaults, time to detect faults, number of faults detected per hour and percentage of faults detected by time for Basili and Selby and number of detected faults, number of faults detected combining techniques, number of faults detected per hour and percentage of detected faults for Wood et al. Apart from these results, (Kamsties and Lott, 1995) also take an interest in the faults that cause the different failures, studying another set of variables, as shown in Table 14: time to detect faults, number of faults found
Limitations of Empirical Testing Technique Knowledge
25
per hour, number of faults isolated per hour, percentage of faults detected per type, percentage of faults isolated per type, time to isolate faults, percentage of faults isolated, percentage of faults detected and total time to detect and isolate faults. Whereas Basili and Selby replicate the experiment with experienced and inexperienced subjects (two and one replications, respectively), Wood et al., like Kamsties and Lott, use only inexperienced subjects.
ASPECT STUDIED
STUDY (Myers, 1978)
(Basili & Selby, 1987)
No. faults detected
X
X
Time to detect faults
X
X
Time to detect faults/ fault type
X
No. faults detected combining techniques
X
Time combining techniques/fault type
X
No. faults found/ time
X
X X
X
X
X
X
% faults isolated/type
X
Time to isolate faults
X
Total time to detect and isolate
X X
% faults detected % faults isolated
X
X
X
White box
0
Black box
0
Boundary value analysis
o
Sentence coverage
0
Decision coverage (branch testing)
X
X
% faults detected/type
Condition coverage
(Wood et al, 1997)
X
No. faults isolated/hour
TESTING TECHNIQUE
(Kamsties & Lott, 1995)
0
0
0 O
Table 14. Comparative studies of functional and control testing techniques.
26
N. Juristo, A. M. Moreno & S. Vegas
This means that Basili and Selby can further examine the effect of experience on the fault detection rate (number of faults detected per hour) or the time taken to detect faults. As regards the first variable, the authors indicate that the fault detection rate is the same for experienced and inexperienced subjects for both techniques (boundary value analysis and sentence coverage), that is, neither experience nor the technique influences this result. With respect to time, Basili and Selby indicate that the experienced subjects take longer to detect a fault with using the functional technique than with sentence coverage. This means that experienced subjects detect fewer faults with the structural technique than with the functional testing technique within a given time. For inexperienced subjects, on the other hand, the findings are inconclusive, as the results of the replications are not the same (in one replication no differences were observed between the techniques and in the other, the functional testing technique took longer to detect faults). Also as regards time, the study by Kamsties and Lott (who, remember, worked with inexperienced subjects) indicates that the total time to detect and isolate faults is less using the functional testing technique than with condition coverage. As these authors studied the time to detect and isolate faults separately, the authors were able to determine statistically that it takes longer to isolate the fault using the functional technique than with condition coverage, but the time to detect the fault is less. Note that this result cannot be directly compared with the findings of Basili and Selby, where the functional technique did not take less time to detect faults, as the two consider different structural testing techniques: sentence coverage (Basili and Selby) and condition coverage (Kamsties and Lott). As regards efficiency, Kamsties and Lott indicate that the fault detection rate was greater for the functional testing technique than for condition coverage. Kamsties and Lott note that there were no significant differences between the percentage of isolated and detected faults, that is, both techniques behaved similarly, because the program was the influential factor. This result was corroborated by studies by Basili and Selby and Wood et al., who claim that the percentage of detected faults depends on the program and, according to Wood et al, more specifically, on the faults present in these programs. Basili and Selby and Kamsties and Lott have also studied the percentage of faults detected by the techniques according to fault type. In this respect, whereas Basili and Selby claim that the functional technique detects more
Limitations of Empirical Testing Technique Knowledge
27
control faults than sentence coverage, Kamsties and Lott indicate that, generally, there are no significant differences between the functional testing technique and condition coverage with regard to the percentage of isolated and detected faults by fault type. Finally, it should be mentioned that Wood et al. also focus on the study of the number of detected faults using each technique individually and combining the results of two people applying the same or different techniques. Individually, they reach the conclusion that it is impossible to ascertain which technique is more effective, as the program (fault) is also influential. On the other hand, they find that the number of different faults detected is higher combining the results of different people, instead of considering only the results of the individual application of each technique. However, a formal analysis of the data would show that there is no significant difference between two people applying the same or different techniques, which might suggest that it is the people and not the techniques that find different faults (although this claim would require further examination). The studies considered in this group generally include an experimental design and analysis, which means that their results are reliable. However, caution needs to be exercised when generalising and directly comparing these results for several reasons: • They use relatively small programs, between 150 and 350 LOC, which are generally toy programs and might not be representative of industrial software. • Most, although not all, of the faults considered in these programs are inserted by the authors ad hoc for the experiments run, which means that there is no guarantee that these are faults that would occur in real programs. • The studies by Basili and Selby, Kamsties and Lott and Wood et al. compare the boundary value analysis technique with three different structural techniques. Hence, although some results of different studies may appear to be contradictory at first glance, a more detailed analysis would be needed to compare the structural techniques with each other. • Although the response variables used in all the studies are quite similar, care should be exercised when directly comparing the results, because, as mentioned above, the techniques studied are not absolutely identical.
28
•
•
•
•
• •
•
N. Juris to, A. M. Moreno & S. Vegas
In Myers' study, it is not very clear the concrete techniques the subjects apply, since they are asked to apply a control flow and a functional testing technique. And the conclusions that can be drawn are: The boundary analysis technique appears to behave differently compared with different structural testing techniques (particularly, sentence coverage and condition coverage). Note that from the practical viewpoint, condition coverage is more applicable, which means that future replications should focus on condition coverage rather than sentence coverage. Nothing can be said about branch testing. Basili and Selby, Kamsties and Lott and Wood et al. find effectiveness-related differences between functional and structural techniques depending on the program to which they are applied. Wood et al. further examine this relationship, indicating that it is the fault type that really influences the detected faults (and, more specifically, the influential factor is the failures that these faults cause in programs), whereas Kamsties and Lott and Myers find no such difference. Also there appears to be a relationship between the programs, or the type of faults entered in the programs, and technique effectiveness, as indicated by all three studies. However, this relationship has not been defined in detail. Basili and Selby point out that the functional technique detects more control faults. Myers also discerns a difference as regards different faults, but fails to conduct a statistical analysis. Finally, Kamsties and Lott find no such difference, which means that a more exhaustive study would be desirable. More faults are detected using the same technique if different people are combined than individually. Any possible extensions of these studies should deal, whenever possible, with real problems and faults in order to be able to generalise the results obtained. Finally, it would be recommendable to unify the techniques under study in future replications to be able to generalise conclusions.
Taking this into account, we have summarised the results of this group in Table 15 and Table 16.
Limitations of Empirical Testing Technique Knowledge
STUDY
No. faults detected
% faults detected
H
% faults detected/
H
type
0. en
VQ+ Outlier
l.5(VQ-LQ)
PBR versus PBR —> CBR). Before we explain the expected effects when comparing PBR with CBR in the form of hypotheses, we define the following notation (we use d(a|b) for defining d(a) or d(b) respectively). Term
Definition
ej(CBR|PBR)
The individual defect detection effort for inspector i when using (CBR|PBR).
m (CBR|PBR)
The meeting effort after the individual inspectors used (CBR|PBR).
D (CBR|PBR)
The total number of unique defects found by two inspectors using (CBR|PBR) Table 1. Notation.
O. Laitenberger & D. Rombach
We use this terminology to define our expectations. The Effectiveness of PBR is Larger than the Effectiveness of CBR for Teams. Given the anticipated defect detection benefits of PBR, then we would expect the following to hold within a given experiment: D(CBR) < D(PBR)
(5)
Defect Detection Cost for Teams is Lower with PBR than with CBR. The PBR technique requires an inspector to perform certain activities based on the content of the inspected document and to actively work with the document instead of passively scanning through it. Moreover, individual inspectors have to document some intermediate results, such as a call graph, a description of the function, or some test cases. It is therefore expected that an inspector will spend at least as much effort for defect detection using PBR than CBR: ei{PBR)>ei(CBR)
(6)
The effect that we anticipate is: D(PBR) D(CBR)
el(PBR) + e2(PBR) >
e1(CBR) + e2(CBR)
(7)
which states that the increase in detected defects must be larger than the additional effort spent for defect detection using PBR. In this case, the total cost to detect a defect using CBR will be higher than for PBR. Meeting Cost for Teams is Lower with PBR than with CBR. Performing these extra activities during PBR is expected to result in a better understanding of the document. Therefore, during the meeting, inspectors do not have to spend a lot of extra effort in explaining the defects that they found to their counterpart in the team. Furthermore, it will take less effort to resolve false positives due to the enhanced
(Quasi-)Experimental Studies in Industrial Settings
189
understanding. The better understanding is expected to translate to an overall reduction in meeting effort for PBR than for CBR7: m{CBR) > m(PBR)
(8)
It follows from (5) and (8): m{CBR) D(CBR)
m(PBR) >
(9)
D(PBR)
Therefore, one would hypothesise that the meeting cost for PBR will be less than for CBR. 4. Overall Inspection Cost is Lower with PBR than with CBR. Following from the above two arguments about the cost per defect for defect detection and for the meeting, we would expect that the overall cost per defect for both phases to be smaller for PBR than for CBR. Based on the above explanation of the expected effects, below we state our four null hypotheses: Hoi
An inspection team is as effective or more effective using CBR than it is using PBR.
H02
An inspection team using PBR finds defects at the same or higher cost per defect than a team using CBR for the defect detection phase of the inspection.
Ho3
An inspection team using PBR finds defects at the same or higher cost per defect as a team using CBR for the meeting phase of the inspection.
H04
An inspection team using PBR finds defects at the same or higher cost per defect than a team using CBR for all phases of the inspection.
7
One can argue that more time could be taken during the meeting for an inspector to understand the other perspective and understanding aspects of the code not demanded by his/her own perspective. This could result in a greater meeting time for PBR than for CBR. However, in such a case one would expect inspectors to share the abstractions that they have constructed during their individual reading, making it easier for an inspector to comprehend the reasoning behind the other perspective's defects. For CBR such abstractions are not available.
190
O. Laitenberger & D. Rombach
Research Method Experimental Design Our study was conducted as part of a large training course on software inspection within Bosch Telecom GmbH. We ran the initial quasiexperiment, and then replicated it twice (we refer to each one of these as a "run"). In each run we had 20 subjects, giving a total of 60 subjects who took part in the original quasi-experiment and its two replications. Each of the original quasi-experiment and its two replications took 6 days. Therefore the full study lasted 18 days. In this section we describe the rationale behind the quasi-experimental design that we have employed, and the alternatives considered. In particular, since we make causal interpretations of the results, we discuss the potential weaknesses of a quasi-experiment and how we have addressed them. To prelude the discussion of the experimental design, we emphasise that experimental design is, according to Hays [42], a problem in economics. Each choice that one makes for a design has its price. For example, the more treatments, subjects, and number of hypotheses one considers, the more costly an experiment is likely to be. This is particularly true for field experiments. Therefore, the objective is to choose a design that minimises the threats to validity within the prevailing cost constraints. Description of the Environment Bosch Telecom GmbH is a major player in the telecommunication market and develops high quality telecommunication systems (e.g., modern transmission systems based on SDH technology, access networks, switching systems) containing embedded software. One major task of the embedded software is the management of these systems. In the transmission systems and access network this comprises alarm management, configuration management, performance management, on-line software download, and on-line database down- and up- load. There are four typical characteristics for this kind of software. First, it is event triggered. Second, there are real time requirements. Third, the software must be highly reliable which basically means that the software system must be available 24 hours. Finally, the developed software system must be tailorable to different hardware configurations. Because of these characteristics and the ever increasing competition in the telecommunications market, high product quality is one of the most crucial demands for software development projects at Bosch Telecom GmbH.
(Quasi-)Experimental Studies in Industrial Settings
191
Although reviews are performed at various stages of the development process to ensure that the quality goals are met, a large percentage of defects in the software system are actually found throughout the integration testing phase. A more detailed analysis of these defects revealed that many (typically one half) have their origin in the implemented code. Since defects detected in the integration testing phase are expensive (estimate: 5000 DM per detected and fixed defect), several departments at Bosch Telecom GmbH decided to introduce code inspections to detect these defects earlier and, thus, save detection as well as correction cost.8 Part of the effort to introduce code inspections was a code inspection training course for the software developers. We organised this as a quasiexperiment and two internal replications. This quasi-experimental organisation is advantageous for both participants and researchers. The participants not only learnt how software inspection works in theory but also practised them using a checklist as well as a PBR scenario for defect detection. In fact, the practical exercises offered the developers the possibility to convince themselves that software inspection helps improve the quality of code artefacts and, therefore, are beneficial in the context of their own software development projects. In a sense, these exercises help overcome the "not applicable here (NAH)" syndrome [50] often observed in software development organisations, and alleviated many objections against software inspections on the developers' behalf. From a researcher's perspective, the training effort offered the possibility to collect inspection data in a quasi-controlled manner. Subjects All subjects in our study were professional software developers of a particular business unit at Bosch Telecom GmbH. Each developer within this unit was to be trained in inspection and, thus, participated in this study. In order to capture their experiences we used a debriefing questionnaire. We captured the subjects' experience in the C-Programming language and in the In this environment, code inspections are expected to reduce detection and correction costs for the following reasons. First, the integration test requires considerable effort just to set up the test environment. A particular test run consumes several hours or days. Once a failure is observed this process is stopped and the setup needs to be repeated after the correction. So, if some defects are removed beforehand, the effort for some of these cycles is saved (defect detection effort). Second, once a failure is observed it usually consumes considerable effort to locate or isolate the defect that led to the failure. In a code inspection, a defect can be located immediately. Hence, code inspection helps save correction cost (assuming that the effort for isolating defects is part of the defect correction effort).
192
O. Laitenberger & D. Rombach
application domain on a six-point scale as the most prevalent types of experience that may impact a subject's performance. Fig. 3 shows boxplots of subjects' C-programming and application domain experiences. very experienced
experienced
I
•
1
rather experienced
'
1
1
|
•
1
D
rather inexperienced
inexperienced I I
very inexperienced C-Programming
Application Domain
Min-Max 1 25%-75%
•
Median value
Fig. 3. Subjects' experience with the C-programming language and the application domain. We found that subjects perceived themselves experienced with respect to the programming language (median of 5 on the 6 item scale) and rather experienced regarding software development in the application domain (median of 4 on the six item scale). We can consider our subject pool a representative sample of the population of professional software developers. Because of cost and company constraints many empirical studies on software inspection were performed with students, that is, novices as subjects. Although these studies provide valuable insights, they are limited in the sense that the findings cannot be easily generalised to a broader context. As Curtis points out [22], generalisations of empirical results can only be made if the study population are professional software engineers rather than novices, and it has been forcefully emphasised that the characteristics of professional engineers differ substantially from student subjects [23]. The few existing results in the context of empirical software engineering support this statement since
(Quasi-)Experimental Studies in Industrial Settings
193
differences between professional software developers and novices were found to be qualitative as well as quantitative [98]. This means that expert and novice software developers have different problem solving processes causing experts not only to perform tasks better, but also to perform them in a different manner than novices. Hence, in the absence of a body of studies that find the same results with experts and novice subjects, such as the one presented by Porter and Votta [78], generalisations between the two groups are questionable and more studies with professional software developers as subjects are required. The 2x2 Factorial Counterbalanced Repeated-Measures Design Our final experimental design is depicted in Table 2. For the original quasiexperiment, the subjects were split in two groups. The first group (Group 1) performed a reading exercise using PBR first, and then measures were collected (Oi). Subsequently they performed a reading exercise using CBR, and again measures were collected (0 2 ). The second group (Group 2) performed the treatments the other way round. The vertical sequence indicates elapsed time during the whole study. Therefore, the first replication was executed with groups 3 and 4 after groups 1 and 2 and the second replication was run with groups 5 and 6 after groups 3 and 4. Original Quasi-Experiment Group 1
XPBR/Module 1
0,
XCBR/Module 2
02
Group 2
XCBR/Module 1
o3
XPBR/Module 2
o4
First Replication Group 3
XPBR/Module 3
0,
XCBR/Module 4
02
Group 4
XCBR/Module 3
o3
XPBR/Module 4
o4
Second Replication Group 5
XPBR/Module 5
Group 6
XCBR/Module 5
o, o3
XCBR/Module 6
o2
XPBR/Module 6
04
Table 2. Design of the quasi-experiment and its two replications.
194
O. Laitenberger & D. Rombach
Below we discuss a number of issues related to this design and its execution: Different Code Modules Within a Run. Given that each group performs two reading exercises using different reading techniques, it is not possible for a group to read the same code modules twice, otherwise they would remember the defects that they found during the first reading, hence invalidating the results of the second reading. Therefore, a group reads different code modules each time. Within a run, a module was used once with each reading technique so as not to confound the reading technique completely with the module that is used. Different Code Modules Across Runs. During each of the quasiexperiment and its two replications different pairs of code modules were used. Since there was considerable elapsed time between each run, it was thought that past subjects may communicate with prospective subjects about the modules that were part of the training. By using different code modules, this problem is considerably alleviated. Confounding. In this design the interaction between the module and the reading technique is confounded with the group main effect. This means that the interaction cannot be estimated separately from the group effect. However, it has been argued in the past that this interaction is not important because all modules come from the same application domain, which is the application domain that the subjects work in [3] [4]. Effect of Intact Groups. As noted earlier, the groups in our study were intact (i.e., they were not formed randomly). The particular situation where there is an inability to assign subjects to groups randomly in a counterbalanced design was discussed by Campbell and Stanley [9]. In this design there is the danger that the group interaction with say practice effects confounds the treatment effect. However, if we only interpret a significant treatment effect as meaningful if it is not due to one group, then such a confounding would have to occur on different occasions in all groups in turn, which is a highly unlikely scenario [9]. Furthermore, it is noted that if there are sufficient intact groups and if they are assigned to the sequences at random, then the quasi-experiment would become a true experiment [9], which is also echoed by Spector [90]. This last point would argue for pooling the original quasi-experiment and its replications into one large experiment. However, it is known that pooling data from different studies
(Quasi-)Experimental Studies in Industrial Settings
195
can mask even strong effects [89] [99], making it much preferable to combine the results through a meta-analysis [58]. A meta-analysis allows explicit testing of homogeneity of the groups. Homogeneity refers to the question whether the different groups share a common effect size. If homogeneity is ensured, one can combine the results to come up with an overall conclusion. Otherwise, the results must be treated separate from each other. An explicit test for homogeneity is conducted during our analysis to check if effect size estimates exhibit greater variability than would be expected if their corresponding effect size parameters were identical [44]. Replication to Alleviate Low Power. Ideally, researchers should perform a power analysis9 before conducting a study to ensure that their experimental design will find a statistically significant effect if one exists. However, in our case, such an a priori power analysis was difficult because the effect size is unknown. As mentioned earlier, there have been no previous studies that compared CBR with PBR for code documents, and therefore it was not possible to use prior effect sizes as a basis for a power analysis. We therefore defined a medium effect size, that is an effect size of at least 0.5 [17],10 as a reasonable expectation given the claimed potential benefits of PBR. Moreover, we set an alpha level of a = 0.1. Usually, the commonly The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the investigated phenomenon exists [17]. Statistical power analysis exploits the relationship among the following four variables involved in statistical inference: • Power: The statistical power of a significance test is the probability of rejecting a false null-hypothesis. • Significance level a: the risk of mistakenly rejecting the null-hypothesis and thus the risk of committing a Type I error. • Sample size: the number of subjects/teams participating in the experiment. • Effect size: the discrepancy between the null hypothesis and the alternate hypothesis. According to Cohen [17] effect size means "the degree to which the phenomenon is present in the population", or "the degree to which the null hypothesis is false". In a sense, the effect size is an indicator of the strength of the relationship under study. It takes the value zero when the null hypothesis is true and some other nonzero value when the null hypothesis is false". In our context, for example, an effect size of 0.5 between the defect detection effectiveness of CBR inspectors and PBR inspectors would indicate a difference of 0.5 standard deviation units. Ideally, in planning a study, power analysis can be used to select a sample size that will ensure a specified degree of power to detect an effect of a particular size at some specified alpha level. 10 In the allied discipline of MIS, Cohen's guidelines for interpreting effect sizes [17] have also been suggested in the context of meta-analysis [48].
196
O. Laitenberger & D. Rombach
accepted practice is to set a = 0.05. n However, in order to control Type I error (a) and Type II error (P) requires either rather large effects sizes or rather large sample sizes. This represents a dilemma in a software engineering context since much treatment effectiveness research in this area involves relatively modest effects sizes, and in general, small sample sizes. As pointed out in [66], if neither effect size nor sample size can be increased to maintain a low risk of error, the only remaining strategy—other than abandoning the research altogether—is to permit higher risk of error. This explains why we used a more relaxed alpha level for our studies. With the anticipation of 10x2 inspector teams and an a = 0.1, t-test power curves [57][66] for a one-tailed significance test12 indicated that the experimental design has about a 0.3 probability of detecting, at least, a medium effect size. This was deemed to be a small probability of rejecting the null hypotheses if they were false (Cohen [17] recommends a value of 0.8). While this power level was not based on observed effect sizes, it already indicated potential problems in doing a single quasi-experiment without replications. After the performance of the quasi-experiment, we found for several hypotheses that the difference between CBR and PBR was not statistically significant. One potential reason for insignificant findings is low power. Using the obtained effect size from the quasi-experiment, an a posteriori power analysis was performed for all four hypotheses. Table 3 presents the a posteriori power levels. H4
H,
H2
Quasi-Experiment
>0.9
0.49
H3 >0.9
0.69
1st Replication
0.51
0.71
0.84
0.81
0.84
0.28
0.76
0.49
2
nd
Replication
Table 3. A posteriori power analysis results.
11 It is conventional to use an a level of 0.05. Cowles and Davis [21] trace this convention to the turn of the century, but credit Fisher for adopting this and setting the trend. However, some authors note that the choice of an a level is arbitrary [46] [36], and it is clear that null hypothesis testing at fixed alpha levels is controversial [19] [92]. Moreover, it is noted that typical statistical texts provide critical value tables for a = 0.1 [87] [88], indicating that this choice of a level is appropriate in some instances. As we explain in the text, our choice of a level was driven by power considerations. 12 We used one-side hypothesis tests since we seek a directional difference between the two reading techniques.
(Quasi-)Experimental Studies in Industrial Settings
197
It is seen that low power was a potentially strong contributor for not finding statistical significance, making it difficult to interpret these results.13 One possibility to tackle the problem of low power is to replicate an empirical study and merge the results of the studies using meta-analysis techniques. Meta-analysis techniques have been primarily designed to combine results from a series of studies, each of which had insufficient statistical power to reliably accept or reject the null hypothesis. This is the approach we have adopted and hence the prevalent reason for performing the two replications. Replications to Increase Generalisability of Results. Replication of experimental studies provides a basis for confirming the results of the original experiment [24]. However, replications can also be useful for generalising results. A framework that distinguishes between close and differentiated replications has been suggested to explain the benefits of replication in terms of generalising results [79]. Our two replications can be considered as close replications since they were performed by the same investigators, using the same design and reading artefacts, under the same conditions, within the same organisation, and during the same period of time. However, there were also some differences that facilitate generalisation of the results. First, the subjects were different. Therefore, if consistent results are obtained in the replications we can claim that the results hold across subjects at Bosch Telecom GmbH. Second, the modules were varied. Again, if consistent results are obtained then we can claim that they are applicable across different code modules at Bosch Telecom GmbH. By varying these two elements in the replications, one attempts to find out if the same results occur despite these differences [79].14 Process Conformance. It is plausible that subjects do not perform the CBR and PBR reading techniques but revert to their usual technique that they use in everyday practice. This may occur, for example, if they are faced with familiar documents (i.e., documents from their own application domain Not finding a statistically significant result using a low power study does not actually tell us very much because the inability to reject the null hypothesis may be due to the low power. A high power study that does not reject the null hypothesis has more credibility in its findings. 14 As noted in [79], however, this entails a gamble since if the results do differ in the replications, then one would not know which of the two elements that were changed are the culprit.
198
O. Laitenberger & D. Rombach
within their own organisation) [4]. This is particularly an issue given that the subjects are professional software engineers who do have everyday practices. As alluded to earlier, with PBR it is possible to check this explicitly by examining the intermediate artefacts that are turned in. We did, and determined that the subjects did perform PBR as defined. For CBR, it will be seen in the post-hoc analysis section of the paper that more subjects found CBR easier to use for defect detection than their current reading technique (ad-hoc); actually more than double. Given such a discrepancy, it is unlikely that the subjects will revert to a technique that is harder to use when offered CBR. Therefore, we expect process conformance when using CBR to also be high.15 It has also been suggested that subjects, when faced with time pressure, may revert to using techniques that they are more familiar with rather than make the effort of using a new technique [4]. During the conduct of the quasi-experiment and its replications, there were no time limits. That is, the subjects were given as much time as required for defect detection using each of the reading techniques. Therefore, this would not have affected the application of the reading techniques. Furthermore, when conducting analyses using defect detection cost, time limits would not invalidate conclusions drawn since there was no artificial ceiling on effort. Feedback Between Treatments. The subjects did not receive feedback about how well they were performing after the first treatment. This alleviates some of the problems that may be introduced due to learning [4]. If subjects do not receive feedback then they are less likely to continue applying the practices they learned during the first treatment. However, we found it very important to provide feedback at the end of each run. Trainer Effects. For the quasi-experiment and its two replications, the same trainer was used. This trainer had given the same course before to multiple organisations, and therefore the quasi-experiment was not the first time that this course was given by the same person. Consequently, we expect no
15
In [4] it is suggested that subjects could be asked to cross-reference the defects that they find with the specific steps of the reading technique. This would allow more detailed checking of process conformance. However, this would have added to the effort of the subjects, and it is unknown in what ways this additional activity would affect the effort when using CBR and PBR. If such an effect does occur differently for CBR and PBR, then it would contaminate the effort data that we collected.
(Quasi-)Experimental Studies in Industrial Settings
199
differences across the studies due to different trainers, nor due to the trainer improving dramatically in delivering the material.16 An Alternative Design Researchers in software engineering empirical research at one point in time have to decide about the experimental design. Of course, many factors impact this decision. One is the availability of subjects and subject groups. In educational settings, it frequently occurs that treatments are assigned to intact classes, where the unit of analysis is the individual student (or teams) within the classes. This situation is akin to our current study whereby we have intact groups. In such situations one can explicitly take into account the fact that subjects are nested within groups and to employ a hierarchical experimental design. This allows the researcher to pool all the data into one larger experiment [90]. One can proceed by first testing if there are significant differences amongst groups. If there is group equivalence within treatments, then one can analyse the data as if they were not nested. However, if there were group differences, the unit of analysis would have to be the group and the analysis performed on that basis. This creates a difficulty in that reverting to a group unit of analysis would substantially reduce the degrees of freedom in the analysis. Hence, it would be harder to find any statistically significant effects even if such differences existed. Given the potential for this considerable methodological disadvantage, we opted a priori not to pool the results and rather perform a meta-analysis. Experimental Materials In each group the subjects inspected two different code modules. Apart from a code module, an inspector received a specification document which can be regarded as a type of requirements document for the expected functionality and which is considered defect free. All code modules were part of running software systems of Bosch Telecom GmbH and, thus, can be considered almost defect free. Table 4 shows the size of a code module in Lines of Code (without blank lines), their average cyclomatic complexity using McCabe's complexity measure [67], and the number of defects we considered in the analysis.
Although, of course, an experienced trainer is still expected to improve as more courses were given; this is inevitable. However, their impact would be minimal.
200
O. Laitenberger & D. Rombach Size(LOC)
Average Cycl. Complexity
Number of defects
Code Module 1
375
6.67
8
Code Module 2
627
11.00
9
Code Module 3
666
3.20
10
Code Module 4
915
4.38
16
Code Module 5
375
3.29
11
Code Module 6
627
5.44
11
Table 4. Characteristics of code modules. Execution The quasi-experiment and its two replications were performed between March and July 1998. Each consisted of two sessions and each session lasted 2.5 days and was conducted in the following manner (see Table 5). On the first day of each session, we did an intensive exercise introducing the principles of software inspection. Depending on the reading order, we then explained in detail either CBR or PBR. This explanation covered the theory behind the different reading approaches as well as how to apply them in the context of a code module from Bosch Telecom GmbH. Then, the subjects used the explained reading approach for individual defect detection in a code module. Regarding the PBR technique, the subjects either used the code analyst scenario or the tester scenario but not both. While inspecting a code module, the subjects were asked to log all detected defects on a defect report form. After the reading exercise we asked the subjects to fill out a debriefing questionnaire. The same procedure was used on the second day for the reading approach not applied on the first day. After the reading exercises, we described how to perform inspection meetings. To provide the participants with more insight into an inspection meeting, we randomly assigned two subjects to an inspection team and let them perform two inspection meetings (one for each code module) in which they could discuss the defects found in the two reading exercises. Of course, we ensured that within each of these teams one participant read a code module from the perspective of a code analyst and one read the code module from the perspective of a tester. The inspection team was asked to log all defects upon which both agreed. We then did an initial analysis by checking all
(Quasi-)Experimental Studies in Industrial Settings
201
defects that were either reported from individual subjects or from the teams, against the known defect list and presented the results on the third day (halfday). We consider the presentation of initial results a very important issue because, first, it gives experimenters the chance to get quick feedback on the results and, second, the participants have the possibility to see and interpret their own data, which motivates further data collection. At the end of each training session, each participant was given a debriefing questionnaire in which we asked the participant about the effectiveness and efficiency of inspections and the subjects' opinion about the training session.
Day 1
Group x
Group y
Morning
Afternoon
Introduction of Inspection Principles
PBR Explanation
Morning CBR Explanation
Defect Detection with PBR (Module A)
Defect Detection with CBR (Module B)
CBR Explanation
PBR Explanation
Defect Detection with CBR (Module A)
Defect Detection with PBR (Module B)
Introduction of Inspection Principles
Day 3
Day 2 Afternoon
Morning
Team Meetings for Module A and Module B
Feedback on Results
Team Meetings for Module A and Module B
Feedback on Results
Table 5. Execution of the quasi-experiment and its replications. Data Analysis
Methods
Analysis Strategy To understand how the different treatments affected individual and team results across the quasi-experiment and its replications, we started the data analysis by calculating some descriptive statistics of the individual and team results. We continued testing the stated hypotheses using a t-test for repeated measures, i.e., a matched-pair t-test [1]. The t-test allowed us to investigate whether a difference in the defect detection effectiveness or the cost per defect ratio is due to chance. Although the t-test seems to be quite
202
O. Laitenberger & D. Rombach
robust against violation of certain assumptions (i.e., the normality and homogeneity of the data), we also performed the Wilcoxon signed ranks test [88], which is the non-parametric counterpart of the matched-pair t-test. The Wilcoxon signed rank test corroborated the findings of the t-test in all cases. Hence, we do not present the detailed results of this test. We run the statistical tests for the quasi-experiment and the two replications separately. Our experimental design permits the possibility of carry-over effects. Grizzle [41] points out that when there are carry-over effects from treatment 1 to treatment 2, it is impossible to estimate or test the significance of any change in performance from Period 1 and Period 2 over and beyond carryover effects. In that situation, the only legitimate follow-up test is an assessment of the differences between the effects of the two treatments using Period 1 data only. Hence, it is recommended that investigators first test for carry-over effects. Only if this test is not significant, even at a very relaxed level, is a further analysis of the total data set appropriate. In that case, all the obtained data may properly be used in separate tests of significance for treatments. One goal of any empirical work is to produce a single reliable conclusion, which is, at first glance, difficult if the results of several studies are divergent or not statistically significant in each case. Close replication and replication in general, therefore, raises questions concerning how to combine the obtained results with each other and with the results of the original study. Meta-analysis techniques can be used to merge study results. Meta-analysis refers to the statistical analysis of a collection of analysis results from individual studies for the purpose of integrating the findings. Although meta-analysis allows the combination and aggregation of scientific results, there has been some criticism about its use [38]. The main critical points include diversity, design, publication bias, and dependence of studies. Diversity refers to the fact that logical conclusions cannot be drawn by comparing and aggregating empirical results that include different measuring techniques, definitions of variables, and subjects because they are too dissimilar. Design means that results of a meta-analysis cannot be interpreted because results from "poorly" designed studies are included Before starting the analysis, we performed the Shapiro-Wilks' W test of normality, which is the preferred test of normality because of its good power properties [91]. In all cases we could not reject the null hypothesis that our data were normally distributed. Cohen [16] provides ample empirical evidence that a non-parametric test should be substituted for a parametric test only under conditions of extreme assumption violation, and that such violations rarely occur in behavioural or psychological research.
(Quasi-)Experimental Studies in Industrial Settings
203
along with results from "good" designed studies. Publication bias refers to the fact that published research is biased in favour of significant findings because non-significant findings are rarely published. This in turn leads to biased meta-analysis results. Finally, dependence of the studies means that meta-analyses are conducted on large data sets in which multiple results are derived from the same study. Diversity, design, and publication bias do not play a role in our case. However, we need to discuss the issue of study dependency in more detail. It has been remarked that experiments using common experimental materials exhibit strong inter-correlations regarding the variables involving the materials. Such correlations result in non-independent studies [70]. In general, Rosenthal [82] discusses issues of non-independence in metaanalysis for studies performed by the same laboratory or research group, which is our case since we have two internal replications. He presents an example from research on interpersonal expectancy effects demonstrating that combining all studies across laboratories and combining studies where the unit of analysis is the laboratory results in negligible differences. Hence, in practice, combining studies from the same laboratory or research group is not perceived to be problematic. Another legitimate question is whether it is appropriate to perform a meta-analysis with a small meta-sample (in our case a meta-sample of three studies). It has been noted that there is nothing that precludes the application of meta-analytic techniques on a small meta-sample [58] "Meta-analyses can be done with as few as two studies or with as many studies as are located. In general, the procedures are the same [...] having only a few studies should not be of great concern". For instance, Kramer and Rosenthal [58] report on a meta-analysis of only two studies evaluating the efficacy of a vaccination to SIV in monkeys using data from Cohen [18]. In the realm of software engineering, meta-analyses have also tended to have small metasamples. For example, Hayes [43] performs a meta-analysis of five experiments evaluating DBR techniques in the context of software inspections. Miller [70] performs a meta-analysis of four experiments comparing defect detection techniques. Comparing and Combining Results using Meta-Analysis Techniques Meta-analysis is a set of statistical procedures designed to accumulate empirical results across studies that address the same or a related set of research questions [97]. As pointed out by Rosenthal [82] there are two major ways to merge and subsequently evaluate empirical findings in meta-
204
O. Laitenberger & D. Rombach
analysis—in terms of their statistical significance (e.g., p-levels) and in terms of their effect sizes (e.g., the difference between means divided by the common standard deviation). One reason for the importance of the effect size is that many statistical tests, such as the t-test, can be broken down mathematically in the following two components [84]: Significance test = Effect Size x Size of study This relationship reveals that the result of a significance test, such as the t-test, is determined by the effect size and the size of the study. Many researchers however only make decision based on the fact of whether the result of applying a particular test is statistically significant. They often do not take the effect size or the size of the study into consideration. This almost exclusive reliance of researchers on results of null hypothesis significance testing alone has been heavily criticised in other disciplines, such as psychology and the social sciences [19] [83] [86]. We therefore explicitly take the effect size into consideration during the meta-analysis. Two major meta-analytic processes can be applied to the set of studies to be evaluated: Comparing and combining [83]. When studies are compared as to their significance levels or their effect sizes, we want to know whether they differ significantly among themselves with respect to significance levels or effect sizes, respectively. This is referred to as homogeneity. When studies are combined, we want to know how to estimate the overall level of significance and the average effect size, respectively. In most cases, researchers performing meta-analysis, first, compare the studies to determine their homogeneity. This is particularly important in a software engineering context since, there, empirical studies are often heterogeneous [7] [69]. Once it is shown that the studies are, in fact, homogeneous the researchers continue with a combination of results. Otherwise, they look for reasons that cause variations. According to this procedure, we first compared and combined p-values and, second, compared and combined effect sizes of the team results. We limited the meta-analysis to the team results since these are of our primary interest. The meta-analytic approach is described in more detail in [63]. Results In this section we present the detailed results. Note that all t-values and effect sizes have been calculated to be consistent with the direction: PBR
A
CBR •
205
(Quasi-)Experimental Studies in Industrial Settings
Defect Detection Effectiveness Fig. 4 shows boxplots of the teams' defect detection effectiveness. Quasj-Experinent
0.1
^
0.J
T
31
Of
l,s III $0.3
I Ui»SD UeavSD DUunSE ItoSE D MID
PBR ReadngTechniqiM
2 ntptcatkn
I RepiCStKHI
— n= 0
,
—
fell
n=lO
o=ll
4= a
xi 111 lUeaitSO Mun-SD D«emtSf ItoSE n Hun Readiif Tednqw
.
I
BiaotSO liin<SD
D
»f»tSE llHriE
O lUA
CD Rtl4t| Tittup
Fig. 4. Box-plots of the team defect detection effectiveness.
For one team in the quasi-experiment, one member of one team dropped out of the study. Therefore we have only nine teams for the analysis.
206
O. Laitenberger & D. Rombach
The boxplots show that the inspection teams detected on average between 58% and 78% of the defects in a code module. These percentages are in line with the ones reported in the literature [37]. Teams using PBR for defect detection had a slightly higher team defect detection effectiveness than the same teams using CBR. In addition to the effectiveness difference, the boxplots illustrate that PBR teams exhibit less variability than CBR teams. The lower variability for PBR may be explained by the fact that the more prescriptive approach for defect detection to some extent removes the effects of human factors on the results. Hence, all the PBR teams achieved similar scores. However, before providing further explanation of these results, we need to check whether the difference between CBR and PBR is due to chance. We first assessed whether there is a carry-over effect for the team effectiveness. The results indicate no carry-over effect. Therefore we can proceed with the analysis of the data from the two periods. We investigated whether the difference among teams is due to chance. Table 6 presents the summary of the results of the matched-pair t-test for the team defect detection effectiveness. t-value
df
p-value (one-sided)
3.09
8
0.007
1 Replication
1.37
9
0.10
2nd Replication
2.39
9
0.02
Quasi-Experiment st
Table 6. t-Test results of the team defect detection effectiveness. Taking an alpha level of a = 0.1 we can reject Hoi for the quasiexperiment and the 2nd replication. We cannot reject hypothesis Hoi for the 1st replication. The findings suggest a treatment effect in two out of three cases, which is not due to chance. At this point we need to decide whether there is an overall treatment effect across studies. Therefore, we performed a meta-analysis as described previously to compare and combine the results. The test for homogeneity for p-values results in p=0.7. Hence, we cannot reject the null hypothesis that our p-values are homogeneous. This means that we can combine the pvalues of the three studies according to Fisher's procedure in the following manner:
(Quasi-)Experimental Studies in Industrial Settings
207
k
P = -2x^ln P i =22.14 i=l
2
Based on the % -distribution, this value of P results in a p-value of p=0.000016. Hence we can reject the null hypothesis and conclude that the resulting combination of the quasi-experiment and its replications revealed that a team using the PBR technique for defect detection had a significantly higher defect detection effectiveness than the team using CBR. We continued the analysis by looking at the effect sizes. Table 7 reveals the effect sizes of the three studies. S pooled
Hedges g
d
Quasi-Experiment
0.16
0.79
1.46
1st Replication
0.21
0.42
0.81
0.14
1.06
0.97
2
nd
Replication
Table 7. Effect sizes of the team defect detection effectiveness. Table 7 shows that the 1st replication has the lowest effect size, which explains why the results of the test were not statistically significant. To compare and combine the effect sizes, we first checked the effect size homogeneity by calculating Q. The calculated value of Q is Q = 0.91 which leads to a p-value of p=0.63. Hence, we cannot reject the null hypothesis that the effect sizes are homogeneous. The combination of the effect sizes reveals a mean effect size value of 1.08. This represents a large effect size, i.e., one would really become aware of a difference in the team defect detection effectiveness between the use of PBR and CBR. Based on our findings, we therefore can reject hypothesis Hoi. For three of the modules in our study all defects were detected using PBR. For the other three all defects except one were detected. No discernible pattern in terms of the types of defects that were not detected could be identified. Cost per Defect for the Defect Detection Phase Before looking at the team results, we first investigated how much effort each subject consumed for defect detection using either of the techniques and whether there is a difference. Table 8 depicts the average effort in
208
O. Laitenberger & D. Rombach
minutes that a subject spent for defect detection using either CBR or PBR and the results of the matched-pair t-test. CBR
PBR
p-values (one sided)
115.2
145.6
0.0003
1 Replication
118.8
132.5
0.16
2nd Replication
168
150.5
0.13
Quasi-Experiment st
Table 8. Average effort per subject for the defect detection phase. We found that in the quasi-experiment and the 1st replication, the subjects consumed less effort for CBR than for PBR. However, only in the first case the difference was statistically significant. This finding seems to indicate that if there is a significant difference PBR seems to require more effort on the inspector's behalf. The question is whether the extra effort is justified in terms of more detected and documented defects in the team meeting. Fig. 5 depicts boxplots of the cost per defect ratio of both inspectors on an inspection team. Fig. 5 reveals that on average an inspection team consumed between 28 and 60.5 minutes per defect. The average cost per defect of PBR teams is consistently lower than the cost per defect of CBR teams. In this case, there is also less variability in the cost per defect ratio of PBR teams. Based on these findings, the extra effort for PBR, if any, seems to be justified because PBR teams have a better cost per defect ratio than CBR teams. A formal test for carry-over effects was conducted, and none were identified. t-value
df
p-value (one-sided)
Quasi-Experiment
-1.33
8
0.11
1st Replication
-1.93
9
0.04
-0.75
9
0.23
2
nd
Replication
Table 9. t-Test results of the cost per defect ratio for the defect detection phase.
209
(Quasi-)Experimental Studies in Industrial Settings Quasi-Eiperiment
1 Replcation 100 9t n=!0 It
0
70
"rr
n.9 10
¥ SO
HCD
a
E
s m ft
I*
1
i
o
1
£ IiSid.Oev.
| 10
•
1 s o
isu. En. D Mean
DiSttEl.
u
Reading Technique
Readng Technique 2 HepicalHm
100
.=0
1 • 1 J
CBR
Z]
m
liSid.Oei. D>SU.Eit. D Mean
Reading Technique
Fig. 5. Box-plots of the cost per defect ratio for the defect detection phase. Table 9 shows the t-test results for the cost per defect during the defect detection phase. Taking a = 0.1 we can reject H02 for the 1st replication. We cannot reject hypothesis Ho2for the quasi-experiment and the 2 replication. To compare and combine the results we first performed the homogeneity check for p-values. This value with 2 degrees of freedom results in a p-value of p = 0.77. Hence, we cannot reject the hypothesis that our p-values are homogeneous. This finding allowed us to combine the p-values of the three studies. Calculating the combination of the p-values according to Fisher's procedure results in:
210
O. Laitenberger & D. Rombach k
P = - 2 x J l n P i = 13.57 i=l
Based on the X -distribution, this value of P results in a p-value of p=0.001. Hence we can reject the hypothesis H02 and conclude that the resulting combination of the quasi-experiment and its replications revealed that a team using the PBR technique for defect detection had a significantly lower defect detection cost per defect than the team using CBR. We continued the analysis by looking at the effect sizes. Table 10 reveals the effect sizes of the three studies.
Quasi-Experiment s
I ' Replication 2
nd
Replication
S pooled
Hedges g
d
24.03
-0.50
-0.72
22.59
-0.87
-0.78
12.59
-0.33
-0.30
Table 10. Effect sizes of the cost per defect ratio for the defect detection phase. Table 10 shows that the quasi-experiment and 2n replication have the lowest effect size, which explains why the results of these tests were not statistically significant. To compare and combine the effect sizes, we first checked the effect size homogeneity by calculating Q. The calculated value of Q is Q = 0.64 which leads to a p-value of p=0.73. Hence, we cannot reject the hypothesis that the effect sizes are homogenous. The combination of the effect sizes gives a mean effect size value of 0.6. Considering our effect size threshold of 0.5, we can conclude that we have, in fact, found an effect of practical significance. We therefore can reject hypothesis H02. Cost per Defect for the Meeting Phase We subsequently consider the cost per defect when accounting for the meeting phase. Fig. 6 shows the boxplots for the quasi-experiment and its replications. Fig. 6 reveals that the average cost per defect ratio of PBR was lower than the CBR one when only considering the effort of the meeting phase. Although there seems to be less variability for the 1st replication, there does not seem to be as much differences in the variability for the quasi-
211
(Quasi-)Experimental Studies in Industrial Settings
experiment and the 2nd replication. Overall, this result indicates that the meeting cost per defect is higher for CBR than for PBR. Qu*i-E iperimenl
l"* ReplicaAion
11
..*!...
n=10
a M
a
1* & 3
m\
1
;
• 1
D
£ s
I •
2
ItSld.D« • iStd.Eff. o Urn
ifitd.Oen ifitd.Ert. o Mean flsj*l 2
Technittue
Replication
X3 • CBR
PBH
IiSld.Dm. • iSld.En. °
UMn
RudHig Technique
Fig. 6. Box-plots of the cost per defect for the meeting phase. The results of a formal test for carry-over effects did not indicate any carry-over-effects.
Quasi-Experiment st
1 Replication 2
nd
Replication
t-value
df
p-value (one-sided)
-5.20
8
0.0004
-2.05
9
0.035
-2.55
9
0.016
Table 11: t-Test results of the cost per defect ratio for the meeting phase.
212
O. Laitenberger & D. Rombach
Taking a = 0.1 we can reject H03 for all three studies. The findings suggest a treatment effect in all three cases, which is not due to chance. We calculated ^(z^zf
= 1.396. This value with 2 degrees of
freedom results in a p-value of p=0.50. Hence, we cannot reject the null hypothesis that our p-values are homogeneous. Calculating the combination of the p-values according to Fisher's procedure results in: k
P = - 2 x ^ l n P i =30.59 i=l
Based on the % -distribution, this value of P results in a p-value of p is the importance of that criterion (from Figure 5.2). In this case, the lowest composite value would determine the most significant method. Table 5.5 presents these results. Sample 1 (Research)
Sample 3 (Prof, student)
Sample 2 (Industry)
Simulation
288 Case study
284 Project monitoring
258
Static analysis
292 Legacy data
314 Legacy data
305
Dynamic analysis
298 Field study
315 Theoretical analysis
324
Project monitoring
301 Simulation
333 Literature search
325
Lessons learned
339 Dynamic analysis
355 Case study
326
Legacy data
345 Static analysis
361 Field study
327
Synthetic study
346 Literature search
370 Pilot study
329
Theoretical analysis
348 Replicated experiment
387 Feature benchmark
338
Field study
363 Project monitoring
388 Demonstrator project
345
Literature search
367 Lessons learned
391 Replicated project
361
Replicated experiment
368 Theoretical analysis
405 External
407
Case study
398 Synthetic study
418 Vendor opinion
469
Table 5.5. Composite measures. Table 5.5 reveals some interesting observations: 1. For the research community, tool-based techniques dominate the rankings (Bold items). Simulation, static analysis, and dynamic analysis are techniques that are easy to automate and can be handled in the laboratory. On the other hand, techniques that are labor intensive and require interacting with industrial groups (Italics) (e.g., replicated experiment, case study, field study, legacy data) are at the bottom of the list. This confirms anecdotal experiences of the authors over the past 25 years; working with industry on real projects certainly is harder to manage than building evaluation tools in the lab.
258
M. V. Zelkowitz, D. R. Wallace & D. W. Binkley
2. For the industrial community (the professional student population), almost the opposite seems true. Those techniques that can confirm a technique in the field using industry data (e.g., case study, field study, legacy data) dominate the rankings, while "artificial" environments (e.g., theoretical analysis, synthetic study) are at the bottom. Again, this seems to support the concept that industrial professionals are more concerned with effectiveness of the techniques in live situations than simply validating a concept. 3. The industrial group evaluating the industrial validation methods cannot be compared directly with the others two groups since the methods they evaluated were different; however, there are some interesting observations. For one, project monitoring (measurement - the continual collection of data on development practices) clearly dominates the ranking. The situation has apparently not changed much since a 1984 study conducted by the University of Maryland [Zelkowitz84]. In that earlier survey, the authors found that data was owned by individual project managers and was not available to the company as a whole in order to build corporate-wide experience bases. This is surprising considering the difficulty the software engineering measurement community has been having in getting industry to recognize the need to measure development practices. With models like the Software Engineering Institute's Capability Maturity Model (CMM), the SEI's Personal Software Process (PSP) and Basili's Experience Factory [Basili88] promoting measurement, perhaps the word is finally getting out about the need to measure. But actual practice does not seem to agree with the desires of the professionals in the field. For example, theoretical analysis came out fairly high in this composite score, but that does not seem to relate to experiences in the field and may be wishful thinking. 4. Finally, within the industrial group, the need to be state of the art (edict classification) came near the bottom of the list (11 th out of 12) as not important. Basing decisions on vendor opinions was last. Yet image (being state-of-the-art) and vendors often influence the decision making process. Furthermore, vendor opinion was also judged to be least effective with respect to internal and external validity. Apparently since vendor opinion was judged to be one of the easiest to do, users rely on it even though they know the results are not to be trusted.
Experimental Validation of New Software Technology
259
6. Discussion Software engineering has been described as being in its alchemy stage. However, some widely recognized scientific principles are emerging. One of these is the importance of validating software-engineering techniques. This chapter identifies the experimental validation techniques in use by two communities: researchers and practitioners. The study identified 14 research models found to be in use in the research community. These models are driven by the demands on the research community and reflect the biases and reward system of that community. Similarly, there are 15 different models in use by industry. These reflect the ultimate goal of industrial software engineering, which is to solve a problem given a set of constraints. In this case, the problem is the production of a piece of software and the main constraint is funding. It is clear from comparing these techniques that the research community is primarily focused on exploratory methods while industry focuses on confirmatory techniques. There are many similarities between the two sets of models. These provide a place to begin bridging the gap between the two communities. However, the need to better understand the relationships between models exists. Some of these relationships were explored in Section 5. Section 5 also provides an example of how to setup and run an experiment. This chapter is in essence an example of data mining and a field study of software engineering validation techniques. The results of this experiment facilitate the understanding of the models currently used by the two communities and the connections between them. This, in turn, facilitates technology transfer, the ultimate goal. Technology transfer is known to be a difficult process. A 1985 study by Redwine and Riddle showed that a typical software technology took up to 17 years to move from the research laboratory to general practice in industry [Redwine85]. (This time is consistent with other engineering technologies.) In fact, many new technologies do not even last 20 years! But once developed, it often takes up to 5 years for a new technology to be fully integrated into any one organization [Zelkowitz96]. Because of the short lifecycle of many critical technologies, we need to understand the transition process better in order to enable effective methods to be adopted more rapidly. This chapter is a first step toward understanding the models in current use by the two communities. To formalize relationships between them and
260
M. V. Zelkowitz, D. R. Wallace & D. W. Binkley
to better understand the universe of possible techniques, it is useful to have formalism in which to place the models. One such formalization is the three faceted approach of Shaw. Each method (an "experiment" in Shaw's terminology) fulfills three facets [ShawOl]: 1. Question - Why was the research done? 2. Strategy - How was it done? 3. Validation - How does one know it worked? This resulted in a 3 by 5 matrix (Table 6.1) where a research project is the result of choosing one item from each column. Question
Strategy
Validation
Feasibility
Qualitative model
Persuasion
Characterization
Technique
Implementation
Method
System
Evaluation
Generalization
Empirical model
Analysis
Selection
Analytic model
Experience
Table 6.1. Shaw's research validation model.
A case study that tries software inspections on an important new development could be classified according to this model as 1. Question: Feasibility - Do software inspections work? 2. Strategy - Technique - Apply it on a real project. 3. Validation - Experience - See if it has a positive effect. On the other hand, determining whether inspections or code walkthroughs are more effective could be classified as 1. Question: Selection - Are inspections or walkthroughs more effective? 2. Strategy - Technique - Apply them on multiple real projects. 3. Validation - Evaluation - Use them and compare the results. This chapter contains definitions for strategies and validation mechanisms as shown by the authors' research. Persuasion is most similar to the assertion method in the authors' taxonomy. In fact, this classification can be mapped into Shaw's model as given in Table 6.2.
261
Experimental Validation of New Software Technology Method
Strategy
Validation
Project monitoring
Technique
Persuasion
Case study
System
Experience
Field study
Qualitative model
Evaluation
Literature search
Technique
Evaluation
Legacy data
Empirical model
Evaluation
Lessons learned
Qualitative model
Persuasion
Static analysis
System
Evaluation
Replicated
Empirical model
Evaluation
Synthetic
Empirical model
Evaluation
Dynamic analysis
System
Evaluation
Simulation
Technique
Evaluation
Theoretical
Analytic model
Analysis
Assertion
Technique or System
Persuasion or Experience
No validation
—
Persuasion
Table 6.2. Research validation techniques modeled in Shaw's model. Given the assumption that some new means of performing a task of software engineering has been proposed or has been attempted, the following may serve as generic definitions for the questions: • Feasibility: the possibility or probability that a software engineering task may be conducted successfully by applying the method or technique X under study to a software engineering problem. • Characterization: Those features which identify X: its purpose, that is, what problem does it attempt to solve; its measurable characteristics; its differences from other solutions to the same problem; the circumstances under which it may be applied. • Method / Means: The procedures by which X may be accomplished. In addition to the actual conduct of X, these include the potential for automating X and for incorporating it into an existing software engineering paradigm or culture. • Generalization: the features or process steps that specify how X may be applied generally beyond the specific research example where it has been used. • Discrimination: the criteria by which X is to be judged for use.
262
M. V. Zelkowitz, D. R. Wallace & D. W. Binkley
Identifying the models in current use is a first step in the understanding of technology transfer in the software field. A formal understanding of the different models used to evaluate software is a second step. While the final step a still far off, the efficient transfer of software technology is necessary.
References [Adrion93] Adrion W. R., Research methodology in software engineering, Summary of the Dagstuhl Workshop on Future Directions in Software Engineering, W. Tichy (Ed.), ACM SIGSOFT Software Engineering Notes, 18, 1, (1993). [Basili02] Basili V., F. McGarry, R. Pajerski, M. Zelkowitz, Lessons learned from 25 years of process improvement: The rise and fall of the NASA Software Engineering Laboratory, IEEE Computer Society and ACM International Conf. on Soft. Eng., Orlando FL, May 2002. [Basili95] Basili V., M. Zelkowitz, F. McGarry, J. Page, S. Waligora and R Pajerski, SEL's software process improvement program, IEEE Software 12, 6 (1995) 83-87. [BerryOO] Berry M. and G. Linoff, Mastering Data Mining, John Wiley & Sons, 2000. [Brooks87] Brooks, F. No Silver Bullet: Essence and Accidents of Software Engineering, IEEE Computer (1987), 10-19. [Brown96] Brown A. W. and K. C. Wallnau, A framework for evaluating software technology, IEEE Software, (September, 1996) 39^19. [Campbell63] Campbell D. and J. Stanley, Experimental and quasi-experimental designs for research, Rand McNally, Chicago, (1963). [Daly97] Daly, J., K. El Emam, and J. Miller, Multi-method research in software engineering, 1997 IEEE Workshop on Empirical Studies of Software Maintenance (WESS '97) Bari, Italy, October 3, 1997. [Fenton94] Fenton N., S.L.Pfleeger, and R. L. Glass, Science and substance: A challenge to software engineering, IEEE Software, Vol. 11, No. 4, 1994, 86-95. [Kitchenham96] Kitchenham B. A., Evaluating software engineering methods and tool, ACM SIGSOFT Software Engineering Notes, (January, 1996) 11-15. [Miller99] Miller J., Can software engineering experiments be safely combined?, IEEE Symposium on Software Metrics (METRICS'99), Bethesda, MD (November 1999).
Experimental Validation of New Software Technology
263
[Redwine85] Redwine S. and W. Riddle, Software technology maturation, 8 IEEE/ACM International Conference on Software Engineering, London, UK, (August, 1985) 189-200. [Scargle99] Scargle J.D., Publication Bias (The "File-Drawer Problem") in Scientific Inference, Sturrock Symposium, Stanford University, Stanford, CA, (March, 1999). [ShawOl] Shaw M., Keynote presentation, International Conference on Software Engineering, Toronto, Canada, May 2001 (http://www.cs.cmu.edu/~shaw). [Snow63] Snow, C.P., The two cultures and the scientific revolution, New York: Cambridge University Press, 1963. [Thorngate76] Thorngate W., "In General" vs "It Depends": Some comments on the Gergen-Schlenker debate. Personality and Social Psychology Bulletin 2, 1976,404-410. [Tichy95] Tichy W. F., P. Lukowicz, L. Prechelt, and E. A. Heinz, Experimental evaluation in computer science: A quantitative study, J. of Systems and Software Vol. 28, No. 1,1995 9-18. [Tichy98] Tichy, W., Should computer scientists experiment more?, IEEE Computer, 31,5,1998, 32-40. [Zelkowitz84] Zelkowitz M. V., Yeh R. T„ Hamlet R. G., Gannon J. D., Basili V. R., Software engineering practices in the United States and Japan, IEEE Computer 17, 6 (1984) 57-66. [Zelkowitz96] Zelkowitz M. V., Software Engineering technology infusion within NASA, IEEE Trans, on Eng. Mgmt. 43, 3 (August, 1996) 250-261. [Zelkowitz97] Zelkowitz M. and D. Wallace, Experimental validation in software engineering, Information and Software Technology, Vol. 39, 1997,735-743. [Zelkowitz98] Zelkowitz M. and D. Wallace, Experimental models for validating technology, IEEE Computer, 31, 5, 1998,23-31.
Series on Software Engineering and Knowledge Engineering -Vol. 12 LECTURE NOTES ON EMPIRICAL SOFTWARE ENGINEERING edited by Natalia Juristo & Ana M Moreno (Universidad Politecnica de Madrid, Spain) Empirical verification of knowledge is one of the foundations for developing any discipline. As far as software construction is concerned, the empirically verified knowledge is not only sparse but also not very widely disseminated among developers and researchers. This book aims to spread the idea of the importance of empirical knowledge in software development from a highly practical viewpoint. It has two goals: (1) Define the body of empirically validated knowledge in software development so as to advise practitioners on what methods or techniques have been empirically analysed and what the results were; (2) as empirical tests have traditionally been carried out by universities or research centres, propose techniques applicable by industry to check on the software development technologies they
www. worldscientific. com 4930 he