requirements Editor: Suzanne Robertson
■
The Atlantic Systems Guild
■
[email protected] Are We Afraid of the Dark? Welcome to the first IEEE Software column on Requirements. I am the column’s editor, Suzanne Robertson. My goal is to make requirements and their importance more widely understood by developers, business people, and management. I am looking for columnists who come from a variety of disciplines (not necessarily software development) and encourage them to communicate what requirements mean to them along with examples from their work. The aim is to make practical ideas about requirements engineering accessible to a wide range of people. I would value your input on proposals for columns and ideas for subjects that you would like to see the column cover—send your suggestions to
[email protected].
R
equirements are hot. Conferences, books, courses, and now this column devoted to the subject all point to a growing interest in the discipline. Increasingly, my clients tell me that they recognize the importance of requirements and that they are ready to invest effort in making improvements. They plan to formalize their processes, standardize their specifications, clarify requirements for users, and so forth. Mostly, they tell me that they are in the process of buying a tool. Of course these things are important, but it’s all too easy to retreat into tools and technical software engineering details at the expense of other less familiar aspects. We are comfortable with tools, we believe that methods and modeling formalisms will help us, and we feel secure with the amount of attention given to these aspects of the requirements discipline. Practically each month a new vendor or author turns up the lights and directs our attention to one of the formal or mechanical (and thus easily automated) aspects of requirements. However, that’s not all there is to requirements—there are other aspects of the field still illuminated by little more than candlepower. I’m not seeing the effort put into the hard work of finding the requirements by listening, observing, and stimulating people by 12
IEEE SOFTWARE
July/August 2001
acting as a translator, catalyst, imagineer, explorer, and trout tickler. We are still making the mistake of expecting people to know precisely and exactly what they want and of being able to communicate those wants. So let’s look at our requirements skills and consider what we are doing well and where we can improve. Where are the brightest lights? Over the past 25 years we have developed skills that help us define some types of requirements. Our learning curve has led us through structured, real-time, and object revolutions. Techniques for modeling process, data, and state have focused us on functional requirements (the requirements that focus on what a product must do). I see a lot of requirements specifications, and the functional requirements that are best specified are those that reflect some kind of current reality. In this well-lit place, we can look at the way things are done now (by whatever combination of manual or automated processes we can see), and we can build models to specify a new system’s functionality. We ask people what they want and they tell us what is uppermost in their minds, usually influenced by the status quo. However, this approach is limited to finding conscious requirements—things that people know they want. To find the unconscious and undreamed-of requirements, we need to look into some dark corners. 0740-7459/01/$10.00 © 2001 IEEE
REQUIREMENTS
Where are the badly lit places? One place we need to look is in the area of nonfunctional requirements such as look and feel, usability, operational environment, performance, maintainability, security, legality, and culture. These requirements are usually vaguely specified, if they are mentioned at all. Of all these requirements, specifications for performance requirements are the least ambiguous. There’s a good reason for this. To make performance requirements unambiguous, we quantify them using time. We have grown up with time as an everyday scale of measurement. If someone asks for fast, our natural response is how many days, seconds, years, centuries...whatever we mean by fast. The possession of this familiar and well-understood scale of measurement gives us the basis for exploration because it makes us feel safe and confident and helps us ask the right questions.1,2 On the other hand, cultural requirements are found in the darkest places. We are not sure what we are looking for or how we will know it when we find it or even whether we want to look for it in the first place, but our ambitions make cultural requirements increasingly critical. We build products that involve multiple technologies operating in multiple environments, and this inevitably involves different people operating in a multiplicity of what we refer to as cultures. We talk about the culture of organizations, projects, countries, and individuals and the need to build culturally acceptable products, but what do we really mean? When we try to understand a culture, we are trying to understand a particular civilization well enough to help the inhabitants (in our case, the project’s stakeholders) discover their requirements.
duce a software product that would satisfy all the requirements of all the users, no matter what country they worked in. Someone high in the organization had the sense to get the key stakeholders together before embarking on the project. My main role was to help the different people talk to each other—they (mostly) wanted to, but it was difficult. We spent a lot of time trying to understand the diverse
mixture of cultures brought together under the project’s umbrella and each other’s differences—huge differences in stakeholder intentions, understanding, meaning, and just about every other damn thing. After four days, we came up with some consensus, but it was a different project from the one that they had walked into the room with on Monday. A group of stakeholders is like a
The project sociology A few months ago, I worked with a group of 20 people including business users, designers, managers, and software engineers from seven countries. The project’s aim was to proJuly/August 2001
IEEE SOFTWARE
13
REQUIREMENTS
family. In a family, everyone has different opinions and levels of importance, yet, in a healthy family, the members (without losing their individuality) collaborate to contribute to the family’s overall happiness and success. Requirements engineers can learn a lot about how to work with a stakeholder family from the work of family therapists.3 We gave the project a healthy start by devoting some time to understanding the project’s sociology, actively searching for cultural differences, and facilitating collaboration between the project’s stakeholders.4,5 This is difficult; listening to people and helping them listen to each other is hard work. It leads us into all sorts of dark and uncertain places where awareness of other people’s unpredictable behavior is our only clue to what is going on. Finding cultural requirements Does it surprise you that Muslim
women do not shake hands with men? Did you know that when an African-American wears a white rose on Mother’s Day, it means that his or her mother is dead? Why is it not appropriate to send carnations to a French house? What does the color purple mean in Spain? If you do not know the answers and you are about to deploy software outside your own immediate geographical environment, then you might be making some cultural faux pas. The icons you have used, or the words, pictures, or spelling, might have a different meaning from the ones you intended. The problem is that there are so many questions to ask that it is hard to know where to start. My father was a world traveler who was at home in many different cultures. His way of getting to know a culture was to explore the food, music, and language. This provides an insight into people’s everyday lives—things that they are familiar
with and would not think to mention to someone else. So many of our differences are reflected in what we eat, what we listen to, and in the words we use. Try having a meal with key stakeholders and ask questions about favorite foods, singers, or literature. When you get an answer that sounds unfamiliar, out of place, different, or just plain wrong (to your ears), that’s an indication that you might have a cultural difference. Consider whether this difference might affect your project’s success. If the answer is yes or maybe, chances are you have identified a relevant cultural requirement. Carry a flashlight Where else can we look to get the best advances in discovering the right requirements? Well, almost predictably, the best places to look are the darkest. I have discussed the advisability of investigating cultural requirements, but there’s more. Look and feel is another area worth illuminating.
cluster computing collaborative computing dependable systems distributed agents distributed databases distributed multimedia grid computing middleware mobile & wireless systems operating systems real-time systems security
IEEE
Distributed Systems Online c o m p u t e r . o r g / d s o n l i n e
14
IEEE SOFTWARE
July/August 2001
REQUIREMENTS
What is the most appropriate appearance that your product should have? Will your product look like all the rest, or is there a special appearance that will distance it from its competitors? This might seem like a trivial exercise, but remember that the world’s best-known brands got that way by paying attention to how their products appear.6 To appreciate the importance of look and feel, go to the Web sites of well-known brands such as Coca-Cola, Federal Express, McDonald’s, and so forth and find their corporate branding pages. The good news is that we don’t need to invent ways of specifying look and feel requirements; we already have experts available. Once you have defined the image that the product is aiming for (the business requirement), then call for help from graphic artists. Graphic artists are specialists in color, type, material— all the things that give a product the desired image. The idea of calling on expert help for specifying details works well for many nonfunctional requirements. Legal, security, usability, environmental, and many other areas have recognized experts. We can ask these people to act as consultant stakeholders, provided we give them a clear specification of the business requirement. How much effort do you spend observing your clients, customers, and users at work? How much time do you spend sitting with them as they do their daily tasks? How much learning have you put in to discovering what kind of people they are or imagining what kinds of products would do them the most good? When you take your people away from their tasks to interview them, you are destroying the very link that you most need—the link between that person and his or her job. It is this mysterious, sometimes seemingly unfathomable feeling that people have for their work that helps you learn what kind of product they need. These are the sorts of requirements that people do not mention— until you implement a product that lacks them.
Another dark area that we need to explore is creative invention.7 People can tell us only so much of what they want; we (the requirements engineers) must invent and inspire the rest. Customers can tell us how they would like the existing product to be improved, but they cannot tell us what they would like us to invent for them. Did customers ask for their telephones to be portable? I think that most of us asked for speed dial, caller identification, and so on, but not for the mobile phone—it was invented for us. How many of the truly successful products—including software products—were invented? How much of successful software is innovation instead of re-implementation? This is an area that deserves significant illumination.
enth incarnation of the payroll system that your company first implemented in 1976 and has changed very little in the intervening years, then look no further. If you want your next product to anticipate the future, to be one that its users like, to offer significant advantages to your organization, then “requirements” has to take on a different meaning. It has to mean looking into areas that are vague or unexplored, poking into the unlit corners to discover dark secrets, and inventing what your users never dreamed of.
References
W
hy don’t we look into the dark? Are we afraid that it represents an unknown quantity and is therefore dangerous? If we have little time before our deadline, do we want to spend it looking into areas that our tools and methods don’t mention? I guess the answer depends on the kind of product you intend to build. If your product is to be the sev-
It is this mysterious, sometimes seemingly unfathomable feeling that people have for their work that helps you learn what kind of product they need.
1. T. Gilb and S. Finzi, Principles of Software Engineering Management, Addison-Wesley, Wokingham, England, 1988. 2. J. Robertson and S. Robertson, Mastering the Requirements Process, Addison-Wesley, New York, 1999. 3. V. Satir et al., The Satir Model: Family Therapy and Beyond, Behavior Books, Palo Alto, Calif., 1991. 4. H. Beyer and K. Holtzblatt, Contextual Design: Defining Customer-Centered Systems, Morgan Kaufmann, San Francisco, 1998. 5. J.A. Highsmith III, Adaptive Software Development: A Collaborative Approach to Managing Complex Systems, Dorset House, New York, 2000. 6. D. Ogilvy, Ogilvy on Advertising, Pan Books, London, 1984. 7. R. von Oech, A Whack on the Side of the Head: How You Can Be More Creative, Creative Think, Menlo Park, Calif., 1998.
Suzanne Robertson is a principal and founder of The
Atlantic Systems Guild, a London- and New York-based think tank. She specializes in the field of requirements and is coauthor of Volere, a widely used approach to requirements engineering. Contact her at
[email protected]; www.systemsguild. com. July/August 2001
IEEE SOFTWARE
15
quality time E d i t o r : J e f f r e y Vo a s
■
Cigital
■
[email protected] Composing Software Component “ilities” Jeffrey Voas, Cigital
I
recently had the opportunity to attend part of the Component-Based Software Engineering (CBSE) Workshop at this year’s International Conference on Software Engineering in Toronto. One of the more interesting discussions dealt with the issue of certification. The speakers and participants discussed how to calculate the “ilities” of two software components (hooked in a simple series) for a composition before they’re joined and executed. (“ilities” is my term for the nonfunctional properties of software components that define characteristics such as security, reliability, fault tolerance, performance, availability, and safety.) The workshop impressed on me a sense that a fruitful research area for the software engineering community exists here, so I have opted to share some of the interesting issues related to “ility” composability here in this column. A few examples For the past 10 years, much of the work on CBSE and component-based development (CBD) dealt with functional composability. FC deals with whether F(A) ξ F(B) = F(A ξ B) is true (where ξ is some mathematical operator)—that is, whether a composite 16
IEEE SOFTWARE
July/August 2001
system results with the desired functionality, given a system created solely by joining A and B. Increasingly, our community is discovering that even if FC were a solved problem (using formal methods, architectural design approaches, model checking, and so on), it’s not mature enough to experience other serious concerns that arise in CBSE and CBD. These concerns stem from the problem of composing “ilities.” Composing “ilities” is complicated because we can’t, for example, know a priori about the security of a system composed of two components, A and B, based on knowledge of the components’ security. This is because we base the composite’s security on more than just the individual components’ security. Numerous reasons for this exist; here, we’ll just discuss component performance and calendar time. Suppose that A is an operating system and B is an intrusion detection system. Operating systems have some authentication security built into them, and intrusion detection systems have some definition for the types of event patterns that often warn of an attack. Thus, the composition’s security clearly depends on the individual components’ security models. But even if A has a weak security policy or flawed implementation, the composite can still be secure. If you 0740-7459/01/$10.00 © 2001 IEEE
QUALITY TIME
make A’s performance so poor that no one can log on—that is, if the operating system authenticates inefficiently—security is actually increased. And, if the implementation of A’s security mechanism is so unreliable that it disallows all users—even legitimate ones—access, security is again increased. Although these examples are clearly not a desirable way to attain higher system security, both do decrease the likelihood that a system will be successfully attacked. If we reuse our example of A as an operating system and B as an intrusion detection system and this time assume that A provides excellent security and B provides excellent security, we must accept that B’s security is a function of calendar time. This is simply because we are always discovering new threats and ways to “break in.” Even if you could create a scheme such as Security(A) ξ Security(B) = Security(A ξ B), Security(B) is clearly a function of the version of B being composed and what new threats have arisen. Easy “ilities”? Which “ilities,” if any, are easy to compose? Unfortunately, the answer to this question is that no “ilities” are easy to compose, and some are much harder than others. Furthermore, we lack widely accepted algorithms for the composition. I just demonstrated this problem for security, but the same holds true for other areas such as reliability. For reliability, consider a twocomponent system in which component A feeds information in B, and B produces the composite’s output. Assume that both components are reliable. What can we assume about the composite’s reliability? Although it certainly suggests that the composite system will be reliable, the components (which were tested in isolation for their individual reliabilities) can suddenly behave unreliably when connected to other components, particularly if the isolated test distributions did not reflect the distribution of transferred information after composition. Moreover, we could have nonfunctional component behaviors, which we can’t observe and don’t manifest until after composition occurs. Such behaviors can undermine the composition’s reliability. Finally, if one of the components is simply the wrong component— although highly reliable—the resulting system will be useless. Nonfunctional behaviors are particularly
worrisome in commercial off-the-shelf (COTS) software products. Nonfunctional behaviors can include malicious code (such as Trojan horses and logic bombs) and any other undocumented behavior or side effect. Another worrisome problem facing CBSE and CBD is hidden interfaces. Hidden interfaces are typically channels through which application or component software can convince an operating system to execute undesirable tasks or processes. An example would be an application requesting higher permission levels than the application should have. Interestingly, fault injection can partially address this issue by detecting hidden interfaces and nonfunctional behaviors by forcing software systems to reveal those behaviors and events after a COTS component’s input stream receives corrupted inputs. In addition to reliability and security, performance appears—at least on the surface—to be the “ility” with the best possibility for successful composability. This possibility is problematic, however, from a practical sense. Even if you performed a Big O algorithmic analysis on a component, the component’s performance after composition depends heavily on the hardware and other physical resources. What we most likely need, then, is to drag many different hardware variables along with a certificate that makes even minimal, worstcase claims about the component’s performance. Clearly, this can introduce serious difficulties.
F
or CBSE and CBD to flourish, technologies must exist that let you successfully predict software component interoperability before composition occurs. Without predictability, interoperability cannot be known a priori until after a system is built. At that point, it might be too late in the life cycle to recover financially if you discover that certain components are incompatible. Therefore, predictive technologies that address the “ilities” are truly needed.
No “ilities” are easy to compose, and some are much harder than others.
July/August 2001
IEEE SOFTWARE
17
focus
guest editor’s introduction
Software Fault Tolerance: Making Software Behave Jeffrey Voas, Cigital
etween the late 1960s and early 1990s, the software engineering community strove to formalize schemes that would lead to perfectly correct software. Although a noble undertaking at first, it soon became apparent that correct software was, in general, unobtainable. And furthermore, the costs, even if achievable, would be overwhelming.
B
18
IEEE SOFTWARE
July/August 2001
0740-7459/01/$10.00 © 2001 IEEE
About the Author
Modern software systems, even if correct, can still exhibit undesirable behaviors as they execute. How? Well, the simplest example would be if a software system were forced to experience an input it should not have. In this situation, the software could 1. handle it gracefully by ignoring it, 2. execute on it and experience no ill effects, or 3. execute on it and experience catastrophic effects. Note that 1 and 2 are desirable, but 3 is not, yet in all cases, the software is still correct. This issue’s focus is dedicated to the research results and ideas from a group of experts who discuss their views on how to create fault-tolerant software—that is, software that is designed to deliberately resist exhibiting undesirable behaviors as a result of a failure of some subsystem. That subsystem could be part of the software, an external software or hardware failure, or even a human operator failure. Generally speaking, fault-tolerant software differs from other software in that it can gracefully handle anomalous internal and external events that could lead to catastrophic system consequences. Because correct software is an oxymoron in most cases and, as I just mentioned, correct software can still hurt you, software fault tolerance is one of the most important areas in software engineering. The first article, “Using Simplicity to Control Complexity,” by Lui Sha begins by discussing the widely held belief that diversity in software constructions entails robustness. The article then questions whether this is really true. It goes on to investigate the relationship between software complexity, reliability, and the resources available for software development. The article also presents a forward recovery approach based on the idea of “using simplicity to control complexity” as a way to improve the robustness of complex software systems. Karama Kanoun’s article analyzes data collected during the seven-year development of a real-life software system. The software under consideration comprised two diverse variants. For each development phase, Kanoun evaluated the cost overhead induced by the second variant’s development with respect to the principal variant’s cost.
Jeffrey Voas is a cofounder and chief scientist of Cigital. His research interests include composition strategies for COTS software, software product certification and warranties, and software quality measurement. He coauthored Software Fault Injection: Inoculating Programs Against Errors (Wiley, 1998) and is working on Software Certificates and Warranties: Ensuring Quality, Reliability, and Interoperability. He received his BS in computer engineering from Tulane University and his PhD in computer science from the College of William & Mary. He was the program chair for the Eighth IEEE International Conference on Engineering of ComputerBased Systems. He was named the 1999 Young Engineer of the Year by the District of Columbia Council of Engineering and Architectural Societies, was corecipient of the 2000 IEEE Reliability Engineer of the Year award, and received an IEEE Third Millennium Medal and an IEEE Computer Society Meritorious Service award. He is a senior member of the IEEE, a vice president of the IEEE Reliability Society, and an associate editor in chief on IEEE Software. Contact him at
[email protected].
The results concluded that the cost overhead varies from 25 to 134 percent according to the development phase. Les Hatton’s article, “Exploring the Role of Diagnosis in Software Failure,” builds on the premise that software systems have, among engineering systems, the unique characteristic of repetitive failure. His article explores various reasons for this situation, particularly poor diagnosability. Hatton argues that this cause exists largely because of educational problems. Through examples, thearticle highlights the need for an attitude change toward software failure and for improved diagnostics. Finally, he introduces the concepts of diagnostic distance and diagnostic quality to help with categorization. Michel Raynal and Mukesh Singhal’s article deals with ways to overcome agreement problems in distributed systems. The authors focus on practical solutions for a well-known agreement problem—the nonblocking atomic commitment. Finally, William Yurcik and David Doss’s article addresses software aging. The article discusses two approaches to this problem: ■
■
provide a system with the proactive capability of reinitializing to a known reliable state before a failure occurs or provide a system with the reactive capability of reconfiguring after a failure occurs such that the service provided by the software remains operational.
The authors discuss the complementary nature of these two methods for developing fault-tolerant software and give the reader a good overview of the field in general. So in conclusion, I hope that after you read these articles you will have a better understanding of the underlying principles of software fault tolerance. All systems need defensive mechanisms at some point. These articles, along with the references in the Roundtable (see p. 54), provide information on how to get started. July/August 2001
IEEE SOFTWARE
19
focus
fault tolerance
Using Simplicity to Control Complexity Lui Sha, University of Illinois at Urbana-Champaign
Does diversity in construction improve robustness? The author investigates the relationship between complexity, reliability, and development resources, and presents an approach to building a system that can manage upgrades and repair itself when complex software components fail. 20
IEEE SOFTWARE
ccording to a US government IT initiative, “As our economy and society become increasingly dependent on information technology, we must be able to design information systems that are more secure, reliable, and dependable.”1 There are two basic software reliability approaches. One is fault avoidance, using formal specification and verification methods2 and a rigorous software development process. An example of a high-assurance software development process is the DO 178B standard
A
adopted by the US Federal Aviation Administration. Fault avoidance methods allow computer-controlled safety-critical systems such as flight control, but they can only handle modestly complex software. The trend toward using a large network of systems based on commercial-off-the-shelf components (COTS) also makes applying fault avoidance methods more difficult. Another approach is software fault tolerance through diversity (for example, using the N-version programming method3). Many believe that diversity in software construction results in improved robustness, but is that true? Would the system be more reliable if we devoted all our effort to developing a single version? In this article, I show that dividing resources for diversity can lead to either improved or reduced reliability, depending on the architecture. The key to improving reliability is not the degree of diversity, per se.
July/August 2001
Rather, it is the existence of a simple and reliable core component that ensures the system’s critical functions despite the failure of noncore software components. I call this approach using simplicity to control complexity. I will show how to use the approach systematically in the automatic-control-applications domain, creating systems that can manage upgrades and fix themselves when complex software components fail. The power of simplicity Software projects have finite budgets. How can we allocate resources in a way that improves system reliability? Let’s develop a simple model to analyze the relationship between reliability, development effort, and software’s logical complexity. Computational complexity is modeled as the number of steps to complete the computation. Likewise, we can view logical complexity as the number of 0740-7459/01/$10.00 © 2001 IEEE
1
steps to verify correctness. Logical complexity is a function of the number of cases (states) that the verification or testing process must handle. A program can have different logical and computational complexities. For example, compared to quicksort, bubble sort has lower logical complexity but higher computational complexity. Another important distinction is the one between logical complexity and residual logical complexity. For a new module, logical complexity and residual logical complexity are the same. A program could have high logical complexity initially, but if users verified the program before and can reuse it as is, the residual complexity is zero. It is important to point out that we cannot reuse a known reliable component in a different environment, unless the component’s assumptions are satisfied. Residual complexity measures the effort needed to ensure the reliability of a system comprising both new and reused software components. I focus on residual logical complexity (just “complexity” for the remainder of the article) because it is a dominant factor in software reliability. From a development perspective, the higher the complexity, the harder to specify, design, develop, and verify. From a management perspective, the higher the complexity, the harder to understand the users’ needs and communicate them to developers, find effective tools, get qualified personnel, and keep the development process smooth without many requirement changes. Based on observations of software development, I make three postulates: ■
■
■
P1: Complexity breeds bugs. All else being equal, the more complex the software project, the harder it is to make it reliable. P2: All bugs are not equal. Developers spot and correct the obvious errors early during development. The remaining errors are subtler and therefore harder to detect and correct. P3: All budgets are finite. We can only spend a certain amount of effort (budget) on any project.
P1 implies that for a given mission duration t, the software reliability decreases as complexity increases. P2 implies that for a given degree of complexity, the reliability function has a monotonically decreasing improvement
Reliability
0.8
C=1
0.6 C=2
0.4 0.2
0
2
4 6 Degrees of effort
8
10
Figure 1. Reliability
rate with respect to development effort. P3 and complexity. implies that diversity is not free (diversity ne- C is the software cessitates dividing the available effort). complexity. A simple reliability model The following model satisfies the three postulates. We adopt the commonly used exponential reliability function R(t) = e–λt and assume that the failure rate, λ, is proportional to the software complexity, C, and inversely proportional to the development effort, E. That is, R(t) = e–kCt/E. To focus on the interplay between complexity and development effort, we normalize the mission duration t to 1 and let the scaling constant k = 1. As a result, we can rewrite the reliability function with a normalized mission duration in the form R(E, C) = e–C/E. Figure 1 plots the reliability function R(E, C) = e–C/E with C = 1 and C = 2, respectively. As Figure 1 shows, the higher the complexity, the more effort needed to achieve a given degree of reliability. R(E, C) also has a monotonically decreasing rate of reliability improvement, demonstrating that the remaining errors are subtler and, therefore, detecting and correcting them requires more effort. Finally, the available budget E should be the same for whatever fault-tolerant method you use. We now have a simple model that lets us analyze the relationship between development effort, complexity, diversity, and reliability. The two well-known software fault tolerance methods that use diversity are Nversion programming and recovery block.3–5 I’ll use them as examples to illustrate the model’s application. For fairness, I’ll compare each method under its own ideal condition. That is, I assume faults are independent under N-version programming and acceptance test is perfect under recovery block. However, neither assumption is easy to realize in practice (leading to the forward-recovery approach,6 which I’ll discuss later). July/August 2001
IEEE SOFTWARE
21
1
Reliability
0.8 0.6 Single-version programming 3-version programming
0.4 0.2
0
5
10 15 Degrees of effort
20
25
Figure 2. Effect of divided effort in three-version programming when the failure rate is inversely proportional to effort (black), the square of effort (red), and the square root of effort (blue).
1
Reliability
0.8 0.6 Recovery block Single-version programming
0.4 0.2
0
5
10 15 Degrees of effort
20
25
Figure 3. Effect of divided efforts in recovery block.
N-version programming To focus on the effect of dividing the development effort for diversity, I assume that the nominal complexity is C = 1. First, consider the case of N-version programming with N = 3. The key idea in this method is to have three teams to independently design and implement different program versions from the same specification, hoping that any faults resulting from software errors are independent. During runtime, the results from different versions are voted on and the majority of the three outputs is selected (the median is used if outputs are floating-point numbers). In the case N = 3, the reliability function of the three-version programming system is R3 = R3E/3 + 3R2E/3(1 – RE/3). Replacing E with (E/3) in R(E, 1) = e–1/E provides the reliability function of each version RE/3 = e–3/E, because the total effort E is divided by three teams. Each team is responsible for a version. The black lines in Figure 2 show that the reliability of single-version programming 22
IEEE SOFTWARE
July/August 2001
with undivided effort is superior to threeversion programming over a wide range of development effort. This result counters the belief that diversity results in improved reliability. To check the result’s sensitivity, I make two assumptions. First, I make the optimistic assumption that the failure rate is inversely proportional to the square of software engineering 2 effort; that is, R(E, 1) = e–1/E (plotted in red in Figure 2). Second, I make the pessimistic assumption that the failure rate is inversely proportional to the square root of software 1/2 engineering effort, R(E, 1) = e–1/E (plotted in blue in Figure 2). The plots show that a single version’s reliability is also superior to three-version programming under the two assumptions over a wide range of efforts. However, single-version programming might not always be superior to its Nversion counterpart. Sometimes, we can obtain additional versions inexpensively. For example, if you use a Posix-compliant operating system, you can easily add a new lowcost version from different vendors. This is a reasonable heuristic to improve reliability in non–safety-critical systems. The difficulty with this approach is that there is no method that assures that faults in different versions are independent. Nor is there a reliable method to quantify the impact of potentially correlated faults. This is why FAA DO 178B discourages the use of N-version programming as a primary tool to achieve software reliability. Recovery block Now, consider diversity’s effect in the context of recovery block, where we construct different alternatives and then subject them to a common acceptance test. When input data arrive, the system checkpoints its state and then executes the primary alternative. If the result passes the acceptance test, the system will use it. Otherwise, the system rolls back to the checkpointed state and tries the other alternatives until either an alternative passes the test or the system raises the exception that it cannot produce a correct output. Under the assumption of a perfect acceptance test, the system works as long as any of the alternatives works. When three alternatives exist, the recovery block system’s reliability is RB = 1 – (1 – RE/3)3, where RE/3 = e–1/(E/3). Figure 3 shows the reliability of sin-
1
Reliability
0.8 0.6 RB2 RB3 RB10
0.4 0.2
0
2
4 Degrees of effort
6
8
Figure 4. The results of dividing total effort into two (RB2), three (RB3), and 10 (RB10) alternatives in recovery block.
1
Reliability
0.8 0.6 RB2L10 RB2L2 RB2
0.4 0.2
0
2
4 6 Degrees of effort
10
8
Figure 5. Effect of reducing complexity in a two-alternative recovery block.
1 0.8 Reliability
gle-version programming and of the recovery block with a three-way-divided effort. When the available effort is low, single-version programming is better. However, recovery block quickly becomes better after E > 2.6. Recovery block scores better than N-version programming because only one version must be correct under recovery block— easier to achieve than N-version programming’s requirement that the majority of the versions be correct. Diversity in the form of recovery block helps, but to what degree? Figure 4 compares system reliability under recovery block when the total effort E is divided evenly into two, three, and 10 alternatives. (An alternative is a different procedure—called when the primary fails the acceptance test.) All the alternatives have the same nominal complexity C = 1. Clearly, dividing the available effort in many ways is counterproductive. Next, consider the effect of using a reduced-complexity alternative. Figure 5 shows the reduced-complexity-alternative effect in a two-alternative recovery block. In this plot, I divide the total effect E equally into two alternatives. RB2 has two alternatives with no complexity reduction; that is, C1 = 1 and C2 = 1. RB2L2 has two alternatives with C1 = 1 and C2 = 0.5. RB2L10 has two alternatives with C1 = 1 and C2 = 0.1. Clearly, system reliability improves significantly when one alternative is much simpler. (The recovery block approach recommends using a simpler alternative.) To underscore the power of simplicity, let’s consider the effect of a good but imperfect acceptance test. Suppose that if the acceptance test fails, the system fails. Figure 6 plots two reliability functions: recovery block RB2 with a perfect acceptance test and two alternatives, where each has complexity C = 1; and recovery block *RB2L5, with an imperfect acceptance test whose reliability equals that of the low-complexity alternative. The two alternatives’ complexities are C1 = 1 and C2 = 0.2. As we can see, with a five-fold complexity reduction in one alternative, *RB2L5 is superior to recovery block with a perfect acceptance test but without the complexity reduction in its alternatives. You should not be surprised by the observation that the key to improving reliability is having a simple and reliable core component with which we can ensure a software sys-
0.6 *RB2L5 RB2
0.4 0.2
0
1
2
3 4 Degrees of effort
5
6
7
Figure 6. Effect of reducing complexity in a two-alternative recovery block with an imperfect acceptance test.
tem’s critical functions. After all, “Keep it simple” has long been reliability engineering’s mantra. What is surprising is that as the role of software systems increases, we have not taken simplicity seriously in software construction. July/August 2001
IEEE SOFTWARE
23
We can exploit the features and performance of complex software even if we cannot verify them, provided we can guarantee the critical requirements with simple software.
A two-alternative recovery block with a reduced-complexity alternative is an excellent approach whenever we can construct high-reliability acceptance tests. Unfortunately, constructing effective acceptance tests that can check each output’s correctness is often difficult. For example, from a single output, determining whether a uniform random-number generator is generating random numbers uniformly is impossible. The distribution is apparent only after many outputs are available. Many phenomena share this characteristic: diagnosing from a single sample is difficult, but a pattern often emerges when a large sample is available. Unfortunately, many computer applications are interactive in nature; they do not let us buffer a long output sequence and analyze the outputs before using them. Fortunately, we can leverage the power of simplicity using forward recovery. Using simplicity to control complexity The wisdom of “Keep it simple” is selfevident. We know that simplicity leads to reliability, so why is keeping systems simple so difficult? One reason involves the pursuit of features and performance. Gaining higher performance and functionality requires that we push the technology envelope and stretch the limits of our understanding. Given the competition on features, functionality, and performance, the production and usage of complex software components (either custom or COTS) are unavoidable in most applications. Useful but unessential features cause most of the complexity. Avoiding complex software components is not practical in most applications. We need an approach that lets us safely exploit the features the applications provide. A conceptual framework From a software engineering perspective, using simplicity to control complexity lets us separate critical requirements from desirable properties. It also lets us leverage the power of formal methods and a high-reliability software development process. For example, in sorting, the critical requirement is to sort items correctly, and the desirable property is to sort them fast. Suppose we can verify the bubble sort program but not the quicksort program. One solution is to use the slower bubble sort
24
IEEE SOFTWARE
July/August 2001
as the watchdog for quicksort. That is, we first sort the data items using quicksort and then pass the sorted items to bubble sort. If quicksort works correctly, bubble sort will output the sorted items in a single pass. Hence, the expected computational complexity is still O(n log(n)). If quicksort sorts the items in an incorrect order, bubble sort will correct the sort and thus guarantee the critical requirement of sorting. Under this arrangement, we not only guarantee sorting correctness but also have higher performance than using bubble sort alone, as long as quicksort works most of the time. The moral of the story is that we can exploit the features and performance of complex software even if we cannot verify them, provided that we can guarantee the critical requirements with simple software. This is not an isolated example. Similar arrangements are possible for many optimization programs that have logically simple greedy algorithms (simple local optimization methods such as the nearest-neighbor method in the traveling salesman problem) with lower performance and logically complex algorithms with higher performance. In the following, I show how to systematically apply the idea of using simplicity to control complexity in the context of automatic control applications. Control applications are ubiquitous. They control home appliances, medical devices, cars, trains, airplanes, factories, and power generators and distribution networks. Many of them have stringent reliability or availability requirements. The forward recovery solution Feedback control is itself a form of forward recovery. The feedback loop continuously corrects errors in the device state. We need feedback because we have neither a perfect mathematical model of the device, nor perfect sensors and actuators. A difference (error) often exists between the actual device state and the set-point (desired state). In the feedback control framework, incorrect control-software outputs translate to actuation errors. So, we must contain the impact resulting from incorrect outputs and keep the system within operational constraints. One way to achieve those goals is to keep the device states in an envelope established by a simple and reliable controller. An example of this idea in practice is the
Boeing 777 flight control system, which uses triple-triple redundancy for hardware reliability.7 At the application-software level, the system uses two controllers. The normal controller is the sophisticated software that engineers developed specifically for the Boeing 777. The secondary controller is based on the Boeing 747’s control laws. The normal controller is more complex and can deliver optimized flight control over a wide range of conditions. On the other hand, Boeing 747’s control laws—simple, reliable, and well understood—have been used for over 25 years. I’ll call the secondary controller a simple component because it has low residual complexity. To exploit the advanced technologies and ensure a high degree of reliability, a Boeing 777, under the normal controller, should fly within its secondary controller’s stability envelope. That is a good example of using forward recovery to guard against potential faults in complex software systems. However, using forward recovery in software systems is an exception rather than the rule. Forward recovery also receives relatively little attention in software fault tolerance literature8 due to the perceived difficulties. For a long time, systematically designing and implementing a forward recovery approach in feedback control has been a problem without a solution. Using the recent advancement in the linear matrix inequality theory,9 I found that we can systematically design and implement forward recovery for automatic control systems if the system is piecewise linearizable (which covers most practical control applications). The Simplex architecture With my research team at the Software Engineering Institute, I developed the Simplex architecture to implement the idea of using simplicity to control complexity. Under the Simplex architecture, the control system is divided into a high-assurance-control subsystem and a high-performance-control subsystem. The high-assurance-control subsystem. In the Simplex architecture’s HAC subsystem, the simple construction lets us leverage the power of formal methods and a rigorous development process. The prototypical example of HAC is Boeing 777’s secondary controller. The HAC subsystem uses the following technologies:
■
■
■
■
■
Application level: well-understood classical controllers designed to maximize the stability envelope. The trade-off is performance for stability and simplicity. System software level: high-assurance OS kernels such as the certifiable Ada runtime developed for the Boeing 777. This is a no-frills OS that avoids complex data-structure and dynamic resource-allocation methods. It trades off usability for reliability. Hardware level: well-established and simple fault-tolerant hardware configurations, such as pair-pair or triplicate modular redundancy. System development and maintenance process: a suitable high-assurance process appropriate for the applications—for example, FAA DO 178B for flight control software. Requirement management: the subsystem limits requirements to critical functions and essential services. Like a nation’s constitution, the functions and services should be stable and change very slowly.
With highassurance control in place to ensure the process remains operational, we can aggressively pursue advanced control technologies and cost reduction.
The high-performance-control subsystem. An HPC subsystem complements the conservative HAC core. In safety-critical applications, the high-performance subsystem can use more complex, advanced control technology. The same rigorous standard must also apply to the HPC software. The prototypical example of HPC is Boeing 777’s normal controller. Many industrial-control applications, such as semiconductor manufacturing, are not safety critical, but the downtime can be costly. With high-assurance control in place to ensure the process remains operational, we can aggressively pursue advanced control technologies and cost reduction in the high-performance subsystem as follows: ■
■
Application level: advanced control technologies, including those difficult to verify—for example, neural nets. System software level: COTS real-time OS and middleware designed to simplify application development. To facilitate application-software-component upgrades, we can also add dynamic real-time component-replacement capability in the middleware layer, which supports advanced upgrade management—replacing July/August 2001
IEEE SOFTWARE
25
Figure 7. The Simplex architecture. The circle represents the switch that the decision logic controls.
Decision logic High-assurance-control subsystem
Plant
High-performance-control subsystem
■
■
■
software components during runtime without shutting down the OS.10 Hardware level: standard industrial hardware, such as VMEBus-based hardware or industrial personal computers. System development and maintenance process: standard industrial software development processes. Requirement management: the subsystem handles requirements for features and performance here. With the protection that the high-assurance subsystem offers, requirements can change relatively fast to embrace new technologies and support new user needs.
Figure 7 diagrams the Simplex architecture, which supports using simplicity to control complexity. The high-assurance and highperformance systems run in parallel, but the software stays separate. The HPC can use the HAC’s outputs, but not vice versa. Normally, the complex software controls the plant. The decision logic ensures that the plant’s state under the high-performance controller stays within an HAC-established stability envelope. Otherwise, the HAC takes control. Certain real-time control applications such as manufacturing systems are not safety critical, but they still need a high degree of availability, because downtime is very expensive. In this type of application, the main concern is application-software upgradability and availability. For such non–safety-critical applications, we can run Simplex architecture middleware on top of standard industrial hardware and real-time OSs. A number of applications have used this technique, including those performed in a semiconductorwafer-making facility.11 For educational purposes, I had my group at the University of Illinois at Urbana-Champaign develop a Web-based control lab— the Telelab (www-drii.cs.uiuc.edu/download. html)—which uses a physical inverted pendulum that your software can control to explore 26
IEEE SOFTWARE
July/August 2001
this article’s principles. Once you submit your software through the Web, Telelab dynamically replaces the existing control software with your software and uses it to control the inverted pendulum without stopping the normal control. Through streaming video, you can watch how well your software improves the control. You can also test this approach’s reliability by embedding arbitrary applicationlevel bugs in your software. In this case, Telelab will detect the deterioration of control performance, switch off your software, take back control, and keep the pendulum from falling down. Also, it will restore the control software in use prior to yours. Telelab demonstrates the feasibility of building systems that manage upgrades and self-repair. Forward recovery using high-assurance controller and recovery region In plant (or vehicle) operation, a set of state constraints, called operation constraints, represent the devices’ physical limitations and the safety, environmental, and other operational requirements. We can represent the operation constraints as a normalized polytope (an n-dimensional figure whose faces are hyperplanes) in the system’s n-dimensional state space. Figure 8 shows a two-dimensional example. Each line on the boundary represents a constraint. For example, the engine rotation must be no greater than k rpm. The states inside the polytope are called admissible states, because they obey the operational constraints. To limit the loss that a faulty controller can cause, we must ensure that the system states are always admissible. That means 1. we must be able to remove control from a faulty control subsystem and give it to the HAC subsystem before the system state becomes inadmissible, 2. the HAC subsystem can control the system after the switch, and 3. the system state’s future trajectory after the switch will stay within the set of admissible states and converge to the setpoint. We cannot use the polytope’s boundary as the switching rule, just as we cannot stop a car without collision when it’s about to touch a wall. Physical systems have inertia. A subset of the admissible states that satisfies the three conditions is called a recov-
Figure 8. State constraints and the switching rule (Lyapunov function).
State constraints
ery region. A Lyapunov function inside the state constraint polytope represents the recovery region (that is, the recovery region is a stability region inside the state constraint polytope). Geometrically, a Lyapunov function defines an n-dimensional ellipsoid in the n-dimensional system state space, as Figure 8 illustrates. An important property of a Lyapunov function is that, if the system state is in the ellipsoid associated with a controller, it will stay there and converge to the equilibrium position (setpoint) under this controller. So, we can use the boundary of the ellipsoid associated with the high-assurance controller as the switching rule. A Lyapunov function is not unique for a given system–controller combination. To not unduly restrict the state space that highperformance controllers can use, we must find the largest ellipsoid in the polytope that represents the operational constraints. Mathematically, we can use the linear matrix inequality method to find the largest ellipsoid in a polytope.9 Thus, we can use Lyapunov theory and LMI tools to solve the recovery region problem (to find the largest ellipsoid, we downloaded the package that Steven Boyd’s group at Stanford developed). • For example, given a dynamic system X = – – AX + BKX, where X is the system state, A is the system matrix, and K represents a controller. We can first choose K by using wellunderstood robust controller designs; that is, the system stability should be insensitive to model uncertainty. The system under this reliable controller • – is X = AX, where A = (A + BK). AdditionT ally, A Q + QA < 0 represents the stability condition, where Q is the Lyapunov function. A normalized polytope represents the operational constraints. We can find the largest ellipsoid in the polytope by minimizing (log det Q–1) (det stands for determinant),9 subject to the stability condition. The resulting Q defines the largest normalized ellipsoid XTQX = 1, the recovery region, in the polytope (see Figure 8). In practice, we use a smaller ellipsoid— for example, XTQX = 0.7, inside XTQX = 1. The shortest distance between XTQX = 1 and XTQX = 0.7 is the margin reserved to guard against model errors, actuator errors, and measurement errors. During runtime, the HPC subsystem normally controls the plant. The decision logic checks the plant
Recovery region Lyapunov function
state X every sampling period. If X is inside the n-dimensional ellipsoid XTQX = c, 0 < c < 1, it considers admissible the system the high-performance controller controls. Otherwise, the HAC subsystem takes over, which ensures that plant operation never violates the operational constraints. The software that implements the decision rule “if (XTQX > c), switch to high-assurance controller” is simple and easy to verify. Once we ensure that the system states will remain admissible, we can safely conduct statistical performance evaluations of the HPC subsystem in the plant. If the new “high-performance” controller delivers poor performance, we can replace it online. I would point out that the high-assurance subsystem also protects the plant against latent faults in the high-performance control software that tests and evaluations fail to catch. Application notes The development of the high-assurance controller and its recovery region satisfies forward recovery’s basic requirement: the impact caused by incorrect actions must be tolerable and recoverable. In certain applications such as chemical-process control, we typically do not have a precise plant model. In such applications, we might have to codify the recovery region experimentally. When a controller generates faulty output, the plant states will move away from the setpoint. It is important to choose a sufficiently fast sampling rate so that we can detect errors earlier. Will the simple controller unreasonably restrict the state space that the high-performance controller can use? This turns out to be a nonproblem in most applications. The controllers’ design involves a trade-off between agility (control performance) and stability. Because the high-performance controller often focuses on agility, its stability envelope is naturally smaller than the stability envelope of the safety controller that sacrifices performance for stability. July/August 2001
IEEE SOFTWARE
27
High-performance-control subsystem
Level of performance
Failure
Restarted
High-assurance-control subsystem
t0
t1
t2
t3 Time
Figure 9. Using imperfect highperformance control. The high-assurancecontrol subsystem takes over after a failure at time t2. When the system is stable again, the restarted highperformance-control subsystem resumes control, at time t3.
About the Author Lui Sha is a professor of computer science at the University of Illinois at Urbana-Champaign. His research interests include QoS-driven resource management and dynamic and reliable software architectures. He obtained his PhD from Carnegie Mellon University. He is a fellow of the IEEE, awarded “for technical leadership and research contributions that transformed real-time computing practice from an ad hoc process to an engineering process based on analytic methods.” He is an associate editor of IEEE Transactions on Parallel and Distributed Systems and the Real-Time Systems Journal. Contact him at 1304 W. Springfield DCL, Urbana, IL 61801;
[email protected]; www.cs.uiuc. edu/contacts/faculty/sha.html.
28
IEEE SOFTWARE
With the HAC subsystem in place, we can exploit the less-than-perfect HPC subsystem using COTS components. Reasonably goodquality software fails only occasionally when encountering unusual conditions. When the software restarts under a different condition, it will work again. When the HPC subsystem fails under unusual conditions, the HAC subsystem steps in until the condition becomes normal again, at which time we can resume using the HPC subsystem. Figure 9 illustrates the control selection from the two control subsystems. The vertical axis represents control performance levels and the horizontal axis represents time. At time t0, the system starts using the HAC subsystem. At time t1, the operator switches the system to the new HPC subsystem. Unfortunately, something triggers an error in the HPC subsystem, so the system automatically switches to the HAC subsystem at time t2. As the HAC subsystem stabilizes the system, the system control goes back to the restarted HPC subsystem at time t3. Thanks to the HAC subsystem, we can test new HPC software safely and reliably online in applications such as process-control upgrades in factories.
F
orward recovery using feedback is not limited to automatic-control applications. Ethernet, for example, rests on the idea that correcting occasional packet collisions as they occur is easier than completely preventing them. The same goes for the transmission-control protocol: correcting occasional congestion is easier than completely avoiding it. Forward recovery is also the primary tool for achieving robustness in human organizations. Democracy’s endurance does not rely on infallible leaders; rather, the system provides a mechanism for removing undesirable ones.
July/August 2001
Given the success of forward recovery with feedback in so many engineering disciplines and human organizations, I believe we can apply it to other types of software application. The notion of using simplicity to control complexity ensures the critical properties. It provides us with a “safety net” that lets us safely exploit the features that complex software components offer. That, in turn, lets us build systems that can manage upgrades and fix themselves when complex software components fail.
Acknowledgments The US Office of Naval Research is the major sponsor of this work. The Lockheed Martin Corporation and the Electric Power Research Institute also sponsored part of this work. Many people contributed. In particular, I thank Danbing Seto for his contributions to control-theoretic development, Bruce Krogh for the semiconductor wafer manufacture experiment, and Michael Gagliardi for development of the experimental demonstration systems at the Software Engineering Institute. I also thank Kris Wehner, Janek Schwarz, Xue Liu, Joao Lopes, and Xiaoyan He for their contributions to the Telelab demonstration system. I thank Alexander Romanovsky for his comments on an earlier draft of this article. Finally, I thank Gil Alexander Shif, whose editing greatly improved this article’s readability.
References 1. “Information Technology for the 21st Century: A Bold Investment in America’s Future,” www.ccic.gov/it2/ initiative.pdf (current 14 April 2001). 2. E.M. Clarke and J.M. Wing, “Formal Methods, State of the Art, and Future Directions,” ACM Computing Surveys, vol. 28, no. 4, Dec. 1996, pp. 626–643. 3. A. Avizienis, “The Methodology of N-Version Programming,” Software Fault Tolerance, M.R. Lyu, ed., John Wiley & Sons, New York, 1995. 4. B. Randel and J. Xu, “The Evolution of the Recovery Block Concept,” Software Fault Tolerance, M.R. Lyu, ed., John Wiley & Sons, New York, 1995. 5. S. Brilliant, J.C. Knight, and N.G. Leveson, “Analysis of Faults in an N-Version Programming Software Experiment,” IEEE Trans. Software Eng., Feb. 1990. 6. N.G. Leveson, “Software Fault Tolerance: The Case for Forward Recovery,” Proc. AIAA Conf. Computers in Aerospace, AIAA, Hartford, Conn., 1983. 7. Y.C. Yeh, “Dependability of the 777 Primary Flight Control System,” Proc. Dependable Computing for Critical Applications, IEEE CS Press, Los Alamitos, Calif., 1995. 8. Software Fault Tolerance (Trends in Software, No. 3), M.R. Lyu, ed., John Wiley & Sons, New York, 1995. 9. S. Boyd et al., Linear Matrix Inequality in Systems and Control Theory, SIAM Studies in Applied Mathematics, Philadelphia, 1994. 10. L. Sha, “Dependable System Upgrade,” Proc. IEEE Real-Time Systems Symp. (RTSS 98), IEEE CS Press, Los Alamitos, Calif., 1998, pp. 440–449. 11. D. Seto et al., “Dynamic Control System Upgrade Using Simplex Architecture,” IEEE Control Systems, vol. 18, no. 4, Aug. 1998, pp. 72–80.
focus
fault tolerance
Real-World Design Diversity: A Case Study on Cost Karama Kanoun, LAAS, Centre National de la Recherche Scientifique, France
esearchers have long shown that design diversity improves system reliability based both on controlled experiments1–4 and modeling.5,6 Design diversity consists of checking software’s dynamic behavior by executing two or more units (called variants) that were designed in separate processes but deliver the same services. Developers can use design diversity to detect errors—by comparing the
R Some say design diversity is a waste of resources that might be better spent on the original design. The author describes a real-world study that analyzed work hours spent on variant design. The results show that in fact, costs did not double by developing a second variant. 0740-7459/01/$10.00 © 2001 IEEE
variant results—as well as to support fault tolerance in the N-version programming approach7 and N-self-checking programming approach.8,9 When developers use back-to-back testing,10 diversity helps develop software variants. It also helps detect certain errors that other methods are unlikely to find.4 Moreover, studies show that design diversity does not necessarily double or triple development costs as many people might assume. Despite these benefits, design diversity is often viewed negatively because of the extra effort needed to develop and validate additional variants. Indeed, the typical question is: Is it better to devote extra effort to developing a variant or to verifying and validating the original unit? A 1997 IEEE Software article11 presents interesting arguments in favor of diversity. Here, I offer results from a real-world study that validates the design diversity approach. At LAAS-CNRS, we conducted a study
analyzing data on work hours spent developing two variants over a seven-year period. The study’s objective was to evaluate the cost overhead of developing the second variant as compared to the principal variant, which we use as the reference. Our results show that this overhead varies from 25 to 134 percent, depending on the development phase. These results confirm those published in previous work; their value rests in the fact that we carried out the evaluation in a realworld industrial development environment. To our knowledge, it is the only such evaluation performed on a real-life system. Software development characteristics The software in our study was composed of two variants. The system compares results from executing the variants and, when the variants agree, it transmits the outputs of the principal variant (PAL). When they disagree, it reports an error. In this way, the secondary
An earlier version of this article was published as “Cost of Software Design Diversity—An Empirical Evaluation”in the 1999 Proceedings of the 10th IEEE Int’l Conf. on Software Reliability.
July/August 2001
IEEE SOFTWARE
29
Figure 1. The software development process was incremental and involved the specification, software, and system departments.
Specification department S o f t w a r e d e p a r t m e n t
Functional specifications
PAL
SEL
Specification analysis
Specification analysis
Design
Design
Coding
Coding
System department
Integration and general tests
System tests
variant (SEL) is used for self-checking. The Software development process was incremental and involved three departments: specification, software, and system. The Specification Department decomposed the functional specifications into elementary functions specified in a nonambiguous graphical language (an in-house formal language) in specification sheets. These sheets were simulated automatically to check for data exchange consistency, variable types, and so on. Because a small part of the functionality could not be specified in the formal language, the department provided it in natural language in informal specifications. The Specification Department delivered 193 specification sheets to the software development teams. Of these, 113 were common to PAL and SEL, 48 were specific to PAL, and 32 specific to SEL. The development teams thus had to work on 161 specification sheets for PAL and 145 specification sheets for SEL to derive the source and executable codes. The PAL coding language was assembly. To make the variants as dissimilar as possible, the development team wrote half of SEL in assembly and the other half in a high-level language. The system compares variant results for error detection at 14 checkpoints. Eight of these checkpoints are common to PAL and SEL. The Specification Department specified each common checkpoint only once, though the development teams developed them separately within each variant. The remaining six checkpoints are specific to the variants: 30
IEEE SOFTWARE
July/August 2001
four were specific to PAL and two to SEL. Starting from the functional specifications, the Software Department handled specification analysis, high-level and detailed design, coding (including unit test), and integration and general tests. The resulting software had about 40,000 noncommented lines of code for PAL and 30,000 noncommented lines of code for SEL. The System Department tested the system using functional tests, completesimulator tests, and field tests. As Figure 1 shows, the teams conducted functional-specifications and system tests on PAL and SEL simultaneously. Within the Software Department, two teams worked separately on PAL and SEL. The Specification Department provided each team with the common functional specifications, its own specifications in the form of specification sheets, and informal specifications. Although the teams did not otherwise communicate, they did participate in Specification Department meetings, so they had the same information related to specifications and computer hardware. Within each Software Department team, any person could code and test. The only rule was that team members could not test a part they coded. Tests for the common specification sheets were specified and designed only once, but each team performed them separately on PAL and SEL. Functions that were specified completely in the specification sheets were coded using automatic code-generation tools. The remaining ones required complete design and coding activities. High-level specifications consisted of structuring the software; detailed design decomposed the functions into abstract machines. The teams coded the abstract machines automatically or manually, depending on whether or not the function was entirely specified in the specification sheets. The teams defined unit tests classically, based on the specification sheets where possible. For integration, general tests, and system tests, developers took advantage of the two variants (using back-toback testing, for example). However, their test strategy went beyond such tests, and they performed particular tests for each variant. Database information We recorded information about work hours for people in the specification and system departments over a seven-year period.
Table 1 Working Hours Recorded in Database 2 with Respect to Total Working Hours
Over the same period, the Software Department devoted 24,375 hours to PAL and SEL and recorded information in two databases. Database 1 covers the first four years of the study, and Database 2 covers the last three years. Both databases give information about each of the development phases (specification analysis, high-level design, detailed design, coding and unit testing, and integration and general tests). They also list time spent in documentation. Although Database 1 discriminates between PAL and SEL working hours, Database 2 makes no such distinction. Database 2 also introduces two new headings— maintenance and analysis. The fifth year was a transition period; all work performed during that year was attributed to maintenance. Table 1 shows the percentage of working hours recorded in Database 2 relative to the total work hours for the seven-year period. About two-thirds of the Software Department’s working hours over the time period were recorded in Database 1. Figure 2 details the number of work hours per year, as derived from Database 1 and Database 2, without distinguishing between PAL and SEL. Figure 3 shows the work hours devoted to PAL and SEL from Database 1 for the first four years. In the early years, SEL needed almost as much effort as PAL. Data analysis Because different departments provided different information and we were dealing with two separate databases, I’ll first present the data from Database 1 alone and then consider the whole data set. Analysis: Database 1 Figures 4 and 5 show the percentage of time dedicated to PAL and SEL in various phases. For SEL, more than half the time was allotted to coding and integration. This might be because SEL is written in two languages,
Phase
Percent
Specification analysis High-level design Detailed design Coding Integration General tests Documentation Maintenance Analysis Overall
18.3 4.1 16.4 18.2 9.8 46.3 24.8 100.0 100.0 34.2
6,000
Analysis Maintenance Documentation General tests Integration Coding Detailed design High-level design Specification analysis
5,000
4,000
3,000
2,000
1,000
0 1
2
3
4 Year
5
6
7
Figure 2. Yearly working hours from both databases over the seven-year period.
3,500
2,500 Hours
■
The Specification Department devoted 10,000 hours to functional specifications—2,000 hours dedicated specifically to SEL and 8,000 hours devoted to PAL and the common specifications. The System Department spent 10,200 hours on system testing. The target system was tested as a black box; they did not distinguish between PAL and SEL working hours.
Hours
■
PAL SEL
1,500
500 1
2
3
4
Year
Figure 3. Working hours spent on principal variant (PAL) and secondary variant (SEL) during the first four years.
making such activities more time consuming. Table 2 shows the work hours dedicated to PAL and SEL by phase. As the table shows, the teams devoted only 35 percent of specifiJuly/August 2001
IEEE SOFTWARE
31
Table 2
Documentation 4%
Percent of Team Working Hours on PAL and SEL per Phase during the First Four Years Phase
Specification analysis High-level design Detailed design Coding Integration General tests Documentation Overall
PAL (percent)
SEL (percent)
65.0 64.5 65.2 41.1 49.0 50.6 67.4 54.3
35.0 35.5 34.8 58.9 51.0 49.4 32.6 45.7
Table 3
SEL phases
54 55 53 143 104 98 48 84
Table 4 Percent of SEL Overhead Costs Compared to PAL for the Seven-Year Period SEL phases
Percent of cost compared to PAL
Functional specifications Specification analysis High-level design Detailed design Coding Integration General tests Documentation Maintenance Analysis Overall
25 61 56 59 134 104 99 59 100 100 64
cation analysis to SEL, but almost 60 percent of the coding effort. Because the teams performed integration and general tests globally, the effort was nearly the same for both variants. Overall, the teams devoted 54.3 percent and 45.7 percent of the time to PAL and SEL, respectively. Table 3 shows the overall costs of developing SEL in relation to PAL based on data in Database 1 for the first four years. Note that we are comparing development costs on SEL (which had 10 checkpoints) to IEEE SOFTWARE
High-level design 6% Detailed design 11%
Integration 29%
Coding 26%
Figure 4. Time spent on principal variant by phase. Specification analysis 16%
General tests 12%
High-level design 9%
Percent of cost compared to PAL
Specification analysis High-level design Detailed design Coding Integration General tests Documentation Overall
32
General tests 14%
Documentation 7%
Percent of SEL Overhead Costs Compared to PAL for the First Four Years
Specification analysis 10%
July/August 2001
Detailed design 18%
Integration 23% Coding 15%
Figure 5. Time spent on secondary element by phase.
PAL (which had 12 checkpoints). Although the checkpoints certainly added extra costs, they also helped to speed validation. Unfortunately, we can’t estimate checkpoint-related costs from the recorded data set. However, if we consider the information that is in Database 1 and 2, we can see the influence in terms of time. Most of the work related to specification analysis, design, and coding was recorded in Database 1, whereas Database 2 focused on work related to validation (see Table 1). As a result, even if our cost overheads here do not correspond precisely to the extra cost induced by design diversity, they give a good order of magnitude of this overhead. Analysis: Whole data set To evaluate the cost overhead of developing the additional variant (developed for self-checking), we must consider all development phases from functional specifications to field tests. We thus had to partition the data from Database 2 and from the System Department according to PAL and SEL. For Database 2, the development teams said that they spent about equal time on PAL and SEL for all phases. Indeed, this figure is confirmed by the more accurate data collected in Database 1. In the first four
years, 45.7 percent of the global effort was dedicated to SEL (see Table 2). Given this, we partitioned the working hours in Database 2 equally between PAL and SEL. Table 4 gives the cost overhead for SEL in relation to PAL for the different phases over the entire seven-year period, excluding requirement specifications and system test. As the table shows, developing SEL created an overhead of 134 percent for coding and unit tests together; about 100 percent for integration tests, general tests, maintenance, and analysis; about 60 percent for specification analysis, design, and documentation; and 25 percent for functional specifications. Because we were unable to obtain detailed information from the System Department, we made two extreme assumptions: ■ ■
an optimistic one, in which we assumed that SEL created no overhead, and a pessimistic one, in which we assumed that the time overhead for SEL testing was 100 percent.
Although the first assumption is unrealistic, it let us evaluate a low bound of the cost overhead. The second assumption is no more likely, because the second variant helped detect errors in several situations; we used it to evaluate the cost overhead’s upper bound. Given these assumptions, the global overhead from functional specifications to system test is between 42 and 71 percent, creating a cost increase factor between 1.42 and 1.71. These results do not include the requirement specifications phase, which was performed for the whole software system. Taking the corresponding time into account reduces the overall cost overhead. Unfortunately, the associated effort was not recorded, and all we know is that it was a time-consuming phase.
A
lthough our results compare the second variant’s cost with the cost of a single variant developed for selfchecking, they compare very well to those already observed in controlled experiments, in which the cost overhead is evaluated with respect to the cost of a non-fault-tolerant variant. For example, in one study,1 the cost of a variant in a recovery-block programming approach was estimated at 0.8 times the cost of a non-fault-tolerant variant, giving a cost increase factor of 1.6. In another study,12 the
cost of a variant in an N-version programming approach was 0.75 times the cost of a non-fault-tolerant variant (a cost increase factor of 2.26 for three variants). We have obtained similar figures through cost models.8 Likewise, other researchers have shown that the cost of software design diversity with N variants is not N times the cost of a single software variant, though they provided no specific figures.13-15
References 1. T. Anderson et al., “Software Fault Tolerance: An Evaluation,” IEEE Trans. Software Eng., vol. 11, no. 12, 1985, pp. 1502–1510. 2. D.E. Eckhardt et al., “An Experimental Evaluation of Software Redundancy as a Strategy for Improving Reliability,” IEEE Trans. Software Eng., vol. 17, no. 7, 1991, pp. 692–702. 3. J.P.J. Kelly et al., “A Large Scale Second Generation Experiment in Multi-Version Software: Description and Early Results,” Proc. IEEE 18th Int’l Symp. FaultTolerant Computing (FTCS-18), IEEE CS Press, Los Alamitos, Calif., 1988, pp. 9–14. 4. P.G. Bishop, “The PODS Diversity Experiment,” Software Diversity in Computerized Control Systems, Dependable Computing and Fault-Tolerant Systems, vol. 2, U. Voges, ed., Springer-Verlag, New York, 1988, pp. 51–84. 5. J.B. Dugan and M.R. Lyu, “Dependability Modeling for Fault-Tolerant Software and Systems,” Software Fault Tolerance, M. Lyu, ed., John Wiley & Sons, New York, 1995, pp. 109–138. 6. J.C. Laprie et al., “Architectural Issues in Software Fault Tolerance,” Software Fault Tolerance, M. Lyu, ed., John Wiley & Sons, New York, 1995, pp. 47–80. 7. A. Avizienis, “The N-version Approach to FaultTolerant Systems,” IEEE Trans. Software Eng., vol. 11, no. 12, 1985, pp. 1491–1501. 8. J.C. Laprie et al., “Definition and Analysis of Hardware and Software Fault-Tolerant Architectures,” Computer, vol. 23, no. 7, 1990, pp. 39–51. 9. P. Traverse, “Dependability of Digital Computers on Board Airplanes,” Dependable Computing for Critical Applications, Dependable Computing and Fault-Tolerant Systems, vol. 1, A. Avizienis and J.C. Laprie, eds., Springer-Verlag, New York, 1987, pp. 133–152. 10. J.P.J. Kelly, T.I. McVittie, and W.I. Yamamoto, “Implementing Design Diversity to Achieve Fault Tolerance,” IEEE Software, July 1991, pp. 61–71. 11. L. Hatton, “N-Version Design Versus One Good Version,” IEEE Software, Nov./Dec. 1997, pp. 71–76. 12. A. Avizienis et al., “DEDIX 87—A Supervisory System for Design Diversity Experiments at UCLA,” in Software Diversity in Computerized Control Systems, Dependable Computing and Fault-Tolerant Systems, vol. 2, U. Voges, ed., Springer-Verlag, New York, 1988, pp. 129–168. 13. P.G. Bishop et al., “PODS—A Project on Diverse Software,” IEEE Trans. Software Eng., vol. 12, no. 9, 1986, pp. 929–940. 14. G. Hagelin, “Ericsson Safety System for Railway Control,” Software Diversity in Computerized Control Systems, Dependable Computing and Fault-Tolerant Systems, vol. 2, U. Vogues, ed., Springer-Verlag, New York, 1988, pp. 9–21. 15. P.G. Bishop, “The PODS Diversity Experiment,” Software Diversity in Computerized Control Systems, Dependable Computing and Fault-Tolerant Systems, vol. 2, U. Vogues, ed., Springer-Verlag, New York, 1988, pp. 51–84.
The cost of software design diversity with N variants is not N times the cost of a single software variant.
For further information on this or any other computing topic, please visit our Digital Library at http://computer.org/ publications/dlib.
About the Author Karama Kanoun is
director of research at the LAAS, Centre National de la Recherche Scientifique, France. Her current research interests include modeling and evaluating computer system dependability in both hardware and software. She was program committee cochair of the International Symposium on Software Reliability Engineering ’94 and the International Conference on Dependable Systems and Networks 2000, and general chair of ISSRE ’95 and the 18th International Conference on Computer Safety, Reliability, and Security ’99. She is associate editor of IEEE Transactions on Reliability and is a project reviewer for the European Commission. Contact her at LAAS-CNRS, 7 Avenue du Colonel Roche, 31077 Toulouse Cedex 4, France;
[email protected].
July/August 2001
IEEE SOFTWARE
33
focus
fault tolerance
Exploring the Role of Diagnosis in Software Failure Les Hatton, Oakwood Computing and the University of Kent, UK
Among engineering systems, software systems are unique in their tendency toward repetitive failure, and the situation is worsening as systems become larger and more tightly coupled. To stem the tide, the author advocates a humbler attitude toward software failure and improved diagnostics. 34
IEEE SOFTWARE
hen an engineering system behaves in an unexpected way, we say it has failed. If the failure is significant, we use a process called diagnosis to discover why the failure occurred. By definition, the successful outcome of diagnosis is the discovery of a deficiency whose correction prevents the system from failing in the same way again. We call such a deficiency a fault. If diagnosis is unsuccessful, the fault remains undiscovered and can cause the system to fail again in the future.
W
Such a failure is called a repetitive failure mode, although it would probably be clearer to call it a diagnostic failure mode. The first main point of this article is that repetitive failure is very common in softwarecontrolled systems as compared to other kinds of engineering systems. Second, I argue that this state of affairs stems from an overly optimistic approach by software system designers and implementers, coupled with rapidly increasing system complexity. Finally, in discussing ways to improve the situation, I define two attributes of diagnosis that may prove useful in clarifying the problem. Evidence for repetitive failure Repetitive failure is indeed a widespread property of software-controlled systems. For early evidence of this, we can go back more than 15 years, when Ed Adams published his now famous study at IBM,1
July/August 2001
showing that a very small percentage of faults—only around 2%—were responsible for most observed failures. In other words, his data showed that his systems were dominated by repetitive failure. Figure 1 shows the relationship between fault and failure. In essence, every failure results from at least one fault, but not all faults fail during the software’s life cycle. In fact, Adams showed that a third of all faults failed so rarely that for all intents and purposes they never failed at all in practice. It is quite easy to envisage a fault that does not fail. For example, large systems sometimes contain unintentionally unreachable code; a fault located in an unreachable piece of code could never fail. In general, however, under some set of conditions, a fault or group of faults in combination will cause the system to fail. Figure 2 presents an analysis of dependencies on a significant class of software faults oc0740-7459/01/$10.00 © 2001 IEEE
All faults
Reasons for repetitive failure Why would software systems be dominated by repetitive failure? After all, other engineering systems aren’t. If bridges failed in the same way repeatedly, there would be a public outcry. The answer seems to lie in maturity: If you go far enough back in time, bridges did fail repeatedly, and there was public outcry. For example, in 1879, the Tay Bridge in Scotland fell down in a strong gale. It was poorly designed and built, and the wind blew it over, causing the deaths of 75 people in the train that was on it at the time. More or less as a direct result of the ensuing public outcry, the Forth Bridge in Scotland, built a little later, is almost embarrassingly overengineered. If a comet ever hits the earth, let’s hope that it hits the Forth Bridge first. It will very likely bounce off. In fact, in the 1850s, as many as one in four iron railway bridges fell down until the reason was discovered.5 The process of engineering maturation, whereby later designs gradually avoided past mistakes, has led to the almost complete disappearance of repetitive failure—not only in bridges, but also in most other areas of engineering. The single exception is software. It is easy to think that this is because software development is only around 50 years old, but this is overly generous and ignores the lessons of history. The fact is that software engineering is gripped by unconstrained and very rapid creativity, whereas the elimination of repetitive failure requires painstaking analysis of relatively slowly moving technologies. This standard method of engineering im-
Faults that fail
Figure 1. The relationship between fault and failure. There is not necessarily a one-to-one relationship between failure and “faults that fail.” A failure is simply a difference between a system’s expected behavior and its actual behavior; some failures result from two or more faults acting collaboratively.
provement, summarized in Figure 3, has been known for a very long time. It used to be called common sense, but in these enlightened times, with the addition of a little mathematics, we know it as control process feedback. Regrettably, use of this simple principle is not widespread in software engineering, although process control models such as the SEI CMM firmly espouse it. It is almost as if the idea is too obvious, a drawback that also
25 Weighted fault density
curring in a large study of commercial C systems.2 In this case, I calculated the weighted fault rate by counting the occurrences of 100 of C’s better-known fault modes and weighting them according to severity between 0 and 1.3 This class of faults is widely published, has been known for the best part of 10 (and in some cases 20) years, is known to fail, and yet nothing much has been done to prevent it. Precisely the same thing occurs to differing extents for other standardized languages. It is perhaps no wonder that independently written programs tend to suffer from nonindependent failure,4 when software in general is riddled with a growing number of known repetitive failure modes that we seem unable to avoid.
20 15 10 5 0
1
11
21
31 Package
41
51
61
Figure 2. Weighted fault rates per 1,000 lines of code for a wide variety of commercially released C applications, plotted as a function of package number. This study analyzed 68 packages (totaling over 2,000,000 nonblank, preprocessed lines) from approximately 50 application areas in approximately 30 industrial groupings. The groupings ranged from non–safetyrelated areas, such as advertising and insurance, to safetycritical areas, such as air-traffic control and medical systems. Safety-critical systems were a little better than the average (which was 5.73), but one of the games measured scored better than two of the medical systems. July/August 2001
IEEE SOFTWARE
35
Figure 3. The silver bullet of engineering, control process feedback embodies the extraordinarily simple principle that it is not a sin to make a mistake, it is a sin to repeat one. Careful analysis of failure reveals the all-important clues as to how to avoid it in the future. The analysis mechanism is known as rootcause analysis.
36
IEEE SOFTWARE
Feed back into process to improve it
Process
Measure samples of product for quality
Product
might explain attitudes over the last 200 years or so toward Darwinian evolution. The writer and technological seer Arthur C. Clarke once said that “any sufficiently complex technology is indistinguishable from magic.” I would add that any sufficiently simple technology is also indistinguishable from magic. Darwinian evolution and control process feedback are two outstanding examples of my principle in action. They are apparently so obvious that they shouldn’t be right. However, probably the most important characteristic these two share is that although improvement is inexorable, it also takes time. This is particularly relevant in these days of “reduced time to market”—perhaps better known as “don’t test it as much.” Another demonstrable source of repetitive failure in software systems is imprecisely defined programming languages—a problem that many organizations make no effort to avoid. The language standardization process exacerbates this problem. Obviously, standardization is an important step forward in engineering maturity, but the process should not ignore historical lessons. As practiced today, language standardization suffers from two important drawbacks. First, language committees (and I’ve sat on a few in my time) seem unable to resist the temptation to fiddle, albeit with every good intention. These committees add features that seem like a good idea at the time, but often without really knowing whether they will work in practice. Of course, such creativity is normal in engineering. It is similar to the role of mutation in Darwinian evolution. What is not normal, however, is language standardization’s second drawback, the concept of backwards compatibility, often expressed in the hallowed rule, “Thou shalt not break old code.” The backwards compatibility principle operates in direct opposition to control process feedback. So, drawback one guarantees the continual injection of features that may not work,
July/August 2001
and drawback two guarantees the extreme difficulty of taking them out again. In other words, these two characteristics guarantee a standardization technique that largely excludes learning from previous mistakes. If other engineering disciplines pursued this doctrine, hammers, for example, would have microprocessor-controlled ejection mechanisms to ensure that their heads would fly off randomly every few minutes—just as they did when made with wooden handles 40 years ago. Not surprisingly, hammers were redesigned to eliminate this feature. We can clearly see the effects of the language situation in Figure 2. A further reason that software is full of repetitive failure modes is my main subject in this article—the diagnosis problem. If a system fails and diagnosis does not reveal the fault or all of the faults that caused the failure (bearing in mind that some experiments have shown that one in seven defect corrections is itself faulty!1), the failure will occur again in some form in the future. In other words, the inability to diagnose a fault inevitably results in the resulting failure becoming repetitive. Now I’ll attempt to analyze why diagnosis often fails. Quantifying diagnosis The essence of diagnosis is to trace back from a failure to the culpable fault or faults. Unfortunately, we have no systematic model that lets us go in the other direction—that is, to predict failure from a particular fault or group of faults. This is the province of the discipline known as “software metrics,” wherein we attempt to infer runtime behavior and, in particular, failure occurrence from some statically measurable property of the code or design. For example, we might say that components with a large number of decisions are unusually prone to failure. In general, we have so far failed to produce any really useful relationship, and the area is very complex.6 The primary reason for this appears to be that software failure is fundamentally chaotic. In conventional engineering systems, we have learned through hard experience to work in the linear zone, where stress is linearly related to strain and the the system’s behavior at runtime is much more predictable, although still occasionally prone to chaotic failure.5 Unfortunately, at the present stage of understanding, software is much less predictable.
Dereference pointer content 0x0 at strlen(…) called from line 126 of myc_constexpr.c called from line 247 of myc_evalexpr.c called from line 2459 of myc_expr.c
Figure 4. A typical stack trace automatically produced by a reasonable C compiler (in this case the estimable GNU compiler), at the point of dereferencing a pointer containing the address zero.
This asymmetry between prediction and diagnosis is particularly clear when we consider the difference between code inspections and traditional testing. Code inspections address the whole fault space of Figure 1. Consequently, a large percentage of faults revealed during code inspections would never actually cause the system to fail in a reasonable life cycle. In spite of this, the often dramatic effectiveness of code inspections compared with traditional runtime testing is unquestionable.7 Traditional testing, of course, finds a problem at runtime. This by definition is a failure and lies in the small subset of Figure 1. (In fact, we can predict that code inspections will be even more effective on systems that rapidly build up many execution years, such as modern consumer embedded systems, because in these cases the subset of faults that fail is much bigger.) Two essential parameters help us categorize how easily a software failure can be diagnosed. These are diagnostic distance and diagnostic quality. Diagnostic distance In essence, diagnostic distance is the “distance” in the system state between a fault being executed and the resulting failure manifesting itself. We can reasonably visualize this distance as the number of changes of state (for example, variables or files modified) that occur between the fault’s execution and the observation of failure. The greater the number of changes, the more difficult in general it is to trace the failure back to the fault—that is, to diagnose it. Let’s consider some simple examples from different languages and different systems. Consider Figure 4, a typical “stack trace” resulting from a core dump, generated in this case by a C program. The program has tried to look at the contents of address zero,
… if (tolerance.eq. acceptable_tolerance) then …
a fatal mistake in C leading to an immediate program failure. In other words, the “distance” between the fault firing and the failure occurring is very short. This, coupled with the detailed nature of the stack trace, points unerringly back at the precise location of the fault. Not surprisingly, these failures are very easy to diagnose. We can contrast this failure with the next one, shown in Figure 5, which affected me some years ago. This failure occurred in a Fortran 77 program, but exactly the same problem manifests itself in other languages without warning today. In this case, a comparison of real variables behaved slightly differently on different machines, a common problem. The effect was an unacceptable drop in the significance of agreement between two different computers—from around four decimal places to only two decimal places—in some parts of a data set that formed part of an acceptance test. This line of code was buried in the middle of a 70,000+-line package containing various signal-processing algorithms. From the point of failure, a colleague and I spent some three months on-and-off tracing the problem back to this line of code. In this case, the diagnostic distance between the fault and the failure was sufficiently large to lead to an exceptionally difficult debugging problem—although in the rich glow of hindsight, it is statically detectable. We can see other examples of large diagnostic distance in dynamic-memory failures in mainstream languages such as C, C++, and Fortran, to a lesser extent in Ada, and also in several aspects of OO implementations. In these cases, dynamic memory is allocated, corrupted at some stage, and rather later on leads to a failure. We can also call this kind of failure nonlocal behavior.
Figure 5. A comparison of realvalued variables. This is a broken concept in just about every programming language in existence and is one of the features included in the study that led to the data of Figure 2.
Diagnostic quality In contrast to diagnostic distance, diagnostic quality refers to how well the diagnostic distance is signposted between the state at which the fault executed and the state at which we observe the software to fail. To give a simple example of how influential this can be, remove a significant token from the middle of a computer program, for example, a “,” from a C program. When July/August 2001
IEEE SOFTWARE
37
Figure 6. Part of the fault cascade on one of the author’s programs, which has had a significant token, in this case a “,” deliberately removed. Warnings near the point of occurrence are relatively easy to understand. Warnings later in the cascade rapidly become more and more esoteric.
38
IEEE SOFTWARE
… myc_decl.c:297 parse error before ‘name’ … (lots of increasingly more exotic messages) … myc_decl.c: 313: warning: data definition has no type or storage class
• • •
• •
such a program is recompiled, the compiler will usually diagnose the absence of this token with a reasonably comprehensible error message. Figure 6 shows an example from the GNU compiler. Note that the missing token leads to a cascade of error messages that start out reasonably comprehensible at the fault location but rapidly degenerate into entirely spurious messages, until the compiler eventually gives up in disgust. This is a classic problem in compiler design for all programming languages. Now imagine trying to diagnose the missing token from the last compiler warning. Clearly, this can be very difficult. The full error cascade signposts the fault-failure path. Removing part of the signposting can fatally cripple your ability to diagnose the fault or faults from the failure. Such cascades are becoming common in real systems as they become more tightly coupled and larger. Consider the example in Figure 7, quoted by Peter Mellor8 in an excellent discussion of this topic. Imagine trying to diagnose the landing gear problem from the last error message in this case! In fact, this led to a repetitive failure mode, as at least one more incident occurred (19 November 1988), and a further nine months went by before a fix could be made. Training engineers to provide adequate signposting so that faults can be diagnosed from failures is an educational issue. We all too often fail to train software engineers to anticipate failure, and it is uncommon to teach the fundamental role of testing and techniques such as hazard analysis in our universities. This lack encourages overoptimism and leads to inadequate preparation for the inevitable failures. The diagnostic link between failure and fault is simply not present, and the fault becomes difficult, if not impossible, to find. Failure is one of the natural properties of a software-engineering system, and we ignore this immutable fact at our peril. Even when diagnostic links are present during testing, engineers often remove them from or truncate them in the released system
July/August 2001
• • •
•
MAN PITCH TRIM ONLY, followed in quick succession by: Fault in right main landing gear; At 1,500 feet, fault in ELAC2, (one of seven computers in the Electronic Flight Control System); LAF alternate ground spoilers 1-2-3-5 (fault in Load Alleviation Function); Fault in left pitch control green hydraulic circuit; Loss of attitude protection (which prevents dangerous maneuvers); Fault in Air Data System 2; Autopilot 2 shown as engaged despite the fact that it was disengaged; and then, finally, LAVATORY SMOKE, indicating a (nonexisting) fire in the toilets.
Figure 7. The diagnostic sequence appearing on the primary flight display of an Airbus A320, Flight AF 914 on 25 August 1988. The faults were not spurious—the aircraft had difficulty getting its landing gear down and had to do three passes at low altitude by the control tower so that the controllers could check visually.
for space reasons, crippling our ability to diagnose failure properly when it occurs in the field. Consider the following examples of possible warnings issued in commercial systems. Please wait ...
accompanied a failure in the flight management system on an Airbus A340 in September 1994.9 To my knowledge, the cause has still not been found. System over-stressed ...
appeared in a cash register system in a public bar. It later transpired that the printer had run out of paper.10 More than 64 TCP or UDP streams open ...
appeared in a G3 Macintosh running OS8.1. It turned out that the modem was not switched on.10 Of course, the ultimate in poor diagnostics is no warning at all. At the time of writing, there is considerable public debate in the UK about the Chinook helicopter crash in 1994 on the Mull of Kintyre in Scotland.11 The crash killed 30 people, including both pilots and several very senior security people from Northern Ireland. Although the crash was
Unification We can put together diagnostic distance and diagnostic quality to summarize the diagnosis problem, as shown in Figure 8. Here, we can see that when diagnostic distance is considerable, unless diagnostic quality is good, we have a very difficult diagnostic problem, one that in general will be intractable. The “difficult” sector of Figure 8 can be seen as a source of repetitive failure. We can ameliorate such a problem either by improving the diagnostic quality or by reducing the distance between the point where the fault executed and the point where the failure was observed. In practice, the latter is much easier.
I
n the last few years, software engineering has begun producing more and more systems where diagnostic distance is large. Networking, for example, is a classic method of increasing diagnostic distance. Embedded systems are similarly difficult to diagnose. Unless we can make rapid progress in improving diagnostic quality—essentially an educational issue—things are going to spiral out of control, and we can look forward to repetitive failure becoming a permanent fixture in software-engineering systems. This is clearly an unacceptable scenario. We can address both these issues by educating software engineers to realize that failure is an inevitable—indeed, a natural property of software systems—and to reflect this fact in design, implementation, and provision for a priori diagnosis.
Diagnostic distance
blamed on pilot error, prior to this incident concerns had been raised about the quality of the FADEC software controlling this aircraft’s engines—including concerns about a design flaw that had precipitated the near destruction of another Chinook in 1989. The essence of the verdict in this case seems to be that because the crash investigators found no evidence that the FADEC software failed, it must have been the pilots. However, as any PC user knows, most PC crashes leave no trace, so that on reboot, there is no evidence that anything was ever wrong. It is therefore grievously wrong to equate no diagnosis with no failure.
“Smoke in lavatory”
“Please wait”
Moderate
Difficult
Distance
Close
“Core dump”
“System stressed”
Easy
Moderate
Good
Poor Diagnostic quality
Acknowledgments All sorts of people contributed to this article. In particular, I thank Peter Mellor for his input and the anonymous reviewers, who contributed enormously.
Figure 8. Diagnostic distance versus diagnostic quality. Failures discussed earlier in the article give examples of the four sectors.
References 1. N.E. Adams, “Optimizing Preventive Service of Software Products,” IBM J. Research and Development, vol. 28, no. 1, 1984, pp. 2–14. 2. L. Hatton, “The T Experiments: Errors in Scientific Software,” IEEE Computational Science & Eng., vol. 4, no. 2, Apr.–Jun. 1997, pp. 27–38. 3. L. Hatton, Safer C: Developing for High-Integrity and Safety-Critical Systems, McGraw-Hill, New York, 1995. 4. J.C. Knight and N.G. Leveson, “An Experimental Evaluation of the Assumption of Independence in MultiVersion Programming,” IEEE Trans. Software Eng., vol. 12, no. 1, 1986, pp. 96–109. 5. H. Petroski, To Engineer is Human: The Role of Failure in Successful Design, Random House, New York, 1992. 6. N.E. Fenton and S.L. Pfleeger, Software Metrics: A Rigorous and Practical Approach, 2nd edition, PWS Publishing, Boston, 1997. 7. T. Gilb and D. Graham, Software Inspections, AddisonWesley, Wokingham, England, 1993. 8. P. Mellor, “CAD: Computer-Aided Disaster,” High Integrity Systems, vol. 1, no. 2, 1994, pp. 101–156. 9. AAIB Bulletin 3/95, Air Accident Investigation Branch, DRA Farnborough, UK, 1995. 10. L. Hatton, “Repetitive Failure, Feedback and the Lost Art of Diagnosis,” J. Systems and Software, vol. 47, 1999, pp. 183–188. 11. A. Collins, “MPs Misled Over Chinook,” Computer Weekly, 27 May 1999.
For further information on this or any other computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
About the Author Les Hatton is an independent consultant in software reliability. He is also Professor of Software Reliability at the Computing Laboratory, University of Kent, UK. He received a number of international prizes for geophysics in the 1970s and 80s, culminating in the 1987 Conrad Schlumberger prize for his work in computational geophysics. Shortly afterwards, he became interested in software reliability, and changed careers to study the design of high-integrity and safety-critical systems. In October 1998, he was voted amongst the leading international scholars of systems and software engineering for the period 1993-1997 by the Journal of Systems and Software. He holds a BA from King’s College, Cambridge, and an MSc and PhD from the University of Manchester, all in mathematics; an ALCM in guitar from the London College of Music; and an LLM in IT law from the University of Strathclyde. Contact him at Oakwood Computing Associates and the Computing Laboratory, University of Kent, UK;
[email protected]; www.oakcomp.co.uk.
July/August 2001
IEEE SOFTWARE
39
focus
fault tolerance
Mastering Agreement Problems in Distributed Systems Michel Raynal, IRISA Mukesh Singhal, Ohio State University
istributed systems are difficult to design and implement because of the unpredictability of message transfer delays and process speeds (asynchrony) and of failures. Consequently, systems designers have to cope with these difficulties intrinsically associated with distributed systems. We present the nonblocking atomic commitment (NBAC) agreement problem as a case study in the context of both synchronous and asynchronous distributed systems where processes can fail by
D Overcoming agreement problems in distributed systems is a primary challenge to systems designers. These authors focus on practical solutions for a well-known agreement problem—the nonblocking atomic commitment. 40
IEEE SOFTWARE
crashing. This problem lends itself to further exploration of fundamental mechanisms and concepts such as time-out, multicast, consensus, and failure detector. We will survey recent theoretical results of studies related to agreement problems and show how system engineers can use this information to gain deeper insight into the challenges they encounter. Commitment protocols In a distributed system, a transaction usually involves several sites as participants. At a transaction’s end, its participants must enter a commitment protocol to commit it (when everything goes well) or abort it (when something goes wrong). Usually, this protocol obeys a two-phase pattern (the two-phase commit, or 2PC). In the first phase, each participant votes Yes or No. If for any reason (deadlock, a storage prob-
July/August 2001
lem, a concurrency control conflict, and so on) a participant cannot locally commit the transaction, it votes No. A Yes vote means the participant commits to make local updates permanent if required. The second phase pronounces the order to commit the transaction if all participants voted Yes or to abort it if some participants voted No. Of course, we must enrich this protocol sketch to take into account failures to correctly implement the failure atomicity property. The underlying idea is that a failed participant is considered to have voted No. Atomic commitment protocols Researchers have proposed several atomic commitment protocols and implemented them in various systems.1,2 Unfortunately, some of them (such as the 2PC with a main coordinator) exhibit the blocking property in some failure scenarios.3,4 “Blocking” means 0740-7459/01/$10.00 © 2001 IEEE
The Two-Phase Commit Protocol Can Block A coordinator-based two-phase commit protocol obeys the following message exchanges. First, the coordinator requests a vote from all transaction participants and waits for their answers. If all participants vote Yes, the coordinator returns the decision Commit; if even a single participant votes No or failed, the coordinator returns the decision Abort. Consider the following failure scenario (see Figure A).1 Participant p1 is the coordinator and p2, p3, p4, and p5 are the other participants. The coordinator sends a request to all participants to get their votes. Suppose p4 and p5 answer Yes. According to the set of answers p1 receives, it determines the decision value D (Commit or Abort) and sends it to p2 and p3, but crashes before sending it to p4 and p5. Moreover, p2 and p3 crash just after receiving D. So, participants p4 and p5 are blocked: they cannot know the decision value because they voted Yes and because p1, p2, and p3 have crashed (p4 and p5 cannot communicate with one of them to get D). Participants p4 and p5 cannot force a decision value because the forced value could be different from D. So, p4 and p5 are blocked until one of the crashed participants recovers. That is why the basic two-phase commit protocol is blocking: situations exist
that nonfailed participants must wait for the recovery of failed participants to terminate their commit procedure. For instance, a commitment protocol is blocking if it admits executions in which nonfailed participants cannot decide. When such a situation occurs, nonfailed participants cannot release resources they acquired for exclusive use on the transaction’s behalf (see the sidebar “The Two-Phase Commit Protocol Can Block”). This not only prevents the concerned transaction from terminating but also prevents other transactions from accessing locked data. So, it is highly desirable to devise NBAC protocols that ensure transactions will terminate (by committing or aborting) despite any failure scenario. Several NBAC protocols (called 3PC protocols) have been designed and implemented. Basically, they add handling of failure scenarios to a 2PC-like protocol and use complex subprotocols that make them difficult to understand, program, prove, and test. Moreover, these protocols assume the underlying distributed system is synchronous (that is, the process-scheduling delays and message transfer delays are upper bounded, and the protocols know and use these bounds). Liveness guarantees Nonblocking protocols necessitate that the underlying system offers a liveness guarantee—a guarantee that failures will be even-
where noncrashed participants cannot progress because of participant crash occurrences.
Reference 1. Ö. Babao˘glu and S. Toueg, “Non-Blocking Atomic Commitment,” Distributed Systems, S. Mullender, ed., ACM Press, New York, 1993, pp. 147–166.
Vote request P1
Compute D = Commit/Abort Crash
Yes/No
P2
Crash
Yes/No
P3 P4
Crash Yes
?
Yes
?
P5
Figure A. A failure scenario for the two-phase commit protocol.
tually detected. Synchronous systems provide such a guarantee, thanks to the upper bounds on scheduling and message transfer delays (protocols use time-outs to safely detect failures in a bounded time). Asynchronous systems do not have such bounded delays, but we can compensate for this by equipping them with unreliable failure detectors, as we’ll describe later in this article. (For more on the difference between synchronous and asynchronous distributed systems, see the related sidebar.) So, time-outs in synchronous systems and unreliable failure detectors in asynchronous systems constitute building blocks offering the liveness guarantee. With these blocks we can design solutions to the NBAC agreement problem. Distributed systems and failures A distributed system is composed of a finite set of sites interconnected through a communication network. Each site has a local memory and stable storage and executes one or more processes. To simplify, we assume that each site has only one process. Processes communicate and synchronize by exchanging messages through the underlying network’s channels. A synchronous distributed system is characterized by upper bounds on message transfer delays, process-scheduling delays, and message-processing time. δ denotes an upper time bound for “message transfer deJuly/August 2001
IEEE SOFTWARE
41
A Fundamental Difference between Synchronous and Asynchronous Distributed Systems We consider a synchronous distributed system in which a channel connects each pair of processes and the only possible failure is the crash of processes. To simplify, assume that only communication takes time and that a process that receives an inquiry message responds to it immediately (in zero time). Moreover, let ∆ be the upper bound of the round-trip communication delay. In this context, a process pi can easily determine the crash of another process pj. Process pi sets a timer’s time-out period to ∆ and sends an inquiry message to pj. If it receives an answer before the timer expires, it can safely conclude that pj had not crashed before receiving the inquiry message (see Figure B1). If pi has not received an answer from pj when the timer expires, it can safely conclude that pj ∆ has crashed (see Figure B2). Pi In an asynchronous system, processes can use timers but cannot rely on them, irInquiry I am here Pi respective of the time-out period. This is because message transfer delays in asyn(1) chronous distributed systems have no up∆ per bound. If pi uses a timer (with some Pi time-out period ∆) to detect the crash of pj, the two previous scenarios (Figures B1 and Pi Crash B2) can occur. However, a third scenario (2) can also occur (see Figure B3): the timer ∆ expires while the answer is on its way to Pi pi. This is because ∆ is not a correct upper bound for a round-trip delay. Pi Only the first and second scenarios can occur in a synchronous distributed system. (3) All three scenarios can occur in an asynchronous distributed system. Moreover, in Figure B. Three scenarios of such a system, pi cannot distinguish the attempted communication second and third scenarios when its timer in a distributed system: (1) expires. When considering systems with communication is timely; process crash failures, this is the funda(2) process pj has crashed; mental difference between a synchronous (3) link (pi, pj) is slow. and an asynchronous distributed system.
mous “Generals Paradox” shows that distributed systems with unreliable communication do not admit solutions to the NBAC problem.2 A site or transaction participant can fail by crashing. Its state is correct until it crashes. A participant that does not crash during the transaction execution is also said to be correct; otherwise, it is faulty. Because we are interested only in commitment and not in recovery, we assume that a faulty participant does not recover. Moreover, whatever the number of faulty participants and the network configuration, we assume that each pair of correct participants can always communicate. In synchronous systems, crashes can be easily and safely detected by using time-out mechanisms. This is not the case in asynchronous distributed systems. The consensus problem Consensus6 is a fundamental problem of distributed systems (see the sidebar “Why Consensus Is Important in Distributed Systems”). Consider a set of processes that can fail by crashing. Each process has an input value that it proposes to the other processes. The consensus problem consists of designing a protocol where all correct processes unanimously and irrevocably decide on a common output value that is one of the input values. The consensus problem comprises four properties: ■ ■
lay + receiving-process scheduling delay + message-processing time.” Such a value is a constant, known by all the system’s sites. Most real distributed systems are asynchronous in the sense that δ cannot be established.5 So, asynchronous distributed systems are characterized by the absence of such an a priori known bound; message transfer and process-scheduling delays cannot be predicted and are considered arbitrary. Asynchronous distributed systems are sometimes also called time-free systems. Crash failures The underlying communication network is assumed to be reliable; that is, it doesn’t lose, generate, or garble messages. The fa42
IEEE SOFTWARE
July/August 2001
■ ■
Termination. Every correct process eventually decides some value. Integrity. A process decides at most once. Agreement. No two correct processes decide differently. Validity. If a process decides a value, some process proposed that value.
The uniform consensus problem is defined by the previous four properties plus the uniform agreement property: no two processes decide differently. When the only possible failures are process crashes, the consensus and uniform consensus problems have relatively simple solutions in synchronous distributed systems. Unfortunately, this is not the case in asynchronous distributed systems, where the most famous result is a negative one.
Why Consensus Is Important in Distributed Systems In asynchronous distributed systems prone to process crash failures, Tushar Chandra and Sam Toueg have shown that the consensus problem and the atomic broadcast problem are equivalent.1 (The atomic broadcast problem specifies that all processes be delivered the same set of messages in the same order. This set includes only messages broadcast by processes but must include all messages broadcast by correct process.) Therefore, you can use any solution for one of these problems to solve the other. Suppose we have a solution to the atomic broadcast problem. To solve the consensus problem, each process “atomically broadcasts” its value and the first delivered value is considered the decision value. (See Chandra and Toueg’s article1 for information on a transformation from consensus to atomic broadcast). So, all theoretical results associated with the consensus problem are also applicable to the
The FLP result states that it is impossible to design a deterministic protocol solving the consensus problem in an asynchronous system even with only a single process crash failure6 (see the sidebar, “The Impossibility of Consensus in Asynchronous Distributed Systems”). Intuitively, this is due to the impossibility of safely distinguishing a very slow process from a crashed process in an
atomic broadcast problem. Among others, it is impossible to design an atomic broadcast protocol in an asynchronous distributed system prone to process crash failures without considering additional assumptions limiting the system’s asynchronous behavior. Also, in asynchronous distributed systems with limited behavior that make the consensus problem solvable, consensus acts as a building block on the top of which you can design solutions to other agreement or coordination problems (such as nonblocking atomic commitment, leader election, and group membership).
Reference 1. T.D. Chandra and S. Toueg, “Unreliable Failure Detectors for Reliable Distributed Systems,” J. ACM, vol. 43, no. 2, Mar. 1996, pp. 245–267.
asynchronous context. This impossibility result has motivated researchers to discover a set of minimal properties that, when satisfied by an asynchronous distributed system, make the consensus problem solvable. Failure detection in asynchronous systems Because message transfer and processscheduling delays are arbitrary and cannot
The Impossibility of Consensus in Asynchronous Distributed Systems Let’s design a consensus protocol based on the rotating coordinator paradigm. Processes proceed by asynchronous rounds; in each round, a predetermined process acts as the coordinator. Let pc be the coordinator in round r, where c = (r mod n) + 1. Here’s the protocol’s principle: 1. During round r (initially, r = 1), pc tries to impose its value as the decision value. 2. The protocol terminates as soon as a decision has been obtained. Let’s examine two scenarios related to round r: 1. Process pc has crashed. A noncrashed process pi cannot distinguish this crash from the situation in which pc or its communication channels are very slow. So, if pi waits for a decision value from pc, it will wait forever. This violates the consensus problem’s termination property. 2. Process pc has not crashed, but its communication channel to pi is fast while its communication channels to pj and pc′ are very slow. Process pi receives the value vc of pc and decides accordingly. Processes pj and pc′ , after waiting for pc for a long period, suspect that pc has crashed and start round r + 1. Let c′ = ((r + 1) mod n) + 1. During round r + 1, pj and pc′ decide on vc′ . If vc ≠ vc′, the agreement property is violated.
This discussion shows that the protocol just sketched does not work. What is more surprising is that it is impossible to design a deterministic consensus protocol in an asynchronous distributed system even with a single process crash failure. This impossibility result, established by Michael Fischer, Nancy Lynch, and Michael Paterson,1 has motivated researchers to find a set of properties that, when satisfied by an asynchronous distributed system, make the consensus problem solvable. Minimal synchronism,2 partial synchrony,3 and unreliable failure detectors4 constitute answers to such a challenge. Researchers have also investigated randomized protocols to get nondeterministic solutions.5
References 1. M. Fischer, N. Lynch, and M. Paterson, “Impossibility of Distributed Consensus with One Faulty Process,” J. ACM, vol. 32, no. 2, Apr. 1985, pp. 374–382. 2. D. Dolev, C. Dwork, and L. Stockmeyer, “On the Minimal Synchronism Needed for Distributed Consensus,” J. ACM, vol. 34, no. 1, Jan. 1987, pp. 77–97. 3. C. Dwork, N. Lynch, and L. Stockmeyer, “Consensus in the Presence of Partial Synchrony,” J. ACM, vol. 35, no. 2, Apr. 1988, pp. 288–323. 4. T.D. Chandra and S. Toueg, “Unreliable Failure Detectors for Reliable Distributed Systems,” J. ACM, vol. 43, no. 2, Mar. 1996, pp. 245–267. 5. M. Rabin, “Randomized Byzantine Generals,” Proc. 24th Symp. Foundations of Computer Science, IEEE Press, Piscataway, N.J., 1983, pp. 403–409.
July/August 2001
IEEE SOFTWARE
43
Implementations of failure detectors can have only “approximate” accuracy in purely asynchronous distributed systems.
be bounded in asynchronous systems, it’s impossible to differentiate a transaction participant that is very slow (owing to a very long scheduling delay) or whose messages are very slow (owing to long transfer delays) from a participant that has crashed. This simple observation shows that detecting failures with completeness and accuracy in asynchronous distributed systems is impossible. Completeness means a faulty participant will eventually be detected as faulty, while accuracy means a correct participant will not be considered faulty. We can, however, equip every site with a failure detector that gives the site hints on other sites it suspects to be faulty. Such failure detectors function by suspecting participants that do not answer in a timely fashion. These detectors are inherently unreliable because they can erroneously suspect a correct participant or not suspect a faulty one. In this context, Tushar Chandra and Sam Toueg have refined the completeness and accuracy properties of failure detection.7 They have shown that you can solve the consensus problem in executions where unreliable failure detectors satisfy some of these properties:
detectors can only have “approximate” accuracy in purely asynchronous distributed systems. Despite such approximate implementations, unreliable failure detectors, satisfying only weak completeness and eventual weak accuracy, allow us to solve the consensus problem.7 Nonblocking atomic commitment As we mentioned earlier, the NBAC problem consists of ensuring that all correct participants of a transaction take the same decision—namely, to commit or abort the transaction. If the decision is Commit, all participants make their updates permanent; if the decision is Abort, no change is made to the data (the transaction has no effect). The NBAC protocol’s Commit/Abort outcome depends on the votes of participants and on failures. More precisely, the solution to the NBAC problem has these properties: ■ ■ ■ ■
■
■
■ ■
Strong completeness. Eventually every faulty participant is permanently suspected by every correct participant. Weak completeness. Eventually every faulty participant is permanently suspected by some correct participant. Weak accuracy. There is a correct participant that is never suspected. Eventual weak accuracy. Eventually there is a correct participant that is never suspected.
Based on these properties, Chandra and Toueg have defined several classes of failure detectors. For example, the class of “eventual weak failure detectors” includes all failure detectors that satisfy weak completeness and eventual weak accuracy. We can use time-outs to meet completeness of failure detections because a faulty participant does not answer. However, as we indicated previously, accuracy can only be approximate because, even if most messages are received within some predictable time, an answer not yet received does not mean that the sender has crashed. So, implementations of failure 44
IEEE SOFTWARE
July/August 2001
■
Termination. Every correct participant eventually decides. Integrity. A participant decides at most once. Uniform agreement. No two participants decide differently. Validity. A decision value is Commit or Abort. Justification. If a participant decides Commit, all participants have voted Yes.
The obligation property The previous properties could be satisfied by a trivial protocol that would always output Abort. So, we add the obligation property, which aims to eliminate “unexpected” solutions in which the decision would be independent of votes and of failure scenarios. The obligation property stipulates that if all participants voted Yes and everything went well, the decision must be Commit. “Everything went well” is of course related to failures. Because failures can be safely detected in synchronous systems, the obligation property for these systems is as follows: S-Obligation: If all participants vote Yes and there is no failure, the outcome decision is Commit. Because failures can only be suspected, possibly erroneously, in asynchronous systems, we must weaken this property for the
Figure 1. A generic nonblocking atomic commitment (NBAC) protocol.
procedure NBAC (vote, participants) begin (1.1)
multicast (vote, participants);
(2.1)
wait ( (d delivery of a vote No from a participant) or (∃ q ∈ participants: exception(q) has been notified to p)
(2.2) (2.3)
or (f from each q ∈ participants: delivery of a vote Yes)
(2.4)
);
(3.1)
case
(3.2)
a vote No has been delivered
→ outcome := propose (Abort)
(3.3)
an exception has been notified
→ outcome := propose (Abort)
(3.4)
all votes are Yes
→ outcome := propose (Commit)
(3.5)
end case
end
problem to be solvable.8 So, for these systems we use this obligation property: AS-Obligation: If all participants vote Yes and there is no failure suspicion, the outcome decision is Commit. A generic NBAC protocol Next, we describe a generic protocol9 for the NBAC problem. In this protocol, the control is distributed: each participant sends its vote to all participants and no main coordinator exists.4 Compared to a central coordinator-based protocol, it requires more messages but fewer communication phases, thus reducing latency. The protocol is described by the procedure nbac (vote, participants), which each correct participant executes (see Figure 1). The procedure uses three generic statements (multicast, exception, and propose) whose instantiations are specific to synchronous or to asynchronous distributed systems. Each participant has a variable outcome that will locally indicate the final decision (Commit or Abort) at the end of the protocol execution. A participant p first sends its vote to all participants (including itself) by using the multicast statement (Figure 1, line 1.1). Then p waits until it has been either ■ ■ ■
delivered a No vote from a participant (line 2.1), notified of an exception concerning a participant q (line 2.2), or delivered a Yes vote from each participant (line 2.3).
At line 2.2, the notification exception(q) concerns q’s failure and will be instantiated appropriately in each type of system. Finally,
according to the votes and the exception notifications that p has received, it executes the statement outcome := propose(x) (lines 3.1 to 3.5) with Commit as the value of x if it has received a Yes from all participants, and with Abort in all other cases. We’ll discuss propose in the next section. Instantiations Now we’ll look at instantiations of the generic protocol for synchronous and asynchronous distributed systems (see Table 1).9 Synchronous systems For these systems, the instantiations are Rel_Multicast(v,P), timer expiration, and x. Multicast(v,P). The Rel_Multicast(v,P)
primitive allows a process to reliably send a message m to all processes of P. It has these properties:3,10 ■
■
Termination. If a correct process multicasts a message m to P, some correct process of P delivers m (or all processes of P are faulty). Validity. If a process p delivers a message m, then m has been multicast to a set P and p belongs to P (there is no spurious message).
Table 1 Instantiations of the Generic Protocol for Distributed Systems Generic statement
Multicast(v,P) exception propose(x)
Synchronous instantiation
Asynchronous instantiation
Rel_Multicast(v,P) Multisend(v,P) timer expiration failure suspicion x Unif_Cons(x)
July/August 2001
IEEE SOFTWARE
45
Figure 2. A protocol stack. Applications Nonblocking atomic commitment Consensus Point-to-point communication
■ ■
Failure detection
Integrity. A process p delivers a message m at most once (no duplication). Uniform agreement. If any (correct or not) process belonging to P delivers a message m, all correct processes of P deliver m.
The previous definition is independent of the system’s synchrony. In synchronous distributed systems, the definition of Rel_Multicast(v,P) includes the additional property3 of timeliness: there is a time constant ∆ such that, if the multicast of m is initiated at real-time T, no process delivers m after T + ∆. Let f be the maximum number of processes that might crash and δ be the a priori known upper bound defined earlier in the section “Distributed systems and failures.” We can show that ∆ = (f + 1)δ. Özalp Babaoglu ˘ and Sam Toueg describe several message- and time-efficient implementations of Rel_Multicast(m,P).3 So, in our instantiation, Rel_Multicast ensures that if a correct participant is delivered a vote sent at time T, all correct participants will deliver this vote by T + ∆.
Asynchronous systems For these systems, the instantiations are Multisend(v,P), failure suspicion, and Unif_Cons(x). These instantiations “reduce” the NBAC problem to the uniform consensus problem. This means that in asynchronous distributed systems, if we have a solution for the uniform consensus problem, we can use it to solve the NBAC problem8 (see Figure 2). The primitive Multisend(m,P) is a syntactical notation that abstracts for each p ∈ P do send(m) to p end do where send is the usual point-to-point communication primitive. Multisend(m,P) is the simplest multicast primitive. It is not fault tolerant: if the sender crashes after having sent message m to a subset P′ of P, message m will not be delivered to all processes belonging to P but not to P′. In our instantiation, if a participant crashes during the execution of Multisend, it will be suspected by any failure detector that satisfies the completeness property. Multicast(v,P).
Exception. When the failure detector associated with participant p suspects (possibly erroneously) that a participant q has crashed, it raises the exception associated with q by setting a local Boolean flag suspected(q) to the value true. If failure detectors satisfy the completeness property, all participants that crashed before sending their vote will be suspected. (The protocol’s termination is based on this observation.)
Exception. Because the crash of a participant
q in a synchronous system can be safely detected, the exception associated with q is raised if q crashed before sending its vote. We implement this by using a single timer for all possible exceptions (there is one possible exception per participant): p sets a timer to δ + ∆ when it sends its vote (with the Rel_Multicast primitive). If the timer expires before a vote from each participant has been delivered, p can safely conclude that participants from which votes have not been delivered have crashed. Propose(x). In this case, propose is simply
the identity function—that is, propose(x) = x. So, outcome := propose(x) is trivially instantiated by outcome := x. 46
IEEE SOFTWARE
July/August 2001
Propose(x). The function propose is instantiated by any subprotocol solving the uniform consensus problem. Let Unif_Cons be such a protocol; it is executed by all correct participants. 5
T
he study of the NBAC problem teaches us two lessons. In the presence of process crash failures, the first lesson comes from the generic statement propose. In a synchronous system, a correct participant can locally take a globally consistent decision when a timer
About the Authors Michel Raynal is a professor of computer science at the University of Rennes, France.
References 1. P.A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, Reading, Mass., 1987. 2. J.N. Gray, “Notes on Database Operating Systems,” Operating Systems: An Advanced Course, Lecture Notes in Computer Science, no. 60, Springer-Verlag, Heidelberg, Germany, 1978, pp. 393–481. 3. Ö. Babaoglu ˘ and S. Toueg, “Non-Blocking Atomic Commitment,” Distributed Systems, S. Mullender, ed., ACM Press, New York, 1993, pp. 147–166. 4. D. Skeen, “Non-Blocking Commit Protocols,” Proc. ACM SIGMOD Int’l Conf. Management of Data, ACM Press, New York, 1981, pp. 133–142. 5. F. Cristian, “Understanding Fault-Tolerant Distributed Systems,” Comm. ACM, vol. 34, no. 2, Feb. 1991, pp. 56–78. 6. M. Fischer, N. Lynch, and M. Paterson, “Impossibility of Distributed Consensus with One Faulty Process,” J. ACM, vol. 32, no. 2, Apr. 1985, pp. 374–382. 7. T.D. Chandra and S. Toueg, “Unreliable Failure Detectors for Reliable Distributed Systems,” J. ACM, vol. 43, no. 2, Mar. 1996, pp. 245–267. 8. R. Guerraoui, “Revisiting the Relationship between Non-Blocking Atomic Commitment and Consensus,” Proc. 9th WDAG, Lecture Notes in Computer Science, no. 972, Springer-Verlag, Heidelberg, Germany, 1995, pp. 87–100. 9. M. Raynal, “Non-Blocking Atomic Commitment in Distributed Systems: A Tutorial Based on a Generic Protocol,” to be published in Int’l J. Computer Systems Science and Eng., vol. 15, no. 2, Mar. 2000, pp. 77–86. 10. V. Hadzilacos and S. Toueg, “Reliable Broadcast and Related Problems,” Distributed Systems, 2nd ed., S. Mullender, ed., ACM Press, New York, 1993, pp. 97–145. 11. T.D. Chandra, V. Hadzilacos, and S. Toueg, “The Weakest Failure Detector for Solving Consensus,” J. ACM, vol. 43, no. 4, July 1996, pp. 685–822.
Mukesh Singhal is a full professor of computer and information science at Ohio State University, Columbus. He is also the program director of the Operating Systems and Compilers program at the National Science Foundation. His research interests include operating systems, database systems, distributed systems, performance modeling, mobile computing, and computer security. He has coauthored Advanced Concepts in Operating Systems (McGraw-Hill, 1994) and Readings in Distributed Computing Systems (IEEE CS Press, 1993). He received his Bachelor of Engineering in electronics and communication engineering with high distinction from the University of Roorkee, Roorkee, India, and his PhD in computer science from the University of Maryland, College Park. He is a fellow of the IEEE. Contact him at the Dept. of Computer and Information Science, Ohio State Univ., Columbus, OH 43210;
[email protected]; www.cis.ohio-state.edu/~singhal.
IEEE
CALL Articles FOR
expires. In an asynchronous system, the participant must cooperate with others to take this decision. The second lesson comes from failure detectors. They let us precisely characterize the set of executions in which we can solve the NBAC problem in purely asynchronous systems. Weak completeness and eventual weak accuracy delineate the precise frontier beyond which the consensus problem cannot be solved.11 To master the difficulty introduced by the detection of failures in distributed systems, it’s necessary to understand the few important notions we’ve presented. These notions should be helpful to researchers and engineers to state precise assumptions under which a given problem can be solved. In fact, the behavior of reliable distributed applications running on asynchronous distributed systems should be predictable in spite of failures.
He is involved in several projects focusing on the design of large-scale, fault-tolerant distributed operating systems. His research interests are distributed algorithms, operating systems, parallelism, and fault tolerance. He received his Doctorat dEtat en Informatique from the University of Rennes. Contact him at IRISA Campus de Beaulieu, 35042 Rennes Cedex, France;
[email protected].
Software Engineering of Internet Software In less than a decade, the Internet has grow n from a little-known back road of nerds into a central highway for worldwide comm erce, information, and entertainment. This shift has introduced a new language. We speak of Internet time, Internet software, and the rise and fall of e-business. Essential to all of this is the software that makes the Intern et work. From the infrastructure companies that create the tools on which e-business runs to the Web design boutiques that deploy slick Web sites using the latest technology, software lies behind the shop windows, newspapers, and bank notes. How is this new Internet software differ ent than the software created before every thing became e-connected? Are the tools different? Are the designs different? Are the processes different? And have we forgo tten important principles of software engin eering in the rush to stake claims in the new webif ied world? We seek original articles on what it mean s to do Internet software, in terms that are useful to the software community at large and emphasizing lessons learned from pract ical experience. Articles should be 2,800–5,40 0 words, with each illustration, graph, or table counting as 200 words. Submissions are peer-reviewed and are subject to editin g for style, clarity, and space. For detailed author guidelines, see computer.org/so ftware/ author.htm or contact
[email protected] rg.
Publication: March/April 2002 Submission deadline: 15 August 2001
Guest Editors: Elisabeth Hendrickson, Quality Tree Software, Inc.
[email protected] Martin Fowler, Chief Scientist, Thoughtworks
[email protected] July/August 2001
IEEE SOFTWARE
47
focus
fault tolerance
Achieving Fault-Tolerant Software with Rejuvenation and Reconfiguration William Yurcik and David Doss, Illinois State University
equirements for constantly functioning software have increased dramatically with commercialization of the Internet. Application service providers and service-level agreements specify contractual software performance in terms of guaranteed availability and error thresholds (failed connection attempts, transaction failures, and fulfillment failures). These requirements are difficult to satisfy, particularly as applications grow in complexity, but the alternative of letting systems unpredictably
R The authors present two complementary ways of dealing with software aging: reinitializing to a known operating state before a failure occurs or reconfiguring after a failure such that the service the software provides remains operational. 48
IEEE SOFTWARE
crash is becoming less of an option. Such crashes are becoming increasingly expensive to business and potentially life threatening to those who depend on essential services built on networked software systems. As the makeup of systems is increasingly composed of software relative to hardware, system crashes are more likely to be the result of a software fault than a hardware fault. Although enormous efforts go into developing defect-free software, it isn’t always possible to find and eliminate every software bug. Software engineers develop software that works in the best of all possible worlds, but the real world includes environmental disruptions, transient faults, human errors, and malicious attacks.1 Building constantly functioning software systems in such a highly dynamic and unbounded environment is a challenge. Even if individual software could be cer-
July/August 2001
tifiably “assured” as bug-free, this assured software would likely have to execute on systems with “nonassured” software that could potentially introduce new faults into the system.2 Developing systems through software integration and reuse (rather than customized design) has become a cornerstone of modern software engineering. Thus, when considering software systems as a whole, it is prudent to assume that bugs are inherent and software should be fault tolerant. Furthermore, when specific software continuously executes, software aging occurs: The software ages due to error conditions that accumulate with time and use.3–5 Causes include memory leaks, memory fragmentation, memory bloating, missing scheduling deadlines, broken pointers, poor register use, and build-up of numerical round-off errors. This aging manifests itself 0740-7459/01/$10.00 © 2001 IEEE
Software Decay
in terms of system failures due to deteriorating operating system resources, unreleased file locks, and data corruption.6 (For more information, see the “Software Decay” sidebar.) Software aging has occurred in both software used on a massive scale (Microsoft Windows95 and Netscape Navigator) and specialized high-availability safety-critical software.7 As more PC users leave their computers “always-on” through a cable modem or DSL connections to the Internet, the likelihood of system crashes due to software aging is increasingly relevant. Software rejuvenation Most software theory has focused on static behavior by analyzing software listings. Little work was performed on longitudinal dynamic behavior and performance under varying loads until Yennun Huang and colleagues introduced the concept of software rejuvenation in 1995.3 Software rejuvenation is a proactive approach that involves stopping executing software periodically, cleaning internal states, and then restarting the software. Rejuvenation may involve all or some of the following: garbage collection, memory defragmentation, flushing operating system kernel tables, and reinitializing internal data structures.7 Software rejuvenation does not remove bugs resulting from software aging but rather prevents them from manifesting themselves as unpredictable whole system failures. Periodic rejuvenation limits the state space in the execution domain and transforms a nonstationary random process into a stationary process that can be predicted and avoided. To all computer users, rejuvenation is as intuitive as occasionally rebooting your computer. Of course, Murphy’s Law holds that this reboot will occur when irreplaceable data will be lost. Examples of using large-scale software rejuvenation include3,7 ■
■ ■
the Patriot missile defense system’s requirement to switch the system off and on every eight hours (1992); software rejuvenation for AT&T billing applications (1995); telecommunications switching software rejuvenation to prevent performance degradation (1997); and
Software decay is a proposed phenomenon, and it refers to the changing behavior of software.1 Specifically, software decay refers to software that degrades through time as it becomes increasingly difficult and expensive to maintain. If software remains in the same supporting environment, it is possible for it to constantly function without changing its behavior, but this is unrealistic. Hardware- and software-supporting environments change over time, and new features are added to change or enhance functionality. The original software architects can’t anticipate all possible changes, so unanticipated changes sometimes violate design principles or fail to follow the intent of imprecise requirements. The result is software that has decayed in function and would be more efficient if completely rewritten. While software decay occurs in incremental steps through change processes that humans initiate, software aging occurs through underlying operating system resource management in response to dynamic events and a varying load over time.
Reference 1. S.G. Eick et al., “Does Software Decay? Assessing the Evidence from Change Management Data,” IEEE Trans. Software Eng., vol. 27, no. 1, Jan. 2001.
■
on-board preventative maintenance for long-life deep space missions (1998).
Rejuvenation is similar to preventive maintenance for hardware systems.7 While rejuvenation incurs immediate overhead in terms of some services being temporarily unavailable, the idea is to prevent more lengthy unexpected failures from occurring in the future. Cluster computing also provides a similar fault tolerance by using planned outages. When we detect a failure in one computer in a cluster, we can “fail over” the executing process to another computer within the cluster. Similar to rejuvenation, computers can be removed from the cluster as if under failure, serviced, and upgraded, and then restored back to the cluster.5 This ability to handle unexpected failures and scheduled maintenance makes clusters the only information systems that can attain 100 percent availability. The critical factor in making scheduled downtime preferable to unscheduled downtime is determining how often a system must be rejuvenated. If unexpected failures are catastrophic, then a more aggressive rejuvenation schedule might be justified in terms of cost and availability. If unexpected failures are equivalent to scheduled downtime in terms of cost and availability, then a reactive approach is more appropriate. Currently, the two techniques used to determine an optimal rejuvenation schedule are a July/August 2001
IEEE SOFTWARE
49
N-Version Programming NVP, first proposed by Algirdas Avizienis in 1977, refers to multiple (N > 2) functionally equivalent program versions based on the same specification.1 To provide fault tolerance, each version must employ design diversity (different algorithms and programming languages) to maximize the probability that any error results are distinguishable. This is based on the conjecture that the probability of a random, independent fault producing the same error results in two or more versions is less when the versions are diverse. A consistent set of inputs is supplied to all N versions and all N versions are executed in parallel. Similar to majority-voting hardware units, a consensus software decision mechanism then examines the results from all N versions to determine the accurate result and mask error results. NVP is increasingly feasible because asymmetrical multiprocessing now allows different processors running different operating systems for applications requiring reliability. Research continues into the feasibility of NVP for different problems and whether the NVP assumption of independent failures from functionally equivalent but independently developed versions holds (or whether failures remain correlated).
Reference 1. A. Avizienis, “The Methodology of N-Version Programming,” Software Fault Tolerance, M.R. Lyu, ed., John Wiley & Sons, New York, 1995, pp. 23–46.
measurement-based technique (which estimates rejuvenation timing based on system resource metrics) and a modeling-based technique (which uses mathematical models and simulation to estimate rejuvenation timing based on predicted performance).7 IBM is pioneering software rejuvenation technology, in conjunction with Duke University, and products are beginning to appear. Software rejuvenation has been incorporated in IBM’s Netfinity Director for Microsoft Windows-based IBM and nonIBM servers, desktops, workstations, and notebook systems, and an extension has been created for Microsoft Cluster Service. Software reconfiguration In contrast to proactive rejuvenation, fault-tolerance techniques have traditionally been reactive. The reactive approach to achieving fault-tolerant software is to reconfigure the system after detecting a failure—and redundancy is the primary tool used. For hardware, the reconfiguration approach to providing fault tolerance is redundancy in terms of backup processors, power supplies, disk drives, and circuits. For software, the reconfiguration approach uses redundancy in three different dimensions:
■
■
Redundancy in these three dimensions provides flexible and efficient recovery, independent of knowledge about the underlying failure (such as fault identification or causal events). While robust software can be built with enough redundancy to handle almost any arbitrary failure, the challenge is to provide fault tolerance by minimizing redundancy—which reduces cost and complexity.8 However, reactive techniques don’t have to mean that a system must crash before it can be gracefully recovered. Software reconfiguration can use redundant resources for real-time recovery while dynamically considering a large number of factors (operating system services, processor load, and memory variables among others)—a human-in-the-loop might not be necessary.9 To PC users, however, reactive reconfiguration means recovery after a system crash. When your PC freezes, reconfiguration can take place by listing executing processes ( <delete>) and attempting to identify and terminate the process responsible for the problem, often using a trial-anderror approach. If something catastrophic has occurred, a reboot from tape backup or original system disks might be necessary. Realistically, most users do not keep current backups, so a market for automatic software reconfiguration products has taken off (see the “Software Reconfiguration Products” sidebar). Reconfiguration techniques have been pioneered for fault-tolerant networks and include,8, 10–12 ■
■ ■
50
IEEE SOFTWARE
software redundancy expressed as independently-written programs performing the same task executing in parallel (see
July/August 2001
the “N-Version Programming” sidebar) and comparing outputs; time redundancy expressed as repetitively executing the same program to check consistent outputs; and information redundancy expressed as redundancy bits that help detect and correct errors in messages and outputs.
preplanned reconfiguration with disjoint working and backup circuits, such that after detecting a failure in a working circuit, the traffic can be automatically rerouted to its dedicated backup circuit; dynamic reconfiguration, such that after detecting a failure, signaling messages search the network for spare circuits on which to reroute traffic;
Software Reconfiguration Products
■
■
multilayer reconfiguration, in which recovery from a failure at one layer might take place at higher layers either independently or in coordination with each layer having different characteristics; and priority reconfiguration, which might involve re-optimizing an entire network to reconnect disrupted high-priority circuits over currently established lowerpriority circuits.
Each of these reconfiguration techniques requires the provisioning of spare resources for redundancy that can be used when a failure occurs. The redundant resources can be dedicated, to guarantee reconfiguration, or shared, in which case recovery might not be possible (spare resources might not be available at the time of a fault). On the other hand, sharing redundant resources is more efficient in environments of low fault probability or when reconfiguration need not be guaranteed. Reconfiguration can be provided at different layers and implemented with different algorithms at each layer.13 In fact, if all restoration mechanisms are similar at each layer, there is increased whole system vulnerability.2 For example, if all layers used preplanned mechanisms, then each layer— and the system as a whole—will not be able to handle unexpected fault events. If all layers used real-time search algorithms, then system behavior would be hard to predict. Instead, it is better to use complementary reconfiguration algorithms at different layers and draw on the benefits of each. For example, a preplanned algorithm at a lower layer (for speed), followed by a real-time search mechanism at a higher layer, can handle unexpected faults that lower layers have been unable to handle. In general, reconfiguration of successfully executing software for recovery from a failure in another part of a system should only be performed if it can be accomplished transparently such that it is imperceptible to users. However, there are cases when highpriority software fails and requires resources for recovery. In this scenario, lowerpriority software should be delayed or terminated and its resources reassigned to aid this recovery.9 Intentional system degradation to maintain essential processing is the most extreme type of reconfiguration.
Given the maturation of software reconfiguration techniques, products have begun to appear, particularly in the PC operating system market. These products do not protect from hardware failures, such as CPU malfunction or disk failure, but they can be useful tools against buggy software and human error. In general, these products track software changes (system, application, data file, and registry setting), use a hard disk to make redundant copies, and let the user restore (reconfigure) a system to a previous “snapshot.” Note that there are trade-offs for providing this reconfiguration capability versus system performance and hard disk space requirements. For more information, here is a representative sampling of current products: ■ ■ ■ ■
ConfigSafe v 4.0, by imagine LAN (www.configsafe.com), GoBack v 2.21, by Roxio (www.roxio.com), Rewind, by Power On Software (www.poweronsoftware.com), and System Restore Utility, included in Microsoft Windows ME (www. microsoft.com).
Figure 1. A model depicting the complementary nature of rejuvenation and reconfiguration.7
Operational state Rejuvenation Aging
Failureprobable state
Reconfiguration
Bug manifestation
Failed state
F
ault-tolerant software requires a whole system approach.2,9 We have attempted to outline the use of contrasting proactive and reactive approaches to achieve fault-tolerant software, but it’s not easy to say which approach is better. Both approaches are nonexclusive and complementary, such that they work well together in an integrated system (see Figure 1). The high cost of redundancy required for reactive reconfiguration suggests it is better suited for software in which a rejuvenation schedule appears unrealistic due to imminent faults or where an outage’s effect could be catastrophic. Proactive rejuvenation is the preferred solution when faults can be efficiently avoided using a realistic rejuvenation schedule or where the risk an outage presents is low. Because both approaches are relatively new and under study, we direct readers to our references for more details. July/August 2001
IEEE SOFTWARE
51
An analogy can be made between these fault-tolerant software approaches and CPU communications in a computer system. Reactive reconfiguration is equivalent to eventdriven interrupts, and proactive rejuvenation is equivalent to polling resources. Preventing failures before they occur might be the best approach when finding all software bugs is possible, just as polling is preferable when CPU communication can be anticipated. However, when finding all bugs is improbable (or maybe testing is not even attempted), then having the flexibility to react to multipriority interrupts with robust service-handling routines might be the critical last line of defense against software faults.
8.
9.
10.
11.
12.
13.
“Statistical Non-Parametric Algorithms to Estimate the Optimal Rejuvenation Schedule,” Pacific Rim Int’l Symp. Dependable Computing (PRDC), IEEE Computer Soc. Press, Los Alamitos, Calif., 2000, pp. 77–84. W. Yurcik and D. Tipper, “Survivable ATM Group Communications: Issues and Techniques,” Eight Intl. Conf. Telecomm. Systems, Vanderbuilt Univ./Owen Graduate School of Management, Nashville, Tenn., 2000, pp. 518–537. D. Wells et al., “Software Survivability,” Proc. DARPA Information Survivability Conf. and Exposition (DISCEX), IEEE Computer Soc. Press, Los Alamitos, Calif., vol. 2, Jan. 2000, pp. 241–255. D. Medhi, “Network Reliability and Fault Tolerance,” Wiley Encyclopedia of Electrical and Electronics Engineering, John Wiley, New York, 1999. D. Medhi and D. Tipper, “Multi-Layered Network Survivability—Models, Analysis, Architecture, Framework and Implementation: An Overview,” Proc. DARPA Information Survivability Conference and Exposition (DISCEX), IEEE Computer Soc. Press, Los Alamitos, Calif., vol. 1, Jan. 2000, pp. 173–186. D. Medhi and D. Tipper, “Towards Fault Recovery and Management in Communications Networks,” J. Network and Systems Management, vol. 5, no. 2, Jun. 1997, pp. 101–104. D. Johnson, “Survivability Strategies for Broadband Networks,” IEEE Globecom, 1996, pp. 452–456.
Acknowledgments The authors thank Katerina Goseva-Popstojanova, Duke University, who provided an outstanding introduction to the concept of rejuvenation; David Tipper, University of Pittsburgh, and Deep Medhi, University of Missouri–Kansas City, for making significant contributions to the field of fault-tolerant networking using reconfiguration; and Kishor S. Trivedi, Duke University, for making seminal contributions in the development of software rejuvenation. Lastly, we thank past reviewers for their specific feedback that has significantly improved this article.
References 1. D. Milojicic, “Fred B. Schneider on Distributed Computing,” IEEE Distributed Systems Online, vol. 1, no. 1, 2000, http://computer.org/dsonline/archives/ds100/ ds1intprint.htm (current 11 June 2001). 2. W. Yurcik, D. Doss, and H. Kruse, “SurvivabilityOver-Security: Providing Whole System Assurance,” IEEE/SEI/CERT Information Survivability Workshop (ISW), IEEE Computer Soc. Press, Los Alamitos, Calif., 2000, pp. 201–204. 3. Y. Huang et al., “Software Rejuvenation: Analysis, Module and Applications,” 25th IEEE Intl. Symp. on Fault Tolerant Computing, 1995, pp. 381–390. 4. K.S. Trivedi, K. Vaidyanathan, and K. Goseva-Popstojanova, “Modeling and Analysis of Software Aging and Rejuvenation,” 33rd Ann. Simulation Symp., Soc. Computer Simulation Int’l Press, 2000, pp. 270–279. 5. K. Vaidyanathan et al., “Analysis of Software Rejuvenation in Cluster Systems,” Fast Abstracts, Pacific Rim Int’l Symp. Dependable Computing (PRDC), IEEE Computer Soc. Press, Los Alamitos, Calif., 2000, pp. 3–4. 6. S. Garg et al., “A Methodology for Detection and Estimation of Software Aging,” Ninth Int’l Symp. Software Reliability Eng., 1998, pp. 283–292. 7. T. Dohi, K. Goseva-Popstojanova, and K.S. Trivedi,
52
IEEE SOFTWARE
July/August 2001
For further information on this or any other computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
About the Authors William Yurcik is an assistant professor in
the Department of Applied Computer Science at Illinois State University. Prior to his academic career, he worked for organizations such as the Naval Research Laboratory, MITRE, and NASA. Contact him at
[email protected].
David Doss is an associate professor and
the graduate program coordinator in the Department of Applied Computer Science at Illinois State University. He is also a retired Lt. Commander US Navy (SSN). Contact him at
[email protected].
roundtable Fault Tolerance Jeffrey Voas
T
Particpants Joanne Bechta Dugan is a professor of electrical and computer engineering at the University of Virginia. Her research interests include hardware and software reliability engineering, faulttolerant computing, and mathematical modeling using dynamic fault trees, Markov models, Petri nets, and simulation. Contact her at
[email protected]. Les Hatton’s biography appears on page 39.
his is a “virtual” roundtable discussion between five respected experts from the fault-tolerant computing field: Joanne Bechta Dugan (UVA), Les Hatton (Oakwood Computing), Karama Kanoun and Jean-Claude Laprie (LAAS-CNRS), and Mladen Vouk (NC State University). The reason for including this piece is to provide a mini tutorial for readers who are unfamiliar with the field of software fault tolerance or who might have had more exposure to hardware fault tolerance. Eleven questions were posed to each member of the panel over email, and here we present their responses. —Jeffrey Voas
Karama Kanoun’s biography appears on page 33. Jean-Claude Laprie is a research director at CNRS, the French National Organization for Scientific Research. He is Director of LAAS-CNRS, where he founded and previously directed the research group on Fault Tolerance and Dependable Computing. He also founded and previously directed the Laboratory for Dependability Engineering, a joint academia–industry laboratory. His research interests have focused on fault tolerance, dependability evaluation, and the formulation of the basic concepts of dependability. Contact him at
[email protected]. Mladen A. Vouk is a professor of computer science at the N.C. State University, Raleigh, North Carolina. His research and development interests include software engineering, scientific computing, computer-based education, and high-speed networks. He received his PhD from the King’s College, University of London, UK. He is an IEEE Fellow and a member of the IEEE Reliability, Communications, Computer and Education Societies, and of the IEEE TC on Software Engineering, ACM, ASQC, and Sigma Xi. Contact him at vouk@csc. ncsu.edu.
54
IEEE SOFTWARE
What are the basic principles of building faulttolerant systems? Joanne Bechta Dugan: To design and build a fault-tolerant system, you must understand how the system should work, how it might fail, and what kinds of errors can occur. Error detection is an essential component of fault tolerance. That is, if you know an error has occurred, you might be able to tolerate it—by replacing the offending component, using an alternative means of computation, or raising an exception. However, you want to avoid adding unnecessary complexity to enable fault tolerance because that complexity could result in a less reliable system. Les Hatton: The basic principles depend, to a certain extent, on whether you’re designing hardware or software. Classic techniques such as independence work in differ-
July/August 2001
ent ways and with differing success for hardware as compared to software. Karama Kanoun and Jean-Claude Laprie: Fault tolerance is generally implemented by error detection and subsequent system recovery. Recovery consists of error handling (to eliminate errors from the system state) and fault handling (to prevent located faults from being activated again). Fault handling involves four steps: fault diagnosis, fault isolation, system reconfiguration, and system reinitialization. Using sufficient redundancy might allow recovery without explicit error detection. This form of recovery is called fault masking. Fault tolerance can also be implemented preemptively and preventively—for example, in the so-called software rejuvenation, aimed at preventing accrued error conditions to lead to failure. Mladen Vouk: A principal way of introducing fault tolerance into a system is to provide a method to dynamically determine if the system is behaving as it should—that is, you introduce a self-checking or “oracle” capability. If the method detects unexpected and unwanted behavior, a fault-tolerant system must provide the means to recover or continue operation (preferably, from the user’s perspective, in a seamless manner). What is the difference between hardware and software fault tolerance? Joanne: The real question is what’s the difference between design faults (usually software) and physical faults (usually hardware). We can tolerate physical faults in redundant (spare) copies of a component that are identical to the original, but we can’t generally tolerate design faults in this way because the error is likely to recur on the spare component if it is identical to the original. However, the distinction between design and physical faults is not so easily drawn. A large class of errors 0740-7459/01/$10.00 © 2001 IEEE
ROUNDTABLE
arise from design faults that do not recur because the system’s state might slightly differ when the computation is retried. Les: From a design point of view, the basic principles of thinking about system behavior as a whole are the same, but again, we need to consider what we’re designing—software or hardware. I suspect the major difference is cost. Software fault tolerance is usually a lot more expensive. Karama and Jean-Claude: A widely used approach to fault tolerance is to perform multiple computations in multiple channels, either sequentially or concurrently. To tolerate hardware physical faults, the channels might be identical, based on the assumption that hardware components fail independently. Such an approach has proven to be adequate for soft faults— that is, software or hardware faults whose activation is not systematically reproducible—through rollback. Rollback involves returning the system back to a saved state (a checkpoint) that existed prior to error detection. Tolerating solid design faults (hardware or software) necessitates that the channels implement the same function through separate designs and implementations—that is, through design diversity. Mladen: Hardware fault tolerance, for the most part, deals with random failures that result from hardware defects occurring during system operation. Software fault tolerance contends with random (and sometimes not-sorandom) invocations of software paths (or path and state or environment combinations) that usually activate software design and implementation defects. These defects can then lead to system failures. What key technologies make software fault-tolerant? Joanne: Software involves a system’s conceptual model, which is easier than a physical model to engineer to test for things that violate basic concepts. To the extent that a software system can evaluate its own performance and correctness, it can be made fault-tolerant—or at least error-
aware. What I mean is that, to the extent a software system can check its responses before activating any physical components, a mechanism for improving error detection, fault tolerance, and safety exists. Think of the adage, “Measure twice, cut once.” The software analogy might be, “Compute twice, activate once.” Les: I’m not sure there is a key technology. The behavior of the system as a whole and the hardware and software interaction must be thought through very carefully. Efforts to support software fault tolerance, such as exception handling in languages, are rather crude, and education on how to use them is not well established in universities. Karama and Jean-Claude: We can use three key technologies—design diversity, checkpointing, and exception handling—for software fault tolerance, depending on whether the current task should be continued or can be lost while avoiding error propagation (ensuring error containment and thus avoiding total system failure). Tolerating solid software faults for task continuity requires diversity, while checkpointing tolerates soft software faults for task continuity. Exception handling avoids system failure at the expense of current task loss. Mladen: Runtime failure detection is often accomplished through either an acceptance test or comparison of results from a combination of “different” but functionally equivalent system alternates, components, versions, or variants. However, other techniques—ranging from mathematical consistency checking to error coding to data diversity—are also useful. There are many options for effective system recovery after a problem has been detected. They range from complete rejuvenation (for example, stopping with a full data and software reload and then restarting) to dynamic forward error correction to partial state rollback and restart. What is the relationship between software fault tolerance and software safety? Joanne: Both require good error
detection, but the response to errors is what differentiates the two approaches. Fault tolerance implies that the software system can recover from —or in some way tolerate—the error and continue correct operation. Safety implies that the system either continues correct operation or fails in a safe manner. A safe failure is an inability to tolerate the fault. So, we can have low fault tolerance and high safety by safely shutting down a system in response to every detected error. Les: It’s certainly not a simple relationship. Software fault tolerance is related to reliability, and a system can certainly be reliable and unsafe or unreliable and safe as well as the more usual combinations. Safety is intimately associated with the system’s capacity to do harm. Fault tolerance is a very different property. Karama and Jean-Claude: Fault tolerance is—together with fault prevention, fault removal, and fault forecasting—a means for ensuring that the system function is implemented so that the dependability attributes, which include safety and availability, satisfy the users’ expectations and requirements. Safety involves the notion of controlled failures: if the system fails, the failure should have no catastrophic consequence—that is, the system should be fail-safe. Controlling failures always include some forms of fault tolerance—from error detection and halting to complete system recovery after component failure. The system function and environment dictate, through the requirements in terms of service continuity, the extent of fault tolerance required. Mladen: You can have a safe system that has little fault tolerance in it. When the system specifications properly and adequately define safety, then a well-designed fault-tolerant system will also be safe. However, you can also have a system that is highly faulttolerant but that can fail in an unsafe way. Hence, fault tolerance and safety are not synonymous. Safety is concerned with failures (of any nature) that can harm the user; fault tolerance is primarily concerned with runtime prevention of failures in any shape or July/August 2001
IEEE SOFTWARE
55
ROUNDTABLE
form (including prevention of safetycritical failures). A fault-tolerant and safe system will minimize overall failures and ensure that when a failure occurs, it is a safe failure. What are the four most seminal papers or books on this topic? Joanne: Important references on my desk include Michael Lyu’s Software Fault Tolerance (John Wiley, 1995), Nancy Leveson’s Safeware: System Safety and Computers—especially for the case studies in the appendix (Addison-Wesley, 1995), Debra Herrmann’s Software Safety and Reliability (IEEE Press, 2000), and Henry Petroski’s To Engineer is Human: The Role of Failure in Successful Design (Vintage Books, 1992). Les: The literature is pretty packed, although some of Nancy Leveson’s contributions are right up there. Beyond that, I usually read books on failure tolerance and safety in conventional engineering. Karama and Jean-Claude: B. Randell, “System Structure for Software Fault Tolerance,” IEEE Trans. Software Eng., vol. SE-1, no. 10, 1975, pp. 1220–1232. A. Avizienis and L. Chen, “On the implementation of N-version Programming for Software Fault Tolerance During Execution,” Proc. IEEE COMPSAC 77, 1977, pp. 149–155. Software Fault Tolerance, M.R. Lyu, ed., John Willey, 1995. J.C. Laprie et al., “Definition and Analysis of Hardware- and Software-Fault-Tolerant Architectures,” Computer, vol. 23, no. 7, July 1990, pp. 39–51. Mladen: I agree that M. Lyu’s Software Fault-Tolerance and B. Randell’s “System Structure for Software FaultTolerance” are important. Also, A. Avizienis, “The N-Version Approach to Fault-Tolerant Software,” IEEE Trans. Software Eng., vol. SE-11, no. 12, 1085, pp. 1491–1501, and J.C. Laprie et al., “Definition and Analysis of Hardware- and Software-Fault-Tolerant Architectures,” Computer, vol. 23, no. 7, July 1990, pp. 39–51. (Reprinted in Fault-Tolerant Software Systems: Techniques and Applications, Hoang Pham, ed., IEEE Com56
IEEE SOFTWARE
July/August 2001
puter Soc. Press, 1992, pp. 5–17.) Much research has been done in this area, but it seems little has made it into practice. Is this true? Joanne: I think it’s a psychological issue. In general, I don’t think that software engineers are trained to consider failures as well as other engineers. Software designers like to think about neat ways to make a system work—they aren’t trained to think about errors and failures. Remember that most software faults are design faults, traceable to human error. If I have a finite amount of resources to expend on a project, it’s hard to make a case for fault tolerance (which usually implies redundancy) rather than just spending more time to get the one version right. Les: Yes, I agree that little has made it into practice. Most industries are in too much of a rush to do it properly— if at all—with reduced time-to-market so prevalent, and software engineers are not normally trained to do it well. There is also a widespread and wholly mistaken belief that software is either right or wrong, and testing proves that it is “right.” Karama and Jean-Claude: This could be true for design diversity, as it has been used mainly for safety-critical systems only. Exception handling is used in most systems: telecommunications, commercial servers, and so forth. Mladen: I think we’ve made a lot of progress—we now know how to make systems that are quite faulttolerant. Most of the time, it is not an issue of technology (although many research issues exist) but an issue of cost and schedules. Economic and business models dictate cost-effective and timely solutions. Available faulttolerant technology, depending on the level of confidence you desire, might offer unacceptably costly or time-consuming solutions. However, most of our daily critical reliability expectations are being met by faulttolerant systems—for example, 911 services and aircraft flight systems— so it is possible to strike a successful balance between economic and technical models.
How can a company deciding whether to add fault tolerance to a system determine whether the return on investment is sufficient for the additional costs? Joanne: I think it all hinges on the cost of failure. The more expensive the failure (in terms of actual cost, reputation, human life, environmental issues, and other less tangible measures), the more it’s worth the effort to prevent it. Les: This is very difficult. Determining the cost of software failure in actuarial terms given the basically chaotic nature of such failure is just about impossible at the current level of knowledge—the variance on the estimates is usually ridiculous. ROI must include a knowledge of risk. Risk is when you are not sure what will happen but you know the odds. Uncertainty is when you don’t know either. We generally have uncertainty. On the other hand, I am inclined to believe that the cost of failure in consumer embedded systems is so high that all such systems should incorporate fault-tolerant techniques. Karama and Jean-Claude: It is misleading to consider the ROI as the only criterion for deciding whether to add fault tolerance. For some application domains, the cost of a system outage is a driver that is more important than the additional development cost due to fault tolerance. Indeed, the cost of system outage is usually the determining factor, because it includes all sources of loss of profit and income to the company. Mladen: There are several faulttolerance cost models we can use to answer this question. The first step is to create a risk-based operational profile for the system and decide which system elements need protection. Risk analysis will yield the problem occurrence to cost-to-benefit results, which we can then use to make appropriate decisions. What key standards, government agencies, or standards bodies require fault-tolerant systems? Joanne: I’d be surprised if there are any standards that require fault tolerance (at least for design faults)
ROUNDTABLE
per se. I would expect that correct operation is required, which might imply fault tolerance as an internal means to achieve correct operation. Les: Quite a few standards have something to say about this—for example, IEC 61508. However, IEC 61508 also has a lot to say about a lot of other things. I do not find the proliferation of complex standards particularly helpful, although they are usually well meaning. There is too much opinion and not enough experimentation. Karama and Jean-Claude: Several standards for safety-critical applications recommend fault tolerance—for hardware as well as for software. For example, the IEC 61508 standard (which is generic and application sector independent) recommends among other techniques: “failure assertion programming, safety bag technique, diverse programming, backward and forward recovery.” Also, the Defense standard (MOD 00-55), the avionics standard (DO-178B), and the standard for space projects (ECSS-Q-40A) list design diversity as possible means for improving safety. Mladen: Usually, the requirement is not so much for fault tolerance (by itself) as it is for high availability, reliability, and safety. Hence, IEEE, FAA, FCC, DOE, and other standards and regulations appropriate for reliable computer-based systems apply. We can achieve high availability, reliability, and safety in different ways. They involve a proper reliable and safe design, proper safeguards, and proper implementation. Fault tolerance is just one of the techniques that assures that a system’s quality of service (in a broader sense) meets user needs (such as high safety). How do you demonstrate that fault tolerance is achieved? Joanne: This is a difficult question to answer. If we can precisely describe the classes of faults we must tolerate, then a collection of techniques to demonstrate fault tolerance (including simulation, modeling, testing, fault injection, formal analysis, and so forth) can be used to build a credible case for fault tolerance. Les: I honestly don’t know. Demon-
strating that something has been achieved in classic terms means comparing the behavior with and without the additional tolerance-inducing techniques. We don’t do experiments like this in software engineering, and our prediction systems are often very crude. There are some things you can do, but comparing software technologies in general is about as easy as comparing supermarkets. Karama and Jean-Claude: As for any system, analysis and testing are mandatory. However, fault injection constitutes a very efficient technique for testing fault-tolerance mechanisms: well-selected sets of faults are injected in the system and the reaction of faulttolerance mechanisms is observed. They can thus be improved. The main problem remains the representativeness of the injected faults. For design diversity, the implementation of the specifications by different teams has proven to be efficient for detecting specifications faults. Back-to-back testing has also proven to be efficient in detecting some difficult faults that no other method can detect. Mladen: You know you’ve achieved fault tolerance through a combination of good requirements specifications, realistic quantitative availability, reliability and safety parameters, thorough design analysis, formal methods, and actual testing. What well-known projects contain fault-tolerant software? Les: Certainly some avionics systems and some automobile systems, but the techniques are not usually widely publicized so I’m not sure that many would be well known. Ariane 5 is a good, well-documented example of a system with hardware tolerance but no software tolerance. Karama and Jean-Claude: Airbus A-320 and successors use design diversity for software fault tolerance in the flight control system. Boeing B777 uses hardware diversity and compiler diversity. The Elektra Austrian railway signaling system uses design diversity for hardware and software fault tolerance. Tandem ServerNet and predecessors (Non-
Stop) tolerate soft software faults through checkpointing. Mladen: Well-known projects include the French train systems, Boeing and Airbus fly-by-wire aircraft, military fly-by-wire, and NASA space shuttles. Several proceedings and books, sponsored by IFIP WG 10.4 and the IEEE CS Technical Committee on Fault Tolerant Computing, describe older fault-tolerance projects (see www.dependability.org). Is much progress being made into fault-tolerant software? Les: I don’t think there is much new in the theory. Endless and largely pointless technological turnover in software engineering just makes it harder to do the right things in practice. With a new operating system environment emerging about every two years on average, a different programming language emerging about every three years, and a new paradigm about every five years, it’s a wonder we get anywhere at all. Karama and Jean-Claude: To the best of our knowledge, the main concepts and means for software fault tolerance have been defined for a few years. However, some refinements of these concepts are still being done and implemented in practice, mainly for distributed systems such as faulttolerant communication protocols. Mladen: The basic ideas of how to provide fault tolerance have been known for decades. Specifics on how to do that cost effectively in modern systems with current and future technologies (wireless, personal devices, nanodevices, and so forth) require considerable research as well as ingenious engineering solutions. For example, our current expectations for 911 services require telephone switch availabilities that are in the range of 5 to 6 nines (0.999999). These expectations are likely to start extending to more mundane things such as personal computing and networking applications, car automation (such as GIS locators), and, perhaps, space tourism. Almost none of these would measure up today, and many will require new faulttolerance approaches that are far from standard textbook solutions. July/August 2001
IEEE SOFTWARE
57
feature requirements
Requirements Engineering as a Success Factor in Software Projects Hubert F. Hofmann, General Motors Franz Lehner, University of Regensburg
eficient requirements are the single biggest cause of software project failure. From studying several hundred organizations, Capers Jones discovered that RE is deficient in more than 75 percent of all enterprises.1 In other words, getting requirements right might be the single most important and difficult part of a software project. Despite its importance, we know surprisingly little about the actual process of specifying software. “The RE Process” sidebar provides a basic description.
D Based on their field study of 15 requirements engineering teams, the authors identify the RE practices that clearly contribute to project success, particularly in terms of team knowledge, resource allocation, and process. 58
IEEE SOFTWARE
Most RE research is conceptual and concentrates on methods or techniques, primarily supporting a single activity. Moreover, the rare field studies we actually have do not establish a link between RE practices and performance. We therefore conducted this study to identify the RE practices that clearly contribute to software project success. Stakeholders and teams Stakeholders are individuals and organizations that are actively involved in a software project or whose interests the project affects. Stakeholders of any computer system can include customers, users, project managers, analysts, developers, senior management, and quality assurance staff. Table 1 illustrates the wide range of expertise and motivations that stakeholders typically exhibit.2 A typical software project team consists of a project manager, analysts, developers,
July/August 2001
and quality assurance personnel. Often it includes users or their representatives. In the case of commercial off-the-shelf (COTS) software, marketers such as sales representatives and account managers tend to substitute for users and customers. Field study Seven field studies have reported on RE in practice.3–9 Unfortunately, these rare studies have not established a clear link to performance and tend to focus on a narrow set of variables. Our study provides a more integrated view of RE by investigating team knowledge, allocated resources, and deployed RE processes (see Figure 1) and their contribution to project success. In addition, we incorporate the observations of previous field studies. Fifteen RE teams, including six COTS and nine customized application develop0740-7459/01/$10.00 © 2001 IEEE
The RE Process Requirements engineering denotes both the process of specifying requirements by studying stakeholder needs and the process of systematically analyzing and refining those specifications.1 A specification, the primary result of RE, is a concise statement of the requirements that the software must satisfy— that is, of the conditions or capabilities that a user must achieve an objective or that a system possesses to satisfy a contract or standard.2 Ideally, a specification enables stakeholders to quickly learn about the software and developers to understand exactly what the stakeholders want. Despite heterogeneous terminology throughout the literature, RE must include four separate but related activities: elicitation, modeling, validation, and verification. In practice, they will most likely vary in timing and intensity for different projects. Typically, we first elicit requirements from whatever sources are available (experts, repositories, or the current software) and then model them to specify a solution. Eliciting and modeling requirements are interrelated. Modeling describes a perceived solution in the context of an application domain using informal, semiformal, or formal notations. The gradual normalization of such models in terms of the requirements leads to a satisfactory candidate specification, which then must be validated and verified. This gives stakeholders feedback on the interpretation of their requirements so they can correct misunderstandings as early as possible.
Elicitation Elicitation is often treated as a simple matter of interviewing users or analyzing documents, but several other elicitation methods are available. Some emphasize group sessions in the form of focus groups or workshops; others are employed primarily to elicit requirements for specific types of systems. For example, developers frequently use repertory grids, sorts, and laddering methods in specifying knowledge-based systems. Elicitation also includes those activities that explore how software can meet organizational goals, what alternatives might exist, and how they affect various stakeholders.
tent. Traditionally these methods have separated the data, functional, and behavioral aspects of requirements and specified software by creating one or more distinct models. Prototypes, for instance, attempt to create an operational model that stakeholders can directly experience. Paul Ward and Stephen Mellor, followed by many others, proposed extensions to basic models. Most of these extensions focus on modeling real-time systems. Developments in OO programming and design have introduced a more integrated approach to modeling requirements. In addition, advanced modeling methods attempt to establish a closer link between models and “the customer’s voice,” stakeholders’ viewpoints, and business goals.
Validation and verification The purpose of validating requirements is to certify that they meet the stakeholders’ intentions: Am I specifying the right software? In other words, validation examines a work product (for example, a specification) to determine conformity with stakeholder needs. Verification, on the other hand, determines whether a work product conforms to the allocated requirements: Am I specifying the software correctly? That is, it checks a specification for internal consistency through mathematical proofs or inspection techniques. An important point in validating and verifying requirements is prioritizing them. By addressing high-priority requirements before considering low-priority ones, you can significantly reduce project costs and duration. Moreover, throughout RE you should revisit the priorities assigned, for example, during elicitation to ensure that they continue to adequately reflect the stakeholders’ needs. This highlights the recurrent nature of requirements validation and verification. Methods for validating and verifying requirements are relatively scarce. Peer reviews, inspections, walk-throughs, and scenarios figure most prominently. Moreover, the recording of decisions and their rationales is quite useful.
References
Modeling Experts have proposed many modeling methods and specification languages to make requirements precise and consis-
1. H.F. Hofmann, Requirements Engineering: A Situated Discovery Process, Gabler, Wiesbaden, Germany, 2000. 2. IEEE Guide to Software Requirements Specification, IEEE Std. 830-1998, IEEE Press, Piscataway, N.J., 1998.
Table 1 What Stakeholders Want and What They Offer Stakeholder
Motivation
Expertise areas
Customer User Project manager
Introduce change with maximum benefit Introduce change with minimum disruption Successfully complete the project with the given resources
Analyst Developer
Specify requirements on time and within budget Produce technically excellent system, use latest technologies
Quality assurance
Ensure compliance to process and product standards
Business and information system strategies, industry trends Business process, operating procedures Project management, software development and delivery process RE methods and tools Latest technologies, design methods, programming environments and languages Software process, methods, and standards
July/August 2001
IEEE SOFTWARE
59
High-technology leaders
Knowledge
Resources
Process
Application domain Information technology Requirements engineering process
Effort and duration Team size Cohesive team
Defined process Cycles (“activity patterns”) Activities
Critical business applications
Figure 1. We studied three factors that contribute to project success: team knowledge, allocated resources, and exhibited RE processes.
Application domain
5.4
Infomation technology
5.5
RE process (in use)
5.4 1
2
3 4 Likert scale
5
6
7
Figure 2. Research results regarding knowledge of application domain, information technology, and the RE process being used.
ment projects in nine software companies and development organizations in the telecommunications and banking industries, participated in our study. There were 76 stakeholders: 15 project managers, 34 team members, and 27 other stakeholders such as customers, management, and quality assurance personnel. The development projects we targeted were recently released critical business applications. On average, the participating projects finished in 16.5 months with an expended effort of approximately 120 person-months. Through questionnaires and interviews, we collected data directly from each project’s stakeholders to avoid an eventual bias and to obtain a more complete understanding of the RE process. We assessed the participants’ confidence in their responses by using a 1–7 Likert scale in the questionnaires. That is, individuals rated their own perceived degree of 60
IEEE SOFTWARE
July/August 2001
accuracy of statements (such as “the project has a well-documented process for specifying requirements”) by assigning a value from 1 (very inaccurate) to 7 (very accurate). The Likert scale let us calculate a total numerical value from the responses. Our research approach We decided to study knowledge because, as Bill Curtis, Herb Krasner, and Neil Iscoe have emphasized, deep application-specific knowledge is required to successfully build most large, complex systems.4 We investigated team knowledge with regard to application domain, the technology needed to implement the proposed solution, and the RE process used. To gather the participants’ perceptions of team knowledge, we employed a 7-point Likert scale for each focus area. Allocated resources included team size, expended effort in person-months, and duration (in months) of RE. With regard to team size, the project managers provided data on full- and part-time members of the RE and project teams. We also gathered perceptions about the teams’ coordination and interaction capabilities during RE. In the questionnaire, we used the construct of “cohesiveness,” measured on the Likert scale, to address this aspect of RE. During follow-up interviews, we further investigated communication breakdowns, quality of interaction, and conflict resolution. We gathered the stakeholders’ perceptions of the defined RE process by considering the extent of standardization (for example, is it well-documented? is it tailored from an organizational standard?) and the configuration of work products and their changes, as well as by obtaining independent reviews of RE activities and deliverables. We measured these constructs on the Likert scale. To enable the participants to characterize the practiced RE process, we gave them a list of typical elicitation, modeling, verification, and validation activities, some of which we identified by surveying the RE literature. We also applied the construct of RE cycles to distinguish activity patterns over time. An RE cycle is a set of activities that contain at least one each of elicitation, modeling, validation, and verification activities. Generally, a specific deliverable (for example, a prototype or data model) also characterizes a completed RE cycle.
We gathered stakeholders’ perceptions of RE performance in terms of process control, the quality of RE service, and the quality of RE products. Process control addresses the team’s capability to execute according to plan; thus, we gathered data to compare planned and actual cost, duration, and effort. We evaluated the quality of RE service in terms of the stakeholders’ satisfaction with the RE process and the perceived “organizational fit” of the proposed solution. The stakeholders rated the quality of RE products using these quality attributes: correct, unambiguous, complete, consistent, prioritized, verifiable, modifiable, and traceable.10 Requirements coverage, the functional and nonfunctional requirements addressed by a project’s RE products, also influences quality. We assessed functional requirements from the perspective of functions, data, and behavior, and nonfunctional requirements by gathering perceived coverage of product, process, and external requirements.11 Findings We focused on three factors that contribute to project success—knowledge, resources, and process—and analyzed their contribution to a project’s success (performance). Knowledge Group research emphasizes the impact of experience and expertise on team effectiveness. For instance, in the study by Curtis, Krasner, and Iscoe, project managers and division vice presidents consistently commented on how differences in individual talents and skills affected project performance.4 The authors identified the thin spread of application domain knowledge as one of the most salient problems in software projects. In this study, stakeholders perceived the team’s domain knowledge as relatively good. It reached 5.4 on a 7-point Likert scale, where 7 indicates an RE team with a high degree of knowledge (see Figure 2). Several teams repeatedly referred to their “good use of domain experts.” (The quotes here and throughout this article indicate a direct quote from a participant in our study.) They also included “end users and customers from the very beginning” and “tried to get as much feedback as possible from team members when defining requirements.” Some teams, however, worked with marketing personnel rather than the actual customer, assuming
that marketing knew what was best for the customer. This produces, in some instances, “unrealistic” requirements due to the fact that the marketing staff’s understanding often differs from the application-specific information about the software use that is needed for design.4 On half of the participating projects, senior management, project managers, and system analysts defined at least the initial requirements. Only a few software teams involved technically knowledgeable stakeholders such as software developers, quality assurance, test, and configuration management personnel early in the project. In the rare case that they were involved in RE, they were selected because they had greater application domain knowledge than their colleagues. Involving stakeholders early also resulted in an increased understanding of the RE process being used. On the other hand, a lack of training and “weak project management” led to teams that were less familiar with the RE process. For instance, some organizations assign project managers based on their availability rather than their capabilities.5 In those cases, both project managers and team members must learn the basics as they work. Participants in this study confirmed that this can result in severe budget overruns, frequent schedule adjustments, and canceled projects.3
Some teams worked with marketing personnel rather than the actual customer, assuming that marketing knew what was best for the customer.
Resources Traditionally, RE receives a relatively small percentage of project resources throughout the software life cycle. For example, in 1981 Barry Boehm found that 6 percent of a project’s cost and 9 to 12 percent of project duration are allocated to specifying requirements.12 Over the last 20 years, the resources allocated to RE activities have increased. In this study, perhaps due to the projects’ high-tech leaders, the resource allocation to RE was significantly higher. Project teams expended on average 15.7 percent of project effort on RE activities. The average amount of RE time equaled 38.6 percent of total project duration. Patricia Guinan gives an average RE team size of 5.6 members in the 66 projects she studied.7 For the 15 projects in our study, we calculated an average RE team headcount of 6.2 (see Figure 3). The varied July/August 2001
IEEE SOFTWARE
61
Number of team members
18 16 14 12 10 8 6 4 2 0
Project team size (avg.) RE team size (avg.)
Full-time (avg.) Part-time (avg.) Total (avg.) 10.9 5.7 16.6 3.4 2.8 6.2
Figure 3. Comparing the resources spent on RE by team size and part-time versus full-time allocation of people.
uses of full- and part-time resources during RE were interesting: Some projects allocated only dedicated resources (“product teams”) to RE, whereas others relied on RE “experts” whom various projects “shared.” We also investigated participants’ perceptions of team cohesiveness. The teams rated themselves as relatively cohesive—5.5 on a 7point Likert scale. Overall, stakeholders gave higher scores to RE teams that frequently communicated with customers and other teams involved in the development project. In other words, such teams executed more “boundary management activities.”7 While some teams leveraged “a lot of confidence and trust between customers and developers,” others struggled to achieve “buy-in from all stakeholders” or to secure “the necessary participation of other teams to stay on track and complete the specification.” Several RE teams used specification templates to facilitate communication among stakeholders. They also included “comprehensive examples” to improve specification readability. The most common tool used during RE was an internal Web site, accessible to all stakeholders, where the project team posted and maintained the requirements. Some RE teams experimented with commercially available RE tools. In all but one case, these tools interfered with rather than supported RE activities. We believe that either a lack of well-defined RE processes or the RE team members’ lack of training in the selected tools caused this undesired effect. Process Only some projects defined their RE process explicitly or tailored an organizational (“standard”) RE process (see Figure 4) to their needs. A tailored RE process uses or adapts a collection of RE process “as62
IEEE SOFTWARE
July/August 2001
sets” such as methods, templates, and tools to better fit a specific project’s characteristics. Although several RE teams executed a documented RE process, with quality assurance providing insight into the compliance of a team’s actions to their plan, most stakeholders perceived RE as an ad hoc process. Most projects specify software in a dynamic environment and therefore struggle with the classic problem of rapidly fluctuating requirements.4 This requires “flexible requirements” that “can be clarified and changed as the product progresses.” In other words, the RE process has to account for stakeholders’ learning curves and for requirements negotiation throughout the development project. Early on, some projects lamented the lack of “a detailed enough system architecture to adequately specify requirements,” while others “froze the specification early” only to face customers that “kept changing requirements late in the cycle.” The average ratings of configured RE work products (for example, prototypes and object-oriented models) and configured requirement changes reflect the stakeholders’ struggle to balance the “drivers for change and the desire for stability.” The most successful projects, however, recognized elusive requirements early in the process (with such statements as “the specification is a living document”). They managed requirements change explicitly rather than “freezing the whole specification.” Most of the RE teams performed multiple RE cycles. This is consistent with Pradromas Chatzoglou’s research finding that more than half of the 107 projects he studied exhibited three or more RE cycles.3 In our study, RE teams focused significantly more on eliciting and modeling requirements than on validating and verifying them (6.4 and 6.2 percent compared to 3.1 percent of project effort). All teams performed some level of document analysis. Some, for instance, analyzed business plans and market studies while others derived requirements from contracts and requests for proposals. Most of the projects also used unstructured interviews, brainstorming, and focus groups. Only two projects held workshops to elicit requirements. Most RE teams did not use “textbook” modeling methods,8 but they did adopt some of those methods’ principles, such as
Well-documented RE process
abstraction and partitioning. The majority of projects developed prototypes ranging from simple mock-ups to operational prototypes. A third of the RE teams developed OO models. Most projects verified and validated requirements with multiple stakeholders. More than half the projects performed peer reviews or walk-throughs to verify or validate requirements. Repeatedly, participants emphasized the importance of including customers and users in peer reviews. Moreover, several teams created scenarios to validate requirements. Only five of the 15 RE teams explicitly tracked requirements throughout their projects’ life cycles. Performance We considered three dimensions of performance: quality of RE service, quality of RE products, and process control. Using existing preference measures,5 we calculated the product of weighted performance dimensions to obtain a total performance indicator (TPI). That is, quality of RE service is most important (with a weight of 1.438), followed by the quality of RE products (1.126) and process control (0.438). Stakeholders rated the quality of service at 76 percent on average. Several stakeholders mentioned their early and frequent involvement in RE activities and the “good interaction between all groups” as important aspects that influenced their rating. They were more satisfied with the fit of the recommended solution than with RE, however. Frequently, this was due to difficulties in project planning (for example, “planning for high-priority items slipped” or “maintaining requirements in later development phases not planned”). The quality of RE products considers both requirements coverage and specification quality. Stakeholders gave it an average rating of 66 percent; the customers’ perception of requirements coverage was relatively low, particularly with regard to nonfunctional requirements. Management seemed satisfied with data requirements, whereas the team and project managers focused on functions. Stakeholders emphasized, however, that concentrating on functions and data resulted in a “lack of total system requirements attention” and in “incomplete performance, capacity, and external interface requirements.”
4.6
Tailored RE process
4.2
QA audits of RE products and process
5.0
Team executes process
4.9
Configured RE work products
5.3
Configured requirements changes
4.6 1
2
3 4 Likert scale
5
6
7
Figure 4. Evaluation of the participating teams’ RE processes, based on a 7-point Likert scale.
The stakeholders also scored specification quality. Overall, they were most satisfied with the specification’s consistency and the ability to modify it as necessary, but they emphasized that the lack of traceability hurt their project. Prioritization of requirements, however, caused the most difficulty for RE teams. The average stakeholder rating for this quality attribute was significantly lower than any other attribute. RE teams struggled with adequately involving customers to identify high-priority requirements, management’s inattention to resolving “unknowns,” and nonfunctional requirements that were not managed “in the same process or as tightly.” Several teams also mentioned the inability to consistently execute RE according to stakeholders’ priorities rather than their own interpretation of “what’s important,” and some reported difficulties in keeping everybody on the project informed of changing priorities. Process control, the third dimension of RE performance, was rated at 59 percent on average. The less variance between actual and planned cost, duration, and effort, the better the process control of a particular project. On average, project teams contained cost overruns within less than 2 percent and effort overruns within less than 6 percent. In addition, their cost and effort performance was predictable, with a standard deviation of less than 25 percent. Project duration, however, showed an average overrun of 20 percent, with a standard deviation of 47 percent. Managers frequently focused on cost to July/August 2001
IEEE SOFTWARE
63
Performance (percent)
100 90 80 70 60 50 40 30 20 10 0 Quality of Quality of Process RE service RE products control
Projects (TPI > 75%) PT/model-based process PT-based process Model-based process Projects (TPI < 50%) Total (avg.)
95% 83% 71% 64% 50% 76%
68% 76% 71% 69% 43% 66%
85% 71% 62% 43% 16% 59%
TPI 83% 78% 70% 63% 42% 70%
Figure 5. Comparing the deployed processes and their total performance indicator (TPI), the quality of RE service, the quality of RE products, and process control.
Specification
Application domain System Source artifacts material Domain analysis
Develop basic and advanced models
Models
Domain knowledge
Peer reviews
Stakeholder feedback
Compile specification
Develop prototypes Scenarios Prototypes and mock-ups Mock-ups Walk-throughs
Figure 6. A successful RE process.
control the RE process, leading to undesired changes of expended effort and predicted duration. Moreover, some project managers perceived “pressure” to underestimate effort and duration to meet cost targets determined outside their control. The “top” performers in this study (TPI > 75 percent) dynamically balance knowledge, resources, and process. For example, as Figure 5 shows, RE teams that used either prototypes or models exclusively faired better on average than projects with a TPI below 50 percent. A combined prototyping and model-based (PT/M) process resulted in higher ratings in all performance dimensions. The most successful projects also expended twice the effort to specify requirements. They performed RE activities 64
IEEE SOFTWARE
July/August 2001
throughout the project’s duration, while lower-rated RE teams (TPI < 50 percent) were primarily involved during the “front end” of the project. For both types of projects, about four full-time team members participated in the RE team. On the most successful projects, however, several parttime members also supported the team. For projects with a higher TPI, stakeholders reported that RE teams were more knowledgeable about the application domain, IT, and the RE process. Whereas Mitchell Lubars suggests that a “better atmosphere” contributes to project success,8 Guinan concludes that an environment where team members share positive and friendly feelings is unrelated to team performance.7 In our study, we found low cohesiveness only on projects with a TPI of less than 50 percent. This suggests that cohesiveness is a “trouble” indicator rather than a performance predictor. In other words, an increase in cohesiveness reduces the risk of failure but does not guarantee success. Successful teams performed on average three iterations of the RE process (see Figure 6). They identified major stakeholders and domain boundaries, examined system artifacts and source material from current and previous systems, and frequently obtained stakeholder feedback and expert guidance on how to proceed. This resulted in much more explicit “win conditions” for stakeholders. The successful RE teams used advanced modeling methods such as OO models, knowledge models, and quality function deployment matrices that translated “the voice of the customer” into quantitative technical requirements. However, they did not abandon basic models (for example, the entity-relationship model or state transition diagrams). They simply tried “to create a more complete model of the system.” Moreover, the teams developed (basic and advanced) models and prototypes together to clarify “known” requirements and to guide the discovery of new ones. A combination of models and prototypes helped the stakeholders, especially customers and users, envision the proposed solution. During peer reviews, the RE team, technical and domain experts, and customers and users examine the models. The resulting feedback enables the RE team to create acceptable models that accurately specify the
Table 2 Best Practices Focus area
Best practice
Cost of introduction
Cost of application
Key benefit
Knowledge
Involve customers and users throughout RE Identify and consult all likely sources of requirements Assign skilled project managers and team members to RE activities Allocate 15 to 30 percent of total project effort to RE activities Provide specification templates and examples Maintain good relationships among stakeholders Prioritize requirements
Low
Moderate
Better understanding of “real needs”
Low to moderate
Moderate
Improved requirements coverage
Moderate to high
Moderate
More predictable performance
Low
Moderate to high
Low to moderate
Low
Maintain high-quality specification throughout the project Improved quality of specification
Low
Low
Better satisfy customer needs
Low
Low to moderate
Develop complementary models together with prototypes Maintain a traceability matrix
Low to moderate
Moderate
Moderate
Moderate
Use peer reviews, scenarios, and walk-throughs to validate and verify requirements
Low
Moderate
Focus attention on the most important customer needs Eliminate specification ambiguities and inconsistencies Explicit link between requirements and work products More accurate specification and higher customer satisfaction
Knowledge Knowledge Resources Resources Resources Process Process Process Process
requirements. Scenarios and walk-throughs further guide the discovery of requirements. The breakdowns experienced while using prototypes and mock-ups lead to an evolutionary improvement of the specification. Best practices Successful RE teams have in-depth knowledge of the application domain, IT, and the RE process. In other words, successful projects have the “right combination” of knowledge, resources, and process. Table 2 summarizes the best practices exhibited by the most successful RE teams. Stakeholder feedback plays a decisive role from the beginning to the end of successful RE projects. The most successful teams always involve customers and users in the RE process and maintain a good relationship with stakeholders. They have an ongoing collaboration with stakeholders to make sure that requirements are interpreted properly, to deal with fluctuating requirements, and to avoid communication breakdowns. Research supports this best practice: according to one study, user participation is one of the most important factors contributing to RE success.5 Successful RE teams identify the boundaries of the application domain and of the major stakeholders. To validate their understanding of the application domain, they identify and consult all likely requirements
sources. They examine, for example, system artifacts and source material from current and previous systems. As other research points out,4 some individuals perform “10 times better than others.” Thus, managers of successful RE teams should ■
■ ■
carefully select team members skilled in the application domain, IT, and the RE process; always assign experienced, capable project managers to RE; and consult domain experts and stakeholders early on to augment and validate the team’s knowledge base.
Successful projects allocate a significantly higher amount of resources to RE (28 percent) than the average project in this or previous field studies, and they expend these resources according to a well-defined process. Successful teams also maintain a balance between RE activities. That is, they allocate on average 11 percent of project effort to elicitation, 10 percent to modeling, and 7 percent to validation and verification. To streamline RE activities, successful teams frequently leverage specification templates and examples from previous projects. Requirements prioritized by stakeholders drive successful RE teams. This allows the July/August 2001
IEEE SOFTWARE
65
RE team to decide which requirements to investigate when and to what degree of detail. To specify prioritized requirements, the RE team develops various models together with prototypes. Moreover, they maintain a requirements traceability matrix to track a requirement from its origin through its specification to its implementation. This lets the team show how its work products contribute to satisfying the requirements. In addition, successful teams repeatedly validate and verify requirements with multiple stakeholders. They use peer reviews, scenarios, and walk-throughs to improve the specification throughout the software’s life cycle.
T
eams often struggle with fluctuating requirements, communication breakdowns, and difficulties in prioritizing requirements. RE goes through recurrent cycles of exploring the perceived problem, proposing improved specifications, and validating and verifying those specifications. It is a learning, communication, and negotiation process; to succeed, you must integrate your technical, cognitive, social, and organizational processes to suit your project’s particular needs and characteristics. In other words, you must progressively discover your project requirements to specify software successfully.
About the Authors
Acknowledgments We thank the participating companies and the outstanding executives managing their software development and procurement processes for making this study possible. We also thank Steve McConnell and the reviewers, especially Karl E. Wiegers, for their helpful comments and suggestions. Further thanks go to Theresa Hofmann and John Overmars.
References 1. C. Jones, Applied Software Measurement: Assuring Productivity and Quality, McGraw-Hill, New York, 1996. 2. L.A. Macaulay, Requirements Engineering, Springer, London, 1996. 3. P.D. Chatzoglou, “Factors Affecting Completion of the Requirements Capture Stage of Projects with Different Characteristics,” Information and Software Technology, vol. 39, no. 9, Sept. 1997, pp. 627–640. 4. B. Curtis, H. Krasner, and N. Iscoe, “A Field Study of the Software Design Process for Large Systems,” Comm. ACM, vol. 31, no. 11, Nov. 1988, pp. 1268– 1287. 5. K.E. Emam and N.H. Madhavji, “A Field Study of Requirements Engineering Practices in Information Systems Development,” Second Int’l Symp. Requirements Eng., IEEE CS Press, Los Alamitos, Calif., 1995, pp. 68–80. 6. N.F. Doherty and M. King, “The Consideration of Organizational Issues during the System Development Process: An Empirical Analysis,” Behavior & Information Technology, vol. 17, no. 1, Jan. 1998, pp. 41–51. 7. P.J. Guinan, J.G. Cooprider, and S. Faraj, “Enabling Software Development Team Performance During Requirements Definition: A Behavioral Versus Technical Approach,” Information Systems Research, vol. 9, no. 2, 1998, pp. 101–125. 8. M. Lubars, C. Potts, and C. Richter, “A Review of the State of the Practice in Requirements Modeling,” First Int’l Symp. Requirements Eng., IEEE CS Press, Los Alamitos, Calif., 1993, pp. 2–14. 9. S.R. Nidumolu, “A Comparison of the Structural Contingency and Risk-Based Perspectives on Coordination in Software-Development Projects,” J. Management Information Systems, vol. 13, no. 2, 1995, pp. 77–113. 10. IEEE Guide to Software Requirements Specification, IEEE Std. 830-1998, IEEE Press, Piscataway, N.J., 1998. 11. H.F. Hofmann, Requirements Engineering: A Situated Discovery Process, Gabler, Wiesbaden, Germany, 2000. 12. B.W. Boehm, Software Engineering Economics, Prentice Hall, Englewood Cliffs, N.J., 1981.
Hubert F. Hofmann is manager of information systems and services for General Mo-
tors. His professional interests include strategic planning, enterprise-wide program management, software development, and system delivery processes. He received a PhD in business informatics from the University of Regensburg and an MBA from the University of Linz and the University of Zurich. He is a member of the IEEE Computer Society. Contact him at 7000 Chicago Rd., Warren, MI 48090;
[email protected].
Franz Lehner is a professor of management and information systems at the University of Regensburg, Germany, and head of the Department of Business Informatics. His fields of interest are software engineering and reengineering, knowledge management, and distance education. Contact him at the University of Regensburg, Business Informatics, D-93040 Regensburg, Germany;
[email protected].
66
IEEE SOFTWARE
July/August 2001
For further information on this or any other computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
feature user interfaces
Virtual Windows: Linking User Tasks, Data Models, and Interface Design Soren Lauesen and Morten Borup Harning, Copenhagen Business School
ser interface design comprises three major activities: organizing data into a set of windows or frames, defining functions that let the user control the system, and designing the graphical appearance of windows and functions. These design activities can build on analysis results such as task analysis and data modeling, and they can include checking activities such as reviews and usability tests. The goal is to create a system that is easy to learn, is easy to understand, and supports user
U The authors show an approach for designing user interfaces that balances a good overview of data with efficient task support, and allows user validation much earlier than do traditional usability tests. 0740-7459/01/$10.00 © 2001 IEEE
tasks efficiently. (Some systems are for entertainment rather than task support, and in such cases, task efficiency is of no concern. We will not consider such systems in this article.) Many developers wonder whether there is a systematic way to design a user interface. How do we get from data models and use cases to the actual interface? Two systematic approaches have been around for a long time in various versions: the data-oriented approach (easy to do with popular database tools) and the task-oriented approach (dominant in the field of human–computer interaction). Each approach has its strengths and weaknesses. In this article, we show a new approach called virtual windows that eliminates many weaknesses of these two classic approaches.
Two traditional approaches The data-oriented approach starts with a description of the data the system must maintain, typically in the form of a data model (a static class model). From this, designers define a set of windows such that all data is visible.1,2 The functions tend to be standard functions for creating, updating, and deleting data. The graphical design is dominated by the built-in presentations the database tool offers. Examples of this approach are simple applications made with Microsoft Access or Oracle Forms. This approach’s main problem is that it doesn’t ensure efficient task support. We have seen many data-oriented interfaces where the user cannot get an overview of the data necessary for important tasks.3 The task-oriented approach starts with a July/August 2001
IEEE SOFTWARE
67
The virtualwindows technique is an approach based on employing data and tasks at the same time.
list of user tasks (use cases) that the system must support. Analysts break down each task into a series of steps. From there, designers define a window for each step.4,5 In an extreme version, each window holds only the input and output fields strictly necessary to carry out the step. The user can move to the next window and usually to the previous one as well. Wizards (which are popular in many modern interfaces) follow this extreme approach. This approach’s main problem is that the user never gets an overview of the data available. He or she only sees the data through a “soda straw.” Furthermore, real-life tasks are usually much more varied and complex than the analyst assumes. Consequently, the system supports only the most straightforward tasks and not the variants. In some systems, only straightforward tasks must be supported, which is when the task-oriented approach is excellent. Withdrawing cash from an ATM is a good example. In more complex systems, however, another approach is needed. Why it works The virtual-windows technique is an approach based on employing data and tasks at the same time. Part of the approach is to design and test the graphical appearance before the functions are defined. A virtual window is a picture on an idealized screen. With a PC application, think of it as a GUI window on a large physical screen. This window shows data but has no buttons, menus, or other functions. For devices with specialized displays, think of the virtual window as a picture on a large display; for Web systems, think of it as a page or frame. A complex application needs several virtual windows. The basic idea when composing a set of virtual windows is this: Create as few virtual windows as possible while ensuring that all data is visible somewhere and that important tasks need only a few windows. Furthermore, design the graphical appearance of the windows such that real-life data will be shown conveniently and that users can understand what the windows show. An important part of the graphical design is to decide whether data appears as text, curves, dials, pictures, and so on. Later, the designer develops the user interface: he or she organizes the virtual win-
68
IEEE SOFTWARE
July/August 2001
dows into physical windows or screens; adds buttons, menus, and other functions; and adds messages, help, and so on. We have used the virtual-windows approach for several years in teaching and projects such as special devices and business, Web, and client-server systems. Compared to user interfaces designed traditionally, the ones based on virtual windows appear to have several advantages: ■ ■ ■ ■
there are fewer windows, there is efficient task support (also for task variants), users can validate the database, and users better understand the final system.
Virtual windows give fewer windows than data-oriented approaches because we do not create a window for each entity—the windows group information according to user needs. They also give fewer windows than in task-oriented approaches because we can reuse windows across several tasks. Not surprisingly, the virtual-windows approach provides more efficient task support than the data-oriented approach because the windows are designed for the tasks. Blindly applied, the virtual windows might support straightforward tasks a bit less efficiently than the task-oriented approach, but they can better handle task variants. Less blindly applied, the virtual windows let the designer support important tasks as efficiently as the task-oriented approach. The validation of database contents is the result of the early graphical design with realistic data. Furthermore, all data is shown in a work context, resulting in the user relating to the data more vividly. Users better understand the final system because fewer windows reduce the user’s mental load. In practice, we also test whether users understand the virtual windows. They often don’t understand the first version, but because the windows are made early—much earlier than paper mockups of the user interface—we can afford to make radical design changes. Because the later physical windows resemble the virtual windows, chances are higher that users will also understand the final interface. Understandable windows allow the users to form a better mental model of not only
Guest
the data in the system, but also the system’s functions. Figure 1 illustrates the principle. The user sees or enters some data on the screen. When the system removes the window or changes to another window, the user does not assume that the data disappears but that it is stored somehow. The figure shows this stored data at the back of the system. The user thus forms a mental model of what is stored and how it relates to other stored data, based on the way the data appears on the screen. This psychological mechanism corresponds to Piaget’s law of object constancy: When we see an object and it becomes hidden, we automatically assume that it still exists somewhere. The figure also suggests that the user understands the functions in terms of the stored data and its relation to what is on the screen. Many users comment that the picture reflects quite well how they imagine the system.
Order
Virtual windows What the computer remembers
Get guest
Find orders
Cancel order
Functional model What the commands do
Get Order Find John F. Smith Cancel
Physical windows What the user sees
Figure 1. The user forms a mental model of data and functions in the system from what he or she sees on the screen. The virtual windows are a consistent basis for a mental model.
How to do it Let’s examine how to use virtual windows to design a user interface for a hotel booking system. Make a task list (use cases) The first step is to list the tasks (use cases) the system must support. This step is part of many analysis approaches and not something special for the virtual-window approach.4–6 Figure 2 shows a simple task list for the hotel example. The list shows that our system supports booking, checking in and out, keeping track of breakfast servings, and so on. Implicitly, the list delimits the system’s scope. Our list shows, for instance, that the simple hotel system does not support areas such as accounting, personnel, and purchases. Make a data model In most cases, the virtual-window approach benefits from a traditional data model (static object model), which is part of many analysis approaches.7,8 The data model specifies the data to be stored long-term in the computer. Usually, short-term data such as search criteria are not specified in the data model. A data model is an excellent tool for developers but is hard to understand for even expert users. Figure 2 shows the data model
Task list • Book guest • Checkout • Check-in booked • Change room • Check-in unbooked • Enter breakfast list Name, address, passport, pay method Guest Room state
Service count, date
price, name Service type
date, #persons, booked|occupied
Room room#, bath, #beds, price1, price2
Figure 2. The hotel domain and its formalization as a data model and a list of user tasks. July/August 2001
IEEE SOFTWARE
69
Tasks
Virtual windows
Book Check-in Change room
Guest name, address period booked rooms
Checkout
bednights, price servings, price
Rooms prices, bath status, date
Enter breakfast list Breakfast list date room#, type, servings ...
Figure 3. Virtual windows in outline version. A few windows cover all data, and each task needs only a few windows.
for the hotel example. The data model shows that the hotel system must keep track of guests (one record per guest). To shorten the example, we simplified the system a bit and assumed that each guest only had one stay at the hotel—the system doesn’t track regular customers. The system must also record various services to be paid by the guests such as breakfast servings. There are several types of service. According to the data model, each guest might receive several services, and each service is of exactly one ServiceType. The system must keep track of rooms. Each room has a list of RoomStates, one for each date in the period of interest. The room’s state at any given day can be free or booked. A guest’s stay relates to one or more RoomStates corresponding to the dates where he or she has booked a room or actually occupies it. In most hotel systems, rooms are booked by room type (single or double) whereas check-in is by room number. To simplify the example, we assume that rooms are booked by room number, too. Outline the virtual windows In this step, we outline virtual windows, looking at the tasks one by one. For each task we identify the data to be seen by the user and group them into a few virtual windows. The important trick, however, is to reuse or expand previous windows rather than design new ones for each task. The step uses a few guidelines. To explain them, we distinguish between a window type and a window instance. Think of the window type as a template and the window instance as a window filled in with data. As an example, a guest window with data for a specific guest is a window instance, while all the guest windows use the same window type. Here are those guidelines:
70
IEEE SOFTWARE
July/August 2001
■
■
■
■
Few window types. Keep the total number of window templates small. (It is easier to grasp fewer windows as a mental model.) Few window instances per task. For each task, the user should access few window instances. (This improves task support.) Data in one window only. Avoid having the user enter the same data item through several window instances. Preferably, each data item should also be shown in only one window instance. (Seeing the same data item in several windows makes the mental model more complex. Being able to enter it at several places causes confusion about the entry’s effect.) Virtual windows close to final physical windows. Although we assume a large physical screen, the virtual windows shouldn’t be too far from what is realistic in the final interface. (Otherwise, the user will not generate a mental model close to the virtual windows.)
These are not rules set in stone. In many cases they conflict, forcing us to strike a balance between task efficiency and ease of understanding (we show examples later). The rules cover only the high-level composition of screens. They do not pretend to cover graphical design for the individual windows. Windows for booking. We will use these guidelines for the hotel system. Let’s start with a frequent task, the booking task. Which data should the user see to book a guest? He or she must see which rooms are vacant in the period concerned, what their prices are, and so forth. The user also must record the booking and the name, address, and other guest information. Figure 3 shows how this data could be allocated to two virtual windows, Guest and Rooms. The pile of completed Guest windows suggests that we have recorded several guests. We have only one Rooms window, because all rooms are shown there with their occupation status for a period of days. This is also where the user selects free rooms for a guest. Why do we need two virtual windows? Doesn’t it violate the principle of few window types? Apparently, yes, but if we had only one
type, the room occupation status would appear in all the guest windows, violating the “data in one window only” guideline. This kind of conflict is common and is not a result of the virtual-window approach—the approach just reveals the conflicting demands. Although we need two virtual windows for the booking task, we might later choose to show them together on the same physical screen—for instance, as two physical windows or two frames in a single physical page. This is a matter of detailed dialog design, to be handled later. Check-in, change room. Next, let’s consider
check-in. Fortunately, the same two virtual windows suffice. If the guest hasn’t booked in advance, the procedure is much the same as for booking, except that the room becomes occupied rather than booked. If the guest has booked, we just need to find the guest in the pile of guest forms. We need some search criteria to support that, but again, we delay that to detailed dialog design. Room changes use the same two windows: one for the customer who wants to change and one to see free rooms.
receptionist, who enters the data. The system could handle that with one window instance per room, using a special function that scans through guests in room order. However, this would violate the “few window instances per task” guideline. Figure 3 shows another solution, a virtual window that holds a breakfast list. This solution is important for task support, although it violates the “data in one window only” guideline, because breakfast servings are also shown in the guest windows. The system is now conceptually more complex because the user has to understand the relation between the guest and breakfast windows. For instance, the user might worry whether the system keeps both of them updated or whether he or she has to do something to have the latest breakfast list transferred to guest windows. This is a quite serious design conflict between task efficiency and ease of understanding. It is important to test whether users understand the suggested solution properly and, if necessary, to supply adequate guidance in the windows.
It is important to test whether users understand the suggested solution properly.
Supplementary techniques. In larger systems, Checkout. When checking out, the reception-
ist needs some more data. He or she must see how many nights the guest stayed, what services the guest received, the prices, and the total. It is useful to verify the data with the guest before printing the bill. Where do we put these data? In a new virtual window? No, because that would violate the “few window types” guideline. Instead, we extend the guest window as in Figure 3. Whether we want to always show the extensions on the physical window is a matter of later dialog design. But when we show them, the graphical look should clearly indicate that this is an extension of the guest data—for instance, it might be shown in the same frame as the other guest data. Record services. The last task on our list is
recording services such as breakfast servings. In principle, we don’t need a new virtual window for this—the receptionist could simply find the guest window and somehow record the service there. However, in many hotels, waiters record breakfast servings on a list with preprinted room numbers. The waiter brings it to the
defining windows based only on the guidelines listed earlier can be difficult. We use supplementary techniques such as defining larger chunks of data than the simple objects in the data model,9 setting up a matrix with the relation between tasks and the necessary data for the task,10 and defining several virtual-window models by using different tasks as the starting point. Detail the graphics and populate windows In this step, we make a detailed graphical design of the virtual windows and fill in the windows with realistic data. Still, we don’t add dialog functions such as push buttons or menus—that is left to dialog design. Many windows are rather straightforward to design, and standard GUI controls suffice. This is the case with the guest window and the breakfast list. Other windows must give a good overview of a lot of complex data, and they are difficult to design. This is the case with the room window—it’s easy to show the RoomStates as a simple list of records, but that doesn’t give the necessary overview. Our solution uses a spreadsheet-like display July/August 2001
IEEE SOFTWARE
71
Figure 4. Detailed virtual windows that use graphical design and are filled in with realistic data.
Guest
Stay#: 714
Breakfast 9/8
Name: John Simpson Address: 456 Orange Grove
R# 11
Victoria 3745 Payment Visa
12 13
Item
In In rest room 2
Comparison with traditional approaches. If we
1 1
1
# pers
7/8 Room 12. sgl 8/8 Breakf. rest
1 1
600 40
8/8 Room 11, dbl 9/8 Breakf. room
2 2
800 120
9/8 Room11, dbl
2
800
Rooms 11 Double Bath
800 600
12 Single Toil 13 Double Toil
600 600 500
Service charges Breakf. rest. 40 Breakf. room
60
7/8 8/8 9/8 10/8 O B O
O B
B B
B B
with mnemonic codes for room occupation, as in Figure 4. It is somewhat complex to implement on most GUI platforms. Showing complex data comprehensively is a relatively unexplored area.11,12 However, virtual windows are excellent for experimenting with advanced presentation forms. Show realistic and extreme data. It is very important to test the design by filling in the windows with realistic data, as in Figure 4 (abbreviated here for space reasons). We also show a complex, but slightly unusual, situation: a single guest checks into a room and later becomes two guests, who move into a double room. The virtual window shows that, conceptually, this is one stay and one guest. Apart from filling the windows with ordinary data, it is useful to try filling them with extreme, but realistic, data, which often reveals the need to modify the design. In the example, extreme data for a guest window include cases where a single guest books rooms for a conference with scores of rooms or where the guest stays for a very long period. Users might need presentation forms with a better overview in such situations. Reuse old windows or forms. If the old system uses some forms or screens already, should we use them in the new system? It will ease learning the new system, but it might counteract designs with better task support. In the hotel system, for instance, the virtual guest window resembles an invoice, thus helping the novice. However, if we want a good overview for a guest who books scores of rooms, another presentation might be advantageous. 72
IEEE SOFTWARE
July/August 2001
had used a traditional data-oriented approach, what kind of physical windows would we have seen? Most likely, the service-charge window would look much the same, because it corresponds closely to the service-type table in the database. However, the guest window would probably become three windows: one with data from the guest record (name, address, and so on), one with the room lines for a specific guest (based on the room state and the room table), and one with the service lines for the guest. (We have seen such solutions in several commercial Enterprise Resource Planning products.) The rooms window would probably become a list of room states for a specific date, corresponding to one column of the virtual window. Finally, the breakfast window might not be there at all, because it is not necessary from a data point of view. If we use a task-oriented approach, the user first must select a task: book, check-in, and so on. For the book task, the user might be guided first through one window to enter the desired room type and stay period, next through a window to select available rooms in that period, then through a window to enter the guest name and address, and eventually through a confirmation window. The windows for another task (such as room change) might look very different. Check the design At this stage, the design artifacts are a task list, a set of virtual windows, and possibly a data model. They overlap considerably, which lets us check for completeness and consistency. Here are some useful checks and tests: Check virtual windows against the data model.
Check that all data in the virtual windows exist in the data model or can be derived from it. CRUD Check. CRUD stands for create, read, update, and delete. Check that it is possible to create, read, and so on all data through some virtual window. Otherwise a window might be missing—and probably a task. (This is a check for oversights—some data might be imported from other systems and thus need no manual update.) If we check the hotel system, we should
notice that service types and their prices cannot be seen in a way where changing them or creating new services makes sense. The solution is to add a new virtual window to show the list of service types and their prices (shown in Figure 4 as Service charges). We also lack at least one task that uses this window. We might call it Maintain Service List, and we should add it to the task list because it is useful for later checking and testing.
sis.15 Detailed design involves several things: ■
■
Walk through all tasks. Take the tasks one by
one and walk through them—manually simulate how to perform each task by means of the virtual windows. As part of this, you might write down the functions needed in each virtual window.
■
Understandability test with users. An in-
depth review of a data model, even with expert users, is rarely possible. They might accept the model, but they rarely understand it fully. With virtual windows, reviews are fruitful because of the detailed graphics and the realistic data. Show the virtual windows one by one to a user. Ask the user what he or she believes the windows show, whether they show a realistic situation, whether he or she could imagine a more complex situation that would be hard to show on the screen, and whether some data is missing. Perform the review with expert users and ordinary users separately— they reveal different kinds of problems. For example, in the hotel case, reviews with experts revealed that seasonal prices were missing. Reviews with ordinary users revealed that the mnemonic marking of room state was not obvious. Some users believed that O for occupied meant zero, meaning that the room was free. You can also ask the user to walk through some tasks. Ask which window he or she would use, what data he or she would enter, what functions he or she would expect to switch to other windows, and so forth. The whole exercise can get quite close to a real usability test,13,14 although the windows contain no functions (menus, buttons, and so on) at this stage. Design the dialog In this step, we design the dialog in detail. This is not part of the virtual-windows approach, but the virtual windows are the ba-
■ ■
Organizing the physical windows. The basic part of this is to adjust virtual windows to physical window size by splitting them, using scroll bars, tabs, and so forth. In some cases, two or more virtual windows might be combined into one physical window or screen. Adding temporary dialog data such as search criteria and dialog boxes. The earlier walkthrough and understandability tests are good sources. Adding functions to navigate between windows, perform domain-oriented functions, and so on. Again, the walkthrough and understandability tests are good sources. CRUD checks and statetransition diagrams might reveal missing functions. Adding error messages. Adding help and other guidance.
An important purpose of the prototype is usability testing, which can effectively reveal usability problems in the design.
The design should produce a more or less functional prototype. An important purpose of the prototype is usability testing, which can effectively reveal usability problems in the design.14 Although early testing of the virtual windows can reveal many problems, new problems creep in during detailed design. Experience Over the last six years, we have gathered much experience with the virtual-windows approach in real projects and in courses for designers. Here are some of our observations. Need for early graphical design In early versions of the approach, we didn’t split virtual-window design into an outline step and a detail step.15 We observed that some design teams produced excellent user interfaces that scored high during usability tests, while other teams produced bad designs. Furthermore, excellent designs were produced much faster than bad designs. Why? Gradually we realized that the main difference between good and bad teams was the amount of detail they put into the virtual windows. Both groups could quickly produce the outline version. The bad teams then continued with dialog design, but when they designed the actual user interface, everything collapsed. The outline could not become useJuly/August 2001
IEEE SOFTWARE
73
Many of our students and clients have successfully picked up the virtualwindows approach.
ful physical windows, fields could not contain what they were supposed to, getting an overview of data was impossible, and so on. The teams had to redesign everything, which resulted in a mess. The good teams spent more effort designing the graphical details of the virtual windows, filling them with realistic data, and so forth. As part of that, they often modified the outline and grouped data in other ways. These changes were easy to handle at that time, and from then on, things went smoothly. Dialog functions were added, and the physical window design was largely a matter of cutting and pasting parts of the virtual windows. Adding functions too early We have observed that designers tend to add buttons to the virtual windows from the very beginning. This goes against the idea of dealing with only data at that stage and delaying functions to later steps. The designers cannot, however, resist the temptation to put that check-in button on the guest window. We have learned to accept that. It doesn’t really harm anything as long as the designers don’t focus too much on functions at this stage. Maybe it actually makes the windows more understandable because a button suggests how to use the window. For instance, we have noticed that users better understand that something is a long list if there is a scroll bar. Forgetting the virtual windows We observed many cases where designers made excellent virtual windows, only to forget them when designing the physical windows. The physical design then became driven by available GUI controls and the belief that traditional windows were adequate. The concern for understandability and efficient task support disappeared, and the final user interface became a disaster. Two things can help overcome this: ensuring that the window designers know the GUI platform to be used (so that they don’t propose unrealistic designs) and ensuring that quality assurance includes tracing from the virtual to the final windows. Examples of projects Virtual windows stood its first real-life test in 1993 in a project for booking and scheduling classrooms in a university. Class-
74
IEEE SOFTWARE
July/August 2001
rooms were the most critical resource at that time, and the existing system was entirely manual and an organizational disaster. The university had attempted a computer system, but owing to its heavily data-oriented design, it was not successful. We designed a new system by means of virtual windows. It became a success and is still the way all room allocation is handled. The database has 20 entity types and 2 million records; there are 150 rooms in 12 buildings and 7,000 courses; and there are two highly complex, 10 moderately complex, and eight simple windows to view. A recent example is the redesign of a Web-based job match system where applicants could record their qualifications and companies announce jobs. Several such systems exist, but the one in question was the easiest and most widely used. Still, it attracted few users. In the old design the user had to go through 14 screens to record qualifications. Usability tests showed that most users gave up in the middle of this. The designers used virtual windows to redesign the application, and the result was four screens to record qualifications. Usability tests showed that all users completed the task (except for one, who turned out to be drunk).
M
any of our students and clients have successfully picked up the virtual-windows approach. It has become the natural way to design user interfaces—for us as well as for them. Over the years they have encouraged us to promote the approach more widely. Until now, however, we have never taken the time to do it, except at a few conferences. We refine the approach a bit every now and then, but basically consider it finished. We have gradually included the approach into more comprehensive methods for user interface design and requirements engineering, but the methods are not yet published in English.
References 1. R. Baskerville, “Semantic Database Prototypes,” J. Information Systems, vol. 3, no. 2, Apr. 1993, pp. 119–144. 2. C. Janssen, A. Weisbecker, and J. Ziegler, “Generating User Interfaces from Data Models and Dialogue Net
3. 4.
5. 6.
7.
8. 9.
10.
11. 12.
Specifications,” Proc. Int’l Conf. Computer–Human Interaction, ACM Press, New York, 1993, pp. 418–423. S. Lauesen, “Real-Life Object-Oriented Systems,” IEEE Software, vol. 15, no. 2, Mar./Apr. 1998, pp. 76–83. K.Y. Lim and J.B. Long, The MUSE Method for Usability Engineering, Cambridge Univ. Press, Cambridge, UK, 1994. A.G. Sutcliffe, Human–Computer Interface Design, Macmillan Press, London, 1995. A. Cockburn, “Structuring Use Cases with Goals,” J. Object Oriented Programming, Sept./Oct., 1997, pp. 35-40. P. Chen, “The Entity-Relationship Model: Toward a Unified View of Data,” ACM Trans. Database Systems, vol. 1, no. 1, Mar. 1976, pp. 9–36. E. Yourdon, Modern Structured Analysis, Prentice-Hall, New York, 1989. M.B. Harning, “An Approach to Structured Display Design: Coping with Conceptual Complexity,” Proc. 2nd Int’l Workshop Computer-Aided Design of User Interfaces, Presses Universitaires de Namur, Namur, France, 1996, pp. 121–138. S. Lauesen, M.B. Harning, and C. Gronning, “Screen Design for Task Efficiency and System Understanding,” Proc. Australian Conf. Computer–Human Interaction (OZCHI 94), CHISIG, Downer, Australia, 1994, pp. 271–276. E. Tufte, Envisioning Information, Graphics Press, Cheshire, Conn., 1990. L. Tweedie, “Interactive Visualisation Artifacts: How Can Abstraction Inform Design?” Proc. Human–Computer Interaction (HCI 95), Cambridge University Press,
1995, pp. 247–266. 13. J.S. Dumas and J.C. Redish, A Practical Guide to Usability Testing, Ablex, Westport, Conn., 1993. 14. A.H. Jorgensen, “Thinking-Aloud in User Interface Design: A Method Promoting Cognitive Ergonomics,” Ergonomics, vol. 33, no. 4, 1990, pp. 501–507. 15. S. Lauesen and M.B. Harning, ”Dialogue Design through Modified Dataflow and Data Modeling,” Proc. Human–Computer Interaction, Springer-Verlag, New York, 1993, pp. 172–183.
About the Authors Soren Lauesen is a professor at the IT University of Copenhagen. He has worked in the
IT industry for 20 years and at the Copenhagen Business School for 15. His research interests include human–computer interaction, requirements specification, object-oriented design, quality assurance, systems development, marketing and product development, and cooperation between research and industry. He is a member of the Danish Academy of Technical Sciences and the Danish Data Association. Contact him at the IT Univ. of Copenhagen, Glentevej 67, DK2400 Copenhagen, NV;
[email protected].
Morten Borup Harning is a chief design officer at Open Business Innovation. His research interests include user interface design frameworks, design methods and design notations, user interface design tools, user interface management systems, and human–computer interaction. He received his PhD from the Copenhagen Business School. He is a member of IFIP WG 2.7/13.4 on User Interface Engineering and chairs SIGGHI.dk, the Danish Special Interest Group on Human–Computer Interaction. Contact him at Dialogical, Inavej 30, DK-3500, Denmark;
[email protected].
COMPUTER
PURPOSE The IEEE Computer Society is the world’s
MEMBERSHIP Members receive the monthly magazine COM PUTER , discounts, and opportunities to serve (all activities are led by volunteer members). Membership is open to all IEEE members, affiliate society members, and others interested in the computer field.
BOARD OF GOVERNORS Term Expiring 2001: Kenneth R.Anderson, Wolfgang K. Giloi, Haruhisa Ichikawa, Lowell G. Johnson, Ming T. Liu, David G. McKendry, Anneliese Amschler Andrews Term Expiring 2002: Mark Grant, James D. Isaak, Gene F. Hoffnagle, Karl Reed, Kathleen M. Swigger, Ronald Waxman, Akihiko Yamada Term Expiring 2003: Fiorenza C.AlbertHoward, Manfred Broy, Alan Clements, Richard A. Kemmerer, Susan A. Mengel, James W. Moore, Christina M. Schober Next Board Meeting: 9 Nov 2001, Denver, CO
EXECUTIVE COMMITTEE
President: BENJAMIN W. WAH* University of Illinois Coordinated Sci Lab 1308 W. Main St Urbana, IL 61801-2307 Phone: +1 217 333 3516 Fax: +1 217 244 7175
[email protected] President-Elect: WILLIS K. KING* Past President: GUYLAINE M. POLLOCK* VP, Educational Activities: CARL K. CHANG (1ST VP)* VP, Conferences and Tutorials: GERALD L. ENGEL* VP, Chapters Activities: JAMES H. CROSS VP, Publications: RANGACHAR KASTURI VP, Standards Activities: LOWELL G. JOHNSON* VP, Technical Activities: DEBORAH K. SCHERRER †
†
IEEE
OFFICERS
President: JOEL B. SNYDER President-Elect: RAYMOND D. FINDLAY Executive Director: DANIEL J. SENESE Secretary: HUGO M. FERNANDEZ VERSTAGEN Treasurer: DALE C. CASTON
(2ND VP)*
Secretary: WOLFGANG K. GILOI* Treasurer: STEPHEN L. DIAMOND* 2000–2001 IEEE Division V Director: 2001–2002 IEEE Division VIII Director:
VP, Publications Activities:JAMES M. TIEN
THOMAS W. WILLIAMS†
VP, Standards Association: MARCO W. MIGLIARO VP, Technical Activities: LEWIS M. TERMAN President, IEEE-USA: NED R. SAUTHOFF
WEB
SITE
Acting Executive Director:
COMPUTER SOCIETY O F F I C E S Headquarters Office 730 Massachusetts Ave. NW Washington, DC 20036-1992 Phone: +1 202 371 0101 • Fax: +1 202 728 9614 E-mail:
[email protected] Publications Office 10662 Los Vaqueros Cir., PO Box 3014 Los Alamitos, CA 90720-1314 Phone:+1 714 821 8380 E-mail:
[email protected] Membership and Publication Orders: Phone:+1 800 272 6657 Fax:+1 714 821 4641 E-mail:
[email protected] European Office 13, Ave. de L’Aquilon B-1200 Brussels, Belgium Phone: +32 2 770 21 98 • Fax: +32 2 770 85 05 E-mail:
[email protected] Asia/Pacific Office Watanabe Building 1-4-2 Minami-Aoyama, Minato-ku, Tokyo 107-0062, Japan Phone: +81 3 3408 3118 • Fax: +81 3 3408 3553 E-mail:
[email protected] EXECUTIVE
DORIS L. CARVER†
VP, Educational Activities: LYLE D. FEISEL VP, Regional Activities: ANTONIO BASTOS
SOCIETY
The IEEE Computer Society’s Web site, at http://computer.org, offers information and samples from the society’s publications and conferences, as well as a broad range of information about technical committees, standards, student activities, and more.
largest association of computing professionals, and is the leading provider of technical information in the field.
ANNE MARIE KELLY†
* voting member of the Board of Governors
STAFF
Acting Executive Director : ANNE MARIE KELLY Publisher: ANGELA BURGESS Acting Director, Volunteer Services: MARY-KATE RADA Chief Financial Officer: VIOLET S. DOAN Director, Information Technology & Services: ROBERT CARE Manager, Research & Planning: JOHN C. KEATON
feature self-testability
Reliable Objects: Lightweight Testing for OO Languages Jean-Marc Jézéquel, Irisa, University of Rennes Daniel Deveaux, Valoria, University of Bretagne Sud Yves Le Traon, Irisa, University of Rennes
n the fast-moving, highly reactive arena of software development, lowcost, low-overhead methods are key to delivering high-quality products quickly and within tight budgets. Among the many aspects of the development process, testing is an obvious target for such methods, because of both its cost and its impact on product reliability. But classical views on testing and their associated test models, based on the waterfall model,
I To keep test costs and overhead low, the authors propose making software components self-testable. Using mutation techniques to evaluate such components’ testing efficiency gives an assessment of their quality. 76
IEEE SOFTWARE
don’t work well with an incremental, objectoriented development process. The standardization of semiformal modeling methods, such as UML, reveals the trend toward exactly this kind of process: We can no longer separate testing from specification, design, and coding. What we need is a testing philosophy geared toward OO development and a low-overhead test approach that we can integrate into the process. We propose a lightweight, quality building method that developers can implement without sophisticated and costly tools. The basic idea is to embed tests into components (loosely defined here as sets of tightly coupled classes that can be deployed independently), making them self-testable. By this method, to establish a self-testable component’s reliability, we estimate the quality of its test sequence. Thus, we can associate each self-testable component with a value—its level of trustability—that quanti-
July/August 2001
fies the unit test sequence’s effectiveness at testing a given implementation of the component. Our method for achieving this quantification is a version of selective mutation analysis that we adapted to OO languages. Relying on this objective estimation of component trustability, the software developer can then consciously trade reliability for resources to meet time and budget constraints. In this article, we outline the Java implementation of our methodology. Self-testable classes Now let’s turn to the idea of self-testable components, in which we embed a component’s specification and test sequence along with its implementation into a deployment unit. Specifying behavior with contracts Before we consider running tests to check a component’s quality, we must know what the component should do in a given situa0740-7459/01/$10.00 © 2001 IEEE
Table 1 Three Classes from the Pylon Library (Open Source Software) tion. This knowledge might exist only in the brain of the programmer or the tester; ideally, it is formalized in a dedicated specification language. However, for many software developments, formal specification technology is seldom feasible because of tight constraints on time and other resources. The notion of software contracts offers a lightweight way to capture mutual obligations and benefits among components. Experience tells us that simply spelling out these contracts unambiguously is a worthwhile design approach to software construction, for which Bertrand Meyer coined the phrase design by contract.1 The design-by-contract approach prompts developers to specify precisely every consistency condition that could go wrong and to assign explicitly the responsibility of each condition’s enforcement to either the routine caller (the client) or the routine implementation (the contractor). Along the lines of abstract data type theory, a common way of specifying software contracts is to use Boolean assertions (called pre- and postconditions) for each service offered, along with class invariants for defining general consistency properties. A contract carries mutual obligations and benefits: The client should only call a contractor routine in a state respecting the class invariant and the routine’s precondition. In return, the contractor promises that when the routine returns, the work specified in the postcondition will be done and the class invariant will still be respected. A second benefit of such contracts, when they are runtime checkable, is that they provide a specification against which we can test a component’s implementation. Making components self-testable Because a software component can evolve during its life cycle (through maintenance, for example), an organic link between its specification, test set, and current implementation is crucial. In our methodology, based on an integrated design and test approach for OO software components, we consider a set of tightly coupled classes as a basic unit component and define a test suite as an organic part of each component. Indeed, we think of each component as a triangle composed of its specification (documentation, methods signature, pre- and postconditions, and invariant properties), one implementation, and the test cases
Class name
Inherits from
Inst
Role
Container
—
Abstract
Dispenser
Container
Abstract
Stack
Dispenser
Concrete
Base for all collection classes, define count, isEmpty... Containers to which new items can be added and existing items can be removed one at a time Standard stack structure
needed for testing it. To a component’s specified functionality, we add a new feature that makes it self-testable. In this approach, the class implementor must ensure that all the embedded tests are satisfied so that we can estimate test quality relative to the specification, a test sequence, and a given implementation. If the component does not reach the necessary quality level, the class implementor has to enhance the test sequence. Thus, a self-testable component can test itself with a guaranteed level of quality. We could define this quality level several ways—for example, the classical definition involves code coverage. We propose mutation analysis2 as a relevant way for analyzing a test sequence’s quality. Therefore, we base the quality level on the test sequence’s faultrevealing power under systematic fault injection. Once a designer can associate such test quality estimates with a set of functionally equivalent components, he or she can choose the component with the best self-testability. Self-testable classes in Java We’ve implemented the self-testable concept in the Eiffel,3 Java, C++, and Perl languages. Because Eiffel has direct support for design by contract, implementing selftestable classes in that language is straightforward. On the other hand, the lack of standardized introspection facilities made it more difficult in several other aspects. To outline the Java implementation, we’ll use the simplified example of a set of three classes taken from the Pylon library (www. nenie.org/eiffel/pylon), implementing a generic stack (see Table 1). Our Web site (www.iu-vannes.fr/docinfo/stclass) includes the complete source code of these classes as well as the self-testable classes distribution for all supported languages. Design by contract in Java Because Java does not directly support the design-by-contract approach, we must implement contract watchdogs and trace and asserJuly/August 2001
IEEE SOFTWARE
77
/** * add() : Add an item to the dispenser */ public void add (Object element) { precond (“writable,” isWritable(), “real elem,” !isVoid(element)); int old_count = count(); ..... postcond (“keep elem,” has(element), ““, count() == oldcount + 1); } (a) /** * add() : Add item to the dispenser * * @pre isWritable() // writable * @pre !isVoid(element) // real elem * * @post has(element) // keep elem * @post count() == count()@pre + 1 */ public void add (Object element) { ..... } (b) Figure 1. Possible syntaxes for contracts in Java, showing the use of (a) inherited methods and (b) the preprocessor implementation. public static void main (String args[]) { Stack ImplementationUnderTest = new Stack() ; args = ImplementationUnderTest.testOptions (args) ; if (ImplementationUnderTest.test (args)) { System.exit(0);} else {System.exit(1);} } Figure 2. The main() function of a Java self-testable class.
/** TST_stack() : stack structure verification */ public void TST_stack() { SelfTest.testTitle (3, “LIFO stack,” “A stack of int”) ; reset() ; // start with known state add (new Integer (1)) ; add (new Integer (2)) ; add (new Integer (3)) ; SelfTest.check (-1, “stack image 1,” out(). equals (“[1, 2, 3]”)) ; SelfTest.testMsg (“after three ‘add()’ : “ + out() + “... Ok”) ; SelfTest.check (-1, “3 on top,” ((Integer) item()).intValue() == 3) ; SelfTest.testMsg (“3 on stack top... Ok”) ; ..... } // —————————————— TST_stack() Figure 3. A test function for Stack. 78
IEEE SOFTWARE
July/August 2001
tion mechanisms. (A contract watchdog monitors the contract validity on entry and exit of any method call and, in case of a contract violation, raises an exception.) To this end, two roads are possible: the first uses inherited functions that programmers call directly in their code; the second uses a contract definition syntax embedded in comments and uses a preprocessor that instruments the code before compilation (see Figure 1). Both approaches use the Java exception mechanism. Although it is very simple to implement and explain, the inherited-methods approach has two major drawbacks: ■
■
Because of Java’s single inheritance, using inheritance for implementing contracts prohibits the specialization of noninstrumented classes. More seriously, the contract calls are located inside methods; thus, the contracts are forgotten if a method is redefined in a subclass.
Recently, researchers have proposed several preprocessors to manage contracts. We developed the example we use in this article with iContract from Reto Kramer (www. reliable-systems.com/tools). This tool is very simple to use, and, because it uses only comments, it does not affect production code. In addition, it completely implements the design-by-contract scheme, including inheritance of contractual obligations and Object Constraint Language–style @pre expressions corresponding to Eiffel old expressions. How to make self-testable classes Applying our design-for-testability approach, we define invariants for each class and pre- and postconditions for each method. In addition, we can instrument a method code with check() and trace() instructions to help further debugging. A standard solution for this in Eiffel and C++ is to use multiple inheritance; in Java, we must use interface inheritance plus delegation. Thus, we make a Java class self-testable by making it implement the SelfTestable interface: We add an implements ubs.cls.SelfTestable clause to the declaration and define the test() method so as to delegate to the corresponding method in the ubs.cls.SelfTest utility class. To enable the self-testable class to run as a standalone program, we
....>java ubs.struct.Stack –stat Test of class ‘ubs.struct.Stack’
can append a main() function (see Figure 2). A class usually has several methods that should be called and tested. In addition to these standard methods, we define testing methods: each one frames a testing unit and has a goal (explained in its comment) of testing that the implementation of a set of methods corresponds to its specifications. By convention, the testing method name begins with TST_. Figure 3 shows the outline of a testing method for the class Stack. Usually, a method test consists of a simple call because the class invariant and method postcondition are sufficient oracles, a concept we’ll return to later. To control behaviors that combine multiple method calls, we can import a check() function that behaves like the Eiffel check instruction or the C/C++ assert macro. The utility class SelfTest defines the check() method and a set of useful functions that support the management of tracing inside testing methods (testTitle(), testMsg(), and so on). An array in the test suite order declares the testing method names; the test launcher method test() uses this array. Moreover, in each method of the class, we add SelfTest. profile() as the first statement: this allows the counting of method calls during the test. For validation and verification, we compile a class through iContract and then through a standard Java compiler. At that point, test execution is very easy; we only have to run the class. By default, running the class runs the validation test and produces a test report such as that in Figure 4. Through options on the command line, we can control debugging and tracing levels and select which testing methods to execute. We can also redirect the execution logs to files. This instrumentation is useful not only for the validation phase, but also for code verification and debugging. When compiling in production mode, we can tell the iContract tool to tune the level of assertion checking or even bypass it. A script called hideTest lets us hide the testing methods as well as the profile() or check() calls in the Java source code. Inheritance and abstract classes As in Eiffel, iContract allows contracts to be inherited from standard classes, abstract classes, and interfaces. For example the Container and Dispenser classes declare several abstract methods (count(), item(), has(), add(), remove(), and so on) that define pre-
Test unit n. 1 Container creation empty, readable and writable - newly created Container... Ok - Container empty... Ok - Container readable... Ok ..... ————————————— Test unit n. 3 LIFO stack A stack of int - three ‘add()’ : [1, 2, 3]... Ok - 3 on stack top … Ok ..... Test TST_stack ended Number of called methods Number of aborted calls
: :
49 0
: :
101 0
: : : :
6 39 6 11
————————————— End of test sequence Number of called methods Number of aborted calls List of tested methods add count has isEmpty ..... —————————————
Figure 4. Test report of the Stack class. /** add() : Add item to dispenser * * @pre isWritable() // writable * @pre ! isVoid (elem) // real element * * @post has(elem) // keep element * @post count() == count()@pre + 1 */ public abstract void add (Object elem); in Dispenser.j /** add() : PUSH * * @post elem == item() // new on top */ public void add (Object elem) { SelfTest.profile(); .... } in Stack.j Figure 5. Postcondition strengthening in add().
and postconditions. The concrete class Stack implements these methods but does not redefine the contracts. On the other hand, in other methods, a precondition might need to be weakened or a postcondition might need to be strengthened (see Figure 5). July/August 2001
IEEE SOFTWARE
79
The quality criterion we propose here is the proportion of injected faults the selftest detects when we systematically inject faults into the component implementation.
Because abstract classes can have neither a constructor nor a main() function, we cannot make them self-testable. Nevertheless, an abstract class contains testing methods inherited by subclasses. We can write these testing methods, because very often at the abstract class level, we know the way a set of methods (both abstract and concrete) should be used. In our example, Container defines a testing method (TST_create()) that tests the instantiation mechanism and the methods count(), isEmpty(), and so on. In the same way, Dispenser defines TST_addsup(), which tests add() and remove(). The concrete class Stack defines only one testing method (TST_stack(), shown in Figure 3), but it uses all three methods in its testing sequence. It is very easy to reuse testing units, especially in large libraries that use inheritance widely. In such cases we can improve the reuse mechanism, redefining a testing method and calling a parent method: public void TST_foo() { super.TST_foo() ; ..... }
Estimating test quality In our approach, it is the programmer’s responsibility to produce the test cases along with, or even before, implementing a class. As several other researchers have pointed out,4 programmers love writing tests because it gives them immediate feedback in an incremental development setting. Still, if we want to build trust in components developed this way, it is of the utmost importance to have a quality estimate for each self-test. If the component passes its tests, we can use this quality estimate to quantify our trust in the component. The quality criterion we propose here is the proportion of injected faults the self-test detects when we systematically inject faults into the component implementation. We derived this criterion from mutation testing techniques adapted for OO development. Mutation testing for OO Mutation testing was first developed to create effective test data with significant fault-revealing power.5,6 As originally proposed by De Millo, Lipton, and Sayward in 80
IEEE SOFTWARE
July/August 2001
1978,7 the technique is to create a set of faulty versions or mutants of a program with the ultimate goal of designing a test set that distinguishes the program from all its mutants. In practice, the technique models faults using a set of mutation operators, where each operator represents a class of software faults. To create a mutant, you apply its operator to the original program. A test set is relatively adequate if it distinguishes the original program from all its nonequivalent mutants. Each test set receives a mutation score that measures its effectiveness in terms of the percentage of revealed nonequivalent mutants. (A mutant is considered equivalent to the original program if there is no input data on which the mutant and the original program produce a different output.) A benefit of the mutation score is that even if the test set finds no errors, the score still measures how well the software has been tested. This gives the user information about certain kinds of errors being absent from the program. Thus, the technique provides a kind of reliability assessment for the tested software. Selective mutation reduces the computational expense of mutation testing by limiting the number of mutation operators applied. (We can omit many expensive operators without losing anything in terms of the tests’ fault-revealing power.) Selective mutation considers faults from syntactic and semantic points of view. A mutation’s semantic size represents its impact on the program’s outputs; the syntactic size represents the modification’s syntactic importance. Because mutations have quite a small syntactic size (modifying one instruction at most), selective-mutation operators must find the better trade-off between large and small semantic faults. Actually, there is no easy way to appraise semantic size, except possibly by sensitive analysis.6 In the context of our methodology, we are looking for a subset of mutation operators general enough to be applied to various OO languages (Java, C++, Eiffel, and so on), implying limited computational expense, and ensuring at least control-flow coverage of methods. Table 2 summarizes our current choice of mutation operators. During the test selection process, we say that a mutant program is killed if at least one test case detects the fault injected into the
Table 2 Selective-Mutation Operators mutant. Conversely, we say that a mutant is alive if no test case detects the injected fault.
Operator
Action
EHF
Causes an exception when executed. This semantically large mutation operator allows forced code coverage. Replaces occurrences of arithmetic operators with their inverses (for example, + replaces –). Replaces each occurrence of a logical operator (and, or, nand, nor, xor) with each of the other operators. Also replaces the expression TRUE with FALSE. Replaces each occurrence of one of the relational operators (, ≤, ≥, =, ≠) with one or more of the other operators in a way that avoids semantically large mutations. Replaces each statement with the empty statement. Slightly modifies constant and variable values to emulate domain perturbation testing. Each constant or variable of arithmetic type is both incremented by one and decremented by one. Each Boolean is replaced by its complement. Resets an object’s reference to the null value. Suppresses a clone or a copy instruction. Inserts a clone instruction for each reference assignment.
AOR
Component and system test quality We obtain a component’s test quality simply by computing the mutation score for the unit-testing test suite executed with the self-test method. We define the system test quality as follows: Let S be a system composed of n components denoted Ci, i ∈ [1..n]; let di be the number of killed mutants after applying the unit test sequence to Ci; and let mi be the total number of mutants. To define test quality—that is, the mutation score MS, for Ci being given a unit test sequence Ti—and system test quality (STQ) relative to di and mi, we use the following expressions:
LOR
ROR
NOR VCP
RFI
MS(Ci, Ti) = di/mi Mutant A6 Mutant A5 Mutant A4 Mutant A3 Mutant A2 Mutant A1
STQ(S) = Âi=1,n di / Âi=1,n mi.
We associate these quality parameters with each component and compute and update the global system’s test quality depending on the number of components actually integrated into the system.
Class A
Generation of mutants
Self-test A
Test selection process The whole process can be driven by either quality or effort. In the first case, test quality guides test selection, while in the second case, some test effort constraint (estimated by a number of test cases) guides the selection. Figure 6 outlines the process of generating unit test cases. The first step is mutant generation. Next comes the test enhancement process, which consists of applying each test case to each live mutant. If a mutant is still alive after execution of each test case, then a diagnosis stage occurs, which cannot be completely automated. For each live mutant, diagnosis determines why the test cases have not detected the injected fault. The diagnosis leads to one of three possible actions: equivalent mutant elimination, test-case set enhancement (if the tests are inadequate), or specification enhancement (if the specification is incomplete). This analysis is very helpful for increasing a component’s quality, because it enforces the organic link between the component’s specification, test, and implementation facets. After a test-case set reaches the desired
Test execution Error not detected
Error detected
MutantAj alive
MutantAj killed Self-test OK!
3 Enhance self-test
Diagnosis
1
Equivalent mutant
Consider Aj as
2 Incomplete specification
Automated process
Add contracts to the specification Non-automated process
Figure 6. Testing quality level, we apply a reduction process process based on to delete redundant test cases. This process mutation analysis.
consists of creating a matrix marking which test cases kill which mutants. A classical matrix Boolean reduction algorithm generates the final test-case set, minimizing the set size while maintaining the fault-revealing power. Selection of the test-case set fixes the mutation score as well as the component’s test quality. Except for the diagnosis July/August 2001
IEEE SOFTWARE
81
Select one of the following: Quality-driven: Wanted quality level = WQL Effort-driven: Maximum number of test cases = MaxTC Let nTC be the number of actual generated test cases. While Q(Ci) < WQL and nTC < MaxTC, do 1. 2. 3. 4.
Enhance the test-case set, update nTC, (nTC++); Apply each new test case to each live mutant; Diagnosis; Compute the new Q(Ci).
Figure 7. Algorithm for the test selection process.
step, the process can be completely automated. Figure 7 summarizes the algorithm’s main steps. Test case generation and oracle determination Our technique uses deterministic test data generation because each class consists of a set of functionally coherent methods. Designers and developers can easily generate basic efficient data; experience teaches that deterministic test generation can follow these rules: ■
■
■
Methods that are functionally linked belong to the same testing family; they must be tested together. (For example, a container remove and add must be tested together because you cannot test remove without adding one element in the container.) In a family, basic independent sets of methods must be exercised first; for example, has cannot be tested before add, but you also need has to check that add is correct. So has and add must be tested together first, before remove, which is less basic and needs both has and add to be tested. In case of inheritance, redefined methods must be retested with specific tests.
and relationships with the input ones—as partial oracle functions. Thus, in a systematic design-by-contract approach we can detect many of the faults without writing explicit oracle functions. Postconditions cover a larger test-data space than the explicit test oracles but are generally not sufficient for detecting semantically rich test results. In some cases, a postcondition is sufficiently precise to replace deterministic oracles: For a sort method, testing whether the result is effectively sorted is a complete and simple-to-express oracle function. However, in most cases, functional dependencies between methods are difficult to express through general invariants.
W
e hope the approach we’ve presented here will provide a consistent, practical design-for-testability methodology for the growing number of software development projects for which timeliness and reactivity are key. The key benefits of our approach are these: ■ ■ ■ ■
To generate oracles, the most general solution is to write explicit test oracles for each test suite. For example, designers know that an [add(2), remove(2)] test sequence implies that [has(2)] should return false. We thus obtain the oracle simply by checking that 2 is not present in the container of integers. To be coherent with the rules expressed upward, the method should have been tested before the test suite is exercised. The second way to generate oracles works well with design by contract. In this approach, we use postconditions—invariant expressions on the output domain values 82
IEEE SOFTWARE
July/August 2001
■
Writing tests is easy at the unit class level.4 Verifying the test quality is feasible through fault injection. The process of estimating test quality can be automated. Because a test driver consists of a set of self-test method calls, structural system tests are easy to launch. Relying on the objective estimation of component trustability that our method provides, software developers can consciously trade reliability for resources to meet time and budget constraints.
For future work, we’ve planned experimental studies for validating the relevance of both language-independent and language-specific mutation operators. We’ll also look at integration strategies based on the underlying test dependency model.
Related Work
References 1. B. Meyer, “Applying ‘Design by Contract,’” Computer, vol. 25, no. 10, Oct. 1992, pp. 40–52. 2. J. Offutt, “Investigation of the Software Testing Coupling Effect,” ACM Trans. Software Engineering Methodology, vol. 1, no. 1, Jan. 1992, pp. 5–20. 3. Y. Le Traon, D. Deveaux, and J.-M. Jézéquel, “SelfTestable Components: From Pragmatic Tests to a Design-for-Testability Methodology,” Proc. TOOLSEurope’99, IEEE Computer Society, Los Alamitos, Calif., 1999, pp. 96–107. 4. K. Beck and E. Gamma, “Test-Infected: Programmers Love Writing Tests,” Java Report, vol. 3, no 7, Jul. 1998, pp. 37–50. 5. J. Offutt et al., “An Experimental Evaluation of Data Flow and Mutation Testing,” Software Practice and Experience, vol. 26, no 2, Feb. 1996, pp. 165–176. 6. J. Voas and K. Miller, “The Revealing Power of a Test Case,” Software Testing, Verification and Reliability, vol. 2, no. 1, May 1992, pp. 25–42. 7. R. De Millo, R. Lipton, and F. Sayward, “Hints on Test Data Selection: Help for the Practicing Programmer,” Computer, vol. 11, no. 4, Apr. 1978, pp. 34–41.
About the Authors Jean-Marc Jézéquel is a professor at Irisa, University of Rennes, France, where he leads an INRIA research project called Triskell. His research interests include software engineering based on OO technologies for telecommunications and distributed systems. He received an engineering degree in telecommunications from the École Nationale Supérieure des Télécommunications de Bretagne and a PhD in computer science from the University of Rennes. Contact him at IRISA, Campus de Beaulieu, F-35042 RENNES (France);
[email protected].
Daniel Deveaux is an assistant professor
at the University of Bretagne Sud, where he leads the Aglae research team. His research interests include OO design and programming, with an emphasis on documentation, testing, and trusted software component production methodologies. He received an engineering degree in agronomy from the École Nationale Supérieure d’Agronomie de Rennes. Contact him at UBS— Lab VALORIA, BP 561 - 56017 Vannes (France);
[email protected]. Yves Le Traon is an assistant professor at
Irisa, University of Rennes, France. His research interests include testing, design for testability, and software measurement. He received his engineering degree and his PhD in computer science from the Institut National Polytechnique de Grenoble, France. Contact him at IRISA, Campus de Beaulieu, F-35042 RENNES (France);
[email protected].
Few of the numerous first-generation books on analysis, design, and implementation of OO software explicitly address validation and verification issues. Despite this initial lack of interest, researchers have begun to devote more interest to testing of OO systems. Robert V. Binder has published a detailed state of the art.1 Concerning OO testing techniques, most researchers have focused on the dynamic aspects of OO systems: They view a system as a set of cooperating agents, modeling objects. And they model the system with finite-state machines or equivalent object-state modeling.1–3 Such approaches must deal with limitations concerning the computational expense of mapping objects’ behavior into the underlying model. One solution consists of decomposing the program into hierarchical and functionally coherent parts. This decomposition provides a framework for unit, integration, and system test definition. John D. McGregor and Tim Korson leave behind the waterfall model, proposing an integrated test and development approach.4 However, these state-based models constrain the design methodology to dividing the system into small parts with respect to behavioral complexity. Other researchers, approaching the test problem from a pragmatic point of view, have come up with simple-to-apply methodologies based on explicit test philosophies. Ivar Jacobson and colleagues describe how to codesign accompanying test classes with “normal” classes.5 Ken Beck and Erich Gamma propose a methodology based on pragmatic unit test generation.6 This methodology can also serve as a basis for bridging the existing gap between unit and system dynamic tests through incremental integration testing.7 Binder discusses the existing analogy between hardware and OO software testing and suggests an OO testing approach close to hardware notions of built-in test and design for testability. In this article, we go even further by detailing how to create self-testable OO components, explicitly using concepts and terminology from hardware’s built-in self-test. We also define an original measure of a component’s quality based on the quality of its associated tests (in turn based on fault injection). For measuring test quality, our approach differs from classical mutation analysis8 in that it uses a reduced set of mutation operators and integrates oracle functions into the component. Classical mutation analysis uses differences between the original program and mutant behaviors to craft a pseudooracle function.
References 1. R.V. Binder, “Testing Object-Oriented Software: A Survey,” J. Software Testing, Verification and Reliability, vol. 6, no. 3–4, Sept.–Dec. 1996, pp. 125–252. 2. P.C. Jorgensen and C. Erickson, “Object-Oriented Integration Testing,” Comm. ACM, vol. 37, no. 9, Sept. 1994, pp. 30–38. 3. D.C. Kung et al., “On Regression Testing of Object-Oriented Programs,” J. Systems and Software, vol. 32, no. 1, Jan. 1996, pp. 232–244. 4. J.D. McGregor and T. Korson, “Integrating Object-Oriented Testing and Development Processes,” Comm. ACM, vol. 37, no. 9, Sept. 1994, pp. 59–77. 5. I. Jacobson et al., Object-Oriented Software Engineering—A Use Case Driven Approach, Addison-Wesley/ACM Press, Reading, Mass., 1992. 6. K. Beck and E. Gamma, “Test-Infected: Programmers Love Writing Tests,” Java Report, vol. 3, no. 7, Jul. 1998, pp. 37–50. 7. Y. Le Traon et al., “Efficient OO Integration and Regression Testing,” IEEE Trans. Reliability, vol. 49, no. 1, Mar. 2000, pp. 12–25. 8. J. Offutt et al., “An Experimental Evaluation of Data Flow and Mutation Testing,” Software Practice and Experience, vol. 26, no. 2, Feb. 1996, pp. 165–176.
July/August 2001
IEEE SOFTWARE
83
focus
quality control
Using Card Sorts to Elicit Web Page Quality Attributes Linda Upchurch, J. Scott (Thrapston) Gordon Rugg, University College Northampton Barbara Kitchenham, Butley Software Services
oftware measurement is a key component in assuring and evaluating software quality, which is the heart of software engineering. Although substantial literature covers software quality and its measurement, most of it focuses on traditional system and application software. The Internet’s rapid expansion has given rise to a new set of problems relating to the development of measures for software products such as Web pages. These problems are epitomized in the question, “How good is my Web site?”
S Card sorts are a good way to elicit Web page quality attributes and measures from respondents with no measurement expertise. This study used a specialized commercial Web site for sugar production. 84
IEEE SOFTWARE
Although numerous guidelines for Web site design have been published, no recognized standards have been set for assessing Web page quality.1–3 This article identifies Web page design quality attributes and explains how to measure them using card sorts. Eliciting quality measures One problem for software quality measurement is identifying the key attributes to measure. Conventionally, software engineers use methods such as Goal-QuestionMetric to identify the metrics needed in a particular context.4,5 For Web page evaluation, the goal is to assess a Web page’s quality from the viewpoints of designers and users. The first question we might ask is “What do we mean by the quality of a Web Page?” followed by “How do we assess Web Page quality?” These questions are too abstract to allow an easy mapping of a met-
July/August 2001
ric or a measurable concept. The fuzzier the interest property and the more novel the context in which the property needs to be measured, the more difficult identifying appropriate measures is. We must identify approaches that operationalize the elicitation of measures in such circumstances. Considerable research has focused on this problem. Many researchers have provided guidelines and models for constructing and validating software measures.6–9 However, there are no universally agreedon procedures. Recently, researchers and practitioners in requirements engineering have been investigating a variety of methods from other areas to elicit requirements.10 Although that research is still in its early stages, results indicate that this approach looks promising. Neil Maiden and Gordon Rugg found that the choice of elicitation method is an impor0740-7459/01/$10.00 © 2001 IEEE
Card Sorts and Related Techniques
tant variable in the types of elicited knowledge. A significant proportion of knowledge is semi-tacit and can be accessed only through particular techniques; traditional interviews and questionnaires will simply fail to access such knowledge.10 When identifying relevant measures for software, a domain expert will probably fail to mention specific knowledge for a variety of reasons. Maiden and Rugg developed a model for selecting elicitation techniques to match knowledge types in the ACRE (Acquisition of Requirements) framework.10 We used this model because card sorts are particularly suitable for eliciting measures for Web page quality evaluation. Card sort findings This case study involved a specific commercial instance—Web pages commissioned by British Sugar, containing information about UK sugar production. This is a commercially important domain; British Sugar produces over one million tons of sugar annually, supplied under contract by 9,000 growers. The strategic objective of the Web pages was to strengthen British Sugar’s links with its grower-suppliers. The site was set up to provide up-to-date information to growers, reduce costs and time delays, and promote a progressive company image. The Web site’s quality is therefore of considerable importance to British Sugar. The first step was choosing the appropriate type of card sorts (see the sidebar). The choice, for most domains, is either item sorts (sorting the physical entities involved), picture sorts (sorting photographs or pictures of the entities), or “true” card sorts (sorting cards, each of which has the name or a brief description of an entity on it). For this domain, we chose picture sorts, using full-color screen dumps of the Web pages as the entities to sort. The next step was deciding which entities to use. Sorts involving fewer than eight entities risk missing criteria; sorts involving more than 20 entities become increasingly cumbersome. We opted for nine screen dumps, because each screen dump contained visual and textual information that the respondents would need to process. This study involved repeated single-criterion sorts (sorting the same set of cards repeatedly, using a different criterion for the sorting each time). We gave respondents a set
Card sorts originated in George Kelly’s Personal Construct Theory.1 PCT is based on the belief that different people categorize the world differently, with enough commonality to let us understand each other but enough differences to make us individuals. The most widely known PCT technique is the repertory grid technique. A repertory grid consists of a matrix of objects, attributes, and values, and is used to elicit attributes of importance to the respondent. The values are either numeric or specific types of nominal values (yes, no, or not applicable). This can lead to difficulties in representing domains where nominal categories predominate—discrete categories such as operating system are difficult to represent in a matrix of this sort. The strength of repertory grids is that grids composed of numeric values can be analyzed using various statistical approaches, including correlations and principal component analysis. Software packages supporting such analysis are readily available, and repertory grids are widely used in fields ranging from market research to clinical psychology. In cases involving nonnumeric values, card sorts are a useful complement to repertory grids.2 They also perform well in relation to other techniques in terms of speed and quantity of information elicited.3 Entities are sorted into groups of the respondent’s choice; the groupings can then be compared across respondents. The entities might be pictures (picture sorts); physical items (item sorts); or names of entities, descriptions of situations, and so forth, on cards (card sorts). There are several versions of card sorts. We prefer a version that involves asking the respondents to sort all the cards repeatedly, using a different basis for the sorting each time. This lets us identify the range of sorting criteria that the respondents used and compare the groups into which the cards are sorted within each criterion. Although card sorts have been widely used for many years, they have tended until recently to serve as an informal method for initial exploration of a domain. One possible explanation for this is that data acquired through card sorts is not so obviously amenable to statistical analysis as data acquired with repertory grids. However, card sorts can be used rigorously, and they provide a valuable complement to repertory grids, rather than a poor substitute. Both repertory grids and card sorts produce flat outputs consisting of a series of object:attribute:value triplets, with nothing to show the semantic links between the triplets. Other PCT techniques support this. Implication grids let respondents state how strongly each member of a set of attributes implies each other member (for instance, lack of visual clutter and easiness to read). Another technique is laddering, which can be used to elicit respondents’ goals and values relating to the constructs elicited through repertory grids or card sorts.4,5 Laddering can also elicit explanations of technical or subjective terms. Laddering is applicable when the respondents’ knowledge appears to be hierarchically structured. In practice, it works well with elicitation of goals and values, classifications, and explanations. Laddering involves using a small set of verbal probes to elicit higher- or lower-level concepts successively—for instance, asking for an explanation of a term, then recursively applying the same method to any terms in the explanation that require explanation, and so forth until all the resulting terms have been explained.
References 1. G. Kelly, The Psychology of Personal Constructs, W.W. Norton, New York, 1955. 2. P. McGeorge and G. Rugg, “The Uses of ‘Contrived’ Knowledge Elicitation Techniques,” Expert Systems, vol. 9, no. 3, Aug. 1992, pp. 149–154. 3. A.M. Burton et al., “The Efficacy of Knowledge Elicitation Techniques: A Comparison across Domains and Levels of Expertise,” Knowledge Acquisition, vol. 2, 1990, pp. 167–178. 4. D. Hinkle, “The Change of Personal Constructs from the Viewpoint of a Theory of Construct Implications,” unpublished PhD thesis, Ohio State Univ., 1965, cited in D. Bannister and F. Fransella, Inquiring Man, Penguin, Harmondsworth, 1990, p. 73. 5. G. Rugg and P. McGeorge, “Laddering,” Expert Systems, vol. 12, no. 4, Nov. 1995, pp. 339–346.
July/August 2001
IEEE SOFTWARE
85
Table 1 Number of Criteria for Sorting (Group of Respondents) Farmers
Web designers
Laypeople
5 6 22
4 9 27
5 11 32
Minimum number of criteria per respondent Maximum number of criteria per respondent Total number of criteria
Table 2 Number of Respondents using Constructs, in Superordinate Construct Groups Superordinate construct
Pages I would read Readability Order would read pages Attractiveness of layout Font size Clarity of information Layout Subject matter Amount of color (objective) Use of color (subjective) Relevance to farmers Contain comparable data Attention drawn to page/information Information source Factory information Headings Time information relates to Amount farmers would read Amount of content Geographical area covered Pages with similar/varied information Who will understand the information Typeface Inclusion of financial information Inclusion of technical information Information relating to supply and demand Pages relating to safety Total superordinate constructs
Farmers/ designers
Web designers
Laypeople
Total
2 2 1 1 1 4 2 2 3 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22
0 1 1 1 2 1 4 2 3 1 0 0 0 2 1 1 1 2 1 1 1 1 0 0 0 0 0 27
0 0 1 1 2 1 4 2 2 1 1 1 3 3 2 2 1 0 0 0 0 0 1 1 1 1 1 32
1 3 3 3 5 6 10 6 8 3 2 2 4 5 3 3 2 2 1 1 1 1 1 1 1 1 1 80
of nine Web page screen dumps to sort; they sorted them repeatedly into groups, using one sorting criterion at a time. The respondents chose the groups and the sorting criteria.11 We used this technique to identify ■
■ ■
86
IEEE SOFTWARE
the attributes of interest—that is, the specific properties of a Web page that contribute to its perceived quality; the unit or scale by which the attribute could be measured; and the preferred (target) values of the measure that would indicate good quality.
July/August 2001
The respondents were a group of farmers, a group of Web designers, and a group of laypeople. Each group comprised four respondents. We asked the latter two groups to imagine themselves as the farmers so that we could investigate differences between experts and novices with relation to appropriate measures for this domain. The issue of roles and viewpoints is important, because respondents are frequently able to perform the sorting task from more than one viewpoint, resulting in very different responses. So, the instructions to respondents should frame the task, explicitly stating the viewpoint from which to perform the sorts. In addition, there are often well-established different ways (facets)11 of categorizing a domain even within a single viewpoint. This set of factors can lead to inconsistent or contradictory results if the respondent is using more than one viewpoint, role, or facet without making this explicit. Simply asking the respondent which viewpoint he or she is using will resolve this. Results Results from card sorts can be analyzed in various ways. The sequence described here is the one recommended by Gordon Rugg and Peter McGeorge. It starts with the data in its rawest form, and condenses the data further at each stage. Analysis 1: Number of criteria and categories Sorting the number of criteria the respondents used and the number of groups (“categories”) used within each criterion is one way to count. This can serve to roughly indicate expertise, because in general more experienced respondents will identify more criteria and categories. In this case study, the number of criteria for sorting varied considerably between groups, with an inverse relationship between expertise and number of criteria (see Table 1). Analysis 2: Content analysis into verbatim, gist, and superordinate agreement The next stage involved agreement (commonality) between respondents. There are two forms of agreement: verbatim (when different respondents use exactly the same words) and gist (when different respondents use different words for the same thing). Independent judges assess gist agreement. Judges find the categories a useful indicator
Table 3 Superordinate Constructs Generated by All Respondent Groups of whether two criteria are equivalent. For brevity, we have not included examples of this here, but the format in which the results are tabulated is the same as in Table 2. Once verbatim and initial gist agreement has been analyzed, the criteria or categories can be grouped into superordinate (higherlevel) constructs, again using an independent judge. For instance, “clarity of page content” and “how easy pages are to read” can be grouped into the superordinate construct of “clarity of information.” Table 2 shows the superordinate constructs identified in our case study, by a colleague acting as independent judge. Table 2 also identifies the number of respondents in each group who used the construct. Analysis 3: Distribution of constructs across groups of respondents Identifying the verbatim, gist, and superordinate constructs helps us see which groups used which constructs. It is most practical to do this with superordinate constructs, where the absolute numbers are high enough for any differences between groups to be visible. Table 3 shows the superordinate constructs applicable to all the respondent groups. We can also examine permutations of commonality between combinations of respondent groups. This can be useful when developing software that will be used by different types of users; it helps identify the measures that are relevant across the different types. We found that there was more commonality between farmers and laypeople (11 superordinate constructs in common) than between farmers and Web designers (nine superordinate constructs in common). However, Web designers and laypeople shared four concepts that farmers did not identify. Although many respondents generated the same superordinate constructs (as Table 2 shows), inspection of the names given to the categories for each original criterion shows low commonality levels. For instance, the superordinate construct “clarity of information” applied to respondents from all three groups, but the names given to the categories into which the cards were sorted revealed little or no commonality. Examples of categories used include “most concise/clear/too cluttered” from one respondent and “very good/good/bad” from another. However, al-
Superordinate construct
Order would read pages Attractiveness of layout Font size Clarity of information Layout Subject matter Amount of color (objective) Use of color (subjective) Total superordinate constructs
Farmers/ Web designers designers
1 1 1 4 2 2 3 1 15
1 1 2 1 4 2 3 1 15
Laypeople
Total
1 1 2 1 4 2 2 1 14
3 3 5 6 10 6 8 3 44
though the names varied, analysis of which cards were grouped together showed more commonality. For example, all the farmers identified cards 2, 5, 7, and 8 as having clearly presented information. This commonality does not extend between respondent groups because lay respondents categorized cards 5 and 8 as “too cluttered,” and Web designers did not include cards 2 and 7 in the “good readability” category. We also asked the respondents to say which categories would be most desirable for each criterion (with the laypeople and the Web designers answering as if they were farmers). Table 4 shows the results. Analysis 4: Further analysis The previous analyses derive from the categorization used by the respondents. Performing further analysis using categorization of the investigators’ choice has its advantages. Such advantages include analyzing the constructs in terms of subjective versus objective constructs—for example, too much color versus three to five different colors—or in terms of intrinsic constructs (inherent in the item) versus extrinsic ones (constructs that apply to the item but are not inherent in the item). Using statistical analysis to show clusterings of cards across criteria is also possible.11 The card sort session outputs can serve as inputs for another technique. The criteria elicited through card sorts can be used to generate a questionnaire, with categories in each criterion generating the possible response categories to each question in a closed-question questionnaire. Another option is using laddering to elicit goals and values associated with each criterion elicited through card sorts. By asking the respondent to choose his or her preferred category from the categories associated with a particular criterion, we can then July/August 2001
IEEE SOFTWARE
87
Table 4 Preferred Value Ratings for Superordinate Constructs Superordinate construct
Readability Order would read pages Attractiveness of layout Font size Clarity of information Layout Subject matter Amount of color (objective) Use of color (subjective) Relevance to farmers Contains comparable data Attention drawn to page/information Information source Factory information Headings Time information relates to
Farmers/ Web designers designers
7/9 8 8 3 8/9/9/10 8/8 5/7 4/7/8 7 10 9 6 0 0 0 0
9 10 5 6/8 8 8/8/9/10 8/8/8 6/7/7 7 0 0 0 2/8 5 7 9
use standard laddering procedures to elicit successively higher-level goals and values. The criteria elicited during a card sort session can also serve as cards for a sorting session to elicit underlying factors in the respondent’s categorization—the repertory grid literature uses a similar approach. Discussion Card sorts performed well in terms of eliciting attributes in a form that could be easily compared across respondents. Respondents found the technique easy to use, and the nature of the technique matched well with the visual nature of the domain being investigated. These findings are consistent with our experience of card sorts in other domains and suggest that the technique should have wider use. Card sorts alone cannot elicit all types of knowledge; they should be combined with other techniques as appropriate.10 For instance, for the dynamic aspects of this domain (for example, download time), a technique such as online self-report can detect attributes used in short-term memory.10 Attributes elicited The farmers used a remarkably consistent number of attributes; interestingly, there was a negative correlation between the group’s knowledge of the domain and the number of attributes that they used. This might be because the farmers were able to discard potential attributes that were of peripheral importance for the domain. Conversely, it might be because farmers’ expertise is farm88
IEEE SOFTWARE
July/August 2001
Laypeople
0 8 10 7 7 6/7/8/9 8 9/10 7 9 9 10/10/10 2/8/9 1/8 9/9 8
Most desired value
readable/easier first attractive/appealing medium/large very good/easy/concise/ quick/most concise
text broken/tables/consistent figures/charts statistics/campaign/reports/all a lot color/blue and red/some right amount/about right/not pertinent/specific to farmer contain statistics/included stand out/eye-catching/do/very both/British Sugar/factory others/factory block and graphic/red current/all
ing, not Web page design. This might also explain the finding that there was more commonality between farmers and laypeople than between farmers and Web designers. We might initially have expected the Web designers to be better than laypeople at seeing a problem from the client’s viewpoint. Respondents described the attributes using both objective and subjective criteria; the number of categories used in each criterion varied between two and four, with farmers on average using fewer categories than the other Web designers, and similar numbers of categories to the laypeople. This is consistent with the explanation that farmers are not experts in categorizing Web pages, which is what we would expect from the literature.12 Further work is clearly needed to establish the extent to which these findings could be generalized across this domain, but is outside this article’s scope. Implications for measurement theory One striking finding was the predominance of nominal (nonscalar) measures in the constructs that the respondents used. From the viewpoint of elicitation techniques, nominal categories are difficult to handle through repertory grids. An investigation of the same domain based solely on repertory grid technique would probably have failed to elicit many of these constructs. There are also validity issues—for instance, the contradiction between preferred values relating to color. The preferred values in each group for use of color and amount of color are differ-
About the Authors
ent (see Table 4). There are possible solutions to this. Considerable research in psychology has investigated how valid people’s introspections are into their own knowledge, and within card sort methodology the concept of facets provides a possible explanation. This topic needs further investigation. From the viewpoint of measurement theory, nominal categories can be a basis for measures, though they are often viewed as being less tractable and less powerful than scalar categories. If nominal categories are as predominant in other domains as in this one, the implication is that measurement theory must pay particular attention to ways of handling these categories, such as Boolean algebra, set theory in general, and rough set theory in particular. Subjective nominal measures are particularly difficult to handle. Problems arise when you attempt to use subjective measures to evaluate a Web Page because different respondents might interpret the evaluation categories differently. This problem can be reduced by basing the evaluation on an assessment made by a random sample of users rather than a single quality assurance specialist. Unpacking different interpretations by using laddering—involving recursively eliciting explanations of how the respondent interprets a term, until unambiguous terms are reached—is another possibility.13
C
ard sorts and other knowledge elicitation techniques can assist software measurement activities by providing a mechanism for eliciting measures suited to abstract and novel situations. These techniques can easily be incorporated into standard quality measurement approaches. In particular, they provide a mechanism by which non-metrics experts can contribute fully to software quality measurement. In terms of measurement theory, the role of nominal categories needs further investigation, both to identify the extent to which these categories are used in different domains, and to raise awareness about methods for handling nominal categories in measurement theory. Anyone choosing methods to elicit measures should include at least one method that can elicit nominal categories. In terms of Web page quality evaluation, card sorts clearly offer a systematic, practi-
Linda Upchurch is a systems manager for J. Scott (Thrapston) Ltd., UK. She received her B.Ed. from Leicester University and her M.Sc. in office systems and data communications from University College Northampton.
Gordon Rugg is a reader in technology acceptance in the School of Information Systems, University College Northampton. He received his PhD in psychology from the University of Reading, followed by postdoctoral research in the Department of Psychology at Nottingham University, in the School of Information Science at City University, and in the Centre for HCI Design at City University. He is the editor of Expert Systems: The International Journal of Knowledge Engineering and Neural Networks. Barbara Kitchenham is the managing director of Butley Software Services Ltd. and works part-time as a principal researcher in software engineering at the University of Keele. She is a visiting professor at the University of Bournemouth and the University of Ulster. Her research interests are software metrics and empirical software engineering. She received her PhD from the University of Leeds. She is a Chartered Mathematician and a fellow of the Royal Statistical Society.
cal method for identifying the Web quality measures that stakeholders consider important and for comparing the choice of measures between different stakeholders. References 1. J. Nielsen and R. Molich, “Heuristic Evaluation of User Interfaces,” Proc. ACM Conf. Human Factors and Compacting Systems (CHI 90), ACM Press, New York, 1990, pp. 249–256. 2. E. Berk and J. Devlin, eds. Hypertext/Hypermedia Handbook, McGraw-Hill, New York, 1991. 3. B. Pfaffenberger, The Elements of Hypertext Style, AP Professional, San Diego, 1997. 4. V. Basili and D. Rombach, “The TAME Project: Towards Improvement Oriented Software Environments.” IEEE Trans. Software Eng., vol. 14, no. 6, June 1988, pp. 758–773. 5. R. Van Solingen and E. Berghout, The Goal/Question/Metrics Method: A Practical Guide for Quality Improvement of Software Development, McGraw-Hill, Maidenhead, UK, 1999. 6. J.-P. Jacquet and A. Abran, “From Software Metrics to Software Measurement Methods: A Process Model,” Proc. 3rd Int’l Symp. and Forum on Software Eng. Standards (ISESS 97), 1997. 7. J.-P. Jacquet and A. Abran, “Metrics Validation Proposals: A Structured Analysis,” Proc. 8th Int’l Workshop Software Measurement, 1999, pp. 43–60. 8. B.A. Kitchenham, S.L. Pfleeger, and N.E. Fenton, “Towards a Framework for Software Measurement Validation,” IEEE Trans. Software Eng., vol. 21, no. 12, Dec. 1995, pp. 929–944. 9. L.C. Briand, S. Morasca, and V.R. Basili, “PropertyBased Software Engineering Measurement,” IEEE Trans. Software Eng., vol. 22, no. 1, Jan. 1996, pp. 68–86. 10. N.A.M. Maiden and G. Rugg, “ACRE: A Framework for Acquisition of Requirements,” Software Eng., vol. 11, no. 3, May 1996, pp. 183–192. 11. G. Rugg and P. McGeorge, “The Sorting Techniques: A Tutorial Paper on Card Sorts, Picture Sorts and Item Sorts,” Expert Systems, vol. 14, no. 2, May 1997. 12. M. Chi, R. Glaser, and M. Farr, eds., The Nature of Expertise, Lawrence Erlbaum Associates, London, 1988. 13. G. Rugg and P. McGeorge, “Laddering,” Expert Systems, vol. 12, no. 4, Nov., 1995, pp. 339–346.
July/August 2001
IEEE SOFTWARE
89
interview slack
Developers Need Some Slack Ware Myers
om DeMarco enlightens us from a background of practical experience, several decades of consulting, and the reflections expressed in nearly a dozen books. That experience began with the ESS-1 switching system at Bell Telephone Laboratories and continued in Western Europe with distributed online banking systems. His books cover a broad gamut—the early ones dealt with the software field’s technical aspects. By the mid 1980s, his books focused on the overriding importance of people. This effort continues in Slack: Getting Past Burnout, Busywork, and the Myth of Total Efficiency, published this spring by Random House.
T
As readers of this magazine, you are caught up in the mantra of our times: faster, better, and cheaper. But you, perhaps, and certainly the Masters of the Universe are less aware of that mantra’s downside: When is it time to reflect, learn, or change? “Change” should actually be the fourth term in that mantra, because without it, an enterprise experiences dissolution, sometimes slowly, often quickly, as the history of the last few decades illustrates. In an interview with contributing editor Ware Myers, Tom DeMarco argues that the capacity to change grows in the interstices of busy-ness, that is, in the “slack.” Tom DeMarco Atlantic Systems Guild 90
IEEE SOFTWARE
IEEE Software: “Slack” is an unusual title for a work its cover calls “A Handbook
July/August 2001
for Managers, Entrepreneurs, and CEOs.” The word calls to mind such concepts as negligent, careless, remiss, slow, sluggish, or indolent. Tom DeMarco: It’s true that you wouldn’t want slack-jawed slackards working on your project, so in that sense the word is pejorative. But when a worker comes to you and asks, “Can you cut me some slack on this assignment?” I assure you that the word slack has no pejorative implications in his or her mind. If you conducted a survey that asked, “What change would make your job better from your perspective: (a) more salary, (b) more power, or (c) more slack?” you’d detect a clearly positive feeling about slack. Think of “slack” in its primary meaning: “not tight, taut, or tense.” 0740-7459/01/$10.00 © 2001 IEEE
Yes, but the “tight” ship analogy seems to have been popular in management circles from the beginning. Why is it desirable now for software organizations to be “not tight” or “slack”? The phrase “running a tight ship” comes to us from the British Navy, dating back to or before the Napoleonic wars. In those days, people on the captain’s (manager’s) staff were referred to as “hands.” That’s all they were: pairs of hands to do management’s bidding. There could not be a worse model for today’s knowledge organization. If you manage such an organization today, your workers are not “hands,” they are “minds.” When you run a staff of hands, you succeed as a function of how quickly and mindlessly they rush to do your bidding. When you run a staff of bright, creative, ambitious, interested minds, your success is much more a function of how well they innovate. The persistence of the tight-ship analogy today stems, I believe, from talent-free management. If you haven’t a clue how to motivate, encourage innovation, form rich communities of practice, help teams gel, or hire people who add immeasurably to your organization without disrupting its fragile chemistry, then you tend to talk about running a tight ship. Did Thomas Edison’s observation that invention is 99 percent perspiration and one percent inspiration mean there was room in the invention game for only one percent of slack? Edison would have been chagrined to hear that his quote implied that one percent is sufficient time for inspiration. I think what he meant was that there were 99 parts of sweat for each part of “Ah ha!” But that doesn’t imply that there wasn’t far more time available (certainly there was in his own schedule) for seeking inspiration. How about this for a more realistic approximation: Invention is one part inspiration, 99 parts waiting for (and seeking) inspiration, and 99 parts perspiration to make an innovation reality. You have conspicuously failed to sign up with Dilbert worshippers. “Scott Adams, his creator, is one of my heroes,” you write. “Dilbert is a jerk.” That seems harsh.
My point is a little different. Let’s understand that Dilbert himself is at least partly responsible for his company’s deplorable management. I say it’s time to grow up and realize that people who keep their heads down and let bad management happen are not heroes; they are part of the problem. Let’s say we are beginning to grow up. What can software people do about assignments that make them overstressed, overcommitted, and overworked to the point where, as you once put it, they vote with their feet? Maybe nothing. I certainly understand the bread-and-butter requirement to stay employed, especially in a down market. But I’m a very patient man; I’m content to wait until the people who have learned to invest in slack for their knowledge workers rise up in management. Some 50 years ago, Richard M. Weaver published the first edition of his classical work Ideas Have Consequences (Univ. of Chicago Press, 1984). The book’s most important lesson is in its title: ideas matter. If we realize, for example, that Dilbert is no hero, little by little we shy away from Dilbert-like action (or inaction, in Dilbert’s case). When we discover a good way to talk about the lovely growth-and-innovation aspect of slack, we begin to value it and ensure that we and our subordinates are nourished with sufficient slack.
When corporate employees are made too busy by organizational policy, the result is not in anyone’s best interests, certainly not in the company’s.
While we’re waiting for the age of sufficient slack, why are we all so damned busy all the time—the very opposite of slack? I have a hunch that everyone in the corporate world knows deep in his or her heart of hearts that all this overwork is counterproductive. I call it “The Problem of Infernal Busy-ness.” We know it isn’t good for us to be so busy, but I think it also isn’t so good for the companies that employ us. When corporate employees are made too busy by organizational policy, the result is not in anyone’s best interests, certainly not in the company’s. As an independent worker, I too am sometimes too busy, but at least it’s my own problem. More importantly, I have begun to make once-unthinkable changes to my schedule. I have introduced a lot more slack into my routine. I took last summer off and did nothing more than garden and think. July/August 2001
IEEE SOFTWARE
91
Reinvention is essential; healthy companies allocate their resources accordingly.
And last year I stole away to write a novel (Dark Harbor House), which Down East Books published in November. I now recognize that a day with no choices—where the clock pushes me all day long to put one foot in front of the other and do exactly what has to be done—is a day with no real opportunity for growth and damn little pleasure. In a tight organization, with people working lots of overtime, can workers have business-related ideas on their own more-orless slack time, such as during breakfast? Sure, people who work overtime might dedicate some additional amount of their personal time to solving the company’s problems. They might. But is this the best scheme the organization can come up with to get itself reinvented? I hardly think so. Counting on your overworked employees to reinvent your company out of the goodness of their hearts on the ride home hardly seems like a genius corporate strategy. Reinvention is essential; healthy companies allocate their resources accordingly. You make unstructured invention time an integral part of everyone’s day. This is enlightened, but hardly radical. Good companies everywhere are doing it. How might management discourage further informal development of an idea it regards as dubious? Companies are doing just fine at discouraging innovation without any help from me, so I decline to answer. How can management transform unofficial development of an idea into a formal project? Transforming a skunk-works project into a real project is more delicate than you might think, because part of the energy of the project is due to its skunk-works nature. I think a key to making it official without damping the energy is something the late Peter Conklin used to call “enrollment management.” (See P. Conklin, “Enrollment Management: Managing the Alpha AXP Program,” IEEE Software, July 1996, pp. 53–64.) The name comes from the idea of “vision enrollment.” Here the program’s managers use vision to enroll the related groups in the goals of the program, which involves 2,000 people in tens of groups scattered throughout a very
92
IEEE SOFTWARE
July/August 2001
large company. There’s more to it than just vision enrollment, of course. Go back and read the article. I did—very impressive, too. It’s a realistic account of how a skillful manager brought in a big and difficult project on time. To go on to one last question, your new book is not entirely clear on the role of standards in software development. Where do you draw the line between standards as “a very good thing” and as “too much of a good thing?” When you set out to make the point that standards are a good thing in general, you’ll find yourself citing examples such as batteries, film cartridges, DIN exposure standards, screw threads, Ethernet connectors and cables, and RJ11 plugs. These are interface standards: They set rules for how a product matches up to other products in interaction. So, we tell Fuji film in great precision what size and shape a film canister has to be, where the sprocket holes belong, how much tail must be left out, and what camerareadable protocol to use in flagging film type. All this guarantees that you can drop a Fuji film cartridge into your Leica, Kodak, or Cannon camera. In setting the standard, we typically don’t say anything about how Fuji should go about implementing it—that would be a process standard. To my mind, interface standards are utterly necessary in the modern world, and process standards are a highly dubious proposition.
Tom DeMarco is a principal of the Atlantic Systems Guild, a New York and London-based consulting practice. He consults mainly in the area of project management and litigation involving software-intensive endeavors. He has authored seven books on software method and management. Contact him at
[email protected].
Ware Myers is a contributing editor for IEEE Software. His principal in-
terest in recent years has been the application of metrics to software planning, estimating, bidding, and project control. Contact him at myersware@ cs.com.
country report Editor: Deependra Moitra
■
L u c e n t Te c h n o l o g i e s
■
d . m o i t r a @ c o m p u t e r. o r g
Germany: Combining Software and Application Competencies Manfred Broy, Technical University of Munich Susanne Hartkopf and Kirstin Kohler, Fraunhofer Institute for Experimental Software Engineering Dieter Rombach, University of Kaiserslautern
T
raditionally, Germany’s economic strength is based on various sectors of high-tech engineering (cars, mechanical devices, or domestic appliances) and business services (such as financial or logistic services). As more functionalities of these engineering products are implemented as software—and as business processes increasingly depend on such software—its importance increases and becomes a competitive advantage of German industry. As a result, Germany’s software market is dominated by application software products—software embedded in processes or in nonsoftware products— rather than stand-alone software products, which leads to special demands on software and software development in companies of all industry sectors. This report elaborates on the typical characteristics of Germany’s software-intensive industry sectors and the resulting challenges for technology, process, and the people involved. German economy Until 1980, the German economy was the production of consumer goods (satisfying the domestic market) and investments 0740-7459/01/$10.00 © 2001 IEEE
goods (meeting the high demand abroad). In the 1970s, the production and export of investment goods in the automotive, mechanical, electrical, and chemical sectors increased dramatically. Since 1980, Germany has turned into a service society: the majority of people now work in the service sector, where most of the value added is produced, and productivity is growing above average.1 Nevertheless, investment goods are still the single largest contributor to Germany’s export industry, according to Germany’s Federal Statistical Office (40.7 percent in 2000).2 Our economic strength in engineering and manufacturing, achieved through the technical strength in engineering, and our move toward a service-based society, offers great opportunities for Germany as a software-developing country. German software A recent government study evaluated the kinds of software that Germany develops.3 The study’s results show that German industry’s strength strongly influences the software market and the software products developed. Software is not only developed July/August 2001
IEEE SOFTWARE
93
COUNTRY REPORT
Medical technology
Services
Transportation Environmental technology technology Energy technology
80–85%
Industrial systems
Automatization
Mechanical engineering
Component systems
Entertainment electronics
Industrial Telecomunications electronics
70–70%
Electrical Memory, logic, microprocessors, sensors, display components
30–35%
Processes
Processors, procedures, and methods
20–25%
Materials
Materials and equipment
5–10%
70–80%
Figure 1. Companies with software development in selected secondary sectors.
German Software Industry When I recently talked to a high-school graduate about her plans regarding university education, my suggestion to consider studying computer science received strong rejection. When I inquired about the reasons, I learned that her perception of the job profile was very biased toward programming. My explanation that computer science graduates do not necessarily work in a software firm and do not necessarily program in front of a terminal all day took her by surprise. This story exemplifies, on the one hand, the specific situation of the software industry in Germany and, on the other, the fact that this specific situation has not even been communicated well to Germany’s general population. In Germany, the so-called primary software sector (such as companies that sell software products) is certainly smaller, as far as the number of software developers employed is concerned than the so-called secondary software sectors (that is, companies in other sectors of industry that develop software to support intelligent services or software that is embedded in high-tech products such as cars). This strong bias toward building application software rather than software products is rooted in Germany’s traditional strengths in various engineering sectors. When more functionalities of these traditional engineering products began to be realized as software, these companies employed highly qualified software development personnel. This close integration of software and application competence in Germany has implications on the education of software developers—that is, they have to understand the application domain well—and on subcontracting relationships with countries where this application knowledge does not exist. Overall, it is important to communicate the true engineering requirements for software developers and the challenging and interesting job profiles of software developers in this context. This little story is a personal success story of mine: following a long discussion about these issues, the high-school graduate mentioned earlier decided to study computer science after all. —Dieter Rombach
94
IEEE SOFTWARE
July/ August 2001
in software-developing companies and data-processing services (referred to as the primary sector), but has an increasing importance for other industrial sectors such as mechanical, electrical, and automotive engineering; telecommunications; and financial services (referred to as the secondary sector). This statement also holds for avionics, logistics, and administration. The government study pointed out that Germany is strong in producing individual products, not standard or mass products. Fifty-six percent of primary-sector companies specialize in developing application software products—they provide dedicated solutions for specific industry sectors. More than half of these companies develop custommade software products with less than 100 installations. The German software industry serves both production and service industries. Electrical and mechanical engineering companies are already highly involved in embedded software development—24 percent of software-developing companies produce embedded software. Moreover, 77 percent of electrical engineering companies and 60 percent of mechanical engineering companies produce embedded software. Fifty-five percent of primary-sector companies develop software for business applications—support for purchasing, billing, and controlling applications are the most popular. Companies developing software The German software development market is growing enormously. For primary-sector companies, this is easy to demonstrate by looking at the turnover growth. In the secondary sectors, the growth in software’s importance cannot be quantified easily, but it can be proven by several significant findings. The primary sector is comprised of young companies (67 percent were founded after 1990). They were often founded as spin-offs of universities, research institutions, and soft-
ware companies. Some came from established secondary-sector companies developing products out of their internal projects. Thus, companies from traditional sectors create new bases in the primary sector by developing their internal products (such as merchandising systems). More secondary-sector companies are starting to justify software development projects—no longer through their cost share in product development, but rather through the amount of sales that they can realize with this software’s help. Companies increasingly understand that software is an investment and an enabling technology, no longer a pure cost factor. Figure 1 shows an estimation of how software permeates business processes as well as products and service. Figure 2 shows how the number of secondary-sector companies that are developing software increases with the number of employees. Software development personnel The number of software development employees largely depends on the scope of different studies. Various studies cover different industrial sectors, and a different terminology is used to name the professions. The estimate for people currently “qualified for IT” is about 1.7 million, 3,4 whereas 177,000 employees “developed software” in 2000.2 All studies forecast a huge growth over the next five years. By 2005, the number of people developing software in the primary and secondary sectors will rise from 177,000 to an estimated 385,000. This is an increase of almost 120 percent.4 This demand for highly-qualified people cannot be covered by the standard courses of study at the univesities and other educational instituitions. On average, German universities graduate between 7,000 and 8,000 computer science engineers per year.5 Education challenges in software development The demand for qualified profes-
Companies with software development of selected secondary sectors (percent)
COUNTRY REPORT
100 1–9 employees 10–49 employees 50–199 employees 200+ employees
80 60 40 20 0
al
ric ect
El
ng
eri
n
le
ica
n cha
Me
ng
eri
e gin
e gin
ing
en
an
em
tom
ic un
m
com
e Tel
v oti
on
ati
tur
c ufa
es
vic
cia
an
Fin
r l se
Au
Figure 2. The cross-sectional importance of software in Germany.
sionals requires differentiation and expansion of software education and training. This can be supported by increasing capacities in computer science, better integration of IT application know-how in other courses of study, reduction of study times, and additional funding for apprenticeship professions in the IT area. Moreover, the computer science discipline has to address industry’s specific demands more carefully. Large companies are looking for engineers with software development degrees from universities. The foci of these employees’ activities are in the area of requirements analysis and architecture. They develop the software’s architecture and decide which components to purchase externally and integrate. Understanding the application and computer science concepts on a high level of abstraction is a necessary prerequisite. These concepts are taught through applicationoriented courses of computer science (such as practical computer science with a focus on mechanical engineering or business administration). Small- and medium-size companies, however, present an entirely different picture. They have a high demand for graduates of technical colleges. The focus of their employees’ activities lies in programming and application adaptation.
The increasing demand for computer science professionals in leading positions also means that there is a huge demand for engineers with management skills, marketing know-how, and skills in intercultural communication. Technology challenges in software development Technological abilities must be improved to strengthen Germany’s position as a strong manufacturing industry with a culture of individual rather than mass production. Therefore, research should concentrate on the following application domains in particular: ■
■
■
software in products (embedded software such as in cars and cellular phones). software for the support of services and business processes (such as, insurance companies, public administration, health sector, planning, and logistics) and software for value-added services that accompany a product (such as traffic guidance systems for vehicles).
Within these domains, research should focus on the following: Continued on p. 100 July/August 2001
IEEE SOFTWARE
95