CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest ...
76 downloads
544 Views
13MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest in applied mathematics under the direction of the Conference Board of the Mathematical Sciences, supported by the National Science Foundation and published by SIAM. GARRETT BIRKHOFF, The Numerical Solution of Elliptic Equations D. V. LINDLEY, Bayesian Statistics, A Review R. S. VARGA, Functional Analysis and Approximation Theory in Numerical Analysis R. R. BAHADUR, Some Limit Theorems in Statistics PATRICK BILLINGSLEY, Weak Convergence of Measures: Applications in Probability J. L. LIONS, Some Aspects of the Optimal Control of Distributed Parameter Systems ROGER PENROSE, Techniques of Differential Topology in Relativity HERMAN CHERNOFF, Sequential Analysis and Optimal Design J. DURBIN, Distribution Theory for Tests Based on the Sample Distribution Function SOL I. RUBINOW, Mathematical Problems in the Biological Sciences PETER D. LAX, Hyperbolic Systems of Conservation Laws and the Mathematical Theory of Shock Waves I. J. SCHOENBERG, Cardinal Spline Interpolation IVAN SINGER, The Theory of Best Approximation and Functional Analysis WERNER C. RHEINBOLDT, Methods of Solving Systems of Nonlinear Equations HANS F. WEINBERGER, Variational Methods for Eigenvalue Approximation R. TYRRELL ROCKAFELLAR, Conjugate Duality and Optimization SIR JAMES LIGHTHILL, Mathematical Biofluiddynamics GERARD S ALTON, Theory of Indexing CATHLEEN S. MORAWETZ, Notes on Time Decay and Scattering for Some Hyperbolic Problems FRANK HOPPENSTEADT, Mathematical Theories of Populations: Demographics, Genetics and Epidemics RICHARD ASKEY, Orthogonal Polynomials and Special Functions L. E. PAYNE, Improperly Posed Problems in Partial Differential Equations SAUL ROSEN, Lectures on the Measurement and Evaluation of the Performance of Computing Systems HERBERT B. KELLER, Numerical Solution of Two Point Boundary Value Problems J. P. LASALLE, The Stability of Dynamical Systems—Z. ARTSTEIN, Appendix A: Limiting Equations and Stability of Nonautonomous Ordinary Differential Equations DAVID GOTTLIEB and STEVEN A. ORSZAG, Numerical Analysis of Spectral Methods: Theory and Applications PETER J. HUBER, Robust Statistical Procedures HERBERT SOLOMON, Geometric Probability FRED S. ROBERTS, Graph Theory and Its Applications to Problems of Society JURIS HARTMANIS, Feasible Computations and Provable Complexity Properties ZOHAR MANNA, Lectures on the Logic of Computer Programming ELLIS L. JOHNSON, Integer Programming: Facets, Subadditivity, and Duality for Groups and Semi-Group Problems SHMUEL WINOGRAD, Arithmetic Complexity of Computations J. F. C. KINGMAN, Mathematics of Genetic Diversity MORTON E. GURTIN, Topics in Finite Elasticity THOMAS G. KURTZ, Approximation of Population Processes JERROLD E. MARSDEN, Lectures on Geometric Methods in Mathematical Physics BRADLEY EFRON, The Jackknife, the Bootstrap, and Other Resampling Plans
Lectures on the Measurement and Evaluation of the Performance of Computing Systems SAUL ROSEN Purdue University
SOCIETY for INDUSTRIAL and APPLIED MATHEMATICS • 1976 PHILADELPHIA, PENNSYLVANIA 19103
Copyright 1976 by Society for Industrial and Applied Mathematics All rights reserved Second printing 1981 Third printing 1983 ISBN: 0-89871-020-0
Printed for the Society for Industrial and Applied Mathematics by J. W. Arrowsmith Ltd., Bristol, England BS3 2NT
Acknowledgments This monograph is based on a series of ten lectures delivered at a regional conference on Measurement and Evaluation of Computer Systems held at the College of William and Mary on July 15-July 19, 1974. The conference was sponsored by the Conference Board of the Mathematical Sciences and the National Science Foundation. Some of the research discussed here was supported by the National Science Foundation under Grant GJ-41289. I want to take this opportunity to thank the many people at the College of William and Mary who made the arrangements, and who extended their very gracious hospitality to me and my wife and to my youngest daughter, Janet. I especially want to mention Dr. Norman Gibbs, a former student and colleague at Purdue University, who first suggested to me the idea of a regional conference at William and Mary, Dr. William G. Poole, Jr. who handled many of the preconference negotiations, and Dr. Stuart W. Katzke, director of the conference, who so ably handled all of the details of organizing and running the conference. My thanks go also to the National Science Foundation and the Conference Board of the Mathematical Sciences, and to the participants in the conference who contributed so much to the exchange of ideas and information that took place there. SAUL ROSEN
Hi
This page intentionally left blank
Contents Acknowledgments
iii
Chapter 1 INTRODUCTION 1.1. Aspects of performance 1.2. Capacity 1.3. Workload and benchmarks 1.4. Performance prediction 1.5. Literature
1 6 7 8 9
Chapter 2 A CONCEPTUAL MODEL OF A COMPUTING SYSTEM 2.1. Introduction 2.2. Resources and processes 2.3. Terminology 2.4. The supervisor process 2.5. The spooling process 2.6. Time-sharing process 2.7. Job processes
11 11 13 13 16 16 17
Chapter 3 PERSONAL OBSERVATION 3.1. Introduction 3.2. Human engineering 3.3. Turnaround and response 3.4. Direct observation 3.5. Elementary considerations 3.6. Examples
19 19 19 20 20 21
Chapter 4 ACCOUNTING SYSTEMS 4.1. Introduction 4.2. Validity of data 4.3. Understanding the system 4.4. Accounting and billing systems 4.5. Dayfile analyzer 4.6. Consistency of data
25 25 25 26 33 44 V
vi
CONTENTS
Chapter 5 SOFTWARE PROBES 5.1. Introduction 5.2. Sampling probes 5.3. Trace programs 5.4. Tracing probes 5.5. The probe process 5.6. Other tracing probes
45 45 47 47 50 71
Chapter 6 HARDWARE MONITORS 6.1. Introduction 6.2. IBM hardware monitors 6.3. Commercially available hardware monitors 6.4. Hardware monitors and software probes 6.5. Minicomputer-based hardware monitors 6.6. Hardware program monitors
73 73 74 74 75 78
Chapter 7 BENCHMARK WORKLOAD MODELS 7.1. Introduction 7.2. Real benchmark models 7.3. Synthetic benchmark models 7.4. Simulating a benchmark workload 7.5. System dependent workload models 7.6. Use of benchmarks in computer design 7.7. Comments on equipment selection
79 79 82 82 84 84 85
Chapter 8 SIMULATION 8.1. Introduction 8.2. Simulation languages 8.3. Characteristics and difficulties of simulation 8.4. A classic simulation study 8.5. General simulation models 8.6. Simulation of a time-sharing process 8.7. Length of simulation runs 8.8. The starting and stopping problem 8.9. Trace-driven modeling 8.10. Models driven by empirical distributions Simulation program listings and outputs
87 87 89 90 90 91 99 100 101 102 103
Chapter 9 MATHEMATICAL MODELS 9.1. Models of computing systems
119
CONTENTS
9.2. The single server queue 9.3. Time slicing 9.4. Mathematical modeling and simulation 9.5. Networks of queues References
vii
120 126 129 130 135
This page intentionally left blank
CHAPTER 1
Introduction Our topic is the measurement and evaluation of the performance of computing systems. Just about everything that is done with the aid of a computer is an instance of computer system performance that could be a subject of measurement and evaluation. We shall limit the scope of our discussions to certain types of general purpose computing systems. In Chapter 2 we shall try to describe in broad outline some of the important characteristics of this type of system. We shall not discuss a particular real system, but rather an abstraction, a model that represents many computing systems of the class under consideration. We are mainly interested in multipurpose computer systems whose characteristics involve multiprogramming, multiprocessors, communication lines serving many users, and on-line terminal systems. We study such systems, at least in part, because they are inherently interesting. It has been suggested that they are the most complex artifacts that man has ever devised. It has also been suggested that they are too complicated for the tasks for which they were intended, that their complexity makes them inherently inefficient, that they are temporary phenomena that will soon disappear. All of these suggestions may be true. There are many practical reasons for measurement and evaluation. The advertising literature and even the technical literature (in this area it is sometimes hard to distinguish between the two) tell many stories of how applications of measurement techniques led to large savings, increases in efficiency, informed decisions about whether or not to upgrade hardware, etc. All of these practical considerations can be seen to result from a better understanding of computing systems, and the goal of all of our investigations is thus to gain a better understanding of computing systems through the techniques of measurement and performance evaluation. 1.1. Aspects of performance. The aspects of performance that we shall consider fall into three general categories: throughput, responsiveness and quality. There is, of course, a great deal of overlap and mutual interaction among these categories. We shall discuss computing systems that are combined hardware and software systems. The descriptive phrase "extended machine" was used in an earlier generation to represent this concept. The machine that serves the user, the machine the user knows and sees, is the combination of hardware and software that performs his jobs. 1
2
CHAPTER 1
1.1.1. Throughput. The throughput of a system is the amount of work actually accomplished by the system. Throughput is measured in terms of items such as: number of jobs processed, number of processor hours used, amount of input/output data transferred, number of transactions processed, number of terminal hours logged, etc. A good billing and accounting system can give a useful combined measure of total throughput in terms of the amounts billed to the users of the system. The accounting system should also contain a great deal of information about various components of system throughput. The use of accounting system data in performance studies is discussed in Chapter 4. The throughput of a system is usually considerably below the maximum capacity of the system. This is partly due to inefficiencies that are bound to exist in any complex system. It is partly by design, since a heavily loaded system almost always gives poor response to its users. An attempt to push the throughput of a system up to and beyond its capacity produces saturation phenomena that are not easily predictable. Figure 1.1 is a typical graph of throughput as a function of system load. Beyond a certain point the throughput actually decreases as the load increases.
FIG. 1.1. Typical throughput of a batch system as a function of size of load
1.1.2. Responsiveness. Every task submitted to a computer has a stated or an implied response time requirement. Turnaround time is an important measure of the responsiveness of a batch processing system, or of the batch processing component of a more general purpose system. Turnaround is the elapsed time between the time the user submits a job and the time the results are available to him. In some computing systems there may be a significant interval between the time the user submits a job and the time the computer becomes aware of it. For example, there may be a courier service that delivers card decks from a pickup station to the computing center. Similarly there may be a considerable interval between the time the computer finishes printing its results and the time the output is seen by the user. Turnaround may be quite dependent on factors other than the capability of the computing system. It may depend very strongly on the administration of the computing center.
INTRODUCTION
3
Most computing centers establish policies that favor some classes of jobs at the expense of other classes. A typical policy attempts to guarantee good turnaround for all short jobs and for longer jobs that have some reason to be granted special priority. In a loaded system this may result in very long delays for long non-priority jobs. Some computing centers use their pricing policies to permit the user some control over turnaround. He may choose fast turnaround service at a premium price, or overnight or weekend service at a discount. Average turnaround, average turnaround by job class, and turnaround time as a function of job characteristics, are data that can give a good deal of insight into the performance of a system. These data are usually very sensitive to the nature and magnitude of the load on the system. A system with good turnaround is almost always running with a great deal of idle capacity. One of the recurring problems in performance analysis is to estimate the effect on turnaround when the load on a system increases and approaches the saturation level. 1.1.2.1. Response time. Response time is a term used in connection with on-line interactive systems, the so-called time-sharing systems, in which there are
FIG. 1.2. Typical response time graph for a time-sharing system
4
CHAPTER 1
a number of users (sometimes a large number) at keyboard terminals, or at other types of terminals that permit interaction with the computing system. Our discussion of such systems will be in terms of keyboard terminals. The response time for an interaction is measured from the time the user strikes a key that transmits a request for service (a carriage return or other attention key) to the time the response to that request starts printing at the terminal. Response time will be discussed in more detail in Chapter 2 and in later chapters. Average response time and response time distributions are important performance data in time-sharing systems and in the time-sharing components of general purpose systems. A very large amount of the literature of computer performance analysis deals with the study of the response time characteristics of real systems and of models of such systems. Here again it is usually necessary to have a good deal of unused capacity in order to maintain good response. A very difficult and important problem is to estimate the effect on the response time distribution of an increase in the number of simultaneous users of a time-sharing system, and the extent to which service deteriorates as that number approaches and possibly exceeds the capacity of the system. A typical curve of response time as a function of number of terminals is shown in Fig. 1.2. Chapters 8 and 9 discuss mathematical and simulation models of time-sharing systems. Turnaround can be a factor in response time as experienced by the user. A good deal of activity at interactive terminals in a general purpose system consists of the preparation and editing of program files, which are sent as jobs to the batch system. The results can be retrieved and displayed at the terminal, errors can be corrected, the job resubmitted, etc. The speed with which jobs, especially small jobs, can be processed for the terminal user is an important aspect of the performance of such systems. 1.1.3. Quality. Performance analysts deal mostly with quantitative measures of performance. They respond to questions such as: How much work can the system do under actual or hypothetical conditions? How fast can a real or hypothetical system respond to requests for various kinds of service? Most of the literature in the field addresses itself to questions of this kind, and the measurements of throughput and response supply data that can be used in studies in these areas. It is also important, though usually much more difficult, to consider qualitative aspects of performance. In the study of such qualitative aspects, performance analysis overlaps many areas of computer science and computer engineering. It might be interesting to study the whole computer field from the point of view of performance evaluation, but that is not bur purpose here. In this section we shall merely point out and discuss briefly a few of the questions that should be raised, and subjects that should be considered, in evaluating the quality of performance of a computing system. 1.1.3.1. Hardware reliability. Computing systems and components of systems experience failures that cause all or part of the system to malfunction. The mean time between failures is often used as a measure of hardware reliability. This is a
INTRODUCTION 5
much better measure than up-time or down-time percentages, which were popular measures for the early computer systems and are still sometimes used. Systems that were essentially unusable because of high error frequency could still report 90% and more up-time if the duration of each malfunction were short. Mean time to failure is not always unequivocal, since it is not always clear whether a number of failures that occur close together should be reported as multiple failures or as a single failure. Another useful performance measure in this area is mean time to repair. It is usually assumed that the computer hardware gives correct answers, and this is generally true. Even though they are infrequent, there are hardware errors, and the extent and validity of error detection and correction is an aspect of performance that should be considered. Many computing systems, even relatively modest ones, are networks that involve communications equipment and a variety of processors and peripheral equipment. The very high level of reliability that is usually present in the central computer does not give any insight into the reliability of the whole network. 1.1.3.2. Software quality. Some of the same measures that are used for hardware can be used for software in assessing its reliability. Mean time to failure and mean time to repair are relevant measures. Valid data in this area are rarely available. Software reliability has only recently started to receive serious attention as an area of research and study. The first major conference in this area was held in 1973 [IE73]. Software quality considerations include the quality and availability of language processors, of libraries, and utility routines. The nature and ease of use of the job control language can be an important factor in performance. Many computer resources are wasted because efficient ways of using the system often require that users of high-level programming languages must know and use difficult low-level command languages. The quality of the documentation may be another important performance factor. In time-sharing systems the quality of the editors, the capacity and convenience of file-storage systems, the availability of incremental compilers and other appropriate software, are at least as important as average response time in evaluating the performance of the system. There are time-sharing systems that cost as low as $1000-$2000* per terminal served, and others that cost $10,000-$20,000 and more per terminal. These systems may very well all have similar response time characteristics. At least some of the differences in price can be attributed to qualitative differences in performance that are seldom measured or analyzed in detail. 1.1.3.3. Quality of service. There are many management considerations that affect the quality of performance of a computing system. In a poorly run or overloaded center, jobs may be lost, tapes that should be protected may be overwritten, poorly maintained printers may give unsatisfactory output. Job submission and retrieval facilities may be inconvenient or inadequate. The quality
6
CHAPTER 1
of the operational policies and the competence of the operating staff can often have more effect on performance than the quality of either the hardware or the software. 1.2. Capacity. The capacity of a general purpose computing system is defined as the maximum throughput that the system is able to achieve while providing the required level of responsiveness and quality. Other definitions of capacity might be appropriate in special purpose systems. For example, the capacity of a time sharing system might be defined as the number of simultaneous on-line users that the system can support with a stated mean and/or maximum response time. A major goal of many performance studies is to estimate the capacity of existing systems and of proposed systems. There are many factors that contribute to the capacity of a computing system. One of these, often overemphasized but still important, is the speed of the computing equipment. 1.2.1. Central processor speed. In his very interesting history of computers [GO72], Goldstine reports von Neumann's enthusiastic response when he was told that the Eniac project was developing a machine that would perform more than 333 multiplications per second. In very early computers a few simple data like multiply time, add time, memory size, and memory cycle time, were sometimes enough to characterize the capacity of the computer. The "Gibson mix" was the best known of many efforts to provide a more precise evaluation of the raw speed of a computer, by using a weighted average of the execution times of a number of different instructions. The weights were the expected frequency of occurrence of the various instructions in typical programs. Another approach to measuring processor speed, and to comparing the speed of different computers, is the use of kernels. Kernels are short examples of typical computer code whose speed of execution can be measured or estimated. A measure of processor speed that is often used, especially in promotional literature, is the maximum number of instructions that can be executed per second. There is still some excitement generated by announcements that computers can (or will be able to) execute hundreds of millions or even billions of instructions per second. These speeds represent very impressive technological achievements, but the speed attainable in short bursts by the fastest components of a computing system usually gives very little information about the capacity of a system. 1.2.2. Disc-based systems. The high speed processing units with their correspondingly fast central memory usually end up spending large amounts of time waiting for the relatively slow peripheral storage devices, and the even slower input-output units. Even when central processors are not waiting, they are usually forced to devote much of their processing time to "overhead" tasks of the large and inefficient operating systems that came along with the "third generation" computing systems.
INTRODUCTION
7
Some of the first performance measurements on the then new OS 360 system were presented by IBM to the SHARE organization in 1965. In a number of test runs there was little difference in total processing time required by the slow model 40 processor and the very much faster model 65 of the 360 line. This was a dramatic illustration of the fact that the system that was being tested was a disc-based and disc-access limited system. It was only through the improvements of peripheral storage (faster devices, more and faster channels) and through more effective use of peripheral storage (for example, multiprogramming, buffering, etc.), that the extra speed of the main frame could be translated into extra capacity for the system as a whole. 1.2.3. Other hardware and software features. In a paper written in 1968 [RO68], I stated that "all of the (hardware) features that have been designed into digital computers may be considered to be • • • reflections of software needs." I might equally well have stated that all hardware and software developments are a reflection of the need and desire to improve the capacity and the performance of computing systems. There is a great deal that could be said about the ways in which the logic of hardware and software organization affect capacity and performance. A discussion of the many aspects of hardware and software that influence performance would be an appropriate subject for a text in computer design. There are no such texts, but the reader is referred to [BU62], [TH70] and [BE71] for interesting discussions of hardware performance considerations in the design of specific computers. We shall not pursue this subject further here, since it would take us beyond the projected scope and magnitude of this monograph. 1.3. Workload and benchmarks. The workload of a system is defined as the set of tasks or jobs that the system is called upon to process. In a system in which the workload characteristics are known and can be specified quantitatively, it is possible to define a measure of capacity as the number of workload units that can be processed in a unit of time. Thus, a system whose only function is to handle transactions of a certain kind (for example, an inventory control system) can be said to have a capacity of so many transactions per hour. If several different computing systems are proposed to handle this type of workload, it is only necessary to measure or estimate their average time per transaction in order to obtain an estimate of their capacity for handling the workload. A benchmark is a program or a set of programs that is selected or designed for the purpose of testing and/or measuring the performance of a computing system, or comparing the performance of a number of computing systems. Benchmarks are usually chosen or constructed to be typical of the jobs a system runs, and representative of the system's workload. For a system whose workload can be characterized by a benchmark set of programs, the time it takes to run the benchmark set (or preferably the reciprocal of that time) can be taken as a measure of the capacity of the system. Such benchmark sets are frequently used in equipment procurement studies to provide
8
CHAPTER 1
an estimate of the relative capacities of different systems, and of different system configuration. Benchmarks and workload characterization will be discussed in Chapter 7. In a general purpose system it may be difficult or even impossible to provide a valid characterization of the workload, especially since the workload may vary significantly, even over relatively short periods of time. There may exist only a very general concept of the kinds of work that such a system will be required to handle. This is characteristic of the many situations in which the computer functions as a public or semi-public utility. Examples are computing centers at universities and research establishments, computer service bureaus, and many multipurpose industrial and military computer installations. For existing systems of this type various measurement techniques can be used to obtain measures of throughput under a variety of real and synthetic workloads. The measurement techniques will be discussed in later chapters. Measures of throughput under varying load conditions can be used to obtain estimates of capacity as a function of the nature of the workload of the system. 1.4. Performance prediction. The performance analyst is very often faced with the problem of estimating the capacity (and other performance characteristics) of systems which cannot be measured under actual production loads. These are estimates that are used in decisions about designing or procuring and installing new systems, and making additions or changes in existing systems. In some cases examples of the proposed new system exist, perhaps in prototype form, in the manufacturing plant or in another user's installation. In such cases a certain amount of testing and measurement may be possible, and the analyst is then called upon to design such tests and to make extrapolations from them. Systems on which no measurements at all are possible provide a greater challenge and much greater difficulty. Every hardware and software design project involves, either explicitly or implicitly, predictions concerning the performance of the resulting product. Manufacturers who develop new products, and users who order products before their use is well established, often stake very large amounts of money on the expected performance of a product. It is surprising how little effort has been devoted to performance prediction, even for huge projects involving expenditures of millions of dollars over periods of many years. As the computer industry has matured, there has developed a greater appreciation and understanding of the need for performance prediction and evaluation, and the major computer manufacturers and some government and industrial organizations have established groups and departments devoted to these efforts. An example is the Federal Computer Performance Evaluation and Simulation Center, which was set up by the General Services Administration in Washington, D.C. [CO74]. Performance prediction is not and probably can never be an exact science. Simulation and mathematical modeling techniques can be used to estimate aspects of the performance of new systems, and of the effect of changes in existing systems. Some of these techniques will be discussed in later chapters. Modeling
INTRODUCTION
9
techniques are important for developing insight into performance, and they can lead to the solution of many interesting performance problems. Performance analysts must be aware of these techniques and of the problems and areas in which they can be applied successfully. They must also be aware of their limitations. Real systems, especially general purpose systems, are often very complex structures that cannot be represented adequately by the relatively simple models for which analytical solutions can be found or for which successful simulation studies can be devised. In discussing some specific changes that were considered in connection with the Purdue MACE operating system on the CDC 6500 [RO73], I stated: "It would be nice to have an accurate model that could predict the change in performance that would be produced by a change in a system component. We have expended considerable effort in applying modeling and simulation techniques, but the models still seem to be much too coarse to be able to give us this kind of information. With the present state of the art, it is still necessary to rely on more informal techniques based on careful extrapolation from measurements of the existing system, to estimate expected performance improvement." There is a great deal of work going on in the area of improving techniques for modeling, and for the solution or simulation of models, some of which will be discussed in Chapters 8 and 9. 1.5. Literature. There have been a large number of survey papers and general discussions of performance evaluation, including several books [DR73], [ST74] and a number of bibliographies [BU69], [CR71], [AN72], [MI72]. Every major computer conference proceedings contains articles in this area, and there have been a number of conferences specifically in the area of measurement and evaluation of performance [AC71], [AC73]. Because of the dominant position of IBM in the computer field most performance studies have involved IBM equipment, and the IBM Systems Journal provides some of the most useful papers on performance analysis. Their special issue on system performance evaluation (Volume 8, No. 4, 1969) contains a number of papers that have had a great deal of influence on subsequent developments in this area. The SHARE organization of large IBM users has had an ongoing Computer Measurement and Evaluation project, and selected papers in this area presented to SHARE in the years 1967-1973 are published in [SH74]. A great deal of the performance evaluation literature deals with immediate practical problems in connection with specific systems, and many of the papers are only of transient interest. Some of the best general papers in the area of performance evaluation are those written by T. E. Bell while he was associated with the Computer Performance Analysis project at Rand Corporation. In [BL73] he lists the important tools (or techniques) of performance evaluation: personal inspection, accounting data, hardware monitors,
10
CHAPTER 1
software monitors, benchmarks, simulation models, analytical models. Any author who plans to cover the field of computer performance analysis at all completely should include a discussion of these techniques and their application. The relative weight (that is, the amount of space) given to each is a matter of background and interest. Some authors might choose to devote a great deal of space to queuing theory, in an attempt to develop a mathematical basis for performance analysis. Others feel, and state, that no progress will be made until we find a way to measure and model the workload of a computing system, an area subsumed under the heading of benchmarks in Bell's list of tools and techniques. Still others prefer to concentrate on the gathering and analysis and reduction of real data from running systems. All of Bell's seven categories will be covered to some extent in the chapters that follow. The size of each chapter is not necessarily an indication of my opinion of the relative importance of each topic. Personal inspection is important, but there is not very much of a general nature that can be said about it, and its chapter is therefore quite short. The chapter on hardware monitors is probably shorter than it should be, but my own experience in this area is quite limited. The chapter on simulation may be longer than it should be relative to the others, but I thought it would be interesting to include actual examples of simulation programs and their output. The examples and accompanying text expanded that chapter, perhaps beyond its level of importance in the total performance evaluation picture.
CHAPTER 2
A Conceptual Model of a Computing System 2.1. Introduction. In subsequent chapters we shall present and discuss measurement and evaluation techniques, with examples of the application of these techniques to one particular computing system and with data gathered during the running of that system. A complete description of the system would run to hundreds or even thousands of pages, depending on the level of detail of the description. It would be necessary to know a great deal about the system in order to judge the validity of the data presented and in order to use and interpret such data in an analysis and evaluation of the performance of the system. However, the measurement and evaluation techniques that are used and the data, considered simply as examples of the kind of information produced by the measurement tools, can be discussed and understood in terms that apply to a large class of computing systems of which this particular system is one instance. This class of computing systems can be described by means of a conceptual model that is an abstraction of characteristics that are common to most, if not in all cases to all, members of the class. This chapter is a brief presentation of such a conceptual model. It is an attempt to present it on a level of detail that will be adequate for an understanding and appreciation of the material in the chapters that follow. The reader can judge its success for himself. The system on which the data presented in later chapters was collected is the Purdue MACE operating system that runs on Control Data 6000 (or Cyber 70) equipment [AB70], [AB74]. This system has much in common with Control Data's KRONOS system, and also (though less) with their Scope system. These systems and IBM's various systems on their 360 and 370 computers were kept very much in mind in developing the conceptual model presented here, but there was also an awareness of features of other large systems, including the Univac EXEC systems on their 1100 series, and the GCOS systems on the large Honeywell machines. The model is based on these large scale general purpose systems, but it may also apply to smaller systems, even to some of the very capable mini-computer systems that have been developed. All features of the model need not be present in a particular instance of a system that is described in terms of the model. Figure 2.1 presents a skeletal outline of the system model. 2.2. Resources and processes. The elements of the computing system model are resources and processes. The primary resources are central processors, central 11
FIG. 2.1. Conceptual model of a computing system
A CONCEPTUAL MODEL OF A COMPUTING SYSTEM
13
memory, channels and peripheral storage. Input-output devices, terminals and auxiliary processors (for example, front-end or communications processors) are secondary resources that are often under the control of a specific process. A process is an activity of the system that involves the execution of sets of programs and the use and modification of sets of data, in the performance of a major function for the system itself or for a user of the system. The programs and data that are used by a process reside in peripheral storage. (There are exceptions in the case of processes that communicate with external devices, but these will be ignored in this discussion.) A process becomes active when part (in some cases, all) of its programs and data are moved into central memory. Processors execute programs for active processes, and channels transfer data and programs between peripheral storage and central memory. Processes use central memory space, central processor time, channel time and peripheral storage space; and records that are kept of their use of these commodities provide some of the primary data for performance analysis. 2.2.1. Permanent processes and jobs. There are a few processes that are designed to remain active over long periods of time. These include the supervisor process, the spooling process, and the time-sharing process, all of which will be described below. These are called permanent processes. The supervisor process, for example, must be active whenever the system is operating. It is characteristic of the permanent processes that they are not scheduled by the system job scheduler. Most processes are not permanent, and are referred to as jobs. Jobs are loaded into the system by a spooling process or by a time-sharing process. They are activated and deactivated by the system scheduler which is part of the supervisor process. Jobs that are in the system and that are not active exist as job files in queues in peripheral storage. 2.3. Terminology. A file is a named collection of data terminated by an end of file mark. In IBM terminology a file would be called a data set. The term core memory or simply core is often used in place of the term central memory, since between 1954 and 1972 almost all central memory was made of arrays of magnetic cores. The term disc memory or simply disc is often used in place of the term peripheral storage. Disc units are used as peripheral storage in most computing systems (in 1974), but drum storage systems are also widely used, and other, nonrotating, peripheral storage systems may soon be introduced. 2.4. The supervisor process. The supervisor process is a permanent process that contains an executive program and other programs that contribute to the system's ultimate purpose, which is to run user jobs. The executive program provides a variety of services for all of the active processes. Any significant change of state in the system involves one or more requests for the execution of executive functions, and a record of the execution of these functions can be one of the most
14
CHAPTER 2
useful sources of data on the performance of the system. Part of the executive program must remain permanently resident in central memory, but the execution of some of the functions provided by the executive may require that program segments be loaded from peripheral storage. Our model does not require a knowledge of all or even most of the executive functions. These will vary in different hardware systems and even in different operating systems using the same hardware. The supervisor process has several major functions that are carried out by supervisor programs. Among these are the programs that carry out the system scheduling and the dispatching functions. It is important to distinguish clearly between these functions, since the nomenclature used in describing them is often ambiguous and can lead to some confusion. In our model the system scheduler manages the allocation of that part of central memory that is available for dynamic allocation to job processes. A separate memory management or paging supervisor program should be added if paging problems that are characteristic of virtual memory systems are to be discussed. This important specialized area of performance analysis has been discussed very extensively in the literature and will not be covered in this monograph. A good survey and bibliography in that area is contained in [DE70]. 2.4.1. The system scheduler. The scheduler is the supervisor program that decides which job processes are to be active in the system at any given time. The scheduler can make a job active by allocating central memory to it, and then by causing all or part of its job file to be loaded into central memory. Central memory space may become available either because a job has released some or all of the space allocated to it (for example, if a job terminates), or because a job is temporarily suspended, either by the scheduler or by the action of some other program. In the case of a suspended job, the scheduler can make the central memory space occupied by that job available for reallocation by causing the job to be rolled out from central memory to peripheral storage. There are various parameters associated with each job, including a job priority, which are used by the scheduler to determine which jobs should be made active, and which active jobs should be suspended. A scheduling strategy is an algorithm used by the scheduler in making this determination. It is assumed that a job that is preempted (suspended) and rolled out to peripheral storage goes into a rollout queue, from which it can later be rolled back into central memory and resumed (made active), and continue to execute from the point in its program at which it was suspended. The scheduling strategies will thus in general be preempt-resume strategies. The scheduling strategy used can have a very significant effect on the performance of a computing system. It is usually very difficult to predict the effect on performance of a change in scheduling strategy. Typically, such a change will improve some areas of performance and worsen other areas. Thus, even if one can predict the changes in performance, it may still be extremely difficult to evaluate them.
A CONCEPTUAL MODEL OF A COMPUTING SYSTEM
15
2.4.2. Dispatcher. One or more of the processors in the system are central processors. These perform the main computing and data manipulation functions of the system. The dispatcher is the supervisor program that assigns central processors to active processes. A process may be active and not need a central processor. For example, it may be waiting for an input/output operation to complete. An active process that needs a central processor is said to be in Wait state until a central processor is assigned to it. When a central processor is assigned to a process, the process is said to be executing. The dispatcher maintains a dispatching queue of those processes that are in Wait state and those that are executing. The central processor time allocation unit is called a quantum. A quantum is the maximum amount of time that a process can use a central processor before that processor becomes eligible for reassignment by the dispatcher. Whenever a central processor becomes eligible for assignment, because it has been interrupted or has become idle or because a process has finished a time quantum, the dispatcher assigns the processor to one of the processes in the dispatching queue in accordance with its dispatching strategy. If the queue is empty, the processor remains idle and eligible for assignment. At the end of a quantum the dispatcher may permit a process that was executing to retain the central processor for an additional quantum, or it may put that process into Wait state and reassign the central processor. A typical dispatching strategy might give permanent processes priority over job processes, but assign quanta of central processor time to job processes on a round-robin basis. A typical quantum might be 10 or 20 milliseconds or more. The quantum size is an item in the dispatching strategy. Considerations in determining quantum size include the speed of the central processor, and the time it takes to switch a central processor from serving one process to serving another. The purpose of the round-robin allocation of central processor time is to permit all active jobs to make some progress toward completion while they are tying up central memory and possibly other system resources. Another typical dispatching strategy is to attempt to favor input/output-bound jobs at the expense of compute-bound jobs. The value of this type of dispatching strategy can be enhanced in a system in which the scheduling strategy maintains a good mix of input/output-bound and compute-bound jobs in the set of active jobs. The size of the quantum, and the dispatching strategy may have significant effects on system performance. These effects have been studied in [BA70] and [SH72]. Some systems permit a job process that is using a central processor to retain that processor for as long as it can use it, subject only to preemption by the supervisor process. The effect is the same as would be observed in a system that uses a very large central processor quantum size. This may be a valid strategy in a system in which switching from one task to another is prohibitively expensive, or where there is only very limited multiprogramming activity. There is some danger of confusion between the use of the word quantum in connection with central processor dispatching, and its frequent use in the computer literature in connection with time slicing in time-sharing systems. I have
16
CHAPTER 2
suggested the use of the word partum in place of quantum in the present context, but I hesitate to use an unfamiliar word to describe a simple concept. In this monograph I shall try to avoid any confusion in the use of the word quantum through use of time slice rather than quantum in the discussion of time-sharing processes. 2.4.3. Input-output manager. The input-output manager is the supervisor program that handles requests from all of the processes for the transfer of information between central memory and peripheral storage units. It may also handle transfers of information to some or all input-output devices. A typical input request may ask for a record from a specified file. The I/O manager uses appropriate tables to determine the peripheral unit and the position on that unit at which the record is located. It must have a mechanism for denying or queuing the request if the unit is busy or, in the case of several units on the same channel, if the channel is busy. 2.5. The spooling process. The word "spooling" is based on the acronym Simultaneous Processing Operations On Line introduced by IBM in 1960 to describe a feature of their 7070 system. It has become an accepted part of computer system nomenclature. The spooling process is a permanent process that reads jobs from one or more input devices and puts the jobs into input queues in peripheral storage. It also reads files from output queues in peripheral storage and transmits them to output devices such as printers, punches, microfiche machines, etc. The output spooling process may have a scheduling function, and the order in which output files are released to output devices may have a significant effect on the observed performance of the system. 2.6. Time-sharing process. A time-sharing process is a permanent process that provides interactive computing service to a number of on-line terminals. Each active (i.e., logged-on) terminal alternates between user state and system state. User state time is sometimes called think time, although in our model it includes printing and typing time as well. System time is also called response time. For each active terminal associated with the time-sharing process there is a current task. At the end of a user's input to the system a carriage return or other attention signal transfers the terminal from user state to system state, and causes a request for service to be transmitted to the time-sharing process. There is space, usually in the fastest peripheral storage available, that is allocated as swapping storage for the time-sharing process. Each task associated with an active terminal has an information area in the swapping storage that contains programs and data necessary for initiating or continuing the execution of the programs that must be run in order to respond to the terminal's request. When a particular task is selected as the next one to be served, its information area is moved from swapping storage into a central memory area that belongs to the time-sharing process. All of the services provided to a permanent process, including quanta of central
A CONCEPTUAL MODEL OF A COMPUTING SYSTEM
17
processor time and access to peripheral storage, are available to the task while it is occupying the time-sharing process' central memory space. Unless a task terminates or is preempted, service to the task will continue for a length of time called a time slice, at the end of which its information area is swapped out, and the information area that belongs to the next task to be served is swapped into central memory for the next time slice. A time slice may be, for example, 50, 150, or 250 milliseconds. The time slice size is an element of the time-sharing process scheduling strategy. The amount of service time needed to respond to requests from terminals may vary from just a few milliseconds to several seconds or more. The service request may thus require one time slice (or a fraction thereof), or may require a number of time slices. The time-sharing process provides time slices to the system state terminals according to its scheduling strategy. The simplest strategy is a round-robin strategy in which a task that requests a time slice starts out at the end of a queue consisting of all tasks that need time slices. It goes back to the end of the queue for each additional time slice that it needs. When the time-sharing process has provided the required service for a terminal, it initiates a response to the terminal and places the terminal in user state. A variant of the model would permit a time-sharing process to provide service to more than one task at a time. In order to simplify some of the later discussions we shall assume that a time-sharing process serves one task at a time, and permit two or even more time-sharing processes to be active simultaneously. 2.7. Job processes. Jobs are users of system resources. On the most elementary level a job is specified by listing the nature and the amount of the system resources it will use. The simplest job specification will list the amount of central memory needed, the amount of central processor time used, and the amount of I/O transferred, possibly along with a priority level to be used by the system scheduler. In addition to the total resource requirements, it is necessary to model the sequence in which some of the resources are used. In a very simple model, an active job can be considered to consist of a sequence of alternating central processor requests and channel requests. A central processor request specifies the amount of central processor time required before the next channel request. A channel request specifies a channel number and the amount of information to be transferred. The devices associated with the different channels are not assumed to be identical, and thus the time it takes to satisfy an I/O request for a given amount of information may be quite different for different channels. A job that makes a channel request gives up the central processor, and becomes ineligible for central processor dispatching until the channel request has been satisfied. Some systems permit a job to retain a central processor while one or more input-output requests are being processed for that job. It is possible to achieve maximum overlap between central processor utilization and channel utilization for the job in such a system. Attempts to optimize the performance of one job process may sometimes result in slowing the execution of other active jobs, or even of permanent processes [RO73].
18
CHAPTER 2
A job may be permitted to have a terminal, or in some cases several terminals, associated with it. A job that provides interactive service to a terminal is not the same kind of thing as a time-sharing process. The job is not a permanent process. It is scheduled by the system scheduler. The scheduling algorithm may make special provision for such jobs. Note that the system scheduler is not involved at all with a time-sharing process since a time-sharing process is a permanent process. A system without time-sharing processes may still be a time-sharing system. The time-sharing process is a mechanism that attempts to provide some, possibly limited, interactive services to a large number of users. Its main service may, for example, be a file creator and editor that provides a conversational system for submitting batch jobs to the system.
CHAPTER 3
Personal Observation 3.1. Introduction. It is possible to learn a good deal about the performance of a computing system through personal observation, without the use of sophisticated data gathering or analytical tools. Direct observation may be the only way to study and evaluate some of the important qualitative aspects of computing system performance. A computer room usually satisfies environmental requirements with respect to measurable quantities such as temperature and humidity, but even these should be checked. There are other unmeasurable environmental parameters, some of which are simply questions of good housekeeping. Dust and dirt in the computer room may be the cause of poor tape and disc performance. Cards and tapes stacked apparently at random on equipment are usually a sign of sloppy procedures that result in lost and improperly run jobs. 3.2. Human engineering. There are important human engineering considerations that can and do affect performance. The layout of the computer room, the accessibility of readers and printers and tape and disc units, the facilities for communication between operators in multi-operator installations, can all effect performance. A performance analyst can observe if input decks are handled properly and with care, if tapes are retrieved and restored and mounted and dismounted carefully and efficiently, if output is permitted to accumulate, or is quickly and accurately distributed back to the user. These are not all traditional areas of computer performance evaluation, but they are important parts of the performance of the system as seen by the user. 3.3. Turnaround and response. In the performance areas that are more frequently studied, such as turnaround and response time, an analyst who can arrange to use a computing system in the same ways in which typical users use it can often gain insights into the performance of the system that might not be apparent in data provided by the accounting logs and software and hardware probes that are discussed in Chapters 4, 5 and 6. A good way to get a feel for the turnaround provided by a system is to submit batch jobs and wait for the output. A good way to get some idea of the response time is to spend some time exercising various features of the system from a keyboard terminal. In both of these cases the analyst may also learn a great deal about the quality of the system's performance. Informal observation and informal studies have many of the same problems as more formal ones. In order to be able to generalize about turnaround or response, 19
20
CHAPTER 3
it is necessary to make tests under "typical" conditions, and since typical conditions can usually neither be defined nor achieved, it becomes necessary to make a large number of tests under different sets of conditions so as to make the results representative. The in-house analyst who lives with the system can make such frequent tests. The analyst who is called in for a brief study would probably have to rely on information provided by users of the system. Observing users of the system, and discussing with them their experiences and satisfactions and dissatisfactions with the system, can provide the analyst with a general feel for how well the system is performing, and may suggest the existence and source of performance problems and inadequacies. 3.4. Direct observation. An analyst who understands a particular system well can often get a great deal of information by observing system displays and lights associated with various hardware components. The fact that a unit is idle, or that a channel is underutilized, or that a particular system resource is a system bottleneck, would usually become apparent in any well-planned system evaluation study. Some of this same type of information can be obtained much more directly by observing that certain indicator lamps are rbrely lit while others are almost continuously on. Intelligent and informed observation of a system can sometimes be the most important part of the performance evaluation process. It is sometimes possible to obtain considerable insight into performance problems by asking some obvious questions and making routine observations of how the system is working. 3.5. Elementary considerations. When an electrical appliance is not functioning, the advice usually given by the analyst (repairman) is, "Make sure that it's plugged in and turned on before you look for any other problems." There are analogous situations with respect to some computer performance problems, especially in connection with the use of peripheral storage devices. There are cases where devices have remained unused for extended periods of time because the "SYSGEN" never turned on the device as far as the operating system was concerned. There are situations in which a necessary software module was not installed or a necessary software change was not made at the time a hardware device was installed. It is not unusual for equipment manufacturers to deliver equipment with powerful capabilities which are not used by typical FORTRAN and COBOL programs. The ability of a device to perform disc seeks on a number of spindles simultaneously means nothing if the seeks are not being issued in such a way as to take advantage of this feature. Position sensing capabilities built into modern disc pack controllers remain unused if the programming language processors, and the input-output packages they use, are not designed to use them. The performance analyst, who is aware of the current state of development of the standard software products, is often in a position to explain why the installation of new and more powerful equipment has not produced the performance improvements that were expected as quickly as they were expected.
PERSONAL OBSERVATION
21
3.6. Examples. Later chapters will include some examples of measurements taken on a particular CDC 6000 installation at Purdue University. The remainder of this chapter will contain a few examples taken from this same system, to illustrate the kinds of performance information that are available to direct personal observation, and some thoughts on the uses and limitations of this kind of information. There is no attempt at any kind of completeness, since there is almost no limit to the kinds of things that could be observed. Three separate work stations in the computing center were visited and observed and some comments on each are presented here. 1. A card reader. As I walk down the hall in the basement of the computing center, I come to a self-service card reader with a rated card-reading speed of 1,000 cards per minute, and a queue of about 15 students waiting to load their jobs. I observe a very significant pause between the time that one student finishes loading his job and the time the next student's job starts feeding through the reader. This pause includes time for the first student to walk away from the reader, time for the second student to approach the reader, insert a header card in front of his deck, put the deck into the hopper and start the reader. Then there is a short delay before the reader picks the first few cards, another very short delay while the software is presumably making some job validity checks, and finally the full speed reading of the rest of the deck. Since most of the decks are short, the rated maximum reading speed of the card reader has little effect on the actual throughput of the device. The mechanics of getting people up to and away from the reader are of great significance. The only way to get information about queuing at the reader is the very old-fashioned way of watching and counting. The user at the end of the queue, who may wait 15 to 20 minutes before he can reach the reader to submit his job, is not going to be impressed with internal reports of job turnaround time that ignore his waiting time. 2. A terminal room. Further down the hall there is a small room with four public keyboard terminals that are in almost constant use. I know that the system is very heavily loaded at this time, and as a result any system measurement tools would show that response is slow. Just what does this mean to these four users? I note that two of the terminals are clearly keeping up with, or staying ahead of, their users. At one of them the obviously inexperienced user is consulting a manual to figure out what to do next. At the second the user is creating a file, and his typing speed is clearly the limiting factor. At the third terminal the user is impatiently waiting for the results of a job he has submitted. Every 10 seconds the system types out his current job status, and from the looks of things, he would do better to request a batch printer printout, and leave his terminal. The fourth user has just retrieved a file from the permanent file system, and even though this took almost a minute because of the heavy load, he seems satisfied with this kind of response for this type of request. There is not much information here, but repeated observation of terminal use does give one some background feeling that makes response time averages and distributions more meaningful. There are some obvious truths about the response
22
CHAPTER 3
time of time-sharing systems that sometimes need to be reinforced. One such obvious truth is that whether or not response time is satisfactory depends on the nature of the application. Even when internal data shows that average response time is poor, it is possible, indeed even probable, that most users who are creating and editing small job files and occasionally submitting jobs to be run are getting better than adequate response. At the same time the student who is trying to complete a CAI (Computer Assisted Instruction) assignment in a half-hour segment of real time may find the response intolerable. 3. The system console. Among the innovations in large computer design introduced by the CDC 6000 systems, one very interesting one was the use of a twin cathode ray tube console driven by display programs in the system's peripheral processors. These displays permit constant monitoring of the performance of the system. A major purpose is to give the operator information that will permit him to intervene intelligently in the operation. The amount and types of operator intervention that are required in the running of a computing system, and the nature and extent of the information that is immediately available to the operator, are factors that should be taken into account in an evaluation of system performance. This is a type of performance evaluation data that must at least partially be obtained by observing the operator and the displays. There are many specialized displays available to the operator, but most of the data necessary for following the principal activities of the system are presented on a few major displays. Thus a single display shows the status of each active process (see Chapter 2) and of each processor. It shows the degree of multiprogramming, the activity of the preempt-resume scheduler, the activity of the processors as the job mix changes back and forth between processor limited and input-output limited, the activity of the channels, and thus indirectly the activity of the I/O devices. Another display shows the activity of the time-sharing subsystem, including the status of each terminal, and the switching of process time among them. Another shows the status of the system queues, and the availability of peripheral storage space on public disc storage units. Several displays show the current values of parameters that can be changed directly from the console keyboard in response to unusual loading or queuing conditions. Changing these parameters introduces changes in some of the scheduling strategies of the system. For example, they might increase or decrease the central memory space allocated to jobs submitted through the terminal system, or they might increase or decrease the priority given to jobs as their resource usage approaches their maximum allocation. Experience has shown that caution and restraint should be used in making any strategy changes in a system while it is running under a heavy production load. Changes often have unanticipated side effects, and may worsen the situation they were meant to improve. Direct observation of the system displays provides a great deal of data, but the interrelationships among the data are not at all obvious. The observer can easily be misled into believing that he understands more about the system than he really
PERSONAL OBSERVATION 23
does. Personal observation is important for understanding and evaluating systems, but it presents only a small part of the total performance picture, of which other parts are discussed in subsequent chapters.
This page intentionally left blank
CHAPTER 4
Accounting Systems 4.1. Introduction. This chapter has a twofold purpose. The first is to explain and illustrate one of the important sources of data and information about the performance of computing systems. The second is to discuss some general problems concerning the interpretation and validity of data. This latter discussion applies equally to sections on data collection and analysis in other chapters. Here and in subsequent chapters we shall discuss some specific data gathering tools, and give examples of data from the Purdue MACE operating system running on Control Data 6000 series equipment. Data and techniques for obtaining data about specific systems are of little general interest in themselves. The samples of data presented in these chapters are not presented to give the reader information about the performance of this particular system. Their purpose is only to enhance the text by providing examples of kinds of data that may be of interest, and of some of the problems involved in collecting, understanding and verifying such data. 4.2. Validity of data. All data that is collected through the observation of complex systems should be approached with skepticism. I recently came across an interesting comment attributed to Sir Arthur Eddington [CH74] to the effect that "you cannot believe in astronomical observations before they are confirmed by theory". A similar statement could be made about observations that produce data in many other fields. For the computer field I would paraphrase Eddington's remark to state that you cannot believe in data about the performance of large computing systems unless they can be explained in terms of a conceptual model of the system. Even relatively simple performance statistics can be misleading. The physical sciences place a great deal of emphasis on the reproducibility of experiments. In matters of any importance, experiments are repeated in a number of laboratories by different groups, sometimes using different techniques, collecting data that may eventually establish a high level of confidence in (or possibly disprove) earlier results. The data that we deal with in computer performance studies are rarely of the type that will be confirmed by other independent studies. Decisions about the validity of data are usually based solely on the judgement and integrity of individual investigators. 4.3. Understanding the system. In order to understand and interpret data concerning the performance of a computing system, it is necessary to understand 25
26
CHAPTER 4
the system itself at an appropriate level of depth and detail. In some cases a performance analyst with only a superficial understanding of a system can come up with important insights about the reasons why the system is not performing up to capacity. We have already discussed situations in which simple observations of the system in production can suggest important system improvements. Simple observations of routine data on the system can sometimes be similarly useful. Data on equipment utilization that show a printer or a disc unit or a channel consistently underutilized can suggest changes that result in better system balance and better system performance. System bottlenecks, especially very bad ones, can sometimes become apparent through relatively simple observation and analysis. There are many aspects of performance that cannot be studied and understood without a detailed understanding of the system. It is usually not possible to achieve this type of understanding without a good deal of very specific knowledge of, and experience with, the system being studied or systems very much like it. Such knowledge may only be available to people who have worked very closely with the system, to high level system designers and implementers, and performance evaluation experts associated with the designers and implementers. A performance evaluation group working for a computer manufacturer might be the only group competent to do an accurate evaluation of the performance of that manufacturer's hardware and software system. Data and evaluations produced by groups that have a vested interest in the sale of the system they are measuring will almost always show some bias, if only in terms of the selection of data that is to be presented, and the context in which it is presented. System development projects at universities and other research centers are often under great pressure to publish their accomplishments in the public literature, and there is frequently a tendency to publish data in such a way as to enhance the reputation of those engaged in a system development project, even when the performance of the system under development falls short of the system's original goals. 4.4. Accounting and billing systems. Accounting and billing are major problems in iarge general purpose computing systems. In a multiprogramming system the amount of time that a job occupies central memory and peripheral storage space, and the amount of processor time used, may vary significantly in different runs of the same job, depending on the resource requirements of the other jobs that are active while it is being run. It is not at all obvious just how to create a billing system that is fair to all users, and a considerable literature exists in this area that is not of direct interest here. At least partially as a result of these problems, every major operating system contains elaborate accounting subsystems that keep detailed records about the utilization of system resources. As a minimum, in a typical system, each job creates one or more accounting messages containing information about the resources that the job has used. These messages are saved in an accounting file that serves as the input to billing and accounting runs. In the standard IBM 370 systems the subsystem that produces the resource utilization records is known as the System Management Facility (SMF) [IB73]
ACCOUNTING SYSTEMS
27
which is itself an optional component of the operating system and which contains numerous options that determine which data are to be included. In the CDC systems from which we shall draw our examples, the file that holds the raw accounting records is called the "accounting dayfile." Other terminology is used in other systems, but in almost every case the data that is produced can provide a great deal of information about the performance of the system, including the information contained in standard billing and accounting reports, but often including a great deal more. There is almost no limit to the amount of summary data and the number of reports that can be produced from this data. Here, as in any large scale data collection and data reduction application, the power and speed of the computer and its printers can be a mixed blessing, since it is so easy to produce frequent, voluminous, and often superfluous reports. 4.4.1. Daily summary report. One of the most important measures of performance is the actual volume of production achieved in the day-to-day operation of the system. A short daily summary report can be very useful for keeping management and systems personnel informed about a number of performance parameters, and making them aware of difficulties and anomalies that may occur. Weekly and monthly and other regular reports may also be useful, in the same way as they are useful in any large production operation, for management and control purposes. Here we shall only consider an example of a daily report, keeping in mind the fact that the management group that gets this report receives a similar report every day, and also receives periodic summary reports. A very brief glance is usually enough to determine if anything unexpected happened in the previous day. Tables 4.1 and 4.2 are two pages of a report that summarizes one day's operation (April 22, 1974) of the academic computing center at Purdue University. These tables are typical of this kind of report. They are not meant to show the results of a typical day's operation, since in this kind of system the computing load shows very marked daily and seasonal variations. A quick glance at this report by one who receives them regularly would indicate that this was a day in which total production was unusually heavy, but that nothing unusual or unexpected happened during that day. At the beginning of the report there is a "dead-start history" for the two major computers, a CDC 6500 and a CDC 6400 that run under the Purdue Dual MACE system. Dead starts correspond to IPL's (Initial Program Loads) in IBM systems, and to warm starts (and possibly cold starts) in a number of other systems. The data here reveals that there were no unscheduled dead starts, i.e., no system crashes and recoveries on April 22. A study of these reports for the whole month of April showed that on the average there was one system recovery per day. Such recoveries require that jobs that are active at the time of the crash have to be restarted. Other jobs are usually unaffected. Recoveries of this type usually go completely unnoticed by those who have submitted batch jobs to the system. They are, of course, noticed by on-line users, and by anyone waiting to submit a job at the time the crash occurs.
TABLE 4.1 Summary of one day 's operation of a computing centre PURDUE UNIVERSITY CDC 6400/6500 DAYFILE SUMMARY -6500
DAYFILE COVERS THIS INCLUDES
6500
DEAD-START HISTORY TAPE AT 09.04.2UA, RESUM AT IB. 03.194, DATE AT un.OO.Oitit
6400
DAYFILE COVERS THIS INCLUDES
04/22"/74.
20.7Q9 HOURS, FROM 09.04.20 ON 04/22/74 THRU 05.46.52 ON 04/23/74 18.602 PRODUCTION HOURS AND 2 .107 HOURS LOST 04/22/74. 04/22/74. 04/23/74.
OFF AT OFF AT OFF AT
15. 5tj.55A 23.39.59A Ob.4G.52B
20.208 HOURS, FROM 09.03.12 ON 04/22/74 THRU 05.15.40 ON 04/23/74 2.097 HOURS LOST 18.111 PRODUCTION HOURS AND
6400 DEAD-START HISTORY TAPE AT 00.03. liA t RESUM AT 1H.02.'3/A» DATE AT OO.OU.01B,
04/22/74. 04/22/74. 04/23/74.
OFF AT OFF AT OFF AT
CUM6INED MACHINE PERFORMANCF HISTORY..... 9557 JOBS COMPiKTtU B68S ......(CP D A r F I L E MtSSAGES FOUND) 30.865 CP HOURS
55.80 PERCENT CFNTHAL pRoctsson U T I L I Z A T I O N
8285,962 10 1000 «/nRD UNITS 5257.831
805.057 TERMINAL r.ONNtcT HOURS 26.775 PLOT HOUH5 4453910 LINES PRI^TEU 71004 CARDS PUNCHED
30.49A 39.59A 1S.40B
TABLE 4.1—con/. JOB H I S T O R Y
1000 2000 3000 4000 5000 7000 9000 OVtHHtAO TOTAL
JObS 2245 112 447 35
172 b~369 b9d 679 9-537
CP HOueS
3Y
ACCOUNT
10 U M I T 5 M*0-StC U.r/rs J 3 0 0 . 0 4 b2i!t3l .36^ .Uub 13.U^9 41.016 B52.e>6k: il. 12* 3e>6. '74 .06^ PO.6^9 lO.^Jl .t ^b 7J .338 1 7t.=«7f 1D.