Advances
in COMPUTERS VOLUME 19
Contributors to Thls Volume
RICHARD DUBES C. W. GEAR DAVIDK. HSIAO A, K. JAIN ROB K...
55 downloads
1331 Views
17MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Advances
in COMPUTERS VOLUME 19
Contributors to Thls Volume
RICHARD DUBES C. W. GEAR DAVIDK. HSIAO A, K. JAIN ROB KLING H.T. KUNG WALTSCACCHI
Advances in
COMPUTERS EDITED B Y
MARSHALL C. YOVITS Purdue School of Science Indiana University-Purdue University at Indianapolis Indianapolis. Indiana
VOLUME 19
ACADEMIC PRESS A Subsidiary of Harcourt Brace Jovanovich, Publishers New York. London Toronto Sydney. San Franclrco-1900
COPYRIGHT @ 1980, BY ACADEMIC PRESS, INC. ALL RIGHTS RESERVED. NO PART OF THIS PUBLICATION MAY BE REPRODUCED OR TRANSMITTED IN ANY FORM OR BY ANY MEANS, ELECTRONIC OR MECHANICAL, 1NCLUDING PHOTOCOPY, RECORDING, OR ANY INFORMATION STORAGE AND RETRIEVAL SYSTEM, WITHOUT PERMISSION IN WRITING FROM THE PUBLISHER.
ACADEMIC PRESS,INC.
111 Fifth Avenue, New York, New
York 10003
United Kingdom Edition published by ACADEMIC PRESS, INC. (LONDON) LTD. 24/28 Oval Road, London NWI IDX
LIBRARY OF CONGRESS CATALOG CARD NUMBER:
ISBN 0-1 2-0121 19-0 PRINTED IN THE UNITED STATES O F AMERICA 80 81 82 83
9 8 7 6 5 4 3 2 1
59-15761
Contents
CONTRIBUTORS . . . . .
. . . . . . . . . . . . . . . . . . PREFACE . . . . . . . . . . . . . . . . . . . . . . . . .
vii ix
Data Base Computers David K. Hsiao
1. Introduction: The 90-10 Rule-A Characterization of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 2. Approaches to a Solution . . . . . . . . . . . . . . . . . 3. Where Are WeNow? . . . , . . . . . . . . . . . , . . 4. Two Kinds of Data Base Computers . . . . . . . . . . . . 5 . Can the New Data Base Computers Replace the Existing Data Base Management Software with Improved Performance? . . 6. TheFuture . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . , . . . . . . . . . . . . ,
1
4 6 8 40 58 59
The Structure of Parallel Algorithms H. 1.Kung
1. Introduction . . . . . . . . . . . . . . . . . . . . . 2. The Space of Parallel Algorithms: Taxonomy and Relation to Parallel Architectures . . . . . . . . . . . . . . . 3. Algorithms for Synchronous Parallel Computers . . . . . 4. Algorithms for Asynchronous Multiprocessors . . . . . 5 . Concluding Remarks . . , . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . V
. .
65
.
66 72 93
,
. . . . . . . .
106
108
vi
CONTENTS
Clustering Methodologies in Exploratory Data Analysis
. .
Richard Duber and A K Jaln
1. 2. 3. 4.
Introduction . . . . . . . . . . . . . . . . . . . . . . . Data Representation . . . . . . . . . . . . . . . . . . Cluster Analysis . . . . . . . . . . . . . . . . . . . . . Applications . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . .
.
113 117 123 208 215
Numerical Software: Science or Alchemy?
. .
C W Gear
1. 2. 3. 4. 5.
Introduction . . . . . . . . . . . . . . . What Is Numerical Software and Why Is It The Science . . . . . . . . . . . . . . . The Alchemy . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . .
. . . . . . . . Difficult? . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
229 232 241 245 247 248
computing as Social Action: The Social Dynamics of Computing in Complex organizations Rob Kling and Walt Scacchi
1 . Perspectives on Computing in Organizations 2 The Computing System Life Cycle . . . . .
.
. . . . . . . . 250 . . . . . . . . 261
3 . Computer System Use in Complex Organizations . . . . . . 4 . Impact of Computing on Organizational Life . . . . . . . . 5 . Conclusions . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . .
290 295 317 323
AUTHORINDEX . . . . . . . . . . . . . . . . . . . . . . SUBJECT INDEX . . . . . . . . . . . . . . . . . . . . . . CONTENTS OF PREVIOUS VOLUMES. . . . . . . . . . . . . .
329 338 346
Contributors to Volume 19 Numbers in parentheses indicate the pages on which the authors’ contributions begin.
RICHARDDUBES,Department of Computer Science, Michigan State University, East Lansing, Michigan 48824 ( I 13) C. W. GEAR,Department of Computer Science, University of Illinois, Urbana, Illinois 61801 (229)
DAVID K . HSIAO,Department of Computer and Information Science, Ohio State University, Columbus, Ohio 43210 ( I ) A. K . J A I N , Department of Computer Science, Michigan State University, East Lansing, Michigan 48824 (113)
ROB KLING,Department of Information and Computer Science, and Public Policy Research Organization, University of California at Irvine, Irvine, California 92717 (249)
H . T. KUNG,Department of Computer Science, Carnegie-Mellon University, Pittsburgh, Pennsylvania 15213 (65) WALT SCACCHI, Department of Information and Computer Science, and Public Policy Research Organization, University of Califwnia at Irvine, Irvine, California 92717 (249)
vii
This Page Intentionally Left Blank
Preface
Volume 19 of Advances in Computers continues the publication of indepth articles of important current interest to computer and information science. Despite the currency of the topics, subjects have been chosen for inclusion largely because they are expected to be of interest over a considerable span of time. The Advances provides an opportunity for renowned experts to write review or tutorial articles covering a major segment of the field. The very nature of Advances in Computers permits the publication of longer survey articles that have been prepared in a somewhat more leisurely fashion. Volume 19 continues a long and unbroken series that began in 1960; in terms of computers, generations ago. Included here are chapters on a variety of important subjects, including hardware of data base computers, software involved in numerical analysis, theory concerned with parallel algorithms, applications to data clustering, and the sociology of computing. A major area of concern in computer and information science is the problem of handling efficiently very large data bases. David Hsiao treats in detail the problems faced by contemporary computers in attempting to process large data bases and shows why their performance is poor. He offers several solutions and treats at some length the data base hardware computer, now feasible both because of available technology and because of a better understanding of data base theory and software. He predicts that data base computers will easily outperform current data base management software systems. This development in turn will allow the growth of data bases and permit many more applications as well as the inclusion of new features in data base management. Parallel algorithms have been studied, H. T. Kung points out, since the early 1960s. Increasing interest has been created by the recent emergence of large-scale parallel computers. As a result many algorithms have been designed for various parallel computer architectures. Kung presents a number of different examples and studies them under a consistent framework. He classifies these algorithms according to a number of attributes. His chapter provides a single reference for anyone interested in understanding the basic issues and techniques involved. It also provides a reference incorporating the parallel algorithms which are in fact now available. The practice of classifying objects according to perceived similarities began with primitive man. Modern science is in fact empirically based on ix
X
PR EFACE
clustering methodology. In their article on clustering methodologies Richard Dubes and A. K. Jain review cluster or classification analysis, whereby objects are grouped according to intrinsic characteristics. The objective of cluster analysis is to uncover natural groupings in order to help understand the phenomena being studied. Without modem digital computers, the authors point out, only very elementary techniques could be realized and these only for small samples. They provide enough introductory material so that the reader unfamiliar with the field will grasp the main techniques. They then proceed to develop an applicationsoriented treatment of cluster analysis in the spirit of exploratory data analysis. They convey the sense in which cluster analysis can be applied and provide a review of important parameters and factors with which a user should be familiar. The question of numerical software is an intriguing one. This term currently is used very liberally to describe a wide range of activities. In his chapter, C. W. Gear examines numerical software and discusses what is particularly difficult about it. He then examines the science behind such software and also the areas in which the science does not help. He emphasizes two important characteristics of numerical software; namely, it deals with approximation and it is usable with a range of computers with different approximation capabilities. The chapter concludes with some predictions regarding the future of numerical software. In the final article Rob Kling and Walt Scacchi consider computing as social action and examine in particular computing in complex organizations. This issue is extremely important and sometimes overlooked or minimized by technologists. In their chapter they examine six common perspectives that provide different indicators by which to understand how people live and work with computing in organizations. Their examination provides an indication of the usefulness of different perspectives for explaining how computing developments work in complex organizations. They demonstrate how each phase of computing entails a complex set of social activities. I am pleased to thank the contributors to this volume. They have given extensively of their time and effort and have accordingly made it an important and lasting contribution to the profession. Their participation ensuresthat this volume achieves a high level of excellence and that it will be of great value for many years. It has been a rewarding experience to edit this volume and to work with the authors.
MARSHALL C . YOVITS
Advances
in COMPUTERS VOLUME 19
This Page Intentionally Left Blank
Data Base Computers DAVID K. HSlAO Department of Computer and Information Science Ohio State University Columbus. Ohio 1.
2.
3.
4.
5.
6.
Introduction: The 90-10 Rule-A Characterization of the Problem. . . . . Approaches to a Solution . . . . . . . . . . . . . . . . . . . . . . . 2.1 Software Back-End Solution . . . . . . . . . . . . . . . . . . . 2.2 Intelligent Controller Solution . . . . . . . . . . . . . . . . . . 2.3 Hardware Back-End Solution . . . . . . . . . . . . . . . . . . . Where Are We Now?, . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Advances in Technology . . . . . . . . . . . . . . . . . . . . . 3.2 Better Research in Data Base Computers . . . . . . . . . . . . . Two Kinds of Data Base Computers . . . . . . . . . . . . . . . . . . 4.1 Text-Retrieval Computers . . . . . . . . . . . . . . . . . . . . 4.2 Data Base Computers for Formatted Data Bases . . . . . . . . . . Can the New Data Base Computers Replace the Existing Data Base Management Software with Improved Performance? . . . . . . . . . . . 5.1 Data Base Transformation . . . . . . . . . . . . . . . . . . . . 5.2 Query Translation . . . . . . . . . . . . . . . . . . . . . . . . The Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Introduction: The 90-10 Rule-A
1
4 4 5 5 6 7 7 8 9 19 40 43 48 58 59
Characterization of the Problem
The %lo rule characterizes the problems of contemporary data base systems. It also points out the direction toward hardware solutions to software data base systems problems. Let us examine the 9&10 rule in the context of a contemporary data base system with very large data bases, Typically, a contemporary data base system resides in a generalpurpose host computer and consists of four major software components: the application programs, the operating system, the data base management software, and on-line I/O routines. The application programs for data 1 ADVANCES IN COMPUTERS. VOL. 19
Copyright @ 1980 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-Ol2119-0
2
DAVID K. HSIAO
base management are mostly developed by the user and written in a high-level language in which the data definition and manipulation facilities of the data base management software are embedded. The user can create a new data base by defining the formats and types of data items, records, files, and relationships among the data types. Furthermore, one can manipulate or access the data base by way of the data manipulation facilities. The data base application programs, together with the data base, represent long-term user investments. They are usually local to an installation and reflect the specialized data base management need. On the other hand, the other three software components are commonly provided by the vendor. The operating system is the principal software interfacing with the user and managing the physical resources of the computer system. The data base management software parses and processes the data definition and manipulation facilities, translates symbolic data references into data addresses, initiates access operations, and verifies the data retrieved from the data bases. The on-line I/O routines carry out the access operations and interface with the controller of the storage devices of the data bases. Whenever an application program issues a data definition and manipulation facility during the execution of the program, the facility (termed a user query or request) is intercepted by the operating system and passed along to the data base management software. After the query (or request) is parsed and processed, addresses of data relevant to the query (or request) are then determined. Finally, relevant data sire screened and the answer to the query (or request) is routed to the application program by way of the operating system. In the course of determining the addresses of relevant data and subsequent retrieval of the same, the data base management software relies on the on-line VO routines. The relationship among the application program, operating system, data base management software, and on-line I/O routines is depicted in Fig. 1. Owing to the sheer size of data bases and the great complexity of the built-in relationships (among data types), it is not possible for the on-line I/O routines to locate on the secondary on-line storage those and only those data relevant to the user request. Instead, a large amount of raw data must be brought into the main memory of the host computer for processing. By providing large buffers, blocking the data bases into pages and overlapping 110 operations with computations,the on-line I/O routines can find the relevant data from the pages. For very large data bases the 90-20 rule' says that, in order to find the relevant data, nine times as much additional irrelevant data will have to be brought into the main memory An empirical rule of large data base management deduced from experiments (see Banerjee and Hsiao, 1978a-c; Banerjee ef al., 1980).
DATA BASE COMPUTERS
3
A general-purposecomputer User request
Applications program
Operating system
FIG. 1. Conventional data base management environment.
for processing; that is, only 10% of all the data processed will be relevant to the request or query. Once the relevant data are found by the on-line I/O routines, they are forwarded to the data base management software. Because a user query (or request) usually consists ofpredicates in the form of Boolean and arithmetic expressions of attribute values of the relevant data, the data base management software must process the relevant data and produce the answer that satisfies the predicates. Here the 90-10 rule applies again. In other words, to produce an item of the desired result, nine items of the relevant data are being handled by the data base management software. The repeated application of the W-10 rule-once for finding the relevant data and once for searching for the right answercreates a situation of “double jeopardy” in contemporary data base systems (as depicted in Fig. 2) causing a shortage of memory and CPU cycles, I/O contention, channel saturation, and overall system degradation. From 90% of relevant data to produce 10%of answer. From 90% of raw data to produce 10%of relevant data.
Controller
... FIG.2. Conventional operationdouble jeopardy.
DAVID K. HSlAO
4
Contemporary systems are always short of memory and CPU cycles for large data base applications. In addition, the use of large buffers by the on-line I/O routines and high-volume data transfers from secondary storage tend to create memory shortage, device contention, and channel saturation. Consequently, the performance of the host computer in supporting the user needs (data base management and otherwise) is degraded. In the following section we will consider various approaches to upgrade the computing environment so that both data base management and nondata base management applications can be facilitated with good performance. 2. Approaches to a Solutlon
One obvious solution is to upgrade the present host computer with a higher performance computer. With faster CPUs, larger memory, greater channel capacity, and additional controllers, the new host computer may overcome the bottlenecks created by the 90-10 rule. However, there are other approaches to the problem. 2.1 Software Back-End Solution
One solution, termed the sofmare back-end approach, utilizes a medium size computer for stand-alone data base management software and on-line I/O routines. By relegating these software components to the back-end computer, the host computer is freed from data base management and on-line I/O activities. Instead, the host computer merely routes queries to and receives answers from the back-end data base management system. Consequently, much of the system resources (e.g., cycles and memory) of the host computer can be used for (application) program preparation and development. This approach, as depicted in Fig. 3, has the additional advantages in that (1) the cost of a modest back-end computer may be lower than the cost of a high-power host replacement; (2) there is a clear delineation between data base management and nondata base management activities owing to the physical separation of the host and back-end computers; and (3) clearly delineated interfaces of large software components tend to produce more reliable software, making the host and backquery Application program
Operating system
-
answer
~
managamant
I/O
FIG.3. Software back-end approachdeferred jeopardy.
DATA BASE COMPUTERS
Application program
Operating Data base system management
-
5
-
Intelligent relevant controller data ~
I
...
end less prone to software failures. However, this approach can only defer the problem. Soon the back-end computer will become of the victim of the 90-10 rule. 2.2 Intelligent Controller Solution
Since the 90-10 rule has such detrimental influence on the performance of the data base management system, efforts have been made to eliminate the appiication of the 90-10 rule. The intelligent controller approach has the benefit of eliminating one application of the 90-10 rule (see Fig. 4). This elimination is accomplished by building directly into the controller, processing logic that enables the large amount of raw data to be processed prior to their movement to the main memory of the host computer. In fact, the only data transferred to the host computer are those relevant to the user query. Although the traffic between the secondary storage and the main memory is considerably reduced, thereby reducing device contention and channel saturation, the data base management software still requires considerable processing power for searching for answers among relevant data. Furthermore, the volume of the relevant data and the size of the data base management software are still too large to minimize their demand for system resources. Finally, finding answers among relevant data again falls victim to the 90-10 rule. Ideally, one would like to eliminate any application of the 90-10 rule. 2.3 Hardware Back-End Solution
By building both the data base management and on-line I/O capabilities into hardware, the hardware back-end approach (see Fig. 5 ) effectively eliminates any application of the 90-10 rule. In other words, the back-end computer receives the user request and returns the answer without going
-
Application program
Operating system
answer
Database management and intelligent control
FIG.5. Hardware back-end solution-liminating
1
...
any application of W 1 0 rule.
6
DAVID K. HSIAO A general-purpose computer
Back end
Front end
FIG.6. New environment.
through intermediate software steps. Furthermore, there is no need to move any data to the host except the answer. Consequently, the traffic to (and from) the host is light, thereby eliminating possible saturation of the channel. Because the volume of data being moved to the main memory is not high, there is no need for large buffers and buffer processing. In addition, the absence of data base management software and on-line I/O routines in the host computer enable the host computer to make more resources available to new applications and enhance the performance and reliability of the existing software in the host computer, e.g., the operating system. The goal of every data base computer architect is, therefore, to incorporate as much data base management and on-line IIO functionalities in a special-purpose back-end computer as possible, thereby freeing the front end general-purpose computer from data base management chores. If there is any need for data base management related software in the front end, it should be minimal and inconsepuential (see Fig. 6). To meet this goal, the technology must be available. There must also be a foundation of knowledge about specialized data base computers. Unless the technology is near and the foundation of knowledge is sufficient, the architect will not be able to build a cost-effective data base computer with intended performance gain. In the following section we will see that the technology and knowledge necessary to make this concept both possible as well as cost effective is just around the corner. 3. Where Are We Now? Progress is being made on two fronts. There are advances in technology and advances in research. Let us elaborate upon them in sequence.
DATA BASE COMPUTERS
7
3.1 Advances in Technology
There is evidence that both the microprocessor technology and memory technologies are progressing rapidly. Emerging memory technologies suggest that magnetic bubble memories, charge-coupled devices, or electronic beam addressable memories are likely to replace fixed-head disks in the near future. These block-oriented and shift-register-like storage devices are particularly versatile for parallel configuration and on-board logic. The configuration for parallelism is primarily intended to improve block access time. The on-board logic via microprocessors enables data base management processing capabilities to be brought close to the store. Both of these innovative uses of technology were difficult to accomplish with the fixed-head disks. On the other hand, there is also the trend of doubling the density of the moving-head disk every 30 months. Thus, it seems that the moving-head disk technology will remain the mainstay of the on-line data base store. There is no on-line memory technology in sight that could compete or replace the moving-head disk technology. Recently, progress has been made on reconfiguring moving-head disks for parallelism and on-board logic. The parallelism is achieved by reading of all the tracks of a cylinder in parallel in the same revolution. Furthermore, the controller is endowed with a set of microprocessors that would process the incoming data on the fly. In other words, at the end of a revolution, not only the entire cylinder of data has been read, but the answer for the user request is found by the controller of the disk. Thus, such a moving-head disk with built-in processing logic in the controller is in effect a multiple-data-stream-andsingle-request (MDSR) computer. The costs for tracks-in-parallel readout and on-board processing logic seem reasonable. The former requires only the widening of the data buses since the readwrite heads are already there. The latter is time shared among all the cylinders of a device. The spectrum of the memory technology is depicted in Fig. 7. 3.2 Better Research in Data Base Computers
There has been a number of advances in data base computer research prompted by the availability of more versatile technology and a better understanding of data base theory and software. For example, at present there are three prevailing data base models (hierarchical, CODASYL, and relational) that underscore most of the contemporary data base management software systems and data base organizations. One may then orient the architecture of a data base computer toward a particular data base model or generalize on several data base models. It is clear that the choice of data base model (or models) will greatly influence the data and instruction formats of a data base computer. However, once these choices are
DAVID K. HSIAO
8
-.
-/y Bipolar,
Large capacity
- - \
\
/
'/ '*I
\
FIG.7. On-line memory technology: price-performance projection over the next decade. The arrows indicate the movement of the technology toward shorter access time and lower cost per bit over the next decade.
made, it is possible to study the storage requirements in the old and new computer environments for the same data bases, since the data base organizations of the old and new environments are known. It is also possible to study the query execution times of a particular data base manipulation language in the old and new environments. Many such studies and analyses have been carried out and they tend to show that the number of accesses to the data bases and the time required to interpret and execute language constructs are drastically reduced in the new environment, leading toward more efficient and effective data base management. In the following sections, we shall review various design approaches to data base computer architecture. 4. Two Kinds of Data Base Computers
Data bases fall into two broad categories: those with limited formats and those with elaborate formats. Data bases with limited formats mainly consist of textual information in the form of character strings: for exam-
DATA BASE COMPUTERS
9
ple, a newspaper. Textual information is archival in nature and requires little alteration and update. Thus, the main activity involving textual information in a data base environment is that of searching for terms, words, phrases, and paragraphs. Although search patterns are simple combinations of terms and words, their formulations tend to be ad hoc and unpredictable. In other words, at the time the data base is created in the data base computer, there is no knowledge on the part of data base creators regarding the anticipated term patterns that may be used in the future for searching. Since data base structures are employed for efficient accessing of the data base for anticipated queries, there is little use for these structures in textual data bases. As a result of this lack of structures and absence of a priori knowledge of search patterns, the entire (or a good portion of the) data base is involved in every search. Data base computers designed for this kind of data base must therefore have very high-volume readout, allow fast formulation of search patterns, and provide very highspeed loading and matching capabilities for substantial parts of the data bases. Data bases with elaborate formats are intended to support multiple data models (such as hierarchical, CODASYL, and relational). These models attempt to capture both inter- and intrarecord processing requirements so that search and update operations can be restricted to those and only those records relevant to the requests. The search and update operations use predicates based on the attribute values of the records. It is therefore important for the data base computer, upon receiving a predicate, to narrow down the search space, to determine the whereabouts of the relevant records, and to process these records rapidly. Since predicates refer to the properties of the data base contents, the data base computers must have content-addressable capabilities to be effective. Because search and update queries are always stated in a given data base manipulation language and the data base is always structured after some known data model, it is possible for the data base computer designer to have some a priori knowledge of data base usage. Accordingly, the designer can provide some hardware mechanisms for usage enhancement, such as clustering mechanisms for performance improvement and security mechanisms for access control. 4.1 Text-Retrieval Corn puters
The data base computers specializing in data bases with limited format are called text-retrieval computers. In general, text-retrieval computers have the basic architecture depicted in Fig. 8. Referring to this figure, w,e note that one application of the 90-10 rule is removed because
10
DAVID K. HSlAO
FIG. 8. Text-retrieval computer. The main functions of a query resolver are to check context, proximity, and threshold requirements for a matching term. The query resolver does not need fast speed and can be realized with a minicomputer. The main function of the comparator is to match a term against the input stream. Usually, the term comparator is specially built and utilizes parallelism in hardware for fast matching speed.
the disk controller, with the aid of the data base computer controller, can output only relevant data, The potential bottleneck (i.e., next application of the 90-10 rule) of the architecture is therefore the term compururor and query resolver that must utilize the relevant data and produce the answer. Since the task of the query resolver is to decide among several answers that one should be routed to the host computer, the bulk of the work is performed by the term comparator. It is therefore apparent that the term Comparator, instead of the query resolver, is the only potential bottleneck of the entire architecture. In the following sections we present four different approaches to the design of high-performance term comparator. Regardless of their internal designs, the term comparator, in conjunction with the query resolver, is sufficient to carry out some or all the data base commands listed in Fig. 9. These commands are typical for character matching, word search, and text processing. Matching a pattern of characters with a character stream from the data base touches upon a number of fundamental problems. These problems can be easily exemplified by the following exercise: Find ISSIP in MISSISSIPPI via one-character-at-a-timecomparison. The successive comparisons of the exercise are depicted in Fig. 10 where the character comparator hardware is represented by a solid dot between the input stream and the comparand register. We observe that: 1. Loading and shifting both the input character stream and comparand register one-character-at-a-timeis not a good practice, since serial loading and serial shifting are slow.
DATA BASE COMPUTERS
11
~
Single word
T
Either word
A A or B
Both words
A and B
Finds any that contains both A and B anywhere in the text
One but not the other
A and not B
Finds any that contains A but not B
Specified context
A and B in sentence
Finds any that contains both A and B in the same sentence
Finds a contiguous word phrase
AB
Finds any that contains A immediately followed by B
One followed the other
A
Directed proximity
A.N.B.
Finds any that contains A followed by B within N terms
Undirected proximity
N
Finds any that contains A and B within N terms of each other
Threshold OR
( A . B. C . ) N
Finds any document that Contains at least N of the different terms A, B, or C Note that i f N = l . this operation is an OR, while if N equals the number of different terms specified, i t i s an AND
Fixed-length don’t care
All6
Matches the character string A, followed by two arbitrary characters, followed by the string B
Variable-length don’t care
A?B
Matches the character string A, followed by an arbitrary number of characters (possible none), followed b y the string B. Note that ?A matches A with any prefix, while A ? matches A with any suffix
E R M
P R 0 C
E S S I N G
C H A R A C T
P R S O T C R E I S N S G I
E
N
R
G
..B
Finds any text that contains the A Finds any text that contains either the A or the B
Finds any that contains A followed (either immediately or after an arbitrary number of term) by B
FIG.9. Typical text-retrieval commands.
2. The characters, shifted out of the comparand register, are lost and cannot be reused when there is a mismatch near the end of the comparand. 3. A single-character comparator is not sufficient for high-speed matching (i.e., the solid dot in Fig. 10). It is necessary to have as many such comparators as there are characters in the comparand. 4. The comparand register should be able to hold very large search patterns.
12
DAVID K. HSlAO
Inputstream
-
M
I
S S I S S I P P
Mismatch, skip the input character
Register
input Stream c- M
I II sI s I I I s I s I II PI P I I
-
Register Shift
Input Stream
-
M
I
0
I
S I P Match,move both theinputstream
S
and the register by one character
S S
I
S S I P P I Match again, move again
Register Shift
I n p u t Stream
-
M
-
M
-
S
I
S
M
I
S S
S S
S
-1
Register Shift
Input Stream
I
-1
Register Shift
Input Stream
I
I I
I
I
S S I P P I
I I I I
P
Match again, move again
S S I P P
I
I P I Match again, move again
S S
I
P
P
I
Mismatch, what to do?
FIG. 10. One-character-at-a-timecomparison.
These observations have prompted the following four design approaches to term comparators. We shall introduce them chronologically. 4.1.1 Parallel Discrete Comparators Approach
The parallel discrete comparators approach, as depicted in Fig. 1 1 , provides the following solutions to the aforementioned problems. For
DATA BASE COMPUTERS Delimiter detector
stream
~
13
Beginlend of context
buffer Parallel output Matching
n different terms
Comparator
IH
Term match 1
To query resolver
y k
Comparator
Term match 2
11
II I, I1
I1
(Stellhorn)
FIG.11. Parallel discrete comparators approach.
parallel loading of the input stream, a high-bandwidth input channel coupled with very large buffers is used. The size of the buffer, known as the data window, is the product of the number ( n ) of comparators and the bandwidth of the input bus of the comparator. Thus, the data window converts a serial string of terms into n parallel streams of terms for comparison. In addition, all n comparators work in unison and match the characters of the term in a comparator with the characters coming from a corresponding stream one character at a time. To add another degree of parallelism, the loading of the comparators may be made parallel. In this way parallel matching of characters can also be conducted in each comparator. The results of the comparison are sent to the query resolver for sentential analysis, e.g., to determine whether at least one, some, or all matching words appeared in the sentence. Because these discrete comparators are tailor-made for certain terms, they are one of a kind. Furthermore, they may have to be hard wired each time a set of new terms is warranted. Therefore, it is rather tedious and time consuming to set up the matching patterns (Le., to configure the hard-wired term comparators for the search criterion of sentences). 4.1.2 Associative Array Approach
Instead of performing parallel matching of n terms (i.e., comparands) against the incoming stream as was the case in the discrete comparators approach, the associative array approach matches n terms of the input
DAVID K. HSIAO
14 Cornparand register
r
Mark register
[~
I
IOR ~
~lMMOOOOOOOOO 0
]
1 V
f An array 01
Brown, Joseph Chase, Susan
IOR
Phd
ECE
MS
Thomas. Kathy Msnn. John
CIS IOR
MS
Rilev. Peter
ECE
Phd
IOR IOR
Phd
Smith. Mary Sawyer, Beth
Downs, Harold Green. Tom
X
1 M
Phd
ECE
I
L
R . g I a t ~ acontaln Ierma lrom the Input data atream
FIG.12. Associative array.
stream against a single comparand in parallel. The rationale in either case is that parallelism speeds up the comparison. The difference is, of course, where the parallelism is applied. In the associative array approach, a single matching pattern is first loaded into the comparand register. The incoming stream is then loaded into the array elements one block at a time. On the average a block contains 1200 terms (see Fig. 12). Thus, the associative array is able to match in parallel 1200 terms with the term in the comparand register. The results of the comparison can be used to “flag” terms in the data stream, which in turn are routed to the query resolver (see Fig. 13). Although this approach requires no tailor-made or hard-wired com-
the query resolver
stream I
Associative memory (AM) capacity: 8192 characters
1200 terms (Bird. Tu. Worthy1
FIG.13. Associative array approach.
15
DATA BASE COMPUTERS
parators and relies on a single general-purpose comparand register, it has a number of limitations. For example, it limits the use of elaborate matching patterns. For more complex matching patterns it is only possible to use a part of the pattern at a time. For each part of the pattern the entire data base will have to be loaded one block at a time into the associative array for matching. In this case the speed is limited by the rate at which the input stream can be loaded. 4.1.3 Cellular Logic Approach
One way to solve the problem of the expense of discrete components for term comparators and the difficulty of reconfiguring them for everchanging matching patterns, is to use a basic building block from which term comparators can be configured for various new matching patterns. The cellular logic approach proposes such a basic building block, known as the cell. The cell, as depicted in Fig. 14, can store one matching character and compare the character with one from the input stream rapidly, and generate a (match) signal if the match is successful. In addition it receives two signals from outside. One signal (actually, some other cell’s match signal) enables the present comparison to take place. The other signal can be used either to clear the cell or to synchronize this cell’s comparison with other comparisons taking place in other cells. By cascading these cells a term comparator can be configured for a new matching pattern. Since each cell can accommodate only a single character, the number of cells of the term comparator is equal to the number of characters of the term. Furthermore, the loading of the pattern characters can be done in parallel. Since each cell can also receive a character from Character stream
Enable
--d
Match
Setup/control
Match character
(Mukhopadhyay, Cowland)
FIG. 14. The cell-a
basic building block.
DAVID K. HSlAO
16 Character Stream
Anchor
Match ~
Character Stream
M 0 U T P U
T
1 S I G N A
0
I 1
S 0
S 0
I
S
S
I
P
P
I
1
0
0
1
0
0
1
2
0
0
1
0
0
1
0
0
0
0
0
3
0
0
0
1
0
0
1
0
0
0
0
4
0
0
0
0
1
0
0
1
0
0
0
Match
0
0
0
0
0
0
0
0
1
0
0
L
the input stream for comparison, loading of input stream for term comparison can also be made parallel in characters. Let us now illustrate the use of cascaded cells to find ISSIP in MISSISSIPPI, as depicted in Fig. 15. We represent each cell with a box and identify the stored character of the matching pattern (in this case, ISSIP) by placing the character in the middle of the box. Because there are n matching characters (in this case, n = 3,there are n cells and n output signals. Let us number the first n - 1 output signals and name the last output signal “match.” The input stream (i.e., MISSISSIPPI) of m characters loaded into the cascaded cells n character at a time for comparison. We do not show the control signals. However, it is understood that after loading n (i.e., 5 ) cells, the comparison takes place simultaneously. This process of parallel loading and simultaneous comparison will be repeated at mostly m - n + 1 times where each time the string to be loaded begins with the next character of the input stream. Whenever there is the presence of the match signal, we know that the cascaded cells have found a match (i,e., ISSIP) in the input stream. The number of match signals indicates the number of successful matches. In the illustration of Fig. 15, we learn that ISSIP is found in MISSISSIPPI after the fifth load. We also note that the intermediate signals can be used to trigger more complex matching patterns. For an example, they are used as input signals to logical gates such as AND (A) and OR (v).This is depicted in Fig. 16. The most attractive feature of the cellular logic approach is the simplicity and effectiveness of the basic cells. However, the repeated loading of
DATA BASE COMPUTERS
17
More Complex Patterns Using AND And OR Gates
0 is the blank character FIG.16. More complex patterns using A N D and OR gates to find CBS or NBS or OSU.
the input stream [i.e., once for each of the first m - n + 1 characters of the input stream] may be slow and unnecessary. Such repetition is mainly due to the fact that these simple cells do not have “memories.” In other words, they do not recall which characters from the input stream were used for previous comparisons. 4.1.4 Finite-State Automata Approach
The finite-state automata approach has a built-in memory. Since a finite-state automaton always knows the current and next states of the automaton, it does not have to backtrackall the previous states in order to progress to the next state. This desirable property overcomes the problem of repeated loading of the input stream commonly associated with the cellular logic approach. We shall illustrate this point later with an example. Meanwhile, let us introduce the notion of finite-state automata and their applications to text processing. Essentially, a finite-state automaton is characterized by a set of states (in which there is one beginning state and at least one final state) and a set of connections between states. A state can have one or more inputs and zero or more outputs. It receives one character at a time from the input stream, although different characters of the input stream may be received via different inputs at different times. The outputs are switching signals. A switching signal causes the next character of the input stream to enter a specific state. Again, although many different switching signals may cause the input stream to arrive at different next states for a current state, no more than one switching signal can be generated by the current state at one time and, therefore, only one specific state is entered as the next state at that time.
18
DAVID K. HSIAO
What are these states? States are comparators. More specifically, a state consists of either a single-character comparator or a collection of parallel single-character comparators, each of which matches a different character. If there is a match between the matching character and the character of the input stream, the comparator of a state generates a switching signal and causes the next character of the input stream to be handled by a specific next state. Obviously, the single-character comparators of the states may be built with cells discussed in the previous section. The important differences between the finite-state automata approach and cellular logic approach are as follows: 1. The switching signals generated by the states automatically route the next character of the input stream to the appropriate state for comparison without repeatedly loading all the previous characters of the input stream for comparison. 2. A (hardware) switching network can be developed for the states. The switching signal generated by the state self-directs the input stream by opening the gates to the next state. Thus, routing is deterministic and automatic. 3. The states can be realized by one or more single-character comparators made of cells. Since the matching characters (i.e., comparand) are stored in the comparators via programming means, these comparators can be reused for new patterns. 4. A general switching network can be constructed so that specific connections among states for a given matching pattern may be facilitated by programming means (i.e., software) or switch control (i.e., hardware).
The finite-state automata approach to software development has been used extensively by compiler writers for lexical and syntactical analysis of programming languages. It is, therefore, ideal for text processing. The challenge lies in the cost-effective implementation of the switching network. Let us return to the example of finding ISSIPin MISSISSIPPIby referring to Fig. 17. Let us represent states with circles, number each state of the automaton, and designate the first and last state to be the beginning and final states, respectively. Let us represent the switching (matching) signal of a state by an arrow leading from the state to another state. Thus, the outgoing arrow shows the connection between the present state and the next state. We place the stored matching character(s) (Lee,the comparand character(s) of a state), not in the circle but on the outgoing arrow(s) of the state. This convention has its advantages in that, when a match occurs, the cornparand character on the outgoing arrow is indeed the character that causes the switching signal to be generated. The input stream starts at
DATA BASE COMPUTERS
19
FIG. 17. Finite-state automaton for finding ISSIPin MISSISSIPPI.@ = a character that is not on another output arc, beginning state = 1, and final state = 6.
the beginning state (i.e., State 1). Whenever the current character of the input stream is a character other than the character I, the next character of the input stream remains to be processed by State 1. When the current character is I, the next character of the input stream will be processed by the comparators of State 2. It is easy to see in Fig. 17 that when State 6, the final state, is reached, the ISSIP has been found. Otherwise the automaton will never arrive at the final state. In a real sense this automaton represents a recognizer of the string ISSIP. One observes that the recognizer has a good memory. If the present state is State 5 and the current character of the input stream is not P but S, the recognizer will cause the input stream to be processed at State 3. In other words the recognizer recalls that a part of the matching pattern (i.e., IS of ISSIP) has been found and there is no need to reload them for comparison. In Fig. 18 a number of finite-state automata for different patterns are used in parallel and in serial. This allows more complex patterns to be set up for matching. An example of such usage is depicted in Fig. 19. We eliminate many arrows from Fig. 19 for the sake of clarity. 4.2 Data
Base Computers for Formatted Data Bases
By formatted data bases we mean those data bases that are structured for some prevailing data base model such as the hierarchical, CODASYL, or relational. Because data base models are developed for the purpose of capturing the interrelationships of various data aggregates and types, they are ideal to represent the intended relations among data base aggregates. For example, the relation “all those parts supplied by the supplier” establishes a relationship between the file of parts and the file of suppliers. If a data base model can represent this relation, then subsequent references to
DAVID K. HSIAO
20
Context delimiter recognizer
U
Finitestate automaton 1 (say, for recognizing single words)
I
I
Finitestate automaton 3 (continuous word phrases)
I
Finitestate automaton 2 (say, for recognizing variabie4ength don’t care words)
FIG. 18. Multiple finite-state automata for complex patterns.
the related records of the files for the supplier-part transactions can be made naturally and expeditiously. To allow the use of models, the data base management software must devote considerable efforts toward keeping track of the relations, related aggregates, and types. Furthermore, it must provide efficient storage and retrieval of these entities. The benefits to the user, however, are many: 1. Since the data base model is an ideal means to represent the form and content of the data base, the user does not have to be concerned with the storage layout and performance issues of the database. This notion of data base independence from storage and performance considerations greatly enhances the user’s acceptance of modern data base management. O
N
A
T
I
O
N
A
L
A
B
U
R
E
A
U
A
O
F
A
F
A
R
M
E
R
S
O
Detection of any charcter other than those shown causes transition to state 1.
FIG. 19. Multiple patterns from the same input character stream. A = blank and 0 = delimiting character. Beginning state = 1; final states: @ = national AIMS. @ = national bureau of standards, and = national bureau of farmers.
DATA BASE COMPUTERS
21
2. The presence of a data base model allows a data base definition and manipulation language to be easily developed for the model. Furthermore, the language enables the user’s application programs to interface with the data base management software in a clear and precise way. 3. By standardizing on a data base model, it is possible to make different languages developed for the model compatible (i.e., application programs written in one definition and manipulation language for the model can be converted automatically into another definition and manipulation language for the same model). 4. Both the user’s data base and his applications programs can be transported from one computer to another computer, provided that the data base management software of the new computer supports the same model. The price to pay for the transportation is a one-time automatic conversion of the data base and application programs. Nevertheless, there are severe penalties on the part of the data base management software and on-line I/O routines. Since the user’s applications programs merely make requests in the form of predicates, they do not give hints to the data base management software as to the most expeditious way to locate and access the data that satisfy the predicates. The lack of such hints is due to data base independence. Thus, data independence shifts the responsibility to the data base management software for finding ways and means to provide adequate performance and efficient storage. This demand is compounded by the fact that in formatted data bases, the search of a portion of a data base is necessarily more frequent than the sweeping of the entire data base. Unlike text retrieval, where the entire text collection may have to be searched, the formatted data bases tend to focus on those records participating in some given relations. Although the volume of these related records may be high, it is essential for the data base management software to identify the related records readily and complete the transaction in a short time period. To overcome the aforementioned problems, the contemporary data base management software maintains additional information about relations and related aggregates and utiIizes additional data such as pointers and indices to facilitate the inherent relationships among aggregates. The 90-10 rule is, therefore, felt more acutely in a formatted data base management environment than elsewhere. Owing to this additional information, the amount of relevant and raw data for a given user answer tends to increase. This of course exaggerates the impact of the 90-10 rule. Hardware solutions to data base management of formatted data base without the impact of the 90-10 rule can be considered in three categories. They are presented chronologically in the following sections.
DAVID K. HSIAO 4.2.1 Cellular Logic Approach
The cellular logic approach to data base computer architecture for formatted data bases is exemplified by the employment of a building block known as the cell, the use of on-board processing logic for the cell, and the parallel processing of all the cells. There are some important differences in their implementation. An earlier implementation of a cellular logic data base computer for processing both textual and formatted data bases was centered on a fixed-head disk. Subsequently, the same approach was implemented on a fixed-head disk replacement (i.e. , charge-coupled devices), with specialization on the relational data base model. Studies of storage layout of the relational data bases for more rapid access and processing have also been conducted. Let us concentrate our discussion on the fixed-head disk implementation as depicted in Fig. 20. For processing formatted data bases the cell size must be large and the processing logic must have content addressability, both of which are achieved in this cellular logic computer. The tracks of the disk become the cells. Since a track can accommodate 24K bytes, the cell size is 24K
a Control bus
On-cell processing logic
/'/ --e.g., a fixed-head disk track
lo8 bytesldevice 24K byteslcell
(Lipovski and Su. Schuster, Smith and Smith)
FIG.20. Cellular logic approach.
DATA BASE COMPUTERS
23
bytes. For each track (i.e., cell) a processing unit (say, two microprocessors with some additional registers) is attached. In addition, there are two processors, one for device control and the other for I10 arbitration. The processing units are connected by two buses, the control bus and the I10 bus. The control bus is used by the device control processor to broadcast the data base management instructions (one at a time) to all the processing units and to cause, simultaneously, all the cells to be content-addressed by their corresponding processing units. The machine cycle is one disk revolution. In other words, the typical time required to broadcast an instructbn to all the processing units, to have all the cells contentaddressed, and to allow the answer to be placed on the I/O bus is one disk rotation time. When the volume of the answer is high, there may be situations where many processing units attempt to place their portions of the answer on the I10 bus. In this case the I/O bus is overloaded and the I/O arbitrator determines the right-of-way for the processing units on the basis of some priority scheme. The output of the low-priority processing units will have to wait until the next machine cycle (i.e., the next disk revolution) to be placed on the I/O bus. Let us illustrate the use of a cellular logic data base computer for processing formatted data bases. For this purpose we choose a hierarchical data base of airline routes as illustrated in Fig. 21. This data base consists of only one file, two record types, five attributes in the first record type, and three attributes in the second type. The level of the hierarchy is three. We note that the values of the attributes, of the types, and of the file names are schematically related in the hierarchy. This hierarchical relaValue -
Type File Name
/I\
Rouier 3 A B 11 12
R# Dest.
Record Type
0
Routes
Record Type
Org. DeDt. Ar&.
Level -
Routes
F l i r
Flights
2
I
3
9 C A 17
18
I
I
I i i i h (
... Record
Date: SR: MS:
D1 N1 M1
D2 N2 M2
1
Routes 2 A D 13 T4
D3 N3 M3
D4 N4 M4
FIG.21. Sample hierarchical data base of airline routes.
24
-
D D N N N N N D D N N - N D
--
s q b c
CIR I
Router Rout=
R#: Dmt: Org: Dept: Ariv: Flights Flights
Date: SR: MS: Fliphts N Date: N SR: MS: - N D Routes
-
0 1 3 A B T1 T2 2 3 D1 N1 M1 3 D2 N2 M2 1 MnICh p.d In Cell 0
-D
-
-
riG.
N N N N N D D N N N D
Routes
R#: Dat: Org: Dept: Ariv: Flights Flights Date: SR: MS: Car-
+
1)
r q b c
c.110+ 1)
1 9 C A T7 T0 2 3
D4 N4 M4 0
LL. storage layout of a hierarchicat data base.
tion of values and types must also be captured by the cellular logic data base computer. To capture the hierarchical relation of values, this cellular logic computer employs the storage layout scheme as depicted in Fig. 22. We note that in the figure the data base is sufficiently small to be stored in two cells. There are three columns of data in the cells. The first column consists of the codes “D” for delimiters and “N” for nondelimiters. The second column consists of the attributes, type names, and file names of the data base. Finally, the third column consists of the level numbers, attribute values, type values, and named values. To distinguish a value from a level number, one merely looks at the code of the row in which the number appears. If the code is a nondelimiter (N), then the number is a value of the attribute (also appearing in the row). If the code is a delimiter (D), then the number represents the level of the type in the hierarchy. With this storage layout scheme it is possible to capture the hierarchical relations in
DATA BASE COMPUTERS
25
the data base computer. In Fig. 22 we also draw some lines to the left of the cells to indicate the “span” of the hierarchy at each level of the hierarchy as dictated by the delimiters. However, the lines are there for the convenience of the reader and are not part of the cellular logic computer. On the other hand, the working spaces to the right of the cells are permanent parts of the cells. We draw them separately from the cells again for the convenience of the reader. However, they are physically on the tracks (i.e., in the cells). The following may be a typical user request: Find those route numbers of the routes scheduled for either date D1 or date D2.
To answer the above query the data base computer parses the query and generates the following machine instructions. These instructions will then be broadcast to the cellular logic one instruction at a time. 1. Locate the routes file. 2. Mark all the routes (Le., sets bits on for the routes). See Fig. 23. Rev.
D Routes 0 Routes N N N N
Dut:
0 1 A
B
Org: Dept: Ariv: D Flights
T1 T2
N Date: N SR: N MS:
D1 Nl M1
N Date:
02
N SR: N MS:
N2
N N N N D D N
Dest: Org: Dept:
Ariv: Flights Flights Date:
2
M2
C A T7 TB 2
3
D4
1
Rev.
2
s q b c
s q b c
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 7 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0
0 0 0 1 0 0 0
0 0 0 1 1 0 0
s q b c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
3
Rsv.
e
s q b c
0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0
0 0 0 0 1 1 1 0 1
Rev. s q b 0 0 0 1 1 1
4
0
0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0
0 0 1 1 0 0
0 0 0 1 1 0 0
s q b c
s q b c
s q b c
1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
(CASSM)
FIG.23. Content of the working spaces of the cells at the end of each revolution.
26
DAVID K. HSIAO
3. Mark all the flights of the marked routes (i.e., set q bits on for the flights). 4. For each marked route, assign a stack (i.e., the b stack). Push a bit into the stack of a marked route if any date of its flights is either D1 or D2. 5. Collect the route numbers whose stacks are not empty. It is easy to see that since there are five machine instructions, five machine cycles are needed to obtain the answer. The contents of the working spaces associated with the two cells are displayed at the end of each of the first four revolutions (disk rotations) in Fig. 23. The use of the cells themselves as “scratch pads” (e.g., marked bits and a stack) distinguishes the cellular logic approach. This distinct feature is also retained in the other implementation using charge-coupled devices. The cellular logic approach to data base computer architecture has the attractiveness of supporting a relatively large data base (e.g., lo8 charactetddevice) and providing very powerful on-board logic (4000 processing units/device). Although the magnetic fixed-head disk technology will become obsolete, its electronic replacements such as magnetic bubbles, charge-coupled devices, and electronic beam addressable memories can be used for implementation. There are also limitations, however, having to do with the cost and performance of the cellular logic computers. 1. The emphasis on alogic-per-track implementation of the cellular logic may not be cost effective. For example, 4000 processing units (approximately 8000 microprocessors) per device may be an expensive undertaking. One solution to this problem is to use very large cells. In other words, instead of a track (24K bytes), a cell may be a cylinder (4 megabytes). This would effectively cut down the number of processing units by a factor of 20. 2. A large number of revolutions may be necessary to answer a query. For an example, the sample query in the previous paragraph requires five revolutions. One argument is that if the answer to the query is high in volume, then the large rotation time may be compensated for by the high volume of the answer. This argument has a fallacy in that high-volume output tends to saturate the I/O bus. The I/O arbitrator can only relieve the saturation by spreading the output over additional revolutions.
4.2.2 Associative Array Approach
The associative array approach to data base computer architecture for formatted data bases is similar to the one presented in Section 4.1.2 for text retrieval. In fact, in both cases, the associative arrays employed are identical. The associative array processor (or associative memory module)
DATA BASE COMPUTERS
27
is 256 x 256 bits and has a 500 psec array load time. The search time, depending on the search criterion, is small (less than 100 psec). Because of the very large size of formatted data bases, the need to process only small portions of a data base at a time, and the disparity between the very fast associative arrays and very slow conventional on-line storage devices, this approach includes the use of a large, fast buffer memory made of either charge-coupled devices or magnetic bubbles for staging. The architecture is depicted in Fig. 24. There are some attractive features on this approach. 1. These 256 x 256-bit associative arrays are within the state of the art of the present technology. In fact, for the configuration of Fig. 24, four such arrays are proposed. 2. The use of conventional on-line storage devices such as moving-head disks enables the data base computer to support very large data bases, say, lo9 or 1 O 1 O characters. 3. The use of magnetic-bubble memories or charge-coupled devices for
500 cuec/AM load 5 msec/partition 50 msac/buffer load 1 sec/l00 cvlinders
6
00 moving-hd dirks 8.4 x lo9 bytes) a a
t
a
An associative memory module of 256 x 256 bits
For 4% of enrwr 100mcondsneedd for the entire databesa
6 I
I (Bane)
FIG.24. Associative array approach to data base computer architecture.
28
DAVID K. HSIAO
the buffer memory seems reasonable, since these emerging technologies are known to be good “gap filling” memory technologies (see Fig. 7). In the worst case, where the entire data base may have to be swept through for a small percentage of the answer (i.e., subject to the 90-10 rule), it is estimated that this approach requires only 100 seconds. This is achieved as follows. First, both the associative arrays and buffer memories have multiple ports and buses for concurrent loading and unloading of the data. Second, the size of the buffer memories is made large. In the example of Fig. 24, the total buffer size is 2 megabytes. Each partition of the buffer memory is of 320K bytes, and has its own ports in order to load (unload) the data coming from (going to) either the associative arrays or the moving-head disks. In this approach the processing logic is in the associative arrays and the data base store consists of the moving-head disks. The associative arrays are the “cache” of the data base computer. The data base computer stages the data base via the buffer memory for processing. The basic limitation of this approach is inherent in the fundamental limitations of staging. Unless the data base computer knows the whereabouts of the relations and the related aggregates, it may have to resort to excessive staging. This, in turn, implies that the performance of the computer will always approach the worst case (i.e., 100 seconds for a typical request). 4.2.3 Functionally Specialized Approach
The inherent limitation imposed by staging in the associative array approach is difficult to overcome. Although the use of indices may reduce the amount of staging of the data base pages, the indices themselves will have to be staged and processed. Thus, it is better to entirely avoid staging in a data base computer. The associative array seems too powerful and expensive to be employed for content-addressing of the data bases. The problem of loading and unloading of the arrays is acute. With regard to the cellular logic approach, we make the following observations: 1. The fixed-head disk technology is becoming obsolete and the use of its electronic replacements such as magnetic bubbles is not viable for very large data bases. 2. The insistence on logic-per-track implementation of a cellular logic data base computer is expensive and difficult. One way to cut down the cost and reduce the complexity is to have very large cell size, thereby reducing the unit investment of on-board logic per cell. 3. The use of the cells themselves as scratch pads is not desirable. In
DATA BASE COMPUTERS
29
other words, cellular logic data base computers utilize the cell as their memory for “programming” tasks. Program-driven tasks (e.g., use of marked bits and stacks) require many accesses to the memory. At the present, the memory cycle on a data base computer is usually long, say, in the milliseconds. Therefore storage-layout schemes that require no use of the cells as scratch pads and minimize the access to the cells are desirable. 4. The machine instruction repertoire of the cellular logic data base computer is too primitive. Thus, a single user request may have to be parsed into and carried out by a large number of machine instructions. The machine cycle of the data base computer is also long, say, in the milliseconds. Obviously, to carry out more machine instructions requires more machine cycles, which prolongs the length of time the computer needs to answer the user request. Direct execution of high-level data base manipulation language constructs is therefore desirable. The functionally specialized approach attempts to overcome some of the limitations of these approaches and improve on their attractiveness. Functional specialization means that the data base computer contains a number of specialized components with considerably different processing speeds and memory capacity requirements. This approach allows us to build a relatively well-balanced computer and to avoid bottlenecks by providing each component with the right amount of processing power and memory capacity. For example, the data base computer discussed in the following section has seven major functionally specialized components; a keyword transformation unit (KXU), a structure memory (SM), a mass memory (MM), a structure memory information processor (SMIP), an index translation unit (IXU), a data base command and control processor (DBCCP), and a security filter processor (SFP). This data base computer will be able to support gigabyte data base capacities while providing full and efficient retrieval, update, and security capabilities (see Fig. 25). Let us examine three of the components (namely, the mass memory, structure memory, and structure memory information processor) from the standpoint of their functions and their role in data base management. (a) Design Considerations for an On-Line Data Base Store. The design of the mass memory (MM) is heavily influenced by the storage and processor technologies, data base size, and processing characteristics. Let us consider each of these factors in sequence. A survey of the current and emerging technologies in Fig. 7 indicates that the various on-line memory technologies may be divided into three major classes on the basis of their cost and performance. In terms of low cost per bit and high-storage capacity, however, there are no known and emerging technologies that can compete with the moving-head disk tech-
DAVID K. HSlAO
30 Information -- -Control
path
From host computer To host computer
*
path
MM
(Banerjee, Bourn, Hrlao, Kannm) FIG.
25. Overall architecture of a functionally specialized data base computer.
nology. Moving-head disks, therefore, seem to be the only practical alternative for large, on-line data base storage. Once the moving-head disk technology is chosen, one can ask what kind of modification is necessary in order to support data base management. The performance gain owing to any modification must be cost- and performance-effective, so that the cost-performance projection of the modified disks will not exceed either the fixed-head disk or its replacements. Referring to the 90-10 rule, we note that typical data base management operations require the processing of very large amount of relevant and raw data to produce the results. It is desirable that the mass memory should process the raw and relevant data rapidly so that the results can be obtained without being delayed by the sheer volume of the data. This calls for high-volume read out and processing capabilities. Conventional moving-head disks, as well as fixed-head disks, allow the read out of only one track per disk revolution. By modifying the read-out mechanism of moving-head disks, the mass memory can read, instead of one track per disk revolution, all the tracks of a cylinder in the same revolution. This modification is called tracks-in-parallel read out. Such modification is now feasible and relatively low in cost, since some of the read/write electronics are already part of the moving-head disk. Modifications are necessary to trigger the readwrite heads to read simultaneously and to enlarge the data buses for accommodating the increased data rate. With the moving-head disks modified for high-volume read out, the
DATA BASE COMPUTERS
31
mass memory must now provide high-volume processing. The muss memory information processor (MMIP) obtains and processes an entire cylinder of information during one disk rotation. Since the rotation speed of the disks is relatively slow, it is possible to process information “on the fly.” Each track of the cylinder is actually processed by a separate processing unit called a truck informarion processor (TIP) having some amount of buffer space. For instance, a disk rotation speed of 3000 revolutions per minute and a track capacity of 30,000 bytes require a processing speed (for comparison-type operations) of no more than 1.5 megabytes per second from each TIP, a rate within the present state-of-the-art microprocessor technology. Furthermore, if there are 40 tracks in a cylinder, then there will be 40 TIPS in the MMIP. The MMIP is time-shared among all the cylinders of the mass memory as depicted in Fig. 26. In data management, processing means content-addressable search, retrieval, and update. With the mass memory modified for high-volume read out and with the high-performance processors, we now illustrate how the mass memory (MM) performs content-addressing. For this discussion, we must introduce some notions and terminology. The data base computer accepts and stores a data base as a collection of records. Each record consists of arecord body and a set of variable-length
Mass Memory Information Proceesor (MMlPI
The controller of MMIP
Dlsk with tracks-in-parallel readout capabillty
I
T I P = Track information processor
FIG.26. Moving-head disk with tracks-in-parallel read out and on-board logic capability.
DAVID K. HSlAO
attribute-value pairs, where the attribute may represent the type, quality, or characteristic of the value. The record body is composed of a (possibly empty) string of characters that are ignored by the mass memory for search purposes. For logical reasons, all the attributes in a record are required to be distinct. An example of a record is shown below. ( ,<JOB ,MGR>
,,<SALARY, 1~ooo>) I
The record consists of four attribute-value pairs. The value of the attribute JOB, for instance, is MGR. Attribute-value pairs are called, for short, keywords. They obviously characterize records and may be used as “keys” in a search operation. An important feature of the data base computer commands is that they allow natural expressions for specifying a record collection in terms of a keyword predicate, or simply, predicate, which is a triple consisting of an attribute, a relational operator (such as, =, 2,>, 2,I;, 10000)
may be used to indicate all records that have SALARY as one of the attributes, the value of that attribute being greater than 10,000. A record collection may also be specified in terms of a conjunction of predicates called the query conjunction, for example, (SALARY 2
25000) A
(JOB # MGR)
A
(RELATION = EMP).
Carefully planned physical layouts of the record should be used in the data base computer to eliminate unnecessary disk revolutions and to reduce the cost and size of the TIPS buffers. Each attribute is first encoded by the data base computer, so that it has a unique numerical identifier. In Fig. 27 an encoded file (i.e., the employee file) is illustrated. The nine logical records of the file are actually stored in the mass memory as nine physical records, with attributes replaced by their uniquely assigned identifiers. The record template is a special physical record that is usually retrieved by the data base computer at the time the file is first opened. It is retained by the data base computer for the open duration of the file. The predicates in a query conjunction, like the keywords in a record, are arranged in an ascending order based on the attribute identifiers. A query conjunction is stored in a sequentially accessed memory. The TIP reads a record from the track as a part of one data stream and the query conjunction from the sequentially accessed buffer as another data stream, and carries out a simple bit-by-bit comparison of the two streams. Whenever there is a match between an attribute identifier in the record and an attribute identifier in the conjunction, the TIP then compares the
DATA BASE COMPUTERS
33
A record templ8te
1 Employee 'lmployee
m
mIvawtion FIlvawtionl
1.
2. 3.
4. 6. 6.
7. 8. 9.
A field-
FIG.27. Physical records of the employee file.
value parts to determine if the corresponding predicate is satisfied. If the attribute identifier in the record is less than the attribute identifier in the conjunction, then the TIP skips over the corresponding value to the next attribute identifier of the same record. If the attribute identifier in the record is greater than the one in the conjunction, then the TIP skips the entire record. The above logic is repeated until either all the predicates in the conjunction are satisfied or the record does not satisfy the conjunction. The scheme just described will result in a simple serial-by-bit comparison. A conjunction, after it is broadcast by the mass memory controller, is stored in each of the TIPs. Simultaneously each TIP evaluates the query conjunction against its corresponding incoming record stream. For example, the first TIP searches the records of the first track of the cylinder. At the same time the ith TIP searches all the records of the ith track of the same cylinder. In one disk revolution all tracks of an entire cylinder are thus searched in parallel by the TIPs (see Fig. 28). In summary, this approach of the data base store exploits existing technologies. The on-line mass memory (MM) is made from moving-head disks, the least expensive of all large-capacity on-line storage devices.
DAVID K. HSIAO
34
- -
Tracks in parallel read -out disk
n-
ATIP
I
FIG.28. Content-addressablesearch utilizing bit-serial-single-sweepcomparison.
The disks, however, are modified to allow parallel read out of an entire cylinder in one revolution, instead of one track at a time. The parallel read-out capability of the data base computer provides rapid access to a relatively large block of data. These data can now be content-addressed simultaneously by a set of TIPS in the same revolution. It is sufficient that access is limited to one or a few cylinders, since single-user transactions seldom refer to data beyond megabytes in size. As long as data are not physically scattered, sweeping of a large number of disk cylinders can be avoided. The physical dispersion of related data is prevented by a built-in clustering mechanism that uses information provided by the creators of the data base via the host computer. ( b ) The Design of a Structure Memory. Despite all the improvements that can be made to the moving-head disk technology, there is still one fundamental limitation of the technology-the time delay in repositioning the read/write heads from one cylinder to another. This delay is particu-
DATA SASE COMPUTERS
35
larly acute if many cylinders are to be accessed. Two factors may cause the unnecessary search of a large number of cylinders: 1. The data base creator inadvertently scatters records over a large number of cylinders, thus requiring the mass memory (MM) to “sweep” through all those cylinders. 2. For a given query conjunction the MM does not have any knowledge of those records that may satisfy the query conjunction. If, on the other hand, it knows which cylinders may contain the desired records, then the MM can restrict its content-addressable search to just those cylinders, instead of the entire cylinder space.
To eliminate the first problem, the data base computer provides a clustering mechanism in the data base command and control processor (DBCCP). With the clustering mechanism, the data base computer allows physical grouping of records that are likely to be retrieved and updated together into as few content-addressable cylinders of the MM as possible. For example, in the implementation of a relational data base, each record (corresponding to a relational tuple) contains a keyword where RELATION is an attribute and relation-name is the relation to which the record (tuple) belongs. If RELATION is declared as a clustering attribute at the relational data base creation time, then every single-relation query can be executed by searching only as few cylinders as are required to store the entire relation. To address the second problem, the data base computer maintains some auxiliary information about the data base in a separate component known as the structure memory (SM). Indices are maintained in the S M for selected attributes and value ranges. Clustering attributes are likely candidates for indices, since most queries are expected to refer to these attributes. This auxiliary information also makes it possible to preprocess the user’s access authorization for security purposes. Although both the access and security-related information are likely to be at most 1% of the size of the data base, they are still quite large since the data base itself is of size 1O’O bytes. Furthermore, there will usually be many accesses to the information in the SM for every access to the data base. Therefore, the structure memory (SM), which is the repository of all auxiliary information, must provide a large capacity and good access speed. Such a performance can be currently achieved through the use of fixed-head disks, and in the future by charge-coupled devices or magnetic bubble memories. For every keyword designated for indexing, there is an entry in the structure memory (SM) consisting of the keyword itself and a list of index terms. An index term is composed of a cylinder numberf and security
36
DAVID K. HSIAO
compartment numbers . An index term (f, s) for a keyword K, therefore, indicates that there exists one or more records containing the keyword K that reside in the cylinderfof the mass memory (MM), and that belong to the security compartment s, For access control, the query conjunction of a user is processed as follows. For each predicate with an indexed attribute, the structure memory (SM)determines all those keywords that satisfy the predicate. Corresponding to each of such keywords, a set of index terms is retrieved. The sets of index terms for all such predicates are then intersected (by the structure memory information processor to be discussed in the following section). The result of the intersection is a list L of index terms for the given query conjunctions. This list L of index terms is compared, by the data base command and control processor (DBCCP), against the user’s access priviledge list to determine the final list L‘. The list L’ includes only those (f, s) pairs of L where the required access is permitted on the security compartment s. The list L’, together with the query conjunction and the requested access, is now forwarded to the mass memory (MM). As we stated earlier, the mass memory stores a record as variablelength attributevalue pairs, together with a record body. For the purpose of identifying the security compartment to which it belongs, each record is also tagged with the security compartment number (which was not shown in Fig. 27). Given a query conjunction Q and a list L’ of index terms (f, s), the mass memory can then narrow its content-addressablesearch to those cylinders whose numbers appear in L‘. For each unique cylinder numberf in L’ ,the mass memory will access cylinderf, disregard those records that are not tagged with one of the corresponding security compartment numbers s, and output only those that satisfy the conjunction. Since the structure memory is on the order of IOOM bytes, the speed requirement implies that the memory must be content-addressable and that the searching operation should be carried out by multiple processing elements. Cost effectiveness requires that we (1) utilize a small number of processors; (2) assign a number of memory modules to each processor; and (3) provide a mechanism to identify a single module (if possible) for search by each processor in response to an index search request. The structure memory organization, depicted in Fig. 29, adheres to these guidelines. The structure memory is organized as an array of memory unit-processor pairs, which are managed by a controller. A memory unit, in turn, is composed of a set of memory modules. All memory modules are of the same fixed size. A processor can address any memory module within its memory unit, and then content-address the entire module. Furthermore,
DATA BASE COMPUTERS
37
Bucket memory system
t
(block) L _--__--____----_--------------Note: shaded modules constitute a single physical bucket.
FIG.29. Organization of the structure memory.
the structure memory controller can trigger all the processors to contentaddress their corresponding modules simultaneously. Whenever possible, searching the structure memory on the basis of a given keyword should be restricted to at most one module from each memory unit. To achieve this goal all keywords and their index terms corresponding to a particular attribute (and lying within a given value range) will constitute a bucket. Each bucket is physically distributed among the various memory units in order that it may be searched in parallel by all the processors. Ideally, a bucket is placed inn modules, one from each of the n different memory units, Each bucket may be placed in one or more modules (as many as necessary) evenly distributed among different memory units. The concept is also illustrated in Fig. 29, where the shaded modules contain a single bucket. The bucket to which a keyword and its index terms belong is determined by a separate component of the data base computer called the keyword transformation unit (KXU),which we will not discuss here.
38
DAVID K. HSIAO Example: File: A Attributes: salary, name. end location Attribute IDS: 1.2. and 3 Indices
FIG.30. Sample of three buckets.
Nevertheless, we provide a sample of bucket names and bucket information in Fig. 30. One of the functjons of the structure memory is to map a bucket name into the memory modules allocated to the bucket. For this purpose, the controller has a small random access memory in which it records a bucket name and stores the corresponding module numbers. Thus, given a bucket name, all the processors can work simultaneously on the modules that contain the bucket. ( c ) The Design of a Structure Memory Information Processor. The structure memory information processor (SMIP) intersects on the sets of index terms delivered by the structure memory. For an understanding of the operation of the SMIP, let us consider a query conjunction Q , Q = P , A P , A . . .APn, where each Pi is a predicate. The data base command and control processor (DBCCP) make use of the structure memory and the SMIP to determine the set of index terms to be sent to the mass memory. After the SMIP memory is cleared, the first set of index terms for keywords satisfyingP,, called theargument set ofP,, is provided by the structure memory and then stored in the SMIP memory. Each of the stored index terms is initially associated with a count of one, indicating the number of predicates it has satisfied. Next, the argument set of P, is provided by the structure memory and sent to the SMIP. The associated count of an existing index term in the SMIP memory is incremented by one if the index term matches an index term of the argument set of P,. The process for P, is repeated for each of the other predicates. At the end of this entire process, the stored index terms, those whose counts are n, represent a refined list applicable to the evaluation of Q. This list of index terms is then retrieved by the SMIP and forwarded to the data base command and control processor (DBCCP). Subsequently, the list is checked by the DBCCP for security clearance, before being transmitted to the mass memory. The most important part of the above procedure is the determination of whether an index term already exists in the SMIP memory. To perform
DATA BASE COMPUTERS
39
this task rapidly the SMIP is implemented as a set ofMU-PE pairs, where MU is a memory unit and PE is a processing element. Since the total number of index terms stored in the SMIP memory is small (in fact, this number is never more than the largest number of index terms of a single attribute), the memory units (MUs) forming the SMIP memory can be made from fast random access memory. The hardware organization of the SMIP is shown in Fig. 31. Each memory unit is a single module of random access memory. The processing
After the last (i.e., third) set of indices is processed
--I
A
a
(f2. $2). 3
n
.. m
40
DAVID K. HSIAO
elements are composed of microprocessors and are capable of doing comparison-type operations. The SMIP controller must be quite fast, so that it can process index terms at the same rate as it receives them. The common memory bus is used for data transfer when a M U overflows and requires space within another MU. The contents of the SMIP reflect the intersection of three buckets of indices in response to the following conjunction: (Salary > 15000) A (Name =
HSIAO)
A (Location # Michigan).
It first shows the loading of the first bucket of indices. It then shows the result after the third bucket of indices has been loaded. We note that the index term, which has a count of 3, is (f2, s2). In other words, only in cylinderf, and security compartment s 2 is there at least one record whose salary is greater than 15,000, name is HSIAO, and location is not in Michigan. (Incidently, these three buckets of indices are the same as those that appeared in Fig. 30.) ( d ) Toward Very Large Data Base Management. Since a large number of common data base management functions are implemented in hardware, the functionally specialized data base computer is expected to perform appreciably better than a computer that provides these functions by software. The high cost of and slow performance of software security enforcement may also be absorbed by the hardware. Support of very large data bases in an on-line and interactive mode should now be costeffective, since storage is made of relatively low-cost and simply modified moving-head disks, The mass memory information processor (MMIP), if need arises, may be expanded to simultaneously handle several cylinders, each of which is from a separate disk drive. It is only necessary that the number of TIPs @ the MMIP be increased accordingly (i.e., one set of TIPs for each drive). Although the mass memory is expanded into even larger content-addressable blocks (each made up of several cylinders), a structure memory is still required, since no two blocks may be accessed concurrently. However, as the size of these blocks grow, the need for clustering and the amount of indexing decrease. Thus, the structure memory may decrease in size. Another benefit may occur if there are a multiplicity of MMIPs where each MMIP handles a separate query conjunction, thereby allowing user queries to be multiprocessed.
5. Can the New Data Base Computers Replace the Existing Data Base Management Software with improved Performance?
In the area of text retrieval, there has never been any satisfactory text-retrieval software. Thus, the arrival of text-retrieval hardware is
DATA BASE COMPUTERS
41
particularly welcomed. The associative array data base computer for text processing depicted in Fig. 13 is a working system. The cellular logic approach to text retrieval has reached its prototype stage at Tektronix. Both the CIA in the United States and IRIA in France are experimenting with text-retrieval computers with finite-state automata approach. Meanwhile, the Universities of Illinois and Central Florida are pursuing some additional research and prototype work on the finite-state automata and cellular logic approaches, respectively. In the area of formatted data bases, there are many widely used data base management software systems running with adequate performance on conventional general-purpose computers. Thus, in order to replace these existing data base management software systems, two issues must be addressed. First, the existing data base must be transformed into the storage format of the new data base computer. This one-time transformation, known as data base transformation, is required to preserve the semantics of the data base and to take advantage of the advanced hardware features of the new computer. Second, the data base sublanguage used in the existing application programs must be supported in real time by the new data base computer so that application programs may be executed in the new environment without the need of program conversion. Such real-time translation of sublanguage calls to the instructions of the new data base computer, known as query translation, must be straightforward and require minimal software support. We recall in our earlier discussion that there are three prevailing data base models (hierarchical, CODASYL, and relational) that underscore the formats and data sublanguages of the contemporary data bases. For our convenience we will single out one hierarchical data base system for the study of the data base transformation and query translation issues. In particular, we choose the Information Management System (IMS), which is one of the most commonly used hierarchical data base systems. An IMS data base consists of a number of hierarchically related segments, each of which belongs to asegment type. In the example of Fig. 32, segment type A, the root segment type, has three segments-A,, AP,and AS. All others are dependent segment types, each having a unique parent segment type and zero or more child segment types. Some relationships among the various segments in our example are: A1 is the parent of B1 and G1. H1, H2, I1 are children of G1. J 1 and 52 are twins. H1, H2, 11, J1, 52 are descendants or dependents of G1.
Successive levels are numbered such that a root segment is at level 1 . All segment occurrences are made of one or morefields.
42
DAVID K. HSIAO
U
FIG.32. Schematic representation of an IMS data base.
An IMS data base is traversed in the order parent to child, front to back among twins, and left to right among children. The traversal sequence for
thedatabaseofFig.32isAl,Bl,Cl,Dl,D2,D3,El,Fl,E2,F2,F3,G1, HI, H2,11, J1, 52, A2, A3. Notice that the traversal sequence defines a next segment with respect to a given segment. A hierarchical path is a sequence of segments, one per level, starting at the root (e.g., A l , GI, 11, 52). An IMS user processes an IMS data base with applications programs using Data Languagell (DW1). A DU1 call has the following format: FUNCTION
search-list
where FUNCTION is one of insert (ISRT), delete (DLET), replace (REPL), and get (GET) calls, and where search-list is a sequence of segment search arguments (SSAs), at most one per level, which are used to select a hierarchical path. Each segment search argument is of the form <segment-type> in which the Boolean expression relates values of attributes of the given segment type (see Fig. 33). After each retrieval or insertion operation, a segment is “established”
DATA BASE COMPUTERS
43
Course
FIG.33. Logical data structure ofan IMS data base. There are five types: course, prereq, offering,teacher, and student. There are thirteen attributes that are "placed" in the boxes. There are five sequence fields which are marked with *. The number of records (or segments) for each type is not shown and is not important for the illustration.
in the traversal sequence of the IMS data base. For a retrieval operation, this segment refers to the segment just retrieved; for an insertion operation, this segment refers to the segment just inserted. Such a segment in the traversal sequence is termed the current position in the data base. The hierarchical path leading from the root segment to the current position in the data base consists of many segments. Each of these segments is called the segment on which position is established at that level. There are several forms of the get call, each of which returns a single segment. A get-unique (GU) call retrieves a specific segment at level n by starting at the root segment type, finding the first segment at each level i satisfying SSA,, and finally retrieving the segment satisfying the last SSA. A rather detailed example of GU is given later in Section 5.2.1. Aget-next (GN) call starts the search at the current position in the data base and proceeds along the traversal sequence satisfying the SSAs and retrieving the segment satisfying the last SSA. A get-next-within-parent (GNP) call restricts the research to descendants of a specific parent segment. Thus IMS also maintains aparentposition that is set at the last segment that was retrieved by a GU or GN call. The parent position remains constant for successive GNP calls. Several examples involving get calls are analyzed in Section 5.2.2. We shall call the environment consisting of the new data base computer and a front-end host computer the new environment. The environment consisting of a general-purpose computer acting alone with data base management software and on-line UO routines will be called the old environment. 5.1 Data Base Transformation
An IMS data base can be restructured by considering every IMS segment as a data base computer record (or, simply, a record) composed of
DAVID K. HSIAO
44
keywords. Address-dependent pointers of the segments are replaced in the records by keywords that are not dependent on physical location. 5.1.1 Notion of Symbolic Identifier
An IMS segment includes a-sequencefield whenever it is necessary to indicate the order among the twin segments. Since each segment becomes a record and no address-dependent pointers are allowed, we assign a symbolic identifier to each segment, identifying it uniquely from all other segments in the data base. Thesymbolic identifier of a segment S is a group of fields consisting of (1) the symbolic identifier of the parent of S and (2) the sequence field of S. Since the sequence fields of different segment types may use the same field name, we may qualify the field name with the segment type. 5.1.2 Conversion of IMS Segments
The creation of a record from an IMS segment can now be accomplished by forming keywords as follows: 1. For each field in the segment, form a keyword using the field name as the attribute and the field value as the value. 2. Form a keyword of the form where Type is a literal and SEGTYPE is the segment type in consideration. 3. For each sequence field in the symbolic identifier of the segment, form a keyword using the field name as the attribute and the field value as the value.
Course # = Title =
LJ
Course # =
Type = Prereq Course # = Prereq.Courae# = Title =
Location =
1 Type = Teacher I coune # =
Type = Student Course # = Date = Student. Emp # = Name = Grade =
-
FIG. 34. Attribute templates of records for the segments of Fig. 33. Every symbolic identifier is underlined.
DATA BASE COMPUTERS
45
For example, for an IMS data base shown in Fig. 33 the attribute templates of the five collections of records corresponding to the five segment types are shown in Fig. 34. Qualified field names such as Prereq. Course# are used to distinguish the same field names (Lea, Course#) among different segment types. 5.1.3 Clustering of the New Data Base
The access pattern of the segments should be used to determine the clustering policy in the new computer. Since the traversal of an IMS data base is usually along a hierarchical path, one clustering policy is to first cluster the records that represent all the IMS root segments and then, for each root segment, cluster the records that represent all dependent segments. An application of this policy is illustrated in Fig. 35. An advantage of using this policy is that if several root segments are to be accessed collectively, a single cylinder access will retrieve each segment. Furthermore, since for a given root segment the average size of all the segments in a hierarchical path is usually smaller than the size of a cylinder, it is possible to cluster all the dependent segments of the root segment in the same cylinder. For the clustering policy an IMS data base is, therefore, created on the new computer with two kinds of clusters, one cluster containing only the converted root segments and the other type of cluster containing the remainder of the converted segments. The only clustering attribute for the first cluster is Type. This assures that all the converted root segments
FIG.35. Application of the clustering policy.
DAVID K. HSIAO
46
form a single cluster, since they all have the same clustering keyword. If the sequence field of the root segment is called Seq, then the only clustering attribute for the second kind of cluster is Seq. Thus, there are as many clusters as there are unique sequence field values in the root segments (see Fig. 35). The clustering keywords also constitute the only keywords to be stored in the structure memory. Since there is a one-to-one correspondence between an IMS segment and its converted form, we shall refer to them in the sequel without confusion with either terminology. 5.1.4 Storage Analysis
Storage is required for the indices and the data base store. If we ignore all secondary indices that may be maintained in the old environment, then the index storage requirements are about the same in the two environments. In fact, the primary index maintained in the old environment has almost the same number of entries as there are keywords of the structure memory in the new environment. There is only one keyword (namely, 0.99, evidence exists for concluding that the proximity matrix was not chosen at random. The threshold 0.99 is arbitrary. The intuitive idea behind this test is that, when the data are clustered, the within-cluster edges will tend to occur before the between-cluster edges, thus delaying the formation of a connected graph. Two other tests are based on the empirical distribution of node degrees in threshold graph G,, with n and u fixed and the number of cycles of order k in an experimentally derived graph, but are not as easily applied as the connectivity test. A single-link clustering algorithm (Section 3.3. l), which is especially appropriate for a rank-order proximity matrix, supplies u*
128
RICHARD DUBES AND A. K. JAlN
and tables in Ling and Killough (1976), which cover values of n up to 100, are consulted for significance level. Ling (1973a) has proposed a means for testing the “compactness” of a clustering structure based on the number of nodes in clusters and the Random Graph Hypothesis. In our framework, Ling’s procedure provides information concerning clustering tendency. The number i* of nodes incident to one or more edges is observed in a threshold graph, G,. Let P(i; n , u ) be the probability, under the Random Graph Hypothesis, that exactly i nodes are incident to some edge in G,. Ling (1973a) provides a recurrence relation for P(i; n , u ) for i = 1, . . . , n. The probability that at most i* nodes are incident to some edge in G , under the Random Graph Hypothesis is I*
F(P; n , u ) =
2 P(i; n, u). 1=1
The sequence of numbers F(i*; n, v ) for v = 1, 2, . . . , n(n - 1)/2 thus indicates the levels, if any, at which i* clustered points differ from that expected when no clustering exists. Small values of F(i*;n, u) would lead to rejection of the null hypothesis. One difficulty with this statistical test is its very poor power. For example, if the data points were organized into a moderate number of compact clusters, all nodes would soon be in clusters and F(i*; n, v ) would be large for all v . Also, the presence of a few outliers in otherwise random data would artificially delay the inclusion of all nodes in clusters, leading to unjustified rejection of the null hypothesis. No quantitative procedures are available in the literature for testing clustering tendency when the entries in the proximity matrix are on interval or ratio scales. The tests cited above based on Random Graph Theory must be used with great care. As Ling (1973a) points out, a negative result, or acceptance of the null hypothesis, is a true indication that no clustering structure exists in the data. However, a rejection of the null hypothesis is not particularly significant because meaningful alternative hypotheses have not been formulated; a practical and mathematically usable definition of “clustering structure” does not exist. Thus, these tests for clustering structure are mainly ways of identifying structure-less data. 3.2.2 Pattern Matrices
When the patterns to be examined occur as n points in a d-dimensional pattern space, the clustering tendency problem has a geometrical nature. The null, or “randomness,” hypothesis is that the patterns are indepen-
EX PLORAT0RY DATA ANALYSIS
129
dently drawn from a uniform or unimodal distribution over some simple hypervolume, such as a hypercube or hypersphere. Such a hypothesis is called a Random Position Hypothesis. Strauss (1975, 1977) has suggested a means for testing clustering tendency in a pattern matrix. He characterized the clustering mechanism in terms of a “clustering parameter” V, which is zero when no clustering exists. The test for clustering tendency is based on the value of V. Kelly and Ripley ( 1976) disproved the Strauss characterization lemma and provided a slightly different characterization. Saunders and Funk (1977) clarified several issues and demonstrated that, when the patterns are sufficiently sparse, the statistic used for testing clustering tendency has a known asymptotic distribution. Their test statistic is Y,,(r),the number of interpoint distances that are r or less. That is, Y,(r) is the number of points captured in spheres of radius r centered at each of the n patterns in turn, not counting the points at the centers. Stated in rough terms, Saunders and Funk (1977) showed that i f n increases and the volume of the feature space increases as n 2 , Y , ( r ) has, under a Random Position Hypothesis, an asymptotic Poisson distribution with parameter
)(;
Wsphere/Vspace],
where Vsphere is the volume of a sphere of radius r and Vspace is the volume of the space holding the patterns. The test is to compare Y,(r) to a threshold chosen as the (1 - a)th percentile of the null distribution and reject the null hypothesis if Y,(r) is large. Several practical difficulties must be overcome to implement this interesting idea. Determining Vspace can be difficult. For example, the assumption of a spherically shaped volume can lead to inaccuracy if the volume is actually a hypercube and the inaccuracy worsens as d increases. If one applies an eigenvector transformation, which rotates the feature space to uncorrelate the original features, and applies a diagonal transformation to equalize variances, a spherical volume might be reasonable. One is then faced with choosing a suitable r. Saunders and Funk (1977) suggest an intuitive test. However, when d is large, say 10 or more, small changes in r , n , and the radius of the space can dramatically affect the results of the test. The test is also sensitive to nonuniformity in the underlying distribution. Silverman and Brown (19x4, Ripley and Silverman (1978), Besag and Diggle (1977), and Diggle (1979) have all studied this problem of examining the tendency of spatial patterns to cluster, with most results being directly applicable only to two-dimensional pattern spaces. One seemingly straightforward way of studying the tendency of pat-
RICHARD DUBES AND A. K. JAlN
130
terns to cluster is to examine the distribution of interpattern distances. That is, choose a statistic sensitive to variations in interpattern distance, establish its distribution under the Random Position Hypothesis, and formulate a test of this hypothesis in the usual way. Kruskal ( 1972) provided a specific example of this approach when he proposed a version of the coefficient of variation, computed from observed interpattern distances, as an index of clustering. Although Kruskal provided empirical evidence for the rationality of this test, he was unable to develop all the theory needed to establish a generally applicable test. It would seem that distributions derived theoretically under a Random Position Hypothesis could be exploited to create tests of clustering tendency. A body of theoretical results exists that might be so applied (Harding and Kendall, 1974; Coleman, 1969; Bartlett, 1975). Hammersley (1950), Lord (1954), and Alagar (1976) have all derived the distribution of the distance between two points chosen randomly inside a d-dimensional hypersphere that is the basis for several Intrinsic Dimensionality algorithms. The distribution of the Euclidean distance r between two randomly chosen points can also be found in other cases. If the coordinates for each point are values of independent, standardized normal random variables (mean zero, variance one), then any statistics textbook (e.g., Wilks, 1963) shows that r2/2 has a chi-squared distribution with parameter d, viz., P[r‘
I271 = 10’
1 2(d/2q-
):(
$d/2)-le-W2
’&
if r > 0.
The mean value ofr’ is 2d and the variance is 8d. Parzen (1%0) shows that ( P / 2 ~ f ) ’ has ’ ~ a chi distribution with parameter d, but the distribution of r itself does not have a recognizable form. Translating theoretical results into practical tests for clustering tendency is not as easy as might be expected. In applying statistical tests for randomness to investigate clustering tendency, we encounter many pitfalls, some of which are now discussed. To begin, the Central Limit Theorem suggests that r 2 has an asymptotically normal distribution no matter what the distribution of the individual coordinates, the shape of the surface enclosing the points, or the clustering structure. This suggests that first- and second-order moments of interpoint proximity measures might indicate clustering tendency. Unfortunately, the digression from normality for a fixed d is unknown; the exact distribution for r is known only in a few situations. Normalization of data is also a factor. As many spurious factors as possible, such as scale differences and interfeature correlation, should be
EXPLORATORY DATA ANALYSIS
131
eliminated by normalization so that tests of clustering tendency are sensitive only to data structure. For example, comparing a set of patterns to patterns generated under a uniform distribution requires an adjustment of the sample mean, equalization of the sample variance, and diagonalization of the correlation matrix. One way of doing this is to express the patterns in terms of principle components, which raises questions about the intrinsic, or “true” dimensionality of the patterns. Outliers, or patterns far removed from the main swarm of patterns, must also be isolated and deleted. Outliers can be caused by errors in recording data or anomalies in the phenomenon being studied. The presence of a few outliers can make otherwise randomly generated patterns appear nonrandom. These considerations generate more basic questions such as: Does “random” mean “unclustered?” Are patterns from a single unimodal Gaussian distribution unclustered or clustered into one cluster? Statistically reasonable definitions of randomness should not dictate the meaning of “cluster.” One other point that must be considered pertains to the independence assumption. The theoretical derivations imagine two points selected independently. A set of I? points involves n ( n - 1)/2 interpoint distances, but each point is involved in n - 1 of these distances so only n / 2 of the distances can be independent. Which set of n / 2 distance should be observed? Clustering tendency has been virtually ignored in applications, for many of the reasons discussed above. No suitable theory exists and few practical procedures are available. However, we feel that trying to answer these questions will save a great deal of time and effort in subsequen-t analysis and obviate inappropriate applications of clustering algorithms.
3.3 Hierarchical Clustering A hierarchical clustering method is a procedure for creating a nested sequence of partitions of the patterns from a proximity matrix and tries to organize the proximity matrix as a type of evolutionary tree. We can begin with each pattern in a distinct group, or cluster, and proceed to link the pair of groups that most resemble one another in an iterative fashion until all n patterns are in a single group. This is an agglomerative approach (Section 3.1). We might also use a divisive approach, where all n patterns begin in a single cluster that is split into two subsets of patterns. Each subset is split again and again until each pattern is in a cluster by itself. The tree, or dendrogram (also calledphenogram) that arises naturally from these procedures provides a picture of a hierarchical structure. The investigator uses this picture to explore data and pursue the goals of Cluster Analysis, perhaps in conjunction with other representations of the data.
132
RICHARD DUBES AND A. K. JAlN
We concentrate here on the mechanism whereby the proximity matrix is transformed into a dendrogram. We distinguish between a clustering method, which is a mathematical characterization or idealization of a procedure for doing this, and a clustering algorithm, which is a specific sequence of steps that can be programmed on a computer. For example, some clustering methods can be implemented by both agglomerative and divisive algorithms. The method that corresponds to a particular algorithm is not always clear, especially since the algorithm often comes first in the developmental process (Everitt, 1974; Anderberg, 1973). A clustering method can be stated in various mathematical frameworks, such as graph theory and set theory. Theoretically interesting questions arise when attempting to demonstrate equivalences between clustering methods stated in different frameworks. The ingredients necessary for a specification of a clustering method have not been universally established, A few metastudies have faced this question (Lance and Williams, 1967a,b), but are beyond the bounds of this review.
3.3.1 Hierarchical Clustering Phrased in Graph Theory
Hubert ( 1974a) has explained hierarchical clustering methods in graphtheoretic terms. This explanation has several advantages over earlier methods. It provides a general overview of several methods that is mathematical, yet understandable. It also shows how new clustering methods can be devised, and leads directly to measures of cluster validity. The most important ground rule for this section is that the dissimilarity matrix is ordinal without ties. Not allowing ties simplifies the presentation considerably. ( a ) Threshold and Proximity Graphs. Consider the n x n dissimilarity matrix D = [&, j ) ] . We view d(i, j ) as a measure of dissimilarity between patterns x i and xJ. The entries of D are required to be on an ordinal scale with no ties. Since there are n(n - 1)/2 entries above the main diagonal, we take the entries of D to be an arrangement of the integers (1, 2,
. . ,. , n(n - 1)/2).
Hierarchical clustering methods are expressed in terms of a series of undirected graphs (Zahn, 1971; Everitt, 1974) without self-loops (slings) or multiple edges. These graphs contain n nodes, one per pattern, and are pictorial representations of the binary symmetric relations, R ,', defined below. For each distinct proximity value u, the relation R , is the set of pairs of integers ( i , j ) ,where 1 5 i 5 n and 1 Ij In for which
(i, 11 L R,
if d(i,j ) Iu ,
i f j.
EX PLORAT0RY DATA ANA LYS IS
133
Note that
(i.j ) E R,
if i
= j.
For simplicity, we require that node i corresponds to pattern x i . The graph G,. representing R,. contains an edge between nodes i and j if and only if Ci,j) E Ri.;i.e., nodes i and j are linked. We will call G,.a threshold graph when only the existence of the edges is of importance and aproximity graph when the edges ( i , j )are weighted with proximity d ( i , j ) .Threshold graphs are usually sufficient for clustering with ordinal proximity matrices. Example 3.1. An ordinal dissimilarity matrix for a set of five patterns is given below.
1:: 'I 1 2 3 4
;
D = 3
4 5
;
3 6 0 8
9 1 8 0 1 0 2 7 5
5
7 .
5 0
Because of symmetry and diagonal requirements, only half the proximity matrix need be shown. The relation R 6 and graph G 6 are shown in Fig. 3. An asterisk in position ( i , j )in the matrix means ( i , j ) E Re. ( b ) Clusters and Clusterings. We begin with the set X = (xl,x2, . . . , x,) of n patterns and an ordinal dissimilarity matrix D = [d(i, j ) ] and proceed to generate a nested sequence of clustenngs. Let
(CUI,Ca** . .
-
. CUWJ
be the uth clustering, which contains W , clusters. That is, the following properties are satisfied; # is the empty set.
i. For every x i , x, ii. C, n C,, = c$ iii. CUj# #; iv.
E
C,,, for somej; i f j # k;
J ( ' c,, = x. /= 1
1 1
2
3
4
Matrix for Re
FIG. 3.
5
G6
Relation R, and graph C Bfor the dissimilarity matrix of Example 3.1.
134
RICHARD DUBES AND A. K. JAlN
In other words, a clustering is a partition of X into nonempty subsets called clusters. The clusterings are nested because:
i. ii. iii. iv.
j = 1 , . . . , n , and W , = n; C , = (xj). u 2 0; ClljC C,,+l,k for all j and some k , u 2 0; W,,,, = W , - 1, Cn-,= X .
The zeroth clustering is called the disjoint clustering and the ( n - l)st, the conjoint clustering. Two clusters from the uth clustering merge to form the ( u + 1)st clustering. A clustering method can be established by specifying the manner in which the clusters merge. (c) The Single-Link Method. This hierarchical clustering method is also known as the near-neighbor method and the connectedness method. It has been shown to have many desirable theoretical properties (Jardine and Sibson, 1971; Sibson, 1973) and some undesirable characteristics in practice (Wishart, 1969). Given the uth clustering, define the function Qs for all pairs of clusters as Qs(C,,,,C,,,) = min [ d ( i ,j)lthe maximal subgraph of Gd(l,j) defined by CurU C,,,is connected].
Beginning with the disjoint clustering, the single-link clustering method is defined by specifying the pair of clusters to be merged. The pair ( C , , , Cuq) is merged under the single-link method if ’
Qs(Cup,
CuJ = midQACur,
Cut)],
where the minimum is taken over all 1 cr r # t 5 Wl,. In other words, begin with a graph containing the n nodes, representing the patterns in X , and no edges. Add edges one at a time, in the order of the entries in D. Every connected subgraph is a cluster and every new edge that established a connected subgraph defines a clustering. The requirement that a set of nodes in a cluster form a connected subgraph of a threshold graph is a minimal requirement and can lead to the formation of clusters with little cohesion, sometimes called “straggly” clusters. A second criticism of single-link clusters is that they can chain together easily into loose clusters. That is, only a single link between two clusters in a threshold graph is needed to merge the clusters and, in a sense, make all the nodes in the two clusters equivalent. Chaining also occurs when the hierarchy of single-link clusters forms by adding one node at a time to a single large cluster. Several of the clustering methods described in this section were motivated by the need to overcome this chaining.
EXPLORATORY DATA ANALYSIS 2
4
1
- b
r:rl N 4
2
1
1
5
3
135
5
3
1
5
5
GI
G4
u
2
4
5
4 YE! 3
y 1
3
4
(b)
FIG.4. (a) Sequence of threshold graphs for Example 3.2 and (b) dendrogram for Example 3.2.
Example 3.2. The sequence of threshold graphs for the disimilarity matrix in Example 3.1 is shown in Fig. 4a. For simplicity, the patterns are referenced (1, 2, 3, 4, 5) instead of (xl,. . . , xs). Since all nodes are linked in a single connected subgraph in G4, we need go no further. The sequence of single-link clusterings is listed below.
fl
clusters (single-link)
4
The list of clusterings is conveniently summarized in the dendrogram shown in Fig. 4b. Note that, in general, the entire list must be available before the dendrogram can be constructed because the patterns cannot be ordered arbitrarily. Several items should be noted regarding Example 3.2. Only four of the ten different entries (above the main diagonal) are used. Second, G , is the minimum spanning tree for the complete graph GI,, (Section 3.3.2). Third, if D were altered so that edge 4-5 were added before edge 1-3, then more threshold graphs would have been needed to connect all five nodes. However, the list of clusterings and the dendrogram would not change. Finally, no more than one pair of clusters can merge at any level. The clusterings in the nested sequence forming the single-link hierarchy
136
RICHARD DUBES AND A. K. JAlN
are numbered sequentially. When the sequence numbers are listed next to the tree, as in Example 3.2, we have a threshold dendrogram. (d) The Complete-Link Method. Changing the word “connected” to the word “complete” in the definition of single-link clustering establishes an entirely new clustering method called the “complete-link” or “diameter” method. Again, we begin with a set of patterns and an ordinal dissimilarity matrix [d(i,j ) ] . Given the uth clustering (C,,, Cuz,. . . , CUM.’,), define the measure Q , of proximity between a pair of clusters as QJCtt,.,CUJ= min [d(i,j)lthe maximal subgraph of Gd(Z.J) defined by C,, U C,, is complete].
The pair (C,,, Ct4Jis merged under the complete-link method if Qc(Cup9
CuJ = min[Qc(Cw, C d I ?
the minimum being taken for 1 5 r # t 5 W,. Here, a subgraph must be complete before it is recognized as a cluster. One forms new threshold graphs by adding edges until complete subgraphs appear. Thus, complete-link clusters can be characterized in graph-theoretic terms as cliques, or maximally complete subgraphs, just as single-link clusters can be characterized as connected subgraphs. Completeness is a much stronger property than connectedness. For a node to join an existing cluster, it must link to all the nodes in the cluster. Note that not all cliques are complete-link clusters. Peay (1975a) proposes a clustering strategy in which all cliques in each threshold graph are viewed as clusters, even though the cliques overlap. A hierarchy of overlapping clusterings is created as the dissimilarity threshold is increased. Peay (1975b) extended this idea to asymmetric proximity matrices. The large number of cliques that can occur (see Matula, 1977) make this method somewhat impractical for large sample size. As part of this development, Peay (1975a) shows that the complete-link method can lead to a set of clusterings, even though an algorithm will generate only one member of the set. Defays (1977) provides an efficient complete-link algorithm. Example 3.3. The dissimilarity matrix for Example 3.1 is used once again. From threshold graphs G,-G4, we find three clusterings. Graphs G2 and G, do not define new clusterings. Threshold graph G5, shown in Fig. 5a, links x5 with the (xz, xq) cluster. The conjoint clustering requires all pairs of nodes to be linked under the complete-link method, which occurs only in GI,,. The list of clusterings and threshold dendrogram are shown in Fig. 5b. ( e ) Some Other Methods. It is clear that the technique used to explain the complete-link and the single-link hierarchical clustering methods can
EXPLO R A T 0 RY DATA AN A LYS IS 2
5
4
137
3
Dv,
(b)
FIG.5. (a) Threshold graph G , for Example 3.3 and (b) complete-link clusterings and dendogram for Example 3.3.
be extended. Hubert (1974a) uses the following general formulation, given an ordinal dissimilarity matrix [d(i, j ) ] containing no ties. Cu2, Define the function QR for all pairs of clusters in the clustering (Cv,, . . . , Cz,u.,,)as
QR(Cur,C,,) = min [diiI the maximal subgraph of Gd(ij, defined by C,, U C,, is connected and either has property R or is complete]. Then, merge C , , and C , , to form the ( u Q ~ ( C t t p , Cuq)
+
1)st clustering if
= min[Q~(Cur-CttJI,
the minimum being taken for 1 5 r # s 5 Wll. Some examples of property R are listed below.
i. k-node connectivity. A graph is k-node connected if and only if every pair of nodes are joined by at least k node-disjoint paths. ii. k-edge connectivity. A graph is k-edge connected if and only if every pair of nodes are joined by at least k edge-disjoint paths. iii. k-minimum degree. A graph has minimum degree k or more if each node is linked to at least k other nodes. iv. k-diameter. A connected graph has a k-diameter if the maximum distance between any two nodes is k or less. The distance between two nodes in a graph is the length (number of edges) in the minimum-length path from one node to the other. v. k-radius. A connected graph has a k-radius if there is one node within a distance k of all other nodes.
RICHARD DUBES AND A. K. JAlN
138
Theorems from graph theory can be used to interrelate these methods. For example, k-node connected implies k-edge connected, but the converse is not true; k-edge connected impIies k-minimum degree, but not conversely. Example 3.4. The following dissimilarity matrix is given in uppertriangular form. 2
[dij]
3
4
5
6
5
14
6
12
=3
5
1-5
2.
04
06
3 0
Gl
3 65
G9
G13
0
0
6-3
62
G3
2e!5 tl: GS
G7
GlO
GlI
G14
G15
FIG. 6. Threshold graphs for Example 3.4.
EXPLORATORY DATA ANALYSIS
139
Several threshold graphs are sketched in Fig. 6. Dendrograms for some clustering methods are shown in Fig. 7. Some details are given below. The third clustering in the two-diameter method is $11
= (1,
5 , 4); C,,
=
(2); Cs, = ( 3 , 6).
For the two-diameter property R, we have
6; Q R ( C X C:n) . = 10; Q~(c32, C33) = 5 . Thus CZ2and C,, are merged. The same results are obtained under the one-radius method. Ling (1972, 1973a) studied hierarchical clustering schemes based on another property. Consider a proximity graph G , and let S be a subset of nodes. The following definitions use only the maximal subgraph of G , induced on S. This subgraph is r connected if all pairs of nodes in S are connected by paths; it is ( k , r) bonded, k a positive integer, if every node Q~(c3i. C32)
1
5
5
=
4
2
3
6
1
Y Y
5
2
4
3
(a)
1
5
4
2
3 4
FIG.7. (a) Threshold dendrogram for single-link and 2-radius methods (Example 3.4), (b) threshold dendrogram for complete-link method (Example 3.4), and (c) thr hold dendrogram for 2-diameter and I-radius methods (Example 3.4).
140
RICHARD DUBES AND A. K. JAlN
has at least k near neighbors; it is (k, rj connected if it is both r connected and (k, I ) bonded. Finally, S is a ( k , r) cluster ifr is the smallest s for which S is (k, s) connected and S is not properly contained in any other (k, rj connected subgraph. Ling (1972, 1973a) shows that the (k, r) clusters form partitions and exhibit trees for picturing the hierarchical structure. Note that (1, r) clusters are single-link clusters. Ling (1972, 1973a) also defines a k cluster as a (k, r) cluster for some r, which is equivalent to a k-minimum degree cluster. The chromatic number of a graph provides another means for defining a hierarchical clustering method. In general, consider the problem of coloring the nodes in a graph G such that no two adjacent nodes (connected by a single edge) have the same color. The minimum number of colors needed is the chromatic number of G. Baker and Hubert (1976) show that coloring the nodes of the complement of a threshold graph is equivalent to dividing the nodes into k clusters. Several ad hoc clustering methods have been proposed for specific applications. We do not intend to review all such methods. Lefkovitch (1978) motivated his alternative to hierarchical clustering with a good discussion of the inadequacies of methods based on graph theory. He defines clusters from measures of separation and internal consistency and uses mathematical programming to find clusterings. McQuitty (1967, 1971), McQuitty and Frary (1971), and McQuitty and Koch (1975a,b, 1976) have proposed a series of hierarchical clustering methods for rank-order proximity matrices. These methods define various “types,” or groups of objects (patterns) that satisfy certain proximity relations. The basis for this work can be expressed in graph-theoretic terms. McQuitty (1967) defines a “comprehensive type” as a cluster in which every pattern is more like every other pattern in the cluster than like any pattern not in the cluster. In other words, an isolated clique in a threshold graph would establish a comprehensive type. The Rank Order Typal Analysis analogous to complete-link clustering searches for comprehensive types. A “restricted type” requires that every patter, in the cluster be most like some other pattern in the cluster. Connected subgraphs of a near-neighbor graph are restricted types. A near-neighbor graph links all nodes that are near neighbors and is a subgraph of the minimum spanning tree but is not, in general, a threshold graph. The process of finding restricted types is called Linkage Analysis (analogous to single-link clustering). These clustering procedures share several of the disadvantages of complete-link clustering (many small clusters) and single-link clustering (chaining) so several heuristic procedures have been established to over-
EXPLORATORY DATA ANALYSIS
141
come these deficiencies. For example, McQuitty ( 1967) proposes Reciprocal Pair Hierarchical Clustering in which mutual near neighbors are sought; McQuitty and Frary (1971) offer Reliable and Valid Hierarchical Clustering, in which “square,” “identity,” “elongated,” and “spotted” types are sought; McQuitty (197 1) defines Iterative Intercolumnar Correlation Analysis, a devisive algorithm that generates sequences of matrices of intercolumn correlations; McQuitty and Koch (1975a,b) put forward Highest Entry Hierarchical Clustering and Highest Column Hierarchical Clustering. The lack of algorithmic details and the large assortment of ad hoc procedures make a comparison of these methods difficult. McQuitty and Koch (1975a) claim that Reciprocal Pair Hierarchical analysis is suited to proximity matrices having up to 1000 rows and columns, and suggest a technique for breaking a large matrix into submatrices, the analyses of which can be reformulated to describe the large matrix. Practical computational matters are not addressed. Koch (1976) notes that the run time (Univac 1100 series) for a 400 x 400 matrix varied from four minutes to three hours, depending on the complexity of the data; up to 500 output pages were produced. ( f ) Proximity Dendrograms. The hierarchical clustering methods presented earlier were all based on threshold graphs. Adding edge weights produces proximity graphs. The effect of this on the hierarchical clustering is best seen on the scale of the dendrogram. Rather than having new clusterings for each integer value ofu between 0 and n - 1 , the ‘‘level’’ of a clustering is the value of v in the proximity graph G , in which the clustering first appears. For instance, if no new clusterings form in G , and G5, then the scale on the dendrogram is stretched and a gap appears. These gaps can aid in the interpretation of dendrograms, although some questions of uniqueness do occur, as demonstrated below. A dendrogram with level values is called a proximity dendrogram. Example 3.5. Consider the ordinal dissimilarity matrix in Example 3.4 and threshold graphs in Fig. 6. The proximity dendrograms are obtained by inserting gaps in the threshold dendrograms of Fig. 6 and are shown in Fig. 8. The single-link dendrogram is not changed, while the complete-link dendrogram is drastically altered. The order in which the clusters form remains the same. 3.3.2 Hierarchical Clustering Phrased in Algorithmic Terms
Some researchers prefer to define hierarchical clustering in terms of the algorithm used to generate the hierarchy. This is somewhat unfortunate because it equates a method, which can be implemented in several ways, with a specific algorithm. However, this approach to defining a clustering
142
RICHARD DUBES AND A. K. JAlN 5
1
i1
2
3
6
5
2
4
3
I
1
2 3
4 5
(a) 6 7 8
9 10
11
15
1 I
l-
2
3
6
I
I
I
I
FIG.8. (a) Proximity dendrogram for single-link method (Example 3 . 3 , (b) proximity dendrogram for complete-link method (Example 3 . 3 , and (c) proximity dendrogram for 2-diameter method (Example 3.5).
method is direct, simple to understand, and is commonly used in much of the literature. The most important ground rule of this section is that the dissimilarity matrix contains interval or ratio data, rather than ordinal data as in Section 3.3.1. Again we will assume that no ties exist. Note that one can treat ordinal data as if it were interval data and follow the procedures of this section. However, the interpretation of some of the results will be questionable. (a) Single-Link and Complete-Link Algorithms. Suppose we begin with a quantitative dissimilarity matrix D = [d(i, 371. Johnson (1967) popularized the following algorithm that grows clusters by a merging process. The algorithm described below, called the Johnson Scheme, produces a nested sequence of clusterings and a “value” or “level” for each clustering.
EXPLORATORY DATA ANALYSIS
143
Step I . Initially assign each pattern to a unique cluster. We refer to pattern x i as i. The initial level is Lo = 0. The initial clustering is %o =
((I),
(a,(3), .
*
. (n)). 3
The index h is the number of the clustering; h Step 2. h + h + 1. a. Find the smallest entry in D.
+ 0.
d(i,j ) = min [d(s, r)]. IS.1)
b. Merge clusters ( i ) and (j)into a new cluster to establish clustering 6 h with n-h clusters. c. L h * d(i,J?. Step 3 . Update D by deleting the row and column corresponding to one of the designated clusters, say ( i ) . The entries in the row and column corresponding to the other designated cluster, denoted ( i , 31, are given below. For the single-link method
d [(k),(& 311 d
[(j)&?l
+-
+
min [4(kfl(i)3,4(k),( j ) ] ] ,
k
f
i,k # j
&).(.j)I.
For complete-link method d [ ( M j 9191 max C ~ ( k ~ , ( i ~ l , d [ ~ k ~ , ( ~k ~#l l i,k , #j d [( j ) , ( k ) l d ( k ) . W ] . +
+
Step 4 . If h = n - 1 , stop. Else go t o Step 2.
Some items t o note:
i. It is not immediately obvious that these “single-link” and “complete-link’’ methods are the same as those previously established. However, the Johnson Single-Link Scheme demonstrates one way of constructing those proximity graphs needed to form the single-link dendrogram in Section 3.3.1. Similarly, the Johnson Complete-Link scheme identifies complete proximity subgraphs in the proper order. ii. Since all the entries in D are distinct and no new numbers are created, the question of ties never occurs. Jardine and Sibson (1971) show that the single-link method has a “continuity” property so that ties in the proximity matrix do not affect the dendrogram. However, ties can have a dramatic effect on the complete-link dendrogram. Examples can be created in which the order of resolving ties dictates the structure of the dendrogram. iii. This algorithm is very tedious to carry out by hand, but relatively
RICHARD DUBES AND A. K. JAlN
144
easy to program on a computer. The upper triangle of D can record the current matrix and the lower triangle, the past matrix. Keeping track of the information needed to construct the proximity dendrogram requires some ingenuity. iv. If the u subscript on the proximity subgraphs G,in Section 3.3.1 is allowed to assume noninteger values, the procedures of Section 3.3.1 apply directly here. 2
3
0.2 0.7 4 5
5
(1,314
4
5
1
4.6 1.9 8.3 3.6 9.0 6.4 1.0 3.4 2-51 fa)
(1,3) 4 5 6 2 0.7 4.6 1.9 0.3 3.6 9.0 6.4 (1,3) 4 1.8 3.4 2.5 5
6
1 . 7 3.2 2.6 4
5
4
5
6
4
5
5
6
4
5
5
[
5 (2,1,3,4) 1.8 5
(2,1,3.451
1
6
1.7 3.2 2.6
3
]
6
[ 2651 2
4
5
6
1
3
2
6
34 -
(b)
Fic. 9. (a) Dissimilarity matrix for Example 3.6, (b) applying the Johnson scheme for the single-link method, and (c) applying the Johnson scheme for the complete-link method.
EXPLORATORY DATA ANALYSIS
145
Example 3.6. The dissimilarity matrix D in Fig. 9a is given. The sequence of dissimilarity matrices and t he final dendrogram obtained by following the Johnson Scheme are shown in Figs. 9b and c. These dendrograms could also have been constructed by the graph-theoretical methods of Section 3.3.1. In particular, drawing the proximity graphs Go.*,Go,s, GI.,,GI.*,and Gz,5 will exhibit connected subgraphs corresponding to new Go.7, G3.4rand G , , will show complete subclusters. Similarly, graphs. Example 3.7. The data set for this example was derived from the Munson handprinted Fortran character set, available from the Computer Society, Institute of Electrical and Electronics Engineers. The Munson data consist of binary-coded (24 x 24) handwritten characters from several authors, each of whom prepared three alphabets of 46 characters. Our derived data used the three alphabets from the first five authors, selecting the characters “8,” “0,” and “X” from each alphabet for a total of 45 patterns. Each pattern is expressed as eight features that count the number of squares from the perimeter to the character, as demonstrated for character “I” in Fig. 10. Eigenvector (43.1% of the variance retained) and discriminant plane projections (see Section 2.2) are shown in Figs. 11 and 12. This data set, called the SOX data, can be clustered by pattern and by feature. Dissimilarity between patterns is measured by Euclidean distance in the pattern space. The single- and complete-link dendrograms based on this dissimilarity index are shown in Figs. 13a and b, respectively. The single-link dendrogram exhibits the chaining characteristic of the singlelink method. The first 15 patterns are from category 8 (the character 8), the next 15 are from category 0, and the last 15 are from category X. 7
3
- 6
4
FIG.10.
I
\
2 8 Eight-dimensional feature vector (pattern): (11, 11, 5, 6, 10, 10, 5 , 5 ) .
RICHARD DUBES AND A. K. JAlN
146
X
8
X
8
8
x
x
8
88 8
8 8
0 8
0
0 8
8 8
8
8
0 0 X
0
xxx
x
0 0 0
0 0
X
X
0 O
O O
X
X
FIG. 1 1 . Two-dimensional projection of 8 0 X data onto the first two principle components (Example 3.7).
Cutting the complete-link dendrogram at level 14, which appears to be a reasonable cutting point and produces a small number of clusters, defines the cluster-by-category table in Table I. Category X is separated from the combination of categories 8 and 0 in Cluster 2. This agrees with the eigenvector projection in Fig. 11, which shows category X separate from categories 8 and 0. The complete- and single-link methods were also applied to cluster the eight features, using the absolute correlation coefficient as a similarity measure. The dendrograms are shown in Fig. 14. There is little agreement between the clusterings exhibited by the dendrograms so these features are not hierarchically related. ( b ) Other Agglomerative Methods. An advantage to defining hierarchical clustering methods algorithmically becomes apparent when the quantitative nature of the data is employed. The single- and complete-link methods depend only on the order of the data since a transformation
EXPLORATORY DATA ANALYSIS
147
8 8 8
X
8 8
8
X
x X
0 X
X X
x
x
0 0
0
0
X
0
X
00 O 00
x 0
0 0
FIG. 12. Two-dimensional projection of 8 0 X data using discriminant analysis (Example 3.n.
which changes the proximities without changing their order does not alter the clusters generated by the single- or complete-link methods; only the levels in the dendrogram change. Several hierarchical clustering methods will now be defined which do not have this property (see Williams and Lance, 1977). We again begin with a quantitative dissimilarity matrix D = [ d ( i , j ) ]and follow the Johnson Scheme. At stage h, suppose we have clustering Let n k be the number of patterns in clusterk. Parameters [ak], p, and y are chosen to define a clustering method by altering Step 3 of the Johnson Scheme as follows.
:1a
RICHARD DUBES AND A. K. JAlN
148 13444
30
i
1,41
II I IT IL
J
6.94
9.95
I
3 33
1
w
4.57
!+
1 1 0 0 0 1 1 1 1 2 1 1 0 0 0 1 2 2 2 2 1 2 2 2 2 2 4 34 3 3 7868564 5105
(a 1
FIG.13. (a) Single-link dendrogram for the 45 patterns of the 8 0 X data set (Example 3.7) and (b) complete-link dendrogram for 45 patterns of the 8 0 X data set (Example 3.7).
If ak = 4, all k, and /3 = 0, the single-link method is obtained when y = -8 and the complete-link method is reproduced when y = 8. These parameters cannot be chosen arbitrarily ifcrossovers are to be avoided. A crossover occurs when two clusters are merged at a lower level (smaller dissimilarity) than a previous merger. Williams and Lance (1977) show that crossovers are avoided if y # 0; when y = 0, crossovers cannot o c c u r i f a i + a j + / 3 r 1. Two popular hierarchical clustering methods are defined below in terms of these parameters.
EXPLORATORY DATA ANALYSIS 3
3433330A
1
149
2 1 11222122222211 1100010013433044 2 0 3 5 64 5 374
1.41
6.82
12.23
I; I-4 I
17.07
1
I
FIG.13.
(b) (Conrinued)
i. Group Average Method [UPGMA (Unweighted Pair Groups Using Metric Averages)]. a{=
n. i n . 1 . p=y=o. ni nj’ ni nj ’
+
+
Thus, the dissimilarity between cluster ( k ) and the newly formed cluster is a weighted average that depends on the numbers of patterns in the merged clusters (i) and ( j ) as well as on the number of patterns in ( k ) . ii. Ward’s Method. a{=
ni + nk ni
.
+ nj + n k ’
(y. 3
=
+ nk . + nj + nk’
- nk
nj ni
p
= ni
+ nj + nk’ a
y=o.
150
RICHARD DUBES AND A. K. JAlN TABLEI CLUSTERS B Y CATEGORY USING COMPLETE-LINK CLUSTERING (EXAMPLE 3.7)
Cluster
Q
X
1 13 0
0 14
9
2 3
1
1
4
0
5
1
0 0
2 2
1
1
Ward (1963) derived this method to minimize an error sum of squares objective function. See Anderberg (1973) for details. Proximity dendrograms for the Group Average Method and Ward’s Method corresponding to the dissimilarity matrix in Fig. 9(a) are shown in Figs. lS(0 and 15(g);the sequences of dissimilarity matrices is given in Figs. 15(b) and 15(c). (c) The Minimum Spanning Tree and the Single-Link Method. The most efficient algorithm for computing a single-link dendrogram is a result of the correspondence between the single-link method and a minimum spanning tree (MST) on the complete proximity graph. A spanning tree for a (connected) graph G is a subgraph of G touching all n nodes and containing no loops. A spanning tree contains exactly ( n - 1) edges. An MST is a spanning tree, the sum of whose edge weights is minimum among all spanning trees. Two key properties of MSTs are listed. 6 5 2 3 1 7 4 8
(a)
FIG.14. (a) Single-link dendrogram for the 8 features of the 8 0 X data set (Example 3.7) and (b) complete-link dendrogram for the 8 features of the 8 0 X data set (Example 3.7).
EXPLORATORY DATA ANALYSIS 2 0.6
3 0.2
4 5
4 1.7
5 3.2 1.9 9.0
6 2.6 8.3 6.4
1.8
3.4 2.5
1[
(a) 4 4.6
5 1.9
6 8.3
2.65 6.1 1.8
4.5 3.4 2.5
5
6
1
151
(1.3)
4
5
6
0.8
4.6
1.9
8.3
’
3.47 8.07 5.93 1.8 3.4 2.5
5
4 5 (2.1,3) 4.72 6.8 4 1 1.8
r
6 8.41 3.4
1 I
2.5
[
(2;:(,
(4,5) 2.13
6 8.411 3.33 (4,5,6)
(2,1,3)
[I10.711 (C)
0
-
1
-
2
-
3
-
4
-
5
-
6
-
7
-
8
-
9
-
1
L
1
(e)
FIG 15. (a) Dissimilarity matrix, (b) sequence of matrices for group-average method, (c) sequence of matrices for Ward’s method, (d) single-link dendrogram for (a), (e) completelink dendrogram for (a), (0 group-average dendrogram for (b), and ( g ) Ward’s dendrogram for (c). ~~
152
RICHARD DUBES AND A.
6
Fic;. 16.
2.5
5
K. JHIIY
2
Minimum spanning tree for the dissimilarity matrix in Fig. 9(a).
i. Every node is linked to its nearest neighbor in an MST. ii. Let ( S . s) be a partition of the nodes ofC. The minimum-weight edge linking a node in S to a node in is in the MST.
s
Two new algorithms for implementing the single-link method are given below. A dissimilarity matrix [ d ( i ,j ) ] is assumed given. Both algorithms employ the MST for the complete proximity graph C,, or the proximity graph containing all possible edges. Merging the pair of nodes defining the smallest edge in the MST into a cluster is the first step in an agglomerative algorithm, which continues by collapsing one edge at a time. This scheme works because of the following characteristic of the single-link method. If (C,(,,. . . , C,,,,.,,)is a single-link clustering, and the next single-link clustering is formed by merging clusters C,,,, and C,,,,, then for i E C,,, and j E min[tl(i,j)] = min[min[d(k. l ) ] ,
k
E
C,,,.,
I E C,,,q,
r f s].
For example, the MST for the dissimilarity matrix in Fig. 9a is found by inspection and is shown in Fig. 16. Collapsing the edges one at a time starting with the smallest produces the single-link clusters in Fig. 17. Another way of employing the MST to construct a single-link hierarchy is to begin with the conjoint clustering and break it into two clusters by cutting the longest edge in the MST. Repeating on each cluster and continuing until the disjoint clustering is reached generates the single-link hierarchy. The edge with the largest weight among all edges in the entire (disconnected) graph is cut at each step. Applying this procedure to the MST in Fig. 16 produces the sequence of graphs in Fig. 18. Zahn (1971) has proposed several other methods based on the MST (Section 3.4.4). Gower and Ross ( 1969) are credited with formally noting the connection between the MST and single-link clustering. Rohlf ( 1973) has published an efficient algorithm that looks at each dissimilarity value only once to compute a single-link hierarchy. The problem of computing an MST has been thoroughly researched (Bentley and Friedman, 1978); most algorithms are versions of Prim's ( 1957) algorithm. Hall of crl. ( 1973) and Ross ( 1968) have suggested schemes for computing approximate MSTs for use with large numbers of patterns. I t is interesting to ponder the degree of equivalence between the MST
EX PLORAT0RY DATA ANA LYSIS
153
1.8
2.5 [ ( 1 , 3 , 2 , 4 , 5 , 6I] FIG. 17. Single-link hierarchy by an agglomerative algorithm based on MST in Fig. 16.
and the single-link dendrogram. We know that the dendrogram can be constructed from the MST, but the reverse is not true. To see this, consider the single-link dendrogram in Fig. 15d. We see that cluster (2) merges with cluster ( I , 3), but we do not know whether edge ( I , 2) or edge ( I , 3) is in the MST. Another way of looking at this is to realize that the cophenetic matrix (Section 3.3.3), computed from the single-link dendrogram, is not the same as the matrix of path distances computed from the MST, even though all the MST edge weights occur in the cophenetic matrix. The MST can provide a useful picture of the data. Figure 19 is the MST of the 8 0 X data defined Example 3.7. Note that the edge lengths in Fig. 19 are proportional to dissimilarity, but the only interpattern distances that have meaning are between pairs of points joined by a single edge. The fan out of the points is arbitrary. Removing edge (8, 43) separates all 15 patterns in the X category from the other two categories. Only pattern 2,
154
RICHARD DUBES AND A. K. JAlN
41T3 Level -
> 2.5
2.5 5
Clustering
[(1,2,3,4,5,6I]
2
1.1 -i-3 5
4
2
1.7
I
i3 5
4* 5
4.
5
2
[
T
O
3
2
1
3
0.2
2 0.2
[(I f , ( 3 ) , ( 2 ) , 41, ( t5),(6)]
FIG.18. Single-link hierarchy by divisive algorithm based on MST in Fig. 16.
from the 8 category, is part of the resulting connected subgraph. Removing edge (6, 21) separates the remaining 14 patterns in the 8 category from the 15 patterns in the 0 category. Thus, the MST exhibits good separation by category, as might be expected from the eigenvector projection in Fig. 11. Why does not the single-link dendrogram in Fig. 13a, which can be determined from the MST, show better separation by category? The reason is that several MST links are longer than the two links (8, 43) and (6, 21) needed to break the MST according to category. Zahn (1971) has suggested several clustering methods which heuristically delete edges from an MST (Section 3.4.4). [See also Dandrade (1978) and Slagle et al. (1974).]
EXPLORATORY DATA ANALYSIS
155
FIG. 19. MST of the 80X data set.
3.3.3 A Mathematical Characterization of Hierarchical Clustering
Graph theory provides one way of characterizing a class of hierarchical clustering methods. Jardine and Sibson ( 197 1) proposed another such characterization that illustrates the role of the ulrrametric inequality in hierarchical clustering and leads to an important goodness measure for a hierarchical clustering. Let X = (xl, . . . , x,J be the set of n patterns to be clustered. A particular clustering induces an equivalence relation, r, on X. An equivalence relation on X is a subset of X x X (Cartesian product) that is reflexive, symmetric, and transitive. Specifically, ( x i , xj) E r
if x i and xj are in the same cluster.
The clusters from any particular clustering are then the equivalence classes for an equivalence relation r. The equivalence relation will henceforth be implied by clustering. Clustering Vu = (CUJis nested into clustering %, = (C,) if, for each i, there is exactly o n e j such that cut
c G,.
This nesting is denoted V,,C V,. If [To,V1, . . . , V,-,l is the sequence of clusterings generated by one of the methods in Section 3.3.1, we have Vo c V,c
. . .c
Vn-l.
156
RICHARD DUBES AND A. K. JAlN
Let r uand rv be the equivalence relations on X induced by clusterings Ceu and %w, respectively. We denote the nesting of %u into CeU in terms of the equivalence relations as ru G rc.
A dendrogram can be defined iri this framework as follows. Let %(X) be the set of all equivalence relations on X. Then, a dendrogram is a function C,
c : R + ---* 8(X),
mapping R+, the set of nonnegative real numbers (including zero) into %(X) having the following properties. i. If a,b E R+ and 0
Ia Ib,
then c(a) C c(b).
Note that since c(a) is an equivalence relation on X,the clustering corresponding to c(a) is nested into that for c(b). ii. For large enough a E R+, c(a) is the conjoint clustering. iii. For any a E R+, there exists 6 > 0 such that c(a + 6) = c(a). iv. c(0) is the disjoint clustering. Example 3.8. Suppose we are given the nested sequence of clusterings and level values shown below, resulting from some hierarchical clustering method applied to X = (1, 2, 3, 4) level
0
0.6 1.0 2.5
The dendrogram is established by defining the mapping c, which is equivalent to the normal tree picture of a dendrogram, e.g., Fig. 13. [Ceo
if0
[%’;
if 2.5
< 0.6
Ia
I-
a.
Since equivalence relations such as c(a) are clumsy to write down, the equivalence classes are shown instead. Jardine and Sibson (1971) call the dissimilarity index a “dissimilarity coefficient,” or DC. In mathematical terms, a DC can be expressed as a mapping, viz: d:XxX+R+.
EXPLORATORY DATA ANALYSIS
157
This mathematical formalism allows us to picture a hierarchical clustering method as a mapping from the set of all DCs to the set of all dendrograms. We now study the character of this mapping and try to match the DC to the dendrogram. Consider a mapping U from the set of all dendrograms (on X) to the set of all DCs (on X). That is, U maps a particular dendrogram c into a particular dissimilarity matrix. The notation for this is rather clumsy. We must begin with a dendrogram and end with the assignment of a number to each pair of patterns in X. Jardine and Sibson (1971) express this mapping U as follows.
1
( u c ) ( x ( ,xj) = inf [a ( x i , xj) E c ( a ) ] .
In other words, determine where x f and xj first occur in the same cluster and assign that level as the dissimilarity between x i and xj. The dissimilarity matrix formed in this way is called the cophenetic matrix. For example, a dendrogram is pictured in Fig. 20 along with the corresponding cophenetic matrix. The interesting thing about the cophenetic matrix is that both the single-link and the complete-link methods will reproduce the dendrogram if applied to the cophenetic matrix. There are several ties, but the ties are always arranged so that the maxima and minima required in the Johnson Scheme are the same. That is, the dendrogram is exactly “equivalent” to the dissimilarity matrix so we can say that a hierarchical structure exactly reproduces a cophenetic matrix. In hierarchical clustering studies we are, of course, faced with the inverse question. When does a dendrogram contain the same information as a given dissimilarity matrix? That is, when can the given dissimilarity matrix be reconstructed from the derived dendrogram? What is it about the cophenetic matrix that produces the one-to-one relationship between dendrogram and DC? Answering these questions will lead to a constraint on the dissimilarity index called the ultrametric inequality. Consider a mapping T from the set of all DCs (on X) to the set of all dendrograms (on X); T is really some hierarchical clustering method. In Jardine and Sibson’s (1971) notation, the image of the DC, denoted d , is written as the dendrogram Td so (Td)(a),a E R+, is an equivalence rela-
::ylj 1
2.1
4
3
2
1
2
3
2.1
i.3
4
0.2
I_! 3
4
0
FIG.20. Dendrogram and the corresponding cophenetic matrix.
158
RICHARD DUBES AND A. K. JAlN
tion, or the corresponding partition. A particular mapping T of interest is
I
( ~ d ) ( a= ) Ki, j ) d(i,j ) 5 a].
That is, patterns xi and xjare “related” under (Td)if their dissimilarity is a or less. Under what conditions is (Td)(a)an equivalence relation? Clearly (Td)(a)is a subset of X x X, so it is a relation. We must check the three conditions required of an equivalence relation. i. Reflexive: d(i, i) = 0 for all x i E X so x i is related to itself for all U ER + . ii. Symmetric: d(i,J = d ( j , i) so if ( i , J E (Td)(a),then ( j , i) E (Td)(a). iii. Transitive: If ( i , k) E (Td)(a)and ( k , j ) E (Td)(a),does it follow that ( i , j ) E (Td)(a)? The following counter example answers this negatively. Let X = (1, 2, 3) and consider the following DC.
2 3 2
L4 :I
Then (1, 3) E ( T d ( 3 . 5 ) and (3, 2) E Td(3.3, but (1, 2) 4 Td(3.5). For (Td)(a)to be an equivalence relation for anya, we must have the following condition: If d(i,k ) 5 a so that ( i , k) E (Td)(a)and if d ( k , j ) 5 a so that ( i , j ) E (Td)(a),then we must have d(i, j ) Ia to force ( i , 11 to belong to (Td)(a).This condition must be satisfied for all triples ( i , j ,k) and is called the ultrametric inequality. Another way of expressing the ultrametric inequality is d(i,j ) Imax [d(i,k ) , d(k, j ) ]
for all i , j , k.
If (Td)(a)is an equivalence relation for all a , the definition of T assures nesting so (Td) is a dendrogram. It is easy to see that the entries of the cophenetic matrix satisfy the ultrametric inequality and, in fact, that a one-to-one relation exists between the set of ultrametric DCs and the set of dendrograms. In other words, ifd is restricted to the set of ultrametric DCs, T and U are inverse relations. Johnson (1967) first explained the key role of the ultrametric inequality in hierarchical clustering. Some other mathematical constraints of this sort have been imposed to study data structure. For example, see Dobson (1974). We can now state a new characterization of a hierarchical clustering method. Corresponding to every dendrogram c is the (ultrametric) dissimilarity matrix ( U c ) .A hierarchical clustering method maps a dissimilarity matrix into a dendrograrn. Thus, we can picture a hierarchical clustering method as a mapping from the set of all DCs to the set of ultrametric
EXPLORATORY DATA ANALYSIS
159
DCs. This notion leads to a means for evaluating a particular hierarchical clustering. The dendrogram that “best” fits a given DC can be viewed as the one whose corresponding ultrametric DC is “closest” to the given DC. How is closeness measured? One popular way is the cophenetic correlation coefficient (CPCC), which is the ordinary product-moment correlation coefficient between the two dissimilarity matrices. If the CPCC is to be a measure of closeness, why not devise a hierarchical clustering method to maximize it? Farris (1969) has tried this and shown that crossovers often occur. The CPCC is also a means of evaluating the global fit of a hierarchy, as discussed in Section 3.5.1. Specifically, let d,, be the given dissimilarity matrix and let [uu] be the ultrametric DC obtained by applying some hierarchical clustering method to dijand applying U to the resulting dendrogram. Then,
where d = (l/N) C djj, ij = (1/N)Z uij, N = n(n - 1)/2, and all sums are for i <j. Values of 0.8 have been observed in practical applications, although Rohlf (1970) warns that values of 0.9 may cause trouble. Example 3.9. Three CPCCs will be computed, corresponding to three hierarchical clustering methods applied to the following dissimilarity matrix. I 2 3
2 3.2
[_
3 4 1.0 2.6 0.5 2.0 - 3.01
The single-link, complete-link, and group-average dendrograms and corresponding cophenetic matrices are shown in Fig. 21. The three cophenetic correlation coefficients are almost equal although the dendrograms, especially the single-link and complete-link, are very different. Note that three entries in the dissimilarity matrix match three entries in the singlelink and complete-link cophenetic matrices, but not the same three. In general, the group-average method produces high values of the CPCC. 3.3.4 Using Hierarchical Clustering
The most distinct advantage of hierarchical clustering methods over partitional methods is that they represent the structure of the data in the form of a tree or a dendrogram that can be examined visually. Even though identifying clusters requires choosing an appropriate proximity level for cutting the dendrogram, the visual picture in the form of a den-
r
RICHARD DUBES AND A. K. JAlN
160
i
I
3-
4-
2
3
CPCC
3
['"A::
0.61
4
2
3
i[ z: 3'2
0.68
2
4
[
3
2.1 2.1 0.5
4
2.53 ZE]
0.66
(C 1 FIG.21. (a) Single-link dendrogram and cophenetic matrix (Example 3.9), (b) completelink dendrogram and cophenetic matrix (Example 3.9), and (c) group-average dendrogram and cophenetic matrix (Example 3.9). (0)
(b)
drogram offers more information about the data than a single clustering. A hierarchical clustering method generates a sequence of nested clusterings and leads to a more complete interpretation of the data than if only a single partition were available. This is one of the main reasons for the popularity of hierarchical clustering methods in biological sciences where, quite often, the data represent a taxonomy, and hence are expected to exhibit a natural hierarchical relationship. Aldenderfer ( 1977) warns that, even though hierarchical clustering methods have been extensively applied to data in nonbiological sciences, the resulting dendrograms may not be the optimal way to represent such data. The adequacy of a hierarchical structure and the significance of a particular partition for a given proximity matrix are treated in Section 3.5. Suppose the data are available in the form of n x d pattern matrix, as is the case in most engineering applications. Hierarchical clustering methods can be used to cluster either the n patterns or the d features (Jain and Dubes, 1978). There is obviously a loss of information in converting a pattern matrix to a proximity matrix, which explains why hierarchical
EX PLOR A T 0 RY DATA ANA LYS IS
161
methods are not as popular in engineering applications as in other application areas. The most serious drawback of hierarchical clustering algorithms is their computational and memory requirements, which limits their application to situations involving small numbers of patterns. In fact, visually inspecting a dendrogram becomes unmanageable or awkward when the number of patterns exceeds (say) 200. Anderberg (1973) provides a careful analysis of the computational aspects of hierarchical clustering algorithms and lists three approaches to implementation: the stored matrix approach, the stored data approach, and the sorted matrix approach. In the stored matrix approach the entire proximity matrix of n(n - 1)/2 numbers is stored in random access memory so no more than 200 patterns can be clustered in a typical computer. The amount of computation is proportional to 2n2. Since the number of features is usually smaller than the number of patterns, the stored data approach saves memory space by storing the pattern matrix instead of the proximity matrix. The proximity values between patterns is computed from the stored pattern matrix when they are needed. The stored data approach requires more computation than the stored matrix approach, although the number of computations increases as n2. The sorted matrix approach does not store the pattern matrix or the proximity matrix in central memory. Instead, the proximity matrix is sorted (using a sodmerge package) and stored on a auxillary storage device (magnetic disk or tape). Patterns are clustered by successively reading the sorted proximities. This approach is well suited to hierarchical methods based on graph theory (Section 3.3.1). A number of commonly available statistical packages contain hierarchical clustering programs. Some well-known packages devoted to clustering are BCTRY, BMDP, CLUSTAN, NT-SYS, and OSIRIS. Anderberg (1973) and Hartigan ( 1975) also provide listings of several hierarchical clustering algorithms in their books. CLUSTAN (Wishart, 1978) is probably the most versatile of these packages. Blashfield (1977~)gives a critique of various packages and points out that none have been used extensively because of their idiosyncrasies and nonportability. None of these packages is complete in the sense of containing all hierarchical clustering algorithms. 3.4 Partitional Clustering
The goal of a partitional clustering method is to divide a given set of II patterns into exactly C disjoint subsets or clusters, where C 4 n. The ideal is to put all those patterns that have a “natural” relation to one another in the same cluster, and unrelated patterns in different clusters.
162
RICHARD DUBES AND A. K. JAlN
While a hierarchical clustering gives a sequence of nested partitions, a partitional clustering generates a single partition. Partitional clustering is more popular than hierarchical clustering in engineering applications because the data are generally expected to be available in the form of a pattern matrix. An optimal solution to partitional clustering is intuitively simple. It involves selecting a clustering criterion for evaluating the “goodness” of any particular clustering or partition, evaluating this criterion for all possible partitions, and selecting that partition which optimizes the criterion. It can be shown (Duran and Odell, 1974) that the number of partitions of t2 patterns into C clusters, S(n,C),is a Stirling number of the second kind, viz: l C S ( n , C) = (-1)-
C!
(C) I i”.
A complete enumeration of all the partitions is not feasible even for moderate values of n and C. For example, for n = 19 and C = 3, S(n, C) = 1.93 x 108. Some attempts have been made to use the techniques of mathematical programming, in particular, the dynamic programming approach, to cut down the search time for an optimal partition by eliminating redundant partitions. For details, see Jensen (1969), Rao (1971), Vinod (1969), Koontz er al. (1979, and Edwards and Cavalli-Sforza (1965). Unfortunately, these approaches do not result in computationally efficient partitional clustering algorithms, and one must resort to heuristics and be content with suboptimal partitions. What criterion will lead to a “good” partition? Translating one’s intuitive notion about what makes a good clustering into a mathematical formula is not easy. No single mathematical criterion can be expected to obtain good partitions for all data and for all structures. Some twodimensional examples are shown in Fig. 22. Zahn (1971) and Casey and Nagy ( 1971) illustrate other types of two-dimensional structures. Geometrical configurations are much more varied in high-dimensional spaces than in two dimensions, but visual descriptions are impossible in more than three dimensions. The clustering structures in Fig. 22 merely hint at the difficulty inherent in defining the term “cluster.” In any event, a criterion for evaluating a clustering must be simple for computational reasons but complex enough to encompass a variety of clustering structures. Just as people do not always agree on or “see” the same clustering in two dimensions, different clustering criteria will not always identify the same clusterings in many dimensions. Partitional clustering methods are particularly valuable because they operate in many dimensions, where clusters cannot be visualized. The square-error criterion discussed below is the most popular criterion in the literature.
EXPLORATORY DATA ANALYSIS
.... ......- ..
...... . .. '. 9
..
.a:.
.
...a.
..,..... ...... .. . .. :*...,.. * . -
.-. .. ..... '.". ,. ... , .*.* a.
163
.a.
.....'
0.9.
. *
..... .: ... .-.. ..... .....:.. . ..... ......... ... ....*.. * . ..-. . . * *
.*
.w :::e.
0.:
;*
.a.
.*
9 . .
* .
.
0..
.. ... . ''...
* *.
(d1 FIG. 22. (a) Two well-separated, ball-shaped clusters, (b) two cigar-shaped clusters, (c) two annular clusters, and (d) two straggly clusters.
3.4.1 Square-Error Clustering
Clustering methods that minimize a square-error criterion are more widely used than methods based on other criteria. This criterion defines clusters that are hyperellipsoidal in shape. Square-error clustering methods are sometimes characterized as nonparametric methods for fitting mixtures of Gaussian distributions to the data. Let the ith pattern, i = 1, . . . , n from the data set under study be written as XiT =
(Xi%xi2
*
-
Xid),
where d denotes the number of features. Let m k be the sample mean
164
RICHARD DUBES AND A. K. JAlN
vector of the kth cluster in some partition of the data. The square error for cluster k is
where Ik is the set of integers corresponding to the subscripts of the patterns in cluster k. The square error for a clustering containing C clusters is
The objectives of a square-error clustering method are to define, for a given C , a clustering that minimizes EcZ and to find a suitable C, much smaller than n. An exhaustive search being computationally unfeasible, as already stated, the various square-error methods adopt different tactics for searching through the possible clusterings. All methods essentially obtain a local minimum of EC2 with the hope that the local minimum coincides with the global minimum. One usually starts with some initial partition that assigns a cluster label to each pattern, and changes the cluster labels so as to minimize square error. Gordon and Henderson (1977) have expressed the problem of minimizing EC2as an unconstrained minimization problem, the variables being the cluster labels placed on the patterns. They propose a steepest descent algorithm that has been included in the CLUSTAN clustering package (Wishart, 1978). ( a ) Details of Square-Error Clustering. While the basic idea behind a partitional clustering method that minimizes square error is simple, the various algorithms that have been reported in the literature differ substantially according to details of implementation; Blashfield (1977b), Anderberg (1973), and MacQueen (1967) provide excellent coverage of this topic. We begin with a set of n patterns in d dimensions. Figure 23 is a general purpose flowchart for square-error clustering. Details on the various operations are given below. i. Starting the algorithm: Most algorithms require an initial clustering. For example, C cluster centers or seed points can be specified by the user or picked at random and all the patterns can be assigned to the closest cluster center. Alternatively, the n patterns can be randomly assigned to the C clusters. ii. Choosing the number of clusters: The desired number of clusters is usually specified a priori. Several approaches to this problem are discussed in Section 3.5. iii. Type of pass: A “pass” involves examining each pattern in order to
EX PLOR A T 0 RY DATA AN A LYSIS
165
cluster centers
Assign cluster labels to 011 patterns
Compute new cluster centers
Adjust number of clusters
a output
FIG.23. Simplified flowchart for the FORGY/ISODATA algorithm.
obtain a better clustering or a reduction in the square error. In a k-means pass each pattern is assigned to the closest cluster center and then the cluster centers are recomputed. The updating of cluster centers can be done either immediately after a pattern changes cluster membership (a combinational pass) or after all the n patterns have been assigned labels (a noncombinational pass). A hill-climbing pass changes the cluster membership of a pattern only if it results in an improvement in the value of a criterion function. Aforcing pass forces a pattern into a new cluster to initiate another partitioning sequence. A partitional clustering algorithm usually consists of a sequence of these passes.
RICHARD DUBES AND A. K. JAlN
iv. Distance metric: The Euclidean metric is normally used to compute the distance between the patterns and the cluster centers. However, Minkowski metrics or the Mahalanobis distance measure can also be used. Hall et al. (1973) point out problems in using the Euclidean metric for partitional clustering. v. Identifying outliers: The data may contain some noisy points that do not conform with the rest of the patterns and can seriously distort the shape of clusters. Some algorithms call a pattern an outlier if it is sufficiently far removed from all the cluster centers. These patterns are either forced into clusters at the end of an iteration or reported as outliers. vi. Merging and splitting heuristics: Several algorithms employ heuristics to change the number of clusters to achieve a better clusteiing. If a cluster gets too large (contains too many patterns) it is split; if two cluster centers are sufficiently close then they are merged. Small clusters can also be treated as outliers and deleted. vii. Optimization criterion: The square error Ec2 can be related to statistically motivated criteria based on the concepts of Discriminant Analysis and Multivariate Analysis of Variance. The computational burden imposed by trying to optimize these criteria (Friedman and Rubin, 1967) has made heuristic algorithms based on minimizing square error attractive. The within-cluster scatter matrix (Sw),between-cluster scatter matrix (&), and the total scatter matrix (S) are used to define the following criteria: (1) minimize tr(Sw)where tr denotes the trace operator, (2) minimize the determinant ISw (3) maximize tr(Sw-l SB),(4) minimize ISw/ / IS the scatter ratio, and ( 5 ) minimize tr(S-'S,). Criterion (1) is equivalent to minimizing Ec2 and criterion (2) is similar to minimizing Ec2. Criteria (3)-(5) are invariant to linear transformation of the pattern matrix, which is not true for partitional algorithms based on minimizing square error when Euclidean distance is used. It can be shown that criterion (3) is equivalent to minimizing the square error under the Mahalanobis metric. Criterion (4) is based on the well-known Wilk's lambda statistic and is used in Discriminant Analysis (Section 2.2). It is important to note that all criteria generate clusters which are essentially hyperellipsoidal in shape whether one uses the simple square-error criteria or any one of the above-mentioned statistical criterion. viii. Ending the algorithm: When the cluster labels for all the patterns do not change between two successive iterations, the algorithm terminates. Alternatively, the stopping criterion can be stated in terms of the percent reduction in square error between two successive iterations. However, a maximum number of iterations is usually specified to avoid oscillations.
I,
I,
EXPLORATORY DATA ANALYSIS
167
Most of the above heuristics are somewhat interchangeable among various square-error clustering programs. However, these heuristic procedures often spell success or failure for a clustering program, irrespective of the criterion chosen in (vii) and are crucial elements in the clustering program. Realize that the square error tends to decrease as the number of clusters increases so one can mathematically minimize Ec2 only for fixed C. The heuristics in (vi) allow the program to select a “natural” or “suitable” number of clusters, subject to user-imposed restrictions. Section 3.5 discusses some tests for determining an appropriate value for C . (6)Square-Error Algorithms. The most popular of all the square-error partitional clustering algorithms available (Wishart, 1969; Anderberg, 1973; Blashfield, 1977b) use the k-means pass, defined previously. Anderberg (1973) calls this nearest centroid sorting. We will now describe two such algorithms, FORGY (Forgy, 1965) and ISODATA (Ball, 1965). A simplified flowchart of FORGY and ISODATA is given in Fig. 23. FORGY is the simplest of the square-error programs and consists of the following steps: (1) Initialize cluster centers. (2) Assign cluster labels to all patterns by finding the closest cluster center for each pattern. (3) Compute new cluster centers. (4) Alternate steps 2 and 3 until either the cluster labels do not change in step 2 or the number of iterations exceeds a user-supplied limit, LF. FORGY was constructed to be as direct as possible. The initialization procedure follows this philosophy and fixes the initial cluster centers by selecting KF patterns at random, where KF is supplied by the user. Although, in principle, FORGY generates a fixed number, KF, of clusters, several of its implementations (Dubes and Jain, 1976) allow the resulting number of clusters to be different from KF by adding the outside loop in Fig. 23. For example, after convergence is obtained for a given KF, a new cluster is created when a pattern is found that is sufficiently far removed from the existing structure. Let dk(i)be the distance between pattern x i and cluster center k. Let J(i) be the average distance from pattern xito all KF cluster centers.
I= dk(i). KF
a(i) = (l/KF)
k= I
A new cluster is created centered at pattern x i if where k , is the cluster center closest to pattern xi and TFis a user-supplied threshold between zero and one. The larger TF, the more new clusters will be created. If new clusters are created the algorithm goes to Step 2,
168
RICHARD DUBES AND A. K. JAlN
otherwise it stops. The only means available in FORGY for reducing the number of clusters is to eliminate outliers. Let n k be the number of patterns in cluster k. If nk s NF,all patterns in cluster k are removed and henceforth ignored, where NFis again a user-supplied parameter. The number of iterations required in FORGY before convergence is obtained is usually small ( 6. Finally, (3, 3) and (7, 12) are neither concordant nor discordant. The y statistic is invariant through monotonic transformations of the proximities. The closer y is to one, the better the match between dendrogram and proximity matrix. A hierarchical structure truly fits a proximity matrix only when y = 1, which occurs when the proximity matrix is ultrametric (Section 3.3.3) and either the single-link or complete-link clustering method is applied. The maximum value of y = 1 can also be achieved with nonultrametric proximity matrices. Values of y other than one can be interpreted only with respect to specific clustering methods. In a sense, the y statistic is more a tool for evaluating clustering methods than for measuring global fit. Baker (1974) and Hubert (1974b) demonstrated the applicability of y in comparative studies of clustering methods, as explained in Section 3.6. This work also provides benchmarks for y when judging the global fit of hierarchies imposed by the single-link and complete-link methods. Hubert (1974b) provides tables of percentage points of the estimated distribution function of y for 4 5 n 5 16 and tables of the means and standard deviation of y for 4 In 5 25 under the Random Graph Hypothesis. He suggests that the quantity ny - a ln(n) has an approximate standardized normal distribution where a = 1.8 under complete-link clustering and a = 1.1 under single-link clustering.
3.5.2
Global Fit Criteria for Partitions
Clustering algorithms provide one or more clusterings, or partitions, of the data. Here we address the validity of an individual clustering. The two questions of interest are: (i) What is an appropriate number of clusters? and (ii) How valid is the clustering itself? The first question deals with tradeoffs between parsimony and cluster homogeneity. The second compares a given clustering with random partitions of the same size achieved with the same clustering. The discussion is separated into partitions achieved from partitional methods and those obtained from hierarchical methods. (a) Partitional Methods. We first consider the case when then patterns are expressed in terms of d features in a pattern matrix and the features
EXPLORATORY DATA ANALYSIS
185
are all quantitative. The ideal clusters under this approach are compact, well separated, hyperellipsoidal swarms of points in d dimensions. Clustering algorithms that partition a set of pattern vectors optimize some criterion (Section 3.4). We concentrate on the most popular criterion, square error (Section 3.4.1). Since the square error is a monotonically decreasing function of the number of clusters, C, it is usually minimized for a fixed C . Thus, either the number of clusters desired must be specified a priori or the clustering method must suggest a suitable value of C with the help of user-specified criteria for merging and splitting clusters. Choosing the proper value of C has received a great deal of attention. An empirical approach is to plot the square error as a function of C. If indeed there are C , clusters present in the data, we expect to see a “knee” in this curve at C,. Duda and Hart (1973) attempt to provide a statistical test to determine whether or not the decrease in the square error as a result of increasing the number of clusters from C to ( C + 1) is significant. More specifically, they test the null hypothesis that a given set of patterns is a random sample from a Gaussian distribution. Let E , 2 and E22denote the square errors when the set of patterns is treated as one cluster and when it is partitioned into two subsets, respectively. Duda and Hart (1973) obtain the null distributions of E,2 and E22(for a suboptimal partitioning) and propose the following test: Reject the null hypothesis at thep-percent significance level if (EZ2/El2)< 1 - - nd
(Y
1
where (Y is the ( 1 - p)th percentile of the standardized normal distribution. Wolfe (1970) suggested the likelihood ratio tests for determining an appropriate value of C. To test the hypothesis of C components against (C - 1) components in a mixture (Section 3.4.3), one computes
where L c is the likelihood of the sample based on C components. The quantity x2 has an asymptotic chi-square distribution with degrees of freedom equal to the difference in the number of parameters under the hypotheses of C and of (C - 1) components. Thus the clustering should be repeated for increasing C until the observed x2 value is below some critical threshold. The conditions under which the asymptotic chi-square distribution is appropriate are not always applicable in clustering problems. Hall er al. (1973) take a similar approach except that they test the
186
RICHARD DUBES AND A. K. JAlN
validity of a cluster by a multivariate version of the Kolmogorov-Smirnov (KS) test. First, they test the given set of data for unimodality by applying the multivariate version of the KS test. If the test fails, they split the data into two clusters by using the basic ISODATA procedure; otherwise the program stops. Each cluster, in turn, is split unless it passes the KS test. The goal is to obtain the smallest number of multivariate Gaussian clusters. This test is limited to small d for computational reasons. Davies and Bouldin (1979) define a clustering criterion involving the distances between cluster centers and dispersions of individual clusters. A minimum in a plot of the criterion as a function of the number of clusters indicates an appropriate number of clusters. The distribution of this criterion has not been studied. Bezdek (1974) defines a partition coefficient and a separation index for evaluating fuzzy clusters and selecting the proper number of fuzzy clusters, but does not provide distributions for the indices. The problem of deciding whether a swarm of points is better viewed as one cluster or as two clusters is both a clustering tendency problem (Section 3.2) and a partitional-fit problem. We now review two approaches to this problem that can be used to test the distinctiveness between a pair of clusters. That is, the clusters are viewed (Section 3.4.3) as components of a mixture density (Halletal., 1973; Sclove, 1977; Scott and Symons, 1971; Wolfe, 1970) and multivariate Gaussian distributions are fitted to each cluster by estimating the mean vector and covariance matrix based on cluster labels. These parameters are required to check the pairwise validity of clusters, but no single test of validity for the entire partition has been based on them. Consider first the one-feature problem when n patterns are split so as to minimize the scalarsw,the within-cluster sum of squares. Hartigan (1978) assumes that cluster C i is distributed as N ( p , , a*), i = 1, 2. The null hypothesis is p1 = b.Engleman and Hartigan (1969) obtain the empirical distribution of the scalar ratio (SB/Sw)(Section 2.4.2) under the null hypothesis; S Bis the between-cluster sum of squares. In one dimension only (n - 1) possible splits between the n patterns need be considered. Hartigan (1978) provides a theoretical study of this problem. Hartigan (1975) also demonstrates the log (SB/Sw) has an asymptotic Gaussian distribution and extended the test to clusters in multivariate space under the null hypothesis that all cluster centers are equal. The expected value and the variance of (SB/Sw) for largen and d are known under this null hypothesis. If the observed value of (SB/Sw)is several standard deviations away from its expected value, then the null hypothesis is rejected. Sneath ( 1977) has proposed an interesting test for the distinctiveness of two clusters by projecting the patterns onto the line joining the two centroids and measuring the overlap between the distributions of projected
EXPLORATORY DATA ANALYSIS
187
patterns. Rather than asking whether the two centroids are significantly different, Sneath (1977) tests whether the “distinctness” of the two clusters is above a certain level. The index of disjunction involves the numbers of patterns in the two clusters ( n , and n,) the sample standard deviations of the projected patterns (S,andS,), and the intercentroid distance A and is defined as
+ n2)(Si2/nl + S ~ 2 / ~ ~ ) ] ” 2 .
W = A/[(n,
The quantity t,,
=
W ( n , + n,)
has a noncentral r distribution under the null hypothesis, and Sneath (1977) suggests a means for determining the noncentrality parameter. Mountford (1970) takes a different point of view and assumes a mathematical model for the n 2 entries in the proximity matrix D, rather than for the patterns themselves, viewing them as jointly distributed, Gaussian random variables based on the following equation for the similarity between patterns i and j. d k j ) = p K j ) + g(i) + g ( j ) + d i , j ) .
Here p ( i , j ? is the nominal value of the proximity, g(i) and g ( j ) are effects peculiar to patterns i andj, and ~ ( ji ), is the effect of interaction between patterns i andj. The analogy to an analysis of variance model based on the entries in apattern matrix is clear. The covariance between d(i,j)and d(k, m) depends on the number of matches in the subscripts. Thus d ( i , j ) and d(k, m)will be independent if i , j # k , m ,but the indices d(i,j)and d(i, k ) will be correlated. Only the two-cluster case is considered in detail. The ideal clustering has proximities for all pairs of patterns within the same cluster [c1(1 , 1) and ~ ( 22)] , that are greater than those for all pairs of patterns from separate clusters [p(l, 2)]. The natural measure of separation between the two clusters is p(1, 1)
+ p(2, 2)
-
2 d 1 , 2).
The larger this measure, the more “real” the clusters. The null hypothesis is p ( 1 , 1) = p(2,2) = p ( 1,2), which corresponds to “no clustering.” Any clusters generated by a clustering method under this hypothesis would be strictly artifacts of the clustering method. Mountford suggests a simplified statistic, denoted b, to test this hypothesis. Adams el al. (1977) and Attinger er al. (1978) have used this statistic to evaluate partitions of time functions. The distribution of b under the null hypothesis is very difficult to obtain because the sample has been treated by a clustering algorithm.
188
RICHARD DUBES AND A. K. JAlN
Mountford suggests a conceptual strategy for approximating the distribution of an upper bound on b. In summary, no practical means has been established for determining the validity of a partition of points based on the coordinates of the points themselves. McClain and Rao (1974) have proposed Monte Carlo simulations of statistics to properly evaluate partitions. The problems of choosing an appropriate number of clusters and finding an index of separation between a pair of clusters have not been completely solved, much less the problem of evaluating an entire partition. ( b ) Partitions from a Hierarchy. In hierarchical clustering, partitions are achieved by cutting a dendrogram, which selects one of the clusterings in the nested sequence of clusterings that comprises the hierarchy (Section 3.3). As many as (n - 2) such clusterings are available so it is essential, in applications, to be able to select those clusterings which have objective merit and discard those which are artifacts of the clustering met hod. Baker and Hubert (1975, 1976) have proposed two approaches to this problem of partitional adequacy. A characteristic function for the clustering at level rn is Tm(i,j ) = 0 if patterns (i, j ) are in the same cluster at level m. = 1 ifnot.
This crude measure of dissimilarity between patterns is compared to the given rank-order similarity matrix [r(i, 511 using the Goodman-Kruskal (1954) y statistic defined in Section 3.5.1. The statistic at level rn is denoted ym. Ideally, a rank rm exists such that for all i # j , r ( i , j ) < rm r(i, 51 > rm
if T m ( i , j )= 0; if Tm(i,j ) = 1.
Baker and Hubert (1975) estimate the distribution of ye when n = 12 by Monte Carlo methods under the Random Graph Hypothesis, which serves as the statement of “no clustering.” Alternative hypotheses can be formed by choosing functions T ( i , j )to reflect appropriate structure. However, the distribution of y m must be obtained by Monte Carlo simulations for each case. An interesting result from this approach is the comparison of the relative ability of single-link and complete-link methods to recognize various structures. Baker and Hubert (1976) also studied the validity of complete-link clustering using the number of extra edges in the threshold graph. Each complete-link cluster is a clique (maximally complete subgraph) of a threshold graph. The ideal clustering structure has a clique for each cluster with no extra edges, or edges between the cliques. The more extra
EXPLORATORY DATA ANALYSIS
189
edges are present, the less valid is the complete-link clustering. Baker and Hubert (1976) express the number of extra edges at level k , E k as Ek
=
Ak
- Tk,
where A k is the number of edges in a threshold graph and Tkis the minimum number of edges required in the threshold graph to establish cliques for all clusters. They provide information on the joint distribution of A k and T kfor a few values of n and k. The main practical difficulty with both these techniques is the need to simulate distributions for every case. The distributions depend on both n, the number of nodes, and k, the level of the dendrogram. Monte Carlo simulations for even moderate values of k and n would not be trivial undertakings. Another approach to validating partitional structures, called quadruric assignment, has been proposed by Hubert and Schultz (1976) as part of a comprehensive scheme for validating clustering structures. We begin with an a priori notion of an ideal structure, which specifies not only the number of clusters, but also the number of patterns in each cluster, expressed in a structure matrix [c(i,j ) ] . Only the blocks along the main diagonal contain nonzero elements. The ith main-diagonal block contains zero on its main diagonal and the number k t f elsewhere, where k i t = I/nt(ni - 1) and ni is the number of patterns in cluster i. Letting D = [d(i,j ) ] be the dissimilarity matrix, the following statistic is used to test the partition:
r(g)=
d(i, j ) c [g(i), g(i?I, i .i
where g indicates a permutation function on the integers from 1 to n and the sum is over all n values of i and ofj. This statistic is a product moment correlation between the given dissimilarities and the ideal structure after the rows and columns of [c(i,511 have been rearranged according to g . If g o is the identity permutation, so go(k) = k for k = 1, . . . , n , then r (go)is the sum of the average within-cluster dissimilarities, summed over all clusters. The smaller r(go),the better the partition defined by gofits the data. The idea behind the validity test is to pick out unusual clusterings of type (4, n2, . . . , nK), where K is the number of clusters and n, + n2 + . . . + nK = n. Conceptually, one would compute r(g)for all n ! permutations of the pattern numbers and develop the sampling distribution of r. Then one could see if a particular permutation, say g’, were unusual by seeing where r(g’)occurs in the distribution. One can think of the sample distribution created in this way as the distribution of I? under the null hypothesis that all permutations are equally likely. The unique
190
RICHARD DUBES AND A. K. JAlN
aspect of this approach is that formulas for the mean, E ( T ) , and variance, Var(T), of r are known (Hubert and Schultz, 1976) so one can compute the statistic
The exact significance of this z score is not known, although we would expect it to have an approximate standardized normal distribution when no clustering is present. 3.5.3 Validity of Individual Clusters
Individual clusters can be described as compact or isolated, depending on their interaction with their environment. We discuss techniques for examining the characteristics of individual clusters formed by hierarchical methods and by partitional methods. We also review cluster profiles, which provide the most extensive tests of clusters based on rank-order proximities yet proposed. ( a ) Hierarchical Clusters. Compactness criteria, or indices of internal adhesion, and isolation criteria, or indices of separation, have been established for single-link clusters based on ordinal proximity matrices. Ling (1972, 1973a) calls a cluster real if it forms early in the single-link dendrogram and remains a separate entity for a reasonably long range of proximities. Ling’s measure of compactness is, in our framework, a measure of clustering tendency (Section 3.2). The measure of isolation is lifetime. If r l is the rank at which a cluster, C, is born, or first appears, and r2 is the rank at which the cluster dies, or is merged with another cluster, the lifetime of the cluster is Z(C) = r2 - rl.
Ling (1972, 1973a) wiis able to find the distribution of Z(C) under the Random Graph Hypothesis, assuming a rank-order proximity matrix. The distribution of Z(C) depends on cluster size and number of patterns being clustered. Thus, one can judge whether an observed lifetime is unusual, as compared to lifetimes for other single-link clusters of the same size, N(C),under the Random Graph Hypothesis. That is, if F,(i) = Prob [ I > ilN(C)], then a single-link cluster C is judged isolated if Fa&) is sufficiently low, say or less, where ic is the lifetime of the cluster containing N ( C ) patterns. A clustering method is biased towards certain types of clusters. Single-
EXPLORAT0RY DATA ANALYSIS
191
link clusters tend to be weakly connected and “chained” since only the connectivity of a subgraph is needed to form the cluster. Complete-link clusters require maximally complete subgraphs and all tend to have roughly equal size. Thus, the evaluation of individual cluster validity must involve the clustering method. Ling’s results cannot be used for anything other than single-link clusters. This natural bias imposed by a clustering method also means that one cannot objectively evaluate the isolation of a cluster by comparing it to that of a randomly chosen cluster. The selection of a particular clustering method automatically rules out several subsets, so comparing a cluster achieved by a particular method to a randomly chosen cluster tends- to make the cluster appear more isolated than it is. Cluster profiles, discussed below, were designed to overcome this objection. Another potential trap for the unwary was pointed out by Ling (1972, 1973b). If Fn(ic)is computed for all single-link clusters, some of the values can be small by pure chance. Ling observed this phenomenon in purely random data. No theory exists for the joint occurrence of isolation indices, although we know that the isolation indices of the single-link clusters are not independent. Several other indices of cluster isolation have been proposed. Hartigan ( 1975) suggested a procedure for measuring cluster validity that depends both on the clustering method and the algorithm chosen to implement the method. Estabrook (1966) and Wirth et al. (1966) provide examples of application-related isolation indices. However, none of these indices are supported by theoretical or cOmprehensive empirical work, so they must be viewed as qualitative measures. ( b ) Partitional Clusters. No theory analagous to Ling’s work on single-link clusters from ordinal data exists for clusters formed by partitional methods. It is clear that proximities between patterns inside the cluster should be compared to proximities between pairs of patterns, one inside and one outside the cluster, but no general theory for such measures has been developed that is applicable to cluster validity. Some of the measures of global fit for partitions could be adopted to the dichotomy of the data consisting of patterns in the cluster and outside the cluster. The Hubert and Schultz (1976) gamma statistic can be defined from the following n x n “structure” matrix:
Here n is the number of patterns in the proposed cluster and the n, X n, block in the structure matrix corresponds to the proposed cluster;
192
RICHARD DUBES AND A. K. JAlN
n2 = n - n,. The entries in this block are all
k,, = [n,(n, -
except for zero diagonal entries. All entries in the n, and, in the n, x n, block, are -klz where
X
n2 block are k l z
k,, = (2n,n,)-'.
The gamma statistic is
[Ail, g(~91,
Ug) = M i , jk
where d(i,j) is the dissimilarity between patterns i a n d j and g defines a permutation of the integers from 1 to n. If patterns g*U),
-
*
.
7
g*h)
define the cluster, then one can show that r(g*) = (average within-cluster dissimilarity) - (average dissimilarity between the cluster and noncluster).
The null hypothesis is that all permutations of the n patterns are equally likely. If the distribution of r were known, over all permutations g , then T(g*) could be located on this distribution to test the hypothesis; the more negative r(s*),the more valid the cluster. Although Hubert and Schultz (1976) provide expressions for the mean and variance of I? under the null hypothesis, the distribution of r is not known and must be established by Monte Carlo simulation for each nl,n, and dissimilarity matrix. A second impediment is that n, must be specified a priori, before the dissimilarity matrix is observed through a clustering method, if this test is to be applied properly, since the size of a cluster is affected by the clustering method. We reiterate our earlier warnings about using statistical tests based on multivariate analysis for testing separation between clusters. One cannot label patterns so as to separate the patterns into clusters and then test validity against randomly labeled patterns. Monte Carlo simulations are necessary to establish the distributions of the standard statistics under clustering. ( c ) Cluster Profiles. Bailey (1978) has introduced a promising procedure for validating individual clusters based on the theory of random graphs. The clusters can be formed by any possible means, so Bailey's techniques do not refer to any particular clustering method. However, they are limited to data described by a rank-order proximity matrix. Cluster profiles help one judge whether a cluster is unusually compact and/or isolated by recording the growth of the threshold graph around a particu-
EX PLORAT0RY DATA ANALYSIS
193
lar subset of patterns and comparing this growth with what is expected in a carefully chosen model of no clustering. Bailey (1978) chooses simple indices of isolation and compactness for ease of computation and interpretation, suggested by the “extra edges” idea of Baker and Hubert (1976). Using the usual correspondence between a graph and a clustering problem, suppose a subset, A, of nodes is fixed and a threshold graph (Section 3.3.1) containing edges having dissimilarity t or less is observed. The compactness index, e A ( t )is , the number of edges in the threshold graph internal to A (both nodes in A) while the isolation index, bA(t), is the number of linking edges (one node in A, the other outside of A). Graphs of b A ( fand ) eA(t) as functions o f t are called raw profiles and are qualitative evaluators of the validity ofA as a cluster. A probability profile plots bA(t) and eA(r)on a probability scale, thus providing quantitative measures of isolation and compactness. The type of probability model adopted determines the manner in which the profile can be interpreted. Under the Random Graph Hypothesis (Section 3.2), all proximity matrices are equally likely. Probability profiles compare the growth in eA(t)and bA(t) for the given proximity matrix with that for a randomly chosen matrix. To quantify this growth, let B and E be random variables representing isolation and compactness, respectively, of a randomly chosen subset of k nodes; B is the number of linking edges and E is the number of internal edges in some threshold graph. The distributions of E andB can be determined for a randomly selected subset of k nodes from the hypergeometric model and depend on n , the number of nodes and N , the number of edges in the threshold graph. -
P(E
=
k:2 ( e) = e
k)\/n:2
-
n:2-k:2
>(f n N: 2 \- e
k(n - k)\
1,
The notation n : 2 stands for n things taken 2 at a time, or n(n - 1 ) / 2 . Plots of P(B Ib A )and P ( E 2 e A )as functions of N score the growth of the isolation and compactness indices for clusterA on a probability scale. As long as cluster A is chosen without reference to the proximity matrix, these plots provide an objective evaluation of A. If cluster A is identified after applying a clustering method to the proximity matrix, comparing the clus-
194
RICHARD DUBES AND A.
K. JAlN
ter’s compactness and isolation to those for a randomly chosen set of nodes is not a fair comparison. Any clustering method will almost certainly generate a better-than-random cluster so such a procedure will provide unduly optimistic validity conclusions. When cluster A is identified by a clustering method, all clusters that cannot be achieved by that clustering method must be deleted from the population of clusters against which the isolation and compactness of A are compared. One could develop distributions for B and E by Monte Car10 simulation (Baker and Hubert, 1975) and judge compactness and isolation by profiles based on this distribution. To avoid such computations, Bailey (1978) proposed a “best-case” distribution. Suppose a threshold graph containing n nodes and N edges is chosen at random and the isolation and compactness indices are determined for all k-node subsets. LetB* andE* be those subsets on which the best isolation and compactness indices, respectively, are achieved. That is, the number of linking edges is minimum for subsets in B* and the number of internal edges is maximum for edges in E*. Then plots of
1
P(B IbAIB*) and P(E 2 eA E*)
would provide reasonable measures of isolation and compactness for cluster A since A would be compared to the “best” clusters in a randomly chosen threshold graph and would provide a standard of performance against which any cluster, regardless of origin, could be compared. The exact probabilities have not been derived theoretically, but Bailey (1978) has found the following bounds, which serve as conservative measures of Cluster Validity:
Further evidence of validity can be found by asking: How unusual is an observed isolation index among clusters having a fixed compactness index in a randomly selected graph? Similarly, the novelty of an observed compactness among all clusters having a fixed isolation can be studied. Bailey (1978) applied the hypergeometric model to show that
k(n - k)
P(B = &lE =
( n - k ) :2
( b )(N-e-b) e) = (n - k ) : 2 + k(n - k ) ( N - e
EXPLORATORY DATA ANALYSIS
P(E
=
elB
=
6)
=
k:2
(
195
+ (n - k):2 N - 6
These probabilities lead to the following relative measures of isolation and compactness:
Plots of Z,(A). C , ( A ) ,Z,(A), and C,(A) against N form the probability profiles for clusterA and provide a comprehensive basis for evaluating the validity of individual clusters. This approach avoids the need for Monte Carlo simulations and evaluates the cluster over its entire evolution. A general strategy for applying the probability profiles is to first search for small values (say of less) of I,(A) and C,(A) over a span of ranks, which is evidence for saying that A is unusually compact and/or isolated. If both labels apply, the cluster is called “valid.” If a cluster is compact but not isolated, &(A)allows one to see if A is at least isolated among all clusters having the same compactness as A. Similarly, C,(A) permits evaluation of relative compactness. 3.5.4 Stability
Stability can describe hierarchical structures, partitions, or individual clusters. The intuitive idea is that a stable structure will remain fixed through minor changes in the data and/or clustering algorithm. If a few patterns or features are added to or deleted from a data set, the identity of a stable cluster should remain essentially the same. The identities of all clusters in a stable clustering should survive the addition or deletion of a few patterns. The only effect of adding new patterns to a stable hierarchical clustering structure should be new binary merges involving the new patterns. The precise, quantitative expression of this idea depends on the situation and is difficult to formulate in general terms. Jardine and Sibson (1971) study the stability of a clustering method in a different context than described above. Their theoretical analysis shows that the single-link clustering method possesses a continuity property (with respect to the proximity matrix) that other clustering methods do not have. As a proximity matrix is gradually changed, the single-link dendrogram adapts in a continuous fashion while, for example, complete-link clustering exhibits a sudden change in cluster identity. Mezzich (1978) compared clustering algorithms by assessing ‘‘cluster replicability .” He
196
RICHARD DUBES AND A. K. JAlN
randomly halved the data and computed a correlation coefficient between corresponding entries of cluster-by-category tables from the two halves. Strauss (1973) and Strauss et al. (1973) suggested the same thing in a different context. Baker ( 1978) also evaluated the complete-link method by studying the effects of removing a few patterns from the data sets. His results apply only for n = 16 and uncovered some unstable behavior in this method. Gnanadesikan et al. (1977) suggest two ways of investigating the s:ability of clusterings and individual clusters. “Jackknifing” means omitting one feature or one pattern from the analysis and investigating resultant changes in the structure. “Shaking the data” means perturbing the data by adding noise with a multivariate normal distribution and observing the effects on clustering structure. Benchmarks do not exist for either of these suggestions so one has no quantitative way of predicting the changes expected under any form of a null hypothesis. The same criticism applies to the suggestions of Strauss (1973) and Strauss et al. (1973). One is faced with extensive Monte Carlo simulations for each special case. Smith and Dubes (1979) suggested a statistic for comparing the clusterings achieved on a randomly halved data set and empirically determined its null distribution. Let R be the given rank-order proximity matrix whose rows and columns are partitioned into two sets of size n, and n,, where n = n, + n,. Rank-order proximity matrices R, and R2are formed from R corresponding to this partition. Suppose a hierarchical clustering method is applied to R;let the partition ranks (cophenetic matrix) be denoted U. Define U, and U, from U in the same way that R, and R, are formed from R. Finally, apply the same hierarchical clustering method to R1and R, that was applied to R and let W, and Wz be the corresponding cophenetic matrices. If U1and WI are very similar and U, and W, are very similar, then the two halves of the data produced clustering similar to the original clustering and the clustering‘is stable. This does not depend on the number of clusters. Smith and Dubes (1979) applied the Goodman-Kruskal(l954) gamma statistic to measure similarity. Baker and Hubert (1975) use this statistic in other evaluations of clustering results requiring a measure of rank correlation. Let 7, be the rank correlation between U, and W,, and let y2 be that between U2 and W,. The two stability statistics are Y m = m i n ( ~ i7,2 ) and ~a = ( ~ +i 72>/2. Smith and Dubes (1979) obtained the distributions of these two statistics by Monte Carlo simulation under the Random Graph Hypothesis with n, = n, = n/2 for several values of n between 16 and 48. Both the complete-link and the single-link methods were studied. In both cases, the
EXPLORATORY DATA ANALYSIS
197
means and variances of y m and y n decreased with n , as expected, for random data. The means of the single-link statistics were consistently larger than those for the complete-link statistics, which suggests that the single-link method finds more stability in the hierarchy, even for random data. Besides choosing proximity matrices at random, Smith and Dubes (1979) studied the effects of global fit by rejecting all proximity matrices which did not conform to a hierarchical structure, as measured by a Goodman-Kruskal gamma statistic of 0.5 or more. This raised the value of the gamma statistics, but was not a significant factor. 3.6 Comparative Analysis
How does one choose, from among the array of clustering methods and algorithms, the one that is “best” for a particular application? Besides shedding some light on this question, comparative analysis should determine whether one clustering method is inherently better than another for certain classes of data and should provide differential information on the characteristics and peculiarities of competing clustering methods and algorithms. We begin by noting that one should not expect a single clustering method to be best in any useful sense of the word “best.” Since clustering is a tool for exploring data, personal preference for and confidence in a certain method may override other criteria during selection of a clustering method. In this section we review objective criteria for comparing clustering methods and algorithms. One’s choice of clustering method is, of course, strongly affected by nonobjective factors. Matching algorithm to data type, the desired type of output, the availability of the algorithm on the local computer, and acceptance of the clustering method in the methodology of the application are all factors that strongly influence the choice of clustering method. Comparative analysis studies can themselves be clustered into three groups. Theoretical studies specify a mathematical framework into which several clustering methods can be placed. Clustering methods can then be compared on the basis of the mathematical properties inherent in the methods themselves. The second group of comparative studies simulate clustering methods and characterize their responses to artificially generated, or “well-known,” data sets. Clustering methods can be rated on the degree to which their responses match the proper one, which is known a priori. Comparative studies in the third group are calledpractical because they apply several algorithms on the same “real” data sets and try to evaluate them on a relative basis. We now investigate the three groups in more detail.
198
RICHARD DUBES AND A. K. JAlN
3.6.1 Theoretical Studies
Several mathematical formalisms for clustering have been proposed. A formalism should provide a showcase for clustering methods and permit a comparison of the relative merits and demerits of the various methods. Some examples of formalisms are Hubert’s (1974a) work on imbedding clustering in graph theory, Jardine and Sibson’s (1971) and Jardine’s et al. ( 1967) axiomatic approach, Backer’s ( 1978) generalizations based on fuzzy set theory, and Wright’s (1973) axiomatic theory. Binder (1978) has imposed a Bayesian decision-making structure on the partitional clustering problem and has related certain a priori information to clustering criteria. Although.such formalisms enhance our mathematical understanding of clustering, they are often implicitly biased toward one method or the other, as by the choice of criterion function. Creating an algorithm for implementing a theoretical criterion depends on factors that are not part of the formalism. For example, any formulation of clustering based on the intuitively appealing criterion of square error will require an iterative algorithm, such as K-MEANS or ISODATA. As discussed in Section 3.4, the success of such algorithms in achieving a good partition is often governed by factors such as the ability to recognize outliers and techniques that split and lump clusters; a theoretical comparison of such algorithms is impractical. Thus, the general theory itself stops short of providing satisfactory answers to the optimality question. Some theoretical studies directed at objective, comparative analysis are discussed below. Fisher and Van Ness (1971) attack the problem of choosing a clustering procedure by formulating properties that any reasonable procedure should satisfy and calling a procedure satisfying them admissible. Admissibility eliminates poor clustering algorithms, but does not identify the best method. Three of the admissibility properties relate to the manner in which the clusters are formed, two reflect the structure of the data, and four test the sensitivity of the clustering method to nonessential changes in the data. All the criteria apply to clustering methods, not to algorithms. Single-link clustering satisfied all but one admissibility condition. All other methods tested failed to satisfy at least two conditions, so one must select those admissibility conditions that are important in each application. Unfortunately, the admissibility conditions do not discriminate among competing clustering methods very well. Dubes and Jain (1976) included these admissibility conditions in their comparison of eight clustering methods. Williams et al. (1971) compared four hierarchical agglomerative strategies for clustering that differed in the procedures for combining groups when going from one clustering to another. Advantages and disadvantages
EXPLORATORY DATA ANALYSIS
199
of the four strategies were considered for various data types and sample sizes. The evolution of a new clustering method is often motivated by deficiencies in existing methods. Thus, any proposal for a new clustering method or for a modification of an existing method has some basis for evaluating existing methods. Examples are Wishart’s ( 1969) mode analysis, in which 13 methods are compared by a “minimized factor” and ways of overcoming single-link chaining are proposed, Sneath’s ( 1969) comparison of several methods, which is really a motivation for work in Cluster Validity, and Hubert and Schultz’s (1975) work establishing the “space-contracting” character of single-link clustering and the “spacedilating” character of complete-link clustering in a quantitative manner. Blashfield (1977a,b) calls for the type of comparative analysis we are discussing. Aldenderfer ( 1977) and Blashfield ( 1977a,b) provide excellent summaries of hierarchical and partitional clustering methods in a manner that emphasizes comparisons from a consumer’s viewpoint. Their careful analyses failed to identify markers of “good” clustering methods and they could only admonish the consumer to beware of traps. Jardine and Sibson (1971) demonstrated that the single-link method is the only hierarchical clustering method with certain desirable properties, such as continuity. This exemplifies the difficulty in attempting to theoretically characterize clustering methods in a meaningful way. Sibson ( 1971) also showed that the single-link method recovers a certain type of structure in a simple case. However, several careful empirical studies (Baker, 1974; Baker and Hubert, 1975; Hubert, 1974a) show that the completelink method is preferable in many ways. Baker (1978) found that the complete-link method can produce misleading results for small sample sizes (16 or less). One other aspect of comparative analysis is the problem of relating newly proposed clustering algorithms to existing ones. A clustering method motivated by one theoretical framework may end up being the same as another algorithm developed from an entirely different framework. No theoretical tools exist for objectively comparing methods. Recognizing the equivalences between methods developed from totally disparate viewpoints should be part of comparative analysis, but has received little attention. An example of such a situation was recently noted by Shaffer et a/. (1979), who observed empirical similarities between the results of Kittler’s (1976) clustering method based on mode analysis and the standard single-link clustering method. Kittler (1979) did not agree that the methods were the same. This example demonstrates the need for formal comparisons.
200
RICHARD DUBES AND A. K. JAlN
We conclude this section by emphasizing the futility of searching for a purely theoretical means that will identify the “best” clustering method or algorithm, even in a particular application. Theory can establish certain general characteristics of clustering methods and provide perspective, but the approaches described in the next two sections are much better guides to selection of method and algorithm than is theory.
3.6.2 Simulated Comparisons Here we discuss several studies in which artificially generated data are created and clustering algorithms are evaluated by their responses to them, The key is that the known structure of the data serves as a fixed reference point. The goodness criteria relate entirely to the performance of the algorithm in properly interpreting the data, and ignore computational criteria, so the underlying method is being characterized, not the algorithm. This approach links comparative analysis to Cluster Validity. Clustering methods that produce “good” or the “right” results are grouped together and the “good” results correspond to valid clusters. Thus, concepts from Cluster Validity are inextricably woven into our discussion. Milligan (1978) has reported a very thorough simulated comparison. A total of eleven hierarchical and four partitional algorithms were examined for their ability to recover clustering structure masked by errors of various kinds. Date sets of size 50 were chosen by a three-factor design; the number of clusters (2, 3, 4, or 5 ) , the number of features (4, 6, or 8), and the type of distribution (three levels) were controlled. Three replications of each of the 36 possible data sets were generated for a total of 108 error-free sets of data. Six types of error (no error, outliers, distance perturbation, noise dimensions, non-Euclidean distance, and variable standardization) were imposed on each set of data. Recovery of structure was evaluated by an external criterion (Rand Coefficient) and an internal coefficient (correlation between dissimilarity matrix and characteristic function of the clustering). We mention only a few of the conclusions drawn from this excellent study. Single-link clustering was extremely sensitive to perturbations in distances but was surprisingly robust to the addition of outliers. All of the methods were sensitive to the addition of extra features containing purely random numbers. The partitional clustering algorithms (K-MEANS type) generally outperformed hierarchical algorithms, but were very sensitive to starting configuration. Mojena ( 1975) empirically evaluated seven hierarchical clustering methods to aid a user in objectively selecting good partitions, rather than good hierarchies. The seven methods were single-link, complete-link,
EXPLORATORY DATA ANALYSIS
201
simple-average, group-average, median, centroid, and Ward’s method. All are identified by selection of three parameters (Section 3.3.2) for combining groups as part of agglomerative algorithms. The main objective was to establish a “stopping rule,” or decision regarding the stage (level) that best reproduces the underlying structure. If the levels (proximity values) at which ciusters are formed are denoted by the ordered numbers a,, a . . . , a,,-,, and if the sample average and standard deviation of these numbers are denoted a and S,, respectively, then Rule 1 calls the first stage satisfying aj+, 2 a
+ kS,,
the proper stage for cutting the dendrogram; k is a user-selected cutting parameter. Rule 2 views the sequence of a values as a time series and employs a different means for identifying a significant stopping level. The artificially generated data used for this study consisted of n patterns and K populations in five features. Each of the n patterns was a set of five numbers selected independently from gamma distributions. The gammadistributions’ parameters were determined by selecting a degree of overlap, based on the maximum-likelihood decision rule that would have been used to separate populations if the patterns were labeled according to population. A total of 12 data sets were formed; one group of four data sets had K = 2 and n = 60,the second had K = 3 and n = 90 while the third had K = 4 and n = 120. Four degrees of overlap were used in each group. A particular clustering is described by a characteristic matrix (Section 3.5.2). This matrix is compared to the “desired” matrix based on a priori category information by the simple matching coefficient (Section 2. I )
where cij = 1 if entries ( i , j )from the two matrices match and 0 if not. High values for C imply good clustering results, especially with low overlap. When the number of clusters was chosen equal to the number of populations, Mojena’s ( 1975) results show that performance tends to deteriorate as the degree of overlap and the number of populations increase. Ward’s method performed best, single-link performed worst, and all other methods were essentially the same, with complete-link somewhat better than the average. These conclusions are based on one realization of each of the twelve data bases. Values of the cutting parameter k in the range from 2.75 to 3.5 gave the best overall results with Rule 1. Mojena (1975) concluded that Ward’s method was better than the rest, but stressed the importance of consistency among the conceptualization of clusters, the measure of
202
RICHARD DUBES AND A. K. JAlN
proximity, the type of data, and the clustering method. We note that the data consisted of compact swarms of patterns in five dimensions, so it is not too surprising that Ward’s method performed well. Cunningham and Ogilvie ( 1972) compared the same seven hierarchical clustering methods as Mojena (1975). The data consisted of 20 x 20 proximity matrices [d(i,j ) ] .Three proximity matrices were derived by placing four groups of five points each in the plane and measuring proximity with Euclidean distance. The three matrices differed in degree of overlap. Two other proximity matrices were cophenetic matrices (Section 3.3.3) derived from prespecified dendrograms. Cunningham and Ogilvie ( 1972) defined “output” distances d * ( i , j) as the ( i , j) entries in the cophenetic matrix obtained by applying a clustering method to a distance matrix. They suggest Kendal’s Tau statistic, a measure of rank correlation between the , evaluating the structure derived by the entries of [ d ( i , j ) ]and [ d * ( i , j ) Jfor clustering method. In the sense of Cluster Validity (Section 3.5.l), this is a measure of global fit; Hubert (1974b) prefers the Goodman-Kruskal gamma statistic for this purpose. Cunningham and Ogilvie (1972) also used a “stress” measure, reminiscent of Kruskal’s (1964a,b) stress (Section 2.2), in which they adjusted the dendrogram levels [i.e., foundd*(i,j ) ] to minimize
Cunningham and Ogilvie (1972) evaluated clustering methods by computing this stress and comparing it with Kendal’s Tau using two different methods of assigning dendrogram levels. The group-average method seemed to work best while the performance of the single-link method was highly data-dependent. This study stressed the importance of the interaction between the type of input data and the particular clustering method. The small sample sizes and the biases introduced by rigidly fixing the data types limit the applicability of these conclusions. However, this work motivated other empirical comparisons. Blashfield (1976) compared four of the hierarchical methods studied above (single-link, complete-link, group-average, and Ward‘smethod) by their response to randomly generated data, described by a mixture model. Blashfield (1976) chose the number of features, the number of populations, and the numbers of patterns in each population at random. Each population was described by a multivariate Gaussian distribution whose mean vector and correlation structure were also chosen randomly. Fifty data sets were generated this way and the four methods were applied to each, using CLUSTAN IB (Wishart, 1978) implementations. Euclidean distance was the proximity measure. Agreement between the cluster structure and
EXPLORATORY DATA ANALYSIS
203
the a priori structure was measured by choosing the number of clusters equal to the number of populations and using the kappa statistic where Pois the observed proportions of agreement in classification and P, is the proportion of agreement expected by chance. When comparing the clustering methods on the basis of the median of this statistic (over the 50 data sets), Ward’s method was clearly the best and single-link clearly the worst in retrieving the structure of the normal mixture of distributions. Blashfield ( 1976) suggests some strategies for deciding on the adequacy of a classification generated by a clustering method and warns the user to regard the classification generated by clustering methods with great skepticism. The same seven hierarchical clustering methods included in Mojena’s (1975) work were studied by Holgersson (1978), who used the CPCC or cophenetic correlation coefficient (Section 3.3.3), and Kruskal’s (1964a,b) stress to measure the amount of agreement between the given proximity matrix and the derived cophenetic matrix. Wilk’s lambda statistic (Section 2.2) reflected the amount of clustering in the randomly generated data. The distributions of CPCC and stress were obtained for each clustering method over data that exhibited roughly the same value of Wilk’s lambda. The proximity matrices were computed from Euclidean distance between 10 points in one dimension, five point from each of two unit-variance normal distributions with the difference in the means, m, - m2,between 0 and 5.8. The complete-link method was studied more extensively than the other methods. Holgersson (1978) concluded that CPCC and related measures of agreement may be misleading if used to indicate the presence of clustering. We note that CPCC is a measure of global fit (Section 3.5.1). As such, it should reflect the degree to which a hierarchical structure fits data, whether or not the data are clustered, although hierarchical structures should fit clustered data better than nonclustered data. Holgersson ( 1978) exhibited an inverse relationship between Wilk’s lambda, which measures group separation, and CPCC. The simple matching coefficient (Section 2.1) with which Mojena (1975) compared a priori and achieved classifications can distinguish two clusterings, whether or not they have the same number of clusters. Rand ( 1971) proposed this “C” statistic as the basis for evaluating clustering methods. Rand’s (1971) evaluation included the ability of a clustering method to retrieve “natural” clusters, the sensitivity of a method to perturbations in the data and missing individuals, and the differential ability of two methods to establish data structure. Kuiper (1971) and Kuiper and Fisher (1972) compared six hierarchical clustering methods with this
204
RICHARD DUBES AND A. K. JAlN
statistic. Dubes and Jain (1976) also used this statistic to compare the performance of eight clustering algorithms on a single data set, following the suggestion of Anderberg (1973). Rohlf (1974) also suggests Rand’s coefficient for comparing two clusterings, besides proposing several ways of comparing two dendrograms. Blashfield ( 1977b) and. Aldenderfer (1977) compared clustering algorithms from the unique viewpoint of the user of a clustering package. Blashfield (1977b) input the same data sets to three statistical packages and found the results were different because the definition of Euclidean distance was not consistent among the packages. Some other examples of comparative analysis are provided by Zahl (1977), who compared three measures of density, Paulus (1976), who used randomly generated data to demonstrate that different clustering algorithms require different a priori information, Maronna and Jacovkis (1976), who studied the effects of various metrics on clustering algorithms, and Gower (1967). We conclude this discussion by recommending the four Consumer Reports by Blashfield (1977b,c), Aldenderfer (1977), and Blashfield and Aldenderfer (1977) to the careful user of clustering algorithms. The casual and nondiscriminating user of clustering can easily be misled. The proper interpretation of results requires thorough knowledge of the basis of clustering algorithms and the interplay between proximity measures, clustering algorithms, and data type.
3.6.3 Practical Comparisons
One general conclusion that can be drawn from our review thus far is that no truly objective methodology exists for evaluating clustering methods. How is the potential user to proceed when faced with the task of choosing a clustering method for a “real” data set? Several researchers have responded by suggesting that several clustering methods be used (Everitt, 1979). We review some of these approaches. Psychiatric classification is an area in which a great deal of clustering objective effort has been expended to establish diagnostic categories in ~II manner. Recent literature calls for quantitative classification techniques and for researchers to become familiar with modern procedures. Morey and Blashfield (1979) provide a list of over 80 studies using cluster analysis in psychiatric classifications. Kendell (1976) decries the use of statistical packages when unaware of the underlying assumptions and considers Cluster Analysis an advance in psychiatric classification. Blashfield and Draguns ( 1976a,b) describe the characteristics of a good classification system and motivate the application of Cluster Analysis to psychiatry.
EX PLORAT0RY DATA ANALYSIS
205
Garside and Roth ( 1978) point out several caveats for the user of statistical methods and suggest that several clustering methods be applied before attempting to draw conclusions. Besides noting that Cluster Analysis is meant to generate hypotheses, not validate them, Garside and Roth (1978) claim that at least 100 patterns are needed to establish two meaningful dimensions in diagnosis and call for Monte Carlo studies on artificial data to evaluate the effectiveness of multivariate methods. Although standard diagnostic categories, such as manic depressive and paranoid schizophrenic, exist, mental health professionals do not agree on the classification of individual patients or on all the characteristics of the standard categories. In psychiatric classification, the true classifications of the patterns (patients) are unknown, but one is willing to admit that some categorization exists that must be objectively established for communication in the field. Mezzich ( 1978) provided an excellent comparative study of clustering methods on a data base of 88 patterns that were archetypal patients, developed by 22 experienced psychiatrists. Each psychiatrist fabricated four archetypal patients to represent each of four diagnostic categories by providing 17 scores thought to be appropriate for each “patient.” The 88 patterns were randomly divided into two sets of 44 patterns each. Ten clustering methods, including both hierarchical and partitional methods, were applied to the resulting 44 x 17 patterns matrices. Three measures of proximity (correlation, Euclidean distance, Manhattan distance) were used. Four clusters were sought in all cases. Besides clustering, multidimensional scaling and Chernoff face analysis (Section 2.2) were applied. Evaluative criteria in Mezzich’s (1978) study were external and internal validity, replicability, and inter-rater reliability. External validity is measured by creating a 4 x 4 cluster-by-category table showing the number of patterns from each diagnostic category in each cluster and measuring concordance between clinician and clustering method. The averagelinkage hierarchical clustering method with correlation coefficient proximity measure and the K-MEANS and ISODATA clustering methods with Euclidean distance performed best, while single-link performed worst with concordance percentages ranging from 100% to 52%. The CPCC (Section 3.3.3) tested global fit of clustering to the original structure of the data. None of the results were outstanding. Cluster replicability represents the degree of similarity between the two data sets, each containing 44 patterns, and is measured by computing correlations between the two cluster-by-category tables. The rankings of the clustering methods agreed quite well with the ranking under the external validity measure. The clustering methods were given an overall ranking, combining the above three criteria. In terms of average ranks, K-MEANS cluster-
206
RICHARD DUBES AND A. K. JAlN
ing with Euclidean and Manhattan distances, complete-link clustering with correlation coefficient and Manhattan distance, and average-linkage clustering were best, The worst were Wolfe’s (1970) NORMAP/NORMIX program and Chernoff’s (1973) faces. After averaging across all clustering methods, the correlation coefficient was the best proximity measure, followed by Manhattan and Euclidean distances. Mezzich (1978) concludes by recommending K-MEANS, ISODATA, and complete-link clustering in psychiatric classification. Strauss et al. (1973) reported a similar study in which 100 archetypal patients were fabricated, 20 from each of five diagnostic categories by rating each patient on 48 symptoms and signs. The 100 patterns were clustered by four different methods and several ways of preparing the data for clustering were tested. Strauss et al. ( 1973)emphasized the importance of rationally selecting a clustering program, matching data preparation to the program, and selecting evaluative criteria. A clustering method is stable (Section 3.5.4) if small changes in patterns or features cause only small changes in the clustering. Strauss ef al. (1973) suggest comparing the clusterings on randomly chosen subsets of the original pattern matrix and testing stability by removing some of the variables. These techniques are called “shaking the data” and “jackknifing” by Cohen et al. (1977). Several serious applications of clustering methods have included limited comparisons of clustering methods. Each application employs some ad hoc procedure for evaluating the results and almost all call for empirical comparative studies of clustering methods and for useful indices of Cluster Validity. Strauss (1973) reported on several clustering studies on psychiatric patients and suggested measures of stability. Kiev (1976) used Ward’s method to cluster suicide attempters. Williams et al. (1976) applied the complete-link method to patterns representing patients admitted for psychiatric evaluation, based on 39 symptoms, and were able to identify four significant clusters. Carpenter el al. (1976) provided an excellent example of a practical application of clustering methodology to a large data set. Paykel(l971, 1978) applied Ward’s method with other statistical procedures to identify three groups of suicide offenders and paid special attention to the stability of the results. All of these studies adopted reasonable, but ad hoc, evaluative criteria. Comparisons of clustering methods on practical data have also been provided by Boyce ( 1969) and Filsinger and Sauer (1978). Dubes and Jain (1976) compared eight clustering algorithms on a 192-pattern data set from four categories in a character recognition problem. Hubert (1973b) called for “evaluations of the differential adequacy” of clustering methods on large data sets. Hubert (19731)compared five hierarchical clustering methods, including single-link and complete-link, with an objective
EXPLORATORY DATA ANALYSIS
207
method that tried to minimize the number of “discrepancies.” A discrepancy exists if a pair of patterns from different clusters exists that is more similar than a pair from the same cluster. Hubert (1973a) applied all the clustering methods to some small data sets using number of discrepancies as a criterion and found that the complete-link method is the best, next to the objective function method, and single-link clustering is the worst. Wolfe (1978) clustered 113 occupations by three methods and tried to validate the resulting clusters. We have concentrated on comparative studies of clustering methods because they have received more attention in the literature than other types of comparative-studies. Green and Rao (1969) provide an example of a comparative study of proximity measures. Ten proximity measures were computed for a 15-pattern data set to determine which measures were providing redundant information. 3.7 Summary
The three basic activities discussed in this section are clustering tendency, clustering algorithms, and cluster verification. The mathematical tools for describing these activities are graph theory, geometry, and statistics with the form and scale of the data dictating models. A bewildering array of clustering algorithms exists for proximity matrices and patterns in a space. By comparison, procedures for evaluating clustering tendency and for verifying clustering structures are scarce. Considering the number of clustering algorithms that have been implemented and the wide range of applications of clustering, the paucity of quantitative techniques for V a l idating clustering structures is both surprising and alarming. If clustering is ever to be more than a qualitative tool, quantitative techniques for evaluating clustering tendency and cluster validity must be established and become accepted in applications, much as the standard multivariate statistics tests are accepted. The quantitative results we have reported are applicable to somewhat specific cases and are sensitive to clustering method and sample size. Outside of the case of single-link clusters formed from rank-order proximity matrices, little is known of the mathematical structure of individual clusters. Even less is known about partitional and hierarchical structures and about clustering tendency. No unified theory has been suggested. Our main conclusion is to warn the potential user of a clustering method that visual evaluation of results can be misleading and that one should not expect a panacea in the form of a single statistic for solving all cluster validity problems. At present, it is more practical to validate a clustering structure by applying several different clustering methods and hope the
208
RICHARD DUBES AND A. K. JAlN
results agree than to objectively evaluate the results of a single algorithm. The main difficulty seems to be the absence of a generally applicable definition of “clustering structure” and of “cluster.” Even when such definitions can be stated, deriving the distributions of statistics is usually intractable. Perhaps the broad range of applications precludes the establishment of a general theory that would ease these practical problems. Let the user beware! This succinct statement summarizes our review of comparative analysis. The uninformed and indiscriminate user of clustering is likely to draw false conclusions. The results of clustering algorithms must be viewed skeptically and more efforts should be devoted to validating the results than were devoted to obtaining the clustering. The data type, data scale, normalization, clustering method, and evaluative criteria should be matched. The user should also be prepared to replicate the results. One factor that would enhance communication among users of clustering is the adoption of a common terminology (e.g., Everitt, 1979; Blashfield and Aldenderfer, 1978b). The jargon employed in particular areas of application inhibits the interchange of ideas and forces the needless replication of algorithms.
4. Applications
There is a tremendous number of potential applications of clustering in such diverse fields as engineering, life sciences, medical sciences, behavioral and social sciences, and information sciences. New applications of clustering are constantly being reported. For example, Chen et al. (1974) apply hierarchical cIustering to 90 corporations, Adams et at. (1977) cluster time functions and Friedman and Rubin (1968) cluster patients. Anderberg (1973) and Hartigan (1975) give long lists of applications of clustering methodology. Our intent here is to illustrate, with the help of few novel applications, how clustering can aid in data interpretation, along the lines of Fig. 1. It is essential for the user of a clustering algorithm not only to have a thorough understanding of the particular clustering algorithm being utilized, but also to know the details of the data gathering process and to have some expertise in the subject area in which the clustering methodology is being applied. Since the goals of a clustering method or program are to subjectively describe data, the more information the user has about the data at hand, the more likely the user would be to succeed in assessing the true structure of the data. The applications in this section reflect our engineering orientation. Several examples from other areas have been mentioned earlier in this chapter. Section 3.6.3 pays particular attention to the problem of psychiatric classification.
EX PLOR A T 0RY DATA AN A LYSIS
209
4.1 Workload Characterization of Computer Systems
As computer systems become more pervasive in our society, the need to realistically characterize the expenditure of computer resources becomes apparent. This application of Cluster Analysis tries to accurately model the system workload. This model can be used to simulate performance when evaluating computer systems. Each user request, orjob, in a computer system can be represented as a pattern vector xiT = ( x i l ,. . . , .rid), wherexu represents the amount of resourcej specified by useri. The features are the resources, or significant characteristics, of the workload of the computer system. Some examples of these resources are CPU time, disk 110 time, and number of tape assignments. A job can also be characterized in terms of its location in the computer system, timing properties, and priority status. Starting with n user requests, a clustering algorithm investigates the data structure, and obtains clusters that represent jobs with similar properties. A density function can be fitted to each cluster to generate synthetic jobs for performance evaluation studies (Sreenivasan and Kleinman, 1974). Agrawala et al. (1976) obtained the workload model for a UNIVAC 1108 using clustering techniques. They collected data for three different time periods and analyzed them separately using a K-MEANS type partitional clustering algorithm. The eight features in their study consisted of CPU time, executive request and control card charges, average number of 512-word core blocks used, number of job steps (programs) executed, wall clock time, VO to FASTRAND (a drum device) or disk drives, VO to tape, and VO to high-speed drum devices. The numbers of patterns corresponding to the three time periods (end-of-semester, mid-semester, and holiday) were 2876, 3028, and 408, respectively. A partitional clustering method was chosen because the data were available in the form of a pattern matrix. The data were scaled prior to clustering so that the relative ranges of the eight features were approximately the same. Scaling is necessary because the units measuring the various features are not directly comparable. As discussed in Section 2.1, the scale of one feature can annihilate the rest of the features when Euclidean distance measures proximity. Agrawala et a/. (1976) used a generalized distance measure as a proximity that incorporated the variance of each of the features within each cluster. Clusters of various jobs found by Agrawala et al. (1976) were fairly homogeneous and each cluster was interpretable in terms of amount of resources required by each job within a cluster. For example, the largest cluster (containing 1428 jobs) in the data corresponding to the midsemester time period was easily interpreted as consisting of jobs that did
21 0
RICHARD DUBES AND A. K. JAlN
not need any disk IIO. One other cluster contained only large, program development jobs. No statistical test of significance was used in this study. Two-dimensional Kiviat graphs for each of the clusters exhibited the intercluster and intracluster variations in terms of the eight features. The results of this study demonstrate that it is possible to use clustering methodology to characterize the workload of a computer system. 4.2 Digital Image Segmentation
During the past decade, digital image processing has emerged as a major field of application for computers. The development of automated vision systems, interpretation and analysis of LANDSAT data, detection of tumors in radiographs, chromosome analysis, and character recognition are some of the examples where digital image processing techniques have been successfully applied (Rosenfeld, 1976). A digitized picturef(x, y ) is a function of two spatial variables; the x and y coordinates are discretized into N levels. The value f(x, y) represents the brightness of the picture at the pixel, or picture element, marked (x, y ) . The continuous range of brightness levels is also quantized to some finite number of gray levels, G. Thus, a digital image is represented as an N X N matrix containing integer-valued features in the range (1, G). Typical values for N and G are 256 and 64, respectively, so a digital image requires a substantial amount of computer memory. Several types of digitizers are commercially available for converting an image into its digital form (Gonzalez and Wintz, 1977).
The aim of image segmentation is to identify homogeneous and meaningful regions in a given digital image, so Cluster Analysis is clearly needed. Segmenting a digital image leads to data compression and an efficient description of the given image. Some popular techniques for image segmentation include thresholding, edge detection, use of texture information, template matching, and tracking (Rosenfeld and Kak, 1976). The simple technique of thresholding is nothing but clustering in one dimension, namely, the gray level. This was recently extended by Panda and Rosenfeld (1978) to two dimensions (gray level and gradient). Various partitional and hierarchical clustering techniques have been proposed for image segmentation utilizing various types of features (Haralick and Dinstein, 1975; Haralick and Kelly, 1969; Jarvis, 1977a; Backer, 1978; Narendra and Goldberg, 1977; Schachter et al., 1979; Jain et al., 1978; Burr and Chien, 1976; Coleman, 1977). We summarize only a few of these approaches. i. Jain et al. (1978) use a hierarchical clustering technique to segment muscle-cell pictures. Their segmentation procedure consists of two logical
EX PLORAT0RY DATA ANA LYSIS
21 1
parts. The first part segments the digital image (containing numerous cells) into regions composed of cells or clump of cells using a number of lowlevel operations (smoothing, gradient, thresholding, thinning). The cell clumps actually represent collections of those adjacent cells whose boundaries are weak. The second part of the procedure segments the cell clumps into individual cells. A hierarchical clustering algorithm groups together those boundary points of a cell clump that belong to the same globally convex section of the boundary. The segmentation of clumps does not depend on the gray levels internal to the boundary of clumps, but only on the shape of the boundary. We can formally state the clustering problem involved in this application as follows. Let V = (u,, u2, . . . , u,) be an ordered set of n points representing the vertices of a polygonal approximation to the boundary of a clump, such that the line segments forming the clump join uI to u ~ + ~ , i = 1, . . . , n - 1, and un to u l . We would like to partition the set V into subsets that represent polygonal approximations of cells in the clump. Given this set of n points, an n x n dissimilarity matrix was contructed whose entries were based on the Line-Segment-Interior (LI) relation of Shapiro and Haralick (1979). The LI relation is a binary relation on V x V consisting of all pairs of vertices ( u i , uj) such that the line segment joining ui to ul is an interior line segment of V. If the line segment joining o1and ul is an interior line segment, then the dissimilarity between uI and ul is the length of this segment; otherwise the dissimilarity between ui and uj is an arbitrarily large value. The UPGMA hierarchical clustering algorithm (Section 3.3.2) is applied to this dissimilarity matrix. A partition is obtained by cutting the dendrogram at a level that corresponds to the largest difference in the dissimilarity values between two successive joins. This method was applied to several types of clumps containing up to five individual cells. The segmentation results were extremely good. One drawback of this segmentation procedure is that its computational complexity is of order n 3 and, for complex shapes, n must be large to obtain the correct number of components. ii. Remotely sensed data are commonly described by representing each pixel as a point in a multidimensional pattern space. For example, multispectral scanners on LANDSAT satellites sense the reflected energy from the earth’s surface in four different bands, or wavelengths from 0.5 pm to 1 . 1 pm of the electromagnetic spectrum (Lintz and Simonett, 1976). Thus, each pixel (which corresponds to an area of 79 X 79 m on the ground) in a LANDSAT image can be represented in a four-dimensional pattern space. Clustering techniques have been extensively used (Coleman, 1977; Narendra and Goldberg, 1977) to segment these multispectral images and find homogeneous regions that correspond to, say, water, wheat field and
21 2
RICHARD DUBES AND A. K. JAlN
urban areas. Haralick and Kelly (1969) and Haralick and Dinstein (1975) suggested taking spatial information into consideration. Pixels that are close to each other in the spatial domain (x and y coordinates) are more likely to belong to the same cluster than pixels which are far apart. The clustering scheme of Narendra and Goldberg ( 1977) for LANDSAT data starts out by first computing a four-dimensional histogram of the four-channel LANDSAT data. This is actually a data compression step since there are approximately 7.6 million pixels in a LANDSAT frame, 95% of which contain less than 6000 distinct intensity vectors. Therefore, instead of clustering all the available pixels, Narendra and Goldberg ( 1977) clustered only these distinct intensity vectors. The frequency of occurrence of a particular intensity vector in the histogram was taken as a density estimate, and a valley-seeking clustering algorithm (Koontz et al., 1976) clustered these distinct vectors. The cluster label for each vector was stored in a table that was used to classify the pixels in the image. The effectiveness of this methodology for segmenting LANDSAT images was visually observed by displaying the results on a color monitor where a distinct color was assigned to each cluster. In order to eliminate clusters containing very few pixels, the histogram was smoothed prior to applying the clustering algorithm. This clustering scheme is fairly efficient and took approximately 2 minutes of CPU time on a PDP-10 computer to cluster a 512 x 512 image. While clustering techniques have been extensively used in the analysis and interpretation of multispectral images, they have only recently been applied to segment gray scale images, where each pixel is represented as a vector of local feature values (Schachter et al., 1979; Backer, 1978; Coleman, 1977). The term “local” means that features are computed in a small neighborhood, usually 3 x 3, of every pixel. Local features can be computed at every pixel and pixels can then be clustered. The local features chosen by Schachter et al. (1979) were gray level, mean, median, total variation and standard deviation of gray levels over a 3 x 3 neighborhood, Laplacian, and a texture feature. The segmentation based on local feature values was not very good, which Schachter et al. (1979) attributed to the lack of spatial information in the clustering modeI. They suggest incorporating this information through the use of relaxation techniques (Rosenfeld et al., 1976). 4.3 Syntactic Clustering
In syntactic pattern recognition (Fu, 1974) each pattern is represented as a string of “primitives.” The placement of primitives in the string is governed by the structural relationship among the primitives in the pat-
EXPLORATORY DATA ANALYSIS
213
terns, which depends on the category to which the patterns belong. We can associate a grammar with each pattern class that generates all the patterns belonging to this class. One can draw an analogy between a pattern and a sentence, a feature value and a primitive and the structure of a pattern and the syntax of a language. The tools of formal language theory have been extensively used in syntactic pattern recognition. The goal in statistical pattern recognition (Section 2.3) is to classify a pattern into one of several categories; the syntactic approach offers an additional advantage. In an attempt to classify a pattern syntactically, we also determine the structure of the pattern, or the manner in which the pattern is constructed from the primitives. The syntactic approach has been successfully used in chromosome analysis, character recognition, and waveform analysis. The selection of features in the statistical approach is analogous to the selection of primitives in the syntactic approach. Most practical pattern recognition problems involve some sort of learning. In the statistical approach, estimating or learning a density function based on a set of samples is a well-understood problem (Duda and Hart, 1973). However, the problem of learning a grammar based on a set of sample strings, called the grammatical inference problem, remains to be solved in a general setting. The complexity of the resulting grammar has limited the applicability of the syntactic approach. Clustering techniques have recently been applied to patterns represented as strings in grammatical inference (Liou and Dubes, 1977), to partition the available strings such that all the strings that correspond to objects belonging to the same category should fall in the same cluster (Fu and Lu, 1977, 1979; Lu and Fu, 1978). In order to cluster strings we must first define a proximity measure that determines the syntactic difference between pairs of strings. The WLD (Weighted Levenshtein Distance, Okuda et uf., 1976) is the same as the distance measure proposed by Wagner and Fischer (1974) and has been extensively used as a measure for string dissimilarity. The WLD is based on the number of edit operations needed to change one string into another. Fu and Lu (1977) and Lu and Fu (1978) applied partitional clustering to 51 strings representing nine different English characters. In order to convert each character into a string, the character was first digitized on a 20 X 20 grid, four primitives (line segments with different orientations) were extracted from the digitized picture, and Shaw’s (1969) picture description language (PDL) was used for string representation. The dissimilarity measures in their clustering were both the weighted and the unweighted Levenshtein distance. The WLD resulted in perfect clustering. Nine clusters were obtained, each containing strings corresponding to only one pattern class. Recently, Fu and Lu (1979) proposed a modifica-
214
RICHARD DUBES AND A. K. JAlN
tion of the WLD measure that is invariant with respect to the size and the orientation of patterns. Liou and Dubes (1977) analyzed the structure of a given set of strings by clustering them based on the WLD. Their clustering algorithm consisted of generating the minimum spanning tree (MST) and removing the inconsistent edges from the MST. While Zahn’s (1971) notion of inconsistent edges depends only on the dissimilarity measure, Liou and Dubes (1977) used some additional structural relations such as length and substring relations. The resulting clusters controlled the construction of initial and candidate grammars that substantially reduced the number of grammars that need be examined, as compared to the usual procedure of constructing a grammar for each string and merging them in an ad hoc fashion. The cluster analysis of strings also suggested recursive productions for the inferred grammar. 4.4 Summary
Cluster Analysis is a new and useful tool for exploratory data analysis that is especially appropriate when little is known about the data. Recent literature exhibit a broad spectrum of fields of application. A data set is explored by imposing several clustering structures and interpreting the results. Many factors, such as choice of proximity index, normalization, outlier identification, interaction among patterns and among features, sample size, and choice of algorithmic parameters can have profound and unexpected effects on the output of a particular clustering algorithm so close attention to details is necessary. Everitt (1979) summarizes many of the current problems in Cluster Analysis. A bewildering array of clustering algorithms exists and more appear regularly, derived from a variety of viewpoints and in several disciplines. We have categorized algorithms, embedded them in a methodological framework, examined computational requirements, and have tried to highlight practical aspects of using them. We conclude with two recommendations. If one searches long enough, a clustering algorithm can probably be found whose results the experimenter likes. Clustering algorithms will find clusters whether or not the clusters are inherent in the data. These facts motivated our emphasis on cluster validity and objective verification, clustering tendency, and comparative analysis. Unless one can have confidence in a clustering structure, Cluster Analysis cannot be a quantitative tool for data analysis. We recommend that measures of clustering tendency and cluster validity be carefully chosen before the data are clustered. Cluster Analysis should be a supplement to other methods of data anal-
EX PLO RAT0RY DATA AN A LYSI S
21 5
ysis, not an alternative. Successful data analysis calls for a blending of Cluster Analysis with appropriate techniques such as Multidimensional Scaling, Intrinsic Dimensionality, Pattern Recognition, and classical statistical methods. We recommend that the serious data analyst be conversant with as broad a range of data analysis techniques and programs as possible, and be aware of the assumptions on which the techniques are based. ACKNOWLEDGMENTS We wish to thank the National Science Foundation and the Division of Engineering Research at Michigan State University for funding our research and Dr. Harry G. Hedges, Chairman of the Computer Science Department at Michigan State University, for providing technical and moral support for this chapter. We are grateful to several graduate students, past and present, who programmed algorithms and enthusiastically endured earlier versions of parts of this chapter, particularly Dr. Thomas Bailey, Gautam Biswas, James Coggins, George Cross, Stephen Smith, and Neal Wyse. Finally, we thank Brenda S. Minott for her cheerfulness, accuracy, and speed in typing this chapter. This work was partially supported by NSF Grant ENG 76-11936A01.
REFERENCES Adams, J. M., DelPresto, P. V., Anne, A,, and Attinger, E. 0. (1977). Techniques in consolidation, characterization, and expression of physiologic signals in the time domain. Proc. IEEE 65, 689-696. Alagar, V. S. (1976). The distribution of the distance between random points. J. A p p l . Probability 13, 558-566. Aldenderfer, M.S. (1977). A consumer report on cluster analysis software. 2. “Hierarchical Methods,” Report on NSF Grant DCR74-20007. Department of Psychology, Pennsylvania State University. University Park. Anderberg, M.R. (1973). “Cluster Analysis for Applications.” Academic Press, New York. Anderson, T. W. (1958). “An Introduction to Multivariate Statistical Analysis.’’ Wiley, New York. Andrews, D. F. (1972). Plots of high-dimensional data. Biornefrics 28, 125-136. Agrawala, A. K., Mohr, J. M., and Bryant, R. M. (1976). An approach to the work-load characterization problem. Computer 9, 18-32. Attinger, E. O., Henry, D. T., Attinger, F. M. L., Adams, J. M., and Anne, A. (1978). Biological control hierarchies. Part I . Simulation by a three-level, ten-component adaptive model. IEEE Trans. Sysf., M a n Cybern. smc-8, 11-19. Backer, E . (1978). “Cluster Analysis by Optimal Decomposition of Induced Fuzzy Sets.’’ Delft University Press, Delft, The Netherlands. Bailey, T. A. ( 1978). Cluster validity and intrinsic dimensionality. Ph.D. Thesis, Michigan State University, Department of Computer Science, East Lansing. Baker, F. B. (1974). Stability of two hierarchical grouping techniques. Case 1. Sensitivity to data errors. J . Am. Star. Assoc. 69, 440-445. Baker, F. B. (1978). Sensitivity of the complete-link clustering technique to missing individuals. J . Educ. Srai. 3, 233-252.
21 6
RICHARD DUBES AND A. K. JAlN
Baker, F. B., and Hubert, L. J. (1975). Measuring the power of hierarchical cluster analysis. J . Am. Stat. Assoc. 70, 31-38. Baker, F. B., and Hubert, L. J. (1976). A graph-theoretic approach to goodness-of-fit in complete-link hierarchical clustering. J . Am. Stat. Assoc. 71, 870-878. Ball, G. H. (1965). Data analysis in the Social Sciences: What about the details?AFIPS Fall Joint Comput. Conf. pp. 533-560. Barnes, G. R. (1976). Fuzzy sets and cluster analysis. Proc. Int. Joint Con$ Pattern Recognition, 3rd, 1976 pp. 371-375. Bartels, P. H., Bahr, G. F., Calhoun, D. W., and Wied, G. L. (1970). Cell recognition by neighborhood grouping techniques in TICAS. Acta Cytol. 14, 3 13-324. Bartlett, M. S. (1975). “The Statistical Analysis of Spatial Pattern.” Chapman & Hall, London. Bennett, R. S. (1969). The intrinsic dimensionality of signal collections. IEEE Trans. InJ Theory it-15, 517-525. Bentler, P. M., and Weeks, D. G. (1978). Restricted multidimensional scaling models. J . Math. Psychol. 17, 138-151. Bentley, J. L., and Friedman, J. H. (1978). Fast algorithms for constructing minimal spanning trees in coordinate spaces. IEEE Trans. Comput. c-27, 97-105. Besag, J., and Diggle, P. J. (1977). Simple Monte Carlo tests for spatial pattern. Appl. Stat. 26, 327-333. Bezdek, J. C. (1974). Cluster validity with fuzzy sets. J. Cybern 3, 58-73. Bezdek, J. C. (1976). A physical interpretation of fuzzy ISODATA. IEEE Trans. Syst., Man Cybern. smc-6, 387-389. Bhat, M. V . ,and Haupt, A. (1976). An efficient clustering algorithm. IEEE Trans. Syst., Man Cybern. smc-6, 61-64. Binder, D. A. (1978). Bayesian cluster analysis. Biometrika 65, jl-38. Blashfield, R. K. (1976). Mixture model tests of cluster analysis: Accuracy of four agglomerative hierarchical methods. Psycho/. Bull. 83, 377-388. Blashfield, R. K. (1977a). Equivalence of three statistical packages for performing hierarchical cluster analysis. Psychometrika 42, 429-43 1. Blashfield, R. K. (197%). A consumer report on cluster analysis software. 3. “Iterative Partitioning Methods,” Report on NSF Grant DCR-74-20007. Department of Psychology, Pennsylvania State University, University Park. Blashfield, R. K. (1977~).A consumer report on cluster analysis software. 4. “Useability,” Report on N S F Grant DCR-74-20007. Department of Psychology, Pennsylvania State University, University Park. Blashfield, R. K., and Aldenderfer, M. S. (1977). A consumer report on cluster analysis software. 1. “Cluster Analysis Methods and Their Literature,” Report on NSF Grant DCR-74-20007. Department of Psychology, Pennsylvania State University, University Park. Blashfield, R. K., and Aldenderfer, M. S. (1978a). Cluster analysis literature on validation. Pap., 9th Annu. Meet. Classijication SOC. Blashfield, R. K., and Aldenderfer, M. S. (1978b). Literature on cluster-analysis. Mulrivar. Behav. Res. 13, 271-295. Blashfield, R. K., and Draguns, J. G. (1976a). Toward a taxonomy of psychopathologypurpose of psychiatric classification. Br. J . Psychiatry 12, 574-583. Blashfield, R. K., and Draguns, J. G. (1976b). Evaluative criteria for psychiatric classification. J . Abnorm. Psycho/. 85, 140-150. Boyce, A. J. (1969). Mapping diversity: A comparative study of some numerical methods. In “Numerical Taxonomy” (A. J. Cole, ed.), pp. 1-31. Academic Press, New York.
EX PLOR A T 0 RY DATA ANA LYS IS
217
Burr, D. J., and Chien, R. T. (1976). The minimal spanning tree in visual data segmentation. Proc. Int. Joint Conf. Pattern Recognition, 3rd. 1976 pp. 519-523. Butler, G. A. (1969). A vector field approach to cluster analysis. Pattern Recognition I, 291-299. Calvert, T. W. (1970). Nonorthogonal projections for feature extraction. IEEE Trans. Comput. C-19, 447-452. Carpenter, W. T., Bartko, J. J., Carpenter, C. L., and Strauss, J. S. (1976). Another view of schizophrenic subtypes. Arch. Gen. Psychiatry 33, 508-5 16. Carroll, J. D., and Wish, M. (1971). Measuring preference and perception with multidimensional models. Bell Lab. Rec. pp. 147-154. Casey, R. G., and Nagy, G. (1971). Advances in pattern recognition. Sci. A m . 233, NO. 4 56-7 1. Chang, C. L., and Lee, R. C. T. (1973). A heuristic relaxation method for nonlinear mapping in cluster analysis. IEEE Trans. Syst., Man Cybern. smc-3, 197-200. Chen, C. K., and Andrews, H. C. (1974). Nonlinear intrinsic dimensionality computations. IEEE Trans. Comput. c-23, 178-184. Chen, H. J., Gnanadesikan, R., and Kettenring, J. R. (1974). Statistical methods for grouping corporations. Sunkhya, Ser. B 34, Part I , 1-28. Chernoff, H. (1973). Using faces to represent points in k-dimensional space graphically. J. A m . Stat. Assoc. 68, 361-368. Chemoff, H. ( 1978). “Graphical Representations as a Discipline,” Rep. ADIA-056-633. U.S. Dept. of Commerce, Washington, D.C. Cleveland, W. S . , and Relles, D. A. (1975). Clustering by identification with special application to two-way tables of counts. J . A m . Stat. Assoc. 70, 626-630. Clifford, H. T., and Stephenson, W. (1975). ”An Introduction to Numerical Classification.” Academic Press, New York. Cohen. A,, Gnanadesikan, R., Kettenring, J. R., and Landwehr, J. M. (1977). Methodological developments in some applications of clustering. I n “Applications of Statistics’’ (P. R. Krishnaiah, ed.), pp. 141-162. North-Holland Publ., Amsterdam. Cole, A. J., ed. (1969). “Numerical Taxonomy.” Academic Press, New York. Coleman, G. B. (1977). “Image Segmentation by Clustering,” Tech. Rep. No. 750. Image Processing Institute, University of Southern California, Los Angeles, California. Coleman, R. (1969). Random paths through convex bodies. J. Appl. Probl. 6 , 430-441. Conover, W. J. ( 1971). “Practical Nonparametric Statistics.” Wiley, New York. Cooley, W. W., and Lohnes, P. R. (1971). “Multivariate Data Analysis.” Wiley, New York. Cormack, R. M. (1971). A review of classification. J. R . Stat. Soc., Ser. A 134, 321-367. Cramer, H. ( 1945). “Mathematical Models of Statistics.” Princeton Univ. Press, Princeton, New Jersey. Cunningham, J. P., and Shepard, R. N. (1974). Monotone mapping of similarities into a general metric space. J . Math. Psychol. 11, 335-363. Cunningham, K. M.,and Ogilvie, J . C. (1972). Evaluation of hierarchical grouping techniques: A preliminary study. Comput. J. 15, 209-213. Curry, D. J. (1976). Some statistical considerations in clustering with binary data. Multivar. Behav. Res. I I , 175-188. D’Andrade, R. G. (1978). U-statistic hierarchical clustering. Psychometrika 43, 59-67. Davies, D. L. and Bouldin, D. W. (1979). A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. pami-I, 224-227. Day, W. H. E . ( 1977). Validity of clusters formed by graph-theoretic cluster methods. Math. Biosci. 36, 299-317.
218
RICHARD DUBES AND A. K. JAlN
Day, W. H. E. (1978). Classification and specification of flat cluster methods. Math. Biosci. 41, 1-19. Defays, D. (1977). An efficient algorithm for a complete link method. Comput. J . 20, 364366. Diggle, P. J. (1979). On parameter estimation and goodness-of-fit testing for spatial point patterns. Biometrics 35, 87-101. Dobson, A. J. (1974). Unrooted trees for numerical taxonomy. J . Appl. Probl. 11, 32-42. Dorofeyuk, A. A. (1971). Automatic classification algorithms. Autom. Remote Control (Engl. Transl.) 32, 1928-1958. Dubes, R . , and Jain, A. K. (1976). Clustering techniques: The user’s dilemma. Pattern Recognition 8, 247-260. Dubes, R., and Jain, A. K. (1978). Models and methods in cluster validity. Proc. IEEE Conf. Pattern Recognition Image Process. pp. 148-155. Dubes, R., and Jain, A. K. (1979). Validity studies in clustering methodologies. Pattern Recognition 11, 235-254. Duda, R. 0.. and Hart, P. E. (1973). “Pattern Classification and Scene Analysis.” Wiley, New York. Dunn, J. C. (1974). Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95-104. Duran, B. S . , and Odell, P. L. (1974). “Cluster Analysis: A Survey.” Springer-Verlag, Berlin and New York. Edwards, A. W. F., and Cavalli-Sforza, L. L. (1965). A method for cluster analysis. Biornetrics 21, 362-375. Eigen, D. J., Fromm, F. R.,and Northouse, R. A. (1974). Cluster analysis based on dimensional information with applications to feature selection and classification. IEEE Trans. Syst., Man Cybern. smc-4, 284-294. Eisenstein, B. A., and Fehlauer, J . (1976). Declustering the minimal spanning tree. In “Milwaukee Symposium on Automatic Computation and Control,” pp. 137-141. Engelman, L., and Hartigan, J. A. (1969). Percentage points of a test for clusters. J. Am. Stat. Assoc. 64, 1647-1648. Enslein, K . , Ralston, A., and Wilf, H. S. (1977). “Statistical Methods for Digital Computers.” Wiley, New York. Erdos, P., and Renyi, A. (1959). On random graphs. Pub/. Math. Debrecen. 6, 290-297. Erdos, P., and Renyi, A. (1960). On the evolution of random graphs. P. Math. Inst. Hung. Acad. Sci. 5, 17-61. Erdos, P., and Renyi, A. (1961). On the strength of connectedness of a random graph. Acta Math. Acad. Sci. Hung. 12, 261-267. Estabrook, G. F. (1966). A mathematical model in graph theory for biological classification. J . Theor. Biol. 12, 297-310. Everitt, B. (1974). “Cluster Analysis.” Wiley, New York. Everitt, B. S. (1978). “Graphical Techniques for Multivariate Data.” North-Holland Publ., New York. Everitt, B. S. (1979). Unresolved problems in cluster analysis. Biometrics 35, 169-181. Everitt, B. S.,and Nicholls, P. (1975). Visual techniques for representing multivariate data. Statistician 24, 3 7 4 9 . Farris, J. S. (1969). On the cophenetic correlation coefficient. Syst. Zool. 18, 279-285. Fehlauer, J., and Eisenstein, B. A. (1978). Structural editing by a point density function. IEEE Trans. Syst., Man Cybern. smc-8, 362-370. Ferguson, T. S. ( 1967). “Mathematical Statistics-A Decision Theoretic Approach.” Academic Press, New York.
EX PLORAT0RY DATA ANA LYSI S
21 9
Fillenbaum, S., and Rapoport, A. (1971). “Structures in the Subjective Lexicon.” Academic Press, New York. Filsinger, E., and Sauer, W. J. ( 1978). Empirical typology of adjustment to aging. J . G e m n 101. 33, 437-445. Fisher, L., and Van Ness, J. W. (1971). Admissible clustering procedures. Biomerrika 58, 91-104. Fleiss, J . L., and Zubin, J. (1969). On the methods and theory ofclustering. Multivar. Behav. Res. 4, 235-250. Forgy, E. ( 1965). Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biomerrics 21, 768 (abstr.). Fortier, J. J., and Solomon, H. (1966). Clustering procedures. In “Multivariate Analysis” (P. R. Krishnaiah, ed.), pp. 493-506. Academic Press, New York. Friedman, H. P., and Rubin, J. (1967). On some invariant criteria for grouping data. J . Am. Stat. Assoc. 62, 1159-1 178. Friedman, H. P., and Rubin, J. (1968). The logic of the statistical methods. In “The Borderline Syndrome” (R. Grinker, B. Werble, and R. Drye, eds.), pp. 53-72. Basic Books, New York. Friedman, J. H., and Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Cornput. c-23, 881-890. Fu, K . S. (1968). “Sequential Methods in Pattern Recognition and Machine Learning.” Academic Press, New York. Fu, K. S. (1974). “Syntactic Methods in Pattern Recognition.” Academic Press, New York. Fu, K. S . , and Lu, S . Y. (1977). A clustering procedure for syntactic patterns. IEEE Trans. S y s t . , Man Cybern. smc-7, 734-742. Fu, K. S.. and Lu, S. Y. (1979). A size normalization and pattern orientation problem in syntactic clustering. IEEE Trans. S y s t . , Man Cybern. smc-9, 55-58. Fu, K. S . , and Rosenfeid, A. (1976). Pattern recognition and image processing. IEEE Trans. Comput. c-25, 1336-1346. Fukunaga, K . (1972). “Introduction to Statistical Pattern Recognition.” Academic Press, New York. Fukunaga, K., and Ando, M. (1977). The optimum nonlinear features for a scatter criterion in discriminant analysis. IEEE Trans. Znf. Theory it-23, 453-459. Fukunaga, K., and Olsen, D. R. (1971). An algorithm for finding intrinsic dimensionality of data. IEEE Trans. Cornput. c-20, 176-183. Fukunaga, K., and Short, R. D. (1978). Generalized clustering for problem localization. IEEE Trans. Comput. c-27, 176-181. Garside, R. F., and Roth, M. (1978). Multivariate statistical methods and problems of classification in psychiatry. Br. J. Psychiatry 133, 53-67. Gitman, I., and Levine, M. D. (1970). An algorithm for detecting unimodal fuzzy sets and its application as a clustering technique. IEEE Trans. Cornput. e-19, 583-593. Gnanadesikan, R., Kettenring, J. R., and Landwehr, J. M. (1977). Interpreting and assessing the results of cluster analysis. Bull. Int. Stat. Inst. 47, 451-463. Gonzalez, R. C., and Thomason, M. G. (1978). “Syntactic Pattern Recognition.” Addison-Wesley, Reading, Massachusetts. Gonzalez, R. C., and Wintz, P. (1977). “Digital Image Processing.” Addison-Wesley, Reading, Massachusetts. Good, I. J. ( 1965). Categorization of classification. In “Mathematics and Computer Science in Biology and Medicine,” pp. 115-128. HM Stationery Office, London.
220
RICHARD DUBES AND A. K. JAlN
Good, I. J. (1977). The botryology of botryology. In “Classification and Clustering” (J, Van Ryzin, ed.), pp. 73-94. Academic Press, New York. Goodall, D. W. (1964). A probabilistic similarity index. Nature (London) 203, 1098. Goodall, D. W. (1%6). A new similarity index based on probability. Eiornetrics 22, 882907. Goodman, L. A., and Kruskal, W. H. (1954). Measures of association for crossclassifications. J. Am. Stat. Assoc. 49, 732-764. Gordon, A. D., and Henderson, J. T. (1977). Algorithm for Euclidean sum of squares classification. Eiornetrics 33, 355-362. Gowda, K. C., and Krishna, G. (1978). Agglomerativeclustering using the concept of mutual nearest neighborhood. Pattern Recognition 10, 105-1 12. Gower, J. C. (1967). A comparison of some methods of cluster analysis. Biomerrics 23, 623-637. Gower, J. C., and Ross, G. J. S. (1969). Minimum spanning trees and single-linkage cluster analysis. Appl. Srat. 18, 54-64. Green, P. E., and Carmone, F. J. (1970). “Multidimensional Scaling and Related Techniques.” AUyn & Bacon, Boston, Massachusetts. Green, P. E., and Rao, V. R. (1969). A note on proximity measures and cluster analysis. J . Market. Res. 6 , 359-364. Guttman, L. (1968). A general nonmetric technique for finding the smallest configuration of points. Psychometrika 33, 469-506. Hall, A. V. (1967). Methods for demonstrating resemblance in taxonomy and ecology. Nature (London) 214, 830-831. Hall, D. J., Duda, R. O., Huffman, D. A., and Wolf, D. E. (1973). “Development of New Pattern Recognition Methods.” Aerospace Research Laboratories (AD-772614), Wright Patterson AFB, Ohio. Hamdan, M. A., and Tsokos, C. P. (1971). An information measure of association. lnf. Control 19, 174-179. Hammersley, J. M. (1950). The distribution of distance in a hypersphere. Ann. Math. Star. 21, 447-452. Haralick, R. M., and Dinstein, I. (1975). A spatial Clustering procedure for multi-image data. IEEE Trans. Circuits Syst. 22, 440-450. Haralick, R. M., and Kelly, G. L. (1969). Pattern recognition with measurement space and spatial clustering for multiple images. Proc. IEEE 57, 654-665. Harding, E. F., and Kendall, D. G: (1974). “Stochastic Geometry.” Wiley, New York. Harman, H. H. (1967). “Modern Factor Analysis.” Univ. of Chicago Press, Chicago, IIlinois. Hartigan, J. A. (1975). “Clustering Algorithms.” Wiley, New York. Hartigan, J. A. (1978). Asymptotic distributions for clustering criteria. Ann. Srar. 6 , 117131. Henrichon, E. G., and Fu, K. S. (1969). A nonparametric partitioning procedure for pattern classification. IEEE Trans. Compur. c-18, 614-624. Holgersson, M. (1978). The limited value of cophenetic correlation as a clustering criterion. Pattern Recognition 10, 287-295. Howarth, R. J. (1973). Preliminary assessment of a nonlinear mapping algorithm in a geological context. Math. Geol. 5 , 39-57. Hubert, L. J. (1973a). Monotone invariant clustering procedures. Psychometrika 38, 47-62. Hubert, L. J. (1973b). Min and max hierarchical clustering using asymmetric similarity measures. Psychometrika 38, 63-72.
EXPLORATORY DATA ANALYSIS
221
Hubert, L. J. (1974a). Some applications of graph theory to clustering. Psychometrika 39, 283-309. Hubert, L. J. (1974b). Approximate evaluation techniques for the single-link and completelink hierarchical clustering procedures. J. Am. Srar. Assoc. 69, 698-704. Hubert, L. J., and Baker, F. B. (1977). The comparison and fitting of given classification schemes. J . Marh. Psychol. 16, 233-253. Hubert, L . J., and Busk, P. (1976). Normative location theory: Placement in continuous space. J. Marh. Psychol. 14, 187-210. Hubert, L. J., and Schultz, J. (1975). Hierarchical clustering and concept of space distortion. Br. J. Math. Star. Psycho/. 28, 121-135. Hubert, L. J., and Schultz, J. (1976). Quadratic assignment as a general data-analysis strategy. Br. J. Marh. Star. Psychol. 29, 190-241. Huizinga, D. (1978). “Are There Any Clusters?” Tech. Rep. 78-2. Behavioral Research Institute, Boulder, Colorado. Jain, A. K., and Dubes, R. (1978). Feature definition in pattern recognition with small sample size. Pattern Recognition 10, 85-97. Jain, A. K., Smith, S. P., and Backer, E. (1978). “Segmentation of Muscle-Cell Pictures,” Tech. Rep. TR-78-01. Department of Computer Science, Michigan State University, East Lansing. Janowitz, M. F. (1978). Order theoretic model for cluster analysis. H A M J. Appl. Math. 34, 55-72. Jardine, C . J., Jardine, N., and Sibson, R. (1967). The structure and construction of taxonomic hierarchies. Math. Biosci. I , 173-179. Jardine, N., and Sibson, R. (1971). “Mathematical Taxonomy.” Wiley, New York. Jarvis, R. A. (1977a). “Computer Image Segmentation: First Partition Using Shared Near Neighbor Clustering,” Tech. Rep. TR-EE-77-43. School of Electrical Engineering, Purdue University, West Lafayette, Indiana. Jarvis, R. A. (1977b). “Shared Near Neighbor Maximal Spanning Trees for Cluster Analysis,’’ Tech. Rep. TR-EE-77-45. School of Electrical Engineering, Purdue University, West Lafayette, Indiana. Jarvis, R. A., and Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Compur. c-22, 1025-1034. Jensen, R. E. (1969). A dynamic programming algorithm for cluster analysis. Oper. Res. 17, 1034-1057. Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika 32, 241-254. Johnston, B., Bailey, T., and Dubes, R. (1979). A variation on a nonparametric clustering method. IEEE Trans. Pattern Anal. Mach. Intell. pami-1, 400-408. Kamenskii, V. S. (1977). Methods and models of nonmetric multidimensional scaling. Autom. Remote Control 38, 1212-1243. Kaminuma, T., Takekawa, T., and Watanabe, S. (1%9). Reduction of clustering problems to pattern recognition. Pattern Recognition 1, 195-205. Kanal, L. N. (1972). Interactive pattern analysis and classification systems: A survey and commentary, Proc. IEEE 69, 1200-1215. Kanal, L. N. (1974). Patterns in pattern recognition. IEEE Trans. lnf. Theory it-20,697-722. Karonski, M. (1973). On a definition of cluster and pseudocluster for multivariate normal populations. Bull. Int. Srar. Inst. 45, 593-598. Katz, J. O., and Rohlf, F. J. (1973). Function-point cluster analysis. Syst. Zool. 22, 295-301. Kelly, F. P., and Ripley, B. D. (1976). A note on Strauss’s model for clustering. Biometrika 63, 357-360.
RICHARD DUBES AND A. K. JAiN Kemeny, J. G., and Snell, J. L. (1962). “Mathematical Models in the Social Sciences.” Ginn, Boston, Massachusetts. Kendall, M. G. (1966). Discrimination and classification. In “Multivariate Analysis” (P.R. Krishnaiah, ed.), pp. om-nun. Academic Press, New York. Kendall, M. G . (1973). The basic problems of cluster analysis. In “Discriminant Analysis and Applications” (T. Cacoullos, ed.), pp. 179-191. Academic Press, New York. Kendell, R. E. (1976). The classification of depressions: A review of contemporary confusion. Br. J . Psychiatry 129, 15-28. Kiev, A. (1976). Cluster analysis profiles of suicide attempters. Am. J . Psychiatry 133, 150- 153. Kittler, J. (1976). A locally sensitive method for cluster analysis. Pattern Recognition 8, 22-33. Kittler, J. (1979). Comments on “single-link characteristics of mode-seeking algorithm.” Pattern Recognition 11, 71-73. Klahr, D. (1969). A Monte Carlo investigation of the statistical significance of Kruskal’s nonmetric scaling procedure. Psychornetrika 34, 319-330. Koch, V. L. (1976). Computer program for clustering large matrices. Educ. Psychol. Meas. 36, 175-178. Koontz, W. L. G., and Fukunaga, K. ( 1972a). A nonlinear feature extraction algorithm using distance transformation. IEEE Trans. Cornput. c-21, 56-63. Koontz, W. L. G., and Fukunaga, K. (1972b). A non-parametric valley-seeking technique for cluster analysis. IEEE Trans. Cornput. c-21, 171-178. Koontz, W. L. G., Narendra, P. M., and Fukunaga, K. (1975). A branch and bound clustering algorithm. IEEE Trans. Cornput. c-23, 908-914. Koontz, W. L. G., Narendra, P. M., and Fukunaga, K. (1976). A graph-theoretic approach to nonparametnc cluster analysis. IEEE Trans. Cornput. c-25, 936-944, Kruskal, J. B. (1964a). Multidimensional scaling by optimizinggoodness of fit to a nonmetric hypothesis. Psychornetrika 29, 1-27. Kruskal, J. B. (1964b). Nonmetric multidimensional scaling: A numerical method. Psychornetrika 29, 115-129. Kruskal, J. B. (1971). Comments on a nonlinear mapping for data structure analysis. IEEE Trans. Cornput. c-20, 1614. Kruskal, J. B. (1972). Linear transformation of multivariate data to reveal clustering. In “Multidimensional Scaling” (R.N. Shepard, A. K. Romney, and S. B. Nerlove, eds.), Vol. 1, pp. 179-191. Seminar Press, New York. Kruskal, J. B. ( 1977). Multidimensional scaling and other methods for discovering structure. In “Statistical Methods for Digital Computers” (K. Enslein, A. Ralston, and H.S. Wilf, eds.), pp. 296-339 Wiley, New York. Kruskal, J. B., and Carroll, J. D. (1969). Geometrical models and badness-of-fit functions. In “Multivariate Analysis 11” (P.R. Krishnaiah, ed.), pp. 039-671. Academic Press, New York. Kruskal, J. B., and Hart, R. E. (1966). A geometric interpretation of diagnostic data from a digital machine. Bell Syst. Tech. J . 45, 1299-1338. Kruskal, J. B., and Shepard, R. N. (1974). A nonmetric variety of linear factor analysis. Psychometrika 39, 123-157. Krzanowski, W. J. (1979). Some exact percentage points of a statistic useful in analysis of variance and principal component analysis. Technornetrics 21, 261-263. Kuiper, E K. (1971). A Monte Carlo comparison of six clustering procedures. M.S. Thesis, University Washington, Seattle.
EXPLORATORY DATA ANALYSIS
223
Kuiper, F. K., and Fisher, L. D. (1972). Monte-Carlo comparison of six clustering procedures. Biometrics 28, 1162-1 163. Lachenbruch, P. A., and Goldstein, M. (1979). Discriminant analysis. Biornetrics 35, 69-85. Lance, G. N., and Williams, W. T. (1967a). Mixed-data classificatory programs, I. Aust. Cornput. J . 1, 15-20. Lance, G. N., and Williams, W. T. (1967b). A general theory of classificatory sorting strategies. 11. Clustering systems. Cornput. J. 10, 271-277. Lance, G. N., and Williams, W. T. (1971). A note on a new divisive classificatory program for mixed data. Cornput. J . 114, 154-155. Lee, R. C. T., Slagle, J. R., and Blum, H. (1977). A triangulation method for the sequential mapping of points from N-space to two-space. IEEE Trans. Cornput. c-26, 288-292. Lefkovitch, L. P. ( 1978). Cluster generation and grouping using mathematical programming. Math. Biosci. 41, 91-110. Levine, D. M. ( 1977). Nonmetric multidimensional scaling and hierarchical clusteringprocedure for investigations of perception of sports. Res. Q. 48, 341-348. Levine, D. M. (1978). Monte Carlo study of Kruskal’s variance based on measure of stress. Psychornetrika 43, 307-3 15. Lindman, H., and Caelli, T. (1978). Constant curvature Riemannian scaling. J . Math. Psychol. 17, 89-109. Ling, R. F. (1972). On the theory and construction of k-clusters. Cornput. J . 15, 326-332. Ling, R. F. (1973a). Probability theory of cluster analysis. J. Am. Star. Assoc. 68, 159-164. Ling, R. F. (1973b). The expected number of components in random linear graphs. Ann. Probability I , 876-881. Ling, R. F. (1975). An exact probability distribution on the connectivity of random graphs.J. Math. Psycho/. 12, 90-98. Ling, R. F., and KiUough, G. S. (1976). Probability tables for cluster analysis based on a theory of random graphs. J . Am. .Stat. Assoc. 71, 293-300. Lingoes, J. C. (1971). Some boundary conditions for a monotone analysis of symmetric matrices. Psychornetrika 36, 195-203. Lingoes, J. C., and Cooper, T. (1971). PEP-I: A FORTRAN IV program for GuttmanLingoes nonmetric probability clustering. Behav. Sci. 16, 255-261. Lintz, J . , and Simonett, D. S. (1976): “Remote Sensing of Environment.” Addison-Wesley, Reading, Pennsylvania. Liou, J. I., and Dubes, R. C. (1977). “A Constructive Method for Grammatical Inference Based on Clustering,” Tech. Rep. TR77-01. Department of Computer Science, Michigan State University, East Lansing. Lord, R. D. (1954). The distribution of distance in a hypersphere. Ann. Mafh. Sfat. 25, 794-798. Lu, S. Y., and Fu. K. S. (1978). A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans. S y s f . , Man C.vbern. smc-8, 381-389. MacCaUum, R. C. (1974). Relations between factor-analysis and multidimensional scaling. Psycho/. Bull. 81, 505-516. MacCaUum, R. C. and Cornelius, E. T. (1977). A Monte Carlo investigation of recovery of structure by ALSCAL. Psychornetrika 42, 401-428. McClain, J . O., and Rao, V. R. (1974). “Tradeoffs and Conflicts in Evaluation of Health Services Alternatives: Methodology for Analysis,” pp. 35-52. Health Services Research, Spring, 1974.
224
RICHARD DUBES AND A. K. JAlN
McClain, J. O., and Rao, V. R. (1975). CLUSTSIZ: A program to test for the quality of clustering of a set of objects. J. Market. Res. 12, 456-460. MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. Proc. Berkeley Symp. Marh. Srar. Probability, Srh, 1967 pp. 281-297. McQuitty, L. L. (1967). A mutual development of some typological theories and pattern analytic methods. Educ. Psychol. Meas. 27, 21-46. McQuitty, L. L. (1971). A comparative study of some selected methods of pattern analysis. Educ. Psycho/. Meas. 31, 607-626. McQuitty, L. L., and Frary, J. M. (1971). Reliable and valid hierarchical classification. Educ. Psychol. Meas. 31, 321-346. McQuitty, L. L., and Koch, V. L. (1975a). A method for hierarchical clustering of a matrix of a thousand by a thousand. Educ. Psychol. Meas. 35, 239-254. McQuitty, L. L., and Koch, V. L. (19791). Highest entry hierarchical clustering. Educ. Psychol. Meas. 35, 751-766. McQuitty, L. L., and Koch, V. L. (1976). Highest column entry hierarchical clustering: A redevelopment and elaboration of elementary linkage analysis. Educ. Psychol. Meas. 36, 243-258.
Magnuski, H. S. (1975). Minimal spanning tree clustering method. Commun. ACM 111, 119. Maronna, R., and Jacovkis, P. M. (1976). Multivariate clustering procedures with variable metrics. Biometrics 30, 499-505. Marriot, F. H. C. (1971). Practical problems in a method of cluster analysis. Biometrics 27, 501-5 14.
Matula, D. W. (1977). Graph theoretic techniques for cluster analysis algorithms. In “Classification and Clustering” (J. Van Ryzin, ed.), pp. 95-129. Academic Press, New York. Mezzich, J. E. (1978). Evaluating clustering methods for psychiatric-diagnosis. Biof. Psychiatry 13, 265-81. Milligan, G. W. (1978). “An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms,” Working Paper Series. College of Administrative Science, Ohio State University, Columbus. Mojena, R. (1975). Hierarchical grouping methods and stopping rules: An evaluation. Cornput. J. 20, 359-363. Morey, L. C., and Blashfield, R. K. (1979). Cluster analysis and the classification of a l c e holism. Pap., 10th Annu. Meet. Classification SOC. Mountford, M. D. (1970). A test of the difference between clusters. In “Statistical Ecology” (G. P. Patil, E. C. Pielou, and W. Waters, eds.), pp. 237-257. Pennsylvania State Univ. Press, University Park. Mucciardi, A. N . , and Gose, E. E. (1972). An automatic clustering algorithm and its properties in high-dimensional spaces. ZEEE Trans. Syst., Man Cybern. smc-2, 247-254. Narendra, P. M., and Goldberg, M. (1977). A non-parametric clustering scheme for LANDSAT. Pattern Recognition 9, 207-215. Niemann, H., and Weiss, J. (1979). A fast converging algorithm for nonlinear mapping of high-dimensional data to a plane. IEEE Trans. Comput. c-28, 142-147. Okuda, T., Tanaka, E., and Kasai, T. (1976). A method for the correction of garbled words based on the Levenshtein metric. IEEE Trans. Comput. c-25, 172-178. Orford, J. D. (1976). Implementation of criteria for partitioning a dendrogram. Marh. Geol. 8, 75-85.
Page, R. L. (1974). Minimal spanning tree clustering method. Commun. ACM 17, 321-323.
EXPLORATORY DATA ANALYSIS
225
Panda, D. P., and Rosenfeld, A. (1978). Image segmentation by pixel classification in (gray level, gradient) space. IEEE Trans. Comput. c-27, 387-389. Parzen, E. ( 1960). “Modern Probability Theory and Its Applications.” Wiley, New York. Patrick, E. A. (1972). “Fundamentals of Pattern Recognition.” Rentice-Hall. Englewood Cliffs, New Jersey. Paulus, E. ( 1976). A comparative study on automatic discrimination between significant and nonsignificant results in cluster analysis. I n “COMPSTAT 1976” (J. Gordesch and P. Naere, eds.), pp. 97-104. Physica-Verlag, Wien. Paykel, E. S. (1971). Classification of depressed patients-cluster analysis derived grouping. Er. J. Psychiatry 118, 275-288. Paykel, E. S. ( 1978). Classification of suicide attempters by cluster-analysis. Er. J . Psychiatry 133, 45-52. Peay, E. R. (1975a). Nonmetric grouping and cliques. Psychometrika 40, 297-313. Peay, E. R. (1975b). Grouping by cliques for directed relationships. Psychometrika 40, 573-574. Pettis, K.,Bailey, T., Jain, A. K., and Dubes, R. (1979). An intrinsic dimensionality estimator from near-neighbor information. IEEE Trans. Pattern Anal. Mach. Infell. pami-I, 25-37. Prim, R. C. (1957). Shortest connection networks and some generalizations. Bell Sysr. Tech. J . 36, 1389-1401. Pykett, C. E. (1978). Improving the efficiency of Sammon’s nonlinear mapping by using clustering archetypes. Electron. Lett. 14, 799-800. Ramsay, J . 0. (1977). Maximum likelihood estimation in multidimensional scaling. Psychometrika 42, 241-266. Ramsay, J. 0. ( 1978). Confidence regions for multidimensional scaling analysis. Psychometrika 43, 145-160. Rand, W. M . (1971). Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846-850. Rao, M. R . (1971). Cluster analysis and mathematical programming. J. A m . Stat. Assoc. 66, 622-626. Rapoport, A,, and Fillenbaum, S. (1972). An experimental study of semantic structures. I n “Multidimensional Scaling” (A. K. Romney, R. N. Shepard, and S. B. Nerlove, eds.), Vol. 2, pp. 93-13], Seminar Press, New York. Raven, P. H., Berlin, B., and Breedlove. D. E. (1971). The originsoftaxonomy. Science 174, 1210-1213. Riddell, R. J., and Uhlenbeck, G. E. (1953). On the theory of the virial development of the equation of state of monoatomic gases. J . Chem. Phys. 21, 2056-2064. Ripley, B. D., and Silverman, B. W. (1978). Quick tests for spatial interaction. Eiometrika 65, 641-642. Rohlf, F. J. (1963. A randomization test of the nonspecificity hypothesis in numerical taxonomy. Taxon 14, 263-267. Rohlf, F. J. (1970). Adaptive hierarchical clustering schemes. Syst. Zool. 19, 58-82. Rohlf, F. J . (1973). Hierarchical clustering using minimum spanning tree. Comput. J. 16, 93-95. Rohlf, F. J . (1974). Methods of comparing classifications. Annu. Rev. Ecol. Sysr. 5 , 101113. Rohlf, F. J., and Fisher, D. R. (1968). Tests for hierarchical structure in random data sets. Sysr. Zool. 17, 407-412.
RICHARD DUBES AND A. K. JAlN Romney, A. K., Shepard, R. N., and Nerlove, S. B., eds. (1972). “Multidimensional Scaling,’’ Vol. 2. Seminar Press, New York. Rosenfeld, A. ( 1976). “Digital Picture Analysis.” Springer-Verlag, Berlin and New York. Rosenfeld, A., and Kak, A. C. (1976). “Digital Picture Processing.” Academic Press, New York. Rosenfeld, A., Zucker, S. W., and Hummel, R. A. (1976). Scene labelling by relaxation operations. IEEE Trans. Syst., Man Cybern. smc-6, 420-433. Ross, G. J. S. (1968). Classification techniques for large sets of data. In “Numerical Taxonomy” (A. J. Cole, ed.), pp. 224-229. Academic Press, New York. Rubin, J. (1967). Optimal classification into groups: An approach for solving the taxonomy problem. J. Theor. Biol. 15, 103-144. Ruspini, E. H. (1973). New experimental results in fuzzy clustering. fnJ Sci. 6, 273-284. Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. fEEE Trans. Comput. C-18, 401-409. Saunders, R., and Funk, G. M. (1977). Poisson limits for a clustering model of Strauss. J. Appl. Probt. 14, 776-784. Schachter, B. J., Davis, L. S., and Rosenfeld, A. (1979). Some experiments in image segmentation by clustering of local feature values. Pattern Recognition 11, 19-28. Schultz, J. R., and Hubert, L. J. (1973). Data-analysis and connectivity of random graphs.J. Math. Psychol. 10, 42 1-428. Schwartzmann, D. H . , and Vidal, J. J. (1975). An algorithm for determining the topological dimensionality of point clusters. IEEE Trans. Comput. c-24, 1175-1 182. Sclove, S. L. ( 1977). Population mixture-models and clustering algorithms. Commun. Stat.-Theory Methods, A 6, 417-434. Scott, A. J., and Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biomettics 21, 387-397. Sebestyen, G. S. (1962). “Decision-Making Processes in Pattern Recognition.” Macmillan, New York. Sebestyen, G. S., and Edie, J. (1966). An algorithm for non-parametric pattern recognition. IEEE Trans. Electron. Comput. 15, 908-915. S h d e r , E., Dubes, R., and Jain, A. K. (1979). Single-link characteristics of a mode-seeking algorithm. Pattern Recognition 11, 65-73. Shanley, R. J., and Mahtab, M. A. (1976). Delineation and analysis of clusters in orientation data. Math. Geol. 8, 9-23. Shapiro, L. G., and Haralick, R. M.(1979). Decomposition of two-dimensional shapes by graph-theoretic clustering. IEEE Trans. Pattern Anal. Mach. Intell. pami-1, 10-20. Shaw, A. C. (1969). A formal picture description scheme as a basis for picture processing system. Inf. Control 14, 9-52. Shepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with an unknown distance function. I. Psychornetrika 27, 125-140. Shepard, R. N. (1974). Representation of structure in similarity data-problems and prospects. Psychometrika 39, 373-421. Shepard, R. N., and Arabie, P. (1979). Additive clustering: Representation of similarities as combinations of discrete overlapping properties. Psychol. Rev. 86, 87-123. Shepard, R. N., and Carroll, J. D. (1966). Parametric representation of nonlinear data structures. I n “Multivariate Analysis” (P.R. Krishnaiah, ed.), pp. 561-592. Academic Press, New York. Shepard, R. N., Romney, A. K., and Nerlove, S. B., eds. (1972). “Multidimensional Scaling,” Vol. 1. Seminar Press, New York. Sibson, R. (1971). Some observations on a paper by Lance and Williams. Comput. J . 14, 156157.
EXPLORATORY DATA ANALYSIS
227
Sibson, R. ( 1973). SLINK-Optimally efficient algorithm for single-link cluster method. Comput. J . 16, 30-34. Silverman. B., and Brown, T. (1978). Short distances, flat triangles and Poisson limits. J . Appl. Probabiliry 15, 815-825. Simon, J. C. (1978). Some current topics in clustering in relation with pattern recognition. Proc. Int. Joint Conf. on Pattern Recognition, 41h, 1978 pp. 19-29. Slagle, J. R., Chang, C . L., and Lee, R. C. T. (1974). Experiments with some cluster analysis algorithms. Pattern Recognition 6 , 181-187. Smith, S. P., and Dubes, R. C. (1979). “The Stability of Hierarchical Clustering,” Tech. Rep. TR-79-02. Computer Science Department, Michigan State University, East Lansing. Sneath, P. H. A. ( 1969). Evaluation of clustering methods. In “Numerical Taxonomy” (A. J. Cole, ed.), pp. 257-267. Academic Press, New York. Sneath, P. H. A. (1977). Method for testing distinctness of clusters-test of disjunction of 2 clusters in Euclidean space as measured by their overlap. Math. Geol. 9, 123-143. Sneath, P. H. A,, and Sokal, R. R. (1973). “Numerical Taxonomy.” Freeman, San Francisco, California. Sokal, R. R. ( 1974). Classification: Purposes, principles, progress, prospects. Science 185, 1 115-1 123. Sreenivasan, K., and Kleinman, A. J. (1974). On the construction of a representative synthesis workload. Commun. A C M 17, 127-133. Srivastava, J. N. ( 1973). An information function approach to dimensionality analysis and curved manifold clustering. In “Multivariate Analysis 111” (P. R. Krishnaiah, ed.), pp. 369-382. Academic Press, New York. Stenson, H. H., and Knoll, R. L. (1969). Goodness of fit for random rankings in Kruskal’s nonmetric scaling procedure. Psycho/. Bull. 71, 122-126. Strauss, D. J. (1975). Model for clustering. Biometrika 62, 467-475. Strauss, D. J. (1977). Clustering on colored lattices. J . Appl. Probability 14, 135-143. Strauss, J. S. (1973). Classification by cluster analysis. I n “Report of the International Pilot Study of Schizophrenia,” Vol. 1, pp. 336-359. World Health Organ., Geneva. Strauss, J. S ., Bartko, J. J., and Carpenter, W. J. (1973). The use of clustering techniques for the classification of psychiatric patients. Er. J . Psvchiutry 122, 531-540. Thompson, H. K., Jr., and Woodbury, M. A. (1970). Clinical data representation in multidimensional space. Cornput. Biomed. R e s . 3, 58-73. Torgerson, W. S. (1958). “Theory and Methods of Scaling.” Wiley, New York. Torn, A. 1. (1977). Cluster analysis using seed points and density-determined hyperspheres as an aid to global optimization. IEEE Trans. Syst., Man Cybern. smc-7, 610-616. Tou, J. T., and Heydorn, R. P. (1967). Some approaches to optimum feature extraction. I n “Computer and Information Sciences-11” (J. T. Tou, ed.), pp. 57-89. Academic Press. New York. Toussaint, G. T. ( 1975). Subjective clustering and bibliography of books on pattern recognition. Inf. Sci. 8, 251-257. Trunk, G. V. (1976). Statistical estimation of the intrinsic dimensionality of a noisy signal collection. IEEE Trans. Cornput. c-25, 165-171. Tryon, R. C., and Bailey, D. E . (1970). “Cluster Analysis.” McGraw-Hill, New York. Turner, M. E. (1969). Credibility and cluster. Ann. N.Y. A c a d . Sci. 161, 680-688. Tzeng, D. C. S . , and Landis, D. (1978). Three-mode multidimensional-scaling with points of view solutions. Multivar. Eehav. R e s . 13, 181-213. VanRijsbergen, C. J. (1970). A clustering algorithm. Cornput. J. 13, 113-1 15. Van Ryzin, J., ed. (1977). “Classification and Clustering.” Academic Press. New York.
RICHARD DUBES AND A.
K. JAlN
Vinod, H. D. (1969). Integer programming and theory of grouping. J . Am. Stat. Assoc. 64, 506-519. Wagner, R. A., and Fischer, M. J. (1974). The string-to-string correction problem. J. Assoc. Comput. Mach. 21, 168-173. Ward, J. H., Jr. (1%3). Hierarchical grouping to optimize an objective function. J . Am. Stat. ASSOC.58, 236-244. Watanabe, S., Lambert, P. F.,Kulikowski, C. A., and Buxton, J. L. (1967). Evaluation and selection of variables in pattern recognition. In “Computer and Information Science-11” (J. T. Tou, ed.), pp. 91-122. Academic Press, New York. White, R. F., and Lewinson, T. M. (1977). Probabilistic clustering for attributes of mixed type with biopharmaceutical applications. J . Am. Stat. Assoc. 72, 271-277. Wiks, S. S. (1963). “Mathematical Statistics.” Wiley, New York. Williams, G . W., Barton, G. M., White, A. A., and Won, H. (1976). Cluster analysis applied to symptom ratings of psychiatric patients: An evaluation of its predictive ability. Er. J . Psychiatry 129, 178-1856. Williams, W. T., and Lance, G . N. (1977). Hierarchical classificatory methods. In “Statistical Methods for Digital Computers” (K. Enslein, A. Ralston, and H. S. Wilf, eds.), pp. 269-295. Wiley, New York. Williams, W. T., Clifford, H. T., and Lance, G. N. (1971). Group-size dependence: A rationale for choice between numerical classifications. Cumput. J . 14, 157-162. Wirth, M., Estabrook, G. F., and Rogers, D. J. (1966). A graph theory model for systematic biology with an example for the oncidiinae (orchidaceae). Sysr. Zoo/. 15, 59-69. Wishart, D. (1%9). Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In “Numerical Taxonomy” (A. J. Cole, ed.), pp. 282-308. Academic Press, New York. Wishart, D. (1978). “CLUSTAN User Manual.” Program Library Unit, Edinburgh University, Scotland. Wolfe, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivar. Eehav. Rex. 5, 329-350. Wolfe, J. H. (1978). Comparative cluster analysis of patterns of vocational interest. Multivar. Eehav. Res. 13, 33-44. Wong, A. K . C., and Liu, T. S. (1977). A decision-directed algorithm for discrete data. IEEE Trans. Cornput. c-26, 75-82. Wright, W. E. (1973). A formalization of cluster analysis. Pattern Recognition 5 , 273-282. Young, F. W. ( 1970). Nonmetric multidimensional scaling: Recovery of metric information. Psychornetrika 35, 455-473. Zadeh, L. A. (1965). Fuzzy sets. Znf. Control 8, 338-353. Zahl, S. ( 1977). A comparison of three methods for the analysis of spatial pattern. Biornetrics 33, 681-692. Zahn, C. T. (1971). Graph-theoretical methods for detecting and describing Gestalt clusters. IEEE Trans. Comput. c-20, 68-86.
Numerical Software: Science or Alchemy? C.W. GEAR Department of Computer Science University of Illinois Urbana, Illinois
I. 2. 3. 4. 5.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What Is Numerical Software and Why Is It Difficult? . . . . . . . . . . The Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Alchemy . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
229 232 241 245 247 248
1. Introduction
“Numerical Software” is a term that is used rather liberally today to describe a range of activities. This article addresses the questions: “Is there anything being done under the heading Numerical Software that was not done in past years when we just called it programming?” “Are those things being done important?” and “Are they a science?” First, this article examines the nature of numerical software and then discusses what is particularly difficult about it. The third section briefly examines the science behind such software and the areas where that science does not help us. Numerical software production is viewed by many people either as a routine programming task or as a by-product of that dull subject, numerical analysis, which itself falls somewhere between mathematics and computer science, too applied for the one and too irrelevant to the other. However, we want to show that there is a significant difference between the concern and approach of either a numerical analyst or of a programmer, on the one hand, and a person who writes numerical software, on the other. We will call the latter person a “numerician,” for want of a better name. 229 ADVANCES IN COMPUTERS, VOL. 19
Copyright @ 1960 by Academic Press. Inc. All rights of reproduction in a n y form reserved. ISBN 0-12-0121 19-0
230
C.W. GEAR
By and large, most big numerical codes are not written by numerical analysts, or even by computer scientists, but by engineers, physicists, and other large computer users, These people tend to underestimate the difficulty of producing reliable code, but in spite of this, or perhaps because of it, they have been responsible for most of the important methods that have been developed in the past (for example, most integration methods, the relaxation method, and the finite element method). When these people had a real problem to solve, they could not afford to be deterred by minor mathematical difficulties, so they invented new methods. There has been a tendency for numerical analysts and computer scientists to ignore or disparage the accomplishments of the writers of large problem-oriented packages (‘‘just hack programming”), although many major developments in our field have started with hack programming; the structured, polished programs and proofs have appeared much later for distribution and publication. Many of the large codes utilize a fine blend of “engineering insight” and applicable theory to obtain results that could not be obtained by a numerical analyst or computer scientist. (In fact, one of the great difficulties facing us is how to codify such “insight” so that it can be applied to the development of general purpose methods and packages .) Part of the challenge of numerical software is to produce codes that can be embedded within large packages to handle standard operations such as the solution of differential equations. Although some modem numerical software is far more reliable than the corresponding sections of the large problem-oriented packages, these packages do not generally use library software because the latter is insufficiently flexible to be tailored to a particular class of applications and still retain efficiency. Unfortunately, many of us get too involved in our own theoretical interests to produce useful codes for the large body of users. We produce computer sciencetrained programmers who are more interested in clever garbage collection than in avoiding generating functional garbage in the first place, or numerical analysts who hibernate in Hilbert space. It is certainly not true that all numerical code written by the user has desirable properties. Vast numbers of small problems are “solved” everyday by ordinary users in ways that are not only painful to the theoretician, but which are wrong sufficiently often that we should be concerned. Regrettably, much of this code has found its way into computer libraries in the past, although not only is it not suitable for a public library, it is often too crude for most adult bookstores! The wrong answers arise because, as Shampine’ has pointed out, crude numerical methods
’ L. Shampine, Sandia Laboratory, Albuquerque, New Mexico. This and many other comments in this paper are taken from talks by others. In the case that they have not appeared in print, we will not give an actual reference, but will acknowledge the source of the wisdom.
NUMERICAL SOFTWARE
231
are often not adequate for solving crude numerical models. Whereas, when a user is faced with a very large job, time is invested to attempt numerous methods and at least make an empirical choice, a user faced with a small job tends to choose the first simple method that appears to work (that is, which gives an answer close to one expected). These results are incorrect for precisely those problems that are of more than normal interest to the user, namely those problems that have an unexpected answer because of behavior not detected by a crude method. Thus the small user has a great need for reliable software that will solve large classes of problems. Efficiency is not as important for small problems because the human time presenting the problem to the computer is the more expensive resource. The desire for reliability is obvious; the need for broad applicability is less obvious but equally important, because, if a user must select between a large set of codes on the basis of the characteristics of the methods and their application to the problem at hand, the choice made will most likely be wrong. It is unreasonable to expect a user to understand large bodies of knowledge in areas other than his or her expertise, which is not usually numerical analysis and computer science. We see that some of the attributes needed in numerical software are reliability, efficiency, and broad applicability. Is there any reason why this type of code cannot be, or is not, written by numerical analysts? In a sense it is because the scientists/engineers/programmers are the people who could be said to be the true numerical analysts: they practice numerical analysis in the sense that they analyze problems numerically. However, the term “numerical analysis” has come to mean something different, the study of numerical methods themselves, and that is what today’s numerical analyst does, mostly studying the methods originally developed by the engineer or scientist. Today’s numerical analyst does not write code (if it can be avoided)-that is a job for a programmer; but not to write code is to ignore a very important step in the two-stage activity we find in almost all intellectual endeavor: analysis and synthesis. In this case, we must analyze classes of problems and classes of methods, but before the result is of any utility, the result of the analysis must be synthesized into a computer code. A numerician is a person interested in both of these activities. The analysis is the science and the synthesis is the art or alchemy. The result will be numerical software, if properly done. Numerical software must also strive to be completely automatic. Let us compare the situation with the telephone system. Forty years ago the projection “everybody will be a telephone operator in thirty years” was made on the basis of the growth of the system. In the last decade we made the same projection about programming: “Everybody will be a programmer in twenty years.” The fact of the matter is that everybody is a telephone operator today, at least in the western world. Almost every-
232
C. W. GEAR
body connects their own calls. Of course, they use a language that is high level compared to the actual connections that must be made. In the same way, everyone will be a programmer very shortly, but they will be using languages that are very high level compared to what computer scientists think of as high level. These languages will allow access to the software tools that solve a range of problems automatically. The job of the numerician is to synthesize codes that can solve numerical programs automatically. (However, this is not to suggest that the solution to the sorts of problems which concern us is the design of yet another language. Far from it. Too many people are fond of designing new languages, languages which have wonderful structures for handling the ‘‘a + 4 loop” problem. What we have is not the “n + 4 loop” problem, but the “n + 4 language” problem. We already have n languages, and there is always another ridiculous proposal being made.) 2. What Is Numerical Software and Why Is It Difflcult?
The previous remarks apply equally well to any form of software if we substitute “computer scientist” for “numerical analyst.” What is special about numerical software other than that it deals with numbers? Primarily, numerical software must tolerate errors. The word “error” is an unfortunate one because numerical errors are not errors in the usual sense of the word, but differences between approximations that can be computed in a finite length of time and the true solution. WilkinsonZrelates an incident in which he was visiting a university to give a talk. At dinner the previous night he found himself sitting next to another guest of the university, a bishop. Opening conversation, the bishop enquired the subject of Wilkinson’s talk. “Error,” replied Wilkinson. “That’s a coincidence,” replied the bishop, “that’s my topic also, but I call it sin.” Numerical error is unavoidable; it is not sin. Numerical software has two principle characteristics: i. It deals with approximations to real numbers. ii. It is usable on a range of computers that have different approximation capabilities. The fact that it deals with approximations to the real numbers causes the dimension of all aspects of software production to increase. Nonnumerical software deals with finite or countable sets, be they numbers or not. In it we are concerned with the design and analysis of an algorithm.
* J. Wilkinson, National Physical Laboratory, Teddington, England.
NUMERICAL SOFTWARE
Execution time studies are done on the algorithm. Memory space studies are done on the algorithm. Program proofs are prepared for the algorithm. The algorithm either “fits” into the computer (i.e., the range of integers and memory space is adequate), or it does not. If it fits, the analysis of the algorithm carries over to the behavior of the computer program. In numerical software, the characteristics of the algorithm are only one of the set of problems t o be studied; the behavior of the actual implementation of the algorithm on a computer must also be analyzed. The fact that computers differ in their treatment of floating-point numbers means that we must be concerned with classes of computers. As Battiste3 has pointed out, two decades ago we concerned ourselves mainly with the algorithm, a decade later we were also concerned about its embodiment on a particular computer, while today we are concerned about its embodiment on classes of computers in classes of languages. These classes of computers have different_numeric ranges, different precisions, and different round-off properties. Numerical software is written to work in these varying environments. It must be sensitive to the precision of the computer so that it does not try to achieve more accuracy than is possible on a particular machine. It must be sensitive to the range of numbers so that it can avoid unnecessary overflows and underflows. It must be aware of the peculiarities of round off. It must also work around the difficulties of many computer languages. A good example of this is one given by Cody (1971) in which he needed to compute 1.0 - X without any unnecessary rounding error. IfX is in the range 0.5 t o 1.0, this can be done without error on a computer that uses a guard digit during addition/ subtraction. If not, it can be done by forming (0.5 - X) + 0.5. However, most optimizing compilers will “improve” this for the user-back to the original form. This can be circumvented in some systems by coding T +- 0.5 - X, and R + T + 0.5, but a slightly better optimizing compiler will still oblige with the improvement. In many cases, the compiler can be outsmarted by adding a statement label t o the second statement. (I hope pure computer scientists will pardon such constructs.) However, if the compiler does a flow analysis we are foiled again. Worse yet, we may get an error message UNNECESSARY STATEMENT LABEL ON L I N E nnn. (Sometimes it seems that the system programmer cannot leave well enough alone. I am reminded of the story about the priest, lawyer, and system programmer waiting to go to the guillotine. The priest went first, was asked whether he wished t o lie face up or face down, and chose face up to look at heaven. The blade fell and miraculously stopped a millimeter from his neck, so he went free. The lawyer went next, and, unwilling to break E. Battiste, IMSL, Houston, Texas.
234
C.W. GEAR
precedent, chose to lie the same way. Again the blade stopped a millimeter short. The system programmer went last, and chose face up because he was interested in examining the mechanism. “Ah,” he said, “I see the problem. The rope isn’t on the pulley correctly.” The other viewpoint is seen in the probably true story of a programmer working for an aircraft company some twenty years ago. Observing that the square-root routine was happily returning a value even when the argument was negative, he changed it so it would trap and warn the user. The inevitable result was that programs which had previously worked “perfectly” now failed. Needless to say, the decision of the management was to restore the routine to its previous state in which it gave no unpleasant suggestion that all might not be well.) Numerical software can be particularly difficult to design because, even before precision, range, and round-off problems are considered, many numerical tasks are, in a sense, unsolvable. It is no use telling the user that the problem is theoretically impossible; it has to be “solved.” After the solution process a numerical program can come to one of three outcomes: answers correct to within the tolerance requested or expected by the user; a statement by the program that the task is impossible; or answers not within tolerance. The latter are more commonly called wrong answers. Ideally, we do not want wrong answers, and many users are prepared to ask that they not occur. However, the best we can usually ask is that we minimize the frequency of occurrence of wrong answers, even at the expense of telling the user more frequently that the job is impossible. It is not true that “some answer is better than none.” In most cases, a wrong answer is much worse than no answer. Suppose the user thereby makes a wrong decision; it is clearly better not to build a plane than to build one that will not fly. Why is it so difficult for numerical algorithms to distinguish between possible and impossible cases? In some instances it is not; it depends on the type of problem. The types of numerical problems can be classified according to three criteria: data properties, algorithm properties, and round-off error properties. The first subdivision is on the basis of the initial data provided by the user to specify the problem. These data may take the form (in ascending order of difficulty for the computation): 1. A set of isolated values, such as the coefficients of a system of linear equations or the argument to a sine function. 2. Symbolic data, such as the specification of a differential equation. 3. A “black box” program that will compute any specific value of a function for specified values of its arguments.
The first two forms of data give a complete specification of the problem,
NUMERICAL SOFTWARE
235
although considerable (nonnumeric) manipulation may be required if the data are in the second form. If the data are in the third form, we must accept that absolute reliability is impossible without additional assumptions. For example, if we write a program to compute the value of the integral of a function f ( x ) by forming a weighted sum of values of the functionf for various values of its argument x, we can only samplef at a finite number of points, say xI, xz, . . . , x,~.We can substitute another function g(x) that is identical to f ( x ) at these points, and the integral of g will be “identical” (numerically) to that off. For example, let
Clearly the integral of g is arbitrarily larger than that off. The second subdivision is on the basis of the relation between the algorithm and the problem, and is independent of round-off error problems. It leads to the breakdown into four groups: A. The problem can be solved by a finite sequence of calculations in real number arithmetic (that is, infinite precision arithmetic). Linear equation solution is an example of this. B. The problem cannot be solved exactly in a finite sequence of arithmetic operations, but there exists one or more finite sets of steps for which everything needed to complete an error analysis and obtain error bounds is known. An example of this is a program for cosine. C. Error bounds can be given in terms of unknown characteristics. These characteristics are values such as the bounds on derivatives that cannot be computed. Examples of this include the solution of the ordinary differential equation y‘ = f(y) and the solution of the nonlinear equation f ( x ) = 0, wheref is a function given by another program. Note that in these examples there is an assumption that certain derivatives exist and are bounded. These assumptions are usually correct, but cannot be verified. D. All known proofs of error bounds depend on assumptions that not only cannot be verified, but that are often not true. An example is a program for partial differential equations. Most theories rely on linearity or small deviations from linearity, but practical problems that do not satisfy these assumptions are often solved. Group A problems are trivial at the algorithm level, as there is no analytical problem to consider. Neither do group B problems cause difficulties at the algorithm level; a code can be designed using a number of steps determined by the accuracy needed. This is rlot to say that the implementation of group A and B problems on an actual computer is trivial; there are still the problems of precision, range, and round off to
C.W. GEAR consider. (These are discussed for a variety of functions in the previously cited paper of Cody.) However, the difficulties are soluble. Group C problems are the first to exhibit serious difficulties. Much software allows the user to request answers within a given error tolerance. However, the error bounds that can be computed depend on unknown quantities. Although these quantities can be estimated, and that is what numerical software does, there can be no guarantee that these estimates are correct. Consequently, a wrong answer is always a possibility. Since a key objective of numerical software is reliability, the most the numerician can hope to do is to keep the probability of wrong answers low. If instead, we can “get hold” of the function by requiring the user to specify form 2 data, we might be able to move the problem into group A or B. Alternatively, it is sometimes possible to convert a group C problem to one in group B by requiring the user to provide additional information. For example, in solving the equation f ( x ) = 0 we could also ask the user to provide a subroutine that will give a bound on the derivative off($ over a range of values ofx. In that case, we can compute error bounds (if roundoff error is ignored) and can write programs that make statements such as “there are no roots off(x) = 0 in the range specified.” It is only recently that numerical software has been attempted for problems in group D, both because of the great difficulty in providing much reliability, and because it is still difficult to know how to handle many aspects of such problems. For example, many partial differential equations must be solved in regions with very irregular boundaries. These give difficulties both in specification and in numerical treatment. Consequently, we find a number of software packages on the market at the moment, each suited to a particular combination of equation types and boundary conditions. See, for example, the Ellpack project (Rice, 1977). Future advances in mathematical understanding may cause some group D problems to move to group C. The third criterion by which we can categorize problems is by their dependence on round-off errors. This leads to a subdivision into the types:
a. Those in which a priori bounds can be computed on the effects of round-off errors. This occurs, for example, if we wish to compute a cosine over a limited range of its argument. b. Those in which we can compute bounds on the effects of round-off errors once we know the data for a particular problem. Linear equations are a case of this if we want to determine the error in the answer. c. Those in which no bounds can be specified on the effects of round-off errors. This usually arises with problems in groups C and D because the effect of the propagation of round-off error is dependent on unknown characteristics of the actual problem.
NUMERICAL SOFTWARE
237
I have talked about the “effects” of round-off error without being very specific. By “effect,” most users mean the change to the answer. Bounding the change to an answer is calledforward error analysis because it computes the effect of the error as it propagates forward with the solution process. In this type of analysis the error at the end of the calculation will depend on the way in which it is amplified or reduced by the problem and the solution process. If the problem is such that small errors are amplified greatly, the problem is called ill conditioned, because small changes in the initial data, whether by error or user perturbation, cause large changes in the answers. An example of an ill-conditioned problem is the problem of computing the trajectory of a rocket with no guidance system fired from earth and aimed at Mars. A very small error will send it past Mars and probably into the sun, causing about a 100% error in the result! There is no way of avoiding growth of errors if the problem is ill conditioned. If the method is such that it causes small errors to be amplified even though the underlying problem does not, we say that the method is imstable. Unstable methods are to be avoided! The other type of error analysis that is very popular with numerical analysts is backward error analysis. This attempts to determine the smallest change to the input data that could lead to the answer obtained. Expressed mathematically, we have a problem, say P [ d , x], whered is a set of input data andx is the unknown. Ideally, we would like to solve this, that is, to find a value y such that P[d, y ]
=
0.
Unfortunately, we compute a numerical answerz. In forward error analysis we attempt to determine the size of z - y. whereas in backward error analysis, we ask how large Ad must be in order that P[d
+ Ad, z] = 0.
The advantage of backward error analysis is that it is often possible to obtain bounds on the backward error that are independent of the problem data, even if the problem is ill conditioned. In the rocket example it may happen that the rocket would actually hit Mars, but a numerical computation of the trajectory shows that the rocket will miss and crash into the sun. However, we can say that the computation produced the correct answer to a problem with very slightly different data. For a more downto-earth example, suppose we want to compute the sine of a large number, say sin( 10%). In a seven-digit precision computer, we will compute ~ ~ ( 3 1 4 1 5 9 2 Will ) . this be zero? Of course not. Its correct value is about -0.608, but with round-off error, we could get any answer between - 1 and + l . Thus, a bound in forward error analysis will be of little value. However, a bound in backward error analysis will tell us that we have the
C. W. GEAR sine of a number that is within one part in about 0.7 x lop6of the given argument because, even with the worst case error of 1.608, we know that sin[(106 + O S ) m ) ] = 1.O and (lo6 + 0 . 5 ) ~- 3141592 = 0.708 x 3 14 1592 Consequently, backward error analysis is a very appealing concept for the numerical analyst; it passes the buck back to the users who, after all, cannot expect good answers to bad problems. If we give them an answer to a very close problem, what more can they ask? Unfortunately, even in a simple example such as linear equations, the meaning of “very close” may depend heavily on the problem area. For example, a simple linear electrical network of resistors leads to a system of linear equations. It is possible to say for a reasonable code that the answer obtained is the answer to the system of equations changed by a small amount. However, to an electrical engineer, a ‘‘small change” means that the network differs from the original only by some small changes to resistor values. Unfortunately, the numerical analyst means that there may also be some additional small resistors, or that Kirchoff’s laws are only satisfied approximately. The engineer may not agree that such a network is in any sense close to the original! It is important to realize that the classification of a problem is dependent on the demand the user places on the error. If a backward error bound is sut3cient;the problem may be computationally much simpler. Thus, with a backward error bound, a linear equation problem is 1Aa (class 1, group A, type a), but with a forward error bound, it is 1Ab. Earlier we said that a key attribute of software is reliability. How do we obtain this? Traditionally, testing has played a major role in checking for reliability; today, program proof techniques are taking a role in checking nonnumerical software. Can they be applied to numerical software? Generally speaking, no, for three reasons: i. We cannot prove theorems if we do not know what we want to prove. ii. We cannot prove theorems for algorithms if we cannot even prove theorems for the underlying mathematical problem. iii. Theorems must be proved for computer implementations using inexact computer arithmetic. The first statement is true for proofs of any subject matter! In terms of good programming practice, it is usually expressed in the statement “we
NUMERICAL SOFTWARE
= 0 ,
we can replace the derivative by a difference approximation to obtain z(x
+ h ) - z(x) - f ( x . z ) h
= 0,
which can be solved explicitly for a sequence of values z(xo + nh) given z(x,,). Naturally, we expect z to be close toy. We hope to achieve this by making A “like” P. There are several ways of asking whether A is like P. The two principal ones are to examine the residual or the truncation error and see if it is small. In the residual approach, we ask whether the numerical solution z comes close to “satisfying” the problem P; that is, whether p[z] is small. In the truncation error approach we ask if the true solution y comes close to satisfying the algorithm or code; that is, whether A [ y ] is small. The definitions of residual and truncation error are, therefore, residual r = P[z] truncation error t = A [ y ] .
NUMERICAL SOFTWARE
243
By subtracting P [ y ] and A [ z ] (both of which are zero) from these, we obtain r
=
P[z] - P [ y ] =
t = A[yJ
-
A [ z ]=
If we can invert the partial derivatives which are matrices), we obtain
and z - y = - [ a , la]A
-I
t.
The first says that the error in the answer ( z - y ) is small if the residual is small and the quantity [aP/dx]-' is small. The latter quantity depends on the problem, and we say that the problem is well conditioned if the quantity is small. There are some problems, usually in groups A and B, for which we can calculate a residual (for example, in linear equation solution). In those cases we can compute an error bound on the result if we have a problem whose condition we know. For example, in the linear equation Py = b. we can, in principle, compute the residual of the solut i o n ~by forming r = Pz - b. If we do this in extended precision, we can ignore round-off errors in the residual computation. Combining this with P y - b = 0, we obtain z - y = P-'r. Thus, in this case we can compute both the residual and the condition of the problem. In other problems, particularly in groups C and D, we cannot calculate a residual because P involves operations such as differentiation that cannot be done exactly when the values of functions are available only on discrete points. In that case we usually use the truncation error approach and say that the solution error is small if the truncation error is small and the quantity [aA/ax]-' is small. If the latter quantity is small, we say that the method is stable. This is the type of analysis applied to differential equations. In the example given above, which happens to be Euler's method, the simplest method for solving differential equations, we can substitute the true sohtiony into the method and find that the truncation error is hy"/2 where the second derivative y" is evaluated at some unknown point on the solution. In this case we can show that the method is stable if the original problem is well conditioned, so that the method is a good one. If something is known about the condition of the problem, it is possible to say something about the accuracy of the answer when it is possible to
244
C.W. GEAR
compute the residual. However, it may not be easy to see how to create an algorithm with small residuals. On the other hand, it is frequently easy to see how to create an algorithm with small truncation error; then it may be much more difficult to make the algorithm stable. Truncation errors are often analyzed using the techniques of asymptotic analysis. Essentially, these are applications of Taylor’s series to appropriate expressions. When it is impossible to do an exact computation for P, even in the absence of round-off error, we usually have an algorithm that depends on a parameter. Suppose this parameter is a small number, say h , and the algorithm is A [ h , z]
=
0.
For many problems, particularly those arising in differential equations, the algorithm run time depends on h , and becomes infinite as h approaches zero. On the other hand, the algorithm becomes more accurate as h decreases because the algorithm has been chosen so that it is exact when h is zero; that is, so that A[O, y ] = 0. Then we can compute A [ h , y ] by Taylor’s series to obtain aA h2a2A A [ h , y ] = A[O, y ] + h - + ah 2ah2
+
....
We already have arranged to make the first term zero. The remaining terms are the truncation error. Usually we design an algorithm to make a number of the additional terms zero. Suppose all of the terms up to hp-l are zero. Then, for small enough h the error is close to the first nonzero term that is proportional to hu. Numerical software frequently relies on estimating this term to measure the error and to control h. It ignores higher order terms. Traditional floating-point error analysis allows the error introduced in each step in the computation to be related to the size of the numbers taking part in that computation. Thus, the initial errors are proportional to the initial data, and subsequent errors are related to the intermediate data values. For this reason, it is often straightforward to relate the effect of round-off errors to the effect of changes in the initial data, since these tend to propagate proportionally through each intermediate result. That is why backward error analysis is such a handy tool. For example, in a simple calculation such as B*(C + D*E), the value of the round-off error in the result is a complex expression involving the values of B , c, D, and E, whereas it is easy to say that the error in the result is no worse than that caused by a small change in each of the input values.
NUMERICAL SOFTWARE
245
4. The Alchemy
The creation of numerical software is based on the type of analysis discussed in the previous section, but none of this analysis is strictly valid in the real world of computers. i. Stability analysis may be invalid because it depends on the differentiation of codes, not algorithms. ii. Asymptotic analysis may break down because often we cannot be certain we are working with small values of the parameter-in fact, sometimes we know we want to use large values. iii. Round-off error analysis breaks down because, in the presence of underflow, it is more complex than the usual analysis indicates.
Most of the time the analysis is not badly in error, but since the goal of a numerical software project is reliability, the success is related to the probability of avoiding wrong answers. An occasional breakdown of the analysis may cause a large code to frequently give misleading answers. The differentiation of programs obviously fails in the presence of significant round-off errors, although round-off errors are frequently small compared to truncation errors. However, differentiation can also fail because many programs are adaptive; that is, they try to adjust some parameters to achieve close to optimal behavior. If these parameters behave in an erratic way or are confined to discrete values as would happen if they represented switching between various methods, differentiation appears to have no meaning, even in an approximate sense. This means that a totally different approach is needed for the analysis of stability. There has so far been little progress in this area. Asymptotic analysis also fails when differentiation is not meaningful, but the major practical difficulties with asymptotic analysis seem to arise from the “small h” assumption. When h is not small, error estimates are unreliable, if not totally wrong. Because of this, some of the toughest problems are those in which we want very little accuracy. As contradictory as it may seem, we know how to solve many problems very accurately, but we do not know how to solve them inaccurately and cheaply. In some cases this is simply the statement that the first digit is the most expensive to obtain, and others become progressively less expensive, but in other cases, we just cannot find ways to compute low-accuracy answers reliably. This is particularly the case in differential equations where the only form of error estimates known to us are based on asymptotic error estimates that break down at low accuracy. Consequently, we have no idea of the size of the error until we have computed a very accurate answer. An alternate theory to asymptotic analysis is needed. Approximation theory provides a basis
246
C.W. GEAR
for some cases, but it has not been applied successfully to many problems that currently use asymptotic analysis. Asymptotic analysis has another peculiarity in that if we use it to estimate an error we finish up with no error estimate! The reason for this is that if we estimate the error in a numerical result, we naturally subtract that error from the result to obtain a “more accurate” answer. However, we now have no error estimate for the more accurate answer. Kahan5 argues that we should solve this dilemma by computing an “uncertainty” rather than an error estimate. The uncertainty represents the possible error in the solution owing to our lack of more precise knowledge of the input or unwillingness to compute more precise information about the solution; it does not represent an estimate of the actual error. The breakdown of simple round-off error analysis can be seen in the example used earlier, B*(C + D*E). Suppose that this is computed in floating point using a finite exponent range. Overflow is not a serious problem in that we are told when overflow occurs, and it warns us of trouble. We could take the same attitude to underflow, but most users ignore underflows entirely, accepting a zero result. Most of the time this is reasonable. However, suppose that c is the smallest number that can be represented in the machine, and B is its inverse, assumed representable. Let D and E be small enough that D*E is just less than c, so is underflowed. Then, the computed answer for the above expression is 1.0, whereas the correct answer is almost 2.0. Mention should be made of an effort underway by an IEEE subcommittee on microprocessor floating-point standards. In addition to standardizing formats, they are considering two proposals that would help with this underflow problem. One would virtually eliminate underflows and overflows by using system traps to extend the range of the exponent when either occur. A heap would be maintained to keep this additional information, and bit patterns in the data would be used to indicate extended exponent range. Arithmetic is likely to be slow for extended range numbers in this scheme, but some of the problems facing the numerician disappear. The other proposal introduces the idea of “gradual underflow.” This allows numbers with the minimum exponent to be unnormalized. The effect of this is that if expressions such as the above have a nonzero value, they will not be in error by more than one expects from a conventional error analysis (Coonen, 1978). In the past a number of schemes have been tried to minimize the impact of round-off errors. Some of these have been designed into computers, but have not been effective. For example, significant-digit arithmetic (MeV. Kahan, University of California at Berkeley.
NUMERICAL SOFTWARE
247
tropolis and Ashenhurst, 1958) was tried about 20 years ago. The idea was to keep as many digits in the floating-point mantissa as were known to be correct. While the scheme did not give answers that were wrong, it gave answers with no precision because most schemes for bounding the effects of round-off errors lead to answers that are far too pessimistic. A more recent scheme that has had some success is interval arithmetic (Moore, 1966) in which an interval known to contain the result is computed. Schemes of this form are, unfortunately, limited to problems that are not in class 3 o r in groups C or D, so round-off errors will remain a problem for the numerician for some time. 5. Conclusion
What is the future direction of numerical software? Today we see a large amount of activity in areas that will formalize the measurement of methods, the selection of algorithms, and the techniques of testing. For example, Jackson et al. (1978) have used a model problem and selected methods that are optimal over a class of model problems. Rice (1976) has examined the algorithm selection problem as an approximation problem. A recent IFIP working conference (Fosdick, 1979) was devoted to the question of evaluation of the performance of numerical software. This is a very difficult area because the performance of software for group C and D problems is extremely problem dependent, so at the moment we can do little but test packages on large numbers of problems and take average statistics. In the future we need to abstract features of problems so that we can relate performance t o these features. However, these features must be ones those can be determined either by the user or a program. We might ask whether numerical software is a part of computer science and should be studied in computer science departments. I believe that computer science is primarily concerned with the analysis and synthesis of the design and use of computers. From this point of view, I place program design, programming language design, and architecture at the center of computer science. Abstract theory, while vitally important and interesting in its own right, is only computer science when it is concerned with real computer problems. From this point of view, numerical software is very much a part of computer science. It is concerned with automatic problem solving; that is, with analyzing methods and synthesizing techniques. It has some solid scientific foundations, but still involves a great deal of judgment because the idealized goals are not achievable. Thus, the numerician is partially reduced to alchemy; brewing new concoctions and testing the results. Sometimes heuristics are used (although they are usually called adaptive methods and given something of a scientific footing).
248
C.W. GEAR
Many people have commented that the goals of automatic problem solving and the use of heuristics suggest a potential affinity with the A1 community. Perhaps that is true because our A1 colleagues are the outstanding alchemists of computer science; however, the numerician is usually more concerned with writing a package that works reasonably well over a broad spectrum of problems than with a package that does remarkably well for a smaller set of problems. Perhaps that will lead to accusations that numericians are engaged in a search for the mediocre. If, by that it is meant thTy are trying to build the Model A for everyone rather than the custom design for the few-then yes, that is what they are trying to accomplish. They want a model that runs reliably, smoothly, and that can be maintained easily. ACKNOWLEDGMENT The thoughts presented here are the result of listening to numerous colleagues. Unfortunately, I cannot always remember who was responsible for which idea, so they have not always been acknowledged, but I would like specifically to thank Bob Skeel of the University of Illinois for his many comments and suggestions. REFERENCES Brown, W. S. (1977). A realistic model of floating-point computation. I n “Mathematical Software 111” (J. R. Rice, ed.), pp. 343-360. Academic Press, New York. Cody, W. J. (1971). Software for elementary functions. I n “Mathematical Software” (J. R. Rice, ed.), pp. 171-186. Academic Press, New York. Coonen, J. T. (1978). “Specifications for a Proposed Standard for Floating-point Arithmetic,” Rev. Memo. No. UCBIERL M7W72. Department of Mathematics, University of California at Berkeley. Dekker, T. J. (1979). Correctness proof and machine arithmetic. In “Proceedings of IFIP Working Conference on the Performance Evaluation of Mathematical Software” (L. Fosdick, ed.). North-Holland Publ., Amsterdam. Fosdick, L., ed. (1979). “Performance Evaluation of Numerical Software.” North-Holland Publ., Amsterdam. Jackson, K. R., Enright, W. H., and Hull,T. E. (1978). A theoretical criterion for comparing Runge Kutta methods. S I A M J . Numer. Anal. 15 (31, 618-641. Jacobs, D., ed. (1978). “Numerical Software: Needs and Availability.” Academic Press, New York. Metropolis, N., and Ashenhurst, R. L. (1958). Significant digit computer arithmetic. IRE Trans. Electron. Comput. 7 , 265-267. Moore, R. (1966). “Interval Arithmetic.” Prentice-Hall, Englewood Cliffs, New Jersey. Rice, J. R. (1976). The algorithm selection problem. I n “Advances in Computers” (M. Rubinoff and M. C. Yovits, eds.), Vol. 15, pp. 65-1 18. Academic Press, New York. Rice, J. R. (1977). ELLPACK: A research tool for elliptic partial differential equations software. I n “Mathematical Software 111” (J. R. Rice, ed.), pp. 319-341. Academic Press, New York.
Computing as Social Action: The Social Dynamics of Computing in Complex Organizations ROB KLING AND WALT SCACCHI Department of lnformation and Computer Science and Public Policy Research Organization University of California at lrvine lrvine California
.
I.
2.
3.
4.
5.
Perspectives on Computing in Organizations . . . . . . . . . . . . . 1.1 The Social Side of Computing in Organizations . . . . . . . . . 1.2 The Case of a Client-Tracking Welfare Information System . . . . Six Perspectives on Computing in Organizations . . . . . . . . . 1.3 The Computing System Life Cycle . . . . . . . . . . . . . . . . . 2. I Initiation and Adoption . . . . . . . . . . . . . . . . . . . . . Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Requirements Specification . . . . . . . . . . . . . . . . . . 2.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Testing and Validation . . . . . . . . . . . . . . . . . . . . . 2.6 Documentation . . . . . . . . . . . . . . . . . . . . . . . . 2.7 2.8 Maintenance, Modification, and Conversion . . . . . . . . . . . Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 2.10 CSRO Case: A System Life Cycle . . . . . . . . . . . . . . . 2.1 I Six Perspectives of Text Processing at CSRO . . . . . . . . . . 2.12 A Retrospective of the System Life Cycle . . . . . . . . . . . Computer System Use in Complex Organizations . . . . . . . . . . Discretion in Computer System Use . . . . . . . . . . . . . . 3.1 Organizational Distribution of Computing Knowledge . . . . . . 3.2 3.3 Mistakes in Computing System Use . . . . . . . . . . . . . . Impact of Computing on Organizational Life . . . . . . . . . . . . . Decision Making . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Organizational Power . . . . . . . . . . . . . . . . . . . . . . 4.2 4.3 Computing and Work . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 The Six Perspectives in Perspective . . . . . . . . . . . . . . Lines for Development . . . . . . . . . . . . . . . . . . . . . 5.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .
. . .
.
250 250 251 253 261 263 266 267 269 272 274
275
.
. . . . .
. . .
.
276 278 280 286 287 290 290 292 294 295 2% 309 312 317 317 319 323
249 ADVANCES IN COMPUTERS . VOL . 19
Copyright 0 1980 by Academic Press. Inc . All rights of reproduction in any form reserved. ISBN 0- 12-012 I190
250
ROB KLING AND WALT SCACCHI 1. Perspectives on Computing in Organizations 1.1 The Social Side of computing in Organizations
During the last 30 years computer technologies have evolved from unusual laboratory instruments into common fixtures. Most businesses that gross over $100 million per year in sales, and government agencies that serve publics of several hundred thousand, have automated some of their operations. Computer technologies are becoming widespread and increasingly visible in small businesses and homes. One can easily count the increasing number and variety of applications that have been automated. What these technologies mean for the people and organizations who use them is much less clear and less amenable to explication by simple measures like dollars spent and dollars saved (Gottlieb and Borodin, 1973). Almost anyone who deals with computer use has occasion to explain why some computing arrangements work out the way they do. A company president may wonder why the staff of several operating divisions each request their own minicomputers. A systems analyst may wonder why the prospective users of a new information system continually change their minds about their requirements. A bank auditor may wonder whether he will have a harder or easier time understanding the records that are placed on a computer that uses a data base management system. City councilmen may wonder whether use of a “fiscal impact model” by planners will facilitate understanding of the costs and revenues associated with new commercial and residential developments. The customer of a credit card firm may wonder why a billing error is still uncorrected after several phone calls and promises made over a three-month period. Questions like these are commonplace. Our answers are usually based upon some assumptions about how the particular computing technology works and how people in the organization who develop, provide, or use computer-based services go about their business. Technical assumptions include simple understandings such as “Data stored in a common data base should be identical for each user.” But one might even have to invoke more complex understandings of a particular application in order to appreciate why one’s airline reservations for a whole trip are cancelled when only one leg of the trip is cancelled. Similarly, simple and subtle assumptions must be made about the behavior of organizations and their participants. Under what conditions do lower level staff usually follow policies set by higher level managers? Will the activities of groups outside of the computer-using organization affect its computer operations? When can Federal funders, government or private auditors, vendors, competitors, or professional associations be ignored and when are they important parties? Should one view complex organizations largely as composed
COMPUTING AS SOCIAL ACTION
251
of groups working toward a common goal, or are they better understood as federations of groups that build a shifting set of alliances and in the process serve selected interests, legitimate and illegitimate? In order to give good answers to these questions, an analyst must understand computing technology and life in organizations sufficiently to model the situation being examined. We believe that the technical aspects of most computer systems in use today are relatively well understood. In contrast, the behavioral and social aspects of computing are poorly understood. This chapter examines the ways in which the behavior of people and groups in organizations influences the development, use, and consequences of computing. It indicates recent research findings that help answer questions like those listed above. When one asks why organizations adopt computing or why some people are more enthusiastic about particular applications, different analysts provide different answers. Some analyses emphasize the economic efficiencies provided by computing (Kanter, 1977). Other analyses emphasize the ways in which automated data analysis provides users with important political resources (Laudon, 1974; Kling, 1978b). Still other analyses emphasize the ways in which state-of-the-art computing use enables both computer specialists and users to enhance their image within social worlds that value innovative activities (Kling and Gerson, 1977). Should questions about the use or consequences of computing in organizations be answered primarily in terms of rational economic calculations, interpersonal relations of key staff, routinized organizational procedures, o r the politics of the computer-using organizations? Each of these phrases indicates a different perspective from that the social activities that constitute computing development and use in organizations can be understood. In this chapter we shall examine six common perspectives that provide different terms or “storylines” with which to understand how people live and work with computing in organizations. This chapter can be read as a review of recent studies of computing development, use, and impacts in organizations. It can also be read as an examination of the usefulness of different social perspectives for explaining how computing developments “work” in complex organizations (Kling, 1980). 1.2 The Case of a Client-Tracking Welfare Information System
These six perspectives are best introduced by indicating how they help explain a complex case of computer use. First we will introduce the case of a welfare information system use by the staff of an American local government. Then we will indicate how analysts, using each of six different perspectives on behavior in organizations, would view this case.
252
ROB KLING AND WALT SCACCHI
In the mid-1960s the U.S. Department of Health, Education, and Welfare funded a major municipality, “Riverville,” to build an automated information system (UMIS) to help “integrate services” among dozens of different local welfare agencies. This integration would help improve administrative efficiency as well as the quality of service provided to welfare clients in Riverville. Agreements to this effect were given by municipal staff and representatives of the computer vendor, who jointly developed the earliest, batch version of UMIS. Data provided by welfare clients upon application for welfare assistance were entered on UMIS at one of several information and referral offices supported by public funds. Participating agencies were also requested to enroll their clients on UMIS. Whenever a person was referred to a service, it was recorded on UMIS. Whenever the services were provided, that too was recorded on UMIS. Staff in the welfare agencies were suspicious of UMIS. A cable connecting UMIS to one of the neighborhood service centers was cut just as it was being introduced. Through administrative fiat and persuasion, UMIS was adopted by the staff in the municipally operated neighborhood service centers and 34 welfare programs. The staff of other agencies that had small clientele or that were large and had their own automated recordkeeping systems were reluctant to utilize UMIS. By 1970 the technology of both UMIS and of welfare operations had changed. UMIS supported sets of terminals in several neighborhood tenters to ease the entry and retrieval of data about individual clients or their families. By 1970, 170 different local agencies were listed in a directory of “participating agencies” p-ublished by the staff of UMIS. Also, the 34 welfare programs operated by the city were administratively consolidated into one agency, and UMIS was supposed to help evaluate programs and eliminate duplicate paperwork. A careful study conducted in the mid1970s showed that UMIS was in place and used routinely for recording the transactions between some agencies in Riverville and their clientele. However, data on the services provided to many clients were incomplete because many agencies refused to use UMIS. Data were also inaccurate since UMIS was used as a training aid for a job training program designed to teach people data entry skills. UMIS provided little useful information to welfare caseworkers and it had minimal influence on their work styles. UMIS did not help integrate the operations of any agencies and was not used to increase administrative efficiency. Rather, its reports were used to help generate more Federal funding by taking advantage of the fact that HEW auditors believe an agency with such extensive automation must be well managed, and that computerized data are more credible than manually aggregated data (Kling, 1978b).
COMPUTING AS SOCIAL ACTION
253
1.3 Six Perspectives on Computing in Organizations
When confronted with a complex set of events involving many participants over 10 years, we are easily overwhelmed. How do we focus our attention and make sense of the bewildering array of facts and relations that appear in the case of UMIS? When people knowledgeable about computing in organizations attempt to explain UMIS operations, effects, and failures, they usually adopt one of six perspectives: 1. Rational analyses emphasize the formal ends which computer-based technologies are supposed to serve. Rational analysts emphasize economic payoffs of computing, and explain its value, diffusion, successes, and failures by the ways computer applications help increase the information processing capacity of social actors, and thereby help them satisfy their espoused goals. Rational analysts explain the discrepancies between the espoused goals of UMIS and its actual patterns of use by indicating that UMIS was based upon a simple technology (Hiltz and Turoff, 1978). Hiltz and Turoff argue, for example, that UMIS would “really” have helped caseworkers if only they had utilized computer conferencing. 2 . Structural analysts share with rational analysts an emphasis upon the formal tasks claimed to be important by the most credible officials in an organization. Structural analysts enrich the rational account by situating decisions in the context of ongoing organizational activities that may be scattered across organizational subunits, episodic, and disrupted repeatedly by alterations in the environment. Structural analysts have typically viewed the choice of the “best” information system for an organization as contingent upon some features of the organization’s activities and the stability of its environment (Galbraith, 1977). Structural analysts are concerned with the ways in which “structural properties” of organization such as size, complexity, and centralization are influenced by information systems. For structural analysts, the key questions about UMIS would be whether it helped top administrators consolidate several welfare agencies into one super-agency and whether it helped centralize decision making in their hands. They would also be interested in the reasons why UMIS did not help integrate the operations of the diverse array of welfare agencies situated in Riverville. 3 . Human relations analysts are particularly concerned that the people using an automated system are not constricted by its use. This perspective, commonly adopted by organizational psychologists, focuses upon the way in which different computing arrangements alter the quality of working life of participants. Concerns with job satisfaction, motivation, and alienation are focal for these analysts. When faced with UMIS in Riverville, human relations analysts would wonder why some people cut
254
ROB KLING AND WALT SCACCHI
the cable between UMIS and a neighborhood service center. They would be interested in whether the shift from a batch system to an on-line operation eased the time pressures and demands for accuracy faced by clerks who were entering and retrieving clients records. They would wonder whether UMIS was used to monitor the workloads of caseworkers, and to increase closeness of supervision. 4 . Interactionist analysts view computing technologies as objects that can take on rich social meaning for people who deal with them. Analysts of this persuasion are concerned with the ways in which people define their social situations, and the ways in which they create strategies of action in line with their perceptions and intentions. Interactionist theorists believe that many of the rules and policies that are supposed to regulate life in organizations are in fact moderately flexible, fluid, and frequently altered, through interpersonal negotiations, to fit the lives of the participants. Interactionist analysts who studied Riverville would focus on the conceptions of UMIS held by different participants in the local welfare agencies, the municipal administration, and the local computing world. They would wonder why HEW auditors believed that data from a computer were more credible than manually counted data. They would be interested in learning how municipal managers developed UMIS as a negotiating instrument to deal with Federal auditors. They would also ask whether the relations between clients and welfare agencies were altered by virtue of UMIS. Were welfare applicants aware of its presence? Did it alter their perceptions of how formal or helpful the agency staff was? 5 . Some analysts emphasize the political character of organizational life. Participants in organizations are viewed as players in positions from which they can command certain resources, and will often jockey to increase their influence. They view the decisions made by organizations as the actions in power games, or as the resultants of many such actions. An Organizational politics analyst would select the power dimensions of social life for careful scrutiny. Like the structural analyst, he would be concerned with whether top administrators in Riverville were able to use UMIS to enhance their control over welfare operations. Like the human relations analyst, he would be interested in the extent to which casework supervisors used UMIS to increase their control over the activities of caseworkers and clients. He would also focus on the relations of power and dependency between the different welfare agencies in Riverville. Last, he would highlight UMIS’ utility as an instrument of bureaucratic politics between local officials and federal auditors and funders. 6 . Class politics analyses may be viewed as a special kind of political analysis. Usually using Marxism as a point of departure, class analysts examine how the use of computing in organizations reinforces existing
COMPUTING AS SOCIAL ACTION
255
class relationships. In particular, they are concerned that workers become increasingly disenfranchized and work becomes increasingly degraded in automated workplaces. Ironically, class analysts often share the assumption of rational analysts that automated information systems cannot help but effectively rationalize the operations of computer using organizations. Class analysts would explain how the American welfare system is designed to foster a labor pool of low paid workers who can be easily exploited by potential employers. They would argue that UMIS does nothing to alter the fundamental relations between social classes within which welfare services are provided. They would also expect UMIS to lower the quality of working conditions of staff in the welfare agencies. In particular, they would share with human relations and organizational politics analysts a focus on the role of UMIS in altering the power relations between workers and their managers. Most of the articles and books that explain some facet of computing in organizations are primarily informed by one or another of these six perspectives.’ None of these approaches is unique to computing. Analyses representing each approach may be easily found that examine other aspects of technology and organizational life. Each perspective provides a way for the analyst to focus on a few relations of particular interest, and to develop a coherent storyline about his topic. In Table I , we indicate some of the key features of each perspective. Each perspective includes a view of technology and a way of defining the social setting in which the technology is used. For example, both rational (Hiltz and Turoff, 1978) and organizational politics (Laudon, 1974) analysts view computing as an “instrument” or tool that is directed to some specified end. In contrast, the interactionist views computing as creating a complex social milieu that brings a variety of participants, including “users,” into a rich, shifting, and sometimes problematic set of social arrangements (Kling and Gerson, 1978; Kling and Scacchi, 1979b). Rational analysts typically identify the individual or group who uses computing for the major focus of attention. Analysts who adopt an interactionist (Kling and Gerson, 1977, 1978) or political perspective (Laudon, I This classification system is based on our reading of the literature on computing in organizations. Keen and Scott Morton’s ( 1978) book on decision-support systems examines the social dynamics of implementation from 5 different perspectives. Some of their categories differ from ours. They identify two sub schools of the rational approach, “satisfying” and “individual differences,” which we cluster under the same common label. They ignore the structural, interactionist, and class politics perspectives. Labeling theoretical perspectives is more an art than a science. By spanning a wider array of theoretical perspectives than Keen and Scott Morton, we have chosen to cluster some important variations of given perspectives under common labels.
TABLEI. SIXPERSPECTIVES ANALYZING COMPUTING Rational Technology Social setting
Organizing concepts
Dynamics of technical diliusion
Good technology
WorkplaEe ideology
Equipment as instrument Unified organization I . The user 2. Tasks and goals 3. Consistency and consensus over goals (assumed)
Rationalization Formal procedures Individual ("perscnality") d Wrences Intended effects (assumed) Authority Productivity Need Cost-Benefit Efficiency Task Better Management I . Economic substitution--"Meet a need" 2. Educate users 3. A good technology "sells itself" 1. Effective in meeting explicit goals o r "sophisticate" use 2. Efficient 3. Correct Scientiik management
Structural
Human relations
Equipment as instrument Organizations and formal units (e.g., departments) I . Formal organizational arrangements 2. Hierarchy of authority, reporting relationships Organizational structure Organizational environment Communication Standard operating procedures Organizations resources and rewards Uncertainty absorption Rules Authority/power Information flow
Equipment as instrumentfen vironment Small groups and individuals 1. Task groups and their interactions 2. Individual needs 3. Organizational resources and rewards TNSt Motivation Expectations and rewards Job satisfaction (subjective alienation) Self-esteem Leadenhip Sense of Competence User Involvement Group. Autonomy
The fit between attributes of the innovation, organization, and e n v i r m e n t enable diffusion Helps organizations adapt to their environments
Acceptance through participation m design
Promotes job satisfaction (e.g. enlarges jobs) and improves j o b performance
Scientific management
Individual fulfillment through work
lnteractionist
Organizational politics
Class politics
Equipment as instrument Social acton in positions 1. lndividualsigroups and their interests
Equipment as instrument Social classes in stratified system
Work opportunitiesi constraints Power Social e d i c t Legitimacy Elites Coalitions Political resources Bargaining Power reinforcement
Ownership of means of production Power Social conflict Alienation Deslrillig Surplus Value Social class
Accepted technologies preserve important social meanings of dominant acton
Accepted technologies serve specific interests
Accepted technologies serve dominant class interests
Does not destroy social meanings important to lower level participants, public and underdogs Individual fulfillment through evocation of valued social meanings
Serves the interests of all legitimate parties and d w s not undermine legitimate political process Several conflicting ideologies
1. Does not alienate
"Package" as milieu Situated social actors I . Differentiated work organizations and their clientele 2. Groups with overlapping and shifting interests Defining situations Labeling events as a social construction Work opportunities/ constraints Package Career Legitimacy Social world Social conflict Interaction Role Negotiations
workers
2. Does m t reproduce relations Worker's control over production
COMPUTING AS SOCIAL ACTION
257
1974; Kling, 1978b) identify a variety of groups and individuals who participate in the social and organizational world of the identified user as critical actors. Analyses written from a particular perspective are framed with some of the organizing concepts identified in Table I . An analysis of UMIS in Riverville that centrally emphasizes the rationalization of administrative procedures and its intended effects should be understood as a rational analysis. Rational analyses typically indicate the utility of computing to some individual or group with little attention to variations between them or their organizational worlds. Similarly, rational analysis often slight examination of different modes of computing and their fit in a variety of settings. If the conceptual vocabulary is enriched to account for different organizational arrangements of computing development, operations, and use (e.g., centralization of control), or if the analyst pays attention to the critical uncertainties in resources and lines of action which actors cope with in an organization, then one is reading a structural analysis. In practice, analysts may draw upon the organizing concepts of two or more perspectives. However, some couplings are most common. Some structural analyses reflect the arid, consensual world of the rational analyst (Galbraith, 1977). Some rational analysts indicate a few structural features of the computing world they examine (e.g., transaction volumes, or organizational hierarchy) without explicitly hinging the validity of their analyses to carefully specified structural conditions (Hiltz and Tbroff, 1978). The frameworks of interactionist and political analysis are sometimes blended, particularly in the studies of the symbolic politics of computing and social power (Kling, 1974, 1978a; Kling and Gerson, 1977). Couplings of political and structural analyses are also often common (Greenberger et al., 1976; Colton, 1978; Goodman, 1979; Kraemer et al., 19801, but most analyses emphasize one dominant set of organizing concepts and selectively utilize others to fill in conceptual gaps on an ad hoc basis. In this article we will indicate how analyses of computing system developments, uses, and consequences for organizational life have been informed by these six alternative perspectives. The major value of identifying these perspectives is not to provide a static typology for sorting studies. Rather, it is to indicate how critical assumptions about the social world pervade common arguments about the development, use, and consequences of computing. These six perspectives indicate some of the major clusters of social assumptions and organizing concepts that inform contemporary analysts. The concepts of each perspective have been developed and used in relatively rich literature devoted to both theory and application. Because of space limitations, we will not elaborate the core concepts of each perspective that are listed in Table I. Each perspective
ROB KLING AND WALT SCACCHI
has a rich and distinct literature. The interested reader may find detailed examinations of these perspectives in Allison (1971), Perrow (1979), Cuff and Payne (1979), and Braverman (1974). We will examine software development in different stages of the system life cycle, some elements of computer use, and the impacts of computing on decisions, power, and worklife in organizations. There are many topics we are not examining. We have not selected particular technologies such as minicomputers or decision-support systems for scrutiny. Nor have we examined certain managerial schemes such as radically (de)centralizing computer operations, appointing policy boards, or full chargeback pricing. The computing world is alive with many novel and interesting technologies. There are many strategies for most effectively developing them and managing their use. Rather than examining these options in fine detail, we examine the efficacy of the six perspectives commonly used to analyze new technologies and the social arrangements for managing their operation. How computing “really” fits into the social life of organizations depends in part upon one’s particular theoretical preferences. Structural analysts “see” a world of communication channels, standard operating procedures, and environmental uncertainty (Galbraith, 1977). Interactionists see a world of people negotiating their own definitions of potentially fluid situations. These social meanings are constructed and pursued through interactions with others and with technologies at hand. These social interactions, and their focal concerns related to computing, will vary in place, time, and context (Kling and Gerson, 1977, 1978). Political analysts, in contrast, see a world in which participants actively play and manipulate to increase their power (Kraemer and Dutton, 1977). While these perspectives denote languages that punctuate what will be seen and what will be ignored, they do not determine the substantive beliefs of an analyst about how computing “works” in different settings. Within any given theoretical perspective there can be serious debates or marked consensus over “what is really happening.” Rational analysts generally agree that management information systems have been oversold and “underused” by managers. Recently, two researchers tested the viability of the theoretical assumptions of rational and political perspectives within the same study (Kling, 1978b; Markus, 1979) and found that rational models explained a much narrower range of important behavior. However, there is no consensus among analysts of organizational politics whether automated information systems enhance the power of the topmost managers in an organization (elite politics), or simply enhance the power of those already strongest in an organization (reinforcement poli-
COMPUTING AS SOCIAL ACTION
259
tics) (Kling, 1980). For example, in large organizations such as the U.S. Federal government or horizontally merged conglomerates, heads of the largest departments or operating divisions, not the president, may exert effective influence over what business is done and how it is done. We are not indifferent in our choice of perspectives. We believe that the interactionist and organizational politics perspectives go farther in providing appropriate languages and conceptions of social dynamics to help explain the attractions and dilemmas of computer use in organizations than do their competitors. However, we believe the social elements of computer use will be understood much more keenly if we explicitly contrast analyses from different perspectives. Analysts also differ with respect to the unit ofanalysis they focus upon. Some analysts emphasize the behavior of participants in organizations, others focus on the behavior of organizations, and still others try to link the two (Kling, 1978b). For example, a structural analyst might claim that organizations will adopt computing to help increase the ease of coordinating large volumes of routine transactions. However, an analyst who studies the specific computing arrangements used by the staff of such a firm may find its economic rationality less obvious and would emphasize the politics of its computing decisions (Pettigrew, 1973). (He might have found that the board of directors of a firm dominate decisions on choice of equipment and choose equipment from a prestigious, but expensive vendor rather than the more flexible and technically apprcpnate equipment of a competitor.) These two accounts are congruent even though they are couched in different theoretical terms. In this case the structural analyst is describing the behavior of the organization while the political analyst is scrutinizing the behavior of irs participants. A political account that truly conflicts with this structural account would focus on the behavior of the organization, rather than just of its individual members. In contrast, an analyst who argues that the staff of an organization adopted computing to enhance their image of being efficient or progressive, rather than to enhance their ability to coordinate routine operations, would be a contending analysis from a political perspective. In this chapter we will examine these different social elements of computing development and use in organizations. We will focus upon what we have learned about the actual patterns of computer development and use in “representative” organizations, and emphasize the findings of careful empirical studies of computing in organizational life. We believe that careful studies of computing development and use in real organizations offer an essential set of data to help US understand how computing arrangements of newer and older design may “work” in organizations, bearing in
ROB KLING AND WALT SCACCHI
mind that the “empirical” nature of studies does not mean that they look for the same things in studying the social aspects of computing, or that they interpret the findings in the same way. Typically, the social aspects of computing development and computing use are separately treated. Some management analysts and software engineers have been concerned about ways to organize software development and computer operations so that they are smooth, cost effective, and efficient (Cooper, 1978). Other analysts have focused their attention on the social repercussions of computer use (e.g., worklife, organizational structure). We view these diverse topics as pieces of a common fabric. Whenever computing is embedded in an organization it becomes a markedly social phenomenon. This viewpoint is ironically underlined by the complaints of people who adopt a rational perspective and who wish that people or politics could be taken out of computing. “Supportive” managers, “clever” programmers, “indifferent” machine operators, “career-oriented” auditors, “stupid” users, and ‘foot-dragging” vendors often appear on the playbill of the rational analysts who wish that systems could be perfect in their theater, and that the actors would depart, leaving the director alone with the plans, technology, and the audience (Strassman, 1973). But with such a cast of characters on hand, something of interest might just be happening; and often it is. In the following sections the development, use, and impact of computing in organizations will be examined in light of the six perspectives outlined above. Section 2 examines the development and provision of computer services through the life cycle from initiation to evaluation. Section 3 indicates how knowledge about computing is distributed throughout organizations and how this leads to systematic misperceptions of computer use and increases the likelihood of computing errors. Section 4 examines the consequences of computer use for the ways decisions are made, the worklives of computer users, and the distributions of power in computer-using organizations. The narrative thread we have woven is at times somewhat complicated as we examine topics such as software maintenance, discretion in computing use, and the role of computing in decision making from several different perspectives. Usually each topic has been most carefully examined by analysts who represent only a few of the six perspectives. In principle, we would wish to examine each topic from each perspective, and provide a more consistent development. Two difficulties intrude. Primarily, the literature composed of careful studies ofcomputing have been extremely fragmented, and researchers have typically not developed lines of inquiry that examine these diverse topics from any one theoretical perspective. Sometimes we have U e d in appropriate analyses when we believed them to be
COMPUTING AS SOCIAL ACTION
261
particularly relevant and we could ground in data collected during the course of our own fieldwork. Otherwise, our gaps reftect those in the literature. For example, human relations and class politics (Kraft, 1977) analyses of computing work in general can be developed straightforwardly. But, peculiarly human relations or class politics approaches to software testing are less transparent. In Section 5 we integrate the major themes and suggest lines for further work. 2. The Computing System Life Cycle
Some analysts conceptualize computing systems as developing through a temporal sequence of stages-a “life cycle.” A computer system is designed before it is implemented, tested after it is implemented, and maintained until it “dies.” Organizing research by phases of the system life cycle focuses attention upon the generic activities of system development integrally linked with computing use (cf., Learmonth and Merten, 1978). Traditionally, the computing system life cycle draws attention to the different phases of system development independent of their organizational setting (Cave and Salisbury, 1978; Cooper, 1978). This rational tradition reflects the desire by many analysts and computing specialists to view computing system development, use, and maintenance as essentially technical and managerial endeavors (DeRoze and Nyman, 1978). But the tide may be beginning to turn. For example, informed software engineers have begun to see that concerns for the social implications of computer sysrems are part of the software engineer’s job, and techniques for dealing with these concerns must be built into the software engineer’s practical methodology, rather than being treated as a separate topic isolated from our day-to-day practice. (Boehm, 1979; emphasis added)
We will examine some of the concerns encountered by computing users, specialists, and managers according to the perspectives that have informed the analysis of different techniques. In this section we will show how each phase of the computing system life cycle is shaped predominantly by social interactions between computing promoters, developers, users, maintainers, and their organizations. We will investigate each of the system.life cycle phases in Table 11 as covered in the literature. When applicable, we will provide examples of situations drawn from oufown case studies. However, we will investigate computer system use separately in Section 3. Many authors and organizations have defined the phases of the computing system life cycle. However, their definitions vary widely with respect
262
ROB KLING AND WALT SCACCHI TABLE11. THECOMPUTING SYSTEMLIFECYCLE
1. Initiation and Adoption. The activities related to system adoption and requirements
specification. Getting the organization ready for the arrival of the intended computing system. The common choice is between developing a system in-house or transfemng-in a system developed by some other organization. 2. Requirements Speci$cation. The activities that outline the design of a computing system. The system specifications are determined by some mix of managers, computing specialists, and instrumental users. 3. Selection. The activities surrounding the choice of where to acquire a computing system. 4. Design. The activities dealing with the detailed specification of the structure and functions of the computing system. Computing systems can be distinguished according to whether their designs are primarily user centered or machine centered. 5 . Implementation. The activity of building and installing a specified computing system in an organization. Some management and MIS analysts include system design in this phase. 6. Testing and Validation. The activity of demonstration, verifying, or “proving” that a system works as expected. 7 . Documentation. The activities in which a written record of the functions, structure, and operating instructions of a computing system is produced. 8. Computing System Use. The activities that entail work with a computing system on the part of managers, computing specialists, other professionals, and clerks. 9 . Maintenance, Modijkation, and Conversion. The activities concurrent to the use of a computing system which keep it alive and operational. Now considered to be the most costly phase of the system life cycle. 10. Evaluation. Measuring the fit and performance of a computing system with respect to computing arrangements and organizational goals.
t o the number of phases, their names, and the activities within each. For this chapter, we define the phases as follows. The “system life cycle” as used here does not refer to some predetermined sequence of steps through which all systems must inevitably pass. Rather, it directs our attention to the beginnings, middles, and endings of computing systems that may lead to new systems beginnings. In organizational settings where computing systems have been in use for some time, new application systems are continually added to the existing system environment. “Systems” and application software tend to be built upon the layers of the existing kernel of software. Application software often depends on systems software for proper operation. As software components grow larger and “bump” into each other, computing systems intended as solutions to certain problems can lead to new problems to which systems must be adapted or replaced. Thus, as a system develops within its life cycle, a number of the “phases” mentioned in Table I1 occur in parallel. One indicator of this concurrency is the continuing growth of software systems found in the use and maintenance phases.
COMPUTING AS SOCIAL ACTION
263
That is, once a system is implemented, it either stays in operation and maintenance, or is converted and assimilated within some other system. Most systems are replaced or upgraded upon obsolesence. Few computing systems “die” or disappear without replacement by another computing system. In the remainder of this section we will review the phases in the system life cycle utilizing the six analytic perspectives listed in Section 1. We will then examine a common software system found in modern organizational computing environments, an automated text processing system. Here we will draw upon a recent study we conducted of such a system in use at CSRO, a computer science research organization (Scacchi, 1980). After presenting this case as a detailed example of a software system life cycle, we will review it with respect to the six analytic perspectives. 2.1 Initiation and Adoption
System initiation is the phase of the system life cycle when a computing system is proposed and the decision to adopt made. In considering computing system initiation we are interested in the activities surrounding system promotion and the incentives for computing system adoption. Traditionally, new or better technical components or complete systems are said to move “naturally” from inventor or developer toward users and consumers. Computing system promoters or adopters can, however, be motivated by different, potentially conflicting, incentives. Vendors are interested in selling (or overselling) systems for profit. Federal agencies seek ways to improve their efficiency and spheres of influence. Specialists sometimes want to work with newer, more sophisticated systems instead of procrustean user applications. Users seek ways of improving their productivity or organizational effectiveness with minimal or extensive computing system use. Rational analysts hold pleasingly simple conceptions of the diffusion of technical innovations. To wit: successful technologies meet a “need” of individuals or organizations (Licklider and Vezza, 1978); as people learn to appreciate that a new technology meets some need better than its alternatives, and as the costs of obtaining the technology decrease, the technology is adopted on a large scale (Simon, 1977). This rational account is commonly articulated by technologists and the staff of computer-using organizations responsible for developing new computer applications. The primary analytical difficulty that confronts any scholar who studies these claims arises in the attempt to conceptualize “need.” Structural analysts and analysts using an organizational politics perspective ap-
264
ROB KLING AND WALT SCACCHI
proach this problem differently. Structural analysts typically focus on the information processing task to which computing technologies can be applied. Gerson and Koenig (1979), for example, analyze the applicability of computing to different tasks by examining the extent to which the tasks can be easily rationalized. They indicate that organizations that carry out a large volume of easily rationalizable tasks (e.g., insurance companies that process bills, claims, and payments) are much more likely to find computing applicable than are firms where rationalizable tasks are a smaller fraction of the work done (e.g., law firms). Organizational politics analysts, on the other hand, focus on the interests and intentions of participants in a decision to adopt a new computing arrangement. One important study examined the relative importance of structural and political explanations of computing adoption. Danziger and Dutton ( 1977) carefully examined the role of institutional needs and the social features within and outside a large sample of American local governments to predict the rates at which they would adopt computing applications. By comparing rates of computing adoption within similar kinds of organizations, they were able to use organizational domains (population of the jurisdiction) as an index of need. Most American counties, for example, administer property taxes. One can assume that the utility of computing to help ease the burdens of tax roll preparation would be similar for counties with similar populations. Danziger and Dutton found that governmental size and complexity (i.e., “need”) account for about half of the explained variance in levels of computing adoption. The other half of the explained variance is accounted for by elements of the government’s milieu (e.g., presence of outside funding for applications) and the social milieu inside the government agency (e.g., reliance upon professional management practices, variety of participants influencing decisions to adopt computing). As indicated by this study, then, organizational adoption of computing is not simply based upon economic conceptions of task demands; it is also influenced by a variety of specifiable social features of an organization. Often the staff of an organization will point to “technical benefits” such as labor reduction, internal efficiencies, and increased managerial control to justify the adoption and use of computing technologies. However, careful studies of the actual payoffs of computing within organizations indicate that other, more private rationales often dominate the decisions to adopt or sustain computer-based systems. Concerns for increasing administrative attractiveness, rather than internal administrative efficiencies, helped sustain an otherwise relatively useless client-tracking information system in Riverville (Kling, 1978b). The welfare agencies that used UMIS appeared more efficient to federal auditors and were able to maintain a steady flow of funds since the presence of computing enhanced its image of
COMPUTING AS SOCIAL ACTION
effective administration. UMIS was a political resource for the municipal welfare agency in Riverville. Laudon’s (1974) insightful case studies of four police computing systems in state and local governments illustrate how these systems served the interests of particular elite groups in the governments that adopted them. Sometimes the introduction of these systems was instrumental in bureaucratic politics. For example, one state governor wished to establish a state attorney general’s office, but the local police departments were extremely autonomous and their staff did not want a state attorney general’s office. As ayuidpro yuo to help establish the attorney general’s office, the governor offered to develop a state-run computer system to keep track of wants, warrants, and criminal histories for all jurisdictions within the state. Its administration would be centralized in the new attorney general’s office. Thus the governor sought to establish an attorney general’s office by offering in exchange a police information network for local police. Laudon views new computer systems as political instruments that are selected to fit the political contours of existing organizations with their own ongoing conflicts and coalitions. In one of our own case studies of a multinational engineering firm (WESCO), similar patterns were sometimes found. Chemical engineers at WESCO utilize automated simulations to help calculate the design parameters of chemical processing plants. Sometimes they seek simply more precise estimates for the sizes of pumps, compressors, and costly equipment. But there are also occasions when engineers have trouble convincing auditors hired by their clients of the efficacy of their designs. At times, engineers have moved from hand calculations to available simulation programs in order to help snow the auditors. The calculations look precise, accurate, and sophisticated, Since WESCO’s simulation programs are proprietary, it is also more difficult for auditors to double check the correctness of the calculations or the assumptions made.2
These simulations utilize tables for material properties (e.g., specific heat) and special thermodynamic correlations for computing others. Properties of materials change in different temperature ranges. It is impossible to determine which values are used without careful inspection of the programs. This is not a trivial matter. Some of the chemical processes engineered by WESCO operate at cyrogenic temperatures, while others operate near the boiling point of water. The chemical engineers select among a dozen thermodynamic correlations to compute other properties. These correlations are proprietary and an outside auditor cannot confirm whether the selected correlation for a particular analysis was appropriate. Even engineers within WESCO differ in their judgments about which correlations should be selected for a particular set of physical conditions. Processing plant design parameters vary a s much as 20%, depending upon the correlation chosen. The engineers typically work within 10% tolerances and selection of an appropriate correlation is technically and substantially important.
ROB KLING AND WALT SCACCHI
In the language of rational analysis, one could say that engineers at WESCO and administrators in Riverville found computing “applicable” to decisions in their organizations (Licklider and Vezza, 1978). But much is lost with such banal characterizations. Sometimes people choose to utilize computer-based systems because they value the expected technical benefits such as more precise calculations, speedier information flows, and more flexibly organized reports. On other occasions, as shown above, computer use serves as an instrument of bureaucratic politics (Kling, 1974; Laudon, 1974). Computing applications used as political instruments do not easily fit the accounts of rational analysts who ignore the politics likely to be found in any organization (Simon, 1973; Burch et al., 1979). Many analyses of computer use developed within the rational tradition are sociologically naive. These analyses focus on the technical payoffs of computing in organizations and neglect patterns of use that are inconsistent with an economic conception of organizational activity. But to the extent organizational politics influences the move to adopt computing, its dynamics must be part of any systematic understanding of the antecedents or consequences of computer use in organizations.
2.2 Selection It is commonly held that computer-using organizations benefit by adopting computer applications developed by an outside vendor or some other organization. Computer scientists and federal research supporters have paid serious attention to schemes for developing machine-independent software, to help diminish the costs of transferring applications across organizations. Kraemer ( 1977) studied the actual rates of application software transfer between American cities. He found that a typical city automates about 40 applications, but, in general, only one has been transferred in. Kraemer and King jointly studied federal projects in which cities were funded specifically to transfer applications, and found that applications were rarely transferred to other cities (Kraemer and King, 1979). One could interpret these findings in a number of ways. One could bemoan the ignorance of public officials and wish that they were “wiser, technically competent, and more innovative.” More sensibly, Kraemer analyzed the incentives provided for particular groups in an organization to transfer-in an application. Subsequently, Perry and Kraemer (1978) found that the structural attributes of the organization and of the computing application influenced their adoption and selection. In most organizations transferring-in a new application does not provide strong rewards for a local computer group, compared with local development. A manager who supervizes local development can increase or main-
COMPUTING AS SOCIAL ACTION
267
tain staff, increase his budget, and become an expert on yet another critical operation. Interactionist analysts note that given the development bias of the computing world, a new developer may gain additional prestige compared to an actor who merely transfers-in an existing application (Kling and Gerson, 1977). Together, these create a mobilizurion of bius against transferring-in new applications. When computing applications are adopted to help the staff of an organization cope with some information processing demands, there may be a “political” dimension to their decisions. Often, participants in an organization differ over the forms of computing they prefer. Pettigrew (1973) found that the selection of new computer systems made by the board of directors in a British manufacturing firm was influenced more by their trust in certain advocates than by the technical merits of the various proposals. Those having good personal relations with board members were most likely to have their proposals for choice of vendors or kinds of equipment accepted, even when their adversaries within the firm had better cases on procedural or technical grounds. Rational analysts often assume that technical criteria or the technical merit of system proposals will determine what computing systems will be selected by an organization. Software transfer, similarly, should be “natural,” barring technical limitations or poor dissemination of transfer information, according to rational analysts. But the studies cited here point to the potent role that incentives, structural attributes of a computer-using organization, and organizational politics play in influencing whether computing is adopted, which technologies will be selected, how they will be developed, and whom they will serve.
2.3 Requirements Specification
Lately, within the computer science community, much attention has been directed at the capture and analysis of system specifications (Boehm, 1976). The current rational emphasis is to treat the capture of system specifications as a technical task which can be analyzed in a “structured” manner. But analysts utilizing other perspectives primarily attend to the interplay between users, specialists, and managers when developing system specifications. The primary organizational concern for system specification activities, according to human relations analysts, is the closeness-of-fit to satisfying user needs. Oftentimes, satisfying users’ needs is described as an important concern in system design (Hedberg and Mumford, 1975; Lucas, 1974). But a careful examination of this work reveals that emphasis on
268
ROB KLING AND WALT SCACCHI
user involvement is really directed at specifying the functions and workings of an intended system: specifying the system design. According to human relations analysts, lack of involvement in specifying a system’s design can lead users to be dissatisfied with their jobs (Lucas, 1978; Mumford and Banks, 1967; Mumford, 1972). Lucas (1974, 1975, 1978) argues that one recurrent reason why computing systems fail (i.e., fail to satisfy user desires) is that users are not directly involved in specifying the system’s design. He also argues that smooth interaction between the users who specify and the specialists who develop the system is essential for achieving a satisfying and successful computing systems. Mumford and Pettigrew (1976) claim that computing increases uncertainty among people in the organization and decreases their ability to cope with these uncertainties. Mumford stresses that if system designers (or specification analysts) attend to the “humanistic needs” of users, they will mitigate users’ uncertainties and keep them satisfied when using a computing system. Attending to these needs should therefore facilitate organizational success with computing. But it is also quite clear that constraints on organizational resources and the distribution of computing expertise must be attended to in order to determine whether these needs can be met and the best ways t o meet them. In related work, Kling (1973, 1977) examined the way in which emphasis on different system requirements translates into computing systems that are either “machine centered” or “user centered.” Machinecentered systems reflect the focusing of system specifications toward efficient utilization of the computer. User- or person-centered systems reflect the focusing of system specifications toward minimizing computing hassles for users and improving users’ job content. Kling (1977) has developed a conceptual framework that stresses both concern for the intended uses of the computing system, and sensitivity to local organizational politics and t o the value orientations of a system’s designers, as a means for analyzing a computing system’s focus. Computing systems may be specified to reflect a variety of distinct interests. These include the perceived cost savings of computerization, the problem-solving challenge for specialists to develop new systems and skills, reduced work demands and improved work effectiveness for instrumental users, and the perceived power payoffs achieved with computer-based organizational reforms. The specifications of a new computing system are shaped by the conflicts and agreements established between participants pursuing these different interests. The rational perspective suggests that conflicts over system design specification, with respect to interests served, are static, hence resolvable, rather than poten-
COMPUTING AS SOCIAL ACTION
269
tially problematic, fluid, and ongoing. But interactionist and organizational politics analysts view these shifting conflicts as an intrinsic part of computing system specification dynamics. They suggest that system adopters may be as much concerned, if not more, with organizational gains, losses, and strategies for achieving favorable balances of resources and administrative influence as with specifying the technical intricacies of a computing system. The system developer thus may face a difficult dilemma in deciding which system “requirements” are to be further specified and met, and for whom. 2.4 Design
System design is often cast as an essential activity in the system life cycle. After all, if you can not build a computing system that is understandable and manageable, why should the system be expected to work well? Though such a concern may seem simple and obvious, practical design decisions depend in part on the size of the system and on the number of different people developing it. Many computing specialists are trained with systems that are relatively well-designed and small. But some analysts, for example, Kernighan and Mashey (1979) discount the utility of structured software design techniques for small programs (less than 1500 lines of code) when programming in a sophisticated computing environment. However, most organizations probably have limited computing resources and technical expertise. In many large organizations, large-scale integrated software systems can require multimillion-dollar expenditures and many man-years of labor for development. In these settings, the efficacy of a system design can mean new savings and opportunities for the organization, or massive cost overruns. Software design has emerged with a strong engineering influence, and a rational perspective which, at times, views software as a form of mathematics. So conceived, software design is thus a series of structured tasks executed by a hierarchically organized team of computing specialists (Mills, 1976, 1977). The common problems of software systems that support a rationale for directing attention to system design activities include: the high cost of design errors (when the system does nor function as specified): unfulfilled need for well-designed systems which are easily extended; and the absence of appeal to the intuitive correctness embodied in an engineering approach. Furthermore, engineering design strategies are perceived to provide benefits in later stages of a system’s life cycle [e.g., a good system design results in easier-to-maintain systems (Boehm, 1976); system reliability is achieved through structured design and not
270
ROB KLING AND WALT SCACCHI
through testing (Mills, 1976, 19731. However, the literature citing empirical demonstration of such claims is scarce. These techniques are usually promoted for their rational, but heuristic, value. A finer examination of the engineering approach to system design reveals a casual concern for theactual users of a system. Part of this problem stems from the pervasive ambiguity of “who the users really are.” That is, users can be either organizations, managers, specialists, instrumental users or clerical users. These people might each “use” a computing system in a very distinct way to realize different ends. Much of the software engineering literature tends to identify users as large organizations and their managers (Boehm, 1976; Brooks, 1975; Mills, 1976, 1977). But little attention is directed at how users are to be involved and which users are to be involved in system design activities. Judging by the literature on design activities, there is little understanding of the ease with which computing-naive users can participate in system design activities. Human relations analysts continually emphasize the importance of user participation in system design a5 the key to system implementation success (Lucas, 1974, 1975, 1978; Mumford, 1972, Mumford and Banks, 1967; Mumford and Sackman, 1975). Many computing users have little or no prior experience in system design activities; thus they cannot be expected to understand all of the technical exigencies that must be attended to during system development. Though these users may be more satisfied because of their participation, neither their interest nor their satisfaction guarantees that adequate system specifications will be captured. Similarly, participation does not guarantee that the implemented system will meet users needs or serve their interests; we may all be aware of how participation in organized activities can become merely passive or symbolic when outcome decisions have already been specified by others in control. Concerns such as these may be less at issue for technically skilled professionals but more at issue for casual or clerical users of computing systems. Another approach to system design directs attention to the roles of systems designers and actual users as well as the intended administrative, social, and political uses of resultant systems in organizational settings. This approach represents a merger between the organizational politics and interactionist perspectives. From this spliced perspective, software design is viewed as an activity inseparably technical and social, taking place in settings where organizational politics, the negotiation of computing resource balances, staff sentiment and the like are all formative ingredients which shape computing system design, development and use. Keen and Gerson (1977) outline the way in which the software system design process is embedded in the political order of an organizational
COMPUTING AS SOCIAL ACTION
271
setting. The political order of the organization shapes, constrains, or defines every phase of systems development and use. For example, conflicts over which system specifications are to be reflected in a system design may be resolved or disputed according to who has the greatest influence in specifying the design or use of the intended system rather than according to one’s technical competency. These authors suggest strategic heuristics to help software designers to recognize and avoid political entanglements in the system design process. The recent work of Kling and Scacchi builds upon many of the preceding findings of the social character of system design (Kling. 1978b; Kling and Scacchi, 1978, 1979a,b). Computing system design is a recurrent activity in many social settings. They examine computing system designs with respect to the present and prospective distributions of organizational influence and computing resources for managers, end-users, and system designers within an organization. Different system designs reflect the selected interests of those with substantial influence or resource control. Similarly, different designs when implemented can shift influence and resource patterns within an organization. But such shifts are constrained with respect to the proposed system, the “purposeful” activities of organizational actors, and the local computing puckuge (Kling and Scacchi, 1979b).3With respect to a computing package, the purposeful activities of Organizational actors may include specialists trying to pursue rational system design techniques, managers attempting to extend their domain of control and influence, instrumental users trying to attend to job content and work deadlines, and computing promoters advocating the adoption of new or “better” computing technologies. The situated interplay of organizational actors pursuing their interests, the proposed system, and the local computing package outline the social character of computing system design (Kling and Scacchi, 1979a,b). Situations like these often give rise to conflicting system design specifications. Depending on the influence of different users, the resulting system design will emerge from the resolution or stalemates of their system design conflicts. This, however, does not imply that such conflicts are monumental. Many will be resolved through deference to technical experAccording to Kling and Scacchi (1979b. p. 108) a computing package “includes not only hardware and software facilities, but also a diverse set of skills, organizational units to supply and maintain computing-based services and data, and sets of beliefs about what computing is good for and how it may be used efficaciously.” Furthermore, an organization’s computing package is situated within the shifting segments of the computing world and other relevant social worlds. A computing package is thus an evolving ensemble of social and technical “fixtures” that encourage and constrain organizational computing activities (Kling and Gerson, 1978: Kling and Scacchi, 1979b: Scacchi, 1980).
272
ROB KLING AND WALT SCACCHI
tise, managerial authority, or administrative fiat. But system designers often can find themselves at the nexus of these interactions. Interactionist analysts thus point to the need to account for the meanings different actors give to system design, the social and technical resources at their disposal, the organizational setting and the organizational computing arrangements as inseparable features of the system design process. 2.5 Implementation
Much of the present discussion on system implementation comes from the Management Information System (MIS) arena. In this arena, implementation concerns the management of organizational change brought about through technological changes (Keen and Scott Morton, 1978). Presently, the introduction of new computing systems into an organizational setting is the major focus for system implementation research. The major issue pervading the literature is: will a project to implement an organizational information system succeed or fail? The research emphasis is often on assessing what features of an organization and its staff, and the computing system and its design, influence system success or failure a priori. Most researchers often utilize a structural perspective in their analysis of system implementation. Studies of system implementation addressed to computing specialists often rely on anecdote and “horror stories” (Brooks, 1975; Glass, 1977) rather than systematic empirical studies. These arm-chair analyses similarly deal with system implementation in terms of success or failure. For example, Brooks (1975, p. 47), commenting on the implementation of 0s 360 notes: “It was very humbling to make a multimillion-dollar mistake, but it was also very memorable.” Most of us are of course aware that 0s 3604 and its descendants are now found in many organizational settings. So why would it be considered a failure? An important distinction must be made in defining the success or failure of a system implementation. MIS analysts treat success or failure in terms of the achievement of management goals, whereas computing specialists are often attentive to technical criteria. Thus construed, a system implementation project could be considered by computing specialists to be a technical success, while failing to meet organizational or management goals; and conversely, a system implementation project might succeed in involving satisfied users and attaining management goals, yet be considered a technical failure. An interactionist analyst might therefore interpret this distinction to reflect the social meanings that different organizational actors can utilize in assessing the success or failure of a system implementation project.
‘ 0 s 360 is not an information system, but is a large software system nonetheless.
COMPUTING AS SOCIAL ACTION
273
Computer-based systems may be adopted by an organization, but still be difficult for the staff to use. Frequently, casual observations are made about computer-based information systems that produce voluminous or unreadable reports. Addressing this difficulty of system use, human relations studies of MIS implementation repeatedly conclude that substantive involvement of computing users in system specification and design leads to systems which are better accepted. Unfortunately, “user involvement” has become a cliche, and both “user” and “involvement” are ambiguous in referent. The better studies differentiate kinds of “users”: managers who occasionally see computer-based reports; staff analysts who generate reports; clerks who enter and retrieve individual transactions; and computing specialists who use tools or techniques to tune or maintain the performance of the system. Given this variety of users, we might infer that there exist divergent expectations about the potential for system implementation success or failure, and the nature of the expectations of a given user depends on the extent of his or her involvement. “Involvement” can range from being informed about design decisions made by technical staff, to actually creating design specifications. The research indicates that people who use computer-based systems will appreciate them to the extent that the systems help them meet their task demands and to the extent they feel some control over their system’s behavior. The empirical studies of MIS implementation generally focus on comparisons between the conventional wisdom of how to implement an information system and what is done in practice (Alter and Ginzberg, 1978; Keen and Scott Morton, 1978; Lucas, 1975, 1978). Many of these studies indicate that practice does not usually follow academic wisdom. For computing specialists, for example, implementing an information system may be challenging, necessary, or a preferably avoidable set of tasks. We ask, why does practice not follow advice? Rational analysts often slight system implementation activities. It is as if they define system implementation as just coding the system design. Thus construed, “implemented” systems should somehow naturally “slip” into existing organizational activities. But should the organization’s implementation of a computing system be clogged up by unforeseen organizational problems, rational analysts will often cite the need for “better management” to improve things. Structural MIS analysts often focus on those characteristics of the organization that must or will be changed to achieve the successful introduction of major technological changes (Alter and Ginzberg, 1978; Brooks, 1975). Human relations analysts, on the other hand, have documented the importance of user participation in system design to the achievement of successful information system im-
274
ROB KLING AND WALT SCACCHI
plementation (Edstrom, 1977; Hedberg and Mumford, 1975; Lucas, 1974, 1975, 1978; Mumford and Banks, 1967). In addition, the organizational role and power of different users are posed as a complementary characterization to the human relations approach as the basis for the way user involvement is practiced and achieved (Mumford and Pettigrew, 1976). But analysts following an interactionist and organizational politics perspective would attend to how the implementation activity is operationally defined by different actors, the constraints on their involvement, and the placements of value on managerial versus technical objectives as characterizing features of the social process of system implementation. 2.6 Testing and Validation
Software system development practices often stress the importance of testing, validating, and verifying the correctness of a system (Boehm, 1976; Mills, 1976, 1977). In theory, one would like to be able to generate automatically a set of test data sufficient to demonstrate the probable correctness of a robust class of programs and accompanying data sets. However, for programs of moderate complexity, the set of test data to exercise all paths through a program is infeasibly large. Some promising research is proceeding on various schemes to automate tests for special program conditions. However, basic questions have been raised as to whether people will trust an automated program tester or verifier, because of what it takes in practice to convince someone of the efficacy of automated “proofs” (DeMiIlo et af., 1979). DeMillo and associates argue that the rationalist appeal for viewing software as a part of mathematics (Mills, 1976, 1977) (thus facilitating the application of “formal” theorem-proving techniques) is confused and misconceived. DeMillo and company sympathetically agree with the appeal and robustness of the mathematical perspective, but for nonrationalist reasons. They argue that mathematics is, rather than a formal process, an ongoing, interactional social process. In contrast to the research on new systems or techniques for program testing, the current state of practice does not rely upon much automation at all. Computing systems are “proved correct,” validated, or thoroughly tested, for example, either by using the system in production or by selectively demonstrating the system’s capabilities with data “manually” selected by a specialist or knowledgeable user. An adjacent matter to that of software testing and verification concerns the accuracy, reliability, and auditability of results processed with automated systems. For example, though a program (or parts of it) may be shown to behave “correctly” for a given test data set, ensuring the accu-
COMPUTING AS SOCIAL ACTION
275
racy of data processed by many individual users across many other programs and organizational units leads to practical difficulties. These difficulties become most apparent when one attempts to reconstruct what processing occurred and when. These are often practical concerns of organizational computing users who are accountable to clients and governmental agencies as well as other computing specialists and instrumental users. Such topics are often of interest to analysts of organizational politics. But these topics have not yet garnered much attentive research within the computer science community.
2.7 Documentation System documentation usually is intended to function as a source of information for what a given computing system is good for and how t o use it. Rational analysts argue that the preparation of documentation should not be separated from system development. Furthermore, they argue that good interactive text processing facilities permit quick and convenient production of many kinds of documentation that might otherwise be unobtainable, impractical, or very expensive (Mashey and Smith, 1976). When a new computing system is introduced, its users benefit from both tutorial manuals and reference manuals. Documentation used to supplement user training activities usually provides demonstrations of example processing tasks. System specialists, who must maintain a system, can use good' design and program implementation documents. Such documents may in fact include routine procedures for (limited) data and program testing. However, adequate and up-to-date software documentation of these kind is continually a weak feature of most software systems (Kling and Scacchi, 1979a,b). Our observation is that it is rure for a computing system to have up-to-date, high-quality documents of all these kinds. Software system manuals can often be measured in inches, but their currency and adequacy vary. Palme ( 1978) has reported that though documentation may abound for individual programs, documents that describe how multiple programs can interact under user direction are unavailable or effectively unretrievable from the documentation morass. Maynard (1979) has indicated that some users want manuals that are less wordy, while others want manuals with more explanations. Belady (1978) and Brooks ( 1975) have indicated that, as an organizational task, the sheer production of system documentation can consume significant amounts of organizational materials and services. In addition, the comprehensibility of computing systems with substantial documentation of varying quality has been raised as a practical organizational, technical, and person-
276
ROB KLING AND WALT SCACCHI
centered concern (Belady, 1978; Brooks, 1975; Kling and Scacchi, 1979a,b). According to interactionist analysts, providing practical and up-to-date documentation demands time, attention, skills in clear and concise writing, an inclination to help users, and an organizational commitment to encourage and support documentation activities (Kling and Scacchi, 1979a,b). In practice, documentation work is subject to many of the contingent demands that can pervade an organizational computing setting. Interactionist analysts observe that the people who produce or use system documentation manage or shirk the contingent demands for documentation upkeep and adequacy. On the other hand, rational analysts suggest that such dilemmas do not arise in resource-rich computing environments where suitable on-line documentation and storage are available (Kernighan and Mashey, 1979). But it is questionable whether most organizational computing environments are resource rich. Though the rational analyst might then argue for computing resource redistribution, organizational politics, interactionist, and perhaps structural analysts would counter that such redistributions are not achieved without attendant organizational conflicts.
2.8 Maintenance, Modification, and Conversion Current system life cycle costs reflect the high and growing cost of system maintenance (Belady, 1978; Boehm, 1973; Lientzet al., 1978).The organizational resources spent on system maintenance and modification are often much larger than expected. Furthermore, maintenance costs are difficult to predict or determine after the fact because these costs can be distributed across or hidden within an organization’s budget. The traditional concern, as primarily addressed by rational analysts, is on the high cost of software maintenance rather than the nature of the activities that constitute it. The available figures for system maintenance do not distinguish the costs of system alteration in terms of time, skills, inclination, forsaken opportunities, attention, and other organizational resources or their distribution across an organization (Boehm, 1976; Wolverton, 1974). Similarly, the figures do not account for how these social resources are consumed by users and managers interacting with specialists trying to implement the requested enhancements in an already complex computing system. For example, in order to convert a related collection of data files (a data base) over to an integrated data base management system, we might assume the specialists doing the conversion have the necessary knowl-
COMPUTING AS SOCIAL ACTION
277
edge of the contents (and structure) of the data base (Oliver, 1978). But if the data base has been in the organization longer than the conversion specialists, then these specialists must rely on information provided by others to implement and test the conversion. This information may be in the form of either system documentation or verbal information from others. Neither of these sources is necessarily reliable, up-to-date, or sufficiently detailed. System documentation, as we have already observed, is usually described as being of poor quality and out-of-date, while deriving information from others can sometimes be problematic because of (a) differences in perspectives toward the uses of a data base, (b) the specialists’ concern for detailed structural information versus the users’ concern with organizational procedure, and (c) the data base may be sufficiently large and complex to be incomprehensible by any single person. Interactionist analysts note that these organizational dynamics and dilemmas may contribute significantly ro the high cost (and organizational distribution) of system maintenance (Kling and Scacchi, 1979b). In order to better control rising maintenance costs, some analysts recommend rearranging system development and maintenance work of programmers. Also, they suggest managing programmers with “proper” supervisory control. In other words, they recommend altering or increasing staff resources and management attention. These suggestions conform to a structural analysis. Rational analysts also suggest the importance of “better management” and improved “programmer motivation” as partial solutions to the system maintenance problem (Cave and Salisbury, 1978; De Roze and Nyman, 1978). But these are often vague and expensive remedies to recurrent computing work problems. Such remedies may not be readily implemented because of contending demands (and battles) for finite organizational resources and staff attention, according to organizational politics analysts. The contingent needs of organizational groups to get software systems maintained and system maintainers more committed to meeting user requests may be better understood by attending to work group interactions and resource constraints, according to interactionist analysts. Currently, participants within the computing community emphasize software system development over system maintenance (Kling and Scacchi, 1978, 1979a,b; Learmonth and Merten, 1978). Many development activities are based on the attractive assumption that good system development practices can significantly reduce system maintenance costs. Rational analysts, for example, claim that demands for software maintenance activities in resource rich, cooperative work settings are minimal due to an abundance of software development tools and an integrated program-
278
ROB KLING AND WALT SCACCHI
ming environment (Kernighan and Mashey, 1979). But demands for system alteration can come from organizational actors far removed from system operation. Computing systems are often treated as extensible entities. As such, older, embedded systems may be revitalized and extended by altering existing subsystems or adding new ones. In this manner, systems are developed layer upon layer: each layer adding an increasingly complex web of interacting system modules (Scacchi, 1980). But many systems become distended, procrustean, and fragile after numerous repairs, adaptations, and extensions. Understanding the workings and intricacies of such embedded systems becomes more difficult, thereby complicating system maintenance tasks. Though software systems may have no physical moving parts, systems apparently “wear out.” Wearing out is another way of saying that a system is too complicated or costly to use and maintain. Depending on the importance of a worn out system, coalitional or individual efforts may be initiated to convert the old system to a new one. The availability of new technological developments may further encourage some organizational actors to initiate system (and data) conversions or to adopt newer systems (Oliver, 1978). Similarly, the managerial desire to reorganize a perceived fragmentation of staff interactions and computing work can be another force acting to bring about the demise of an old system and the initiation of a new system. These conversion activities help recreate a computing system “life cycle.” Rational and structural analysts tend to slight or ignore the practice of converting worn out systems into improved, often more complex systems. Their analysis, which may promote or assess the efficacyof computing system adoption, often treats system initiation as an activity discrete from (if not independent of) the conversion of older systems. Interactionist analysts, on the other hand, assert that the organizational and technological dynamics that surround computing system maintenance, conversion, and initiation characterize, in particular, the continuous production of computing system life cycles (Kling and Gerson, 1977; Scacchi, 1980). 2.9 Evaluation
With the ever-growing complexity of computing systems, problems related to system performance evaluation have received variable attention from researchers and practitioners alike. Interest in computing system evaluation is usually split between specialist and management concerns. For computing specialists, systems are evaluated according to some set of performance measures. These are generally diagnostic, but machine-
COMPUTING AS SOCIAL ACTION
279
centered measures. For example, hardware or software monitors may be used to measure processor utilization and operation system scheduling policies and analysis of the derived measures can be used to support decisions regarding hardware enhancements or software tuning. Use of system performance measurements can thus “create” certain demands for system maintenance. Management concerns, on the other hand, address evaluation of the effectiveness of a computing system in fulfilling “hard” or “soft” organizational goals (Keen and Scott Morton, 1978). These evaluations are typically post fact0 rather than diagnostic (Keen, 1975). For example, UMIS in Riverville can be evaluated in terms of how well it improved caseworker productivity (i.e., number of cases processed per unit of time/person), o r how well it integrated welfare services. But the assessment of how UMIS enhanced administrative credibility in Riverville is an evaluation not likely to be conducted by management, at least publicly. People who work with computing systems are as much part of an organization’s information processing system as is the computer system itself. Computer performance measures that “objectively” monitor machinecentered system activity discount or ignore the import of instrumental users’ concerns. Rational analysts often assume that maximum or optimal system utilization is the important user need. But it is often computing specialists or computing facility management who decide whether a system utilization pattern is suitable or constrained. It is these people who decide what measurements to make and how to interpret them. Though we might believe that such decisions are always based on competent, technical reasoning, there is no empirical basis that supports the accuracy of such a belief. In fact, we might as easily assume just the opposite: decisions concerning effective computing resource utilization are based on ad hoc or impressionistic assumptions of computing specialists or managers. Interactionist and organizational politics analysts might argue that specialists’ or managers’ use of system performance measurements, selective even though well-intended, can serve to legitimate attempts on their part to establish o r reconfigure computing resource utilization patterns according to their desires. The accuracy of either of these system performance management perspectives awaits further empirical study. One management approach to system evaluation is to quantify and assess the costs and benefits that accrue from an organizational computing system. Computing systems are often initiated, designed, and implemented with the outlook that certain intangible or qualitative benefits will occur. Rational analysts often claim that improved managerial control, “better” and more timely information, improved decision support and the like are the desired benefits (Bernard el a / . , 1977; Champine, 1978;
280
ROB KLING AND WALT SCACCHI
Simon, 1973; Walston and Felix, 1977). But a thorough accounting of computing costs should note cost displacements, and should reveal the presence or absence of real cost savings. Cost displacements are “hidden” costs borne, for example, by users when hassling with poor documentation or when trying to figure out why a system does not behave as expected. In practice, such accountings may not be done regularly. But it is often assumed that the outcomes of such accountings will reveal real cost savings for computing use over alternative techniques. Again, we must wait for empirical assessments. In general, then, computing work is assumed to be cheaper than manual work for comparable tasks, but, in practice, this assumption is not systematically checked, Furthermore, it is not clear that if such an accounting were performed, which revealed the lesser expense of manual techniques, users would or could readily drop computing and move to the ‘‘rational” cost-efficient method. 2.10 CSRO Case: A System Life Cycle
In the early 1970s a computer science research organization (CSRO) acquired a large-scale time-sharing computer system to support their ongoing activities in various aspects of computer science research. We studied this computing setting during 1977-1978, well after the computer system had been installed and usage patterns rooted (Scacchi, 1980). In this section we will examine CSRO’s text processing system. The computer system at CSRO is linked to similar organizations and computing facilities through a national computer network. The computer system utilizes virtual memory in addition to 2+ megabytes of main memory and 100+ megabytes of disk storage. The system had demonstrated its capability to support more than 50 interactive users during peak use times. Some users of this computer system have developed very large interactive application systems written in a popular dialect of the programming language LISP (Sandewall, 1978). In short, the CSRO computing environment is quite sophisticated, “state of the art.” The automated text procesing system at CSRO is regularly used by project managers, computer science researchers, system programmers and secretarial staff, Project managers, usually senior researchers, use the complete text system less often except when work deadlines necessitate its use. Managers take advantage of their organizational position in allocating secretarial or junior staff to certain text processing tasks. These tasks usually amount to executing the computational tasks necessary to obtain a finished version of the source text document. Such a computing work allocation scheme often leads the secretarial staff to use the system
COMPUTING AS SOCIAL ACTION
281
on a daily basis. The systems programmers and other CSRO researchers use the text processing system regularly, as they need it. The junior researchers and staff often must individually tend to all aspects of system use. But for some of these users, this is the preferred arrangement: working with the system autonomously without secretarial support. The text processing system can be configured in different combinations by users. We suspect this leads to different patterns of system use. Most CSRO researchers and programmers used system configurations of their own choosing. But some users openly reported some disdain for use of certain subsystem modules. In the words of one systems programmer: “I grew up knowing nothing but displays (screen-based text editors). It’s hard for me to go to less elegant (line-oriented text) editors. They’re such a pain.” Most system configurations (that is, different subsystem ensembles) were plagued with recurrent difficulties. Many of these concerned the use of the dominant text formatting program, NEAT. NEAT is said to be a powerful text formatting program: a macro-based5 document “compiler.” But many users did not appreciate all of its general features. One programmer remarked: “NEAT is an interesting example of a program that was made so general, it’s unusable. It’s so mysterious. Nobody knows all of it.” Another user commented: “NEAT can be nice. But there are a lot ofidiosyncracies that are a pain in the ass. As I use it more, I run into more of them. If you want to do something exotic, you have to find somebody who knows it. For example, there are some hassles to using special characters. You have to know how to use them. They’re not in the manual.” Given these sentiments, we became interested in discovering what NEAT users were doing to manage these difficulties. Most users rely on the help of other experienced users in order to manage or minimize NEAT hassles. One user commented: “there are alot of bugs in it. But people who use it regularly have charted out the bugs which is sort of nice because (then) you know where they are.” Another user stated: “Most people around here treat it (NEAT) as a black box. You usually go t o one of the people around here who knows it to avoid trouble.” It puzzled us why the presence of system bugs should pose recurrent difficulties in system use. Why weren’t these bugs readily fixed? According to one systems programmer, “actually, it’s not maintained at all.” What resulted was that many different sets of NEAT macros were implemented by capable users to get around the bugs they encountered. However, many of these NEAT macro routines were incompatible or redundant. This gave rise to new system maintenance demands-upkeep of NEAT bug-avoidance macro libraries. Macro here refers to a programming construct
282
ROB KLING AND WALT SCACCHI
The text processing system came to our attention in part because of its regular use by most CSRO users. The “system” was composed of different software systems: “line-oriented” and “screen-oriented” text editors (four now in use); a spelling checker; text formatters (two in use, one in development); and supporting systems software, which handle source file “movements” and output listings. In addition, on-line CRT terminals and high quality printers constitute part of this system. The users at CSRO include graduate student research assistants, research associates, academic faculty, project managers, and computer programmers-all computing professionals. Many users characterized the need for a text processing system as “essential,” given their desire to produce ‘‘professional’’ documents or publications, in light of the limited secretarial services at CSRO. Some users were quite concerned with their ability to control most or all aspects of document production, and some of them indicated that their use of the text processing system reduced or eliminated the need for secretarial assistance. Since the system is time shared, text processing users must share the available computing resources with all other users. One possible difficulty with this arrangement is availability of computer terminals and of raw computing resources during peak use periods. Even in sophisticated computing environments, computing resources cannot be viewed as free and always at one’s disposal. All CSRO’s computer users apparently use the text processing facilities. Some users reported that 50-80% of their computing activities dealt with next processing. With this indication of the extent of system use in mind, together with the preceding characteristics of the setting, we will step through the phases of the system life cycle for CSRO’s text processing system as we found them. When the current computer system was installed at CSRO, it came with two line-oriented text editors and one text formatter supplied by the vendor. Different users selected to import screen-oriented text editors, text formatting, and spelling checker programs from other compatible computing facilities. The adoption of these was easy and natural. We attribute this, in part, to the software being “free,” to the sophistication of computing at CSRO,and to the users’ expertise in computer science. In addition, there were workplace incentives for adopting software “tools” like special text editors. Adopting these software tools could reduce the perceived amount of work needed to complete other “important” computing tasks (e.g., producing professional documents). Furthermore, some users were willing to enhance versions of certain programs to satisfy their specific needs. Thus, initial development of text processing system com-
COMPUTING AS SOCIAL ACTION
ponents was straightforward: some parts came without asking (from the vendor), and other parts were brought in by eager users. The design history of these various text processing tools was somewhat more difficult to discover. None of these software systems was designed o r developed by anyone within CSRO; thus our history of the text processing was fragmentary. Some users knew who designed and built these software components. Some also readily pointed out what they considered to be system design flaws. System bugs, missing features, irrelevant system features, awkward stylistic conventions, and poor system performance characteristics were pointed out by users. Nonetheless, they used the system. The implementation of the text processing system followed a trajectory similar to that of system design, that is, sketchy and fragmentary. Some of the system flaws may have resulted from the method of implementation by the system builders. There was no overall system design to integrate the different system components. Many users had implemented their own version of one of the text formatting programs. Users had accessed the original “public” version of the program and altered it to meet their own needs. Users said they followed this strategy because they could identify and repair (and hence understand) program bugs and the program itself. Furthermore, a user could to some extent add, subtract, o r revise program features given his or her own version of the program. But different users indicated that no one at CSRO really knew everything about how the original program worked. Nearly all users reported that certain software components, particularly the text formatter they preferred, were rife with bugs. Most bugs were found through testing with real data. That is, the system was not sufficiently tested by specialists or users a priori. (In practice, users attempt an action they believe to be correct, as indicated by someone else or by the system documentation, and find that some unexpected system behavior results.) The documentation for NEAT was roughly five years okd and not updated to reflect maintenance alterations. This led to some problems for users who acquired an intermediate version of the NEAT program that had been altered without the documentation also being changed to reflect this. According to the users, other software components did not have equally bad documentation. But an opinion expressed repeatedly about all software documentation was that “it could always be better.” As we have noted, some analysts have suggested that one set of recurrent problems in maintaining computing systems is attributable to a primary emphasis on issues of system development. The traditional practice has been to focus on system development activities as opposed to system
284
ROB KLING AND WALT SCACCHI
maintenance. The apparent underlying rationale for this position follows the belief that if a computing system is developed in an orderly manner, then naturally the resulting system will be more easily maintained. But this rationale disregards the problem of maintaining systems that have been previously developed in a less than orderly manner. This latter position seems to be the more appropriate characterization of the development of the text processing system at CSRO. Given its development path, what issues in system maintenance have surfaced? We saw in the case of the NEAT subsystem that system bugs and idiosyncracies are not necessarily repaired. No one was assigned to maintain the NEAT system. One user strategy to avoid, circumvent, or undo adverse system effects was to develop manageable NEAT macro routines. Another strategy was simply to chart out the location of known system deficiencies and to structure the content of text processing computing tasks to avoid problems. Bug charting activities can entail relying on the experiences of other users who have passed through similarly troubled waters. The point drawn from both of these strategies is this: the fact that design or implementation errors are known to exist in a given computing system does not necessarily mean that the system will be repaired; users will alter their computing work styles to get around the limitations. In addition, since many users develop their own sets of mistake-avoiding software routines, these routines must also be maintained. These routines are likely to conform to the particular needs of some, but not all, users. For the text processing system at CSRO, system maintenance activities are distributed across the user community. Many computing analysts report the increasing cost of software system maintenance. At CSRO, the text processing system is regularly used by many users. Maintenance on this system, however, does vary. Earlier, we described how the text processing system developed from two text editors and one text formatter to encompass four text editors, a spelling checker, and two text formatters. There are now more individual subsystems to maintain and unless members of the CSRO user community are somehow persuaded or restricted to use only some subset of these, they must all be maintained. The necessary maintenance may be distributed throughout CSRO and not confined to the systems programmers (those normally assigned to system maintenance), but it is unclear whether the cost of maintaining the system will increase. That is, while there are more software components to maintain, a major portion of maintenance activity is subsumed by users distributed throughout the organization. In addition, their maintenance activities may not get directly accounted when system operation and maintenance costs are assessed. Such a system maintenance arrangement may be common practice but it can be deceptive in that it can
COMPUTING AS SOCIAL ACTION
285
increase “hidden” costs. Under this maintenance arrangement, personnel turnover can act to further exacerbate the costs of maintaining existing systems. All organizations experience personnel turnover. As different people pass into and out of the organization, so do experiences, understandings, and macro libraries for different computing systems. In a case where understanding the workings (and bugs) of a given computing system is spread across the user population, we might expect that new organizational members (or infrequent computing system users) will have to spend some time acquiring a working knowledge of a system. Again, such arrangements may be common or expected, especially in a sophisticated computing environment. But what does this say about the cost of system maintainence? System maintenance is usually accounted for according to expended resources: labor, computer time, organizational overhead, etc. We might expect that greater personnel turnover will drive up the cost of system maintenance because more people have to take time (labor and possibly computer time) to learn the system. CSRO is set up to regularly “turnover” skilled graduate students and visiting research associates. Furthermore, when organizational knowledge of the workings of a given computing system is spread across many computing users rather than confined to system programmers and good documentation, the ease and frequency of interactions between instrumental users and others will influence the content of system maintenance, and hence, maintenance costs. Finally, to what extent do system maintainers avidly pursue or desire system maintenance work? In the words of one system programmer at CSRO: “computer programming is a lot of fun, like solving puzzles. But there is a lot of shit work around that you have to get involved with: documentation, inheriting somebody else’s program that is rife with bugs. These are not fun.” Thus we were not surprised to find users distinguishing maintenance activities from “real” computing-that is, system use or development. The cost impacts of these conditions are difficult to quantify, but it is not clear that these arrangements can be ignored in determining what affects system maintenance costs. Concerns for system maintenance naturally led us to address matters of system evaluation. How well does the CSRO text processing system perform? We solicited evaluations from users on the practical utility of CSRO’s text processing system. The general user sentiment about automated text processing was quite positive. Text processing was said to be one of the main benefits of interactive computing. In the words of CSRO’s computing facility manager: “it’s a very powerful tool, especially for preparing large documents.” Though there was much apparent support for text processing, many users cited
286
ROB KLING AND WALT SCACCHI
their reservations. Three CSRO users, each working in different research groups, commented, respectively: There is a lure to text processing. It’s very seductive to keep hacking at a paper to make it look nicer since you have the ability. I may get more work done if I used a secretary, but not as nice looking or reading as clear. The process of making text pretty is enormously time-consuming. (But) it’s getting to the point where a secretary, rather than a programmer can do it. As things have got fancier, the first (system) introduction seems powerful. But as you advance (to other systems), it seems hard to step down or back to the others.
From these remarks it is unclear what the real organizational or performance benefits (e.g., improved appearance of research documents) of automated text processing are. There are some conflicts between the generally positive sentiment and these individual assessments, which appeal to different system values. We may each draw conclusions about the efficacy of this computing system, its users, or even CSRO. But let us turn to review this case data with respect to the six perspectives on computing in organizations, outlined in section one, to gain further insight into the nature of an organization’s computing system. 2.1 1 Six Perspectives of Text Processing at CSRO
In initially reflecting on the data on CSRO’s text processing system, we each can draw a variety of conclusions. We may, however, discount the viability of a computer science research organization as our object of examination because it is a research organization and hence atypical. But CSRO, like any organization with computing systems, can be examined for those computing activities that are involved in the production of organizational products. We also might have to hedge our analysis by the fact that CSRO is a computer science organization and thus atypical. Our concern here would be that organizations consisting primarily of computer scientists are somehow different. But that same challenge must apply to any organization populated by some particular group of professionals. Instead, we propose to review the life cycle of the text processing system at CSRO, using the six perspectives on computing in organizations. These perspectives might not provide suitable insight into each phase of the system life cycle, but each perspective might lead to deductions or insights which can be contrasted with those drawn from the others. Thus we should be able to enlarge our general understanding of the dynamics of text processing at CSRO, of computing in a professional and sophisticated setting, and of the social character of a computing system life cycle.
COMPUTING AS SOCIAL ACTION
287
Rather than go through the issues raised by each perspective, we provide representative questions (Table 111) that the reader can choose to answer given our previous discussion of each perspective and the CSRO data just presented. These questions exemplify the concerns of analysts adhering to different perspectives in understanding the life cycle of CSRO’s text processing system. 2.12 A Retrospective of the System Life Cycle
We believe the CSRO case supports our earlier claim that computing systems evolve by way of a process of layering throughout their life cycle. Computing system components and their interconnections are reassembled, reconfigured, and enhanced by instrumental users as well as by computing specialists. People tinker with a computing system to make it do what they want it to do (cf. Jacob, 1077; Scacchi, 1980). At CSRO, some users took it upon themselves to acquire their own version of TABLE111
Six PERSPECTIVES ON THE CSRO DATA Rtrtionul
I . In what manner is the text system a labor-saving device? 2. How does the text system augment current secretarial service limitations? 3. How is the automated text processing better than conventional techniques in the ways it is “naturally” used? 4. How would automated documentation tools aid in resolving current documentation problems? 5 . According to what criteria can system performance be measured? 6. Are text productivity impacts positive for the text system? If not, is the system efficient? Strrrctirral 1. Are text production activities now more centralized’? What difference does/would this make? 2. Can users more readily manipulate and produce complex documents with more speed and ease? 3. Can more users utilize the existing text system as compared to a manual system? 4. What attributes of the organization and the existing computing facilities “encourage” the use of the system?
Human Relatiotis
I. Are users satisfied with their use and control of the system? 2. Are users troubled about not being involved in specifying the design of the text processing system’? 3. Does the use of the text system provide some motivation for its continuing use? 4. Does the text system help users cope with certain uncertainties (e.g., ease or timeliness of producing documents) in their work?
ROB KLING AND WALT SCACCHI TABLE111 (Continued) ~
~~
In tera c tionist I . What import do CSRO text users assign to the text system? 2. Do text users give contradictory definitions of text production activities on the one hand in contrast to the efficiency of the text processing system on the other? 3. How do users view the text processing system as a social object and not just a technical artifact ‘? 4. What interpersonal negotiations take place that shape the different phases of system development? 5. What social and technical resources are bartered in these negotiations? 6. Why do text users prefer to use the text system in spite of their feeling that manual alternatives (secretaries) may be a more efficient way? 7. Why are text system users willing to reshape the style or content of their work to conform to the “contours” (including system limitations) of the system? 8. To what extent do users sense greater negotiating ability or control over text processing because of the availability of multiple system configurations? Organ iza tiona I Polir ics
I. What resources do CSRO text system users command and how do they use these to increase their power? 2. Do CSRO managers make use of the text system to enhance their control over CSRO operations or products? 3. To what extent are managers able to extend their control others at CSRO because of the text system? 4. What are the relations of organizational power and dependency that surface in the use of the text system? 5 . What highlights the text system’s utility as an instrument of bureaucratic politics for dealings with outside organizations (e.g., preparing grant proposals and project summanes as production activities conducted or supervised by CSRO managers)? Class Politics
CSRO deskill the quality of work, knowledge, or control over computing or other work activities for the users?
1. How does the use of the text system at
NEAT, which they would enhance to fit their needs. The evolution of a computing system may thus be a process of tinkering and layering. That is, users’ tinkerings will accumulate in many shapes and forms (some by design, some by hasty effort, and others just obscure or bizarre) that coalesce and emerge as new layers of system. When many system layers accumulate and their interactions tinkered beyond clarity, these and related aspects of a computing package will be “defined” to be unmanageable. (At CSRO, the genera1 quality of documentation for NEAT degrades as multiple-version enhancements are made and as haphazardly documented macro libraries build up, unless systematic efforts are undertaken to restore it.) As this occurs in an orga-
COMPUTING AS SOCIAL ACTION
nization, efforts will emerge to stop using the system, to convert it, or to replace it with some newer arrangement. These efforts denote the passage of a computing system from the end of its life cycle to the beginning of another. It also denotes a passage in the evolution of an organization’s computing package. Thus, we expect that in the near future NEAT will be seen to be decreasingly useful rhruughuur CSRO, especially as new, more encompassing components appear that subsume many of NEAT’S enhanced capabilities. It appears that instrumental users may have to take on a significant portion of system maintenance tasks in order to stay abreast of their growing and contracting computing environment. However, the skill with which these people tinker with the layers of a system will shape the ease or hassle of computer use, and probably a major share of system maintenance costs. If instrumental users d o not take up maintenance tasks, then these users must renegotiate or reshape the work content to conform t o the contours of the computing environment. In this later arrangement, users bear the costs of poorly fitting computing systems. Will advances in system life cycle engineering improve this situation? We believe that that depends on what perspective(s1 computing system developers will utilize when conceiving the development of new, o r the conversion of old, systems. We have tried to show how analysts following different perspectives define and assess problems in the system life cycle. We believe the interactionist perspective captures the richest description, and provides the most “active” analysis, of the nature of these activities. The analysis within this perspective suggests that people will shape their role in the different activities of a system life cycle in order to complete their computing tasks with the least hassle, unless what they value is threatened. Other analysts point out that many “technical” activities in the system life cycle are influenced by workplace incentives, who controls key decisions, whose interests are served, and how these are mixed and meshed (organizational politics). Human relations analysts tell us that success or failure in fitting a computing system into its organizational environment depends on involving users in specification, design, and implementation of a system. Structural analysts note that attributes of the structure of a computing system, the organization in which it is embedded, and the external environment of the organization also play a part in determining the fit of a system. The rational position, unfortunately, attempts to limit our attention to just the technical nature of activities in the system life cycle. While such technical matters often pose challenging problems of their own, we believe that constant attention to a technological o r “systems” point of view denatures our understandings about what different people can and
ROB KLING AND WALT SCACCHI
will do to make a computing system habitable and useful. Finally, we note in passing that the class politics presently provides no contribution to understanding the dynamics of a system life cycle. We will now move on to a section that examines a subject intimately entwined with a system’s life cycle: computer system use. In this next section we will present a social analysis of computer system use in complex organizations. 3. Computer System Use in Complex Organizations
Empirical and theoretical studies of computer use range in focus from promotional speculations to analysis of the recurrent dilemmas of computer use in organizational settings (Kling and Scacchi, 1979b). In this section we will focus our discussion to four facets of computer use: discretion in computing system use; the distribution of knowledge for computing system use; commitment to computing system use; and mistakes in computing use. These will be examined with an eye to the six perspectives on computing in organizations. 3.1 Discretion in Computer System Use
Organizations provide an arena for interaction among computing users of all kinds. Managers in organizations use computing selectively. They most often use computer-generated reports for routine organizational decision making, rather than directly accessing terminals or programming their own applications (Dutton and Kraemer, 1978). Instrumental users such as accountants, urban planners, insurance actuaries, etc., use computing to routinely analyze data and prepare reports for administrative purposes. Occasionally, they conduct exploratory data analysis (Kling, 1978e). In addition, managers and instrumental users often have the option of using computing; computer use may not be desirable for all tasks given individual or organizational interests and limitations. Computing specialists are generally eager to use and further develop a given computing system. The amount of attention specialists give to a system or a particular application depends upon the demands for scheduled service delivery and personal or professional interests. Clerical users, on the other hand, are constrained in their use of a computing system to data entry and/or retrieval (Mumford and Banks, 1967). The work of clerical users is generally managed in such a way that they must use a computing system as provided. Thus patterns of computing use are characterized by differential discretion in computing use on the part of organizational actors.
COMPUTING AS SOCIAL ACTION
29 1
Managers are assigned organizational decision and policy-making authority. When it comes to computing, few managers actually use a computing system on-line. Rather, they usually delegate information retrieval tasks to subordinate professional or clerical staff. If a given manager is particularly influential, either because of organizational responsibilities or power base, then the manager can often get his computing tasks assigned high priority. Similarly, tenured organizational staff may be able to utilize this kind of influence. While we all may recognize managerial prerogative, managers can utilize their organizational authority to assure their subordinates easier access to computing. Managers in organizations can and d o influence other users’ patterns of computing use. Though managers use computing primarily for routine needs, they may occasionally need to exercise their prerogatives for computing use. In small organizations with few sub-units, this poses few difficulties. However, in a large organization with many sub-units and special projects, unexpected managerial prerogatives for computing use can interrupt some computing tasks for different instrumental users in that organization. Structural analysts might point to how a manager’s control of computing data and expertise determine staff access to management-oriented computing use. Human relations analysts assert that people value what they control (Lawler and Rhode, 1976). Thus, managers should value their ability to control staff access to management-oriented computing uses. But instrumental users, clerks, and specialists will also value Computing when they have greater control over its use and orientation. Organizational politics analysts are quick t o point to the pending conflicts between managers and others over control of staff access to computing use and computing job schedules. Additionally, these analysts could point to the conflicts stemming from the conflicting requirements on a single system made by managers, users and specialists. Interactionist analysts, on the other hand, seek to understand and distinguish (a) when managers utilize their computing prerogatives, (b) how the distribution of managerial authority influences their utilization, (c) which conflicts are rooted in fixed versus negotiable commitments to schedules and system (user) orientation, and (d) how and when others are able to control computing use. Users continually request computing system alterations even for routine computing applications (Lientz et al., 1978). These alterations are requested for the purpose of meeting changing conditions in the organization’s environment, and they are implemented by specialists who follow some scheme for prioritizing changes and for scheduling commitments of organizational resources. The priorities and schedules are sometimes established or arbitrated by an intervening administrative group (e.g., a
292
ROB KLING AND WALT SCACCHI
“change control board”) of representative managers, users and specialists. In order for users to get a system change request serviced, they may have to meet and coordinate with their managers, other users, the change control board, and the specialists who will implement the change. Thus, as interactionist and organizational politics analysts show, the negotiations among system users, change controllers, and specialists over system enhancements mutually affect the pattern3 of computing use and the contingencies of computing work.
3.2 Organizational Distribution of Computing Knowledge Our preceding observations on the discretionary use of computing point to a differential distribution of computing knowledge within organizations. How do different distributions of computing expertise facilitate or hamper easy, hassle-free computer use? Rational analysts suggest that because computing systems are “rational,” users can rely on personal cognitive abilities, the computing system, and system documentation to facilitate computer use (Kernighan and Mashey, 1979). In this way the distribution of computing knowledge is localized and “efficient.” Alternatively, interactionist analysts suggest that users rely primarily on information gained through dealings with other users, system specialists, or personal experience (Kling and Scacchi, 1979b). From this perspective, computing expertise is organizationally decentralized but “effective.” Interactionist analysts argue that when a computing system and its documentation lack coherence and comprehensibility, users will turn to others for help to complete computing tasks. Many computer scientists point to the dilemmas that fragmentary and dispersed computing knowledge creates for sophisticated computing use (Brooks, 1975; Palme, 1978; Kling and Scacchi, 1978, 1979a,b). Two related social features of organizations affect the distribution of computing knowledge. These are ease of access to computing expertise and the level of commitment to computing. Having easy access to computing expertise is a concern for many instrumental computer users. The level of commitment to computing helps distinguish computing specialists from instrumental computer users. 3.2.1 Access to Computing Expertise
Instrumental users often interact with other users and specialists. These interactions sometimes arise from the need to resolve inadequacies in the system documentation. At other times, users need assistance to understand how to process “special cases” or “exceptions” given the range of
COMPUTING AS SOCIAL ACTION
293
system functions normally provided. In rich computing environments with many competent and cooperative computing users, computing expertise is often readily available. In most computing settings where computer use is scheduled or regulated, access to computing specialists is mediated through a chain of liaisons or supervisors (Kling and Scacchi, 1979b). Some specialists prefer this arrangement because it “buffers” contact with end users. Users, on the other hand, sometimes covertly bypass liaisons in order to get access to particular specialists familiar with their computing nkeds (Scacchi, 1980). While access to system documentation is generally provided, access to specialists knowledgeable in the workings of a given system is restricted but possible. Structural, human relations, and organizational politics analysts suggest that the distribution of computing knowledge together with bureaucratic mechanisms (such as chains of liaisons between specialists and users) can either smooth or muddle easy access to computing (Danziger, 1977, 1979; Mumford and Pettigrew, 1976; Pettigrew, 1972). These analysts point to the nature of particular organizational computing arrangements as a major influence affecting the ease or bureaucratic hassle of computing use. However, rational analysts seem to treat the distribution and access to computing knowledge as nonissues because of the tacit assumption that all users have equally strong interest in computing. 3.2.2 Commitment to Computing System Use
It seems to be the case that specialists have an extensive commitment to computing, while instrumental users often have a much weaker commitment. This is indicated by the fact that instrumental users have been known to rely on others, including (sometimes to their chagrin) specialists, to help complete computational tasks. Specialists, on the other hand, seem to identify professional expertise with their ability to read raw code in order to “really” find out how a system works. In fact, system specialists are often surprised when users seek their help to acquire information that can somehow be found in the system documentation. Although some instrumental users may occasionally look at system code (often the only up-to-date form of “documentation”), it can be supposed that, in general, they do not share the detailed understanding or interest in software system intricacies that are earmarks of commitment. If all system users were equally committed to computing, then all users could be expected to read system manuals to learn the system. The rationalist ideal is that a user thoroughly understands a system before using it. But whilespecialists can usually be expected to be attentive to the intricacies and details of a system’s workings, instrumental
294
ROB KLING AND WALT SCACCHI
users frequently have other interests and organizational commitments beyond computing system use. For this reason, instrumental users seem to be willing to learn those functions and commands necessary for specific computing uses as long as they can do so without expending more than the smallest effort. For interactionist analysts, a user’s commitment to computing involves both social and technical computing resources. The social resources include time, information, skill, budget, inclination, and opportunity. The technical computing resources include memory space, disk storage, software utilities, processor time, computer terminals, etc. These analysts note that negotiations between users and specialists over computer use and computing work focus on altering one’s computing resources commitments to accommodate the needs of others. Users’ and specialists’ respective resource commitments to computer use are but one set of demands or investments that each attends to in completing computing tasks. The computing resources that are users’ and specialists’ objects of negotiation or commitment shape, in part, organizational computing arrangements and computer use patterns. 3.3 Mistakes in Computing System Use
Computing system use often results in the regular occurrence of “mistakes” at work (cf. Riemer, 1976). Computing-use mistakes fall into four categories: miscalculations (such as enhancing the wrong code module, deleting the wrong disk files, changing the wrong parameter value, or using the wrong analytic equations); hold-ups (such as unscheduled system down time owing to a new operator, or unexpected system slow down owing to high system loads); circumstantial errors (such as botching the conversion to the latest version of an operating system); and “natural” accidents (such as operating system crashes or hardware failure). Mistakes such as these rarely occur intentionally, but they do occur frequently. Mistakes in computing use is a topic that has surprisingly limited coverage in the literature. What materials there are focus on programming errors (Youngs, 19741, program debugging (Rustin, 19721, or fault-tolerant and reliable software systems (Boehm, 1976). These are generally machine-centered concerns for computing specialists and rational analysts. Though such concerns may be familiar and important to computing specialists, they may be quite remote to many instrumental users. User-centered concerns are often indirectly approached through efforts to improve or “naturalize” the user-system interface (Dzida et al., 1978;
COMPUTING AS SOCIAL ACTION
Schneiderman, 1978). Our interest is in examining how users deal with mistakes in computing system use, and the relationship between these mistakes and other work activities.6 We saw in the CSRO data that many system bugs (i.e., someone else's miscalculation) are either repaired (if the user can get the system fixed), circumvented with the help of additional software, or avoided by a rearrangement of computing tasks. These are all strategies users follow in dealing with system design errors and maintenance inadequacies, and they indicate that even sophisticated computer users may adapt and reshape their work rather than invest the time, skill, and effort necessary to get the system corrected. Rational analysts seem to suggest that mistakes in computing use can usually be accounted for by the inadequacies of computing resources o r in the incompetency of users. But at CSRO, the computing environment is sophisticated and the users highly talented. Structural analysts might show how the uncertainties that users face in using a complex computing system facilitate mistakes. Human relations analysts might focus on whether users are satisfied with the reward system, which encourages good computing use and documentation practices in order to discourage computing mistakes. But interactionist analysts point to (a) the dealings between users and specialists to overcome inadequacies in system documentation, (b) strategies for circumventing regulated access to computing specialists, (c) users' restructuring their work tasks to accommodate emerging system limitations, and (d) userhpecialist negotiations over work, resource, and computer use sovereignties. We have seen how the social dynamics of computing use in organizations affect the efficacy of computing, and how these effects are viewed from the six perspectives. Now we will move on to examine the consequences of how computing is used in organizational settings. 4. Impact of Computing on Organizational Life
Analysts of almost every persuasion have suspected that a technology which enlarges the information processing capacity of people or organizations by orders of magnitude must have potent influences on their interaction and work techniques. What kinds of differences do computerized fi The need for studies of errors in computerized transaction processing systems has recently emerged (Sterling, 1979). The relationship between mistakes in computing system use and errors encountered by the "end targets" of these systems is presently unexamined and poorly understood.
296
ROB KLING AND WALT SCACCHI
information systems make in the nature of the activities performed by and within organizations? Most of the accounts of the impacts of computing on organizational life focus on the ways in which computing alters the efficiency or effectiveness of organizations, the ways in which decisions are made, the work life of participants in the organizations, the ways activities are structured, the kinds of control managers can exercise in their administrative domains, and the power of different participants to influence the activities of their organizations. Different analysts who adopt different perspectives approach the questions “What difference does computing make?” or “Is computing a good organizational technology?” very differently (see Table I). The main line of development in this section is to examine the consequences of organizational computing use as viewed by analysts from each of the six perspectives. Rather than examine every important difference that computer use may make in organizational life, we will examine a few important areas that have been most studied. These include the impact of computer use on the way decisions are made in organizations, the worklife of computer users, and the alteration in patterns of power and managerial control. We will not examine other topics, such as the ways in which computer use may alter the “structure” of organizations, the ethos of its members, or relations between organizations. 4.1 Decision Making
A critical question faced by anyone who analyzes the role of automated information systems in decision making is the directness of the linkage of data, information, the way decisions are made, and the effects of different decision styles on activities within and by an organization. This section will examine this question from several viewpoints. Then several classes of automated information systems will be examined. Analysts who focus on the role of computer-based systems in supporting decisions usually differentiate among different kinds of tasks which are usually performed at different hierarchical levels in organizations. Most common is the trichotomy of “operations,” “management control,” and “strategic planning” (Burch et al., 1979). The lion’s share of computing in organizations is devoted to applications that support routine operations within the organization: billing, transaction processing, simple record keeping, engineering analysis, etc. Fewer computational resources are devoted to applications that support the ability of managers to schedule, allocate, and control their resources. The applications used to support long-range planning and policy analysis are the most analytically interesting and the rarest of all.
COMPUTING AS SOCIAL ACTION
297
4.1.1 Data, Information, Automation, and Decisions
Rational analysts take the links between data and decisions to be direct, while analysts from other perspectives treat them as more problematic. H . A. Simon has clearly been the most influential theorist and structural analyst to link computer-based systems and decision making in organizations. The conjunction of computing and decision making is natural for Simon, since he has developed a view of organizations which emphasizes people’s activities as decision makers and problem solvers (Simon, 1947, 1965, 1973, 1977). He argues that data and methods that help focus attention and evaluate choices improve the technical performance of a decision maker. He has also argued that computer-based information systems can help participants in organizations act more rationally by enabling them to compensate for weaknesses in judiciously selecting and remembering information (Simon, 1973, 1977). To the extent that computer-based systems can help them organize and filter larger volumes of information and that simulations enable them to consider a wider variety of complex dynamics simultaneously, Simon views computer-based systems as helpful instruments. Simon’s claims have rarely been tested empirically, but they have provided the framework for much theorizing about the social roles of computer-based information by analysts who adopt a rational perspective (Burch er al., 1979; Licklider and Vezza, 1978; Streeter, 1974). Many studies that relate computer-based information systems to decisions made by actors in organizations assume that there is usually a good match between the data available in computer systems and the kinds of information used to inform decisions (Burch et al., 1979; Kanter, 1977; Simon, 1977). In practice, the relation between the data contained in computerized information systems and the decisions that they are held to inform is problematic. No one is surprised when the connections are direct and simple. For instance, most readers will have had the experience of altering airline flights in mid-trip with the assistance of an airline reservationist who uses a computerized information system to find available flights. When we are also told that a police patrolman is able to locate more stolen vehicles with the assistance of an on-line stolen vehicles file that he can access through the department’s dispatcher, our expectations that information in computerized systems is useful are confirmed. These examples typify many cases in which some participant in an organization directly uses some item of information from a computerized information system to inform a decision. Clearly, decisions based upon data from computerized systems might be poor if the link is direct between data and decision, and the data used are in error. There are also several kinds of situations in which the link between data and decisions is not so direct.
ROB KLING AND WALT SCACCHI
Rational analysts readily identify swollen data bases as troublesome. Many automated information systems have been built which provide data that is unusable to the people it is supposed to inform (Ackoff, 1967; Danziger, 1977; Dery, 1977). The problems vary, but they include systems which contain inappropriate data items, data which cannot be crossreferenced in useful ways, data which are inaccurate or out-of-date, and reporting formats which are cumbersome. Often these difficulties appear amenable to social or technical fixes. Structural analysts also identify situations in which the decisions are made independently of the formal data system. Gibson (1975) reports an interesting study of the ways in which decisions to find new branches for acquisition were made by the staff of an eastern bank. A management analyst who computed the costs and potential profitability of prospective acquisitions believed that his analyses were the primary elements in decisions about which banks to acquire. However, his supervisor, who was responsible for acquiring new banks, used a variety of informal sources such as friends and newspaper stories to locate prospective branches. In one land development firm we have studied, financial analysts prepared extensive cash flow and profitability analyses for new development projects. According to the financial analysts, the reports were critical aids for managers of new development projects. However, according to the project managers, the reports were of minor utility since any of the projects they proposed were profitable in the climate of land development they faced. They were more concerned with shaping a project so that it would meet with the approval of the local planning commissions. Project managers spent more of their time coping with regulatory uncertainty than with fiscal uncertainty, and they evaluated their reports accordingly. Several years ago we studied the use of a manpower allocation model which was developed by a major midwestern police department. The use of the model received national attention, and we expected to find it of real utility to police officials who were responsible for shifting allocations between different police precincts. When asked about its usefulness, the police captain responsible for staffing replied, “It’s really useful, if you want to allocate manpower.” The police staff had developed the model in the late 1960s with the assistance of analytical support from a major computer vendor. In 1969, the police were able to increase their staf€ing by 30% by convincing the city council that ghetto riots and increasing crime rates were imminent. The model was used to help decide how the new recruits were to be allocated to the various precincts. In the early 1970s, the riots did not materialize, the crime rate fell, and no new police were hired. The original allocations to different precincts were maintained, and the model was not needed.
COMPUTING AS SOCIAL ACTION Both rational and structural analysts focus “inward” on the formal task organization and on the decisions that are most critically related to the formal tasks. Interactionist and organizational politics analysts examine the web of social and economic relations in which each social unit is embedded. These choices have profound effects on the sense one makes of computing use. For example, a rational analyst would argue that computer use would be of little value to WESCO, discussed in Section 2.1, if it did not help engineers design products better or more easily. I n contrast, a political analyst would argue that automated design programs could be of tremendous value in helping engineers convince auditors of their design choices, even if the automated system did not alter the quality or lower the direct cost of engineering design. Rational and structural accounts of computing in organizations emphasize the “formal tasks” of organizational participants (Burch et ul., 1979; Galbraith, 1977; Kanter, 1977; Simon, 1977). These include engineers designing, accountants calculating rates of return and balancing books, and clerks auditing and correcting records. Interactionist analyses also examine the work setting of participants in organizations and examine the way they adopt, define, and negotiate all of the tasks they find important during their working day. People attend to their ongoing social relations in their work groups, their relationships with clients and staff in other organizational units, with auditors, and with subordinates and superordinates in the organization. These “decisions” directing attention may focus on ways of making sense of new activities in one’s immediate setting, in maintaining autonomy or security in the work group, in hiding unauthorized work practices, and in maintaining the cooperation of the variety of people with whom one interacts. As we shall see later, these decisions, which are not “about” the public content of a participant’s authorized task, may nevertheless play important roles in shaping the uses of computing in organizations. But this broader ecology of social relationships in which organizational members act can also influence the relations between computer use and decisions. Computing can serve as a symbolic element in organizational life. In the cases of Riverville described in Section 1.2 and of WESCO described in Section 2.1, computing was used because it convinced important parties that careful decisions were being made. In the case of Riverville, UMIS was not used to alter the decisions made by managers in the municipal welfare agency. In the case of WESCO, we described an episode in which engineering calculations were shifted from paper and pencil to computers to provide the appearance of greater accuracy. These are extreme cases in which computer use serves almost exclusively symbolic and political functions. However, more common examples are those in
300
ROB KLING AND WALT SCACCHI
which computer use serves several social roles simultaneously. Computing may be used to help a manager plan his organization’s activities and gain more credibility among other managers for his plans (Keen and Scott Morton, 1978). 4.7.2 Operations
Routine operations such as billing and airline reservation processing were thought to be relatively dull. If there is any class of task for which computer systems ought to pay off, it is the support of routine operations. Most of the system successes reported by managers or designers of computer-based systems were written from a rational perspective. The typical report describes how goals were set and how the author and his co-workers successfully designed a computer-based system to meet them. All the drama focuses upon the battles faced by the implementers in getting a “successful” system designed on time, within budget, and loved by its users. There are relatively few careful studies of computer use in organizations and few of these focus on routine operations. The better evaluations are the less triumphant and report complex or ambiguous evaluations. Many routine systems such as traffic ticket processing help increase organizational efficiency (Colton, 1978; Kraemer e f al., 1980). Systems may fail because they are technically unsound (e.g., response time is too slow in a demanding decision-making environment) or because they do not contain terribly useful or accurate information (Kling, 1978b). Most important, the criteria adopted for success may strongly influence one’s evaluations. Laudon (1974), Colton (1978), and Kraemeret a f . (1980) have all studied police patrol support systems that provide information about wanted persons, stolen property, and criminal records to patrol officers in the field. Colton, and Kraemeret af. examined the use of these systems by patrolmen and found the systems to be “successful” with respect to two measures. Colton contrasted a “successful” system in which the mean response time is 5-10 seconds with a nearly worthless system in which the mean response time was ten minutes. Patrolmen made about 4 times as many inquiries, per capita, with the better automated system. Kraemeret al. found that police who used a locally automated information system were much more likely to find people with outstanding arrest warrants and locate stolen vehicles than were police who only had access to statewide and national systems. These are the kinds of internal efficiencies that rational analysts identify as major values of computing use. However, as one enlarges the array of activities and subjects included in an evaluation, the picture can alter dramatically. For example, Colton
COMPUTING AS SOCIAL ACTION
301
also reported that the (ex)Chief of the Kansas City Police Department at times lamented the enhanced efficiency of Kansas City’s system in helping patrolmen find stolen cars, unpaid parking tickets, and unregistered vehicles because patrol officers made more field stops for these relatively minor offenses and displaced time from other important police tasks. Laudon ( 1974) adopted a frame of reference that included the network of criminal justice agencies from police through courts. He argued that the increasing success of police in locating stolen vehicles and in citing minor traffic offenders further clogged the courts, which were already jammed. Furthermore, Kraemer et al. found that in locally automated sites police were more likely to detain people who should not have been detained and to arrest people who should not have been arrested. Studies that identify an array of parties with possibly conflicting interests in computing support for routine operations are relatively rare. Sterling’s (1979) study of the frequency of billing errors attributed to automated systems is an example. Again, systems that may perform well according to efficiency criteria narrow in scope (e.g., cost savings for the organization) may be problematic when viewed more broadly (e.g., inconvenience to clients). But it is also true that computerized systems that seem ineffective when evaluated with respect to narrow criteria of internal efficiency may appear of substantial value to their users when they are viewed in the context of a larger ecology of social relations. UMIS did little to improve the internal efficiencies of the welfare agencies in Riverville, but it improved their image of efficient administration enough that they had an easier time attracting Federal funding. The engineering simulations at WESCO that confounded the designers but which helped move a stalled project through another stage of auditing were valued for their political role. Many computer-based systems persist because they save time, money, or tedium for their primary users. Some of these systems may pose a few important problems for their users or other relevant parties (e.g., clients, data subjects, colleagues). The use of automated text processing at CSRO that we reported in Section 2.10 illustrates such a case. As the groups affected by the use of a computer-based system increase in number and variety, one can expect that computing will serve some groups better than others. Moreover, it is also more likely that some groups will use computer-based technologies to alter their power and dependency relative to other groups rather than simply using the technology to maximize efficiencies in their own operations. Efficiency gains should be $findings, however, rather than assumptions. Rational analysts usually evaluate computerized systems in terms of efficiency or effectiveness criteria which are referenced to the computerusing organization or one 01its subunits. In contrast, political analysts
302
ROB KLING AND WALT SCACCHI
assume that different groups which use a computer system or which interact with a computer-using group may have conflicting orientations and interests. In empirical inquiries, a political analyst would examine the orientations and interests of a wide array of participants in a computing milieu rather than assume they were relatively homogeneous. Thus, a political analysis of an organization’s milieu will be much more informative for understanding the utility and effects of computing on internal operations. 4.1.3
Management Control
Control systems are usually initiated by or for managers to help them direct their organizational resources or the activities of their subordinates. These systems parallel lines of authority and usually measure only a few aspects of behavior or activity. Rational analysts have argued that computer-based information systems can dramatically increase the effective control of higher level managers. Automated control systems are seen to provide managers with fine grained, timely, and accurate information about the activities within their administrative domains. But few studies have empirically investigated the accuracy of these claims. Independent of automation, organizational control systems may be problematic. When the resource that is controlled is inanimate, like parts for a manufacturing plant, the greatest dilemma may be in developing a good control procedure. Rational analysts have observed that a poor procedure, independent of whether it is automated, may lead to costly overstock (Ackoff, 1967). Conversely, a good procedure, such as “material requirements planning,” may demand such fine-grained records (of subassemblies, sub-sub-assemblies, their constituent parts, and stock levels to be manipulated) that for complex manufactured items that are made to order, digital computers provide the only economically feasible medium. When the activities that are to be controlled depend critically upon people’s skill and cooperation, human relations theorists and political analysts have noted that control systems can become even more problematic (Lawler and Rhode, 1976). The staff of an organization may have several conflicting goals, but most control or reporting systems are designed to measure one or two aspects of a person’s or work group’s performance. These control systems are easily “gamed.” The participants in an organization may simply alter their work style to “make the numbers look good.” Assemblers may produce more components, but they may be more components of lower quality. Employment counselors can gravitate toward clients who are easiest to place in jobs. Furthermore, these shifts may have severe consequences for the organization’s
COMPUTING AS SOCIAL ACTION
303
achievement of a variety of important goals. Co-workers in assembly may sufficiently intensify their competition that morale drops and production suffers through continual turnover and the costs of retraining. Employment counselors may turn their attention away from the clients most in need o r they may place unsuitable clients to raise their own performance measures in the short run but lower both the agency’s credibility with employers, and the number of potential placements, in the long run. These are dilemmas of organizational control. These dilemmas are not simply resolved through the use of automated reporting systems. In studies of computer use it is often difficult to decide which alterations to attribute to the algorithm or strategy adopted and which alterations to attribute to the automation of the algorithm or strategy. That dilemma is particularly critical in the case of control systems. The keen separation between the algorithm and its automation is often moot in practice because computer-based information systems may be the only technically or economically feasible instrumentality to implement them. The efficacy of several control systems has been studied in private firms by Markus (1979) and in public agencies by Albrecht (1976), Herbert and Colton (1978), Kling (1978b), and Kraemer et al. (1980). At this time, the efficacy of these systems seems mixed. Albrecht found that a caseload reporting system used in a large metropolitan court led to substantial reductions in case processing time. These findings are consonant with the rational perspective. In contrast, Kling and Markus found control systems that were of little utility to managers in controlling activities within their administrative domains. Markus studied a financiaYaccounting system and found that its primary utility was to rationalize the existence of the office of the corporate comptroller. Similarly, Kling explained the persistence of UMIS in primarily political terms. Kraemer ef d.investigated budget monitoring systems in local governments, and systems to help police commanders allocate their staff across beats, districts, and shifts. Herbert and Colton (1978) also studied police manpower allocation systems. Kraemer et a/. investigated budget monitoring systems that were favorably viewed by central managers as helping them to track the expenditures of operating departments and to predict potential problem areas. The budgetary control prerogative of central managers is so well accepted that the staff of operating departments did not find this surveillance to be intrusive. When the staff in operating departments complained about budgeting systems, it was primarily because the systems were designed to serve the needs of central managers but did not serve rheir needs very well. Usually these complaints were assuaged when budgeting systems were redesigned to accommodate the interests of staff both in central finance departments and in the operating departments.
304
ROB KLING AND WALT SCACCHI
However, neither Herbert and Colton nor Kraemer et al. found police manpower control systems to be particularly efficacious. Herbert and Colton present a structural argument for the difficulties of linking automated manpower allocation to improved police performance. One line of argument addresses computing and police performance as part of a larger strategy for crime deterrence, Thus, manpower allocation models are part of a strategy to increase the number of criminal apprehensions by increasing the speed with which a patrolman can reach the scene of a crime; and manpower allocation schemes are used to deploy patrolmen to the beats and shifts where apprehensions would be most likely. But, Herbert and Colton argue, even if the models were relatively efficacious in helping police departments allocate their patrol, the effects on criminal activity would be negligible. Even if it were twice as likely that a suspect would be apprehended, his chance of conviction by a court would be only slightly increased given the delays, plea bargaining, and dismissed cases that are common in the American courts. In contrast, the analysis presented by Kraemer et al. lends itself to a political interpretation. They find that manpower allocation models are least efficacious in cities where elected officials have relatively strong influence in computing decisions. In these cities, elected officials are likely to have strong influence in decisions about the allocation of other resources as well, including police. Since the quality of police service is a highly politicized topic in cities with moderate or high crime rates, city councilmen may refuse to cede to modelbased analyses that lead to the loss of patrol support in their districts. For example, manpower allocation models in Atlanta indicated that police should be moved from outlying suburban areas to inner city beats. The councilmen in the suburban areas mustered a majority of votes to prevent the proposed shift. Thus, the models were “ineffective.” Examination of control systems for budgets and police manpower reveals interesting contrasts in efficacy. Here a structural perspective seems fruitful. While both expenditure levels and crime rates are subject to exogenous influences, the extent to which departments spend their d o cated budgets is more subject to administrative control. Moreover, budgetary control has been a traditional prerogative of central managers, who set budgets for operating departments at the beginning of the fiscal year. The staff of operating departments may feel that budget monitoring by central accounting staff does not strongly intrude upon their traditional prerogatives. When there are major conflicts over who will control critical organizational resources, computerized information systems may simply be used as political instruments by the contenders-and fought as political intrusions. Albrecht ( 1979) presents an intriguing study of a case-tracking sys-
COMPUTING AS SOCIAL ACTION
305
tern that the legal staff (e.g., judges) in a Southern court introduced to manage the work of the probation staff. Each of the two groups brought a different orientation to its work with defendants and convicts, and each was able to exercise moderate autonomy in its control over the meaning of its work. Albrecht indicates that the legal staff were concerned that cases be processed through the courts in an orderly manner and emphasized due process. In contrast, the probation staff emphasized rehabilitating individuals so that they could become productive and trusted members of the community. When the information system was being designed, each group proposed a reporting structure that minimized its accountability and maximized the visibility and possible control over the other group. An automated system that included a compromise set of data was built and operated for four years. But it was primarily used as a record-keeping system and rarely used to enhance the control of court administrators. Finally, it was removed and the court reverted to a manual record-keeping system. Albrecht views this information system as an instrument in the struggle between the legal staff and probation staff for control of each other. Neither group was able to gain sufficient power to force the other to submit to its form of measurement and management. Since no group could tightly manage the other, and thus provide “objective” data about the productivity and efficacy of court activities, the automated system was a sterile tool. This case also highlights the close coupling of management control systems and the exercise of power in organizations. Organizational power will be carefully examined in Section 4.2.
4.1.4 Policy Making and Strategic Planning
The roles of automated information systems in policy making have been examined from rational, structural, and political perspectives. This section examines sample studies from each framework. An example of the rational perspective is provided by a recent account of cornputer-based models by Licklider and Vezza (1978) Computer-based modeling and simulation are applicable to essentially all problemsolving and decision-making . . . they are far from ubiquitous as computer applications . . . The trouble at present is that most kinds of modeling and simulation are much more difficult, expensive, and time-consuming than intuitive judgment and are cost-effective only under special conditions that can justify and pay for facilities and expertise.
These casual explanatory comments typify the weaknesses of much rational commentary on the social effects of computer-based technologies. They embody a simplified version of Simon’s decision-making perspective, and lack the attention to social context which Simon, on occasion,
306
ROB KLING AND WALT SCACCHI
provides. Neglecting the obiter-dicta claim that modeling and simulation are “applicable to essentially all problem-solving and decision-making,” presumably including ethical decisions, one is left with an odd account of the problems of modeling. Models “are far from ubiquitous,” and “the trouble is” they are difficult and costly to develop and use. But the appropriateness of modeling is not linked by Licklider and Vezza to any discernable social setting or the interests of its participants. Licklider and Vezza’s claims are not aimed at policy making in particular. They could include simulations for engineering design as well as for projecting the costs of new urban development. However, their comments typify the rational perspective when it is applied to information systems in policy making; the presumption is that differences in social settings make no difference.’ But scholars who have carefully studied the conditions under which computer-based technologies are adopted and used have found that the character of social settings is a potent influence (Danziger and Dutton, 1977; Greenberger et al., 1976; Kling, 1974, 1978a,b; Rule, 1974). Whether models and simulations “are applicable to essentially all problem-solving and decision-making” by virtue of helping inform decisions, or whether they are used to rationalize and obscure the bases of decisions made on other grounds, depends in part on the degree ofconsensus over means and ends held by the active participants in a decision arena (Greenbergeret al., 1976; Kling, 1978a). Exactly what “special conditions” Licklider and Vezza believe “can justify and pay for facilities and expertise” cannot be discerned from their account. The reader is provided with an ambiguous analysis, which is framed in rational categories, but which can be interpreted within broad political understandings of organizational action. Greenberger et al. (1976) investigation of the use of computer-based modeling systems as guides to policy makers in public agencies is primarily a structural analysis with some political elements. They studied the role of econometric models in developing U.S.fiscal policy; Forrester’s WORLD I11 model, which stimulated the “limits to growth” debate in the United States; and an operations research model for locating fire stations used by the City of New York. An advocate of rational modeling would assume that models such as these could and should be used by policy makers to help sharpen their perceptions and help select among alternative policies. Greenberger, Crenson, and Crissey rarely found that policy makers’ choices were influenced directly by model-based analyses. Politi-
’ Simon waflles on the role of social settings in altering the roles played by computer-based systems. In Organizations ( 1958), he and March indicate that analytical problem solving will occur under special organizational conditions. However, when he recommends the use of computer-based systems, he does not qualify his advice by attention to social context (Simon, 1977). See Kling (19780 for a critique of Simon’s analyses.
COMPUTING AS SOCIAL ACTION
307
cal actors often used models, but they were employed to generate support for policies selected in advance. When modeling efforts were influential in a decision arena, it was often the modeler who was called upon to inform decisions rather than modeling runs. Models did help inform decisions indirectly by stirring partisan debate. While “modeling as advocacy” was best illustrated through the “limits to growth debate” and through their account of Laffer’s predictions of the U.S. GNP in 1971, modeling as a tool of advocacy is commonplace. Under such conditions, the most constructive role for models has been to help advocates of different positions make their cases and to critique the assumptions of their antagonists. Greenberger, Crenson, and Crissey’s subtle analyses of each modeling effort indicate that details and structural arrangements differ across p r o b lem domains and social arenas. Bargaining over firehouse locations in New York City differs from predictions of the Gross National Product in response to an increase in the Federal Reserve Board’s prime interest rate. City councils and Congress, firemen’s unions and bankers, act differently. But in each case, Greenberger, Crenson, and Crissey take the world of policy making as given, with its short time horizons, fluctuating attention to issues, and attempts of participants to mobilize support, placate critics, and displace problems that do not require immediate attention. They find that modeling does not easily fit the fragmented world of public policy making. They note that modeling efforts often require good definitions of the questions to be asked, while policy makers are often working with shifting definitions of the dilemmas they face. In addition, modeling efforts often require several years to design, program, and fine tune. Many policy matters are resolved more rapidly; indeed, many political actors are out of office or transferred to other jobs within any two-year period. Greenberger, Crenson, and Crissey also note that models help an actor gain intellectual mastery over a given problem domain by integrating many factors and their interactions. In contrast, political arenas, particularly in the Federal government, are designed to manage problems by factoring them into chunks which can be delegated to different administrative and regulatory agencies. Each of these structural features (e.g., time horizon, stability of people, and problems) becomes an element in their explanation about why modeling has been problematic. In addition, they play down the political attractiveness that modeling and modelers offer politicians. If the modeler is called upon to provide sage advice, he can be viewed as a flexible political resource. He can be chosen to have a world view that is closely aligned with the politician’s, but to use the “objective authority” of his modeling to enhance the credibility of casual, partisan analyses. Despite the structural “misfits” between model-based analyses and the
308
ROB KLING AND WALT SCACCHI
dynamics of policy making that they emphasize, Greenberger, Crenson, and Crissey are optimistic about the value of modeling efforts and recommend strategies to improve their utility. They focus upon social strategies that institutionalize the development of specific models which may be used to answer recurrent questions and to intensify the uses of countermodeling and multimodeling; They clearly eschew those explanations of the failures of rational modeling efforts that hinge on the technical weaknesses of contemporary models, In a study of the use of automated information systems in municipal policy making, Kling (19780 contrasted the explanatory power of rational models and organizational politics models. Automated information systems that provided demographic, economic, housing, and transportation data were available to most of the city councils, planning staff, and top administrators in the 42 cities he studied. Computer-based reports were provided to city council several times a year, on the average. In about half the cities these reports never led to clearer perceptions or surprises about city conditions. In less than 10% of the cities did these reports generally provide clearer perceptions or surprises to city council members. However, in 35% of the cities, the reports were generally used to enhance the legitimacy of decisions and perceptions of city councilmen. In about 10% of the cities, the reports were generally used to gain publicity for programs. In about 25% of the cities, these reports were sometimes used to legitimize perceptions and gain publicity for programs supported by council members. These patterns could be attributed to the primitive state of automation in American cities. If cities utilized more sophisticated data resources, then perhaps policy makers would find more surprises in the reports they received and be less likely to use them as political resources. Kling developed several measures of the degree and richness of automated data systems supporting policy analysis in each of the 42 cities. He found that in more highly automated cities, policy makers were more likely to report clearer perceptions and surprises gleaned from reports based on automated analyses. However, he also found that policy makers in more highly automated cities were more likely to use these same reports to legitimize their perceptions and gain publicity for their preferred programs. These findings indicate that computer-based analyses are a social resource used by political actors in the same manner as other social resources. They do not alter the policy-making style of political bodies; they are appropriated by policy makers and adapted to their styles of organizational work. These studies indicate that computing can play multiple social roles in organizations which adopt them. While the studies cited here have been carried out within public agencies, there is no reason to believe that the
COMPUTING AS SOCIAL ACTION
309
internal organizational dynamics of private organizations are substantially different. For example, Pettigrew’s ( 1973) study of computing adoption decisions in a private manufacturing firm indicates that office politics can be as pervasive as that in public agencies. In an effort to implement an analytical model to assist the officers of a Boston bank in locating new branches, Gibson ( 1975) also discovered actively “political” modes of decision making at the higher corporate levels. Simon indicates similar perceptions when he suggests that the manager in a more automated firm: . . . will find himself dealing more than in the past with a well-structured system whose problems have to be diagnosed and corrected objectively and analytically, and less with unpredictable and sometimes recalcitrant people who have to be persuaded, prodded, rewarded, and cajoled. . . . Man does not generally work well with his fellow man in relations saturated with arbitrary authority and dependence, with controf and subordination, even though these have been the predominant human relations in such settings in the past. He works much better when he works with his fellow man in coping with an objective, understandable, external environment (Simon, 1977, pp. 132-133).
Despite this hope that automation will help diminish the abrasive conflicts within organizations, there is no evidence that computer-based systems have “depoliticized” the organizational settings in which they are employed. Strassman (1973) observes: The motivating forces behind every game in the systems field are the achievement of power, influence, ”political posture,” high-compensation levels for its members, and a disproportionate share of the corporation’s resources.
There is a vast literature on the design of information systems for organizations, but remarkably few studies actually evaluate the use of systems in place. Only a handful of these treat the interplay between automated information systems and the social order of their host organizations.HThe studies that have been summarized here indicate that computer-based information systems that are regularly used are grafted into the ongoing politics of their host organizations. Rather than altering the character of social “relations saturated with ‘arbitrary’ authority and dependence,” automated systems are best received when they buttress the positions of organizational actors with substantial power (Dutton and Kraemer, 1978; Kling, 1974, 1978f; Kraemer and Dutton, 1977). 4.2 Organizational Power
In Section 4.1.3 we discussed “managerial control,” by which an administrator increases his influence within a domain of activity over which In addition to the studies described here, see Colton’s (1979) studies of computer applications for police, the studies of Kling (1978b, 1980)and Laudon (1974) described earlier, and Stewart’s (1971) studies of largescale decision-support systems in British firms.
ROB KLING AND WALT SCACCHI
he has primary supervisory responsibility. However, there are many arenas of organizational life in which participants share their influence, and may attempt to increase their influence (Kling, 1980). One participant exerts power over others to the extent that he can effectively influence them to see things and do things his way. It is a relatively common observation that computer-based systems increase the influence of parties who have access to the technology and understand its use (Danziger, 1977). Such an analysis led Downs (1967) to suggest that data custodians will gain power relative to other staff, and that full-time administrators will gain power relative to elected officials. Recently, Kling (19780 and Kraemer and Dutton (1979) have investigated the role of information systems in altering balances of power in American municipal governments. They collected data for investigating alterations in power relations attributable to the use of computer-based reports. In each city the researchers coded the extent to which different actors gained or lost power because of computer-based reports. “Power” was treated operationally, as “effective influence”; actors who seemed to have gained influence in the outcomes of decisions made during the previous year or two were viewed as having gained power. A weak group gaining power need not be relatively strong, and a strong group losing power need not be relatively weak. The following patterns stood out in the data: 1. In a majority of the cities there were discernible shifts of power to or from some role (e.g., mayor) attributable to computer use. 2. However, there were no shifts of power to or from any single role (e.g., mayor) in a majority of the cities. In each city, different sets of actors gained or lost power because of computer use. Moreover, many actors had their relative power unaffected by computer use. (Respondents also attributed shifts of power to a variety of sources including demographic changes in the city, personal style of top officials, etc.) 3. Data custodians (e.g., planners) were most likely to gain power, and never lost power, because of computer-based analyses. Nevertheless, data custodians remained relatively weak. At best they appeared as the favored experts of more powerful actors. At worst, their council was distrusted and their reports received little sustained interest. 4. Supporting Downs, city councilmen were most likely to lose power (20%) and rarely gained power (5%) because of computer-based analyses. 5 . Supporting Downs, top-level administrators (city managers and CAOs) often gained power (27%) and rarely lost (3%) power when there were any shifts at all.
The conditions under which mayors and departments gained and lost power were more complex. The top elected officials in a city were more
COMPUTING AS SOCIAL ACTION
31 1
likely to gain power and less likely to lose power than all other role takers except the data custodians. This suggests that computer-based analyses reinforce the patterns of influence in municipal governments (Kling, 197% Kraemer and Dutton, 1979). Kling ( 1974) suggests that computer-based information systems should reinforce the structure of power in an organization simply because computer-based systems are expensive t o develop and use. Thus, top officials who can authorize large expenditures will, on the average, ensure that expensive analyses serve their interests. However, the data reported by Kling and Kraemer and Dutton indicate even more subtle patterns of influence. Correlations were computed between power shifts attributable to computing and measures of both city size and the extent of automation in policy analysis. Departments gained power while top officials lost power in the lurgpt.cities. This pattern follows the structure of influence in American municipalities. In the smaller cities, the mayors and city managers were able to stay on top of department operations and place a strong stamp upon municipal activities. In larger cities, the departments could be vast fiefdoms (particularly police and public works) and the top officials were placed in a weaker bargaining position vis-a-vis department heads. In the cities with several hundred thousand residents the larger departments had sufficiently large budgets to afford their own skilled analysts to build information systems that suited their needs, and do it with less scrutiny from top officials. Thus, in these cities, it is not top officials who gained most, but managers who were sufficiently high in the organizational hierarchy that they could command large resources. A general observation drawn from the analysis of computer-based systems for policy analysis in cities is that they often serve aspoliricd, power reinforcing instruments (Kling, 1978f; Kraemer and Dutton, 1979). They differ from automated upward reporting systems that may cause data providers to lose power to data collectors (Kling, 1974). In general, to predict the influences of a computer-based information system, one must have in mind a sharp characterization of the distribution of influence in the social setting in which it is used (Dutton and Kraemer, 1978; Gibson, 1975; Greenberger et a / . 1976). Automated information sysyems should be viewed as social resources that are absorbed into ongoing organizational games, but which do not materially influence the structure of the games being played. In organizations in which “normal business” is highly politicized, computing will not be a neutral resource. It can easily be used to reinforce the power of potent actors (Kling, 1980). This analysis is neither vacuous nor obvious. Recently, McLaughlin (1978) recommended that the Office of the President should develop a “policy management system” with extensive automated support t o help provide greater control over the (then) United States Department of
312
ROB KLING AND WALT SCACCHI
Health, Education, and Welfare. Since the Federal cabinet agencies have developed into extensive bases of independent power vis-a-vis the Presidency, they may be analogous to departments in the larger cities. The analysis presented here suggests that extensive automation is likely to reinforce the power of the executive agencies. In that case McLaughlin’s strategy would backfire, This analysis indicates that the political order of the social setting in which a computer-based system is utilized must be well understood, in addition to the technical features of the system, to predict the likely uses and impacts of computing. This principle undermines the sufficiency of the formulations of analysts who emphasize the technical characteristics of systems and neglect the political dynamics of the settings in which automated data systems are utilized.
4.3 Computing and Work There is no paucity of images to portray the effects of computing on work life. Bell (1973) and Myers (1970) provide enthusiastic and largely rational accounts of computing as an aid to “knowledge work.” They both emphasize how computer-based technologies enlarge the range and speed at which data are available. Myers, for example, concludes that, “computing will help relieve ‘specialists and professionals’ of the timeconsuming and repetitive parts of their work.” In contrast, Braverman (19741, a class politics analyst, argues that managers conceive of workers as general purpose machines which they operate. He views automation as a managerial strategy to replace unreliable and finicky “human machines’’ with more reliable, more productive, and less self-interested mechanical or electronic machines. He argues that computerized information systems usually are organized to routinize white collar work and weaken the power of lower level participants in an organization. Human relations analysts occupy a middle ground between these extremes of euphoria and gloom. They believe that the effects of computerization are contingent upon the technical arrangements chosen and the way they are introduced into the workplace (Argyris, 1971; Kling, 1973). While human relations analysts place considerable value on jobs which allow people opportunity for self-expression and trust (Argyris, 1973);they argue that the character of jobs and the role of computing in work is to be empirically determined (Mann and Williams, 1960; Kling, 1973). 4.3.1 Computing and Job Characteristics
Empirical studies of computing in the workplace have been dominated by human relations approaches (Mann and Williams, 1960; Mumford and
COMPUTING AS SOCIAL ACTION
313
Banks, 1967; Kling, 1978e; Robey, 1979). Class politics analysts often draw upon studies done by others, and then interpret them in their own framework (Braverman, 1974; Mowshowitz, 1976), although Noble ( 1978) has developed a provocative, empirically grounded account of computer numerical control in machine shops that is based upon his own field studies. Rational analysts usually emphasize the information processing potentials of computing, but remain relatively remote from empirical inquires. Regardless of their theoretical orientation, most analysts view computing as a potent technology that is likely to have a powerful influence on the workplaces where it is used. It is easy to think of technological changes, such as the transformation of craft fabrication to assembly lines, which have been organized to have powerful influences on the character of work. Kling (1978e) recently investigated the impacts of computer use on the character of white collarjobs through a survey of 1200 managers, data analysts (e.g., urban planners), and clerks in 42 municipal governments. While his study is carried out within the human relations tradition, it casts doubt on the easy assumption one finds in class politics and rational analyses, that computing is a potent workplace-transforming technology. Most of Kling’s respondents used computer-based reports and attributed jobenlarging influences to computer use. Respondents also attributed increases in job pressure to computing. These effects increased with the extent to which respondents used computer-based reports in their work. Computer use had perceptible, but not dominant effects on the jobs of many people who used the technology; overall, Kling’s respondents reported that computer use enlarged their jobs, but did not profoundly alter the character of their jobs. Computing was viewed as a salient technology by managers, data analysts, and clerks who utilized it regularly. For some, computing substantially increased the ease with which they could obtain valued information, and made possible more complex data analyses. For others, it had little utility. White collar workers in several different occupational specialties attributed clear, often positive influences to computing in their jobs. Computing use also moderately increased the level of task significance and job pressure in the work of computer users. Kling’s (1978e) findings support those of Whisler (1970) who indicates that managers are better served by computing than are clerks. But even traffic clerks in Kling’s study generally attributed job-enlarging influences to computing. Kling’s data lend no support to analysts like Braverman (1974) and Mowshowitz (1976), who argue that the dominant effect of white collar computer use is to diminish the quality of working life. On the contrary, it supports claims that computer use often enlarges the jobs of workers who use the technology.
314
ROB KLING AND WALT SCACCHI
The detailed patterns and levels of the impact of computing vary from one occupational specialty to another and can be understood best in terms of different work contingencies and information system designs. In addition, the technology is not always easily implemented-difficulties in getting data, dealing with computer specialists, the computing services organization, and the technology itself all add minor and continual turmoil to the workplace; but overall, the technology does not dramatically change the character of work. Rather, according to Kling’s findings, it has benign and minor influence on the work of the computer users. 4.3.2 Automated In for ma tion Systems and Supervision
It is often assumed that when automated information systems become available, managers and line supervisors exploit them to enhance their own control over different resources, particularly their subordinates activities (Downs, 1967; Whisler, 1970). Two strategies are commonly adopted: reorganizing work so that tasks are more finely subdivided or control over key decisions moves “up” the organization (Noble, 1978), or directly employing automated information systems to enhance the grain and comprehensiveness, and speed of organizational control systems (Kling, 1974; Lawler and Rhode, 1976). In this section we examine the latter strategy. Episodes illustrating the use of automated information systems to better monitor the activities of employees, particularly the quantity of their work, can be found in many computer-using organizations (Kling, 1974). Both the image of computing as enhancing the rationalization of whitecollar work (Braverman, 1974; Mowshowitz, 1976), and reports by managers that they were losing autonomy (Mumford and Banks, 1967) support the hunch that such uses are widespread and effective. In short, computing and closer supervision are held to go hand-in-hand. Nevertheless, there are few good studies of automated information systems and patterns of supervision. In Section 4.1.3 we reviewed the case reported by Albrecht (1979) in which the probation staff of a court were able to prevent the legal staff from employing an automated information system to monitor their casework. Albrecht’s case provides a rich account of how a specific information system was introduced by actors (legal staff) who wished to control others in their work organization. How commonly do organizational actors attempt to utilize information systems as direct aids to supervision, and how successful are they? Kling’s (1978e) study of the role of computing in white collar work provides some helpful data. Each of his respondents (described in Section 4.3.1) was asked to report whether or not automated information systems had increased or
COMPUTING AS SOCIAL ACTION
315
decreased the extent to which he or she was supervized. Few of the respondents attributed increases in their level of supervision t o computer use. Kling found that traffic clerks and managers, rather than accountants, detectives, and planners, most frequently ( 12-16%) reported increases in their level of supervision. The respondents were also asked whether computing provided their supervisors with information about the quantity and quality of their work. Between 25% and 36% of the respondents believed that computer-based reports conveyed such information. In addition, the managers were asked whether computer-based reports helped them monitor and control the work groups within their organizational domains. A majority of the managers indicated either that they received no relevant computer-based data, or that the computer-based data they received were of no use for enhancing their control over their subordinates. A small fraction indicated that computer-based reports were a major aid. Many managers indicated that computer-based data were sometimes, but not usually helpful. Again, understanding the contingencies of different jobs and the ways in which information systems are used helps us understand these patterns. Computer-based reports are sometimes used to monitor the workloads of detectives, property appraisers, and social workers. However, such data are usually available in manual records as well and is simply spun off as a by-product of the computer-based systems (e.g., a welfare client-tracking system). First-line supervisors often have some concept of the relative quantity and quality of the work done by their staffs. In most police detective details cases are assigned to individual detectives by a supervisor; planners are assigned work on particular projects and their reports are reviewed by higher level staff; accountants work on particular projects or with particular funds. Much of the work is carried out in small groups where supervision can be informal and close. Even when formal statistics may help, their explicit use seems to produce resentment, suspicion, and controversy over the meaning of the data. The information systems used by most of Kling’s nonclerical respondents were not well tailored to provide information about their work style, or the quantity or quality of their work. Ledgers do not record the activity of an accountant, and demographic data bases do not record the activities of planners who use them. As we indicated in Section 4.1.3, automated systems can enhance the control of one group over another (Kling, 1974).9But, in many workplaces Sometimes closer supervision accompanies computing even though it is not accomplished through computing. One municipal water department installed ah on-line inquiry system to help citizens learn about and negotiate their bills and payments. After the system
31 6
ROB KLING AND WALT SCACCHI
where computing is used to automate records and data analysis, the technology has not been exploited by managers to increase their information about the activities of their subordinates. Rather, the scenario of computing and control is of accountants in central budget units monitoring the expenditures of line departments; detectives seeking to catch lawbreakers; and municipal staff using automated traffic ticket processing systems to help catch scofflaws. Taken together with Albrecht's study, these data indicate that computing should not be viewed as a technology with necessary, fixed patterns of use and consequences. Computing is selectively exploited, as one strategy among many, for organizing work and information. However, the patterns of computer use appear to fit the workplace politics of the computer-using organization. In current practice semiprofessionals do not attribute increased supervision to the use of automated information systems. Structural analysts would suggest that managers might find other strategies for controlling their subordinates to be more effective. Political analysts would suggest that semiprofessionals such as probation officers, accountants, and detectives have sufficient control over their own work that managers cannot succeed even when they attempt to use formal reporting systems to enhance their supervisory control. 4.3.3 The fntegration of Computing in Work
The published studies of computing in the workplace are relatively mechanistic in conception. When workers have little discretion over their use of computing, then they may experience computing .as an external force to which they must adapt. However, many white collar workers have some discretion over the conditions and occasions under which they will use computing. Detectives, for example, can initiate searches in extensive file systems relatively early or late in working a case. Individual detectives have substantial flexibility in integfating computing use into their work. Similar observations apply to the engineers at WESCO described in Section 2.1 or managers of stock portfolios described by Keen and Scott Morton (1978). People who have substantial discretion in their decisions about whether, when, and how to use computing can utilize it as a relatively flexible resource to fit many social agendas. It may be used primarily as a rational instrument to save time or to ease the handling of complex bodies of was installed, the clerks who used it and who handled incoming phone calls had their telephone time monitored to ensure that they did not spend more than 2 minutes, on the average, in handling inquiries. Examples such as this lead casual observers to associate computer use with increases in supervision even when computing is not the instrument of enhanced supervision.
COMPUTING AS SOCIAL ACTION
31 7
information. Some modes of automation, such as automated testing of electronic components, may heip managers employ less skilled workers to reduce costs or maintain stable supplies of labor. But its use may also be invoked as a symbolic gesture to appear innovative, as a medium for playful work, or as a political instrumentality. Little is known about the way that semiprofessional workers can adapt computing to their worklives, or the ways in which their worklives are altered with different computing arrangements. We believe that the interactionist approach that emphasizes the ways in which people construct social meanings around the different elements of their worklives (tasks, technologies, and people) may provide the richest starting point for this line of inquiry (Kling, 1980). 5. Conclusions
In this chapter we have analyzed the development, use, and consequences, of computing from six different theoretical perspectives which we introduced in Section 1.3. This has been a rather complex task, even though we have neglected many important aspects of computing in organizations. We have tried to demonstrate how each “phase” of computing (its development, use, and consequences) entails a complex set of social activities. In addition, we have indicated how the assumptions one makes about behavior in organizations color what we look for, what we find, and what sense we make of computing developments. We have examined some of the more important and recent studies that shed light on the social dimension of computing in organizations. The research studies of the few topics we have covered, however, form a rather fragmented and not terribly cumulative literature. Rather than summarize the substantive findings here, we reexamine the six perspectives that we have used to integrate the diverse topics we have examined. Then we suggest some promising lines for further investigation. 5.1 The Six Perspectives in Perspective
The six theoretical perspectives help us understand the assumptions behind the questions we have asked and the answers different analysts have found. The rational perspective dominates the majority of analyses of computing, particularly those that are written by practitioners and found in trade journals or the internal documents of organizations. Rational analyses provide coherent “first approximations” which help provide the reader with simple, memorable, and clear conceptions of how computing systems are dqveloped, used, and what interests they serve. Unfortunately, all too often rational analyses systematically mislead. The ra-
318
ROB KLING AND WALT SCACCHI
tional perspective severely simplifies the world of computing by assuming that major participants in a computing setting share common goals and values (Kling, 1980). In addition, rational analysts usually assume that the formal structure of authority in an organization provides a good guide to the way activities are carried out. Reference to consensus over goals helps mobilize diverse groups to share, or at Ieast not impede, a collective effort. But statements born of institutional convenience are unlikely to provide adequate conceptual schemes for undertaking serious analytical work. Unfortunately, aside from the most careful studies of computing use which “triangulate” accounts of different participants and examine formal data, much of what we believe about computing in organizational life is based upon a fragmented set of interested accounts. Structural analyses often enrich the rational perspective by indicating the conditions and arrangements under which specific activities will occur or are appropriate. Human relations analysts add richness to the relatively dry and mechanistic world of structural analysts by “bringing people back in.” However, the human relations tradition has been almost exclusively preoccupied with individuals and work groups. It provides managers with studies informing them how best to organize so that their workers will be both happy and productive. However, it provides little insight into other aspects of information technology in organizational life-the diverse incentives for computer use, the fragmented world of computing in many organizations, and the alterations of power between computer users and others. Both interactionist and organizational politics analyses keep people sharply in focus, and are usually less biased toward the interests of managers than are human relations analysts. They can also investigate a wider array of social forms beyond the task group (Danziger et al., 1980; Kling and Gerson, 1977, 1978). Class analysts have undertaken few studies of computing in organizations. Largely, they treat computing simply as a complex technology that supports the power of dominant classes in a capitalist society. Class analysts have taken the position that work in capitalist society is degrading, and have tried to demonstrate how managers attempt to deskill programmers (Kraft, 1977; Greenbaum, 1976) and clerical computer users (Braverman, 1974). Class analysts play an important role in pointing at the ways in which managers of work organizations often organize work so as to maximize productivity rather than provide richer worklives and significant forms of control for their subordinates. We believe that each perspective provides u s some insight into some questions (Kling, 1980). For reasons that we have tried to clarify in this chapter, we nevertheless find that the interactionist and organizational politics perspectives provide more analytical bite under a wider variety of cir-
COMPUTING AS SOCIAL ACTION
319
cumstances than their competitors. The researcher’s choice of perspective leads to many important ramifications for appropriate research methods. If, like rational analysts, one believes that there is consensus over goals, then one “key” informant may provide sufficient data about the development, use, and consequences of a given computer system. On the other hand, if one believes that knowledge about computing is fragmented, that participants are interested rather than disinterested, then one may need many, perhaps several dozen informants, to “adequately” indicate the social contour of a given computer system (Kling, 1978b; Markus, 1979; Scacchi, 1980). These choices have important consequences for the research style most appropriate to the perspective and the questions asked. The more data needed to make sense of a given event or computing arrangement, the more costly the research if many events or systems are to be studied. Conversely, political and interactionist analysts are also more likely to produce qualitative reports since a common research instrument is less likely to capture the array of meanings and nuances communicated by a wider array of situated respondents. Simply specifying a particular “perspective” is relatively weak since there can be alternative accounts within any approach. For example, Danziger er ul. (1980) examine the explanatory power of three different political models for the control of computing in organizations-elite pluralism, technocratic elite, and reinforcement politics. Much of the theoretical diversity (or confusion) of the studies reported here reflects the state of organizational theory generally. There are many competing perspectives (see Perrow, 1979). Moreover, scholars of organizations have trouble agreeing on even simple matters such as how to understand “effective” organizations (Goodman er al., 1977). Studies of the social dynamics of computer use adopt theoretical perspectives from the sociological literature (e.g., Gerson, 1976; Strauss, 1978a,b), and they reflect important weaknesses or difficulties that sociological theorists have not yet resolved. 5.2 Lines for Development
There is also much that we have omitted. First, many computer scientists would have developed this chapter by focusing on specific software development strategies (e.g., structured programming, automated testing), on particular technologies (e.g., minicomputers, networking), or on managerial arrangements (e.g., pricing computer services, centralization vs. decentralization of computing). These are the practical terms with which developers and users of computing often express their choices. Each of the six perspectives developed in this chapter can be used as a
320
ROB KLING AND WALT SCACCHI
frame of reference for analyzing these topics. We hope that the analysis presented here can be picked up by practitioners and scholars and turned to these more practically framed questions. Most of the studies that examine computing or computing practices in organizations, emphasize the instrumental side of computing. But computing is the occasion for symbolic, ceremonial, and ritual behavior as well. For example, “cost-benefit” studies are often invoked as rationales for the adoption of a new computing application or for some major reorganization of computing arrangements. However, cost-benefit studies are often carried out to support beliefs than for their analytical role in helping select cost-effective choices. Important “benefits” such as enhanced organizational control or images of innovation are usually difficult to quantify in monetary terms. As a consequence, “cost-benefit analyses” often boil down to “we knew there would be costs, we hoped there would be benefits.” Slogans like “cost-benefit analysis,” or “state-of-the-art systems” indicate just a few of the many symbolic and ceremonial events that surround the development and use of computing. Third, we have treated organizations as relatively closed systems. Recent studies of the industrial and social world in which computing is developed and used indicate that organizations and groups outside the computer-using organization can have profound influences upon computing activity (Kling and Gerson, 1977, 1978; Kling and Scacchi, 1979a,b; Scacchi, 1980). Professional societies provide arenas for their members to learn about kinds of computing that they can “import” into the organizations for whom they work. They are also arenas in which specialists (accountants, urban planners, production planners, programmers, data processing managers) develop their conceptions of “adequate” or “interesting” technologies. Programmers now work in a tight labor market in most urban centers, and employers often have to keep their operations technically current to attract and retain skilled staff, These links between computer using organizations and larger social worlds merit serious attention. Moreover, computer-based information systems are fit into a complex ecology of organizations with increasing frequency. The FBI, state police agencies, and police in the largest urban centers operate information systems that contain criminal histories and information about wants, warrants, and stolen property that cross a variety of organizational and jurisdictional boundaries. Networks of point-of-sale terminals and credit verification networks link geographically dispersed merchants with central banks and credit reporting agencies (Kling, 1978d). Some manufacturing firms that have automated their entering of orders receive streams of automated messages from their larger customers who place orders
COMPUTING AS SOCIAL ACTION
321
through their own automated systems. On the consumption side, both retailers and manufacturers are turning to automated inventory control systems that trigger purchase orders in an increasingly automated manner. In short, information systems that are shared between organizations or which contain data about transactions between them appear to be tightening the links between organizations in certain industries. However, the nature of these developments, their modes of use, and their consequences for their participants are poorly understood. As the number of groups with diverse perspectives who deal with a given computerized system increase, its character is likely to become more markedly social. Software developed in such settings is likely to be more complex as different groups place somewhat conflicting demands upon the ways a given system should operate and what modes of information processing it should incorporate. Moreover, since any system entails some compromises over how it will actually work and what activities it will support, the use of systems of larger “social scope” is even more of an arena for social conflict (Kling, 1978e; Albrecht, 1979; Markus, 1979). Fourth, an appropriate new line of investigation would directly examine the relations between computer using organizations and their clients. The earliest accounts of computing that examined the losses of personal privacy that accompanied large-scale computerization served to focus attention on privacy as an issue. But they also deflected attention from other important elements in the relationships between the public and computer-using organizations. Rule ( 1974) made an important contribution by framing concerns about personal privacy as an issue about the kind of social control that organizations can exert over their clients. Sterling’s (1979) important study of billing errors opens this line of investigation in a new way. Traditionally, studies of computer-using organizations that provide public services focus on the computer-using organization and its administrative practices (Rule, 1974; Laudon, 1974; Kling, 1978b; Albrecht, 1976, 1979; Kraemer et a / . , 1980). These studies emphasize the character of services, while Sterling’s gives explicit attention to the people who are receiving services from the organization. As a wider variety of organizations automate their operations, the way that computer-using organizations interact with their publics will become more critical (Kling, 1978c,d; Rule, 1974; Sterling, 1979). Moreover, larger firms are more likely to invest in computing than their smaller competitors. In retail sales and banking, for example, these are likely to be organizations whose operations span wider geographical areas and which have larger clienteles. However, larger firms provide a different kind of “service” or product than smaller, local competitors that sell the same commodity. In some cases, such as chain stores, they provide a
322
ROB KLING AND WALT SCACCHI
wider variety of nationally distributed merchandise at lower prices. But they are usually less personal, and less likely to distribute locally produced merchandise. In these cases the customer finds a wider choice of mass produced items or services, but is likely to have to deal with remote organizational units t o make special requests and deal with problems. To the extent that such regional, chain organized firms utilize computer systems to rationalize their marketing, distribution, and billing, computing may appear as “essential” technology. But automated information systems may help support larger scales of effort against which smaller local firms cannot easily compete. Ceteris paribus, computerization helps increase the concentration of selected consumer industries, such as retail chains, restaurants, and banking (Kling, 1978d). The net effect for consumers would be not only dealing with the automated systems used by specific firms, but also having a narrower choice of standardized products from which to choose, and fewer firms to choose between. The opportunities for examining these interactions in relatively permeable service organizations such as chain stores, rather than courts and welfare agencies, will make investigation easier. While these four sets of questions are not the only ones which might be asked about the development, use, or consequences of computing in organizations, the last three strike us as some of the more fruitful points of departure. This chapter has examined “computing as social action.” Most attention to computing activities focuses on the equipment (hardware and software) and, secondarily, on the organizational procedures set up to manage the technology and the information flows it supports. Only the most meager attention is directed t o the ways that people live and work with computing, and the ways they shape and are shaped by its uses. We have focused most of our attention to the latter because we believe that it is the most neglected aspect of computing. People who work in complex organizations develop computer-based systems, embed them in their work, and find themselves and others dealing with the consequences. Computer technologies are not artifacts of nature like limestone caves. They are conceived, designed, shaped, ignored, tinkered with, layered, redesigned, sabotaged, criticized, and appraised to fit a complex web of human interests. That, in large part, is the interest and importance of computing. ACKNOWLEDGMENT
This work was supported by NSF Grant MCS 77-20831. The analyses reported here were sharpened by discussions with colleagues at The University of California at Irvine and elsewhere. In particular we wish to thank James Danziger, William Dutton, Les Gasser,
COMPUTING AS SOCIAL ACTION John Leslie King. Kenneth Kraemer. and Jim Neighbors. In addition. Peter Freeman and Elihu Gerson provided helpful and careful readings of an early draft of the manuscript. REFERENCES Ackoff, R. ( 1967). Management misinformation systems. M o n n g r . Sci. 14, 147-156. Albrecht. G. ( 1976). The effects of computerized information systems on juvenile courts. Justicr SVS/. J . 2, 107-120. Albrecht. G. ( 1979). Defusing technical change in juvenile courts. Sociol. Work O c c ~ p6, . 259-282. Allison. G . T. (1971). “Essence of Decision: Explaining the Cuban Missile Crisis.” Little, Brown, and Co., Boston. Alter, S., and Ginzberg, M. (1978). Managing uncertainty in MIS implementation. Sloun Mnnuge. Rev. 19, 23-3 I. Argyris, C. (1971). Management information systems: the challenge to rationality and emotionality. Mnntrge. Sci. 17, 275-292. Argyris, C. ( 1973). Personality and organization theory revisited. f uhlic Admin. Rev. 33, 14 I- 167. Barkin, S. R., and Dickson, G. W. ( 1977). An investigation of information system utilization. Ittf’. Manmge. I , 35-45. Belady, L. A. (1978). ”Large Software Systems.” I S M RC 6966. IBM Thomas J. Watson Research Center. Yorktown Heights, New York. Bell, D. (1973). “The Coming of the Post-Industrial Society: A Venture in Social Forecasting.’‘ Basic Books, New York. Bernard, D., Emery, J. C., Nolan, R. L., and Scott, R. H. (1977). “Charging for Computer Services: Principles and Guidelines.” Petrocelli Books. New York. Blau, P., Falbe. C. M., McKinley, W., and Tracy, P. (1976). Technology and organization in manufacturing. Admin. Sci. Q . 21, 20-40. Boehm, B. W. (1973). Software and its impact: A quantitative assessment. Durnmdon 19, 48-59. Boehm, B. W. ( 1976). Software engineering. IEEE Truns. Comprri. c-25, 1226-1241. Boehm, B. W. (1979). Software engineering-as it is. f r o c. I n t . Conf. Softwure E n g . , 4th, pp. 11-21. Braverman, H. (1974). ”Labor and Monopoly Capital: The Degradation of Work in the Twentieth Century.” Monthly Review Press, New York. Brooks, F. P. (1975). “The Mythical Man-Month.” Addison-Wesley, Reading, Massachusetts. Burch, J . , Strater, F., and Grudnitski, G. (1979). “Information Systems: Theory and Practice,” 2nd ed. Wiley, New York. Cave, W. C., and Salisbury, A. B. (1978). Controlling the software life cycle-the project management task. IEEE Truns. Sofiwurc, Eng. se-4, 326-334. Champine, G . A. ( 1978). ”Computer Technology Impact on Management.” North-Holland Publ., Amsterdam. Colton, K., ed. ( 1978). “Police Computer Technology.” Lexington Books, Lexington, Massachusetts. Colton, K . (1979a). The impact and use of computer technology by the police. Commun. ACM 22, 10-20. Colton, K . (1979b). Police and computer technology-the expectations and the results. Proceedings 1979 N u t . Comput. Conf. Vol. 48, 443-453. Cooper, J. D. (1978). Corporate level software management. IEEE Trans. Softwure E n g . se-4, 3 19-326.
324
ROB KLING AND WALT SCACCHI
Cuff, E. C., and Payne, G. C. F., eds. (1979). “Perspectives in Sociology.” Allen & Unwin, Lon don. Danziger, J. (1977). Computer, local government and the litany to EDP. Puhkc Admin. Rrv. 37, 28-37. Danziger, J. (1979). The ”skill bureaucracy” and intraorganizational control. Sociol. Work OccliP. 6, 204-226. Danziger, J., and Dutton, W. H. (1977). Computers as an innovation in American local governments. Commun. ACM 20, 945-956. Danziger, J., Dutton, W. H., Kling, R., and Kraemer, K. L. (1980). “Computers, Bureaucrats, and Politicians: High Technology in American Local Governments.“ Columbia Univ. Press, New York. DeBrabander, B., Deschoolmeester, D., Leyder, R., and Vanlommel, E. (1972). The effects of task volume and complexity on computer use. J . Business 45, 56-84. DeMillo, R. A., Lipton, R. J., and Perlis, A. J. (1979). Social processes and proofs of theorems and programs. Commun. ACM 22, 271-280. De Roze, B. C., and Nyman, T. H. (1978). The software life cycle-a management and technological challenge in the department of defense. IEEE Trans. Sofiwwe Eng. se-4, 309-3 18. Dery, D. ( 1977). The bureaucratic organization of information technology: computers, information systems, and welfare management. Ph.D. Dissertation, Department of Public Policy, University of California, Berkeley. Downs, A. (1967). A realistic look at the final payoffs from urban data systems. Public Admin. Rev. 27, 204-210. Dutton, W. H., and Kraemer, K. L. (1978). Management utilization of computers in American local governments. Commun. ACM 21, 206-218. Dzida, W., Herda, S., and Itzfeldt, W. D. (1978). User-perceived quality of interactive systems. IEEE Trans. Software Eng. se-4, 270-276. Edstrom, A. (1977). User influence and the success of MIS projects: A contingency approach. Hum. Relat. 30, 589-607. Galbraith, J. (1977). “Organization Design.” Addison-Wesley, Reading, Massachusetts. Gerson, E. M. (1976). On “Quality of life.” A m . Sociol. Rev. 41, 793-806. Gerson, E. M., and Koenig, S. R. (1979). “Information Systems Technology and Organizations: The Impact of Computing on the Division of Labor and Task Organization,” Working Paper. Pragmatica Systems, San Francisco, California. Gibson, C. (1975). A methodology for implementation research. I n “Implementing Operations ResearchlManagement Scihnce” (R. Schultze and D. Slevin, eds.). American Elsevier, New York. Glass, R. L. (1977). “The Universal Elixir and other Computing Projects which Failed.” Computing Trends, Seattle, Washington. Goodman, P., Pennings, J., et a / . (1977). “New Perspectives in Organizational Effectiveness.” Jossey-Bass, San Francisco, California. Goodman, S. (1979). Soviet computing and technology transfer: An overview. World Politics 31(4), 539-570. Gottlieb, C. C., and Borodin, A. (1973). “Social Issues in Computing.” Academic Press, New York. Greenbaum, J. (1976). Division of labor in the computer field. Monthly Rev. 28, 40-55. Greenberger, M., Crenson, M. A., and Crissey, B. L. (1976). “Models in the Policy Process.” Russell Sage Found., New York. Hedberg, B., and Mumford, E. (1975). The design of computer systems. In “Human Choice and Computers” (E. Mumford and H. Sackman, eds.), pp. 31-59. North-Holland Publ., Amsterdam. Herbert, S., and Colton, K. (1978). The implementation of computer-assisted police re-
COMPUTING AS SOCIAL ACTION source allocation methods. I n “Police Computer Technology” (K. Colton, ed.), Lexington Books, Lexington, Massachusetts. Hiltz, S. R., and Turoff, M. (1978). “Network Nation: Human Communication via Computer.” Addison-Wesley, Reading, Massachusetts. Jacob. F. (1977). Evolution and tinkering. Scicn(,e 196, 1161-1 166. Kanter, J . ( 1977). “Management-oriented information systems,” 2nd ed. Prentice-Hall. Englewood Cliffs. New Jersey. Keen. P. G . W. ( 1975). Computer-based decision aids: The evaluation problem. SlOUIJ Mon( I X P . R u . 16, 13-21. Keen, P. G. W.. and Gerson. E. (I977). The politics of software system design. Dfirrrmciriun 23, 80-84. Keen. P. G . W., and Scott Morton, M. ( 1978). ”Decision Support Systems: A n Organizational Perspective .’ ’ Addison- Wesley, Reading, Massachusetts. Kernighan, 9. W.. and Mashey, J . R . (1979). The UNlX programming environment. Softrz.trrr.-Prcrc.tic,r Experience 9, 1-15. Kling, R. ( 1973). Toward a person-centered computer technology. Proc. 1973 A C M Nntl.
con/: Kling, R. (1974). Computers and social power. C‘ompirt. SOC. 5 (Summer). 6-1 I . Kling, R. ( 1977). The organizational context of user-centered software design. M I S Q . I (Winter), 41-52. Kling, R. ( 1978a). Information systems and policymaking: Computer technology and organizational arrangements. Telecommrrn. Po1ic.y 2, 22-32. Kling, R . (1978b). Automated welfare client-tracking service integration: The political economy of computing. Con~nrcrn.A C M 21, 484-492. Kling, R. (1978~).EFT and the quality of life. P r o ( . . 1Y78 N o t / . Cornput. Con/: Vol. 47. pp. 191-197. Kling. R. (1978d). Value conflicts and social choice in electronic fund transfer system development. Comrnrrn. A C M 21, 642-656. Kling, R. ( 1978e). “The Impacts of Computing on the Work of Managers, Data Analysts, and Clerks,” WP-78-64. Public Policy Research Organization, University of California, Irvine. Kling. R. ( 19780. Information systems as social resources in policy-making. Pro(.. IY78 N [ / t l . A C M Cot$ 666-674. Kling, R. ( 1980). Social analyses of computing: Theoretical Perspectives in recent empirical research. Cornput. Surireys 12, (March 1980) (in press). Kling, R., and Gerson, E. (1977). The social dynamics of technical innovation in the computing world. Symbolic Interclct. I, 132-146. Kling, R., and Gerson, E. (1978). Patterns of segmentation and intersection in the computing world. Symbolic Ititeruct. I, 24-43. Kling, R . , and Scacchi, W. (1978). Assumptions about the social and technical character of production programming environments. “Proceedings of the Irvine Workshop on the Alternatives for the Environment, Certification, and Control of the DOD Common High Order Language.” Department of Information and Computer Science, University of California, Irvine. Kling, R., and Scacchi, W. (1979a). The DoD common high order programming language effort (DoD-I): What will the impacts be? S I G P L A N Notices 14 (February), 2943. Kling, R., and Scacchi, W. ( 1979b). Recurrent dilemmas of computer use in complex organizations. Proc. 1979 Natl. Computer Conf. Vol. 48, pp. 107-116. Kling, R., Scacchi, W., and Crabtree, P. (1978). The social dynamics of instrumental computer use. SIGSOC Bull. 10 ( h m m e r ) , 9-21. Kraemer, K. L. ( 1977). Local government, information systems, and technology transfer. Public Admin. Rev. 37. 368-382.
326
ROB KLING AND WALT SCACCHI
Kraemer, K. L., and Dutton, W. H. (1977). Technology and urban management: The power payoffs of computing. Admin. SOC. 9, 305-340. Kraemer, K. L., and King, J. L. (1979). A requiem for USAC. Policy Anal. 5 , 313-349. Kraemer, K. L., Dutton, W. H., and Northrop, A. (1980). “The Management of Information Systems.” Columbia Univ. Press, New York. Kraft, P. (1977). “Programmers and Managers: The Routinization of Computer Programming in the United States.” Springer-Verlag, Berlin and New York. Laudon, K. (1974). “Computers and Bureaucratic Reform.” Wiley (Interscience), New York. Lawler, E. E., and Rhode, J. G. (1976). “Information and Control in Organizations.” Goodyear Publ., Pacific Palisades, California. Learmonth, G. P., and Merten, A. G. (1978). Operation and evolution of organizational information systems. I n “Information Systems Methodology” (G. Bracchi and P. C. Lockeman, eds.), pp. 664-683. Springer-Verlag, Berlin and New York. Licklider, J. C. R., and Vezza, A. (1978). Applications of information networks. Proc. IEEE 66, 1330-1346. Lientz, B. P., Swanson, E. B., and Tompkins, G. E. (1978). Characteristics of applications software maintenance. Cornrnrm. ACM 21, 466471. Lucas, H.(1974). “Toward Creative System Design.” Columbia Univ. Press, New York. Lucas, H. (1975). “Why Information Systems Fail.” Columbia Univ. Press, New York. Lucas, H. (1978). Unsuccessful implementation: The case of a computer-based order entry system. Decision Sci. 9, 68-79 and 197-205. McLaughlin, R. A. (1978). The (mis)use of DP in government agencies. Datamalion 24, 147- 157. Mann, F. C., and Williams, L. K . (1960). Observations on the dynamics of change to electronic data processing equipment. Admin. Sci. Q . 5 , 217-256. Markus, M. L. ( 1979). Understanding information system use in organizations: A theoretical explanation. Ph.D. Dissertation, Department of Organizational Behavior, Case Western Reserve University, Cleveland, Ohio. Mashey, J. R., and Smith, D. W. (1976). Documentation tools and techniques. Proc. I n t . Conf. Software E n g . , 2 4 , pp. 177-181. Maynard, J. (1979). A user-driven approach to better user manuals. Computer 12, 72-75. Mills, H. (1976). Software development. lEEE Trans. Software Eng. se-4, 265-273. Mills, H. (1977). Software engineering. Science 195, 1195-1205. Mowshowitz, A. (1976). “The Conquest of Will: Information Processing in Human Affairs.” Addison-Wesley , Reading, Massachusetts. Mumford, E. ( 1972). “Job Satisfaction: A Study of Computer Specialists.’’ Longmans, Green, New York. Mumford, E.. and Banks, 0. ( 1967). “The Computer and the Clerk.” Routledge & Kegan Paul, London. Mumford, E., and Pettigrew, A. M. (1976). “Implementing Strategic Decisions.” Longmans, Green, New York. Mumford, E., and Sackman, H., eds. (1975). “Human Choice and Computers.” NorthHolland Publ., Amsterdam. Myers, C. ( 1970). “Computers in Knowledge-Based Fields.’’ MIT Press, Cambridge, Massachusetts. Noble, D. ( 1978). Social choice in machine design: The case of automatically controlled machine tools, and a challenge for labor. Politics SOC. 8(3-4), 313-47. Oliver, P. (1978). Guidelines to software conversion. Pror. 1978 Nail. Cnmput. Conf. Vol. 47, pp. 877-886.
COMPUTING AS SOCIAL ACTION
327
Palme, J. (1978). How I fought with hardware and software and succeeded. SoftwurePractice Exprrierice 8, 77-83. Perrow, C. (1979). “Complex Organizations: A Critical Essay,” 2nd ed. Scott, Foresman & Co., Glenview, Illinois. Perry, J. L., and Kraemer. K. L . ( 1978). Innovation attributes, policy intervention, and the diffusion of computer applications among local governments. Policy Sci. 9. 179-205. Pettigrew, A. M. (1972). Information control as a power resource. Sociology 6, 187-204. Pettigrew, A. M. ( 1973). “The Politics of Organizational Decision-Making.” Tavistock, L.ondon. Riemer, J . W. ( 1976). Mistakes at work: The social organization of error in building construction work. Sociol Prohl. 23, 255-267. Robey, D. (1979). MIS effects on managers task scope and satisfaction. Proc. 197Y N o t / . Compru. Conf. Vol. 48, pp. 391-395. Rule, J. ( 1974). “Private Lives and Public Surveillance: Social Control in the ComputerAge.” Schocken Books, New York. Rustin, R.. ed. ( 1972). “Debugging Techniques in Large Systems.” Prentice-Hall, Englewood Cliffs, New Jersey. Sandewall, E. ( 1978). Programming in an interactive environment: The ”LISP“ experience. Comput. Surv. 10, 35-71. Scacchi. W. ( 1980). The process of innovation in computing: A study of the social dynamics of computing. Ph.D. Dissertation. Department of Information and Computer Science, University of California, Irvine. Schneiderman, B. ( 1978). Improving the human factors aspect of database interactions, A C M Truns. L)oruha.w Syst. 3, 417-439. Simon, H . A. ( 1947). “Administrative Behavior.” Macmillan, New York. Simon, H. A. ( 1965). ”The Shape of Automation for Men and Management.’‘ Harper, New York. Simon, H. A. ( 1973). Applying information technology to organizational design. Pirblic Admin. Re\.. 33, 268-278. Simon, H. A. (1977). “The New Science of Management Decision.” Prentice-Hall, Englewood Cliffs, New Jersey. Sterling. T. ( 1979). Consumer difficulties with computerized transactions: An empirical analysis. Cottimriti. ACM 22, 283-289. Stewart. R. ( 1971). “How Computers Affect Management.” MIT Press, Cambridge, Massachusetts. Strassman. P. A. (1973). Managing the evolution to advanced information systems. I n ”lnformation Systems Administration” (F. W. McFarland, R. L. Nolan. and D. P. Norton, eds.). pp. 347-363. Holt. Rinehart. and Winston. New York. Strauss, A . ( 1978a). “Negotiation.” Jossey-Bass. San Francisco, California. Strauss, A . (1978b). A social world perspective. I t i ”Studies in Symbolic Interaction” ( N . Denzin. ed.), Vol. 1, pp. 99-128. JAI Press, Chicago, Illinois. Streeter, D. ( 1974). “The Scientific Process and the Computer.” Wiley (Interscience), New York. Walston. C. E.. and Felix. C. P. (1977). A method for programming measurement and estimation. IBM S y s t . J . 16, 54-73. Whisler, T. L. (1970). “The impact of computers on organizations.” Praeger, New York. Wolverton, R. W. ( 1974). The cost of developing large-scale software. IEEE Truns. Cornput. C-23, 615-636. Youngs. E . A. (1974). Human errors in programming. I n ! . J . M n n - M c d z . S i i d . 6, 361-376.
This Page Intentionally Left Blank
Author Index Numbers in italics refer to the pages on which the complete references are listed.
A Ackerrnan, A , , 62 Ackoff, R., 298, 302,323 Adams, J. M., 187, 208,215 Agrawala, A. K., 179, 209,215 Aho, A., 76, 86, 89, 108 Alagar, V. S., 120, 130,215 Albrecht, G., 303, 305, 314, 321,323 Aldenderfer, M. S., 114, 115, 160, 180, 204, 208,215, 216 Alter, S., 273,323 Anderberg, M. R., 114, 118, 132, 150, 161, 164, 167, 181, 204, 208,215 Anderson, G. A., 60, 66, 108 Anderson, T. W., 182,215 Ando, M.,121,219 Andrews, D. F., 122,215 Andrews, H. C., 120,217 Anne, A., 187, 208,215 Arabie, P., 124, 226 Argyris, C., 312, 323 Ashcroft, E. A., 101, 108 Ashenhurst, R. L., 246, 241,248 Astrahan, M. M., 62 Attinger, E. O., 187, 208,215 Attinger, F. M. L., 187,215
B Babb, E., 63 Backer, E., 178, 198, 210, 212, 215 Bahr, G. F., 118, 182, 216 Bailey, D. E., 114, 227 Bailey, T., 120, 175. 221, 225 Bailey, T. A., 192, 193, 194,215 Baker, F. B., 140, 179, 184, 188, 189, 194, I%, 199,215, 216. 221 Ball, G. H., 167, 180, 216 Banerjee, J., 2, 59, 64 Banks, O., 268, 270, 274, 290, 312, 313, 314, 326
329
Barker, W. B., 69. 109 Barkin, S. R.,323 Barnes, G. H., 66, 108 Barnes, G. R., 178, 216 Bartels, P. H., 118, 182, 216 Bartko, J. J., 196, 206,217, 227 Bartlett, M. S . , 130, 216 Barton, G. M., 206,228 Batcher, K. E., 62, 64, 90, 91, 108 Baudet, G. M., 73, 104, 106, 108 Baum, R. I., 60, 64 Bayer, R., 102, 108 Belady, L. A., 275, 276,323 Bell, C. G., 69, 111 Bell, D., 312,323 Benes, V. E., 91, 108 Bennett, R. S . , 119, 216 Bentler, P. M., 123, 216 Bentley, J. L., 88, 108, 152, 216 Berlin, B., 113,225 Bernard, D., 279,323 Berra, P. B., 62 Berstein, P. A., 100, 108 Besag, J., 129, 216 Bezdek, J. C., 178, 186, 216 Bhat, M. V., 178, 216 Binder, D. A., 198,216 Bird, R. M., 61 Blashfield, R. K., 114, 115, 161, 164, 167, 169, 180, 199, 202, 203, 204, 208,216, 224 Blau, P., 323 Blum, H., 122,223 Boehm, B. W., 261, 267, 269, 270, 274, 276, 294,323 Borodin, A., 250,324 Bouldin, D. W., 186,217 Boyce, A. J., 206,216 Braverman, H., 258, 312, 313, 314, 318,323 Breedlove, D. E., 113,225 Brent, R. P.,82, 106, 107, 108 Brooks, F. P., 270, 272, 273, 275, 276, 292, 323 Brown, T., 129, 227
330
AUTHOR INDEX
Brown, R. M., 66, 108 Brown, W. S.,240,248 Browning, S., 88, 108 Bryant, R. M., 179, 209, 215 Burch, J., 266, 2%. 297, 299,323 Burr, D. J., 210,217 Bush, G. A., 62 Busk, P., 123, 221 Butler, G. A., 179, 217 Buxton, J. L., 121,228
C Caelli, T., 123, 223 Calhoun, D. W., 118, 182,216 Calvert, T. W., 121,217 Canaday, R. H., 60 Carpenter, C. L., 206,217 Carpenter, W. J., 196, 206,227 Carpenter, W. T., 206,217 Carroll, J. D., 122,217, 222, 226 Carmone, F. J., 122,220 Casey, R. G., 162,217 Cavalli-Sforza, L. L., 162,218 Cave, W. C., 261, 276,323 Champine, G. A., 279,323 Chang, C. L., 122, 154,217, 227 Chang, H., 60 Chansler, R. J., Jr., 107, 109 Chazan, D., 104, 108 Chen, C. K., 120,217 Chen, H. J., 208,217 Chen, T. C., 60, 77, 108 Chen, W. F., 63 Chernoff, H., 122, 206,217 Chien, R. T., 210,217 Cleveland, W. S., 181,217 Clifford, H. T., 114, 124, 126, 180, 198,217, 228 Codd, E. F., 62 Cody, W. J., 233,248 Cohen, A., 180, 206,217 Cohen, D., 107, 108 Cohen, J. B., 60 Cole, A. J., 114, 180,217 Coleman, G . B., 210, 211, 212,217 Coleman, R., 130,217 Colton, K., 257, 300, 303,323, 325 Conover. W. J., 182,217
Cooley, W. W., 121,217 Coonen, J. T., 246,248 Cooper, J. D., 260, 261,323 Cooper, T., 123, 223 Copeland, G. P., 61, 63 Cormack, R. M., 114,217 Cornelius, E. T., 123,223 Cosley, E. S.,60 Coulouris, G. F., 60 Crabtree, P., 325 Cramer, H., 217 Crenson, M. A., 257, 306, 31 I, 324 Crissey, B. L., 257, 306, 31 1, 324 Crowther, W. R., 69, 109 Cuff, E. C., 258,324 Cunningham, K. M.,202,217 Cunningham, J. P.,122, 217 Curry, D. J., 182,217
D’Andrade, R. G., 154,217 Date, C. J., 62 Davies, D. L., 186, 217 Davis, E. W., 60 Davis, L. S., 210, 212, 226 Dawson, W. N., 61 Day, W. H. E., 179, 181,217, 218 Danziger, J., 264, 293, 298, 306, 310 319,324 DeBrabander, B., 324 Defays, D., 136,218 DeFiore, C., 62 Dekker, T. J., 240,248 DelPresto, P. V., 187, 208,215 DeMillo, R. A., 274,324 Dery, D., 298,324 DeRoze, B. C., 261, 277,324 Deschoolmeester, D., 324 DeWitt, D. J., 63 Dickson, G. W., 323 Diggle, P. J., 129,216, 218 Dijkstra, E. W., 105, 109 Dinstein, I., 210, 211,220 Dobson, A. J., 158, 218 Dorofeyuk, A. A., 114,218 Doty, K. L., 63 Downs, A., 310, 314,324 Draguns, J. G., 204,216
AUTHOR INDEX Dutton, W. H., 264, 290, 306, 309, 311, 318. 319.324 Dubes, R., 120, 121, 123, 126, 160, 167, 169, 175, 180, 1%. 197, 198, 199, 204, 206, 213, 214,218, 221, 223, 225, 226, 227 Duda, R. O., 123, 152, 166, 174, 175, 185. 186, 213,218, 220 ' Dunn, J. C., 178,218 Duran, B. S.,114, 162, 180, 218 Durham, I., 69, 107, 109 Dutton, W. H., 257, 258, 300, 303, 309, 321, 326 Dzida, W., 294, 324
E Eastlake, D. E., III., 60 Edelberg, M., 60 Edie, J., 174, 226 Edstrom, A., 274,324 Edwards, A. W. F., 162, 218 Eigen, D. J., 179,218 Eisenstein, B. A., 176, 179, 218 Emam, A., 63 Emery, J. C., 279,323 Engelman, L., 179, 186,218 Enright, W. H., 247, 248 Enslein, K., 114, 218 Enslow, P. H., 66, 109 Erdos, P., 127,218 Estabrook, G. F., 180, 191,218, 228 Eswaran, K. P., 109 Evans, J. M., 60 Everitt, B. S., 114, 122, 159, 180, 183, 204, 208, 214,218
F Falbe, C. M., 323 Farnesworth, D. L., 62 Farris, J. S., 159, 183,218 Fehlauer, J., 176, 179, 218 Feiler, P. H., 107, 109 Felix, C. P., 280, 327 Ferguson, T. S., 182,218 Fillenbaum, S., 127, 182,2/9, 225 Filsinger, E., 206, 219 Fischer, M. J., 213,228
331
Fisher, D. R., 180, 183, 225 Fisher, L., 198,219 Fisher, L. D., 203, 223 Fisher, P. S.,60 Fleiss, J. L., 114, 219 Floyd, R. W., 101, 109 Flynn, M. J., 68, 79, 109 Forgy, E., 167, 219 Fortier, J. J., 121, 219 Fosdick, L., 247. 248 Foster, M.,72, 86, 87, 109 Frary, J. M., 140, 141,224 Friedman, H. P:, 121, 166, 208,219 Friedman, J. H., 152, 179,216, 219 Fromm, F. R., 179,218 Fu,K. S., 123, 179, 212, 213,219, 220, 223 Fukunaga, K., 120, 121, 123, 162, 174, 175, 179, 212,219, 222 Fuller, S. H., 69, 109 Fung, H. S.,61 Funk, G . M., 129,226
G Galbraith, J., 253, 257, 258, 299,324 Garside, R. F., 205, 219 Gates, R., 62 Gerson, E., 251, 255,257,258,264,267,271, 278, 318, 319, 320,324, 325 Gibson, C., 298, 309, 311,324 Ginsberg, M., 273,323 Gitman, I., 174, 178, 179,219 Glass, R. L., 272,324 Gnanadesikan, R., 180, 1%, 206, 208, 217, 219 Goldberg, M., 210, 211, 212, 224 Goldstein, M., 121,223 Gonzalez, R. C., 123, 210,219 Good, I. J., 114, 124,219, 220 Goodall, D. W.,118,220 Goodman, L. A,, 183, 188, 1%,220 Goodman, N., 100, I08 Goodman, P., 319,324 Goodman, S., 257,324 Gordon, A. D., 164,220 Gose, E. E., 174, 224 Gottlieb, C. C., 250,324 Gowda, K. C., 177, 178,220 Gower, J. C., 152, 220
332
AUTHOR INDEX
Gray, J. N., 95, 109 Green, P. E., 122, 207,220 Greenbaum, J., 318,324 Greenberger, M., 257, 306, 311,324 Guibas, L. J., 86, 109 Guttman, L., 122,220
H
I Ichikawa, T., 73, 110 Itzfeldt, W. D., 294, 324 Ivie, E. L., 60
J
Jackson, K. R., 247,248 Hail, A. V., 118,220 Jacobs, D., 248 Hall, D. J., 152, 166, 175, 185, 186,220 Jacobs, F., 287,325 Hallin, T. G., 79, 109 Jacovkis, P. M., 204,224 Hamdan, M. A., 118,220 Jain, A. K., 120, 121, 123, 126, 160, 167, Hammersley, J. M., 120, 130,220 175, 180, 198, 199, 204, 206, 210,218, Haralick, R. M., 210, 211, 212,220, 226 225, 226 Harding, E. F., 130,220 Janowitz, M. F., 114,221 Harman, H. H., 119,220 Jardine, C. J., 198,221 Harrison, R. D., 60 Jardine, N., 114, 134, 143, 155, 156, Hart, P. E., 123, 174, 175, 185, 213,218 180, 195, 198, 199,221 Hart, R. E., 123,222 Jarvis, R. A., 177, 179, 210, 221 Hartigan, J. A,, 161, 179, 180, 186, 191, 208, Jensen, E. D., 66, 108 218, 220 Jensen, R. E., 162,221 Haupt, A., 178,216 Jino, M., 60 Heacox, M. C., 60 Johnson, S. C., 142, 158,221 Healy, L. D., 61, 63 Johnston, B., 175,221 Heart, F. E., 69, 109 Jones, A. K., 69, 107, 109 Hedberg, B., 267, 274,324 Juliussen, J. E., 61 Heller, D., 69, 109 Henderson, J. T., 164, 220 Henrichon, E. G., 179,220 K Henry, D. T., 187,215 Herbert, S., 303,325 Kain, R. Y.,60 Herda, S., 294,324 Kak, A. C., 210,226 Heydorn, R. P., 121,227 Kanal, L. N., 123,221 Higbie, L. C., 60 Kannan, K., 64 Hiltz, S. R., 253, 255, 257,325 Kanter, J., 25 I , 297, 299,325 Hoagland, A. S., 60 Kamenskii, V. S . , 122,221 Hoffman, C. P.,62 Kaminuma, T., 121,221 Holgersson, M., 183, 203, 220 Karonski, M., 180,221 Hollaar, L. A., 60, 62 Karp, R. M., 89, 109 Hopcroft. J. E., 76, 86, 89, 108 Kasai, T., 213, 224 Howarth, R. J., 122,220 Kato, M., 66, 108 Hsiao, D. K., 2,59, 60, 61, 64 Katz, J. O., 174,221 Hubert, L. J., 123, 127, 132, 137, 140, 179, Kautz, W. H., 86, 109, 110 180, 183, 184, 188, 189, 190, 191, 192, 198, Keen, P. G . W., 255,270, 272, 273, 279, 199, 202, 206,216, 220, 221, 226 316,325 Huffman, D. A., 152, 166, 175, 185, 186,220 Keller, R. M.,101, 109 Hull, T. E., 247,248 Kelly, F. P., 129,221 Hurnmel, R. A., 212,226 Kelly, G. L., 210, 212,220
AUTHOR INDEX Kemeny, J. G., 118,222 Kendall, D. G., 130,220 Kendall, M. G., 113, 124, 180,222 Kendell, R. E., 204, 222 Kernighan, B. W., 269, 276, 278, 292. 325 Kerr, D. S.,64 Kettenring, J. R., 180, 196, 206, 208, 217, 219 Kiev, A., 206, 222 Killough, G. S., 127, 128,223 King, J. L., 266, 310, 3 I I , 326 Kittler, J., 174, 179. 199, 222 Klahr, D., 122, 222 Kleinman, A., 209, 227 Kling, R.,251, 252, 255, 257, 258, 259, 264, 266,267, 268, 270, 271,275, 276, 277, 278, 290,292, 293, 300, 303, 306, 308, 309, 310, 31 1.312, 313,314, 315,318, 319, 320, 321, 322,324, 325 Knoll, R. L., 122,227 Knuth, D. E., 73, 89, 91, 102, 109 Koch, V. L., 140, 141,222, 224 Koenig, S. R., 264, 324 Koontz, W. L. G., 162, 174, 175, 179, 212, 222 Kosaraju, S. R., 86, 109 Kraemer, K. L., 257, 258, 266, 290, 300, 303, 309, 310, 311, 318, 319, 321,324, 325, 325, 327 Kraft, P., 261, 318, 326 Krishna, G., 177, 178,220 Kruskal, J. B., 122, 123. 130, 202, 203,222 Kruskal, W. H., 183, 188, 1%,220 Krzanowski, W. J., 120,222 Kuck, D. J., 66, 87, 91, 92, 108, 109 Kuiper, F. K., 203,222, 223 Kulikowski, C. A., 121,228 Kung, H. T.. 66, 69, 72, 76, 77, 79, 82, 86, 87, 88, 89, 91, 92. 93, 94, 95, 97, 98, 100, 103, 104, 105, 106, 107, 108. 109, 110, I f f
L Lachenbruch, P. A., 121,223 Lance, G. N., 124, 125, 132, 147, 148, 198, 223, 228 Landis, D., 123,227 Landwehr, J. M., 180, 196, 206,217, 219 Lambert, P. F., 121,228
333
Lamport. L., 101, 105, 109, 110 Langdon, G. G . , Jr., 63 Laudon, K., 251, 253, 255, 257. 265, 266, 300, 301, 309, 321,326 Lawler, E. E., 291, 302, 314,326 Lawrie, D. H., 92, 110 Learmonth, G. P., 261, 277,326 Lee, R. C. T., 122, 154,217. 223, 227 Lefkovitch, L. P., 140, 223 Lehman, P. L., 82, 86, 109, 110 Leilich, H. O., 64 Leiserson, C. E., 76, 82, 89, 110 Levine, D. M., 122, 123, 223 Levine, M. D., 174, 178, 179, 219 Levitt, K. N., 86, 109, 110 Lewinson, T. M., 180, 228 Lewis, P. M., II., 100, I l l Leyder, R.,324 Li, H. F., 66, 79, I l l Licklider, J. C. R., 263, 266, 297, 305,326 Lientz, B. P., 276, 291, 326 Lin, C. S., 60, 63 Linde, R.,62 Lindman, H., 123,223 Ling, R. F., 127, 128, 139, 140, 180, 190, 191, 223 Lingoes, J. C., 122, 123, 223 Lintz, J., 211,223 Liou, J . l., 213, 214, 223 Lipton, R. J., 274,324 Lipvoski, G. J., 61, 62, 63 Liu, J.-W.S., 60 Liu, T. S., 174, 228 Lohnes, P. R., 114,217 Lord, R. D., 120, 130, 223 Lorie, R. A., 109 Love, H. H., 61 Lowenthal, E. I., 60 Lu, S. Y.,213,219, 223 Lucas, H.. 267, 268. 270, 273, 274,326 Lum, V. W., 60 Lum, V. Y.,77, 108
M MacCallum, R. C., 123,223 McClain, J. O., 188,223, 224 McCreight, E., 102, 108 McGregor, D. R.,61
AUTHOR INDEX
334
McKinley, W., 323 McLaughlin, R. A., 311,326 MacQueen, J. B., 164,224 McQuitty, L. L., 140, 141, 181,224 Madnick, S. E., 60, 61 Magnuski, H. S., 176,224 Mahtab, M. A., 182,226 Mann, F. C., 312,326 Markus, M. L., 258, 303, 319, 321,326 Maronna, R., 204,224 Marriot, F. H. C., 180,224 Martin, A. J., 105, 109 Maryanski, F. J., 60 Mashey, J. R., 275,326 Masri, A. E., 62 Matula, D. W., 136, 180,224 Maynard, J., 275,326 Mead, C. A., 72, 88, 110, 111 Merten, A. G., 261, 277,326 Metropolis, N., 246, 247,248 Mezzich. I. E., 195, 205, 206,224 Milligan, G.W.,200,224 MiUs, H., 269, 270, 274,326 Minsky, N., 61 Miranker, W. L., 65, 105, I10 Mitchell, R. W., 60 Mohr, J. M., 179, 209,215 Mojena, R., 200, 201, 202. 203,224 Moore, R., 247,248 Morey, L. C., 204,224 Moulder, R., 62 Mountford, M. D., 180, 182, 187,224 Mowshowitz, A., 313, 314,326 Mucciardi, A. N., 174,224 Mukhopadhyay, A,, 61, 73, 110 Mumford, E., 267, 268, 270, 274, 290, 293, 312. 313, 314,324, 326 Myers, C., 312,326
N Nagy, G., 162,217 Narendra, P. M., 162, 174, 179, 210, 211, 212,222, 224 Nerlove, S.B., 122, 123, 226 Newell, A., 107, 110 Nicholls, P., 122, 218 Niemann, H., 122,224 Ng, F. K., 2.59. 64
Nguyen, H. B., 63 Noble, D., 313, 314, 326 Nolan, R. L., 279,323 Northouse, R. A., 179,218 Northrop, A., 257, 300, 303, 321,326 Nyman, T. H., 261, 277,324
0 Odell, P. L., 114, 162, 180,218 Ogilvie, J. C., 202,217 Okuda, T., 213, 224 Oleinick, P. N., 93,' 110 Oliver, E., 62 Oliver, P., 277, 278,326 Olsen, D. R., 120,219 Orford, J. D., 180,224 Ornstein, S. M.,69, 109 Owicki, S., 101, 111 Ozkarahan, E. A., 63
P Page, R. L., 176,225 Palme, J., 275, 292,327 Panda, D. P., 210,225 Papadimitriou, C. H., 94, 95, 97, 100, 111
Parhami, B., 61 Parker, J. L., 61 Parzen, E., 130,225 Patrick, E. A., 123, 177, 179, 221. 225 Paulus, E., 204,225 Paykel, E. S., 206,225 Payne, G. C. F., 258,324 Pease, M. C., 90,I l l Peay, E. R., 136,225 Peleg, S.,87, 1 1 1 Peng, T., 62 Pennings, J., 319,324 Perlis, A. J., 324 Perrow, C., 258, 319,327 Perry, J. L., 266,327 Pettigrew, A. M., 259, 267, 268, 274, 309,326, 327 Pettis, K., 120, 225 Prim,R. C., 152,225 Pykett, C. E., 122,225
AUTHOR INDEX
Ralston, A,, 114, 218 Ramamoorthy, C. V., 66, 79, 111 Ramsay, J. O., 123,225 Rand, W. M., 203,225 Rao, M. R., 162,225 Rao, V. R., 188, 207,220, 223, 224 Rapoport, A,, 127, 182,219, 225 Raven, P. H., 113, 225 Relles, D. A,, 181, 217 Rem, M.,88, 110 Renyi, A., 127, 218 Rhode, J. G., 291, 302, 314, 326 Rice, J. R., 236, 247,248 Riddell, R. J., 127, 225 Riemer, J. W., 294,327 Ripley, B. D., 129, 221, 225 Roberts, D. C., 60, 62 Robertson, G., 107, 110 Robey, D., 313,327 Robinson, J. T., 98, 100, 106, 107, 110, 1 1 1 Rogers, D. J., 180, 191, 228 Rohlf, F. J., 152, 159, 174, 180, 183, 204, 221, 225 Rohmer, J., 62 Romney, A. K., 122, 123,226 Rosenfeld, A,, 87, I l l , 123, 210, 212, 219. 225, 226 Rosenkrantz, D. J., 100, 111 Ross, G. J. S., 152,220, 226 Roth, M.,205, 219 Rothnie, J. B., 100, 108 Rubin, J., 180, 208, 219. 226 Rudolph, J. A., 61 Rule, J., 306, 321, 327 Ruspini, E. H., 178.226 Rustin, R., 294,327 Ryder, J. L., 60
S Sackman, H., 270,326 Salisbury, A. B., 261, 276,323 Samadi, B., 102, 1 1 1 Sameh, A. H., 66, I l l Sammon, J. W., 122,226 Sandewall, E., 280,327 Sauer, W. J., 206,219
335
Saunders, R., 129,226 Savitt, D. A., 61 Scacchi, W., 263, 270, 271, 275, 276, 277, 278, 280, 287, 290,292,293, 303, 319, 320, 325, 327 Scelza, D. A., 107, 109 Schachter, B. J., 210, 212,226 Schissler, L. R., 60 Schkolnick, M., 102, 108 Schneiderman, B., 295,327 Schultz, J., 189, 190, 191, 192, 199,221 Schultz, J. R., 127, 180,226 Schuster, S. A., 63 Schwans, K., 107, 109 Schwartzmann, D. H., 120,226 Schwarz, P., 107, 109 Sclove, S. L., 175, 186,226 Scott, A. J., 175, 180, 186,226 Scott, R. H., 279,323 Sebestyen, G. S., 121, 174,226 Sevcik, K. C., 63 Shaffer, E., 175, 199,226 Shanley, R. J., 182,226 Shapiro, L. G., 211,226 Shaw, A. C., 213,226 Shcolten, C. S., 105, 109 Shepard, R. N., 122, 123, 124,217, 222, 226 Short, R. D., 179,2f9 Shutt, J. J., 62 Sibson, R., 114, 132, 134, 143, 155, 156, 157, 180, 195, 198, 199,221, 226, 227 Silverman, B. W., 129,225, 227 Simon, H. A., 263, 266, 280, 297, 299, 306, 309,327 Simon, J. C., 123,227 Simonett, D. S . , 211,223 Slagle, J. R., 122, 154, 223, 227 Slotnick, D. L., 66. 108 Smith, A. R., III., 86, 1 1 1 Smith, D. C. P., 63 Smith, D. W., 275,326 Smith, J. M., 63 Smith, K. C., 63 Smith, S. P., 196, 197, 210, 221, 227 Sneath, P. H. A., 114, 117, 180, 182, 186, 187, 199, 227 Snell, J. L., 118, 222 Sokal, R. R., 113, 114, 117, 180, 182, 227 Solomon, H., 121, 219 Song, S. W., 105, 110
AUTHOR INDEX
336
Sreenivasan, K., 209,227 Srivastava, J. N., 120, 227 S t e a m , R. E., 100, 111 Steele, G. L., Jr., 105, I l l Steffen, E. F. M., 105, I09 Stelhorn, W. H., 61, 64 Stenson, H. H., 122,227 Stephenson, W., 114, 124, 126, 180,217 Sterling, T., 295, 301, 321, 327 Stevenson, D., 73, 92, 108, 110 Stewart, R., 309,327 Stiege, G., 64 Stillman, N., 62 Stokes, R. A., 66, 108 Stone, H. S., 66, 90, 111 Stonebreaker, M., 63, 100, 111 Strassman, P. A., 260, 309,327 Strauss, A., 319,327 Strauss, D. J., 126, 129,227 Strauss, J. S., I%, 206, 217, 227 Streeter, D., 297,327 Su, S . K. W . , 6 1 , 62, 63 Sutherland, I. E., 72, I I I Swanson, E. B., 276, 291, 326 Symons, M. J., 175, 180, 186,226
Tung, C., 60, 77, 108 Turner, M. E., 182,227 Turoff, M.,253, 255, 257,325 Tusera, D., 62 Tzeng, D. C. S., 123,227
U Uhlenbeck, G. E., 127,225 Ullman, J. D., 76, 86, 89, 108
V Vanlommel, E., 324 Van Ness, J. W., 198,219 Van Rijsbergen, C. J., 181,227 Van Ryzin, J., 114,227 Vezza, A., 263, 266, 297, 305,326 Vidal, J. J., 120, 226 Vinod, H. D., 162,228 Voigt, R. G., 79, I l l von Neumann, J., 86, I l l
T Takekawa, T., 121,221 Tanaka, E., 213,224 Thomason, M. G., 123,219 Thompson, C. D., 86, 91, 107, 109, i' 1 1 Thompson, H. K., Jr., 122, 227 Thompson, R. G., 61 Todd, S., 61 Tompkins, G. E., 276, 291,326 Toombs, D., 61 Torgerson, W. S., 123,227 Torn, A. I., 179,227 Tou, J. T., 121,227 Toussaint, G. T., 123, 227 Tracy, P.,323 Traiger, I. L., 109 Troop, R. E., 61 Trunk, G. V., 120,227 Tryon, R. C., 114,227 Tsokos, C. P., 118,220 Tu, J. C., 61 Tukey, J. W., 179,2/9
W Wagner, R. A,, 213, 228 Waksman, A., 86, 109 Wallentine, V. E., 60 Walston, C. E., 280,327 Ward, J. H., Jr., 150, 228 Watanabe, S., 121,221, 228 Watson, J. K., 61, 62 Weeks, D. G., 123,216 Wehr, L. A., 60 Weiss, J., 122, 224 Whisler, T. L., 313, 314,327 White, A. A., 206, 208 White, R. F., 180, 228 Wied, G . L., 118, 182,216 Wdf, H. S., 114,218 Wilks, S . S., 121, 130,228 Williams, G. W.,206,228 Williams, L. K., 312, 326 Williams, W. T., 124, 125, 132, 147, 148, 223, 228
AUTHOR fNDEX Wintz, P., 210, 219 Wirth, M . , 180, 191,228 Wish, M., 122,217 Wishart. D., 132, 134, 161, 164, 167, 199, 202,228 Wolf, D. E., 152, 166, 175, 185, 186,220 Wolfe, J. H . , 175, 185, 186, 206, 207, 228 Wolverton, R. W., 216,327 Won, H . , 206,228 Wong, A. K . C . , 174,228 Woodbury, M . A., 122,227 Worthy, R . M., 61 Wright, W. E., 198, 228 Wulf, W. A., 69, 111
337
Y Yau, S. S., 61 Young, F. W., 122,228 Youngs, E. A., 294,327
Z Zadeh, L. A., 178,228 Zahl, S., 204,228 Zahn, C. T., 132, 152, 154, 162, 176, 177, 178, 179, 214,228 Zeidler, H . C . , 64 Zubin, J . , 114, 219 Zucker, S. W., 212, 226
Subject Index A
BMDP clustering program, 161 Boolean expressions, in user query, 3 Boolean matrix, transitive closure of, 8 B trees, access to, 102-103 Bucket, structure memory and, 37 Bucket samples, 38 Budget monitoring systems, 303-304
Agglomerative approach, 131 Algorithms asynchronous iterative, 104-105 for asynchronous multiprocessors, 93-106 chaotic relaxation, 104 complete-link, 142-146 concurrent data base systems and, 94-102 correctness and efficiency of, 93 “implicit form“ of, 242 intrinsic dimensionality, 130 with large module granularity, 92 mode-seeking, 174- 175 multidimensional scaling, 122 parallel, see Parallel algorithms shuffle-exchange networks in, 90-91 single-link, 142-146 for special problems, 102-106 synchronized iterative, 104 Argument set, 38 Associative array approach for formatted data bases, 26-28 in text-processing computers, 13-15 Asynchronous multiprocessors, algorithms for, 93-106 Asymptotic analysis breakdown of, 245 error estimate and, 246 truncation error and, 244 Attribute-value pairs, 32 Automated information systems see also Computer; Computer-based system; Computing; Data-base computers decision making and, 2% in policy making and strategic planning,
C Cascaded cells, 16 Cellular logic approach cell as memory in, 29 fixed-head vs. moving-head technolol in, 28-29 formatted data bases and, 22-26 logic-per-track implementation and, 2 in text-processing computer, 15-17 Central limit theorem, 130 Chaotic relaxation algorithm, 104 Character stream, matching patterns wi 10-11
Child segment types, 41 Classification agglomerative vs. divisive, 126 exclusive vs. nonexclusive, 125 extrinsic vs. intrinsic, 125 hierarchical vs. partitional, 125-126 monothetic vs. polythetic, 126 Class politics analysts, Marxian view oi 254-255
Client-tracking welfare information syst 25 1-252
305-309
supervision and, 314-316
B Backward error analysis, 237 Balance of power, information systems and, 310
Band matrix, LU decomposition of, 84-86 BCTRY clustering program, 161 Binary search trees, access to, 103
CLUSTAN program, 161 Cluster(s) defined, 134 distinctiveness of, 186-187 individual validity of, 190-195 “straggly,” 134 Cluster analysis, see also Clustering; Clu methods; Cluster techniques; CIust validity applications of, 115, 208-215 data representation in, 117-123 data set dimensionality in, 119 defined, 114, 123-124
338
SUBJECT INDEX in digital image segmentation, 210-212 history of, 114-115 "jackknifing" in, 196 pattern recognition and, 123 procedures in, 116 in psychiatry, 204-205 Clustering, see also Cluster(s): Cluster anal ysis; Cluster hierarchy characteristic matrix in, 201-202 defined, 134 disjoint, 134 in exclusive classification, 125 fuzzy, 178 graph-theoretic, 176- 177 hierarchical, see Hierarchical clustering mathematical formalism for, 198 near-neighbor, 177-178 of new data base, 45-46 null hypothesis and, 180, 187, 192 partitional, see Partitional clustering square-error, 163- 174 syntactic, 212-2 14 Clustering algorithms comparative analysis in, 197-207 mode-seeking, 174 Clustering methodologies, in exploratory data analysis, 113-215 Clustering methods application of, 205-215 comparative analysis of, 197-206 evolution of, 199 graph theory and, 132 hierarchical, 202-203 practical comparisons in, 204-207 set theory and, 132 simultate comparisons in, 200-204 in syntactic pattern recognition, 212-214 Clustering parameter, 129 Clustering problems, classification of, 124 Clustering programs, 161 Clustering tendency problem, 126-131, 186 Cluster membership, in square-error clustering, 165 Cluster profiles, validity of, 192-195 Cluster replicability, 195-1% Cluster stability, 195-197 Cluster validity, 179-197 comparative analysis and, 200 ideal structures in, 181 imposed structure in, 181
null hypothesis in, 180-181 objective procedures in, 179-180 CODASYL data base model, 7, 9, 19, 58 Complete-link algorithm, 136, 142-146 Complete-link dendrogram, 146 Complete-link method, 136 Complex organizations, see also Organization computer system use in, 290-295 computing knowledge distribution in, 292 Computer, data base, see Data base computer Computer analysis, perspectives on, 253257 Computer analysts, unit of analysis and, 259 Computer-based systems, political power and, 31 1-312: see also Computer system; Computing: Data base computer Computer science research organization, see CSRO Computer system, see also Automated information systems; Computing system life cycle cluster analysis in, 209-210 in complex organizations, 290-295 codbenefit studies in, 320 future development of, 319-322 integration of in work, 316-317 job characteristics of, 312-314 managers' benefits from, 313-314 as money or time saver, 301 organizational power and, 309-3 12 requirement specifications for, 267-269 six perspectives on, 317-319 work function and, 312, 316 workload characterization in, 209-210 Computer system use commitment to, 293-294 in complex organizations, 290-295 in decision making, 296-312 discretion in, 290-292 mistakes in, 294-295 span of control in, 291 Computer technology, advances in, 7 Computer upgrading goal in, 6 hardware back-end solution in, 5-6 intelligence controller solution in, 5 progress in, 6-8 software back-end approach in, 4-5
340
SUBJECT INDEX
Computing, see also Computer system in complex organization, 249-322 impact of on organizational life, 295-317 organizational behavior and, 250-25 1 as social action, 249-322 Computing expertise, access to, 292-293 Computing in organizations perspectives on, 253-261 UMIS operations and, 253-254 Computing package, defined, 271 Computing system life cycle, 261-290 CSRO case in, 280-286 documentation in, 275-276 evaluation in, 278-280 implementation in, 272-274 maintenance and modification in, 276-278 management goals and, 272-273 NEAT formatting program and, 288-290 requirement specification in, 267-269 software design in, 269-272 system initiation and adoption in, 263-266 system selection in, 266-267 testing and validation in, 274-275 Computing technology, managers’ benefits from, 313 Concurrency controls, of parallel algorithms, 67 Concurrent data base, reorganization of, 105- 106 Connectedness method, 134 Consistent data base, defined, 94 Content-addressable capability, 9 Content addressing, by mass memory, 3 1 Control bus, 23 Cophenetic correlation coefficient, 182-183, 205 Cophenetic matrix, 157 Correct transaction, defined, 94 Codbenefit studies, in computer adoption, 320 CPCC, see Cophenetic correlation coefficient CPU (computer processing unit) cycles, 3-4 CRAY-1, pipeline processing and, 79 CSRO (computer science research organization), 280-286 NEAT formatting program and, 281-284 six perspectives and text processing at, 286-288 Current position, in data base, 43
D
Data base(s) consistent, 94 current position in, 43 segments in, 41 swollen, 298 two categories of, 8-9 Data base architecture, cellular logic a proach to, 22-26 Data base command and control proce 35-38 Data base computers, 1-59 see also Computer system; Computii Text processing computer for formatted data bases, 19-40 language development for, 21 new and old environments in, 43 one-character-at-a-time comparison i overall architecture of, 30 as replacement for existing software, 40-58 research progress in, 7-8 text-processing, 9- 19 text-processing commands in, 11 two kinds of, 8-40 Data base independence, 20 Data base management, 40 Data base reorganization, 1QS-106 Data base software, penalties for, 21 see also Numerical software; Softwa Data base storage, in old and new envi ments, 46-47 Data base storage ratio, 47-49 Data base system components of, 1 “double jeopardy” situation in, 3 90-10 rule in, 1-4 Data base transformation, 41-48 Data normalization, 130 Data representation, in cluster analysis 117-123 DBCCP (data base command and conti processor), 35-38 Decision making automated information systems and, computer system use in, 296-312 management control in, 302-305 required data in, 297-300 routine operations in, 300-302
SUBJECT INDEX Dendrogram, 131 complete-link, 146 group theory and, 155-156 in hierarchical clustering, 155-158 proximity, 141 single-link, 160 threshold, 136 Dependency graph, 96 Digital image segmentation, 210-212 Discriminant analysis, linear projection in,
341
functionally specialized approach in, 28-40 Forward error analysis, 237 Functionally specialized approach, in associative array approach, 28-40 Functional specialization, defined, 29 Fuzzy clustering, 178, 186 Fuzzy ISODATA clustering, 178 Fuzzy partitions, 178 Fuzzy set concept, 178
121
Disjoint clustering, 134 Dissimilarity coefficient, 156 Dissimilarity index, 118 DUI cells, execution of, 48-50 “Double jeopardy” situation, in data base system, 3
E Eighty-X data set, 149-150 Errors in numerical software, 233-239 studies of, 295 Euclidean metric, 174 Exclusive classification, vs. nonexdusive, 125 Exploratory data analysis, clustering methodology in, 113-215 F
Fast Fourier transform, 90 Finite-impulse response filtering, 74-75 Finite-state automata defined, 17 multiple, 20 in text-processing computer, 17-19 Floating-point computation, model of, 240 Floating-point error analysis, 244 Floating-point numbers, 233 Floating-point standards, for microprocessors, 246 Forcing pass, in square-error clustering, 165 FORGY algorithm, 167-174 FORGY/ISODATA algorithm, flow chart for, 165 Formatted data bases associative array approach in, 2fL-28 cellular logic approach in, 22-26 data base computers for, 19-40
G
Get-next call, 43 Get-next-within-parent call, 43 Get-unique call, 43 execution of, 48 Global fit criteria, for partitions, 184-190 Goodman-Kruskel statistic, 183 Graph-theoretic clustering, 176- 177 Graph theory, hierarchical clustering in, 132-141 Group average method, 149-150 H
“Hardware back-end” solution, in computer upgrading, 5-6 HDAM (hierarchical direct access method), 51-53 Health, Education, and Welfare Department, U.S., 252 HIDAM (hierarchical indexed direct access method), 5 1-53 Hierarchical clustering, 131-161, 202-203 advantages of, 159-160 in algorithmic terms, 141-154 dendrograms in, 155-158 graph theory and, 132-141 mathematical characterization of, 155-159 reliability of, 141 single-link, 134-136 use of, 159-161 validity of, 141, 190-191 Hierarchical data base of airline routes, 23 contents at end of revolution, 25 storage layout of, 24 Hierarchical direct access method, 51-53 Hierarchical indexed direct access method, 5 1-53
SUBJECT INDEX Hierarchical model, of data base computer, 7, 19 Hierarchy, global fit of, 182-184 Higher performance computer, upgrading to, 4 Hill-climbing pass, cluster membership and, 165 Human relations analysts, automated system, and 253-254 I
Identifier, in mass memory, 32 IMS (Information Management System) data base, 41-44, 48 see also Management Information System IMS segment, conversion of, 44-45 Index term, 35-36 Information Management System, 41-48 Information systems, balance of power and, 310-311 see also Automated information systems Input character stream, loading and shifting of, 10 “Intelligent controller” approach, in computer upgrading, 5 Intel-MRI Company, 59 Interactionist analysts computing technologies and, 254 system improvement and, 277 International Computers Ltd., 59 Intrinsic dimensionality algorithm, 130 110 (input/output) arbitration, in cellular logic approach, 23 YO routines, 2-4 ISODATA algorithm, 167-174 user-supplied algorithms for, 169 ISODATA clustering methods, 205-206 ISODATA statistics, 170-173
J Jackknifing, in cluster analysis, 1% Job characteristics, computing system and, 312-314 Johnson complete-link scheme, 143, 147 K
Keyword bucket and, 37 index term for, 36
Keyword transformation unit, 37 K-MEANS clustering method, 205-206 KXU, see Keyword transformation uni
L LANDSAT data, clustering technique i 212 Large-scale integration technology, 6972 Linear projection defined, 120 in discriminant analysis, 121 LU (lowerhpper) decomposition, of ma 84
M Magnetic bubble memories, 27 Mahalanobis metric, 174 Management control, computer system! 302-305 Management Information System, coml ing system implementation in, 272see also Information Management Syz Mass memory design of, 29-30 , identifier in, 32 query conjunction and, 32, 35 Mass memory information processor, 3 I Matrix, LU decomposition of, 84-86 Matrix multiplication, for parallel algorithms, 82-84 MDSCAL, see Multidimensional scalini gorithm MDSR, see Multiple-data-stream and single-request computer Memory unit-processor pairs, 36 Memory units, in structure memory proc ing unit, 39 Microprocessor technology, advances ii MIMD, see multiple-instruct stream multiple-data stream parallel compu Minimum spanning tree, 150-154, 214 Mixture resolving, 175- 176 MM, see Mass memory MMIP, see Mass memory information p ceswr Model-based analysis, policy making an 306-308 Mode-seeking algorithms, 174
SUBJECT INDEX Moving-head disks in data-base store, 28 for high-value read out, 30-31 technological advances in, 7 MST,see Minimum spanning tree Multidimensional scaling algorithms, 122 Multiple-data-stream and single-request computer, 7 Multiple finite-state automata, 20 Multiple-instruct stream multiple-data stream parallel computers, 68-71 Municipal policy making, automated information systems in, 308
N n x d pattern matrix, 117 n x n matrix, 118, 191 Near-neighbor method, 134, 177-178 NEAT formatting program, 281-289 New data base, clustering in, 45-46 New environment, data base storage in, 43, 46 90-10 rule, 1-4 hardware solutions and, 21 mass memory and, 30 Nonlinear algorithms, 121-122 Nonserialization approach, 100-102 NT-SYS clustering program, 161 Null hypothesis, 128-129 in cluster validity, 180, 187, 192 Numerical analysis, 229 Numerical analysts, 230, 232 Numerical codes, 230 Numerical error, 232 Numerical software, 229-248 as "alchemy," 245-247 asymptotic analysis and, 244-246 attributes of, 231 automatic, 231-232 characteristics of, 232 difficulty of, 234-237 error in, 233-237, 244 future of, 247-248 reliability in, 238-239 as science, 241-244 stability analysis and, 245 Numerician concerns of,24 1-242 defined, 229
343 0
Odd-even transposition sort, 73-74 Old environment, data base storage in, 43, 47 On-line data base storage, design considerations for, 29-34 see also Data-base storage; Data-base system On-line I/O routines, 2-4 On-line mass memory, 33 On-line memory technology, changes in, 8 Operating Systems Incorporated, 59 Optimization criterion, in square-error clustering, 166 Organization behavior of, 250-251 complex, 290-295 computing development and use in, 259260 social life of, 258-259 Organizational behavior, 250-25 1 structural analyst and, 259 Organizational life computing impact on, 295-317, 321 political character of, 254 Organizational power, managerial control and, 309-312 OSIRIS clustering program, 161
P Parallel algorithms asynchronous iterative, 104 communication geometry and. 68 concurrency controls and, 67-68 literature on, 65-66 matrix multiplication for, 82-84 module granularity of, 68 one-dimension linear arrays and, 73-81 and pipeline processing of arithmetic operations, 79-81 serialization approach in, 95-100 shuffle-exchange networks in, 90-93 space of, 66-72 structure of, 65-107 synchronized iterative, 104 for synchronous parallel computers, 72-93 systolic search tree for, 89 taxonomy for, 70-72
SUBJECT INDEX
344
three dimensions of, 67-68 tree structure of, 87-88 two-dimensional arrays for, 81-87 types of, 71 Parallel algorithm space, examples in, 71 Parent position, in IMS,43 Parent segment types, 41 Partition global fit criteria for, 184-190 “good” vs. “poor,” 162 Partitional clustering, 161-215, see also Clustering mixture resolving in, 175-176 validity of, 191-192 Pattern, exclusive vs. nonexclusive, 125 Pattern matrix, 124, 128 Pattern recognition defined, 123 terminology in, 117 Pattern space, 117 Phenogram, 131 Picture description language, 213 Pipeline processing, of arithmetic operations, 79-81 PLD, see Picture description language Police manpower allocation, computerbased, 304 Policy making, automated information systems in, 305-309 Political power, computer-based systems in, 311 Priority queue, parallel algorithms and, 76-77 Probability profile, 193 Programmers, ”deskilling” of, 318 Projection, defined, 120 Projection algorithm, nonlinear, 121-122 Proximity dendrograms, 141-142 Proximity graphs, 132-133 Proximity matrix, 118, 124, 127 Psychiatry, clustering analysis in, 204-205
Q Q-mode clustering, 117 Quadratic assignment, 189 Query conjunction, 32 Query resolver, 10 Query translation, 41, 48-58
R Random graph hypothesis, 127-128 Random position hypothesis, 129-130 Rank order proximity matrix, I40 Rank order typal analysis, 140 Rational analysts “better management” criterion for, 2’ computer-based technologies and, 251 decision making and, 297 swollen data bases and, 298 in system initiation and adoption, 263 Record body, mass memory and, 31-32 Recurrence evaluation, 77-79 Recursive filtering, 77-79 Request facility, 2 R-mode clustering, I17 Round-off errors, minimizing of, 246-24’ Routine operations, decision making in, 300-302 S
Schedule, in two-phase transaction meth 95-96 Search algorithms, 88-89, see also Parall algorithms Search trees, concurrent access to, 102Segment search arguments, 42-43.49 Segment type, data base and, 41 Semantic information approach, 100-102 Sequence field, in IMS segment, 44 Serialization approach, in parallel algorith 95-100 Serial schedule, defined, 96 Shift-resisters, 73 Shume-exchange networks, 90-91 SIMD (single-instruction stream multiple data stream) parallel computers, 6887 Single-link dendrogram 160, see also Der drogram Single-link hierarchy, 153-154 Single-link method, 134-136 minimum-spanning tree and, 150 SM, see Structure memory SMIP, see Structure memory information processor Software, numerical, see Numerical software
SUBJECT INDEX Software design, in computing system life cycle, 269-272 Software development, organization and, 258 Software system variables, 275 Span of control, in management-oriented computer use, 291 Sperry-Univac Company, 59 Square arrays, two-dimensional, 86-87 Square-error algorithms, 167-174 Square-error clustering, 163-174 SSA, see Segment search arguments STAR-100, floating-point adders in, 79 Storage analysis, for data base computers, 46-48 Tee also Data base storage Straggly clusters, 134 Strategic planning, automated information systems in, 305-309 Structural analysts decision making and, 298 rational perspective and, 3 18 UMIS system and, 253 Structure memory, 35 bucket and, 37 design of, 34-38 Structure memory information processor contents of, 39 design of, 38 Supervision, automated information systems and, 314-316 Swollen data bases, in decision making, 298 Symbolic identifier. in data base computer, 44 Synchronized iterative algorithm, 104 Synchronous parallel computers. algorithms for, 72-93, see d s o Parallel algorithms Syntactic clustering, 212-214 System initiation and adoption, need in, 264 Systole, defined, 70 Systolic machines, algorithms for, 69-70 Systolic search tree, 89
T Tektronix. 59 Term comparator, 10 Text-processing computers, 9-10 associative array approach in, 13-15 cellular logic approach in, 15-1 7 finite-state automata approach in. 17-19
345
parallel discrete comparators approach in, 12-13 Threshold dendrogram, 136 Threshold graph, 132 Track information processor, 31-33 Tracks-in-parallel read out, 30 Transaction concurrent, 99 correct, 94 read and validation phases in, 98 Transaction execution time analysis, 50-58 Transaction execution time ratio, 56-57 Transactions, classification and analysis of, 52-58 Tree breadth, 52 Tree structure arithmetic expressions and evaluations in, 89 for parallel algorithms, 87-90 Two-dimensional arrays, algorithms using, 81-87 Two-phase transaction method, 95-98 serial schedule and, 97
U Ultrametric inequality, 155, 158 UMIS (automated information system), 252-254.257 UNIVAC 1108, workload model for, 209 UPGMA (unweighted pair groups using metric averages), 149, 183 hierarchical clustering algorithm and, 21 1 User query, predicates in, 2-3 V Validation phase, defined, 98 see C J / S O Cluster validity Very large-scale integration technology, 89 W
Weighted Levenshtein Distance, 213 Welfare information system, client-tracking, 25 I -252 WESCO decision making and, 299-300, 3 16 simulation programs of, 265 WLD, see Weighted Levenshtein Distance Work function computer integration in, 316-317 computer systems in, 316-317
Contents of Previous Volumes Volume 1
General-Purpose Programming for Business Applications CALVINC. GOTLIEB Numerical Weather Prediction NORMAN A. PHILLIPS The Present Status of Automatic Translation of Languages YEHOSHUA BAR-HILLEL Programming Computers to Play Games ARTHURL. SAMUEL Machine Recognition of Spoken Words RICHARD FATEHCHAND Binary Arithmetic GEORGEW. REITWIESNER Volume 2
A Survey of Numerical Methods for Parabolic Differential Equations J I MDOUGLAS, JR. Advances in Orthonormalizing Computation PHILIPJ. DAVISAND PHILIPRABINOWITZ Microelectronics Using Electron-Beam-Activated Machining Techniques KENNETH R. SHOULDERS Recent Developments in Linear Programming SAULI. GLASS The Theory of Automata, a Survey ROBERTMCNAUGHTON Volume 3
The Computation of Satellite Orbit Trajectories SAMUEL D. CONTE Multiprogramming E. F. CODD Recent Developments of Nonlinear Programming PHILIPWOLFE Alternating Direction Implicit Methods S. VARGA,A N D DAVIDYOUNG GARRET BIRKHOFF, RICHARD Combined Analog-Digital Techniques in Simulation F. SKRAMSTAD HAROLD Information Technology and the Law REEDC. LAWLOR Volume 4
The Formulation of Data Processing Problems for Computers WILLIAM C. MCGEE 346
CONTENTS OF PREVIOUS VOLUMES All-Magnetic Circuit Techniques A N D HEWITT D. C R A N E DAVIDR. BENNION Computer Education HOWARD E. TOMPKINS Digital Fluid Logic Elements H. H. GLAETTLI Multiple Computer Systems WILLIAM A. CURTIN Volume 5 The Role of Computers in Electron Night Broadcasting JACKMOSHMAN Some Results of Research on Automatic Programming in Eastern Europe WLADYSLAW TURKSI A Discussion of Artificial Intelligence and Self-organization GORDONPASK Automatic Optical Design N . STAVROUDIS ORESTES Computing Problems and Methods in X-Ray Crystallography CHARLES L. COULTER Digital Computers in Nuclear Reactor Design ELIZABETH CUTHILL A n Introduction to Procedure-Oriented Languages HARRY D. HUSKEY Volume 6 Information Retrieval CLAUDE E. WALSTON Speculations Concerning the First Ultraintelligent Machine I R V I N G JOHN GOOD Digital Training Devices CHARLES R. WICKMAN Number Systems and Arithmetic HARVEY L. GARDER Considerations on Man versus Machine for Space Probing P. L. BARGELLINI Data Collection and Reduction for Nuclear Particle Trace Detectors HERBERT GELERNTER Volume 7 Highly Parallel Information Processing Systems JOHNC. M U R T H A Programming Language Processors RUTHM . DAVIS The Man-Machine Combination for Computer-Assisted Copy Editing WAYNEA . DANIELSON Computer-Aided Typesetting WILLIAM R. BOZMAN
347
348
CONTENTS OF PREVIOUS VOLUMES
Programming Languages for Computational Linguistics ARNOLD C. SATTERTHWAIT Computer Driven Displays and Their Use in ManiMachine Interaction ANDRIESV A N DAM Volume 8
Time-shared Computer Systems THOMAS N. PYKE,JR. Formula Manipulation by Computer JEANE. SAMMET Standards for Computers and Information Processing T. B. STEEL,JR. Syntactic Analysis of Natural Language N A O M ISAGER Programming Languages and Computers: A Unified Metatheory R. NARAS~MHAN Incremental Computation LIONELLO A. LOMBARDI Volume 9
What Next in Computer Technology W. J . POPPELBAUM Advances in Simulation JOHN MCLEOD Symbol Manipulation Languages PAULW. ABRAHAMS Legal Information Retrieval S. FRAENKEL AVIEZRI Large Scale Integration-an Appraisal L. M. SPANDORFER Aerospace Computers A. s. B U C H M A N The Distributed Processor Organization L. J . KOCZELA Volume 10
Humanism, Technology, and Language CHARLES DECARLO Three Computer Cultures: Computer Technology, Computer Mathematics, and Compi Science PETERWEGNER Mathematics in 1984-The Impact of Computers BRYAN THWAITES Computing from the Communication Point of View E. E. DAVID,JR. Computer-Man Communication: Using Computer Graphics in the Instructional Process FREDERICK P. BROOKS.JR.
CONTENTS OF PREVIOUS VOLUMES
349
Computers and Publishing: Writing. Editing. and Printing ANDRIESV A N DAMA N D DAVIDE. R I C E A Unified Approach to Pattern Analysis ULF GRENANDER Use of Computers in Biomedical Pattern Recognition RoBER-r s. LED L E Y Numerical Methods of Stress Analysis WILLIAM PRAGER Spline Approximation and Computer-Aided Design J. H.AHLBERG Logic per Track Devices D. L. SLOTNICK Volume I 1
Automatic Translation of Languages Since 1960: A Linguist’s View HARRYH. JOSSELSON Classification, Relevance, and Information Retrieval D. M. JACKSON Approaches to the Machine Recognitton of Conversational Speech KLAUSW. OTTEN Man-Machine Interaction Using Speech DAVIDR. H I L L Balanced Magnetic Circuits for Logic and Memory Devices R . B. KIEBURTZ A N D E. E. N E W H A L L Command and Control: Technology and Social Impact ANTHONYDEBONS Volume 12
Information Security in a Multi-User Computer Environment J A M E S P. ANDERSON Managers. Deterministic Models. and Computers G. M. FERRERO DIROCCAFERRERA Uses of the Computer in Music Composition and Research HARRYB. LI NC O L N File Organization Techniques C. ROBERTS DAVI D Systems Program m i ng Languages R. D. BERGERON, J . D. G A N N O ND. , P. SHECHTER, F. W. TOMPA,A N D A. V A N DAM Parametric and Nonparametric Recognition by Computer: An Application to Leukocyte Image Processing J U D I T HM . S. PREWITT Volume 13
Programmed Control of Asynchronous Program Interrupts R I C H A R DL. WEXELBLAT Poetry Generation and Analysis J A M E SJOYCE
350
CONTENTS OF PREVIOUS VOLUMES
Mapping and Computers PATRICIA FULTON Practical Natural Language Processing: The REL System as Prototype A N D BOZENA HENISZTHOMPSON FREDERICK B. THOMPSON Artificial Intelligence-The Past Decade B. CHANDRASEKARAN Volume 14
On the Structure of Feasible Computations J . HARTMANIS A N D J . S I M O N A Look at Programming and Programming Systems T. E. CHEATHAM, JR., A N D J U D YA. TOWNELY Parsing of General Context-Free Languages SUSAN L. GRAHAM A N D MICHAEL A. HARRISON Statistical Processors W. J . POPPELBAUM Information Secure Systems DAVID K. HSIAOA N D RICHARD 1. BAUM Volume 15
Approaches to Automatic Programming ALANW. BIERMANN The Algorithm Selection Problem JOHN R. RICE Parallel Processing of Ordinary Programs DAVID J . KUCK The Computational Study of Language Acquisition LARRYH. REEKER The Wide World of Computer-Based Education DONALD BITZER Volume 16
3-D Computer Animation CHARLES A. CSURI Automatic Generation of Computer Programs NOAH s. PRYWE5 Perspectives in Clinical Computing A. HALUSKA KEVIN C. O’KANEA N D EDWARD The Design and Development of Resource-Sharing Services in Computer Communicati Networks: A Survey SANDRA A. M A M R A K Privacy Protection in Information Systems R E I N TURN
CONTENTS OF PREVIOUS VOLUMES
351
Volume 17 Semantics and Quantification in Natural Language Question Answering W.A. WOODS Natural Language Information Formatting: The Automatic Conversion of Texts to a Struc. tured Data Base NAOMISAGER Distributed Loop Computer Networks MINCT. LIU Magnetic Bubble Memory and Logic TIENCHI CHEN A N D HSU CHANG Computers and the Public’s Right of Access to Government Information ALAN F. WESTIN Volume 18 Image Processing and Recognition AZRIELROSEWFELD Recent Progress in Computer Chess MONROEM. NEWBORN Advances in Software Science M. H. HALSTEAD Current Trends in Computer-Assisted Instruction PATRICK SUPPES Software in the Soviet Union: Progress and Problems S. E. GOODMAN
This Page Intentionally Left Blank