1
Fundamentals of Computer Design
And now for something completely different. Monty Python’s Flying Circus
1.1
1.1...
329 downloads
3397 Views
5MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
1
Fundamentals of Computer Design
And now for something completely different. Monty Python’s Flying Circus
1.1
1.1
Introduction
1
1.2
The Task of a Computer Designer
4
1.3
Technology Trends
11
1.4
Cost, Price and their Trends
14
1.5
Measuring and Reporting Performance
25
1.6
Quantitative Principles of Computer Design
40
1.7
Putting It All Together: Performance and Price-Performance
49
1.8
Another View: Power Consumption and Efficiency as the Metric
58
1.9
Fallacies and Pitfalls
59
1.10
Concluding Remarks
69
1.11
Historical Perspective and References
70
Exercises
77
Introduction Computer technology has made incredible progress in the roughly 55 years since the first general-purpose electronic computer was created. Today, less than a thousand dollars will purchase a personal computer that has more performance, more main memory, and more disk storage than a computer bought in 1980 for $1 million. This rapid rate of improvement has come both from advances in the technology used to build computers and from innovation in computer design. Although technological improvements have been fairly steady, progress arising from better computer architectures has been much less consistent. During the first 25 years of electronic computers, both forces made a major contribution; but beginning in about 1970, computer designers became largely dependent upon integrated circuit technology. During the 1970s, performance continued to improve at about 25% to 30% per year for the mainframes and minicomputers that dominated the industry. The late 1970s saw the emergence of the microprocessor. The ability of the microprocessor to ride the improvements in integrated circuit technology more closely than the less integrated mainframes and minicomputers led to a higher rate of improvement—roughly 35% growth per year in performance.
2
Chapter 1 Fundamentals of Computer Design
This growth rate, combined with the cost advantages of a mass-produced microprocessor, led to an increasing fraction of the computer business being based on microprocessors. In addition, two significant changes in the computer marketplace made it easier than ever before to be commercially successful with a new architecture. First, the virtual elimination of assembly language programming reduced the need for object-code compatibility. Second, the creation of standardized, vendor-independent operating systems, such as UNIX and its clone, Linux, lowered the cost and risk of bringing out a new architecture. These changes made it possible to successfully develop a new set of architectures, called RISC (Reduced Instruction Set Computer) architectures, in the early 1980s. The RISC-based machines focused the attention of designers on two critical performance techniques, the exploitation of instruction-level parallelism (initially through pipelining and later through multiple instruction issue) and the use of caches (initially in simple forms and later using more sophisticated organizations and optimizations). The combination of architectural and organizational enhancements has led to 20 years of sustained growth in performance at an annual rate of over 50%. Figure 1.1 shows the effect of this difference in performance growth rates. The effect of this dramatic growth rate has been twofold. First, it has significantly enhanced the capability available to computer users. For many applications, the highest performance microprocessors of today outperform the supercomputer of less than 10 years ago. Second, this dramatic rate of improvement has led to the dominance of microprocessor-based computers across the entire range of the computer design. Workstations and PCs have emerged as major products in the computer industry. Minicomputers, which were traditionally made from off-the-shelf logic or from gate arrays, have been replaced by servers made using microprocessors. Mainframes have been almost completely replaced with multiprocessors consisting of small numbers of off-the-shelf microprocessors. Even high-end supercomputers are being built with collections of microprocessors. Freedom from compatibility with old designs and the use of microprocessor technology led to a renaissance in computer design, which emphasized both architectural innovation and efficient use of technology improvements. This renaissance is responsible for the higher performance growth shown in Figure 1.1—a rate that is unprecedented in the computer industry. This rate of growth has compounded so that by 2001, the difference between the highest-performance microprocessors and what would have been obtained by relying solely on technology, including improved circuit design, is about a factor of fifteen. In the last few years, the tremendous imporvement in integrated circuit capability has allowed older less-streamlined architectures, such as the x86 (or IA-32) architecture, to adopt many of the innovations first pioneered in the RISC designs. As we will see, modern x86 processors basically consist of a front-end that fetches and decodes x86 instructions and maps them into simple ALU, memory access, or branch operations that can be executed on a RISC-style pipelined pro-
1.1
Introduction
3
350 DEC Alpha 300
250
1.58x per year
200 SPECint rating DEC Alpha 150 IBM Power2 DEC Alpha 100 1.35x per year
HP 9000 50 SUN4
MIPS R2000
MIPS R3000
IBM Power1
95 19
94 19
93 19
92 19
91 19
90 19
89 19
88 19
87 19
86 19
85 19
19
84
0
Year
FIGURE 1.1 Growth in microprocessor performance since the mid 1980s has been substantially higher than in earlier years as shown by plotting SPECint performance. This chart plots relative performance as measured by the SPECint benchmarks with base of one being a VAX 11/780. (Since SPEC has changed over the years, performance of newer machines is estimated by a scaling factor that relates the performance for two different versions of SPEC (e.g. SPEC92 and SPEC95.) Prior to the mid 1980s, microprocessor performance growth was largely technology driven and averaged about 35% per year. The increase in growth since then is attributable to more advanced architectural and organizational ideas. By 2001 this growth leads to about a factor of 15 difference in performance. Performance for floating-point-oriented calculations has increased even faster.
Change this figure as follows: !1. the y-axis should be labeled “Relative Performance.” 2. Plot only even years 3. The following data points should changed/added: a. 1992 136 HP 9000; 1994 145 DEC Alpha; 1996 507 DEC Alpha; 1998 879 HP 9000; 2000 1582 Intel Pentium III 4. Extend the lower line by increasing by 1.35x each year
4
Chapter 1 Fundamentals of Computer Design
cessor. Beginning in the end of the 1990s, as transistor counts soared, the overhead in transistors of interpreting the more complex x86 architecture became neglegible as a percentage of the total transistor count of a modern microprocessor. This text is about the architectural ideas and accompanying compiler improvements that have made this incredible growth rate possible. At the center of this dramatic revolution has been the development of a quantitative approach to computer design and analysis that uses empirical observations of programs, experimentation, and simulation as its tools. It is this style and approach to computer design that is reflected in this text. Sustaining the recent improvements in cost and performance will require continuing innovations in computer design, and the authors believe such innovations will be founded on this quantitative approach to computer design. Hence, this book has been written not only to document this design style, but also to stimulate you to contribute to this progress.
1.2
The Changing Face of Computing and the Task of the Computer Designer In the 1960s, the dominant form of computing was on large mainframes, machines costing millions of dollars and stored in computer rooms with multiple operators overseeing their support. Typical applications included business data processing and large-scale scientific computing. The 1970s saw the birth of the minicomputer, a smaller sized machine initially focused on applications in scientific laboratories, but rapidly branching out as the technology of timesharing, multiple users sharing a computer interactively through independent terminals, became widespread. The 1980s saw the rise of the desktop computer based on microprocessors, in the form of both personal computers and workstations. The individually owned desktop computer replaced timesharing and led to the rise of servers, computers that provided larger-scale services such as: reliable, long-term file storage and access, larger memory, and more computing power. The 1990s saw the emergence of the Internet and the world-wide web, the first successful handheld computing devices (personal digital assistants or PDAs), and the emergence of high-performance digital consumer electronics, varying from video games to set-top boxes. These changes have set the stage for a dramatic change in how we view computing, computing applications, and the computer markets at the beginning of the millennium. Not since the creation of the personal computer more than twenty years ago have we seen such dramatic changes in the way computers appear and in how they are used. These changes in computer use have led to three different computing markets each characterized by different applications, requirements, and computing technologies.
1.2
The Changing Face of Computing and the Task of the Computer Designer
5
Desktop Computing The first, and still the largest market in dollar terms, is desktop computing. Desktop computing spans from low-end systems that sell for under $1,000 to highend, heavily-configured workstations that may sell for over $10,000. Throughout this range in price and capability, the desktop market tends to be driven to optimize price-performance. This combination of performance (measured primarily in terms of compute performance and graphics performance) and price of a system is what matters most to customers in this market and hence to computer designers. As a result desktop systems often are where the newest, highest performance microprocessors appear, as well as where recently cost-reduced microprocessors and systems appear first (see section 1.4 on page 14 for a discussion of the issues affecting cost of computers). Desktop computing also tends to be reasonably well characterized in terms of applications and benchmarking, though the increasing use of web-centric, interactive applications poses new challenges in performance evaluation. As we discuss in Section 1.9 (Fallacies, Pitfalls), the PC portion of the desktop space seems recently to have become focused on clock rate as the direct measure of performance, and this focus can lead to poor decisions by consumers as well as by designers who respond to this predilection. Servers As the shift to desktop computing occurred, the role of servers to provide larger scale and more reliable file and computing services grew. The emergence of the world-wide web accelerated this trend due to the tremendous growth in demand for web servers and the growth in sophistication of web-based services. Such servers have become the backbone of large-scale enterprise computing replacing the traditional mainframe. For servers, different characteristics are important. First, availability is critical. We use the term availability, which means that the system can reliably and effectively provide a service. This term is to be distinguished from reliability, which says that the system never fails. Parts of large-scale systems unavoidably fail; the challenge in a server is to maintain system availability in the face of component failures, usually through the use of redundancy. This topic is discussed in detail in Chapter 6. Why is availability crucial? Consider the servers running Yahoo!, taking orders for Cisco, or running auctions on EBay. Obviously such systems must be operating seven days a week, 24 hours a day. Failure of such a server system is far more catastrophic than failure of a single desktop. Although it is hard to estimate the cost of downtime, Figure 1.2 shows one analysis, assuming that downtime is distributed uniformly and does not occur solely during idle times. As we can see, the estimated costs of an unavailable system are high, and the estimated costs in
6
Chapter 1 Fundamentals of Computer Design
Figure 1.2 are purely lost revenue and do not account for the cost of unhappy customers! Application
Cost of downtime per hour (thousands of $)
Annual losses (millions of $) with downtime of 1% (87.6 hrs/yr)
0.5% (43.8 hrs/yr)
0.1% (8.8 hrs/yr)
Brokerage operations
$6,450
$565
$283
$56.5
Credit card authorization
$2,600
$228
$114
$22.8
Package shipping services
$150
$13
$6.6
$1.3
Home shopping channel
$113
$9.9
$4.9
$1.0
Catalog sales center
$90
$7.9
$3.9
$0.8
Airline reservation center
$89
$7.9
$3.9
$0.8
Cellular service activation
$41
$3.6
$1.8
$0.4
On-line network fees
$25
$2.2
$1.1
$0.2
ATM service fees
$14
$1.2
$0.6
$0.1
FIGURE 1.2 The cost of an unavailable system is shown by analyzing the cost of downtime (in terms of immediately lost revenue), assuming three different levels of availability. This assumes downtime is distributed uniformly. This data is from Kembel [2000] and was collected an analyzed by Contingency Planning Research.
A second key feature of server systems is an emphasis on scalability. Server systems often grow over their lifetime in response to a growing demand for the services they support or an increase in functional requirements. Thus, the ability to scale up the computing capacity, the memory, the storage, and the I/O bandwidth of a server are crucial. Lastly, servers are designed for efficient throughput. That is, the overall performance of the server–in terms of transactions per minute or web pages served per second–is what is crucial. Responsiveness to an individual request remains important, but overall efficiency and cost-effectiveness, as determined by how many requests can be handled in a unit time, are the key metrics for most servers. (We return to the issue of performance and assessing performance for different types of computing environments in Section 1.5 on page 25). Embedded Computers Embedded computers, the name given to computers lodged in other devices where the presence of the computer is not immediately obvious, are the fastest growing portion of the computer market. The range of application of these devices goes from simple embedded microprocessors that might appear in a everyday machines (most microwaves and washing machines, most printers, most networking switches, and all cars contain such microprocessors) to handheld digital devices (such as palmtops, cell phones, and smart cards) to video games and digital set-top boxes. Although in some applications (such as palmtops) the comput-
1.2
The Changing Face of Computing and the Task of the Computer Designer
7
ers are programmable, in many embedded applications the only programming occurs in connection with the initial loading of the application code or a later software upgrade of that application. Thus, the application can usually be carefully tuned for the processor and system; this process sometimes includes limited use of assembly language in key loops, although time-to-market pressures and good software engineering practice usually restrict such assembly language coding to a small fraction of the application. This use of assembly language, together with the presence of standardized operating systems, and a large code base has meant that instruction set compatibility has become an important concern in the embedded market. Simply put, like other computing applications, software costs are often a large factor in total cost of an embedded system. Embedded computers have the widest range of processing power and cost. From low-end 8-bit and 16-bit processors that may cost less than a dollar, to full 32-bit microprocessors capable of executing 50 million instructions per second that cost under $10, to high-end embedded processors (that can execute a billion instructions per second and cost hundreds of dollars) for the newest video game or for a high-end network switch. Although the range of computing power in the embedded computing market is very large, price is a key factor in the design of computers for this space. Performance requirements do exist, of course, but the primary goal is often meeting the performance need at a minimum price, rather than achieving higher performance at a higher price. Often, the performance requirement in an embedded application is a real-time requirement. A real-time performance requirement is one where a segment of the application has an absolute maximum execution time that is allowed. For example, in a digital set-top box the time to process each video frame is limited, since the processor must accept and process the next frame shortly. In some applications, a more sophisticated requirement exists: the average time for a particular task is constrained as well as the number of instances when some maximum time is exceeded. Such approaches (sometimes called soft real-time) arise when it is possible to occasionally miss the time constraint on an event, as long as not too many are missed. Real-time performance tend to be highly application dependent. It is usually measured using kernels either from the application or from a standardized benchmark (see the EEMBC benchmarks described in Section 1.5). With the growth in the use of embedded microprocessors, a wide range of benchmark requirements exist, from the ability to run small, limited code segments to the ability to perform well on applications involving tens to hundreds of thousands of lines of code. Two other key characteristics exist in many embedded applications: the need to minimize memory and the need to minimize power. In many embedded applications, the memory can be substantial portion of the system cost, and memory size is important to optimize in such cases. Sometimes the application is expected to fit totally in the memory on the processor chip; other times the applications needs to fit totally in a small off-chip memory. In any event, the importance of memory size translates to an emphasis on code size, since data size is dictated by
8
Chapter 1 Fundamentals of Computer Design
the application. As we will see in the next chapter, some architectures have special instruction set capabilities to reduce code size. Larger memories also mean more power, and optimizing power is often critical in embedded applications. Although the emphasis on low power is frequently driven by the use of batteries, the need to use less expensive packaging (plastic versus ceramic) and the absence of a fan for cooling also limit total power consumption.We examine the issue of power in more detail later in the chapter. Another important trend in embedded systems is the use of processor cores together with application-specific circuitry. Often an application’s functional and performance requirements are met by combining a custom hardware solution together with software running on a standardized embedded processor core, which is designed to interface to such special-purpose hardware. In practice, embedded problems are usually solved by one of three approaches: 1. using a combined hardware/software solution that includes some custom hardware and typically a standard embedded processor, 2. using custom software running on an off-the-shelf embedded processor, or 3. using a digital signal processor and custom software. (Digital signal processors are processors specially tailored for signal processing applications. We discuss some of the important differences between digital signal processors and general-purpose embedded processors in the next chapter.) Most of what we discuss in this book applies to the design, use, and performance of embedded processors, whether they are off-the-shelf microprocessors or microprocessor cores, which will be assembled with other special-purpose hardware. The design of special-purpose application-specific hardware and the detailed aspects of DSPs, however, are outside of the scope of this book. Figure 1.3 summarizes these three classes of computing environments and their important characteristics. The Task of a Computer Designer The task the computer designer faces is a complex one: Determine what attributes are important for a new machine, then design a machine to maximize performance while staying within cost and power constraints. This task has many aspects, including instruction set design, functional organization, logic design, and implementation. The implementation may encompass integrated circuit design, packaging, power, and cooling. Optimizing the design requires familiarity with a very wide range of technologies, from compilers and operating systems to logic design and packaging. In the past, the term computer architecture often referred only to instruction set design. Other aspects of computer design were called implementation, often
1.2
Feature
The Changing Face of Computing and the Task of the Computer Designer
Desktop
Server
Embedded
$1,000–$10,000
$10,000– $10,000,000
$10–$100,000 (including network routers at the high-end)
Price of microprocessor module
$100–$1,000
$200–$2000 (per processor)
$0.20–$200
Microprocessors sold per year (estimates for 2000)
150,000,000
4,000,000
300,000,000 (32-bit and 64-bit processors only)
Price-performance Graphics performance
Throughput Availability Scalability
Price Power consumption Application-specific performance
Price of system
Critical system design issues
9
FIGURE 1.3 A summary of the three computing classes and their system characteristics. The total number of embedded processors sold in 2000 is estimated to exceed 1 billion, if you include 8-bit and 16-bit microprocessors. In fact, the largest selling microprocessor of all time is an 8-bit microcontroller sold by Intel! It is difficult to separate the low end of the server market from the desktop market, since low-end servers–especially those costing less than $5,000–are essentially no different from desktop PCs. Hence, up to a few million of the PC units may be effectively servers.
insinuating that implementation is uninteresting or less challenging. The authors believe this view is not only incorrect, but is even responsible for mistakes in the design of new instruction sets. The architect’s or designer’s job is much more than instruction set design, and the technical hurdles in the other aspects of the project are certainly as challenging as those encountered in doing instruction set design. This challenge is particularly acute at the present when the differences among instruction sets are small and at a time when there are three rather distinct applications areas. In this book the term instruction set architecture refers to the actual programmervisible instruction set. The instruction set architecture serves as the boundary between the software and hardware, and that topic is the focus of Chapter 2. The implementation of a machine has two components: organization and hardware. The term organization includes the high-level aspects of a computer’s design, such as the memory system, the bus structure, and the design of the internal CPU (central processing unit—where arithmetic, logic, branching, and data transfer are implemented). For example, two processors with nearly identical instruction set architectures but very different organizations are the Pentium III and Pentium 4. Although the Pentium 4 has new instructions, these are all in the floating point instruction set. Hardware is used to refer to the specifics of a machine, including the detailed logic design and the packaging technology of the machine. Often a line of machines contains machines with identical instruction set architectures and nearly identical organizations, but they differ in the detailed hardware implementation. For example, the Pentium II and Celeron are nearly identical, but offer different clock rates and different memory systems, making the Celron more effective for low-end computers. In this book the word architecture is intended to cover all three aspects of computer design—instruction set architecture, organization, and hardware.
10
Chapter 1 Fundamentals of Computer Design
Computer architects must design a computer to meet functional requirements as well as price, power, and performance goals. Often, they also have to determine what the functional requirements are, and this can be a major task. The requirements may be specific features inspired by the market. Application software often drives the choice of certain functional requirements by determining how the machine will be used. If a large body of software exists for a certain instruction set architecture, the architect may decide that a new machine should implement an existing instruction set. The presence of a large market for a particular class of applications might encourage the designers to incorporate requirements that would make the machine competitive in that market. Figure 1.4 summarizes some requirements that need to be considered in designing a new machine. Many of these requirements and features will be examined in depth in later chapters. Functional requirements
Typical features required or supported
Application area
Target of computer
General purpose desktop
Balanced performance for a range of tasks, including interactive performance for graphics, video, and audio (Ch 2,3,4,5)
Scientific desktops and servers
High-performance floating point and graphics (App A,B)
Commercial servers
Support for databases and transaction processing, enhancements for reliability and availability. Support for scalability. (Ch 2,7)
Embedded computing
Often requires special support for graphics or video (or other application-specific extension). Power limitations and power control may be required. (Ch 2,3,4,5)
Level of software compatibility
Determines amount of existing software for machine
At programming language
Most flexible for designer; need new compiler (Ch 2,8)
Object code or binary compatible
Instruction set architecture is completely defined—little flexibility—but no investment needed in software or porting programs
Operating system requirements
Necessary features to support chosen OS (Ch 5,7)
Size of address space
Very important feature (Ch 5); may limit applications
Memory management
Required for modern OS; may be paged or segmented (Ch 5)
Protection
Different OS and application needs: page vs. segment protection (Ch 5)
Standards
Certain standards may be required by marketplace
Floating point
Format and arithmetic: IEEE 754 standard (App A), special arithmetic for graphics or signal processing
I/O bus
For I/O devices: Ultra ATA, Ultra SCSI, PCI (Ch 6)
Operating systems
UNIX, PalmOS, Windows, Windows NT, Windows CE, CISCO IOS
Networks
Support required for different networks: Ethernet, Infiniband (Ch 7)
Programming languages
Languages (ANSI C, C++, Java, Fortran) affect instruction set (Ch 2)
FIGURE 1.4 Summary of some of the most important functional requirements an architect faces. The left-hand column describes the class of requirement, while the right-hand column gives examples of specific features that might be needed. The right-hand column also contains references to chapters and appendices that deal with the specific issues.
1.3
Technology Trends
11
Once a set of functional requirements has been established, the architect must try to optimize the design. Which design choices are optimal depends, of course, on the choice of metrics. The changes in the computer applications space over the last decade have dramatically changed the metrics. Although desktop computers remain focused on optimizing cost-performance as measured by a single user, servers focus on availability, scalability, and throughput cost-performance, and embedded computers are driven by price and often power issues. These differences and the diversity and size of these different markets leads to fundamentally different design efforts. For the desktop market, much of the effort goes into designing a leading-edge microprocessor and into the graphics and I/O system that integrate with the microprocessor. In the server area, the focus is on integrating state-of-the-art microprocessors, often in a multiprocessor architecture, and designing scalable and highly available I/O systems to accompany the processors. Finally, in the leading edge of the embedded processor market, the challenge lies in adopting the high-end microprocessor techniques to deliver most of the performance at a lower fraction of the price, while paying attention to demanding limits on power and sometimes a need for high performance graphics or video processing. In addition to performance and cost, designers must be aware of important trends in both the implementation technology and the use of computers. Such trends not only impact future cost, but also determine the longevity of an architecture. The next two sections discuss technology and cost trends.
1.3
Technology Trends If an instruction set architecture is to be successful, it must be designed to survive rapid changes in computer technology. After all, a successful new instruction set architecture may last decades—the core of the IBM mainframe has been in use for more than 35 years. An architect must plan for technology changes that can increase the lifetime of a successful computer. To plan for the evolution of a machine, the designer must be especially aware of rapidly occurring changes in implementation technology. Four implementation technologies, which change at a dramatic pace, are critical to modern implementations: n
n
Integrated circuit logic technology—Transistor density increases by about 35% per year, quadrupling in somewhat over four years. Increases in die size are less predictable and slower, ranging from 10% to 20% per year. The combined effect is a growth rate in transistor count on a chip of about 55% per year. Device speed scales more slowly, as we discuss below. Semiconductor DRAM (dynamic random-access memory)—Density increases by between 40% and 60% per year, quadrupling in three to four years. Cycle time has improved very slowly, decreasing by about one-third in 10 years. Bandwidth per chip increases about twice as fast as latency decreases. In addi-
12
Chapter 1 Fundamentals of Computer Design
tion, changes to the DRAM interface have also improved the bandwidth; these are discussed in Chapter 5. n
n
Magnetic disk technology—Recently, disk density has been improving by more than 100% per year, quadrupling in two years. Prior to 1990, density increased by about 30% per year, doubling in three years. It appears that disk technology will continue the faster density growth rate for some time to come. Access time has improved by one-third in 10 years. This technology is central to Chapter 6, and we discuss the trends in greater detail there. Network technology—Network performance depends both on the performance of switches and on the performance of the transmission system, both latency and bandwidth can be improved, though recently bandwidth has been the primary focus. For many years, networking technology appeared to improve slowly: for example, it took about 10 years for Ethernet technology to move from 10 Mb to 100 Mb. The increased importance of networking has led to a faster rate of progress with 1 Gb Ethernet becoming available about five years after 100 Mb. The Internet infrastructure in the United States has seen even faster growth (roughly doubling in bandwidth every year), both through the use of optical media and through the deployment of much more switching hardware.
These rapidly changing technologies impact the design of a microprocessor that may, with speed and technology enhancements, have a lifetime of five or more years. Even within the span of a single product cycle for a computing system (two years of design and two to three years of production), key technologies, such as DRAM, change sufficiently that the designer must plan for these changes. Indeed, designers often design for the next technology, knowing that when a product begins shipping in volume that next technology may be the most cost-effective or may have performance advantages. Traditionally, cost has decreased very closely to the rate at which density increases. Although technology improves fairly continuously, the impact of these improvements is sometimes seen in discrete leaps, as a threshold that allows a new capability is reached. For example, when MOS technology reached the point where it could put between 25,000 and 50,000 transistors on a single chip in the early 1980s, it became possible to build a 32-bit microprocessor on a single chip. By the late 1980s, first-level caches could go on-chip. By eliminating chip crossings within the processor and between the processor and the cache, a dramatic increase in cost/performance and performance/power was possible. This design was simply infeasible until the technology reached a certain point. Such technology thresholds are not rare and have a significant impact on a wide variety of design decisions Scaling of Transistor Performance, Wires, and Power in Integrated Circuits Integrated circuit processes are characterized by the feature size, which is the minimum size of a transistor or a wire in either the x or y dimension. Feature siz-
1.3
Technology Trends
13
es have decreased from 10 microns in 1971 to 0.18 microns in 2001. Since a transistor is a 2-dimensional object, the density of transistors increases quadratically with a linear decrease in feature size. The increase in transistor performance, however, is more complex. As feature sizes shrink, devices shrink quadratically in the horizontal dimensions and also shrink in the vertical dimension. The shrink in the vertical dimension requires a reduction in operating voltage to maintain correct operation and reliability of the transistors. This combination of scaling factors leads to a complex interrelationship between transistor performance and process feature size. To first approximation, transistor performance improves linearly with decreasing feature size. The fact that transistor count improves quadratically with a linear improvement in transistor performance is both the challenge and the opportunity that computer architects were created for! In the early days of microprocessors, the higher rate of improvement in density was used to quickly move from 4-bit, to 8bit, to 16-bit, to 32-bit microprocessors. More recently, density improvements have supported the introduction of 64-bit microprocessors as well as many of the innovations in pipelining and caches, which we discuss in Chapters 3, 4, and 5. Although transistors generally improve in performance with decreased feature size, wires in an integrated circuit do not. In particular, the signal delay for a wire increases in proportion to the product of its resistance and capacitance. Of course, as feature size shrinks wires get shorter, but the resistance and capacitance per unit length gets worse. This relationship is complex, since both resistance and capacitance depend on detailed aspects of the process, the geometry of a wire, the loading on a wire, and even the adjacency to other structures. There are occasional process enhancements, such as the introduction of copper, which provide one-time improvements in wire delay. In general, however, wire delay scales poorly compared to transistor performance, creating additional challenges for the designer. In the past few years, wire delay has become a major design limitation for large integrated circuits and is often more critical than transistor switching delay. Larger and larger fractions of the clock cycle have been consumed by the propagation delay of signals on wires. In 2001, the Pentium 4 broke new ground by allocating two stages of its 20+ stage pipeline just for propagating signals across the chip. Power also provides challenges as devices are scaled. For modern CMOS microprocessors, the dominant energy consumption is in switching transistors. The energy required per transistor is proportional to the product of the load capacitance of the transistor, the frequency of switching, and the square of the voltage. As we move from one process to the next, the increase in the number of transistors switching and the frequency with which they switch, dominates the decrease in load capacitance and voltage, leading to an overall growth in power consumption. The first microprocessors consumed tenths of watts, while a Pentium 4 consumes between 60 and 85 watts, and a 2 GHz Pentium 4 will be close to 100 watts. The fastest workstation and server microprocessors in 2001 consume between 100 and 150 watts. Distributing the power, removing the heat, and prevent-
14
Chapter 1 Fundamentals of Computer Design
ing hot spots have become increasingly difficult challenges, and it is likely that power rather than raw transistor count will become the major limitation in the near future. .
1.4
Cost, Price and their Trends Although there are computer designs where costs tend to be less important— specifically supercomputers—cost-sensitive designs are of growing importance: more than half the PCs sold in 1999 were priced at less than $1,000, and the average price of a 32-bit microprocessor for an embedded application is in the tens of dollars. Indeed, in the past 15 years, the use of technology improvements to achieve lower cost, as well as increased performance, has been a major theme in the computer industry. Textbooks often ignore the cost half of cost-performance because costs change, thereby dating books, and because the issues are subtle and differ across industry segments. Yet an understanding of cost and its factors is essential for designers to be able to make intelligent decisions about whether or not a new feature should be included in designs where cost is an issue. (Imagine architects designing skyscrapers without any information on costs of steel beams and concrete.) This section focuses on cost and price, specifically on the relationship between price and cost: price is what you sell a finished good for, and cost is the amount spent to produce it, including overhead. We also discuss the major trends and factors that affect cost and how it changes over time. The Exercises and Examples use specific cost data that will change over time, though the basic determinants of cost are less time sensitive. This section will introduce you to these topics by discussing some of the major factors that influence cost of a computer design and how these factors are changing over time. The Impact of Time, Volume, Commodification, and Packaging The cost of a manufactured computer component decreases over time even without major improvements in the basic implementation technology. The underlying principle that drives costs down is the learning curve—manufacturing costs decrease over time. The learning curve itself is best measured by change in yield— the percentage of manufactured devices that survives the testing procedure. Whether it is a chip, a board, or a system, designs that have twice the yield will have basically half the cost. Understanding how the learning curve will improve yield is key to projecting costs over the life of the product. As an example of the learning curve in action, the price per megabyte of DRAM drops over the long term by 40% per year. Since DRAMs tend to be priced in close relationship to cost–with the exception
1.4
Cost, Price and their Trends
15
of periods when there is a shortage–price and cost of DRAM track closely. In fact, there are some periods (for example early 2001) in which it appears that price is less than cost; of course, the manufacturers hope that such periods are both infrequent and short. Figure 1.5 plots the price of a new DRAM chip over its lifetime. Between the start of a project and the shipping of a product, say two years, the cost of a new DRAM drops by a factor of between five and ten in constant dollars. Since not all component costs change at the same rate, designs based on projected costs result in different cost/performance trade-offs than those using current costs. The caption of Figure 1.5 discusses some of the long-term trends in DRAM price. . Microprocessor prices also drop over time, but because they are less standardized than DRAMs, the relationship between price and cost is more complex. In a period of significant competition, price tends to track cost closely, although microprocessor vendors probably rarely sell at a loss. Figure 1.6 shows processor price trends for the Pentium III. Volume is a second key factor in determining cost. Increasing volumes affect cost in several ways. First, they decrease the time needed to get down the learning curve, which is partly proportional to the number of systems (or chips) manufactured. Second, volume decreases cost, since it increases purchasing and manufacturing efficiency. As a rule of thumb, some designers have estimated that cost decreases about 10% for each doubling of volume. Also, volume decreases the amount of development cost that must be amortized by each machine, thus allowing cost and selling price to be closer. We will return to the other factors influencing selling price shortly. Commodities are products that are sold by multiple vendors in large volumes and are essentially identical. Virtually all the products sold on the shelves of grocery stores are commodities, as are standard DRAMs, disks, monitors, and keyboards. In the past 10 years, much of the low end of the computer business has become a commodity business focused on building IBM-compatible PCs. There are a variety of vendors that ship virtually identical products and are highly competitive. Of course, this competition decreases the gap between cost and selling price, but it also decreases cost. Reductions occur because a commodity market has both volume and a clear product definition, which allows multiple suppliers to compete in building components for the commodity product. As a result, the overall product cost is lower because of the competition among the suppliers of the components and the volume efficiencies the suppliers can achieve. This has led to the low-end of the computer business being able to achieve better priceperformance than other sectors, and yielded greater growth at the low-end, albeit with very limited profits (as is typical in any commodity business). Cost of an Integrated Circuit Why would a computer architecture book have a section on integrated circuit costs? In an increasingly competitive computer marketplace where standard
16
Chapter 1 Fundamentals of Computer Design
80 16 MB 70
60
50
Dollars per DRAM chip
4 MB
1 MB
40
256 KB
30 Final chip cost
64 KB 20
10
16 KB
95 19
94
93
19
92
19
19
91
90
19
19
89
88
19
87
19
19
86
85
19
19
84
83
19
82
19
81
19
80
19
19
79 19
19
78
0
Year
FIGURE 1.5 Prices of six generations of DRAMs (from 16Kb to 64 Mb) over time in 1977 dollars, showing the learning curve at work. A 1977 dollar is worth about $2.95 in 2001; more than half of this inflation occurred in the five-year period of 1977–82, during which the value changed to $1.59. The cost of a megabyte of memory has dropped incredibly during this period, from over $5000 in 1977 to about $0.35 in 2000, and an amazing $0.08 in 2001 (in 1977 dollars)! Each generation drops in constant dollar price by a factor of 10 to 30 over its lifetime. Starting in about 1996, an explosion of manufacturers has dramatically reduced margins and increased the rate at which prices fall, as well as the eventual final price for a DRAM. Periods when demand exceeded supply, such as 1987–88 and 1992–93, have led to temporary higher pricing, which shows up as a slowing in the rate of price decrease; more dramatic short-term fluctuations have been smoothed out. In late 2000 and through 2001, there has been tremendous oversupply leading to an accelerated price decrease, which is probably not sustainable. n n n n
Add 64Mb data Change MB to Mb in labels and KB to Kb. Remove the final chip cost line and the label on it. Extend x-axis: change 1996 data point to $6.00; add to the 16Mb line: 1997: 3.78; 1998: $1.30 Add a new line labeled 64Mb: 1999: $4.36; 2000: $2.78; 2001: $0.68 parts—disks, DRAMs, and so on—are becoming a significant portion of any system’s cost, integrated circuit costs are becoming a greater portion of the cost that varies between machines, especially in the high-volume, cost-sensitive portion of the market. Thus computer designers must understand the costs of chips to understand the costs of current computers. Although the costs of integrated circuits have dropped exponentially, the basic procedure of silicon manufacture is unchanged: A wafer is still tested and
1.4
Cost, Price and their Trends
17
1000 MHz
867 MHz
733 MHz
450 MHz
500 MHz
600 MHz
FIGURE 1.6 The price of an Intel Pentium III at a given frequency decreases over time as yield enhancements decrease the cost of good die and competition forces price reductions. Data courtesy of Microprocessor Report, May 2000 issue. The most recent introductions will continue to decrease until they reach similar prices to the lowest cost parts available today ($100-$200). Such price decreases assume a competitive environment where price decreases track cost decreases closely.
chopped into dies that are packaged (see Figures 1.7 and 1.8). Thus the cost of a packaged integrated circuit is Cost of integrated circuit =
Cost of die + Cost of testing die + Cost of packaging and final test Final test yield
In this section, we focus on the cost of dies, summarizing the key issues in testing and packaging at the end. A longer discussion of the testing costs and packaging costs appears in the Exercises. To learn how to predict the number of good chips per wafer requires first learning how many dies fit on a wafer and then learning how to predict the percentage of those that will work. From there it is simple to predict cost:
18
FIGURE 1.7
Chapter 1 Fundamentals of Computer Design
Photograph of an 12-inch wafer containing Intel Pentium 4 microprocessors. (Courtesy Intel.)
Get new photo! Cost of wafer Cost of die = --------------------------------------------------------------Dies per wafer × Die yield
The most interesting feature of this first term of the chip cost equation is its sensitivity to die size, shown below. The number of dies per wafer is basically the area of the wafer divided by the area of the die. It can be more accurately estimated by 2
π × Wafer diameter π × ( Wafer diameter/2 ) Dies per wafer = ----------------------------------------------------------- – ----------------------------------------------Die area 2 × Die area
The first term is the ratio of wafer area (πr2) to die area. The second compensates for the “square peg in a round hole” problem—rectangular dies near the periphery of round wafers. Dividing the circumference (πd) by the diagonal of a square die is approximately the number of dies along the edge. For example, a wafer 30 cm (≈ 12 inch) in diameter produces π × 225 – ( π × 30 ⁄ 1.41 ) = 640 1-cm dies. EXAMPLE
ANSWER
Find the number of dies per 30-cm wafer for a die that is 0.7 cm on a side. The total die area is 0.49 cm2. Thus
1.4
Cost, Price and their Trends
19
? FIGURE 1.8
Photograph of an 12-inch wafer containing NEC MIPS 4122 processors.
Get new photo
2
π × 30 706.5 94.2 π × ( 30 ⁄ 2 ) Dies per wafer = ------------------------------ – ------------------------ = ------------- – ---------- = 1347 0.49 0.49 0.99 2 × 0.49 n
But this only gives the maximum number of dies per wafer. The critical question is, What is the fraction or percentage of good dies on a wafer number, or the die yield? A simple empirical model of integrated circuit yield, which assumes that defects are randomly distributed over the wafer and that yield is inversely proportional to the complexity of the fabrication process, leads to the following: Defects per unit area × Die area Die yield = Wafer yield × 1 + ---------------------------------------------------------------------------- α
–α
where wafer yield accounts for wafers that are completely bad and so need not be tested. For simplicity, we’ll just assume the wafer yield is 100%. Defects per unit area is a measure of the random manufacturing defects that occur. In 2001, these values typically range between 0.4 and 0.8 per square centimeter, depending on the maturity of the process (recall the learning curve, mentioned earlier). Lastly,
20
Chapter 1 Fundamentals of Computer Design
α is a parameter that corresponds inversely to the number of masking levels, a measure of manufacturing complexity, critical to die yield. For today’s multilevel metal CMOS processes, a good estimate is α = 4.0. EXAMPLE
ANSWER
Find the die yield for dies that are 1 cm on a side and 0.7 cm on a side, assuming a defect density of 0.6 per cm2. The total die areas are 1 cm2 and 0.49 cm2. For the larger die the yield is 0.6 × 1 – 4 Die yield = 1 + ---------------- = 0.35 2.0 For the smaller die, it is 0.6 × 0.49 – 4 Die yield = 1 + ------------------------ = 0.58 2.0 n
The bottom line is the number of good dies per wafer, which comes from multiplying dies per wafer by die yield (which incorporates the effects of defects). The examples above predict 224 good 1-cm2 dies from the 30-cm wafer and 781 good 0.49-cm2 dies. Most 32-bit and 64-bit microprocessors in a modern 0.25µ technology fall between these two sizes, with some processors being as large as 2 cm2 in the prototype process before a shrink. Low-end embedded 32-bit processors are sometimes as small as 0.25 cm2, while processors used for embedded control (in printers, automobiles, etc.) are often less than 0.1 cm2. Figure 1.34 on page 81 in the Exercises shows the die size and technology for several current microprocessors. Given the tremendous price pressures on commodity products such as DRAM and SRAM, designers have included redundancy as a way to raise yield. For a number of years, DRAMs have regularly included some redundant memory cells, so that a certain number of flaws can be accomodated. Designers have used similar techniques in both standard SRAMs and in large SRAM arrays used for caches within microprocessors. Obviously, the presence of redundant entries can be used to significantly boost the yield. Processing a 30-cm-diameter wafer in a leading-edge technology with 4-6 metal layers costs between $5000 and $6000 in 2001. Assuming a processed wafer cost of $5500, the cost of the 0.49-cm2 die is around $7.04, while the cost per die of the 1-cm2 die is about $24.55, or more than three times the cost for a die that is two times larger. What should a computer designer remember about chip costs? The manufacturing process dictates the wafer cost, wafer yield, α, and defects per unit area, so the sole control of the designer is die area. Since α is around 4 for the advanced
1.4
Cost, Price and their Trends
21
processes in use today, die costs are proportional to the fifth (or higher) power of the die area: Cost of die = f (Die area5)
The computer designer affects die size, and hence cost, both by what functions are included on or excluded from the die and by the number of I/O pins. Before we have a part that is ready for use in a computer, the die must be tested (to separate the good dies from the bad), packaged, and tested again after packaging. These steps all add significant costs. These processes and their contribution to cost are discussed and evaluated in Exercise 1.9. The above analysis has focused on the variable costs of producing a functional die, which is appropriate for high volume integrated circuits. There is, however, one very important part of the fixed cost that can significantly impact the cost of an integrated circuit for low volumes (less than one million parts), namely the cost of a mask set. Each step in the integrated circuit process requires a separate mask. Thus, for modern high density fabrication processes with four to six metal layers, mask costs often exceed $1 million. Obviously, this large fixed cost affects the cost of prototyping and debugging runs and, for small volume production, can be a significant part of the production cost. Since mask costs are likely to continue to increase, designers may incorporate reconfigurable logic to enhance the flexibility of a part, or choose to use gate arrays (that have fewer custom mask levels) and thus, reduce the cost implications of masks. Distribution of Cost in a System: An Example To put the costs of silicon in perspective, Figure 1.9 shows the approximate cost breakdown for a $1,000 PC in 2001. Although the costs of some parts of this machine can be expected to drop over time, other components, such as the packaging and power supply, have little room for improvement. Furthermore, we can expect that future machines will have larger memories and disks, meaning that prices drop more slowly than the technology improvement. Cost Versus Price—Why They Differ and By How Much Costs of components may confine a designer’s desires, but they are still far from representing what the customer must pay. But why should a computer architecture book contain pricing information? Cost goes through a number of changes before it becomes price, and the computer designer should understand how a design decision will affect the potential selling price. For example, changing cost by $1000 may change price by $3000 to $4000. Without understanding the relationship of cost to price the computer designer may not understand the impact on price of adding, deleting, or replacing components.
22
Chapter 1 Fundamentals of Computer Design
System
Subsystem
Cabinet
Sheet metal, plastic
2%
Power supply, fans
2%
Cables, nuts, bolts
1%
Processor board
Shipping box, manuals
1%
Subtotal
6%
Processor
23%
DRAM (128 MB)
5%
Motherboard with basic I/O support, and networking
5%
Keyboard and mouse Monitor
Software
5%
Video card
Subtotal I/O devices
Fraction of total
38% 3% 20%
Hard disk (20 GB)
9%
DVD drive
6%
Subtotal
37%
OS + Basic Office Suite
20%
FIGURE 1.9 Estimated distribution of costs of the components in a $1,000 PC in 2001. Notice that the largest single item is the CPU, closely followed by the monitor. (Interestingly, in 1995, the DRAM memory at about 1/3 of the total cost was the most expensive component! Since then, cost per MB has dropped by about a factor of 15!) Touma [1993] discusses computer system costs and pricing in more detail. These numbers are based on estimates of volume pricing for the various components.
The relationship between price and volume can increase the impact of changes in cost, especially at the low end of the market. Typically, fewer computers are sold as the price increases. Furthermore, as volume decreases, costs rise, leading to further increases in price. Thus, small changes in cost can have a larger than obvious impact. The relationship between cost and price is a complex one with entire books written on the subject. The purpose of this section is to give you a simple introduction to what factors determine price and typical ranges for these factors. The categories that make up price can be shown either as a tax on cost or as a percentage of the price. We will look at the information both ways. These differences between price and cost also depend on where in the computer marketplace a company is selling. To show these differences, Figure 1.10 shows how the dif-
1.4
Cost, Price and their Trends
23
ference between cost of materials and list price is decomposed, with the price increasing from left to right as we add each type of overhead. Direct costs refer to the costs directly related to making a product. These include labor costs, purchasing components, scrap (the leftover from yield), and warranty, which covers the costs of systems that fail at the customer’s site during the warranty period. Direct cost typically adds 10% to 30% to component cost. Service or maintenance costs are not included because the customer typically pays those costs, although a warranty allowance may be included here or in gross margin, discussed next. The next addition is called the gross margin, the company’s overhead that cannot be billed directly to one product. This can be thought of as indirect cost. It includes the company’s research and development (R&D), marketing, sales, manufacturing equipment maintenance, building rental, cost of financing, pretax profits, and taxes. When the component costs are added to the direct cost and gross margin, we reach the average selling price—ASP in the language of MBAs—the money that comes directly to the company for each product sold. The gross margin is typically 10% to 45% of the average selling price, depending on the uniqueness of the product. Manufacturers of low-end PCs have lower gross margins for several reasons. First, their R&D expenses are lower. Second, their cost of sales is lower, since they use indirect distribution (by mail, the Internet, phone order, or retail store) rather than salespeople. Third, because their products are less unique, competition is more intense, thus forcing lower prices and often lower profits, which in turn lead to a lower gross margin.
Average selling price
100%
Component costs
List price 25%
17%
Direct costs
12.8%
83%
Component costs
62.2%
Add 20% for direct costs
Add 33% for gross margin
Gross margin Direct costs
Component costs
Average discount
25%
9.6%
Gross margin Direct costs
46.6%
Component costs
18.8%
Add 33% for average discount
FIGURE 1.10 The components of price for a $1,000 PC. Each increase is shown along the bottom as a tax on the prior price. The percentages of the new price for all elements are shown on the left of each column.
List price and average selling price are not the same. One reason for this is that companies offer volume discounts, lowering the average selling price. As person-
24
Chapter 1 Fundamentals of Computer Design
al computers became commodity products, the retail mark-ups have dropped significantly, so list price and average selling price have closed. As we said, pricing is sensitive to competition: A company may not be able to sell its product at a price that includes the desired gross margin. In the worst case, the price must be significantly reduced, lowering gross margin until profit becomes negative! A company striving for market share can reduce price and profit to increase the attractiveness of its products. If the volume grows sufficiently, costs can be reduced. Remember that these relationships are extremely complex and to understand them in depth would require an entire book, as opposed to one section in one chapter. For example, if a company cuts prices, but does not obtain a sufficient growth in product volume, the chief impact will be lower profits. Many engineers are surprised to find that most companies spend only 4% (in the commodity PC business) to 12% (in the high-end server business) of their income on R&D, which includes all engineering (except for manufacturing and field engineering). This well-established percentage is reported in companies’ annual reports and tabulated in national magazines, so this percentage is unlikely to change over time. In fact, experience has shown that computer companies with R&D percentages of 15-20% rarely prosper over the long term. The information above suggests that a company uniformly applies fixedoverhead percentages to turn cost into price, and this is true for many companies. But another point of view is that R&D should be considered an investment. Thus an investment of 4% to 12% of income means that every $1 spent on R&D should lead to $8 to $25 in sales. This alternative point of view then suggests a different gross margin for each product depending on the number sold and the size of the investment. Large, expensive machines generally cost more to develop—a machine costing 10 times as much to manufacture may cost many times as much to develop. Since large, expensive machines generally do not sell as well as small ones, the gross margin must be greater on the big machines for the company to maintain a profitable return on its investment. This investment model places large machines in double jeopardy—because there are fewer sold and they require larger R&D costs—and gives one explanation for a higher ratio of price to cost versus smaller machines. The issue of cost and cost/performance is a complex one. There is no single target for computer designers. At one extreme, high-performance design spares no cost in achieving its goal. Supercomputers have traditionally fit into this category, but the market that only cares about performance has been the slowest growing portion of the computer market. At the other extreme is low-cost design, where performance is sacrificed to achieve lowest cost; some portions of the embedded market, for example, the market for cell phone microprocessors, behaves exactly like this. Between these extremes is cost/performance design, where the designer balances cost versus performance. Most of the PC market, the worksta-
1.5
Measuring and Reporting Performance
25
tion market, and most of the server market (at least including both low-end and midrange servers) operate in this region. In the past 10 years, as computers have downsized, both low-cost design and cost/performance design have become increasingly important. This section has introduced some of the most important factors in determining cost; the next section deals with performance.
1.5
Measuring and Reporting Performance When we say one computer is faster than another, what do we mean? The user of a desktop machine may say a computer is faster when a program runs in less time, while the computer center manager running a large server system may say a computer is faster when it completes more jobs in an hour. The computer user is interested in reducing response time—the time between the start and the completion of an event—also referred to as execution time. The manager of a large data processing center may be interested in increasing throughput—the total amount of work done in a given time. In comparing design alternatives, we often want to relate the performance of two different machines, say X and Y. The phrase “X is faster than Y” is used here to mean that the response time or execution time is lower on X than on Y for the given task. In particular, “X is n times faster than Y” will mean Execution time Y ---------------------------------------- = n Execution time X
Since execution time is the reciprocal of performance, the following relationship holds: 1 ---------------------------------Execution time Y Performance Y Performance X n = ---------------------------------------- = ----------------------------------- = ---------------------------------1 Execution time X Performance Y ---------------------------------Performance X
The phrase “the throughput of X is 1.3 times higher than Y” signifies here that the number of tasks completed per unit time on machine X is 1.3 times the number completed on Y. Because performance and execution time are reciprocals, increasing performance decreases execution time. To help avoid confusion between the terms increasing and decreasing, we usually say “improve performance” or “improve execution time” when we mean increase performance and decrease execution time. Whether we are interested in throughput or response time, the key measurement is time: The computer that performs the same amount of work in the least time is the fastest. The difference is whether we measure one task (response time) or many tasks (throughput). Unfortunately, time is not always the metric quoted in comparing the performance of computers. A number of popular measures have been adopted in the quest for a easily understood, universal measure of computer
26
Chapter 1 Fundamentals of Computer Design
performance, with the result that a few innocent terms have been abducted from their well-defined environment and forced into a service for which they were never intended. The authors’ position is that the only consistent and reliable measure of performance is the execution time of real programs, and that all proposed alternatives to time as the metric or to real programs as the items measured have eventually led to misleading claims or even mistakes in computer design. The dangers of a few popular alternatives are shown in Fallacies and Pitfalls, section 1.9. Measuring Performance Even execution time can be defined in different ways depending on what we count. The most straightforward definition of time is called wall-clock time, response time, or elapsed time, which is the latency to complete a task, including disk accesses, memory accesses, input/output activities, operating system overhead—everything. With multiprogramming the CPU works on another program while waiting for I/O and may not necessarily minimize the elapsed time of one program. Hence we need a term to take this activity into account. CPU time recognizes this distinction and means the time the CPU is computing, not including the time waiting for I/O or running other programs. (Clearly the response time seen by the user is the elapsed time of the program, not the CPU time.) CPU time can be further divided into the CPU time spent in the program, called user CPU time, and the CPU time spent in the operating system performing tasks requested by the program, called system CPU time. These distinctions are reflected in the UNIX time command, which returns four measurements when applied to an executing program: 90.7u 12.9s 2:39 65%
User CPU time is 90.7 seconds, system CPU time is 12.9 seconds, elapsed time is 2 minutes and 39 seconds (159 seconds), and the percentage of elapsed time that is CPU time is (90.7 + 12.9)/159 or 65%. More than a third of the elapsed time in this example was spent waiting for I/O or running other programs or both. Many measurements ignore system CPU time because of the inaccuracy of operating systems’ self-measurement (the above inaccurate measurement came from UNIX) and the inequity of including system CPU time when comparing performance between machines with differing system codes. On the other hand, system code on some machines is user code on others, and no program runs without some operating system running on the hardware, so a case can be made for using the sum of user CPU time and system CPU time. In the present discussion, a distinction is maintained between performance based on elapsed time and that based on CPU time. The term system performance is used to refer to elapsed time on an unloaded system, while CPU performance refers to user CPU time on an unloaded system. We will focus on CPU performance in this chapter, though we do consider performance measurements based on elapsed time.
1.5
Measuring and Reporting Performance
27
Choosing Programs to Evaluate Performance Dhrystone does not use floating point. Typical programs don’t … Rick Richardson, Clarification of Dhrystone (1988)
This program is the result of extensive research to determine the instruction mix of a typical Fortran program. The results of this program on different machines should give a good indication of which machine performs better under a typical load of Fortran programs. The statements are purposely arranged to defeat optimizations by the compiler. H. J. Curnow and B. A. Wichmann [1976], Comments in the Whetstone Benchmark
A computer user who runs the same programs day in and day out would be the perfect candidate to evaluate a new computer. To evaluate a new system the user would simply compare the execution time of her workload—the mixture of programs and operating system commands that users run on a machine. Few are in this happy situation, however. Most must rely on other methods to evaluate machines and often other evaluators, hoping that these methods will predict performance for their usage of the new machine. There are five levels of programs used in such circumstances, listed below in decreasing order of accuracy of prediction. 1. Real applications—Although the buyer may not know what fraction of time is spent on these programs, she knows that some users will run them to solve real problems. Examples are compilers for C, text-processing software like Word, and other applications like Photoshop. Real applications have input, output, and options that a user can select when running the program. There is one major downside to using real applications as benchmarks: Real applications often enocunter portability problems arising from dependences on the operating system or compiler. Enhancing portability often means modifying the source and sometimes eliminating some important activity, such as interactive graphics, which tends to be more system-dependent. 2. Modified (or scripted) applications—In many cases, real applications are used as the building block for a benchmark either with modifications to the application or with a script that acts as stimulus to the application. Applications are modified for two primary reasons: to enhance portability or to focus on one particular aspect of system performance. For example, to create a CPU-oriented benchmark, I/O may be removed or restructured to minimize its impact on execution time. Scripts are used to reproduce interactive behavior, which might occur on a desktop system, or to simulate complex multiuser interaction, which occurs in a server system.
28
Chapter 1 Fundamentals of Computer Design
3. Kernels—Several attempts have been made to extract small, key pieces from real programs and use them to evaluate performance. Livermore Loops and Linpack are the best known examples. Unlike real programs, no user would run kernel programs, for they exist solely to evaluate performance. Kernels are best used to isolate performance of individual features of a machine to explain the reasons for differences in performance of real programs. 4. Toy benchmarks—Toy benchmarks are typically between 10 and 100 lines of code and produce a result the user already knows before running the toy program. Programs like Sieve of Eratosthenes, Puzzle, and Quicksort are popular because they are small, easy to type, and run on almost any computer. The best use of such programs is beginning programming assignments. 5. Synthetic benchmarks—Similar in philosophy to kernels, synthetic benchmarks try to match the average frequency of operations and operands of a large set of programs. Whetstone and Dhrystone are the most popular synthetic benchmarks. A description of these benchmarks and some of their flaws appears in section 1.9 on page 59. No user runs synthetic benchmarks, because they don’t compute anything a user could want. Synthetic benchmarks are, in fact, even further removed from reality than kernels because kernel code is extracted from real programs, while synthetic code is created artificially to match an average execution profile. Synthetic benchmarks are not even pieces of real programs, although kernels might be. Because computer companies thrive or go bust depending on price/performance of their products relative to others in the marketplace, tremendous resources are available to improve performance of programs widely used in evaluating machines. Such pressures can skew hardware and software engineering efforts to add optimizations that improve performance of synthetic programs, toy programs, kernels, and even real programs. The advantage of the last of these is that adding such optimizations is more difficult in real programs, though not impossible. This fact has caused some benchmark providers to specify the rules under which compilers must operate, as we will see shortly. Benchmark Suites Recently, it has become popular to put together collections of benchmarks to try to measure the performance of processors with a variety of applications. Of course, such suites are only as good as the constituent individual benchmarks. Nonetheless, a key advantage of such suites is that the weakness of any one benchmark is lessened by the presence of the other benchmarks. This advantage is especially true if the methods used for summarizing the performance of the benchmark suite reflect the time to run the entire suite, as opposed to rewarding performance increases on programs that may be defeated by targeted optimizations. Later in this section, we discuss the strengths and weaknesses of different methods for summarizing performance.
1.5
Measuring and Reporting Performance
29
One of the most successful attempts to create standardized benchmark application suites has been the SPEC (Standard Performance Evaluation Corporation), which had its roots in the late 1980s efforts to deliver better benchmarks for workstations. Just as the computer industry has evolved over time, so has the need for different benchmark suites, and there are now SPEC benchmarks to cover different application classes, as well as other suites based on the SPEC model. Although we focus our discussion on the SPEC benchmarks in the many of the following sections, there are also a large set of benchmarks that have been developed for PCs running the Windows operating system. These cover a variety of different application environments, as Figure 1.11 shows. Benchmark Name
Benchmark description
Business Winstone 99
Runs a script consisting of Netscape Navigator, and several office suite products (Microsoft, Corel, WordPerfect). The script simulates a user switching among and running different applications.
High-end Winstone 99
Also simulates multiple applications running simultaneously, but focuses on compute intensive applications such as Adobe Photoshop.
CC Winstone 99
Simulates multiple applications focused on content creation, such as Photoshop, Premiere, Navigator, and various audio editing programs.
Winbench 99
Runs a variety of scripts that test CPU performance, video system performance, disk performance using kernels focused on each subsystem.
FIGURE 1.11 A sample of some of the many PC benchmarks with the first four being scripts using real applications and the last being a mixture of kernels and synethetic benchmarks. These are all now maintained by Ziff Davis, a publisher of much of the literature in the PC space. Ziff Davis also provides independent testing service. For more information on these benchmarks, see: http://www.zdnet.com/etestinglabs/filters/benchmarks/.
Desktop Benchmarks
Desktop benchmarks divide into two broad classes: CPU intensive benchmarks and graphics intensive benchmarks (although many graphics benchmarks include intensive CPU activity). SPEC originally created a benchmark set focusing on CPU performance (initially called SPEC89), which has evolved into its fourth generation: SPEC CPU2000, which follows SPEC95, and SPEC92. (Figure 1.30 on page 64 discusses the evolution of the benchmarks.) SPEC CPU2000, summarized in Figure 1.12, consists of a set of eleven integer benchmarks (CINT2000) and fourteen floating point benchmarks (CFP2000). The SPEC benchmarks are real program modified for portability and to minimize the role of I/O in overall benchmark performance. The integer benchmarks vary from part of a C compiler to a VLSI place and route tool to a graphics application. The floating point benchmarks include code for quantum chromodynmics, finite element modeling, and fluid dynamics. The SPEC CPU suite is useful for CPU benchmarking for both desktop systems and single-processor servers. We will see data on many of these programs throughout this text.
30
Chapter 1 Fundamentals of Computer Design
In the next subsection, we show how a SPEC 2000 report describes the machine, compiler, and OS configuration. In section 1.9 we describe some of the pitfalls that have occurred in attempting to develop the SPEC benchmark suite, as well as the challenges in maintaining a useful and predictive benchmark suite. Benchmark
Type
Source
Description
gzip
Integer
C
Compression using the Lempel-Ziv algorithm
vpr
Integer
C
FPGA circuit placement and routing
gcc
Integer
C
Consists of the GNU C compiler generating optimized machine code.
mcf
Integer
C
Combinatorial optimization of public transit scheduling.
crafty
Integer
C
Chess playing program.
parser
Integer
C
Syntactic English language parser
eon
Integer
C++
perlmbk
Integer
C
Perl (an interpreted string processing language) with four input scripts
gap
Integer
C
A group theory application package
vortex
Integer
C
An object-oriented database system
bzip2
Integer
C
A block sorting compression algorithm.
twolf
Integer
C
Timberwolf: a simulated annealing algorithm for VLSI place and route
wupwise
FP
F77
Lattice gauge theory model of quantum chromodynamics.
swim
FP
F77
Solves shallow water equations using finite difference equations.
mgrid
FP
F77
Multigrid solver over 3-dimensional field.
apply
FP
F77
Parabolic and elliptic partial differential equation solver
Graphics visualization using probabilistic ray tracing
mesa
FP
C
galgel
FP
F90
Three dimensional graphics library.
art
FP
C
Image recognition of a thermal image using neural networks
equake
FP
C
Simulation of seismic wave propagation.
Computational fluid dynamics.
facerec
FP
C
Face recognition using wavelets and graph matching.
ammp
FP
C
molecular dynamics simulation of a protein in water
lucas
FP
F90
Performs primality testing for Mersenne primes
fma3d
FP
F90
Finite element modeling of crash simulation
sixtrack
FP
F77
High energy physics accelerator design simulation.
apsi
FP
F77
A meteorological simulation of pollution distribution.
FIGURE 1.12 The programs in the SPECCPU2000 benchmark suites. The eleven integer programs (all in C, except one in C++) are used for the CINT2000 measurement, while the fourteen floating point programs (six in Fortran-77, five in C, and three in Fortran-90) are used for the CFP2000 measurement. See http://www.spec.org/osg/cpu2000/ for more on these benchmarks.
1.5
Measuring and Reporting Performance
31
Although SPEC CPU2000 is aimed at CPU performance, two different types of graphics benchmarks were created by SPEC: SPECviewperf (see http:// www.spec.org/gpc/opc.static/opcview.htm) is used for benchmarking systems supporting the OpenGL graphics library, while SPECapc (http://www.spec.org/ gpc/apc.static/apcfaq.htm) consists of applications that make extensive use of graphics. SPECviewperf measures the 3D rendering performance of systems running under OpenGL using a 3-D model and a series of OpenGL calls that transform the model. SPECapc consists of runs of three large applications: 1. Pro/Engineer: a solid modeling application that does extensive 3-D rendering. The input script is a model of a photocopying machine consisting of 370,000 triangles. 2. SolidWorks 99: a 3-D CAD/CAM design tool running a series of five tests varying from I/O intensive to CPU intensive. The largetest input is a model of an assembly line consisting of 276,000 triangles. 3. Unigraphics V15: The benchmark is based on an aircraft model and covers a wide spectrum of Unigraphics functionality, including assembly, drafting, numeric control machining, solid modeling, and optimization. The inputs are all part of an aircraft design. Server Benchmarks
Just as servers have multiple functions, so there are multiple types of benchmarks. The simplest benchmark is perhaps a CPU throughput oriented benchmark. SPEC CPU2000 uses the SPEC CPU benchmarks to construct a simple throughput benchmark where the processing rate of a multiprocessor can be measured by running multiple copies (usually as many as there are CPUs) of each SPEC CPU benchmark and converting the CPU time into a rate.This leads to a measurement called the SPECRate. Other than SPECRate, most server applications and benchmarks have significant I/O activity arising from either disk or network traffic, including benchmarks for file server systems, for web servers, and for database and transaction processing systems. SPEC offers both a file server benchmark (SPECSFS) and a web server benchmark (SPECWeb). SPECSFS (see http://www.spec.org/osg/sfs93/) is a benchmark for measuring NFS (Network File System) performance using a script of file server requests; it tests the performance of the I/O system (both disk and network I/O) as well as the CPU. SPECSFS is a throughput oriented benchmark but with important response time requirements. (Chapter 6 discusses some file and I/O system benchmarks in detail.) SPECWEB (see http://www.spec.org/ osg/web99/ for the 1999 version) is a web-server benchmark that simulates multiple clients requesting both static and dynamic pages from a server, as well as clients posting data to the server. Transaction processing benchmarks measure the ability of a system to handle transactions, which consist of database accesses and updates. An airline reserva-
32
Chapter 1 Fundamentals of Computer Design
tion system or a bank ATM system are typical simple TP systems; more complex TP systems involve complex databases and decision making. In the mid 1980s, a group of concerned engineers formed the vendor-independent Transaction Processing Council (TPC) to try to create a set of realistic and fair benchmarks for transaction processing. The first TPC benchmark, TPC-A, was published in 1985 and has since been replaced and enhanced by four different benchmarks. TPC-C, initially created in 1992, simulates a complex query environment. TPC-H models ad-hoc decision support meaning that the queries are unrelated and knowledge of past queries cannot be used to optimize future queries; the result is that query execution times can be very long. TPC-R simulates a business decision support system where users run a standard set of queries. In TPC-R, pre-knowledge of the queries is taken for granted and the DBMS system can be optimized to run these queries. TPC-W web-based transaction benchmark that simulates the activities of a business oriented transactional web server. It exercises the database system as well as the underlying web server software. The TPC benchmarks are described at: http://www.tpc.org/. All the TPC benchmarks measure performance in transactions per second. In addition, they include a response-time requirement, so that throughput performance is measured only when the response time limit is met.To model real-world systems, higher transaction rates are also associated with larger systems, both in terms of users and the data base that the transactions are applied to. Finally, the system cost for a benchmark system must also be included, allowing accurate comparisons of cost-performance. Embedded Benchmarks
Benchmarks for embedded computing systems are in a far more nascent state than those for either desktop or server environments. In fact, many manufacturers quote Dhrystone performance, a benchmark that was criticized and given up by desktop systems more than 10 years ago! As mentioned earlier, the enormous variety in embedded applications, as well as differences in performance requirements (hard real-time, soft real-time, and overall cost-performance), make the use of a single set of benchmarks unrealistic. In practice, many designers of embedded systems devise benchmarks that reflect their application, either as kernels or as stand-alone versions of the entire application. For those embedded applications that can be characterized well by kernel performance, the best standardized set of benchmarks appears to be a new benchmark set: the EDN Embedded Microprocessor Benchmark Consortium (or EEMBC–pronounced embassy). The EEMBC benchmarks fall into five classes: automotive/industrial, consumer, networking, office automation, and telecommunications. Figure 1.13 shows the five different application classes, which include 34 benchmarks. Although many embedded applications are sensitive to the performance of small kernels, remember that often the overall performance of the entire application, which may be thousands of lines) is also critical. Thus, for many embedded
1.5
Measuring and Reporting Performance
33
systems, the EMBCC benchmarks can only be used to partially assess performance. Benchmark Type
# of this type
Example benchmarks
Automotive/industrial
16
6 microbenchmarks (arithmetic operations, pointer chasing, memory performance, matrix arithmetic, table lookup, bit manipulation), 5 automobile control benchmarks, and 5 filter or FFT benchmarks.
Consumer
5
5 multimedia benchmarks (JPEG compress/decompress, filtering, and RGB conversions).
Networking
3
Shortest path calculation, IP routing, and packet flow operations.
Office automation
4
Graphics and text benchmarks (Bezier curve calculation, dithering, image rotation, text processing).
Telecommunications
6
Filtering and DSP benchmarks (autocorrelation, FFT, decoder, and encoder)
FIGURE 1.13 The EEMBC benchmark suite, consisting of 34 kernels in five different classes. See www.eembc.org for more information on the benchmarks and for scores.
Reporting Performance Results The guiding principle of reporting performance measurements should be reproducibility—list everything another experimenter would need to duplicate the results. A SPEC benchmark report requires a fairly complete description of the machine, the compiler flags, as well as the publication of both the baseline and optimized results. As an example, Figure 1.14 shows portions of the SPEC CINT2000 report for an Dell Precision Workstation 410. In addition to hardware, software, and baseline tuning parameter descriptions, a SPEC report contains the actual performance times, shown both in tabular form and as a graph. A TPC benchmark report is even more complete, since it must include results of a benchmarking audit and must also include cost information. A system’s software configuration can significantly affect the performance results for a benchmark. For example, operating syustems performance and support can be very important in server benchmarks. For this reason, these benchmarks are sometimes run in single-user mode to reduce overhead. Additionally, operating system enhancements are sometimes made to increase performance on the TPC benchmarks. Likewise, compiler technology can play a big role in CPU performance. The impact of compiler technology can be especially large when modification of the source is allowed (see the example with the EEMBC benchmarks on page 63) or when a benchmark is particularly suspectible to an optimization (see the example from SPEC described on 61). For these reasons it is important to describe exactly the software system being measured as well as whether any special nonstandard modifications have been made. Another way to customize the software to improve the performance of a benchmark has been through the use of benchmark-specific flags; these flags often caused transformations that would be illegal on many programs or would
34
Chapter 1 Fundamentals of Computer Design
slow down performance on others. To restrict this process and increase the significance of the SPEC results, the SPEC organization created a baseline performance measurement in addition to the optimized performance measurement. Baseline performance restricts the vendor to one compiler and one set of flags for all the programs in the same language (C or FORTRAN). Figure 1.14 shows the parameters for the baseline performance; in section 1.8, Fallacies and Pitfalls, we’ll see the tuning parameters for the optimized performance runs on this machine. Hardware
Software
Model number
Precision WorkStation 410
O/S and version
Windows NT 4.0
CPU
700 MHz, Pentium III
Compilers and version
Intel C/C++ Compiler 4.5
Number of CPUs
1
Other software
See below
Primary cache
16KBI+16KBD on chip
File system type
NTFS
Secondary cache
256KB(I+D) on chip
System state
Default
Other cache
None
Memory
256 MB ECC PC100 SDRAM
Disk subsystem
SCSI
Other hardware
None
SPEC CINT2000 base tuning parameters/notes/summary of changes: +FDO: PASS1=-Qprof_gen PASS2=-Qprof_use Base tuning: -QxK -Qipo_wp shlW32M.lib +FDO shlW32M.lib is the SmartHeap library V5.0 from MicroQuill www.microquill.com Portability flags: 176.gcc: -Dalloca=_alloca /F10000000 -Op 186.crafy: -DNT_i386 253.perlbmk: -DSPEC_CPU2000_NTOS -DPERLDLL /MT 254.gap: -DSYS_HAS_CALLOC_PROTO -DSYS_HAS_MALLOC_PROTO The machine, software, and baseline tuning parameters for the CINT2000 base report on a Dell Precision WorkStation 410. This data is for the base CINT2000 report. The data is available online at: http://www.spec.org/
FIGURE 1.14
osg/cpu2000/results/cpu2000.html.
In addition to the question of flags and optimization, another key question is whether source code modifications or hand-generated assembly language are allowed. There are four broad categories of apporoaches here: 1. No source code modifications are allowed. The SPEC benchmarks fall into this class, as do most of the standard PC benchmarks. 2. Source code modification are allowed, but are essentially difficult or impossible. Benchmarks like TPC-C rely on standard databases, such as Oracle or Microsoft’s SQL server. Although these third party vendors are interested in the overall performance of their systems on important industry-standard bench-
1.5
Measuring and Reporting Performance
35
marks, they are highly unlikely to make vendor- specific changes to enhance the performance for one particular customer.TPC-C also relies heavily on the operating system, which can be change, provided those changes become part of the production version. 3. Source modifications are allowed. Several supercomputer benchmark suites allow modification of the source code. For example, the NAS benchmarks specify the input and output and supply the source, but vendors are allowed to rewrite the source, including changing the algorithms, as long as the result is the same. EEMBC also allows source-level changes to its benchmarks and reports these as “optimized” measurements, versus “out-of-the-box” measurements that allow no changes. 4. Hand-coding is allowed. EEMBC allows assembly language coding of its benchmarks. The small size of its kernels makes this approach attractive, although in practice with larger embedded applications it is unlikely to be used, except for small loops.Figure 1.31 on page 65 shows the significant benefits from handcoding on several different processors. The key issue that benchmark designers face in deciding to allow modification of the source is whether such modifications will reflect real practice and provide useful insight to users, or whether such modifications simply reduce the accuracy of the benchmarks as predictors of real performance. Comparing and Summarizing Performance Comparing performance of computers is rarely a dull event, especially when the designers are involved. Charges and countercharges fly across the Internet; one is accused of underhanded tactics and the other of misleading statements. Since careers sometimes depend on the results of such performance comparisons, it is understandable that the truth is occasionally stretched. But more frequently discrepancies can be explained by differing assumptions or lack of information. We would like to think that if we could just agree on the programs, the experimental environments, and the definition of faster, then misunderstandings would be avoided, leaving the networks free for scholarly discourse. Unfortunately, that’s not the reality. Once we agree on the basics, battles are then fought over what is the fair way to summarize relative performance of a collection of programs. For example, two articles on summarizing performance in the same journal took opposing points of view. Figure 1.15, taken from one of the articles, is an example of the confusion that can arise. Using our definition of faster than, the following statements hold: A is 10 times faster than B for program P1. B is 10 times faster than A for program P2. A is 20 times faster than C for program P1.
36
Chapter 1 Fundamentals of Computer Design
Computer A
Computer B
Computer C
Program P1 (secs)
1
10
20
Program P2 (secs)
1000
100
20
Total time (secs)
1001
110
40
FIGURE 1.15 Execution times of two programs on three machines. Data from Figure I of Smith [1988].
C is 50 times faster than A for program P2. B is 2 times faster than C for program P1. C is 5 times faster than B for program P2. Taken individually, any one of these statements may be of use. Collectively, however, they present a confusing picture—the relative performance of computers A, B, and C is unclear. Total Execution Time: A Consistent Summary Measure
The simplest approach to summarizing relative performance is to use total execution time of the two programs. Thus B is 9.1 times faster than A for programs P1 and P2. C is 25 times faster than A for programs P1 and P2. C is 2.75 times faster than B for programs P1 and P2. This summary tracks execution time, our final measure of performance. If the workload consisted of running programs P1 and P2 an equal number of times, the statements above would predict the relative execution times for the workload on each machine. An average of the execution times that tracks total execution time is the arithmetic mean: 1 --n
n
∑ Timei
i=1
where Timei is the execution time for the ith program of a total of n in the workload. Weighted Execution Time The question arises: What is the proper mixture of programs for the workload? Are programs P1 and P2 in fact run equally in the workload as assumed by the arithmetic mean? If not, then there are two approaches that have been tried for summarizing performance. The first approach when given an unequal mix of programs in the workload is to assign a weighting factor wi to each program to indi-
1.5
Measuring and Reporting Performance
37
cate the relative frequency of the program in that workload. If, for example, 20% of the tasks in the workload were program P1 and 80% of the tasks in the workload were program P2, then the weighting factors would be 0.2 and 0.8. (Weighting factors add up to 1.) By summing the products of weighting factors and execution times, a clear picture of performance of the workload is obtained. This is called the weighted arithmetic mean: n
∑ Weighti × Timei
i=1
where Weighti is the frequency of the ith program in the workload and Timei is the execution time of that program. Figure 1.16 shows the data from Figure 1.15 with three different weightings, each proportional to the execution time of a workload with a given mix.
Programs A
Weightings
B
C
W(1)
W(2)
W(3)
Program P1 (secs)
1.00
10.00
20.00
0.50
0.909
0.999
Program P2 (secs)
1000.00
100.00
20.00
0.50
0.091
0.001
Arithmetic mean:W(1)
500.50
55.00
20.00
Arithmetic mean:W(2)
91.91
18.19
20.00
Arithmetic mean:W(3)
2.00
10.09
20.00
FIGURE 1.16 Weighted arithmetic mean execution times for three machines (A, B, C) and two programs (P1 and P2) using three weightings (W1, W2, W3). The top table contains the original execution time measurements and the weighting factors, while the bottom table shows the resulting weighted arithmetic means for each weighting. W(1) equally weights the programs, resulting in a mean (row 3) that is the same as the unweighted arithmetic mean. W(2) makes the mix of programs inversely proportional to the execution times on machine B; row 4 shows the arithmetic mean for that weighting. W(3) weights the programs in inverse proportion to the execution times of the two programs on machine A; the arithmetic mean with this weighting is given in the last row. The net effect of the second and third weightings is to “normalize” the weightings to the execution times of programs running on that machine, so that the running time will be spent evenly between each program for that machine. For a set of n programs each taking Timei on one machine, the equal-time weightings on 1 that machine are w = ---------------------------------------------------- . i
n
Time i ×
- ∑ -------------Time j 1
j =1
Normalized Execution Time and the Pros and Cons of Geometric Means
A second approach to unequal mixture of programs in the workload is to normalize execution times to a reference machine and then take the average of the normalized execution times. This is the approach used by the SPEC benchmarks,
38
Chapter 1 Fundamentals of Computer Design
where a base time on a SPARCstation is used for reference. This measurement gives a warm fuzzy feeling, because it suggests that performance of new programs can be predicted by simply multiplying this number times its performance on the reference machine. Average normalized execution time can be expressed as either an arithmetic or geometric mean. The formula for the geometric mean is n
n
∏ Execution time ratioi
i=1
where Execution time ratioi is the execution time, normalized to the reference machine, for the ith program of a total of n in the workload. Geometric means also have a nice property for two samples Xi and Yi: Xi Geometric mean ( X i ) -------------------------------------------------= Geometric mean ----- Geometric mean ( Y i ) Y i
As a result, taking either the ratio of the means or the mean of the ratios yields the same result. In contrast to arithmetic means, geometric means of normalized execution times are consistent no matter which machine is the reference. Hence, the arithmetic mean should not be used to average normalized execution times. Figure 1.17 shows some variations using both arithmetic and geometric means of normalized times. Because the weightings in weighted arithmetic means are set proportionate to execution times on a given machine, as in Figure 1.16, they are influenced not only by frequency of use in the workload, but also by the peculiarities of a particular machine and the size of program input. The geometric mean of normalized execution times, on the other hand, is independent of the running times of the individual programs, and it doesn’t matter which machine is used to normalize. If a situation arose in comparative performance evaluation where the programs were fixed but the inputs were not, then competitors could rig the results of weighted arithmetic means by making their best performing benchmark have the largest input and therefore dominate execution time. In such a situation the geometric mean would be less misleading than the arithmetic mean.
1.5
Measuring and Reporting Performance
Normalized to A
39
Normalized to B
A
B
C
Program P1
1.0
10.0
20.0
Program P2
1.0
0.1
0.02
Normalized to C
A
B
C
A
B
C
0.1
1.0
2.0
10.0
1.0
0.2
50.0
0.05
0.5
1.0
5.0
1.0
Arithmetic mean
1.0
5.05
10.01
5.05
1.0
1.1
25.03
2.75
1.0
Geometric mean
1.0
1.0
0.63
1.0
1.0
0.63
1.58
1.58
1.0
Total time
1.0
0.11
0.04
9.1
1.0
0.36
25.03
2.75
1.0
FIGURE 1.17 Execution times from Figure 1.15 normalized to each machine. The arithmetic mean performance varies depending on which is the reference machine—in column 2, B’s execution time is five times longer than A’s, although the reverse is true in column 4. In column 3, C is slowest, but in column 9, C is fastest. The geometric means are consistent independent of normalization—A and B have the same performance, and the execution time of C is 0.63 of A or B (1/1.58 is 0.63). Unfortunately, the total execution time of A is 10 times longer than that of B, and B in turn is about 3 times longer than C. As a point of interest, the relationship between the means of the same set of numbers is always harmonic mean ≤ geometric mean ≤ arithmetic mean.
The strong drawback to geometric means of normalized execution times is that they violate our fundamental principle of performance measurement—they do not predict execution time. The geometric means from Figure 1.17 suggest that for programs P1 and P2 the performance of machines A and B is the same, yet this would only be true for a workload that ran program P1 100 times for every occurrence of program P2 (see Figure 1.16 on page 37). The total execution time for such a workload suggests that machines A and B are about 50% faster than machine C, in contrast to the geometric mean, which says machine C is faster than A and B! In general there is no workload for three or more machines that will match the performance predicted by the geometric means of normalized execution times. Our original reason for examining geometric means of normalized performance was to avoid giving equal emphasis to the programs in our workload, but is this solution an improvement? An additional drawback of using geometric mean as a method for summarizing performance for a benchmark suite (as SPEC CPU2000 does) is that it encourages hardware and software designers to focus their attention on the benchmarks where performance is easiest to improve rather than on the benchmarks that are slowest. For example, if some hardware or software improvement can cut the running time for a benchmark from 2 seconds to 1, the geometric mean will reward those designers with the same overall mark that it would give to designers that improve the running time on another benchmark in the suite from 10,000 seconds to 5000 seconds. Of course, everyone interested in running the second program thinks of the second batch of designers as their heroes and the first group as useless. Small programs are often easier to “crack,” obtaining a large but unrepresentative performance improvement, and the use of geometric mean rewards such behavior more than a measure that reflects total running time. The ideal solution is to measure a real workload and weight the programs according to their frequency of execution. If this can’t be done, then normalizing so that equal time is spent on each program on some machine at least makes the rel-
40
Chapter 1 Fundamentals of Computer Design
ative weightings explicit and will predict execution time of a workload with that mix. The problem above of unspecified inputs is best solved by specifying the inputs when comparing performance. If results must be normalized to a specific machine, first summarize performance with the proper weighted measure and then do the normalizing. Lastly, we must remember that any summary measure necessarily loses information, especially when the measurements may vary widely. Thus, it is important both to ensure that the results of individual benchmarks, as well as the summary number, are available. Furthermore, the summary number should be used with caution, since the summary–as opposed to a subset of the individual scores–may be the best indicator of performance for a customer’s applications.
1.6
Quantitative Principles of Computer Design Now that we have seen how to define, measure, and summarize performance, we can explore some of the guidelines and principles that are useful in design and analysis of computers. In particular, this section introduces some important observations about designing for performance and cost/performance, as well as two equations that we can use to evaluate design alternatives. Make the Common Case Fast Perhaps the most important and pervasive principle of computer design is to make the common case fast: In making a design trade-off, favor the frequent case over the infrequent case. This principle also applies when determining how to spend resources, since the impact on making some occurrence faster is higher if the occurrence is frequent. Improving the frequent event, rather than the rare event, will obviously help performance, too. In addition, the frequent case is often simpler and can be done faster than the infrequent case. For example, when adding two numbers in the CPU, we can expect overflow to be a rare circumstance and can therefore improve performance by optimizing the more common case of no overflow. This may slow down the case when overflow occurs, but if that is rare, then overall performance will be improved by optimizing for the normal case. We will see many cases of this principle throughout this text. In applying this simple principle, we have to decide what the frequent case is and how much performance can be improved by making that case faster. A fundamental law, called Amdahl’s Law, can be used to quantify this principle. Amdahl’s Law The performance gain that can be obtained by improving some portion of a computer can be calculated using Amdahl’s Law. Amdahl’s Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
1.6
Quantitative Principles of Computer Design
41
Amdahl’s Law defines the speedup that can be gained by using a particular feature. What is speedup? Suppose that we can make an enhancement to a machine that will improve performance when it is used. Speedup is the ratio Speedup =
Performance for entire task using the enhancement when possible Performance for entire task without using the enhancement
Alternatively, Speedup =
Execution time for entire task without using the enhancement Execution time for entire task using the enhancement when possible
Speedup tells us how much faster a task will run using the machine with the enhancement as opposed to the original machine. Amdahl’s Law gives us a quick way to find the speedup from some enhancement, which depends on two factors: 1. The fraction of the computation time in the original machine that can be converted to take advantage of the enhancement—For example, if 20 seconds of the execution time of a program that takes 60 seconds in total can use an enhancement, the fraction is 20/60. This value, which we will call Fractionenhanced, is always less than or equal to 1. 2. The improvement gained by the enhanced execution mode; that is, how much faster the task would run if the enhanced mode were used for the entire program—This value is the time of the original mode over the time of the enhanced mode: If the enhanced mode takes 2 seconds for some portion of the program that can completely use the mode, while the original mode took 5 seconds for the same portion, the improvement is 5/2. We will call this value, which is always greater than 1, Speedupenhanced. The execution time using the original machine with the enhanced mode will be the time spent using the unenhanced portion of the machine plus the time spent using the enhancement: Fraction enhanced Execution timenew = Execution timeold × ( 1 – Fraction enhanced ) + ---------------------------------------- Speedup enhanced
The overall speedup is the ratio of the execution times: Execution time old 1 Speedupoverall = -------------------------------------------- = -----------------------------------------------------------------------------------------------Fraction enhanced Execution time new ( 1 – Fraction enhanced ) + ------------------------------------Speedup enhanced
EXAMPLE
Suppose that we are considering an enhancement to the processor of a server system used for web serving. The new CPU is 10 times faster on computation in the web serving application than the original processor. Assuming that the original CPU is busy with computation 40% of the time
42
Chapter 1 Fundamentals of Computer Design
and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement?
ANSWER
Fractionenhanced = 0.4 Speedupenhanced = 10 Speedupoverall
1 1 = --------------------- = ---------- ≈ 1.56 0.4 0.64 0.6 + ------10 n
Amdahl’s Law expresses the law of diminishing returns: The incremental improvement in speedup gained by an additional improvement in the performance of just a portion of the computation diminishes as improvements are added. An important corollary of Amdahl’s Law is that if an enhancement is only usable for a fraction of a task, we can’t speed up the task by more than the reciprocal of 1 minus that fraction. A common mistake in applying Amdahl’s Law is to confuse “fraction of time converted to use an enhancement” and “fraction of time after enhancement is in use.” If, instead of measuring the time that we could use the enhancement in a computation, we measure the time after the enhancement is in use, the results will be incorrect! (Try Exercise 1.2 to see how wrong.) Amdahl’s Law can serve as a guide to how much an enhancement will improve performance and how to distribute resources to improve cost/performance. The goal, clearly, is to spend resources proportional to where time is spent. Amdahl’s Law is particularly useful for comparing the overall system performance of two alternatives, but it can also be applied to compare two CPU design alternatives, as the following Example shows. EXAMPLE
A common transformation required in graphics engines is square root. Implementations of floating-point (FP) square root vary significantly in performance, especially among processor designed for graphics. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for a total of 50% of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the fast square root. Compare these two design al-
1.6
Quantitative Principles of Computer Design
43
ternatives. We can compare these two alternatives by comparing the speedups:
ANSWER
1 1 SpeedupFPSQR = ----------------------------------- = ---------- = 1.22 0.2 0.82 ( 1 – 0.2 ) + ------10 1 1 SpeedupFP = ----------------------------------- = ---------------- = 1.23 0.5 0.8125 ( 1 – 0.5 ) + ------1.6 Improving the performance of the FP operations overall is slightly better because of the higher frequency. n
In the above example, we needed to know the time consumed by the new and improved FP operations; often it is difficult to measure these times directly. In the next section, we will see another way of doing such comparisons based on the use of an equation that decomposes the CPU execution time into three separate components. If we know how an alternative affects these three components, we can determine its overall performance effect. Furthermore, it is often possible to build simulators that measure these components before the hardware is actually designed. The CPU Performance Equation Essentially all computers are constructed using a clock running at a constant rate. These discrete time events are called ticks, clock ticks, clock periods, clocks, cycles, or clock cycles. Computer designers refer to the time of a clock period by its duration (e.g., 1 ns) or by its rate (e.g., 1 GHz). CPU time for a program can then be expressed two ways: CPU time = CPU clock cycles for a program × Clock cycle time
or CPU clock cycles for a program CPU time = ----------------------------------------------------------------------------Clock rate
In addition to the number of clock cycles needed to execute a program, we can also count the number of instructions executed—the instruction path length or instruction count (IC). If we know the number of clock cycles and the instruction count we can calculate the average number of clock cycles per instruction (CPI). Because it is easier to work with and because we will deal with simple processors
44
Chapter 1 Fundamentals of Computer Design
in this chapter, we use CPI. Designers sometimes also use Instructions per Clock or IPC, which is the inverse of CPI. CPI is computed as: CPU clock cycles for a program CPI = ----------------------------------------------------------------------------Instruction Count
This CPU figure of merit provides insight into different styles of instruction sets and implementations, and we will use it extensively in the next four chapters. By transposing instruction count in the above formula, clock cycles can be defined as IC × CPI . This allows us to use CPI in the execution time formula: CPU time = Instruction Count × Clock cycle time × Cycles per Instruction
or Instruction Count × Clock cycle time CPU time = ----------------------------------------------------------------------------------------Clock rate
Expanding the first formula into the units of measurement and inverting the clock rate shows how the pieces fit together: Instructions Clock cycles Seconds Seconds ---------------------------- × ------------------------------ × ---------------------------- = -------------------- = CPU time Program Instruction Clock cycle Program
As this formula demonstrates, CPU performance is dependent upon three characteristics: clock cycle (or rate), clock cycles per instruction, and instruction count. Furthermore, CPU time is equally dependent on these three characteristics: A 10% improvement in any one of them leads to a 10% improvement in CPU time. Unfortunately, it is difficult to change one parameter in complete isolation from others because the basic technologies involved in changing each characteristic are interdependent: n
Clock cycle time—Hardware technology and organization
n
CPI—Organization and instruction set architecture
n
Instruction count—Instruction set architecture and compiler technology
Luckily, many potential performance improvement techniques primarily improve one component of CPU performance with small or predictable impacts on the other two. Sometimes it is useful in designing the CPU to calculate the number of total CPU clock cycles as n
CPU clock cycles =
∑ ICi × CPIi
i=1
where ICi represents number of times instruction i is executed in a program and CPIi represents the average number of instructions per clock for instruction i. This form can be used to express CPU time as
1.6
Quantitative Principles of Computer Design
45
n
CPU time = ∑ IC i × CPI i × Clock cycle time i = 1
and overall CPI as: n
∑ ICi × CPIi
i=1
CPI = ---------------------------------------- = Instruction count
n
IC i
× CPI i ∑ ---------------------------------------Instruction count
i=1
The latter form of the CPI calculation uses each individual CPIi and the fraction of occurrences of that instruction in a program (i.e., IC i ÷ Instruction count ). CPIi should be measured and not just calculated from a table in the back of a reference manual since it must include pipeline effects, cache misses, and any other memory system inefficiencies. Consider our earlier example, here modified to use measurements of the frequency of the instructions and of the instruction CPI values, which, in practice, is obtained by simulation or by hardware instrumentation. EXAMPLE
Suppose we have made the following measurements: Frequency of FP operations (other than FPSQR) = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR= 2% CPI of FPSQR = 20 Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the average CPI of all FP operations to 2.5. Compare these two design alternatives using the CPU performance equation.
ANSWER
First, observe that only the CPI changes; the clock rate and instruction count remain identical. We start by finding the original CPI with neither enhancement: n
CPI original =
IC i
∑ CPIi × ---------------------------------------Instruction count
i=1
= ( 4 × 25% ) + ( 1.33 × 75% ) = 2.0 We can compute the CPI for the enhanced FPSQR by subtracting the cycles saved from the original CPI:
46
Chapter 1 Fundamentals of Computer Design
CPI with new FPSQR = CPI original – 2% × ( CPI old FPSQR – CPI of new FPSQR only ) = 2.0 – 2% × ( 20 – 2 ) = 1.64 We can compute the CPI for the enhancement of all FP instructions the same way or by summing the FP and non-FP CPIs. Using the latter gives us CPI new FP = ( 75% × 1.33 ) + ( 25% × 2.5 ) = 1.625 Since the CPI of the overall FP enhancement is slightly lower, its performance will be marginally better. Specifically, the speedup for the overall FP enhancement is IC × Clock cycle × CPI original CPU time original Speedup new FP = ------------------------------------- = ---------------------------------------------------------------------IC × Clock cycle × CPI new FP CPU time new FP CPI original 2.00 = ----------------------- = ------------- = 1.23 CPI new FP 1.625 Happily, this is the same speedup we obtained using Amdahl’s Law on page 42. It is often possible to measure the constituent parts of the CPU performance equation. This is a key advantage for using the CPU performance equation versus Amdahl’s Law in the above example. In particular, it may be difficult to measure things such as the fraction of execution time for which a set of instructions is responsible. In practice this would probably be computed by summing the product of the instruction count and the CPI for each of the instructions in the set. Since the starting point is often individual instruction count and CPI measurements, the CPU performance equation is incredibly useful. n Measuring and Modeling the Components of the CPU Performance Equation
To use the CPU performance equation as a design tool, we need to be able to measure the various factors. For an existing processor, it is easy to obtain the execution time by measurement, and the clock speed is known. The challenge lies in discovering the instruction count or the CPI. Most newer processors include counters for both instructions executed and for clock cycles. By periodically monitoring these counters, it is also possible to attach execution time and instruction count to segments of the code, which can be helpful to programmers trying to understand and tune the performance of an application. Often, a designer or programmer will want to understand performance at a more fine-grained level than what is available from the hardware counters. For example, they may want to know why the CPI is what it is. In such cases, simulation techniques like those used for processors that are being designed are used.
1.6
Quantitative Principles of Computer Design
47
There are three general classes of simulation techniques that are used. In general, the more sophisticated techniques yield more accuracy, particularly for more recent architectures, at the cost of longer execution time The first and simplest technique, and hence the least costly, is profile-based, static modeling. In this technique a dynamic execution profile of the program, which indicates how often each instruction is executed, is obtained by one of three methods: 1. By using hardware counters on the processor, which are periodically saved. This technique often gives an approximate profile, but one that is within a few percent of exact. 2. By using instrumented execution, in which instrumentation code is compiled into the program. This code is used to increment counters, yielding an exact profile. (This technique can also be used to create a trace of memory address that are accessed, which is useful for other simulation techniques.) 3. By interpreting the program at the instruction set level, compiling instruction counts in the process. Once the profile is obtained, it is used to analyze the program in a static fashion by looking at the code. Obviously with the profile, the total instruction count is easy to obtain. It is also easy to get a detailed dynamic instruction mix telling what types of instructions were executed with what frequency. Finally, for simple processors, it is possible to compute an approximation to the CPI. This approximation is computed by modeling and analyzing the execution of each basic block (or straightline code segment) and then computing an overall estimate of CPI or total compute cycles by multiplying the estimate for each basic block by the number of times it is executed. Although this simple model ignores memory behavior and has severe limits for modeling complex pipelines, it is a reasonable and very fast technique for modeling the performance of short, integer pipelines, ignoring the memory system behavior. Trace-driven simulation is a more sophisticated technique for modeling performance and is particularly useful for modeling memory system performance. In trace-driven simulation, a trace of the memory references executed is created, usually either by simulation or by instrumented execution. The trace includes what instructions were executed (given by the instruction address), as well as the data addresses accessed. Trace-driven simulation can be used in several different ways. The most common use is to model memory system performance, which can be done by simulating the memory system, including the caches and any memory management hardware using the address trace. A trace-driven simulation of the memory system can be combined with a static analysis of pipeline performance to obtain a reasonably accurate performance model for simple pipelined processors. For more complex pipelines, the trace data can be used to perform a more detailed analysis of the pipeline performance by simulation of the processor pipeline.
48
Chapter 1 Fundamentals of Computer Design
Since the trace data allows a simulation of the exact ordering of instructions, higher accuracy can be achieved than with a static approach. Trace-driven simulation typically isolates the simulation of any pipeline behavior from the memory system. In particular, it assumes that the trace is completely independent of the memory system behavior. As we will see in Chapters 3 and 5, this is not the case for the most advanced processors–a third technique is needed. The third technique, which is the most accurate and most costly, is executiondriven simulation. In execution-driven simulation a detailed simulation of the memory system and the processor pipeline are done simultaneously. This allows the exact modeling of the interaction between the two, which is critical as we will see in Chapters 3 and 5. There are many variations on these three basic techniques. We will see examples of these tools in later chapters and use various versions of them in the exercises. Locality of Reference Although Amdahl’s Law is a theorem that applies to any system, other important fundamental observations come from properties of programs. The most important program property that we regularly exploit is locality of reference: Programs tend to reuse data and instructions they have used recently. A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of the code. An implication of locality is that we can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past. Locality of reference also applies to data accesses, though not as strongly as to code accesses. Two different types of locality have been observed. Temporal locality states that recently accessed items are likely to be accessed in the near future. Spatial locality says that items whose addresses are near one another tend to be referenced close together in time. We will see these principles applied in Chapter 5. Take Advantage of Parallelism Taking advantage of parallelism is one of the most important methods for improving performance. We give three brief examples, which are expounded on in later chapters. Our first example is the use of parallelism at the system level. To improve the throughput performance on a typical server benchmark, such as SPECWeb or TPC, multiple processors and multiple disks can be used. The workload of handling requests can then be spread among the CPUs or disks resulting in improved throughput. This is the reason that scalability is viewed as a valuable asset for server applications. At the level of an individual processor, taking advantage of parallelism among instructions is critical to achieving high performance. One of the simplest ways
1.7
Putting It All Together: Performance and Price-Performance
49
to do this is through pipelining. The basic idea behind pipelining, which is explained in more detail in Appendix A and a major focus of Chapter 3, is to overlap the execution of instructions, so as to reduce the total time to complete a sequence of instructions. Viewed from the perspective of the CPU performance equation, we can think of pipelining as reducing the CPI by allowing instructions that take multiple cycles to overlap. A key insight that allows pipelining to work is that not every instruction depends on its immediate predecessor, and thus, executing the instructions completely or partially in parallel may be possible. Parallelism can also be exploited at the level of detailed digital design. For example, set associative caches use multiple banks of memory that are typical searched in parallel to find a desired item. Modern ALUs use carry-lookahead, which uses parallelism to speed the process of computing sums from linear in the number of bits in the operands to logarithmic. There are many different ways designers take advantage of parallelism. One common class of techniques is parallel computation of two or more possible outcomes, followed by late selection. This technique is used in carry select adders, in set associative caches, and in handling branches in pipelines. Virtually every chapter in this book will have an example of how performance is enhanced through the exploitation of parallelism.
1.7
Putting It All Together: Performance and Price-Performance In the Putting It All Together sections that appear near the end of every chapter, we show real examples that use the principles in that chapter. In this section we look at measures of performance and price-performance first in desktop systems using the SPEC CPU benchmarks, then at servers using TPC-C as the benchmark, and finally at the embedded market using EEMBC as the benchmark. Performance and Price-Performance for Desktop Systems Although there are many benchmark suites for desktop systems, a majority of them are OS or architecture specific. In this section we examine the CPU performance and price-performance of a variety of desktop systems using the SPEC CPU2000 integer and floating point suites. As mentioned earlier, SPEC CPU2000 summarizes CPU performance using a geometric mean normalized to a Sun system with larger numbers indicating higher performance. Each system was configured with one CPU, 512 MB of SDRAM (with ECC if available), approximately 20 GB of disk, a fast graphics system, and an 10/100 Mb Ethernet connection. The seven systems we examined and their processors and price are shown in Figure 1.18. The wide variation in price is driven by a number of factors, including system expandability, the use of cheaper disks (ATA versus SCSI), less expensive memory (PC memory versus custom DIMMs), software differences (Linux or a Microsoft OS versus a vendor specific OS), the cost
50
Chapter 1 Fundamentals of Computer Design
of the CPU, and the commoditization effect, which we discussed on page 14. (See the further discussion on price variation in the caption of Figure 1.18.) Vendor
Model
Processor
Clock Rate (MHz)
Price
Compaq
Presario 7000
AMD Athlon
1,400
$2,091
Dell
Precision 420
Intel Pentium III
1,000
$3,834
Dell
Precision 530
Intel Pentium 4
1,700
$4,175
HP
Workstation c3600
PA 8600
552
$12,631
IBM
RS6000 44P/170
IBM III-2
450
$13,889
Sun
Sunblade 100
UltraSPARC II-e
500
$2,950
Sun
Sunblade 1000
UltraSPARC III
750
$9.950
FIGURE 1.18 Seven different desktop systems from five vendors using seven different microprocessors showing the processor, its clock rate, and the selling price. All these systems are configured with 512 MB of ECC SDRAM, a high-end graphics system (which is not the highest performance system available for the more expensive platforms), and approximately 20 GB of disk. Many factors are responsible for the wide variation in price despite this common elements. First, the systems offer different levels of expandability (with the Prescario system being the least expandable, the Dell systems and Sunblade 100 being moderately expandable, nd the HP, IBM, and Sunblade 1000 being very flexible and expandable). Second, the use of cheaper disks (ATA versus SCSI) and less expensive memory (PC memory versus custom DIMMs) has a significant impact. Third the cost of the CPU varies by at least a factor of two. In 2001 the Athlon sells for about $200, The Pentium III for about $240, and the Pentium 4 for about $500. Fourth, software differences (Linux or a Microsoft OS versus a vendor specific OS) probably affect the final price. Fifth, the lower end systems use PC commodity parts in others areas (fans, power supply, support chip sets), which lower costs. Finally, the commoditization effect, which we discussed in page 14, is at work especially for the Compaq and Dell systems. These prices are as of July 2001.
Figure 1.19 shows the performance and the price-performance of these seven systems using SPEC CINT2000 as the metric. The Compaq system using the AMD Athlon CPU offers both the highest performance and the best price-performance, followed by the two Dell systems, which have comparable price-performance, although the Pentium 4 system is faster. The Sunblade 100 has the lowest performance, but somewhat better price-performance than the other UNIX-based workstation systems. Figure 1.20 shows the performance and price-performance for the SPEC floating point benchmarks. The floating point instruction set enhancements in the Pentium 4 give it a clear performance advantage, although the Compaq Athlonbased system still has superior price-performance. The IBM, HP, and Sunblade 1000 all outperform the Dell 420 with a Pentium III, but the Dell system still offers better price-performance than the IBM, Sun, or HP workstations. Performance and Price-Performance for Transaction Processing Servers One of the largest server markets is online transaction processing (OLTP), which we described earlier. The standard industry benchmark for OLTP is TPC-C, which relies on a database system to perform queries and updates. Five factors
1.7
Putting It All Together: Performance and Price-Performance
51
250
600
225 500
SPEC Base CINT2000
175 400 150
125
300
100 200 75
SPEC CINT2000 per $1000 in Price
200
50 100 25
0
0 Compaq Presario 7000
Dell Precision 530
Dell Precision 420 SPECbase CINT2000
HP Workstation c3600
Sun Sunblade 1000/1750
IBM RS6000 44P/170
Sun Sublade 100
SPEC CINT2000 performance/cost
FIGURE 1.19 Performance and price-performance for seven systems are measured using SPEC CINT2000 as the benchmark. With the exception of the Sunblade 100 (Sun’s low-end entry system), price-performance roughly parallels performance, contradicting the conventional wisdom–at least on the desktop–that higher performance systems carry a disproportionate price premium. Price-performance is plotted as CINT2000 performance per $1,000 in system cost. These performance numbers and prices are current as of July 2001.The measurements are available online as http:// www.spec.org/osg/cpu2000/.
make the performance of TPC-C particularly interesting. First, TPC-C is a reasonable approximation to a real OLTP application; although this makes benchmark set-up complex and time consuming, it also makes the results reasonably indicative of real performance for OLTP. Second, TPC-C measures total system performance, including the hardware, the operating system, the I/O system, and the database system, making the benchmark more predictive of real performance. Third, the rules for running the benchmark and reporting execution time are very complete, resulting in more comparable numbers. Fourth, because of the importance of the benchmark, computer system efforts devote significant effort to making TPC-C run well. Fifth, vendors are required to report both performance and price-performance, enabling us to examine both. Because the OLTP market is large and quite varied, there is an incredible range of computing systems used for these applications, ranging from small single processor servers to midrange multiprocessor systems to large-scale clusters
52
Chapter 1 Fundamentals of Computer Design
600
250
550
225
500
175 SPECbase CFP 2000
400 150
350
125
300 250
100
200 75 150
SPEC CFP2000 per $1000 in Price
200 450
50 100 25
50
0
0 Dell Precision 530
Compaq Presario HP Workstation 7000 c3600 SPECbase CFP 2000
Sun Sunblade 1000/1750
IBM RS6000 44P/170
Dell Precision 420
Sun Sublade 100
SPEC CFP2000 performance/cost
FIGURE 1.20 Performance and price-performance for seven systems are measured using SPEC CFP2000 as the benchmark. Price-performance is plotted as CFP2000 performance per $1,000 in system cost. The dramatically improved floating point performance of the Pentium 4 versus the Pentium III is clear in this figure. Price-performance partially parallels performance but not as clearly as in the case of the integer benchmarks. These performance numbers and prices are current as of July 2001. The measurements are available online as http://www.spec.org/osg/cpu2000/.
consisting of tens to hundreds of processors. To allow an appreciation for this diversity and its range of performance and price-performance, we will examine six of the top results by performance (and the comparative price-performance) and six of the top results by price-performance (and the comparative performance). For TPC-C performance is measured in transactions per minute (TPM), while price-performance is measured in TPM per dollar. Figure 1.21 shows the characteristics of a dozen systems whose performance or price-performance is near the top in one measure or the other. Figure 1.22 charts the performance and price-performance of six of the highest performing OLTP systems described in Figure 1.21.The IBM cluster system, consisting of 280 Pentium III processors, provides the highest overall performance beating any other system by almost a factor of three, as well as the best price-performance by just over a factor of 1.5. The other systems are all moderate-scale multiprocessors and offer fairly comparable performance and similar
1.7
Vendor & System
Putting It All Together: Performance and Price-Performance
CPUs
Database
OS
53
Price
IBM exSeries 370 c/s
280 x Pentium III @ 900 Mhz
Microsoft SQL Server 2000
Microsoft Windows Adv. Server
$15,543,346
Compaq Alpha server GS 320
32 x Alpha 21264 @ 1GHz
Oracle 9i
Compaq Tru64 UNIX
$10,286,029
Fujitsu PRIMEPOWER 20000
48 x SPARC64 GP @ 563 MHz
SymfoWARE Server Enterpr.
Sun Solaris 8
$9,671,742
IBM eServer 680 7017S85
24 x IBM RS64-IV 600 MHz
Oracle 8 8.1.7.1
IBM AIX 4.3.3
$7,546,837
HP 9000 Enterprise Server
48 x HP PA-RISC 8600 552 MHz
Oracle8 v8.1.7.1
HP UX 11.i 64-bit
$8,522,104 $8,448,137
IBM eServer 400 8402420
24 x iSeries400 Model 840
IBM DB2 for AS/400 V4
IBM OS/400 V4
Dell PowerEdge 6400
3 x Pentium III 700MHz
Microsoft SQL Server 2000
Microsoft Windows 2000
$131,275
IBM eserver xSeries 250 c/s
4 x Pentium III 700 MHz
Microsoft SQL Server 2000
Microsoft Windows Adv. Server
$297,277
Compaq Proliant ML570 6/700 2
4 x Intel Pentium III @ 700 MHz
Microsoft SQL Server 2000
Microsoft Windows Adv. Server
$375,016
HP NetServer LH 6000
6 x Pentium III @ 550 MHz
Microsoft SQL Server 2000
Microsoft Windows NT Enterprise
$372805
NEC Express 5800/180
8 x Pentium III 900 MHz
Microsoft SQL Server 2000
Microsoft Windows Adv. Server
$682,724
HP 9000 / L2000
4 x PA-RISC 8500 440MHz
Sybase Adaptive Server
HP UX 11.0 64-bit
$368,367
FIGURE 1.21 The characteristics of a dozen OLTP systems with either high total performance (top half of the table) or superior price-performance (bottom half of the table). The IBM exSeries with 280 Pentium IIIs is a cluster, while all the other systems are tightly coupled multiprocessors. Surprisingly, none of the top performing systems by either measure are uniprocessors! The system descriptions and detailed benchmark reports are available at: http://www.tpc.org/.
price-performance to the others in the group. Chapters 7 and 8 discuss the design of cluster and multiprocessor systems. Figure 1.23 charts the performance and price-performance of the six OLTP systems from Figure 1.21 with the best price-performance. These systems are all multiprocessor systems, and, with the exception of the HP system, are based on Pentium III processors. Although the smallest system (the 3-processor Dell system) has the best price-performance, several of the other systems offer better performance at about a factor of 0.65 of the price-performance. Notice that the systems with the best price-performance in Figure 1.23 average almost four times better in price-performance (TPM/$ = 99 versus 27) than the high performance systems in Figure 1.22.
54
Chapter 1 Fundamentals of Computer Design
700
50
45 600
500
35
30 400 25 300 20
15
200
Transcation per Minute per $1,000
Transactions per Minute (thousands)
40
10 100 5
0
0 IBM exSeries 370 c/s
Compaq Alphaserver Fujitsu PRIMEPOWER IBM eServer 680 7017GS 320 20000 S85
Performance (Transactions per Minute)
HP 9000 Enterprise Server
IBM eServer 400 8402420
Price-performance (TPM per $1,000)
FIGURE 1.22 The performance (measured in thousands of transactions minute) and the price-performance (measured in transactions per minute per $1,000) are shown for six of the highest performing systems using TPC-C as the benchmark. Interestingly, IBM occupies three of these six positions, with different hardware platforms (a cluster of Pentium IIIs, an Power III based multiprocessor, and an AS 400 based multiprocessor.
Performance and Price-Performance for Embedded Processors Comparing performance and price-performance of embedded processors is more difficult than for the desktop or server environments because of several characteristics. First, benchmarking is in its comparative infancy in the embedded space. Although the EEMBC benchmarks represent a substantial advance in benchmark availability and benchmark practice, as we discussed earlier, these benchmarks have significant drawbacks. Equally importantly, in the embedded space, processors are often designed for a particular class of applications; such designs are often not measured outside of their application space and when they are they may not perform well. Finally, as mentioned earlier cost and power are often the most important factors for an embedded application. Although we can partially measure cost by looking at the cost of the processor, other aspects of the design can be critical in determining system cost. For example, whether or not the memory controller and I/O control are integrated into the chip affects both power and cost of the system. As we said earlier, power is often the critical constraint in embed-
1.7
Putting It All Together: Performance and Price-Performance
55
180
60
160 50
per
60
20
Transcation
per
30 80
Tranactions
Minute
100
(thousands)
40
per
120
Minute
$1,000
140
40 10 20
0
0 Dell PowerEdge 6400
IBM eserver xSeries 250 c/s
Compaq Proliant ML570 6/700 2
Price-Performance (TPM per $1,000)
HP NetServer LH 6000
NEC Express 5800/180
HP 9000 / L2000
Performance (Transactions per Minute)
FIGURE 1.23 Price-performance (plotted as transactions per minute per $1000 of system cost) and overall performance (plotted as thousands of transactions per minute).
ded systems, and we focus on the relationship between performance and power in the next section. Figure 1.24 shows the characteristics of the five processors whose price and price-performance we examine. These processors span a wide range of cost, power, and performance and thus are used in very different applications. The highend processors, such as the PowerPC 650 and AMD Elan are used in applications such as network switches and possibly high-end laptops. The NEC VR 5432 series is a newer version of the VR 5400 series, which is one of the most heavily used processors in color laser printers. In contrast, the NEC VR 4121 is a lowend, low-power device used primarily in PDAs; in addition to the core computing
56
Chapter 1 Fundamentals of Computer Design
functions, the 4121 provides a number of system functions, reducing the cost of the overall system. Processor
Instr. Set
Processor Clock Rate (MHz)
Cache Instr./Data On-chip Secondary cache
Processor organization
Typical power (in mW)
Price ($)
AMD Elan SC520
x86
133
16K/16K
Pipelined: single issue
1600
$38
AMD K6-2E+
x86
500
32K/32K 128K
Pipelined: 3+ issues/clock.
9600
$78
IBM PowerPC 750CX
PowerPC
500
32K/32K 128K
Pipelined 4 issues/clock
6000
$94
NEC VR 5432
MIPS-64
167
32K/32K
Pipelined: 2 issues/clock
2088
$25
NEC VR 4122
MIPS-64
180
32K/16K
Pipelined: single issue
700
$33
FIGURE 1.24 Five different embedded processors spanning a range of performance (more than a factor of ten, as we will see) and a wide range in price (roughly a factor of four and probably 50% higher than that if total system cost is considered). The price does not include interface and support chips, which could significantly increase the deployed system cost. Likewise, the power indicated includes only the processor’s typical power consumption (in milliWatts); These processors also differ widely in terms of execution capability from a maximum of four instructions per clock to one! All the processors except the NEC VR4122 include a hardware floating point unit.
Figure 1.25 shows the relative performance of these five processors on three of the five EEMBC benchmark suites. The summary number for each benchmark suite is proportional to the geometric mean of the individual performance measures for each benchmark in the suite (measured as iterations per second). The clock rate differences explain between 33% and 75% of the performance differences. For machines with similar organization (such as the AMD Elan SC520 and the NEC VR 4121), the clock rate is the primary factor in determining performance. For machines with widely differing cache structures (such as the presence or absence of a secondary cache) or different pipelines, clock rate explains less of the performance difference. Figure 1.26 shows the price-performance of these processors, where price is measured only by the processor cost. Here, the wide range in price narrows the performance differences, making the slower processors more cost effective. If our cost analysis also included the system support chips, the differences would narrow even further, probably boosting the VR 5432 to the top in price-performance and making the VR 4132 at least competitive with the high-end IBM and AMD chips.
1.7
Putting It All Together: Performance and Price-Performance
57
14.0
Performance Relative to AMD Elan SC520
12.0
10.0
8.0
AMD ElanSC520 AMD K6-2E+ IBM PowerPC 750CX NEC VR 5432
6.0
NEC VR4122
4.0
2.0
0.0 Automotive
Office
Telecomm
FIGURE 1.25 Relative performance for three of the five EEMBC benchmark suites on five different embedded processors. The performance is scaled relative to the AMD Elan SC520, so that the scores across the suites have a narrower range.
14.0
Relative
Performance
/
Price
12.0
10.0 AMD ElanSC520 AMD K6-2E+ 8.0
IBM PowerPC 750CX NEC VR 5432 NEC VR4122
6.0
4.0
2.0
0.0 Automotive
Office
Telecomm
FIGURE 1.26 Relative price-performance for three of the five EEMBC benchmark suites on five different embedded processors, using only the price of the processor.
58
Chapter 1 Fundamentals of Computer Design
1.8
Another View: Power Consumption and Efficiency as the Metric Throughout the chapters of this book, you will find sections entitled: Another View. These sections emphasize the way in which different segments of the computing market may solve a problem. For example, if the Putting It All Together section emphasizes the memory system for a desktop microprocessor, the Another View section may emphasize the memory system of an embedded application or a server. In this first Another View section, we look at the issue of power consumption in embedded processors. As mentioned several times in this chapter, cost and power are often at least as important as performance in the embedded market. In addition to the cost of the processor module (which includes any required interface chips), memory is often the next most costly part of an embedded system. Recall that, unlike a desktop or server system, most embedded systems do not have secondary storage; instead, the entire application most reside in either FLASH or DRAM (as described in Chapter 5). Because many embedded systems, such as PDAs and cell phones, are constrained by both cost and physical size, the amount of memory needed for the application is critical. Likewise, power is often a determining factor in choosing a processor, especially for battery-powered systems. As we saw in Figure 1.24 on page 56, the power for the five embedded processors we examined varies by more than a factor of 10. Clearly, the high performance AMD K6, with a typical power consumption of 9.3 W, cannot be used in environments where power or heat dissipation are critical. Figure 1.27 shows the relative performance per watt of typical operating power. Compare this figure to Figure 1.25 on page 57, which plots raw performance, and notice how different the results are. The NEC VR4122 has a clear advantage in performance per watt, but is the second lowest performing processor! From the viewpoint of power consumption the NEC VR4122, which was designed for battery-based systems, is the big winner. The IBM PowerPC displays efficient use of power to achieve its high performance, although at 6 watts typical, it is probably not be suitable for most battery-based devices.
1.9
Fallacies and Pitfalls
59
4.0
3.5
Relatgive performance per Watt
3.0
2.5 AMD ElanSC520 AMD K6-2E+ 2.0
IBM PowerPC 750CX NEC VR 5432 NEC VR4122
1.5
1.0
0.5
0.0 Automotive
Office
Telecomm
FIGURE 1.27 Relative performance per watt for the five embedded processors. The power is measured as typical operating power for the processor, and does not include any interface chips.
1.9
Fallacies and Pitfalls The purpose of this section, which will be found in every chapter, is to explain some commonly held misbeliefs or misconceptions that you should avoid. We call such misbeliefs fallacies. When discussing a fallacy, we try to give a counterexample. We also discuss pitfalls—easily made mistakes. Often pitfalls are generalizations of principles that are true in a limited context. The purpose of these sections is to help you avoid making these errors in machines that you design. Fallacy: The relative performance of two processors with the same ISA can be judged by clock rate or by the performance of a single benchmark suite. As processors have become faster and more sophisticated, processor performance in one application area can diverge from that in another area. Sometimes the instruction set architecture is responsible for this, but increasingly the pipeline structure and memory system are responsible. This also means that clock rate is
60
Chapter 1 Fundamentals of Computer Design
not a good metric, even if the instruction sets are identical. Figure 1.28 shows the performance of a 1.7 GHz Pentium 4 relative to a 1 GHz Pentium III. The figure also shows the performance of a hypothetical 1.7 GHz Pentium III assuming linear scaling of performance based on the clock rate. In all cases except the SPEC floating point suite, the Pentium 4 delivers less performance per MHz than the Pentium III. As mentioned earlier, instruction set enhancements (the SSE2 extensions), which significantly boost floating point execution rates, are probably responsible for the better performance of the Pentium 4 for these floating point benchmarks.
1.80
1.60
1.40
Relative performance
1.20
1.00
0.80
0.60
0.40
0.20
0.00 SPECbase
CINT2000 SPECbase
CFP2000
Multimedia
Game benchmark
Web benchmark
FIGURE 1.28 A comparison of the performance of the Pentium 4 (P4) relative to the Pentium III (P3) on five different sets of benchmark suites. The bars show the relative performance of a 1.7 GHz P4 versus a 1 GHz P3. The triple vertical line at 1.7 shows how much faster a Pentium 4 at 1.7 GHz would be than a 1 GHz Pentium III assuming performance scaled linearly with clock rate. Of course, this line represents an idealized approximation to how fast a P3 would run. The first two sets of bars are the SPEC integer and floating point suites. The third set of bars represents three multimedia benchmarks. The fourth set represents a pair of benchmarks based on the Game Quake, and the final benchmark is the composite Webmark score, a PC-based web benchmark
Exercises
61
Performance within a single processor implementation family (such as Pentium III) usually scales slower than clock speed because of the increased relative cost of stalls in the memory system. Across generations (such as the Pentium 4 and Pentium III) enhancements to the basic implementation usually yield a performance that is somewhat better than what would be derived from just clock rate scaling. As Figure 1.28 shows, the Pentium 4 is usually slower than the Pentium III when performance is adjusted by linearly scaling the clock rate. This may partly derive from the focus on high clock rate as a primary design goal. We discuss both the differences between the Pentium III and Pentium 4 further in Chapter 3 as well as why the performance does not scale as fast as the clock rate does. Fallacy: Benchmarks remain valid indefinitely. Several factors influence the usefulness of a benchmark as a predictor of real performance and some of these may change over time. A big factor influencing the usefulness of a benchmark is the ability of the benchmark to resist “cracking,” also known as benchmark engineering or “benchmarksmanship.” Once a benchmark becomes standardized and popular, there is tremendous pressure to improve performance by targeted optimizations or by aggressive interpretation of the rules for running the benchmark. Small kernels or programs that spend their time in a very small number of lines of code are particularly vulnerable. For example, despite the best intentions, the initial SPEC89 benchmark suite included a small kernel, called matrix300, which consisted of eight different 300 × 300 matrix multiplications. In this kernel, 99% of the execution time was in a single line (see SPEC [1989]). Optimization of this inner loop by the compiler (using an idea called blocking, discussed in Chapter 5) for the IBM Powerstation 550 resulted in performance improvement by a factor of more than 9 over an earlier version of the compiler! This benchmark tested compiler performance and was not, of course, a good indication of overall performance, nor of this particular optimization. Even after the elimination of this benchmark, vendors found methods to tune the performance of individual benchmarks by the use of different compilers or preprocessors, as well as benchmark-specific flags. Although the baseline performance measurements requires the use of one set of flags for all benchmarks, the tuned or optimized performance does not. In fact, benchmark-specific flags are allowed, even if they are illegal in general and could lead to incorrect compilation! Allowing benchmark and even input-specific flags has led to long lists of options, as Figure 1.29 shows. This list of options, which is not significantly different from the option lists used by other vendors, is used to obtain the peak performance for the Compaq AlphaServer DS20E Model 6/667. The list makes it clear why the baseline measurements were needed. The performance difference between the baseline and tuned numbers can be substantial. For the SPEC CFP2000 benchmarks on the AlphaServer DS20E Model 6/667, the overall performance (which by SPEC CPU2000 rules is summarized by geometric mean) is
62
Chapter 1 Fundamentals of Computer Design
1.12 times higher for the peak numbers. As compiler technology improves, the achieves closer to peak performance using the base flags. Similarly, as the benchmarks improve in quality, they become less suspectible to highly application specific optimizations. Thus, the gap between peak and base, which in early times was often 20%, has narrowed. Peak: -v -g3 -arch ev6 -non_shared ONESTEP plus:
168.wupwise: f77 -fast -O4 -pipeline -unroll 2 171.swim: f90 -fast -O5 -transform_loops 172.mgrid: kf77 -O5 -transform_loops -tune ev6 -unroll 8 173.applu: f77 -fast -O5 -transform_loops -unroll 14 177.mesa: cc -fast -O4 178.galgel: kf90 -O4 -unroll 2 -ldxml RM_SOURCES = lapak.f90 179.art: kcc -fast -O4 -ckapargs='-arl=4 -ur=4' -unroll 10 183.equake: kcc -fast -ckapargs='-arl=4' -xtaso_short 187.facerec: f90 -fast -O4 188.ammp: cc -fast -O4 -xtaso_short 189.lucas: kf90 -fast -O5 -fkapargs='-ur=1' -unroll 1 191.fma3d: kf90 -O4 200.sixtrack: f90 -fast -O5 -transform_loops 301.apsi: kf90 -O5 -transform_loops -unroll 8 -fkapargs='-ur=1' FIGURE 1.29 The tuning parameters for the SPEC CFP2000 report on an AlphaServer DS20E Model 6/667. This is the portion of the SPEC report for the tuned performance corresponding to that in Figure 1.14 on page 34. These parameters describe the compiler options (four different compilers are used). Each line shows the option used for one of the SPEC CFP2000 benchmarks. Data from: http://www.spec.org/osg/cpu2000/results/res1999q4/cpu2000-1999113000012.html.
Ongoing improvements in technology can also change what a benchmark measures. Consider the benchmark gcc, considered one of the most realistic and challenging of the SPEC92 benchmarks. Its performance is a combination of CPU time and real system time. Since the input remains fixed and real system time is limited by factors, including disk access time, that improve slowly, an increasing amount of the runtime is system time rather than CPU time. This may be appropriate. On the other hand, it may be appropriate to change the input over time, reflecting the desire to compile larger programs. In fact, the SPEC92 input was changed to include four copies of each input file used in SPEC89; although this increases runtime, it may or may not reflect the way compilers are actually being used. Over a long period of time, these changes may make even a well-chosen benchmark obsolete. For example, more than half the benchmarks added to the 1992 and 1995 SPEC CPU benchmark release were dropped from the next gener-
Exercises
63
ation of the suite! To show how dramatically benchmarks must adapt over time, we summarize the status of the integer and FP benchmarks from SPEC 89, 92, and 95 in Figure 1.30. Pitfall: Comparing hand-coded assembly and compiler generated high level language performance. In most applications of computers, hand-coding is simply not tenable. A combination of the high cost of software development and maintenance together with time-to-market pressures have made it impossible for many applications to consider assembly language. In parts of the embedded market, however, several factors have continued to encourage limited use of hand coding, at least of key loops. The most important factors favoring this tendency are the importance of a few small loops to overall performance (particularly real-time performance) in some embedded applications, and the inclusion of instructions that can significantly boost performance of certain types of computations, but that compilers can not effectively use. When performance is measured either by kernels or by applications that spend most of their time in a small number of loops, hand coding of the critical parts of the benchmark can lead to large performance gains. In such instances, the performance difference between the hand-coded and machine-generated versions of a benchmark can be very large, as shown in for two different machines in Figure 1.31. Both designers and users must be aware of this potentially large difference
64
Chapter 1 Fundamentals of Computer Design
Benchmark name
Integer or FP
SPEC 89
SPEC 92
SPEC 95
SPEC 2000
gcc
integer
adopted
espresso
integer
adopted
modified
modified
modified
modified
dropped
li
integer
eqntott
integer
adopted
modified
modified
adopted
dropped
spice doduc
FP
adopted
modified
FP
adopted
nasa7
FP
adopted
dropped
fpppp
FP
adopted
modified
dropped
matrix300
FP
adopted
tomcatv
FP
adopted
modified
dropped
compress
integer
adopted
modified
dropped
sc
integer
adopted
dropped
mdljdp2
FP
adopted
dropped
wave5
FP
adopted
modified
ora
FP
adopted
dropped
mdljsp2
FP
adopted
dropped
alvinn
FP
adopted
dropped
ear
FP
adopted
dropped
dropped
dropped dropped
dropped
dropped
swm256 (aka swim)
FP
adopted
modified
modified
su2cor
FP
adopted
modified
dropped
FP
adopted
hydro2d
modified
dropped
integer
adopted
dropped
m88ksim
integer
adopted
dropped
ijpeg
integer
adopted
dropped
perl
integer
adopted
modified
vortex
integer
adopted
modified
mgrid
FP
adopted
modified
applu
FP
adopted
dropped
apsi
FP
adopted
modified
adopted
dropped
go
turb3d
FIGURE 1.30 The evolution of the SPEC benchmarks over time showing when benchmarks were adopted, modified and dropped. All the programs in the 89, 92, and 95 releases are show. Modified indicates that either the input or the size of the benchmark was changed, usually to increase its running time and avoid perturbation in measurement or domination of the execution time by some factor other than CPU time.
Exercises
65
and not extrapolate performance for compiler generate code from hand coded benchmarks. Machine
EEMBC benchmark set
Performance Compiler generated
Performance Hand coded
Ratio hand/ compiler
Trimedia 1300 @166 MHz
Consumer
23.3
110.0
4.7
BOPS Manta @ 136 MHz
Telecomm
2.6
225.8
44.6
TI TMS320C6203 @ 300MHz
Telecomm
6.8
68.5
10.1
FIGURE 1.31 The performance of three embedded processors on C and hand-coded versions of portions of the EEMBC benchmark suite. In the case of the BOPS and TI processor, they also provide versions that are compiled but where the C is altered initially to improve performance and code generation; such versions can achieve most of the benefit from hand optimization at least for these machines and these benchmarks.
Fallacy: Peak performance tracks observed performance. The only universally true definition of peak performance is “the performance level a machine is guaranteed not to exceed.” The gap between peak performance and observed performance is typically a factor of 10 or more in supercomputers. (See Appendix B on vectors for an explanation.) Since the gap is so large and can vary significantly by benchmark, peak performance is not useful in predicting observed performance unless the workload consists of small programs that normally operate close to the peak. As an example of this fallacy, a small code segment using long vectors ran on the Hitachi S810/20 in 1.3 seconds and on the Cray X-MP in 2.6 seconds. Although this suggests the S810 is two times faster than the X-MP, the X-MP runs a program with more typical vector lengths two times faster than the S810. These data are shown in Figure 1.32. Cray X-MP
Hitachi S810/20
Performance
A(i)=B(i)*C(i)+D(i)*E(i) (vector length 1000 done 100,000 times)
2.6 secs
1.3 secs
Hitachi 2 times faster
Vectorized FFT (vector lengths 64,32,…,2)
3.9 secs
7.7 secs
Cray 2 times faster
Measurement
FIGURE 1.32 Measurements of peak performance and actual performance for the Hitachi S810/20 and the Cray X-MP. Note that the gap between peak and observed performance is large and can vary across benchmarks. Data from pages 18–20 of Lubeck, Moore, and Mendez [1985]. Also see Fallacies and Pitfalls in Appendix B.
Fallacy: The best design for a computer is the one that optimizes the primary objective without considering implementation.
66
Chapter 1 Fundamentals of Computer Design
Although in a perfect world where implementation complexity and implementation time could be ignored, this might be true, design complexity is an important factor. Complex designs take longer to complete, prolonging time to market. Given the rapidly improving performance of computers, longer design time means that a design will be less competitive. The architect must be constantly aware of the impact of his design choices on the design time for both hardware and software. The many postponements of the availability of the Itanium processor (roughly a two year delay from the initial target date) should serve as a topical reminder of the risks of introducing both a new architecture and a complex design. With processor performance increasing by just over 50% per year, each week delay translates to a 1% loss in relative performance! Pitfall: Neglecting the cost of software in either evaluating a system or examining cost-performance. For many years, hardware was so expensive that it clearly dominated the cost of software, but this is no longer true. Software costs in 2001 can be a large fraction of both the purchase and operational costs of a system. For example, for a medium size database OLTP server, Microsoft OS software might run about $2,000, while the Oracle software would run between $6,000 and $9,000 for a four-year, one-processor license. Assuming a four-year software lifetime means a total software cost for these two major components of between $8,000 and $11,000. A midrange Dell server with 512MB of memory, Pentium III at 1 GHz, and between 20 and 100 GB of disk would cost roughly the same amount as these two major software components. Meaning that software costs are roughly 50% of the total system cost! Alternatively, consider a professional desktop system, which can be purchased with 1 GHz Pentium III, 128 MB DRAM, 20 GB disk, and a 19 inch monitor for just under $1000. The software costs of a Windows OS and Office 2000 are about $300 if bundled with the system and about double that if purchased separately, so the software costs are somewhere between 23% and 38% of the total cost! Pitfall: Falling prey to Amdahl’s Law. Virtually every practicing computer architect knows Amdahl’s Law. Despite this, we almost all occasionally fall into the trap of expending tremendous effort optimizing some aspect of a system before we measure its usage. Only when the overall speedup is unrewarding, do we recall that we should have measured the usage of that feature before we spent so much effort enhancing it! Fallacy: Synthetic benchmarks predict performance for real programs. This fallacy appeared in the first edition of this book, published in 1990. With the arrival and dominance of organizations such as SPEC and TPC, we thought perhaps the computer industry had learned a lesson and reformed its faulty practices, but the emerging embedded market, has embraced Dhrystone as its most quoted benchmark! Hence, this fallacy survives.
Exercises
67
The best known examples of synthetic benchmarks are Whetstone and Dhrystone. These are not real programs and, as such, may not reflect program behavior for factors not measured. Compiler and hardware optimizations can artificially inflate performance of these benchmarks but not of real programs. The other side of the coin is that because these benchmarks are not natural programs, they don’t reward optimizations of behaviors that occur in real programs. Here are some examples: n
n
n
Optimizing compilers can discard 25% of the Dhrystone code; examples include loops that are only executed once, making the loop overhead instructions unnecessary. To address these problems the authors of the benchmark “require” both optimized and unoptimized code to be reported. In addition, they “forbid” the practice of inline-procedure expansion optimization, since Dhrystone’s simple procedure structure allows elimination of all procedure calls at almost no increase in code size. Most Whetstone floating-point loops execute small numbers of times or include calls inside the loop. These characteristics are different from many real programs. As a result Whetstone underrewards many loop optimizations and gains little from techniques such as multiple issue (Chapter 3) and vectorization (Appendix B). Compilers can optimize a key piece of the Whetstone loop by noting the relationship between square root and exponential, even though this is very unlikely to occur in real programs. For example, one key loop contains the following FORTRAN code: X = SQRT(EXP(ALOG(X)/T1))
It could be compiled as if it were X = EXP(ALOG(X)/(2×T1))
since SQRT(EXP(X)) =
2
X
e = e X / 2 = EXP(X/2)
It would be surprising if such optimizations were ever invoked except in this synthetic benchmark. (Yet one reviewer of this book found several compilers that performed this optimization!) This single change converts all calls to the square root function in Whetstone into multiplies by 2, surely improving performance— if Whetstone is your measure. Fallacy: MIPS is an accurate measure for comparing performance among computers. This fallacy also appeared in the first edition of this book, published in 1990. Your authors initially thought it could be retired, but, alas, the embedded market
68
Chapter 1 Fundamentals of Computer Design
not only uses Dhrystone as the benchmark of choice, but reports performance as “Dhrystone MIPS”, a measure that this fallacy will show is problematic. One alternative to time as the metric is MIPS, or million instructions per second. For a given program, MIPS is simply MIPS =
Instruction count 6
Execution time × 10
=
Clock rate CPI × 106
Some find this rightmost form convenient since clock rate is fixed for a machine and CPI is usually a small number, unlike instruction count or execution time. Relating MIPS to time, Execution time =
Instruction count MIPS × 106
Since MIPS is a rate of operations per unit time, performance can be specified as the inverse of execution time, with faster machines having a higher MIPS rating. The good news about MIPS is that it is easy to understand, especially by a customer, and faster machines means bigger MIPS, which matches intuition. The problem with using MIPS as a measure for comparison is threefold: n
MIPS is dependent on the instruction set, making it difficult to compare MIPS of computers with different instruction sets.
n
MIPS varies between programs on the same computer.
n
Most importantly, MIPS can vary inversely to performance!
The classic example of the last case is the MIPS rating of a machine with optional floating-point hardware. Since it generally takes more clock cycles per floating-point instruction than per integer instruction, floating-point programs using the optional hardware instead of software floating-point routines take less time but have a lower MIPS rating. Software floating point executes simpler instructions, resulting in a higher MIPS rating, but it executes so many more that overall execution time is longer. MIPS is sometimes used by a single vendor (e.g. IBM) within a single set of applications, where this measure is less hamrful since relative differences among MIPS ratings of machines with the same architecture and the same benchmarks are reasonably likely to track relative performance differences. To try to avoid the worst difficulties of using MIPS as a performance measure, computer designers began using relative MIPS, which we discuss in detail on page 75, and this is what the embedded market reports for Dhrystone. Although less harmful than an actual MIPS measurement, relative MIPS have their shortcomings (e.g., they are not really MIPS!), especially when measured using Dhrystone!
1.10
1.10
Concluding Remarks
69
Concluding Remarks This chapter has introduced a number of concepts that we will expand upon as we go through this book. The major ideas in instruction set architecture and the alternatives available will be the primary subjects of Chapter 2. Not only will we see the functional alternatives, we will also examine quantitative data that enable us to understand the trade-offs. The quantitative principle, Make the common case fast, will be a guiding light in this next chapter, and the CPU performance equation will be our major tool for examining instruction set alternatives. Chapter 2 concludes an examination of how instruction sets are used by programs. In Chapter 2, we will include a section, Crosscutting Issues, that specifically addresses interactions between topics addressed in different chapters. In that section within Chapter 2, we focus on the interactions between compilers and instruction set design. This Crosscutting Issues section will appear in all future chapters. In Chapters 3 and 4 we turn our attention to instruction level parallelism (ILP), of which pipelining is the simplest and most common form. Exploiting ILP is one of the most important techniques for building high speed uniprocessors. The presence of two chapters reflects the fact that there are two rather different approaches to exploiting ILP. Chapter 3 begins with an extensive discussion of basic concepts that will prepare you not only for the wide range of ideas examined in both chapters, but also to understand and analyze new techniques that will be introduced in the coming years. Chapter 3 uses examples that span about 35 years, drawing from one of the first modern supercomputers (IBM 360/91) to the fastest processors in the market in 2001. It emphasizes what is called the dynamic or runtime approach to exploiting ILP. Chapter 4 focuses on compile-time approaches to exploiting ILP. These approaches were heavily used in the early 1990s and return again with the introduction of the Intel Itanium. Appendix G is a version of an introductory chapter on pipelining from the 1995, Second Edition of this text. For readers without much experience and background in pipelining, that appendix is a useful bridge between the basic topics explored in this chapter (which we expect to be review for many readers, including those of our more introductory text, Computer Organization and Design: The Hardware/Software Interface) and the advanced topics in Chapter 3. In Chapter 5 we turn to the all-important area of memory system design. We will examine a wide range of techniques that conspire to make memory look infinitely large while still being as fast as possible. As in Chapters 3 and 4, we will see that hardware-software cooperation has become a key to high-performance memory systems, just as it has to high-performance pipelines. In Chapters 6 and 7, we move away from a CPU-centric view and discuss issues in storage systems and interconnect. We apply a similar quantitative approach, but one based on observations of system behavior and using an end-to-
70
Chapter 1 Fundamentals of Computer Design
end approach to performance analysis. Chapter 6 addresses the important issue of how to efficiently store and retrieve data using primarily lower-cost magnetic storage technologies. As we saw earlier, such technologies offer better cost per bit by a factor of 50–100 over DRAM. Magnetic storage is likely to remain advantageous wherever cost or nonvolatility (it keeps the information after the power is turned off) are important. In Chapter 6, our focus is on examining the performance of disk storage systems for typical I/O-intensive workloads, which are the counterpart to the CPU benchmarks we saw in this chapter. We extensively explore the idea of RAID-based systems, which use many small disks, arranged in a redundant fashion to achieve both high performance and high availability. Chapter 7 discusses the primary interconnection technology used for I/O devices. This chapter explores the topic of system interconnect more broadly, including wide-area and system-area networks used to allow computers to communicate. Chapter 7 also describes clusters, which are growing in importance due to their suitability and efficiency for database and web server applications. Our final chapter returns to the issue of achieving higher performance through the use of multiple processors, or multiprocessors. Instead of using parallelism to overlap individual instructions, multiprocessing uses parallelism to allow multiple instruction streams to be executed simultaneously on different processors. Our focus is on the dominant form of multiprocessors, shared-memory multiprocessors, though we introduce other types as well and discuss the broad issues that arise in any multiprocessor. Here again, we explore a variety of techniques, focusing on the important ideas first introduced in the 1980s and 1990s. .
1.11
Historical Perspective and References If... history... teaches us anything, it is that man in his quest for knowledge and progress, is determined and cannot be deterred. John F. Kennedy, Address at Rice University (1962)
A section of historical perspectives closes each chapter in the text. This section provides historical background on some of the key ideas presented in the chapter. The authors may trace the development of an idea through a series of machines or describe significant projects. If you’re interested in examining the initial development of an idea or machine or interested in further reading, references are provided at the end of the section. In this historical section, we discuss the early development of digital computers and the development of performance measurement methodologies. The development of the key innovations in desktop, server, and embedded processor architectures are discussed in historical sections in virtually every chapter of the book.
1.11
Historical Perspective and References
71
The First General-Purpose Electronic Computers J. Presper Eckert and John Mauchly at the Moore School of the University of Pennsylvania built the world’s first fully-operational electronic general-purpose computer. This machine, called ENIAC (Electronic Numerical Integrator and Calculator), was funded by the U.S. Army and became operational during World War II, but it was not publicly disclosed until 1946. ENIAC was used for computing artillery firing tables. The machine was enormous—100 feet long, 8 1/2 feet high, and several feet wide. Each of the 20 10-digit registers was 2 feet long. In total, there were 18,000 vacuum tubes. Although the size was three orders of magnitude bigger than the size of the average machines built today, it was more than five orders of magnitude slower, with an add taking 200 microseconds. The ENIAC provided conditional jumps and was programmable, which clearly distinguished it from earlier calculators. Programming was done manually by plugging up cables and setting switches and required from a half-hour to a whole day. Data were provided on punched cards. The ENIAC was limited primarily by a small amount of storage and tedious programming. In 1944, John von Neumann was attracted to the ENIAC project. The group wanted to improve the way programs were entered and discussed storing programs as numbers; von Neumann helped crystallize the ideas and wrote a memo proposing a stored-program computer called EDVAC (Electronic Discrete Variable Automatic Computer). Herman Goldstine distributed the memo and put von Neumann’s name on it, much to the dismay of Eckert and Mauchly, whose names were omitted. This memo has served as the basis for the commonly used term von Neumann computer. Several early inventors in the computer field believe that this term gives too much credit to von Neumann, who conceptualized and wrote up the ideas, and too little to the engineers, Eckert and Mauchly, who worked on the machines. Like most historians, your authors (winners of the 2000 IEEE von Neumann Medal) believe that all three individuals played a key role in developing the stored program computer. von Neumann’s role in writing up the ideas, in generalizing them, and in thinking about the programming aspects was critical in transferring the ideas to a wider audience. In 1946, Maurice Wilkes of Cambridge University visited the Moore School to attend the latter part of a series of lectures on developments in electronic computers. When he returned to Cambridge, Wilkes decided to embark on a project to build a stored-program computer named EDSAC, for Electronic Delay Storage Automatic Calculator. (The EDSAC used mercury delay lines for its memory; hence the phrase “delay storage” in its name.) The EDSAC became operational in 1949 and was the world’s first full-scale, operational, stored-program computer [Wilkes, Wheeler, and Gill 1951; Wilkes 1985, 1995]. (A small prototype called the Mark I, which was built at the University of Manchester and ran in 1948, might be called the first operational stored-program machine.) The EDSAC was an accumulator-based architecture. This style of instruction set architecture re-
72
Chapter 1 Fundamentals of Computer Design
mained popular until the early 1970s. (Chapter 2 starts with a brief summary of the EDSAC instruction set.) In 1947, Eckert and Mauchly applied for a patent on electronic computers. The dean of the Moore School, by demanding the patent be turned over to the university, may have helped Eckert and Mauchly conclude they should leave. Their departure crippled the EDVAC project, which did not become operational until 1952. Goldstine left to join von Neumann at the Institute for Advanced Study at Princeton in 1946. Together with Arthur Burks, they issued a report based on the 1944 memo [1946]. The paper led to the IAS machine built by Julian Bigelow at Princeton’s Institute for Advanced Study. It had a total of 1024 40-bit words and was roughly 10 times faster than ENIAC. The group thought about uses for the machine, published a set of reports, and encouraged visitors. These reports and visitors inspired the development of a number of new computers, including the first IBM computer, the 701, which was based on the IAS machine. The paper by Burks, Goldstine, and von Neumann was incredible for the period. Reading it today, you would never guess this landmark paper was written more than 50 years ago, as most of the architectural concepts seen in modern computers are discussed there (e.g., see the quote at the beginning of Chapter 5). In the same time period as ENIAC, Howard Aiken was designing an electromechanical computer called the Mark-I at Harvard. The Mark-I was built by a team of engineers from IBM. He followed the Mark-I by a relay machine, the Mark-II, and a pair of vacuum tube machines, the Mark-III and Mark-IV. The Mark-III and Mark-IV were being built after the first stored-program machines. Because they had separate memories for instructions and data, the machines were regarded as reactionary by the advocates of stored-program computers. The term Harvard architecture was coined to describe this type of machine. Though clearly different from the original sense, this term is used today to apply to machines with a single main memory but with separate instruction and data caches. The Whirlwind project [Redmond and Smith 1980] began at MIT in 1947 and was aimed at applications in real-time radar signal processing. Although it led to several inventions, its overwhelming innovation was the creation of magnetic core memory, the first reliable and inexpensive memory technology. Whirlwind had 2048 16-bit words of magnetic core. Magnetic cores served as the main memory technology for nearly 30 years. Important Special-Purpose Machines
During the Second Wold War, there were major computing efforts in both Great Britain and the United States focused on special-purpose code-breaking computers. The work in Great Britain was aimed at decrypting messages encoded with the German Enigma coding machine. This work, which occurred at a location called Bletchley Park, led to two important machines. The first, an electromechanical machine, conceived of by Alan Turing, was called BOMB [see Good in
1.11
Historical Perspective and References
73
Metropolis 1980]. The second, much larger and electronic machine, conceived and designed by Newman and Flowers, was called COLOSSUS [see Randall in Metropolis 1980]. These were highly specialized cryptanalysis machines, which played a vital role in the war by providing the ability to read coded messages, especially those sent to U-boats. The work at Bletchley Park was highly classified (indeed some of it is still classified), and, so, its direct impact on the development of ENIAC, EDSAC and other computers is hard to trace, but it certainly had an indirect effect in advancing the technology and gaining understanding of the issues. Similar work on special-purpose computers for cryptanalysis went on in the United States. The most direct descendent of this effort was a company Engineering Research Associates (ERA [see Thomash in Metropolis 1980], which was founded after the war to attempt to commercialize on the key ideas. ERA build several machines, which were sold to secret government agencies, and was eventually purchased by Sperry Rand, which had earlier purchased the Eckert Mauchly Computer Corporation. Another early set of machines that deserves credit was a group of special-purpose machines built by Konrad Zuse in Germany in the late 1930s and early 1940s [see Bauer and Zuse in Metropolis 1980]. In addition to producing an operating machine, Zuse was the first to implement floating point, which von Neumann claimed was unnecessary!. His early machines used a mechanical store that was smaller than other electromechanical solutions of the time. His last machine was electromechanical but, because of the war, never completed. An important early contributor to the development of electronic computers was John Atanasoff, who built a small-scale electronic computer in the early 1940s [Atanasoff 1940]. His machine, designed at Iowa State University, was a special-purpose computer (called the ABC: Atanasoff Berry Computer) that was never completely operational. Mauchly briefly visited Atanasoff before he built ENIAC and several of Atansanoff’s ideas (e.g. using binary representation) likely influenced Mauchly. The presence of the Atanasoff machine, together with delays in filing the ENIAC patents (the work was classified and patents could not be filed until after the war) and the distribution of von Neumann’s EDVAC paper, were used to break the Eckert-Mauchly patent [Larson 1973]. Though controversy still rages over Atanasoff’s role, Eckert and Mauchly are usually given credit for building the first working, general-purpose, electronic computer [Stern 1980]. Atanasoff, however, demonstrated several important innovations included in later computers. Atanasoff deserves much credit for his work, and he might fairly be given credit for the world’s first special-purpose electronic computer and for possibly influencing Eckert and Mauchly. Commercial Developments In December 1947, Eckert and Mauchly formed Eckert-Mauchly Computer Corporation. Their first machine, the BINAC, was built for Northrop and was shown
74
Chapter 1 Fundamentals of Computer Design
in August 1949. After some financial difficulties, the Eckert-Mauchly Computer Corporation was acquired by Remington-Rand, later called Sperry-Rand. SperryRand merged the Eckert-Mauchly acquisition, ERA, and its tabulating business to form a dedicated computer division, called UNIVAC. UNIVAC delivered its first computer, the UNIVAC I in June 1951. The UNIVAC I sold for $250,000 and was the first successful commercial computer—48 systems were built! Today, this early machine, along with many other fascinating pieces of computer lore, can be seen at the Computer Museum in Mountain View, California. Other places where early computing systems can be visited include the Deutsches Museum in Munich, and the Smithsonian in Washington, D.C., as well as numerous online virtual museums. IBM, which earlier had been in the punched card and office automation business, didn’t start building computers until 1950. The first IBM computer, the IBM 701 based on von Neumann’s IAS machine, shipped in 1952 and eventually sold 19 units [see Hurd in Metropolis 1980].In the early 1950s, many people were pessimistic about the future of computers, believing that the market and opportunities for these “highly specialized” machines were quite limited. Nonetheless, IBM quickly became the most successful computer company. The focus on reliability and a customer and market driven strategy was key. Although the 701 and 702 were modest successes, IBM’s next machine the 704/705, first delivered in 1954, greatly exceeded its initial sales forecast of 50 machines, thanks in part to the inclusion of core memory. Several books describing the early days of computing have been written by the pioneers [Wilkes 1985, 1995; Goldstine 1972], as well as [Metropolis, Howlett, and Rota 1980], which is a collection of recollections by early pioneers. There are numerous independent histories, often built around the people involved [Slater 1987], as well as a journal, Annals of the History of Computing, devoted to the history of computing. The history of some of the computers invented after 1960 can be found in Chapter 2 (the IBM 360, the DEC VAX, the Intel 80x86, and the early RISC machines), Chapters 3 and 4 (the pipelined processors, including Stretch and the CDC 6600), and Appendix B (vector processors including the TI ASC, CDC Star, and Cray processors). Development of Quantitative Performance Measures: Successes and Failures In the earliest days of computing, designers set performance goals—ENIAC was to be 1000 times faster than the Harvard Mark-I, and the IBM Stretch (7030) was to be 100 times faster than the fastest machine in existence. What wasn’t clear, though, was how this performance was to be measured. In looking back over the years, it is a consistent theme that each generation of computers obsoletes the performance evaluation techniques of the prior generation.
1.11
Historical Perspective and References
75
The original measure of performance was time to perform an individual operation, such as addition. Since most instructions took the same execution time, the timing of one gave insight into the others. As the execution times of instructions in a machine became more diverse, however, the time for one operation was no longer useful for comparisons. To take these differences into account, an instruction mix was calculated by measuring the relative frequency of instructions in a computer across many programs. The Gibson mix [Gibson 1970] was an early popular instruction mix. Multiplying the time for each instruction times its weight in the mix gave the user the average instruction execution time. (If measured in clock cycles, average instruction execution time is the same as average CPI.) Since instruction sets were similar, this was a more accurate comparison than add times. From average instruction execution time, then, it was only a small step to MIPS (as we have seen, the one is the inverse of the other). MIPS had the virtue of being easy for the layman to understand. As CPUs became more sophisticated and relied on memory hierarchies and pipelining, there was no longer a single execution time per instruction; MIPS could not be calculated from the mix and the manual. The next step was benchmarking using kernels and synthetic programs. Curnow and Wichmann [1976] created the Whetstone synthetic program by measuring scientific programs written in Algol 60. This program was converted to FORTRAN and was widely used to characterize scientific program performance. An effort with similar goals to Whetstone, the Livermore FORTRAN Kernels, was made by McMahon [1986] and researchers at Lawrence Livermore Laboratory in an attempt to establish a benchmark for supercomputers. These kernels, however, consisted of loops from real programs. As it became clear that using MIPS to compare architectures with different instructions sets would not work, a notion of relative MIPS was created. When the VAX-11/780 was ready for announcement in 1977, DEC ran small benchmarks that were also run on an IBM 370/158. IBM marketing referred to the 370/158 as a 1-MIPS computer, and since the programs ran at the same speed, DEC marketing called the VAX-11/780 a 1-MIPS computer. Relative MIPS for a machine M was defined based on some reference machine as Performance M MIPS M = ------------------------------------------------ × MIPS reference Performance reference
The popularity of the VAX-11/780 made it a popular reference machine for relative MIPS, especially since relative MIPS for a 1-MIPS computer is easy to calculate: If a machine was five times faster than the VAX-11/780, for that benchmark its rating would be 5 relative MIPS. The 1-MIPS rating was unquestioned for four years, until Joel Emer of DEC measured the VAX-11/780 under a timesharing load. He found that the VAX-11/780 native MIPS rating was 0.5. Subsequent VAXes that run 3 native MIPS for some benchmarks were therefore called
76
Chapter 1 Fundamentals of Computer Design
6-MIPS machines because they run six times faster than the VAX-11/780. By the early 1980s, the term MIPS was almost universally used to mean relative MIPS. The 1970s and 1980s marked the growth of the supercomputer industry, which was defined by high performance on floating-point-intensive programs. Average instruction time and MIPS were clearly inappropriate metrics for this industry, hence the invention of MFLOPS (Millions of FLoating-point Operations Per Second), which effectively measured the inverse of execution time for a benchmark. . Unfortunately customers quickly forget the program used for the rating, and marketing groups decided to start quoting peak MFLOPS in the supercomputer performance wars. SPEC (System Performance and Evaluation Cooperative) was founded in the late 1980s to try to improve the state of benchmarking and make a more valid basis for comparison. The group initially focused on workstations and servers in the UNIX marketplace, and that remains the primary focus of these benchmarks today. The first release of SPEC benchmarks, now called SPEC89, was a substantial improvement in the use of more realistic benchmarks. References AMDAHL, G. M. [1967]. “Validity of the single processor approach to achieving large scale computing capabilities,” Proc. AFIPS 1967 Spring Joint Computer Conf. 30 (April), Atlantic City, N.J., 483–485. ATANASOFF, J. V. [1940]. “Computing machine for the solution of large systems of linear equations,” Internal Report, Iowa State University, Ames. BELL, C. G. [1984]. “The mini and micro industries,” IEEE Computer 17:10 (October), 14–30. BELL, C. G., J. C. MUDGE, AND J. E. MCNAMARA [1978]. A DEC View of Computer Engineering, Digital Press, Bedford, Mass. BURKS, A. W., H. H. GOLDSTINE, AND J. VON NEUMANN [1946]. “Preliminary discussion of the logical design of an electronic computing instrument,” Report to the U.S. Army Ordnance Department, p. 1; also appears in Papers of John von Neumann, W. Aspray and A. Burks, eds., MIT Press, Cambridge, Mass., and Tomash Publishers, Los Angeles, Calif., 1987, 97–146. CURNOW, H. J. AND B. A. WICHMANN [1976]. “A synthetic benchmark,” The Computer J., 19:1. FLEMMING, P. J. AND J. J. WALLACE [1986]. “How not to lie with statistics: The correct way to summarize benchmarks results,” Comm. ACM 29:3 (March), 218–221. FULLER, S. H. AND W. E. BURR [1977]. “Measurement and evaluation of alternative computer architectures,” Computer 10:10 (October), 24–35. GIBSON, J. C. [1970]. “The Gibson mix,” Rep. TR. 00.2043, IBM Systems Development Division, Poughkeepsie, N.Y. (Research done in 1959.) GOLDSTINE, H. H. [1972]. The Computer: From Pascal to von Neumann, Princeton University Press, Princeton, N.J. JAIN, R. [1991]. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling, Wiley, New York. LARSON, E. R. [1973]. “Findings of fact, conclusions of law, and order for judgment,” File No. 4–67, Civ. 138, Honeywell v. Sperry Rand and Illinois Scientific Development, U.S. District Court for the State of Minnesota, Fourth Division (October 19).
1.11
Historical Perspective and References
77
LUBECK, O., J. MOORE, AND R. MENDEZ [1985]. “A benchmark comparison of three supercomputers: Fujitsu VP-200, Hitachi S810/20, and Cray X-MP/2,” Computer 18:12 (December), 10–24. METROPOLIS, N., J. HOWLETT, AND G-C ROTA, EDITORS [1980], A History of Computing in the Twentieth Century, Academic Press, N.Y. MCMAHON, F. M. [1986]. “The Livermore FORTRAN kernels: A computer test of numerical performance range,” Tech. Rep. UCRL-55745, Lawrence Livermore National Laboratory, Univ. of California, Livermore (December). REDMOND, K. C. AND T. M. SMITH [1980]. Project Whirlwind—The History of a Pioneer Computer, Digital Press, Boston. SHURKIN, J. [1984]. Engines of the Mind: A History of the Computer, W. W. Norton, New York. SLATER, R. [1987]. Portraits in Silicon, MIT Press, Cambridge, Mass. SMITH, J. E. [1988]. “Characterizing computer performance with a single number,” Comm. ACM 31:10 (October), 1202–1206. SPEC [1989]. SPEC Benchmark Suite Release 1.0, October 2, 1989. SPEC [1994]. SPEC Newsletter (June). STERN, N. [1980]. “Who invented the first electronic digital computer,” Annals of the History of Computing 2:4 (October), 375–376. TOUMA, W. R. [1993]. The Dynamics of the Computer Industry: Modeling the Supply of Workstations and Their Components, Kluwer Academic, Boston. WEICKER, R. P. [1984]. “Dhrystone: A synthetic systems programming benchmark,” Comm. ACM 27:10 (October), 1013–1030. WILKES, M. V. [1985]. Memoirs of a Computer Pioneer, MIT Press, Cambridge, Mass. WILKES, M. V. [1995]. Computing Perspectives, Morgan Kaufmann, San Francisco. WILKES, M. V., D. J. WHEELER, AND S. GILL [1951]. The Preparation of Programs for an Electronic Digital Computer, Addison-Wesley, Cambridge, Mass.
E X E R C I S E S Each exercise has a difficulty rating in square brackets and a list of the chapter sections it depends on in angle brackets. See the Preface for a description of the difficulty scale.
still a good exercise 1.1 [20/10/10/15] In this exercise, assume that we are considering enhancing a machine by adding a vector mode to it. When a computation is run in vector mode it is 20 times faster than the normal mode of execution. We call the percentage of time that could be spent using vector mode the percentage of vectorization.Vectors are discussed in Appendix B, but you don’t need to know anything about how they work to answer this question! a.
[20] Draw a graph that plots the speedup as a percentage of the computation performed in vector mode. Label the y axis “Net speedup” and label the x axis “Percent vectorization.”
b.
[10] What percentage of vectorization is needed to achieve a speedup of 2?
c.
[10] What percentage of vectorization is needed to achieve one-half the maxi-
78
Chapter 1 Fundamentals of Computer Design
mum speedup attainable from using vector mode? d.
[15] Suppose you have measured the percentage of vectorization for programs to be 70%. The hardware design group says they can double the speed of the vector rate with a significant additional engineering investment. You wonder whether the compiler crew could increase the use of vector mode as another approach to increasing performance. How much of an increase in the percentage of vectorization (relative to current usage) would you need to obtain the same performance gain? Which investment would you recommend?
still a good exercise 1.2 [15/10] Assume—as in the Amdahl’s Law Example on page 41—that we make an enhancement to a computer that improves some mode of execution by a factor of 10. Enhanced mode is used 50% of the time, measured as a percentage of the execution time when the enhanced mode is in use. Recall that Amdahl’s Law depends on the fraction of the original, unenhanced execution time that could make use of enhanced mode. Thus, we cannot directly use this 50% measurement to compute speedup with Amdahl’s Law. a.
[15] What is the speedup we have obtained from fast mode?
b.
[10] What percentage of the original execution time has been converted to fast mode?
1.3 [15] Show that the problem statements in the Examples on page 42 and page 45 are the same.
this exercise has been known to cause confusion, thought the concept is good 1.4 1.5 [15] Suppose we are considering a change to an instruction set. The base machine initially has only loads and stores to memory, and all operations work on the registers. Such machines are called load-store machines (see Chapter 2). Measurements of the loadstore machine showing the instruction mix and clock cycle counts per instruction are given in Figure 1.32 on page 69. Let’s assume that 25% of the arithmetic logic unit (ALU) operations directly use a loaded operand that is not used again. We propose adding ALU instructions that have one source operand in memory. These new register-memory instructions have a clock cycle count of 2. Suppose that the extended instruction set increases the clock cycle count for branches by 1, but it does not affect the clock cycle time. (Chapter 3, on pipelining, explains why adding register-memory instructions might slow down branches.) Would this change improve CPU performance?
cache exercises should be tossed since we eliminated that section, we need some simple pipelining exercises. Feel free to take some from the old chapter 3 1.6 [15] Assume that we have a machine that with a perfect cache behaves as given in Figure 1.32.
1.11
Historical Perspective and References
79
With a cache, we have measured that instructions have a miss rate of 5%, data references have a miss rate of 10%, and the miss penalty is 40 cycles. Find the CPI for each instruction type with cache misses and determine how much faster the machine is with no cache misses versus with cache misses.
still a good exercise; 1.7 [20] After graduating, you are asked to become the lead computer designer at Hyper Computers, Inc. Your study of usage of high-level language constructs suggests that procedure calls are one of the most expensive operations. You have invented a scheme that reduces the loads and stores normally associated with procedure calls and returns. The first thing you do is run some experiments with and without this optimization. Your experiments use the same state-of-the-art optimizing compiler that will be used with either version of the computer. These experiments reveal the following information: n
The clock rate of the unoptimized version is 5% higher.
n
Thirty percent of the instructions in the unoptimized version are loads or stores.
n
n
The optimized version executes two-thirds as many loads and stores as the unoptimized version. For all other instructions the dynamic execution counts are unchanged. All instructions (including load and store) take one clock cycle.
Which is faster? Justify your decision quantitatively.
still a good exercise, although dated. I wonder if it can be salvaged. 1.8 [15/15/8/12] The Whetstone benchmark contains 195,578 basic floatingpoint operations in a single iteration, divided as shown in Figure 1.33. Operation
Count
Add
82,014
Subtract
8,229
Multiply
73,220
Divide
21,399
Convert integer to FP Compare Total
6,006 4,710 195,578
FIGURE 1.33 The frequency of floating-point operations in the Whetstone benchmark.
Whetstone was run on a Sun 3/75 using the F77 compiler with optimization turned on. The Sun 3/75 is based on a Motorola 68020 running at 16.67 MHz, and it includes a floatingpoint coprocessor. The Sun compiler allows the floating point to be calculated with the coprocessor or using software routines, depending on compiler flags. A single iteration of Whetstone took 1.08 seconds using the coprocessor and 13.6 seconds using software. Assume that the CPI using the coprocessor was measured to be 10, while the CPI using soft-
80
Chapter 1 Fundamentals of Computer Design
ware was measured to be 6. a.
[15] What is the MIPS rating for both runs?
b.
[15] What is the total number of instructions executed for both runs?
c.
[8] On the average, how many integer instructions does it take to perform a floating-point operation in software?
d.
[12] What is the MFLOPS rating for the Sun 3/75 with the floating-point coprocessor running Whetstone? (Assume all the floating-point operations in Figure 1.21 count as one operation.)
a good exercise, but needs some updating of costs and the data used--newer processors, e.g. 1.9 [15/10/15/15/15] This exercise estimates the complete packaged cost of a microprocessor using the die cost equation and adding in packaging and testing costs. We begin with a short description of testing cost and follow with a discussion of packaging issues. Testing is the second term of the chip cost equation: Cost of integrated circuit =
Cost of die + Cost of testing die + Cost of packaging Final test yield
Testing costs are determined by three components: Cost of testing per hour × Average die test time Cost of testing die = -----------------------------------------------------------------------------------------------------------------Die yield Since bad dies are discarded, die yield is in the denominator in the equation—the good must shoulder the costs of testing those that fail. (In practice, a bad die may take less time to test, but this effect is small, since moving the probes on the die is a mechanical process that takes a large fraction of the time.) Testing costs about $50 to $500 per hour, depending on the tester needed. High-end designs with many high-speed pins require the more expensive testers. For higher-end microprocessors test time would run $300 to $500 per hour. Die tests take about 5 to 90 seconds on average, depending on the simplicity of the die and the provisions to reduce testing time included in the chip. The cost of a package depends on the material used, the number of pins, and the die area. The cost of the material used in the package is in part determined by the ability to dissipate power generated by the die. For example, a plastic quad flat pack (PQFP) dissipating less than 1 watt, with 208 or fewer pins, and containing a die up to 1 cm on a side costs $2 in 1995. A ceramic pin grid array (PGA) can handle 300 to 600 pins and a larger die with more power, but it costs $20 to $60. In addition to the cost of the package itself is the cost of the labor to place a die in the package and then bond the pads to the pins, which adds from a few cents to a dollar or two to the cost. Some good dies are typically lost in the assembly process, thereby further reducing yield. For simplicity we assume the final test yield is 1.0; in practice it is at least 0.95. We also ignore the cost of the final packaged test. This exercise requires the information provided in Figure 1.34.
1.11
Microprocessor
Historical Perspective and References
Die area (mm2 )
Pins
Technology
81
Estimated wafer cost ($)
Package
MIPS 4600
77
208
CMOS, 0.6µ, 3M
3200
PQFP
PowerPC 603
85
240
CMOS, 0.6µ, 4M
3400
PQFP
HP 71x0
196
504
CMOS, 0.8µ, 3M
2800
Ceramic PGA
Digital 21064A
166
431
CMOS, 0.5µ, 4.5M
4000
Ceramic PGA
SuperSPARC/60
256
293
BiCMOS, 0.6µ, 3.5M
4000
Ceramic PGA
FIGURE 1.34 Characteristics of microprocessors. The technology entry is the process type, line width, and number of interconnect levels.
a.
[15] For each of the microprocessors in Figure 1.34, compute the number of good chips you would get per 20-cm wafer using the model on page 18. Assume a defect density of one defect per cm2, a wafer yield of 95%, and assume α = 3.
b.
[10] For each microprocessor in Figure 1.34, compute the cost per projected good die before packaging and testing. Use the number of good dies per wafer from part (a) of this exercise and the wafer cost from Figure 1.34.
c.
[15] Both package cost and test cost are proportional to pin count. Using the additional assumption shown in Figure 1.35, compute the cost per good, tested, and packaged part using the costs per good die from part (b) of this exercise.
Package type
Pin count
Package cost ($)
Test time (secs)
Test cost per hour ($)
PQFP
1 (0012)
5 (1012)
=>
5 (1012)
6 (1102)
=>
3 (0112)
7 (1112)
=>
7 (1112)
Without special support such address transformation would take an extra memory access to get the new address, or involve a fair amount of logical instructions to transform the address. The DSP solution is based on the observation that the resulting binary address is simply the reverse of the initial address! For example, address 1002 (4) becomes 0012(1). Hence, many DSPs have this second novel addressing mode–– bit reverse addressing––whereby the hardware reverses the lower bits of the address, with the number of bits reversed depending on the step of the FFT algorithm. As DSP programmers migrate towards larger programs and hence become more attracted to compilers, they have been trying to use the compiler technology developed for the desktop and embedded computers. Such compilers have no hope of taking high-level language code and producing these two addressing modes, so they are limited to assembly language programmer. As stated before, the DSP community routinely uses library routines, and hence programmers may benefit even if they write at a higher level. Figure 2.11 shows the static frequency of data addressing modes in a DSP for a set of 54 library routines. This architecture has 17 addressing modes, yet the 6 modes also found in Figure 2.6 on page 108 for desktop and server computers account for 95% of the DSP addressing. Despite measuring hand-coded routines to derive Figure 2.11, the use of novel addressing mode is sparse.
2.4
Addressing Modes for Signal Processing
113
These results are just for one library for just one DSP, other libraries might use more addressing modes, and static and dynamic frequencies may differ. Yet Figure 2.11 still makes the point that there is often a mismatch between what programmers and compilers actually use versus what architects expect, and this is just as true for DSPs as it is for more traditional processors. Addressing Mode
Assembly Symbol
Percent
Immediate
#num
30.02%
Displacement
ARx(num)
10.82%
Register indirect
*ARx
17.42%
Direct
num
11.99%
Autoincrement, pre increment (increment register before use contents as address)
*+ARx
0 18.84%
Autoincrement, post increment (increment register after use contents as address)
*ARx+
Autoincrement, pre increment with 16b immediate
*+ARx(num)
0.77%
Autoincrement, pre increment, with circular addressing
*ARx+%
0.08%
Autoincrement, post increment with 16b immediate, with circular addressing
*ARx+(num)%
Autoincrement, post increment by contents of AR0
*ARx+0
1.54%
Autoincrement, post increment by contents of AR0, with circular addressing
*ARx+0%
2.15%
Autoincrement, post increment by contents of AR0, with bit reverse addressing
*ARx+0B
Autodecrement, post decrement (decrement register after use contents as address
*ARx-
0
0 6.08%
Autodecrement, post decrement, with circular addressing
*ARx-%
0.04%
Autodecrement, post decrement by contents of AR0
*ARx-0
0.16%
Autodecrement, post decrement by contents of AR0, with circular addressing
*ARx-0%
0.08%
Autodecrement, post decrement by contents of AR0, with bit reverse addressing
*ARx-0B
0
Total
100.00%
FIGURE 2.11 Frequency of addressing modes for TI TMS320C54x DSP. The C54x has 17 data addressing modes, not counting register access, but the four found in MIPS account for 70% of the modes. Autoincrement and autodecrement, found in some RISC architectures, account for another 25% of the usage. This data was collected form a measurement of static instructions for the C-callable library of 54 DSP routines coded in assembly language. See http://www.ti.com/sc/docs/ products/dsp/c5000/c54x/54dsplib.htm
Summary: Memory Addressing First, because of their popularity, we would expect a new architecture to support at least the following addressing modes: displacement, immediate, and register indirect. Figure 2.7 on page 109 shows they represent 75% to 99% of the addressing modes used in our SPEC measurements. Second, we would expect the size of the address for displacement mode to be at least 12 to 16 bits, since the caption in Figure 2.8 on page 110 suggests these sizes would capture 75% to
114
Chapter 2 Instruction Set Principles and Examples
99% of the displacements. Third, we would expect the size of the immediate field to be at least 8 to 16 bits. As the caption in Figure 2.10 suggests, these sizes would capture 50% to 80% of the immediates. Desktop and server processors rely on compilers and so addressing modes must match the ability of the compilers to use them, while historically DSPs rely on hand-coded libraries to exercise novel addressing modes. Even so, there are times when programmers find they do not need the clever tricks that architects thought would be useful––or tricks that other programmers promised that they would use. As DSPs head towards relying even more on compiled code, we expect increasing emphasis on simpler addressing modes. Having covered instruction set classes and decided on register-register architectures plus the recommendations on data addressing modes above, we next cover the sizes and meanings of data.
2.5
Type and Size of Operands How is the type of an operand designated? Normally, encoding in the opcode designates the type of an operand—this is the method used most often. Alternatively, the data can be annotated with tags that are interpreted by the hardware. These tags specify the type of the operand, and the operation is chosen accordingly. Computers with tagged data, however, can only be found in computer museums. Let’s start with desktop and server architectures. Usually the type of an operand—integer, single-precision floating point, character, and so on—effectively gives its size. Common operand types include character (8 bits), half word (16 bits), word (32 bits), single-precision floating point (also 1 word), and doubleprecision floating point (2 words). Integers are almost universally represented as two’s complement binary numbers. Characters are usually in ASCII, but the 16bit Unicode (used in Java) is gaining popularity with the internationalization of computers. Until the early 1980s, most computer manufacturers chose their own floating-point representation. Almost all computers since that time follow the same standard for floating point, the IEEE standard 754. The IEEE floating-point standard is discussed in detail in Appendix G . Some architectures provide operations on character strings, although such operations are usually quite limited and treat each byte in the string as a single character. Typical operations supported on character strings are comparisons and moves. For business applications, some architectures support a decimal format, usually called packed decimal or binary-coded decimal—4 bits are used to encode the values 0–9, and 2 decimal digits are packed into each byte. Numeric character strings are sometimes called unpacked decimal, and operations—called packing and unpacking—are usually provided for converting back and forth between them. ;
2.5
Type and Size of Operands
115
One reason to use decimal operands is to get results that exactly match decimal numbers, as some decimal fractions do not have an exact representation in binary. For example, 0.1010 is a simple fraction in decimal but in binary it requires an infinite set of repeating digits: 0.0001100110011...2. Thus, calculations that are exact in decimal can be close but inexact in binary, which can be a problem for financial transactions. (See Appendix G to learn more about precise arithmetic.) Our SPEC benchmarks use byte or character, half word (short integer), word (integer), double word (long integer) and floating-point data types. Figure 2.12 shows the dynamic distribution of the sizes of objects referenced from memory for these programs. The frequency of access to different data types helps in deciding what types are most important to support efficiently. Should the computer have a 64-bit access path, or would taking two cycles to access a double word be satisfactory? As we saw earlier, byte accesses require an alignment network: How important is it to support bytes as primitives? Figure 2.12 uses memory references to examine the types of data being accessed. In some architectures, objects in registers may be accessed as bytes or half words. However, such access is very infrequent—on the VAX, it accounts for no more than 12% of register references, or roughly 6% of all operand accesses in these programs. 60%
Double word (64 bits)
31%
Word (32 bits) Half word (16 bits) Byte (8 bits)
40%
6% 18% 0% 0% 3% 0% 0%
0%
94%
62%
28% applu equake gzip perl
19%
22% 18% 20%
40%
60%
80%
100%
FIGURE 2.12 Distribution of data accesses by size for the benchmark programs. The double word data type is used for double-precision floating-point in floating-point programs and for addresses, since the computer uses 64-bit addresses. On a 32-bit address computer the 64-bit addresses would be replaced by 32-bit addresses, and so almost all double-word accesses in integer programs would become single word accesses.
116
Chapter 2 Instruction Set Principles and Examples
2.6
Operands for Media and Signal Processing Graphics applications deal with 2D and 3D images. A common 3D data type is called a vertex, a data structure with three components: x coordinate, y coordinate, a z coordinate, and a fourth coordinate (w) to help with color or hidden surfaces. Three vertices specify a graphics primitive such as a triangle. Vertex values are usually 32-bit floating-point values. Assuming a triangle is visible, when it is rendered it is filled with pixels. Pixels are usually 32 bits, usually consisting of four 8-bit channels: R (red), G (green), B (blue) and A (which denotes the transparency of the surface or transparency of the pixel when the pixel is rendered). DSPs add fixed point to the data types discussed so far. If you think of integers as having a binary point to the right of the least significant bit, fixed point has a binary point just to the right of the sign bit. Hence, fixed-point data are fractions between -1 and +1.
EXAMPLE
Here are three simple16-bit patterns: 0100 0000 0000 0000 0000 1000 0000 0000 0100 1000 0000 1000 What values do they represent if they are two’s complement integers? Fixedpoint numbers?
ANSWER
Number representation tells us that the i-th digit to the left of the binary point represents 2i-1 and the i-th digit to the right of the binary point represents 2-i. First assume these three patterns are integers. Then the binary point is to the far right, so they represent 214, 211, and (214+ 211+ 23), or 16384, 2048, and 18440. Fixed point places the binary point just to the right of the sign bit, so as fixed point these patterns represent 2-1, 2-4, and (2-1+ 2-4 + 2-12). The fractions are 1/2, 1/16, and (2048 + 256 + 1)/4096 or 2305/4096,which represents about 0.50000, 0.06250, and 0.56274. Alternatively, for an n-bit two’s-complement, fixed-point number we could just use the divide the integer presentation by the 2n-1 to derive the same results: 16384/32768=1/2, 2048/32768=1/16, and 18440/32768=2305/4096. n
Fixed point can be thought of as just low cost floating point. It doesn’t include an exponent in every word and have hardware that automatically aligns and normalizes operands. Instead, fixed point relies on the DSP programmer to keep the
2.6
Operands for Media and Signal Processing
117
exponent in a separate variable and ensure that each result is shifted left or right to keep the answer aligned to that variable. Since this exponent variable is often shared by a set of fixed-point variables, this style of arithmetic is also called blocked floating point, since a block of variables have a common exponent To support such manual calculations, DSPs usually have some registers that are wider to guard against round-off error, just as floating-point units internally have extra guard bits. Figure 2.13 surveys four generations of DSPs, listing data sizes and width of the accumulating registers. Note that DSP architects are not bound by the powers of 2 for word sizes. Figure 2.14 shows the size of data operands for the TI TMS320C540x DSP. Generation
Year
Example DSP
Data Width
Accumulator Width
1
1982
TI TMS32010
16 bits
32 bits
2
1987
Motorola DSP56001
24 bits
56 bits
3
1995
Motorola DSP56301
24 bits
56 bits
4
1998
TI TMS320C6201
16 bits
40 bits
FIGURE 2.13 Four generations of DSPs, their data width, and the width of the registers that reduces round-off error. Section 2.8 explains that multiply-accumulate operations use wide registers to avoid loosing precision when accumulating double-length products [Bier 1997].
Data Size
Memory Operand in Operation
Memory Operand in Data Transfer
16 bits
89.3%
89.0%
32 bits
10.7%
11.0%
FIGURE 2.14 Size of data operands for TMS320C540x DSP. About 90% of operands are 16 bits. This DSP has two 40-bit accumulators. There are no floating-point operations, as is typical of many DSPs, so these data are all fixed-point integers. For details on these measurements, see the caption of Figure 2.11 on page 113.
Summary: Type and Size of Operands From this section we would expect a new 32-bit architecture to support 8-, 16-, and 32-bit integers and 32-bit and 64-bit IEEE 754 floating-point data. A new 64bit address architecture would need to support 64-bit integers as well. The level of support for decimal data is less clear, and it is a function of the intended use of the computer as well as the effectiveness of the decimal support. DSPs need wider accumulating registers than the size in memory to aid accuracy in fixed-point arithmetic. We have reviewed instruction set classes and chosen the register-register class, reviewed memory addressing and selected displacement, immediate, and register indirect addressing modes, and selected the operand sizes and types above. Now we are ready to look at instructions that do the heavy lifting in the architecture.
118
Chapter 2 Instruction Set Principles and Examples
2.7
Operations in the Instruction Set The operators supported by most instruction set architectures can be categorized as in Figure 2.15. One rule of thumb across all architectures is that the most widely executed instructions are the simple operations of an instruction set. For example Figure 2.16 shows 10 simple instructions that account for 96% of instructions executed for a collection of integer programs running on the popular Intel 80x86. Hence, the implementor of these instructions should be sure to make these fast, as they are the common case.
Operator type
Examples
Arithmetic and logical
Integer arithmetic and logical operations: add, subtract, and, or, multiple, divide
Data transfer
Loads-stores (move instructions on computers with memory addressing)
Control
Branch, jump, procedure call and return, traps
System
Operating system call, virtual memory management instructions
Floating point
Floating-point operations: add, multiply, divide, compare
Decimal
Decimal add, decimal multiply, decimal-to-character conversions
String
String move, string compare, string search
Graphics
Pixel and vertex operations, compression/decompression operations
FIGURE 2.15 Categories of instruction operators and examples of each. All computers generally provide a full set of operations for the first three categories. The support for system functions in the instruction set varies widely among architectures, but all computers must have some instruction support for basic system functions. The amount of support in the instruction set for the last four categories may vary from none to an extensive set of special instructions. Floating-point instructions will be provided in any computer that is intended for use in an application that makes much use of floating point. These instructions are sometimes part of an optional instruction set. Decimal and string instructions are sometimes primitives, as in the VAX or the IBM 360, or may be synthesized by the compiler from simpler instructions. Graphics instructions typically operate on many smaller data items in parallel; for example, performing eight 8-bit additions on two 64-bit operands.
As mentioned before, the instructions in Figure 2.16 are found in every computer for every application––desktop, server, embedded––with the variations of operations in Figure 2.15 largely depending on which data types that the instruction set includes.
2.8
Operations for Media and Signal Processing Because media processing is judged by human perception, the data for multimedia operations is often much narrower than the 64-bit data word of modern desktop and server processors. For example, floating-point operations for graphics are normally in single precision, not double precision, and often at precession less than required by IEEE 754. Rather than waste the 64-bit ALUs when operating on 32-bit, 16-bit, or even 8-bit integers, multimedia instructions can operate on
2.8
Operations for Media and Signal Processing
119
Integer average (% total executed)
Rank
80x86 instruction
1
load
22%
2
conditional branch
20%
3
compare
16%
4
store
12%
5
add
8%
6
and
6%
7
sub
5%
8
move register-register
4%
9
call
1%
10
return
1%
Total
96%
FIGURE 2.16 The top 10 instructions for the 80x86. Simple instructions dominate this list, and are responsible for 96% of the instructions executed. These percentages are the average of the five SPECint92 programs.
several narrower data items at the same time. Thus, a partitioned add operation on 16-bit data with a 64-bit ALU would perform four 16-bit adds in a single clock cycle. The extra hardware cost is simply to prevent carries between the four 16-bit partitions of the ALU. For example, such instructions might be used for graphical operations on pixels. These operations are commonly called Single-Instruction Multiple Data (SIMD) or vector instructions. Chapters 6 and Appendix F describe the full machines that pioneered these architectures. Most graphics multimedia applications use 32-bit floating-point operations. Some computers double peak performance of single-precision, floating-point operations; they allow a single instruction to launch two 32-bit operations on operands found side-by-side in a double precision register. Just as in the prior case, the two partitions must be insulated to prevent operations on one half to affect the other. Such floating-point operations are called paired-single operations. For example, such an operation might be used to graphical transformations of vertices. This doubling in performance is typically accomplished by doubling the number of floating-point units, making it more expensive than just suppressing carries in integer adders. Figure 2.17 summaries the SIMD multimedia instructions found in several recent computers.. DSP operations DSPs also provide operations found in the first three rows of Figure 2.15, but they change the semantics a bit. First, because they are often used in real time ap-
120
Instruction category
Chapter 2 Instruction Set Principles and Examples
Alpha MAX
HP PA-RISC MAX2
Intel Pentium MMX
Power PC AltiVec
SPARC VIS
Add/subtract
4H
8B,4H,2W
16B, 8H, 4W
4H,2W
Saturating add/sub
4H
8B,4H
16B, 8H, 4W
4H
16B, 8H
8B,4H,2W (=,>)
16B, 8H, 4W (=,>,>=, n bits)
2W->2B, 4H->4B
Unpack/merge
2B->2W, 4B->4H
Permute/shuffle
4H,2W (=,not=,>,8B
4H
4H->4B, 2W->2H
4W->4B, 8H->8B
2W->2H, 2W->2B, 4H>4B
2B->2W, 4B->4H
4B->4W, 8B->8H
4B->4H, 2*4B->8B
16B, 8H, 4W
FIGURE 2.17 Summary of multimedia support for desktop RISCs. Note the diversity of support, with little in common across the five architectures. All are fixed width operations, performing multiple narrow operations on either a 64-bit or 128bit ALU. B stands for byte (8 bits), H for halfword (16 bits), and W for word (32 bits). Thus, 8B means an operation on 8 bytes in a single instruction. Note that AltiVec assume a128-bit ALU, and the rest assume 64 bits. Pack and unpack use the notation 2*2W to mean 2 operands each with 2 words. This table is a simplification of the full multimedia architectures, leaving out many details. For example, HP MAX2 includes an instruction to calculate averages, and SPARC VIS includes instructions to set registers to constants. Also, this table does not include the memory alignment operation of AltiVec, MAX and VIS
plications, there is not an option of causing an exception on arithmetic overflow (otherwise it could miss an event); thus, the result will be used no matter what the inputs. To support such an unyielding environment, DSP architectures use saturating arithmetic: if the result is too large to be represented, it is set to the largest representable number, depending on the sign of the result. In contrast, two’s complement arithmetic can add a small positive number to a large positive number and end up with a negative result. DSP algorithms rely on saturating arithmetic, and would be incorrect if run on a computer without it. A second issue for DSPs is that there are several modes to round the wider accumulators into the narrower data words, just as the IEEE 754 has several rounding modes to chose from.
2.8
Operations for Media and Signal Processing
121
Finally, the targeted kernels for DSPs accumulate a series of products, and hence have a multiply-accumulate or MAC instruction. MACs are key to dot product operations for vector and matrix multiplies. In fact, MACs/second is the primary peak-performance metric that DSP architects brag about. The wide accumulators are used primarily to accumulate products, with rounding used when transferring results to memory. Instruction
Percent
store mem16
32.2%
load mem16
9.4%
add mem16
6.8%
call
5.0%
push mem16
5.0%
subtract mem16
4.9%
multiple-accumulate (MAC) mem16
4.6%
move mem-mem 16
4.0%
change status
3.7%
pop mem16
2.8%
conditional branch
2.6%
load mem32
2.5%
return
2.5%
store mem32
2.0%
branch
2.0%
repeat
2.0%
multiply
1.8%
NOP
1.5%
add mem32
1.3%
subtract mem32
0.9%
Total
97.2%
FIGURE 2.18 Mix of instructions for TMS320C540x DSP. As in Figure 2.16, simple instructions dominate this list of most frequent instructions. Mem16 stands for a 16-bit memory operand and mem32 stands for a 32-bit memory operand. The large number of change status instructions is to set mode bits to affect instructions, essentially saving opcode space in these 16-bit instructions by keeping some of it in a status register. For example, status bits determine whether 32-bit operations operate in SIMD mode to produce16-bit results in parallel or act as a single 32-bit result. For details on these measurements, see the caption of Figure 2.11 on page 113.
Figure 2.18 shows the static mix of instructions for the TI TMS320C540x DSP for a set of library routines. This 16-bit architecture uses two 40-bit accumulators, plus a stack for passing parameters to library routines and for saving return addresses. Note that DSPs have many more multiplies and MACs than in desktop
122
Chapter 2 Instruction Set Principles and Examples
programs. Although not shown in the figure, 15% to 20% of the multiplies and MACs round the final sum. The C54 also has 8 address registers that can be accessed via load and store instructions, as these registers are memory mapped: that is, each register also has a memory address. The larger number of stores is due in part to writing portions of the 40-bit accumulators to 16-bit words, and also to transfer between registers as their index registers also have memory addressees. There are no floating-point operations, as is typical of many DSPs, so these operations are all on fixed-point integers. Summary: Operations in the Instruction Set From this section we see the importance and popularity of simple instructions: load, store, add, subtract, move register-register, and, shift. DSPs add multiplies and multiply-accumulates to this simple set of primitives. Reviewing where we are in the architecture space, we have looked at instruction classes and selected register-register. We selected displacement, immediate, and register indirect addressing and selected 8-,16-, 32-, and 64-bit integers and 32- and 64-bit floating point. For operations we emphasize the simple list mentioned above. We are now ready to show how computers make decisions.
2.9
Instructions for Control Flow Because the measurements of branch and jump behavior are fairly independent of other measurements and applications, we now examine the use of control-flow instructions, which have little in common with operations of the prior sections. There is no consistent terminology for instructions that change the flow of control. In the 1950s they were typically called transfers. Beginning in 1960 the name branch began to be used. Later, computers introduced additional names. Throughout this book we will use jump when the change in control is unconditional and branch when the change is conditional. We can distinguish four different types of control-flow change: 1. Conditional branches 2. Jumps 3. Procedure calls 4. Procedure returns We want to know the relative frequency of these events, as each event is different, may use different instructions, and may have different behavior. Figure 2.19 shows the frequencies of these control-flow instructions for a load-store computer running our benchmarks.
2.9
Instructions for Control Flow
123
8%
call/return
19% 10% 6%
jump
Floating-point Average Integer Average 82% 75%
cond.branch 0%
25%
50%
75%
100%
Frequency of branch instructions FIGURE 2.19 Breakdown of control flow instructions into three classes: calls or returns, jumps, and conditional branches. Conditional branches clearly dominate. Each type is counted in one of three bars. The programs and computer used to collect these statistics are the same as those in Figure 2.8.
Addressing Modes for Control Flow Instructions The destination address of a control flow instruction must always be specified. This destination is specified explicitly in the instruction in the vast majority of cases—procedure return being the major exception—since for return the target is not known at compile time. The most common way to specify the destination is to supply a displacement that is added to the program counter, or PC. Control flow instructions of this sort are called PC-relative. PC-relative branches or jumps are advantageous because the target is often near the current instruction, and specifying the position relative to the current PC requires fewer bits. Using PC-relative addressing also permits the code to run independently of where it is loaded. This property, called position independence, can eliminate some work when the program is linked and is also useful in programs linked dynamically during execution. To implement returns and indirect jumps when the target is not known at compile time, a method other than PC-relative addressing is required. Here, there must be a way to specify the target dynamically, so that it can change at runtime. This dynamic address may be as simple as naming a register that contains the target address; alternatively, the jump may permit any addressing mode to be used to supply the target address. These register indirect jumps are also useful for four other important features: 1.
case or switch statements found in most programming languages (which select among one of several alternatives);
2.
virtual functions or methods in object-oriented languages like C++ or Java (which allow different routines to be called depending on the type of the argument);
3.
high order functions or function pointers in languages like C or C++ (which allows functions to be passed as arguments giving some of the flavor of object oriented programming), and
124
Chapter 2 Instruction Set Principles and Examples
30 25 20 Integer average
Floating-point average
15 10 5 0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Bits of branch displacement FIGURE 2.20 Branch distances in terms of number of instructions between the target and the branch instruction. The most frequent branches in the integer programs are to targets that can be encoded in four to eight bits. This result tells us that short displacement fields often suffice for branches and that the designer can gain some encoding density by having a shorter instruction with a smaller branch displacement. These measurements were taken on a load-store computer (Alpha architecture) with all instructions aligned on word boundaries. An architecture that requires fewer instructions for the same program, such as a VAX, would have shorter branch distances. However, the number of bits needed for the displacement may increase if the computer has variable length instructions to be aligned on any byte boundary. Exercise 2.1 shows the accumulative distribution of this branch displacement data (see Figure 2.42 on page 173). The programs and computer used to collect these statistics are the same as those in Figure 2.8.
4.
dynamically shared libraries (which allow a library to be loaded and linked at runtime only when it is actually invoked by the program rather than loaded and linked statically before the program is run).
In all four cases the target address is not known at compile time, and hence is usually loaded from memory into a register before the register indirect jump. As branches generally use PC-relative addressing to specify their targets, an important question concerns how far branch targets are from branches. Knowing the distribution of these displacements will help in choosing what branch offsets to support and thus will affect the instruction length and encoding. Figure 2.20 shows the distribution of displacements for PC-relative branches in instructions. About 75% of the branches are in the forward direction.
2.9
Instructions for Control Flow
125
Conditional Branch Options Since most changes in control flow are branches, deciding how to specify the branch condition is important. Figure 2.21 shows the three primary techniques in use today and their advantages and disadvantages. Name
Examples
How condition is tested
Advantages
Disadvantages
Condition code (CC)
80x86, ARM, PowerPC, SPARC, SuperH
Special bits are set by ALU operations, possibly under program control.
Sometimes condition is set for free.
CC is extra state. Condition codes constrain the ordering of instructions since they pass information from one instruction to a branch.
Condition register
Alpha, MIPS
Tests arbitrary register with the result of a comparison.
Simple.
Uses up a register.
Compare and branch
PA-RISC, VAX
Compare is part of the branch. Often compare is limited to subset.
One instruction rather than two for a branch.
May be too much work per instruction for pipelined execution.
FIGURE 2.21 The major methods for evaluating branch conditions, their advantages, and their disadvantages. Although condition codes can be set by ALU operations that are needed for other purposes, measurements on programs show that this rarely happens. The major implementation problems with condition codes arise when the condition code is set by a large or haphazardly chosen subset of the instructions, rather than being controlled by a bit in the instruction. Computers with compare and branch often limit the set of compares and use a condition register for more complex compares. Often, different techniques are used for branches based on floating-point comparison versus those based on integer comparison. This dichotomy is reasonable since the number of branches that depend on floating-point comparisons is much smaller than the number depending on integer comparisons.
One of the most noticeable properties of branches is that a large number of the comparisons are simple tests, and a large number are comparisons with zero. Thus, some architectures choose to treat these comparisons as special cases, especially if a compare and branch instruction is being used. Figure 2.22 shows the frequency of different comparisons used for conditional branching. DSPs add another looping structure, usually called a repeat instruction. It allows a single instruction or a block of instructions to be repeated up to, say, 256 times. For example, the TMS320C54 dedicates three special registers to hold the block starting address, ending address, and repeat counter. The memory instructions in a repeat loop will typically have autoincrement or autodecrement addressing to access a vector. The goal of such instructions is to avoid loop overhead, which can be significant in the small loops of DSP kernels. Procedure Invocation Options Procedure calls and returns include control transfer and possibly some state saving; at a minimum the return address must be saved somewhere, sometimes in a special link register or just a GPR. Some older architectures provide a mecha-
126
Chapter 2 Instruction Set Principles and Examples
5% 2%
Not equal
16% 18%
Equal Greater than or Equal
0%
Greater than
0% 0%
11% Floating-point Average Integer Average
Less than or equal
33%
44%
34% 35%
Less than 0%
10%
20% 30%
40%
50%
Frequency of comparison types in branches FIGURE 2.22 Frequency of different types of compares in conditional branches. Less than (or equal) branches dominate this combination of compiler and architecture. These measurements include both the integer and floating-point compares in branches. The programs and computer used to collect these statistics are the same as those in Figure 2.8
nism to save many registers, while newer architectures require the compiler to generate stores and loads for each register saved and restored. There are two basic conventions in use to save registers: either at the call site or inside the procedure being called. Caller saving means that the calling procedure must save the registers that it wants preserved for access after the call, and thus the called procedure need not worry about registers. Callee saving is the opposite: the called procedure must save the registers it wants to use, leaving the caller is unrestrained. There are times when caller save must be used because of access patterns to globally visible variables in two different procedures. For example, suppose we have a procedure P1 that calls procedure P2, and both procedures manipulate the global variable x. If P1 had allocated x to a register, it must be sure to save x to a location known by P2 before the call to P2. A compiler’s ability to discover when a called procedure may access register-allocated quantities is complicated by the possibility of separate compilation. Suppose P2 may not touch x but can call another procedure, P3, that may access x, yet P2 and P3 are compiled separately. Because of these complications, most compilers will conservatively caller save any variable that may be accessed during a call. In the cases where either convention could be used, some programs will be more optimal with callee save and some will be more optimal with caller save. As a result, the most real systems today use a combination of the two mechanisms.
2.10
Encoding an Instruction Set
127
This convention is specified in an application binary interface (ABI) that sets down the basic rules as to which registers should be caller saved and which should be callee saved. Later in this chapter we will examine the mismatch between sophisticated instructions for automatically saving registers and the needs of the compiler. Summary: Instructions for Control Flow Control flow instructions are some of the most frequently executed instructions. Although there are many options for conditional branches, we would expect branch addressing in a new architecture to be able to jump to hundreds of instructions either above or below the branch. This requirement suggests a PC-relative branch displacement of at least 8 bits. We would also expect to see register-indirect and PC-relative addressing for jump instructions to support returns as well as many other features of current systems. We have now completed our instruction architecture tour at the level seen by assembly language programmer or compiler writer. We are leaning towards a register-register architecture with displacement, immediate, and register indirect addressing modes. These data are 8-,16-, 32-, and 64-bit integers and 32- and 64-bit floating-point data. The instructions include simple operations, PC-relative conditional branches, jump and link instructions for procedure call, and register indirect jumps for procedure return (plus a few other uses.) Now we need to select how to represent this architecture in a form that makes it easy for the hardware to execute.
2.10
Encoding an Instruction Set Clearly, the choices mentioned above will affect how the instructions are encoded into a binary representation for execution by the processor. This representation affects not only the size of the compiled program; it affects the implementation of the processor, which must decode this representation to quickly find the operation and its operands. The operation is typically specified in one field, called the opcode. As we shall see, the important decision is how to encode the addressing modes with the operations. This decision depends on the range of addressing modes and the degree of independence between opcodes and modes. Some older computers have one to five operands with 10 addressing modes for each operand (see Figure 2.6 on page 108). For such a large number of combinations, typically a separate address specifier is needed for each operand: the address specifier tells what addressing mode is used to access the operand. At the other extreme are load-store computers with only one memory operand and only one or two addressing modes; obviously, in this case, the addressing mode can be encoded as part of the opcode.
128
Chapter 2 Instruction Set Principles and Examples
When encoding the instructions, the number of registers and the number of addressing modes both have a significant impact on the size of instructions, as the the register field and addressing mode field may appear many times in a single instruction. In fact, for most instructions many more bits are consumed in encoding addressing modes and register fields than in specifying the opcode. The architect must balance several competing forces when encoding the instruction set: 1. The desire to have as many registers and addressing modes as possible. 2. The impact of the size of the register and addressing mode fields on the average instruction size and hence on the average program size. 3. A desire to have instructions encoded into lengths that will be easy to handle in a pipelined implementation. (The importance of having easily decoded instructions is discussed in Chapters 3 and 4.) As a minimum, the architect wants instructions to be in multiples of bytes, rather than an arbitrary bit length. Many desktop and server architects have chosen to use a fixed-length instruction to gain implementation benefits while sacrificing average code size. Figure 2.23 shows three popular choices for encoding the instruction set. The first we call variable, since it allows virtually all addressing modes to be with all operations. This style is best when there are many addressing modes and operations. The second choice we call fixed, since it combines the operation and the addressing mode into the opcode. Often fixed encoding will have only a single size for all instructions; it works best when there are few addressing modes and operations. The trade-off between variable encoding and fixed encoding is size of programs versus ease of decoding in the processor. Variable tries to use as few bits as possible to represent the program, but individual instructions can vary widely in both size and the amount of work to be performed. Let’s look at an 80x86 instruction to see an example of the variable encoding: add EAX,1000(EBX)
The name add means a 32-bit integer add instruction with two operands, and this opcode takes 1 byte. An 80x86 address specifier is 1 or 2 bytes, specifying the source/destination register (EAX) and the addressing mode (displacement in this case) and base register (EBX) for the second operand. This combination takes one byte to specify the operands. When in 32-bit mode (see Appendix C ), the size of the address field is either 1 byte or 4 bytes. Since 1000 is bigger than 28, the total length of the instruction is 1 + 1 + 4 = 6 bytes
The length of 80x86 instructions varies between 1 and 17 bytes. 80x86 programs are generally smaller than the RISC architectures, which use fixed formats (Appendix B )
2.10
Encoding an Instruction Set
Operation & Address no. of operands specifier 1
129
Address field 1
Address specifier n
Address field n
(a) Variable (e.g., VAX, Intel 80x86)
Operation
Address field 1
Address field 2
Address field 3
(b) Fixed (e.g., Alpha, ARM, MIPS, PowerPC, SPARC, SuperH)
Operation
Address specifier
Address field
Operation
Address specifier 1
Address specifier 2
Address field
Operation
Address specifier
Address field 1
Address field 2
(c) Hybrid (e.g., IBM 360/70, MIPS16, Thumb, TI TMS320C54x)
FIGURE 2.23 Three basic variations in instruction encoding: variable length, fixed length, and hybrid. The variable format can support any number of operands, with each address specifier determining the addressing mode and the length of the specifier for that operand. It generally enables the smallest code representation, since unused fields need not be included. The fixed format always has the same number of operands, with the addressing modes (if options exist) specified as part of the opcode (see also Figure C.3 on page C-4). It generally results in the largest code size. Although the fields tend not to vary in their location, they will be used for different purposes by different instructions. The hybrid approach has multiple formats specified by the opcode, adding one or two fields to specify the addressing mode and one or two fields to specify the operand address (see also Figure D.7 on page D12).
Given these two poles of instruction set design of variable and fixed, the third alternative immediately springs to mind: Reduce the variability in size and work of the variable architecture but provide multiple instruction lengths to reduce code size. This hybrid approach is the third encoding alternative, and we’ll see examples shortly. Reduced Code Size in RISCs As RISC computers started being used in embedded applications, the 32-bit fixed format became a liability since cost and hence smaller code are important. In response, several manufacturers offered a new hybrid version of their RISC instruction sets, with both 16-bit and 32-bit instructions. The narrow instructions
130
Chapter 2 Instruction Set Principles and Examples
support fewer operations, smaller address and immediate fields, fewer registers, and two-address format rather than the classic three-address format of RISC computers. Appendix B gives two examples, the ARM Thumb and MIPS MIPS16, which both claim a code size reduction of up to 40%. In contrast to these instruction set extensions, IBM simply compresses its standard instruction set, and then adds hardware to decompress instructions as they are fetched from memory on an instruction cache miss. Thus, the instruction cache contains full 32-bit instructions, but compressed code is kept in main memory, ROMs, and the disk. The advantage of MIPS16 and Thumb is that instruction caches acts as it they are about 25% larger, while IBM’s CodePack means that compilers need not be changed to handle different instruction sets and instruction decoding can remain simple. CodePack starts with run-length encoding compression on any PowerPC program, and then loads the resulting compression tables in a 2KB table on chip. Hence, every program has its own unique encoding. To handle branches, which are no longer to an aligned word boundary, the PowerPC creates a hash-table in memory that maps between compressed and uncompressed addresses. Like a TLB (Chapter 5), it caches the most recently used address maps to reduce the number of memory accesses. IBM claims an overall performance cost of 10%, resulting in a code size reduction of 35% to 40%. Hitachi simply invented a RISC instruction set with a fixed,16-bit format, called SuperH, for embedded applications (see Appendix B ). It has 16 rather than 32 registers to make it fit the narrower format and fewer instructions, but otherwise looks like a classic RISC architecture. Summary: Encoding the Instruction Set Decisions made in the components of instruction set design discussed in prior sections determine whether the architect has the choice between variable and fixed instruction encodings. Given the choice, the architect more interested in code size than performance will pick variable encoding, and the one more interested in performance than code size will pick fixed encoding. The appendices give 11 examples of the results of architect’s choices. In Chapters 3 and 4, the impact of variability on performance of the processor will be discussed further. We have almost finished laying the groundwork for the MIPS instruction set architecture that will be introduced in section 2.12. Before we do that, however, it will be helpful to take a brief look at compiler technology and its effect on program properties.
2.11
Crosscutting Issues: The Role of Compilers Today almost all programming is done in high-level languages for desktop and server applications. This development means that since most instructions execut-
2.11
Crosscutting Issues: The Role of Compilers
131
ed are the output of a compiler, an instruction set architecture is essentially a compiler target. In earlier times for these applications, and currently for DSPs, architectural decisions were often made to ease assembly language programming or for a specific kernel. Because the compiler will be significantly affect the performance of a computer, understanding compiler technology today is critical to designing and efficiently implementing an instruction set. Once it was popular to try to isolate the compiler technology and its effect on hardware performance from the architecture and its performance, just as it was popular to try to separate architecture from its implementation. This separation is essentially impossible with today’s desktop compilers and computers. Architectural choices affect the quality of the code that can be generated for a computer and the complexity of building a good compiler for it, for better or for worse. For example, section 2.14 shows the substantial performance impact on a DSP of compiling vs. hand optimizing the code. In this section, we discuss the critical goals in the instruction set primarily from the compiler viewpoint. It starts with a review of the anatomy of current compilers. Next we discuss how compiler technology affects the decisions of the architect, and how the architect can make it hard or easy for the compiler to produce good code. We conclude with a review of compilers and multimedia operations, which unfortunately is a bad example of cooperation between compiler writers and architects. The Structure of Recent Compilers To begin, let’s look at what optimizing compilers are like today. Figure 2.24 shows the structure of recent compilers A compiler writer’s first goal is correctness—all valid programs must be compiled correctly. The second goal is usually speed of the compiled code. Typically, a whole set of other goals follows these two, including fast compilation, debugging support, and interoperability among languages. Normally, the passes in the compiler transform higher-level, more abstract representations into progressively lower-level representations. Eventually it reaches the instruction set. This structure helps manage the complexity of the transformations and makes writing a bug-free compiler easier. The complexity of writing a correct compiler is a major limitation on the amount of optimization that can be done. Although the multiple-pass structure helps reduce compiler complexity, it also means that the compiler must order and perform some transformations before others. In the diagram of the optimizing compiler in Figure 2.24, we can see that certain high-level optimizations are performed long before it is known what the resulting code will look like. Once such a transformation is made, the compiler can’t afford to go back and revisit all steps, possibly undoing transformations. Such iteration would be prohibitive, both in compilation time and in complexity. Thus, compilers make assumptions about the ability of later steps to deal with certain problems. For example, com-
132
Chapter 2 Instruction Set Principles and Examples
Dependencies Language dependent; machine independent
Front-end per language
Function Transform language to common intermediate form
Intermediate representation Somewhat language dependent, largely machine independent
Small language dependencies; machine dependencies slight (e.g., register counts/types)
Highly machine dependent; language independent
High-level optimizations
Global optimizer
Code generator
For example, loop transformations and procedure inlining (also called procedure integration) Including global and local optimizations + register allocation
Detailed instruction selection and machine-dependent optimizations; may include or be followed by assembler
FIGURE 2.24 Compilers typically consist of two to four passes, with more highly optimizing compilers having more passes. This structure maximizes the probability that a program compiled at various levels of optimization will produce the same output when given the same input. The optimizing passes are designed to be optional and may be skipped when faster compilation is the goal and lower quality code is acceptable. A pass is simply one phase in which the compiler reads and transforms the entire program. (The term phase is often used interchangeably with pass.) Because the optimizing passes are separated, multiple languages can use the same optimizing and code-generation passes. Only a new front end is required for a new language.
pilers usually have to choose which procedure calls to expand in-line before they know the exact size of the procedure being called. Compiler writers call this problem the phase-ordering problem. How does this ordering of transformations interact with the instruction set architecture? A good example occurs with the optimization called global common subexpression elimination. This optimization finds two instances of an expression that compute the same value and saves the value of the first computation in a temporary. It then uses the temporary value, eliminating the second computation of the common expression. For this optimization to be significant, the temporary must be allocated to a register. Otherwise, the cost of storing the temporary in memory and later reloading it may negate the savings gained by not recomputing the expression. There are, in fact, cases where this optimization actually slows down code when the temporary is not register allocated. Phase ordering complicates this problem, be-
2.11
Crosscutting Issues: The Role of Compilers
133
cause register allocation is typically done near the end of the global optimization pass, just before code generation. Thus, an optimizer that performs this optimization must assume that the register allocator will allocate the temporary to a register. Optimizations performed by modern compilers can be classified by the style of the transformation, as follows: 1. High-level optimizations are often done on the source with output fed to later optimization passes. 2. Local optimizations optimize code only within a straight-line code fragment (called a basic block by compiler people). 3. Global optimizations extend the local optimizations across branches and introduce a set of transformations aimed at optimizing loops. 4. Register allocation. 5. processor-dependent optimizations attempt to take advantage of specific architectural knowledge. Register Allocation Because of the central role that register allocation plays, both in speeding up the code and in making other optimizations useful, it is one of the most important—if not the most important—optimizations. Register allocation algorithms today are based on a technique called graph coloring. The basic idea behind graph coloring is to construct a graph representing the possible candidates for allocation to a register and then to use the graph to allocate registers. Roughly speaking, the problem is how to use a limited set of colors so that no two adjacent nodes in a dependency graph have the same color. The emphasis in the approach is to achieve 100% register allocation of active variables. The problem of coloring a graph in general can take exponential time as a function of the size of the graph (NP-complete). There are heuristic algorithms, however, that work well in practice yielding close allocations that run in near linear time. Graph coloring works best when there are at least 16 (and preferably more) general-purpose registers available for global allocation for integer variables and additional registers for floating point. Unfortunately, graph coloring does not work very well when the number of registers is small because the heuristic algorithms for coloring the graph are likely to fail. Impact of Optimizations on Performance It is sometimes difficult to separate some of the simpler optimizations—local and processor-dependent optimizations—from transformations done in the code generator. Examples of typical optimizations are given in Figure 2.25. The last column of Figure 2.25 indicates the frequency with which the listed optimizing transforms were applied to the source program.
134
Chapter 2 Instruction Set Principles and Examples
Optimization name
Explanation
High-level
At or near the source level; processorindependent
Percentage of the total number of optimizing transforms
Procedure integration
Replace procedure call by procedure body
Local
Within straight-line code
N.M.
Common subexpression elimination
Replace two instances of the same computation by single copy
18%
Constant propagation
Replace all instances of a variable that is assigned a constant with the constant
22%
Stack height reduction
Rearrange expression tree to minimize resources needed for expression evaluation
N.M.
Global
Across a branch
Global common subexpression elimination
Same as local, but this version crosses branches
13%
Copy propagation
Replace all instances of a variable A that has been assigned X (i.e., A = X) with X
11%
Code motion
Remove code from a loop that computes same value each iteration of the loop
16%
Induction variable elimination
Simplify/eliminate array-addressing calculations within loops
2%
Processor-dependent
Depends on processor knowledge
Strength reduction
Many examples, such as replace multiply by a constant with adds and shifts
N.M.
Pipeline scheduling
Reorder instructions to improve pipeline performance
N.M.
Branch offset optimization
Choose the shortest branch displacement that reaches target
N.M.
FIGURE 2.25 Major types of optimizations and examples in each class. These data tell us about the relative frequency of occurrence of various optimizations. The third column lists the static frequency with which some of the common optimizations are applied in a set of 12 small FORTRAN and Pascal programs. There are nine local and global optimizations done by the compiler included in the measurement. Six of these optimizations are covered in the figure, and the remaining three account for 18% of the total static occurrences. The abbreviation N.M. means that the number of occurrences of that optimization was not measured. Processor-dependent optimizations are usually done in a code generator, and none of those was measured in this experiment. The percentage is the portion of the static optimizations that are of the specified type. Data from Chow [1983] (collected using the Stanford UCODE compiler).
Figure 2.26 shows the effect of various optimizations on instructions executed for two programs. In this case, optimized programs executed roughly 25% to 90% fewer instructions than unoptimized programs. The figure illustrates the importance of looking at optimized code before suggesting new instruction set features, for a compiler might completely remove the instructions the architect was trying to improve.
2.11
Crosscutting Issues: The Role of Compilers
lucas, level 3
11%
lucas, level 2
12%
135
21%
Program, lucas, level 1 Compiler lucas, level 0 optimimcf, level 3 zation level mcf, level 2
100% 76% 76% 84%
mcf, level 1
100%
mcf, level 0 0%
20%
40%
60%
80%
100%
% of unoptimized instructions executed Branches/Calls
Fl. Pt. ALU Ops
Loads/Stores
Integer ALU Ops
FIGURE 2.26 Change in instruction count for the programs lucas and mcf from the SPEC2000 as compiler optimization levels vary. Level 0 is the same as unoptimized code. Level 1 includes local optimizations, code scheduling, and local register allocation. Level 2 includes global optimizations, loop transformations (software pipelining), and global register allocation. Level 3 adds procedure integration. These experiments were performed on the Alpha compilers.
The Impact of Compiler Technology on the Architect’s Decisions The interaction of compilers and high-level languages significantly affects how programs use an instruction set architecture. There are two important questions: How are variables allocated and addressed? How many registers are needed to allocate variables appropriately? To address these questions, we must look at the three separate areas in which current high-level languages allocate their data: n
n
The stack is used to allocate local variables. The stack is grown and shrunk on procedure call or return, respectively. Objects on the stack are addressed relative to the stack pointer and are primarily scalars (single variables) rather than arrays. The stack is used for activation records, not as a stack for evaluating expressions. Hence, values are almost never pushed or popped on the stack. The global data area is used to allocate statically declared objects, such as global variables and constants. A large percentage of these objects are arrays or other aggregate data structures.
136
Chapter 2 Instruction Set Principles and Examples
n
The heap is used to allocate dynamic objects that do not adhere to a stack discipline. Objects in the heap are accessed with pointers and are typically not scalars.
Register allocation is much more effective for stack-allocated objects than for global variables, and register allocation is essentially impossible for heap-allocated objects because they are accessed with pointers. Global variables and some stack variables are impossible to allocate because they are aliased, which means that there are multiple ways to refer to the address of a variable, making it illegal to put it into a register. (Most heap variables are effectively aliased for today’s compiler technology.) For example, consider the following code sequence, where & returns the address of a variable and * dereferences a pointer: p = &a a = ... *p = ... ...a...
–– –– –– --
gets address of a in p assigns to a directly uses p to assign to a accesses a
The variable a could not be register allocated across the assignment to *p without generating incorrect code. Aliasing causes a substantial problem because it is often difficult or impossible to decide what objects a pointer may refer to. A compiler must be conservative; some compilers will not allocate any local variables of a procedure in a register when there is a pointer that may refer to one of the local variables. How the Architect Can Help the Compiler Writer Today, the complexity of a compiler does not come from translating simple statements like A = B + C. Most programs are locally simple, and simple translations work fine. Rather, complexity arises because programs are large and globally complex in their interactions, and because the structure of compilers means decisions are made one step at a time about which code sequence is best. Compiler writers often are working under their own corollary of a basic principle in architecture: Make the frequent cases fast and the rare case correct. That is, if we know which cases are frequent and which are rare, and if generating code for both is straightforward, then the quality of the code for the rare case may not be very important—but it must be correct! Some instruction set properties help the compiler writer. These properties should not be thought of as hard and fast rules, but rather as guidelines that will make it easier to write a compiler that will generate efficient and correct code.
2.11
Crosscutting Issues: The Role of Compilers
137
1. Regularity—Whenever it makes sense, the three primary components of an instruction set—the operations, the data types, and the addressing modes— should be orthogonal. Two aspects of an architecture are said to be orthogonal if they are independent. For example, the operations and addressing modes are orthogonal if for every operation to which one addressing mode can be applied, all addressing modes are applicable. This regularity helps simplify code generation and is particularly important when the decision about what code to generate is split into two passes in the compiler. A good counterexample of this property is restricting what registers can be used for a certain class of instructions. Compilers for special-purpose register architectures typically get stuck in this dilemma. This restriction can result in the compiler finding itself with lots of available registers, but none of the right kind! ;
2. Provide primitives, not solutions—Special features that “match” a language construct or a kernel function are often unusable. Attempts to support highlevel languages may work only with one language, or do more or less than is required for a correct and efficient implementation of the language. An example of how such attempts have failed is given in section 2.14. 3. Simplify trade-offs among alternatives—One of the toughest jobs a compiler writer has is figuring out what instruction sequence will be best for every segment of code that arises. In earlier days, instruction counts or total code size might have been good metrics, but—as we saw in the last chapter—this is no longer true. With caches and pipelining, the trade-offs have become very complex. Anything the designer can do to help the compiler writer understand the costs of alternative code sequences would help improve the code. One of the most difficult instances of complex trade-offs occurs in a register-memory architecture in deciding how many times a variable should be referenced before it is cheaper to load it into a register. This threshold is hard to compute and, in fact, may vary among models of the same architecture. 4. Provide instructions that bind the quantities known at compile time as constants—A compiler writer hates the thought of the processor interpreting at runtime a value that was known at compile time. Good counterexamples of this principle include instructions that interpret values that were fixed at compile time. For instance, the VAX procedure call instruction (calls) dynamically interprets a mask saying what registers to save on a call, but the mask is fixed at compile time (see section 2.14). Compiler Support (or lack thereof) for Multimedia Instructions Alas, the designers of the SIMD instructions that operate on several narrow data times in a single clock cycle consciously ignored the prior subsection. These instructions tend to be solutions, not primitives, they are short of registers, and the data types do not match existing programming languages. Architects hoped to find an inexpensive solution that would help some users, but in reality, only a few low-level graphics library routines use them.
138
Chapter 2 Instruction Set Principles and Examples
The SIMD instructions are really an abbreviated version of an elegant architecture style that has its own compiler technology. As explained in Appendix F, vector architectures operate on vectors of data. Invented originally for scientific codes, multimedia kernels are often vectorizable as well. Hence, we can think of Intel’s MMX or PowerPC’s AltiVec as simply short vector computers: MMX with vectors of eight 8-bit elements, four 16-bit elements, or two 32-bit elements, and AltiVec with vectors twice that length. They are implemented as simply adjacent, narrow elements in wide registers These abbreviated architectures build the vector register size into the architecture: the sum of the sizes of the elements is limited to 64 bits for MMX and 128 bits for AltiVec. When Intel decided to expand to 128 bit vectors, it added a whole new set of instructions, called SSE. The missing elegance from these architectures involves the specification of the vector length and the memory addressing modes. By making the vector width variable, these vectors seemlessly switch between different data widths simply by increasing the number of elements per vector. For example, vectors could have, say, 32 64-bit elements, 64 32-bit elements, 128 16-bit elements, and 256 8-bit elements. Another advantage is that the number of elements per vector register can vary between generations while remaining binary compatible. One generation might have 32 64-bit elements per vector register, and the next have 64 64-bit elements. (The number of elements per register is located in a status register.) The number of elements executed per clock cycle is also implementation dependent, and all run the same binary code. Thus, one generation might operate 64bits per clock cycle, and another at 256-bits per clock cycle. A major advantage of vector computers is hiding latency of memory access by loading many elements at once and then overlapping execution with data transfer. The goal of vector addressing modes is to collect data scattered about memory, place them in a compact form so that they can be operated on efficiently, and then place the results back where they belong. Over the years traditional vector computers added strided addressing and gather/scatter addressing to increase the number of programs that can be vectorized. Strided addressing skips a fixed number of words between each access, so sequential addressing is often called unit stride addressing. Gather and scatter find their addresses in another vector register: think of it as register indirect addressing for vector computers. From a vector perspective, in contrast these shortvector SIMD computers support only unit strided accesses: memory accesses load or store all elements at once from a single wide memory location. Since the data for multimedia applications are often streams that start and end in memory, strided and gather/scatter addressing modes such are essential to successful vectoization.
2.11
Crosscutting Issues: The Role of Compilers
139
As an example, compare a vector computer to MMX for color representation conversion of pixels from RBG (red blue green) to YUV (luminosity chrominance), with each pixel represented by three bytes. The conversion is just 3 lines of C code placed in a loop:
EXAMPLE
Y = ( 9798*R + 19235*G + 3736*B)/ 32768; U = (-4784*R - 9437*G + 4221*B)/ 32768 + 128; V = (20218*R - 16941*G - 3277*B) / 32768 + 128; A 64-bit wide vector computer can calculate eight pixels simultaneously. One vector computer for media with strided addresses takes: n
3 vector loads (to get RGB),
n
3 vector multiplies (to convert R),
n
6 vector multiply adds (to convert G and B),
n
3 vector shifts (to divide by 32768),
n
2 vector adds (to add 128), and
n
3 vector stores (to store YUV). The total is 20 instructions to perform the 20 operations in the C code above to convert 8 pixels [Kozyrakis 2000]. (Since a vector might have 32 64-bit elements, this code actually converts up to 32 x 8 or 256 pixels.) In contrast, Intel’s web site shows a library routine to perform the same calculation on eight pixels takes 116 MMX instructions plus 6 80x86 instructions [Intel 2001]. This sixfold increase in instructions is due to the large number of instructions to load and unpack RBG pixels and to pack and store YUV pixels, since there are no strided memory accesses. n
Having short, architecture limited vectors with few registers and simple memory addressing modes makes it more difficult to use vectorizing compiler technology. Another challenge is that no programming language (yet) has support for operations on these narrow data. Hence, these SIMD instructions are commonly found only in hand coded libraries. Summary: The Role of Compilers This section leads to several recommendations. First, we expect a new instruction set architecture to have at least 16 general-purpose registers—not counting separate registers for floating-point numbers—to simplify allocation of registers using graph coloring. The advice on orthogonality suggests that all supported addressing modes apply to all instructions that transfer data. Finally, the last three pieces
140
Chapter 2 Instruction Set Principles and Examples
of advice—provide primitives instead of solutions, simplify trade-offs between alternatives, don’t bind constants at runtime—all suggest that it is better to err on the side of simplicity. In other words, understand that less is more in the design of an instruction set. Alas, SIMD extensions are more an example of good marketing than outstanding achievement of hardware/software co-design.
2.12
Putting It All Together: The MIPS Architecture In this section we describe a simple 64-bit load-store architecture called MIPS. The instruction set architecture of MIPS and RISC relatives was based on observations similar to those covered in the last sections. (In section 2.16 we discuss how and why these architectures became popular.) Reviewing our expectations from each section: for desktop applications: n
n
n
n
n
n
n
Section 2.2—Use general-purpose registers with a load-store architecture. Section 2.3—Support these addressing modes: displacement (with an address offset size of 12 to 16 bits), immediate (size 8 to 16 bits), and register indirect. Section 2.5—Support these data sizes and types: 8-, 16-, 32-bit, and 64-bit integers and 64-bit IEEE 754 floating-point numbers. Section 2.7—Support these simple instructions, since they will dominate the number of instructions executed: load, store, add, subtract, move registerregister, and, shift. Section 2.9—Compare equal, compare not equal, compare less, branch (with a PC-relative address at least 8 bits long), jump, call, and return. Section 2.10—Use fixed instruction encoding if interested in performance and use variable instruction encoding if interested in code size. Section 2.11—Provide at least 16 general-purpose registers, and be sure all addressing modes apply to all data transfer instructions, and aim for a minimalist instruction set. This section didn’t cover floating-point programs, but they often use separate floating-point registers. The justification is to increase the total number of registers without raising problems in the instruction format or in the speed of the general-purpose register file. This compromise, however, is not orthogonal.
We introduce MIPS by showing how it follows these recommendations. Like most recent computers, MIPS emphasizes n
n
n
A simple load-store instruction set Design for pipelining efficiency (discussed in Appendix A), including a fixed instruction set encoding Efficiency as a compiler target
2.12
Putting It All Together: The MIPS Architecture
141
MIPS provides a good architectural model for study, not only because of the popularity of this type of processor (see Chapter 1), but also because it is an easy architecture to understand. We will use this architecture again in Chapters 3 and 4, and it forms the basis for a number of exercises and programming projects. In the 15 years since the first MIPS processor, there have been many versions of MIPS (see Appendix B ). We will use a subset of what is now called MIPS64, which will often abbreviate to just MIPS, but the full instruction set is found in Appendix B. Registers for MIPS MIPS64 has 32 64-bit general-purpose registers (GPRs), named R0, R1, …, R31. GPRs are also sometimes known as integer registers. Additionally, there is a set of 32 floating-point registers (FPRs), named F0, F1, ..., F31, which can hold 32 single-precision (32-bit) values or 32 double-precision (64-bit) values. (When holding one single-precision number, the other half of the FPR is unused.) Both single- and double-precision floating-point operations (32-bit and 64-bit) are provided. MIPS also includes instructions that operate on two single precision operands in a single 64-bit floating-point register. The value of R0 is always 0. We shall see later how we can use this register to synthesize a variety of useful operations from a simple instruction set. A few special registers can be transferred to and from the general-purpose registers. An example is the floating-point status register, used to hold information about the results of floating-point operations. There are also instructions for moving between a FPR and a GPR. Data types for MIPS The data types are 8-bit bytes, 16-bit half words, 32-bit words, and 64-bit double words for integer data and 32-bit single precision and 64-bit double precision for floating point. Half words were added because they are found in languages like C and popular in some programs, such as the operating systems, concerned about size of data structures. They will also become more popular if Unicode becomes widely used. Single-precision floating-point operands were added for similar reasons. (Remember the early warning that you should measure many more programs before designing an instruction set.) The MIPS64 operations work on 64-bit integers and 32- or 64-bit floating point. Bytes, half words, and words are loaded into the general-purpose registers with either zeros or the sign bit replicated to fill the 32 bits of the GPRs. Once loaded, they are operated on with the 64-bit integer operations.
142
Chapter 2 Instruction Set Principles and Examples
I-type instruction 6 Opcode
5 rs
5
16
rt
Immediate
Encodes: Loads and stores of bytes, half words, words, double words. All immediates (rt —rs op immediate) Conditional branch instructions (rs is register, rd unused) Jump register, jump and link register (rd = 0, rs = destination, immediate = 0) R-type instruction 6 Opcode
5 rs
5 rt
5
5
rd
shamt
6 funct
Register—register ALU operations: rd— rs funct rt Function encodes the data path operation: Add, Sub, . . . Read/write special registers and moves J-type instruction 6 Opcode
26 Offset added to PC
Jump and jump and link Trap and return from exception
FIGURE 2.27 Instruction layout for MIPS. All instructions are encoded in one of three types, with common fields in the same location in each format.
Addressing modes for MIPS data transfers The only data addressing modes are immediate and displacement, both with 16bit fields. Register indirect is accomplished simply by placing 0 in the 16-bit displacement field, and absolute addressing with a 16-bit field is accomplished by using register 0 as the base register. Embracing zero gives us four effective modes, although only two are supported in the architecture. MIPS memory is byte addressable in Big Endian mode with a 64-bit address. As it is a load-store architecture, all references between memory and either GPRs or FPRs are through loads or stores. Supporting the data types mentioned above, memory accesses involving GPRs can be to a byte, half word, word, or double word. The FPRs may be loaded and stored with single-precision or double-precision numbers. All memory accesses must be aligned. MIPS Instruction Format Since MIPS has just two addressing modes, these can be encoded into the opcode. Following the advice on making the processor easy to pipeline and decode,
2.12
Putting It All Together: The MIPS Architecture
143
all instructions are 32 bits with a 6-bit primary opcode. Figure 2.27 shows the instruction layout. These formats are simple while providing 16-bit fields for displacement addressing, immediate constants, or PC-relative branch addresses. Appendix B shows a variant of MIPS––called MIPS16––which has 16-bit and 32-bit instructions to improve code density for embedded applications. We will stick to the traditional 32-bit format in this book. MIPS Operations MIPS supports the list of simple operations recommended above plus a few others. There are four broad classes of instructions: loads and stores, ALU operations, branches and jumps, and floating-point operations. Example instruction
Instruction name
Meaning
LD R1,30(R2)
Load double word
Regs[R1]←64 Mem[30+Regs[R2]]
LD R1,1000(R0)
Load double word
Regs[R1]←64 Mem[1000+0]
LW R1,60(R2)
Load word
Regs[R1]←64 (Mem[60+Regs[R2]]0)32 ## Mem[60+Regs[R2]]
LB R1,40(R3)
Load byte
Regs[R1]←64 (Mem[40+Regs[R3]] 0)56 ## Mem[40+Regs[R3]]
LBU R1,40(R3)
Load byte unsigned
Regs[R1]←64 056 ## Mem[40+Regs[R3]]
LH
R1,40(R3)
Load half word
Regs[R1]←64 (Mem[40+Regs[R3]]0)48 ## Mem[40+Regs[R3]]##Mem[41+Regs[R3]]
L.S F0,50(R3)
Load FP single
Regs[F0]←64 Mem[50+Regs[R3]] ## 032
L.D F0,50(R2)
Load FP double
Regs[F0]←64 Mem[50+Regs[R2]]
SD R3,500(R4)
Store double word
Mem[500+Regs[R4]]←64 Regs[R3]
SW R3,500(R4)
Store word
Mem[500+Regs[R4]]←32 Regs[R3]
S.S F0,40(R3)
Store FP single
Mem[40+Regs[R3]]←32 Regs[F0]0..31
S.D F0,40(R3)
Store FP double
Mem[40+Regs[R3]]←64 Regs[F0]
SH
R3,502(R2)
Store half
Mem[502+Regs[R2]]←16 Regs[R3]48..63
SB
R2,41(R3)
Store byte
Mem[41+Regs[R3]]←8 Regs[R2]56..63
FIGURE 2.28 The load and store instructions in MIPS. All use a single addressing mode and require that the memory value be aligned. Of course, both loads and stores are available for all the data types shown.
Any of the general-purpose or floating-point registers may be loaded or stored, except that loading R0 has no effect. Figure 2.28 gives examples of the load and store instructions. Single-precision floating-point numbers occupy half a floatingpoint register. Conversions between single and double precision must be done explicitly. The floating-point format is IEEE 754 (see Appendix G). A list of the all the MIPS instructions in our subset appears in Figure 2.31 (page 146).
144
Chapter 2 Instruction Set Principles and Examples
To understand these figures we need to introduce a few additional extensions to our C description language presented initially on page 107: n
n
n
n
n
A subscript is appended to the symbol ← whenever the length of the datum being transferred might not be clear. Thus, ←n means transfer an n-bit quantity. We use x, y ← z to indicate that z should be transferred to x and y. A subscript is used to indicate selection of a bit from a field. Bits are labeled from the most-significant bit starting at 0. The subscript may be a single digit (e.g., Regs[R4]0 yields the sign bit of R4) or a subrange (e.g., Regs[R3]56..63 yields the least-significant byte of R3). The variable Mem, used as an array that stands for main memory, is indexed by a byte address and may transfer any number of bytes. A superscript is used to replicate a field (e.g., 048 yields a field of zeros of length 48 bits). The symbol ## is used to concatenate two fields and may appear on either side of a data transfer.
A summary of the entire description language appears on the back inside cover. As an example, assuming that R8 and R10 are 64-bit registers: Regs[R10]32..63 ← 32(Mem[Regs[R8]]0)24 ## Mem[Regs[R8]]
means that the byte at the memory location addressed by the contents of register R8 is sign-extended to form a 32-bit quantity that is stored into the lower half of register R10. (The upper half of R10 is unchanged.) All ALU instructions are register-register instructions. Figure 2.29 gives some examples of the arithmetic/logical instructions. The operations include simple arithmetic and logical operations: add, subtract, AND, OR, XOR, and shifts. Immediate forms of all these instructions are provided using a 16-bit sign-extended immediate. The operation LUI (load upper immediate) loads bits 32 to 47 of a register, while setting the rest of the register to 0. LUI allows a 32-bit constant to be built in two instructions, or a data transfer using any constant 32-bit address in one extra instruction. As mentioned above, R0 is used to synthesize popular operations. Loading a constant is simply an add immediate where one source operand is R0, and a register-register move is simply an add where one of the sources is R0. (We sometimes use the mnemonic LI, standing for load immediate, to represent the former and the mnemonic MOV for the latter.) MIPS Control Flow Instructions MIPS provides compare instructions, which compare two registers to see if the first is less than the second. If the condition is true, these instructions place a
2.12
Putting It All Together: The MIPS Architecture
Example instruction
Instruction name
Meaning
DADDU R1,R2,R3
Add unsigned
Regs[R1]←Regs[R2]+Regs[R3]
DADDIU R1,R2,#3
Add immediate unsigned
Regs[R1]←Regs[R2]+3
LUI
Load upper immediate
Regs[R1]←032##42##016
SLL R1,R2,#5
Shift left logical
Regs[R1]←Regs[R2] 1 means bigger)
Convolution
11.8
16.5
FIR
157
TMS 320C6203(“C62”) for EEMBC Telecom kernels
Convolutional Encoder
ratio to assembly in execution time (> 1 means slower)
ratio to assembly code space (> 1 means bigger)
44.0
0.5
11.5
8.7
Fixed-Point Complex FFT
13.5
1.0
Matrix 1x3
7.7
8.1
Viterbi GSM Decoder
13.0
0.7
FIR2dim
5.3
6.5
Fixed-point Bit Allocation
7.0
1.4
Dot product
5.2
14.1
Auto Collrelation
1.8
0.7
LMS
5.1
0.7
N real update
4.7
14.1
IIR n biquad
2.4
8.6
N complex update
2.4
9.8
Matrix
1.2
5.1
Complex update
1.2
8.7
IIR one biquad
1.0
6.4
Real update
0.8
15.6
C54 Geometric Mean
3.2
7.8
10.0
0.8
C62 Geometric Mean
FIGURE 2.40 Ratio of execution time and code size for compiled code vs. hand written code for TMS320C54 DSPs on left (using the 14 DSPstone kernels) and Texas Instruments TMS 320C6203 on right (using the 6 EEMBC Telecom kernels). The geometric mean of performance improvements is 3.2:1 for C54 running DSPstone and 10.0:1 for the C62 running EEMBC. The compiler does a better job on code space for the C62, which is a VLIW processor, but the geometric mean of code size for the C54 is almost a factor of 8 larger when compiled. Modifying the C code gives much better results. The EEMBC results were reported May 2000. For DSPstone, see Ropers [1999]
Despite these major difficulties, the 80x86 architecture has been enormously successful. The reasons are threefold: first, its selection as the microprocessor in the initial IBM PC makes 80x86 binary compatibility extremely valuable. Second, Moore’s Law provided sufficient resources for 80x86 microprocessors to translate to an internal RISC instruction set and then execute RISC-like instructions (see section 3.8 in the next chapter). This mix enables binary compatibility with the valuable PC software base and performance on par with RISC processors. Third, the very high volumes of PC microprocessors means Intel can easily pay for the increased design cost of hardware translation. In addition, the high volumes allow the manufacturer to go up the learning curve, which lowers the cost of the product. The larger die size and increased power for translation may be a liability for embedded applications, but it makes tremendous economic sense for the desktop. And its cost-performance in the desktop also makes it attractive for servers, with its main weakness for servers being 32-bit addresses: companies already offer high-end servers with more than one terabyte (240 bytes) of memory.
158
Chapter 2 Instruction Set Principles and Examples
Fallacy: You can design a flawless architecture. All architecture design involves trade-offs made in the context of a set of hardware and software technologies. Over time those technologies are likely to change, and decisions that may have been correct at the time they were made look like mistakes. For example, in 1975 the VAX designers overemphasized the importance of code-size efficiency, underestimating how important ease of decoding and pipelining would be five years later. An example in the RISC camp is delayed branch (see Appendix B ). It was a simple to control pipeline hazards with five-stage pipelines, but a challenge for processors with longer pipelines that issue multiple instructions per clock cycle. In addition, almost all architectures eventually succumb to the lack of sufficient address space. In general, avoiding such flaws in the long run would probably mean compromising the efficiency of the architecture in the short run, which is dangerous, since a new instruction set architecture must struggle to survive its first few years.
2.15
Concluding Remarks The earliest architectures were limited in their instruction sets by the hardware technology of that time. As soon as the hardware technology permitted, computer architects began looking for ways to support high-level languages. This search led to three distinct periods of thought about how to support programs efficiently. In the 1960s, stack architectures became popular. They were viewed as being a good match for high-level languages—and they probably were, given the compiler technology of the day. In the 1970s, the main concern of architects was how to reduce software costs. This concern was met primarily by replacing software with hardware, or by providing high-level architectures that could simplify the task of software designers. The result was both the high-level-language computer architecture movement and powerful architectures like the VAX, which has a large number of addressing modes, multiple data types, and a highly orthogonal architecture. In the 1980s, more sophisticated compiler technology and a renewed emphasis on processor performance saw a return to simpler architectures, based mainly on the load-store style of computer. The following instruction set architecture changes occurred in the 1990s: n
n
Address size doubles: The 32-bit address instruction sets for most desktop and server processors were extended to 64-bit addresses, expanding the width of the registers (among other things) to 64 bits. Appendix B gives three examples of architectures that have gone from 32 bits to 64 bits. Optimization of conditional branches via conditional execution: In the next two chapters we see that conditional branches can limit the performance of aggressive computer designs. Hence, there was interest in replacing conditional branches with conditional completion of operations, such as conditional move (see Chapter 4), which was added to most instruction sets.
2.15
n
n
n
Concluding Remarks
159
Optimization of cache performance via prefetch: Chapter 5 explains the increasing role of memory hierarchy in performance of computers, with a cache miss on some computers taking as many instruction times as page faults took on earlier computers. Hence, prefetch instructions were added to try to hide the cost of cache misses by prefetching (see Chapter 5). Support for multimedia: Most desktop and embedded instruction sets were extended with support for multimedia and DSP applications, as discussed in this chapter. Faster floating-point Operations: Appendix G describes operations added to enhance floating-point performance, such as operations that perform a multiply and an add and paired single execution. (We include them in MIPS.)
Looking to the next decade, we see the following trends in instruction set design: n
n
n
n
Long Instruction Words: The desire to achieve more instruction level parallelism by making changing the architecture to support wider instructions (see Chapter 4). Increased Conditional Execution: More support for conditional execution of operations to support greater speculation. Blending of general purpose and DSP architectures: Parallel efforts between desktop and embedded processors to add DSP support vs. extending DSP processors to make them better targets for compilers, suggesting a culture clash in the marketplace between general purpose and DSPs. 80x86 emulation: Given the popularity of software for the 80x86 architecture, many companies are looking to see if changes to the instruction sets can significantly improve performance, cost, or power when emulating the 80x86 architecture.
Between 1970 and 1985 many thought the primary job of the computer architect was the design of instruction sets. As a result, textbooks of that era emphasize instruction set design, much as computer architecture textbooks of the 1950s and 1960s emphasized computer arithmetic. The educated architect was expected to have strong opinions about the strengths and especially the weaknesses of the popular computers. The importance of binary compatibility in quashing innovations in instruction set design was unappreciated by many researchers and textbook writers, giving the impression that many architects would get a chance to design an instruction set. The definition of computer architecture today has been expanded to include design and evaluation of the full computer system—not just the definition of the instruction set and not just the processor—and hence there are plenty of topics for the architect to study. (You may have guessed this the first time you lifted this book.) Hence, the bulk of this book is on design of computers versus instruction sets.
160
Chapter 2 Instruction Set Principles and Examples
The many appendices may satisfy readers interested in instruction set architecture: Appendix B compares seven popular load-store computers with MIPS. Appendix C describes the most widely used instruction set, the Intel 80x86, and compares instruction counts for it with that of MIPS for several programs. For those interested in the historical computers, Appendix D summarizes the VAX architecture and Appendix E summarizes the IBM 360/370.
2.16
Historical Perspective and References One’s eyebrows should rise whenever a future architecture is developed with a stack- or register-oriented instruction set. [p. 20] Meyers [1978]
The earliest computers, including the UNIVAC I, the EDSAC, and the IAS computers, were accumulator-based computers. The simplicity of this type of computer made it the natural choice when hardware resources were very constrained. The first general-purpose register computer was the Pegasus, built by Ferranti, Ltd. in 1956. The Pegasus had eight general-purpose registers, with R0 always being zero. Block transfers loaded the eight registers from the drum memory. Stack Architectures In 1963, Burroughs delivered the B5000. The B5000 was perhaps the first computer to seriously consider software and hardware-software trade-offs. Barton and the designers at Burroughs made the B5000 a stack architecture (as described in Barton [1961]). Designed to support high-level languages such as ALGOL, this stack architecture used an operating system (MCP) written in a high-level language. The B5000 was also the first computer from a U.S. manufacturer to support virtual memory. The B6500, introduced in 1968 (and discussed in Hauck and Dent [1968]), added hardware-managed activation records. In both the B5000 and B6500, the top two elements of the stack were kept in the processor and the rest of the stack was kept in memory. The stack architecture yielded good code density, but only provided two high-speed storage locations. The authors of both the original IBM 360 paper [Amdahl, Blaauw, and Brooks 1964] and the original PDP-11 paper [Bell et al. 1970] argue against the stack organization. They cite three major points in their arguments against stacks: 1. Performance is derived from fast registers, not the way they are used. 2. The stack organization is too limiting and requires many swap and copy operations. 3. The stack has a bottom, and when placed in slower memory there is a performance loss.
2.16
Historical Perspective and References
161
Stack-based hardware fell out of favor in the late 1970s and, except for the Intel 80x86 floating-point architecture, essentially disappeared. For example, except for the 80x86, none of the computers listed in the SPEC report uses a stack. In the 1990s, however, stack architectures received a shot in the arm with the success of Java Virtual Machine (JVM). The JVM is a software interpreter for an intermediate language produced by Java compilers, called Java bytecodes ([Lindholm 1999]). The purpose of the interpreter is to provide software compatibility across many platforms, with the hope of “write once, run everywhere.” Although the slowdown is about a factor of ten due to interpretation, there are times when compatibility is more important than performance, such as when downloading a Java “applet” into an Internet browser. Although a few have proposed hardware to directly execute the JVM instructions (see [McGhan 1998]), thus far none of these proposals have been significant commercially. The hope instead is that Just In Time (JIT) Java compilers––which compile during run time to the native instruction set of the computer running the Java program––will overcome the performance penalty of interpretation. The popularity of Java has also lead to compilers that compile directly into the native hardware instruction sets, bypassing the illusion of the Java bytecodes. Computer Architecture Defined IBM coined the term computer architecture in the early 1960s. Amdahl, Blaauw, and Brooks [1964] used the term to refer to the programmer-visible portion of the IBM 360 instruction set. They believed that a family of computers of the same architecture should be able to run the same software. Although this idea may seem obvious to us today, it was quite novel at that time. IBM, although it was the leading company in the industry, had five different architectures before the 360. Thus, the notion of a company standardizing on a single architecture was a radical one. The 360 designers hoped that defining a common architecture would bring six different divisions of IBM together. Their definition of architecture was ... the structure of a computer that a machine language programmer must understand to write a correct (timing independent) program for that machine. The term “machine language programmer” meant that compatibility would hold, even in machine language, while “timing independent” allowed different implementations. This architecture blazed the path for binary compatibility, which others have followed. The IBM 360 was the first computer to sell in large quantities with both byte addressing using 8-bit bytes and general-purpose registers. The 360 also had register-memory and limited memory-memory instructions. Appendix E summarizes this instruction set. In 1964, Control Data delivered the first supercomputer, the CDC 6600. As Thornton [1964] discusses, he, Cray, and the other 6600 designers were among the first to explore pipelining in depth. The 6600 was the first general-purpose,
162
Chapter 2 Instruction Set Principles and Examples
load-store computer. In the 1960s, the designers of the 6600 realized the need to simplify architecture for the sake of efficient pipelining. Microprocessor and minicomputer designers largely neglected this interaction between architectural simplicity and implementation during the 1970s, but it returned in the 1980s. High Level Language Computer Architecture In the late 1960s and early 1970s, people realized that software costs were growing faster than hardware costs. McKeeman [1967] argued that compilers and operating systems were getting too big and too complex and taking too long to develop. Because of inferior compilers and the memory limitations of computers, most systems programs at the time were still written in assembly language. Many researchers proposed alleviating the software crisis by creating more powerful, software-oriented architectures. Tanenbaum [1978] studied the properties of high-level languages. Like other researchers, he found that most programs are simple. He then argued that architectures should be designed with this in mind and that they should optimize for program size and ease of compilation. Tanenbaum proposed a stack computer with frequency-encoded instruction formats to accomplish these goals. However, as we have observed, program size does not translate directly to cost/performance, and stack computers faded out shortly after this work. Strecker’s article [1978] discusses how he and the other architects at DEC responded to this by designing the VAX architecture. The VAX was designed to simplify compilation of high-level languages. Compiler writers had complained about the lack of complete orthogonality in the PDP-11. The VAX architecture was designed to be highly orthogonal and to allow the mapping of a high-levellanguage statement into a single VAX instruction. Additionally, the VAX designers tried to optimize code size because compiled programs were often too large for available memories. Appendix D summarizes this instruction set. The VAX-11/780 was the first computer announced in the VAX series. It is one of the most successful––and most heavily studied––computers ever built. The cornerstone of DEC’s strategy was a single architecture, VAX, running a single operating system, VMS. This strategy worked well for over 10 years. The large number of papers reporting instruction mixes, implementation measurements, and analysis of the VAX make it an ideal case study [Wiecek 1982; Clark and Levy 1982]. Bhandarkar and Clark [1991] give a quantitative analysis of the disadvantages of the VAX versus a RISC computer, essentially a technical explanation for the demise of the VAX. While the VAX was being designed, a more radical approach, called highlevel-language computer architecture (HLLCA), was being advocated in the research community. This movement aimed to eliminate the gap between high-level languages and computer hardware—what Gagliardi [1973] called the “semantic gap”—by bringing the hardware “up to” the level of the programming language. Meyers [1982] provides a good summary of the arguments and a history of high-level-language computer architecture projects.
2.16
Historical Perspective and References
163
HLLCA never had a significant commercial impact. The increase in memory size on computers eliminated the code-size problems arising from high-level languages and enabled operating systems to be written in high-level languages. The combination of simpler architectures together with software offered greater performance and more flexibility at lower cost and lower complexity. Reduced Instruction Set Computers In the early 1980s, the direction of computer architecture began to swing away from providing high-level hardware support for languages. Ditzel and Patterson [1980] analyzed the difficulties encountered by the high-level-language architectures and argued that the answer lay in simpler architectures. In another paper [Patterson and Ditzel 1980], these authors first discussed the idea of reduced instruction set computers (RISC) and presented the argument for simpler architectures. Clark and Strecker [1980], who were VAX architects, rebutted their proposal. The simple load-store computers such as MIPS are commonly called RISC architectures. The roots of RISC architectures go back to computers like the 6600, where Thornton, Cray, and others recognized the importance of instruction set simplicity in building a fast computer. Cray continued his tradition of keeping computers simple in the CRAY-1. Commercial RISCs are built primarily on the work of three research projects: the Berkeley RISC processor, the IBM 801, and the Stanford MIPS processor. These architectures have attracted enormous industrial interest because of claims of a performance advantage of anywhere from two to five times over other computers using the same technology. Begun in 1975, the IBM project was the first to start but was the last to become public. The IBM computer was designed as 24-bit ECL minicomputer, while the university projects were both MOS-based, 32-bit microprocessors. John Cocke is considered the father of the 801 design. He received both the Eckert-Mauchly and Turing awards in recognition of his contribution. Radin [1982] describes the highlights of the 801 architecture. The 801 was an experimental project that was never designed to be a product. In fact, to keep down cost and complexity, the computer was built with only 24-bit registers. In 1980, Patterson and his colleagues at Berkeley began the project that was to give this architectural approach its name (see Patterson and Ditzel [1980]). They built two computers called RISC-I and RISC-II. Because the IBM project was not widely known or discussed, the role played by the Berkeley group in promoting the RISC approach was critical to the acceptance of the technology. They also built one of the first instruction caches to support hybrid format RISCs (see Patterson [1983]). It supported 16-bit and 32-bit instructions in memory but 32 bits in the cache (see Patterson [1983]). The Berkeley group went on to build RISC computers targeted toward Smalltalk, described by Ungar et al. [1984], and LISP, described by Taylor et al. [1986].
164
Chapter 2 Instruction Set Principles and Examples
In 1981, Hennessy and his colleagues at Stanford published a description of the Stanford MIPS computer. Efficient pipelining and compiler-assisted scheduling of the pipeline were both important aspects of the original MIPS design. MIPS stood for Microprocessor without Interlocked Pipeline Stages, reflecting the lack of hardware to stall the pipeline, as the compiler would handle dependencies. These early RISC computers—the 801, RISC-II, and MIPS—had much in common. Both university projects were interested in designing a simple computer that could be built in VLSI within the university environment. All three computers used a simple load-store architecture, fixed-format 32-bit instructions, and emphasized efficient pipelining. Patterson [1985] describes the three computers and the basic design principles that have come to characterize what a RISC computer is. Hennessy [1984] provides another view of the same ideas, as well as other issues in VLSI processor design. In 1985, Hennessy published an explanation of the RISC performance advantage and traced its roots to a substantially lower CPI—under 2 for a RISC processor and over 10 for a VAX-11/780 (though not with identical workloads). A paper by Emer and Clark [1984] characterizing VAX-11/780 performance was instrumental in helping the RISC researchers understand the source of the performance advantage seen by their computers. Since the university projects finished up, in the 1983–84 time frame, the technology has been widely embraced by industry. Many manufacturers of the early computers (those made before 1986) claimed that their products were RISC computers. These claims, however, were often born more of marketing ambition than of engineering reality. In 1986, the computer industry began to announce processors based on the technology explored by the three RISC research projects. Moussouris et al. [1986] describe the MIPS R2000 integer processor, while Kane’s book [1986] is a complete description of the architecture. Hewlett-Packard converted their existing minicomputer line to RISC architectures; Lee [1989] describes the HP Precision Architecture. IBM never directly turned the 801 into a product. Instead, the ideas were adopted for a new, low-end architecture that was incorporated in the IBM RT-PC and described in a collection of papers [Waters 1986]. In 1990, IBM announced a new RISC architecture (the RS 6000), which is the first superscalar RISC processor (see Chapter 4). In 1987, Sun Microsystems began delivering computers based on the SPARC architecture, a derivative of the Berkeley RISC-II processor; SPARC is described in Garner et al. [1988]. The PowerPC joined the forces of Apple, IBM, and Motorola. Appendix B summarizes several RISC architectures. To help resolve the RISC vs. traditional design debate, designers of VAX processors later performed a quantitative comparison of VAX and a RISC processor for implementations with comparable organizations. Their choices were the VAX 8700 and the MIPS M2000. The differing goals for VAX and MIPS have led to very different architectures. The VAX goals, simple compilers and code density,
2.16
Historical Perspective and References
165
4.0 Performance ratio
3.5
3.0
2.5
MIPS/VAX
2.0 Instructions executed ratio
1.5
1.0
0.5 CPI ratio li
ot nt eq
es pr es
t
so
c du
ca m to
do
tv
p pp fp
7 sa na
at m
sp
ic
e
rix
0.0
SPEC 89 benchmarks
FIGURE 2.41 Ratio of MIPS M2000 to VAX 8700 in instructions executed and performance in clock cycles using SPEC89 programs. On average, MIPS executes a little over twice as many instructions as the VAX, but the CPI for the VAX is almost six times the MIPS CPI, yielding almost a threefold performance advantage. (Based on data from Bhandarkar and Clark [1991].)
led to powerful addressing modes, powerful instructions, efficient instruction encoding, and few registers. The MIPS goals were high performance via pipelining, ease of hardware implementation, and compatibility with highly optimizing compilers. These goals led to simple instructions, simple addressing modes, fixedlength instruction formats, and a large number of registers. Figure 2.41 shows the ratio of the number of instructions executed, the ratio of CPIs, and the ratio of performance measured in clock cycles. Since the organizations were similar, clock cycle times were assumed to be the same. MIPS executes about twice as many instructions as the VAX, while the CPI for the VAX is about six times larger than that for the MIPS. Hence, the MIPS M2000 has almost three times the performance of the VAX 8700. Furthermore, much less hardware is needed to build the MIPS processor than the VAX processor. This cost/performance gap is the reason the company that used to make the VAX has dropped it and is now making the Alpha, which is quite similar to MIPS. Bell and Strecker summarize the debate inside the company. Looking back, only one CISC instruction set survived the RISC/CISC debate, and that one that had binary compatibility with PC-software. The volume of chips is so high in the PC industry that there is sufficient revenue stream to pay the ex-
166
Chapter 2 Instruction Set Principles and Examples
tra design costs––and sufficient resources due to Moore’s Law––to build microprocessors which translate from CISC to RISC internally. Whatever loss in efficiency, due to longer pipeline stages and bigger die size to accommodate translation on the chip, was hedged by having a semiconductor fabrication line dedicated to producing just these microprocessors. The high volumes justify the economics of a fab line tailored to these chips. Thus, in the desktop/server market, RISC computers use compilers to translate into RISC instructions and the remaining CISC computer uses hardware to translate into RISC instructions. One recent novel variation for the laptop market is the Transmeta Crusoe (see section 4.8 of Chapter 4), which interprets 80x86 instructions and compiles on the fly into internal instructions. The embedded market, which competes in cost and power, cannot afford the luxury of hardware translation and thus uses compilers and RISC architectures. More than twice as many 32-bit embedded microprocessors were shipped in 2000 than PC microprocessors, with RISC processors responsible for over 90% of that embedded market. A Brief History of Digital Signal Processors (Jeff Bier prepared this DSP history.) In the late 1990s, digital signal processing (DSP) applications, such as digital cellular telephones, emerged as one of the largest consumers of embedded computing power. Today, microprocessors specialized for DSP applications ––sometimes called digital signal processors, DSPs, or DSP processors––are used in most of these applications. In 2000 this was a $6 billion market. Compared to other embedded computing applications, DSP applications are differentiated by: n
n
n
n
Computationally demanding, iterative numeric algorithms often composed of vector dot products; hence the importance of multiply and multiply-accumulate instructions. Sensitivity to small numeric errors; for example, numeric errors may manifest themselves as audible noise in an audio device. Stringent real-time requirements. “Streaming” data; typically, input data is provided from an analog-to-digital converter as a infinite stream. Results are emitted in a similar fashion.
n
High data bandwidth.
n
Predictable, simple (though often eccentric) memory access patterns.
n
Predictable program flow (typically characterized by nested loops).
In the 1970s there was strong interest in using DSP techniques in telecommunications equipment, such as modems and central office switches. The microprocessors of the day did not provide adequate performance, though. Fixed-function
2.16
Historical Perspective and References
167
hardware proved effective in some applications, but lacked the flexibility and reusability of a programmable processor. Thus, engineers were motivated to adapt microprocessor technology to the needs of DSP applications. The first commercial DSPs emerged in the early 1980s, about 10 years after Intel’s introduction of the 4004. A number of companies, including Intel, developed early DSPs, but most of these early devices were not commercially successful. NEC’s µPD7710, introduced in 1980, became the first merchant-market DSP to ship in volume quantities, but was hampered by weak development tools. AT&T’s DSP1, also introduced in 1980, was limited to use within AT&T, but it spawned several generations of successful devices which AT&T soon began offering to other system manufacturers. In 1982, Texas Instruments introduced its first DSP, the TMS32010. Backed by strong tools and applications engineering support, the TI processor was a solid success. Like the first microprocessors, these early DSPs had simple architectures. In contrast with their general-purpose cousins, though, DSPs adopted a range of specialized features to boost performance and efficiency in signal processing tasks. For example, a single-cycle multiplier aided arithmetic performance. Specialized datapaths streamlined multiply-accumulate operations and provided features to minimize numeric errors, such as saturation arithmetic. Separate program and data memories provided the memory bandwidth required to keep the relatively powerful datapaths fed. Dedicated, specialized addressing hardware sped simple addressing operations, such autoincrement addressing. Complex, specialized instruction sets allowed these processors to combine many operations in a single instruction, but only certain limited combinations of operations were supported. From the mid 1980s to the mid 1990s, many new commercial DSP architectures were introduced. For the most part, these architectures followed a gradual, evolutionary path, adopting incremental improvements rather than fundamental innovations when compared with the earliest DSPs like the TMS32010. DSP application programs expanded from a few hundred lines of source code to tens of thousands of lines. Hence, the quality of development tools and the availability of off-the-shelf application software components became, for many users, more important than performance in selecting a processor. Today, chips based on these “conventional DSP” architectures still dominate DSP applications, and are used in products such as cellular telephones, disk drives (for servo control), and consumer audio devices. Early DSP architectures had proven effective, but the highly specialized and constrained instruction sets that gave them their performance and efficiency also created processors that were difficult targets for compiler writers. The performance and efficiency demands of most DSP applications could not be met by the resulting weak compilers, so much software––all software for some processor–– was written in assembly language. As applications became larger and more complex, assembly language programming became less practical. Users also suffered from the incompatibility of many new DSP architectures with their predecessors, which forced them to periodically rewrite large amounts of existing application software.
168
Chapter 2 Instruction Set Principles and Examples
In roughly 1995, architects of digital signal processors began to experiment with very different types of architectures, often adapting techniques from earlier high-performance general-purpose or scientific-application processor designs. These designers sought to further increase performance and efficiency, but to do so with architectures that would be better compiler targets, and that would offer a better basis for future compatible architectures. For example, in 1997, Texas Instruments announced the TMS320C62xx family, an eight-issue VLIW design boasting increased parallelism, a higher clock speed, and a radically simple, RISC-like instruction set. Other DSP architects adopted SIMD approaches, superscalar designs, chip multiprocessing, or a combination of these of techniques. Therefore, DSP architectures today are more diverse than ever, and the rate of architectural innovation is increasing. DSP architects were experimenting with new approaches, often adapting techniques from general-purpose processors. In parallel, designers of general-purpose processors (both those targeting embedded applications and those intended for computers) noticed that DSP tasks were becoming increasingly common in all kinds of microprocessor applications. In many cases, these designers added features to their architectures to boost performance and efficiency in DSP tasks. These features ranged from modest instruction set additions to extensive architectural retrofits. In some cases, designers created all-new architectures intended to encompass capabilities typically found in a DSP and those typically found in a general-purpose processor. Today, virtually every commercial 32-bit microprocessor architecture––from ARM to 80x86––has been subject to some kind of DSP-oriented enhancement. Throughout the 1990s, an increasing number of system designers turned to system-on-chip devices. These are complex integrated circuits typically containing a processor core and a mix of memory, application-specific hardware (such as algorithm accelerators), peripherals, and I/O interfaces tuned for a specific application. An example is second-generation cellular phones. In some cases, chip manufacturers provide a complete complement of application software along with these highly integrated chips. These processor-based chips are often the solution of choice for relatively mature, high-volume applications. Though these chips are not sold as “processors,” the processors inside them define their capabilities to a significant degree. More information on the history of DSPs can be found Boddie [2000], Stauss [1998], and Texas Instruments [2000]. Multimedia Support in Desktop Instruction Sets Since every desktop microprocessor by definition has its own graphical displays, as transistor budgets increased it was inevitable that support would be added for graphics operations. The earliest color for PCs used 8 bits per pixel in the “256 color” format of VGA, which some PCs still support for compatibility. The next step was 16 bits per pixel by encoding R in 5 bits, G in 6 bits, and B in 5 bits.
2.16
Historical Perspective and References
169
This format is called high color on PCs. On PCs the 32-bit format discussed above, with R, G, B, and A, is called true color. The addition of speakers and microphones for teleconferencing and video games suggested support of sound as well. Audio samples of 16 bit are sufficient for most end users, but professional audio work uses 24 bits. The architects of the Intel i860, which was justified as a graphical accelerator within the company, recognized that many graphics and audio applications would perform the same operation on vectors of these data. Although a vector unit was beyond the transistor budget of the i860 in 1989, by partitioning the carry chains within a 64-bit ALU, it could perform simultaneous operations on short vectors. It operated on eight 8-bit operands, four 16-bit operands, or two 32-bit operands. The cost of such partitioned ALUs was small. Applications that lend themselves to such support include MPEG (video), games like DOOM (3D graphics), Adobe Photoshop (digital photography), and teleconferencing (audio and image processing). Operations on four 8-bit operands were for operating on pixels. Like a virus, over time such multimedia support has spread to nearly every desktop microprocessor. HP was the first successful desktop RISC to include such support. The pair single floating-point operations, which came later, are useful for operations on vertices. These extensions have been called partitioned ALU, subword parallelism, vector, or SIMD (single instruction, multiple data). Since Intel marketing uses SIMD to describe the MMX extension of the 80x86, SIMD has become the popular name. Summary Prior to the RISC architecture movement, the major trend had been highly microcoded architectures aimed at reducing the semantic gap and code size. DEC, with the VAX, and Intel, with the iAPX 432, were among the leaders in this approach. Although those two computers have faded into history, one contemporary survives: the 80x86. This architecture did not have a philosophy about high level language, it had a deadline. Since the iAPX 432 was late and Intel desperately needed a 16-bit microprocessor, the 8086 was designed in a few months. It was forced to be assembly language compatible with the 8-bit 8080, and assembly language was expected to be widely used with this architecture. Its saving grace has been its ability to evolve. The 80x86 dominates the desktop with an 85% share, which is due in part to the importance of binary compatibility as a result of IBM’s selection of the 8086 in the early 1980s. Rather than change the instruction set architecture, recent 80x86 implementations translate into RISC-like instructions internally and then execute them (see section 3.8 in the next chapter). RISC processors dominate the embedded market with a similar market share, because binary compatibility is unimportant plus die size and power goals make hardware translation a luxury.
170
Chapter 2 Instruction Set Principles and Examples
VLIW is currently being tested across the board, from DSPs to servers. Will code size be a problem in the embedded market, where the instruction memory in a chip could be bigger than the processor? Will VLIW DSPs achieve respectable cost-performance if compilers to produce the code? Will the high power and large die of server VLIWs be successful, at a time when concern for power efficiency of servers is increasing? Once again an attractive feature of this field is that time will shortly tell how VLIW fares, and we should know answers to these questions by the fourth edition of this book. References AMDAHL, G. M., G. A. BLAAUW, AND F. P. BROOKS, JR. [1964]. “Architecture of the IBM System 360,” IBM J. Research and Development 8:2 (April), 87–101. BARTON, R. S. [1961]. “A new approach to the functional design of a computer,” Proc. Western Joint Computer Conf., 393–396. Bier, J. [1997] “The Evolution of DSP Processors“, presentation at U.C.Berkeley, November 14. BELL, G., R. CADY, H. MCFARLAND, B. DELAGI, J. O’LAUGHLIN, R. NOONAN, AND W. WULF [1970]. “A new architecture for mini-computers: The DEC PDP-11,” Proc. AFIPS SJCC, 657–675. Bell, G. and W. D. Strecker [1998]. “Computer Structures: What Have We Learned from the PDP11?” 25 Years of the International Symposia on Computer Architecture (Selected Papers). ACM, 138-151. BHANDARKAR, D., AND D. W. CLARK [1991]. “Performance from architecture: Comparing a RISC and a CISC with similar hardware organizations,” Proc. Fourth Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Palo Alto, Calif., 310–19. BODDIE, J.R. [2000] “HISTORY OF DSPS,” HTTP://WWW.LUCENT.COM/MICRO/DSP/DSPHIST.HTML. CHOW, F. C. [1983]. A Portable Machine-Independent Global Optimizer—Design and Measurements, Ph.D. Thesis, Stanford Univ. (December). CLARK, D. AND H. LEVY [1982]. “Measurement and analysis of instruction set use in the VAX-11/ 780,” Proc. Ninth Symposium on Computer Architecture (April), Austin, Tex., 9–17. CLARK, D. AND W. D. STRECKER [1980]. “Comments on ‘the case for the reduced instruction set computer’,” Computer Architecture News 8:6 (October), 34–38. CRAWFORD, J. AND P. GELSINGER [1988]. Programming the 80386, Sybex Books, Alameda, Calif. DITZEL, D. R. AND D. A. PATTERSON [1980]. “Retrospective on high-level language computer architecture,” in Proc. Seventh Annual Symposium on Computer Architecture, La Baule, France (June), 97–104. EMER, J. S. AND D. W. CLARK [1984]. “A characterization of processor performance in the VAX-11/ 780,” Proc. 11th Symposium on Computer Architecture (June), Ann Arbor, Mich., 301–310. GAGLIARDI, U. O. [1973]. “Report of workshop 4–Software-related advances in computer hardware,” Proc. Symposium on the High Cost of Software, Menlo Park, Calif., 99–120. GAME, M. and A. BOOKER [1999]. “CodePack code compression for PowerPC processors,” MicroNews, First Quarter 1999, Vol. 5, No. 1., http://www.chips.ibm.com/micronews/vol5_no1/codepack.html GARNER, R., A. AGARWAL, F. BRIGGS, E. BROWN, D. HOUGH, B. JOY, S. KLEIMAN, S. MUNCHNIK, M. NAMJOO, D. PATTERSON, J. PENDLETON, AND R. TUCK [1988]. “Scalable processor architecture (SPARC),” COMPCON, IEEE (March), San Francisco, 278–283. HAUCK, E. A., AND B. A. DENT [1968]. “Burroughs’ B6500/B7500 stack mechanism,” Proc. AFIPS
2.16
Historical Perspective and References
171
SJCC, 245–251. HENNESSY, J. [1984]. “VLSI processor architecture,” IEEE Trans. on Computers C-33:11 (December), 1221–1246. HENNESSY, J. [1985]. “VLSI RISC processors,” VLSI Systems Design VI:10 (October), 22–32. HENNESSY, J., N. JOUPPI, F. BASKETT, AND J. GILL [1981]. “MIPS: A VLSI processor architecture,” Proc. CMU Conf. on VLSI Systems and Computations (October), Computer Science Press, Rockville, MY. Intel [2001] Using MMX™ Instructions to Convert RGB To YUV Color Conversion, http://cedar.intel.com/cgi-bin/ids.dll/content/content.jsp?cntKey=Legacy::irtm_AP548_9996&cntType=IDS_EDITORIAL KANE, G. [1986]. MIPS R2000 RISC Architecture, Prentice Hall, Englewood Cliffs, N.J. Kozyrakis, C. [2000] “Vector IRAM: A Media-oriented vector processor with embedded DRAM,” presentation at Hot Chips 12 Conference, Palo Alto, CA, 13-15, 2000 LEE, R. [1989]. “Precision architecture,” Computer 22:1 (January), 78–91. LEVY, H. AND R. ECKHOUSE [1989]. Computer Programming and Architecture: The VAX, Digital Press, Boston. Lindholm, T. and F. Yellin [1999]. The Java Virtual Machine Specification, second edition, AddisonWesley. Also available online at http://java.sun.com/docs/books/vmspec/. LUNDE, A. [1977]. “Empirical evaluation of some features of instruction set processor architecture,” Comm. ACM 20:3 (March), 143–152. McGhan, H.; O'Connor, M. [1998] “PicoJava: a direct execution engine for Java bytecode.” Computer, vol.31, (no.10), Oct. 1998. p.22-30. MCKEEMAN, W. M. [1967]. “Language directed computer design,” Proc. 1967 Fall Joint Computer Conf., Washington, D.C., 413–417. MEYERS, G. J. [1978]. “The evaluation of expressions in a storage-to-storage architecture,” Computer Architecture News 7:3 (October), 20–23. MEYERS, G. J. [1982]. Advances in Computer Architecture, 2nd ed., Wiley, New York. MOUSSOURIS, J., L. CRUDELE, D. FREITAS, C. HANSEN, E. HUDSON, S. PRZYBYLSKI, T. RIORDAN, AND C. ROWEN [1986]. “A CMOS RISC processor with integrated system functions,” Proc. COMPCON, IEEE (March), San Francisco, 191. PATTERSON, D. [1985]. “Reduced instruction set computers,” Comm. ACM 28:1 (January), 8–21. PATTERSON, D. A. AND D. R. DITZEL [1980]. “The case for the reduced instruction set computer,” Computer Architecture News 8:6 (October), 25–33. Patterson, D.A.; Garrison, P.; Hill, M.; Lioupis, D.; Nyberg, C.; Sippel, T.; Van Dyke, K. “Architecture of a VLSI instruction cache for a RISC,” 10th Annual International Conference on Computer Architecture Conference Proceedings, Stockholm, Sweden, 13-16 June 1983, 108-16. RADIN, G. [1982]. “The 801 minicomputer,” Proc. Symposium Architectural Support for Programming Languages and Operating Systems (March), Palo Alto, Calif., 39–47. Riemens, A. Vissers, K.A.; Schutten, R.J.; Sijstermans, F.W.; Hekstra, G.J.; La Hei, G.D. [1999] “Trimedia CPU64 application domain and benchmark suite.” Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors, ICCD'99, Austin, TX, USA, 10-13 Oct. 1999, 580-585. Ropers, A. H.W. Lollman, and J. Wellhausen [1999] “DSPstone: Texas Instruments TMS320C54x”, Technical Report Nr.IB 315 1999/9-ISS-Version 0.9, Aachen University of Technology, http://www.ert.rwth-aachen.de/Projekte/Tools/coal/dspstone_c54x/index.html STRAUSS, W. “DSP Strategies 2002,” Forward Concepts, 1998. http://www.usadata.com/ market_research/spr_05/spr_r127-005.htm
172
Chapter 2 Instruction Set Principles and Examples
STRECKER, W. D. [1978]. “VAX-11/780: A virtual address extension of the PDP-11 family,” Proc. AFIPS National Computer Conf. 47, 967–980. TANENBAUM, A. S. [1978]. “Implications of structured programming for machine architecture,” Comm. ACM 21:3 (March), 237–246. TAYLOR, G., P. HILFINGER, J. LARUS, D. PATTERSON, AND B. ZORN [1986]. “Evaluation of the SPUR LISP architecture,” Proc. 13th Symposium on Computer Architecture (June), Tokyo. TEXAS INSTRUMENTs [2000]. “History of Innovation: 1980s,” http://www.ti.com/corp/docs/company/ history/1980s.shtml. THORNTON, J. E. [1964]. “Parallel operation in Control Data 6600,” Proc. AFIPS Fall Joint Computer Conf. 26, part 2, 33–40. UNGAR, D., R. BLAU, P. FOLEY, D. SAMPLES, AND D. PATTERSON [1984]. “Architecture of SOAR: Smalltalk on a RISC,” Proc. 11th Symposium on Computer Architecture (June), Ann Arbor, Mich., 188–197. van Eijndhoven, J.T.J.; Sijstermans, F.W.; Vissers, K.A.; Pol, E.J.D.; Tromp, M.I.A.; Struik, P.; Bloks, R.H.J.; van der Wolf, P.; Pimentel, A.D.; Vranken, H.P.E.[1999] “Trimedia CPU64 architecture,” Proc. 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors, ICCD'99, Austin, TX, USA, 10-13 Oct. 1999, 586-592. WAKERLY, J. [1989]. Microcomputer Architecture and Programming, J. Wiley, New York. WATERS, F., ED. [1986]. IBM RT Personal Computer Technology, IBM, Austin, Tex., SA 23-1057. WIECEK, C. [1982]. “A case study of the VAX 11 instruction set usage for compiler execution,” Proc. Symposium on Architectural Support for Programming Languages and Operating Systems (March), IEEE/ACM, Palo Alto, Calif., 177–184. WULF, W. [1981]. “Compilers and computer architecture,” Computer 14:7 (July), 41–47.
E X E R C I S E S n
n
n
n
Where do instruction sets come from? Since the earliest computes date from just after WWII, it should be possible to derive the ancestry of the instructions in modern computers. This project will take a good deal of delving into libraries and perhaps contacting pioneers, but see if you can derive the ancestry of the instructions in, say, MIPS. It would be nice to try to do some comparisons with media processors and DSPs. How about this. “Very long instruction word (VLIW) computers are discussed in Chapter 4, but increasingly DSPs and media processors are adopting this style of instruction set architecture. One example is the TI TMS320C6203. See if you can compare code size of VLIW to more traditional computers. One attempt would be to code a common kernel across several computers. Another would be to get access to compilers for each computer and compare code sizes. Based on your data, is VLIW an appropriate architecture for embedded applications? Why or why not? Explicit reference to example Trimedia code 2.1 Seems like a reasonable exercise, but make it second or third instead of leadoff?
Exercises
173
2.1 [20/15/10] We are designing instruction set formats for a load-store architecture and are trying to decide whether it is worthwhile to have multiple offset lengths for branches and memory references. We have decided that both branch and memory references can have only 0-, 8-, and 16-bit offsets. The length of an instruction would be equal to 16 bits + offset length in bits. ALU instructions will be 16 bits. Figure 2.42 contains the data in cumulative form. Assume an additional bit is needed for the sign on the offset. For instruction set frequencies, use the data for MIPS from the average of the five benchmarks for the load-store computer in Figure 2.32. Assume that the miscellaneous instructions are all ALU instructions that use only registers. Offset bits
Cumulative data references
Cumulative branches
0
30%
0%
1
34%
3%
2
35%
11%
3
40%
23%
4
47%
37%
5
54%
57%
6
60%
72%
7
67%
85%
8
72%
91%
9
73%
93%
10
74%
95%
11
75%
96%
12
77%
97%
13
88%
98%
14
92%
99%
15
100%
100%
FIGURE 2.42 The second and third columns contain the cumulative percentage of the data references and branches, respectively, that can be accommodated with the corresponding number of bits of magnitude in the displacement. These are the average distances of all programs in Figure 2.8.
a.
[20] Suppose offsets were permitted to be 0, 8, or 16 bits in length, including the sign bit. What is the average length of an executed instruction?
b.
[15] Suppose we wanted a fixed-length instruction and we chose a 24-bit instruction length (for everything, including ALU instructions). For every offset of longer than 8 bits, an additional instruction is required. Determine the number of instruction bytes fetched in this computer with fixed instruction size versus those fetched with a byte-variable-sized instruction as defined in part (a).
c.
[10] Now suppose we use a fixed offset length of 16 bits so that no addi-
174
Chapter 2 Instruction Set Principles and Examples
tional instruction is ever required. How many instruction bytes would be required? Compare this result to your answer to part (b), which used 8-bit fixed offsets that used additional instruction words when larger offsets were required. n
OK exercise
2.2 [15/10] Several researchers have suggested that adding a register-memory addressing mode to a load-store computer might be useful. The idea is to replace sequences of LOAD ADD
R1,0(Rb) R2,R2,R1
ADD
R2,0(Rb)
by
Assume the new instruction will cause the clock cycle to increase by 10%. Use the instruction frequencies for the gcc benchmark on the load-store computer from Figure 2.32. The new instruction affects only the clock cycle and not the CPI. a.
[15] What percentage of the loads must be eliminated for the computer with the new instruction to have at least the same performance?
b.
[10] Show a situation in a multiple instruction sequence where a load of R1 followed immediately by a use of R1 (with some type of opcode) could not be replaced by a single instruction of the form proposed, assuming that the same opcode exists.
n
Classic exercise, although it has been a confusing to some in the past.
2.3 [20] Your task is to compare the memory efficiency of four different styles of instruction set architectures. The architecture styles are 1.
Accumulator—All operations occur between a single register and a memory location.
2.
Memory-memory—All three operands of each instruction are in memory.
3.
Stack—All operations occur on top of the stack. Only push and pop access memory; all other instructions remove their operands from stack and replace them with the result. The implementation uses a stack for the top two entries; accesses that use other stack positions are memory references.
4.
Load-store—All operations occur in registers, and register-to-register instructions have three operands per instruction. There are 16 general-purpose registers, and register specifiers are 4 bits long.
To measure memory efficiency, make the following assumptions about all four instruction sets: n
The opcode is always 1 byte (8 bits).
n
All memory addresses are 2 bytes (16 bits).
n
All data operands are 4 bytes (32 bits).
n
All instructions are an integral number of bytes in length.
There are no other optimizations to reduce memory traffic, and the variables A, B, C, and D
Exercises
175
are initially in memory. Invent your own assembly language mnemonics and write the best equivalent assembly language code for the high-level-language fragment given. Write the four code sequences for A = B + C; B = A + C; D = A - B;
Calculate the instruction bytes fetched and the memory-data bytes transferred. Which architecture is most efficient as measured by code size? Which architecture is most efficient as measured by total memory bandwidth required (code + data)? 2.4 [Discussion] What are the economic arguments (i.e., more computers sold) for and against changing instruction set architecture in desktop and server markets? What about embedded markets? 2.5 [25] Find an instruction set manual for some older computer (libraries and private bookshelves are good places to look). Summarize the instruction set with the discriminating characteristics used in Figure 2.3. Write the code sequence for this computer for the statements in Exercise 2.3. The size of the data need not be 32 bits as in Exercise 2.3 if the word size is smaller in the older computer. 2.6 [20] Consider the following fragment of C code: for (i=0; i Repeat Exercise 2.6, but this time write the code for the 80x86. 2.8 [20] For this question use the code sequence of Exercise 2.6, but put the scalar data—the value of i, the value of C, and the addresses of the array variables (but not the actual array)—in registers and keep them there whenever possible. Write the code for MIPS; how many instructions are required dynamically? How many memory-data references will be executed? What is the code size in bytes? 2.9 [20] <App. D> Make the same assumptions and answer the same questions as the prior exercise, but this time write the code for the 80x86. 2.10 [15] When designing memory systems it becomes useful to know the frequency of memory reads versus writes and also accesses for instructions versus data. Using the
176
Chapter 2 Instruction Set Principles and Examples
average instruction-mix information for MIPS in Figure 2.32, find n
the percentage of all memory accesses for data
n
the percentage of data accesses that are reads
n
the percentage of all memory accesses that are reads
Ignore the size of a datum when counting accesses. 2.11 [18] Compute the effective CPI for MIPS using Figure 2.32. Suppose we have made the following measurements of average CPI for instructions: Instruction
Clock cycles
All ALU instructions
1.0
Loads-stores
1.4
Conditional branches Taken Not taken Jumps
2.0 1.5 1.2
Assume that 60% of the conditional branches are taken and that all instructions in the miscellaneous category of Figure 2.32 are ALU instructions. Average the instruction frequencies of gcc and espresso to obtain the instruction mix. 2.12 [20/10] Consider adding a new index addressing mode to MIPS. The addressing mode adds two registers and an 11-bit signed offset to get the effective address. Our compiler will be changed so that code sequences of the form ADD R1, R1, R2 LW Rd, 100(R1)(or store)
will be replaced with a load (or store) using the new addressing mode. Use the overall average instruction frequencies from Figure 2.32 in evaluating this addition. a.
[20] Assume that the addressing mode can be used for 10% of the displacement loads and stores (accounting for both the frequency of this type of address calculation and the shorter offset). What is the ratio of instruction count on the enhanced MIPS compared to the original MIPS?
b.
[10] If the new addressing mode lengthens the clock cycle by 5%, which computer will be faster and by how much?
2.13 [25/15] Find a C compiler and compile the code shown in Exercise 2.6 for one of the computers covered in this book. Compile the code both optimized and unoptimized. a.
[25] Find the instruction count, dynamic instruction bytes fetched, and data accesses done for both the optimized and unoptimized versions.
b.
[15] Try to improve the code by hand and compute the same measures as in
Exercises
177
part (a) for your hand-optimized version. 2.14 [30] Small synthetic benchmarks can be very misleading when used for measuring instruction mixes. This is particularly true when these benchmarks are optimized. In this exercise and Exercises 2.15–2.17, we want to explore these differences. These programming exercises can be done with any load-store processor. Compile Whetstone with optimization. Compute the instruction mix for the top 20 most frequently executed instructions. How do the optimized and unoptimized mixes compare? How does the optimized mix compare to the mix for swim256 on the same or a similar processor? 2.15 [30] Follow the same guidelines as the prior exercise, but this time use Dhrystone and compare it with gcc. 2.16 [30] Many computer manufacturers now include tools or simulators that allow you to measure the instruction set usage of a user program. Among the methods in use are processor simulation, hardware-supported trapping, and a compiler technique that instruments the object-code module by inserting counters. Find a processor available to you that includes such a tool. Use it to measure the instruction set mix for one of TeX, gcc, or spice. Compare the results to those shown in this chapter. 2.17 [30] MIPS has only three operand formats for its register-register operations. Many operations might use the same destination register as one of the sources. We could introduce a new instruction format into MIPS called R2 that has only two operands and is a total of 24 bits in length. By using this instruction type whenever an operation had only two different register operands, we could reduce the instruction bandwidth required for a program. Modify the MIPS simulator to count the frequency of register-register operations with only two different register operands. Using the benchmarks that come with the simulator, determine how much more instruction bandwidth MIPS requires than MIPS with the R2 format. 2.18 [25] <App. C> How much do the instruction set variations among the RISC processors discussed in Appendix C affect performance? Choose at least three small programs (e.g., a sort), and code these programs in MIPS and two other assembly languages. What is the resulting difference in instruction count?
3
Instruction-Level Parallelism and its Dynamic Exploitation
4
“Who’s first?” “America.” “Who’s second?” “Sir, there is no second.” Dialog between two observers of the sailing race later named “The America’s Cup” and run every few years. This quote was the inspiration for John Cocke’s naming of the IBM research processor as “America.” This processor was the precursor to the RS/6000 series and the first superscalar microprocessor.
3.1
3.1
Instruction-Level Parallelism: Concepts and Challenges
167
3.2
Overcoming Data Hazards with Dynamic Scheduling
177
3.3
Dynamic Scheduling: Examples and the Algorithm
185
3.4
Reducing Branch Costs with Dynamic Hardware Prediction
193
3.5
High Performance Instruction Delivery
207
3.6
Taking Advantage of More ILP with Multiple Issue
214
3.7
Hardware-Based Speculation
224
3.8
Studies of the Limitations of ILP
240
3.9
Limitations on ILP for Realizable Processors
255
3.10
Putting It All Together: The P6 Microarchitecture
262
3.11
Another View: Thread Level Parallelism
275
3.12
Crosscutting Issues: Using an ILP Datapath to Exploit TLP
276
3.13
Fallacies and Pitfalls
276
3.14
Concluding Remarks
279
3.15
Historical Perspective and References
283
Exercises
291
Instruction-Level Parallelism: Concepts and Challenges All processors since about 1985, including those in the embedded space, use pipelining to overlap the execution of instructions and improve performance. This potential overlap among instructions is called instruction-level parallelism (ILP) since the instructions can be evaluated in parallel. In this chapter and the next, we look at a wide range of techniques for extending the pipelining ideas by increasing the amount of parallelism exploited among instructions. This chapter is at a considerably more advanced level than the material in Appendix A. If you are not familiar with the ideas in Appendix A, you should review that Appendix before venturing into this chapter. We start this chapter by looking at the limitation imposed by data and control hazards and then turn to the topic of increasing the ability of the processor to exploit parallelism. Section 3.1 introduces a large number of concepts, which we build on throughout these two chapters. While some of the more basic material in
222
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
this chapter could be understood without all of the ideas in Section 3.1, this basic material is important to later sections of this chapter as well as to chapter 4. There are two largely separable approaches to exploiting ILP. This chapter covers techniques that are largely dynamic and depend on the hardware to locate the parallelism. The next chapter focuses on techniques that are static and rely much more on software. In practice, this partitioning between dynamic and static and between hardware-intensive and software-intensive is not clean, and techniques from one camp are often used by the other. Nonetheless, for exposition purposes, we have separated the two approaches and tried to indicate where an approach is transferable. The dynamic, hardware intensive approaches dominate the desktop and server markets and are used in a wide range of processors, including: the Pentium III and 4, the Althon, the MIPS R10000/12000, the Sun ultraSPARC III, the PowerPC 603, G3, and G4, and the Alpha 21264. The static, compiler-intensive approaches, which we focus on in the next chapter, have seen broader adoption in the embedded market than the desktop or server markets, although the new IA-64 architecture and Intel’s Itanium, use this more static approach. In this section, we discuss features of both programs and processors that limit the amount of parallelism that can be exploited among instructions, as well as the critical mapping between program structure and hardware structure, which is key to understanding whether a program property will actually limit performance and under what circumstances. Recall that the value of the CPI (Cycles per Instruction) for a pipelined processor is the sum of the base CPI and all contributions from stalls: Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls
The ideal pipeline CPI is a measure of the maximum performance attainable by the implementation. By reducing each of the terms of the right-hand side, we minimize the overall pipeline CPI and thus increase the IPC (Instructions per Clock). In this chapter we will see that the techniques we introduce to increase the ideal IPC, can increase the importance of dealing with structural, data hazard, and control stalls. The equation above allows us to characterize the various techniques we examine in this chapter by what component of the overall CPI a technique reduces. Figure 3.1 shows the techniques we examine in this chapter and in the next, as well as the topics covered in the introductory material in Appendix A. Before we examine these techniques in detail, we need to define the concepts on which these techniques are built. These concepts, in the end, determine the limits on how much parallelism can be exploited. Instruction-Level Parallelism All the techniques in this chapter and the next exploit parallelism among instructions. As we stated above, this type of parallelism is called instruction-level parallelism or ILP. The amount of parallelism available within a basic block–a straight-
3.1
Instruction-Level Parallelism: Concepts and Challenges
223
Technique
Reduces
Section
Forwarding and bypassing
Potential data hazard stalls
A.2
Delayed branches and simple branch scheduling
Control hazard stalls
A.2
Basic dynamic scheduling (scoreboarding)
Data hazard stalls from true dependences
A.8
Dynamic scheduling with renaming
Data hazard stalls and stalls from antidependences and output dependences
3.2
Dynamic branch prediction
Control stalls
3.4
Issuing multiple instructions per cycle
Ideal CPI
3.6
Speculation
Data hazard and control hazard stalls
3.5
Dynamic memory disambiguation
Data hazard stalls with memory
3.2, 3.7
Loop unrolling
Control hazard stalls
4.1
Basic compiler pipeline scheduling
Data hazard stalls
A.2, 4.1
Compiler dependence analysis
Ideal CPI, data hazard stalls
4.4
Software pipelining, trace scheduling
Ideal CPI, data hazard stalls
4,3
Compiler speculation
Ideal CPI, data, control stalls
4.4
FIGURE 3.1 The major techniques examined in Appendix A, chapter 3, or chapter 4 are shown together with the component of the CPI equation that the technique affects.
line code sequence with no branches in except to the entry and no branches out except at the exit–is quite small. For typical MIPS programs the average dynamic branch frequency often between 15% and 25%, meaning that between four and seven instructions execute between a pair of branches. Since these instructions are likely to depend upon one another, the amount of overlap we can exploit within a basic block is likely to be much less than the average basic block size. To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks. The simplest and most common way to increase the amount of parallelism available among instructions is to exploit parallelism among iterations of a loop. This type of parallelism is often called loop-level parallelism. Here is a simple example of a loop, which adds two 1000-element arrays, that is completely parallel: for (i=1; i 30%), there are almost certainly some negative performance effects.
commit. On average, one uop commits per cycle, but, as Figure 3.56 shows, 23% of the time 3 uops commit in a cycle. This distribution demonstrates the ability of a dynamically-scheduled pipeline to fall behind (on 55% of the cycles, no uops commit) and later catch up (31% of the cycles have 2 or 3 uops committing). Figure 3.57 sums up all the possible issue and stall cycles per IA-32 instruction and compares it to the actual measured CPI on the processor. The uop cycles in Figure 3.57 are the number of cycles per instruction assuming that the processor sustains three uops per cycle and accounting for the number of uops required per IA-32 instruction for that benchmark. The sum of the issue cycles plus stalls exceeds the actual measured CPI by an average of 1.37, varying from 1.0 to 1.75. This difference arises from the ability of the dynamically-scheduled pipeline to overlap and hide different classes of stalls arising in different types of programs. The average CPI is 1.15 for the SPECint programs and 2.0 for the SPECFP programs. The P6 microarchitecture is clearly designed to focus on integer programs. The Pentium III versus the Pentium 4 The microarchitecture of the Pentium 4, which is called NetBurst, is similar to that of the Pentium III (called the P6 microarchitecture): both fetch up to three IA-32 instructions per cycle, decode them into micro-ops, and send the uops to an
3.10
Putting It All Together: The P6 Microarchitecture
325
go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 0%
10%
20%
30%
40%
50%
0 uops commit 1 uop commits
60%
70%
80%
90%
100%
2 uops commit 3 uops commit
FIGURE 3.56 The breakdown in how often 0, 1, 2, or 3 uops commit in a cycle. The average number of uop completions per cycle is distributed as: 0 completions 55% of the cycles, 1 completion 13% of the cycles, 2 completions 8% of the cycles, and 3 completions 23% of the cycles,
out-of-order execution engine that can graduate up to three uops per cycle. There are, however, many differences that are designed to allow the NetBurst microarchitecture to operate at a significantly higher clock rate than the P6 microarchitecture and to help maintain or close the peak to sustained execution throughput. Among the most important of these are: n
n
n
A much deeper pipeline: P6 requires about 10 clock cycles from the time a simple add instruction is fetched until the availability of its results. In comparison, NetBurst takes about 20 cycles, including 2 cycles reserved simply to drive results across the chip! NetBurst uses register renaming (like the MIPS R10K and the Alpha 21264) rather than the reorder buffer, which is used in P6. Use of register renaming allows many more outstanding results (potentially up to 128) in NetBurst versus the 40 that are permitted in P6.. There are seven integer execution units in NetBurst versus five in P6. The additions are an additional integer ALU and an additionaladdress computation unit.
326
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
go
uops Instruction cache stalls Resource capacity stalls Branch mispredict penalty Data Cache Stalls
m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
Clock cycles per instruction FIGURE 3.57 The actual CPI (shown as a line) is lower than the sum of the number of uop cycles plus all stalls. The uop cycles assume that three uops are completed every cycle and include the number of uops per instruction for the specific benchmark. All other stalls are the actual number of stall cycles. (TLB stalls that contribute less than 0.1 stalls/cycle are omitted). The overall CPI is lower than the sum of the uop cycles plus stalls through the use of dynamic scheduling.
n
n
n
An aggressive ALU (operating at twice the clock rate) and an aggressive data cache lead to lower latencies for the basic ALU operations (effectively one-half a clock cycle in NetBurst versus one in P6) and for data loads (effectively two cycles in NetBurst versus three in P6). These high-speed functional units are critical to lowering the potential increase in stalls from the very deep pipeline. NetBurst uses a sophisticated trace cache (see Chapter 5) to improve instruction fetch performance, while P6 uses a conventional prefetch buffer and instruction cache. Netburst has a branch target buffer that is eight times larger and has an improved prediction algorithm.
3.10
n
n
Putting It All Together: The P6 Microarchitecture
327
NetBurst has a level 1 data cache that is 8KB compared to P6’s 16KB L1 data cache. NetBurst’s larger level two cache (256KB) with higher bandwidth should offset this disadvantage. NetBurst implements the new SSE2 floating point instructions that allow two floating operations per instruction; these operations are structured as a 128-bit SIMD or short-vector structure. As we saw in Chapter 1 this gives Pentium 4 a considerable advantage over Pentium III on floating point code.
A Brief Performance Comparison of the Pentium III and Pentium 4
As we saw in Figure 1.28 on page 60 the Pentium 4 at 1.7 Ghz outperforms the Pentium III at 1 GHz by a factor of 1.26 for SPEC CINT2000 and 1.8 for SPEC CFP2000. Figure 3.58 shows the performance of the Pentium III and Pentium 4 on four of the SPEC benchmarks that are in both SPEC95 and SPEC2000. The floating point benchmarks clearly take advantage of the new instruction set extensions and yield an advantage of 1.6–1.7 above clock rate scaling. 900
800
700
SPEC Ratio
600
500
400
300
200
100
0
gcc
vortex
applu Pentium III
mgrid
Pentium 4
FIGURE 3.58 The performance of the Pentium 4 for four SPEC2000 benchmarks (two integer: gcc and vortex, and two floating point: apllu and mgrid) exceeds the Pentium III by a factor of between 1.2 and 2.9. This exceeds the purely clock speed advantage for the floating point benchmarks and is less than the clock speed advantage for the integer programs.
For the two integer benchmarks, the situation is somewhat different. In both cases the Pentium 4 delivers less than linear scaling with the increase in clock rate. If we assume the instruction counts are identical for integer codes on the two
328
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
processors, then the CPI for the two integer benchmarks is higher on the Pentium 4 (by a factor of 1.1 for gcc and a factor of 1.5 for vortex). Looking at the data for the Pentium Pro, we can see that the these benchmarks have relatively low level-2 miss rates and that they hide much of their level-1 miss penalty through dynamic scheduling and speculation. Thus, it is likely that the deeper pipeline and larger pipeline stall penalties on the Pentium 4 lead to a higher CPI for these two programs and reduce some of the gain from the high clock rate. One interesting question is: why did the designers at Intel decide on the approach they took for the Pentium 4? On the surface, the alternative of doubling the issue rate of the Pentium III, as opposed to doubling the pipeline depth and the clock rate, looks at least as attractive. Of course, there are numerous changes between the two architectures, making an exact analysis of the tradeoffs difficult. Furthermore, because of the changes in the floating point instruction set, a comparison of the two pipeline organizations needs to focus on integer performance. There are two sources of performance loss that arise if we compare the deeper pipeline of the Pentium 4 with that of the Pentium III. The first is the increase in clock overhead that occurs due to increased clock skew and jitter. This overhead is given by the difference between the ideal clock speed and the achieved clock speed. In comparable technologies, the Pentium 4 clock rate is between 1.7 and 1.8 times higher than the Pentium III clock rate. This range represents between 85% and 90% of the ideal clock rate, which is 2 times higher. The second source of performance loss is the increase in CPI that arises from the deeper pipeline. We can estimate this by taking the ratio in clock rate versus the ratio in achieved overall performance. Using SPECInt as the performance measure and comparing a 1 GHz Pentium III to a 1.7 GHz Pentium 4, the performance ratio is 1.26. This tells us that the CPI for SPECInt on the Pentium 4 must be 1.7/1.26 = 1.34 times higher., or alternatively that the Pentium 4 is about 1.26/ 1.7 = 74% of the efficiency of the Pentium III. Of course, some of this loss is in the memory system, rather than in the pipeline. The key question is whether doubling the issue width would result in a greater than 1.26 times overall performance gain. This is a vey difficult question to answer, since we must account for the improvement in pipeline CPI, the relative increase in cost of memory stalls, and the potential clock rate impact of a processor with twice the issue width. It is unlikely, looking at the data in Section 3.9, that doubling the issue rate will achieve better than a factor of 1.5 improvement in ideal instruction throughput. When combined with the potential impact on clock rate and the memory system costs, it appears that the choice of the Intel Pentium 4 designers to favor a deeper pipeline rather than wider issue, is at least a reasonable design choice.
3.11
3.11
Another View: Thread Level Parallelism
329
Another View: Thread Level Parallelism Throughout this chapter, our discussion has focused on exploiting parallelism in programs by finding and using the parallelism among instructions within the program. Although this approach has the great advantage that it is reasonably transparent to the programmer, as we have seen ILP can be quite limited or hard to exploit in some applications. Furthermore, there may be significant parallelism occurring naturally at a higher level in the application that cannot be exploited with the approaches discussed in this chapter. For example, an online transaction processing system has natural parallelism among the multiple queries and updates that are presented by requests. These queries and updates can be processed mostly in parallel, since they are largely independent of one another. Similarly, embedded applications often have natural high-level parallelism. For example, a processor in a network router can exploit parallelism among independent packets. This higher level parallelism is called thread level parallelism because it is logically structured as separate threads of execution. A thread is a separate process with its own instructions and data. A thread may represent a process that is part of a parallel program consisting of multiple processes, or it may represent an independent program on its own. Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute. Unlike instruction level parallelism, which exploits implicit parallel operations within a loop or straight-line code segment, thread level parallelism is explicitly represented by the use of multiple threads of execution that are inherently parallel. Thread level parallelism is important alternative to instruction level parallelism primarily because it could be more cost-effective to exploit than instruction level parallelism. There are many important applications where thread level parallelism occurs naturally, as it does in many server applications, In other cases, the software is being written from scratch and expressing the inherent parallelism is easy, as is true in some embedded applications. Chapter 6 explores multiprocessors and the support they provide for thread level parallelism. The investment required to program applications to expose thread-level parallelism, makes it costly to switch the large established base of software to multiprocessors. This is especially true for desktop applications, where the natural paralelism that is present in many server environments, is harder to find. Thus, despite the potentially greater efficiency of exploiting thread-level parallelism, it is likely that ILP-based approaches will continue to be the primary focus for desktop-oriented processors.
330
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
3.12
Crosscutting Issues: Using an ILP Datapath to Exploit TLP Thread-level and instruction-level parallelism exploit two different kinds of parallel structure in a program. One natural question to ask is whether it is possible for a processor oriented at instruction level parallelism to exploit thread level parallelism. The motivation for this question comes from the observation that a datapath designed to exploit higher amounts of ILP, will find that functional units are often idle because of either stalls or dependences in the code. Could the parallelism among threads to be used a source of independent instructions that might be used to keep the processor busy during stalls? Could this thread-level parallelism be used to employ the functional units that would otherwise lie idle when insufficient ILP exists? Multithreading, and a variant called simultaneous multithreading, take advantage of these insights by using thread level parallelism either as the primary form of parallelism exploitation–for example, on top of a simple pipelined processor– or as a method that works in conjunction with ILP mechanisms. In both cases, multiple threads are being executed with in single processor by duplicating the thread-specific state (program counter, registers, and so on.) and sharing the other processor resources by multiplexing them among the threads. Since multithreading is a method for exploiting thread level parallelism, we discuss it in more depth in Chapter 6.
3.13
Fallacies and Pitfalls Our first fallacy is a two part one that indicates that simple rules do not hold and that the choice of benchmarks plays a major role. Fallacies: Processors with lower CPIs will always be faster. Processors with faster clock rates will always be faster. Although a lower CPI is certainly better, sophisticated pipelines typically have slower clock rates than processors with simple pipelines. In applications with limited ILP or where the parallelism cannot be exploited by the hardware resources, the faster clock rate often wins. But, when significant ILP exists, a processor that exploits lots of ILP may be better. The IBM Power III processor is designed for high-performance FP and capable of sustaining four instructions per clock, including two FP and two load-store
3.13
Fallacies and Pitfalls
331
instructions. It offers a 400 MHz clock rate in 2000, capable and achieves a SPEC CINT2000 peak rating of 249 and a SPEC CFP2000 peak rating of 344. The Pentium III has a comparably aggressive integer pipeline but has less aggressive FP units. An 800 MHz Pentium III in 2000 achieves a SPEC CINT 2000 peak rating of 344 and a SPEC CFP2000 peak rating of 237. Thus, the faster clock rate of the Pentium III (800 MHz vs. 400 MHz) leads to an integer rating that is 1.38 times higher than the Power III, but the more aggressive FP pipeline of the Power III (and a better instruction set for floating point) leads to a lower CPI. If we assume comparable instruction counts, the Power III CPI must be almost 3x better than that of the Pentium III for the SPECFP 2000 benchmarks, leading to an overall performance advantage of 1.45. Of course, this fallacy is nothing more than a restatement of a pitfall from Chapter 2 (see page XXX) about comparing processors using only one part of the performance equation. Pitfall: Emphasizing an improvement in CPI by increasing issue rate while sacrificing clock rate can lead to lower performance. The TI SuperSPARC design is a flexible multiple-issue processor capable of issuing up to three instructions per cycle. It had a 1994 clock rate of 60 MHz. The HP PA 7100 processor is a simple dual-issue processor (integer and FP combination) with a 99-MHz clock rate in 1994. The HP processor is faster on all the SPEC92 benchmarks except two of the integer benchmarks and one FP benchmark, as shown in Figure 3.59. On average, the two processors are close on integer, but the HP processor is about 1.5 times faster on the FP benchmarks. Of course, differences in compiler technology, detailed tradeoffs in the processor (including the cache size and memory organization), and the implementation technology, could all contribute to the performance differences. The potential of multiple-issue techniques has caused many designers to focus on improving CPI while possibly not focusing adequately on the trade-off in cycle time incurred when implementing these sophisticated techniques. This inclination arises at least partially because it is easier with good simulation tools to evaluate the impact of enhancements that affect CPI than it is to evaluate the cycle time impact. There are two factors that lead to this outcome. First, it is difficult to know the clock rate impact of an approach until the design is well underway, and then it may be too late to make large changes in the organization. Second, the design simulation tools available for determining and improving CPI are generally better than those available for determining and improving cycle time. In understanding the complex interaction between cycle time and various organizational approaches, the experience of the designers seems to be one of the most valuable factors. With ever more complex designs, however, even the best designers find it hard to understand the complex tradeoffs between clock rate and other organizational decisions. At the end of Section 3.10, we will see the oppo-
332
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
site problem: how emphasizing a high clock rate, obtained through a deeper pipeline, can lead to degraded CPI and a lower performance gain than might be expected based sole on the higher clock rate.
300 250 200 SPEC ratio 150 100 50
sc gc c sp ic e do m duc dl jd w p2 av to e5 m ca tv or al a vi nn e m ar dl j sw sp2 ea m2 rs 56 u2 hy co dr r o2 d na s fp a pp p
e li co qnt m ott pr es s
es
pr
es
so
0
Benchmarks HP PA 7100
TI SuperSPARC
FIGURE 3.59 The performance of a 99-MHz HP PA 7100 processor versus a 60-MHz SuperSPARC. The comparison is based on 1994 measurements.
Pitfall: Improving only one aspect of a multiple-issue processor and expecting overall performance improvement. This pitfall is simply a restatement of Amdahl’s Law. A designer might simply look at a design, see a poor branch prediction mechanism and improve it, expecting to see significant performance improvements. The difficulty is that many factors limit the performance of multiple-issue machines, and improving one aspect of a processor often exposes some other aspect that previously did not limit performance. We can see examples of this in the data on ILP. For example, looking just at the effect of branch prediction in Figure 3.39 on page 302, we can see that going from a standard two-bit predictor to a tournament predictor significantly improves the parallelism in espresso (from an issue rate of 7 to an issue rate of 12). If the processor provides only 32 registers for renaming, however, the amount of parallelism is limited to 5 issues per clock cycle, even with a branch prediction scheme better than either alternative. Pitfalls: Sometimes bigger and dumber is better.
3.14
Concluding Remarks
333
Advanced pipelines have focused on novel and increasingly sophisticated schemes for improving CPI. The 21264 uses a sophisticated tournament predictor with a total of 29 Kbits (see page 258), while the earlier 21164 uses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits). For the SPEC95 benchmarks, the more sophisticated branch predictor of the 21264 outperforms the simpler 2-bit scheme on all but one benchmark. On average, for SPECInt95, the 21264 has 11.5 mispredictions per 1000 instructions committed while the 21164 has about 16.5 mispredictions. Somewhat surprisingly, the simpler 2-bit scheme works better for the transaction processing workload than the sophisticated 21264 scheme (17 mispredictions vs. 19 per 1000 completed instructions)! How can a predictor with less than 1/7 the number of bits and a much simpler scheme actually work better? The answer lies in the structure of the workload. The transaction processing workload has a very large code size (more than an order of magnitude larger than any SPEC95 benchmark) with a large branch frequency. The ability of the 21164 predictor to hold twice as many branch predictions based on purely local behavior (2K vs. the 1K local predictor in the 21264) seems to provide a slight advantage. This pitfall also reminds us that different applications can produce different behaviors. As processors become more sophisticated including specific microarchitectural features aimed at some particular program behavior, it is likely that different applications will see more divergent behavior.
3.14
Concluding Remarks The tremendous interest in multiple-issue organizations came about because of an interest in improving performance without affecting the standard uniprocessor programming model. Although taking advantage of ILP is conceptually simple, the design problems are amazingly complex in practice. It is extremely difficult to achieve the performance you might expect from a simple first-level analysis. The trade-offs between increasing clock speed and decreasing CPI through multiple issue are extremely hard to quantify. In the 1995 edition of this book, we stated: “Although you might expect that it is possible to build an advanced multiple-issue processor with a high clock rate, a factor of 1.5 to 2 in clock rate has consistently separated the highest clock rate processors and the most sophisticated multiple-issue processors. It is simply too early to tell whether this difference is due to fundamental implementation trade-offs, or to the difficulty of dealing with the complexities in multiple-issue processors, or simply a lack of experience in implementing such processors.” Given the availability of the Alpha 21264 at 800 MHz, the Pentium III at 1.1 GHz, the AMD Athlon at 1.3 GHz, and the Pentium 4 at 2 GHz, it is clear that the limitation was primarily our understanding of how to build such processors. It is
334
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
also likely that this the first generation of CAD tools used for more than two million logic transistors was a limitation. One insight that was clear in 1995 and remains clear in 2000 is that the peak to sustained performance ratios for multiple-issue processors are often quite large and typically grow as the issue rate grows. Thus, increasing the clock rate by X is almost always a better choice than increasing the issue width by X, though often the clock rate increase may rely largely on deeper pipelining, substantially narrowing the advantage. This insight probably played a role in motivating Intel to pursue a deeper pipeline for the Pentium 4, rather than trying to increase the issue width. Recall, however, the fundamental observation we made in Chapter 1 about the improvement in semiconductor technologies: the number of transistors available grows faster than the speed of the transistors. Thus, a strategy that focuses only on deeper pipelining may not be the best use of the technology in the long run.. Rather than embracing dramatic new approaches in microarchitecture, the last five years have focused on raising the clock rates of multiple issue machines and narrowing the gap between peak and sustained performance. The dynamicallyscheduled, multiple-issue processors announced in the last two years (the Alpha 21264, the Pentium III and 4, and the AMD Athlon) have same basic structure and similar sustained issue rates (three to four instructions per clock) as the first dynamically-scheduled, multiple-issue processors announced in 1995! But, the clock rates are 4 to 8 times higher, the caches are 2 to 4 times bigger, there are 2 to 4 times as many renaming registers, and twice as many load/store units! The result is performance that is 6 to 10 times higher. All the leading edge desktop and server processors are large, complex chips with more than 15 million transistors per processor. Notwithstanding, a simple two-way superscalar that issues FP instructions in parallel with integer instructions, or dual issues integer instructions (but not memory references) can probably be built with little impact on clock rate and with a tiny die size (in comparison to today’s processes). Such a processor should perform well with a higher sustained to peak ratio than the high-end wide-issue processors and can be amazingly cost-effective. As a result, the high-end of the embedded space has recently moved to multiple-issue processors! Whether approaches based primarily on faster clock rates, simpler hardware, and more static scheduling or approaches using more sophisticated hardware to achieve lower CPI will win out is difficult to say and may depend on the benchmarks. Practical Limitations on Exploiting More ILP Independent of the method used to exploit ILP, there are potential limitations that arise from employing more transistors. When the number of transistors employed is increased, the clock period is often determined by wire delays encountered both in distributing the clock and in the communication path of critical signals,
3.14
Concluding Remarks
335
such as those that signal exceptions. These delays make it more difficult to employ increased numbers of transistors to exploit more ILP, while also increasing the clock rate. These problems are sometimes overcome by adding additional stages, which are reserved just for communicating signals across longer wires. The Pentium 4 doe this. These increased clock stages, however, can lead to more stalls and a higher CPI, since they increase pipeline latency. We saw exactly this phoenom when comparing the Pentium 4 to the Pentium III. Although the limitations explored in Section 3.8 act as significant barriers to exploiting more ILP, it may be that more basic challenges would prevent the efficient exploitation of additional ILP, even if it could be uncovered. For example, doubling the issue rates above the current rates of four instructions per clock will probably require a processor to sustain three or four memory accesses per cycle and probably resolve two or three branches per cycle. In addition, supplying eight instructions per cycle will probably require fetching sixteen, speculating through multiple branches, and accessing roughly twenty registers per cycle. None of this is impossible, but whether it can be done while simultaneously maintaining clock rates exceeding 2 GHz is an open question and will surely be a significant challenge for any design team! Equal in importance to the CPI versus clock rate trade-off, are realistic limitations on power. Recall that dynamic power is proportional to the product of the number of switching transistors and the switching rate. A microprocessor trying to achieve both a low CPI and a high CR fights both of these factors. Achieving an improved CPI means more instructions in flight and more transistors switching every clock cycle. Two factors make it likely that the switching transistor count grows faster than performance. The first is the gap between peak issue rates and sustained performance, which continues to grow. Since the number of transistors switching is likely to be proportional to the peak issue rate and the performance is proportional to the sustained rate, the growing performance gap translates to increasing transistors switches per unit of performance. Second, issuing multiple instructions incurs some overhead in logic that grows as the issue rate grows. This logic is responsible for instruction issue analysis, including dependence checking, register renaming, and similar functions. The combined result is that, without voltage reductions to decrease power, lower CPIs are likely to lead to lower ratios of performance per watt. A similar conundrum applies to attempts to increase clock rate. Of course, increasing the clock rate will increase transistor switching frequency and directly increase power consumption. As we saw in the Pentium 4 discussion, a deeper pipeline structure can be used to achieve a clock rate increase that exceeds what could be obtained just from improvements in transistor speed. Deeper pipelines, however, incur additional power penalties, resulting from several sources. The most important of these is the simple observation that a deeper pipeline means more operations are in flight every clock cycle, which means more transistors are switching, which means more power!
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
What is key to understand is the extent to which this potential growth in power caused by an increase in both the switching frequency and number of transistors switching is offset by a reduction in the operating voltage. Although these relationship is complex to understand, we can look at the results empirically and draw some conclusions. The Pentium III and Pentium 4 provide an opportunity to examine this issue. As discussed on page 324, the Pentium 4 has a much deeper pipeline and can exploit more ILP than the Pentium III, although its basic peak issue rate is the same. The operating voltage of the Pentium 4 at 1.7 GHz is slightly higher than a 1 GHz Pentium III: 1.75V versus 1.70V. The power difference, however, is much larger: the 1.7 GHz Pentium 4 consumes 64 W typical, while the 1 GHz Pentium III consumes only 30 W by comparison. Figure 3.60 shows the effective performance of a 1.7 GHz Pentium 4 per watt relative to the performance per watt of a 1 GHz Pentium III using the same benchmarks presented in Figure 1.28 on page 60. Clearly, while the Pentium 4 is faster, its higher clock rate, deeper pipeline and higher sustained execution rate, make it significantly less power efficient. Whether the decreased power efficiency between the Pentium III and Pentium 4 are deep issues and unlikely to be overcome, or to whether they are artifacts of the two implementations is a key question that will probably be settled in future implementations. What is clear is that neither deeper pipelines nor wider issue rates can circumvent the need to consume more power to improve performance.
0.90
0.80
0.60
performance
per
Watt
0.70
Relative
336
0.50
0.40
0.30
0.20
0.10
0.00 SPECbase CINT2000
SPECbase CFP2000
Multimedia
Game benchmark
Web benchmark
FIGURE 3.60 The relative performance per Watt of the Pentium 4 is 15% to 40% less than the Pentium III on these five sets of benchmarks.
3.15
Historical Perspective and References
337
More generally, the question of how best to exploit parallelism remains open. Clearly ILP will continue to play a big role because of its smaller impact on programmers and applications when compared to an explicitly parallel model using multiple threads and parallel processors. What sort of parallelism computer architects will employ as they try to achieve higher performance levels, and what type of parallelism programmers will accept are hard to predict. Likewise, it is unclear whether vectors will play a larger role in processors designed for multimedia and DSP applications, or whether such processors will rely on limited SIMD and ILP approaches. We will return to these questions in the next chapter as well as in Chapter 6.
3.15
Historical Perspective and References This section describes some of the major advances in dynamically scheduled pipelines and ends with some of the recent literature on multiple-issue processors. Ideas such as data flow computation derived from observations that programs were limited by data dependence. The history of basic pipelining and the CDC 6600, the first dynamically scheduled processor, are contained in Appendix A. The IBM 360 Model 91: A Landmark Computer The IBM 360/91 introduced many new concepts, including tagging of data, register renaming, dynamic detection of memory hazards, and generalized forwarding. Tomasulo’s algorithm is described in his 1967 paper. Anderson, Sparacio, and Tomasulo [1967] describe other aspects of the processor, including the use of branch prediction. Many of the ideas in the 360/91 faded from use for nearly 25 years before being broadly resurrected in the 1990s. Unfortunately, the 360/91 was not successful and only a handful were sold. The complexity of the design made it late to the market and allowed the Model 85, which was the first IBM processor with a cache, to outperform the 91. Branch Prediction Schemes The two-bit dynamic hardware branch prediction scheme was described by J. E. Smith [1981]. Ditzel and McLellan [1987] describe a novel branch-target buffer for CRISP, which implements branch folding. McFarling and Hennessy [1986] did a quantitative comparison of a variety of compile-time and runtime branch prediction schemes. The correlating predictor we examine was described by Pan, So, and Rameh [1992]. Yeh and Patt [1992, 1993] generalized the correlation idea and described multilevel predictors that use branch histories for each branch, similar to the local history predictor used in the 21264. McFarling’s tournament prediction scheme, which he refers to a combined predictor, is described in his
338
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
1993 technical report. There are a variety of more recent papers on branch prediction based on variations in the multilevel and correlating predictor ideas. Kaeli and Emma [1991] describe return address prediction. The Development of Multiple-Issue Processors The concept of multiple-issue designs has been around for a while, though much of the work in the 1970s focused on statically scheduled approaches, which we discuss in the next chapter. IBM did pioneering work on multiple issue. In the 1960s, a project called ACS was underway in California. It included multiple-issue concepts, a proposal for dynamic scheduling (although with a simpler mechanism than Tomasulo’s scheme, which used back-up registers), and fetching down both branch paths. The project originally started as a new architecture to follow Stretch and surpass the CDC 6600/6800. ACS started in New York but was moved to California, later changed to be S/360 compatible, and eventually canceled. John Cocke was one of the intellectual forces behind the team that included a number of IBM veterans and younger contributors many of whom went on to other important roles in IBM and elsewhere: Jack Bertram, Ed Sussenguth, Gene Amdahl, Herb Schorr, Fran Allen, Lynn Conway, and Phil Dauber among others. While the compiler team published many of their ideas and had great influence outside IBM, the architecture ideas were not widely disseminated at that time. The most complete accessible documentation of this important project is at: http://www.cs.clemson.edu/~mark/acs.html, which includes interviews with the ACS veterans and pointers to other sources. Sussenguth [1999] is a good overview of ACS. More than 10 years after ACS was cancelled, John Cocke made a new proposal for a superscalar processor that dynamically made issue decisions; he described the key ideas in several talks in the mid 1980s and coined the name superscalar. He called the design America; it is described by Agerwala and Cocke [1987]. The IBM Power-1 architecture (the RS/6000 line) is based on these ideas (see Bakoglu et al. [1989]). J. E. Smith [1984] and his colleagues at Wisconsin proposed the decoupled approach that included multiple issue with limited dynamic pipeline scheduling. A key feature of this processor is the use of queues to maintain order among a class of instructions (such as memory references) while allowing it to slip behind or ahead of another class of instructions. The Astronautics ZS-1 described by Smith et al. [1987] embodies this approach with queues to connect the loadstore unit and the operation units. The Power-2 design uses queues in a similar fashion. J. E. Smith [1989] also describes the advantages of dynamic scheduling and compares that approach to static scheduling. The concept of speculation has its roots in the original 360/91, which performed a very limited form of speculation. The approach used in recent processors combines the dynamic scheduling techniques of the 360/91 with a buffer to allow in-order commit. J. E. Smith and Pleszkun [1988] explored the use of buff-
3.15
Historical Perspective and References
339
ering to maintain precise interrupts and described the concept of a reorder buffer. Sohi [1990] describes adding renaming and dynamic scheduling, making it possible to use the mechanism for speculation. Patt and his colleagues were early proponents of aggressive reordering and speculation. They focused on checkpoint and restart mechanisms and pioneered an approach called HPSm, which is also an extension of Tomasulo’s algorithm [Hwu and Patt 1986]. The use of speculation as a technique in multiple-issue processors was evaluated by Smith, Johnson, and Horowitz [1989] using the reorder buffer technique; their goal was to study available ILP in nonscientific code using speculation and multiple issue. In a subsequent book, M. Johnson [1990] describes the design of a speculative superscalar processor. Johnson later led the AMD K-5 design, one of the first speculative superscalars. Studies of ILP and Ideas to Increase ILP A series of early papers, including Tjaden and Flynn [1970] and Riseman and Foster [1972], concluded that only small amounts of parallelism could be available at the instruction level without investing an enormous amount of hardware. These papers dampened the appeal of multiple instruction issue for more than ten years. Nicolau and Fisher [1984] published a paper based on their work with trace scheduling and asserted the presence of large amounts of potential ILP in scientific programs. Since then there have been many studies of the available ILP. Such studies have been criticized since they presume some level of both hardware support and compiler technology. Nonetheless, the studies are useful to set expectations as well as to understand the sources of the limitations. Wall has participated in several such studies, including Jouppi and Wall [1989], Wall [1991], and Wall [1993]. Although the early studies were criticized as being conservative (e.g., they didn’t include speculation), the last study is by far the most ambitious study of ILP to date and the basis for the data in section 3.10. Sohi and Vajapeyam [1989] give measurements of available parallelism for wide-instruction-word processors. Smith, Johnson, and Horowitz [1989] also used a speculative superscalar processor to study ILP limits. At the time of their study, they anticipated that the processor they specified was an upper bound on reasonable designs. Recent and upcoming processors, however, are likely to be at least as ambitious as their processor. Lam and Wilson [1992] looked at the limitations imposed by speculation and shown that additional gains are possible by allowing processors to speculate in multiple directions, which requires more than one PC. (Such schemes cannot exceed what perfect speculation accomplishes, but they help close the gap between realistic prediction schemes and perfect prediction.) Wall’s 1993 study includes a limited evaluation of this approach (up to 8 branches are explored).
340
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
Going Beyond the Data Flow Limit
One other approach that has been explored in the literature is the use of value prediction. Value prediction can allow speculation based on data values. There have been a number of studies of the use of value prediction. Lipasti and Shen published two papers in 1996 evaluating the concept of value prediction and its potential impact on ILP exploitation. Sodani and Sohi [1997] approaches the same problem from the viewpoint of reusing the values produced by instructions. Moshovos, Breach, Vijaykumar and Sohi [1997] show that by deciding when to speculate on values, by tracking whether such speculation has been accurate in the past, is important to achieving performance gains with value speculation. Moshovos and Sohi [1997] and Chrysos and Emer [1998] focus on predicting memory dependences and using this information to eliminate the dependence through memory. Gonzalez and Gozalez [1998], Babbay and Mendelson [1998], and Calder, Reinman and Tullsen [1999] are more recent studies of the use of value prediction. This area is currently highly active with new results being published in every conference. Recent Advanced Microprocessors The years 1994–95 saw the announcement of wide superscalar processors (3 or more issues per clock) by every major processor vendor: Intel Pentium Pro and Pentium II (these processors share the same core pipeline architecture, described by Cowell and Steck [1995]), AMD K5, K6, and Althon, Sun UltraSPARC (see Lauterbach and Horel [1999]), Alpha 21164 (see Edmonston et. al [1995]) and 21264 (see Kessler [2000]), MIPS R10000 and R12000 (see Yeager [1996]), PowerPC 603, 604, 620 (see Diep, Nelson, and Shen [1995]), and HP 8000 (Kumar [1997]). The latter part of the decade (1996-2000), saw second generations of much of these processors (Pentium III, AMD Athlon, Alpha 21264, among others). The second generation, although similar in issue rate, could sustain a lower CPI, provided much higher clock rates, all included dynamic scheduling, and almost universally supported speculation. In practice, many factors, including the implementation technology, the memory hierarchy, the skill of the designers, and the type of applications benchmarked, all play a role in determining which approach is best. Figure 3.61 shows the most interesting processors of the past five years, their characteristics.
3.15
Historical Perspective and References
Rename registers (int/FP)
Issue rate: Maximum/ Memory / Integer / FP / Branch
48
32/32
4/1/2/2/1
2K x 2
6
29
N.A.
None
4/1/4/3/1
16K x 2
14/15
24
40
Total: 40
3/2/2/1/1
512 entries
12/14
64
42
126
Total:128
3/2/3/2/1
4K x 2
22/24
60
130
56
Total: 56
4/2/2/2/1
2K x 2
7/9
75
15
80
41/41
4/2/4/2/1
multilevel (see p. 258)
7/9
450
5
7
5
6/6
3/1/2/1/1
512 x 2
4/5
2001
1330
76
37
72
36/36
3/2/3/3/1
4K x 9
9/11
2000
450
36
23
32
16/24
4/2/2/2/2
2K x 2
7/8
System ship
Max. current CR (MHz)
Power (W)
Transistors (M)
MIPS R14000
2000
400
25
7
Ultra SPARC III
2001
900
65
Pentium III
2000
1000
30
Pentium 4
2001
1700
HP PA 8600
2001
552
Alpha 21264B
2001
833
Power PC 7400 (G4)
2000
AMD Athlon IBM Power 3II
Processor
341
Window size
Branch Predict Buffer
Pipestages (int/ load)
FIGURE 3.61 Recent high-performance processors and their characteristics. The window size column shows the size of the buffer available for instructions, and, hence, the maximum number of instructions in flight. Both the Pentium III and the Althon schedule microoperations and the window is the maximum number of microoperations in execution. The IBM, HP, and UltraSPARC processors support dynamic issue, but not speculation. To read more about these processors the following references are useful: IBM Journal of Research and Development (contains issues on Power and PowerPC designs), the Digital Technical Journal (contains issues on various Alpha processors), and Proceedings of the Hot Chips Symposium (annual meeting at Stanford, which reviews the newest microprocessors), the International Solid State Circuits Conference, and the annual Microprocessor Forum meetings, and the annual International Symposium on Computer Architecture. Much of this data in this table came from Microprocessor Report online April 30, 2001.
References AGERWALA, T. AND J. COCKE [1987]. “High performance reduced instruction set processors,” IBM Tech. Rep. (March). ANDERSON, D. W., F. J. SPARACIO, AND R. M. TOMASULO [1967]. “The IBM 360 Model 91: Processor philosophy and instruction handling,” IBM J. Research and Development 11:1 (January), 8–24. AUSTIN, T. M. AND G. SOHI [1992]. “Dynamic dependency analysis of ordinary programs,” Proc. 19th Symposium on Computer Architecture (May), Gold Coast, Australia, 342-351. BABBAY F. AND A. MENDELSON [1998]. “Using Value Prediction to Increase the Power of Speculative Execution Hardware.” ACM Transactions on Computer Systems, vol. 16, No. 3 (August), pages
342
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
234-270. BAKOGLU, H. B., G. F. GROHOSKI, L. E. THATCHER, J. A. KAELI, C. R. MOORE, D. P. TATTLE, W. E. MALE, W. R. HARDELL, D. A. HICKS, M. NGUYEN PHU, R. K. MONTOYE, W. T. GLOVER, AND S. DHAWAN [1989]. “IBM second-generation RISC processor organization,” Proc. Int’l Conf. on Computer Design, IEEE (October), Rye, N.Y., 138–142. BHANDARKAR, D. AND D. W. CLARK [1991]. “Performance from architecture: Comparing a RISC and a CISC with similar hardware organizations,” Proc. Fourth Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Palo Alto, Calif., 310–319. BHANDARKAR, D. AND J. DING [1997]. “Performance Characterization of the Pentium Pro Processor,” Proc. Third International Sym. on High Performance Computer Architecture, IEEE, (February), San Antonio, 288-297. BLOCH, E. [1959]. “The engineering design of the Stretch computer,” Proc. Fall Joint Computer Conf., 48–59. BUCHOLTZ, W. [1962]. Planning a Computer System: Project Stretch, McGraw-Hill, New York. CALDER, B., REINMAN, G. AND D. TULLSEN[1999] “Selective Value Prediction”. Proc. 26th International Symposium on Computer Architecture (ISCA), Atlanta, June CHEN, T. C. [1980]. “Overlap and parallel processing,” in Introduction to Computer Architecture, H. Stone, ed., Science Research Associates, Chicago, 427–486. CHRYSOS, G.Z. AND J.S. EMER [1998]. “Memory Dependence Prediction using Store Sets”. Proc. 25th Int. Symposium on Computer Architecture (ISCA), June, Barcelona, 142-153. CLARK, D. W. [1987]. “Pipelining and performance in the VAX 8800 processor,” Proc. Second Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (March), Palo Alto, Calif., 173–177. COLWELL R. P. AND R. STECK [1995]. “A 0.6um BiCMOS process with Dynamic Execution.” Proceedings of Int. Sym. on Solid State Circuits. CVETANOVIC, Z. AND R.E. KESSLER [2000]. “Performance Analysis of the Alpha 21264-based Compaq ES40 System,” Proc. 27th Symposium on Computer Architecture (June), Vancouver, Canada., 192-202. DAVIDSON, E. S. [1971]. “The design and control of pipelined function generators,” Proc. Conf. on Systems, Networks, and Computers, IEEE (January), Oaxtepec, Mexico, 19–21. DAVIDSON, E. S., A. T. THOMAS, L. E. SHAR, AND J. H. PATEL [1975]. “Effective control for pipelined processors,” COMPCON, IEEE (March), San Francisco, 181–184. DIEP, T. A., C. NELSON, AND J. P. SHEN [1995]. “Performance evaluation of the PowerPC 620 microarchitecture,” Proc. 22th Symposium on Computer Architecture (June), Santa Margherita, Italy. DITZEL, D. R. AND H. R. MCLELLAN [1987]. “Branch folding in the CRISP microprocessor: Reducing the branch delay to zero,” Proc. 14th Symposium on Computer Architecture (June), Pittsburgh, 2–7. EMER, J. S. AND D. W. CLARK [1984]. “A characterization of processor performance in the VAX-11/ 780,” Proc. 11th Symposium on Computer Architecture (June), Ann Arbor, Mich., 301–310. EDMONDSON, J.H., RUBINFIELD, P.I., PRESTON, R., AND V. RAJAGOPALAN [1995]. “Superscalar Instruction Execution in the 21164 Alpha Microprocessor”, IEEE Micro, Vol. 15, 2. 33–43. FOSTER, C. C. AND E. M. RISEMAN [1972]. “Percolation of code to enhance parallel dispatching and execution,” IEEE Trans. on Computers C-21:12 (December), 1411–1415. J. GONZÁLEZ AND A. GONZÁLEZ [1998], "Limits of Instruction Level Parallelism with Data Speculation", in Proc. of the VECPAR Conf., pp. 585-598. HEINRICH, J. [1993]. MIPS R4000 User’s Manual, Prentice Hall, Englewood Cliffs, N.J.
3.15
Historical Perspective and References
343
HWU, W.-M. AND Y. PATT [1986]. “HPSm, a high performance restricted data flow architecture having minimum functionality,” Proc. 13th Symposium on Computer Architecture (June), Tokyo, 297–307. IBM [1990]. “The IBM RISC System/6000 processor” (collection of papers), IBM J. of Research and Development 34:1 (January). JORDAN, .H.F. [1983] “Performance measurements on HEP: A pipelined MIMD computer,” Proc. 10th Symposium on Computer Architecture (June), pp. 207--212. JOHNSON, M. [1990]. Superscalar Microprocessor Design, Prentice Hall, Englewood Cliffs, N.J. JOUPPI, N. P. AND D. W. WALL [1989]. “Available instruction-level parallelism for superscalar and superpipelined processors,” Proc. Third Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Boston, 272–282. KAELI, D.R. AND P.G. EMMA [1991]. “Branch History Table Prediction of Moving Target Branches Due to Subroutine Returns, Proc. 18th Int. Sym. on Computer Architecture (ISCA), Toronto, May, 34-42. KELLER R. M. [1975]. “Look-ahead processors,” ACM Computing Surveys 7:4 (December), 177– 195. KESSLER. R. [1999]. “The Alpha 21264 microprocessor,” IEEE Micro, 19(2) (March/April):pp 24-36. KILLIAN, E. [1991]. “MIPS R4000 technical overview–64 bits/100 MHz or bust,” Hot Chips III Symposium Record (August), Stanford University, 1.6–1.19. KOGGE, P. M. [1981]. The Architecture of Pipelined Computers, McGraw-Hill, New York. KUMAR , A. [1997]. “The HP PA-8000 RISC CPU, “ IEEE Micro, Vol. 17, No. 2 (March/April). KUNKEL, S. R. AND J. E. SMITH [1986]. “Optimal pipelining in supercomputers,” Proc. 13th Symposium on Computer Architecture (June), Tokyo, 404–414. LAM, M. S. AND R. P. WILSON [1992]. “Limits of control flow on parallelism,” Proc. 19th Symposium on Computer Architecture (May), Gold Coast, Australia, 46–57. LAUTERBACH G. AND HOREL, T. [1999]. “UltraSPARC-III: Designing Third Generation 64-Bit Performance, “ IEEE Micro, Vol. 19, No. 3 (May/June). LIPASTI, M.H., WILKERSON, C.B., AND J.P. SHEN [1996]. "Value Locality and Load Value Prediction". Proc. Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (October), pp. 138-147. LIPASTI, M.H. AND J. P. SHEN [1996]. Exceeding the Dataflow Limit via Value Prediction. Proc. of the 29 th Annual ACM/IEEE International Symposium on Microarchitecture (December), . MCFARLING, S. [1993] “Combining branch predictors,” WRL Technical Note TN-36 (June), Digital Western Research Laboratory, Palo Alto, Calif. MOSHOVOS, A.AND G.S. SOHI [1997] “Streamlining Inter-operation Memory Communication via Data Dependence Prediction”. Proc. 30th Annual Int. Sym on Microarchitecture (MICRO-30), Dec, 235-245. MOSHOVOS, A. BREACH, S, VIJAYKUMAR, T.N. AND G.S. SOHI [1997] “Dynamic Speculation and Synchronization of Data Dependences”. Proc. 24th Int. Sym. on Computer Architecture (ISCA), June,Boulder. NICOLAU, A. AND J. A. FISHER [1984]. “Measuring the parallelism available for very long instruction word architectures,” IEEE Trans. on Computers C-33:11 (November), 968–976. PAN, S.-T., K. SO, AND J. T. RAMEH [1992]. “Improving the accuracy of dynamic branch prediction using branch correlation,” Proc. Fifth Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (October), Boston, 76-84.
344
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
Postiff, M.A.; Greene, D.A.; Tyson, G.S.; Mudge, T.N. “The limits of instruction level parallelism in SPEC95 Applications” . Computer Architecture News, vol.27, (no.1), ACM, March 1999. p.31-40. RAMAMOORTHY, C. V. AND H. F. LI [1977]. “Pipeline architecture,” ACM Computing Surveys 9:1 (March), 61–102. RISEMAN, E. M. AND C. C. FOSTER [1972]. “Percolation of code to enhance parallel dispatching and execution,” IEEE Trans. on Computers C-21:12 (December), 1411–1415. RYMARCZYK, J. [1982]. “Coding guidelines for pipelined processors,” Proc. Symposium on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (March), Palo Alto, Calif., 12–19. SITES, R. [1979]. Instruction Ordering for the CRAY-1 Computer, Tech. Rep. 78-CS-023 (July), Dept. of Computer Science, Univ. of Calif., San Diego. SMITH, A. AND J. LEE [1984]. “Branch prediction strategies and branch-target buffer design,” Computer 17:1 (January), 6–22. SMITH, J. E. AND A. R. PLESZKUN [1988]. “Implementing precise interrupts in pipelined processors,” IEEE Trans. on Computers 37:5 (May), 562–573. SMITH, J. E. [1981]. “A study of branch prediction strategies,” Proc. Eighth Symposium on Computer Architecture (May), Minneapolis, 135–148. SMITH, J. E. [1984]. “Decoupled access/execute computer architectures,” ACM Trans. on Computer Systems 2:4 (November), 289–308. SMITH, J. E. [1989]. “Dynamic instruction scheduling and the Astronautics ZS-1,” Computer 22:7 (July), 21–35. SMITH, J. E. AND A. R. PLESZKUN [1988]. “Implementing precise interrupts in pipelined processors,” IEEE Trans. on Computers 37:5 (May), 562–573. This paper is based on an earlier paper that appeared in Proc. 12th Symposium on Computer Architecture, June 1988. SMITH, J. E., G. E. DERMER, B. D. VANDERWARN, S. D. KLINGER, C. M. ROZEWSKI, D. L. FOWLER, K. R. SCIDMORE, AND J. P. LAUDON [1987]. “The ZS-1 central processor,” Proc. Second Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (March), Palo Alto, Calif., 199–204. SMITH, M. D., M. HOROWITZ, AND M. S. LAM [1992]. “Efficient superscalar performance through boosting,” Proc. Fifth Conf. on Architectural Support for Programming Languages and Operating Systems (October), Boston, IEEE/ACM, 248–259. SMITH, M. D., M. JOHNSON, AND M. A. HOROWITZ [1989]. “Limits on multiple instruction issue,” Proc. Third Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Boston, 290–302. SODANI, A. AND G. SOHI [1997]. "Dynamic Instruction Reuse", Proc. of the 24th Int. Symp. on Computer Architecture (June).. SOHI, G. S. [1990]. “Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers,” IEEE Trans. on Computers 39:3 (March), 349-359. SOHI, G. S. AND S. VAJAPEYAM [1989]. “Tradeoffs in instruction format design for horizontal architectures,” Proc. Third Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Boston, 15–25. SUSSENGUTH, E[1999]. "IBM's ACS-1 Machine," IEEE Computer, 22: 11(November). THORLIN, J. F. [1967]. “Code generation for PIE (parallel instruction execution) computers,” Proc. Spring Joint Computer Conf. 27. THORNTON, J. E. [1964]. “Parallel operation in the Control Data 6600,” Proc. AFIPS Fall Joint Computer Conf., Part II, 26, 33–40. THORNTON, J. E. [1970]. Design of a Computer, the Control Data 6600, Scott, Foresman, Glenview,
3.15
Historical Perspective and References
345
Ill. TJADEN, G. S. AND M. J. FLYNN [1970]. “Detection and parallel execution of independent instructions,” IEEE Trans. on Computers C-19:10 (October), 889–895. TOMASULO, R. M. [1967]. “An efficient algorithm for exploiting multiple arithmetic units,” IBM J. Research and Development 11:1 (January), 25–33. WALL, D. W. [1991]. “Limits of instruction-level parallelism,” Proc. Fourth Conf. on Architectural Support for Programming Languages and Operating Systems (April), Santa Clara, Calif., IEEE/ ACM, 248–259. WALL, D. W. [1993]. Limits of Instruction-Level Parallelism, Research Rep. 93/6, Western Research Laboratory, Digital Equipment Corp. (November). WEISS, S. AND J. E. SMITH [1984]. “Instruction issue logic for pipelined supercomputers,” Proc. 11th Symposium on Computer Architecture (June), Ann Arbor, Mich., 110–118. WEISS, S. AND J. E. SMITH [1987]. “A study of scalar compilation techniques for pipelined supercomputers,” Proc. Second Conf. on Architectural Support for Programming Languages and Operating Systems (March), IEEE/ACM, Palo Alto, Calif., 105–109. WEISS, S. AND J. E. SMITH [1994]. Power and PowerPC, Morgan Kaufmann, San Francisco. YEAGER, K. ET AL. [1996] "The MIPS R10000 Superscalar Microprocessor". IEEE Micro, vol 16, No 2, (April), pp 28-40. YEH, T. AND Y. N. PATT [1992]. “Alternative implementations of two-level adaptive branch prediction,” Proc. 19th International Symposium on Computer Architecture (May), Gold Coast, Australia, 124–134. YEH, T. AND Y. N. PATT [1993]. “A comparison of dynamic branch predictors that use two levels of branch history,” Proc. 20th Symposium on Computer Architecture (May), San Diego, 257–266.
E X E R C I S E S 3.1 Exercise from Dave (not fully thought out, but a good direction): Given a table like that in Figures 3.25 on page 275 or 3.26 on page 276 and some of the following deduce the rest of the following: a.
the original code
b.
the number of functional units
c.
the number of instructions issued per clock
d.
the functional units
3.2 [10] For the following code fragment, list the control dependences. For each control dependence, tell whether the statement can be scheduled before the if statement based on the data references. Assume that all data references are shown, that all values are defined before use, and that only b and c are used again after this segment. You may ignore any possible exceptions. if (a>c) d a else { e f
{ = d + 5; = b + d + e;} = e + 2; = f + 2;
346
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
c = c + f; } b = a + f;
A good exercise but requires describing how scoreboards work. There are a number of problems based on scoreboards, which may be salvagable by one of the following: introducing scoreboards (maybe not worth it), removing part of the reanming capability (WAW ror WAR) and asking about the result, recasting the problem to ask how Tomasulo avoids the problem. 3.3 [20] It is critical that the scoreboard be able to distinguish RAW and WAR hazards, since a WAR hazard requires stalling the instruction doing the writing until the instruction reading an operand initiates execution, but a RAW hazard requires delaying the reading instruction until the writing instruction finishes—just the opposite. For example, consider the sequence: MUL.D SUB.D ADD.D
F0,F6,F4 F8,F0,F2 F2,F10,F2
The SUB.D depends on the MUL.D (a RAW hazard) and thus the MUL.D must be allowed to complete before the SUB.D; if the MUL.D were stalled for the SUB.D due to the inability to distinguish between RAW and WAR hazards, the processor will deadlock. This sequence contains a WAR hazard between the ADD.D and the SUB.D, and the ADD.D cannot be allowed to complete until the SUB.D begins execution. The difficulty lies in distinguishing the RAW hazard between MUL.D and SUB.D, and the WAR hazard between the SUB.D and ADD.D. Describe how the scoreboard for a processor with two multiply units and two add units avoids this problem and show the scoreboard values for the above sequence assuming the ADD.D is the only instruction that has completed execution (though it has not written its result). (Hint: Think about how WAW hazards are prevented and what this implies about active instruction sequences.)
A good exercise I would transform it by saving that sometimes the CDB bandwidth acts as a limit, using the 2-issue tomasulo pipeline, show a sequence where 2 CDBs is not enough and can eventually cause a stall 3.4 [12] A shortcoming of the scoreboard approach occurs when multiple functional units that share input buses are waiting for a single result. The units cannot start simultaneously, but must serialize. This property is not true in Tomasulo’s algorithm. Give a code sequence that uses no more than 10 instructions and shows this problem. Assume the hardware configuration from Figure 4.3, for the scoreboard, and Figure 3.2, for Tomasulo’s scheme. Use the FP latencies from Figure 4.2 (page 224). Indicate where the Tomasulo approach can continue, but the scoreboard approach must stall.
A good exercise but requires reworking (e.g., show how even with 1 issue/clock, a single cdb can be problem) to save it? 3.5 [15] Tomasulo’s algorithm also has a disadvantage versus the scoreboard: only one result can complete per clock, due to the CDB. Use the hardware configuration from Figures 4.3 and 3.2 and the FP latencies from Figure 4.2 (page 224). Find a code sequence
3.15
Historical Perspective and References
347
of no more than 10 instructions where the scoreboard does not stall, but Tomasulo’s algorithm must due to CDB contention. Indicate where this occurs in your sequence.
Maybe also try a version of this with multiple issue? 3.6 [45] One benefit of a dynamically scheduled processor is its ability to tolerate changes in latency or issue capability without requiring recompilation. This capability was a primary motivation behind the 360/91 implementation. The purpose of this programming assignment is to evaluate this effect. Implement a version of Tomasulo’s algorithm for MIPS to issue one instruction per clock; your implementation should also be capable of inorder issue. Assume fully pipelined functional units and the latencies shown in Figure 3.62. Unit
Latency
Integer
7
Branch
9
Load-store
11
FP add
13
FP mul
15
FP divide
17
FIGURE 3.62
Latencies for functional units.
A one-cycle latency means that the unit and the result are available for the next instruction. Assume the processor takes a one-cycle stall for branches, in addition to any datadependent stalls shown in the above table. Choose 5–10 small FP benchmarks (with loops) to run; compare the performance with and without dynamic scheduling. Try scheduling the loops by hand and see how close you can get with the statically scheduled processor to the dynamically scheduled results. Change the processor to the configuration shown in Figure 3.63. Unit
Latency
Integer
19
Branch
21
Load-store
23
FP add
25
FP mul
27
FP divide
29
FIGURE 3.63 Latencies for functional units, configuration 2.
Rerun the loops and compare the performance of the dynamically scheduled processor and
348
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
the statically scheduled processor. 3.7 [15] Suppose we have a deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches only. Assume that the misprediction penalty is always 4 cycles and the buffer miss penalty is always 3 cycles. Assume 90% hit rate and 90% accuracy, and 15% branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed 2-cycle branch penalty? Assume a base CPI without branch stalls of 1. 3.8 [10] Determine the improvement from branch folding for unconditional branches. Assume a 90% hit rate, a base CPI without unconditional branch stalls of 1, and an unconditional branch frequency of 5%. How much improvement is gained by this enhancement versus a processor whose effective CPI is 1.1? 3.9 [30] Implement a simulator to evaluate the performance of a branch-prediction buffer that does not store branches that are predicted as untaken. Consider the following prediction schemes: a one-bit predictor storing only predicted taken branches, a two-bit predictor storing all the branches, a scheme with a target buffer that stores only predicted taken branches and a two-bit prediction buffer. Explore different sizes for the buffers keeping the total number of bits (assuming 32-bit addresses) the same for all schemes. Determine what the branch penalties are, using Figure 3.21 as a guideline. How do the different schemes compare both in prediction accuracy and in branch cost? 3.10 [30] Implement a simulator to evaluate various branch prediction schemes. You can use the instruction portion of a set of cache traces to simulate the branch-prediction buffer. Pick a set of table sizes (e.g., 1K bits, 2K bits, 8K bits, and 16K bits). Determine the performance of both (0,2) and (2,2) predictors for the various table sizes. Also compare the performance of the degenerate predictor that uses no branch address information for these table sizes. Determine how large the table must be for the degenerate predictor to perform as well as a (0,2) predictor with 256 entries.
this is an interesting exercise to do in several forms: tomsulo, multiple issue with tomasulo and even speculation. Needs some reqorking. may want to ask them to create tables like those in the text (Figures 3.25 on page 275 and 3.26 on page 276 ) 3.11 [20/22/22/22/22/25/25/25/20/22/22] In this Exercise, we will look at how a common vector loop runs on a variety of pipelined versions of MIPS. The loop is the so-called SAXPY loop (discussed extensively in Appendix B) and the central operation in Gaussian elimination. The loop implements the vector operation Y = a × X + Y for a vector of length 100. Here is the MIPS code for the loop: foo:
L.D MUL.D L.D ADD.D S.D DADDUI DADDUI DSGTUI BEQZ
F2,0(R1) F4,F2,F0 F6,0(R2) F6,F4,F6 F6,0(R2) R1,R1,#8 R2,R2,#8 R3,R1,done R3,foo
;load X(i) ;multiply a*X(i) ;load Y(i) ;add a*X(i) + Y(i) ;store Y(i) ;increment X index ;increment Y index ;test if done ; loop if not done
For (a)–(e), assume that the integer operations issue and complete in one clock cycle (in-
3.15
Historical Perspective and References
349
cluding loads) and that their results are fully bypassed. Ignore the branch delay. You will use the FP latencies shown in Figure 4.2 (page 224). Assume that the FP unit is fully pipelined. a.
[20] For this problem use the standard single-issue MIPS pipeline with the pipeline latencies from Figure 4.2. Show the number of stall cycles for each instruction and what clock cycle each instruction begins execution (i.e., enters its first EX cycle) on the first iteration of the loop. How many clock cycles does each loop iteration take?
b.
[22] Use the MIPS code for SAXPY above and a fully pipelined FPU with the latencies of Figure 4.2. Assume Tomasulo’s algorithm for the hardware with one integer unit taking one execution cycle (a latency of 0 cycles to use) for all integer operations. Show the state of the reservation stations and register-status tables (as in Figure 3.3) when the SGTI writes its result on the CDB. Do not include the branch.
c.
[22] Using the MIPS code for SAXPY above, assume a scoreboard with the FP functional units described in Figure 4.3, plus one integer functional unit (also used for load-store). Assume the latencies shown in Figure 3.64. Show the state of the score-
Instruction producing result
Instruction using result
Latency in clock cycles
FP multiply
FP ALU op
6
FP add
FP ALU op
4
FP multiply
FP store
5
FP add
FP store
3
Integer operation (including load)
Any
0
FIGURE 3.64 Pipeline latencies where latency is number of cycles between producing and consuming instruction.
board (as in Figure 4.4) when the branch issues for the second time. Assume the branch was correctly predicted taken and took one cycle. How many clock cycles does each loop iteration take? You may ignore any register port/bus conflicts. d.
[25] Use the MIPS code for SAXPY above. Assume Tomasulo’s algorithm for the hardware using one fully pipelined FP unit and one integer unit. Assume the latencies shown in Figure 3.64. Show the state of the reservation stations and register status tables (as in Figure 3.3) when the branch is executed for the second time. Assume the branch was correctly predicted as taken. How many clock cycles does each loop iteration take?
e.
[25] Assume a superscalar architecture with Tomasulo’s algorithm for scheduling that can issue any two independent operations in a clock cycle (including two integer operations). Unwind the MIPS code for SAXPY to make four copies of the body and schedule it assuming the FP latencies of Figure 4.2. Assume one fully pipelined copy of each functional unit (e.g., FP adder, FP multiplier) and two integer
350
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
functional units with latency to use of 0. How many clock cycles will each iteration on the original code take? When unwinding, you should optimize the code as in section 3.1. What is the speedup versus the original code? f.
[25] In a superpipelined processor, rather than have multiple functional units, we would fully pipeline all the units. Suppose we designed a superpipelined MIPS that had twice the clock rate of our standard MIPS pipeline and could issue any two unrelated instructions in the same time that the normal MIPS pipeline issued one operation. If the second instruction is dependent on the first, only the first will issue. Unroll the MIPS SAXPY code to make four copies of the loop body and schedule it for this superpipelined processor, assuming the FP latencies of Figure 3.64. Also assume the load to use latency is 1 cycle, but other integer unit latencies are 0 cycles. How many clock cycles does each loop iteration take? Remember that these clock cycles are half as long as those on a standard MIPS pipeline or a superscalar MIPS.
g.
[22] Using the MIPS code for SAXPY above, assume a speculative processor with the functional unit organization used in section 3.5 and separate functional units for comparison, for branches, for effective address calculation, and for ALU operations. Assume the latencies shown in Figure 3.64. Show the state of the processor (as in Figure 3.30) when the branch issues for the second time. Assume the branch was correctly predicted taken and took one cycle. How many clock cycles does each loop iteration take?
h.
[22] Using the MIPS code for SAXPY above, assume a speculative processor like Figure 3.29 that can issue one load-store, one integer operation, and one FP operation each cycle. Assume the latencies in clock cycles of Figure 3.64. Show the state of the processor (as in Figure 3.30) when the branch issues for the second time. Assume the branch was correctly predicted taken and took one cycle. How many clock cycles does each loop iteration take?
3.12 [15/15] Consider our speculative processor from section 3.5. Since the reorder buffer contains a value field, you might think that the value field of the reservation stations could be eliminated. a.
[15] Show an example where this is the case and an example where the value field of the reservation stations is still needed. Use the speculative processor shown in Figure 3.29. Show MIPS code for both examples. How many value fields are needed in each reservation station?
b.
[15] Find a modification to the rules for instruction commit that allows elimination of the value fields in the reservation station. What are the negative side effects of such a change?
3.13 [20] Our implementation of speculation uses a reorder buffer and introduces the concept of instruction commit, delaying commit and the irrevocable updating of the registers until we know an instruction will complete. There are two other possible implementation techniques, both originally developed as a method for preserving precise interrupts when issuing out of order. One idea introduces a future file that keeps future values of a register; this idea is similar to the reorder buffer. An alternative is to keep a history buffer that records values of registers that have been speculatively overwritten.
3.15
Historical Perspective and References
351
Design a speculative processor like the one in section 3.5 but using a history buffer. Show the state of the processor, including the contents of the history buffer, for the example in Figure 3.31. Show the changes needed to Figure 3.32 for a history buffer implementation. Describe exactly how and when entries in the history buffer are read and written, including what happens on an incorrect speculation. 3.14 [30/30] This exercise involves a programming assignment to evaluate what types of parallelism might be expected in more modest, and more realistic, processors than those studied in section 3.8. These studies can be done using traces available with this text or obtained from other tracing programs. For simplicity, assume perfect caches. For a more ambitious project, assume a real cache. To simplify the task, make the following assumptions: n
n
n
a.
Assume perfect branch and jump prediction: hence you can use the trace as the input to the window, without having to consider branch effects—the trace is perfect. Assume there are 64 spare integer and 64 spare floating-point registers; this is easily implemented by stalling the issue of the processor whenever there are more live registers required. Assume a window size of 64 instructions (the same for alias detection). Use greedy scheduling of instructions in the window. That is, at any clock cycle, pick for execution the first n instructions in the window that meet the issue constraints. [30] Determine the effect of limited instruction issue by performing the following experiments: n
n
b.
Vary the issue count from 4–16 instructions per clock, Assuming eight issues per clock: determine what the effect of restricting the processor to two memory references per clock is.
[30] Determine the impact of latency in instructions. Assume the following latency models for a processor that issues up to 16 instructions per clock: n
n
n
Model 1: All latencies are one clock. Model 2: Load latency and branch latency are one clock; all FP latencies are two clocks. Model 3: Load and branch latency is two clocks; all FP latencies are five clocks.
Remember that with limited issue and a greedy scheduler, the impact of latency effects will be greater. 3.15 [Discussion] Dynamic instruction scheduling requires a considerable investment in hardware. In return, this capability allows the hardware to run programs that could not be run at full speed with only compile-time, static scheduling. What trade-offs should be taken into account in trying to decide between a dynamically and a statically scheduled implementation? What situations in either hardware technology or program characteristics are likely to favor one approach or the other? Most speculative schemes rely on dynamic scheduling; how does speculation affect the arguments in favor of dynamic scheduling? 3.16 [Discussion] There is a subtle problem that must be considered when imple-
352
Chapter 3 Instruction-Level Parallelism and its Dynamic Exploitation
menting Tomasulo’s algorithm. It might be called the “two ships passing in the night problem.” What happens if an instruction is being passed to a reservation station during the same clock period as one of its operands is going onto the common data bus? Before an instruction is in a reservation station, the operands are fetched from the register file; but once it is in the station, the operands are always obtained from the CDB. Since the instruction and its operand tag are in transit to the reservation station, the tag cannot be matched against the tag on the CDB. So there is a possibility that the instruction will then sit in the reservation station forever waiting for its operand, which it just missed. How might this problem be solved? You might consider subdividing one of the steps in the algorithm into multiple parts. (This intriguing problem is courtesy of J. E. Smith.) 3.17 [Discussion] Discuss the advantages and disadvantages of a superscalar implementation, a superpipelined implementation, and a VLIW approach in the context of MIPS. What levels of ILP favor each approach? What other concerns would you consider in choosing which type of processor to build? How does speculation affect the results?
Need some more exercises on speculation, newer branch predictors, and probably also multiple issue with Tomasulo and with speculation--maybe an integer loop? Add something on multiple processors/chip
3.15
Historical Perspective and References
353
4
Exploiting Instruction Level Parallelism with Software Approaches
4
Processors are being produced with the potential for very many parallel operations on the instruction level....Far greater extremes in instructionlevel parallelism are on the horizon. J. Fisher [1981], in the paper that inaugurated the term “instruction-level parallelism”
One of the surprises about IA-64 is that we hear no claims of high frequency, despite claims that an EPIC processor is less complex than a superscalar processor. It's hard to know why this is so, but one can speculate that the overall complexity involved in focusing on CPI, as IA-64 does, makes it hard to get high megahertz. M. Hopkins [2000], in a commentary on the IA-64 architecture, a joint development of HP and Intel designed to achieve dramatic increases in the exploitation of ILP while retaining a simple architecture, which would allow higher performance.
4.1
4.1
Basic Compiler Techniques for Exposing ILP
301
4.2
Static Branch Prediction
311
4.3
Static Multiple Issue: the VLIW Approach
314
4.4
Advanced Compiler Support for Exposing and Exploiting ILP
318
4.5
Hardware Support for Exposing More Parallelism at Compile-Time
340
4.6
Crosscutting Issues
350
4.7
Putting It All Together: The Intel IA-64 Architecture and Itanium Processor
361
4.8
Another View: ILP in the Embedded and Mobile Markets
363
4.9
Fallacies and Pitfalls
372
4.10
Concluding Remarks
373
4.11
Historical Perspective and References
375
Exercises
379
Basic Compiler Techniques for Exposing ILP This chapter starts by examining the use of compiler technology to improve the performance of pipelines and simple multiple-issue processors. These techniques are key even for processors that make dynamic issue decisions but use static scheduling and are crucial for processors that use static issue or static scheduling. After applying these concepts to reducing stalls from data hazards in single issue pipelines, we examine the use of compiler-based techniques for branch prediction. Armed with this more powerful compiler technology, we examine the design and performance of multiple-issue processors using static issuing or scheduling. Sections 4.4 and 4.5 examine more advanced software and hardware techniques, designed to enable a processor to exploit more instruction-level parallelism. Putting It All Together examines the IA-64 architecture and its first implementation, Itanium. Two different static, VLIW-style processors are covered in Another View.
222
Chapter 4 Exploiting Instruction Level Parallelism with Software Approaches
Basic Pipeline Scheduling and Loop Unrolling To keep a pipeline full, parallelism among instructions must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline. To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. A compiler’s ability to perform this scheduling depends both on the amount of ILP available in the program and on the latencies of the functional units in the pipeline. Throughout this chapter we will assume the FP unit latencies shown in Figure 4.1, unless different latencies are explicitly stated. We assume the standard 5-stage integer pipeline, so that branches have a delay of one clock cycle. We assume that the functional units are fully pipelined or replicated (as many times as the pipeline depth), so that an operation of any type can be issued on every clock cycle and there are no structural hazards. Instruction producing result
Instruction using result
Latency in clock cycles
FP ALU op
Another FP ALU op
FP ALU op
Store double
2
Load double
FP ALU op
1
Load double
Store double
0
3
FIGURE 4.1 Latencies of FP operations used in this chapter. The first column shows the originating instruction type. The second column is the type of the consuming instruction. The last column is the number of intervening clock cycles needed to avoid a stall. These numbers are similar to the average latencies we would see on an FP unit. The latency of a floatingpoint load to a store is zero, since the result of the load can be bypassed without stalling the store. We will continue to assume an integer load latency of 1 and an integer ALU operation latency of 0.
In this subsection, we look at how the compiler can increase the amount of available ILP by unrolling loops. This example serves both to illustrate an important technique as well as to motivate the more powerful program transformations described later in this chapter. We will rely on an example similar to the one we used in the last chapter, adding a scalar to a vector: for (i=1000; i>0; i=i–1) x[i] = x[i] + s;
We can see that this loop is parallel by noticing that the body of each iteration is independent. We will formalize this notion later in this chapter and describe how we can test whether loop iterations are independent at compile-time. First, let’s look at the performance of this loop, showing how we can use the parallelism to improve its performance for a MIPS pipeline with the latencies shown above. The first step is to translate the above segment to MIPS assembly language. In the following code segment, R1 is initially the address of the element in the array
4.1
Basic Compiler Techniques for Exposing ILP
223
with the highest address, and F2 contains the scalar value, s. Register R2 is precomputed, so that 8(R2) is the last element to operate on. The straightforward MIPS code, not scheduled for the pipeline, looks like this: Loop:
L.D ADD.D S.D DADDUI
F0,0(R1) F4,F0,F2 F4,0(R1) R1,R1,#-8
BNE
R1,R2,Loop
;F0=array element ;add scalar in F2 ;store result ;decrement pointer ;8 bytes (per DW) ;branch R1!=zero
Let’s start by seeing how well this loop will run when it is scheduled on a simple pipeline for MIPS with the latencies from Figure 4.1. EXAMPLE
ANSWER
Show how the loop would look on MIPS, both scheduled and unscheduled, including any stalls or idle clock cycles. Schedule for both delays from floating-point operations and from the delayed branch. Without any scheduling the loop will execute as follows: Clock cycle issued Loop:
L.D stall ADD.D stall stall S.D DADDUI stall BNE stall
F0,0(R1) F4,F0,F2
F4,0(R1) R1,R1,#-8 R1,R2,Loop
1 2 3 4 5 6 7 8 9 10
This code requires 10 clock cycles per iteration. We can schedule the loop to obtain only one stall: Loop:
L.D DADDUI ADD.D stall BNE S.D
F0,0(R1) R1,R1,#-8 F4,F0,F2 R1,R2,Loop ;delayed branch F4,8(R1) ;altered & interchanged with DADDUI
Execution time has been reduced from 10 clock cycles to 6. The stall after ADD.D is for the use by the S.D. n
224
Chapter 4 Exploiting Instruction Level Parallelism with Software Approaches
Notice that to schedule the delayed branch, the compiler had to determine that it could swap the DADDUI and S.D by changing the address to which the S.D stored: the address was 0(R1) and is now 8(R1). This change is not trivial, since most compilers would see that the S.D instruction depends on the DADDUI and would refuse to interchange them. A smarter compiler, capable of limited symbolic optimization, could figure out the relationship and perform the interchange. The chain of dependent instructions from the L.D to the ADD.D and then to the S.D determines the clock cycle count for this loop. This chain must take at least 6 cycles because of dependencies and pipeline latencies. In the above example, we complete one loop iteration and store back one array element every 6 clock cycles, but the actual work of operating on the array element takes just 3 (the load, add, and store) of those 6 clock cycles. The remaining 3 clock cycles consist of loop overhead—the DADDUI and BNE—and a stall. To eliminate these 3 clock cycles we need to get more operations within the loop relative to the number of overhead instructions. A simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. Unrolling simply replicates the loop body multiple times, adjusting the loop termination code. Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows instructions from different iterations to be scheduled together. In this case, we can eliminate the data use stall by creating additional independent instructions within the loop body. If we simply replicated the instructions when we unrolled the loop, the resulting use of the same registers could prevent us from effectively scheduling the loop. Thus, we will want to use different registers for each iteration, increasing the required register count. EXAMPLE
Show our loop unrolled so that there are four copies of the loop body, assuming R1 is initially a multiple of 32, which means that the number of loop iterations is a multiple of 4. Eliminate any obviously redundant computations and do not reuse any of the registers.
ANSWER
Here is the result after merging the DADDUI instructions and dropping the unnecessary BNE operations that are duplicated during unrolling. Note that R2 must now be set so that 32(R2) is the starting address of the last four elements. Loop:
L.D ADD.D S.D L.D ADD.D S.D L.D
F0,0(R1) F4,F0,F2 F4,0(R1) F6,-8(R1) F8,F6,F2 F8,-8(R1) F10,-16(R1)
;drop DADDUI & BNE
;drop DADDUI & BNE
4.1
Basic Compiler Techniques for Exposing ILP
ADD.D S.D L.D ADD.D S.D DADDUI BNE
F12,F10,F2 F12,-16(R1) F14,-24(R1) F16,F14,F2 F16,-24(R1) R1,R1,#-32 R1,R2,Loop
225
;drop DADDUI & BNE
We have eliminated three branches and three decrements of R1. The addresses on the loads and stores have been compensated to allow the DADDUI instructions on R1 to be merged. This optimization may seem trivial, but it is not; it requires symbolic substitution and simplification. We will see more general forms of these optimizations that eliminate dependent computations in Section 4.4. Without scheduling, every operation in the unrolled loop is followed by a dependent operation and thus will cause a stall. This loop will run in 28 clock cycles—each L.D has 1 stall, each ADD.D 2, the DADDUI 1, the branch 1, plus 14 instruction issue cycles—or 7 clock cycles for each of the four elements. Although this unrolled version is currently slower than the scheduled version of the original loop, this will change when we schedule the unrolled loop. Loop unrolling is normally done early in the compilation process, so that redundant computations can be exposed and eliminated by the optimizer. n
In real programs we do not usually know the upper bound on the loop. Suppose it is n, and we would like to unroll the loop to make k copies of the body. Instead of a single unrolled loop, we generate a pair of consecutive loops. The first executes (n mod k) times and has a body that is the original loop. The second is the unrolled body surrounded by an outer loop that iterates (n/k) times. For large values of n, most of the execution time will be spent in the unrolled loop body. In the above Example, unrolling improves the performance of this loop by eliminating overhead instructions, although it increases code size substantially. How will the unrolled loop perform when it is scheduled for the pipeline described earlier? EXAMPLE
ANSWER
Show the unrolled loop in the previous example after it has been scheduled for the pipeline with the latencies shown in Figure 4.1 on page 222. Loop:
L.D L.D L.D L.D ADD.D
F0,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F4,F0,F2
226
Chapter 4 Exploiting Instruction Level Parallelism with Software Approaches
ADD.D ADD.D ADD.D S.D S.D DADDUI S.D BNE S.D
F8,F6,F2 F12,F10,F2 F16,F14,F2 F4,0(R1) F8,-8(R1) R1,R1,#-32 F12,16(R1) R1,R2,Loop F16,8(R1)
;8-32 = -24
The execution time of the unrolled loop has dropped to a total of 14 clock cycles, or 3.5 clock cycles per element, compared with 7 cycles per element before scheduling and 6 cycles when scheduled but not unrolled. n
The gain from scheduling on the unrolled loop is even larger than on the original loop. This increase arises because unrolling the loop exposes more computation that can be scheduled to minimize the stalls; the code above has no stalls. Scheduling the loop in this fashion necessitates realizing that the loads and stores are independent and can be interchanged. Summary of the Loop Unrolling and Scheduling Example
Throughout this chapter we will look at a variety of hardware and software techniques that allow us to take advantage of instruction-level parallelism to fully utilize the potential of the functional units in a processor. The key to most of these techniques is to know when and how the ordering among instructions may be changed. In our example we made many such changes, which to us, as human beings, were obviously allowable. In practice, this process must be performed in a methodical fashion either by a compiler or by hardware. To obtain the final unrolled code we had to make the following decisions and transformations: 1. Determine that it was legal to move the S.D after the DADDUI and BNE, and find the amount to adjust the S.D offset. 2. Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for the loop maintenance code. 3. Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations. 4. Eliminate the extra test and branch instructions and adjust the loop termination and iteration code. 5. Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address.
4.1
Basic Compiler Techniques for Exposing ILP
227
6. Schedule the code, preserving any dependences needed to yield the same result as the original code. The key requirement underlying all of these transformations is an understanding of how an instruction depends on another and how the instructions can be changed or reordered given the dependences. Before examining how these techniques work for higher issue rate pipelines, let us examine how the loop unrolling and scheduling techniques affect data dependences. EXAMPLE
ANSWER
Show how the process of optimizing the loop overhead by unrolling the loop actually eliminates data dependences. In this example and those used in the remainder of this chapter, we use nondelayed branches for simplicity; it is easy to extend the examples to use delayed branches. Here is the unrolled but unoptimized code with the extra DADDUI instructions, but without the branches. (Eliminating the branches is another type of transformation, since it involves control rather than data.) The arrows show the data dependences that are within the unrolled body and involve the DADDUI instructions. The underlined registers are the dependent uses. Loop:
L.D ADD.D S.D DADDUI L.D ADD.D S.D DADDUI L.D ADD.D S.D DADDUI L.D ADD.D S.D DADDUI BNE
F0,0(R1) F4,F0,F2 F4,0(R1) R1,R1,#-8 F6,0(R1) F8,F6,F2 F8,0(R1) R1,R1,#-8 F10,0(R1) F12,F10,F2 F12,0(R1) R1,R1,#-8 F14,0(R1) F16,F14,F2 F16,0(R1) R1,R1,#-8 R1,R2,LOOP
;drop BNE
;drop BNE
;drop BNE
As the arrows show, the DADDUI instructions form a dependent chain that involves the DADDUI, L.D, and S.D instructions. This chain forces the body to execute in order, as well as making the DADDUI instructions necessary, which increases the instruction count. The compiler removes this dependence by symbolically computing the intermediate values of R1 and fold-
228
Chapter 4 Exploiting Instruction Level Parallelism with Software Approaches
ing the computation into the offset of the L.D and S.D instructions and by changing the final DADDUI into a decrement by 32. This transformation makes the three DADDUI unnecessary, and the compiler can remove them. There are other types of dependences in this code, as the next few example show. n
EXAMPLE
Unroll our example loop, eliminating the excess loop overhead, but using the same registers in each loop copy. Indicate both the data and name dependences within the body. Show how renaming eliminates name dependences that reduce parallelism.
ANSWER
Here’s the loop unrolled but with the same registers in use for each copy. The data dependences are shown with gray arrows and the name dependences with black arrows. As in earlier examples, the direction of the arrow indicates the ordering that must be preserved for correct execution of the code: Loop:
L.D
F0,0(R1)
ADD.D
F4,F0,F2
S.D
F4,0(R1)
L.D
F0,-8(R1)
ADD.D
F4,F0,F2
S.D
F4,-8(R1)
L.D
F0,-16(R1)
ADD.D
F4,F0,F2
S.D
F4,-16(R1)
L.D
F0,-24(R1)
ADD.D
F4,F0,F2
S.D
F4,-24(R1)
;drop DADDUI & BNE
;drop DADDUI & BNE
DADDUI R1,R1,#-32 BNE
R1,R2,LOOP
The name dependences force the instructions in the loop to be almost completely ordered, allowing only the order of the L.D following each S.D to be interchanged. When the registers used for each copy of the loop
4.1
Basic Compiler Techniques for Exposing ILP
229
body are renamed only the true dependences within each body remain: Loop:
L.D
F0,0(R1)
ADD.D
F4,F0,F2
S.D
F4,0(R1)
L.D
F6,-8(R1)
ADD.D
F8,F6,F2
S.D
F8,-8(R1)
L.D
F10,-16(R1)
ADD.D
F12,F10,F2
S.D
F12,-16(R1)
L.D
F14,-24(R1)
ADD.D
F16,F14,F2
S.D
F16,-24(R1)
;drop DADDUI & BNE
;drop DADDUI & BNE
DADDUI R1,R1,#-32 BNE
R1,R2,LOOP
With the renaming, the copies of each loop body become independent and can be overlapped or executed in parallel. This renaming process can be performed either by the compiler or in hardware, as we saw in the last chapter. n
There are three different types of limits to the gains that can be achieved by loop unrolling: a decrease in the amount of overhead amortized with each unroll, code size limitations, and compiler limitations. Let’s consider the question of loop overhead first. When we unrolled the loop four times, it generated sufficient parallelism among the instructions that the loop could be scheduled with no stall cycles. In fact, in fourteen clock cycles, only two cycles were loop overhead: the DSUBI, which maintains the index value, and the BNE, which terminates the loop. If the loop is unrolled eight times, the overhead is reduced from 1/2 cycle per original iteration to 1/4. One of the exercises asks you to compute the theoretically optimal number of times to unroll this loop for a random number of iterations. A second limit to unrolling is the growth in code size that results. For larger loops, the code size growth may be a concern either in the embedded space where memory may be at a premium or if the larger code size causes a decrease in the instruction cache miss rate. We return to the issue of code size when we consider more aggressive techniques for uncovering instruction level parallelism in Section 4.4. Another factor often more important than code size is the potential shortfall in registers that is created by aggressive unrolling and scheduling. This secondary affect that results from instruction scheduling in large code segments is called register pressure. It arises because scheduling code to increase ILP causes the number of live values to increase. After aggressive instruction scheduling, it not be possi-
230
Chapter 4 Exploiting Instruction Level Parallelism with Software Approaches
ble to allocate all the live values to registers. The transformed code, while theoretically faster, may lose some or all of its advantage, because it generates a shortage of registers. Without unrolling, aggressive scheduling is sufficiently limited by branches so that register pressure is rarely a problem. The combination of unrolling and aggressive scheduling can, however, cause this problem. The problem becomes especially challenging in multiple issue machines that require the exposure of more independent instruction sequences whose execution can be overlapped. In general, the use of sophisticated high-level transformations, whose potential improvements are hard to measure before detailed code generation, has led to significant increases in the complexity of modern compilers. Loop unrolling is a simple but useful method for increasing the size of straightline code fragments that can be scheduled effectively. This transformation is useful in a variety of processors, from simple pipelines like those in MIPS to the statically scheduled superscalars we described in the last chapter, as we will see now. Using Loop Unrolling and Pipeline Scheduling with Static Multiple Issue We begin by looking at a simple two-issue, statically-scheduled superscalar MIPS pipeline from the last chapter, using the pipeline latencies from Figure 4.1 on page 222 and the same example code segment we used for the single issue examples above. This processor can issue two instructions per clock cycle, where one of the instructions can be a load, store, branch, or integer ALU operation, and the other can be any floating-point operation. Recall that this pipeline did not generate a significant performance enhancement for the example above, because of the limited ILP in a given loop iteration. Let’s see how loop unrolling and pipeline scheduling can help. EXAMPLE
Unroll and schedule the loop used in the earlier examples and shown on page 223.
ANSWER
To schedule this loop without any delays, we will need to unroll the loop to make five copies of the body. After unrolling, the loop will contain five each of L.D, ADD.D, and S.D; one DADDUI; and one BNE. The unrolled and scheduled code is shown in Figure 4.2. This unrolled superscalar loop now runs in 12 clock cycles per iteration, or 2.4 clock cycles per element, versus 3.5 for the scheduled and unrolled loop on the ordinary MIPS pipeline. In this Example, the performance of the superscalar MIPS is limited by the balance between integer and floating-point computation. Every floating-point instruction is issued together with an integer instruction, but there are not enough floating-point instructions to keep the floating-point pipeline full. When scheduled, the original loop ran in 6 clock cycles per iteration. We have improved on that by a factor of 2.5, more than half of which came from loop unrolling. Loop unrolling took us from 6 to 3.5 (a factor of 1.7), while superscalar execution gave us
4.2
Static Branch Prediction
231
Integer instruction Loop:
FP instruction
1
L.D
F0,0(R1)
L.D
F6,-8(R1)
L.D
F10,-16(R1)
ADD.D F4,F0,F2
3
L.D
F14,-24(R1)
ADD.D F8,F6,F2
4
L.D
F18,-32(R1)
ADD.D F12,F10,F2
5
S.D
F4,0(R1)
ADD.D F16,F14,F2
6
S.D
F8,-8(R1)
ADD.D F20,F18,F2
7
S.D
F12,-16(R1)
2
S.D
8 9
DADDUI R1,R1,#-40
FIGURE 4.2
Clock cycle
10
F16,16(R1)
BNE
R1,R2,Loop
11
S.D
F20,8(R1)
12
The unrolled and scheduled code as it would look on a superscalar MIPS.
a factor of 1.5 improvement.
n
.
4.2
Static Branch Prediction In Chapter 3, we examined the use of dynamic branch predictors. Static branch predictors are sometimes used in processors where the expectation is that branch behavior is highly predictable at compile-time; static prediction can also be used to assist dynamic predictors. In Chapter 1, we discussed an architectural feature that supports static branch predication, namely delayed branches. Delayed branches expose a pipeline hazard so that the compiler can reduce the penalty associated with the hazard. As we saw, the effectiveness of this technique partly depends on whether we correctly guess which way a branch will go. Being able to accurately predict a branch at compile time is also helpful for scheduling data hazards. Loop unrolling is on simple example of this; another example, arises from conditional selection branches. Consider the following code segment: LD DSUBU BEQZ OR DADDU
R1,0(R2) R1,R1,R3 R1,L R4,R5,R6 R10,R4,R3
232
Chapter 4 Exploiting Instruction Level Parallelism with Software Approaches
L:
DADDU
R7,R8,R9
The dependence of the DSUBU and BEQZ on the LD instruction means that a stall will be needed after the LD. Suppose we knew that this branch was almost always taken and that the value of R7 was not needed on the fall-through path. Then we could increase the speed of the program by moving the instruction DADD R7,R8,R9 to the position after the LD. Correspondingly, if we knew the branch was rarely taken and that the value of R4 was not needed on the taken path, then we could contemplate moving the OR instruction after the LD. In addition, we can also use the information to better schedule any branch delay, since choosing how to schedule the delay depends on knowing the branch behavior. We will return to this topic in section 4.4, when we discuss global code scheduling. To perform these optimizations, we need to predict the branch statically when we compile the program. There are several different methods to statically predict branch behavior. The simplest scheme is to predict a branch as taken. This scheme has an average misprediction rate that is equal to the untaken branch frequency, which for the SPEC programs is 34%. Unfortunately, the misprediction rate ranges from not very accurate (59%) to highly accurate (9%). A better alternative is to predict on the basis of branch direction, choosing backward-going branches to be taken and forward-going branches to be not taken. For some programs and compilation systems, the frequency of forward taken branches may be significantly less than 50%, and this scheme will do better than just predicting all branches as taken. In the SPEC programs, however, more than half of the forward-going branches are taken. Hence, predicting all branches as taken is the better approach. Even for other benchmarks or compilers, directionbased prediction is unlikely to generate an overall misprediction rate of less than 30% to 40%. An enhancement of this technique was explored by Ball and Larus; their approach uses program context information and generates more accurate predictions than a simple scheme based solely on branch direction. A still more accurate technique is to predict branches on the basis of profile information collected from earlier runs. The key observation that makes this worthwhile is that the behavior of branches is often bimodally distributed; that is, an individual branch is often highly biased toward taken or untaken. Figure 4.3 shows the success of branch prediction using this strategy. The same input data were used for runs and for collecting the profile; other studies have shown that
4.2
Static Branch Prediction
233
changing the input so that the profile is for a different run leads to only a small change in the accuracy of profile-based prediction.
25%
22% 18%
20%
15% 15% 12%
Misprediction rate
11%
12% 9% 10%
10% 5% 6% 5%
do
s
li du c e hy ar dr o2 d m dl jd p su 2c or
es
es
nt
pr
eq
m co
o pr tt es so gc c
0%
Benchmark
FIGURE 4.3 Misprediction rate on SPEC92 for a profile-based predictor varies widely but is generally better for the FP programs, which have an average misprediction rate of 9% with a standard deviation of 4%, than for the integer programs, which have an average misprediction rate of 15% with a standard deviation of 5%. The actual performance depends on both the prediction accuracy and the branch frequency, which varies from 3% to 24%; we will examine the combined effect in Figure 4.4.
Although we can derive the prediction accuracy of a predict-taken strategy and measure the accuracy of the profile scheme, as in Figure 4.3, the wide range of frequency of conditional branches in these programs, from 3% to 24%, means that the overall frequency of a mispredicted branch varies widely. Figure 4.4 shows the number of instructions executed between mispredicted branches for both a profile-based and a predict-taken strategy. The number varies widely, both because of the variation in accuracy and the variation in branch frequency. On average, the predict-taken strategy has 20 instructions per mispredicted branch and the profile-based strategy has 110. These averages, however, are very different for integer and FP programs, as the data in Figure 4.4 show. Static branch behavior is useful for scheduling instructions when the branch delays are exposed by the architecture (either delayed or canceling branches), for assisting dynamic predictors (as we will see in the IA-64 architecture in section 4.7), and for determining which code paths are more frequent, which is a key step in code scheduling (see section 4.4, page 251).
234
Chapter 4 Exploiting Instruction Level Parallelism with Software Approaches
1000 250
159
100 60
96
253
58
11
14
19
li
10 10
37 11
do du c
19
gc c
56
Instructions between mispredictions
113
92
11
14
11
6
su 2c or
dl jd p m
ea r hy dr o2 d
eq nt ot t es pr es so
co m
pr es s
1
Benchmark Predict taken
Profile based
FIGURE 4.4 Accuracy of a predict-taken strategy and a profile-based predictor for SPEC92 benchmarks as measured by the number of instructions executed between mispredicted branches and shown on a log scale. The average number of instructions between mispredictions is 20 for the predict-taken strategy and 110 for the profile-based prediction; however, the standard deviations are large: 27 instructions for the predict-taken strategy and 85 instructions for the profile-based scheme. This wide variation arises because programs such as su2cor have both low conditional branch frequency (3%) and predictable branches (85% accuracy for profiling), although eqntott has eight times the branch frequency with branches that are nearly 1.5 times less predictable. The difference between the FP and integer benchmarks as groups is large. For the predict-taken strategy, the average distance between mispredictions for the integer benchmarks is 10 instructions, and it is 30 instructions for the FP programs. With the profile scheme, the distance between mispredictions for the integer benchmarks is 46 instructions, and it is 173 instructions for the FP benchmarks.
.
4.3
Static Multiple Issue: the VLIW Approach Superscalar processors decide on the fly how many instructions to issue. A statically scheduled superscalar must check for any dependences between instructions in the issue packet as well as between any issue candidate and any instruction already in the pipeline. As we have seen in Section 4.1, a staticallyscheduled superscalar requires significant compiler assistance to achieve good performance. In contrast, a dynamically-scheduled superscalar requires less compiler assistance, but has significant hardware costs. An alternative to the superscalar approach is to rely on compiler technology not only to minimize the potential data hazard stalls, but to actually format the instructions in a potential issue packet so that the hardware need not check explicitly for dependences. The compiler may be required to ensure that dependences within the issue packet cannot be present or, at a minimum, indicate when a dependence may be present. Such an approach offers the potential advantage of simpler hardware while still exhibiting good performance through extensive compiler optimization.
4.3
Static Multiple Issue: the VLIW Approach
235
The first multiple-issue processors that required the instruction stream to be explicitly organized to avoid dependences used wide instructions with multiple operations per instruction. For this reason, this architectural approach was named VLIW, standing for Very Long Instruction Word, and denoting that the instructions, since they contained several instructions, were very wide (64 to 128 bits, or more). The basic architectural concepts and compiler technology are the same whether multiple operations are organized into a single instruction, or whether a set of instructions in an issue packet is preconfigured by a compiler to exclude dependent operations (since the issue packet can be thought of as a very large instruction). Early VLIWs were quite rigid in their instruction formats and effectively required recompilation of programs for different versions of the hardware. To reduce this inflexibility and enhance performance of the approach, several innovations have been incorporated into more recent architectures of this type, while still requiring the compiler to do most of the work of finding and scheduling instructions for parallel execution. This second generation of VLIW architectures is the approach being pursued for desktop and server markets. In the remainder of this section, we look at the basic concepts in a VLIW architecture. Section 4.4 introduces additional compiler techniques that are required to achieve good performance for compiler-intensive approaches, and Section 4.5 describes hardware innovations that improve flexibility and performance of explicitly parallel approaches. Finally, Section 4.7 describes how the Intel IA-64 supports explicit parallelism. The Basic VLIW Approach VLIWs use multiple, independent functional units. Rather than attempting to issue multiple, independent instructions to the units, a VLIW packages the multiple operations into one very long instruction, or requires that the instructions in the issue packet satisfy the same constraints. Since there is not fundamental difference in the two approaches, we will just assume that multiple operations are placed in one instruction, as in the original VLIW approach. Since the burden for choosing the instructions to be issued simultaneously falls on the compiler, the hardware in a superscalar to make these issue decisions is unneeded. Since this advantage of a VLIW increases as the maximum issue rate grows, we focus on a wider-issue processor. Indeed, for simple two issue processors, the overhead of a superscalar is probably minimal. Many designers would probably argue that a four issue processor has manageable overhead, but as we saw in the last chapter, this overhead grows with issue width. Because VLIW approaches make sense for wider processors, we choose to focus our example on such an architecture. For example, a VLIW processor might have instructions that contain five operations, including: one integer operation (which could also be a branch), two floating-point operations, and two memory references. The instruction would have a set of fields for each functional unit— perhaps 16 to 24 bits per unit, yielding an instruction length of between 112 and 168 bits.
236
Chapter 4 Exploiting Instruction Level Parallelism with Software Approaches
To keep the functional units busy, there must be enough parallelism in a code sequence to fill the available operation slots. This parallelism is uncovered by unrolling loops and scheduling the code within the single larger loop body. If the unrolling generates straighline code, then local scheduling techniques, which operate on a single basic block can be used. If finding and exploiting the parallelism requires scheduling code across branches, a substantially more complex global scheduling algorithm must be used. Global scheduling algorithms are not only more complex in structure, but they must deal with significantly more complicated tradeoffs in optimization, since moving code across branches is expensive. In the next section, we will discuss trace scheduling, one of these global scheduling techniques developed specifically for VLIWs. In Section 4.5, we will examine hardware support that allows some conditional branches to be eliminated, extending the usefulness of local scheduling and enhancing the performance of global scheduling. For now, let’s assume we have a technique to generate long, straight-line code sequences, so that we can use local scheduling to build up VLIW instructions and instead focus on how well these processors operate. EXAMPLE
Suppose we have a VLIW that could issue two memory references, two FP operations, and one integer operation or branch in every clock cycle. Show an unrolled version of the loop x[i] = x[i] +s (see page 223 for the MIPS ode) for such a processor. Unroll as many times as necessary to eliminate any stalls. Ignore the branch-delay slot.
ANSWER
The code is shown in Figure 4.5. The loop has been unrolled to make seven copies of the body, which eliminates all stalls (i.e., completely empty issue cycles), and runs in 9 cycles. This code yields a running rate of seven results in 9 cycles, or 1.29 cycles per result, nearly twice as fast as the two-issue superscalar of Section 4.1 that used unrolled and scheduled code. n
For the original VLIW model, there are both technical and logistical problems. The technical problems are the increase in code size and the limitations of lock-step operation. Two different elements combine to increase code size substantially for a VLIW. First, generating enough operations in a straight-line code fragment requires ambitiously unrolling loops (as earlier examples) thereby increasing code size. Second, whenever instructions are not full, the unused functional units translate to wasted bits in the instruction encoding. In Figure 4.5, we saw that only about 60% of the functional units were used, so almost half of each instruction was empty. In most VLIWs, an instruction may need to be left completely empty if no operations can be scheduled. To combat this code size increase, clever encodings are sometimes used. For example, there may be only one large immediate field for use by any functional unit. Another technique is to compress the instructions in main memory and ex-
4.3
Memory reference 1
Static Multiple Issue: the VLIW Approach
Memory reference 2
L.D F0,0(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
L.D F18,-32(R1)
L.D F22,-40(R1)
L.D F26,-48(R1)
S.D F4,0(R1)
S.D -8(R1),F8
S.D F12,-16(R1)
S.D -24(R1),F16
S.D F20,-32(R1)
S.D -40(R1),F24
S.D F28,8(R1)
237
FP operation 1
FP operation 2
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
ADD.D F20,F18,F2
ADD.D F24,F22,F2
Integer operation/branch
ADD.D F28,F26,F2
DADDUI R1,R1,#-56 BNE R1,R2,Loop
FIGURE 4.5 VLIW instructions that occupy the inner loop and replace the unrolled sequence. This code takes nine cycles assuming no branch delay; normally the branch delay would also need to be scheduled. The issue rate is 23 operations in nine clock cycles, or 2.5 operations per cycle. The efficiency, the percentage of available slots that contained an operation, is about 60%. To achieve this issue rate requires a larger number of registers than MIPS would normally use in this loop. The VLIW code sequence above requires at least eight FP registers, while the same code sequence for the base MIPS processor can use as few as two FP registers or as many as five when unrolled and scheduled. In the superscalar example in Figure 4.2, six registers were needed.
pand them when they are read into the cache or are decoded. We will see techniques to reduce code size increases in both Sections 4.7 and 4.8. Early VLIWs operated in lock-step; there was no hazard detection hardware at all. This structure dictated that a stall in any functional unit pipeline must cause the entire processor to stall, since all the functional units must be kept synchronized. Although a compiler may be able to schedule the deterministic functional units to prevent stalls, predicting which data accesses will encounter a cache stall and scheduling them is very difficult. Hence, caches needed to be blocking and to cause all the functional units to stall. As the issue rate and number of memory references becomes large, this synchronization restriction becomes unacceptable. In more recent processors, the functional units operate more independently, and the compiler is used to avoid hazards at issue time, while hardware checks allow for unsynchronized execution once instructions are issued. Binary code compatibility has also been a major logistical problem for VLIWs. In a strict VLIW approach, the code sequence makes use of both the instruction set definition and the detailed pipeline structure, including both functional units and their latencies. Thus, different numbers of functional units and unit latencies require different versions of the code. This requirement makes migrating between successive implementations, or between implementations with different issue widths, more difficult than it is for a superscalar design. Of course, obtaining improved performance from a new superscalar design may require recompilation. Nonetheless, the ability to run old binary files is a practical advantage for the superscalar approach. One possible solution to this migration problem, and the problem of binary code compatibility in general, is object-code translation or emulation. This technology is developing quickly and could play a significant role in future migration
238
Chapter 4 Exploiting Instruction Level Parallelism with Software Approaches
schemes. Another approach is to temper the strictness of the approach so that binary compatibility is still feasible. This later approach is used in the IA-64 architecture, as we will see in Section 4.7. The major challenge for all multiple-issue processors is to try to exploit large amounts of ILP. When the parallelism comes from unrolling simple loops in FP programs, the original loop probably could have been run efficiently on a vector processor (described in Appendix B). It is not clear that a multiple-issue processor is preferred over a vector processor for such applications; the costs are similar, and the vector processor is typically the same speed or faster. The potential advantages of a multiple-issue processor versus a vector processor are twofold. First, a multiple-issue processor has the potential to extract some amount of parallelism from less regularly structured code, and, second, it has the ability to use a more conventional, and typically less expensive, cache-based memory system. For these reasons multiple-issue approaches have become the primary method for taking advantage of instruction-level parallelism, and vectors have become primarily an extension to these processors.
4.4
Advanced Compiler Support for Exposing and Exploiting ILP In this section we discuss compiler technology for increasing the amount of parallelism that we can exploit in a program. We begin by defining when a loop is parallel and how a dependence can prevent a loop from being parallel. We also discuss techniques for eliminating some types of dependences. As we will see in later sections, hardware support for these compiler techniques can greatly increase their effectiveness. This section serves as an introduction to these techniques. We do not attempt to explain the details of ILP-oriented compiler techniques, since this would take hundreds of pages, rather than the 20 we have allotted. Instead, we view this material as providing general background that will enable the reader to have a basic understanding of the compiler techniques used to exploit ILP in modern computers. Detecting and Enhancing Loop-Level Parallelism Loop-level parallelism is normally analyzed at the source level or close to it, while most analysis of ILP is done once instructions have been generated by the compiler. Loop-level analysis involves determining what dependences exist among the operands in a loop across the iterations of that loop. For now, we will consider only data dependences, which arise when an operand is written at some point and read at a later point. Name dependences also exist and may be removed by renaming techniques like those we used earlier. The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations, such a dependence is called a loop-carried dependence. Most of the exam-
4.4
Advanced Compiler Support for Exposing and Exploiting ILP
239
ples we considered in Section 4.1 have no loop-carried dependences and, thus, are loop-level parallel. To see that a loop is parallel, let us first look at the source representation: for (i=1000; i>0; i=i–1) x[i] = x[i] + s;
In this loop, there is a dependence between the two uses of x[i], but this dependence is within a single iteration and is not loop-carried. There is a dependence between successive uses of i in different iterations, which is loop-carried, but this dependence involves an induction variable and can be easily recognized and eliminated. We saw examples of how to eliminate dependences involving induction variables during loop unrolling in Section 4.1, and we will look at additional examples later in this section. Because finding loop-level parallelism involves recognizing structures such as loops, array references, and induction variable computations, the compiler can do this analysis more easily at or near the source level, as opposed to the machinecode level. Let’s look at a more complex example. EXAMPLE
Consider a loop like this one: for (i=1; i #include <sys/times.h> #include <sys/types.h> #include #define CACHE_MIN (1024) /* smallest cache */ #define CACHE_MAX (1024*1024) /* largest cache */ #define SAMPLE 10 /* to get a larger time sample */ #ifndef CLK_TCK #define CLK_TCK 60 /* number clock ticks per second */ #endif int x[CACHE_MAX]; /* array going to stride through */ double get_seconds() { /* routine to read time */ struct tms rusage; times(&rusage); /* UNIX utility: time in clock ticks */ return (double) (rusage.tms_utime)/CLK_TCK; } void main() { int register i, index, stride, limit, temp; int steps, tsteps, csize; double sec0, sec; /* timing variables */ for (csize=CACHE_MIN; csize 1,000,000
>1,000,000
(see caption)
$41
$595
$165
2B, 4H->4B
2*2W->4H, 2*4H->8B
Unpack/merge
2B->2W, 4B->4H
2*4B->8B, 2*2H->4H
Permute/shuffle
8B, 4H Integer
8B, 4H, 2W
8B, 4H, 2W 8B
Max/min
Register sets
PowerPC
■
2*4H->8B
2W->2H, 2W->2B, 4H->4B 4B->4H, 2*4B->8B
4H
Fl. Pt. + 192b Acc. Integer
Fl. Pt.
Figure C.20 Summary of multimedia support for desktop RISCs. B stands for byte (8 bits), H for half word (16 bits), and W for word (32 bits). Thus 8B means an operation on 8 bytes in a single instruction. Pack and unpack use the notation 2*2W to mean 2 operands each with 2 words. Note that MDMX has vector/scalar operations, where the scalar is specified as an element of one of the vector registers. This table is a simplification of the full multimedia architectures, leaving out many details. For example, MIPS MDMX includes instructions to multiplex between two operands, HP MAX2 includes an instruction to calculate averages, and SPARC VIS includes instructions to set registers to constants. Also, this table does not include the memory alignment operation of MDMX, MAX, and VIS.
These machines largely used existing register sets to hold operands: integer registers for Alpha and HP PA-RISC and floating-point registers for MIPS and Sun. Hence data transfers are accomplished with standard load and store instructions. MIPS also added a 192-bit (3*64) wide register to act as an accumulator for some operations. By having 3 times the native data width, it can be partitioned to accumulate either 8 bytes with 24 bits per field or 4 half words with 48 bits per field. This wide accumulator can be used for add, subtract, and multiply/add instructions. MIPS claims performance advantages of 2 to 4 times for the accumulator. Perhaps the surprising conclusion of this table is the lack of consistency. The only operations found on all four are the logical operations (AND, OR, XOR), which do not need a partitioned ALU. If we leave out the frugal Alpha, then the only other common operations are parallel adds and subtracts on 4 half words.
C-18
■
Appendix C A Survey of RISC Architectures for Desktop, Server, and Embedded Computers
Each manufacturer states that these are instructions intended to be used in hand-optimized subroutine libraries, an intention likely to be followed, as a compiler that works well with all desktop RISCs’ multimedia extensions would be challenging.
C.5
Instructions: Digital Signal-Processing Extensions of the Embedded RISCs One feature found in every digital signal processor (DSP) architecture is support for integer multiply-accumulate. The multiplies tend to be on shorter words than regular integers, such as 16-bits, and the accumulator tends to be on longer words, such as 64 bits. The reason for multiply-accumulate is to efficiently implement digital filters, common in DSP applications. Since Thumb and MIPS16 are subset architectures, they do not provide such support. Instead, programmers should use the DSP or multimedia extensions found in the 32-bit mode instructions of ARM and MIPS64. Figure C.21 shows the size of the multiply, the size of the accumulator, and the operations and instruction names for the embedded RISCs. Machines with accumulator sizes greater than 32 and less than 64 bits will force the upper bits to remain as the sign bits, thereby “saturating” the add to set to maximum and minimum fixed-point values if the operations overflow.
ARM v.4
Thumb
SuperH
M32R
MIPS16
Size of multiply
32B × 32B
—
32B × 32B, 16B × 16B
32B × 16B, 16B × 16B
—
Size of accumulator
32B/64B
—
32B/42B, 48B/64B
56B
—
Accumulator name
Any GPR or pairs of GPRs
—
MACH, MACL
ACC
—
Operations
32B/64B product + 64B accumulate signed/unsigned
—
32B product + 42B/32B accumulate (operands in memory); 64B product + 64B/48B accumulate (operands in memory); clear MAC
32B/48B product + 64B accumulate, round, move
—
Corresponding instruction names
MLA, SMLAL, UMLAL
—
MAC, MACS, MAC.L, MAC.LS, CLRMAC
MACHI/MACLO, MACWHI/ MACWLO, RAC, RACH, MVFACHI/MVFACLO, MVTACHI/MVTACLO
—
Figure C.21 Summary of five embedded RISC approaches to multiply-accumulate.
C.6
C.6
Instructions: Common Extensions to MIPS Core
■
C-19
Instructions: Common Extensions to MIPS Core Figures C.22 through C.28 list instructions not found in Figures C.9 through C.17 in the same four categories. Instructions are put in these lists if they appear in more than one of the standard architectures. The instructions are defined using the hardware description language defined in Figure C.29. Although most of the categories are self-explanatory, a few bear comment: ■
The “atomic swap” row means a primitive that can exchange a register with memory without interruption. This is useful for operating system semaphores in a uniprocessor as well as for multiprocessor synchronization (see Section 6.7).
■
The 64-bit data transfer and operation rows show how MIPS, PowerPC, and SPARC define 64-bit addressing and integer operations. SPARC simply defines all register and addressing operations to be 64 bits, adding only special instructions for 64-bit shifts, data transfers, and branches. MIPS includes
Name
Definition
Alpha
MIPS64
PA-RISC 2.0
PowerPC
SPARC v.9
Atomic swap R/M (for locks and semaphores)
Temp 1)
17.3%
63.4%
14.2%
7.1%
30.7%
26.5%
Memory
81.6%
36.6%
85.8%
92.7%
68.7%
73.1%
Figure D.14 The percentage of instructions for the floating-point operations (add, sub, mul, div) that use each of the three options for specifying a floating-point operand on the 80x86. The three options are 1) the strict stack model of implicit operands on the stack, 2) register version naming an explicit operand that is not one of the top two elements of the stack, and 3) memory operand.
in this measurement, the stack model of execution is rarely followed. (See Section D.8 for a historical explanation of this observation.) Finally, Figures D.15 and D.16 show the instruction mixes for 10 SPEC92 programs.
D-18
■
Appendix D An Alternative to RISC: The Intel 80x86
Instruction load
compress
eqntott
espresso
gcc (cc1)
li
Int. average
20.8%
18.5%
21.9%
24.9%
23.3%
22%
store
13.8%
3.2%
8.3%
16.6%
18.7%
12%
add
10.3%
8.8%
8.15%
7.6%
6.1%
8%
sub
7.0%
10.6%
3.5%
2.9%
3.6%
5%
mul
0.1%
0%
div compare
0% 8.2%
27.7%
15.3%
13.5%
7.7%
16%
7.8%
4%
mov reg-reg
7.9%
0.6%
5.0%
4.2%
load imm
0.5%
0.2%
0.6%
0.4%
cond. branch
0%
15.5%
28.6%
18.9%
17.4%
15.4%
20%
uncond. branch
1.2%
0.2%
0.9%
2.2%
2.2%
1%
call
0.5%
0.4%
0.7%
1.5%
3.2%
1%
return, jmp indirect
0.5%
0.4%
0.7%
1.5%
3.2%
1%
2.5%
1.7%
1.0%
8.7%
4.5%
8.4%
6%
0.4%
1%
shift
3.8%
and
8.4%
or
0.6%
2.7%
0.4%
other (xor, not, . . .)
0.9%
2.2%
0.1%
1%
1%
load FP
0%
store FP
0%
add FP
0%
sub FP
0%
mul FP
0%
div FP
0%
compare FP
0%
mov reg-reg FP
0%
other (abs, sqrt, . . .)
0%
Figure D.15 80x86 instruction mix for five SPECint92 programs.
Comparative Operation Measurements Figures D.17 and D.18 show the number of instructions executed for each of the 10 programs on the 80x86 and the ratio of instruction execution compared with that for DLX: Numbers less than 1.0 mean the 80x86 executes fewer instructions than DLX. The instruction count is surprisingly close to DLX for many integer programs, as you would expect a load-store instruction set architecture like DLX to execute more instructions than a register-memory architecture like the 80x86. The floating-point programs always have higher counts for the 80x86, presumably due to the lack of floating-point registers and the use of a stack architecture.
D.6
Instruction
Putting It All Together: Measurements of Instruction Set Usage
■
D-19
doduc
ear
hydro2d
mdljdp2
su2cor
FP average
load
8.9%
6.5%
18.0%
27.6%
27.6%
20%
store
12.4%
3.1%
11.5%
7.8%
7.8%
8%
add
5.4%
6.6%
14.6%
8.8%
8.8%
10%
sub
1.0%
2.4%
3.3%
2.4%
2.4%
mul div compare
3% 0% 0%
1.8%
5.1%
0.8%
1.0%
1.0%
1.8%
2.3%
2.3%
2%
mov reg-reg
3.2%
0.1%
load imm
0.4%
1.5%
cond. branch
5.4%
8.2%
5.1%
2.7%
2.7%
5%
uncond branch
0.8%
0.4%
1.3%
0.3%
0.3%
1%
call
0.5%
1.6%
0.1%
0.1%
0%
return, jmp indirect
0.5%
1.6%
0.1%
0.1%
0%
shift
1.1%
and
0.8%
or
0.1%
0.8%
2% 0%
4.5%
2.5%
2.5%
2%
0.7%
1.3%
1.3%
1%
0.1%
0.1%
other (xor, not, . . .)
0% 0%
load FP
14.1%
22.5%
9.1%
12.6%
12.6%
14%
store FP
8.6%
11.4%
4.1%
6.6%
6.6%
7%
add FP
5.8%
6.1%
1.4%
6.6%
6.6%
5%
sub FP
2.2%
2.7%
3.1%
2.9%
2.9%
3%
mul FP
8.9%
8.0%
div FP
2.1%
compare FP
9.4%
mov reg-reg FP
2.5%
other (abs, sqrt, . . .)
3.9%
4.1%
12.0%
12.0%
9%
0.8%
0.2%
0.2%
0%
6.9%
10.8%
0.5%
0.5%
5%
0.8%
0.3%
0.8%
0.8%
1%
3.8%
4.1%
0.8%
0.8%
2%
Figure D.16 80x86 instruction mix for five SPECfp92 programs.
Another question is the total amount of data traffic for the 80x86 versus DLX, since the 80x86 can specify memory operands as part of operations while DLX can only access via loads and stores. Figures D.17 and D.18 also show the data reads, data writes, and data read-modify-writes for these 10 programs. The total accesses ratio to DLX of each memory access type is shown in the bottom rows, with the read-modify-write counting as one read and one write. The 80x86 performs about two to four times as many data accesses as DLX for floating-point programs, and 1.25 times as many for integer programs. Finally, Figure D.19 shows the percentage of instructions in each category for 80x86 and DLX.
D-20
■
Appendix D An Alternative to RISC: The Intel 80x86
compress
eqntott
espresso
gcc (cc1)
li
Int. avg.
Instructions executed on 80x86 (millions)
2226
1203
2216
3770
5020
Instructions executed ratio to DLX
0.61
1.74
0.85
0.96
0.98
Data reads on 80x86 (millions)
589
229
622
1079
1459
Data writes on 80x86 (millions)
311
39
191
661
981
Data read-modify-writes on 80x86 (millions)
26
1
129
48
48
Total data reads on 80x86 (millions)
615
230
751
1127
1507
Data read ratio to DLX
0.85
1.09
1.38
1.25
0.94
Total data writes on 80x86 (millions)
338
40
319
709
1029
Data write ratio to DLX
1.67
9.26
2.39
1.25
1.20
Total data accesses on 80x86 (millions)
953
269
1070
1836
2536
Data access ratio to DLX
1.03
1.25
1.58
1.25
1.03
1.03
1.10 3.15 1.23
Figure D.17 Instructions executed and data accesses on 80x86 and ratios compared to DLX for five SPECint92 programs.
doduc
ear
hydro2d
mdljdp2
su2cor
Instructions executed on 80x86 (millions)
1223
15,220
13,342
6197
6197
Instructions executed ratio to DLX
1.19
1.19
2.53
2.09
1.62
Data reads on 80x86 (millions)
515
6007
5501
3696
3643
Data writes on 80x86 (millions)
260
2205
2085
892
892
Data read-modify-writes on 80x86 (millions)
1
0
189
124
124
Total data reads on 80x86 (millions)
517
6007
5690
3820
3767
Data read ratio to DLX
2.04
2.36
4.48
4.77
3.91
Total data writes on 80x86 (millions)
261
2205
2274
1015
1015
Data write ratio to DLX
3.68
33.25
38.74
16.74
9.35
Total data accesses on 80x86 (millions)
778
8212
7965
4835
4782
Data access ratio to DLX
2.40
3.14
5.99
5.73
4.47
FP average
1.73
3.51 20.35 4.35
Figure D.18 Instructions executed and data accesses for five SPECfp92 programs on 80x86 and ratio to DLX.
D.7
Concluding Remarks Beauty is in the eye of the beholder. Old Adage
As we have seen, “orthogonal” is not a term found in the Intel architectural dictionary. To fully understand which registers and which addressing modes are
D.8
Historical Perspective and References
Integer average
■
D-21
FP average
Category
x86
DLX
x86
DLX
Total data transfer
34%
36%
28%
2%
Total integer arithmetic
34%
31%
16%
12%
Total control
24%
20%
6%
10%
Total logical
8%
13%
3%
2%
Total FP data transfer
0%
0%
22%
33%
Total FP arithmetic
0%
0%
25%
41%
Figure D.19 Percentage of instructions executed by category for 80x86 and DLX for the averages of five SPECint92 and SPECfp92 programs of Figures D.17 and D.18.
available, you need to see the encoding of all addressing modes and sometimes the encoding of the instructions. Some argue that the inelegance of the 80x86 instruction set is unavoidable, the price that must be paid for rampant success by any architecture. We reject that notion. Obviously no successful architecture can jettison features that were added in previous implementations, and over time some features may be seen as undesirable. The awkwardness of the 80x86 began at its core with the 8086 instruction set and was exacerbated by the architecturally inconsistent expansions of the 8087, 80286, and 80386. A counterexample is the IBM 360/370 architecture, which is much older than the 80x86. It dominates the mainframe market just as the 80x86 dominates the PC market. Due undoubtedly to a better base and more compatible enhancements, this instruction set makes much more sense than the 80x86 more than 30 years after its first implementation. For better or worse, Intel had a 16-bit microprocessor years before its competitors’ more elegant architectures, and this head start led to the selection of the 8086 as the CPU for the IBM PC. What it lacks in style is made up in quantity, making the 80x86 beautiful from the right perspective. The saving grace of the 80x86 is that its architectural components are not too difficult to implement, as Intel has demonstrated by rapidly improving performance of integer programs since 1978. High floating-point performance is a larger challenge in this architecture.
D.8
Historical Perspective and References The complexity of the x86 is not an impassable barrier. . . . The biggest weakness in the x86 instruction set is the lack of registers coupled with an extremely painful addressing scheme. Mike Johnson, Leader of 80x86 Design at AMD Microprocessor Report (1994)
D-22
■
Appendix D An Alternative to RISC: The Intel 80x86
There are numerous descriptions of the 80x86 architecture that have been published—Wakerly’s [1989] is both concise and easy to understand. Crawford and Gelsinger [1988] is a thorough description of the 80386. The ancestors of the 80x86 were the first microprocessors, produced late in the first half of the 1970s. The Intel 4004 and 8008 were extremely simple 4- and 8-bit accumulator-style machines. Morse et al. [1980] describe the evolution of the 8086 from the 8080 in the late 1970s as an attempt to provide a 16-bit machine with better throughput. At that time almost all programming for microprocessors was done in assembly language—both memory and compilers were in short supply. Intel wanted to keep its base of 8080 users, so the 8086 was designed to be “compatible” with the 8080. The 8086 was never object-code compatible with the 8080, but the machines were close enough that translation of assembly language programs could be done automatically. In early 1980, IBM selected a version of the 8086 with an 8-bit external bus, called the 8088, for use in the IBM PC. They chose the 8-bit version to reduce the cost of the machine. This choice, together with the tremendous success of the IBM PC, has made the 8086 architecture ubiquitous. The success of the IBM PC was due in part because IBM opened the architecture of the PC and enabled the PC-clone industry to flourish. As discussed in the introduction of this appendix, the 80286, 80386, 80486, Pentium, and P6 have extended the architecture and provided a series of performance enhancements. Although the 68000 was chosen for the popular Macintosh, the Macintosh was never as pervasive as the PC, partly because Apple did not allow clones until recently, and the 68000 did not acquire the same software leverage that the 8086 enjoys. The Motorola 68000 may have been more significant technically than the 8086, but the impact of the selection by IBM and IBM’s open architecture strategy dominated the technical advantages of the 68000 in the market. Kahan’s history [1990] of the stack architecture selection for the 8086 is entertaining. The floating-point architecture of the companion 8087 had to be retrofitted into the 8086 opcode space, making it inconvenient to offer two operands per instruction as found in the rest of the 8086. Hence the decision for one operand per instruction using a stack: “The designer’s task was to make a Virtue of this Necessity.” Rather than the classical stack architecture, which has no provision for avoiding common subexpressions from being pushed and popped from memory into the top of the stack found in registers, Intel tried to combine a flat register file with a stack. The reasoning was the restriction of the top of stack as one operand was not so bad since it only required the execution of an FXCH instruction (which swapped registers) to get the same result as a two-operand instruction, and FXCH was much faster than the floating-point operations of the 8087. Since floating-point expressions are not that complex, Kahan reasoned that eight registers meant that the stack would rarely overflow. Hence he urged that the 8087 use this hybrid scheme with the provision that stack overflow or stack underflow would interrupt the 8086 so that interrupt software could give the illusion to the compiler writer of an unlimited stack for floating-point data. The Intel 8087 was implemented in Israel, and 7500 miles and 10 time zones made communication difficult from California. According to Palmer and Morse [1984]:
D.8
Historical Perspective and References
■
D-23
Unfortunately, nobody tried to write a software stack manager until after the 8087 was built, and by then it was too late; what was too complicated to perform in hardware turned out to be even worse in software. One thing found lacking is the ability to conveniently determine if an invalid-operation exception is indeed due to a stack overflow. . . . Also lacking is the ability to restart the instruction that caused the stack overflow . . . [p. 93]
The result is that the stack exceptions are too slow to handle in software. As Kahan [1990] says: Consequently, almost all higher-level languages’ compilers emit inefficient code for the 80x87 family, degrading the chip’s performance by typically 50% with spurious stores and loads necessary simply to preclude stack over/underflow. . . . I still regret that the 8087’s stack implementation was not quite so neat as my original intention. . . . If the original design had been realized, compilers today would use the 80x87 and its descendents more efficiently, and Intel’s competitors could more easily market faster but compatible 80x87 imitations.
The P6 renames the floating-point registers (see Chapter 3), effectively providing up to 40 floating-point registers at any given instant. The main effect of the stack organization is to force design teams to use transistors for dereferencing the stack before doing the renaming. Hewlett-Packard and Intel have announced a new, common instruction set architecture. It is also upward compatible with the 80x86, and thus the 80x86 instruction set will be available in some form in computers of this century. Instruction set anthropologists will peel off layer by layer from such machines until they uncover artifacts from the first microprocessor. Given such a find, how will they judge 20th-century computer architecture?
References Crawford, J., and P. Gelsinger [1988]. Programming the 80386, Sybex Books, Alameda, Calif. Kahan, J. [1990]. “On the advantage of the 8087’s stack,” unpublished course notes, Computer Science Division, University of California at Berkeley. Morse, S., B. Ravenal, S. Mazor, and W. Pohlman [1980]. “Intel microprocessors—8080 to 8086,” Computer 13:10 (October). Palmer, J., and S. Morse [1984]. The 8087 Primer, J. Wiley, New York, p. 93. Wakerly, J. [1989]. Microcomputer Architecture and Programming, J. Wiley, New York.
E Another Alternative to RISC: The VAX Architecture
In principle, there is no great challenge in designing a large virtual address minicomputer system. . . . The real challenge lies in two areas: compatibility—very tangible and important; and simplicity— intangible but nonetheless important. William Strecker “VAX-11/780—A Virtual Address Extension to the PDP-11 Family,” AFIPS Proc., National Computer Conference, 1978.
Entities should not be multiplied unnecessarily. William of Occam Quodlibeta Septem, 1320 (This quote is known as “Occam’s Razor.”)
© 2003 Elsevier Science (USA). All rights reserved.
E.1 E.2 E.3 E.4 E.5 E.6 E.7 E.8 E.9
Introduction VAX Operands and Addressing Modes Encoding VAX Instructions VAX Operations An Example to Put It All Together: swap A Longer Example: sort Fallacies and Pitfalls Concluding Remarks Historical Perspective and Further Reading Exercises
E-2 E-2 E-5 E-6 E-10 E-13 E-18 E-19 E-20 E-21
E-2
■
Appendix E Another Alternative to RISC: The VAX Architecture
E.1
Introduction To enhance your understanding of instruction set architectures, we chose the VAX as the representative Complex Instruction Set Computers (CISC) because it is so different from MIPS and yet still easy to understand. By seeing two such divergent styles, we are confident that you will be able to learn other instruction sets on your own. At the time the VAX was designed, the prevailing philosophy was to create instruction sets that were close to programming languages in order to simplify compilers. For example, because programming languages had loops, instruction sets should have loop instructions. As VAX architect William Strecker said (“VAX-11/780—A Virtual Address Extension to the PDP-11 Family,” AFIPS Proc., National Computer Conference, 1978): A major goal of the VAX-11 instruction set was to provide for effective compiler generated code. Four decisions helped to realize this goal: . . . 1) A very regular and consistent treatment of operators. . . . 2) An avoidance of instructions unlikely to be generated by a compiler. . . . 3) Inclusions of several forms of common operators. . . . 4) Replacement of common instruction sequences with single instructions. Examples include procedure calling, multiway branching, loop control, and array subscript calculation.
Recall that DRAMs of the mid-1970s contained less than 1/1000th the capacity of today’s DRAMs, so code space was also critical. Hence, another prevailing philosophy was to minimize code size, which is de-emphasized in fixed-length instruction sets like MIPS. For example, MIPS address fields always use 16 bits, even when the address is very small. In contrast, the VAX allows instructions to be a variable number of bytes, so there is little wasted space in address fields. Books the size of the one you are reading have been written just about the VAX, so this VAX extension cannot be exhaustive. Hence, the following sections describe only a few of its addressing modes and instructions. To show the VAX instructions in action, later sections show VAX assembly code for two C procedures. The general style will be to contrast these instructions with the MIPS code that you are already familiar with. The differing goals for VAX and MIPS have led to very different architectures. The VAX goals, simple compilers and code density, led to the powerful addressing modes, powerful instructions, and efficient instruction encoding. The MIPS goals were high performance via pipelining, ease of hardware implementation, and compatibility with highly optimizing compilers. The MIPS goals led to simple instructions, simple addressing modes, fixed-length instruction formats, and a large number of registers.
E.2
VAX Operands and Addressing Modes The VAX is a 32-bit architecture, with 32-bit-wide addresses and 32-bit-wide registers. Yet the VAX supports many other data sizes and types, as Figure E.1
E.2
VAX Operands and Addressing Modes
■
E-3
shows. Unfortunately, VAX uses the name “word” to refer to 16-bit quantities; in this text a word means 32 bits. Figure E.1 shows the conversion between the MIPS data type names and the VAX names. Be careful when reading about VAX instructions, as they refer to the names of the VAX data types. The VAX provides 16 32-bit registers. The VAX assembler uses the notation r0, r1, . . . , r15 to refer to these registers, and we will stick to that notation. Alas, 4 of these 16 registers are effectively claimed by the instruction set architecture. For example, r14 is the stack pointer (sp) and r15 is the program counter (pc). Hence, r15 cannot be used as a general-purpose register, and using r14 is very difficult because it interferes with instructions that manipulate the stack. The other dedicated registers are r12, used as the argument pointer (ap), and r13, used as the frame pointer (fp); their purpose will become clear later. (Like MIPS, the VAX assembler accepts either the register number or the register name.) VAX addressing modes include those discussed in Chapter 2, which has all the MIPS addressing modes: register, displacement, immediate, and PC-relative. Moreover, all these modes can be used for jump addresses or for data addresses. But that’s not all the addressing modes. To reduce code size, the VAX has three lengths of addresses for displacement addressing: 8-bit, 16-bit, and 32-bit addresses called, respectively, byte displacement, word displacement, and long displacement addressing. Thus, an address can be not only as small as possible, but also as large as necessary; large addresses need not be split, so there is no equivalent to the MIPS lui instruction (see page 134). Those are still not all the VAX addressing modes. Several have a deferred option, meaning that the object addressed is only the address of the real object, requiring another memory access to get the operand. This addressing mode is called indirect addressing in other machines. Thus, register deferred, autoincrement deferred, and byte/word/long displacement deferred are other addressing modes to choose from. For example, using the notation of the VAX assembler, r1 means the operand is register 1 and (r1) means the operand is the location in memory pointed to by r1.
Bits
Data type
MIPS name
VAX name
08
Integer
Byte
Byte
16
Integer
Half word
Word
32
Integer
Word
Long word
32
Floating point
Single precision
F_floating
64
Integer
Double word
Quad word
64
Floating point
Double precision
D_floating or G_floating
8n
Character string
Character
Character
Figure E.1 VAX data types, their lengths, and names. The first letter of the VAX type (b, w, l, f, q, d, g, c) is often used to complete an instruction name. Examples of move instructions include movb, movw, movl, movf, movq, movd, movg, and movc3. Each move instruction transfers an operand of the data type indicated by the letter following mov.
E-4
■
Appendix E Another Alternative to RISC: The VAX Architecture
There is yet another addressing mode. Indexed addressing automatically converts the value in an index operand to the proper byte address to add to the rest of the address. For a 32-bit word, we needed to multiply the index of a 4-byte quantity by 4 before adding it to a base address. Indexed addressing, called scaled addressing on some computers, automatically multiplies the index of a 4-byte quantity by 4 as part of the address calculation. To cope with such a plethora of addressing options, the VAX architecture separates the specification of the addressing mode from the specification of the operation. Hence, the opcode supplies the operation and the number of operands, and each operand has its own addressing mode specifier. Figure E.2 shows the name, assembler notation, example, meaning, and length of the address specifier. The VAX style of addressing means that an operation doesn’t know where its operands come from; a VAX add instruction can have three operands in registers, three operands in memory, or any combination of registers and memory operands. Example
How long is the following instruction? addl3 r1,737(r2),(r3)[r4] The name addl3 means a 32-bit add instruction with three operands. Assume the length of the VAX opcode is 1 byte.
Answer
The first operand specifier—r1— indicates register addressing and is 1 byte long. The second operand specifier—737(r2)—indicates displacement addressing and has two parts: The first part is a byte that specifies the word displacement addressing mode and base register (r2); the second part is the 2-byte long displacement (737). The third operand specifier—(r3)[r4]—also has two parts: The first byte specifies register deferred addressing mode ((r3)), and the second byte specifies the Index register and the use of indexed addressing ([r4]). Thus, the total length of the instruction is 1 + (1) + (1 + 2) + (1 + 1) = 7 bytes. In this example instruction, we show the VAX destination operand on the left and the source operands on the right, just as we show MIPS code. The VAX assembler actually expects operands in the opposite order, but we felt it would be less confusing to keep the destination on the left for both machines. Obviously, left or right orientation is arbitrary; the only requirement is consistency.
Elaboration
Because the PC is one of the 16 registers that can be selected in a VAX addressing mode, 4 of the 22 VAX addressing modes are synthesized from other addressing modes. Using the PC as the chosen register in each case, immediate addressing is really autoincrement, PC-relative is displacement, absolute is autoincrement deferred, and relative deferred is displacement deferred.
E.3
Encoding VAX Instructions
■
E-5
Addressing mode name
Syntax
Example Meaning
Length of address specifier in bytes
Literal
#value
#–1
–1
1 (6-bit signed value)
Immediate
#value
#100
100
1 + length of the immediate
Register
rn
r3
r3
1
Register deferred
(rn)
(r3)
Memory[r3]
1
Byte/word/long displacement
Displacement (rn)
100(r3)
Memory[r3 + 100]
1 + length of the displacement
Byte/word/long displacement deferred
@displacement (rn)
@100(r3) Memory[Memory [r3 + 100]]
1 + length of the displacement
Indexed (scaled)
Base mode [rx]
(r3)[r4]
Memory[r3 + r4 × d] (where d is data size in bytes)
1 + length of base addressing mode
Autoincrement
(rn)+
(r3)+
Memory[r3]; r3 = r3 + d
1
Autodecrement
– (rn)
–(r3)
r3 = r3 – d; Memory[r3]
1
Autoincrement deferred
@(rn)+
@(r3)+
Memory[Memory[r3]]; r3 = r3 + d 1
Figure E.2 Definition and length of the VAX operand specifiers. The length of each addressing mode is 1 byte plus the length of any displacement or immediate field needed by the mode. Literal mode uses a special 2-bit tag and the remaining 6 bits encode the constant value. If the constant is too big, it must use the immediate addressing mode. Note the length of an immediate operand is dictated by the length of the data type indicated in the opcode, not the value of the immediate. The symbol d in the last four modes represents the length of the data in bytes; d is 4 for 32bit add.
E.3
Encoding VAX Instructions Given the independence of the operations and addressing modes, the encoding of instructions is quite different from MIPS. VAX instructions begin with a single byte opcode containing the operation and the number of operands. The operands follow the opcode. Each operand begins with a single byte, called the address specifier, that describes the addressing mode for that operand. For a simple addressing mode, such as register addressing, this byte specifies the register number as well as the mode (see the rightmost column in Figure E.2). In other cases, this initial byte can be followed by many more bytes to specify the rest of the address information. As a specific example, let’s show the encoding of the add instruction from the example on page E-4: addl3 r1,737(r2),(r3)[r4] Assume that this instruction starts at location 201. Figure E.3 shows the encoding. Note that the operands are stored in memory in opposite order to the assembly code above. The execution of VAX instructions
E-6
■
Appendix E Another Alternative to RISC: The VAX Architecture
Byte address
Contents at each byte
Machine code
201
opcode containing addl3
c1hex
202
index mode specifier for [r4]
44hex
203
register indirect mode specifier for (r3)
63hex
204
word displacement mode specifier using r2 as base
205 206 207
the 16-bit constant 737 register mode specifier for r1
c2hex e1hex 02hex 51hex
Figure E.3 The encoding of the VAX instruction addl3 r1,737(r2),(r3)[r4], assuming it starts at address 201. To satisfy your curiosity, the right column shows the actual VAX encoding in hexadecimal notation. Note that the 16-bit constant 737ten takes two bytes.
begins with fetching the source operands, so it makes sense for them to come first. Order is not important in fixed-length instructions like MIPS, since the source and destination operands are easily found within a 32-bit word. The first byte, at location 201, is the opcode. The next byte, at location 202, is a specifier for the index mode using register r4. Like many of the other specifiers, the left 4 bits of the specifier give the mode and the right 4 bits give the register used in that mode. Since addl3 is a 4-byte operation, r4 will be multiplied by 4 and added to whatever address is specified next. In this case it is register deferred addressing using register r3. Thus bytes 202 and 203 combined define the third operand in the assembly code. The following byte, at address 204, is a specifier for word displacement addressing using register r2 as the base register. This specifier tells the VAX that the following two bytes, locations 205 and 206, contain a 16-bit address to be added to r2. The final byte of the instruction gives the destination operand, and this specifier selects register addressing using register r1. Such variability in addressing means that a single VAX operation can have many different lengths; for example, an integer add varies from 3 bytes to 19 bytes. VAX implementations must decode the first operand before they can find the second, and so implementors are strongly tempted to take one clock cycle to decode each operand; thus this sophisticated instruction set architecture can result in higher clock cycles per instruction, even when using simple addresses.
E.4
VAX Operations In keeping with its philosophy, the VAX has a large number of operations as well as a large number of addressing modes. We review a few here to give the flavor of the machine.
E.4
VAX Operations
■
E-7
Given the power of the addressing modes, the VAX move instruction performs several operations found in other machines. It transfers data between any two addressable locations and subsumes load, store, register-register moves, and memory-memory moves as special cases. The first letter of the VAX data type (b, w, l, f, q, d, g, c in Figure E.1) is appended to the acronym mov to determine the size of the data. One special move, called move address, moves the 32-bit address of the operand rather than the data. It uses the acronym mova. The arithmetic operations of MIPS are also found in the VAX, with two major differences. First, the type of the data is attached to the name. Thus addb, addw, and addl operate on 8-bit, 16-bit, and 32-bit data in memory or registers, respectively; MIPS has a single add instruction that operates only on the full 32-bit register. The second difference is that to reduce code size, the add instruction specifies the number of unique operands; MIPS always specifies three even if one operand is redundant. For example, the MIPS instruction add
$1, $1, $2
takes 32 bits like all MIPS instructions, but the VAX instruction addl2
r1, r2
uses r1 for both the destination and a source, taking just 24 bits: 8 bits for the opcode and 8 bits each for the two register specifiers.
Number of Operations Now we can show how VAX instruction names are formed: (operation)(datatype) 2 3
The operation add works with data types byte, word, long, float, and double and comes in versions for either 2 or 3 unique operands, so the following instructions are all found in the VAX: addb2 addb3
addw2 addw3
addl2 addl3
addf2 addf3
addd2 addd3
Accounting for all addressing modes (but ignoring register numbers and immediate values) and limiting to just byte, word, and long, there are more than 30,000 versions of integer add in the VAX; MIPS has just 4! Another reason for the large number of VAX instructions is the instructions that either replace sequences of instructions or take fewer bytes to represent a single instruction. Here are four such examples (* means the data type):
E-8
■
Appendix E Another Alternative to RISC: The VAX Architecture
VAX operation
Example
Meaning
clr*
clrl r3
r3 = 0
inc*
incl r3
r3 = r3 + 1
dec*
decl r3
r3 = r3 – 1
push*
pushl r3
sp = sp – 4; Memory[sp] = r3;
The push instruction in the last row is exactly the same as using the move instruction with autodecrement addressing on the stack pointer: movl – (sp), r3 Brevity is the advantage of pushl: It is one byte shorter since sp is implied.
Branches, Jumps, and Procedure Calls The VAX branch instructions are related to the arithmetic instructions because the branch instructions rely on condition codes. Condition codes are set as a side effect of an operation, and they indicate whether the result is positive, negative, zero, or if an overflow occurred. Most instructions set the VAX condition codes according to their result; instructions without results, such as branches, do not. The VAX condition codes are N (Negative), Z (Zero), V (oVerflow), and C (Carry). There is also a compare instruction cmp* just to set the condition codes for a subsequent branch. The VAX branch instructions include all conditions. Popular branch instructions include beql(=), bneq(≠), blss(), and bgeq(≥), which do just what you would expect. There are also unconditional branches whose name is determined by the size of the PC-relative offset. Thus brb (branch byte) has an 8-bit displacement and brw (branch word) has a 16-bit displacement. The final major category we cover here is the procedure call and return instructions. Unlike the MIPS architecture, these elaborate instructions can take dozens of clock cycles to execute. The next two sections show how they work, but we need to explain the purpose of the pointers associated with the stack manipulated by calls and ret. The stack pointer, sp, is just like the stack pointer in MIPS; it points to the top of the stack.The argument pointer, ap, points to the base of the list of arguments or parameters in memory that are passed to the procedure. The frame pointer, fp, points to the base of the local variables of the procedure that are kept in memory (the stack frame). The VAX call and return instructions manipulate these pointers to maintain the stack in proper condition across procedure calls and to provide convenient base registers to use when accessing memory operands. As we shall see, call and return also save and restore the general-purpose registers as well as the program counter. Figure E.4 gives a further sampling of the VAX instruction set.
E.4
VAX Operations
■
Instruction type
Example
Data transfers
Move data between byte, half-word, word, or double-word operands; * is data type
Arithmetic/logical
Control
Procedure
Floating point
Other
E-9
Instruction meaning
mov*
Move between two operands
movzb*
Move a byte to a half word or word, extending it with zeros
mova*
Move the 32-bit address of an operand; data type is last
push*
Push operand onto stack
Operations on integer or logical bytes, half words (16 bits), words (32 bits); * is data type add*_
Add with 2 or 3 operands
cmp*
Compare and set condition codes
tst*
Compare to zero and set condition codes
ash*
Arithmetic shift
clr*
Clear
cvtb*
Sign-extend byte to size of data type
Conditional and unconditional branches beql, bneq
Branch equal, branch not equal
bleq, bgeq
Branch less than or equal, branch greater than or equal
brb, brw
Unconditional branch with an 8-bit or 16-bit address
jmp
Jump using any addressing mode to specify target
aobleq
Add one to operand; branch if result ≤ second operand
case_
Jump based on case selector
Call/return from procedure calls
Call procedure with arguments on stack (see Section E.6)
callg
Call procedure with FORTRAN-style parameter list
jsb
Jump to subroutine, saving return address (like MIPS jal)
ret
Return from procedure call
Floating-point operations on D, F, G, and H formats addd_
Add double-precision D-format floating numbers
subd_
Subtract double-precision D-format floating numbers
mulf_
Multiply single-precision F-format floating point
polyf
Evaluate a polynomial using table of coefficients in F format
Special operations crc
Calculate cyclic redundancy check
insque
Insert a queue entry into a queue
Figure E.4 Classes of VAX instructions with examples. The asterisk stands for multiple data types: b, w, l, d, f, g, h, and q. The underline, as in addd_, means there are 2-operand (addd2) and 3-operand (addd3) forms of this instruction.
E-10
■
Appendix E Another Alternative to RISC: The VAX Architecture
E.5
An Example to Put It All Together: swap To see programming in VAX assembly language, we translate two C procedures swap and sort. The C code for swap is reproduced in Figure E.5. The next section covers sort. We describe the swap procedure in three general steps of assembly language programming: 1. Allocate registers to program variables 2. Produce code for the body of the procedure 3. Preserve registers across the procedure invocation The VAX code for these procedures is based on code produced by the VMS C compiler using optimization.
Register Allocation for swap In contrast to MIPS, VAX parameters are normally allocated to memory, so this step of assembly language programming is more properly called “variable allocation.” The standard VAX convention on parameter passing is to use the stack. The two parameters, v[] and k, can be accessed using register ap, the argument pointer: the address 4(ap) corresponds to v[] and 8(ap) corresponds to k. Remember that with byte addressing the address of sequential 4-byte words differs by 4. The only other variable is temp, which we associate with register r3.
Code for the Body of the Procedure swap The remaining lines of C code in swap are temp = v[k]; v[k] = v[k + 1]; v[k + 1] = temp;
swap(int v[], int k) { int temp; temp = v[k]; v[k] = v[k + 1]; v[k + 1] = temp; } Figure E.5 A C procedure that swaps two locations in memory. This procedure will be used in the sorting example in the next section.
E.5
An Example to Put It All Together: swap
■
E-11
Since this program uses v[] and k several times, to make the programs run faster the VAX compiler first moves both parameters into registers: movl movl
r2, 4(ap) r1, 8(ap)
;r2 = v[] ;r1 = k
Note that we follow the VAX convention of using a semicolon to start a comment; the MIPS comment symbol # represents a constant operand in VAX assembly language. The VAX has indexed addressing, so we can use index k without converting it to a byte address. The VAX code is then straightforward: movl addl3 movl movl
r3, (r2)[r1] r0, #1,8(ap) (r2)[r1],(r2)[r0] (r2)[r0],r3
;r3 (temp) = v[k] ;r0 = k + 1 ;v[k] = v[r0] (v[k + 1]) ;v[k + 1] = r3 (temp)
Unlike the MIPS code, which is basically two loads and two stores, the key VAX code is one memory-to-register move, one memory-to-memory move, and one register-to-memory move. Note that the addl3 instruction shows the flexibility of the VAX addressing modes: It adds the constant 1 to a memory operand and places the result in a register. Now we have allocated storage and written the code to perform the operations of the procedure. The only missing item is the code that preserves registers across the routine that calls swap.
Preserving Registers across Procedure Invocation of swap The VAX has a pair of instructions that preserve registers calls and ret. This example shows how they work. The VAX C compiler uses a form of callee convention. Examining the code above, we see that the values in registers r0, r1, r2, and r3 must be saved so that they can later be restored. The calls instruction expects a 16-bit mask at the beginning of the procedure to determine which registers are saved: if bit i is set in the mask, then register i is saved on the stack by the calls instruction. In addition, calls saves this mask on the stack to allow the return instruction (ret) to restore the proper registers. Thus the calls executed by the caller does the saving, but the callee sets the call mask to indicate what should be saved. One of the operands for calls gives the number of parameters being passed, so that calls can adjust the pointers associated with the stack: the argument pointer (ap), frame pointer (fp), and stack pointer (sp). Of course, calls also saves the program counter so that the procedure can return! Thus, to preserve these four registers for swap, we just add the mask at the beginning of the procedure, letting the calls instruction in the caller do all the work: .word ^m ;set bits in mask for 0,1,2,3
E-12
■
Appendix E Another Alternative to RISC: The VAX Architecture
This directive tells the assembler to place a 16-bit constant with the proper bits set to save registers r0 though r3. The return instruction undoes the work of calls. When finished, ret sets the stack pointer from the current frame pointer to pop everything calls placed on the stack. Along the way, it restores the register values saved by calls, including those marked by the mask and old values of the fp, ap, and pc. To complete the procedure swap, we just add one instruction: ret
;restore registers and return
The Full Procedure swap We are now ready for the whole routine. Figure E.6 identifies each block of code with its purpose in the procedure, with the MIPS code on the left and the VAX code on the right. This example shows the advantage of the scaled indexed addressing and the sophisticated call and return instructions of the VAX in reducing the number of lines of code. The 17 lines of MIPS assembly code became 8 lines of VAX assembly code. It also shows that passing parameters in memory results in extra memory accesses. Keep in mind that the number of instructions executed is not the same as performance; the fallacy on page E-18 makes this point.
MIPS versus VAX Saving register swap:
addi sw sw sw
$29,$29, –12 $2, 0($29) $15, 4($29) $16, 8($29)
swap:
.word ^m
Procedure body muli add lw lw sw sw
$2, $5,4 $2, $4,$2 $15, 0($2) $16, 4($2) $16, 0($2) $15, 4($2)
lw lw lw addi
$2, 0($29) $15, 4($29) $16, 8($29) $29,$29, 12
movl movl movl addl3 movl movl
r2, 4(a) r1, 8(a) r3, (r2)[r1] r0, #1,8(ap) (r2)[r1],(r2)[r0] (r2)[r0],r3
Restoring registers
Procedure return jr
$31
ret
Figure E.6 MIPS versus VAX assembly code of the procedure swap in Figure E.5 on page E-10.
E.6
Elaboration
E.6
A Longer Example: sort
■
E-13
VAX software follows a convention of treating registers r0 and r1 as temporaries that are not saved across a procedure call, so the VMS C compiler does include registers r0 and r1 in the register saving mask. Also, the C compiler should have used r1 instead of 8(ap) in the addl3 instruction; such examples inspire computer architects to try to write compilers!
A Longer Example: sort We show the longer example of the sort procedure. Figure E.7 shows the C version of the program. Once again we present this procedure in several steps, concluding with a side-by-side comparison to MIPS code.
Register Allocation for sort The two parameters of the procedure sort, v and n, are found in the stack in locations 4(ap) and 8(ap), respectively. The two local variables are assigned to registers: i to r6 and j to r4. Because the two parameters are referenced frequently in the code, the VMS C compiler copies the address of these parameters into registers upon entering the procedure: moval moval
r7,8(ap) r5,4(ap)
;move address of n into r7 ;move address of v into r5
It would seem that moving the value of the operand to a register would be more useful than its address, but once again we bow to the decision of the VMS C compiler. Apparently the compiler cannot be sure that v and n don’t overlap in memory.
Code for the Body of the sort Procedure The procedure body consists of two nested for loops and a call to swap, which includes parameters. Let’s unwrap the code from the outside to the middle.
sort (int v[], int n) { int i, j; for (i = 0; i < n; i = i + 1) { for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j = j – 1) { swap(v,j); } } } Figure E.7 A C procedure that performs a bubble sort on the array v.
E-14
■
Appendix E Another Alternative to RISC: The VAX Architecture
The Outer Loop The first translation step is the first for loop: for (i = 0; i < n; i = i + 1) { Recall that the C for statement has three parts: initialization, loop test, and iteration increment. It takes just one instruction to initialize i to 0, the first part of the for statement: clrl
r6
;i = 0
It also takes just one instruction to increment i, the last part of the for: incl
r6
;i = i + 1
The loop should be exited if i < n is false, or said another way, exit the loop if i ≥ n. This test takes two instructions: for1tst:
cmpl bgeq
r6,(r7) ;compare r6 and memory[r7] (i:n) exit1 ;go to exit1 if r6 ≥ mem[r7] (i ≥ n)
Note that cmpl sets the condition codes for use by the conditional branch instruction bgeq. The bottom of the loop just jumps back to the loop test: brb for1tst
;branch to test of outer loop
exit1: The skeleton code of the first for loop is then for1tst:
clrl r6 cmpl r6,(r7) bgeq exit1 ... (body of ... incl r6 brb for1tst
;i = 0 ;compare r6 and memory[r7] (i:n) ;go to exit1 if r6 ≥ mem[r7] (i ≥ n) first for loop) ;i = i + 1 ;branch to test of outer loop
exit1:
The Inner Loop The second for loop is for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j = j – 1) {
E.6
A Longer Example: sort
■
E-15
The initialization portion of this loop is again one instruction: subl3
r4,r6,#1 ;j = i – 1
The decrement of j is also one instruction: decl
r4
;j = j – 1
The loop test has two parts. We exit the loop if either condition fails, so the first test must exit the loop if it fails (j < 0): for2tst:blss
exit2
;go to exit2 if r4 < 0 (j < 0)
Notice that there is no explicit comparison. The lack of comparison is a benefit of condition codes, with the conditions being set as a side effect of the prior instruction. This branch skips over the second condition test. The second test exits if v[j] > v[j + 1] is false, or exits if v[j] ≤ v[j + 1]. First we load v and put j + 1 into registers: movl addl3
r3,(r5) r2,r4,#1
;r3 = Memory[r5] (r3 = v) ;r2 = r4 + 1 (r2 = j + 1)
Register indirect addressing is used to get the operand pointed to by r5. Once again the index addressing mode means we can use indices without converting to the byte address, so the two instructions for v[j] ≤≤ v[j + 1]are cmpl bleq
(r3)[r4],(r3)[r2] ;v[r4] : v[r2] (v[j]:v[j + 1]) exit2 ;go to exit2 if v[j] ≤ v[j + 1]
The bottom of the loop jumps back to the full loop test: brb
for2tst
# jump to test of inner loop
Combining the pieces, the second for loop looks like this: for2tst:
subl3 blss movl addl3 cmpl bleq
decl brb exit2:
r4,r6, #1 ;j = i – 1 exit2 ;go to exit2 if r4 < 0 (j < 0) r3,(r5) ;r3 = Memory[r5] (r3 = v) r2,r4,#1 ;r2 = r4 + 1 (r2 = j + 1) (r3)[r4],(r3)[r2];v[r4] : v[r2] exit2 ;go to exit2 if v[j]≤v[j+1] ... (body of second for loop) ... r4 ;j = j – 1 for2tst ;jump to test of inner loop
E-16
■
Appendix E Another Alternative to RISC: The VAX Architecture
Notice that the instruction blss (at the top of the loop) is testing the condition codes based on the new value of r4 (j), set either by the subl3 before entering the loop or by the decl at the bottom of the loop.
The Procedure Call The next step is the body of the second for loop: swap(v,j); Calling swap is easy enough: calls
#2,swap
The constant 2 indicates the number of parameters pushed on the stack.
Passing Parameters The C compiler passes variables on the stack, so we pass the parameters to swap with these two instructions: pushl pushl
(r5) r4
;first swap parameter is v ;second swap parameter is j
Register indirect addressing is used to get the operand of the first instruction.
Preserving Registers across Procedure Invocation of sort The only remaining code is the saving and restoring of registers using the callee save convention. This procedure uses registers r2 through r7, so we add a mask with those bits set: .word ^m; set mask for registers 2-7 Since ret will undo all the operations, we just tack it on the end of the procedure.
The Full Procedure sort Now we put all the pieces together in Figure E.8. To make the code easier to follow, once again we identify each block of code with its purpose in the procedure and list the MIPS and VAX code side by side. In this example, 11 lines of the sort procedure in C become the 44 lines in the MIPS assembly language and 20 lines in VAX assembly language. The biggest VAX advantages are in register saving and restoring and indexed addressing.
E.6
A Longer Example: sort
■
E-17
MIPS versus VAX Saving registers sort:
addi sw sw sw sw sw sw sw sw sw
$29,$29, –36 $15, 0($29) $16, 4($29) $17, 8($29) $18,12($29) $19,16($29) $20,20($29) $24,24($29) $25,28($29) $31,32($29)
move move
$18, $4 $20, $5
sort:
.word ^m
Procedure body Move parameters Outer loop
Inner loop
add for1tst: slt beq
$19, $0, $0 $8, $19, $20 $8, $0, exit1
addi for2tst: slti bne muli add lw lw slt beq
$17, $19, –1 $8, $17, 0 $8, $0, exit2 $15, $17, 4 $16, $18, $15 $24, 0($16) $25, 4($16) $8, $25, $24 $8, $0, exit2
for1tst:
moval moval
r7,8(ap) r5,4(ap)
clrl cmpl bgeq
r6 r6,(r7) exit1
subl3
r4,r6,#1
blss movl
exit2 r3,(r5)
addl3 cmpl bleq
r2,r4,#1 (r3)[r4],(r3)[r2] exit2
for2tst:
Pass parameters and call
move move jal
$4, $18 $5, $17 swap
pushl pushl calls
(r5) r4 #2,swap
Inner loop
addi j
$17, $17, –1 for2tst
decl brb
r4 for2tst
exit2:
addi j
$19, $19, 1 for1tst
exit2:
incl brb
r6 for1tst
exit1:
lw lw lw lw lw lw lw lw lw addi
$15, 0($29) $16, 4($29) $17, 8($29) $18,12($29) $19,16($29) $20,20($29) $24,24($29) $25,28($29) $31,32($29) $29,$29, 36 exit1:
ret
Outer loop
Restoring registers
Procedure return jr
$31
Figure E.8 MIPS32 versus VAX assembly version of procedure sort in Figure E.7 on page E-13.
E-18
■
Appendix E Another Alternative to RISC: The VAX Architecture
E.7
Fallacies and Pitfalls The ability to simplify means to eliminate the unnecessary so that the necessary may speak. Hans Hoffman Search for the Real, 1967
Fallacy
It is possible to design a flawless architecture. All architecture design involves trade-offs made in the context of a set of hardware and software technologies. Over time those technologies are likely to change, and decisions that may have been correct at one time later look like mistakes. For example, in 1975 the VAX designers overemphasized the importance of code size efficiency and underestimated how important ease of decoding and pipelining would be ten years later. And almost all architectures eventually succumb to the lack of sufficient address space. Avoiding these problems in the long run, however, would probably mean compromising the efficiency of the architecture in the short run.
Fallacy
An architecture with flaws cannot be successful. The IBM 360 is often criticized in the literature—the branches are not PCrelative, and the address is too small in displacement addressing. Yet, the machine has been an enormous success because it correctly handled several new problems. First, the architecture has a large amount of address space. Second, it is byte addressed and handles bytes well. Third, it is a general-purpose register machine. Finally, it is simple enough to be efficiently implemented across a wide performance and cost range. The Intel 8086 provides an even more dramatic example. The 8086 architecture is the only widespread architecture in existence today that is not truly a general-purpose register machine. Furthermore, the segmented address space of the 8086 causes major problems for both programmers and compiler writers. Finally, it is hard to implement. It has generally provided only half the performance of the RISC architectures for the last eight years, despite significant investment by Intel. Nevertheless, the 8086 architecture—because of its selection as the microprocessor in the IBM PC—has been enormously successful.
Fallacy
The architecture that executes fewer instructions is faster. Designers of VAX machines performed a quantitative comparison of VAX and MIPS for implementations with comparable organizations, the VAX 8700 and the MIPS M2000. Figure E.9 shows the ratio of the number of instructions executed and the ratio of performance measured in clock cycles. MIPS executes about twice as many instructions as the VAX while the MIPS M2000 has almost three times the performance of the VAX 8700.
E.8
Concluding Remarks
■
E-19
4 3.5 Performance
Instructions executed 3 2.5 MIPS/VAX 2 1.5 1 0.5 0
spice
matrix
nasa7
fpppp
tomcatv
doduc
espresso
eqntott
li
Number of bits of displacement
Figure E.9 Ratio of MIPS M2000 to VAX 8700 in instructions executed and performance in clock cycles using SPEC89 programs. On average, MIPS executes a little over twice as many instructions as the VAX, but the CPI for the VAX is almost six times the MIPS CPI, yielding almost a threefold performance advantage. (Based on data from “Performance from Architecture: Comparing a RISC and CISC with Similar Hardware Organization,” by D. Bhandarkar and D. Clark in Proc. Symp. Architectural Support for Programming Languages and Operating Systems IV, 1991.)
E.8
Concluding Remarks The Virtual Address eXtension of the PDP-11 architecture . . . provides a virtual address of about 4.3 gigabytes which, even given the rapid improvement of memory technology, should be adequate far into the future. William Strecker “VAX-11/780—A Virtual Address Extension to the PDP-11 Family,” AFIPS Proc., National Computer Conference, 1978
We have seen that instruction sets can vary quite dramatically, both in how they access operands and in the operations that can be performed by a single instruction. Figure E.10 compares instruction usage for both architectures for two programs; even very different architectures behave similarly in their use of instruction classes. A product of its time, the VAX emphasis on code density and complex operations and addressing modes conflicts with the current emphasis on easy decoding, simple operations and addressing modes, and pipelined performance. With more than 600,000 sold, the VAX architecture has had a very successful run. In 1991 DEC made the transition from VAX to Alpha, a 64-bit address architecture very similar to MIPS.
E-20
■
Appendix E Another Alternative to RISC: The VAX Architecture
Data transfer
Floating point
Machine
gcc
VAX
30%
40%
19%
89%
MIPS
24%
35%
27%
86%
spice
Branch
Arithmetic/ logical
Program
Totals
VAX
18%
23%
15%
23%
79%
MIPS
04%
29%
35%
15%
83%
Figure E.10 The frequency of instruction distribution for two programs on VAX and MIPS.
Orthogonality is key to the VAX architecture; the opcode is independent of the addressing modes that are independent of the data types and even the number of unique operands.Thus a few hundred operations expand to hundreds of thousands of instructions when accounting for the data types, operand counts, and addressing modes.
E.9
Historical Perspective and Further Reading VAX: the most successful minicomputer design in industry history . . . the VAX was probably the hacker’s favorite machine. . . . Especially noted for its large, assembler-programmer-friendly instruction set—an asset that became a liability after the RISC revolution. Eric Raymond
The New Hacker’s Dictionary, 1991 In the mid-1970s, DEC realized that the PDP-11 was running out of address space. The 16-bit space had been extended in several creative ways, but the small address space was a problem that could only be postponed, not overcome. In 1977, DEC introduced the VAX. Strecker described the architecture and called the VAX “a Virtual Address eXtension of the PDP-11.” One of DEC’s primary goals was to keep the installed base of PDP-11 customers. Thus, the customers were to think of the VAX as a 32-bit successor to the PDP-11. A 32-bit PDP-11 was possible—there were three designs—but Strecker reports that they were “overly compromised in terms of efficiency, functionality, programming ease.” The chosen solution was to design a new architecture and include a PDP11 compatibility mode that would run PDP-11 programs without change. This mode also allowed PDP-11 compilers to run and to continue to be used. The VAX-11/780 resembled the PDP-11 in many ways. These are among the most important: 1. Data types and formats are mostly equivalent to those on the PDP-11. The F and D floating formats came from the PDP-11. G and H formats were added
Exercises
■
E-21
later. The use of the term “word” to describe a 16-bit quantity was carried from the PDP-11 to the VAX. 2. The assembly language was made similar to the PDP-11s. 3. The same buses were supported (Unibus and Massbus). 4. The operating system, VMS, was “an evolution” of the RSX-11M/IAS OS (as opposed to the DECsystem 10/20 OS, which was a more advanced system), and the file system was basically the same. The VAX-11/780 was the first machine announced in the VAX series. It is one of the most successful and heavily studied machines ever built. It relied heavily on microprogramming, taking advantage of the increasing capacity of fast semiconductor memory to implement the complex instructions and addressing modes. The VAX is so tied to microcode that we predict it will be impossible to build the full VAX instruction set without microcode. To offer a single-chip VAX in 1984, DEC reduced the instructions interpreted by microcode by trapping some instructions and performing them in software. DEC engineers found that 20% of VAX instructions are responsible for 60% of the microcode, yet are only executed 0.2% of the time. The final result was a chip offering 90% of the performance with a reduction in silicon area by more than a factor of 5. The cornerstone of DEC’s strategy was a single architecture, VAX, running a single operating system, VMS. This strategy worked well for over ten years. Today, DEC has transitioned to the Alpha RISC architecture. Like the transition from the PDP-11 to the VAX, Alpha offers the same operating system, file system, and data types and formats of the VAX. Instead of providing a VAX compatibility mode, the Alpha approach is to “compile” the VAX machine code into the Alpha machine code.
To Probe Further Levy, H., and R. Eckhouse [1989]. Computer Programming and Architecture: The VAX, Digital Press, Boston. This book concentrates on the VAX, but includes descriptions of other machines.
Exercises E.1
[3] <E.4> The following VAX instruction decrements the location pointed to be register r5: decl (r5) What is the single MIPS instruction, or if it cannot be represented in a single instruction, the shortest sequence of MIPS instructions, that performs the same operation? What are the lengths of the instructions on each machine?
E.2
[5] <E.4> This exercise is the same as Exercise E.1, except this VAX instruction clears a location using autoincrement deferred addressing: clrl @(r5)+
E.3
[5] <E.5> This exercise is the same as Exercise E.1, except this VAX instruction adds 1 to register r5, placing the sum back in register r5, compares the sum to register r6, and then branches to L1 if r5 < r6: aoblss r6, r5,L1 # r5 = r5 + 1; if (r5 < r6) goto L1.
E.4
[5] <E.2> Show the single VAX instruction, or minimal sequence of instructions, for this C statement: a = b + 100; Assume a corresponds to register r3 and b corresponds to register r4.
E.5
[10] <E.2> Show the single VAX instruction, or minimal sequence of instructions, for this C statement: x[i + 1] = x[i] + c; Assume c corresponds to register r3, i to register r4, and x is an array of 32-bit words beginning at memory location 4,000,000ten.
F The IBM 360/370 Architecture for Mainframe Computers
We are not at all humble in this announcement. This is the most important product announcement that this corporation has ever made in its history. It’s not a computer in any previous sense. It’s not a product, but a line of products . . . that spans in performance from the very low part of the computer line to the very high. IBM spokesman at announcement of System/360 (1964)
© 2003 Elsevier Science (USA). All rights reserved.
F.1 F.2 F.3 F.4
Introduction System/360 Instruction Set 360 Detailed Measurements Historical Perspective and References
F-2 F-3 F-6 F-8
F-2
■
Appendix F The IBM 360/370 Architecture for Mainframe Computers
F.1
Introduction The term “computer architecture” was coined by IBM in 1964 for use with the IBM 360. Amdahl, Blaauw, and Brooks [1994] used the term to refer to the programmer-visible portion of the instruction set. They believed that a family of machines of the same architecture should be able to run the same software. Although this idea may seem obvious to us today, it was quite novel at the time. IBM, even though it was the leading company in the industry, had five different architectures before the 360. Thus, the notion of a company standardizing on a single architecture was a radical one. The 360 designers hoped that six different divisions of IBM could be brought together by defining a common architecture. Their definition of architecture was . . . the structure of a computer that a machine language programmer must understand to write a correct (timing independent) program for that machine.
The term “machine language programmer” meant that compatibility would hold, even in assembly language, while “timing independent” allowed different implementations. The IBM 360 was introduced in 1964 with six models and a 25:1 performance ratio. Amdahl, Blaauw, and Brooks [1994] discuss the architecture of the IBM 360 and the concept of permitting multiple object-code–compatible implementations. The notion of an instruction set architecture as we understand it today was the most important aspect of the 360. The architecture also introduced several important innovations, now in wide use: 1. 32-bit architecture 2. Byte-addressable memory with 8-bit bytes 3. 8-, 16-, 32-, and 64-bit data sizes In 1971, IBM shipped the first System/370 (models 155 and 165), which included a number of significant extensions of the 360, as discussed by Case and Padegs [1978], who also discuss the early history of System/360. The most important addition was virtual memory, though virtual memory 370s did not ship until 1972 when a virtual memory operating system was ready. By 1978, the high-end 370 was several hundred times faster than the low-end 360s shipped ten years earlier. In 1984, the 24-bit addressing model built into the IBM 360 needed to be abandoned, and the 370-XA (eXtended Architecture) was introduced. While old 24-bit programs could be supported without change, several instructions could not function in the same manner when extended to a 32-bit addressing model (31-bit addresses supported) because they would not produce 31-bit addresses. Converting the operating system, which was written mostly in assembly language, was no doubt the biggest task. Several studies of the IBM 360 and instruction measurement have been made. Shustek’s thesis [1978] is the best known and most complete study of the 360/370 architecture. He made several observations about instruction set complexity that were not fully appreciated until some years later. Another important study of the
F.2
System/360 Instruction Set
■
F-3
360 is the Toronto study by Alexander and Wortman [1975] done on an IBM 360 using 19 XPL programs.
F.2
System/360 Instruction Set The 360 instruction set is shown in the following tables, organized by instruction type and format. System/370 contains 15 additional user instructions.
Integer/Logical and Floating-Point R-R Instructions The * indicates the instruction is floating point, and may be either D (double precision) or E (single precision).
Instruction
Description
ALR
Add logical register
AR
Add register
A*R
FP addition
CLR
Compare logical register
CR
Compare register
C*R
FP compare
DR
Divide register
D*R
FP divide
H*R
FP halve
LCR
Load complement register
LC*R
Load complement
LNR
Load negative register
LN*R
Load negative
LPR
Load positive register
LP*R
Load positive
LR
Load register
L*R
Load FP register
LTR
Load and test register
LT*R
Load and test FP register
MR
Multiply register
M*R
FP multiply
NR
And register
OR
Or register
SLR
Subtract logical register
SR
Subtract register
S*R
FP subtraction
XR
Exclusive or register
F-4
■
Appendix F The IBM 360/370 Architecture for Mainframe Computers
Branches and Status Setting R-R Instructions These are R-R format instructions that either branch or set some system status; several of them are privileged and legal only in supervisor mode. Instruction
Description
BALR
Branch and link
BCTR
Branch on count
BCR
Branch/condition
ISK
Insert key
SPM
Set program mask
SSK
Set storage key
SVC
Supervisor call
Branches/Logical and Floating-Point Instructions—RX Format These are all RX format instructions. The symbol “+” means either a word operation (and then stands for nothing) or H (meaning half word); for example, A+ stands for the two opcodes A and AH. The symbol “*” is D or E standing for double- or single-precision floating point. Instruction
Description
A+
Add
A*
FP add
AL
Add logical
C+
Compare
C*
FP compare
CL
Compare logical
D
Divide
D*
FP divide
L+
Load
L*
Load FP register
M+
Multiply
M*
FP multiply
N
And
O
Or
S+
Subtract
S*
FP subtract
SL
Subtract logical
ST+
Store
ST*
Store FP register
X
Exclusive or
F.2
System/360 Instruction Set
■
F-5
Branches and Special Loads and Stores—RX format Instruction
Description
BAL
Branch and link
BC
Branch condition
BCT
Branch on count
CVB
Convert-binary
CVD
Convert-decimal
EX
Execute
IC
Insert character
LA
Load address
STC
Store character
RS and SI Format Instructions These are the RS and SI format instructions. The symbol “*” may be A (arithmetic) or L (logical). Instruction
Description
BXH
Branch/high
BXLE
Branch/low-equal
CLI
Compare logical immediate
HIO
Halt I/O
LPSW
Load PSW
LM
Load multiple
MVI
Move immediate
NI
And immediate
OI
Or immediate
RDD
Read direct
SIO
Start I/O
SL*
Shift left A/L
SLD*
Shift left double A/L
SR*
Shift right A/L
SRD*
Shift right double A/L
SSM
Set system mask
STM
Store multiple
TCH
Test channel
TIO
Test I/O
TM
Test under mask
TS
Test-and-set
WRD
Write direct
XI
Exclusive or immediate
F-6
■
Appendix F The IBM 360/370 Architecture for Mainframe Computers
SS Format Instructions These are add decimal or string instructions.
F.3
Instruction
Description
AP
Add packed
CLC
Compare logical chars
CP
Compare packed
DP
Divide packed
ED
Edit
EDMK
Edit and mark
MP
Multiply packed
MVC
Move character
MVN
Move numeric
MVO
Move with offset
MVZ
Move zone
NC
And characters
OC
Or characters
PACK
Pack (Character → decimal)
SP
Subtract packed
TR
Translate
TRT
Translate and test
UNPK
Unpack
XC
Exclusive or characters
ZAP
Zero and add packed
360 Detailed Measurements Figure F.1 shows the frequency of instruction usage for four IBM 360 programs.
F.3
360 Detailed Measurements
Instruction
PLIC
FORTGO
Control BC, BCR BAL, BALR Arithmetic/logical A, AR SR SLL LA CLI NI C TM MH Data transfer L, LR MVI ST LD STD LPDR LH IC LTR Floating point AD MDR Decimal, string MVC AP ZAP CVD MP CLC CP ED Total
32% 28% 3% 29% 3% 3%
13% 13%
5% 5%
35% 17% 7% 6% 1%
29% 21%
5% 3%
4% 1%
4%
17% 7% 2% 3%
40% 23%
8% 7%
7% 7% 3%
PLIGO
Average
16% 14% 2% 9%
16% 15% 1% 26% 10% 3% 2% 2% 2% 2% 3% 2% 1% 33% 19% 5% 3% 2% 2% 1% 1% 1% 0% 2% 1% 1% 11% 3% 3% 2% 1% 1% 1% 1% 0% 88%
7% 0% 3% 20% 19% 1%
3% 2% 1% 7% 3% 3% 4% 4%
82%
95%
90%
F-7
COBOLGO
3% 1%
2% 56% 28% 16% 7% 2% 2%
■
40% 7% 11% 9% 5% 3% 3% 2% 1% 85%
Figure F.1 Distribution of instruction execution frequencies for the four 360 programs. All instructions with a frequency of execution greater than 1.5% are included. Immediate instructions, which operate on only a single byte, are included in the section that characterized their operation, rather than with the long character-string versions of the same operation. By comparison, the average frequencies for the major instruction classes of the VAX are 23% (control), 28% (arithmetic), 29% (data transfer), 7% (floating point), and 9% (decimal). Once again, a 1% entry in the average column can occur because of entries in the constituent columns. These programs are a compiler for the programming language PL-I and run time systems for the programming languages FORTRAN, PL/I, and Cobol.
F-8
■
Appendix F The IBM 360/370 Architecture for Mainframe Computers
F.4
Historical Perspective and References The IBM 360 was the first computer to sell in large quantities with both byte addressing using 8-bit bytes and general-purpose registers. The 360 also had register-memory and limited memory-memory instructions. This architecture blazed the path for binary compatibility, which others have followed. The architects of the IBM 360 were aware of the importance of address size and planned for the architecture to extend to 32 bits of address. Only 24 bits were used in the IBM 360, however, because the low-end 360 models would have been even slower with the larger addresses in 1964. Unfortunately, the architects didn’t reveal their plans to the software people, and programmers who stored extra information in the upper 8 “unused” address bits foiled the expansion effort. Virtually every computer since then will check to make sure the unused bits stay unused, and will trap if the bits have the wrong value. IBM officially extended the address to 32 bits in 1970 with the IBMs/370 architecture. Only recently did IBM expand this architecture to a flat, 64-bit address, with the IBMs/390.
References Alexander, W. G., and D. B. Wortman [1975]. “Static and dynamic characteristics of XPL programs,” IEEE Computer 8(11) (November), 41–46. Amdahl, G. M., G. A. Blaauw, and F. P. Brooks, Jr. [1964]. “Architecture of the IBM System/360,” IBM J. Research and Development 8:2 (April), 87–101. Case, R. P., and A. Padegs [1978]. “The architecture of the IBM System/370,” Communications of the ACM, 21:1, 73–96. Shustek, L. J. [1978]. Analysis and Performance of Computer Instruction Sets. Ph.D. dissertation, Stanford University (January).
G Vector Processors Revised by Krste Asanovic Department of Electrical Engineering and Computer Science, MIT
I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI (ASC) processor. Those three were all pioneering processors. . . . One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made. Seymour Cray Public lecture at Lawrence Livermore Laboratories on the introduction of the Cray-1 (1976)
© 2003 Elsevier Science (USA). All rights reserved.
G.1 G.2 G.3 G.4 G.5 G.6 G.7 G.8 G.9
Why Vector Processors? Basic Vector Architecture Two Real-World Issues: Vector Length and Stride Enhancing Vector Performance Effectiveness of Compiler Vectorization Putting It All Together: Performance of Vector Processors Fallacies and Pitfalls Concluding Remarks Historical Perspective and References Exercises
G-2 G-4 G-16 G-23 G-32 G-34 G-40 G-42 G-43 G-49
G-2
■
Appendix G Vector Processors
G.1
Why Vector Processors? In Chapters 3 and 4 we saw how we could significantly increase the performance of a processor by issuing multiple instructions per clock cycle and by more deeply pipelining the execution units to allow greater exploitation of instructionlevel parallelism. (This appendix assumes that you have read Chapters 3 and 4 completely; in addition, the discussion on vector memory systems assumes that you have read Chapter 5.) Unfortunately, we also saw that there are serious difficulties in exploiting ever larger degrees of ILP. As we increase both the width of instruction issue and the depth of the machine pipelines, we also increase the number of independent instructions required to keep the processor busy with useful work. This means an increase in the number of partially executed instructions that can be in flight at one time. For a dynamically-scheduled machine, hardware structures, such as instruction windows, reorder buffers, and rename register files, must grow to have sufficient capacity to hold all in-flight instructions, and worse, the number of ports on each element of these structures must grow with the issue width. The logic to track dependencies between all in-flight instructions grows quadratically in the number of instructions. Even a statically scheduled VLIW machine, which shifts more of the scheduling burden to the compiler, requires more registers, more ports per register, and more hazard interlock logic (assuming a design where hardware manages interlocks after issue time) to support more in-flight instructions, which similarly cause quadratic increases in circuit size and complexity. This rapid increase in circuit complexity makes it difficult to build machines that can control large numbers of in-flight instructions, and hence limits practical issue widths and pipeline depths. Vector processors were successfully commercialized long before instructionlevel parallel machines and take an alternative approach to controlling multiple functional units with deep pipelines. Vector processors provide high-level operations that work on vectors—linear arrays of numbers. A typical vector operation might add two 64-element, floating-point vectors to obtain a single 64-element vector result. The vector instruction is equivalent to an entire loop, with each iteration computing one of the 64 elements of the result, updating the indices, and branching back to the beginning. Vector instructions have several important properties that solve most of the problems mentioned above: ■
A single vector instruction specifies a great deal of work—it is equivalent to executing an entire loop. Each instruction represents tens or hundreds of operations, and so the instruction fetch and decode bandwidth needed to keep multiple deeply pipelined functional units busy is dramatically reduced.
■
By using a vector instruction, the compiler or programmer indicates that the computation of each result in the vector is independent of the computation of other results in the same vector and so hardware does not have to check for data hazards within a vector instruction. The elements in the vector can be
G.1
Why Vector Processors?
■
G-3
computed using an array of parallel functional units, or a single very deeply pipelined functional unit, or any intermediate configuration of parallel and pipelined functional units. ■
Hardware need only check for data hazards between two vector instructions once per vector operand, not once for every element within the vectors. That means the dependency checking logic required between two vector instructions is approximately the same as that required between two scalar instructions, but now many more elemental operations can be in flight for the same complexity of control logic.
■
Vector instructions that access memory have a known access pattern. If the vector’s elements are all adjacent, then fetching the vector from a set of heavily interleaved memory banks works very well (as we saw in Section 5.8). The high latency of initiating a main memory access versus accessing a cache is amortized, because a single access is initiated for the entire vector rather than to a single word. Thus, the cost of the latency to main memory is seen only once for the entire vector, rather than once for each word of the vector.
■
Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control hazards that would normally arise from the loop branch are nonexistent.
For these reasons, vector operations can be made faster than a sequence of scalar operations on the same number of data items, and designers are motivated to include vector units if the application domain can use them frequently. As mentioned above, vector processors pipeline and parallelize the operations on the individual elements of a vector. The operations include not only the arithmetic operations (multiplication, addition, and so on), but also memory accesses and effective address calculations. In addition, most high-end vector processors allow multiple vector instructions to be in progress at the same time, creating further parallelism among the operations on different vectors. Vector processors are particularly useful for large scientific and engineering applications, including car crash simulations and weather forecasting, for which a typical job might take dozens of hours of supercomputer time running over multigigabyte data sets. Multimedia applications can also benefit from vector processing, as they contain abundant data parallelism and process large data streams. A high-speed pipelined processor will usually use a cache to avoid forcing memory reference instructions to have very long latency. Unfortunately, big, long-running, scientific programs often have very large active data sets that are sometimes accessed with low locality, yielding poor performance from the memory hierarchy. This problem could be overcome by not caching these structures if it were possible to determine the memory access patterns and pipeline the memory accesses efficiently. Novel cache architectures and compiler assistance through blocking and prefetching are decreasing these memory hierarchy problems, but they continue to be serious in some applications.
G-4
■
Appendix G Vector Processors
G.2
Basic Vector Architecture A vector processor typically consists of an ordinary pipelined scalar unit plus a vector unit. All functional units within the vector unit have a latency of several clock cycles. This allows a shorter clock cycle time and is compatible with longrunning vector operations that can be deeply pipelined without generating hazards. Most vector processors allow the vectors to be dealt with as floating-point numbers, as integers, or as logical data. Here we will focus on floating point. The scalar unit is basically no different from the type of advanced pipelined CPU discussed in Chapters 3 and 4, and commercial vector machines have included both out-of-order scalar units (NEC SX/5) and VLIW scalar units (Fujitsu VPP5000). There are two primary types of architectures for vector processors: vectorregister processors and memory-memory vector processors. In a vector-register processor, all vector operations—except load and store—are among the vector registers. These architectures are the vector counterpart of a load-store architecture. All major vector computers shipped since the late 1980s use a vector-register architecture, including the Cray Research processors (Cray-1, Cray-2, X-MP, YMP, C90, T90, and SV1), the Japanese supercomputers (NEC SX/2 through SX/5, Fujitsu VP200 through VPP5000, and the Hitachi S820 and S-8300), and the minisupercomputers (Convex C-1 through C-4). In a memory-memory vector processor, all vector operations are memory to memory. The first vector computers were of this type, as were CDC’s vector computers. From this point on we will focus on vector-register architectures only; we will briefly return to memory-memory vector architectures at the end of the appendix (Section G.9) to discuss why they have not been as successful as vector-register architectures. We begin with a vector-register processor consisting of the primary components shown in Figure G.1. This processor, which is loosely based on the Cray1, is the foundation for discussion throughout most of this appendix. We will call it VMIPS; its scalar portion is MIPS, and its vector portion is the logical vector extension of MIPS. The rest of this section examines how the basic architecture of VMIPS relates to other processors. The primary components of the instruction set architecture of VMIPS are the following: ■
Vector registers—Each vector register is a fixed-length bank holding a single vector. VMIPS has eight vector registers, and each vector register holds 64 elements. Each vector register must have at least two read ports and one write port in VMIPS. This will allow a high degree of overlap among vector operations to different vector registers. (We do not consider the problem of a shortage of vector-register ports. In real machines this would result in a structural hazard.) The read and write ports, which total at least 16 read ports and 8 write ports, are connected to the functional unit inputs or outputs by a pair of crossbars. (The description of the vector-register file design has been simplified here. Real machines make use of the regular access pattern within a vector instruction to reduce the costs of the vector-register file circuitry [Asanovic 1998]. For example, the Cray-1 manages to implement the register file with only a single port per register.)
G.2
Basic Vector Architecture
■
G-5
■
Vector functional units—Each unit is fully pipelined and can start a new operation on every clock cycle. A control unit is needed to detect hazards, both from conflicts for the functional units (structural hazards) and from conflicts for register accesses (data hazards). VMIPS has five functional units, as shown in Figure G.1. For simplicity, we will focus exclusively on the floating-point functional units. Depending on the vector processor, scalar operations either use the vector functional units or use a dedicated set. We assume the functional units are shared, but again, for simplicity, we ignore potential conflicts.
■
Vector load-store unit—This is a vector memory unit that loads or stores a vector to or from memory. The VMIPS vector loads and stores are fully pipelined, so that words can be moved between the vector registers and memory
Main memory
Vector load-store
FP add/subtract
FP multiply
FP divide Vector registers
Integer
Logical
Scalar registers
Figure G.1 The basic structure of a vector-register architecture, VMIPS. This processor has a scalar architecture just like MIPS. There are also eight 64-element vector registers, and all the functional units are vector functional units. Special vector instructions are defined both for arithmetic and for memory accesses. We show vector units for logical and integer operations. These are included so that VMIPS looks like a standard vector processor, which usually includes these units. However, we will not be discussing these units except in the exercises. The vector and scalar registers have a significant number of read and write ports to allow multiple simultaneous vector operations. These ports are connected to the inputs and outputs of the vector functional units by a set of crossbars (shown in thick gray lines). In Section G.4 we add chaining, which will require additional interconnect capability.
G-6
■
Appendix G Vector Processors
with a bandwidth of 1 word per clock cycle, after an initial latency. This unit would also normally handle scalar loads and stores. ■
A set of scalar registers—Scalar registers can also provide data as input to the vector functional units, as well as compute addresses to pass to the vector load-store unit. These are the normal 32 general-purpose registers and 32 floating-point registers of MIPS. Scalar values are read out of the scalar register file, then latched at one input of the vector functional units.
Figure G.2 shows the characteristics of some typical vector processors, including the size and count of the registers, the number and types of functional units, and the number of load-store units. The last column in Figure G.2 shows the number of lanes in the machine, which is the number of parallel pipelines used to execute operations within each vector instruction. Lanes are described later in Section G.4; here we assume VMIPS has only a single pipeline per vector functional unit (one lane). In VMIPS, vector operations use the same names as MIPS operations, but with the letter “V” appended. Thus, ADDV.D is an add of two double-precision vectors. The vector instructions take as their input either a pair of vector registers (ADDV.D) or a vector register and a scalar register, designated by appending “VS” (ADDVS.D). In the latter case, the value in the scalar register is used as the input for all operations—the operation ADDVS.D will add the contents of a scalar register to each element in a vector register. The scalar value is copied over to the vector functional unit at issue time. Most vector operations have a vector destination register, although a few (population count) produce a scalar value, which is stored to a scalar register. The names LV and SV denote vector load and vector store, and they load or store an entire vector of double-precision data. One operand is the vector register to be loaded or stored; the other operand, which is a MIPS general-purpose register, is the starting address of the vector in memory. Figure G.3 lists the VMIPS vector instructions. In addition to the vector registers, we need two additional special-purpose registers: the vector-length and vectormask registers. We will discuss these registers and their purpose in Sections G.3 and G.4, respectively.
How Vector Processors Work: An Example A vector processor is best understood by looking at a vector loop on VMIPS. Let’s take a typical vector problem, which will be used throughout this appendix: Y = a× X + Y X and Y are vectors, initially resident in memory, and a is a scalar. This is the socalled SAXPY or DAXPY loop that forms the inner loop of the Linpack benchmark. (SAXPY stands for single-precision a × X plus Y; DAXPY for doubleprecision a × X plus Y.) Linpack is a collection of linear algebra routines, and the routines for performing Gaussian elimination constitute what is known as the
G.2
Processor (year) Cray-1 (1976)
Clock rate Vector (MHz) registers 80
Basic Vector Architecture
Elements per register (64-bit elements) Vector arithmetic units
8
64
8
64
G-7
■
Vector load-store units Lanes
6: FP add, FP multiply, FP reciprocal, integer add, logical, shift
1
1
8: FP add, FP multiply, FP reciprocal, integer add, 2 logical, shift, population count/parity
2 loads 1 store
1
64
5: FP add, FP multiply, FP reciprocal/ sqrt, integer add/shift/population count, logical
1
1
32–1024
3: FP or integer add/logical, multiply, divide
2
1 (VP100) 2 (VP200)
Cray X-MP (1983) Cray Y-MP (1988)
118
Cray-2 (1985)
244
8
Fujitsu VP100/ VP200 (1982)
133
8–256
Hitachi S810/ S820 (1983)
71
32
256
4: FP multiply-add, FP multiply/ divide-add unit, 2 integer add/logical
3 loads 1 store
1 (S810) 2 (S820)
Convex C-1 (1985)
10
8
128
2: FP or integer multiply/divide, add/ logical
1
1 (64 bit) 2 (32 bit)
NEC SX/2 (1985)
167
8 + 32
256
4: FP multiply/divide, FP add, integer add/logical, shift
1
4
Cray C90 (1991)
240 128
8: FP add, FP multiply, FP reciprocal, integer add, 2 logical, shift, population count/parity
2 loads 1 store
2
8
4: FP or integer add/shift, multiply, divide, logical 3: FP or integer multiply, add/logical, divide
1
16
1 load 1 store
16
166
Cray T90 (1995)
460
NEC SX/5 (1998)
312
8 + 64
512
Fujitsu VPP5000 (1999)
300
8–256
128–4096
Cray SV1 (1998)
300
SV1ex (2001)
500
VMIPS (2001)
500
8
64
8
64
8: FP add, FP multiply, FP reciprocal, 1 load-store 1 load integer add, 2 logical, shift, population count/parity 5: FP multiply, FP divide, FP add, integer add/shift, logical
1 load-store
2 8 (MSP) 1
Figure G.2 Characteristics of several vector-register architectures. If the machine is a multiprocessor, the entries correspond to the characteristics of one processor. Several of the machines have different clock rates in the vector and scalar units; the clock rates shown are for the vector units. The Fujitsu machines’ vector registers are configurable: The size and count of the 8K 64-bit entries may be varied inversely to one another (e.g., on the VP200, from eight registers each 1K elements long to 256 registers each 32 elements long). The NEC machines have eight foreground vector registers connected to the arithmetic units plus 32–64 background vector registers connected between the memory system and the foreground vector registers. The reciprocal unit on the Cray processors is used to do division (and square root on the Cray-2). Add pipelines perform add and subtract. The multiply/divide-add unit on the Hitachi S810/820 performs an FP multiply or divide followed by an add or subtract (while the multiply-add unit performs a multiply followed by an add or subtract). Note that most processors use the vector FP multiply and divide units for vector integer multiply and divide, and several of the processors use the same units for FP scalar and FP vector operations. Each vector load-store unit represents the ability to do an independent, overlapped transfer to or from the vector registers. The number of lanes is the number of parallel pipelines in each of the functional units as described in Section G.4. For example, the NEC SX/5 can complete 16 multiplies per cycle in the multiply functional unit. The Convex C-1 can split its single 64-bit lane into two 32-bit lanes to increase performance for applications that require only reduced precision. The Cray SV1 can group four CPUs with two lanes each to act in unison as a single larger CPU with eight lanes, which Cray calls a Multi-Streaming Processor (MSP).
G-8
■
Appendix G Vector Processors
Instruction
Operands
Function
ADDV.D ADDVS.D
V1,V2,V3 V1,V2,F0
Add elements of V2 and V3, then put each result in V1. Add F0 to each element of V2, then put each result in V1.
SUBV.D SUBVS.D SUBSV.D
V1,V2,V3 V1,V2,F0 V1,F0,V2
Subtract elements of V3 from V2, then put each result in V1. Subtract F0 from elements of V2, then put each result in V1. Subtract elements of V2 from F0, then put each result in V1.
MULV.D MULVS.D
V1,V2,V3 V1,V2,F0
Multiply elements of V2 and V3, then put each result in V1. Multiply each element of V2 by F0, then put each result in V1.
DIVV.D DIVVS.D DIVSV.D
V1,V2,V3 V1,V2,F0 V1,F0,V2
Divide elements of V2 by V3, then put each result in V1. Divide elements of V2 by F0, then put each result in V1. Divide F0 by elements of V2, then put each result in V1.
LV
V1,R1
Load vector register V1 from memory starting at address R1.
SV
R1,V1
Store vector register V1 into memory starting at address R1.
LVWS
V1,(R1,R2)
Load V1 from address at R1 with stride in R2, i.e., R1+i × R2.
SVWS
(R1,R2),V1
Store V1 from address at R1 with stride in R2, i.e., R1+i × R2.
LVI
V1,(R1+V2)
Load V1 with vector whose elements are at R1+V2(i), i.e., V2 is an index.
SVI
(R1+V2),V1
Store V1 to vector whose elements are at R1+V2(i), i.e., V2 is an index.
CVI
V1,R1
Create an index vector by storing the values 0, 1 × R1, 2 × R1,...,63 × R1 into V1.
S--V.D S--VS.D
V1,V2 V1,F0
Compare the elements (EQ, NE, GT, LT, GE, LE) in V1 and V2. If condition is true, put a 1 in the corresponding bit vector; otherwise put 0. Put resulting bit vector in vectormask register (VM). The instruction S--VS.D performs the same compare but using a scalar value as one operand.
POP
R1,VM
Count the 1s in the vector-mask register and store count in R1.
CVM
Set the vector-mask register to all 1s.
MTC1 MFC1
VLR,R1 R1,VLR
Move contents of R1 to the vector-length register. Move the contents of the vector-length register to R1.
MVTM MVFM
VM,F0 F0,VM
Move contents of F0 to the vector-mask register. Move contents of vector-mask register to F0.
Figure G.3 The VMIPS vector instructions. Only the double-precision FP operations are shown. In addition to the vector registers, there are two special registers, VLR (discussed in Section G.3) and VM (discussed in Section G.4). These special registers are assumed to live in the MIPS coprocessor 1 space along with the FPU registers. The operations with stride are explained in Section G.3, and the use of the index creation and indexed load-store operations are explained in Section G.4.
Linpack benchmark. The DAXPY routine, which implements the preceding loop, represents a small fraction of the source code of the Linpack benchmark, but it accounts for most of the execution time for the benchmark. For now, let us assume that the number of elements, or length, of a vector register (64) matches the length of the vector operation we are interested in. (This restriction will be lifted shortly.)
G.2
Example Answer
Basic Vector Architecture
■
G-9
Show the code for MIPS and VMIPS for the DAXPY loop. Assume that the starting addresses of X and Y are in Rx and Ry, respectively. Here is the MIPS code.
Loop:
L.D DADDIU L.D MUL.D L.D ADD.D S.D DADDIU DADDIU DSUBU BNEZ
F0,a R4,Rx,#512 F2,0(Rx) F2,F2,F0 F4,0(Ry) F4,F4,F2 0(Ry),F4 Rx,Rx,#8 Ry,Ry,#8 R20,R4,Rx R20,Loop
;load scalar a ;last address to load ;load X(i) ;a × X(i) ;load Y(i) ;a × X(i) + Y(i) ;store into Y(i) ;increment index to X ;increment index to Y ;compute bound ;check if done
Here is the VMIPS code for DAXPY. L.D LV MULVS.D LV ADDV.D SV
F0,a V1,Rx V2,V1,F0 V3,Ry V4,V2,V3 Ry,V4
;load scalar a ;load vector X ;vector-scalar multiply ;load vector Y ;add ;store the result
There are some interesting comparisons between the two code segments in this example. The most dramatic is that the vector processor greatly reduces the dynamic instruction bandwidth, executing only six instructions versus almost 600 for MIPS. This reduction occurs both because the vector operations work on 64 elements and because the overhead instructions that constitute nearly half the loop on MIPS are not present in the VMIPS code. Another important difference is the frequency of pipeline interlocks. In the straightforward MIPS code every ADD.D must wait for a MUL.D, and every S.D must wait for the ADD.D. On the vector processor, each vector instruction will only stall for the first element in each vector, and then subsequent elements will flow smoothly down the pipeline. Thus, pipeline stalls are required only once per vector operation, rather than once per vector element. In this example, the pipeline stall frequency on MIPS will be about 64 times higher than it is on VMIPS. The pipeline stalls can be eliminated on MIPS by using software pipelining or loop unrolling (as we saw in Chapter 4). However, the large difference in instruction bandwidth cannot be reduced.
G-10
■
Appendix G Vector Processors
Vector Execution Time The execution time of a sequence of vector operations primarily depends on three factors: the length of the operand vectors, structural hazards among the operations, and the data dependences. Given the vector length and the initiation rate, which is the rate at which a vector unit consumes new operands and produces new results, we can compute the time for a single vector instruction. All modern supercomputers have vector functional units with multiple parallel pipelines (or lanes) that can produce two or more results per clock cycle, but may also have some functional units that are not fully pipelined. For simplicity, our VMIPS implementation has one lane with an initiation rate of one element per clock cycle for individual operations. Thus, the execution time for a single vector instruction is approximately the vector length. To simplify the discussion of vector execution and its timing, we will use the notion of a convoy, which is the set of vector instructions that could potentially begin execution together in one clock period. (Although the concept of a convoy is used in vector compilers, no standard terminology exists. Hence, we created the term convoy.) The instructions in a convoy must not contain any structural or data hazards (though we will relax this later); if such hazards were present, the instructions in the potential convoy would need to be serialized and initiated in different convoys. Placing vector instructions into a convoy is analogous to placing scalar operations into a VLIW instruction. To keep the analysis simple, we assume that a convoy of instructions must complete execution before any other instructions (scalar or vector) can begin execution. We will relax this in Section G.4 by using a less restrictive, but more complex, method for issuing instructions. Accompanying the notion of a convoy is a timing metric, called a chime, that can be used for estimating the performance of a vector sequence consisting of convoys. A chime is the unit of time taken to execute one convoy. A chime is an approximate measure of execution time for a vector sequence; a chime measurement is independent of vector length. Thus, a vector sequence that consists of m convoys executes in m chimes, and for a vector length of n, this is approximately m × n clock cycles. A chime approximation ignores some processor-specific overheads, many of which are dependent on vector length. Hence, measuring time in chimes is a better approximation for long vectors. We will use the chime measurement, rather than clock cycles per result, to explicitly indicate that certain overheads are being ignored. If we know the number of convoys in a vector sequence, we know the execution time in chimes. One source of overhead ignored in measuring chimes is any limitation on initiating multiple vector instructions in a clock cycle. If only one vector instruction can be initiated in a clock cycle (the reality in most vector processors), the chime count will underestimate the actual execution time of a convoy. Because the vector length is typically much greater than the number of instructions in the convoy, we will simply assume that the convoy executes in one chime.
G.2
Example
Basic Vector Architecture
■
G-11
Show how the following code sequence lays out in convoys, assuming a single copy of each vector functional unit: LV MULVS.D LV ADDV.D SV
V1,Rx V2,V1,F0 V3,Ry V4,V2,V3 Ry,V4
;load vector X ;vector-scalar multiply ;load vector Y ;add ;store the result
How many chimes will this vector sequence take? How many cycles per FLOP (floating-point operation) are needed ignoring vector instruction issue overhead? Answer
The first convoy is occupied by the first LV instruction. The MULVS.D is dependent on the first LV, so it cannot be in the same convoy. The second LV instruction can be in the same convoy as the MULVS.D. The ADDV.D is dependent on the second LV, so it must come in yet a third convoy, and finally the SV depends on the ADDV.D, so it must go in a following convoy. This leads to the following layout of vector instructions into convoys: 1. LV 2. MULVS.D
LV
3. ADDV.D 4. SV The sequence requires four convoys and hence takes four chimes. Since the sequence takes a total of four chimes and there are two floating-point operations per result, the number of cycles per FLOP is 2 (ignoring any vector instruction issue overhead). Note that although we allow the MULVS.D and the LV both to execute in convoy 2, most vector machines will take 2 clock cycles to initiate the instructions. The chime approximation is reasonably accurate for long vectors. For example, for 64-element vectors, the time in chimes is four, so the sequence would take about 256 clock cycles. The overhead of issuing convoy 2 in two separate clocks would be small. Another source of overhead is far more significant than the issue limitation. The most important source of overhead ignored by the chime model is vector start-up time. The start-up time comes from the pipelining latency of the vector operation and is principally determined by how deep the pipeline is for the functional unit used. The start-up time increases the effective time to execute a convoy to more than one chime. Because of our assumption that convoys do not overlap in time, the start-up time delays the execution of subsequent convoys. Of course the instructions in successive convoys have either structural conflicts for some functional unit or are data dependent, so the assumption of no overlap is
G-12
■
Appendix G Vector Processors
reasonable. The actual time to complete a convoy is determined by the sum of the vector length and the start-up time. If vector lengths were infinite, this start-up overhead would be amortized, but finite vector lengths expose it, as the following example shows. Example
Assume the start-up overhead for functional units is shown in Figure G.4. Show the time that each convoy can begin and the total number of cycles needed. How does the time compare to the chime approximation for a vector of length 64?
Answer
Figure G.5 provides the answer in convoys, assuming that the vector length is n. One tricky question is when we assume the vector sequence is done; this determines whether the start-up time of the SV is visible or not. We assume that the instructions following cannot fit in the same convoy, and we have already assumed that convoys do not overlap. Thus the total time is given by the time until the last vector instruction in the last convoy completes. This is an approximation, and the start-up time of the last vector instruction may be seen in some sequences and not in others. For simplicity, we always include it. The time per result for a vector of length 64 is 4 + (42/64) = 4.65 clock cycles, while the chime approximation would be 4. The execution time with startup overhead is 1.16 times higher.
Unit
Start-up overhead (cycles)
Load and store unit
12
Multiply unit
7
Add unit
6
Figure G.4 Start-up overhead.
Convoy 1. LV 2. MULVS.D LV
Starting time
First-result time
Last-result time
0
12
11 + n
12 + n
12 + n + 12
23 + 2n
3. ADDV.D
24 + 2n
24 + 2n + 6
29 + 3n
4. SV
30 + 3n
30 + 3n + 12
41 + 4n
Figure G.5 Starting times and first- and last-result times for convoys 1 through 4. The vector length is n.
G.2
Basic Vector Architecture
■
G-13
For simplicity, we will use the chime approximation for running time, incorporating start-up time effects only when we want more detailed performance or to illustrate the benefits of some enhancement. For long vectors, a typical situation, the overhead effect is not that large. Later in the appendix we will explore ways to reduce start-up overhead. Start-up time for an instruction comes from the pipeline depth for the functional unit implementing that instruction. If the initiation rate is to be kept at 1 clock cycle per result, then Pipeline depth =
Total functional unit time ------------------------------------------------------------Clock cycle time
For example, if an operation takes 10 clock cycles, it must be pipelined 10 deep to achieve an initiation rate of one per clock cycle. Pipeline depth, then, is determined by the complexity of the operation and the clock cycle time of the processor. The pipeline depths of functional units vary widely—from 2 to 20 stages is not uncommon—although the most heavily used units have pipeline depths of 4– 8 clock cycles. For VMIPS, we will use the same pipeline depths as the Cray-1, although latencies in more modern processors have tended to increase, especially for loads. All functional units are fully pipelined. As shown in Figure G.6, pipeline depths are 6 clock cycles for floating-point add and 7 clock cycles for floating-point multiply. On VMIPS, as on most vector processors, independent vector operations using different functional units can issue in the same convoy.
Vector Load-Store Units and Vector Memory Systems The behavior of the load-store vector unit is significantly more complicated than that of the arithmetic functional units. The start-up time for a load is the time to get the first word from memory into a register. If the rest of the vector can be supplied without stalling, then the vector initiation rate is equal to the rate at which new words are fetched or stored. Unlike simpler functional units, the initiation rate may not necessarily be 1 clock cycle because memory bank stalls can reduce effective throughput.
Operation
Start-up penalty
Vector add
6
Vector multiply
7
Vector divide
20
Vector load
12
Figure G.6 Start-up penalties on VMIPS. These are the start-up penalties in clock cycles for VMIPS vector operations.
G-14
■
Appendix G Vector Processors
Typically, penalties for start-ups on load-store units are higher than those for arithmetic functional units—over 100 clock cycles on some processors. For VMIPS we will assume a start-up time of 12 clock cycles, the same as the Cray1. Figure G.6 summarizes the start-up penalties for VMIPS vector operations. To maintain an initiation rate of 1 word fetched or stored per clock, the memory system must be capable of producing or accepting this much data. This is usually done by creating multiple memory banks, as discussed in Section 5.8. As we will see in the next section, having significant numbers of banks is useful for dealing with vector loads or stores that access rows or columns of data. Most vector processors use memory banks rather than simple interleaving for three primary reasons: 1. Many vector computers support multiple loads or stores per clock, and the memory bank cycle time is often several times larger than the CPU cycle time. To support multiple simultaneous accesses, the memory system needs to have multiple banks and be able to control the addresses to the banks independently. 2. As we will see in the next section, many vector processors support the ability to load or store data words that are not sequential. In such cases, independent bank addressing, rather than interleaving, is required. 3. Many vector computers support multiple processors sharing the same memory system, and so each processor will be generating its own independent stream of addresses. In combination, these features lead to a large number of independent memory banks, as shown by the following example. Example
The Cray T90 has a CPU clock cycle of 2.167 ns and in its largest configuration (Cray T932) has 32 processors each capable of generating four loads and two stores per CPU clock cycle. The CPU clock cycle is 2.167 ns, while the cycle time of the SRAMs used in the memory system is 15 ns. Calculate the minimum number of memory banks required to allow all CPUs to run at full memory bandwidth.
Answer
The maximum number of memory references each cycle is 192 (32 CPUs times 6 references per CPU). Each SRAM bank is busy for 15/2.167 = 6.92 clock cycles, which we round up to 7 CPU clock cycles. Therefore we require a minimum of 192 × 7 = 1344 memory banks! The Cray T932 actually has 1024 memory banks, and so the early models could not sustain full bandwidth to all CPUs simultaneously. A subsequent memory upgrade replaced the 15 ns asynchronous SRAMs with pipelined synchronous SRAMs that more than halved the memory cycle time, thereby providing sufficient bandwidth.
G.2
Basic Vector Architecture
■
G-15
In Chapter 5 we saw that the desired access rate and the bank access time determined how many banks were needed to access a memory without a stall. The next example shows how these timings work out in a vector processor. Example
Suppose we want to fetch a vector of 64 elements starting at byte address 136, and a memory access takes 6 clocks. How many memory banks must we have to support one fetch per clock cycle? With what addresses are the banks accessed? When will the various elements arrive at the CPU?
Answer
Six clocks per access require at least six banks, but because we want the number of banks to be a power of two, we choose to have eight banks. Figure G.7 shows the timing for the first few sets of accesses for an eight-bank system with a 6clock-cycle access latency.
Bank Cycle no.
0
1
2
3
4
0
136
1
busy
144
2
busy
busy
152
3
busy
busy
busy
160
4
busy
busy
busy
busy
5
busy
6 7
192
8
busy
200
busy
busy
busy
176
busy
busy
busy
busy
busy
busy
busy
busy
busy
busy
busy
busy
busy
busy
busy
busy
busy
busy
208
busy
busy
busy
216
11
busy
busy
busy
busy
224
12
busy
busy
busy
busy
busy
15
256
16
busy
busy 232
busy
busy
busy
busy
240
busy
busy
busy
busy
busy
busy 264
184
busy
9
busy
7
168
busy
busy
14
6
busy
10
13
5
248
busy
busy
busy
busy
busy
busy
busy
busy
Figure G.7 Memory addresses (in bytes) by bank number and time slot at which access begins. Each memory bank latches the element address at the start of an access and is then busy for 6 clock cycles before returning a value to the CPU. Note that the CPU cannot keep all eight banks busy all the time because it is limited to supplying one new address and receiving one data item each cycle.
G-16
■
Appendix G Vector Processors
The timing of real memory banks is usually split into two different components, the access latency and the bank cycle time (or bank busy time). The access latency is the time from when the address arrives at the bank until the bank returns a data value, while the busy time is the time the bank is occupied with one request. The access latency adds to the start-up cost of fetching a vector from memory (the total memory latency also includes time to traverse the pipelined interconnection networks that transfer addresses and data between the CPU and memory banks). The bank busy time governs the effective bandwidth of a memory system because a processor cannot issue a second request to the same bank until the bank busy time has elapsed. For simple unpipelined SRAM banks as used in the previous examples, the access latency and busy time are approximately the same. For a pipelined SRAM bank, however, the access latency is larger than the busy time because each element access only occupies one stage in the memory bank pipeline. For a DRAM bank, the access latency is usually shorter than the busy time because a DRAM needs extra time to restore the read value after the destructive read operation. For memory systems that support multiple simultaneous vector accesses or allow nonsequential accesses in vector loads or stores, the number of memory banks should be larger than the minimum; otherwise, memory bank conflicts will exist. We explore this in more detail in the next section.
G.3
Two Real-World Issues: Vector Length and Stride This section deals with two issues that arise in real programs: What do you do when the vector length in a program is not exactly 64? How do you deal with nonadjacent elements in vectors that reside in memory? First, let’s consider the issue of vector length.
Vector-Length Control A vector-register processor has a natural vector length determined by the number of elements in each vector register. This length, which is 64 for VMIPS, is unlikely to match the real vector length in a program. Moreover, in a real program the length of a particular vector operation is often unknown at compile time. In fact, a single piece of code may require different vector lengths. For example, consider this code: 10
do 10 i = 1,n Y(i) = a ∗ X(i) + Y(i)
The size of all the vector operations depends on n, which may not even be known until run time! The value of n might also be a parameter to a procedure containing the above loop and therefore be subject to change during execution.
G.3
Two Real-World Issues: Vector Length and Stride
■
G-17
The solution to these problems is to create a vector-length register (VLR). The VLR controls the length of any vector operation, including a vector load or store. The value in the VLR, however, cannot be greater than the length of the vector registers. This solves our problem as long as the real length is less than or equal to the maximum vector length (MVL) defined by the processor. What if the value of n is not known at compile time, and thus may be greater than MVL? To tackle the second problem where the vector is longer than the maximum length, a technique called strip mining is used. Strip mining is the generation of code such that each vector operation is done for a size less than or equal to the MVL. We could strip-mine the loop in the same manner that we unrolled loops in Chapter 4: create one loop that handles any number of iterations that is a multiple of MVL and another loop that handles any remaining iterations, which must be less than MVL. In practice, compilers usually create a single stripmined loop that is parameterized to handle both portions by changing the length. The strip-mined version of the DAXPY loop written in FORTRAN, the major language used for scientific applications, is shown with C-style comments:
10
1
low = 1 VL = (n mod MVL) /*find the odd-size piece*/ do 1 j = 0,(n / MVL) /*outer loop*/ do 10 i = low, low + VL - 1 /*runs for length VL*/ Y(i) = a * X(i) + Y(i) /*main operation*/ continue low = low + VL /*start of next vector*/ VL = MVL /*reset the length to max*/ continue
The term n/MVL represents truncating integer division (which is what FORTRAN does) and is used throughout this section. The effect of this loop is to block the vector into segments that are then processed by the inner loop. The length of the first segment is (n mod MVL), and all subsequent segments are of length MVL. This is depicted in Figure G.8.
Value of j
0
1
2
3
...
...
n/MVL
Range of i
1..m
(m + 1).. m + MVL
(m + MVL + 1) .. m + 2 * MVL
(m + 2 * MVL + 1) .. m + 3 * MVL
...
...
(n – MVL + 1).. n
Figure G.8 A vector of arbitrary length processed with strip mining. All blocks but the first are of length MVL, utilizing the full power of the vector processor. In this figure, the variable m is used for the expression (n mod MVL).
G-18
■
Appendix G Vector Processors
The inner loop of the preceding code is vectorizable with length VL, which is equal to either (n mod MVL) or MVL. The VLR register must be set twice—once at each place where the variable VL in the code is assigned. With multiple vector operations executing in parallel, the hardware must copy the value of VLR to the vector functional unit when a vector operation issues, in case VLR is changed for a subsequent vector operation. Several vector ISAs have been developed that allow implementations to have different maximum vector-register lengths. For example, the IBM vector extension for the IBM 370 series mainframes supports an MVL of anywhere between 8 and 512 elements. A “load vector count and update” (VLVCU) instruction is provided to control strip-mined loops. The VLVCU instruction has a single scalar register operand that specifies the desired vector length. The vector-length register is set to the minimum of the desired length and the maximum available vector length, and this value is also subtracted from the scalar register, setting the condition codes to indicate if the loop should be terminated. In this way, object code can be moved unchanged between two different implementations while making full use of the available vector-register length within each stripmined loop iteration. In addition to the start-up overhead, we need to account for the overhead of executing the strip-mined loop. This strip-mining overhead, which arises from the need to reinitiate the vector sequence and set the VLR, effectively adds to the vector start-up time, assuming that a convoy does not overlap with other instructions. If that overhead for a convoy is 10 cycles, then the effective overhead per 64 elements increases by 10 cycles, or 0.15 cycles per element. There are two key factors that contribute to the running time of a strip-mined loop consisting of a sequence of convoys: 1. The number of convoys in the loop, which determines the number of chimes. We use the notation Tchime for the execution time in chimes. 2. The overhead for each strip-mined sequence of convoys. This overhead consists of the cost of executing the scalar code for strip-mining each block, Tloop, plus the vector start-up cost for each convoy, Tstart. There may also be a fixed overhead associated with setting up the vector sequence the first time. In recent vector processors this overhead has become quite small, so we ignore it. The components can be used to state the total running time for a vector sequence operating on a vector of length n, which we will call Tn: Tn =
n ------------- × ( T loop + T start ) + n × T chime MVL
The values of Tstart, Tloop, and Tchime are compiler and processor dependent. The register allocation and scheduling of the instructions affect both what goes in a convoy and the start-up overhead of each convoy.
G.3
Two Real-World Issues: Vector Length and Stride
■
G-19
For simplicity, we will use a constant value for Tloop on VMIPS. Based on a variety of measurements of Cray-1 vector execution, the value chosen is 15 for Tloop. At first glance, you might think that this value is too small. The overhead in each loop requires setting up the vector starting addresses and the strides, incrementing counters, and executing a loop branch. In practice, these scalar instructions can be totally or partially overlapped with the vector instructions, minimizing the time spent on these overhead functions. The value of Tloop of course depends on the loop structure, but the dependence is slight compared with the connection between the vector code and the values of Tchime and Tstart. Example
What is the execution time on VMIPS for the vector operation A = B × s, where s is a scalar and the length of the vectors A and B is 200?
Answer
Assume the addresses of A and B are initially in Ra and Rb, s is in Fs, and recall that for MIPS (and VMIPS) R0 always holds 0. Since (200 mod 64) = 8, the first iteration of the strip-mined loop will execute for a vector length of 8 elements, and the following iterations will execute for a vector length of 64 elements. The starting byte addresses of the next segment of each vector is eight times the vector length. Since the vector length is either 8 or 64, we increment the address registers by 8 × 8 = 64 after the first segment and 8 × 64 = 512 for later segments. The total number of bytes in the vector is 8 × 200 = 1600, and we test for completion by comparing the address of the next vector segment to the initial address plus 1600. Here is the actual code: DADDUI DADDU DADDUI MTC1 DADDUI DADDUI Loop: LV MULVS.D SV DADDU DADDU DADDUI MTC1 DSUBU BNEZ
R2,R0,#1600 R2,R2,Ra R1,R0,#8 VLR,R1 R1,R0,#64 R3,R0,#64 V1,Rb V2,V1,Fs Ra,V2 Ra,Ra,R1 Rb,Rb,R1 R1,R0,#512 VLR,R3 R4,R2,Ra R4,Loop
;total # bytes in vector ;address of the end of A vector ;loads length of 1st segment ;load vector length in VLR ;length in bytes of 1st segment ;vector length of other segments ;load B ;vector * scalar ;store A ;address of next segment of A ;address of next segment of B ;load byte offset next segment ;set length to 64 elements ;at the end of A? ;if not, go back
The three vector instructions in the loop are dependent and must go into three convoys, hence Tchime = 3. Let’s use our basic formula: n -------------- × ( T loop + T start ) + n × T chime MVL = 4 × ( 15 + T start ) + 200 × 3
Tn = T 200
T 200 = 60 + ( 4 × T start ) + 600 = 660 + ( 4 × T start )
G-20
■
Appendix G Vector Processors
The value of Tstart is the sum of ■
The vector load start-up of 12 clock cycles
■
A 7-clock-cycle start-up for the multiply
■
A 12-clock-cycle start-up for the store
Thus, the value of Tstart is given by Tstart = 12 + 7 + 12 = 31
So, the overall value becomes T200 = 660 + 4 × 31= 784
The execution time per element with all start-up costs is then 784/200 = 3.9, compared with a chime approximation of three. In Section G.4, we will be more ambitious—allowing overlapping of separate convoys. Figure G.9 shows the overhead and effective rates per element for the previous example (A = B × s) with various vector lengths. A chime counting model would lead to 3 clock cycles per element, while the two sources of overhead add 0.9 clock cycles per element in the limit. The next few sections introduce enhancements that reduce this time. We will see how to reduce the number of convoys and hence the number of chimes using a technique called chaining. The loop overhead can be reduced by further overlapping the execution of vector and scalar instructions, allowing the scalar loop overhead in one iteration to be executed while the vector instructions in the previous instruction are completing. Finally, the vector start-up overhead can also be eliminated, using a technique that allows overlap of vector instructions in separate convoys.
Vector Stride The second problem this section addresses is that the position in memory of adjacent elements in a vector may not be sequential. Consider the straightforward code for matrix multiply:
10
do 10 i = 1,100 do 10 j = 1,100 A(i,j) = 0.0 do 10 k = 1,100 A(i,j) = A(i,j)+B(i,k)*C(k,j)
At the statement labeled 10 we could vectorize the multiplication of each row of B with each column of C and strip-mine the inner loop with k as the index variable.
G.3
Two Real-World Issues: Vector Length and Stride
■
G-21
9 8 7 6 5 Clock cycles
Total time per element
4 3 2
Total overhead per element
1 0 10
30
50
70
90
110
130
150
170
190
Vector size
Figure G.9 The total execution time per element and the total overhead time per element versus the vector length for the example on page G-19. For short vectors the total start-up time is more than one-half of the total time, while for long vectors it reduces to about one-third of the total time. The sudden jumps occur when the vector length crosses a multiple of 64, forcing another iteration of the strip-mining code and execution of a set of vector instructions. These operations increase Tn by Tloop + Tstart.
To do so, we must consider how adjacent elements in B and adjacent elements in C are addressed. As we discussed in Section 5.5, when an array is allocated memory, it is linearized and must be laid out in either row-major or columnmajor order. This linearization means that either the elements in the row or the elements in the column are not adjacent in memory. For example, if the preceding loop were written in FORTRAN, which allocates column-major order, the elements of B that are accessed by iterations in the inner loop are separated by the row size times 8 (the number of bytes per entry) for a total of 800 bytes. In Chapter 5, we saw that blocking could be used to improve the locality in cache-based systems. For vector processors without caches, we need another technique to fetch elements of a vector that are not adjacent in memory. This distance separating elements that are to be gathered into a single register is called the stride. In the current example, using column-major layout for the matrices means that matrix C has a stride of 1, or 1 double word (8 bytes), separating successive elements, and matrix B has a stride of 100, or 100 double words (800 bytes). Once a vector is loaded into a vector register it acts as if it had logically adjacent elements. Thus a vector-register processor can handle strides greater than one, called nonunit strides, using only vector-load and vector-store operations with stride capability. This ability to access nonsequential memory locations and
G-22
■
Appendix G Vector Processors
to reshape them into a dense structure is one of the major advantages of a vector processor over a cache-based processor. Caches inherently deal with unit stride data, so that while increasing block size can help reduce miss rates for large scientific data sets with unit stride, increasing block size can have a negative effect for data that is accessed with nonunit stride. While blocking techniques can solve some of these problems (see Section 5.5), the ability to efficiently access data that is not contiguous remains an advantage for vector processors on certain problems. On VMIPS, where the addressable unit is a byte, the stride for our example would be 800. The value must be computed dynamically, since the size of the matrix may not be known at compile time, or—just like vector length—may change for different executions of the same statement. The vector stride, like the vector starting address, can be put in a general-purpose register. Then the VMIPS instruction LVWS (load vector with stride) can be used to fetch the vector into a vector register. Likewise, when a nonunit stride vector is being stored, SVWS (store vector with stride) can be used. In some vector processors the loads and stores always have a stride value stored in a register, so that only a single load and a single store instruction are required. Unit strides occur much more frequently than other strides and can benefit from special case handling in the memory system, and so are often separated from nonunit stride operations as in VMIPS. Complications in the memory system can occur from supporting strides greater than one. In Chapter 5 we saw that memory accesses could proceed at full speed if the number of memory banks was at least as large as the bank busy time in clock cycles. Once nonunit strides are introduced, however, it becomes possible to request accesses from the same bank more frequently than the bank busy time allows. When multiple accesses contend for a bank, a memory bank conflict occurs and one access must be stalled. A bank conflict, and hence a stall, will occur if Number of banks ------------------------------------------------------------------------------------------------------------------------- < Bank busy time Least common multiple (Stride, Number of banks)
Example
Suppose we have 8 memory banks with a bank busy time of 6 clocks and a total memory latency of 12 cycles. How long will it take to complete a 64-element vector load with a stride of 1? With a stride of 32?
Answer
Since the number of banks is larger than the bank busy time, for a stride of 1, the load will take 12 + 64 = 76 clock cycles, or 1.2 clocks per element. The worst possible stride is a value that is a multiple of the number of memory banks, as in this case with a stride of 32 and 8 memory banks. Every access to memory (after the first one) will collide with the previous access and will have to wait for the 6clock-cycle bank busy time. The total time will be 12 + 1 + 6 * 63 = 391 clock cycles, or 6.1 clocks per element.
G.4
Enhancing Vector Performance
■
G-23
Memory bank conflicts will not occur within a single vector memory instruction if the stride and number of banks are relatively prime with respect to each other and there are enough banks to avoid conflicts in the unit stride case. When there are no bank conflicts, multiword and unit strides run at the same rates. Increasing the number of memory banks to a number greater than the minimum to prevent stalls with a stride of length 1 will decrease the stall frequency for some other strides. For example, with 64 banks, a stride of 32 will stall on every other access, rather than every access. If we originally had a stride of 8 and 16 banks, every other access would stall; with 64 banks, a stride of 8 will stall on every eighth access. If we have multiple memory pipelines and/or multiple processors sharing the same memory system, we will also need more banks to prevent conflicts. Even machines with a single memory pipeline can experience memory bank conflicts on unit stride accesses between the last few elements of one instruction and the first few elements of the next instruction, and increasing the number of banks will reduce the probability of these interinstruction conflicts. In 2001, most vector supercomputers have at least 64 banks, and some have as many as 1024 in the maximum memory configuration. Because bank conflicts can still occur in nonunit stride cases, programmers favor unit stride accesses whenever possible. A modern supercomputer may have dozens of CPUs, each with multiple memory pipelines connected to thousands of memory banks. It would be impractical to provide a dedicated path between each memory pipeline and each memory bank, and so typically a multistage switching network is used to connect memory pipelines to memory banks. Congestion can arise in this switching network as different vector accesses contend for the same circuit paths, causing additional stalls in the memory system.
G.4
Enhancing Vector Performance In this section we present five techniques for improving the performance of a vector processor. The first, chaining, deals with making a sequence of dependent vector operations run faster, and originated in the Cray-1 but is now supported on most vector processors. The next two deal with expanding the class of loops that can be run in vector mode by combating the effects of conditional execution and sparse matrices with new types of vector instruction. The fourth technique increases the peak performance of a vector machine by adding more parallel execution units in the form of additional lanes. The fifth technique reduces start-up overhead by pipelining and overlapping instruction start-up.
Chaining—the Concept of Forwarding Extended to Vector Registers Consider the simple vector sequence
G-24
■
Appendix G Vector Processors
MULV.D ADDV.D
V1,V2,V3 V4,V1,V5
In VMIPS, as it currently stands, these two instructions must be put into two separate convoys, since the instructions are dependent. On the other hand, if the vector register, V1 in this case, is treated not as a single entity but as a group of individual registers, then the ideas of forwarding can be conceptually extended to work on individual elements of a vector. This insight, which will allow the ADDV.D to start earlier in this example, is called chaining. Chaining allows a vector operation to start as soon as the individual elements of its vector source operand become available: The results from the first functional unit in the chain are “forwarded” to the second functional unit. In practice, chaining is often implemented by allowing the processor to read and write a particular register at the same time, albeit to different elements. Early implementations of chaining worked like forwarding, but this restricted the timing of the source and destination instructions in the chain. Recent implementations use flexible chaining, which allows a vector instruction to chain to essentially any other active vector instruction, assuming that no structural hazard is generated. Flexible chaining requires simultaneous access to the same vector register by different vector instructions, which can be implemented either by adding more read and write ports or by organizing the vector-register file storage into interleaved banks in a similar way to the memory system. We assume this type of chaining throughout the rest of this appendix. Even though a pair of operations depend on one another, chaining allows the operations to proceed in parallel on separate elements of the vector. This permits the operations to be scheduled in the same convoy and reduces the number of chimes required. For the previous sequence, a sustained rate (ignoring start-up) of two floating-point operations per clock cycle, or one chime, can be achieved, even though the operations are dependent! The total running time for the above sequence becomes Vector length + Start-up timeADDV + Start-up timeMULV
Figure G.10 shows the timing of a chained and an unchained version of the above pair of vector instructions with a vector length of 64. This convoy requires one chime; however, because it uses chaining, the start-up overhead will be seen in the actual timing of the convoy. In Figure G.10, the total time for chained operation is 77 clock cycles, or 1.2 cycles per result. With 128 floating-point operations done in that time, 1.7 FLOPS per clock cycle are obtained. For the unchained version, there are 141 clock cycles, or 0.9 FLOPS per clock cycle. Although chaining allows us to reduce the chime component of the execution time by putting two dependent instructions in the same convoy, it does not eliminate the start-up overhead. If we want an accurate running time estimate, we must count the start-up time both within and across convoys. With chaining, the number of chimes for a sequence is determined by the number of different vector functional units available in the processor and the number required by the appli-
G.4
7
Enhancing Vector Performance
64
6
■
G-25
64 Total = 141
Unchained
MULV 7
ADDV
64 MULV
Chained 6
64 Total = 77 ADDV
Figure G.10 Timings for a sequence of dependent vector operations ADDV and MULV, both unchained and chained. The 6- and 7-clock-cycle delays are the latency of the adder and multiplier.
cation. In particular, no convoy can contain a structural hazard. This means, for example, that a sequence containing two vector memory instructions must take at least two convoys, and hence two chimes, on a processor like VMIPS with only one vector load-store unit. We will see in Section G.6 that chaining plays a major role in boosting vector performance. In fact, chaining is so important that every modern vector processor supports flexible chaining.
Conditionally Executed Statements From Amdahl’s Law, we know that the speedup on programs with low to moderate levels of vectorization will be very limited. Two reasons why higher levels of vectorization are not achieved are the presence of conditionals (if statements) inside loops and the use of sparse matrices. Programs that contain if statements in loops cannot be run in vector mode using the techniques we have discussed so far because the if statements introduce control dependences into a loop. Likewise, sparse matrices cannot be efficiently implemented using any of the capabilities we have seen so far. We discuss strategies for dealing with conditional execution here, leaving the discussion of sparse matrices to the following subsection. Consider the following loop:
100
do 100 i = 1, 64 if (A(i).ne. 0) then A(i) = A(i) – B(i) endif continue
This loop cannot normally be vectorized because of the conditional execution of the body; however, if the inner loop could be run for the iterations for which A(i) ≠ 0, then the subtraction could be vectorized. In Chapter 4, we saw that the conditionally executed instructions could turn such control dependences into data dependences, enhancing the ability to parallelize the loop. Vector processors can benefit from an equivalent capability for vectors.
G-26
■
Appendix G Vector Processors
The extension that is commonly used for this capability is vector-mask control. The vector-mask control uses a Boolean vector of length MVL to control the execution of a vector instruction just as conditionally executed instructions use a Boolean condition to determine whether an instruction is executed. When the vector-mask register is enabled, any vector instructions executed operate only on the vector elements whose corresponding entries in the vector-mask register are 1. The entries in the destination vector register that correspond to a 0 in the mask register are unaffected by the vector operation. If the vector-mask register is set by the result of a condition, only elements satisfying the condition will be affected. Clearing the vector-mask register sets it to all 1s, making subsequent vector instructions operate on all vector elements. The following code can now be used for the previous loop, assuming that the starting addresses of A and B are in Ra and Rb, respectively: LV LV L.D SNEVS.D SUBV.D CVM SV
V1,Ra V2,Rb F0,#0 V1,F0 V1,V1,V2 Ra,V1
;load vector A into V1 ;load vector B ;load FP zero into F0 ;sets VM(i) to 1 if V1(i)!=F0 ;subtract under vector mask ;set the vector mask to all 1s ;store the result in A
Most recent vector processors provide vector-mask control. The vector-mask capability described here is available on some processors, but others allow the use of the vector mask with only a subset of the vector instructions. Using a vector-mask register does, however, have disadvantages. When we examined conditionally executed instructions, we saw that such instructions still require execution time when the condition is not satisfied. Nonetheless, the elimination of a branch and the associated control dependences can make a conditional instruction faster even if it sometimes does useless work. Similarly, vector instructions executed with a vector mask still take execution time, even for the elements where the mask is 0. Likewise, even with a significant number of 0s in the mask, using vector-mask control may still be significantly faster than using scalar mode. In fact, the large difference in potential performance between vector and scalar mode makes the inclusion of vector-mask instructions critical. Second, in some vector processors the vector mask serves only to disable the storing of the result into the destination register, and the actual operation still occurs. Thus, if the operation in the previous example were a divide rather than a subtract and the test was on B rather than A, false floating-point exceptions might result since a division by 0 would occur. Processors that mask the operation as well as the storing of the result avoid this problem.
Sparse Matrices There are techniques for allowing programs with sparse matrices to execute in vector mode. In a sparse matrix, the elements of a vector are usually stored in
G.4
Enhancing Vector Performance
■
G-27
some compacted form and then accessed indirectly. Assuming a simplified sparse structure, we might see code that looks like this: do 100
100 i = 1,n A(K(i)) = A(K(i)) + C(M(i))
This code implements a sparse vector sum on the arrays A and C, using index vectors K and M to designate the nonzero elements of A and C. (A and C must have the same number of nonzero elements—n of them.) Another common representation for sparse matrices uses a bit vector to say which elements exist and a dense vector for the nonzero elements. Often both representations exist in the same program. Sparse matrices are found in many codes, and there are many ways to implement them, depending on the data structure used in the program. A primary mechanism for supporting sparse matrices is scatter-gather operations using index vectors. The goal of such operations is to support moving between a dense representation (i.e., zeros are not included) and normal representation (i.e., the zeros are included) of a sparse matrix. A gather operation takes an index vector and fetches the vector whose elements are at the addresses given by adding a base address to the offsets given in the index vector. The result is a nonsparse vector in a vector register. After these elements are operated on in dense form, the sparse vector can be stored in expanded form by a scatter store, using the same index vector. Hardware support for such operations is called scattergather and appears on nearly all modern vector processors. The instructions LVI (load vector indexed) and SVI (store vector indexed) provide these operations in VMIPS. For example, assuming that Ra, Rc, Rk, and Rm contain the starting addresses of the vectors in the previous sequence, the inner loop of the sequence can be coded with vector instructions such as LV LVI LV LVI ADDV.D SVI
Vk,Rk Va,(Ra+Vk) Vm,Rm Vc,(Rc+Vm) Va,Va,Vc (Ra+Vk),Va
;load K ;load A(K(I)) ;load M ;load C(M(I)) ;add them ;store A(K(I))
This technique allows code with sparse matrices to be run in vector mode. A simple vectorizing compiler could not automatically vectorize the source code above because the compiler would not know that the elements of K are distinct values, and thus that no dependences exist. Instead, a programmer directive would tell the compiler that it could run the loop in vector mode. More sophisticated vectorizing compilers can vectorize the loop automatically without programmer annotations by inserting run time checks for data dependences. These run time checks are implemented with a vectorized software version of the advanced load address table (ALAT) hardware described in Chapter 4 for the Itanium processor. The associative ALAT hardware is replaced with a software hash table that detects if two element accesses within the same strip-mine iteration
G-28
■
Appendix G Vector Processors
are to the same address. If no dependences are detected, the strip-mine iteration can complete using the maximum vector length. If a dependence is detected, the vector length is reset to a smaller value that avoids all dependency violations, leaving the remaining elements to be handled on the next iteration of the strip-mined loop. Although this scheme adds considerable software overhead to the loop, the overhead is mostly vectorized for the common case where there are no dependences, and as a result the loop still runs considerably faster than scalar code (although much slower than if a programmer directive was provided). A scatter-gather capability is included on many of the recent supercomputers. These operations often run more slowly than strided accesses because they are more complex to implement and are more susceptible to bank conflicts, but they are still much faster than the alternative, which may be a scalar loop. If the sparsity properties of a matrix change, a new index vector must be computed. Many processors provide support for computing the index vector quickly. The CVI (create vector index) instruction in VMIPS creates an index vector given a stride (m), where the values in the index vector are 0, m, 2 × m, . . . , 63 × m. Some processors provide an instruction to create a compressed index vector whose entries correspond to the positions with a 1 in the mask register. Other vector architectures provide a method to compress a vector. In VMIPS, we define the CVI instruction to always create a compressed index vector using the vector mask. When the vector mask is all 1s, a standard index vector will be created. The indexed loads-stores and the CVI instruction provide an alternative method to support conditional vector execution. Here is a vector sequence that implements the loop we saw on page G-25: LV L.D SNEVS.D CVI POP MTC1 CVM LVI LVI SUBV.D SVI
V1,Ra F0,#0 V1,F0 V2,#8 R1,VM VLR,R1 V3,(Ra+V2) V4,(Rb+V2) V3,V3,V4 (Ra+V2),V3
;load vector A into V1 ;load FP zero into F0 ;sets the VM to 1 if V1(i)!=F0 ;generates indices in V2 ;find the number of 1’s in VM ;load vector-length register ;clears the mask ;load the nonzero A elements ;load corresponding B elements ;do the subtract ;store A back
Whether the implementation using scatter-gather is better than the conditionally executed version depends on the frequency with which the condition holds and the cost of the operations. Ignoring chaining, the running time of the first version (on page G-25) is 5n + c1. The running time of the second version, using indexed loads and stores with a running time of one element per clock, is 4n + 4fn + c2, where f is the fraction of elements for which the condition is true (i.e., A(i) ≠ 0). If we assume that the values of c1 and c2 are comparable, or that they are much smaller than n, we can find when this second technique is better.
G.4
Enhancing Vector Performance
■
G-29
Time 1 = 5 ( n ) Time 2 = 4n + 4 fn
We want Time1 ≥ Time2, so 5n ≥ 4n + 4 fn 1 --- ≥ f 4
That is, the second method is faster if less than one-quarter of the elements are nonzero. In many cases the frequency of execution is much lower. If the index vector can be reused, or if the number of vector statements within the if statement grows, the advantage of the scatter-gather approach will increase sharply.
Multiple Lanes One of the greatest advantages of a vector instruction set is that it allows software to pass a large amount of parallel work to hardware using only a single short instruction. A single vector instruction can include tens to hundreds of independent operations yet be encoded in the same number of bits as a conventional scalar instruction. The parallel semantics of a vector instruction allows an implementation to execute these elemental operations using either a deeply pipelined functional unit, as in the VMIPS implementation we’ve studied so far, or by using an array of parallel functional units, or a combination of parallel and pipelined functional units. Figure G.11 illustrates how vector performance can be improved by using parallel pipelines to execute a vector add instruction. The VMIPS instruction set has been designed with the property that all vector arithmetic instructions only allow element N of one vector register to take part in operations with element N from other vector registers. This dramatically simplifies the construction of a highly parallel vector unit, which can be structured as multiple parallel lanes. As with a traffic highway, we can increase the peak throughput of a vector unit by adding more lanes. The structure of a four-lane vector unit is shown in Figure G.12. Each lane contains one portion of the vector-register file and one execution pipeline from each vector functional unit. Each vector functional unit executes vector instructions at the rate of one element group per cycle using multiple pipelines, one per lane. The first lane holds the first element (element 0) for all vector registers, and so the first element in any vector instruction will have its source and destination operands located in the first lane. This allows the arithmetic pipeline local to the lane to complete the operation without communicating with other lanes. Interlane wiring is only required to access main memory. This lack of interlane communication reduces the wiring cost and register file ports required to build a highly parallel execution unit, and helps explains why current vector supercomputers can complete up to 64 operations per cycle (2 arithmetic units and 2 load-store units across 16 lanes).
G-30
■
Appendix G Vector Processors
A[9]
B[9]
A[8]
B[8]
A[7]
B[7]
A[6]
B[6]
A[5]
B[5]
A[4]
B[4]
A[3]
B[3]
A[2]
B[2]
A[8]
B[8]
A[9]
B[9]
A[1]
B[1]
A[4]
B[4]
A[5]
B[5]
A[6]
B[6]
A[7]
B[7]
+
+
+
+
+
C[0]
C[0]
C[1]
C[2]
C[3]
Element group (a)
(b)
Figure G.11 Using multiple functional units to improve the performance of a single vector add instruction, C = A + B. The machine shown in (a) has a single add pipeline and can complete one addition per cycle. The machine shown in (b) has four add pipelines and can complete four additions per cycle. The elements within a single vector add instruction are interleaved across the four pipelines. The set of elements that move through the pipelines together is termed an element group. (Reproduced with permission from Asanovic [1998].)
Adding multiple lanes is a popular technique to improve vector performance as it requires little increase in control complexity and does not require changes to existing machine code. Several vector supercomputers are sold as a range of models that vary in the number of lanes installed, allowing users to trade price against peak vector performance. The Cray SV1 allows four two-lane CPUs to be ganged together using operating system software to form a single larger eightlane CPU.
Pipelined Instruction Start-Up Adding multiple lanes increases peak performance, but does not change start-up latency, and so it becomes critical to reduce start-up overhead by allowing the start of one vector instruction to be overlapped with the completion of preceding vector instructions. The simplest case to consider is when two vector instructions access a different set of vector registers. For example, in the code sequence ADDV.D V1,V2,V3 ADDV.D V4,V5,V6
G.4
Lane 0
FP add pipe 0
Vector registers: elements 0,4,8, . . .
FP mul. pipe 0
Lane 1
FP add pipe 1
Vector registers: elements 1,5,9, . . .
FP mul. pipe 1
Enhancing Vector Performance
Lane 2
FP add pipe 2
Vector registers: elements 2,6,10, . . .
FP mul. pipe 2
■
G-31
Lane 3
FP add pipe 3
Vector registers: elements 3,7,11, . . .
FP mul. pipe 3
Vector load-store unit
Figure G.12 Structure of a vector unit containing four lanes. The vector-register storage is divided across the lanes, with each lane holding every fourth element of each vector register. There are three vector functional units shown, an FP add, an FP multiply, and a load-store unit. Each of the vector arithmetic units contains four execution pipelines, one per lane, that act in concert to complete a single vector instruction. Note how each section of the vector-register file only needs to provide enough ports for pipelines local to its lane; this dramatically reduces the cost of providing multiple ports to the vector registers. The path to provide the scalar operand for vector-scalar instructions is not shown in this figure, but the scalar value must be broadcast to all lanes.
an implementation can allow the first element of the second vector instruction to immediately follow the last element of the first vector instruction down the FP adder pipeline. To reduce the complexity of control logic, some vector machines require some recovery time or dead time in between two vector instructions dispatched to the same vector unit. Figure G.13 is a pipeline diagram that shows both start-up latency and dead time for a single vector pipeline. The following example illustrates the impact of this dead time on achievable vector performance. Example
The Cray C90 has two lanes but requires 4 clock cycles of dead time between any two vector instructions to the same functional unit, even if they have no data dependences. For the maximum vector length of 128 elements, what is the reduction in achievable peak performance caused by the dead time? What would be the reduction if the number of lanes were increased to 16?
G-32
■
Appendix G Vector Processors
Start-up latency R X1 X2 X3 W
First vector instruction
R X1 X2 X3 W Element 63 R X1 X2 X3 W Dead cycle R X1 X2 X3 W Dead cycle R X1 X2 X3 W
Dead time
Dead cycle R X1 X2 X3 W Dead cycle R X1 X2 X3 W Element 0 R X1 X2 X3 W Element 1 R X1 X2 X3 W
Second vector instruction
R X1 X2 X3 W
Figure G.13 Start-up latency and dead time for a single vector pipeline. Each element has a 5-cycle latency: 1 cycle to read the vector-register file, 3 cycles in execution, then 1 cycle to write the vector-register file. Elements from the same vector instruction can follow each other down the pipeline, but this machine inserts 4 cycles of dead time between two different vector instructions. The dead time can be eliminated with more complex control logic. (Reproduced with permission from Asanovic [1998].)
Answer
A maximum length vector of 128 elements is divided over the two lanes and occupies a vector functional unit for 64 clock cycles. The dead time adds another 4 cycles of occupancy, reducing the peak performance to 64/(64 + 4) = 94.1% of the value without dead time. If the number of lanes is increased to 16, maximum length vector instructions will occupy a functional unit for only 128/16 = 8 cycles, and the dead time will reduce peak performance to 8/(8 + 4) = 66.6% of the value without dead time. In this second case, the vector units can never be more than 2/3 busy! Pipelining instruction start-up becomes more complicated when multiple instructions can be reading and writing the same vector register, and when some instructions may stall unpredictably, for example, a vector load encountering memory bank conflicts. However, as both the number of lanes and pipeline latencies increase, it becomes increasingly important to allow fully pipelined instruction start-up.
G.5
Effectiveness of Compiler Vectorization Two factors affect the success with which a program can be run in vector mode. The first factor is the structure of the program itself: Do the loops have true data dependences, or can they be restructured so as not to have such dependences? This factor is influenced by the algorithms chosen and, to some extent, by how they are coded. The second factor is the capability of the compiler. While no compiler can vectorize a loop where no parallelism among the loop iterations
G.5
Effectiveness of Compiler Vectorization
■
G-33
exists, there is tremendous variation in the ability of compilers to determine whether a loop can be vectorized. The techniques used to vectorize programs are the same as those discussed in Chapter 4 for uncovering ILP; here we simply review how well these techniques work. As an indication of the level of vectorization that can be achieved in scientific programs, let's look at the vectorization levels observed for the Perfect Club benchmarks. These benchmarks are large, real scientific applications. Figure G.14 shows the percentage of operations executed in vector mode for two versions of the code running on the Cray Y-MP. The first version is that obtained with just compiler optimization on the original code, while the second version has been extensively hand-optimized by a team of Cray Research programmers. The wide variation in the level of compiler vectorization has been observed by several studies of the performance of applications on vector processors. The hand-optimized versions generally show significant gains in vectorization level for codes the compiler could not vectorize well by itself, with all codes now above 50% vectorization. It is interesting to note that for MG3D, FLO52, and DYFESM, the faster code produced by the Cray programmers had lower levels of vectorization. The level of vectorization is not sufficient by itself to determine performance. Alternative vectorization techniques might execute fewer
Benchmark name
Operations executed in vector mode, compiler-optimized
Operations executed in vector mode, hand-optimized
Speedup from hand optimization
BDNA
96.1%
97.2%
1.52
MG3D
95.1%
94.5%
1.00
FLO52
91.5%
88.7%
N/A
ARC3D
91.1%
92.0%
1.01
SPEC77
90.3%
90.4%
1.07
MDG
87.7%
94.2%
1.49
TRFD
69.8%
73.7%
1.67
DYFESM
68.8%
65.6%
N/A
ADM
42.9%
59.6%
3.60
OCEAN
42.8%
91.2%
3.92
TRACK
14.4%
54.6%
2.52
SPICE
11.5%
79.9%
4.06
4.2%
75.1%
2.15
QCD
Figure G.14 Level of vectorization among the Perfect Club benchmarks when executed on the Cray Y-MP [Vajapeyam 1991]. The first column shows the vectorization level obtained with the compiler, while the second column shows the results after the codes have been hand-optimized by a team of Cray Research programmers. Speedup numbers are not available for FLO52 and DYFESM as the hand-optimized runs used larger data sets than the compiler-optimized runs.
G-34
■
Appendix G Vector Processors
Completely Partially Not vectorized vectorized vectorized
Processor
Compiler
CDC CYBER 205
VAST-2 V2.21
62
5
33
Convex C-series
FC5.0
69
5
26
Cray X-MP
CFT77 V3.0
69
3
28
Cray X-MP
CFT V1.15
50
1
49
Cray-2
CFT2 V3.1a
27
1
72
ETA-10
FTN 77 V1.0
62
7
31
Hitachi S810/820
FORT77/HAP V20-2B
67
4
29
IBM 3090/VF
VS FORTRAN V2.4
52
4
44
NEC SX/2
FORTRAN77 / SX V.040
66
5
29
Figure G.15 Result of applying vectorizing compilers to the 100 FORTRAN test kernels. For each processor we indicate how many loops were completely vectorized, partially vectorized, and unvectorized. These loops were collected by Callahan, Dongarra, and Levine [1988]. Two different compilers for the Cray X-MP show the large dependence on compiler technology.
instructions, or keep more values in vector registers, or allow greater chaining and overlap among vector operations, and therefore improve performance even if the vectorization level remains the same or drops. For example, BDNA has almost the same level of vectorization in the two versions, but the hand-optimized code is over 50% faster. There is also tremendous variation in how well different compilers do in vectorizing programs. As a summary of the state of vectorizing compilers, consider the data in Figure G.15, which shows the extent of vectorization for different processors using a test suite of 100 handwritten FORTRAN kernels. The kernels were designed to test vectorization capability and can all be vectorized by hand; we will see several examples of these loops in the exercises.
G.6
Putting It All Together: Performance of Vector Processors In this section we look at different measures of performance for vector processors and what they tell us about the processor. To determine the performance of a processor on a vector problem we must look at the start-up cost and the sustained rate. The simplest and best way to report the performance of a vector processor on a loop is to give the execution time of the vector loop. For vector loops people often give the MFLOPS (millions of floating-point operations per second) rating rather than execution time. We use the notation Rn for the MFLOPS rating on a vector of length n. Using the measurements Tn (time) or Rn (rate) is equivalent if the number of FLOPS is agreed upon (see Chapter 1 for a longer discussion on MFLOPS). In any event, either measurement should include the overhead.
G.6
Putting It All Together: Performance of Vector Processors
■
G-35
In this section we examine the performance of VMIPS on our DAXPY loop by looking at performance from different viewpoints. We will continue to compute the execution time of a vector loop using the equation developed in Section G.3. At the same time, we will look at different ways to measure performance using the computed time. The constant values for Tloop used in this section introduce some small amount of error, which will be ignored.
Measures of Vector Performance Because vector length is so important in establishing the performance of a processor, length-related measures are often applied in addition to time and MFLOPS. These length-related measures tend to vary dramatically across different processors and are interesting to compare. (Remember, though, that time is always the measure of interest when comparing the relative speed of two processors.) Three of the most important length-related measures are ■
R∞—The MFLOPS rate on an infinite-length vector. Although this measure may be of interest when estimating peak performance, real problems do not have unlimited vector lengths, and the overhead penalties encountered in real problems will be larger.
■
N1/2—The vector length needed to reach one-half of R∞. This is a good measure of the impact of overhead.
■
Nv—The vector length needed to make vector mode faster than scalar mode. This measures both overhead and the speed of scalars relative to vectors.
Let’s look at these measures for our DAXPY problem running on VMIPS. When chained, the inner loop of the DAXPY code in convoys looks like Figure G.16 (assuming that Rx and Ry hold starting addresses). Recall our performance equation for the execution time of a vector loop with n elements, Tn: Tn =
n -------------MVL
× (T loop + T start) + n × T chime
Chaining allows the loop to run in three chimes (and no less, since there is one memory pipeline); thus Tchime = 3. If Tchime were a complete indication of performance, the loop would run at an MFLOPS rate of 2/3 × clock rate (since there are 2 FLOPS per iteration). Thus, based only on the chime count, a 500 MHz VMIPS would run this loop at 333 MFLOPS assuming no strip-mining or start-up overhead. There are several ways to improve the performance: add additional vector LV V1,Rx
MULVS.D V2,V1,F0
Convoy 1: chained load and multiply
LV V3,Ry
ADDV.D V4,V2,V3
Convoy 2: second load and add, chained
SV Ry,V4
Convoy 3: store the result
Figure G.16 The inner loop of the DAXPY code in chained convoys.
G-36
■
Appendix G Vector Processors
load-store units, allow convoys to overlap to reduce the impact of start-up overheads, and decrease the number of loads required by vector-register allocation. We will examine the first two extensions in this section. The last optimization is actually used for the Cray-1, VMIPS’s cousin, to boost the performance by 50%. Reducing the number of loads requires an interprocedural optimization; we examine this transformation in Exercise G.6. Before we examine the first two extensions, let’s see what the real performance, including overhead, is.
The Peak Performance of VMIPS on DAXPY First, we should determine what the peak performance, R∞ , really is, since we know it must differ from the ideal 333 MFLOPS rate. For now, we continue to use the simplifying assumption that a convoy cannot start until all the instructions in an earlier convoy have completed; later we will remove this restriction. Using this simplification, the start-up overhead for the vector sequence is simply the sum of the start-up times of the instructions: T start = 12 + 7 + 12 + 6 + 12 = 49
Using MVL = 64, Tloop = 15, Tstart = 49, and Tchime = 3 in the performance equation, and assuming that n is not an exact multiple of 64, the time for an nelement operation is n ------ × ( 15 + 49 ) + 3n 64 ≤ ( n + 64 ) + 3n = 4n + 64
Tn =
The sustained rate is actually over 4 clock cycles per iteration, rather than the theoretical rate of 3 chimes, which ignores overhead. The major part of the difference is the cost of the start-up overhead for each block of 64 elements (49 cycles versus 15 for the loop overhead). We can now compute R∞ for a 500 MHz clock as Operations per iteration × Clock rate R ∞ = lim ---------------------------------------------------------------------------------------- Clock cycles per iteration n → ∞
The numerator is independent of n, hence Operations per iteration × Clock rate R ∞ = ---------------------------------------------------------------------------------------lim ( Clock cycles per iteration ) n→∞
Tn 4n + 64 lim ( Clock cycles per iteration ) = lim ------ = lim ------------------ = 4 n n→∞ n→∞ n n → ∞ 2 × 500 MHz R ∞ = -------------------------------- = 250 MFLOPS 4
G.6
Putting It All Together: Performance of Vector Processors
■
G-37
The performance without the start-up overhead, which is the peak performance given the vector functional unit structure, is now 1.33 times higher. In actuality the gap between peak and sustained performance for this benchmark is even larger!
Sustained Performance of VMIPS on the Linpack Benchmark The Linpack benchmark is a Gaussian elimination on a 100 × 100 matrix. Thus, the vector element lengths range from 99 down to 1. A vector of length k is used k times. Thus, the average vector length is given by 99
∑i
2
i =1 ------------- = 66.3 99
∑i
i =1
Now we can obtain an accurate estimate of the performance of DAXPY using a vector length of 66. T 66 = 2 × ( 15 + 49 ) + 66 × 3 = 128 + 198 = 326 2 × 66 × 500 R 66 = ------------------------------ MFLOPS = 202 MFLOPS 326
The peak number, ignoring start-up overhead, is 1.64 times higher than this estimate of sustained performance on the real vector lengths. In actual practice, the Linpack benchmark contains a nontrivial fraction of code that cannot be vectorized. Although this code accounts for less than 20% of the time before vectorization, it runs at less than one-tenth of the performance when counted as FLOPS. Thus, Amdahl’s Law tells us that the overall performance will be significantly lower than the performance estimated from analyzing the inner loop. Since vector length has a significant impact on performance, the N1/2 and Nv measures are often used in comparing vector machines. Example
What is N1/2 for just the inner loop of DAXPY for VMIPS with a 500 MHz clock?
Answer
Using R∞ as the peak rate, we want to know the vector length that will achieve about 125 MFLOPS. We start with the formula for MFLOPS assuming that the measurement is made for N1/2 elements: FLOPS executed in N 1 ⁄ 2 iterations Clock cycles –6 MFLOPS = -------------------------------------------------------------------------------------------------- × ------------------------------ × 10 Clock cycles to execute N 1 ⁄ 2 iterations Second 2 × N1 ⁄ 2 125 = --------------------- × 500 TN 1⁄2
G-38
■
Appendix G Vector Processors Simplifying this and then assuming N1/2 ≤ 64, so that Tn ≤ 64 = 1 × 64 + 3 × n, yields TN
1⁄2
= 8 × N1 ⁄ 2
1 × 64 + 3 × N 1 ⁄ 2 = 8 × N 1 ⁄ 2 5 × N 1 ⁄ 2 = 64 N 1 ⁄ 2 = 12.8
So N1/2 = 13; that is, a vector of length 13 gives approximately one-half the peak performance for the DAXPY loop on VMIPS.
Example
What is the vector length, Nv , such that the vector operation runs faster than the scalar?
Answer
Again, we know that Nv < 64. The time to do one iteration in scalar mode can be estimated as 10 + 12 + 12 + 7 + 6 +12 = 59 clocks, where 10 is the estimate of the loop overhead, known to be somewhat less than the strip-mining loop overhead. In the last problem, we showed that this vector loop runs in vector mode in time Tn ≤ 64 = 64 + 3 × n clock cycles. Therefore, 64 + 3N v = 59N v Nv =
64 -----56
Nv = 2
For the DAXPY loop, vector mode is faster than scalar as long as the vector has at least two elements. This number is surprisingly small, as we will see in the next section (“Fallacies and Pitfalls”).
DAXPY Performance on an Enhanced VMIPS DAXPY, like many vector problems, is memory limited. Consequently, performance could be improved by adding more memory access pipelines. This is the major architectural difference between the Cray X-MP (and later processors) and the Cray-1. The Cray X-MP has three memory pipelines, compared with the Cray-1’s single memory pipeline, and the X-MP has more flexible chaining. How does this affect performance? Example
What would be the value of T66 for DAXPY on VMIPS if we added two more memory pipelines?
G.6
Answer
Putting It All Together: Performance of Vector Processors
■
G-39
With three memory pipelines all the instructions fit in one convoy and take one chime. The start-up overheads are the same, so 66 × ( T -----loop + T start ) + 66 × T chime 64 = 2 × ( 15 + 49 ) + 66 × 1 = 194
T 66 = T 66
With three memory pipelines, we have reduced the clock cycle count for sustained performance from 326 to 194, a factor of 1.7. Note the effect of Amdahl’s Law: We improved the theoretical peak rate as measured by the number of chimes by a factor of 3, but only achieved an overall improvement of a factor of 1.7 in sustained performance. Another improvement could come from allowing different convoys to overlap and also allowing the scalar loop overhead to overlap with the vector instructions. This requires that one vector operation be allowed to begin using a functional unit before another operation has completed, which complicates the instruction issue logic. Allowing this overlap eliminates the separate start-up overhead for every convoy except the first and hides the loop overhead as well. To achieve the maximum hiding of strip-mining overhead, we need to be able to overlap strip-mined instances of the loop, allowing two instances of a convoy as well as possibly two instances of the scalar code to be in execution simultaneously. This requires the same techniques we looked at in Chapter 4 to avoid WAR hazards, although because no overlapped read and write of a single vector element is possible, copying can be avoided. This technique, called tailgating, was used in the Cray-2. Alternatively, we could unroll the outer loop to create several instances of the vector sequence using different register sets (assuming sufficient registers), just as we did in Chapter 4. By allowing maximum overlap of the convoys and the scalar loop overhead, the start-up and loop overheads will only be seen once per vector sequence, independent of the number of convoys and the instructions in each convoy. In this way a processor with vector registers can have both low start-up overhead for short vectors and high peak performance for very long vectors. Example
Answer
What would be the values of R∞ and T66 for DAXPY on VMIPS if we added two more memory pipelines and allowed the strip-mining and start-up overheads to be fully overlapped? Operations per iteration × Clock rate R ∞ = lim ---------------------------------------------------------------------------------------- Clock cycles per iteration n → ∞ Tn lim ( Clock cycles per iteration ) = lim ------ n→∞ n→∞ n
Since the overhead is only seen once, Tn = n + 49 + 15 = n + 64. Thus,
G-40
■
Appendix G Vector Processors Tn n + 64 lim ------ = lim --------------- = 1 n n → ∞ n
n → ∞
2 × 500 MHz R ∞ = -------------------------------- = 1000 MFLOPS 1
Adding the extra memory pipelines and more flexible issue logic yields an improvement in peak performance of a factor of 4. However, T66 = 130, so for shorter vectors, the sustained performance improvement is about 326/130 = 2.5 times. In summary, we have examined several measures of vector performance. Theoretical peak performance can be calculated based purely on the value of Tchime as Number of FLOPS per iteration × Clock rate ----------------------------------------------------------------------------------------------------------T chime
By including the loop overhead, we can calculate values for peak performance for an infinite-length vector (R∞) and also for sustained performance, Rn for a vector of length n, which is computed as Number of FLOPS per iteration × n × Clock rate R n = -------------------------------------------------------------------------------------------------------------------Tn
Using these measures we also can find N1/2 and Nv, which give us another way of looking at the start-up overhead for vectors and the ratio of vector to scalar speed. A wide variety of measures of performance of vector processors are useful in understanding the range of performance that applications may see on a vector processor.
G.7 Pitfall
Fallacies and Pitfalls Concentrating on peak performance and ignoring start-up overhead. Early memory-memory vector processors such as the TI ASC and the CDC STAR-100 had long start-up times. For some vector problems, Nv could be greater than 100! On the CYBER 205, derived from the STAR-100, the start-up overhead for DAXPY is 158 clock cycles, substantially increasing the break-even point. With a single vector unit, which contains 2 memory pipelines, the CYBER 205 can sustain a rate of 2 clocks per iteration. The time for DAXPY for a vector of length n is therefore roughly 158 + 2n. If the clock rates of the Cray-1 and the CYBER 205 were identical, the Cray-1 would be faster until n > 64. Because the Cray-1 clock is also faster (even though the 205 is newer), the crossover point is over 100. Comparing a four-lane CYBER 205 (the maximum-size processor) with the Cray X-MP that was delivered shortly after the 205, the 205 has a peak rate of two results per clock cycle—twice as fast as the X-MP. However, vectors must be longer than about 200 for the CYBER 205 to be faster.
G.7
Fallacies and Pitfalls
■
G-41
The problem of start-up overhead has been a major difficulty for the memorymemory vector architectures, hence their lack of popularity. Pitfall
Increasing vector performance, without comparable increases in scalar performance. This was a problem on many early vector processors, and a place where Seymour Cray rewrote the rules. Many of the early vector processors had comparatively slow scalar units (as well as large start-up overheads). Even today, processors with higher peak vector performance can be outperformed by a processor with lower vector performance but better scalar performance. Good scalar performance keeps down overhead costs (strip mining, for example) and reduces the impact of Amdahl’s Law. A good example of this comes from comparing a fast scalar processor and a vector processor with lower scalar performance. The Livermore FORTRAN kernels are a collection of 24 scientific kernels with varying degrees of vectorization. Figure G.17 shows the performance of two different processors on this benchmark. Despite the vector processor's higher peak performance, its low scalar performance makes it slower than a fast scalar processor as measured by the harmonic mean. The next fallacy is closely related.
Fallacy
You can get vector performance without providing memory bandwidth. As we saw with the DAXPY loop, memory bandwidth is quite important. DAXPY requires 1.5 memory references per floating-point operation, and this ratio is typical of many scientific codes. Even if the floating-point operations took no time, a Cray1 could not increase the performance of the vector sequence used, since it is memory limited. The Cray-1 performance on Linpack jumped when the compiler used clever transformations to change the computation so that values could be kept in the vector registers. This lowered the number of memory references per FLOP and improved the performance by nearly a factor of 2! Thus, the memory bandwidth on
Processor
Minimum rate for any loop (MFLOPS)
Maximum rate for any loop (MFLOPS)
Harmonic mean of all 24 loops (MFLOPS)
MIPS M/120-5
0.80
3.89
1.85
Stardent-1500
0.41
10.08
1.72
Figure G.17 Performance measurements for the Livermore FORTRAN kernels on two different processors. Both the MIPS M/120-5 and the Stardent-1500 (formerly the Ardent Titan-1) use a 16.7 MHz MIPS R2000 chip for the main CPU. The Stardent-1500 uses its vector unit for scalar FP and has about half the scalar performance (as measured by the minimum rate) of the MIPS M/120, which uses the MIPS R2010 FP chip. The vector processor is more than a factor of 2.5 times faster for a highly vectorizable loop (maximum rate). However, the lower scalar performance of the Stardent-1500 negates the higher vector performance when total performance is measured by the harmonic mean on all 24 loops.
G-42
■
Appendix G Vector Processors
the Cray-1 became sufficient for a loop that formerly required more bandwidth. This ability to reuse values from vector registers is another advantage of vectorregister architectures compared with memory-memory vector architectures, which have to fetch all vector operands from memory, requiring even greater memory bandwidth.
G.8
Concluding Remarks During the 1980s and 1990s, rapid performance increases in pipelined scalar processors led to a dramatic closing of the gap between traditional vector supercomputers and fast, pipelined, superscalar VLSI microprocessors. In 2002, it is possible to buy a complete desktop computer system for under $1000 that has a higher CPU clock rate than any available vector supercomputer, even those costing tens of millions of dollars. Although the vector supercomputers have lower clock rates, they support greater parallelism through the use of multiple lanes (up to 16 in the Japanese designs) versus the limited multiple issue of the superscalar microprocessors. Nevertheless, the peak floating-point performance of the low-cost microprocessors is within a factor of 4 of the leading vector supercomputer CPUs. Of course, high clock rates and high peak performance do not necessarily translate into sustained application performance. Main memory bandwidth is the key distinguishing feature between vector supercomputers and superscalar microprocessor systems. The fastest microprocessors in 2002 can sustain around 1 GB/sec of main memory bandwidth, while the fastest vector supercomputers can sustain around 50 GB/sec per CPU. For nonunit stride accesses the bandwidth discrepancy is even greater. For certain scientific and engineering applications, performance correlates directly with nonunit stride main memory bandwidth, and these are the applications for which vector supercomputers remain popular. Providing this large nonunit stride memory bandwidth is one of the major expenses in a vector supercomputer, and traditionally SRAM was used as main memory to reduce the number of memory banks needed and to reduce vector start-up penalties. While SRAM has an access time several times lower than that of DRAM, it costs roughly 10 times as much per bit! To reduce main memory costs and to allow larger capacities, all modern vector supercomputers now use DRAM for main memory, taking advantage of new higher-bandwidth DRAM interfaces such as synchronous DRAM. This adoption of DRAM for main memory (pioneered by Seymour Cray in the Cray-2) is one example of how vector supercomputers are adapting commodity technology to improve their price-performance. Another example is that vector supercomputers are now including vector data caches. Caches are not effective for all vector codes, however, and so these vector caches are designed to allow high main memory bandwidth even in the presence of many cache misses. For example, the cache on the Cray SV1 can support 384 outstanding cache misses per CPU, while for microprocessors 8–16 outstanding misses is a more typical maximum number.
G.9
Historical Perspective and References
■
G-43
Another example is the demise of bipolar ECL or gallium arsenide as technologies of choice for supercomputer CPU logic. Because of the huge investment in CMOS technology made possible by the success of the desktop computer, CMOS now offers competitive transistor performance with much greater transistor density and much reduced power dissipation compared with these more exotic technologies. As a result, all leading vector supercomputers are now built with the same CMOS technology as superscalar microprocessors. The primary reason that vector supercomputers now have lower clock rates than commodity microprocessors is that they are developed using standard cell ASIC techniques rather than full custom circuit design to reduce the engineering design cost. While a microprocessor design may sell tens of millions of copies and can amortize the design cost over this large number of units, a vector supercomputer is considered a success if over a hundred units are sold! Conversely, superscalar microprocessor designs have begun to absorb some of the techniques made popular in earlier vector computer systems. Many multimedia applications contain code that can be vectorized, and as discussed in Chapter 2, most commercial microprocessor ISAs have added multimedia extensions that resemble short vector instructions. A common technique is to allow a wide 64-bit register to be split into smaller subwords that are operated on in parallel. This idea was used in the early TI ASC and CDC STAR-100 vector machines, where a 64-bit lane could be split into two 32-bit lanes to give higher performance on lower-precision data. Although the initial microprocessor multimedia extensions were very limited in scope, newer extensions such as AltiVec for the IBM/Motorola PowerPC and SSE2 for the Intel x86 processors have both increased the vector length to 128 bits (still small compared with the 4096 bits in a VMIPS vector register) and added better support for vector compilers. Vector instructions are particularly appealing for embedded processors because they support high degrees of parallelism at low cost and with low power dissipation, and have been used in several game machines such as the Nintendo-64 and the Sony Playstation 2 to boost graphics performance. We expect that microprocessors will continue to extend their support for vector operations, as this represents a much simpler approach to boosting performance for an important class of applications compared with the hardware complexity of increasing scalar instruction issue width, or the software complexity of managing multiple parallel processors.
G.9
Historical Perspective and References The first vector processors were the CDC STAR-100 (see Hintz and Tate [1972]) and the TI ASC (see Watson [1972]), both announced in 1972. Both were memory-memory vector processors. They had relatively slow scalar units—the STAR used the same units for scalars and vectors—making the scalar pipeline extremely deep. Both processors had high start-up overhead and worked on vectors of several hundred to several thousand elements. The crossover between scalar and vector could be over 50 elements. It appears that not enough attention was paid to the role of Amdahl’s Law on these two processors.
G-44
■
Appendix G Vector Processors
Cray, who worked on the 6600 and the 7600 at CDC, founded Cray Research and introduced the Cray-1 in 1976 (see Russell [1978]). The Cray-1 used a vector-register architecture to significantly lower start-up overhead and to reduce memory bandwidth requirements. He also had efficient support for nonunit stride and invented chaining. Most importantly, the Cray-1 was the fastest scalar processor in the world at that time. This matching of good scalar and vector performance was probably the most significant factor in making the Cray-1 a success. Some customers bought the processor primarily for its outstanding scalar performance. Many subsequent vector processors are based on the architecture of this first commercially successful vector processor. Baskett and Keller [1977] provide a good evaluation of the Cray-1. In 1981, CDC started shipping the CYBER 205 (see Lincoln [1982]). The 205 had the same basic architecture as the STAR, but offered improved performance all around as well as expandability of the vector unit with up to four lanes, each with multiple functional units and a wide load-store pipe that provided multiple words per clock. The peak performance of the CYBER 205 greatly exceeded the performance of the Cray-1. However, on real programs, the performance difference was much smaller. The CDC STAR processor and its descendant, the CYBER 205, were memory-memory vector processors. To keep the hardware simple and support the high bandwidth requirements (up to three memory references per floating-point operation), these processors did not efficiently handle nonunit stride. While most loops have unit stride, a nonunit stride loop had poor performance on these processors because memory-to-memory data movements were required to gather together (and scatter back) the nonadjacent vector elements; these operations used special scatter-gather instructions. In addition, there was special support for sparse vectors that used a bit vector to represent the zeros and nonzeros and a dense vector of nonzero values. These more complex vector operations were slow because of the long memory latency, and it was often faster to use scalar mode for sparse or nonunit stride operations. Schneck [1987] described several of the early pipelined processors (e.g., Stretch) through the first vector processors, including the 205 and Cray-1. Dongarra [1986] did another good survey, focusing on more recent processors. In 1983, Cray Research shipped the first Cray X-MP (see Chen [1983]). With an improved clock rate (9.5 ns versus 12.5 ns on the Cray-1), better chaining support, and multiple memory pipelines, this processor maintained the Cray Research lead in supercomputers. The Cray-2, a completely new design configurable with up to four processors, was introduced later. A major feature of the Cray-2 was the use of DRAM, which made it possible to have very large memories. The first Cray-2 with its 256M word (64-bit words) memory contained more memory than the total of all the Cray machines shipped to that point! The Cray-2 had a much faster clock than the X-MP, but also much deeper pipelines; however, it lacked chaining, had an enormous memory latency, and had only one memory pipe per processor. In general, the Cray-2 is only faster than the Cray X-MP on problems that require its very large main memory.
G.9
Historical Perspective and References
■
G-45
The 1980s also saw the arrival of smaller-scale vector processors, called minisupercomputers. Priced at roughly one-tenth the cost of a supercomputer ($0.5 to $1 million versus $5 to $10 million), these processors caught on quickly. Although many companies joined the market, the two companies that were most successful were Convex and Alliant. Convex started with the uniprocessor C-1 vector processor and then offered a series of small multiprocessors ending with the C-4 announced in 1994. The keys to the success of Convex over this period were their emphasis on Cray software capability, the effectiveness of their compiler (see Figure G.15), and the quality of their UNIX OS implementation. The C-4 was the last vector machine Convex sold; they switched to making largescale multiprocessors using Hewlett-Packard RISC microprocessors and were bought by HP in 1995. Alliant [1987] concentrated more on the multiprocessor aspects; they built an eight-processor computer, with each processor offering vector capability. Alliant ceased operation in the early 1990s. In the early 1980s, CDC spun out a group, called ETA, to build a new supercomputer, the ETA-10, capable of 10 GFLOPS. The ETA processor was delivered in the late 1980s (see Fazio [1987]) and used low-temperature CMOS in a configuration with up to 10 processors. Each processor retained the memorymemory architecture based on the CYBER 205. Although the ETA-10 achieved enormous peak performance, its scalar speed was not comparable. In 1989 CDC, the first supercomputer vendor, closed ETA and left the supercomputer design business. In 1986, IBM introduced the System/370 vector architecture (see Moore et al. [1987]) and its first implementation in the 3090 Vector Facility. The architecture extends the System/370 architecture with 171 vector instructions. The 3090/VF is integrated into the 3090 CPU. Unlike most other vector processors, the 3090/VF routes its vectors through the cache. In 1983, processor vendors from Japan entered the supercomputer marketplace, starting with the Fujitsu VP100 and VP200 (see Miura and Uchida [1983]), and later expanding to include the Hitachi S810 and the NEC SX/2 (see Watanabe [1987]). These processors have proved to be close to the Cray X-MP in performance. In general, these three processors have much higher peak performance than the Cray X-MP. However, because of large start-up overhead, their typical performance is often lower than the Cray X-MP (see Figure 1.32 in Chapter 1). The Cray X-MP favored a multiple-processor approach, first offering a two-processor version and later a four-processor. In contrast, the three Japanese processors had expandable vector capabilities. In 1988, Cray Research introduced the Cray Y-MP—a bigger and faster version of the X-MP. The Y-MP allows up to eight processors and lowers the cycle time to 6 ns. With a full complement of eight processors, the Y-MP was generally the fastest supercomputer, though the single-processor Japanese supercomputers may be faster than a one-processor Y-MP. In late 1989 Cray Research was split into two companies, both aimed at building high-end processors available in the early 1990s. Seymour Cray headed the spin-off, Cray Computer Corporation,
G-46
■
Appendix G Vector Processors
until its demise in 1995. Their initial processor, the Cray-3, was to be implemented in gallium arsenide, but they were unable to develop a reliable and costeffective implementation technology. A single Cray-3 prototype was delivered to the National Center for Atmospheric Research (NCAR) for evaluation purposes in 1993, but no paying customers were found for the design. The Cray-4 prototype, which was to have been the first processor to run at 1 GHz, was close to completion when the company filed for bankruptcy. Shortly before his tragic death in a car accident in 1996, Seymour Cray started yet another company, SRC Computers, to develop high-performance systems but this time using commodity components. In 2000, SRC announced the SRC-6 system that combines 512 Intel microprocessors, 5 billion gates of reconfigurable logic, and a high-performance vector-style memory system. Cray Research focused on the C90, a new high-end processor with up to 16 processors and a clock rate of 240 MHz. This processor was delivered in 1991. Typical configurations are about $15 million. In 1993, Cray Research introduced their first highly parallel processor, the T3D, employing up to 2048 Digital Alpha 21064 microprocessors. In 1995, they announced the availability of both a new low-end vector machine, the J90, and a high-end machine, the T90. The T90 is much like the C90, but offers a clock that is twice as fast (460 MHz), using threedimensional packaging and optical clock distribution. Like the C90, the T90 costs in the tens of millions, though a single CPU is available for $2.5 million. The T90 was the last bipolar ECL vector machine built by Cray. The J90 is a CMOS-based vector machine using DRAM memory starting at $250,000, but with typical configurations running about $1 million. In mid-1995, Cray Research was acquired by Silicon Graphics, and in 1998 released the SV1 system, which grafted considerably faster CMOS processors onto the J90 memory system, and which also added a data cache for vectors to each CPU to help meet the increased memory bandwidth demands. Silicon Graphics sold Cray Research to Tera Computer in 2000, and the joint company was renamed Cray Inc. Cray Inc. plans to release the SV2 in 2002, which will be based on a completely new vector ISA. The Japanese supercomputer makers have continued to evolve their designs and have generally placed greater emphasis on increasing the number of lanes in their vector units. In 2001, the NEC SX/5 was generally held to be the fastest available vector supercomputer, with 16 lanes clocking at 312 MHz and with up to 16 processors sharing the same memory. The Fujitsu VPP5000 was announced in 2001 and also had 16 lanes and clocked at 300 MHz, but connected up to 128 processors in a distributed-memory cluster. In 2001, Cray Inc. announced that they would be marketing the NEC SX/5 machine in the United States, after many years in which Japanese supercomputers were unavailable to U.S. customers after the U.S. Commerce Department found NEC and Fujitsu guilty of bidding below cost for a 1996 NCAR supercomputer contract and imposed heavy import duties on their products. The basis for modern vectorizing compiler technology and the notion of data dependence was developed by Kuck and his colleagues [1974] at the University of Illinois. Banerjee [1979] developed the test named after him. Padua and Wolfe [1986] give a good overview of vectorizing compiler technology.
G.9
Historical Perspective and References
■
G-47
Benchmark studies of various supercomputers, including attempts to understand the performance differences, have been undertaken by Lubeck, Moore, and Mendez [1985], Bucher [1983], and Jordan [1987]. In Chapter 1, we discussed several benchmark suites aimed at scientific usage and often employed for supercomputer benchmarking, including Linpack and the Lawrence Livermore Laboratories FORTRAN kernels. The University of Illinois coordinated the collection of a set of benchmarks for supercomputers, called the Perfect Club. In 1993, the Perfect Club was integrated into SPEC, which released a set of benchmarks, SPEChpc96, aimed at high-end scientific processing in 1996. The NAS parallel benchmarks developed at the NASA Ames Research Center [Bailey et al. 1991] have become a popular set of kernels and applications used for supercomputer evaluation. In less than 30 years vector processors have gone from unproven, new architectures to playing a significant role in the goal to provide engineers and scientists with ever larger amounts of computing power. However, the enormous priceperformance advantages of microprocessor technology are bringing this era to an end. Advanced superscalar microprocessors are approaching the peak performance of the fastest vector processors, and in 2001, most of the highestperformance machines in the world were large-scale multiprocessors based on these microprocessors. Vector supercomputers remain popular for certain applications including car crash simulation and weather prediction that rely heavily on scatter-gather performance over large data sets and for which effective massively parallel programs have yet to be written. Over time, we expect that microprocessors will support higher-bandwidth memory systems, and that more applications will be parallelized and/or tuned for cached multiprocessor systems. As the set of applications best suited for vector supercomputers shrinks, they will become less viable as commercial products and will eventually disappear. But vector processing techniques will likely survive as an integral part of future microprocessor architectures, with the currently popular SIMD multimedia extensions representing the first step in this direction.
References Alliant Computer Systems Corp. [1987]. Alliant FX/Series: Product Summary (June), Acton, Mass. Asanovic, K. [1998]. “Vector microprocessors,” Ph.D. thesis, Computer Science Division, Univ. of California at Berkeley (May). Bailey, D. H., E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga [1991]. “The NAS parallel benchmarks,” Int’l. J. Supercomputing Applications 5, 63–73. Banerjee, U. [1979]. “Speedup of ordinary programs,” Ph.D. thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign (October). Baskett, F., and T. W. Keller [1977]. “An Evaluation of the Cray-1 Processor,” in High Speed Computer and Algorithm Organization, D. J. Kuck, D. H. Lawrie, and A. H. Sameh, eds., Academic Press, San Diego, 71–84.
G-48
■
Appendix G Vector Processors
Brandt, M., J. Brooks, M. Cahir, T. Hewitt, E. Lopez-Pineda, and D. Sandness [2000]. The Benchmarker’s Guide for Cray SV1 Systems. Cray Inc., Seattle, Wash. Bucher, I. Y. [1983]. “The computational speed of supercomputers,” Proc. SIGMETRICS Conf. on Measuring and Modeling of Computer Systems, ACM (August), 151–165. Callahan, D., J. Dongarra, and D. Levine [1988]. “Vectorizing compilers: A test suite and results,” Supercomputing ’88, ACM/IEEE (November), Orlando, Fla., 98–105. Chen, S. [1983]. “Large-scale and high-speed multiprocessor system for scientific applications,” Proc. NATO Advanced Research Work on High Speed Computing (June); also in K. Hwang, ed., “Superprocessors: Design and applications,” IEEE (August), 1984. Dongarra, J. J. [1986]. “A survey of high performance processors,” COMPCON, IEEE (March), 8–11. Fazio, D. [1987]. “It’s really much more fun building a supercomputer than it is simply inventing one,” COMPCON, IEEE (February), 102–105. Flynn, M. J. [1966]. “Very high-speed computing systems,” Proc. IEEE 54:12 (December), 1901–1909. Hintz, R. G., and D. P. Tate [1972]. “Control data STAR-100 processor design,” COMPCON, IEEE (September), 1–4. Jordan, K. E. [1987]. “Performance comparison of large-scale scientific processors: Scalar mainframes, mainframes with vector facilities, and supercomputers,” Computer 20:3 (March), 10–23. Kuck, D., P. P. Budnik, S.-C. Chen, D. H. Lawrie, R. A. Towle, R. E. Strebendt, E. W. Davis, Jr., J. Han, P. W. Kraska, and Y. Muraoka [1974]. “Measurements of parallelism in ordinary FORTRAN programs,” Computer 7:1 (January), 37–46. Lincoln, N. R. [1982]. “Technology and design trade offs in the creation of a modern supercomputer,” IEEE Trans. on Computers C-31:5 (May), 363–376. Lubeck, O., J. Moore, and R. Mendez [1985]. “A benchmark comparison of three supercomputers: Fujitsu VP-200, Hitachi S810/20, and Cray X-MP/2,” Computer 18:1 (January), 10–29. Miranker, G. S., J. Rubenstein, and J. Sanguinetti [1988]. “Squeezing a Cray-class supercomputer into a single-user package,” COMPCON, IEEE (March), 452–456. Miura, K., and K. Uchida [1983]. “FACOM vector processing system: VP100/200,” Proc. NATO Advanced Research Work on High Speed Computing (June); also in K. Hwang, ed., “Superprocessors: Design and applications,” IEEE (August 1984), 59–73. Moore, B., A. Padegs, R. Smith, and W. Bucholz [1987]. “Concepts of the System/370 vector architecture,” Proc. 14th Symposium on Computer Architecture (June), ACM/ IEEE, Pittsburgh, 282–292. Padua, D., and M. Wolfe [1986]. “Advanced compiler optimizations for supercomputers,” Comm. ACM 29:12 (December), 1184–1201. Russell, R. M. [1978]. “The Cray-1 processor system,” Comm. of the ACM 21:1 (January), 63–72. Schneck, P. B. [1987]. Superprocessor Architecture, Kluwer Academic Publishers, Norwell, Mass. Smith, B. J. [1981]. “Architecture and applications of the HEP multiprocessor system,” Real-Time Signal Processing IV 298 (August), 241–248. Sporer, M., F. H. Moss, and C. J. Mathais [1988]. “An introduction to the architecture of the Stellar Graphics supercomputer,” COMPCON, IEEE (March), 464. Vajapeyam, S. [1991]. “Instruction-level characterization of the Cray Y-MP processor,” Ph.D. thesis, Computer Sciences Department, University of Wisconsin-Madison.
Exercises
■
G-49
Watanabe, T. [1987]. “Architecture and performance of the NEC supercomputer SX system,” Parallel Computing 5, 247–255. Watson, W. J. [1972]. “The TI ASC—a highly modular and flexible super processor architecture,” Proc. AFIPS Fall Joint Computer Conf., 221–228.
Exercises In these exercises assume VMIPS has a clock rate of 500 MHz and that Tloop = 15. Use the start-up times from Figure G.4, and assume that the store latency is always included in the running time. G.1
[10] Write a VMIPS vector sequence that achieves the peak MFLOPS performance of the processor (use the functional unit and instruction description in Section G.2). Assuming a 500-MHz clock rate, what is the peak MFLOPS?
G.2
[20/15/15] Consider the following vector code run on a 500-MHz version of VMIPS for a fixed vector length of 64: LV MULV.D ADDV.D SV SV
V1,Ra V2,V1,V3 V4,V1,V3 Rb,V2 Rc,V4
Ignore all strip-mining overhead, but assume that the store latency must be included in the time to perform the loop. The entire sequence produces 64 results. a. [20] Assuming no chaining and a single memory pipeline, how many chimes are required? How many clock cycles per result (including both stores as one result) does this vector sequence require, including start-up overhead? b. [15] If the vector sequence is chained, how many clock cycles per result does this sequence require, including overhead? c. [15] Suppose VMIPS had three memory pipelines and chaining. If there were no bank conflicts in the accesses for the above loop, how many clock cycles are required per result for this sequence? G.3
[20/20/15/15/20/20/20] Consider the following FORTRAN code:
10
do 10 i=1,n A(i) = A(i) + B(i) B(i) = x * B(i) continue
Use the techniques of Section G.6 to estimate performance throughout this exercise, assuming a 500-MHz version of VMIPS. a. [20] Write the best VMIPS vector code for the inner portion of the loop. Assume x is in F0 and the addresses of A and B are in Ra and Rb, respectively.
G-50
■
Appendix G Vector Processors
b. [20] Find the total time for this loop on VMIPS (T100). What is the MFLOPS rating for the loop (R100)? c. [15] Find R∞ for this loop. d. [15] Find N1/2 for this loop. e. [20] Find Nv for this loop. Assume the scalar code has been pipeline scheduled so that each memory reference takes six cycles and each FP operation takes three cycles. Assume the scalar overhead is also Tloop. f.
[20] Assume VMIPS has two memory pipelines. Write vector code that takes advantage of the second memory pipeline. Show the layout in convoys.
g. [20] Compute T100 and R100 for VMIPS with two memory pipelines. G.4
[20/10] Suppose we have a version of VMIPS with eight memory banks (each a double word wide) and a memory access time of eight cycles. a. [20] If a load vector of length 64 is executed with a stride of 20 double words, how many cycles will the load take to complete? b. [10] What percentage of the memory bandwidth do you achieve on a 64-element load at stride 20 versus stride 1?
G.5
[12/12] Consider the following loop: C = 0.0 do 10 i=1,64 A(i) = A(i) + B(i) C = C + A(i) 10 continue a. [12] Split the loop into two loops: one with no dependence and one with a dependence. Write these loops in FORTRAN—as a source-tosource transformation. This optimization is called loop fission. b. [12] Write the VMIPS vector code for the loop without a dependence.
G.6
[20/15/20/20] The compiled Linpack performance of the Cray-1 (designed in 1976) was almost doubled by a better compiler in 1989. Let's look at a simple example of how this might occur. Consider the DAXPY-like loop (where k is a parameter to the procedure containing the loop):
10
do 10 i=1,64 do 10 j=1,64 Y(k,j) = a*X(i,j) + Y(k,j) continue
a. [20] Write the straightforward code sequence for just the inner loop in VMIPS vector instructions. b. [15] Using the techniques of Section G.6, estimate the performance of this code on VMIPS by finding T64 in clock cycles. You may assume
Exercises
■
G-51
that Tloop of overhead is incurred for each iteration of the outer loop. What limits the performance? c. [20] Rewrite the VMIPS code to reduce the performance limitation; show the resulting inner loop in VMIPS vector instructions. (Hint: Think about what establishes Tchime; can you affect it?) Find the total time for the resulting sequence. d. [20] Estimate the performance of your new version, using the techniques of Section G.6 and finding T64. G.7
[15/15/25] Consider the following code. do 10 i=1,64 if (B(i) .ne. 0) then A(i) = A(i) / B(i) 10 continue Assume that the addresses of A and B are in Ra and Rb, respectively, and that F0 contains 0. a. [15] Write the VMIPS code for this loop using the vector-mask capability. b. [15] Write the VMIPS code for this loop using scatter-gather. c. [25] Estimate the performance (T100 in clock cycles) of these two vector loops, assuming a divide latency of 20 cycles. Assume that all vector instructions run at one result per clock, independent of the setting of the vector-mask register. Assume that 50% of the entries of B are 0. Considering hardware costs, which would you build if the above loop were typical?
G.8
[15/20/15/15] In “Fallacies and Pitfalls” of Chapter 1, we saw that the difference between peak and sustained performance could be large: For one problem, a Hitachi S810 had a peak speed twice as high as that of the Cray X-MP, while for another more realistic problem, the Cray X-MP was twice as fast as the Hitachi processor. Let’s examine why this might occur using two versions of VMIPS and the following code sequences: C
Code sequence 1 do 10 i=1,10000 A(i) = x * A(i) + y * A(i) 10 continue C
Code sequence 2 do 10 i=1,100 A(i) = x * A(i) 10 continue Assume there is a version of VMIPS (call it VMIPS-II) that has two copies of every floating-point functional unit with full chaining among them. Assume that both VMIPS and VMIPS-II have two load-store units. Because of the extra func-
G-52
■
Appendix G Vector Processors
tional units and the increased complexity of assigning operations to units, all the overheads (Tloop and Tstart) are doubled. a. [15] Find the number of clock cycles for code sequence 1 on VMIPS. b. [20] Find the number of clock cycles on code sequence 1 for VMIPS-II. How does this compare to VMIPS? c. [15] Find the number of clock cycles on code sequence 2 for VMIPS. d. [15] Find the number of clock cycles on code sequence 2 for VMIPS-II. How does this compare to VMIPS? G.9
[20] Here is a tricky piece of code with two-dimensional arrays. Does this loop have dependences? Can these loops be written so they are parallel? If so, how? Rewrite the source code so that it is clear that the loop can be vectorized, if possible. do 290 j = 2,n do 290 i = 2,j aa(i,j)= aa(i-1,j)*aa(i-1,j)+bb(i,j) 290 continue
G.10
[12/15] Consider the following loop:
10
do 10 i = 2,n A(i) = B C(i) = A(i-1)
a. [12] Show there is a loop-carried dependence in this code fragment. b. [15] Rewrite the code in FORTRAN so that it can be vectorized as two separate vector sequences. G.11
[15/25/25] As we saw in Section G.5, some loop structures are not easily vectorized. One common structure is a reduction—a loop that reduces an array to a single value by repeated application of an operation. This is a special case of a recurrence. A common example occurs in dot product: dot = 0.0 do 10 i=1,64 10 dot = dot + A(i) * B(i) This loop has an obvious loop-carried dependence (on dot) and cannot be vectorized in a straightforward fashion. The first thing a good vectorizing compiler would do is split the loop to separate out the vectorizable portion and the recurrence and perhaps rewrite the loop as 10 20
do 10 i=1,64 dot(i) = A(i) * B(i) do 20 i=2,64 dot(1) = dot(1) + dot(i)
Exercises
■
G-53
The variable dot has been expanded into a vector; this transformation is called scalar expansion. We can try to vectorize the second loop either relying strictly on the compiler (part (a)) or with hardware support as well (part (b)). There is an important caveat in the use of vector techniques for reduction. To make reduction work, we are relying on the associativity of the operator being used for the reduction. Because of rounding and finite range, however, floating-point arithmetic is not strictly associative. For this reason, most compilers require the programmer to indicate whether associativity can be used to more efficiently compile reductions. a. [15] One simple scheme for compiling the loop with the recurrence is to add sequences of progressively shorter vectors—two 32-element vectors, then two 16-element vectors, and so on. This technique has been called recursive doubling. It is faster than doing all the operations in scalar mode. Show how the FORTRAN code would look for execution of the second loop in the preceding code fragment using recursive doubling. b. [25] In some vector processors, the vector registers are addressable, and the operands to a vector operation may be two different parts of the same vector register. This allows another solution for the reduction, called partial sums. The key idea in partial sums is to reduce the vector to m sums where m is the total latency through the vector functional unit, including the operand read and write times. Assume that the VMIPS vector registers are addressable (e.g., you can initiate a vector operation with the operand V1(16), indicating that the input operand began with element 16). Also, assume that the total latency for adds, including operand read and write, is eight cycles. Write a VMIPS code sequence that reduces the contents of V1 to eight partial sums. It can be done with one vector operation. c. Discuss how adding the extension in part (b) would affect a machine that had multiple lanes. G.12
[40] Extend the MIPS simulator to be a VMIPS simulator, including the ability to count clock cycles. Write some short benchmark programs in MIPS and VMIPS assembly language. Measure the speedup on VMIPS, the percentage of vectorization, and usage of the functional units.
G.13
[50] Modify the MIPS compiler to include a dependence checker. Run some scientific code and loops through it and measure what percentage of the statements could be vectorized.
G.14
[Discussion] Some proponents of vector processors might argue that the vector processors have provided the best path to ever-increasing amounts of processor power by focusing their attention on boosting peak vector performance. Others would argue that the emphasis on peak performance is misplaced because an increasing percentage of the programs are dominated by nonvector performance. (Remember Amdahl’s Law?) The proponents would respond that programmers should work to make their programs vectorizable. What do you think about this argument?
G-54
■
Appendix G Vector Processors
G.15
[Discussion] Consider the points raised in “Concluding Remarks” (Section G.8). This topic—the relative advantages of pipelined scalar processors versus FP vector processors—was the source of much debate in the 1990s. What advantages do you see for each side? What would you do in this situation?
H Computer Arithmetic by David Goldberg Xerox Palo Alto Research Center
The Fast drives out the Slow even if the Fast is wrong. W. Kahan
© 2003 Elsevier Science (USA). All rights reserved.
H.1 H.2 H.3 H.4 H.5 H.6 H.7 H.8 H.9 H.10 H.11 H.12
Introduction Basic Techniques of Integer Arithmetic Floating Point Floating-Point Multiplication Floating-Point Addition Division and Remainder More on Floating-Point Arithmetic Speeding Up Integer Addition Speeding Up Integer Multiplication and Division Putting It All Together Fallacies and Pitfalls Historical Perspective and References Exercises
H-2 H-2 H-13 H-17 H-21 H-27 H-33 H-37 H-45 H-58 H-62 H-63 H-69
H-2
■
Appendix H Computer Arithmetic
H.1
Introduction Although computer arithmetic is sometimes viewed as a specialized part of CPU design, it is a very important part. This was brought home for Intel in 1994 when their Pentium chip was discovered to have a bug in the divide algorithm. This floating-point flaw resulted in a flurry of bad publicity for Intel and also cost them a lot of money. Intel took a $300 million write-off to cover the cost of replacing the buggy chips. In this appendix we will study some basic floating-point algorithms, including the division algorithm used on the Pentium. Although a tremendous variety of algorithms have been proposed for use in floating-point accelerators, actual implementations are usually based on refinements and variations of the few basic algorithms presented here. In addition to choosing algorithms for addition, subtraction, multiplication, and division, the computer architect must make other choices. What precisions should be implemented? How should exceptions be handled? This appendix will give you the background for making these and other decisions. Our discussion of floating point will focus almost exclusively on the IEEE floating-point standard (IEEE 754) because of its rapidly increasing acceptance. Although floating-point arithmetic involves manipulating exponents and shifting fractions, the bulk of the time in floating-point operations is spent operating on fractions using integer algorithms (but not necessarily sharing the hardware that implements integer instructions). Thus, after our discussion of floating point, we will take a more detailed look at integer algorithms. Some good references on computer arithmetic, in order from least to most detailed, are Chapter 4 of Patterson and Hennessy [1994]; Chapter 7 of Hamacher, Vranesic, and Zaky [1984]; Gosling [1980]; and Scott [1985].
H.2
Basic Techniques of Integer Arithmetic Readers who have studied computer arithmetic before will find most of this section to be review.
Ripple-Carry Addition Adders are usually implemented by combining multiple copies of simple components. The natural components for addition are half adders and full adders. The half adder takes two bits a and b as input and produces a sum bit s and a carry bit cout as output. Mathematically, s = (a + b) mod 2, and cout = (a + b)/2, where is the floor function. As logic equations, s = ab + ab and cout = ab, where ab means a ∧ b and a + b means a ∨ b. The half adder is also called a (2,2) adder, since it takes two inputs and produces two outputs. The full adder is a (3,2) adder and is defined by s = (a + b + c) mod 2, cout = (a + b + c)/2, or the logic equations
H.2
H.2.1 H.2.2
Basic Techniques of Integer Arithmetic
■
H-3
s = ab c + abc + abc + abc cout = ab + ac + bc
The principal problem in constructing an adder for n-bit numbers out of smaller pieces is propagating the carries from one piece to the next. The most obvious way to solve this is with a ripple-carry adder, consisting of n full adders, as illustrated in Figure H.1. (In the figures in this appendix, the least-significant bit is always on the right.) The inputs to the adder are an–1an–2 ⋅ ⋅ ⋅ a0 and bn–1bn–2 ⋅ ⋅ ⋅ b0, where an–1an–2⋅ ⋅ ⋅ a0 represents the number an–1 2n–1 + an–2 2n–2 + ⋅ ⋅ ⋅ + a0. The ci+1 output of the ith adder is fed into the ci+1 input of the next adder (the (i + 1)-th adder) with the lower-order carry-in c0 set to 0. Since the low-order carry-in is wired to 0, the low-order adder could be a half adder. Later, however, we will see that setting the low-order carry-in bit to 1 is useful for performing subtraction. In general, the time a circuit takes to produce an output is proportional to the maximum number of logic levels through which a signal travels. However, determining the exact relationship between logic levels and timings is highly technology dependent. Therefore, when comparing adders we will simply compare the number of logic levels in each one. How many levels are there for a ripple-carry adder? It takes two levels to compute c1 from a0 and b0. Then it takes two more levels to compute c2 from c1, a1, b1, and so on, up to cn. So there are a total of 2n levels. Typical values of n are 32 for integer arithmetic and 53 for doubleprecision floating point. The ripple-carry adder is the slowest adder, but also the cheapest. It can be built with only n simple cells, connected in a simple, regular way. Because the ripple-carry adder is relatively slow compared with the designs discussed in Section H.8, you might wonder why it is used at all. In technologies like CMOS, even though ripple adders take time O(n), the constant factor is very small. In such cases short ripple adders are often used as building blocks in larger adders.
an–1 bn–1
a1
an–2 bn–2
a0
b1
b0 0
Full adder
cn
sn–1
Full adder
cn–1
sn–2
Full adder
c2
s1
Full adder
c1
s0
Figure H.1 Ripple-carry adder, consisting of n full adders.The carry-out of one full adder is connected to the carry-in of the adder for the next most-significant bit. The carries ripple from the least-significant bit (on the right) to the most-significant bit (on the left).
H-4
■
Appendix H Computer Arithmetic
Radix-2 Multiplication and Division The simplest multiplier computes the product of two unsigned numbers, one bit at a time, as illustrated in Figure H.2(a). The numbers to be multiplied are an–1an–2 ⋅ ⋅ ⋅ a0 and bn–1bn–2 ⋅ ⋅ ⋅ b0, and they are placed in registers A and B, respectively. Register P is initially 0. Each multiply step has two parts. Multiply Step
(i)
If the least-significant bit of A is 1, then register B, containing bn–1bn–2 ⋅ ⋅ ⋅ b0, is added to P; otherwise 00 ⋅ ⋅ ⋅ 00 is added to P. The sum is placed back into P.
Shift
Carry-out P 1
A n
n
B n
(a)
Shift P n+1
(b)
A n
0
B
1
n
Figure H.2 Block diagram of (a) multiplier and (b) divider for n-bit unsigned integers. Each multiplication step consists of adding the contents of P to either B or 0 (depending on the low-order bit of A), replacing P with the sum, and then shifting both P and A one bit right. Each division step involves first shifting P and A one bit left, subtracting B from P, and, if the difference is nonnegative, putting it into P. If the difference is nonnegative, the low-order bit of A is set to 1.
H.2
(ii)
Basic Techniques of Integer Arithmetic
■
H-5
Registers P and A are shifted right, with the carry-out of the sum being moved into the high-order bit of P, the low-order bit of P being moved into register A, and the rightmost bit of A, which is not used in the rest of the algorithm, being shifted out.
After n steps, the product appears in registers P and A, with A holding the lower-order bits. The simplest divider also operates on unsigned numbers and produces the quotient bits one at a time. A hardware divider is shown in Figure H.2(b). To compute a/b, put a in the A register, b in the B register, 0 in the P register, and then perform n divide steps. Each divide step consists of four parts: Divide Step
(i) (ii)
Shift the register pair (P,A) one bit left. Subtract the content of register B (which is bn–1bn–2 ⋅ ⋅ ⋅ b0) from register P, putting the result back into P.
(iii)
If the result of step 2 is negative, set the low-order bit of A to 0, otherwise to 1. If the result of step 2 is negative, restore the old value of P by adding the contents of register B back into P.
(iv)
After repeating this process n times, the A register will contain the quotient, and the P register will contain the remainder. This algorithm is the binary version of the paper-and-pencil method; a numerical example is illustrated in Figure H.3(a). Notice that the two block diagrams in Figure H.2 are very similar. The main difference is that the register pair (P,A) shifts right when multiplying and left when dividing. By allowing these registers to shift bidirectionally, the same hardware can be shared between multiplication and division. The division algorithm illustrated in Figure H.3(a) is called restoring, because if subtraction by b yields a negative result, the P register is restored by adding b back in. The restoring algorithm has a variant that skips the restoring step and instead works with the resulting negative numbers. Each step of this nonrestoring algorithm has three parts: Nonrestoring Divide Step
If P is negative, (i-a) Shift the register pair (P,A) one bit left. (ii-a) Add the contents of register B to P. Else, (i-b) Shift the register pair (P,A) one bit left. (ii-b) Subtract the contents of register B from P. (iii) If P is negative, set the low-order bit of A to 0, otherwise set it to 1. After repeating this n times, the quotient is in A. If P is nonnegative, it is the remainder. Otherwise, it needs to be restored (i.e., add b), and then it will be the remainder. A numerical example is given in Figure H.3(b). Since (i-a) and (i-b)
H-6
■
Appendix H Computer Arithmetic
00000
P
A 1110
Divide 14 = 11102 by 3 = 112. B always contains 00112.
00001
110
step 1(i): shift.
–00011
step 1(ii): subtract.
–00010
1100
step 1(iii): result is negative, set quotient bit to 0.
00001
1100
step 1(iv): restore.
00011
100
step 2(i): shift.
–00011
step 2(ii): subtract.
00000
1001
step 2(iii): result is nonnegative, set quotient bit to 1.
00001
001
step 3(i): shift.
–00011
step 3(ii): subtract.
–00010
0010
step 3(iii): result is negative, set quotient bit to 0.
00001
0010
step 3(iv): restore.
00010
010
step 4(i): shift.
–00011
step 4(ii): subtract.
–00001
0100
step 4(iii): result is negative, set quotient bit to 0.
00010
0100
step 4(iv): restore. The quotient is 01002 and the remainder is 000102.
00000
1110
Divide 14 = 11102 by 3 = 112. B always contains 00112.
00001
110
step 1(i-b): shift.
(a)
step 1(ii-b): subtract b (add two’s complement).
+11101 11110
1100
step 1(iii): P is negative, so set quotient bit 0.
11101
100
step 2(i-a): shift. step 2(ii-a): add b.
+00011 00000
1001
step 2(iii): P is nonnegative, so set quotient bit to 1.
00001
001
step 3(i-b): shift. step 3(ii-b): subtract b.
+11101 11110
0010
step 3(iii): P is negative, so set quotient bit to 0.
11100
010
step 4(i-a): shift. step 4(ii-a): add b.
+00011 11111 +00011 00010
0100
step 4(iii): P is negative, so set quotient bit to 0. Remainder is negative, so do final restore step. The quotient is 01002 and the remainder is 000102.
(b)
Figure H.3 Numerical example of (a) restoring division and (b) nonrestoring division.
are the same, you might be tempted to perform this common step first, and then test the sign of P. That doesn’t work, since the sign bit can be lost when shifting. The explanation for why the nonrestoring algorithm works is this. Let rk be the contents of the (P,A) register pair at step k, ignoring the quotient bits (which
H.2
Basic Techniques of Integer Arithmetic
■
H-7
are simply sharing the unused bits of register A). In Figure H.3(a), initially A contains 14, so r0 = 14. At the end of the first step, r1 = 28, and so on. In the restoring algorithm, part (i) computes 2rk and then part (ii) 2rk − 2nb (2nb since b is subtracted from the left half). If 2rk − 2nb ≥ 0, both algorithms end the step with identical values in (P,A). If 2rk − 2nb < 0, then the restoring algorithm restores this to 2rk, and the next step begins by computing rres = 2(2rk) − 2nb. In the nonrestoring algorithm, 2rk − 2nb is kept as a negative number, and in the next step rnonres = 2(2rk − 2nb) + 2nb = 4rk − 2nb = rres. Thus (P,A) has the same bits in both algorithms. If a and b are unsigned n-bit numbers, hence in the range 0 ≤ a,b ≤ 2n − 1, then the multiplier in Figure H.2 will work if register P is n bits long. However, for division, P must be extended to n + 1 bits in order to detect the sign of P. Thus the adder must also have n + 1 bits. Why would anyone implement restoring division, which uses the same hardware as nonrestoring division (the control is slightly different) but involves an extra addition? In fact, the usual implementation for restoring division doesn’t actually perform an add in step (iv). Rather, the sign resulting from the subtraction is tested at the output of the adder, and only if the sum is nonnegative is it loaded back into the P register. As a final point, before beginning to divide, the hardware must check to see whether the divisor is 0.
Signed Numbers There are four methods commonly used to represent signed n-bit numbers: sign magnitude, two’s complement, one’s complement, and biased. In the sign magnitude system, the high-order bit is the sign bit, and the low-order n − 1 bits are the magnitude of the number. In the two’s complement system, a number and its negative add up to 2n. In one’s complement, the negative of a number is obtained by complementing each bit (or alternatively, the number and its negative add up to 2n − 1). In each of these three systems, nonnegative numbers are represented in the usual way. In a biased system, nonnegative numbers do not have their usual representation. Instead, all numbers are represented by first adding them to the bias, and then encoding this sum as an ordinary unsigned number. Thus a negative number k can be encoded as long as k + bias ≥ 0. A typical value for the bias is 2n–1. Example
Using 4-bit numbers (n = 4), if k = 3 (or in binary, k = 00112 ), how is −k expressed in each of these formats?
Answer
In signed magnitude, the leftmost bit in k = 00112 is the sign bit, so flip it to 1: −k is represented by 10112. In two’s complement, k + 11012 = 2n = 16. So −k is represented by 11012. In one’s complement, the bits of k = 00112 are flipped, so −k is represented by 11002. For a biased system, assuming a bias of 2n−1 = 8, k is represented by k + bias = 10112, and −k by −k + bias = 01012.
H-8
■
Appendix H Computer Arithmetic
The most widely used system for representing integers, two’s complement, is the system we will use here. One reason for the popularity of two’s complement is that it makes signed addition easy: Simply discard the carry-out from the highorder bit. To add 5 + −2, for example, add 01012 and 11102 to obtain 00112, resulting in the correct value of 3. A useful formula for the value of a two’s complement number an–1an–2 ⋅ ⋅ ⋅ a1a0 is H.2.3
−an–12n–1 + an–22n–2 + ⋅ ⋅ ⋅ + a121 + a0
As an illustration of this formula, the value of 11012 as a 4-bit two’s complement number is −1⋅23 + 1⋅22 + 0⋅21 + 1⋅20 = −8 + 4 + 1 = −3, confirming the result of the example above. Overflow occurs when the result of the operation does not fit in the representation being used. For example, if unsigned numbers are being represented using 4 bits, then 6 = 01102 and 11 = 10112. Their sum (17) overflows because its binary equivalent (100012 ) doesn’t fit into 4 bits. For unsigned numbers, detecting overflow is easy; it occurs exactly when there is a carry-out of the mostsignificant bit. For two’s complement, things are trickier: Overflow occurs exactly when the carry into the high-order bit is different from the (to be discarded) carry-out of the high-order bit. In the example of 5 + −2 above, a 1 is carried both into and out of the leftmost bit, avoiding overflow. Negating a two’s complement number involves complementing each bit and then adding 1. For instance, to negate 00112, complement it to get 11002 and then add 1 to get 11012. Thus, to implement a − b using an adder, simply feed a and b (where b is the number obtained by complementing each bit of b) into the adder and set the low-order, carry-in bit to 1. This explains why the rightmost adder in Figure H.1 is a full adder. Multiplying two’s complement numbers is not quite as simple as adding them. The obvious approach is to convert both operands to be nonnegative, do an unsigned multiplication, and then (if the original operands were of opposite signs) negate the result. Although this is conceptually simple, it requires extra time and hardware. Here is a better approach: Suppose that we are multiplying a times b using the hardware shown in Figure H.2(a). Register A is loaded with the number a; B is loaded with b. Since the content of register B is always b, we will use B and b interchangeably. If B is potentially negative but A is nonnegative, the only change needed to convert the unsigned multiplication algorithm into a two’s complement one is to ensure that when P is shifted, it is shifted arithmetically; that is, the bit shifted into the high-order bit of P should be the sign bit of P (rather than the carry-out from the addition). Note that our n-bit-wide adder will now be adding n-bit two’s complement numbers between −2n–1 and 2n–1 − 1. Next, suppose a is negative. The method for handling this case is called Booth recoding. Booth recoding is a very basic technique in computer arithmetic and will play a key role in Section H.9. The algorithm on page H-4 computes a × b by examining the bits of a from least significant to most significant. For example, if a = 7 = 01112, then step (i) will successively add B, add B, add B, and add 0. Booth recoding “recodes” the number 7 as 8 − 1 = 10002 − 00012 = 1001, where
H.2
Basic Techniques of Integer Arithmetic
■
H-9
1 represents −1. This gives an alternate way to compute a × b; namely, successively subtract B, add 0, add 0, and add B. This is more complicated than the unsigned algorithm on page H-4, since it uses both addition and subtraction. The advantage shows up for negative values of a. With the proper recoding, we can treat a as though it were unsigned. For example, take a = −4 = 11002. Think of 11002 as the unsigned number 12, and recode it as 12 = 16 − 4 = 100002 − 01002 = 10100. If the multiplication algorithm is only iterated n times (n = 4 in this case), the high-order digit is ignored, and we end up subtracting 0100 2 = 4 times the multiplier—exactly the right answer. This suggests that multiplying using a recoded form of a will work equally well for both positive and negative numbers. And indeed, to deal with negative values of a, all that is required is to sometimes subtract b from P, instead of adding either b or 0 to P. Here are the precise rules: If the initial content of A is an–1⋅⋅⋅a0, then at the ith multiply step, the low-order bit of register A is ai , and step (i) in the multiplication algorithm becomes I. If ai = 0 and ai–1 = 0, then add 0 to P. II. If ai = 0 and ai–1 = 1, then add B to P. III. If ai = 1 and ai–1 = 0, then subtract B from P. IV. If ai = 1 and ai–1 = 1, then add 0 to P. For the first step, when i = 0, take ai–1 to be 0. Example Answer
When multiplying −6 times −5, what is the sequence of values in the (P,A) register pair? See Figure H.4.
0000
P
A 1010
Put –6 = 10102 into A, –5 = 10112 into B.
0000
1010
step 1(i): a0 = a–1 = 0, so from rule I add 0.
0000
0101
step 1(ii): shift.
+ 0101
step 2(i): a1 = 1, a0 = 0. Rule III says subtract b (or add –b = –10112 = 01012).
0101
0101
0010
1010
+ 1011
step 2(ii): shift. step 3(i): a2 = 0, a1 = 1. Rule II says add b (1011).
1101
1010
1110
1101
+ 0101
step 3(ii): shift. (Arithmetic shift—load 1 into leftmost bit.) step 4(i): a3 = 1, a2 = 0. Rule III says subtract b.
0011
1101
0001
1110
step 4(ii): shift. Final result is 000111102 = 30.
Figure H.4 Numerical example of Booth recoding. Multiplication of a = –6 by b = –5 to get 30.
H-10
■
Appendix H Computer Arithmetic
The four cases above can be restated as saying that in the ith step you should add (ai–1 − ai )B to P. With this observation, it is easy to verify that these rules work, because the result of all the additions is n –1
∑ b(ai – 1 – ai )2 i
= b ( –an – 1 2
n–1
+ an – 2 2
n–2
+ . . . + a 1 2 + a 0 ) + ba –1
i=0
Using Equation H.2.3 (page H-8) together with a−1 = 0, the right-hand side is seen to be the value of b × a as a two’s complement number. The simplest way to implement the rules for Booth recoding is to extend the A register one bit to the right so that this new bit will contain ai–1. Unlike the naive method of inverting any negative operands, this technique doesn’t require extra steps or any special casing for negative operands. It has only slightly more control logic. If the multiplier is being shared with a divider, there will already be the capability for subtracting b, rather than adding it. To summarize, a simple method for handling two’s complement multiplication is to pay attention to the sign of P when shifting it right, and to save the most recently shifted-out bit of A to use in deciding whether to add or subtract b from P. Booth recoding is usually the best method for designing multiplication hardware that operates on signed numbers. For hardware that doesn’t directly implement it, however, performing Booth recoding in software or microcode is usually too slow because of the conditional tests and branches. If the hardware supports arithmetic shifts (so that negative b is handled correctly), then the following method can be used. Treat the multiplier a as if it were an unsigned number, and perform the first n − 1 multiply steps using the algorithm on page H-4. If a < 0 (in which case there will be a 1 in the low-order bit of the A register at this point), then subtract b from P; otherwise (a ≥ 0) neither add nor subtract. In either case, do a final shift (for a total of n shifts). This works because it amounts to multiplying b by −an–1 2n–1 + ⋅ ⋅ ⋅ + a12 + a0, which is the value of an–1 ⋅ ⋅ ⋅ a0 as a two’s complement number by Equation H.2.3. If the hardware doesn’t support arithmetic shift, then converting the operands to be nonnegative is probably the best approach. Two final remarks: A good way to test a signed-multiply routine is to try −2n–1 × −2n–1, since this is the only case that produces a 2n − 1 bit result. Unlike multiplication, division is usually performed in hardware by converting the operands to be nonnegative and then doing an unsigned divide. Because division is substantially slower (and less frequent) than multiplication, the extra time used to manipulate the signs has less impact than it does on multiplication.
Systems Issues When designing an instruction set, a number of issues related to integer arithmetic need to be resolved. Several of them are discussed here. First, what should be done about integer overflow? This situation is complicated by the fact that detecting overflow differs depending on whether the oper-
H.2
Basic Techniques of Integer Arithmetic
■
H-11
ands are signed or unsigned integers. Consider signed arithmetic first. There are three approaches: Set a bit on overflow, trap on overflow, or do nothing on overflow. In the last case, software has to check whether or not an overflow occurred. The most convenient solution for the programmer is to have an enable bit. If this bit is turned on, then overflow causes a trap. If it is turned off, then overflow sets a bit (or alternatively, have two different add instructions). The advantage of this approach is that both trapping and nontrapping operations require only one instruction. Furthermore, as we will see in Section H.7, this is analogous to how the IEEE floating-point standard handles floating-point overflow. Figure H.5 shows how some common machines treat overflow. What about unsigned addition? Notice that none of the architectures in Figure H.5 traps on unsigned overflow. The reason for this is that the primary use of unsigned arithmetic is in manipulating addresses. It is convenient to be able to subtract from an unsigned address by adding. For example, when n = 4, we can subtract 2 from the unsigned address 10 = 10102 by adding 14 = 11102. This generates an overflow, but we would not want a trap to be generated. A second issue concerns multiplication. Should the result of multiplying two n-bit numbers be a 2n-bit result, or should multiplication just return the low-order n bits, signaling overflow if the result doesn’t fit in n bits? An argument in favor of an n-bit result is that in virtually all high-level languages, multiplication is an operation in which arguments are integer variables and the result is an integer variable of the same type. Therefore, compilers won’t generate code that utilizes a double-precision result. An argument in favor of a 2n-bit result is that it can be used by an assembly language routine to substantially speed up multiplication of multiple-precision integers (by about a factor of 3). A third issue concerns machines that want to execute one instruction every cycle. It is rarely practical to perform a multiplication or division in the same amount of time that an addition or register-register move takes. There are three possible approaches to this problem. The first is to have a single-cycle multiplystep instruction. This might do one step of the Booth algorithm. The second
Trap on signed overflow?
Trap on unsigned overflow?
Set bit on signed overflow?
VAX
If enable is on
No
Yes. Add sets V bit.
Yes. Add sets C bit.
IBM 370
If enable is on
No
Yes. Add sets cond code.
Yes. Logical add sets cond code.
Intel 8086
No
No
Yes. Add sets V bit.
Yes. Add sets C bit.
MIPS R3000
Two add instructions: one always traps, the other never does.
No
No. Software must deduce it from sign of operands and result.
SPARC
No
No
Addcc sets V bit. Add does not.
Machine
Set bit on unsigned overflow?
Addcc sets C bit. Add does not.
Figure H.5 Summary of how various machines handle integer overflow. Both the 8086 and SPARC have an instruction that traps if the V bit is set, so the cost of trapping on overflow is one extra instruction.
H-12
■
Appendix H Computer Arithmetic
approach is to do integer multiplication in the floating-point unit and have it be part of the floating-point instruction set. (This is what DLX does.) The third approach is to have an autonomous unit in the CPU do the multiplication. In this case, the result either can be guaranteed to be delivered in a fixed number of cycles—and the compiler charged with waiting the proper amount of time—or there can be an interlock. The same comments apply to division as well. As examples, the original SPARC had a multiply-step instruction but no divide-step instruction, while the MIPS R3000 has an autonomous unit that does multiplication and division (newer versions of the SPARC architecture added an integer multiply instruction). The designers of the HP Precision Architecture did an especially thorough job of analyzing the frequency of the operands for multiplication and division, and they based their multiply and divide steps accordingly. (See Magenheimer et al. [1988] for details.) The final issue involves the computation of integer division and remainder for negative numbers. For example, what is −5 DIV 3 and −5 MOD 3? When computing x DIV y and x MOD y, negative values of x occur frequently enough to be worth some careful consideration. (On the other hand, negative values of y are quite rare.) If there are built-in hardware instructions for these operations, they should correspond to what high-level languages specify. Unfortunately, there is no agreement among existing programming languages. See Figure H.6. One definition for these expressions stands out as clearly superior; namely, x DIV y = x/y, so that 5 DIV 3 = 1, −5 DIV 3 = −2. And MOD should satisfy x = (x DIV y) × y + x MOD y, so that x MOD y ≥ 0. Thus 5 MOD 3 = 2, and −5 MOD 3 = 1. Some of the many advantages of this definition are as follows: 1. A calculation to compute an index into a hash table of size N can use MOD N and be guaranteed to produce a valid index in the range from 0 to N − 1. 2. In graphics, when converting from one coordinate system to another, there is no “glitch” near 0. For example, to convert from a value x expressed in a system that uses 100 dots per inch to a value y on a bitmapped display with 70 dots per inch, the formula y = (70 × x) DIV 100 maps one or two x coordinates into each y coordinate. But if DIV were defined as in Pascal to be x/y rounded to 0, then 0 would have three different points (−1, 0, 1) mapped into it.
Language
Division
Remainder
FORTRAN
−5/3 = −1
MOD(−5,
Pascal
−5 DIV 3 = −1
−5 MOD 3 = 1
Ada
−5/3 = −1
−5 MOD 3 = 1 −5 REM 3 = −2
C
−5/3 undefined
−5 % 3 undefined
Modula-3
−5 DIV 3 = −2
−5 MOD 3 = 1
3) = −2
Figure H.6 Examples of integer division and integer remainder in various programming languages.
H.3
Floating Point
■
H-13
3. x MOD 2k is the same as performing a bitwise AND with a mask of k bits, and x DIV 2k is the same as doing a k-bit arithmetic right shift. Finally, a potential pitfall worth mentioning concerns multiple-precision addition. Many instruction sets offer a variant of the add instruction that adds three operands: two n-bit numbers together with a third single-bit number. This third number is the carry from the previous addition. Since the multiple-precision number will typically be stored in an array, it is important to be able to increment the array pointer without destroying the carry bit.
H.3
Floating Point Many applications require numbers that aren’t integers. There are a number of ways that nonintegers can be represented. One is to use fixed point; that is, use integer arithmetic and simply imagine the binary point somewhere other than just to the right of the least-significant digit. Adding two such numbers can be done with an integer add, whereas multiplication requires some extra shifting. Other representations that have been proposed involve storing the logarithm of a number and doing multiplication by adding the logarithms, or using a pair of integers (a,b) to represent the fraction a/b. However, only one noninteger representation has gained widespread use, and that is floating point. In this system, a computer word is divided into two parts, an exponent and a significand. As an example, an exponent of −3 and significand of 1.5 might represent the number 1.5 × 2–3 = 0.1875. The advantages of standardizing a particular representation are obvious. Numerical analysts can build up high-quality software libraries, computer designers can develop techniques for implementing high-performance hardware, and hardware vendors can build standard accelerators. Given the predominance of the floating-point representation, it appears unlikely that any other representation will come into widespread use. The semantics of floating-point instructions are not as clear-cut as the semantics of the rest of the instruction set, and in the past the behavior of floating-point operations varied considerably from one computer family to the next. The variations involved such things as the number of bits allocated to the exponent and significand, the range of exponents, how rounding was carried out, and the actions taken on exceptional conditions like underflow and overflow. Computer architecture books used to dispense advice on how to deal with all these details, but fortunately this is no longer necessary. That’s because the computer industry is rapidly converging on the format specified by IEEE standard 754-1985 (also an international standard, IEC 559). The advantages of using a standard variant of floating point are similar to those for using floating point over other noninteger representations. IEEE arithmetic differs from many previous arithmetics in the following major ways:
H-14
■
Appendix H Computer Arithmetic
1. When rounding a “halfway” result to the nearest floating-point number, it picks the one that is even. 2. It includes the special values NaN, ∞, and −∞. 3. It uses denormal numbers to represent the result of computations whose value is less than 1.0 × 2Emin. 4. It rounds to nearest by default, but it also has three other rounding modes. 5. It has sophisticated facilities for handling exceptions. To elaborate on (1), note that when operating on two floating-point numbers, the result is usually a number that cannot be exactly represented as another floating-point number. For example, in a floating-point system using base 10 and two significant digits, 6.1 × 0.5 = 3.05. This needs to be rounded to two digits. Should it be rounded to 3.0 or 3.1? In the IEEE standard, such halfway cases are rounded to the number whose low-order digit is even. That is, 3.05 rounds to 3.0, not 3.1. The standard actually has four rounding modes. The default is round to nearest, which rounds ties to an even number as just explained. The other modes are round toward 0, round toward +∞, and round toward –∞. We will elaborate on the other differences in following sections. For further reading, see IEEE [1985], Cody et al. [1984], and Goldberg [1991].
Special Values and Denormals Probably the most notable feature of the standard is that by default a computation continues in the face of exceptional conditions, such as dividing by 0 or taking the square root of a negative number. For example, the result of taking the square root of a negative number is a NaN (Not a Number), a bit pattern that does not represent an ordinary number. As an example of how NaNs might be useful, consider the code for a zero finder that takes a function F as an argument and evaluates F at various points to determine a zero for it. If the zero finder accidentally probes outside the valid values for F, F may well cause an exception. Writing a zero finder that deals with this case is highly language and operating-system dependent, because it relies on how the operating system reacts to exceptions and how this reaction is mapped back into the programming language. In IEEE arithmetic it is easy to write a zero finder that handles this situation and runs on many different systems. After each evaluation of F, it simply checks to see whether F has returned a NaN; if so, it knows it has probed outside the domain of F. In IEEE arithmetic, if the input to an operation is a NaN, the output is NaN (e.g., 3 + NaN = NaN). Because of this rule, writing floating-point subroutines that can accept NaN as an argument rarely requires any special case checks. For example, suppose that arccos is computed in terms of arctan, using the formula arccos x = 2 arctan( ( 1 – x ) ⁄ ( 1 + x ) ). If arctan handles an argument of NaN properly, arccos will automatically do so too. That’s because if x is a NaN, 1 + x, 1 − x, (1 + x)/(1 − x), and ( 1 – x ) ⁄ ( 1 + x ) will also be NaNs. No checking for NaNs is required.
H.3
Floating Point
■
H-15
While the result of – 1 is a NaN, the result of 1/0 is not a NaN, but +∞, which is another special value. The standard defines arithmetic on infinities (there is both +∞ and –∞) using rules such as 1/∞ = 0. The formula arccos x = 2 arctan( ( 1 – x ) ⁄ ( 1 + x )) illustrates how infinity arithmetic can be used. Since arctan x asymptotically approaches π /2 as x approaches ∞, it is natural to define arctan(∞) = π /2, in which case arccos(−1) will automatically be computed correctly as 2 arctan(∞) = π. The final kind of special values in the standard are denormal numbers. In many floating-point systems, if Emin is the smallest exponent, a number less than 1.0 × 2Emin cannot be represented, and a floating-point operation that results in a number less than this is simply flushed to 0. In the IEEE standard, on the other hand, numbers less than 1.0 × 2Emin are represented using significands less than 1. This is called gradual underflow. Thus, as numbers decrease in magnitude below 2Emin, they gradually lose their significance and are only represented by 0 when all their significance has been shifted out. For example, in base 10 with four significant figures, let x = 1.234 × 10Emin. Then x/10 will be rounded to 0.123 × 10Emin, having lost a digit of precision. Similarly x/100 rounds to 0.012 × 10Emin, and x/1000 to 0.001 × 10Emin, while x/10000 is finally small enough to be rounded to 0. Denormals make dealing with small numbers more predictable by maintaining familiar properties such as x = y ⇔ x − y = 0. For example, in a flushto-zero system (again in base 10 with four significant digits), if x = 1.256 × 10Emin and y = 1.234 × 10Emin, then x − y = 0.022 × 10Emin, which flushes to zero. So even though x ≠ y, the computed value of x − y = 0. This never happens with gradual underflow. In this example, x − y = 0.022 × 10Emin is a denormal number, and so the computation of x − y is exact.
Representation of Floating-Point Numbers Let us consider how to represent single-precision numbers in IEEE arithmetic. Single-precision numbers are stored in 32 bits: 1 for the sign, 8 for the exponent, and 23 for the fraction. The exponent is a signed number represented using the bias method (see the subsection “Signed Numbers,” page H-7) with a bias of 127. The term biased exponent refers to the unsigned number contained in bits 1 through 8 and unbiased exponent (or just exponent) means the actual power to which 2 is to be raised. The fraction represents a number less than 1, but the significand of the floating-point number is 1 plus the fraction part. In other words, if e is the biased exponent (value of the exponent field) and f is the value of the fraction field, the number being represented is 1. f × 2e–127. Example
What single-precision number does the following 32-bit word represent? 1 10000001 01000000000000000000000
H-16
■
Appendix H Computer Arithmetic
Answer
Considered as an unsigned number, the exponent field is 129, making the value of the exponent 129 − 127 = 2. The fraction part is .012 = .25, making the significand 1.25. Thus, this bit pattern represents the number −1.25 × 22 = −5. The fractional part of a floating-point number (.25 in the example above) must not be confused with the significand, which is 1 plus the fractional part. The leading 1 in the significand 1. f does not appear in the representation; that is, the leading bit is implicit. When performing arithmetic on IEEE format numbers, the fraction part is usually unpacked, which is to say the implicit 1 is made explicit. Figure H.7 summarizes the parameters for single (and other) precisions. It shows the exponents for single precision to range from –126 to 127; accordingly, the biased exponents range from 1 to 254. The biased exponents of 0 and 255 are used to represent special values. This is summarized in Figure H.8. When the biased exponent is 255, a zero fraction field represents infinity, and a nonzero fraction field represents a NaN. Thus, there is an entire family of NaNs. When the biased exponent and the fraction field are 0, then the number represented is 0. Because of the implicit leading 1, ordinary numbers always have a significand greater than or equal to 1. Thus, a special convention such as this is required to represent 0. Denormalized numbers are implemented by having a word with a zero exponent field represent the number 0. f × 2Emin. Single p (bits of precision)
24
Single extended ≥ 32
Double 53
Double extended ≥ 64
Emax
127
≥ 1023
1023
≥ 16383
Emin
−126
≤ −1022
−1022
≤ −16382
Exponent bias
127
1023
Figure H.7 Format parameters for the IEEE 754 floating-point standard. The first row gives the number of bits in the significand. The blanks are unspecified parameters.
Exponent
Fraction
Represents
e = Emin − 1
f=0
±0
e = Emin − 1
f≠0
0.f × 2
Emin ≤ e ≤ Emax
—
Emin
1.f × 2e
e = Emax + 1
f=0
±∞
e = Emax + 1
f≠0
NaN
Figure H.8 Representation of special values. When the exponent of a number falls outside the range Emin ≤ e ≤ Emax, then that number has a special interpretation as indicated in the table.
H.4
Floating-Point Multiplication
■
H-17
The primary reason why the IEEE standard, like most other floating-point formats, uses biased exponents is that it means nonnegative numbers are ordered in the same way as integers. That is, the magnitude of floating-point numbers can be compared using an integer comparator. Another (related) advantage is that 0 is represented by a word of all 0’s. The downside of biased exponents is that adding them is slightly awkward, because it requires that the bias be subtracted from their sum.
H.4
Floating-Point Multiplication The simplest floating-point operation is multiplication, so we discuss it first. A binary floating-point number x is represented as a significand and an exponent, x = s × 2e. The formula (s1 × 2e1) • (s2 × 2e2) = (s1 • s2) × 2e1+e2
shows that a floating-point multiply algorithm has several parts. The first part multiplies the significands using ordinary integer multiplication. Because floatingpoint numbers are stored in sign magnitude form, the multiplier need only deal with unsigned numbers (although we have seen that Booth recoding handles signed two’s complement numbers painlessly). The second part rounds the result. If the significands are unsigned p-bit numbers (e.g., p = 24 for single precision), then the product can have as many as 2p bits and must be rounded to a p-bit number. The third part computes the new exponent. Because exponents are stored with a bias, this involves subtracting the bias from the sum of the biased exponents. Example
How does the multiplication of the single-precision numbers 1 10000010 000. . . = –1 × 23 0 10000011 000. . . = 1 × 24 proceed in binary?
Answer
When unpacked, the significands are both 1.0, their product is 1.0, and so the result is of the form 1 ???????? 000. . . To compute the exponent, use the formula biased exp (e1 + e2) = biased exp(e1) + biased exp(e2) − bias From Figure H.7, the bias is 127 = 011111112, so in two’s complement –127 is 100000012. Thus the biased exponent of the product is
H-18
■
Appendix H Computer Arithmetic
10000010 10000011 + 10000001 10000110 Since this is 134 decimal, it represents an exponent of 134 − bias = 134 − 127 = 7, as expected. The interesting part of floating-point multiplication is rounding. Some of the different cases that can occur are illustrated in Figure H.9. Since the cases are similar in all bases, the figure uses human-friendly base 10, rather than base 2. In the figure, p = 3, so the final result must be rounded to three significant digits. The three most-significant digits are in boldface. The fourth most-significant digit (marked with an arrow) is the round digit, denoted by r. If the round digit is less than 5, then the bold digits represent the rounded result. If the round digit is greater than 5 (as in (a)), then 1 must be added to the least-significant bold digit. If the round digit is exactly 5 (as in (b)), then additional digits must be examined to decide between truncation or incrementing by 1. It is only necessary to know if any digits past 5 are nonzero. In the algorithm below, this will be recorded in a sticky bit. Comparing (a) and (b) in the figure shows that there are two possible positions for the round digit (relative to the least-significant digit of the product). Case (c) illustrates that when adding 1 to the least-significant bold digit, there may be a carry-out. When this happens, the final significand must be 10.0. There is a straightforward method of handling rounding using the multiplier of Figure H.2 (page H-4) together with an extra sticky bit. If p is the number of bits in the significand, then the A, B, and P registers should be p bits wide. Multiply the two significands to obtain a 2p-bit product in the (P,A) registers (see
a)
b)
c)
1.23 ✕ 6.78 8.3394
r = 9 > 5 so round up rounds to 8.34
2.83 ✕ 4.47 12.6501
r = 5 and a following digit = 0 so round up rounds to 1.27 ✕ 101
1.28 ✕ 7.81 09.9968
r = 6 > 5 so round up rounds to 1.00 ✕ 101
Figure H.9 Examples of rounding a multiplication. Using base 10 and p = 3, parts (a) and (b) illustrate that the result of a multiplication can have either 2p − 1 or 2p digits, and hence the position where a 1 is added when rounding up (just left of the arrow) can vary. Part (c) shows that rounding up can cause a carry-out.
H.4
Floating-Point Multiplication
P Product Case (1): x0 = 0 Shift needed Case (2): x0 = 1 Increment exponent
H-19
A
x0 x1 . x2 x3 x4 x5
g
x1 . x2 x3 x4 x5
g
x0 . x1 x2 x3 x4 x5
■
rnd
r rnd
s
s
s
s
sticky
sticky
Adjust binary point, add 1 to exponent to compensate
Figure H.10 The two cases of the floating-point multiply algorithm. The top line shows the contents of the P and A registers after multiplying the significands, with p = 6. In case (1), the leading bit is 0, and so the P register must be shifted. In case (2), the leading bit is 1, no shift is required, but both the exponent and the round and sticky bits must be adjusted. The sticky bit is the logical OR of the bits marked s.
Figure H.10). During the multiplication, the first p − 2 times a bit is shifted into the A register, OR it into the sticky bit. This will be used in halfway cases. Let s represent the sticky bit, g (for guard) the most-significant bit of A, and r (for round) the second most-significant bit of A. There are two cases: 1. The high-order bit of P is 0. Shift P left 1 bit, shifting in the g bit from A. Shifting the rest of A is not necessary. 2. The high-order bit of P is 1. Set s := s ∨ r and r := g, and add 1 to the exponent. Now if r = 0, P is the correctly rounded product. If r = 1 and s = 1, then P + 1 is the product (where by P + 1 we mean adding 1 to the least-significant bit of P). If r = 1 and s = 0, we are in a halfway case, and round up according to the leastsignificant bit of P. As an example, apply the decimal version of these rules to Figure H.9(b). After the multiplication, P = 126 and A = 501, with g = 5, r = 0, s = 1. Since the high-order digit of P is nonzero, case (2) applies and r := g, so that r = 5, as the arrow indicates in Figure H.9. Since r = 5, we could be in a halfway case, but s = 1 indicates that the result is in fact slightly over 1/2, so add 1 to P to obtain the correctly rounded product. The precise rules for rounding depend on the rounding mode and are given in Figure H.11. Note that P is nonnegative, that is, it contains the magnitude of the result. A good discussion of more efficient ways to implement rounding is in Santoro, Bewick, and Horowitz [1989]. Example
In binary with p = 4, show how the multiplication algorithm computes the product −5 × 10 in each of the four rounding modes.
Answer
In binary, −5 is −1.0102 × 22 and 10 = 1.0102 × 23. Applying the integer multiplication algorithm to the significands gives 011001002, so P = 01102, A = 01002,
H-20
■
Appendix H Computer Arithmetic
g = 0, r = 1, and s = 0. The high-order bit of P is 0, so case (1) applies. Thus P becomes 11002, and since the result is negative, Figure H.11 gives round to -∞
11012
round to +∞
11002
round to 0
11002
round to nearest 11002
add 1 since r ∨ s = 1 ⁄ 0 = TRUE
no add since
r ∧ p0 = 1 ∧ 0 = FALSE and r ∧ s = 1 ∧ 0 = FALSE
The exponent is 2 + 3 = 5, so the result is −1.1002 × 25 = −48, except when rounding to −∞, in which case it is −1.1012 × 25 = −52. Overflow occurs when the rounded result is too large to be represented. In single precision, this occurs when the result has an exponent of 128 or higher. If e1 and e2 are the two biased exponents, then 1 ≤ ei ≤ 254, and the exponent calculation e1 + e2 − 127 gives numbers between 1 + 1 − 127 and 254 + 254 − 127, or between −125 and 381. This range of numbers can be represented using 9 bits. So one way to detect overflow is to perform the exponent calculations in a 9-bit adder (see Exercise H.12). Remember that you must check for overflow after rounding—the example in Figure H.9(c) shows that this can make a difference.
Denormals Checking for underflow is somewhat more complex because of denormals. In single precision, if the result has an exponent less than −126, that does not necessarily indicate underflow, because the result might be a denormal number. For example, the product of (1 × 2–64) with (1 × 2–65) is 1 × 2–129, and −129 is below the legal exponent limit. But this result is a valid denormal number, namely, 0.125 × 2–126. In general, when the unbiased exponent of a product dips below −126, the resulting product must be shifted right and the exponent incremented until the
Rounding mode
Sign of result ≥ 0
+1 if r ∨ s
–∞ +∞
Sign of result < 0
+1 if r ∨ s
0 Nearest
+1 if r ∧ p0 or r ∧ s
+1 if r ∧ p0 or r ∧ s
Figure H.11 Rules for implementing the IEEE rounding modes. Let S be the magnitude of the preliminary result. Blanks mean that the p most-significant bits of S are the actual result bits. If the condition listed is true, add 1 to the pth most-significant bit of S. The symbols r and s represent the round and sticky bits, while p0 is the pth mostsignificant bit of S.
H.5
Floating-Point Addition
■
H-21
exponent reaches −126. If this process causes the entire significand to be shifted out, then underflow has occurred. The precise definition of underflow is somewhat subtle—see Section H.7 for details. When one of the operands of a multiplication is denormal, its significand will have leading zeros, and so the product of the significands will also have leading zeros. If the exponent of the product is less than –126, then the result is denormal, so right-shift and increment the exponent as before. If the exponent is greater than –126, the result may be a normalized number. In this case, left-shift the product (while decrementing the exponent) until either it becomes normalized or the exponent drops to –126. Denormal numbers present a major stumbling block to implementing floating-point multiplication, because they require performing a variable shift in the multiplier, which wouldn’t otherwise be needed. Thus, high-performance, floating-point multipliers often do not handle denormalized numbers, but instead trap, letting software handle them. A few practical codes frequently underflow, even when working properly, and these programs will run quite a bit slower on systems that require denormals to be processed by a trap handler. So far we haven’t mentioned how to deal with operands of zero. This can be handled by either testing both operands before beginning the multiplication or testing the product afterward. If you test afterward, be sure to handle the case 0 × ∞ properly: This results in NaN, not 0. Once you detect that the result is 0, set the biased exponent to 0. Don’t forget about the sign. The sign of a product is the XOR of the signs of the operands, even when the result is 0.
Precision of Multiplication In the discussion of integer multiplication, we mentioned that designers must decide whether to deliver the low-order word of the product or the entire product. A similar issue arises in floating-point multiplication, where the exact product can be rounded to the precision of the operands or to the next higher precision. In the case of integer multiplication, none of the standard high-level languages contains a construct that would generate a “single times single gets double” instruction. The situation is different for floating point. Many languages allow assigning the product of two single-precision variables to a double-precision one, and the construction can also be exploited by numerical algorithms. The best-known case is using iterative refinement to solve linear systems of equations.
H.5
Floating-Point Addition Typically, a floating-point operation takes two inputs with p bits of precision and returns a p-bit result. The ideal algorithm would compute this by first performing the operation exactly, and then rounding the result to p bits (using the current rounding mode). The multiplication algorithm presented in the previous section follows this strategy. Even though hardware implementing IEEE arithmetic must
H-22
■
Appendix H Computer Arithmetic
return the same result as the ideal algorithm, it doesn’t need to actually perform the ideal algorithm. For addition, in fact, there are better ways to proceed. To see this, consider some examples. First, the sum of the binary 6-bit numbers 1.100112 and 1.100012 × 2–5: When the summands are shifted so they have the same exponent, this is 1.10011 + .0000110001 Using a 6-bit adder (and discarding the low-order bits of the second addend) gives 1.10011 + .00001 1.10100 The first discarded bit is 1. This isn’t enough to decide whether to round up. The rest of the discarded bits, 0001, need to be examined. Or actually, we just need to record whether any of these bits are nonzero, storing this fact in a sticky bit just as in the multiplication algorithm. So for adding two p-bit numbers, a p-bit adder is sufficient, as long as the first discarded bit (round) and the OR of the rest of the bits (sticky) are kept. Then Figure H.11 can be used to determine if a roundup is necessary, just as with multiplication. In the example above, sticky is 1, so a roundup is necessary. The final sum is 1.101012. Here’s another example: 1.11011 + .0101001 A 6-bit adder gives 1.11011 + .01010 10.00101 Because of the carry-out on the left, the round bit isn’t the first discarded bit; rather, it is the low-order bit of the sum (1). The discarded bits, 01, are OR’ed together to make sticky. Because round and sticky are both 1, the high-order 6 bits of the sum, 10.00102, must be rounded up for the final answer of 10.00112. Next, consider subtraction and the following example: 1.00000 – .00000101111 The simplest way of computing this is to convert −.000001011112 to its two’s complement form, so the difference becomes a sum
H.5
Floating-Point Addition
■
H-23
1.00000 + 1.11111010001 Computing this sum in a 6-bit adder gives 1.00000 + 1.11111 0.11111 Because the top bits canceled, the first discarded bit (the guard bit) is needed to fill in the least-significant bit of the sum, which becomes 0.1111102 , and the second discarded bit becomes the round bit. This is analogous to case (1) in the multiplication algorithm (see page H-19). The round bit of 1 isn’t enough to decide whether to round up. Instead, we need to OR all the remaining bits (0001) into a sticky bit. In this case, sticky is 1, so the final result must be rounded up to 0.111111. This example shows that if subtraction causes the most-significant bit to cancel, then one guard bit is needed. It is natural to ask whether two guard bits are needed for the case when the two most-significant bits cancel. The answer is no, because if x and y are so close that the top two bits of x − y cancel, then x − y will be exact, so guard bits aren’t needed at all. To summarize, addition is more complex than multiplication because, depending on the signs of the operands, it may actually be a subtraction. If it is an addition, there can be carry-out on the left, as in the second example. If it is subtraction, there can be cancellation, as in the third example. In each case, the position of the round bit is different. However, we don’t need to compute the exact sum and then round. We can infer it from the sum of the high-order p bits together with the round and sticky bits. The rest of this section is devoted to a detailed discussion of the floating-point addition algorithm. Let a1 and a2 be the two numbers to be added. The notations ei and si are used for the exponent and significand of the addends ai. This means that the floating-point inputs have been unpacked and that si has an explicit leading bit. To add a1 and a2, perform these eight steps. 1. If e1< e2, swap the operands. This ensures that the difference of the exponents satisfies d = e1 − e2 ≥ 0. Tentatively set the exponent of the result to e1. 2. If the signs of a1 and a2 differ, replace s2 by its two’s complement. 3. Place s2 in a p-bit register and shift it d = e1 − e2 places to the right (shifting in 1’s if s2 was complemented in the previous step). From the bits shifted out, set g to the most-significant bit, r to the next most-significant bit, and set sticky to the OR of the rest. 4. Compute a preliminary significand S = s1 + s2 by adding s1 to the p-bit register containing s2. If the signs of a1 and a2 are different, the most-significant bit of S is 1, and there was no carry-out, then S is negative. Replace S with its two’s complement. This can only happen when d = 0. 5. Shift S as follows. If the signs of a1 and a2 are the same and there was a carryout in step 4, shift S right by one, filling in the high-order position with 1 (the
H-24
■
Appendix H Computer Arithmetic
carry-out). Otherwise shift it left until it is normalized. When left-shifting, on the first shift fill in the low-order position with the g bit. After that, shift in zeros. Adjust the exponent of the result accordingly. 6. Adjust r and s. If S was shifted right in step 5, set r := low-order bit of S before shifting and s := g OR r OR s. If there was no shift, set r := g, s := r OR s. If there was a single left shift, don’t change r and s. If there were two or more left shifts, r := 0, s := 0. (In the last case, two or more shifts can only happen when a1 and a2 have opposite signs and the same exponent, in which case the computation s1 + s2 in step 4 will be exact.) 7. Round S using Figure H.11; namely, if a table entry is nonempty, add 1 to the low-order bit of S. If rounding causes carry-out, shift S right and adjust the exponent. This is the significand of the result. 8. Compute the sign of the result. If a1 and a2 have the same sign, this is the sign of the result. If a1 and a2 have different signs, then the sign of the result depends on which of a1, a2 is negative, whether there was a swap in step 1, and whether S was replaced by its two’s complement in step 4. See Figure H.12. Example Answer
Use the algorithm to compute the sum (−1.0012 × 2–2) + (−1.1112 × 20) s1 = 1.001, e1 = −2, s2 = 1.111, e2 = 0 1. e1 < e2, so swap. d = 2. Tentative exp = 0. 2. Signs of both operands negative, don’t negate s2. 3. Shift s2 (1.001 after swap) right by 2, giving s2 = .010, g = 0, r = 1, s = 0. 4. +
1.111 .010 (1)0.001
S = 0.001, with a carry-out.
5. Carry-out, so shift S right, S = 1.000, exp = exp + 1, so exp = 1.
swap
compl
Yes Yes No
No
sign(a1)
sign(a2)
+
–
sign(result) –
–
+
+
+
–
+
No
No
–
+
–
No
Yes
+
–
–
No
Yes
–
+
+
Figure H.12 Rules for computing the sign of a sum when the addends have different signs. The swap column refers to swapping the operands in step 1, while the compl column refers to performing a two’s complement in step 4. Blanks are “don’t care.”
H.5
Floating-Point Addition
■
H-25
6. r = low-order bit of sum = 1, s = g ∨ r ∨ s = 0 ∨ 1 ∨ 0 = 1. 7. r AND s = TRUE, so Figure H.11 says round up, S = S + 1 or S = 1.001. 8. Both signs negative, so sign of result is negative. Final answer: −S × 2exp = 1.0012 × 21.
Example Answer
Use the algorithm to compute the sum (−1.0102) + 1.1002 s1 = 1.010, e1 = 0, s2 = 1.100, e2 = 0 1. No swap, d = 0, tentative exp = 0. 2. Signs differ, replace s2 with 0.100. 3. d = 0, so no shift. r = g = s = 0. 4. +
1.010 0.100 1.110
Signs are different, most-significant bit is 1, no carry-out, so must two’s complement sum, giving S = 0.010.
5. Shift left twice, so S = 1.000, exp = exp − 2, or exp = −2. 6. Two left shifts, so r = g = s = 0. 7. No addition required for rounding. 8. Answer is sign × S × 2exp or sign × 1.000 × 2–2. Get sign from Figure H.12. Since complement but no swap and sign(a1) is −, the sign of sum is +. Thus answer = 1.0002 × 2–2.
Speeding Up Addition Let’s estimate how long it takes to perform the algorithm above. Step 2 may require an addition, step 4 requires one or two additions, and step 7 may require an addition. If it takes T time units to perform a p-bit add (where p = 24 for single precision, 53 for double), then it appears the algorithm will take at least 4T time units. But that is too pessimistic. If step 4 requires two adds, then a1 and a2 have the same exponent and different signs. But in that case the difference is exact, and so no roundup is required in step 7. Thus only three additions will ever occur. Similarly, it appears that a variable shift may be required both in step 3 and step 5. But if |e1 − e2| ≤ 1, then step 3 requires a right shift of at most one place, so only step 5 needs a variable shift. And if |e1 − e2| > 1, then step 3 needs a variable shift, but step 5 will require a left shift of at most one place. So only a single variable shift will be performed. Still, the algorithm requires three sequential adds, which, in the case of a 53-bit double-precision significand, can be rather time consuming.
H-26
■
Appendix H Computer Arithmetic
A number of techniques can speed up addition. One is to use pipelining. The “Putting It All Together” section gives examples of how some commercial chips pipeline addition. Another method (used on the Intel 860 [Kohn and Fu 1989]) is to perform two additions in parallel. We now explain how this reduces the latency from 3T to T. There are three cases to consider. First, suppose that both operands have the same sign. We want to combine the addition operations from steps 4 and 7. The position of the high-order bit of the sum is not known ahead of time, because the addition in step 4 may or may not cause a carry-out. Both possibilities are accounted for by having two adders. The first adder assumes the add in step 4 will not result in a carry-out. Thus the values of r and s can be computed before the add is actually done. If r and s indicate a roundup is necessary, the first adder will compute S = s1 + s2 + 1, where the notation +1 means adding 1 at the position of the least-significant bit of s1. This can be done with a regular adder by setting the low-order carry-in bit to 1. If r and s indicate no roundup, the adder computes S = s1 + s2 as usual. One extra detail: when r = 1, s = 0, you will also need to know the low-order bit of the sum, which can also be computed in advance very quickly. The second adder covers the possibility that there will be carry-out. The values of r and s and the position where the roundup 1 is added are different from above, but again they can be quickly computed in advance. It is not known whether there will be a carry-out until after the add is actually done, but that doesn’t matter. By doing both adds in parallel, one adder is guaranteed to reduce the correct answer. The next case is when a1 and a2 have opposite signs, but the same exponent. The sum a1 + a2 is exact in this case (no roundup is necessary), but the sign isn’t known until the add is completed. So don’t compute the two’s complement (which requires an add) in step 2, but instead compute s1 + s2 + 1 and s1 + s2 +1 in parallel. The first sum has the result of simultaneously complementing s1 and computing the sum, resulting in s2 − s1. The second sum computes s1 − s2. One of these will be nonnegative and hence the correct final answer. Once again, all the additions are done in one step using two adders operating in parallel. The last case, when a1 and a2 have opposite signs and different exponents, is more complex. If |e1−e2| > 1, the location of the leading bit of the difference is in one of two locations, so there are two cases just as in addition. When |e1−e2| = 1, cancellation is possible and the leading bit could be almost anywhere. However, only if the leading bit of the difference is in the same position as the leading bit of s1 could a roundup be necessary. So one adder assumes a roundup, the other assumes no roundup. Thus the addition of step 4 and the rounding of step 7 can be combined. However, there is still the problem of the addition in step 2! To eliminate this addition, consider the following diagram of step 4: s1 s2
–
|—— p ——| 1.xxxxxxx 1yyzzzzz
If the bits marked z are all 0, then the high-order p bits of S = s1 − s2 can be computed as s1 + s2 + 1. If at least one of the z bits is 1, use s1 + s2. So s1 − s2 can be
H.6
Division and Remainder
■
H-27
computed with one addition. However, we still don’t know g and r for the two’s complement of s2, which are needed for rounding in step 7. To compute s1 − s2 and get the proper g and r bits, combine steps 2 and 4 as follows. Don’t complement s2 in step 2. Extend the adder used for computing S two bits to the right (call the extended sum S′). If the preliminary sticky bit (computed in step 3) is 1, compute S′ = s′1 + s′2 , where s′1 has two 0 bits tacked onto the right, and s′2 has preliminary g and r appended. If the sticky bit is 0, compute s′1 + s′2 + 1. Now the two low-order bits of S′ have the correct values of g and r (the sticky bit was already computed properly in step 3). Finally, this modification can be combined with the modification that combines the addition from steps 4 and 7 to provide the final result in time T, the time for one addition. A few more details need to be considered, as discussed in Santoro, Bewick, and Horowitz [1989] and Exercise H.17. Although the Santoro paper is aimed at multiplication, much of the discussion applies to addition as well. Also relevant is Exercise H.19, which contains an alternate method for adding signed magnitude numbers.
Denormalized Numbers Unlike multiplication, for addition very little changes in the preceding description if one of the inputs is a denormal number. There must be a test to see if the exponent field is 0. If it is, then when unpacking the significand there will not be a leading 1. By setting the biased exponent to 1 when unpacking a denormal, the algorithm works unchanged. To deal with denormalized outputs, step 5 must be modified slightly. Shift S until it is normalized, or until the exponent becomes Emin (that is, the biased exponent becomes 1). If the exponent is Emin and, after rounding, the high-order bit of S is 1, then the result is a normalized number and should be packed in the usual way, by omitting the 1. If, on the other hand, the high-order bit is 0, the result is denormal. When the result is unpacked, the exponent field must be set to 0. Section H.7 discusses the exact rules for detecting underflow. Incidentally, detecting overflow is very easy. It can only happen if step 5 involves a shift right and the biased exponent at that point is bumped up to 255 in single precision (or 2047 for double precision), or if this occurs after rounding.
H.6
Division and Remainder In this section, we’ll discuss floating-point division and remainder.
Iterative Division We earlier discussed an algorithm for integer division. Converting it into a floating-point division algorithm is similar to converting the integer multiplication algorithm into floating point. The formula (s1 × 2e1) / (s2 × 2e2) = (s1 / s2) × 2e1–e2
H-28
■
Appendix H Computer Arithmetic
shows that if the divider computes s1/s2, then the final answer will be this quotient multiplied by 2e1−e2. Referring to Figure H.2(b) (page H-4), the alignment of operands is slightly different from integer division. Load s2 into B and s1 into P. The A register is not needed to hold the operands. Then the integer algorithm for division (with the one small change of skipping the very first left shift) can be used, and the result will be of the form q0.q1.... To round, simply compute two additional quotient bits (guard and round) and use the remainder as the sticky bit. The guard digit is necessary because the first quotient bit might be 0. However, since the numerator and denominator are both normalized, it is not possible for the two most-significant quotient bits to be 0. This algorithm produces one quotient bit in each step. A different approach to division converges to the quotient at a quadratic rather than a linear rate. An actual machine that uses this algorithm will be discussed in Section H.10. First, we will describe the two main iterative algorithms, and then we will discuss the pros and cons of iteration when compared with the direct algorithms. There is a general technique for constructing iterative algorithms, called Newton’s iteration, shown in Figure H.13. First, cast the problem in the form of finding the zero of a function. Then, starting from a guess for the zero, approximate the function by its tangent at that guess and form a new guess based on where the tangent has a zero. If xi is a guess at a zero, then the tangent line has the equation y − f (xi) = f ′(xi)(x − xi)
This equation has a zero at H.6.1
f (x i ) x = xi +1 = xi − ------------f ′(x i )
To recast division as finding the zero of a function, consider f(x) = x–1 – b. Since the zero of this function is at 1/b, applying Newton’s iteration to it will give
f(x)
f(xi)
x xi
xi+1
Figure H.13 Newton’s iteration for zero finding. If xi is an estimate for a zero of f, then xi+1 is a better estimate. To compute xi+1, find the intersection of the x-axis with the tangent line to f at f(xi).
H.6
Division and Remainder
■
H-29
an iterative method of computing 1/b from b. Using f ′(x) = −1/x2, Equation H.6.1 becomes H.6.2
1 ⁄ xi – b - = xi + xi – xi2 b = xi(2 − xib) xi +1 = xi − ------------------– 1 ⁄ x i2
Thus, we could implement computation of a/b using the following method: 1. Scale b to lie in the range 1 ≤ b < 2 and get an approximate value of 1/b (call it x0) using a table lookup. 2. Iterate xi+1 = xi (2 − xib) until reaching an xn that is accurate enough. 3. Compute axn and reverse the scaling done in step 1. Here are some more details. How many times will step 2 have to be iterated? To say that xi is accurate to p bits means that (xi − 1/b)/(1/b) = 2−p, and a simple algebraic manipulation shows that when this is so, then (xi+1 − 1/b)/(1/b) = 2−2p. Thus the number of correct bits doubles at each step. Newton’s iteration is selfcorrecting in the sense that making an error in xi doesn’t really matter. That is, it treats xi as a guess at 1/b and returns xi+1 as an improvement on it (roughly doubling the digits). One thing that would cause xi to be in error is rounding error. More importantly, however, in the early iterations we can take advantage of the fact that we don’t expect many correct bits by performing the multiplication in reduced precision, thus gaining speed without sacrificing accuracy. Another application of Newton’s iteration is discussed in Exercise H.20. The second iterative division method is sometimes called Goldschmidt’s algorithm. It is based on the idea that to compute a/b, you should multiply the numerator and denominator by a number r with rb ≈ 1. In more detail, let x0 = a and y0 = b. At each step compute xi+1 = rixi and yi+1 = ri yi. Then the quotient xi+1/ yi+1 = xi /yi = a/b is constant. If we pick ri so that yi → 1, then xi → a/b, so the xi converge to the answer we want. This same idea can be used to compute other functions. For example, to compute the square root of a, let x0 = a and y0 = a, and at each step compute xi+1 = ri2xi , yi+1 = ri yi. Then xi+1/yi+12 = xi /yi2 = 1/a, so if the ri are chosen to drive xi → 1, then yi → a . This technique is used to compute square roots on the TI 8847. Returning to Goldschmidt’s division algorithm, set x0 = a and y0 = b, and write b = 1 − δ, where δ < 1. If we pick r0 = 1 + δ, then y1 = r0y0 = 1 − δ 2. We next pick r1 = 1 + δ 2, so that y2 = r1 y1 = 1 − δ4, and so on. Since δ < 1, yi → 1. With this choice of ri, the xi will be computed as xi+1 = ri xi = (1 + δ 2i)xi = (1 + (1 − b)2i )xi, or H.6.3
i xi+1 = a [1 + (1 − b)] [1 + (1 − b)2 ] [1 + (1 − b)4] ⋅⋅⋅ [1 + (1 − b)2 ]
There appear to be two problems with this algorithm. First, convergence is slow when b is not near 1 (that is, δ is not near 0); and second, the formula isn’t self-correcting—since the quotient is being computed as a product of independent terms, an error in one of them won’t get corrected. To deal with slow
H-30
■
Appendix H Computer Arithmetic
convergence, if you want to compute a/b, look up an approximate inverse to b (call it b′), and run the algorithm on ab′/bb′. This will converge rapidly since bb′ ≈ 1. To deal with the self-correction problem, the computation should be run with a few bits of extra precision to compensate for rounding errors. However, Goldschmidt’s algorithm does have a weak form of self-correction, in that the precise value of the ri does not matter. Thus, in the first few iterations, when the full precision of 1 – δ 2i is not needed you can choose ri to be a truncation of 1 + δ 2i,, which may make these iterations run faster without affecting the speed of convergence. If ri is truncated, then yi is no longer exactly 1 – δ 2.i Thus, Equation H.6.3 can no longer be used, but it is easy to organize the computation so that it does not depend on the precise value of ri. With these changes, Goldschmidt’s algorithm is as follows (the notes in brackets show the connection with our earlier formulas). 1. Scale a and b so that 1 ≤ b < 2. 2. Look up an approximation to 1/b (call it b′) in a table. 3. Set x0 = ab′ and y0 = bb′. 4. Iterate until xi is close enough to a/b: Loop r≈2−y
[if yi = 1 + δi, then r ≈ 1 − δi ]
y=y×r
[yi+1 = yi × r ≈ 1 − δi2]
xi+1 = xi × r
[xi+1 = xi × r]
End loop The two iteration methods are related. Suppose in Newton’s method that we unroll the iteration and compute each term xi+1 directly in terms of b, instead of recursively in terms of xi. By carrying out this calculation (see Exercise H.22), we discover that i xi+1 = x0(2 − x0b) [(1 + (x0b − 1)2] [1 + (x0b − 1)4] … [1 + (x0b − 1)2 ]
This formula is very similar to Equation H.6.3. In fact they are identical if a, b in H.6.3 are replaced with ax0, bx0 and a = 1. Thus if the iterations were done to infinite precision, the two methods would yield exactly the same sequence xi. The advantage of iteration is that it doesn’t require special divide hardware. Instead, it can use the multiplier (which, however, requires extra control). Further, on each step, it delivers twice as many digits as in the previous step—unlike ordinary division, which produces a fixed number of digits at every step. There are two disadvantages with inverting by iteration. The first is that the IEEE standard requires division to be correctly rounded, but iteration only delivers a result that is close to the correctly rounded answer. In the case of Newton’s
H.6
Division and Remainder
■
H-31
iteration, which computes 1/b instead of a/b directly, there is an additional problem. Even if 1/b were correctly rounded, there is no guarantee that a/b will be. An example in decimal with p = 2 is a = 13, b = 51. Then a/b = .2549. . . , which rounds to .25. But 1/b = .0196. . . , which rounds to .020, and then a × .020 = .26, which is off by 1. The second disadvantage is that iteration does not give a remainder. This is especially troublesome if the floating-point divide hardware is being used to perform integer division, since a remainder operation is present in almost every high-level language. Traditional folklore has held that the way to get a correctly rounded result from iteration is to compute 1/b to slightly more than 2p bits, compute a/b to slightly more than 2p bits, and then round to p bits. However, there is a faster way, which apparently was first implemented on the TI 8847. In this method, a/b is computed to about 6 extra bits of precision, giving a preliminary quotient q. By comparing qb with a (again with only 6 extra bits), it is possible to quickly decide whether q is correctly rounded or whether it needs to be bumped up or down by 1 in the least-significant place. This algorithm is explored further in Exercise H.21. One factor to take into account when deciding on division algorithms is the relative speed of division and multiplication. Since division is more complex than multiplication, it will run more slowly. A common rule of thumb is that division algorithms should try to achieve a speed that is about one-third that of multiplication. One argument in favor of this rule is that there are real programs (such as some versions of spice) where the ratio of division to multiplication is 1:3. Another place where a factor of 3 arises is in the standard iterative method for computing square root. This method involves one division per iteration, but it can be replaced by one using three multiplications. This is discussed in Exercise H.20.
Floating-Point Remainder For nonnegative integers, integer division and remainder satisfy a = (a DIV b)b + a REM b, 0 ≤ a REM b < b
A floating-point remainder x REM y can be similarly defined as x = INT(x/y)y + x REM y. How should x/y be converted to an integer? The IEEE remainder function uses the round-to-even rule. That is, pick n = INT (x/y) so that x/y − n ≤ 1/2. If two different n satisfy this relation, pick the even one. Then REM is defined to be x − yn. Unlike integers where 0 ≤ a REM b < b, for floating-point numbers x REM y ≤ y/2. Although this defines REM precisely, it is not a practical operational definition, because n can be huge. In single precision, n could be as large as 2127/2–126 = 2253 ≈ 1076. There is a natural way to compute REM if a direct division algorithm is used. Proceed as if you were computing x/y. If x = s12e1 and y = s22e2 and the divider is as in Figure H.2(b) (page H-4), then load s1 into P and s2 into B. After e1 − e2 division steps, the P register will hold a number r of the form x − yn satisfying
H-32
■
Appendix H Computer Arithmetic 0 ≤ r < y. Since the IEEE remainder satisfies REM ≤ y/2, REM is equal to either r or r − y. It is only necessary to keep track of the last quotient bit produced, which is needed to resolve halfway cases. Unfortunately, e1 − e2 can be a lot of steps, and floating-point units typically have a maximum amount of time they are allowed to spend on one instruction. Thus, it is usually not possible to implement REM directly. None of the chips discussed in Section H.10 implements REM, but they could by providing a remainder-step instruction—this is what is done on the Intel 8087 family. A remainder step takes as arguments two numbers x and y, and performs divide steps until either the remainder is in P or n steps have been performed, where n is a small number, such as the number of steps required for division in the highest-supported precision. Then REM can be implemented as a software routine that calls the REM step instruction (e1 − e2)/n times, initially using x as the numerator, but then replacing it with the remainder from the previous REM step. REM can be used for computing trigonometric functions. To simplify things, imagine that we are working in base 10 with five significant figures, and consider computing sin x. Suppose that x = 7. Then we can reduce by π = 3.1416 and compute sin(7) = sin(7 − 2 × 3.1416) = sin(0.7168) instead. But suppose we want to compute sin(2.0 × 105). Then 2 × 105/3.1416 = 63661.8, which in our five-place system comes out to be 63662. Since multiplying 3.1416 times 63662 gives 200000.5392, which rounds to 2.0000 × 105, argument reduction reduces 2 × 105 to 0, which is not even close to being correct. The problem is that our five-place system does not have the precision to do correct argument reduction. Suppose we had the REM operator. Then we could compute 2 × 105 REM 3.1416 and get −.53920. However, this is still not correct because we used 3.1416, which is an approximation for π. The value of 2 × 105 REM π is −.071513. Traditionally, there have been two approaches to computing periodic functions with large arguments. The first is to return an error for their value when x is large. The second is to store π to a very large number of places and do exact argument reduction. The REM operator is not much help in either of these situations. There is a third approach that has been used in some math libraries, such as the Berkeley UNIX 4.3bsd release. In these libraries, π is computed to the nearest floating-point number. Let’s call this machine π, and denote it by π ′. Then when computing sin x, reduce x using x REM π ′. As we saw in the above example, x REM π ′ is quite different from x REM π when x is large, so that computing sinx as sin(x REM π ′) will not give the exact value of sin x. However, computing trigonometric functions in this fashion has the property that all familiar identities (such as sin2 x + cos2 x = 1) are true to within a few rounding errors. Thus, using REM together with machine π provides a simple method of computing trigonometric functions that is accurate for small arguments and still may be useful for large arguments. When REM is used for argument reduction, it is very handy if it also returns the low-order bits of n (where x REM y = x − ny). This is because a practical implementation of trigonometric functions will reduce by something smaller than 2π. For example, it might use π/2, exploiting identities such as sin(x − π/2) = −cos
H.7
More on Floating-Point Arithmetic
■
H-33
x, sin(x − π) = −sin x. Then the low bits of n are needed to choose the correct identity.
H.7
More on Floating-Point Arithmetic Before leaving the subject of floating-point arithmetic, we present a few additional topics.
Fused Multiply-Add Probably the most common use of floating-point units is performing matrix operations, and the most frequent matrix operation is multiplying a matrix times a matrix (or vector), which boils down to computing an inner product, x1⋅y1 + x2⋅y2 + . . . + xn⋅yn. Computing this requires a series of multiply-add combinations. Motivated by this, the IBM RS/6000 introduced a single instruction that computes ab + c, the fused multiply-add. Although this requires being able to read three operands in a single instruction, it has the potential for improving the performance of computing inner products. The fused multiply-add computes ab + c exactly and then rounds. Although rounding only once increases the accuracy of inner products somewhat, that is not its primary motivation. There are two main advantages of rounding once. First, as we saw in the previous sections, rounding is expensive to implement because it may require an addition. By rounding only once, an addition operation has been eliminated. Second, the extra accuracy of fused multiply-add can be used to compute correctly rounded division and square root when these are not available directly in hardware. Fused multiply-add can also be used to implement efficient floating-point multiple-precision packages. The implementation of correctly rounded division using fused multiply-add has many details, but the main idea is simple. Consider again the example from Section H.6 (page H-31), which was computing a/b with a = 13, b = 51. Then 1/b rounds to b′ = .020, and ab′ rounds to q′ = .26, which is not the correctly rounded quotient. Applying fused multiply-add twice will correctly adjust the result, via the formulas r = a − bq′ q′′ = q′ + rb′
Computing to two-digit accuracy, bq′ = 51 × .26 rounds to 13, and so r = a − bq′ would be 0, giving no adjustment. But using fused multiply-add gives r = a − bq′ = 13 − (51 × .26) = −.26, and then q′′ = q′ + rb′ = .26 − .0052 = .2548, which rounds to the correct quotient, .25. More details can be found in the papers by Montoye, Hokenek, and Runyon [1990] and Markstein [1990].
H-34
■
Appendix H Computer Arithmetic
Precisions The standard specifies four precisions: single, single extended, double, and double extended. The properties of these precisions are summarized in Figure H.7 (page H-16). Implementations are not required to have all four precisions, but are encouraged to support either the combination of single and single extended or all of single, double, and double extended. Because of the widespread use of double precision in scientific computing, double precision is almost always implemented. Thus the computer designer usually only has to decide whether to support double extended and, if so, how many bits it should have. The Motorola 68882 and Intel 387 coprocessors implement extended precision using the smallest allowable size of 80 bits (64 bits of significand). However, many of the more recently designed, high-performance floating-point chips do not implement 80-bit extended precision. One reason is that the 80-bit width of extended precision is awkward for 64-bit buses and registers. Some new architectures, such as SPARC V8 and PA-RISC, specify a 128-bit extended (or quad) precision. They have established a de facto convention for quad that has 15 bits of exponent and 113 bits of significand. Although most high-level languages do not provide access to extended precision, it is very useful to writers of mathematical software. As an example, consider writing a library routine to compute the length of a vector (x,y) in the plane, 2 2 namely, x + y . If x is larger than 2Emax/2, then computing this in the obvious way will overflow. This means that either the allowable exponent range for this subroutine will be cut in half or a more complex algorithm using scaling will have to be employed. But if extended precision is available, then the simple algorithm will work. Computing the length of a vector is a simple task, and it is not difficult to come up with an algorithm that doesn’t overflow. However, there are more complex problems for which extended precision means the difference between a simple, fast algorithm and a much more complex one. One of the best examples of this is binary-to-decimal conversion. An efficient algorithm for binary-to-decimal conversion that makes essential use of extended precision is very readably presented in Coonen [1984]. This algorithm is also briefly sketched in Goldberg [1991]. Computing accurate values for transcendental functions is another example of a problem that is made much easier if extended precision is present. One very important fact about precision concerns double rounding. To illustrate in decimals, suppose that we want to compute 1.9 × 0.66, and that single precision is two digits, while extended precision is three digits. The exact result of the product is 1.254. Rounded to extended precision, the result is 1.25. When further rounded to single precision, we get 1.2. However, the result of 1.9 × 0.66 correctly rounded to single precision is 1.3. Thus, rounding twice may not produce the same result as rounding once. Suppose you want to build hardware that only does double-precision arithmetic. Can you simulate single precision by computing first in double precision and then rounding to single? The above example suggests that you can’t. However, double rounding is not always dangerous. In fact, the following rule is true (this is not easy to prove, but see Exercise H.25).
H.7
More on Floating-Point Arithmetic
■
H-35
If x and y have p-bit significands, and x + y is computed exactly and then rounded to q places, a second rounding to p places will not change the answer if q ≥ 2p + 2. This is true not only for addition, but also for multiplication, division, and square root. In our example above, q = 3 and p = 2, so q ≥ 2p + 2 is not true. On the other hand, for IEEE arithmetic, double precision has q = 53, p = 24, so q = 53 ≥ 2p + 2 = 50. Thus, single precision can be implemented by computing in double precision—that is, computing the answer exactly and then rounding to double—and then rounding to single precision.
Exceptions The IEEE standard defines five exceptions: underflow, overflow, divide by zero, inexact, and invalid. By default, when these exceptions occur, they merely set a flag and the computation continues. The flags are sticky, meaning that once set they remain set until explicitly cleared. The standard strongly encourages implementations to provide a trap-enable bit for each exception. When an exception with an enabled trap handler occurs, a user trap handler is called, and the value of the associated exception flag is undefined. In Section H.3 we mentioned that – 3 has the value NaN and 1/0 is ∞. These are examples of operations that raise an exception. By default, computing – 3 sets the invalid flag and returns the value NaN. Similarly 1/0 sets the divide-by-zero flag and returns ∞. The underflow, overflow, and divide-by-zero exceptions are found in most other systems. The invalid exception is for the result of operations such as – 1, 0/0, or ∞ − ∞, which don’t have any natural value as a floating-point number or as ±∞. The inexact exception is peculiar to IEEE arithmetic and occurs either when the result of an operation must be rounded or when it overflows. In fact, since 1/0 and an operation that overflows both deliver ∞, the exception flags must be consulted to distinguish between them. The inexact exception is an unusual “exception,” in that it is not really an exceptional condition because it occurs so frequently. Thus, enabling a trap handler for the inexact exception will most likely have a severe impact on performance. Enabling a trap handler doesn’t affect whether an operation is exceptional except in the case of underflow. This is discussed below. The IEEE standard assumes that when a trap occurs, it is possible to identify the operation that trapped and its operands. On machines with pipelining or multiple arithmetic units, when an exception occurs, it may not be enough to simply have the trap handler examine the program counter. Hardware support may be necessary to identify exactly which operation trapped. Another problem is illustrated by the following program fragment. r1 = r2 / r3 r2 = r4 + r5
H-36
■
Appendix H Computer Arithmetic
These two instructions might well be executed in parallel. If the divide traps, its argument r2 could already have been overwritten by the addition, especially since addition is almost always faster than division. Computer systems that support trapping in the IEEE standard must provide some way to save the value of r2, either in hardware or by having the compiler avoid such a situation in the first place. This kind of problem is not peculiar to floating point. In the sequence r1 = 0(r2) r2 = r3 it would be efficient to execute r2 = r3 while waiting for memory. But if accessing 0(r2) causes a page fault, r2 might no longer be available for restarting the instruction r1 = 0(r2). One approach to this problem, used in the MIPS R3010, is to identify instructions that may cause an exception early in the instruction cycle. For example, an addition can overflow only if one of the operands has an exponent of Emax, and so on. This early check is conservative: It might flag an operation that doesn’t actually cause an exception. However, if such false positives are rare, then this technique will have excellent performance. When an instruction is tagged as being possibly exceptional, special code in a trap handler can compute it without destroying any state. Remember that all these problems occur only when trap handlers are enabled. Otherwise, setting the exception flags during normal processing is straightforward.
Underflow We have alluded several times to the fact that detection of underflow is more complex than for the other exceptions. The IEEE standard specifies that if an underflow trap handler is enabled, the system must trap if the result is denormal. On the other hand, if trap handlers are disabled, then the underflow flag is set only if there is a loss of accuracy—that is, if the result must be rounded. The rationale is, if no accuracy is lost on an underflow, there is no point in setting a warning flag. But if a trap handler is enabled, the user might be trying to simulate flush-tozero and should therefore be notified whenever a result dips below 1.0 × 2Emin. So if there is no trap handler, the underflow exception is signaled only when the result is denormal and inexact. But the definitions of denormal and inexact are both subject to multiple interpretations. Normally, inexact means there was a result that couldn’t be represented exactly and had to be rounded. Consider the example (in a base 2 floating-point system with 3-bit significands) of (1.112 × 2−2) × (1.112 × 2Emin) = 0.1100012 × 2Emin, with round to nearest in effect. The delivered result is 0.112 × 2Emin, which had to be rounded, causing inexact to be signaled. But is it correct to also signal underflow? Gradual underflow loses significance because the exponent range is bounded. If the exponent range were unbounded, the delivered result would be 1.102 × 2Emin–1, exactly the same answer obtained with gradual underflow. The fact that denormalized numbers
H.8
Speeding Up Integer Addition
■
H-37
have fewer bits in their significand than normalized numbers therefore doesn’t make any difference in this case. The commentary to the standard [Cody et al. 1984] encourages this as the criterion for setting the underflow flag. That is, it should be set whenever the delivered result is different from what would be delivered in a system with the same fraction size, but with a very large exponent range. However, owing to the difficulty of implementing this scheme, the standard allows setting the underflow flag whenever the result is denormal and different from the infinitely precise result. There are two possible definitions of what it means for a result to be denormal. Consider the example of 1.102 × 2–1 multiplied by 1.012 × 2Emin. The exact product is 0.1111 × 2Emin. The rounded result is the normal number 1.002 × 2Emin. Should underflow be signaled? Signaling underflow means that you are using the before rounding rule, because the result was denormal before rounding. Not signaling underflow means that you are using the after rounding rule, because the result is normalized after rounding. The IEEE standard provides for choosing either rule; however, the one chosen must be used consistently for all operations. To illustrate these rules, consider floating-point addition. When the result of an addition (or subtraction) is denormal, it is always exact. Thus the underflow flag never needs to be set for addition. That’s because if traps are not enabled, then no exception is raised. And if traps are enabled, the value of the underflow flag is undefined, so again it doesn’t need to be set. One final subtlety should be mentioned concerning underflow. When there is no underflow trap handler, the result of an operation on p-bit numbers that causes an underflow is a denormal number with p − 1 or fewer bits of precision. When traps are enabled, the trap handler is provided with the result of the operation rounded to p bits and with the exponent wrapped around. Now there is a potential double-rounding problem. If the trap handler wants to return the denormal result, it can’t just round its argument, because that might lead to a double-rounding error. Thus, the trap handler must be passed at least one extra bit of information if it is to be able to deliver the correctly rounded result.
H.8
Speeding Up Integer Addition The previous section showed that many steps go into implementing floating-point operations. However, each floating-point operation eventually reduces to an integer operation. Thus, increasing the speed of integer operations will also lead to faster floating point. Integer addition is the simplest operation and the most important. Even for programs that don’t do explicit arithmetic, addition must be performed to increment the program counter and to calculate addresses. Despite the simplicity of addition, there isn’t a single best way to perform high-speed addition. We will discuss three techniques that are in current use: carry-lookahead, carry-skip, and carry-select.
H-38
■
Appendix H Computer Arithmetic
Carry-Lookahead An n-bit adder is just a combinational circuit. It can therefore be written by a logic formula whose form is a sum of products and can be computed by a circuit with two levels of logic. How do you figure out what this circuit looks like? From Equation H.2.1 (page H-3) the formula for the ith sum can be written as H.8.1
si = ai bi ci + ai bi ci + ai bi ci + ai bi ci
where ci is both the carry-in to the ith adder and the carry-out from the (i−1)-st adder. The problem with this formula is that although we know the values of ai and bi—they are inputs to the circuit—we don’t know ci. So our goal is to write ci in terms of ai and bi. To accomplish this, we first rewrite Equation H.2.2 (page H-3) as H.8.2
ci = gi –1+ p i –1c i –1, g i –1= a i –1b i –1, p i –1 = a i –1 + b i –1
Here is the reason for the symbols p and g: If gi–1 is true, then ci is certainly true, so a carry is generated. Thus, g is for generate. If pi–1 is true, then if ci–1 is true, it is propagated to ci. Start with Equation H.8.1 and use Equation H.8.2 to replace ci with gi–1 + pi–1ci–1. Then, use Equation H.8.2 with i − 1 in place of i to replace ci–1 with ci–2, and so on. This gives the result H.8.3
ci = g i–1 + p i–1 gi–2 + p i–1 pi–2gi−3 + ⋅⋅⋅ + p i–1 pi–2 ⋅⋅⋅ p1 g0 + p i–1 pi–2 ⋅⋅⋅ p1p0c0
An adder that computes carries using Equation H.8.3 is called a carrylookahead adder, or CLA. A CLA requires one logic level to form p and g, two levels to form the carries, and two for the sum, for a grand total of five logic levels. This is a vast improvement over the 2n levels required for the ripple-carry adder. Unfortunately, as is evident from Equation H.8.3 or from Figure H.14, a carry-lookahead adder on n bits requires a fan-in of n + 1 at the OR gate as well as at the rightmost AND gate. Also, the pn–1 signal must drive n AND gates. In addition, the rather irregular structure and many long wires of Figure H.14 make it impractical to build a full carry-lookahead adder when n is large. However, we can use the carry-lookahead idea to build an adder that has about log2n logic levels (substantially fewer than the 2n required by a ripple-carry adder) and yet has a simple, regular structure. The idea is to build up the p’s and g’s in steps. We have already seen that c1 = g0 + c0 p 0
This says there is a carry-out of the 0th position (c1) either if there is a carry generated in the 0th position or if there is a carry into the 0th position and the carry propagates. Similarly, c2 = G01 + P01c0
H.8
Speeding Up Integer Addition
p1 g0 gn–1
pn–1 gn–2
■
H-39
p0 c0
pn–2 gn–3
cn cn= gn–1+ pn–1 gn–2 + . . . + pn–1 pn–2 . . . p1g0 + pn–1 pn–2 . . . p0c0
Figure H.14 Pure carry-lookahead circuit for computing the carry-out cn of an n-bit adder.
G01 means there is a carry generated out of the block consisting of the first two bits. P01 means that a carry propagates through this block. P and G have the following logic equations: G01 = g1 + p1g0 P01 = p1p0
More generally, for any j with i < j, j + 1 < k, we have the recursive relations H.8.4
ck+1 = Gik + Pikci
H.8.5
Gik = Gj+1,k + Pj+1,kGij
H.8.6
Pik = Pij Pj+1,k
Equation H.8.5 says that a carry is generated out of the block consisting of bits i through k inclusive if it is generated in the high-order part of the block (j + 1, k) or if it is generated in the low-order part of the block (i,j) and then propagated through the high part. These equations will also hold for i ≤ j < k if we set Gii = gi and Pii = pi. Example Answer
Express P03 and G03 in terms of p’s and g’s. Using Equation H.8.6, P03 = P01P23 = P00P11P22P33. Since Pii = pi, P03 = p 0p 1p 2p 3. For G03, Equation H.8.5 says G03 = G23 + P23G01 = (G33 + P33G22) + (P22P33)(G11 + P11G00) = g 3 + p 3g 2 + p 3 p 2 g 1 + p 3 p 2 p 1g 0.
H-40
■
Appendix H Computer Arithmetic
With these preliminaries out of the way, we can now show the design of a practical CLA. The adder consists of two parts. The first part computes various values of P and G from pi and gi, using Equations H.8.5 and H.8.6; the second part uses these P and G values to compute all the carries via Equation H.8.4. The first part of the design is shown in Figure H.15. At the top of the diagram, input numbers a7. . . a0 and b7. . . b0 are converted to p’s and g’s using cells of type 1. Then various P’s and G’s are generated by combining cells of type 2 in a binary tree structure. The second part of the design is shown in Figure H.16. By feeding c0 in at the bottom of this tree, all the carry bits come out at the top. Each cell must know a pair of (P,G) values in order to do the conversion, and the value it needs is written inside the cells. Now compare Figures H.15 and H.16. There is a one-to-one correspondence between cells, and the value of (P,G) needed by the carry-generating cells is exactly the value known by the corresponding (P,G)generating cells. The combined cell is shown in Figure H.17. The numbers to be added flow into the top and downward through the tree, combining with c0 at the bottom and flowing back up the tree to form the carries. Note that one thing is missing from Figure H.17: a small piece of extra logic to compute c8 for the carry-out of the adder. The bits in a CLA must pass through about log2 n logic levels, compared with 2n for a ripple-carry adder. This is a substantial speed improvement, especially for a large n. Whereas the ripple-carry adder had n cells, however, the CLA has
a7 b7
a6 b6
a5 b5
a4 b4
a3 b3
a2 b2
a1 b1
a0 b0
1
1
1
1
1
1
1
1
g7
p7
g1
2 G6, 7
2 P6, 7
G4, 5
2 P4, 5
G2, 3
2 G4, 7
p1
p0
2 P2, 3
G0 ,1
P0 ,1
2 P4, 7
G0, 3
P0, 3
ai bi
Gj+1, k
Pj+1, k
2 1 G0, 7
g0
P0, 7
gi = aibi pi = ai + bi
2
Gi, j Pi, j
Pi, k = Pi, j Pj+1,k Gi, k = Gj+1, k + Pj+1, k Gi, j
Figure H.15 First part of carry-lookahead tree. As signals flow from the top to the bottom, various values of P and G are computed.
H.8
c7
c6
c5
Speeding Up Integer Addition
c4
c3
c2
c1
p6
p4
p2
p0
g6
g4
g2
g0
c6
c4
c2
p4, 5
P 0, 1
G4, 5
G 0, 1
c4
■
H-41
c0
c0
c0 cj+1 = Gi j + Pi j ci
P0, 3 G0, 3
ci
Pi, j
c0
Gi, j ci
Figure H.16 Second part of carry-lookahead tree. Signals flow from the bottom to the top, combining with P and G to form the carries.
s1
s7 a7 b7
A
A c7
A c6
A c5
A c4
c3
B
B
a0 b0
A
A
c2
B c4
c6
A
s0 a1 b1
c1
c0
B c2
c0
B
B c4
G0, 3
P0, 3 c0
B si
Pj +1,k cj +1
Gj +1,k
ai bi
c0 A
gi pi ci
si = ai + bi + ci pi = ai + bi gi = ai bi
B
Gij Pij ci
Gi, k Pi, k ci
Figure H.17 Complete carry-lookahead tree adder. This is the combination of Figures H.15 and H.16. The numbers to be added enter at the top, flow to the bottom to combine with c 0, and then flow back up to compute the sum bits.
H-42
■
Appendix H Computer Arithmetic
2n cells, although in our layout they will take n log n space. The point is that a small investment in size pays off in a dramatic improvement in speed. A number of technology-dependent modifications can improve CLAs. For example, if each node of the tree has three inputs instead of two, then the height of the tree will decrease from log2 n to log3 n. Of course, the cells will be more complex and thus might operate more slowly, negating the advantage of the decreased height. For technologies where rippling works well, a hybrid design might be better. This is illustrated in Figure H.18. Carries ripple between adders at the top level, while the “B” boxes are the same as those in Figure H.17. This design will be faster if the time to ripple between four adders is faster than the time it takes to traverse a level of “B” boxes. (To make the pattern more clear, Figure H.18 shows a 16-bit adder, so the 8-bit adder of Figure H.17 corresponds to the right half of Figure H.18.)
Carry-Skip Adders A carry-skip adder sits midway between a ripple-carry adder and a carrylookahead adder, both in terms of speed and cost. (A carry-skip adder is not called a CSA, as that name is reserved for carry-save adders.) The motivation for this adder comes from examining the equations for P and G. For example, P03 = p0 p1 p2 p3 G03 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0
Computing P is much simpler than computing G, and a carry-skip adder only computes the P’s. Such an adder is illustrated in Figure H.19. Carries begin rippling simultaneously through each block. If any block generates a carry, then the carry-out of a block will be true, even though the carry-in to the block may not be
c15
c14
c3
c13
c12
C
C
c8
c4
C
P12, 15
P8, 15
c1
c0
C P0, 3
B
c2
G0, 3
B c8
P0, 7
c0
B
c0
Figure H.18 Combination of CLA and ripple-carry adder. In the top row, carries ripple within each group of four boxes.
H.8
Speeding Up Integer Addition
a19 b19 a18 b18 c20
■
H-43
a3 b3 a2 b2 a1 b1 a0 b0 c16
c12
P12, 15
c4
c8
P8, 11
c0
P4, 7
Figure H.19 Carry-skip adder.This is a 20-bit carry-skip adder (n = 20) with each block 4-bits wide (k = 4).
correct yet. If at the start of each add operation the carry-in to each block is 0, then no spurious carry-outs will be generated. Thus, the carry-out of each block can thus be thought of as if it were the G signal. Once the carry-out from the least-significant block is generated, it not only feeds into the next block, but is also fed through the AND gate with the P signal from that next block. If the carryout and P signals are both true, then the carry skips the second block and is ready to feed into the third block, and so on. The carry-skip adder is only practical if the carry-in signals can be easily cleared at the start of each operation—for example, by precharging in CMOS. To analyze the speed of a carry-skip adder, let’s assume that it takes 1 time unit for a signal to pass through two logic levels. Then it will take k time units for a carry to ripple across a block of size k, and it will take 1 time unit for a carry to skip a block. The longest signal path in the carry-skip adder starts with a carry being generated at the 0th position. If the adder is n bits wide, then it takes k time units to ripple through the first block, n/k − 2 time units to skip blocks, and k more to ripple through the last block. To be specific: if we have a 20-bit adder broken into groups of 4 bits, it will take 4 + (20/4 − 2) + 4 = 11 time units to perform an add. Some experimentation reveals that there are more efficient ways to divide 20 bits into blocks. For example, consider five blocks with the leastsignificant 2 bits in the first block, the next 5 bits in the second block, followed by blocks of size 6, 5, and 2. Then the add time is reduced to 9 time units. This illustrates an important general principle. For a carry-skip adder, making the interior blocks larger will speed up the adder. In fact, the same idea of varying the block sizes can sometimes speed up other adder designs as well. Because of the large amount of rippling, a carry-skip adder is most appropriate for technologies where rippling is fast.
Carry-Select Adder A carry-select adder works on the following principle: Two additions are performed in parallel, one assuming the carry-in is 0 and the other assuming the carry-in is 1. When the carry-in is finally known, the correct sum (which has been precomputed) is simply selected. An example of such a design is shown in Figure H.20. An 8-bit adder is divided into two halves, and the carry-out from the lower half is used to select the sum bits from the upper half. If each block is computing its sum using rippling (a linear time algorithm), then the design in Figure H.20 is
H-44
■
Appendix H Computer Arithmetic
a4 b4
a7 b 7
a3 b3 a2 b2 a1 b1 a0 b0
0 a4 b4
c0 s3
1
s2
s1
s0
c4 s7 s6 s5 s4
Figure H.20 Simple carry-select adder. At the same time that the sum of the loworder 4 bits is being computed, the high-order bits are being computed twice in parallel: once assuming that c4 = 0 and once assuming c4 = 1.
twice as fast at 50% more cost. However, note that the c4 signal must drive many muxes, which may be very slow in some technologies. Instead of dividing the adder into halves, it could be divided into quarters for a still further speedup. This is illustrated in Figure H.21. If it takes k time units for a block to add k-bit numbers, and if it takes 1 time unit to compute the mux input from the two carry-out signals, then for optimal operation each block should be 1 bit wider than the next, as shown in Figure H.21. Therefore, as in the carry-skip adder, the best design involves variable-size blocks. As a summary of this section, the asymptotic time and space requirements for the different adders are given in Figure H.22. (The times for carry-skip and carryselect come from a careful choice of block size. See Exercise H.26 for the carryskip adder.) These different adders shouldn’t be thought of as disjoint choices, but rather as building blocks to be used in constructing an adder. The utility of these different building blocks is highly dependent on the technology used. For example, the carry-select adder works well when a signal can drive many muxes, and the carry-skip adder is attractive in technologies where signals can be cleared at the start of each operation. Knowing the asymptotic behavior of adders is use-
0
1
s18 s17 s16 s15 s14 s13
c13
0
c13
1
s12 s11 s10 s9 s8
c8
0 c4
c8
1
c0 s3 s2 s1 s0
s7 s6 s5 s4
Figure H.21 Carry-select adder. As soon as the carry-out of the rightmost block is known, it is used to select the other sum bits.
H.9
Speeding Up Integer Multiplication and Division
■
H-45
Adder
Time
Space
Ripple
O(n)
O(n)
CLA
O(log n)
O(n log n)
Carry-skip
O( n )
O(n)
Carry-select
O( n )
O(n)
Figure H.22 Asymptotic time and space requirements for four different types of adders.
ful in understanding them, but relying too much on that behavior is a pitfall. The reason is that asymptotic behavior is only important as n grows very large. But n for an adder is the bits of precision, and double precision today is the same as it was 20 years ago—about 53 bits. Although it is true that as computers get faster, computations get longer—and thus have more rounding error, which in turn requires more precision—this effect grows very slowly with time.
H.9
Speeding Up Integer Multiplication and Division The multiplication and division algorithms presented in Section H.2 are fairly slow, producing 1 bit per cycle (although that cycle might be a fraction of the CPU instruction cycle time). In this section we discuss various techniques for higher-performance multiplication and division, including the division algorithm used in the Pentium chip.
Shifting over Zeros Although the technique of shifting over zeros is not currently used much, it is instructive to consider. It is distinguished by the fact that its execution time is operand dependent. Its lack of use is primarily attributable to its failure to offer enough speedup over bit-at-a-time algorithms. In addition, pipelining, synchronization with the CPU, and good compiler optimization are difficult with algorithms that run in variable time. In multiplication, the idea behind shifting over zeros is to add logic that detects when the low-order bit of the A register is 0 (see Figure H.2(a) on page H-4) and, if so, skips the addition step and proceeds directly to the shift step—hence the term shifting over zeros. What about shifting for division? In nonrestoring division, an ALU operation (either an addition or subtraction) is performed at every step. There appears to be no opportunity for skipping an operation. But think about division this way: To compute a/b, subtract multiples of b from a, and then report how many subtractions were done. At each stage of the subtraction process the remainder must fit into the P register of Figure H.2(b) (page H-4). In the case when the remainder is a small positive number, you normally subtract b; but suppose instead you only
H-46
■
Appendix H Computer Arithmetic
shifted the remainder and subtracted b the next time. As long as the remainder was sufficiently small (its high-order bit 0), after shifting it still would fit into the P register, and no information would be lost. However, this method does require changing the way we keep track of the number of times b has been subtracted from a. This idea usually goes under the name of SRT division, for Sweeney, Robertson, and Tocher, who independently proposed algorithms of this nature. The main extra complication of SRT division is that the quotient bits cannot be determined immediately from the sign of P at each step, as they can be in ordinary nonrestoring division. More precisely, to divide a by b where a and b are n-bit numbers, load a and b into the A and B registers, respectively, of Figure H.2 (page H-4). SRT Division
1. If B has k leading zeros when expressed using n bits, shift all the registers left k bits. 2. For i = 0, n − 1, (a) If the top three bits of P are equal, set qi = 0 and shift (P,A) one bit left. (b) If the top three bits of P are not all equal and P is negative, set qi = −1 (also written as 1), shift (P,A) one bit left, and add B. (c) Otherwise set qi = 1, shift (P,A) one bit left, and subtract B. End loop 3. If the final remainder is negative, correct the remainder by adding B, and correct the quotient by subtracting 1 from q0. Finally, the remainder must be shifted k bits right, where k is the initial shift. A numerical example is given in Figure H.23. Although we are discussing integer division, it helps in explaining the algorithm to imagine the binary point just left of the most-significant bit. This changes Figure H.23 from 010002/00112 to 0.10002/.00112. Since the binary point is changed in both the numerator and denominator, the quotient is not affected. The (P,A) register pair holds the remainder and is a two’s complement number. For example, if P contains 111102 and A = 0, then the remainder is 1.11102 = −1/8. If r is the value of the remainder, then −1 ≤ r < 1. Given these preliminaries, we can now analyze the SRT division algorithm. The first step of the algorithm shifts b so that b ≥ 1/2. The rule for which ALU operation to perform is this: If −1/4 ≤ r < 1/4 (true whenever the top three bits of P are equal), then compute 2r by shifting (P,A) left one bit; else if r < 0 (and hence r < −1/4, since otherwise it would have been eliminated by the first condition), then compute 2r + b by shifting and then adding, else r ≥ 1/4 and subtract b from 2r. Using b ≥ 1/2, it is easy to check that these rules keep −1/2 ≤ r < 1/2. For nonrestoring division, we only have r ≤ b, and we need P to be n + 1 bits wide. But for SRT division, the bound on r is tighter, namely, −1/2 ≤ r < 1/2. Thus, we can save a bit by eliminating the high-order bit of P (and b and the adder). In par-
H.9
Speeding Up Integer Multiplication and Division
P 00000 00010
1000 0000
00100
0000
01000 + 10100 11100 11000
0001
10000 + 01100 11100 + 01100 01000
0101
■
H-47
A
0001 0010
Divide 8 = 1000 by 3 = 0011. B contains 0011. Step 1: B had two leading 0s, so shift left by 2. B now contains 1100. Step 2.1: Top three bits are equal. This is case (a), so set q0 = 0 and shift. Step 2.2: Top three bits not equal and P > 0 is case (c), so set q1 = 1 and shift. Subtract B. Step 2.3: Top bits equal is case (a), so set q2 = 0 and shift. Step 2.4: Top three bits unequal is case (b), so set q3 = –1 and shift. Add B. Step 3. Remainder is negative so restore it and subtract 1 from q. Must undo the shift in step 1, so right-shift by 2 to get true remainder. Remainder = 10, quotient = 0101 – 1 = 0010.
Figure H.23 SRT division of 10002/00112. The quotient bits are shown in bold, using the notation 1 for −1.
ticular, the test for equality of the top three bits of P becomes a test on just two bits. The algorithm might change slightly in an implementation of SRT division. After each ALU operation, the P register can be shifted as many places as necessary to make either r ≥ 1/4 or r < −1/4. By shifting k places, k quotient bits are set equal to zero all at once. For this reason SRT division is sometimes described as one that keeps the remainder normalized to r ≥ 1/4. Notice that the value of the quotient bit computed in a given step is based on which operation is performed in that step (which in turn depends on the result of the operation from the previous step). This is in contrast to nonrestoring division, where the quotient bit computed in the ith step depends on the result of the operation in the same step. This difference is reflected in the fact that when the final remainder is negative, the last quotient bit must be adjusted in SRT division, but not in nonrestoring division. However, the key fact about the quotient bits in SRT division is that they can include 1. Although Figure H.23 shows the quotient bits being stored in the low-order bits of A, an actual implementation can’t do this because you can’t fit the three values −1, 0, 1 into one bit. Furthermore, the quotient must be converted to ordinary two’s complement in a full adder. A common way to do this is to accumulate the positive quotient bits in one register and the negative quotient bits in another, and then subtract the two registers after all the bits are known. Because there is more than one way to write a number in terms of the digits −1, 0, 1, SRT division is said to use a redundant quotient representation. The differences between SRT division and ordinary nonrestoring division can be summarized as follows: 1. ALU decision rule: In nonrestoring division, it is determined by the sign of P; in SRT, it is determined by the two most-significant bits of P.
H-48
■
Appendix H Computer Arithmetic
2. Final quotient: In nonrestoring division, it is immediate from the successive signs of P; in SRT, there are three quotient digits (1, 0, 1), and the final quotient must be computed in a full n-bit adder. 3. Speed: SRT division will be faster on operands that produce zero quotient bits. The simple version of the SRT division algorithm given above does not offer enough of a speedup to be practical in most cases. However, later on in this section we will study variants of SRT division that are quite practical.
Speeding Up Multiplication with a Single Adder As mentioned before, shifting-over-zero techniques are not used much in current hardware. We now discuss some methods that are in widespread use. Methods that increase the speed of multiplication can be divided into two classes: those that use a single adder and those that use multiple adders. Let’s first discuss techniques that use a single adder. In the discussion of addition we noted that, because of carry propagation, it is not practical to perform addition with two levels of logic. Using the cells of Figure H.17, adding two 64-bit numbers will require a trip through seven cells to compute the P’s and G’s, and seven more to compute the carry bits, which will require at least 28 logic levels. In the simple multiplier of Figure H.2 on page H-4, each multiplication step passes through this adder. The amount of computation in each step can be dramatically reduced by using carry-save adders (CSAs). A carry-save adder is simply a collection of n independent full adders. A multiplier using such an adder is illustrated in Figure H.24. Each circle marked “+” is a single-bit full adder, and each box represents one bit of a register. Each addition operation results in a pair of bits, stored in the sum and carry parts of P. Since each add is independent, only two logic levels are involved in the add—a vast improvement over 28. To operate the multiplier in Figure H.24, load the sum and carry bits of P with zero and perform the first ALU operation. (If Booth recoding is used, it might be a subtraction rather than an addition.) Then shift the low-order sum bit of P into A, as well as shifting A itself. The n − 1 high-order bits of P don’t need to be shifted because on the next cycle the sum bits are fed into the next lower-order adder. Each addition step is substantially increased in speed, since each add cell is working independently of the others, and no carry is propagated. There are two drawbacks to carry-save adders. First, they require more hardware because there must be a copy of register P to hold the carry outputs of the adder. Second, after the last step, the high-order word of the result must be fed into an ordinary adder to combine the sum and carry parts. One way to accomplish this is by feeding the output of P into the adder used to perform the addition operation. Multiplying with a carry-save adder is sometimes called redundant multiplication because P is represented using two registers. Since there
H.9
Speeding Up Integer Multiplication and Division
■
H-49
P
Carry bits
Shift
Sum bits A +
+
+
+
B
+
ci+1 si
+
ci ai
+
bi
Figure H.24 Carry-save multiplier. Each circle represents a (3,2) adder working independently. At each step, the only bit of P that needs to be shifted is the low-order sum bit.
are many ways to represent P as the sum of two registers, this representation is redundant. The term carry-propagate adder (CPA) is used to denote an adder that is not a CSA. A propagate adder may propagate its carries using ripples, carrylookahead, or some other method. Another way to speed up multiplication without using extra adders is to examine k low-order bits of A at each step, rather than just one bit. This is often called higher-radix multiplication. As an example, suppose that k = 2. If the pair of bits is 00, add 0 to P; if it is 01, add B. If it is 10, simply shift b one bit left before adding it to P. Unfortunately, if the pair is 11, it appears we would have to compute b + 2b. But this can be avoided by using a higher-radix version of Booth recoding. Imagine A as a base 4 number: When the digit 3 appears, change it to 1 and add 1 to the next higher digit to compensate. An extra benefit of using this scheme is that just like ordinary Booth recoding, it works for negative as well as positive integers (Section H.2). The precise rules for radix-4 Booth recoding are given in Figure H.25. At the ith multiply step, the two low-order bits of the A register contain a2i and a2i+1. These two bits, together with the bit just shifted out (a2i–1), are used to select the multiple of b that must be added to the P register. A numerical example is given in Figure H.26. Another name for this multiplication technique is overlapping triplets, since it looks at 3 bits to determine what multiple of b to use, whereas ordinary Booth recoding looks at 2 bits. Besides having more complex control logic, overlapping triplets also requires that the P register be 1 bit wider to accommodate the possibility of 2b or −2b being added to it. It is possible to use a radix-8 (or even higher) version of Booth recoding. In that case, however, it would be necessary to use the multiple 3B as a potential summand. Radix-8 multipliers normally compute 3B once and for all at the beginning of a multiplication operation.
H-50
■
Appendix H Computer Arithmetic
Low-order bits of A
Last bit shifted out
2i + 1
2i
2i − 1
0
0
0
0
0
0
1
+b
0
1
0
+b
0
1
1
+2b
Multiple
1
0
0
−2b
1
0
1
−b
1
1
0
−b
1
1
1
0
Figure H.25 Multiples of b to use for radix-4 Booth recoding. For example, if the two low-order bits of the A register are both 1, and the last bit to be shifted out of the A register is 0, then the correct multiple is −b, obtained from the second-to-last row of the table.
P 00000 + 11011 11011 11110 + 01010 01000 00010
A
L
1001 1001 1110 0 1110 0 0011 1
Multiply –7 = 1001 times –5 = 1011. B contains 1011. Low-order bits of A are 0, 1; L = 0, so add B. Shift right by two bits, shifting in 1s on the left. Low-order bits of A are 1, 0; L = 0, so add –2b. Shift right by two bits. Product is 35 = 0100011.
Figure H.26 Multiplication of –7 times –5 using radix-4 Booth recoding. The column labeled L contains the last bit shifted out the right end of A.
Faster Multiplication with Many Adders If the space for many adders is available, then multiplication speed can be improved. Figure H.27 shows a simple array multiplier for multiplying two 5-bit numbers, using three CSAs and one propagate adder. Part (a) is a block diagram of the kind we will use throughout this section. Parts (b) and (c) show the adder in more detail. All the inputs to the adder are shown in (b); the actual adders with their interconnections are shown in (c). Each row of adders in (c) corresponds to a box in (a). The picture is “twisted” so that bits of the same significance are in the same column. In an actual implementation, the array would most likely be laid out as a square instead. The array multiplier in Figure H.27 performs the same number of additions as the design in Figure H.24, so its latency is not dramatically different from that of a single carry-save adder. However, with the hardware in Figure H.27, multiplication can be pipelined, increasing the total throughput. On the other hand,
H.9
Speeding Up Integer Multiplication and Division
b2 A
b1 A
■
H-51
b0 A
b4 A b3 A CSA (a) CSA
CSA
Propagate adder
b0 a1
b0 a0 b0 A b1 A
(b)
b2 A b3 A b4 A b4 a1
b4 a0 b0 a4
b0 A
b1 a4
b1 A b2 A
(c)
p9 p8
p7
p6
p5
p4
p3
p2
p1 p0
Figure H.27 An array multiplier. The 5-bit number in A is multiplied by b4b3b2b1b0. Part (a) shows the block diagram, (b) shows the inputs to the array, and (c) expands the array to show all the adders.
although this level of pipelining is sometimes used in array processors, it is not used in any of the single-chip, floating-point accelerators discussed in Section H.10. Pipelining is discussed in general in Appendix A and by Kogge [1981] in the context of multipliers. Sometimes the space budgeted on a chip for arithmetic may not hold an array large enough to multiply two double-precision numbers. In this case, a popular design is to use a two-pass arrangement such as the one shown in Figure H.28. The first pass through the array “retires” 5 bits of B. Then the result of this first pass is fed back into the top to be combined with the next three summands. The result of this second pass is then fed into a CPA. This design, however, loses the ability to be pipelined.
H-52
■
Appendix H Computer Arithmetic
b5 A b2 A
b1A
b6 A b3 A
b0 A
CSA
b7 A b4 A
CSA
CSA
CPA
Figure H.28 Multipass array multiplier. Multiplies two 8-bit numbers with about half the hardware that would be used in a one-pass design like that of Figure H.27. At the end of the second pass, the bits flow into the CPA. The inputs used in the first pass are marked in bold.
If arrays require as many addition steps as the much cheaper arrangements in Figures H.2 and H.24, why are they so popular? First of all, using an array has a smaller latency than using a single adder—because the array is a combinational circuit, the signals flow through it directly without being clocked. Although the two-pass adder of Figure H.28 would normally still use a clock, the cycle time for passing through k arrays can be less than k times the clock that would be needed for designs like the ones in Figures H.2 or H.24. Second, the array is amenable to various schemes for further speedup. One of them is shown in Figure H.29. The idea of this design is that two adds proceed in parallel or, to put it another way, each stream passes through only half the adders. Thus, it runs at almost twice the speed of the multiplier in Figure H.27. This even/odd multiplier is popular in VLSI because of its regular structure. Arrays can also be speeded up using asynchronous logic. One of the reasons why the multiplier of Figure H.2 (page H-4) needs a clock is to keep the output of the adder from feeding back into the input of the adder before the output has fully stabilized. Thus, if the array in Figure H.28 is long enough so that no signal can propagate from the top through the bottom in the time it takes for the first adder to stabilize, it may be possible to avoid clocks altogether. Williams et al. [1987] discuss a design using this idea, although it is for dividers instead of multipliers. The techniques of the previous paragraph still have a multiply time of O(n), but the time can be reduced to log n using a tree. The simplest tree would combine pairs of summands b0A ⋅⋅⋅ bn–1A, cutting the number of summands from n to n/2. Then these n/2 numbers would be added in pairs again, reducing to n/4, and so on, and resulting in a single sum after log n steps. However, this simple binary tree idea doesn’t map into full (3,2) adders, which reduce three inputs to two rather than reducing two inputs to one. A tree that does use full adders, known as a Wallace tree, is shown in Figure H.30. When computer arithmetic units were
H.9
Speeding Up Integer Multiplication and Division
b2 A
b1 A
■
H-53
b0 A
CSA b5 A
b4 A
b3 A
CSA b6 A CSA b7 A CSA
CSA
CSA
CPA
Figure H.29 Even/odd array. The first two adders work in parallel. Their results are fed into the third and fourth adders, which also work in parallel, and so on.
built out of MSI parts, a Wallace tree was the design of choice for high-speed multipliers. There is, however, a problem with implementing it in VLSI. If you try to fill in all the adders and paths for the Wallace tree of Figure H.30, you will discover that it does not have the nice, regular structure of Figure H.27. This is why VLSI designers have often chosen to use other log n designs such as the binary tree multiplier, which is discussed next. The problem with adding summands in a binary tree is coming up with a (2,1) adder that combines two digits and produces a single-sum digit. Because of carries, this isn’t possible using binary notation, but it can be done with some other representation. We will use the signed-digit representation 1, 1, and 0, which we used previously to understand Booth’s algorithm. This representation has two costs. First, it takes 2 bits to represent each signed digit. Second, the algorithm for adding two signed-digit numbers ai and bi is complex and requires examining aiai–1ai–2 and bibi–1bi–2. Although this means you must look 2 bits back, in binary addition you might have to look an arbitrary number of bits back because of carries. We can describe the algorithm for adding two signed-digit numbers as follows. First, compute sum and carry bits si and ci+1 using Figure H.31. Then compute the final sum as si + ci. The tables are set up so that this final sum does not generate a carry.
H-54
■
Appendix H Computer Arithmetic
b7 A b6 A b5 A
b4 A
b3 A
b2 A
CSA
b1 A
b0 A
CSA
CSA
CSA
CSA
CSA
Propagate adder
Figure H.30 Wallace tree multiplier. An example of a multiply tree that computes a product in 0(log n) steps.
1 +1 10
1 +1 00
1 +1 10
0 +0 00
1 x +0 y 11 if x ≥ 0 and y ≥ 0 otherwise 01
1 x +0 y 01 if x ≥ 0 and y ≥ 0 otherwise 11
Figure H.31 Signed-digit addition table. The leftmost sum shows that when computing 1 + 1, the sum bit is 0 and the carry bit is 1.
Example Answer
What is the sum of the signed-digit numbers 1102 and 0012? The two low-order bits sum to 0 + 1 = 11, the next pair sums to 1 + 0 = 01, and the high-order pair sums to 1 + 0 = 01, so the sum is 11+ 010 + 0100 = 1012. This, then, defines a (2,1) adder. With this in hand, we can use a straightforward binary tree to perform multiplication. In the first step it adds b0A + b1A in parallel with b2A + b3A, . . . , bn–2A + bn–1A. The next step adds the results of these sums in pairs, and so on. Although the final sum must be run through a carry-propagate adder to convert it from signed-digit form to two’s complement, this final add step is necessary in any multiplier using CSAs. To summarize, both Wallace trees and signed-digit trees are log n multipliers. The Wallace tree uses fewer gates but is harder to lay out. The signed-digit tree has a more regular structure, but requires 2 bits to represent each digit and has more complicated add logic. As with adders, it is possible to combine different multiply techniques. For example, Booth recoding and arrays can be combined.
H.9
Speeding Up Integer Multiplication and Division
■
H-55
In Figure H.27 instead of having each input be biA, we could have it be bibi–1A. To avoid having to compute the multiple 3b, we can use Booth recoding.
Faster Division with One Adder The two techniques we discussed for speeding up multiplication with a single adder were carry-save adders and higher-radix multiplication. However, there is a difficulty when trying to utilize these approaches to speed up nonrestoring division. If the adder in Figure H.2(b) on page H-4 is replaced with a carry-save adder, then P will be replaced with two registers, one for the sum bits and one for the carry bits (compare with the multiplier in Figure H.24). At the end of each cycle, the sign of P is uncertain (since P is the unevaluated sum of the two registers), yet it is the sign of P that is used to compute the quotient digit and decide the next ALU operation. When a higher radix is used, the problem is deciding what value to subtract from P. In the paper-and-pencil method, you have to guess the quotient digit. In binary division there are only two possibilities. We were able to finesse the problem by initially guessing one and then adjusting the guess based on the sign of P. This doesn’t work in higher radices because there are more than two possible quotient digits, rendering quotient selection potentially quite complicated: You would have to compute all the multiples of b and compare them to P. Both the carry-save technique and higher-radix division can be made to work if we use a redundant quotient representation. Recall from our discussion of SRT division (page H-46) that by allowing the quotient digits to be −1, 0, or 1, there is often a choice of which one to pick. The idea in the previous algorithm was to choose 0 whenever possible, because that meant an ALU operation could be skipped. In carry-save division, the idea is that, because the remainder (which is the value of the (P,A) register pair) is not known exactly (being stored in carrysave form), the exact quotient digit is also not known. But thanks to the redundant representation, the remainder doesn’t have to be known precisely in order to pick a quotient digit. This is illustrated in Figure H.32, where the x axis represents ri,
q i = –1
qi = 0
qi = 1 qi = –1 qi = 0
–b
b
ri
–b
0
b
qi = 1 ri
–b ri +1 = 2ri – qi b
Figure H.32 Quotient selection for radix-2 division. The x-axis represents the ith remainder, which is the quantity in the (P,A) register pair. The y-axis shows the value of the remainder after one additional divide step. Each bar on the right-hand graph gives the range of ri values for which it is permissible to select the associated value of qi.
H-56
■
Appendix H Computer Arithmetic
the remainder after i steps. The line labeled qi = 1 shows the value that ri+1 would be if we chose qi = 1, and similarly for the lines qi = 0 and qi = −1. We can choose any value for qi, as long as ri+1 = 2ri – qib satisfies ri+1 ≤ b. The allowable ranges are shown in the right half of Figure H.32. This shows that you don’t need to know the precise value of ri in order to choose a quotient digit qi. You only need to know that r lies in an interval small enough to fit entirely within one of the overlapping bars shown in the right half of Figure H.32. This is the basis for using carry-save adders. Look at the high-order bits of the carry-save adder and sum them in a propagate adder. Then use this approximation of r (together with the divisor, b) to compute qi, usually by means of a lookup table. The same technique works for higher-radix division (whether or not a carry-save adder is used). The high-order bits P can be used to index a table that gives one of the allowable quotient digits. The design challenge when building a high-speed SRT divider is figuring out how many bits of P and B need to be examined. For example, suppose that we take a radix of 4, use quotient digits of 2, 1, 0, 1, 2, but have a propagate adder. How many bits of P and B need to be examined? Deciding this involves two steps. For ordinary radix-2 nonrestoring division, because at each stage r ≤ b, the P buffer won’t overflow. But for radix 4, ri+1 = 4ri – qib is computed at each stage, and if ri is near b, then 4ri will be near 4b, and even the largest quotient digit will not bring r back to the range ri+1 ≤ b. In other words, the remainder might grow without bound. However, restricting ri ≤ 2b/3 makes it easy to check that ri will stay bounded. After figuring out the bound that ri must satisfy, we can draw the diagram in Figure H.33, which is analogous to Figure H.32. For example, the diagram shows that if ri is between (1/12)b and (5/12)b, we can pick q = 1, and so on. Or to put it another way, if r/b is between 1/12 and 5/12, we can pick q = 1. Suppose the
q i = –2
q i = –1
qi = 0
qi = 1
qi = 2
ri –2b 3
2b 3 ri +1 = 4ri – qi b ri +1
qi = 1
qi = 0
–2b 3
0 qi = –2
qi = –1
b 12
b 6
qi = 2
b 3
5b 12
2b 3
ri
Figure H.33 Quotient selection for radix-4 division with quotient digits –2, –1, 0, 1, 2.
H.9
Speeding Up Integer Multiplication and Division
■
H-57
divider examines 5 bits of P (including the sign bit) and 4 bits of b (ignoring the sign, since it is always nonnegative). The interesting case is when the high bits of P are 00011xxx⋅⋅⋅, while the high bits of b are 1001xxx⋅⋅⋅. Imagine the binary point at the left end of each register. Since we truncated, r (the value of P concatenated with A) could have a value from 0.00112 to 0.01002, and b could have a value from .10012 to .10102. Thus r/b could be as small as 0.00112/.10102 or as large as 0.01002/.10012. But 0.00112/.10102 = 3/10 < 1/3 would require a quotient bit of 1, while 0.01002/.10012 = 4/9 > 5/12 would require a quotient bit of 2. In other words, 5 bits of P and 4 bits of b aren’t enough to pick a quotient bit. It turns out that 6 bits of P and 4 bits of b are enough. This can be verified by writing a simple program that checks all the cases. The output of such a program is shown in Figure H.34.
b
Range of P
q
b
Range of P
q
8
−12
−7
−2
12
−18
−10
−2
8
−6
−3
−1
12
−10
−4
−1
8
−2
1
0
12
–4
3
0
8
2
5
1
12
3
9
1
8
6
11
2
12
9
17
2
9
−14
−8
−2
13
−19
−11
−2
9
−7
−3
−1
13
−10
−4
−1
9
−3
2
0
13
−4
3
0
9
2
6
1
13
3
9
1
9
7
13
2
13
10
18
2
10
−15
−9
−2
14
−20
−11
−2
10
−8
−3
−1
14
−11
−4
−1
10
−3
2
0
14
−4
3
0
10
2
7
1
14
3
10
1
10
8
14
2
14
10
19
2
11
−16
−9
−2
15
−22
−12
−2
11
−9
−3
−1
15
−12
−4
−1
11
−3
2
0
15
−5
4
0
11
2
8
1
15
3
11
1
11
8
15
2
15
11
21
2
Figure H.34 Quotient digits for radix-4 SRT division with a propagate adder. The top row says that if the high-order 4 bits of b are 10002 = 8, and if the top 6 bits of P are between 1101002 = −12 and 1110012 = −7, then −2 is a valid quotient digit.
H-58
■
Appendix H Computer Arithmetic
P A 000000000 10010101 000010010 10100000
001001010
1000000
100101010 + 011000000 111101010 110101000
000002
010100000 + 101000000 111100000 + 010100000 010000000
0202
000002 00020
Divide 149 by 5. B contains 00000101. B had 5 leading 0s, so shift left by 5. B now Step 1: contains 10100000, so use b = 10 section of table. Top 6 bits of P are 2, so Step 2.1: shift left by 2. From table, can pick q to be 0 or 1. Choose q0 = 0. Top 6 bits of P are 9, so Step 2.2: shift left 2. q1 = 2. Subtract 2b. Top bits = –3, so Step 2.3: shift left 2. Can pick 0 or –1 for q, pick q2 = 0. Top bits = –11, so Step 2.4: shift left 2. q3 = –2. Add 2b. Remainder is negative, so restore Step 3: by adding b and subtract 1 from q. Answer: q = 0202 – 1 = 29. To get remainder, undo shift in step 1 so remainder = 010000000 >> 5 = 4.
Figure H.35 Example of radix-4 SRT division. Division of 149 by 5.
Example Answer
Using 8-bit registers, compute 149/5 using radix-4 SRT division. Follow the SRT algorithm on page H-46, but replace the quotient selection rule in step 2 with one that uses Figure H.34. See Figure H.35. The Pentium uses a radix-4 SRT division algorithm like the one just presented, except that it uses a carry-save adder. Exercises H.34(c) and H.35 explore this in detail. Although these are simple cases, all SRT analyses proceed in the same way. First compute the range of ri, then plot ri against ri+1 to find the quotient ranges, and finally write a program to compute how many bits are necessary. (It is sometimes also possible to compute the required number of bits analytically.) Various details need to be considered in building a practical SRT divider. For example, the quotient lookup table has a fairly regular structure, which means it is usually cheaper to encode it as a PLA rather than in ROM. For more details about SRT division, see Burgess and Williams [1995].
H.10
Putting It All Together In this section, we will compare the Weitek 3364, the MIPS R3010, and the Texas Instruments 8847 (see Figures H.36 and H.37). In many ways, these are ideal chips to compare. They each implement the IEEE standard for addition, subtraction, multiplication, and division on a single chip. All were introduced in 1988 and run with a cycle time of about 40 nanoseconds. However, as we will see, they use quite different algorithms. The Weitek chip is well described in Birman et al.
H.10
Features
■
H-59
Weitek 3364
TI 8847
40
50
30
Size (mil )
114,857
147,600
156,180
Transistors
75,000
165,000
180,000
Pins
84
168
207
Power (watts)
3.5
1.5
1.5
Cycles/add
2
2
2
Cycles/mult
5
2
3
19
17
11
–
30
14
Clock cycle time (ns) 2
Cycles/divide Cycles/square root
MIPS R3010
Putting It All Together
Figure H.36 Summary of the three floating-point chips discussed in this section.The cycle times are for production parts available in June 1989. The cycle counts are for double-precision operations.
[1990], the MIPS chip is described in less detail in Rowen, Johnson, and Ries [1988], and details of the TI chip can be found in Darley et al. [1989]. These three chips have a number of things in common. They perform addition and multiplication in parallel, and they implement neither extended precision nor a remainder step operation. (Recall from Section H.6 that it is easy to implement the IEEE remainder function in software if a remainder step instruction is available.) The designers of these chips probably decided not to provide extended precision because the most influential users are those who run portable codes, which can’t rely on extended precision. However, as we have seen, extended precision can make for faster and simpler math libraries. In the summary of the three chips given in Figure H.36, note that a higher transistor count generally leads to smaller cycle counts. Comparing the cycles/op numbers needs to be done carefully, because the figures for the MIPS chip are those for a complete system (R3000/3010 pair), while the Weitek and TI numbers are for stand-alone chips and are usually larger when used in a complete system. The MIPS chip has the fewest transistors of the three. This is reflected in the fact that it is the only chip of the three that does not have any pipelining or hardware square root. Further, the multiplication and addition operations are not completely independent because they share the carry-propagate adder that performs the final rounding (as well as the rounding logic). Addition on the R3010 uses a mixture of ripple, CLA, and carry-select. A carry-select adder is used in the fashion of Figure H.20 (page H-44). Within each half, carries are propagated using a hybrid ripple-CLA scheme of the type indicated in Figure H.18 (page H-42). However, this is further tuned by varying the size of each block, rather than having each fixed at 4 bits (as they are in Figure H.18). The multiplier is midway between the designs of Figures H.2 (page H-4) and H.27 (page H-51). It has an array just large enough so that output can be fed back into the input without having to be clocked. Also, it uses radix-4 Booth
H-60
■
Appendix H Computer Arithmetic
TI 8847
MIPS R3010 Figure H.37 Chip layout for the TI 8847, MIPS R3010, and Weitek 3364. In the left-hand columns are the photomicrographs; the right-hand columns show the corresponding floor plans.
H.10
Putting It All Together
■
H-61
Weitek 3364 Figure H.37 (Continued.)
recoding and the even/odd technique of Figure H.29 (page H-53). The R3010 can do a divide and multiply in parallel (like the Weitek chip but unlike the TI chip). The divider is a radix-4 SRT method with quotient digits −2, −1, 0, 1, and 2, and is similar to that described in Taylor [1985]. Double-precision division is about four times slower than multiplication. The R3010 shows that for chips using an O(n) multiplier, an SRT divider can operate fast enough to keep a reasonable ratio between multiply and divide. The Weitek 3364 has independent add, multiply, and divide units. It also uses radix-4 SRT division. However, the add and multiply operations on the Weitek chip are pipelined. The three addition stages are (1) exponent compare, (2) add followed by shift (or vice versa), and (3) final rounding. Stages (1) and (3) take only a half-cycle, allowing the whole operation to be done in two cycles, even though there are three pipeline stages. The multiplier uses an array of the style of Figure H.28 but uses radix-8 Booth recoding, which means it must compute 3 times the multiplier. The three multiplier pipeline stages are (1) compute 3b, (2) pass through array, and (3) final carry-propagation add and round. Single precision passes through the array once, double precision twice. Like addition, the latency is two cycles. The Weitek chip uses an interesting addition algorithm. It is a variant on the carry-skip adder pictured in Figure H.19 (page H-43). However, Pij , which is the
H-62
■
Appendix H Computer Arithmetic
logical AND of many terms, is computed by rippling, performing one AND per ripple. Thus, while the carries propagate left within a block, the value of Pij is propagating right within the next block, and the block sizes are chosen so that both waves complete at the same time. Unlike the MIPS chip, the 3364 has hardware square root, which shares the divide hardware. The ratio of double-precision multiply to divide is 2:17. The large disparity between multiply and divide is due to the fact that multiplication uses radix-8 Booth recoding, while division uses a radix-4 method. In the MIPS R3010, multiplication and division use the same radix. The notable feature of the TI 8847 is that it does division by iteration (using the Goldschmidt algorithm discussed in Section H.6). This improves the speed of division (the ratio of multiply to divide is 3:11), but means that multiplication and division cannot be done in parallel as on the other two chips. Addition has a twostage pipeline. Exponent compare, fraction shift, and fraction addition are done in the first stage, normalization and rounding in the second stage. Multiplication uses a binary tree of signed-digit adders and has a three-stage pipeline. The first stage passes through the array, retiring half the bits; the second stage passes through the array a second time; and the third stage converts from signed-digit form to two’s complement. Since there is only one array, a new multiply operation can only be initiated in every other cycle. However, by slowing down the clock, two passes through the array can be made in a single cycle. In this case, a new multiplication can be initiated in each cycle. The 8847 adder uses a carryselect algorithm rather than carry-lookahead. As mentioned in Section H.6, the TI carries 60 bits of precision in order to do correctly rounded division. These three chips illustrate the different trade-offs made by designers with similar constraints. One of the most interesting things about these chips is the diversity of their algorithms. Each uses a different add algorithm, as well as a different multiply algorithm. In fact, Booth recoding is the only technique that is universally used by all the chips.
H.11 Fallacy
Fallacies and Pitfalls Underflows rarely occur in actual floating-point application code. Although most codes rarely underflow, there are actual codes that underflow frequently. SDRWAVE [Kahaner 1988], which solves a one-dimensional wave equation, is one such example. This program underflows quite frequently, even when functioning properly. Measurements on one machine show that adding hardware support for gradual underflow would cause SDRWAVE to run about 50% faster.
H.12
Fallacy
Historical Perspective and References
■
H-63
Conversions between integer and floating point are rare. In fact, in spice they are as frequent as divides. The assumption that conversions are rare leads to a mistake in the SPARC version 8 instruction set, which does not provide an instruction to move from integer registers to floating-point registers.
Pitfall
Don’t increase the speed of a floating-point unit without increasing its memory bandwidth. A typical use of a floating-point unit is to add two vectors to produce a third vector. If these vectors consist of double-precision numbers, then each floating-point add will use three operands of 64 bits each, or 24 bytes of memory. The memory bandwidth requirements are even greater if the floating-point unit can perform addition and multiplication in parallel (as most do).
Pitfall
−x is not the same as 0 − x. This is a fine point in the IEEE standard that has tripped up some designers. Because floating-point numbers use the sign magnitude system, there are two zeros, +0 and −0. The standard says that 0 − 0 = +0, whereas −(0) = −0. Thus −x is not the same as 0 − x when x = 0.
H.12
Historical Perspective and References The earliest computers used fixed point rather than floating point. In “Preliminary Discussion of the Logical Design of an Electronic Computing Instrument,” Burks, Goldstine, and von Neumann [1946] put it like this: There appear to be two major purposes in a “floating” decimal point system both of which arise from the fact that the number of digits in a word is a constant fixed by design considerations for each particular machine. The first of these purposes is to retain in a sum or product as many significant digits as possible and the second of these is to free the human operator from the burden of estimating and inserting into a problem “scale factors” — multiplicative constants which serve to keep numbers within the limits of the machine. There is, of course, no denying the fact that human time is consumed in arranging for the introduction of suitable scale factors. We only argue that the time so consumed is a very small percentage of the total time we will spend in preparing an interesting problem for our machine. The first advantage of the floating point is, we feel, somewhat illusory. In order to have such a floating point, one must waste memory capacity which could otherwise be used for carrying more digits per word. It would therefore seem to us not at all clear whether the modest advantages of a floating binary point offset the loss of memory capacity and the increased complexity of the arithmetic and control circuits.
This enables us to see things from the perspective of early computer designers, who believed that saving computer time and memory were more important than saving programmer time.
H-64
■
Appendix H Computer Arithmetic
The original papers introducing the Wallace tree, Booth recoding, SRT division, overlapped triplets, and so on, are reprinted in Swartzlander [1990]. A good explanation of an early machine (the IBM 360/91) that used a pipelined Wallace tree, Booth recoding, and iterative division is in Anderson et al. [1967]. A discussion of the average time for single-bit SRT division is in Freiman [1961]; this is one of the few interesting historical papers that does not appear in Swartzlander. The standard book of Mead and Conway [1980] discouraged the use of CLAs as not being cost-effective in VLSI. The important paper by Brent and Kung [1982] helped combat that view. An example of a detailed layout for CLAs can be found in Ngai and Irwin [1985] or in Weste and Eshraghian [1993], and a more theoretical treatment is given by Leighton [1992]. Takagi, Yasuura, and Yajima [1985] provide a detailed description of a signed-digit tree multiplier. Before the ascendancy of IEEE arithmetic, many different floating-point formats were in use. Three important ones were used by the IBM 370, the DEC VAX, and the Cray. Here is a brief summary of these older formats. The VAX format is closest to the IEEE standard. Its single-precision format (F format) is like IEEE single precision in that it has a hidden bit, 8 bits of exponent, and 23 bits of fraction. However, it does not have a sticky bit, which causes it to round halfway cases up instead of to even. The VAX has a slightly different exponent range from IEEE single: Emin is −128 rather than −126 as in IEEE, and Emax is 126 instead of 127. The main differences between VAX and IEEE are the lack of special values and gradual underflow. The VAX has a reserved operand, but it works like a signaling NaN: It traps whenever it is referenced. Originally, the VAX’s double precision (D format) also had 8 bits of exponent. However, as this is too small for many applications, a G format was added; like the IEEE standard, this format has 11 bits of exponent. The VAX also has an H format, which is 128 bits long. The IBM 370 floating-point format uses base 16 rather than base 2. This means it cannot use a hidden bit. In single precision, it has 7 bits of exponent and 24 bits (6 hex digits) of fraction. Thus, the largest representable number is 8 1627 = 24 × 27 = 229, compared with 22 for IEEE. However, a number that is normalized in the hexadecimal sense only needs to have a nonzero leading digit. When interpreted in binary, the three most-significant bits could be zero. Thus, there are potentially fewer than 24 bits of significance. The reason for using the higher base was to minimize the amount of shifting required when adding floating-point numbers. However, this is less significant in current machines, where the floating-point add time is usually fixed independently of the operands. Another difference between 370 arithmetic and IEEE arithmetic is that the 370 has neither a round digit nor a sticky digit, which effectively means that it truncates rather than rounds. Thus, in many computations, the result will systematically be too small. Unlike the VAX and IEEE arithmetic, every bit pattern is a valid number. Thus, library routines must establish conventions for what to return in case of errors. In the IBM FORTRAN library, for example, – 4 returns 2! Arithmetic on Cray computers is interesting because it is driven by a motivation for the highest possible floating-point performance. It has a 15-bit exponent field and a 48-bit fraction field. Addition on Cray computers does not have a guard digit, and multiplication is even less accurate than addition. Thinking of
H.12
Historical Perspective and References
■
H-65
multiplication as a sum of p numbers, each 2p bits long, Cray computers drop the low-order bits of each summand. Thus, analyzing the exact error characteristics of the multiply operation is not easy. Reciprocals are computed using iteration, and division of a by b is done by multiplying a times 1/b. The errors in multiplication and reciprocation combine to make the last three bits of a divide operation unreliable. At least Cray computers serve to keep numerical analysts on their toes! The IEEE standardization process began in 1977, inspired mainly by W. Kahan and based partly on Kahan’s work with the IBM 7094 at the University of Toronto [Kahan 1968]. The standardization process was a lengthy affair, with gradual underflow causing the most controversy. (According to Cleve Moler, visitors to the United States were advised that the sights not to be missed were Las Vegas, the Grand Canyon, and the IEEE standards committee meeting.) The standard was finally approved in 1985. The Intel 8087 was the first major commercial IEEE implementation and appeared in 1981, before the standard was finalized. It contains features that were eliminated in the final standard, such as projective bits. According to Kahan, the length of double-extended precision was based on what could be implemented in the 8087. Although the IEEE standard was not based on any existing floating-point system, most of its features were present in some other system. For example, the CDC 6600 reserved special bit patterns for INDEFINITE and INFINITY, while the idea of denormal numbers appears in Goldberg [1967] as well as in Kahan [1968]. Kahan was awarded the 1989 Turing prize in recognition of his work on floating point. Although floating point rarely attracts the interest of the general press, newspapers were filled with stories about floating-point division in November 1994. A bug in the division algorithm used on all of Intel’s Pentium chips had just come to light. It was discovered by Thomas Nicely, a math professor at Lynchburg College in Virginia. Nicely found the bug when doing calculations involving reciprocals of prime numbers. News of Nicely’s discovery first appeared in the press on the front page of the November 7 issue of Electronic Engineering Times. Intel’s immediate response was to stonewall, asserting that the bug would only affect theoretical mathematicians. Intel told the press, “This doesn’t even qualify as an errata . . . even if you’re an engineer, you’re not going to see this.” Under more pressure, Intel issued a white paper, dated November 30, explaining why they didn’t think the bug was significant. One of their arguments was based on the fact that if you pick two floating-point numbers at random and divide one into the other, the chance that the resulting quotient will be in error is about 1 in 9 billion. However, Intel neglected to explain why they thought that the typical customer accessed floating-point numbers randomly. Pressure continued to mount on Intel. One sore point was that Intel had known about the bug before Nicely discovered it, but had decided not to make it public. Finally, on December 20, Intel announced that they would unconditionally replace any Pentium chip that used the faulty algorithm and that they would take an unspecified charge against earnings, which turned out to be $300 million. The Pentium uses a simple version of SRT division as discussed in Section H.9. The bug was introduced when they converted the quotient lookup table to a PLA. Evidently there were a few elements of the table containing the quotient
H-66
■
Appendix H Computer Arithmetic
digit 2 that Intel thought would never be accessed, and they optimized the PLA design using this assumption. The resulting PLA returned 0 rather than 2 in these situations. However, those entries were really accessed, and this caused the division bug. Even though the effect of the faulty PLA was to cause 5 out of 2048 table entries to be wrong, the Pentium only computes an incorrect quotient 1 out of 9 billion times on random inputs. This is explored in Exercise H.34.
References Anderson, S. F., J. G. Earle, R. E. Goldschmidt, and D. M. Powers [1967]. “The IBM System/360 Model 91: Floating-point execution unit,” IBM J. Research and Development 11, 34–53. Reprinted in Swartzlander [1990]. Good description of an early high-performance floating-point unit that used a pipelined Wallace tree multiplier and iterative division. Bell, C. G., and A. Newell [1971]. Computer Structures: Readings and Examples, McGraw-Hill, New York. Birman, M., A. Samuels, G. Chu, T. Chuk, L. Hu, J. McLeod, and J. Barnes [1990]. “Developing the WRL3170/3171 SPARC floating-point coprocessors,” IEEE Micro 10:1, 55–64. These chips have the same floating-point core as the Weitek 3364, and this paper has a fairly detailed description of that floating-point design. Brent, R. P., and H. T. Kung [1982]. “A regular layout for parallel adders,” IEEE Trans. on Computers C-31, 260–264. This is the paper that popularized CLAs in VLSI. Burgess, N., and T. Williams [1995]. “Choices of operand truncation in the SRT division algorithm,” IEEE Trans. on Computers 44:7. Analyzes how many bits of divisor and remainder need to be examined in SRT division. Burks, A. W., H. H. Goldstine, and J. von Neumann [1946]. “Preliminary discussion of the logical design of an electronic computing instrument,” Report to the U.S. Army Ordnance Department, p. 1; also appears in Papers of John von Neumann, W. Aspray and A. Burks, eds., MIT Press, Cambridge, Mass., and Tomash Publishers, Los Angeles, 1987, 97–146. Cody, W. J., J. T. Coonen, D. M. Gay, K. Hanson, D. Hough, W. Kahan, R. Karpinski, J. Palmer, F. N. Ris, and D. Stevenson [1984]. “A proposed radix- and word-lengthindependent standard for floating-point arithmetic,” IEEE Micro 4:4, 86–100. Contains a draft of the 854 standard, which is more general than 754. The significance of this article is that it contains commentary on the standard, most of which is equally relevant to 754. However, be aware that there are some differences between this draft and the final standard. Coonen, J. [1984]. Contributions to a Proposed Standard for Binary Floating-Point Arithmetic, Ph.D. thesis, Univ. of Calif., Berkeley. The only detailed discussion of how rounding modes can be used to implement efficient binary decimal conversion. Darley, H. M., et al. [1989]. “Floating point/integer processor with divide and square root functions,” U.S. Patent 4,878,190, October 31, 1989. Pretty readable as patents go. Gives a high-level view of the TI 8847 chip, but doesn’t have all the details of the division algorithm.
H.12
Historical Perspective and References
■
H-67
Demmel, J. W., and X. Li [1994]. “Faster numerical algorithms via exception handling,” IEEE Trans. on Computers 43:8, 983–992. A good discussion of how the features unique to IEEE floating point can improve the performance of an important software library. Freiman, C. V. [1961]. “Statistical analysis of certain binary division algorithms,” Proc. IRE 49:1, 91–103. Contains an analysis of the performance of shifting-over-zeros SRT division algorithm. Goldberg, D. [1991]. “What every computer scientist should know about floating-point arithmetic,” Computing Surveys 23:1, 5–48. Contains an in-depth tutorial on the IEEE standard from the software point of view. Goldberg, I. B. [1967]. “27 bits are not enough for 8-digit accuracy,” Comm. ACM 10:2, 105–106. This paper proposes using hidden bits and gradual underflow. Gosling, J. B. [1980]. Design of Arithmetic Units for Digital Computers, Springer-Verlag, New York. A concise, well-written book, although it focuses on MSI designs. Hamacher, V. C., Z. G. Vranesic, and S. G. Zaky [1984]. Computer Organization, 2nd ed., McGraw-Hill, New York. Introductory computer architecture book with a good chapter on computer arithmetic. Hwang, K. [1979]. Computer Arithmetic: Principles, Architecture, and Design, Wiley, New York. This book contains the widest range of topics of the computer arithmetic books. IEEE [1985]. “IEEE standard for binary floating-point arithmetic,” SIGPLAN Notices 22:2, 9–25. IEEE 754 is reprinted here. Kahan, W. [1968]. “7094-II system support for numerical analysis,” SHARE Secretarial Distribution SSD-159. This system had many features that were incorporated into the IEEE floating-point standard. Kahaner, D. K. [1988]. “Benchmarks for ‘real’ programs,” SIAM News (November). The benchmark presented in this article turns out to cause many underflows. Knuth, D. [1981]. The Art of Computer Programming, vol. II, 2nd ed., Addison-Wesley, Reading, Mass. Has a section on the distribution of floating-point numbers. Kogge, P. [1981]. The Architecture of Pipelined Computers, McGraw-Hill, New York. Has a brief discussion of pipelined multipliers. Kohn, L., and S.-W. Fu [1989]. “A 1,000,000 transistor microprocessor,” IEEE Int’l Solid-State Circuits Conf., 54–55. There are several articles about the i860, but this one contains the most details about its floating-point algorithms. Koren, I. [1989]. Computer Arithmetic Algorithms, Prentice Hall, Englewood Cliffs, N.J. Leighton, F. T. [1992]. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann, San Francisco. This is an excellent book, with emphasis on the complexity analysis of algorithms. Section 1.2.1 has a nice discussion of carry-lookahead addition on a tree.
H-68
■
Appendix H Computer Arithmetic
Magenheimer, D. J., L. Peters, K. W. Pettis, and D. Zuras [1988]. “Integer multiplication and division on the HP Precision architecture,” IEEE Trans. on Computers 37:8, 980– 990. Gives rationale for the integer- and divide-step instructions in the Precision architecture. Markstein, P. W. [1990]. “Computation of elementary functions on the IBM RISC System/6000 processor,” IBM J. of Research and Development 34:1, 111–119. Explains how to use fused muliply-add to compute correctly rounded division and square root. Mead, C., and L. Conway [1980]. Introduction to VLSI Systems, Addison-Wesley, Reading, Mass. Montoye, R. K., E. Hokenek, and S. L. Runyon [1990]. “Design of the IBM RISC System/ 6000 floating-point execution,” IBM J. of Research and Development 34:1, 59–70. Describes one implementation of fused multiply-add. Ngai, T.-F., and M. J. Irwin [1985]. “Regular, area-time efficient carry-lookahead adders,” Proc. Seventh IEEE Symposium on Computer Arithmetic, 9–15. Describes a CLA like that of Figure H.17, where the bits flow up and then come back down. Patterson, D. A., and J. L. Hennessy [1994]. Computer Organization and Design: The Hardware/Software Interface, Morgan Kaufmann, San Francisco. Chapter 4 is a gentler introduction to the first third of this appendix. Peng, V., S. Samudrala, and M. Gavrielov [1987]. “On the implementation of shifters, multipliers, and dividers in VLSI floating point units,” Proc. Eighth IEEE Symposium on Computer Arithmetic, 95–102. Highly recommended survey of different techniques actually used in VLSI designs. Rowen, C., M. Johnson, and P. Ries [1988]. “The MIPS R3010 floating-point coprocessor,” IEEE Micro, 53–62 (June). Santoro, M. R., G. Bewick, and M. A. Horowitz [1989]. “Rounding algorithms for IEEE multipliers,” Proc. Ninth IEEE Symposium on Computer Arithmetic, 176–183. A very readable discussion of how to efficiently implement rounding for floating-point multiplication. Scott, N. R. [1985]. Computer Number Systems and Arithmetic, Prentice Hall, Englewood Cliffs, N.J. Swartzlander, E., ed. [1990]. Computer Arithmetic, IEEE Computer Society Press, Los Alamitos, Calif. A collection of historical papers in two volumes. Takagi, N., H. Yasuura, and S. Yajima [1985].“High-speed VLSI multiplication algorithm with a redundant binary addition tree,” IEEE Trans. on Computers C-34:9, 789–796. A discussion of the binary tree signed multiplier that was the basis for the design used in the TI 8847. Taylor, G. S. [1981]. “Compatible hardware for division and square root,” Proc. Fifth IEEE Symposium on Computer Arithmetic, 127–134. Good discussion of a radix-4 SRT division algorithm. Taylor, G. S. [1985]. “Radix 16 SRT dividers with overlapped quotient selection stages,” Proc. Seventh IEEE Symposium on Computer Arithmetic, 64–71. Describes a very sophisticated high-radix division algorithm. Weste, N., and K. Eshraghian [1993]. Principles of CMOS VLSI Design: A Systems Perspective, 2nd ed., Addison-Wesley, Reading, Mass. This textbook has a section on the layouts of various kinds of adders.
Exercises
■
H-69
Williams, T. E., M. Horowitz, R. L. Alverson, and T. S. Yang [1987]. “A self-timed chip for division,” Advanced Research in VLSI, Proc. 1987 Stanford Conf., MIT Press, Cambridge, Mass. Describes a divider that tries to get the speed of a combinational design without using the area that would be required by one.
Exercises H.1
[12] Using n bits, what is the largest and smallest integer that can be represented in the two’s complement system?
H.2
[20/25] In the subsection “Signed Numbers” (page H-7), it was stated that two’s complement overflows when the carry into the high-order bit position is different from the carry-out from that position. a. [20] Give examples of pairs of integers for all four combinations of carry-in and carry-out. Verify the rule stated above. b. [25] Explain why the rule is always true.
H.3
[12] Using 4-bit binary numbers, multiply −8 × −8 using Booth recoding.
H.4
[15] Equations H.2.1 and H.2.2 are for adding two n-bit numbers. Derive similar equations for subtraction, where there will be a borrow instead of a carry.
H.5
[25] On a machine that doesn’t detect integer overflow in hardware, show how you would detect overflow on a signed addition operation in software.
H.6
[15/15/20] Represent the following numbers as single-precision and double-precision IEEE floating-point numbers. a. [15] 10. b. [15] 10.5. c. [20] 0.1.
H.7
[12/12/12/12/12] Below is a list of floating-point numbers. In single precision, write down each number in binary, in decimal, and give its representation in IEEE arithmetic. a. [12] The largest number less than 1. b. [12] The largest number. c. [12] The smallest positive normalized number. d. [12] The largest denormal number. e. [12] The smallest positive number.
H.8
[15] Is the ordering of nonnegative floating-point numbers the same as integers when denormalized numbers are also considered?
H.9
[20] Write a program that prints out the bit patterns used to represent floating-point numbers on your favorite computer. What bit pattern is used for NaN?
H-70
■
Appendix H Computer Arithmetic
H.10
[15] Using p = 4, show how the binary floating-point multiply algorithm computes the product of 1.875 × 1.875.
H.11
[12/10] Concerning the addition of exponents in floating-point multiply: a. [12] What would the hardware that implements the addition of exponents look like? b. [10] If the bias in single precision were 129 instead of 127, would addition be harder or easier to implement?
H.12
[15/12] In the discussion of overflow detection for floating-point multiplication, it was stated that (for single precision) you can detect an overflowed exponent by performing exponent addition in a 9-bit adder. a. [15] Give the exact rule for detecting overflow. b. [12] Would overflow detection be any easier if you used a 10-bit adder instead?
H.13
[15/10] Floating-point multiplication: a. [15] Construct two single-precision floating-point numbers whose product doesn’t overflow until the final rounding step. b. [10] Is there any rounding mode where this phenomenon cannot occur?
H.14
[15] Give an example of a product with a denormal operand but a normalized output. How large was the final shifting step? What is the maximum possible shift that can occur when the inputs are double-precision numbers?
H.15
[15] Use the floating-point addition algorithm on page H-23 to compute 1.0102 − .10012 (in 4-bit precision) .
H.16
[10/15/20/20/20] In certain situations, you can be sure that a + b is exactly representable as a floating-point number, that is, no rounding is necessary. a. [10] If a, b have the same exponent and different signs, explain why a + b is exact. This was used in the subsection “Speeding Up Addition” on page H-25. b. [15] Give an example where the exponents differ by 1, a and b have different signs, and a + b is not exact. c. [20] If a ≥ b ≥ 0, and the top two bits of a cancel when computing a – b, explain why the result is exact (this fact is mentioned on page H-23). d. [20] If a ≥ b ≥ 0, and the exponents differ by 1, show that a − b is exact unless the high order bit of a − b is in the same position as that of a (mentioned in “Speeding Up Addition,” page H-25). e. [20] If the result of a − b or a + b is denormal, show that the result is exact (mentioned in the subsection “Underflow,” page H-36).
Exercises
H.17
■
H-71
[15/20] Fast floating-point addition (using parallel adders) for p = 5. a. [15] Step through the fast addition algorithm for a + b, where a = 1.01112 and b = .110112. b. [20] Suppose the rounding mode is toward +∞. What complication arises in the above example for the adder that assumes a carry-out? Suggest a solution.
H.18
[12] How would you use two parallel adders to avoid the final roundup addition in floating-point multiplication?
H.19
[30/10] This problem presents a way to reduce the number of addition steps in floating-point addition from three to two using only a single adder. a. [30] Let A and B be integers of opposite signs, with a and b their magnitudes. Show that the following rules for manipulating the unsigned numbers a and b gives A + B. 1. Complement one of the operands. 2. Use end-around carry to add the complemented operand and the other (uncomplemented) one. 3. If there was a carry-out, the sign of the result is the sign associated with the uncomplemented operand. 4. Otherwise, if there was no carry-out, complement the result, and give it the sign of the complemented operand. b. [10] Use the above to show how steps 2 and 4 in the floating-point addition algorithm on page H-23 can be performed using only a single addition.
H.20
[20/15/20/15/20/15] Iterative square root. a. [20] Use Newton’s method to derive an iterative algorithm for square root. The formula will involve a division. b. [15] What is the fastest way you can think of to divide a floating-point number by 2? c. [20] If division is slow, then the iterative square root routine will also be slow. Use Newton’s method on f(x) = 1/x2 − a to derive a method that doesn’t use any divisions. d. [15] Assume that the ratio division by 2 : floating-point add : floatingpoint multiply is 1:2:4. What ratios of multiplication time to divide time makes each iteration step in the method of part (c) faster than each iteration in the method of part (a)? e. [20] When using the method of part (a), how many bits need to be in the initial guess in order to get double-precision accuracy after three iterations? (You may ignore rounding error.)
H-72
■
Appendix H Computer Arithmetic
f.
H.21
[15] Suppose that when spice runs on the TI 8847, it spends 16.7% of its time in the square root routine (this percentage has been measured on other machines). Using the values in Figure H.36 and assuming three iterations, how much slower would spice run if square root were implemented in software using the method of part(a)?
[10/20/15/15/15] Correctly rounded iterative division. Let a and b be floating-point numbers with p-bit significands (p = 53 in double precision). Let q be the exact quotient q = a/b, 1 ≤ q < 2. Suppose that q is the result of an iteration process, that q has a few extra bits of precision, and that 0 < q − q < 2–p. For the following, it is important that q < q, even when q can be exactly represented as a floating-point number. a. [10] If x is a floating-point number, and 1 ≤ x < 2, what is the next representable number after x? b. [20] Show how to compute q′ from q, where q′ has p + 1 bits of precision and q − q′ < 2–p. c. [15] Assuming round to nearest, show that the correctly rounded quotient is either q′, q′ − 2–p, or q′ + 2–p. d. [15] Give rules for computing the correctly rounded quotient from q′ based on the low-order bit of q′ and the sign of a − bq′. e. [15] Solve part(c) for the other three rounding modes.
H.22
[15] Verify the formula on page H-30. [Hint: If xn = x0(2 − x0b) × Πi=1, n [1 + (1 − x0b)2i], then 2 − xnb = 2 − x0b(2 − x0b) Π[1 + (1 − x0b)2i] = 2 − [1 − (1 − x0b)2] Π[1 + (1 − x0b)2i].]
H.23
[15] Our example that showed that double rounding can give a different answer from rounding once used the round-to-even rule. If halfway cases are always rounded up, is double rounding still dangerous?
H.24
[10/10/20/20] Some of the cases of the italicized statement in the “Precisions” subsection (page H-34) aren’t hard to demonstrate. a. [10] What form must a binary number have if rounding to q bits followed by rounding to p bits gives a different answer than rounding directly to p bits? b. [10] Show that for multiplication of p-bit numbers, rounding to q bits followed by rounding to p bits is the same as rounding immediately to p bits if q ≥ 2p. c. [20] If a and b are p-bit numbers with the same sign, show that rounding a + b to q bits followed by rounding to p bits is the same as rounding immediately to p bits if q ≥ 2p + 1. d. [20] Do part (c) when a and b have opposite signs.
H.25
[Discussion] In the MIPS approach to exception handling, you need a test for determining whether two floating-point operands could cause an exception. This should be fast and also not have too many false positives. Can you come up
Exercises
■
H-73
with a practical test? The performance cost of your design will depend on the distribution of floating-point numbers. This is discussed in Knuth [1981] and the Hamming paper in Swartzlander [1990]. H.26
[12/12/10] Carry-skip adders. a. [12] Assuming that time is proportional to logic levels, how long does it take an n-bit adder divided into (fixed) blocks of length k bits to perform an addition? b. [12] What value of k gives the fastest adder? c. [10] Explain why the carry-skip adder takes time 0 ( n ).
H.27
[10/15/20] Complete the details of the block diagrams for the following adders. a. [10] In Figure H.15, show how to implement the “1” and “2” boxes in terms of AND and OR gates. b. [15] In Figure H.18, what signals need to flow from the adder cells in the top row into the “C” cells? Write the logic equations for the “C” box. c. [20] Show how to extend the block diagram in H.17 so it will produce the carry-out bit c8.
H.28
[15] For ordinary Booth recoding, the multiple of b used in the ith step is simply ai–1 − ai. Can you find a similar formula for radix-4 Booth recoding (overlapped triplets)?
H.29
[20] Expand Figure H.29 in the fashion of H.27, showing the individual adders.
H.30
[25] Write out the analog of Figure H.25 for radix-8 Booth recoding.
H.31
[18] Suppose that an–1. . .a1a0 and bn–1. . .b1b0 are being added in a signeddigit adder as illustrated in the example on page H-54. Write a formula for the ith bit of the sum, si, in terms of ai, ai–1, ai–2, bi, bi–1, and bi–2.
H.32
[15] The text discussed radix-4 SRT division with quotient digits of −2, −1, 0, 1, 2. Suppose that 3 and −3 are also allowed as quotient digits. What relation replaces ri ≤ 2b/3?
H.33
[25/20/30] Concerning the SRT division table, Figure H.34: a. [25] Write a program to generate the results of Figure H.34. b. [20] Note that Figure H.34 has a certain symmetry with respect to positive and negative values of P. Can you find a way to exploit the symmetry and only store the values for positive P? c. [30] Suppose a carry-save adder is used instead of a propagate adder. The input to the quotient lookup table will be k bits of divisor and l bits of remainder, where the remainder bits are computed by summing the top l bits of the sum and carry registers. What are k and l? Write a program to generate the analog of Figure H.34.
H-74
■
Appendix H Computer Arithmetic
H.34
[12/12/12]The first several million Pentium chips produced had a flaw that caused division to sometimes return the wrong result. The Pentium uses a radix-4 SRT algorithm similar to the one illustrated in the example on page H58 (but with the remainder stored in carry-save format: see Exercise H.33(c)). According to Intel, the bug was due to five incorrect entries in the quotient lookup table. a. [12] The bad entries should have had a quotient of plus or minus 2, but instead had a quotient of 0. Because of redundancy, it’s conceivable that the algorithm could “recover” from a bad quotient digit on later iterations. Show that this is not possible for the Pentium flaw. b. [12] Since the operation is a floating-point divide rather than an integer divide, the SRT division algorithm on page H-46 must be modified in two ways. First, step 1 is no longer needed, since the divisor is already normalized. Second, the very first remainder may not satisfy the proper bound (r ≤ 2b/3 for Pentium, see page H-56). Show that skipping the very first left shift in step 2(a) of the SRT algorithm will solve this problem. c. [12] If the faulty table entries were indexed by a remainder that could occur at the very first divide step (when the remainder is the divisor), random testing would quickly reveal the bug. This didn’t happen. What does that tell you about the remainder values that index the faulty entries?
H.35
[12/12/12] The discussion of the remainder-step instruction assumed that division was done using a bit-at-a-time algorithm. What would have to change if division were implemented using a higher-radix method?
H.36
[25] In the array of Figure H.28, the fact that an array can be pipelined is not exploited. Can you come up with a design that feeds the output of the bottom CSA into the bottom CSAs instead of the top one, and that will run faster than the arrangement of Figure H.28?
I Implementing Coherence Protocols
The devil is in the details. Classic Proverb
© 2003 Elsevier Science (USA). All rights reserved.
I.1 I.2
Implementation Issues for the Snooping Coherence Protocol Implementation Issues in the Distributed Directory Protocol Exercises
I-2 I-6 I-12
I-2
■
Appendix I Implementing Coherence Protocols
I.1
Implementation Issues for the Snooping Coherence Protocol The major complication in actually using the snooping coherence protocol from Section 6.3 is that write misses are not atomic: The operation of detecting a write miss, obtaining the bus, getting the most recent value, and updating the cache cannot be done as if it took a single cycle. In particular, two processors cannot both use the bus at the same time. Thus, we must decompose the write into several steps that may be separated in time, but will still preserve correct execution. The first step detects the miss and requests the bus. The second step acquires the bus, places the miss on the bus, gets the data, and completes the write. Each of these two steps is atomic, but the cache block does not become exclusive until the second step has begun. As long as we do not change the block to exclusive or allow the cache update to proceed before the bus is acquired, writes to the same cache block will serialize when they reach the second step of the coherence protocol. Unfortunately, this two-step process does introduce new complications in the protocol. Figure I.1 shows the actual finite-state diagram for implementing coherence for this two-step process under the assumption that a bus transaction is atomic once the bus is acquired. This assumption simply means that the bus is not a split transaction, and once it is acquired any requests are processed before another processor can acquire the bus. We discuss the complexities of a split-transaction bus shortly. In the simplest implementation, the finite-state machine in Figure I.1 is simply replicated for each block in the cache. Since there is no interaction among operations on different cache blocks, this replication of the controller works. Replicating the controller is not necessary, but before we see why, let’s make sure we understand how the finite-state controller in Figure I.1 operates. The additional states in Figure I.1 over those in Figure 6.12 on page 559 are all transient: The controller will leave those states when the bus is available. Four of the states are pending write-back states that arise because in a write-back cache when a block is replaced (or invalidated) it must be written back to the memory. Four events can cause such a write back: 1. A write miss on the bus by another processor for this exclusive block. 2. A CPU read miss that forces the exclusive block to be replaced. 3. A CPU write miss that forces the exclusive block to be replaced. 4. A read miss on the bus by another processor for this block. In each of the cases the next state differs, hence there are four separate pending write-back states with four different successor states. Logically replicating the controller for each cache block allows correct operation if two conditions hold (in addition to our base assumption that the processor blocks until a cache access completes): 1. An operation on the bus for a cache block and a pending operation for a different cache block are noninterfering.
I.1
Implementation Issues for the Snooping Coherence Protocol
■
I-3
CPU read hit Write miss for this block
Bus available Pending read
CPU read
Invalid
Shared (read only)
Place read miss on bus CPU read miss
Write-ba ck block
Pending write back 2 Pending write back 1
Pending write back 3
CP U
wr ite
m iss
CPU read mis s
Write miss for block
Bus avail able
Exclusive (read/write)
Writeback data
Bus availa ble
CPU write
Bu Wr s av a ite -ba ilab le ck da ta
Bus available
Write-back block
CPU write
Pending write miss
le ilab ava bus Bus on s is m write ce Pla
Read miss for block
Pending write back 4
CPU write hit CPU read hit
Figure I.1 A finite-state controller for a simple cache coherence scheme with a write-back cache.The engine that implements this controller must be reentrant, that is, it must handle multiple requests for different cache lines that are overlapped in time. The diagram assumes the processor stalls until a request is completed, but other transactions must be handled. This controller also assumes that a transition to a new state that involves a bus access does not complete until the bus access is completed. Notice that if we did not require a processor to generate a write miss when it transitioned from the shared to exclusive state, it might not obtain the latest value of a cache block, since some other processor may have updated that block. In a protocol using ownership or upgrade transitions, we will need to be able to transition out of the pending write state and restart an access if a conflicting write obtains the bus first.
2. The controller in Figure I.1 correctly deals with the cases when a pending operation and a bus operation are for the same block.
I-4
■
Appendix I Implementing Coherence Protocols
The first condition is certainly true, since operations for different blocks may proceed in any order and do not affect the state transitions for the other block. To see why the second condition is true, consider each of the pending states and what happens if a conflicting access occurs: ■
Pending write back 1—The cache is writing back the data to eliminate it anyway, so a read or write miss for the block has no new effect. Notice, however, that the pending cache must use the bus cycle generated by the read or write miss to complete the write back. Otherwise, there will be no response to the miss, since the pending cache still has the only copy of the cache block. When it sees that the address of a miss matches the address of the block it is waiting to write back, it recognizes that the bus is available, writes the data, and transitions its state. This applies to all the pending write-back states.
■
Pending write back 2, 3—The cache is eliminating a block in the exclusive state, so another miss for that block simply allows the write back to occur immediately. If the read or write miss on the bus is for the new block that the processor is trying to share, there is no interaction, since the processor does not yet have a copy of the block.
■
Pending write back 4—In this case the processor is surrendering an exclusive block and simply completes the write back.
■
Pending read, pending write miss —The processor does not yet have a copy of the block that it is waiting for, so a read or write miss for that block has no effect. Since the waiting cache still needs to place a miss on the bus and fetch the block, it is guaranteed to get a new copy.
With these additional states and our assumptions that the bus operates atomically, that misses always cause the state to be updated, and that the processor blocks until an access completes, our coherence implementation is both deadlock-free and correct. If some fairness guarantee is made for bus access, then this controller is also free of livelock. Livelock occurs when some portion of a computation cannot make progress, though other portions can. If one processor could be denied the bus indefinitely, then that processor could never make progress in its computation. Some guarantee of fairness on bus access prevents this. There is still, however, one more critical implementation detail related to the bus transactions and what happens when a miss is processed. The key difference between the cache coherence case and the standard uniprocessor case occurs when the block is exclusive in some cache. Because it is a write-back cache, the memory copy is stale. In this case, the coherence unit will retrieve the block (called an intervention) and generate a write back. Since the memory does not know the state of the block, it will attempt to respond to the request as well. Since the data have been updated, the cache and processor will each attempt to drive the bus with different values. To prevent this, a line is added to the bus (often called the shared line) to coordinate the response. When the processor detects that it has a copy in the exclusive state, it signals the memory on this line and the memory
I.1
Implementation Issues for the Snooping Coherence Protocol
■
I-5
aborts the transaction. When the write back occurs, the memory gets the data and updates its copy. Since it is difficult to bound the amount of time that it can take to snoop the local cache copy, this line is usually implemented as a wired-OR with each processor holding its input low until it knows it does not have the block in exclusive state. The memory waits for the line to go high, indicating that no cache has the copy in the exclusive state, before putting data on the bus. If the bus had a split-transaction capability then we could not assume that a response would occur immediately. In fact, implementing a split transaction with coherence is significantly more complex. One complication arises from the fact that we must number and track bus transactions, so that a controller knows when a bus action is a response to its request. Another complication is dealing with races that can arise because two operations for the same cache block could potentially be outstanding simultaneously. An example illustrates this complication best. What happens when two processors try to write a word in the same cache block? Without split transactions, one of the operations reaches the bus first and the other must change the state of the block to invalid and try the operation again. Only one of the transactions is outstanding on the bus at any point. Example
Answer
Suppose we have a split-transaction bus and no cache has a copy of a particular block. Show how when both P1 and P2 try to write a word in that block, we can get an incorrect result using the protocol in Figure I.1 on page I-3. With the protocol in Figure I.1, the following sequence of events could occur: 1. P1 places a write miss for the block on the bus. Since P2 has the data in the invalid state, nothing occurs. 2. P2 places its write miss on the bus; again, since no copy exists, no state changes are needed. 3. The memory responds to P1’s request. P1 places the block in the exclusive state and writes the word into the block. 4. The memory responds to P2’s request. P2 places the block in the exclusive state and writes the word into the block. Disaster! Two caches now have the same block in the exclusive state and memory will be inconsistent. How can this race be avoided? The simplest way is to use the broadcast capability of the bus. All coherence controllers track all bus accesses. In a splittransaction bus, the transactions must be tagged with the processor identity (or a transaction number), so that a processor can identify a reply to its request. Every controller can keep track of the memory address of any outstanding bus requests, since it can see the request and the corresponding reply on the bus. When the local processor generates a miss, the controller does not place the miss request on the bus until there are no outstanding requests for the same cache block. This will
I-6
■
Appendix I Implementing Coherence Protocols
force P2 in the above example to wait for P1’s access to complete, allowing P1 to place the data in the exclusive state (and write the word into the block). The miss request from P2 will then cause P1 to do a write back and move the block to the invalid state. Alternatively, we could have each processor buffer only its own requests and track the responses to others. If the address of the requested block were included in the reply, then the second processor to request the block could ignore the reply and reissue its request. These race conditions are what make implementing coherence even more tricky as the interconnection mechanism becomes more sophisticated. As we will see in the next section, such problems are slightly worse in a directory-based system that does not have a broadcast mechanism like a bus, which can be used to order all accesses.
I.2
Implementation Issues in the Distributed Directory Protocol One further source of complexity of a directory protocol comes from the lack of atomicity in transactions. Several of the operations that are atomic in a bus-based snoopy protocol cannot be atomic in a directory-based machine. For example, a read miss, which is atomic in the snoopy protocol, cannot be atomic, since it requires messages to be sent to remote directories and caches. In fact, if we attempt to implement these operations in an atomic fashion in a distributedmemory machine, we can have deadlock. Recall from Chapter 6 that a deadlock means that the machine has reached a state from which it cannot make forward progress. This is easy to see with an example.
Example
Show how deadlock can occur if a node treats a read miss as atomic and hence is unable to respond to other requests until the read miss is completed.
Answer
Assume that two nodes P1 and P2 each have exclusive copies of cache blocks X1 and X2 that have different home directories. Consider the following sequence of events shown in Figure I.2.
Events caused by P1 activity
Events caused by P2 activity
P1 read miss for X2
P2 read miss for X1
Directory for X2 receives read miss and generates a fetch that is sent to P2
Directory for X1 receives read miss and generates a fetch that is sent to P1
Fetch arrives at P1, waits for completion of atomic read miss
Fetch arrives at P2, waits for completion of atomic read miss
Figure I.2 Events caused by P1 and P2 leading to deadlock.
I.2
Implementation Issues in the Distributed Directory Protocol
■
I-7
At this point the nodes are deadlocked. In this case, since the requests are for separate blocks, deadlock can be avoided by duplicating the controller for each block. This allows the controllers to accept a request for one block while a request for another block is in process. In practice, complications arise because requests for the same block can collide, as we will see shortly. The almost complete lack of atomicity in transactions causes most of the complexities in translating these state transition diagrams into actual finite-state controllers. There are two assumptions about the interconnection network that significantly simplify the implementation. First, we assume that the network provides point-to-point in-order delivery of messages. This means that two messages sent from a single node to another node arrive in the order they were sent. No assumptions are made about messages originating from, or destined to, different nodes. Second, we assume the network has unlimited buffering. This second assumption means that a message can always be accepted into the network. This reduces the possibility for deadlock and allows us to treat some nonatomic action, where we would need to be able to deal with a full set of network buffers, as atomic actions. Of course, we also assume that the network delivers all messages within a finite time. While the first assumption, in-order transmission, is quite reasonable and is, in fact, true in many machines, the second assumption, unlimited buffering, is not true. Actually, the network need only be capable of buffering a finite number of messages, since we still assume that processors block on misses. In practice, this number may still be large and unreasonable, so later in the section we will discuss what has to change to eliminate the assumption that a message can always be accepted, while still preventing deadlock. We also assume that the coherence controller is duplicated for each cache block (to avoid having to deal with unrelated transactions) and that a state transition only completes when a message has been transmitted and a data value reply received (when needed). This last assumption simply means that we do not allow the CPU to continue and read or write a cache block until the read or write miss is satisfied by a data value reply message. This simply eliminates a transition state that waits for the block to arrive. Because we are assuming unlimited buffering, we also assume that an outgoing message can always be transmitted before the next incoming message is accepted. Under these assumptions the state transition diagram of Figure 6.29 on page 581 can be used for the coherence controller at the cache with one small addition: The controller simply throws away any incoming transactions, other than the data value reply, while waiting for a read or write miss. Let’s look at each possible case that can arise while the cache is waiting for a response from the directory. Cases where the cache is transitioning the block to invalid, either from the shared or exclusive state, do not matter, since any incoming signals for this block do not affect the block once it is invalid. Hence, we need only consider cases where the processor is transitioning to the shared or exclusive state. There are two such cases:
I-8
■
Appendix I Implementing Coherence Protocols
■
CPU read miss from either invalid or exclusive—The directory will not reply until the block is available. Furthermore, since any write back of an exclusive entry for this block has been done, the controller can ignore any requests.
■
CPU write miss—Any required write back is done first and the processor is stalled. Since it cannot hold a block exclusive in this cache entry, it can ignore requests for this block until the write miss is satisfied from the directory.
The directory case is more complex to explain, since multiple cache controllers may send a message for the same block close to the same time. These operations must be serialized. Unlike the snoopy case where every controller sees every request on the bus at the same time, the individual caches only know what has happened when they are notified by the directory. Because the directory serializes the messages when it receives them and because all write misses for a given cache block go to the same directory, writes will be serialized by the home directory. Thus, the directory controller’s main problem is to deal with the distributed representation of the cache state. Since the directory must wait for the completion of certain operations, such as sending invalidates and fetching a cache block before transitioning state, most potential races are eliminated. Because we assume unlimited buffering, the directory can always complete a transaction before accepting the next incoming message. For this reason, the state transition diagram in Figure 6.30 can be used as an implementation. To see why, we must consider cases where the directory and the local cache do not agree on the state of a block. The cache can only have a block in a less restricted state than the directory believes the block is in, because transitioning to exclusive from invalid or shared, or to shared from invalid, requires a message to the directory and a reply. Thus, the only cases to consider are ■
Local cache state is invalid, directory state is exclusive—The cache controller must have performed a data write back of the block (see Figure 6.29). Hence the directory will shortly obtain the block. Furthermore no invalidation is needed, since the block has been replaced.
■
Local cache state is invalid, directory state is shared (the local cache is replacing the line)—The directory will send an invalidate, which may be ignored, since the block has been replaced. Some directory protocols send a replacement hint message when a shared line is replaced. Such messages are used to eliminate unnecessary invalidates and to reduce the state needed in the directory.
■
Local cache state is shared, directory state is exclusive—The write back has already been done and the block has been replaced, so a fetch/invalidate, which could be sent by the directory, can be ignored.
Hence, the protocol operates correctly with infinite buffering.
I.2
Implementation Issues in the Distributed Directory Protocol
■
I-9
Dealing with Finite Buffering What happens when the network does not have unlimited buffering? The major implication of this limit is that a cache or directory controller may be unable to complete a message send. This could lead to deadlock. The example on page I-6 showed such a deadlock case. Even if we assume a separate controller for each cache block, so that the requests do not interfere in the controller, the example will deadlock if there are no buffers available to send the replies. The occurrence of such a deadlock is based on three properties, which characterize many deadlock situations: 1. More than one resource is needed to complete a transaction: Buffers are needed to generate requests, create replies, and accept replies. 2. Resources are held until a nonatomic transaction completes: The buffer used to create the reply cannot be freed until the reply is accepted. 3. There is no global partial order on the acquisition of resources: Nodes can generate requests and replies at will. These characteristics lead to deadlock, and avoiding deadlock requires breaking one of these properties. Imposing a global partial order, the solution used in a bus-based system, is unworkable in a larger-scale, distributed machine. Freeing up resources without completing a transaction is difficult, since the transaction must be completely backed out and cannot be left half-finished. Hence, our approach will be to try to resolve the need for multiple resources. We cannot simply eliminate this need, but we can try to ensure that the resources will always be available. One way to ensure that a transaction can always complete is to guarantee that there are always buffers to accept messages. Although this is possible for a small machine with processors that block on a cache miss, it may not be very practical, since a single write could generate many invalidate messages. In addition, features such as prefetch would increase the amount of buffering required. There is an alternative strategy, which most systems use, and which ensures that a transaction will not actually be initiated until we can guarantee that it has the resources to complete. The strategy has four parts: 1. A separate network (physical or virtual) is used for requests and replies, where a reply is any message that a controller waits for in transitioning between states. This ensures that new requests cannot block replies that will free up buffers. 2. Every request that expects a reply allocates space to accept the reply when the request is generated. If no space is available, the request waits. This ensures that a node can always accept a reply message, which will allow the replying node to free its buffer.
I-10
■
Appendix I Implementing Coherence Protocols
3. Any controller can reject (usually with a negative acknowledge or NAK) any request, but it can never NAK a reply. This prevents a transaction from starting if the controller cannot guarantee that it has buffer space for the reply. 4. Any request that receives a NAK in response is simply retried. To understand why this is sufficient to prevent deadlock, let’s first consider our earlier example. Because a write miss is a request that requires a reply, the space to accept the reply is preallocated. Hence, both nodes will have space for the reply. Since the networks are separate, a reply can be received even if no more space is available for requests. Since the requests are for two different blocks, the separate coherence controllers handle the requests. If the accesses are for the same address, then they are serialized at the directory and no problem exists. To see that there are no deadlocks more generally, we must ensure that all replies can be accepted, and that every request is eventually serviced. Since a cache controller or directory controller can have at most one request needing a reply outstanding, it can always accept the reply when it returns. To see that every request is eventually serviced, we need only show that any request could be completed. Since every request starts with a read or write miss at a cache, it is sufficient to show that any read or write miss is eventually serviced. Since the write miss case includes the actions for a read miss as a subset, we focus on showing the write misses are serviced. The simplest situation is when the block is uncached; since that case is subsumed by the case when the block is shared, we focus on the shared and exclusive cases. Let’s consider the case where the block is shared: ■
The CPU attempts to do a write and generates a write miss that is sent to the directory. At this point the processor is stalled.
■
The write miss is sent to the directory controller for this memory block. Note that although one cache controller handles all the requests for a given cache block, regardless of its memory contents, there is a controller for every memory block. Thus the only conflict at the directory controller is when two requests arrive for the same block. This is critical to the deadlock-free operation of the controller and needs to be addressed in an implementation using a single controller.
■
Now consider what happens at the directory controller: Suppose the write miss is the next thing to arrive at the directory controller. The controller sends out the invalidates, which can always be accepted if the controller for this block is idle. If the controller is not idle, then the processor must be stalled. Since the processor is stalled, it must have generated a read or write miss. If it generated a read miss, then it has either displaced this block or does not have a copy. If it does not have a copy, then it has sent a read miss and cannot continue until the read miss is processed by the directory (the read miss will not be handled until the write miss is). If the controller has replaced the block, then we need not worry about it. If the controller is idle, then an invalidate occurs, and the copy is eliminated.
I.2
Implementation Issues in the Distributed Directory Protocol
■
I-11
The case where the block is exclusive is somewhat trickier. Our analysis begins when the write miss arrives at the directory controller for processing. There are two cases to consider: ■
The directory controller sends a fetch/invalidate message to the processor where it arrives to find the cache controller idle and the block in the exclusive state. The cache controller sends a data write back to the home directory and makes its state invalid. This reply arrives at the home directory controller, which can always accept the reply, since it preallocated the buffer. The directory controller sends back the data to the requesting processor, which can always accept the reply; after the cache is updated the requesting cache controller restarts the processor.
■
The directory controller sends a fetch/invalidate message to the node indicated as owner. When the message arrives at the owner node, it finds that this cache controller has taken a read or write miss that caused the block to be replaced. In this case, the cache controller has already sent the block to the home directory with a data write back and made the data unavailable. Since this is exactly the effect of the fetch/invalidate message, the protocol operates correctly in this case as well.
We have shown that our coherence mechanism operates correctly when controllers are replicated and when responses can be NAKed and retried. Both of these assumptions generate some problems in the implementation.
Implementing the Directory Controllers First, let’s consider how these controllers, which we have assumed are replicated, can be built without actually replicating them. On the side of the cache controllers, because the processors stall, the actual implementation is quite similar to what was needed for the snoopy controller. We can simply add the transient states just as we did for the snoopy case and note that a transaction for a different cache block can be handled while the current processor-generated operation is pending. Since a processor blocks on a request, at most one pending operation need be dealt with. On the side of the directory controller, things are more complicated. The difficulty arises from the way we handle the retrieval and return of a block. In particular, during the time a directory retrieves an exclusive block and returns it to the requesting node, the directory must accommodate other transactions. Otherwise, integrating the directory controllers for different cache blocks will lead to the possibility of deadlock. Because of this situation, the directory controller must be reentrant, that is, it must be capable of suspending its execution while waiting for a reply and accept another transaction. The only place this must occur is in response to read or write misses, while waiting for a response from the owner. This leads to three important observations:
I-12
■
Appendix I Implementing Coherence Protocols
1. The state of the controller need only be saved and restored while either a fetch or a fetch/invalidate operation is outstanding. 2. The implementation can bound the number of outstanding transactions being handled in the directory, by simply NAKing read or write miss requests that could cause the number of outstanding requests to be exceeded. 3. If instead of returning the data through the directory, the owner node forwards the data directly to the requester (as well as returning it to the directory), we can eliminate the need for the directory to handle more than one outstanding request. This motivation, in addition to the reduction of latency, is the reason for using the forwarding style of protocol. The forwarding-style protocol introduces another type of problem that we discuss in the exercises. The major remaining implementation difficulty is to handle NAKs. One alternative is for each processor to keep track of its outstanding transactions, so it knows, when the NAK is received, what the requested transaction was. The alternative is to bundle the original request into the NAK, so that the controller receiving the NAK can determine what the original request was. Because every request allocates a slot to receive a reply and a NAK is a reply, NAKs can always be received. In fact, the buffer holding the return slot for the request can also hold information about the request, allowing the processor to reissue the request if it is NAKed. This completes the implementation of the directory scheme. In practice, great care is required to implement these protocols correctly and to avoid deadlock. The key ideas we have seen in this section—dealing with nonatomicity and finite buffering—are critical to ensuring a correct implementation. Designers have found that both formal and informal verification techniques are helpful for ensuring that implementations are correct.
Exercises I.1
[20] The Convex Exemplar is a coherent shared-memory machine organized as a ring of eight-processor clusters. Describe a protocol for this machine, assuming that the ring can be snooped and that a directory sits at the junction of the ring and can also be interrogated from inside the cluster. How much directory storage is needed? If the coherence misses are uniformly distributed and the capacity misses are all within a cluster, what is the average memory access time for Ocean running on 64 processors?
I.2
[15/20] As we discussed in Section I.2, many DSM machines use a forwarding protocol, where a write miss request to a remote dirty block is forwarded to the node that has the copy of the block. The remote node then generates both a write-back operation and a data value reply. a. [15] Modify the state diagrams of Figures 6.29 and 6.30 so that the diagrams implement a forwarding protocol.
Exercises
■
I-13
b. [20] Forwarding protocols introduce a race condition into the protocol. Describe this race condition. Show how NAKs can be used to resolve the race condition. I.3
[20] Supporting lock-up free caches can have different implications for coherence protocols. Show how, without additional changes, allowing multiple outstanding misses from a node in a DSM can lead to deadlock—even if buffering is unlimited.