PARALLEL PROGRAMMING
PARALLEL PROGRAMMING
Ivan Stanimirović
ARCLER
P
r
e
s
s
www.arclerpress.com
Parallel Programming Ivan Stanimirović
Arcler Press 2010 Winston Park Drive, 2nd Floor Oakville, ON L6H 5R7 Canada www.arclerpress.com Tel: 001-289-291-7705 001-905-616-2116 Fax: 001-289-291-7601 Email:
[email protected] e-book Edition 2020 ISBN: 978-1-77407-389-6 (e-book) This book contains information obtained from highly regarded resources. Reprinted material sources are indicated and copyright remains with the original owners. Copyright for images and other graphics remains with the original owners as indicated. A Wide variety of references are listed. Reasonable efforts have been made to publish reliable data. Authors or Editors or Publishers are not responsible for the accuracy of the information in the published chapters or consequences of their use. The publisher assumes no responsibility for any damage or grievance to the persons or property arising out of the use of any materials, instructions, methods or thoughts in the book. The authors or editors and the publisher have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission has not been obtained. If any copyright holder has not been acknowledged, please write to us so we may rectify.
Notice: Registered trademark of products or corporate names are used only for explanation and identification without intent of infringement. © 2020 Arcler Press ISBN: 978-1-77407-227-1 (Hardcover) Arcler Press publishes wide variety of books and eBooks. For more information about Arcler Press and its products, visit our website at www.arclerpress.com
ABOUT THE AUTHOR
Ivan Stanimirovic gained his PhD from University of Niš, Serbia in 2013. His work spans from multi-objective optimization methods to applications of generalized matrix inverses in areas such as image processing and computer graphics and visualisations. He is currently working as an Assistant professor at Faculty of Sciences and Mathematics at University of Niš on computing generalized matrix inverses and its applications.
TABLE OF CONTENTS
List of Figures ........................................................................................................xi List of Tables .......................................................................................................xiii List of Abbreviations ............................................................................................xv Preface........................................................................ ......................................xvii Chapter 1
Introduction .............................................................................................. 1 1.1. Background ........................................................................................ 2 1.2. Definition of the Problem ................................................................... 3 1.3. Main Objectives ................................................................................. 3 1.4. Justification ......................................................................................... 4 1.5. Cloud Computing ............................................................................... 4 1.6. Fdtd Method ....................................................................................... 5 1.7. Computational Parallelism .................................................................. 5 1.8. Parallel Programming Models ............................................................. 7
Chapter 2
Creating and Managing Clusters for FDTD Computational Simulations with Meep Package on the EC2 Service for Amazon Web Services............................................................................................. 9 2.1. Starcluster ......................................................................................... 10 2.2. Meep Parallel Package ...................................................................... 11 2.3. GWT – Google Web Toolkit .............................................................. 12 2.4. Ganglia............................................................................................. 12 2.5. Architecture ...................................................................................... 14 2.6. Creating And Configuring A Public Ami ............................................ 16 2.7. Efficiency Test (FDTD Problem) ......................................................... 22 2.8. Efficiency Tests (Using Amazon EC2 Platform) ................................... 22 2.9. Analysis Of The Results ..................................................................... 31
Chapter 3
Parallel Algorithm Designed by Technique “PCAM” ............................... 33 3.1. Partition ............................................................................................ 34 3.2. Domain Decomposition ................................................................... 34 3.3. Functional Decomposition................................................................ 35 3.4. List Partitions Design ......................................................................... 36 3.5. Communication ................................................................................ 36 3.6. Agglomeration .................................................................................. 37 3.7. Reducing Costs of Software Engineering ........................................... 40 3.8. Load Balancing Algorithms ............................................................... 41 3.9. Task Scheduling Algorithms............................................................... 41 3.10. Allocation List Design ..................................................................... 42 3.11. Model of The Atmosphere ............................................................... 42 3.12. Agglomeration ................................................................................ 45 3.13. Load Distribution ............................................................................ 47
Chapter 4
Parallel Computer Systems ...................................................................... 51 4.1. History.............................................................................................. 53 4.2. Parallel Computing ........................................................................... 53 4.3. Background ...................................................................................... 54 4.4. Types Of Parallelism .......................................................................... 59 4.5. Hardware ......................................................................................... 61 4.6. Applications ..................................................................................... 67 4.7. History.............................................................................................. 68
Chapter 5
Parallelization of Web Compatibility Tests In Software Development .... 71 5.1. Web Compatibility Tests .................................................................... 72 5.2. Proposed Technique .......................................................................... 73 5.3. Results ............................................................................................. 76 5.4. Conclusion ....................................................................................... 76
Chapter 6
Theoretical Framework ........................................................................... 79 6.1. Definition of Process......................................................................... 80 6.2. Analysis of Key Processes.................................................................. 84 6.3. Review Process ................................................................................. 88
viii
6.4. Statistical Tools ................................................................................. 95 6.5. Methodological Framework .............................................................. 99 Chapter 7
Modular Programming .......................................................................... 101 7.1. Programs And Judgments ................................................................ 102 7.2. Modularity Linguistics..................................................................... 102 7.3. Normal And Pathological Connections ........................................... 103 7.4. How To Achieve Minimum Cost Systems ........................................ 103 7.5. Complexity In Human Terms........................................................... 104 7.6. Cohesion ........................................................................................ 116
Chapter 8
Recursive Programming ........................................................................ 125 8.1. Classification Of Recursive Functions ............................................. 126 8.2. Design Recursive Functions ............................................................ 127 8.3. Bubble Method ............................................................................... 138 8.4. Sorting By Direct Selection ............................................................. 140 8.5. Method Binary Insertion ................................................................. 141 8.6. Method Quicksort (Quicksort)......................................................... 141 8.7. Mixing Method (Merge Sort) ........................................................... 143 8.8. Sequential Search ........................................................................... 146 8.9. Binary Search (Binary Search) ......................................................... 147 8.10. Seeking The Maximum And Minimum .......................................... 149 8.11. Greedy Method ............................................................................ 152 8.12. Optimal Storage On Tape (Optimal Storage On Tapes) .................. 152 8.13. The Knapsack Problem .................................................................. 153 8.14. Single Source Shortest Paths (Shortest Route From an Origin) ........ 156
Chapter 9
Dynamic Programming ......................................................................... 161 9.1. Optimality Principle ....................................................................... 162 9.2. Multistage Graphs (Multistage Graphs) ........................................... 163 9.3. Traveling Salesman Problem (TSP) ................................................... 166 9.4. Ix Return On The Same Route (Backtracking) .................................. 170 9.5. The Eight Queens Puzzle (8-Queens) .............................................. 171 9.6. Hamiltonian Cycles (Hamiltonian Path) .......................................... 177
ix
Chapter 10 Branch And Bound ................................................................................ 181 10.1. General Description ..................................................................... 182 10.2. Poda Strategies.............................................................................. 184 10.3. Branching Strategies...................................................................... 184 10.4. The Traveling Salesman Problem (TSP) .......................................... 187 Chapter 11 Turing’s Hypothesis ............................................................................... 201 11.1. Hypothesis Church–Turing ............................................................ 202 11.2. Complexity ................................................................................... 202 11.3. Thesis Sequential Computability ................................................... 203 11.4. NP Problems................................................................................. 204 Conclusions ........................................................................................... 211 Bibliography .......................................................................................... 213 Index .................................................................................................... 217
x
LIST OF FIGURES Figure 1.1. Distributed memory architecture Figure 1.2. Sending messages via MPI step between two computers Figure 2.1. Access files shared via NFS Figure 2.2. Functional diagram of ganglia Figure 2.3. Result from the post-processing problem resonance ring (animatedfigure.gif) Figure 2.4. Graphic node vs. time (minutes) exercise resonance ring Figure 2.5. Graphic simulation transmission ring Figure 2.6. Graphic vs. frequency spectrum (input) Figure 2.7. Graphic vs. frequency spectrum (Step) Figure 2.8. Graphic vs. frequency spectrum (removal) Figure 2.9. Statistical graph nodes vs. time (minutes) exercise transmission with ring Figure 2.10. Graph nodes vs. time (minutes) exercise with ring transmission using Harminv Figure 2.11. Graph nodes vs. time (minutes) without exercise ring transmission using Harminv Figure 3.1. The task and channel structure for calculating the difference of two finite-dimensional templates nine points, assuming a grid point for each processor. Only the channels used by the task are shown shaded. Figure 3.2. Use agglomeration to reduce the communication needs in the model atmosphere. (a) A single point is responsible for each task and therefore must obtain data from eight other tasks to apply the template nine points. (b) Granularity is increased and therefore it increases granularity by 2x2 points Figure 4.1. Supercomputer cray-2 – the fastest in the world from 1985 to 1989 Figure 6.1. Outline of a process Figure 6.2. Histogram Figure 6.3. Pareto diagram xi
Figure 6.4. Sample letter of control Figure 8.1. Johann Gutenberg (1398–1468) Figure 8.2. Al Khwarizmi (lived between 780 and 850 D.C.) Figure 8.3. Leonardo of Pisa (1170–1250) Figure 8.4 . The growth of the main functions of low complexity Figure 8.5. An example for the shortest path algorithm Figure 8.6. An example for the shortest time Figure 9.1. Graph 5 stages Figure 9.2. Corresponding graph 3 project stages problems Figure 9.3. Recursive tree traveling salesman Figure 9.4. Positions that can attack the queen Figure 9.5. Example two threatened queens on the board 4 by 4 Figure 9.6. Scheme reduced tree solutions Figure 9.7. A decision tree for the program 4 queens Figure 9.8. Example of a Hamiltonian cycle Figure 10.1. Strategies branch FIFO Figure 10.2. Strategies branching LIFO Figure 10.3. Tree states for a traveling salesman problem with n = 4 and i0 = i4 =1 Figure 10.4. Possible paths Figure 11.1. Computable and non-computable problems Figure 11.2. A quantum Turing machine
xii
LIST OF TABLES Table 2.1. Geometrical data structure ring resonator Table 2.2. Data optical ring resonator geometric structure Table 2.3. Example tabulation of results optical resonator ring Table 2.4. Example tabulation of results Harminv Table 2.5. Example tabulation of results Harminv Table 8.1. Algorithmic complexity Table 8.2. Time in seconds it takes to perform f (n) operations Table 8.3. Knapsack Problem Table 8.4. Cost Matrix Table 9.1. Traveling Salesman Problem
LIST OF ABBREVIATIONS
AOG
AND/OR decision graph problem
ASIC
application-specific integrated circuit approaches
BOINC
Berkeley open infrastructure for network computing
CDP
decision-clique problem
CN
chromatic number
CSS
code and style sheets
DFD
data flow diagram
DHC
directed Hamiltonian cycle
DOM
document object model
DRR
database Round-Robin
EBS
elastic block storage
EC2
elastic compute cloud
FDTD
finite difference time domain
FPGA
field-programmable gate
GPGPU
general purpose computing units on graphics processing
HPC
high-performance computing
MIMD
multiple-instruction-multiple-data
MISD
multiple-instruction single-data
MPI
message passing interface
MPP
massively parallel processor
MTND
non-deterministic Turing machine
NOT
involving operators negation
NUMA
non-uniform memory access architecture
RRD
round-robin database
SIMD
single-instruction-multiple-data
SISD
single-instruction-single-data
SMP
symmetric multiprocessor
SSE
streaming SIMD extensions
TCS
total clear sky
TSP
traveling salesman problem
TSP
traveling salesperson problem
UI
user interface
UMA
uniform memory access systems
VAC
value customer
VAO
organizational activities that add value
VLSI
very-large-scale integration technology manufacturing computer-chip
xvi
PREFACE
Currently, performing electromagnetic simulations is often a problem with a high degree of complexity. Thereat, when the large amount of processing is performed to generate a solution, it leads to a problem. A good alternative to effect efficient processing and reduce the time thereof is based on cloud computing platform and tools. On the other hand, there is an algorithmic complexity term. This complexity is not about the number line; rather, it is about the execution time of a specific algorithm. There are untreatable algorithms, which have the property of being recursive. So, it must have the notion of what is recursion, both the mathematics and programming environment. Among other topics, this book describes the implementation of a tool for creating and managing computational clusters is presented for FDTD simulations on Amazon EC2 Web services. The details of the problem are provided while analyzing the background, objectives achieved, and justification for the implementation of the algorithm. The theoretical framework such as cloud computing platform, the FDTD method computational electromagnetics, the concept, and architecture are described in this chapter. The tools are used for the both efficient management of the cluster and for electromagnetic simulations. Applications of this field within the framework GWT and StarCluster Ganglia tool, along with the parallel Meep package are described . In this book, the architecture and implementation of the final tool and aspects such as installation, configuration, and development of each of its components are detailed. Also, various tests, performance comparisons, cloud computing platforms, and efficient management of available resources are detailed.
xviii
Chapter 1
Introduction
CONTENTS 1.1. Background ........................................................................................ 2 1.2. Definition of the Problem ................................................................... 3 1.3. Main Objectives ................................................................................. 3 1.4. Justification ......................................................................................... 4 1.5. Cloud Computing ............................................................................... 4 1.6. Fdtd Method ....................................................................................... 5 1.7. Computational Parallelism .................................................................. 5 1.8. Parallel Programming Models ............................................................. 7
2
Parallel Programming
At present, to solve electromagnetic simulations, there is a standard called the FDTD method. This method is based on solving mathematical equations (Maxwell’s equations) by means of finite difference equations. At the software level, there are applications that implement this algorithm [1]; and allow these simulations, in addition, provide a visualization of their results. One of the difficulties is that there are present very complex simulations performed because of the large number of calculations it has. In addition, investigators indicate the need to run multiple simulations that require parameter values within the structure of a problem with the aim to make comparisons of results in different settings. The problem is that to do this on a local machine can consume many resources and time stage; most optimum is to realize a high-performance cluster. A good alternative to effect efficient processing is to reduce costs and time using the cloud computing platform and their respective tools. This chapter describes the implementation of a tool for creating and managing computational clusters for FDTD simulations on AWS EC2 service is presented. This chapter also provides information about the background on the objectives. Also, a profound explanation of the problem, and justification of this work were raised.
1.1. BACKGROUND Maxwell made, circa 1870, partial differential equations of electrodynamics. It represents the fundamental unification of the electric and magnetic fields, predicting the phenomenon of electromagnetic waves. To which Nobel and Feynman called the most outstanding achievement of science in the nineteenth [2] century. The method of finite difference time domain (FDTD) solves Maxwell’s equations, directly modeling the propagation of electromagnetic waves within a volume. This method was introduced in 1966 by Kane Yee, a numerical modeling technique electrodynamic. At first, it was almost impossible to implement this method computationally, probably due to the lack of computing resources. However, with the advent of more powerful, modern equipment, with greater accessibility to acquire and further improvements in the algorithm, the FDTD method has become a standard tool for solving problems of this
Introduction
3
type. Today, we can say that this method brings together a set of numerical techniques for solving Maxwell’s equations in the time domain and allow the electromagnetic analysis of a wide range of problems [3]. Now scientists and engineers use computers for solutions of these equations in order to investigate these electromagnetic fields.
1.2. DEFINITION OF THE PROBLEM Currently, the FDTD method or algorithm is one of the techniques used for calculations in the computational electromagnetic field. There is an implementation of this technique in Linux called Meep. However one of the main problems of this method is the use of time and resources required to solve a problem, especially if the simulations involve three-dimensional structures. This is due to high processing it makes when generating the solution. Today, these types of problems are solved more easily, with the emergence of what is now called Cluster, a set of powerful computers that function as a single system that can improve performance in terms of mass processing [4]. The problem with using them is the cost involved in getting one and the difficult management and configuration for operation. “System creation and management of computer clusters for simulations FDTD with Meep package service Elastic Compute Cloud (EC2)” has been created in order to facilitate the administration and implementation of computer clusters on the cloud computing service Amazon EC2 FDTD to solve problems, improve performance and optimize the time it takes to build the solution using cloud computing services offered by Amazon EC2 platform.
1.3. MAIN OBJECTIVES “System creation and management of computer clusters for FDTD simulations with the package Meep service EC2” was raised to provide a monitoring tool and running multiple simulations FDTD. In order to achieve these following objectives were set: •
•
Provide a public AMI that allows creation and management of computer clusters for FDTD simulations using distributed processing. Integrate a Web tool that offers to monitor of resources used by clusters with graphs of the results.
Parallel Programming
4
•
Implement a Web interface for managing public AMI.
1.4. JUSTIFICATION The main justification for the development of “System creation and management of computer clusters for simulations FDTD with Meep package service EC2” is to decrease the time it takes to perform FDTD simulations with a yield more optimal, and monitoring the resources used during execution of multiple simulations are performed in parallel FDTD that. In this way, the user can check the status of their jobs while they are resolved through Meep package.
1.5. CLOUD COMPUTING It is a technology that provides computer services through the Internet platform and pay as you consume. Its operation is based applications externally hosted services within the web. [5] Thanks to this technology, all you can offer a computer system is offered as service, so that users can access them available “in the cloud” without being experts in managing resources to use. No need to know the infrastructure is a cloud where applications and services can be easily scalable, efficient, without knowing the details of its operation and installation. The examples include Amazon EC2, Google App Engine, eyeOS, and Microsoft Azure.
1.5.1. Amazon EC2 To implement our tool we use Cloud Computing service offered by Amazon EC2, a web service that provides specific computing capability with growth option in the cloud, according to user requirement. It is designed to facilitate computational scalability web developers.
1.5.2. Functionality EC2 presents a computational environment virtually, allowing us to use Web services interfaces to request cluster for use, load our own application environment, manage permissions and network access and execute the image using as many systems as required. It is designed for use with other Amazon services together as Amazon S3, Amazon EBS, Amazon Simple DB, and Amazon SQS to provide a
Introduction
5
complete computing solution. It provides safety since it has numerous security mechanisms such as firewall, access, network configurations, etc.
1.6. FDTD METHOD The FDTD method to simulate the evolution of the electromagnetic field in a region of interest, in addition to making changes in its structure. The original formulation proposed by Yee method studies the behavior of the electromagnetic field in a vacuum. It also allows defining some simple contour conditions, such as electric and magnetic wall, being possible to apply the method to study resonant problems. For this method, the Maxwell equations that describe the evolution in time and space using magnetic fields B and electrical E. These partial differential equations are replaced by a set of finite difference equations. Finite difference is a mathematical expression that allows an approach to solutions of differential equations. The FDTD technique is based on the division into several parts, both spatial and temporal, of electromagnetic fields and the approximation of the partial derivatives appearing in the Maxwell equations of rotational expressed in the time domain by finite difference quotients. As a result, an algebraic problem explicit type lets go calculating, in successive instants, the value of the electric field (magnetic) at each point in space from the electric field (magnetic) the same point in time is obtained above and the values of the magnetic field (electric) in adjacent nodes and in the time preceding time instant. It took nine years until the original FDTD method was suitably modified to solve a problem of scattering [6].
1.7. COMPUTATIONAL PARALLELISM Parallel computing is the simultaneous use of multiple computing resources to solve a computational problem. One problem is divided into different parts that can be solved simultaneously. Each part is broken down into a series of instructions and each instruction executed simultaneously on different CPUs. Calculation of resources may include a single computer with multiple processors or an arbitrary number of computers connected by a network
Parallel Programming
6
or a combination of the above. The computational problem usually shows features like the ability to be divided into discrete pieces of work that can be solved simultaneously run multiple program instructions at any point in time and solve in the shortest time with multiple computing resources. Parallelism has been employed for many years, especially for high-performance computing (HPC) [7]. The HPC are those who use supercomputers and computer clusters to solve advanced calculations. Usually used for scientific research. [8]
1.7.1. Memory Architecture Distributed Memory A distributed memory system comprises several independent processors with a local memory and connected via a network; i.e., each processor has its own memory when performing a process. This type of memory provides the ability to easily scale power, but also entails that there must be a means of communication between nodes so that there is synchronization. This communication task is implemented in the case of our project, the same that is in charge of Meep package (Figure 1.1). This type of architecture brings many advantages such as: • • •
Each processor can directly access the memory without interfering and without overburdening others. It offers scalability to the number of processors to connect, and only depends on the network. No cache coherency problems because each processor has its own data, and not have to worry about local copies.
Figure 1.1: Distributed memory architecture.
The main difficulty of this architecture is the method of communication between processors, since if a processor requires another this information must be sent through messages. The time to build and send a message from one processor to another and disruption of a receiver processor to handle
Introduction
7
messages sent by other processors which two aspects highlighted overload [7].
1.8. PARALLEL PROGRAMMING MODELS We have already mentioned about Parallel Memory Architecture we use in our project. Now in this section of the document, we will proceed with an explanation of parallel programming model that we have implemented. Usually, a parallel programming model is based on some memory architecture, in our case, for the development of our application, we used the architecture of distributed memory. Parallel programming model is simply a set of algorithms, processes or software tools that allow creation of applications and communication systems, Parallel I/O. For implementing distributed applications, developers must know how to choose an appropriate model of parallel programming and in some cases, a combination thereof, which engages the type of problem to be solved. Our tool includes and implements an application that is based on the model message passing interface (MPI).
1.8.1. MPI (Message Passing Interface) It is a technique used in parallel programming for exchanging information through a communication based on receiving and sending messages. Sending message passing can be synchronously or asynchronously. When the message expects the process to continue execution we say it is synchronous whereas when the sending process does not expect that the message is received and continues its execution, we say it is asynchronous. Information transfer requires synchronization between processes to improve performance. Its main feature is that it uses its own local memory during execution and no shared memory. MPI has become a standard for communication between nodes running a particular problem within a distributed system (Figure 1.2).
8
Parallel Programming
Figure 1.2: Sending messages via MPI step between two computers.
Chapter 2
Creating and Managing Clusters for FDTD Computational Simulations with Meep Package on the EC2 Service for Amazon Web Services
CONTENTS 2.1. Starcluster ......................................................................................... 10 2.2. Meep Parallel Package ...................................................................... 11 2.3. GWT – Google Web Toolkit .............................................................. 12 2.4. Ganglia............................................................................................. 12 2.5. Architecture ...................................................................................... 14 2.6. Creating And Configuring A Public Ami ............................................ 16 2.7. Efficiency Test (FDTD Problem) ......................................................... 22 2.8. Efficiency Tests (Using Amazon EC2 Platform) ................................... 22 2.9. Analysis Of The Results ..................................................................... 31
Parallel Programming
10
In this chapter, we will see the details about the tools we have used to develop the project, and explain about important concepts and features provided each of them.
2.1. STARCLUSTER StarCluster is a utility that enables the creation, management, and monitoring of computer clusters that are hosted on the service Amazon Elastic Compute Cloud (EC2) all through a master instance. Its main objective is to minimize the administration associated with the configuration and use of computer clusters used in research laboratories or general applications using distributed computing. To use this tool a configuration file where account information on Amazon Web Services (AWS), type of AMI’s to use and configure additional features we want in the detailed cluster is created. Subsequently, it can be made running the tool by using commands [10].
2.1.1. StarCluster Features Use through commands to perform the tasks of creating, managing, and monitoring one or more clusters on EC2. • AMI provides a public pre-configured with everything you need to install StarCluster. • Support services cloud storage such as Elastic Block Storage (EBS) and Simple Storage Service (S3) also offered by Amazon. • The public AMI includes tools such as OpenMPI, ATLAS, Lapack, NumPy, and SciPy. • All nodes in the cluster are automatically configured with NFS services, SGE, and OpenMPI. For our project, we will use the NFS and OpenMPI services so that detail the concepts of each below: •
2.1.2. Network File System (NFS) It allows different clients connected to the same network to access remote file sharing as if they were part of your local file system. This application protocol works on client/server environment, where the server indicates directories that customers want to share and mount these directories is your file system.
Creating and Managing Clusters for FDTD Computational Simulations....
11
This can be used in a distributed processing for shared memory environment (Figure 2.1).
Figure 2.1: Access files shared via NFS.
2.1.3. OpenMPI It is a project combining technologies and resources from other projects (FTMPI, LA-MPI, LAM/MPI, Y PACX-MPI) to build an MPI library. OpenMPI is an open source implementation of the MPI-1 and MPI-2 standard [11].
2.2. MEEP PARALLEL PACKAGE Meep is a simulation software package developed to model electromagnetic systems. Meep implements the algorithm finite difference time domain (FDTD) method, computational electromagnetics. This algorithm is to divide the space into a mesh and see how the fields change over time, using time steps, the solution for continuous equations, in this way, becomes essentially approximated many practical problems are simulated. Meep parallel package provides support for distributed memory parallel with, and work on very large problems (problems in 3D space), so they can be resolved in a distributed manner. The problem must be large enough to benefit from many processors [12]. To achieve this, the parallel package Meep cell divides the computational simulation in “chunks” that are allocated between processors. Each “chunk”
Parallel Programming
12
is set in time steps, and processors are responsible for communicating the values using MPI.
2.3. GWT – GOOGLE WEB TOOLKIT 2.3.1. Introduction GWT or Google Web Toolkit is a framework created by Google which facilitates the use of technology AJAX. Can solve the big problem of compatibility of client code (HTM, javascript) between browsers, enabling the user to develop an application without the need to test them in various browsers. The concept of Google Web Toolkit is quite simple, basically what you should do is create the code Java using any development environment (IDE) Java, and the compiler will translate to HTML Y JavaScript.
2.3.2. GWT Platform GWT has four main components: a Java-to-JavaScript Compiler, a Hosted Web Browser, and two class libraries: •
•
•
•
GWT Java-to-JavaScript Compiler: The function of this component is to translate the code developed in Java to JavaScript language. Hosted Web Browser: This component runs the Java application without translating it into JavaScript, using the host mode Java Virtual Machine. JRE Emulation Library: JavaScript implementations contain the most commonly used class libraries in Java as Java.Lang, Java.util, etc. GWT Web UI Class Library: It contains a set of elements user interface (UI) which allows the creation of objects such as text, text boxes, images, and buttons [13].
2.4. GANGLIA 2.4.1. Introduction Monitoring a computer cluster requires proper management of resources and the administrator can spend less time detect, investigate, troubleshoot failures happen with this information and also may pose a contingency plan.
Creating and Managing Clusters for FDTD Computational Simulations....
13
Ganglia are a scalable distributed system for monitoring computer clusters and Grids in real time. It is a robust implementation that has been adapted to many different architectures and operating systems of computers and is currently used in thousands of clusters around the world including universities, laboratories, and government research business. Ganglia was initially developed by the University of Berkeley in the department of computer science to link clusters campus, is currently in charge of Source Forge. It is based on a hierarchical clustering scheme and is configured by files XML and XDRon each node that allows extensibility and portability. It is completely open-source and does not contain any proprietary component. Ganglia clusters and link lines so distributed computing is known as “cluster to cluster” [14].
2.4.2. Functioning Ganglia are defined in a hierarchical scheme. It is based on a communication through a multicast protocol send/receives to monitor the status of the cluster and uses a tree point connections between levels of cluster nodes to report their status. Ganglia used status messages in a multicast environment, such as the basics for a communication protocol. To maintain communication each node sends its state in a time interval; in this way discloses that is active at the time to stop sending that node is no longer involved in monitoring.
Figure 2.2: Functional diagram of ganglia.
14
Parallel Programming
In addition, each node monitors local resources and sends multicast packets with status information each time an update occurs. All nodes in the same cluster always have an approximate view of the entire cluster state, and this state is easily reconstructed if a collapse occurs (Figure 2.2).
2.4.3. Architecture Ganglia The main components of Ganglia are two demons gmond and gmetad. The Ganglia Monitoring Daemon (gmond) is the cornerstone of the tool is a multi-threaded daemon which runs on each of the cluster nodes to be monitored. Installation is very easy. There is no need for an NFS file system in common or a database. Gmond has its own distributed database and its own redundancy. The Ganglia Meta Daemon (gmetad), this daemon allows to obtain the information via XML at regular intervals from the nodes, the gmetad takes the information and stores it in a database Round-Robin (DRR) data and concatenates the XML node for share information with the web server or other front-end that runs the demon gmetad. Another major component is integrated into the web application tool, Ganglia uses a round-robin database (RRD) to store, and query historical information for the cluster presents metrics based on the time gradually. RRD Tool is a popular system for storing data and graphing time series, which uses specially designed compactly for storage of data in time series. RRD tool generates graphs which show the trend of metrics versus time. These graphs are then exported to be deployed in the Front-end. Template Front-end can be customized or use which is default, and Ganglia makes a separation between content and presentation, the content is an XML file if desired can be accessed directly for some other application. [14]
2.5. ARCHITECTURE 2.5.1. Input Files Input files for the application are the type CTL (Control Type Language) that are used by the Meep package to perform the simulations. A CTL file specifies the geometry of the problem, sources used, outputs, and everything else necessary to perform the calculation. This file is written as a scripting language and allows us to define the structure of a problem as a sequence.
Creating and Managing Clusters for FDTD Computational Simulations....
15
The CTL file is part of the libctl library is a set of language-based tools Scheme. The CTL file can be written in any of three ways: •
•
•
Scheme, which is a programming language developed by MIT. This language conforms to the shape (Function arguments.) and which can be executed under a GNU Guile interpreter; Libctl, it is a library for the compiler Guile, which simplifies communication between Scheme and scientific computing software. Libctl defines the basic interface and a host of useful features; and Meep, you can write a CTL-based on Meep. This defines all the problems specific to the calculation of FDTD interfaces.
2.5.2. Web Application “StarMeep” The implementation of the “StarMeep” Web application management improves computational cluster for a user since it allows benefits such as: • Availability through a Web browser; • Perform configurations forms rather than through a console; • Display calculation processing of an electromagnetic simulation. • Access files simulation results. The development is based on the Java programming language along with the GWT framework for the front-end application. GWT allows us to create a simple UI and also be able to make requests to the server via AJAX technology. In developing the back-end was the main access point and sending commands to instances or nodes in the cluster. This was done through the SSH protocol. JSch so the library, which lets you create different types of connections such as SSH, SCP, and others from Java was used.
2.5.3. Monitoring Monitoring resources used by the cluster nodes is performed by Ganglia tool. Usage information is sent from the cluster via XML files and is processed by the tool for viewing. Ganglia provide a graphical view of information in real time through its Web Front-End, administrators, and cluster users about the use of resources consumed.
16
Parallel Programming
2.5.4. Master Node and Slave Nodes Within the processing for calculating the FDTD simulations, the cluster nodes are responsible for processing the input file in parallel by Meep package. This divided equally simulation tasks, and each task is delivered to a respective node to the same processing. Each node performs tasks synchronously communicating with others via the MPI communication protocol. Once the tasks are performed by the nodes, the results are written by HDF5 library and are located on a cluster shared NFS directory. This directory may be within the same instance or storage can be used as Amazon’s S3. The information in the use of resources within the cluster is recompiled by a demon gmond lies within each of the nodes. Nodes to be monitored are specified within the demon you gmetad found in the Master Node. Finally, the simulation results and information resources are accessed through the master node.
2.5.5. Output File Storage The output files generated by the simulation have the HDF5 format is a standard format used by many scientific visualization tools such as Matlab, GNU Octave, and others. Apart from HDF5 also we find files that tell us whether there were any problems at runtime and the status of the simulation through time steps. The results generated by the simulation can be stored in the same manner, using a block or storage in EBS AWS S3.
2.6. CREATING AND CONFIGURING A PUBLIC AMI This section explains the installation and configuration of various applications using our tool. Details of these applications were mentioned in Chapter 1.
2.6.1. Installing StarCluster To install this tool we use as a basis the AMI that provides community StarCluster, it provides some services and agencies that help quick installation. Visit the official website of StarCluster, which is specified identifier AMI. A carry it through the lifting of a respective instance. It has the following characteristics:
Creating and Managing Clusters for FDTD Computational Simulations....
17
• 1.7 GB RAM; • 2 virtual cores 2.5 GHz; • 350 GB hard drive; • Ubuntu 9.04 32-bit. Before installing StarCluster, we describe what their some of their most important rooms are. • Python (2.4): Interpreted programming language that can divide a program into modules for reuse in other programs. • boto(1.9b+): It is an integrated python to handle current and future services offered by AWS infrastructure module. • paramiko (1.7.6+): It is another python module that implements the SSH2 protocol (encryption and authentication) for connections to remote computers. When installing StarCluster have downloaded the latest development version from the repository GIT (software version management), then compiled and installed via python, the following commands describe explained: •
Download the installer the StarCluster from the repository
•
Once downloaded we move to the directory created from StarCluster
•
And with Python compile and install the StarCluster.
•
Once installation is complete, we can see running.
When you run this command, StarCluster prompted a configuration file; it will not be created from here, but through the Web Application StarMeep talk later.
Parallel Programming
18
2.6.2. Meep-OpenMPI Installation Package Prior to installing the package Meep, we will review the main units required by this software. Guile-1.8-libs: Guile is an implementation of Scheme (functional programming language) designed for programming. • Libctl3: It is the implementation of a free library based on Guile, used for scientific simulations. • Libhdf5: Library that allows you to print your output Meep in HDF5 format to later be processed as required by the user. • Libmeep-openmpi2: Library that allows Meep FDTD solves problems in parallel using OpenMPI. Installation of MEEP-OpenMPI use the Python-Meep package offers. Next, we will detail the installation was performed: • First we must add the repository Ubuntu install two new addresses where the Meep-OpenMPI package is downloaded, the file repository found in “/etc/apt/source.list” was modified with an editor and finally add: •
•
Then we update the Ubuntu repository as follows:
•
After the update repository proceeded to install two additional packages for Meep.
•
At the end, we will proceed to install the MEEP-OpenMPI package with the command:
2.6.3. Installing Apache This server contains the web application Ganglia to monitor the hardware resources of the nodes that are running.
Creating and Managing Clusters for FDTD Computational Simulations....
19
For installation must do the following: • First install PHP version 5 with its respective modules for Apache can support applications written in this language. In the terminal must perform the following: • Apache installation is then performed by running:
•
Usually if you want to make changes to the web server configuration you can perform in “/etc/apache2/apache2.conf.” We leave the default settings.
2.6.4. Apache-Tomcat Installation The Apache-Tomcat web server can support applications that are written in Java. This allows containing the StarMeep application that helps with the administration and execution of works FDTD. The installation contains the following: • Download the installer apache-tomcat from the official website, open the terminal and run:
•
Then we unzip the downloaded file to generate the apache tomcat directory using the following command packet:
•
Then we moved this directory generated directory facilities. In this case, we chose to place it in “/ usr/local.”
•
At the end an executable file “tomcat.sh” that can lift or stop the server and placed in “/etc/init.d/”de so you can get up the apachetomcat whenever a start node configured.
2.6.5. Installing and Configuring Ganglia Installation and configuration Ganglia Monitor proceeded to perform the following steps.
Parallel Programming
20
•
First we install the package dependencies that will allow the proper functioning Ganglia, in a terminal run
•
Once installed dependencies, we downloaded the installer from the official website ganglia, then run:
•
After downloading the installer note that was compressed, then we proceeded to running decompression:
•
Decompressing automatically generated directory with the name “ganglia-3.1.7,” we switched to it and run the command to configure an installer.
We can see that when configuring the installer Ganglia we send as parameters “--with-gmetad” This allows the demon gmetad additionally installed. We also sent as a parameter the address where you want the installation directory is created, in this case “/ etc/ganglia.” Now to generate the installer set parameters and execute it, we perform the following: •
•
Additionally ganglia also provide an application to monitor resources from the web. For this, we must move the “web”
Creating and Managing Clusters for FDTD Computational Simulations....
21
Ganglia package to the directory “/ var/www” directory of the Apache server. Then simply change the name of the folder web ganglia so we can call the application with that name. Run the following commands to perform these actions:
•
Once the installation is complete perform the configurations of the gmetad and gmond demons. Gmond to generate the default configuration file with the following command:
•
On the recommendation only you need to change the information in the section “cluster” in the configuration file gmond.
•
Now we turn to the daemon configuration gmetad, and we also use a guide template that is in the installation package Ganglia and then moved to the installation directory of Ganglia.
•
In the configuration file gmetad it is where we add the names of those nodes you want to monitor. This file is edited by the Web Application StarMeep because this same manages the lifting and running nodes. Finally we had to define the directory where you will create the RRD files that are required for the Web Monitor ganglia addition to changing the owner user “ganglia” that can read them.
•
22
Parallel Programming
2.7. EFFICIENCY TEST (FDTD PROBLEM) For this test, we have taken an example of a FDTD problem to be solved with Meep. When run on a single machine shows the following error:
This happens when we need to solve a problem with a high degree of processing, such as large structures or images with higher resolution involving large memory usage. This severely limits users as always depends on the characteristic speed and memory of a single machine. Thanks to our tool, these types of problems are solved because to divide the problem into smaller jobs between nodes, minimizing the use of processing and memory required.
2.8. EFFICIENCY TESTS (USING AMAZON EC2 PLATFORM) The problems listed below were resolved on the Amazon EC2 platform, using different numbers of nodes, so we will demonstrate improvement offered by our tool. We tested three different types of problems that are detailed in the following subsections.
2.8.1. Ring Resonator One of the common tasks performed in FDTD simulations is to examine the behavior of the electromagnetic field of an object when it is affected by a power source. By solving these types of problems output data with a post-processing to generate a chart where we can see how the magnetic field behaves is done is created. The objective is to perform the next exercise aforementioned a “ring resonator” which is merely a waveguide bends circularly. The result of it is lively where you can appreciate the resonances of the electromagnetic field image. The structure of the example of ring resonator is given by the following steps:
Creating and Managing Clusters for FDTD Computational Simulations....
23
Define the parameters to be used in the problem. Draw the geometry of the problem, namely the ring which are echoes for this two cylinders one object defined as dielectric material and the other material defined air so as to form a ring was performed. • Define the Gaussian pulse that allow the ring resonate and thereby alter its electromagnetic field. • Finally run the simulation, the basic idea of the problem is to run until the sources are finished and then add a time from which we will calculate the signals and produce images of how the electromagnetic field behaves. Table 2.1 shows the data of the geometric structure. • •
Table 2.1: Geometrical Data Structure Ring Resonator Waveguide index
3.4
Waveguide width (micron)
1
Inner radius of the ring (micron)
1
Space between the waveguide and PML layer (micron)
4
PML thickness (micron)
2
Pulse width
0.15
Pulse rate
0.1
Resolution
40
After the execution of the problem the result obtained is processed through the following commands:
When you run with different numbers of nodes the problem five outcomes was obtained, the relevant comparison was made to verify the drivability of them and actually found that all generate exactly the same. Thus, it is assumed that no matter the number of nodes to be used provided the result is the same.
24
Parallel Programming
Figure 2.3 shows the final result of the exercise.
Figure 2.3: Result from the post-processing problem resonance ring.
The next thing will do is observe the time it took for each node to complete the exercise (see, Figure 2.4 graph).
Figure 2.4. Graphic node vs. time (minutes) exercise resonance ring.
As we can see in Figure 2.4, we have executed the problem five times with different numbers of nodes and can realize improved time there when solving the exercise with more nodes. This happens to some extent; in this
Creating and Managing Clusters for FDTD Computational Simulations....
25
case 9 nodes that time back up.
2.8.2. 3D Simulation of an Optical Ring Resonant for the Transmission Spectrum Another of the most common problems is FDTD simulations performed to study the transmission spectrum of the electromagnetic flux. The following exercise show how the resonance is calculated by the transmission spectrum in a system of two waveguides and a ring. A guide is affected by a power source and transmits this energy to the ring due to the resonance, which in turn can transmit energy to the other waveguide. To do this you have to run the exercise twice, the first time it is done with the presence of the ring and the second time without the ring. Then it proceeds to perform processing of the results. Figure 2.5 illustrates the above-mentioned system.
Figure 2.5: Graphic simulation transmission ring.
The result of the problem is three graphs showing the behavior of the electromagnetic spectrum transmission. The structure of the exercise code is given by the following: • •
•
Define parameters or variables to be used in the problem. Draw the geometry of the problem, namely the ring which are echoes for this two waveguides defined as dielectric material and a dielectric resonator ring if required is performed. Define the Gaussian pulse source allow the ring into resonance.
Parallel Programming
26
Write the output condition in this case when the pulse is turned off and waiting 150 timeslots, and then define what we, in this case, the flow spectrum. Table 2.2 shows the data of the geometric structure. •
Table 2.2: Data Optical Ring Resonator Geometric Structure Refractive index
3.03
Refractive index substrate
1.67
Inner radius of the ring (microns)
2
Ring outer radius (microns)
2.5
Width of the waveguide (microns)
0.55
Along the waveguide (microns)
0.405
Space between the waveguides and the 0.2 ring (microns) Distance from the substrate relative to 0.75 the guide (Microns) Substrate width (microns)
0.6
PML thickness (microns)
One
As we had mentioned should run the problem twice, once I realized that each generates an output file, we all results which contain lines shown below:
Then we tabulate the results that cause problems for post-processing, as shown in Table 2.3. Table 2.3: Example Tabulation of Results Optical Resonator Ring Spectral Flow Frequency
Input port
Paso port
Extraction port
Creating and Managing Clusters for FDTD Computational Simulations....
0.36
1,87190375615421e-8
5,33633135330676e2,25302194395918e-10 9
0.360200100050025
1,91956223787719e-8
5,74839347857691e2,32817150153588e-10 9
27
Such as two executions at the end we obtain two tables as shown above, which were then made to divide the corresponding data in each column except the frequency between the outturn rings for execution result ringless. At the end, we get a single table with which to plot the frequency proceed against each spectral flow as shown in Figure 2.6.
Figure 2.6: Graphic vs. frequency spectrum (input).
From Figure 2.6, we can observe the behavior of the transmission spectrum between the source and the first waveguide. The graph indicates that most of fluctuation occurs around the unit value between 0.3 and 0.5 (Figure 2.7).
Figure 2.7: Graphic vs. frequency spectrum (Step).
Figure 2.7 can observe the behavior of the spectrum during the transition between the first waveguide and the ring, which the valleys that
28
Parallel Programming
we see are due to the resonance of the ring, we must take into account the miscalculations, for this case, between the range of 0.3 and 0.4, we can see that several fluctuations appear in the response curve (Figure 2.8).
Figure 2.8: Graphic vs. frequency spectrum (removal).
From Figure 2.8, we can observe the behavior of the spectrum for energy extraction. Because the source end does not generate power, it begins to be extracted from the ring. Like the previous when running with different numbers of nodes problem the problem, five were obtained, all of which generated exactly the same values. Finally see the times when the problem was completed, as the execution is in two steps, we get two graphs shown in Figure 2.9.
Figure 2.9: Statistical graph nodes vs. time (minutes) exercise transmission with ring.
Creating and Managing Clusters for FDTD Computational Simulations....
29
As we can see in Figure 2.9, we have executed every problem five times with different numbers of nodes, we can denote first execution problem with the ring takes a little more than the implementation of the problem without ring, however, the interesting thing lies that the proportion decreasing over time in the two years to run more nodes are very approximate.
2.8.3. Calculation of Resonance Frequency and Quality Factor of a Resonant Ring 3D After seeing how the electromagnetic field behaves through a graph and calculate the transmission spectrum to find resonances, it is important also find numerical values to identify the frequencies and decay rates as well as the quality factor, this we can make using the investment method using harmonic package Harminv. The aim of this exercise is to process the electromagnetic signals to calculate frequencies and quality factor resonance system previously used. For this problem also we study the existing two cases: the system with ring and ringless system. The result of the problem shows several values that help us indicate the resonance frequency, quality factor Q, amplitudes, and the margin of error. The structure of the code and data are the same as the previous year, all that we change is that at the end instead of obtaining the flow spectra, will process the signals for this Harminv receives four parameters are field component electrical Ez, the position where we analyze, and the frequency range. In both cases, the problem (with ring without ring), to finish running generates an output file which shows all the steps made towards processing and the final result (Harminv values). At the end the final result is tabulated, Table 2.4 shows how data is tabulated; this is done with the results in both embodiments. Table 2.4: Example Tabulation of Results Harminv Freq. Real
Freq. Imaginaria
Q
| Amp |
Amplitude
Error
harminv0:
.4621
1,76E-04
-1,329.10
0,002
–0,0095–0,0012i
2,69E–04
harminv0:
.4935
–0.0016
149.42
0,049
0.017 + 0,04874i
3,63E–05
harminv0:
.5065
–5,20E–04
490.13
0,065
–0,037–0,05496i
1,40E–05
harminv0:
.5189
–0.0027
94.93
0.059
0.0519 + 0,013851i
1,15E–04
harminv0:
.5225
–3,66E–04
723.34
0.134
0.06928 + 0,11025i
2,31E–05
30
Parallel Programming
To process the end result of the tabulated data, we find the frequencies where no resonance, but not all values should be considered because as we might look there is a margin of error should be considered. To find out if each frequency values are still correct comparison is performed. The imaginary frequency absolute value must be greater than the margin of error. In Table 2.5, we note which candidates are to be correct values. Table 2.5: Example Tabulation of Results Harminv Freq, Real
Freq. Imaginaria
Q
| Amp |
Amplitude
Error
harminv0:
0.46
1,76E–04
–1,329.10
0.002
–0.0095–0.0012i
2,69E–04
harminv0:
0.49
–0.0016
149.42
0.049
0.017 + 0.04874i
3,63E–05
harminv0:
0.50
–5.20E–04
490.13
0.065
–0.037–0.05496i
1,40E–05
harminv0:
0.51
–0.0027
94.93
0.059
0.0519 + 0.013851i
1,15E–04
harminv0:
0.52
–3.66E–04
723.34
0.134
0.06928 + 0.11025i
2,31E–05
Reviewing the results generated problems according to the number of nodes used; you can verify that the values obtained are the same. Finally, we note the time when the problem was completed, as it consists of two parts (one ring and one without ring), we obtain two graphs (Figures 2.10 and 2.11).
Figure 2.10: Graph nodes vs. time (minutes) exercise with ring transmission using Harminv.
Creating and Managing Clusters for FDTD Computational Simulations....
31
Figure 2.11: Graph nodes vs. time (minutes) without exercise ring transmission using Harminv.
As we can see in Figures 2.10 and 2.11, to run each issue five times with various numbers of nodes, we decrease the time we notice that in each case as we use more nodes.
2.9. ANALYSIS OF THE RESULTS In each graph shows the performance behavior under three conditions, using StarMeep resources. In general, it is observed that a larger number of nodes, it has less processing time. However, the successful use of the number of nodes depends on the type of problem solved and output files that are generated as a result. For the type of problem, there are several factors that influence the most important of which we have the resolution that desired output files and the size of the structure you want to perform the simulation are generated. While for the output files, influences the amount of information in gigabytes that the problem required, because the disk write usually timeconsuming.
Chapter 3
Parallel Algorithm Designed by Technique “PCAM”
CONTENTS 3.1. Partition ............................................................................................ 34 3.2. Domain Decomposition ................................................................... 34 3.3. Functional Decomposition................................................................ 35 3.4. List Partitions Design ......................................................................... 36 3.5. Communication ................................................................................ 36 3.6. Agglomeration .................................................................................. 37 3.7. Reducing Costs of Software Engineering ........................................... 40 3.8. Load Balancing Algorithms ............................................................... 41 3.9. Task Scheduling Algorithms............................................................... 41 3.10. Allocation List Design ..................................................................... 42 3.11. Model of The Atmosphere ............................................................... 42 3.12. Agglomeration ................................................................................ 45 3.13. Load Distribution ............................................................................ 47
34
Parallel Programming
In order to design a parallel algorithm, first we have to change the issue on an algorithm that shows us the concurrency, scalability, and locality; taking into account that such algorithms are not simple because they require integrated thinking. However, we can verify from a methodical approach to maximize the options that you provide mechanisms to evaluate options, and avoid the high costs of some bad decisions. At each stage, it will release the concept with a brief description and with respective example.
3.1. PARTITION Stage partition is a design intended to expose opportunities for parallel execution. Therefore, the focus is defining a large number of small tasks in order to produce a fine-grained decomposition of a problem. Fine-grained decomposition of a problem, just as fine sand, is easier to pour than a pile of bricks. The fine-grained decomposition provides the most flexibility in terms of possible parallel algorithms. In later design stages, evaluation of communication requirements, the target architecture, or software engineering issues lead us to forego may opportunities identified for parallel execution at this stage. A good partition divides into small pieces both the computation related to a problem and the data on which this calculation operates. When designing a partition, programmers most commonly focus first on the data associated with a problem, then determine an appropriate partition for data and eventually find a way to associate the calculation data. This partitioning technique is known as domain decomposition. The alternative approach— first decomposing the computation to be performed and first focus to break the calculation to be performed and then dealing with the data—is termed functional decomposition. These are complementary techniques applicable to different components of a single problem or even applied at the same problem for alternative parallel algorithms.
3.2. DOMAIN DECOMPOSITION In the domain decomposition approach of partitioning problem, we seek first to decompose the data associated with a problem. Therefore, let us first break the data associated with a problem. If possible, these data are divided into small pieces of approximately equal size. Then calculate the partition to be performed, typically by associating
Parallel Algorithm Designed by Technique “PCAM”
35
each operation data in which it operates. This division provides a number of tasks, each consisting of some data and a set of operations on that data. An operation may require data from multiple tasks. In this case, communication is required to move data between tasks. This requirement is addressed in the next phase of the design process. The data can be decomposed program entry, the output calculated by the program, or intermediate values maintained by the program. Different partitions may be possible, on the basis of different data structures. The good rules of thumb are to focus first on the largest data structure or data structure that is frequently accessed. Different steps of calculation can operate in different data structures, or order different decompositions for the same data structures. In this case, we treat each phase separately and then determine how the decompositions and parallel algorithms developed for each phase of fit. The issues that arise in this situation are discussed in Chapter 4.
3.3. FUNCTIONAL DECOMPOSITION Functional decomposition represents a different and complementary way of thinking about problems. In this approach, the initial focus is on the calculation to be performed rather than on the data manipulated by calculation. If we succeed in dividing this calculation in discontinuous tasks, it proceeds to examine the data requirements of these tasks. These data requirements may be discontinuous, in which case the partition is complete. Alternatively, they may overlap significantly, in which case considerable communication is required to avoid data replication. This is often a sign that a domain decomposition approach should be considered instead. While domain decomposition is the basis for most parallel algorithms, functional decomposition is valuable as a different way of thinking about problems. For this reason alone, it should be considered when exploring possible parallel algorithms. A focus on the computations are to be performed that can sometimes reveal structure in a problem, and hence opportunities for optimization, that would not be obvious from a study of data alone. As an example of a problem for which functional decomposition is most appropriate. It explores a search tree looking for nodes that correspond to “ solutions.” The algorithm has no obvious structure that data can decompose. However, a fine-grained partition can be as described in section obtained.
Parallel Programming
36
Initially, one task is created by the tree root. A task evaluates its node and then if that node is not a leaf, creates a new task for each page call (subtree).
3.4. LIST PARTITIONS DESIGN Phase partitioning of a design is to produce one or more possible decompositions of a decompositions problem. Before assessing the communication needs, the following checklist is used to ensure that the design has obvious flaws. In general, all these questions should be answered in the affirmative. 1.
2. 3. 4.
5.
Does your partition defining at least an order of magnitude more than the tasks are not the target computer processors? If not, you have little flexibility in subsequent design stages. Does your partition to avoid redundant computation and storage? They are tasks of comparable size. Does the number of tasks large scale with the size of the problem? Ideally, an increase in problem size may increase the number of task rather than the size of single tasks. Have you identified several alternative partitions?
3.5. COMMUNICATION Generated tasks must run independently but cannot. The calculation requires is that the data are associated with other tasks that mean that tasks require communication between two homework, as direct linking tasks “in which tasks can send messages, and the other can receive.” Then the communication associated with an algorithm can be specified on two faces. The first phase is to define the structure of the canal linking either directly or indirectly, tasks that require data (consumers) with tasks that have that data (producers). In the second phase, the messages will be sent and received on these channels are specified. It all depends on our technology deployment ultimately. In the domain decomposition problems, reporting requirements may be difficult to determine. Remember that this strategy produces the tasks for the first partition data structures into disjoint subsets and then associated with each data operations that operate solely on that data. This part of the design is usually simple. However, some operations that require data from multiple tasks usually remain. Communication is then necessary to manage the transfer of data required for these tasks to continue. The organization of this
Parallel Algorithm Designed by Technique “PCAM”
37
communication can efficiently be challenging. Even simple decompositions may have complex communication structures. In order to be clearer, we have classified communication as follows: • • • • • •
•
In the local communication, each task communicates with a small set of other tasks (their neighbors); In the global communication, each task requires communicating with many tasks. In the structured communication, task and its neighbors form a regular structure, such as a tree or network. In the static communication, the identity of the communication partners does not change over time. The communication dynamics can be determined by the calculated data at runtime and can be very varying. In the synchronous communication, producers and consumers are executed in coordination with producer/consumer pairs cooperate in data transfer operations. But in asynchronous communication, it may require the consumer to obtain data without the cooperation of the producer.
3.6. AGGLOMERATION Regarding the previous two phases, an algorithm that is not considered complete in the sense that it is not specialized for efficient execution on any parallel equipment, in particular is indeed very inefficient. For example, if many more tasks are created, it is obtained that processors on the target computer are not designed for the efficient execution of small tasks. Therefore, rereading the decisions made in the division and the communication phase focuses on obtaining an algorithm to be implemented efficiently in a class of parallel computing. In particular, it is judged whether it is useful to combine or agglomerate, tasks identified by the partitioning phase, in order to provide fewer, larger tasks. It also determines whether it is useful to replicate data calculations. The three examples describing this phase are: 1.
In this example, the size of the tasks are increased by reducing the size of the decomposition of three to two.
Parallel Programming
38
2.
Adjacent tasks are combined to produce a three-dimensional decomposition of higher granularity.
3.
Substructures joined in a “divide and conquer.”
•
Nodes in a tree algorithm combined.
Despite the reduced tasks produced by this stage, the design is still somewhat abstract, since issues relating to the allocation of processors remain unsolved. On the other hand, we can choose at this stage to reduce the number of tasks to exactly one processor. We are able to do this because our goal is to create a parallel program within the environment that requires an SPMD program.
Parallel Algorithm Designed by Technique “PCAM”
39
This phase focuses on the general issues that arise when there is increased granularity of tasks. There are three possible choices or alternatives objectives agglomeration and reproduction: (i) reducing communication costs by increasing computing and granularity communication, (ii) maintain flexibility with regard to scalability and allocation decisions, and (iii) reduce the costs of software engineering. Regarding the increased granularity, we have the following: A critical issue that affects performance parallel is communication costs; clearly this improvement can be achieved by sending less data, or can also be achieved by using fewer messages, even if it is sent amount of data. • Another concern is the cost of creating the task. The following images show the same task of fine-grained and coarsegrained second that example is exploded to show your outgoing messages (dark gis) and incoming messages (shaded of course). •
The grid is divided into 8x8 = 64 tasks, each responsible for one point, and they require 64x4 = 256 communications, 4 tasks 256 the data value.
Parallel Programming
40
Partitioned 2x2 = 4 task, each responsible for 16 points, only 4x4 = 16 communications are required and only 16x4 = 64 values refer. If the number of partners per task is small, often you can reduce both the number of communication actions and the total volume of communication by increasing the granularity of our part, that is, by agglomerating several tasks at a time. In other words, the communication requirements of a task are proportional to the surface of the subdomain in which it operates, while the computational requirements are proportional to the volume of the subdomain. Therefore, the amount of communication performed by a calculation unit (communication/ computing ratio) decreases as size increases the task. This effect manifests when a partition is obtained by using technique ‘Domain Decomposition.’ •
• •
It is important not agglomerating a multidimensional data structure in one dimension and is inconvenient to carry when the program to a parallel structure. The ability to create a variable number of tasks is essential for a program to be portable and scalable. The flexibility does not necessarily imply that a design always creates a number of tasks; the granularity can be controlled by a parameter compile time or run time.
3.7. REDUCING COSTS OF SOFTWARE ENGINEERING An additional concern, which can be particularly important when the shutdown of existing sequential code are the relative costs related development strategies different partition. Another software engineering topic to be considered is the distribution of data used by other program components. For example, the best algorithm for some program components may require an input matrix data structure is decomposed in three dimensions, while a previous calculation step generates a two-dimensional decomposition. Either one or both algorithms must be changed, or a restructuring phase should be explicitly incorporated in the calculation. Each approach has different performance characteristics. The final stage of the parallel algorithm design specifies that each task is to execute the mapping on uniprocessors or on shared-memory computers that provide automatic task scheduling. This sub-problem of mapping uniprocessor or shared memory computers that provide automatic task
Parallel Algorithm Designed by Technique “PCAM”
41
scheduling should be dealt separately. In these computers, a set of tasks and associated communication requirements represents a sufficient specification for a parallel algorithm; hardware or operating system to be relied upon. There exist several mechanisms to schedule tasks to available executable processors. On these computers, a set of tasks and associated communication requirements is a sufficient specification for our parallel algorithm. The mechanisms of the operating system or hardware can be invoked to schedule tasks executable on available processors. Unfortunately, general-purpose mapping mechanisms still have to be developed for scalable parallel computers, as well as commonly used allocation mechanisms. In general, mapping problem must be explicitly addressed when designing parallel algorithms. Your goal in developing mapping algorithms is to minimize the total execution time. We use two strategies in order to achieve this goal: 1.
Scheduling tasks: to run simultaneously on different processors in order to increase concurrency. 2. Scheduling tasks: to communicate frequently on the same processor, in order to increase the locality. Clearly, these two strategies are often conflicting, in which case the design of our algorithm involves some advantages and disadvantages. In addition, resource limitations tend to restrict the number of tasks that can be allocated to a single processor.
3.8. LOAD BALANCING ALGORITHMS A wide variety of both general and specific implementations of load balancing techniques have been proposed for parallel algorithms, based on domain decomposition techniques. Several representative approaches are reviewed here, namely recursive bisection methods, local algorithms, probabilistic methods, and cyclic mappings. These techniques are all intended to agglomerate fine-grained tasks defined in an initial partition to yield coarse-grained tasks per processor. Alternatively, we can think of them as our computational domain partition for a subdomain for each processor. For this reason, it is often referred to as partitioning algorithm.
3.9. TASK SCHEDULING ALGORITHMS Task scheduling algorithms can be used when a functional decomposition approach produces many tasks, each with the weak locality requirement.
Parallel Programming
42
Thereat, a centralized or distributed task pool is maintained, into which new tasks are placed and from which tasks are taken for allocation to processors. Effectively, we will reformulate the parallel algorithm to be solved by a set of worker tasks, one per processor typically.
3.10. ALLOCATION LIST DESIGN Whenever possible, a static mapping scheme that assigns each task to a single processor is used. However, the number or size of tasks is unknown until the runtime, and we can use a dynamic load balancing scheme or reformulate the problem, and a task scheduling structure can be used for schedule computation. The following questions can serve as a basis for an informal assessment of the mapping design. 1. 2. 3. 4. 5.
Can we consider a SPMD design for a complex problem? Can a design based on the creation of dynamic tasks and disposal be considered? Can a balancing centralized system is used? Have we assessed the relative costs of different strategies, and included the implementation costs in our analysis? Are there a sufficiently large number of tasks to ensure reasonable load balancing? Typically, at least ten times as many tasks as processors are required.
3.11. MODEL OF THE ATMOSPHERE The model of the atmosphere is a program that simulates ordered atmospheric processes (clouds, wind, precipitation, etc.) influencing the climate. It can be used to study the evolution of tornadoes, to predict time tomorrow, or to study the impact on the climate of increased concentrations of atmospheric carbon dioxide. Like many of the numerical models of physical processes, a model of the atmosphere meets a set of partial differential equations, in this case describing the behavior of fluids dynamic basis of the atmosphere. The behavior of these equations in a continuum approaches for their behavior in a finite set of points regularly spaced in that space. Usually, these points are at a latitude and longitude rectangle size in the range of 15 to 30,
, with
in the range of 50 to 500.
Parallel Algorithm Designed by Technique “PCAM”
43
This network is periodic within the x and y dimensions, meaning that grid is regarded as being adjacent to , and . A vector point of values is maintained at each grid point, representing quantities such as pressure, temperature, wind speed, and humidity. At this stage, let us explain the example with each of the techniques “Partition, Communication, Agglomeration, and Mapping.” The grid used to represent the state on the model of the atmosphere is a natural candidate for domain decomposition. Decompositions of x, y, and/or z dimensions are possible.
This task maintains its status as the various values associated with the grid point and is responsible for the calculation required to update that state in each time step. Therefore, we have a total of time.
tasks, each with O(1) data and computing
First, we consider the communication needs. Let us identify three distinct communications as depicted in Figure 3.1.
Figure 3.1: The task and channel structure for calculating the difference of two finite-dimensional template nine points, assuming a grid point for each processor. Only the channels used by the task are shown shaded.
Parallel Programming
44
1.
2.
Finite difference stencils. If we assume a fine-grained decomposition which encapsulates each task to single grid point, the nine-point stencil used in the horizontal dimension requires each task values from eight neighboring tasks. Templates finite differences, if we assume a fine-grained decomposition in which each task encapsulates a single mesh point, nine-point template used in the horizontal dimension requires each task of obtaining the values of neighboring eight tasks. The corresponding channel structure is illustrated in Figure 3.1. Similarly, the three-point stencil used in the vertical dimension that requires each task oftener values from two neighbors. Similarly, the three-point template used in the vertical dimension requires that each task get values from two neighbors. Global operations. The atmosphere model computes the total mass of the periodical elements in the atmosphere, in order to verify the simulation is proceeding correctly. The atmosphere model periodically calculates the total mass of the atmosphere, in order to verify the simulation is successful. This quantity is defined as follows:
Total Mass =
N x −1 N y −1 N z −1
∑ ∑ ∑M
=i 0=j 0 = k 0
3.
ijk
where denotes the mass at grid point (i, j, k). This sum can be computed using one of the parallel algorithms presented in Section 2.4.1. Physics Computations. If each task encapsulates a single grid point, then a component of the physics the model requires significant communication atmosphere. For example, the overall clear sky (TCS) at level, for example, the total clear sky (TCS) at is defined as
TCS = k
k
∏ (1 − cld )TCS i =1
i
1
= TCSk −1 (1 − cld k )1 where 0 is the top level of the atmosphere and the EPC is the cloud fraction at level I. is the fraction of clouds at level I. This product prefixes operation. In all, the component of the physics model requires on the order of 30 per grid point and communications per
Parallel Algorithm Designed by Technique “PCAM”
45
time step. In total, the component of the physical model requires on the order of 30 communications grid point and over time. The communication associated with the finite difference stencil is distributed. This is the communication necessary for the overall operation of communication (we might also consider performing overall operations less frequently since its value is intended only for diagnostic purposes.) The one component of our algorithm’s communication structure that is problematic is the physics. (We could also consider making this global operation less frequently because its value is only for diagnostic purposes.) The component of a communication structure of our algorithm is problematic is physics. However, we see that the need for this communication can prevent agglomeration (Figure 3.2).
Figure 3.2: Use agglomeration to reduce the communication needs in the model atmosphere. (a) A single point is responsible for each task and therefore must obtain data from eight other tasks to apply the template nine points. (b) Granularity is increased and therefore it increases granularity by 2x2 points.
3.12. AGGLOMERATION Our fine grain domain decomposition atmosphere created model N ×N ×N between tasks: 10 − 10 , depending on the size of the problem. This is likely to be much more than they need and some degree of agglomeration can be considered. We identified three reasons for achieving agglomeration: x
y
1.
z
5
7
A small amount of agglomeration (from one to four mesh points per task) can reduce communication requirements associated
46
Parallel Programming
with the template nine points from eight to four posts per task over time. 2. Communication requirements in the horizontal dimension are relatively small: a total of four messages containing eight values of data. In contrast, the vertical dimension requires communication not only for the template finite difference (two messages, two data values) but also for various other computations. These communications can be avoided by agglomerating tasks within each vertical column. 3. Agglomeration in the vertical is also desirable from a standpoint of software engineering. Horizontal units are restricted to the dynamic component model; the physical component operates within individual columns only. Therefore, a two-dimensional horizontally sequential physical decomposition allows existing code to be reused in a parallel program without modification. This analysis makes it appear sensible to refine our parallel algorithm to use a horizontal decomposition two-dimensional mesh model in which each task encapsulates at least four mesh points. Communication requirements are then reduced to the template associated with nine points and the addition operation. Note that this algorithm can create in most N x × N y / 4 between 3 5 tasks: 10 − 10 , depending on the size of the problem. This number is likely to be sufficient for most practical purposes. It is evident from the figure that in this case the further agglomeration can be performed, in the limit; each processor can assign a single task responsible for many columns, thereby giving one program SPMD. This allocation strategy is efficient if each column grid task performs the same amount of computation at each time step. This assumption is valid for many of the problems of finite differences but turns out to be valid for some models of the atmosphere. The reason is that the cost of physics calculations can vary significantly depending on the state variables of the model. For example, radiation calculations are not performed at night, and clouds only form when the humidity exceeds a certain threshold. The simple strategy simple mapping can be used.
Parallel Algorithm Designed by Technique “PCAM”
47
From figure above, it can be seen that additional agglomeration can be performed, in the limit, each processor can assign a single task responsible for many columns since thereby is a program SPMD. This allocation strategy is efficient if each task is performed by a column of the grid and performs the same amount of computation at each time step.
3.13. LOAD DISTRIBUTION
Load distribution in a model of the atmosphere with a grid 64x128. The above figure shows for each point of computational load to a single time step, with the relative frequency histogram giving different values of load. The left image shows a time interval in which the radiation measurements of time are performed and the right image a step of ordinary time.
48
Parallel Programming
Load distribution in component physics model atmosphere in the absence of load balancing. At the top of the figure, the hatching is used to indicate the computational load on each of 16x32 processors. Strong spatial variation is evident. This effect is due to the day/night cycle (radiation calculations are performed only in sunlight).
In many circumstances, this loss of performance can be considered acceptable. However, if a model is widely used, worth leaving to spend time to improve efficiency. One approach is to use a form of cyclic assignment: for example, assigning tasks to each processor in the western and eastern
Parallel Algorithm Designed by Technique “PCAM”
49
and northern and southern hemispheres. The image shows the reduction in load imbalance can be achieved with this technique; this reduction should be compared with the resulting increase in communication costs.
Chapter 4
Parallel Computer Systems
CONTENTS 4.1. History.............................................................................................. 53 4.2. Parallel Computing ........................................................................... 53 4.3. Background ...................................................................................... 54 4.4. Types Of Parallelism .......................................................................... 59 4.5. Hardware ......................................................................................... 61 4.6. Applications ..................................................................................... 67 4.7. History.............................................................................................. 68
52
Parallel Programming
Parallel systems are those that have the ability to perform multiple operations simultaneously. Generally, these systems typically handle large amounts of information in the order of terabytes and can process hundreds of requests per second. Parallel systems are composed of several systems sharing information, resources, and memory somehow. Parallel Systems – multiprocessor systems with more than one processor communication between strongly coupled system – processors share memory and clock; communication is usually done by shared memory. The advantages are gained through increase in reliability (Figure 4.1).
Figure 4.1: Supercomputer cray-2 – the fastest in the world from 1985 to 1989.
Parallel computing is a programming technique in which many instructions are executed simultaneously. It is based on the principle that large problems can be divided into smaller parts that can be solved concurrently (“in parallel”). Several types of parallel computing: bit-level parallelism, instruction level parallelism, data parallelism, and task parallelism. For many years, parallel computing has been implemented in high-performance computing (HPC), but interest in it has increased in recent years due to the physical constraints preventing frequency scaling. Parallel computing has become the dominant paradigm in computer architecture mainly in multicore processors. But recently, the power consumption of parallel computers has become a concern. Parallel computers can be classified according to the level of parallelism that supports your hardware: multicore and multiprocessing computers have multiple processing elements on a single machine, while clusters and MPP grids use multiple computers to work on the same task. The parallel computer programs are more difficult to write than sequential because concurrency introduces new types of software errors. Communication and synchronization between the different subtasks are typically the greatest
Parallel Computer Systems
53
barriers to achieve good performance of parallel programs. The increase in speed achieved as a result of a program parallelization is given by Amdahl law.
4.1. HISTORY The software has traditionally oriented computing series. To solve a problem, it builds an algorithm and is implemented on a serial instruction stream. These instructions are executed on the central processing unit of a computer. At the time when an instruction is completed, the next run. Parallel computing uses multiple processing elements simultaneously to solve a problem. This is achieved by dividing the problem into independent parts so that each processing element can execute its part of the algorithm at the same time as everyone else. Processing elements can be diverse and include resources such as a single computer with many processors, multiple networked computers, specialized hardware or a combination thereof.
4.2. PARALLEL COMPUTING Parallel computing is a form of calculation in which many instructions are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (“in parallel”). There are several different ways of computing the parallel: bit-level parallelism, Instruction-level parallelism, data parallelism, and task parallelism. It has been used for many years, mainly in high-performance computing (HPC), but interest in it has grown in recent years due to physical constraints preventing frequency scaling. Parallel computing has become the dominant paradigm in computer architecture mainly in the form of multicore processors. However, in recent years, energy consumption by parallel computers has become a concern. Parallel computers can be roughly classified according to the level in parallel with which the hardware supports multicore and multiprocessor computers having multiple processing elements within the one machine, while clusters, MPPs, and grids use multiple computers to work on the same task. Parallel computer programs it more difficult to write than sequential ones, because concurrency introduces several new classes of potential software bugs, of what race conditions are the most common. Communication and synchronization between the different subtasks is typically one of the
54
Parallel Programming
biggest barriers to getting good parallel program performance. Speedup of a program as a result of parallelization is given by Amdahl’s Law.
4.3. BACKGROUND Traditionally, software has been written for serial computation. To solve a problem, algorithm it is constructed which produces a serial stream of instructions. These instructions are executed on the central processing unit in a computer. Just it runs only one instruction can be executed in a timeafter that instruction, the next. Parallel computing, on the other hand, uses multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into separate parts so that each processing element can execute its part of the algorithm simultaneously. Processing elements can be diverse and include resources such as a single computer with multiple processors, multiple networked computers, specialized hardware, or any combination of the above. Frequency scaling was the dominant reason for improvements in computer performance from the mid-eighties until 2004. The runtime of a program is equal to the number of instructions multiplied by the average time per instruction. Keeping everything constant, increasing the clock frequency decreases the average time it takes to execute an instruction. An increase in frequency thus decreases runtime for all computer-limited programs. However, power consumption by a chip is given by the equation P = V × C2 × F the where P is power, and C is capacitance. It is being changed by being clock cycle (proportional to the change of the inputs of the transistors), V voltage and F is the processor frequency (cycles per second). Increases in frequency increase the amount of energy used in a processor. Increase power consumption processor ultimately led to Intel May 2004 cancellation of its Tejas and Jayhawk processors, which is usually cited as the end of frequency scaling as the dominant paradigm in computer architecture. Moore’s Law is the empirical observation that transistor density in a microprocessor doubles every 18 to 24 months. Despite power consumption issues, and repeated predictions of its end, Moore’s Law is still in effect. With the end of frequency scaling, these additional transistors (which are no longer used for frequency scaling) can be used to add extra hardware for parallel computing.
Parallel Computer Systems
55
4.3.1. Amdahl’s Law and Gustafson’s Law Theoretically, the acceleration of parallelization must linear-doubling the number of processing elements if Halve runtime, and folded a second time halved again runtime. However, very few parallel algorithms achieve optimal speed. Most make a near-linear speed for a small amount of processing elements, which flattens out at a constant value for a large number of processing elements. The potential speedup algorithm in a computing platform parallel is given by Amdahl’s Law, Originally formulated by Gene Amdahl in the 60 states that a small portion of the program that cannot be done parallelism will limit the overall speeds available from parallelization. Any mathematical or large engineering problem typically consists of several parallelizable parts and several pieces (sequential) non-parallelizable. This relationship is given by the equation: Where S is the acceleration program (as a factor of time original sequential pass), and P is the fraction that is parallelizable. If the sequential portion of a program is 10% last time, we cannot get any more than a 10x speedup, regardless of how many processors are added. This puts an upper limit on the usefulness of adding more parallel execution units. “When a task cannot be spread because of sequential constraints, the use of more effort has no effect on the schedule. The bearing of a child takes nine months, no matter how many women assigned.” Gustafson’s Law is another law in computer engineering, closely related to Amdahl’s law. It can be formulated as: G(x) = 1 – P + SP where P is the number of processors, S is the acceleration α, and the nonparallelizable part of the process [11]. Amdahl’s law assumes a fixed-problem size and the size of the sequential section is independent of the number of processors, whereas Gustafson’s law does not make these assumptions.
4.3.2. Outbuildings Understanding data dependencies it is essential to implement parallel algorithms. No program can run faster than the longest chain of dependent calculations (known as critical path), since calculations that depend on previous calculations in the chain must be executed in order. However, most algorithms do not consist of just a long chain of dependent calculations; there are usually opportunities to execute independent calculations in parallel.
56
Parallel Programming
Let Pi and Pj be two fragments of the program. Bernstein conditions describe when the two are independent and can be executed in parallel. The violation of the first condition introduces a flow dependency, corresponding to the first statement producing a result used by the second statement. The second condition represents anti-dependency, when the first statement writes a needed second variable expression. The third and final condition, q is a dependency output. When two variables write to the same location, the final output must have arisen from the second statement [13]. Consider the following functions, which demonstrate several kinds of dependencies: Function Dep (a, b) c: = a · b d: = 2 · c End Function Operation 3 in Dep (a, b) cannot be executed before (or even parallel) Operation 2, because Operation 3 uses a result of the Operation 2. Viola one position, and thus introduces a flow dependence. Function No Dep (a, b) c: = a · b d: = 2 · b e: = a + b End Function In this example, there are no dependencies between the instructions, so they can all be run in parallel. Bernstein conditions do not allow the memory to be shared between different processes. For that, some means of enforcing order among accesses are required, for example, semaphores, barriers or some other synchronization method.
4.3.3. Race Conditions, Mutual Exclusion, Synchronization, and Parallel Slowdown The subtasks in a parallel program are often called threads. Some architectures of parallel computer using smaller, lighter yarns known as thread fiber versions, while others use larger versions known as processes. However, the “threads” are generally accepted as a generic term for subtasks. The threads
Parallel Computer Systems
57
will often need to put any day variable that is shared between them. The instructions between the two programs may be interleaved in any order. For example, consider the following program: Thread A 1A: Read variable V 2A: Add 1 to variable V 3A write back to V Variable
Thread B 1B: Read variable V 2B: Add 1 to variable V 3B: Write back to variable V
If instruction 1B is executed between 1A and 3A, or if instruction 1A is executed between 1B and 3B, the program will produce incorrect data. This is known as race condition. The programmer must use a lock to provide mutex. A lock is a programming language construct that allows one thread to take control of a variable and prevent other threads from reading or writing it, until it opens that variable. The thread holding the lock is free to execute its critical section (the section of a program that requires exclusive access to some variable), and open the data when it is finished. Therefore, to guarantee correct program execution, the program can be rewritten above to use locks: Thread A 1A: Lock variable V 2A: Read variable V 3A: Add 1 to variable V 4A written back to variable V 5A: Open V Variable
Thread B 1B: Lock variable V 2B: Read variable V 3B: Add 1 to variable V 4B: Write back to variable V 5B: Open V Variable
Here, A thread will lock successfully V variable while the other will thread locked out- unable to proceed until V is opened again. This ensures the correct execution of the program. Locks while they are necessary to ensure correct program execution can greatly slow a program. Fixing multiple uses of non-atomic variables locks introduce the possibility of program dead end. An atomic lock locks multiple variables all at once. If you cannot lock everyone, which wants not lock them? If two threads each need to lock the same two variables using non-atomic clocks, it is possible that a thread will lock one and the second thread will lock the second variable. In this case, neither threads can end, and the results of the impasse. Many parallel programs require that their subtasks act in synchrony. This requires the use of the barrier. Barriers typically are implemented using a software lock. One class of algorithms, known as algorithms lock-free and wait-free avoids altogether the use of locks and barriers. However, this approach is generally difficult to implement and requires data structures
58
Parallel Programming
properly designed. Not all parallelization results in speed. Generally, such a task is divided into increasingly threads; these threads spend an increasing portion of their time communicating with each other. Eventually, the overhead of communication dominates the time spent solving the problem, and future of parallelization increases (i.e., from excess load screw even more work) rather than decrease the amount of time required to finish. This is known as parallel slowdown.
4.3.4. Fine-Grained Parallelism, Coarse-Grained, and Embarrassing Applications are often classified by how many times their subtasks need to synchronize or communicate with each other. An application exhibits fine-grained parallelism if its subtasks must communicate many times per second; exhibits coarse-grained parallelism if they do not communicate many times per second, and is embarrassingly parallel if rarely or never have to communicate. Embarrassingly parallel applications are considered the easiest to parallelism.
4.3.5. Consistency Models Parallel programming languages and parallel computers must have model consistency (also known as memory model). The consistency model defines rules for how operations on computer memory occur and how results are produced. One of the first models of consistency was Leslie Lamport sequential consistency model. Sequential consistency is the characteristic of a parallel program parallel execution produces the same results as a sequential program. Specifically, a program is sequentially constant if.” the results of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its Program” [14]. Software transactional memory is a common type of consistency model. It borrows from transactional memory software database theory the concept of atomic transactions and applies them to memory accesses. Mathematically, these models can be represented in several ways. Petri nets, which were introduced in the doctoral thesis of Carl Adam Petri 1962, were an early attempt to encode the rules of the model consistency. Dataflow theory was later built on these, and data flow architectures were created to
Parallel Computer Systems
59
physically put ideas into execution of the data flow theory. Beginning in the late ‘70s, calculation process, for example, calculation of communicating systems and communicating sequential processes were developed to permit algebraic reasoning about components composed of interacting systems. The most recent calculation process additions are π-calculus, which added the ability for reasoning about dynamic topologies. Logics such as Lamport TLA + mathematical models, for example, traces and event diagrams agent have also been developed to describe the behavior of concurrent systems.
4.3.6. Flynn’s Taxonomy Michael J. Flynn systems created earlier classification for parallel (and sequential) computers and programs, now known as Flynn’s taxonomy. Flynn classified programs and computers by whether they worked with a single system or multiple sets of instructions, whether or not those instructions were using single or multiple data systems. Flynn’s taxonomy Single Multiple
Single Instruction SISD SIMD
Multiple Instructions MISD MIMD
The classification of the single-instruction-single-data (SISD) is equivalent to an entirely sequential program. The classification of the single-instruction-multiple-data (SIMD) is analogous to perform the same operation repeatedly over a large modem. This is commonly done in process uses signal. Multiple-instruction single-data (MISD) are rated rarely used. While computer architectures to deal with this were devised (e.g. orders), few applications that fit this class materialized. Programs multiple-instructionmultiple-data (MIMD) are by far the most common type of parallel programs. According to David Patterson and John L. Hennessy: “Some machines are hybrids of these categories, of course, but this classic model has survived because it is simple, easy to understand, and gives a good first approximation. It is also perhaps because of its understandability-the most widely used scheme.”
4.4. TYPES OF PARALLELISM 4.4.1. Bit-Level Parallelism The advent of very-large-scale integration (VLSI) technology manufacturing of computer-chips in the ‘70s until 1986, which accelerated the computer
60
Parallel Programming
architecture, was driven by doubling the size of the word in computers – the amount of information that the processor can execute per cycle [16]. Increasing the word size reduces the number of instructions the processor must execute to perform an operation on variables sizes are greater than the length of the word. For example, 8 bit processor must add two 16 bit integer numbers, The processor must first add the 8 bits of the low-order of each integer using the standard addition instruction, then add the higher-order 8 bits using an instruction from the add-with-carry and carry bit adding the lowest order; and, a processor requires two 8 bit instructions finishing one operation, which would be capable of 16 bit processor to end the operation with a single instruction. Historically, 4-bit microprocessor was replaced by 8-bit, 16-bit, and 32bit microprocessors. This trend usually ended with the introduction of 32-bit processors, which has been a standard in computing commonly used for two decades. Not until recently (circa 2003–2004), with the advent of x86–64 architectures, the computers with 64-bit processors become present.
4.4.2. Instruction-Level Parallelism A computer program is essentially a stream of instructions executed by a processor. These instructions can be reordered combined into groups which are then executed in parallel without changing the result of the program. This is known as parallelism of instruction-level. Advances in instructionlevel parallelism dominated computer architecture from the mid-eighties to the mid-1990s. Modern processors have gradual pipes instruction. Each stage in the pipeline corresponds to a different action that the processor is performed on that instruction in that stage; i.e., a processor with pipeline stages N may have up to N different instructions at various stages of completion. The canonical example of a channelized processor is a RISC processor, with five stages: instruction fetch, decode, running, memory access and write back. Pentium 4 the processor had a 35 pipe stages. In addition to instruction-level parallelism CAN pipelining, some processors can post more than one instruction at a time. These are known as superscalar processors. Instructions can be grouped together only if there is no data dependency among them. Score boarding Y. Tomasulo algorithm (Which it is similar but uses score boarding Renaming Registry) are two of the most common techniques to get spoiled execution and instruction running-level parallelism.
Parallel Computer Systems
61
4.4.3. Data Parallelism Data parallelism is parallelism inherent Indies program, which focuses on distributing the data across different computing nodes to be processed in parallel. They make parallel sets of often similar sequences or functions of the operations (not necessarily identical) that are performed on elements of large data structures. Many scientific and engineering applications exhibit parallel data. A loop-carried dependence is dependence on a loop iteration output one or more previous iterations. The loop-carried dependencies prevent the parallelization of loops. For example, consider the following pseudocode that computes the first Fibonacci numbers: • PREV2: = 0 • prev1: = 1 • cur: = 1: • CUR: = PREV1 + PREV2 • PREV2: = PREV1 • PREV1: = CUR • While (CUR = 0.
Certain problems to adapt naturally recursive solutions.
8.1. CLASSIFICATION TIONS
OF
RECURSIVE
FUNC-
Recursive functions are classified to the recursive call according is made in: Direct recursion: The function calls itself. Indirect recursion: A function calls the function B and function B calls A. Depending on the number of recursive calls generated at runtime. • Linear recursive or single function: A single internal call is generated. • Nonlinear recursive or multiple functions: Two or more internal calls are generated. According to the point where the recursive call is made, the recursive function can be: • •
Final (tail recursion): The recursive call is the last instruction that occurs within the function. • No end (Non-tail recursive function): any operation is performed when returning from the recursive call. The final recursive functions are more efficient usually (in the multiplicative constant as to time, especially in terms of memory space) that does not end. (Some compilers can optimize these functions automatically passing them to iterative). •
An example of end recursion is the Euclidean algorithm for calculating the greatest common divisor of two positive integers:
Recursive Programming
127
8.2. DESIGN RECURSIVE FUNCTIONS The original problem can be transformed into a problem like simpler. • We have some direct way to solve “trivial problems.” Recursive module to be correct should be performed: •
A case analysis of the problem: There is at least one termination condition in which a recursive call is NOT necessary. Are trivial cases are solved directly. If n = 0 or n = 1, the factor is 1 • Convergence of the recursive calls: Each recursive call is made with a smaller, so that data is reached the termination condition. Factorial (n) = n * factorial (n – 1) • If recursive calls work well, The entire module works well: induction principle. Factorial (0) = 1 factorial (1) = 1. •
For n > 1, correct calculation of factorial assuming (n – 1), Factorial (n) = n * factorial (n – 1) Graphically, the factorial function would be:
128
Parallel Programming
An important requirement for a recursive algorithm is correct is that it does not generate an infinite sequence calls him.
8.2.1. Advantages and Disadvantages of Recursion A recursive function must be a natural, simple, comprehensible, and elegant way problem. For example, given a non-negative integer, write binary coding. binary void (int n) {if (n 1. The execution time of the algorithm grows as fast as the Fibonacci numbers: T (n) is exponential in n, which implies that the algorithm is impractical except for very small values of n. A demonstration of the complexity of the algorithm can be seen as follows: Consider a naive approach to calculating the complexity of the recursive function Fibonacci. If we call S (n) the number of additions to find necessary F (n). For the first values we have: S (1) = 0 = S (2) S (3) = 1, S (4) = 2, S (5) = 4, S (6) = 7. And in general, by induction, the number of sums to compute F (n) is equal to S (n) = S (n – 1) + S (n–2) +1 Induction F (n–2) <S (n) is obtained. But how fast it grows Fibonacci function? We can make an analogy F (2) = F (1) + F (0) X2 = X + 1 That is, the roots of X2-X–1 = 0 This being a characteristic equation and its roots is known as the golden (Golden Ratio), and its exact value is c = (1+√5)/2. {Cn} geometric progression satisfies the same equation in Fibonacci recurrence function, this is cn = cn – 1 + cn–2 has an equivalence with F (n) = F (n – 1) + F (n–2). Now, by induction as c0 = 1 = F (2), c = c1