This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
0. When the tasks have to be executed on processors with successive indices, Steinberg [22] proposed an adapted strip-packing algorithm with approximation factor 2. Furthermore, Jansen [16] gave an approximation scheme with makespan at most 1 + for any fixed > 0. There exist other algorithms for special cases of the MPTS with approximation factors close to one (e.g. [9] for identical malleable tasks), but those do not apply for our case.
Scheduling Parallel Eigenvalue Computations in a Quantum Chemistry Code
117
The problem stated in this paper is a standard MPTS problem, so the algorithms mentioned above could in principle be applied. However, as we will see, the number of sub-matrices to be processed is limited for most cases. This allows us to modify the algorithm from [18] by introducing a combinatorial approach in order to find a better solution. 3.2
The Algorithm
We focus on a two-phase approach. In the first phase, a number of allotted processors for each task is determined. This step performs the transformation from an MPTS to an NPTS. In the second phase, an optimal scheduling for the nonmalleable tasks is constructed. Processor Allotment: In [18], Ludwig and Tiwari showed how a processor allotment can be found in runtime O(mn). The algorithm also computes a lower bound ω on the optimal makespan d∗ such that ω ≤ d∗ ≤ 2ω. The basic idea is to find a number of allotted processors PTi ∈ P for each task Ti such that ti,PTi ≤ τ . τ ∈ IR is defined as an upper bound for the execution time of each task, and PTi the minimum number of processors which are required to satisfy this condition. The goal is to find the minimum τ ∗ which produces a feasible allotment for each task. Furthermore, the values of τ to be considered can be limited to a certain set X = {ti,Pj : i = 1 . . . n, j = 1 . . . m}. Thus, |X | = mn. Once τ ∗ has been found, the algorithm yields an allotment, which will subsequently be used for solving the NPTS problem. The algorithm requires t to be strictly monotonic: ti,p1 > ti,p2 for p1 < p2 . This property is not generally given, but can be achieved by a suitable choice of the cost function (see below). For further details, please refer to the original paper [18]. Solution of NPTS: Ludwig and Tiwari presented a technique which “. . . takes any existing approximation algorithm for NPTS and uses it to obtain an algorithm for MPTS with the same approximation factor.”[18]. In other words, if there is a way to find an optimal schedule for the NPTS, it simultaneously yields an optimal solution for the MPTS. For each irreducible representation, one block on the diagonal of the Hamilton matrix H has to be diagonalized; in most applications, the number of such blocks is limited by 10. This includes the important point groups Ih and Oh and their subgroups as well as all point groups with up to fivefold rotational axes as symmetry elements [1]. Thus, point groups including more than 10 irreducible representations are very rarely encountered in practical applications. For such a limited problem size, we can achieve an optimal solution by a combinatorial approach. The number of possible permutations σ of an NPTS is n!; in our case, this leads to a maximum number of 10! ≈ 3.6 · 106 . Algorithm 1 shows the routine to find the scheduling with the minimum makespan d∗ . A scheduling is represented by a sequence of task numbers σ, from which a scheduling is generated in Algorithm 2. For simplification, we do not explicitly provide the routine that generates the permutations.
118
M. Roderus et al.
Algorithm 1. Finds the scheduling sequence σ ∗ from which MakeSchedule (algorithm 2) generates a scheduling with the minimum makespan 1. Find the tasks which have been allotted all processors: Tm ⊆ T = {Ti : PTi = m} 2. Schedule Tm at the beginning 3. For each possible permutation σ of T \ Tm (a) Call MakeSchedule to generate a scheduling from σ (b) σ ∗ ← σ if makespan of σ < makespan of σ ∗
Algorithm 2. The procedure to generate a scheduling from a scheduling sequence σ. It allots to each task σ(i) → Ti a start time and an end time (σ(i).startTime and σ(i).endTime, respectively), as well as a set of allotted processors (σ(i).allottedProcessors). There is a set of processors, C = {C1 . . . Cm }; each element Ci has an attribute Ci .availableTime, which points out until which time the processor is occupied and, thus, the earliest time from which on it can be used for a new task. procedure MakeSchedule(σ) for all Ci ∈ C do Ci .availableTime ← 0 end for for i ← 1, |σ| do Ctmp ← first PTi processors which are available σ(i).startTime ← max(available times from Ctmp ) σ(i).endTime ← σ(i).startTime + ti,PTi σ(i).allottedProcessors ← Ctmp for all Cj ∈ Ctmp do Cj .availableTime = σ(i).endTime end for end for end procedure
4
Cost Function
The scheduling algorithm described requires a cost function which estimates the execution time of the (Sca)LAPACK routines DSYGV and PDSYGV. It is difficult to determine upfront how accurate the estimates have to be. However, the validation of the algorithm will show whether the error bounds are tight enough for the algorithm to work in practice. The ScaLAPACK User’s Guide [6] proposes a general performance model, which depends on machine-dependent parameters such as floating point or network performance, and data and routine dependent parameters such as total FLOP or communication count. In [10], Demmel and Stanley used this approach to evaluate the general performance behavior of the ScaLAPACK routine PDSYEVX. The validation of the models shows that the prediction error usually lies between 10 and 30%. Apart from that, for the practical use in a scheduling
Scheduling Parallel Eigenvalue Computations in a Quantum Chemistry Code
119
algorithm, it exhibits an important drawback: to establish a model of that kind, good knowledge of the routine used is required. Furthermore, each routine needs its own model; thus, if the routine changes (e.g. due to a revision or the use of a different library), the model has to be adapted as well. Here we follow a different approach: the routine is handled as a “black box”. Predictions of its execution time are based on empirical data, which are recorded by test runs with a set of randomly generated matrices on a set of possible processor allotments P. Then, with a one-dimensional curve-fitting algorithm, a continuous cost function t is generated for each element of P. Thus, each P ∈ P has a related cost function tP : S → tP,S (do not confuse with (2)). ScaLAPACK uses a two-dimensional block cyclic data distribution. For each instance of a routine, a Pr × Pc process grid has to be allocated with Pr process rows and Pc process columns. However, √ the ScaLAPACK User’s Guide [6] suggests to use a square grid (Pr = Pc = P ) for P ≥ 9 and a one-dimensional grid (Pr = 1; Pc = P ) for P < 9. Following this suggestion results√in a reduced set of processor configurations, e.g. P = {1, 2, . . . , 8, 9, 16, 25, . . . , m 2 }, which will be invoked here. We used the method of least squares to fit the data. For our purposes, it shows two beneficial properties: – the data are fitted by polynomials. These are easy to handle and allow us to generate an estimated execution time tP,S with low computational effort; – during the data generation, processing of a matrix sometimes takes longer than expected due to hardware delays (see circular marks in Fig. 2). As long as those cases are rare, their influence on the cost function is minor and can be neglected. Finally, we combine the emerging set of P -related cost functions to form the general cost function (2). However, in practice, when a certain number of allotted processors is exceeded, parallel routines no longer feature a speedup or even slow down, see [24]. This behavior does not comply with the assumption of a general monotonic cost function. To satisfy this constraint, we define (2) as follows: ti,p = min{tP,Si : P ∈ P ∧ P ≤ p}.
(3)
time/seconds
P
16 14 12 10 8 6 4 2 0
empiric data fitted data
n
k 500
1000
1500
2000
size n of a matrix ∈ IRn×n
Fig. 2. Execution time measurements of the routine PDSYGV diagonalizing randomly generated matrices. The curve labeled “fitted data” represents a polynomial of degree 3, which was generated by the method of least squares from the empiric data.
120
M. Roderus et al.
All possible processor counts P ∈ P are considered which are smaller than or equal to p. The P which results in the smallest execution time for the given S also determines the P -related cost function and thus the result t of the general cost function (3).
5
Evaluation
We evaluated the presented scheduler for two molecular systems as example applications: the gold cluster compound Au55 (PH3 )12 in symmetry S6 and the palladium cluster Pd344 in symmetry Oh . Table 1 lists the sizes of the symmetry adapted basis sets which result in SAu55 and SPd344 .
Table 1. The resulting point group classes (PGC) of the two example systems Au55 (PH3 )12 in symmetry S6 and Pd344 in symmetry Oh . The third row contains the sizes n of matrices ∈ IRn×n which comprise the elements of SAu55 and SPd344 .
i PGC Si
Au55 (PH3 )12 1 2 3 4 Ag E g Au E u 782 1556 782 1560
Pd344 1 2 3 4 5 6 7 8 9 10 A1g A2g Eg T1g T2g A1u A2u Eu T1u T2u 199 317 513 838 956 1110 317 471 785 956
Test platform was an SGI Altix 4700, installed at the Leibniz Rechenzentrum M¨ unchen, Germany. The system uses Intel Itanium2 Montecito Dual Cores as CPUs and has an SGI NUMAlink 4 as underlying network. As numerical library, SGI’s SCSL was used to provide BLAS, LAPACK, BLACS and ScaLAPACK support. For further details on the hard- and software specification, please refer to [21]. We performed time and load measurements using the VampirTrace profiling tool [5]. For that purpose, we manually instrumented the code, inserting calls to the VT library. As a result, the two relevant elements of the scheduling algorithm could be measured separately: the execution time of the eigensolvers and the idle time (see below). Two negative influences on the parallel efficiency can be expected: firstly, as virtually every parallel numerical library, the performance of ScaLAPACK does not scale ideally with the number of processors. Consequently, the overall efficiency worsens the more the scheduler parallelizes the given tasks. The second negative influence arises from the scheduling itself. As Fig. 1 shows, a processor can stay idle while waiting for other processors or the whole routine to finish. Those idle times result in load imbalance, hence a decreased efficiency. Figure 3 shows the execution times of a scheduled diagonalization during one SCF cycle. The gap between the predicted and computed execution time is reasonably low (at most ≈ 10%) except for the case when p = 1. This shows that the cost function, which generates the values for the predicted makespan, works sufficiently accurate to facilitate the practical use of the scheduling algorithm.
Scheduling Parallel Eigenvalue Computations in a Quantum Chemistry Code
35 time/seconds
30
predicted computed idle time LPT
16 14 time/seconds
40
25 20 15
12
predicted computed idle time LPT
10 8 6
10
4
5
2
0
121
0 1 2 3 4 5 6 7 8 9 10 12 14 # of processors
(a) Au55 (PH3 )12
16
20
1 2 3 4 5 6 7 8 910 12 14 16 20 # of processors
24
28
(b) Pd344
Fig. 3. Time diagrams of the two test systems Au55 (PH3 )12 and Pd344 . Considered are the execution times of the diagonalization module during one SCF iteration. The curves labeled “predicted” show the predicted makespan of the scheduling algorithm, whereas the curves labeled “computed” provide the real execution time of the scheduled eigensolvers. The curves labeled “idle time” represent the average time, during which a processor was not computing. The lines labeled “LPT” indicate the execution time of the sequential LAPACK routine computing the largest matrix from S and yield thus the best possible performance of the previously used LPT-scheduler (see Sect. 1).
The figure also shows a lower bound on the execution time of a sequential scheduler (“LPT”-line). To recapitulate the basic idea of the previously used LPT-scheduler: all matrices are sorted by their size and accordingly scheduled on any processor which becomes available. There the matrix is diagonalized by a sequential LAPACK eigensolver routine (see Fig. 1a). However, this performance bound is now broken and the execution time is improved beyond this barrier by our new algorithm. The diagonalization of the first system, Au55 (PH3 )12 , scales up to 20 processors, whereas the LPT-scheduler can only exploit up to 4 processors. The proposed MPTS scheduler is faster by a factor of about 2 when using 4 processors and by a factor of 8.4 when using 20 processors. For the diagonalization of the second system, Pd344 , the execution time improved for up to 28 processors. Compared to the LPT-scheduler, which in this case can only exploit up to 10 processors, the MPTS-scheduler is faster by the factor of 1.6 when using this processor number and about 4 when using 28 processors. The parallel efficiency is given in Fig. 4. For the system Au55 (PH3 )12 , a superlinear speedup was achieved for the cases p = {2, 4, 5, 9}. One can also see for both examples that the idle times of the scheduling can cause a notable loss of efficiency. Further investigations on the scheduling algorithm to reduce those time gaps would thus be an opportunity to improve the overall efficiency. One also has to consider the cost of establishing the scheduling. It is difficult to estimate this cost beforehand, but recall that the scheduling, once computed, can be re-used in each recurring SCF cycle, at least 103 in a typical application
122
M. Roderus et al.
1.2
1 0.8
0.8
efficiency
efficiency
1
0.6 0.4 0.2
computed no idle time
0.6 0.4 0.2
computed no idle time
0
0 1 2 3 4 5 6 7 8 9 10 12 14 # of processors
16
20
(a) Au55 (PH3 )12
1 2 3 4 5 6 7 8 910 12 14 16 20 # of processors
24
28
(b) Pd344
Fig. 4. The parallel efficiency in computing the two test systems Au55 (PH3 )12 and Pd344 . The curves labeled “computed” show the efficiency according to the execution times displayed in Fig. 3. The curve labeled “no idle time” represents the efficiency, where the idle-times of the scheduling are subtracted from the execution times. It thus indicates the parallel efficiency of the ScaLAPACK-routine PDSYEVX as used in the given schedules.
(see Sect. 1). Thus, a reasonable criterion is the number of SCF cycles required to amortize these costs. In the worst case considered, for 10! possible task permutations (see Sect. 3.2), our implementation requires a runtime of ≈ 34 s. This case has to be considered for the second test system Pd344 , as its symmetry group Oh implies a total number of 10 matrices. In an example case, where the these matrices are scheduled on 10 processors, a performance gain of ≈ 1.46 s could be achieved, compared to the sequential LPT-scheduler (see Fig. 3b). Accordingly, the initial scheduling costs have been amortized after 24 SCF iterations.
6
Conclusion and Future Work
We demonstrated how a parallel eigensolver can be used efficiently in quantum chemistry software when the Hamilton matrix has a block-diagonal structure. The scalability of the proposed parallel scheduler has been demonstrated on real chemical systems. The first system, Au55 (PH3 )12 , scales as far as 20 processors. For the second system, Pd344 , a performance gain could be achieved until up to 28 processors. Compared to the previously used LPT-scheduler, the scalability was significantly improved. Performance improvements could be achieved by the factors of about 2 and 1.6, respectively, when using the same numbers of processors. With the improved parallelizability, the diagonalization step can now be executed about 8.4 and 4 times faster, respectively. Furthermore, the proposed strategy for the cost function, which relies on empiric records of execution times, provides results accurate enough for the scheduler to work in practice. In summary, the technique presented significantly improves the performance and the scalability of the solution of the generalized eigenvalue problem in
Scheduling Parallel Eigenvalue Computations in a Quantum Chemistry Code
123
parallel Quantum Chemistry codes. It makes thus an important contribution to prevent this step from becoming a bottleneck in simulations of large symmetric molecular systems, especially nano particles. It will be interesting to explore how an approximate scheduling algorithm like the one of [20] compares with the combinatorial approach proposed here. Adopting such an algorithm would also make the presented technique more versatile because the number of tasks would not be limited anymore. Thus, rare cases in practical applications with significantly more than 10 blocks would be covered as well.
Acknowledgments The work was funded by the Munich Centre of Advanced Computing (MAC) and Fonds der Chemischen Industrie.
References 1. Altmann, S.L., Herzig, P.: Point-group theory tables. Clarendon, Oxford (1994) 2. Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., Sorensen, D.: LAPACK’s user’s guide. SIAM, Philadelphia (1992) 3. Belling, T., Grauschopf, T., Kr¨ uger, S., Mayer, M., N¨ ortemann, F., Staufer, M., Zenger, C., R¨ osch, N.: High performance scientific and engineering computing. In: Bungartz, H.J., Durst, F., Zenger, C. (eds.) Lecture notes in Computational Science and Engineering, vol. 8, pp. 439–453 (1999) 4. Belling, T., Grauschopf, T., Kr¨ uger, S., N¨ ortemann, F., Staufer, M., Mayer, M., Nasluzov, V.A., Birkenheuer, U., Hu, A., Matveev, A.V., Shor, A.V., Fuchs-Rohr, M.S.K., Neyman, K.M., Ganyushin, D.I., Kerdcharoen, T., Woiterski, A., Majumder, S., R¨ osch, N.: ParaGauss, version 3.1. Tech. rep., Technische Universit¨ at M¨ unchen (2006) 5. Bischof, C., M¨ uller, M.S., Kn¨ upfer, A., Jurenz, M., Lieber, M., Brunst, H., Mix, H., Nagel, W.E.: Developing scalable applications with Vampir, VampirServer and VampirTrace. In: Proc. of ParCo 2007, pp. 113–120 (2007) 6. Blackford, L.S., Choi, J., Cleary, A., D’Azeuedo, E., Demmel, J., Dhillon, I., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C., Dongarra, J.: ScaLAPACK user’s guide. SIAM, Philadelphia (1997) 7. Blazewicz, J., Ecker, K., Pesch, E., Schmidt, G., Weglarz, J.: Handbook on scheduling: from theory to applications. Springer, Heidelberg (2007) 8. Blazewicz, J., Kovalyov, M.Y., Machowiak, M., Trystram, D., Weglarz, J.: Scheduling malleable tasks on parallel processors to minimize the makespan. Ann. Oper. Res. 129, 65–80 (2004) 9. Decker, T., L¨ ucking, T., Monien, B.: A 5/4-approximation algorithm for scheduling identical malleable tasks. Theor. Comput. Sci. 361(2), 226–240 (2006) 10. Demmel, J., Stanley, K.: The performance of finding eigenvalues and eigenvectors of dense symmetric matrices on distributed memory computers. In: Proc. seventh SIAM conf. on parallel processing for scientific computing, pp. 528–533 (1995) 11. Dunlap, B.I., R¨ osch, N.: The Gaussian-type orbitals density-functional approach to finite systems. Adv. Quantum Chem. 21, 317–339 (1990)
124
M. Roderus et al.
12. Garey, M., Johnson, D.: Computers and intractability: a guide to the theory of NP-completeness. W.H. Freeman and Company, New York (1979) 13. Graham, R.L.: Bounds for certain multiprocessing anomalies. Bell Syst. Tech. J. 45, 1563–1581 (1966) 14. Graham, R.: Bounds on multiprocessing timing anomalities. SIAM J. Appl. Math. 17, 263–269 (1969) 15. Hein, J.: Improved parallel performance of SIESTA for the HPCx Phase2 system. Tech. rep., The University of Edinburgh (2004) 16. Jansen, K.: Scheduling malleable parallel tasks: an asymptotic fully polynomial time approximation scheme. Algorithmica 39, 59–81 (2004) 17. Koch, W., Holthausen, M.C.: A chemist’s guide to density functional theory. WileyVCH, Weinheim (2001) 18. Ludwig, W., Tiwari, P.: Scheduling malleable and nonmalleable parallel tasks. In: SODA 1994, pp. 167–176 (1994) 19. Mouni´e, G., Rapine, C., Trystram, D.: Efficient approximation algorithms for scheduling malleable tasks. In: SPAA 1999, pp. 23–32 (1999) 20. Mouni´e, G., Rapine, C., Trystram, D.: A 3/2-approximation algorithm for scheduling independent monotonic malleable tasks. SIAM J. Comp. 37(2), 401–412 (2007) 21. National supercomputer HLRB-II, http://www.lrz-muenchen.de/ 22. Steinberg, A.: A strip-packing algorithm with absolute performance bound 2. SIAM J. Comp. 26(2), 401–409 (1997) 23. Turek, J., Wolf, J., Yu, P.: Approximate algorithms for scheduling parallelizable tasks. In: SPAA 1992, pp. 323–332 (1992) 24. Ward, R.C., Bai, Y., Pratt, J.: Performance of parallel eigensolvers on electronic structure calculations II. Tech. rep., The University of Tennessee (2006) 25. Yudanov, I.V., Matveev, A.V., Neyman, K.M., R¨ osch, N.: How the C-O bond breaks during methanol decomposition on nanocrystallites of palladium catalysts. J. Am. Chem. Soc. 130, 9342–9352 (2008) 26. Yudanov, I.V., Metzner, M., Genest, A., R¨ osch, N.: Size-dependence of adsorption properties of metal nanoparticles: a density functional study on Pd nanoclusters. J. Phys. Chem. C 112, 20269–20275 (2008)
Scalable Parallelization Strategies to Accelerate NuFFT Data Translation on Multicores Yuanrui Zhang1 , Jun Liu1 , Emre Kultursay1 , Mahmut Kandemir1 , Nikos Pitsianis2,3 , and Xiaobai Sun3 1
Pennsylvania State University, University Park, USA {yuazhang,jxl1036,euk139,kandemir}@cse.psu.edu 2 Aristotle University,Thessaloniki, Greece 3 Duke University, Durham, U.S.A. {nikos,xiaobai}@cs.duke.edu
Abstract. The non-uniform FFT (NuFFT) has been widely used in many applications. In this paper, we propose two new scalable parallelization strategies to accelerate the data translation step of the NuFFT on multicore machines. Both schemes employ geometric tiling and binning to exploit data locality, and use recursive partitioning and scheduling with dynamic task allocation to achieve load balancing. The experimental results collected from a commercial multicore machine show that, with the help of our parallelization strategies, the data translation step is no longer the bottleneck in the NuFFT computation, even for large data set sizes, with any input sample distribution.
1
Introduction
The non-uniform FFT (NuFFT) [2] [7] [15] has been widely used in many applications, including synthetic radar imaging [16], medical imaging [13], telecommunications [19], and geoscience and seismic analysis [6]. Unlike the Fast Fourier Transform (FFT) [4], it allows the sampling in the data or frequency space (or both) to be unequally-spaced or non-equispaced. To achieve the same O(N log N ) computational complexity as the FFT, the NuFFT translates the unequallyspaced samples to the equally-spaced points, and then applies the FFT to the translated Cartesian grid, where the complexity of the first step, named data translation or re-sampling, is linear to the size of the sample ensemble. Despite the lower arithmetic complexity compared to the FFT, the data translation step has been found to be the most time-consuming part in computing NuFFT [18]. The reason lies in its irregular data access pattern, which significantly deteriorates the memory performance in modern parallel architectures [5]. Furthermore, as data translation is essentially a matrix-vector multiplication with an irregular and sparse matrix, its intrinsic parallelism cannot be readily exploited by conventional compiler based techniques [17] that work well mostly for regular dense matrices. Many existing NuFFT algorithms [2] [8] [14] [12] try to reduce the complexity of data translation while maintaining desirable accuracy through mathematical methods, e.g., by designing different kernel P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 125–136, 2010. c Springer-Verlag Berlin Heidelberg 2010
126
Y. Zhang et al.
functions. In a complementary effort, we attempt to improve the performance of data translation through different parallelization strategies that take into account the architectural features of the target platform, without compromising accuracy. In our previous work [20], we developed a tool that automatically generates a fast parallel NuFFT data translation code for user-specified multicore architecture and algorithmic parameters. This tool consists of two major components. The first one applies an architecture-aware parallelization strategy to input samples, rearranging them in off-chip memory or data file. The second one instantiates a parallel C code based on the derived parallel partitions and schedules, using a pool of codelets for various kernel functions. The key to the success of this tool is its parallelization strategy, which directly dictates the performance of the output code. The scheme we developed in [20] has generated significant improvements for the data translation computation, compared to a more straightforward approach, which does not consider the architectural features of the underlying parallel platform. However, it is limited to small data set sizes, e.g., 2K × 2K, with non-uniformly distributed samples; the parallelization takes excessively long time to finish when the data size is large. This slow-down is mainly due to the use of recursive geometric tiling and binning during parallelization, which is intended for improving the data locality (cache behavior) of the translation code. To overcome this drawback, in this paper, we design and experimentally evaluate two new scalable parallelization strategies that employ an equally-sized tiling and binning to cluster the unequally-spaced samples, for both uniform and non-uniform distributions. The first strategy is called the source driven parallelization, and the second one is referred to as the target driven parallelization. Both strategies use dynamic task allocation to achieve load balancing, instead of the static approach employed in [20]. To guarantee the mutual exclusion in data updates during concurrent computation, the first scheme applies a special parallel scheduling, whereas the second one employs a customized parallel partitioning. Although both schemes have comparable performance for the data translation with uniformly distributed sources, the target driven parallelization outperforms the other when using the input with non-uniformly distributed sources, especially on a large number of cores, in which case synchronization overheads become significant in the first scheme. We conducted experiments on a commercial multicore machine, and compared the execution time of the data translation step with the FFT from FFTW [11]. The collected results demonstrate that, with the help of our proposed parallelization strategies, the data translation step is no longer the primary bottleneck in the NuFFT computation, even for non-uniformly distributed samples with large data set sizes. The rest of the paper is organized as follows. Section 2 explains the NuFFT data translation algorithm and its basic operations. Section 3 describes our proposed parallelization strategies in detail. Section 4 presents the results from our experimental analysis, and finally, Section 5 gives the concluding remarks.
Scalable Parallelization Strategies to Accelerate NuFFT Data Translation
2
127
Background
2.1
Data Translation Algorithm
Data translation algorithms vary in the type of data sets they target and the type of re-gridding or re-sampling methods they employ. In this paper, we focus primarily on convolution-based schemes for data translation, as represented by Eq.(1) below, where the source samples S are unequally-spaced, e.g., in the frequency domain, and the target samples T are equally-spaced, e.g., in an image domain. The dual case with equally-spaced sources and unequally-spaced targets can be treated in a symmetrical way. v(T ) = C(T, S) · q(S).
(1)
In Eq.(1), q(S ) and v(T ) denote input source values and translated target values, respectively, and C represents the convolution kernel function. The set S for source locations can be provided in different ways. In one case, the sample coordinates are expressed and generated by closed-form formulas, as in the case with the sampling on a polar grid (see Figure 1 (a)). Alternatively, the coordinates can be provided in a data file as a sequence of coordinate tuples, generated from a random sampling (see Figure 1 (b)). The range and space intervals of the target Cartesian grid T are specified by the algorithm designer, and they can be simplified into a single oversampling factor [7] since the samples are uniformly distributed. With an oversampling factor of α, the relationship between the number of sources and targets can be expressed as | T |= α | S |. The convolution kernel C is obtained either by closed-form formulas or numerically. Examples for the former case are the Gaussian kernel and the central B-splines [14] [7], whereas examples for the latter case are the functions obtained numerically according to the local least-square criterion [14] and the min-max criterion [8]. In each case, we assume that the function evaluation routines are provided. In addition, the kernel function is of local support, i.e., each source is only involved in the convolution computation with targets within a window, and vice versa. Figure 1 (c) shows an example for a single source in the 2D case, where the window has a side length of w.
w
(a) The sampling on a polar grid
(b) A random sampling
(c) The convolution window for a source with side length w
Fig. 1. Illustration of different sampling schemes and local convolution window
128
2.2
Y. Zhang et al.
Basic Data Translation Procedure
The above data translation algorithm can be described by the following pseudo code: for each source si in S for each target tj in T window(si ) V (tj ) += c(tj , si ) × Q(si ) where Q is a one-dimensional array containing the source values regardless of the geometric dimension, V is a d-dimensional array holding the target values, and c represents a kernel function over T × S, e.g., the Gaussian kernel e(|tj −si |)/σ where | tj − si | denotes the distance between a target and a source. The target coordinates are generated on-demand during the computation based on the source coordinates, the oversampling value (α), and the range of the data space, e.g., L×L in the 2D case. In the pseudo code, the outer loop iterates over sources, because it is easy to find the specific targets for a source within the window, but not vice versa. We name this type of computation the source driven computation. The alternate computation is the target driven computation, whose outer loop iterates over targets. The complexity of the code is O(wd × | S |); however, since w is usually very small (in other words, wd is near constant), the data translation time is linear with the number of sources | S |.
3 3.1
Data Translation Parallelization Geometric Tiling and Binning
The convolution-based data translation is essentially a matrix-vector multiplication with sparse and irregular matrices. While the sparsity stems from the local window effect of the kernel function, the irregularity is caused by the unequallyspaced sources. In this case, the conventional tiling on dense and regular matrices [17] cannot help to achieve high data reuse. For instance, the source samples s1 and s2 from the same tile, as shown in Figure 2 (a), may update different targets located far away from each other in the data space, as indicated by Figure 2 (b). To exploit target reuse, the geometric tiling [3] is employed to cluster the sources into cells/tiles based on their spatial locations. The tiles can be equallysized (Figure 3 (a)) or unequally-sized (Figure 3 (b)), with the latter obtained through an adaptive tiling based on the sample distribution. In either case, the basic data translation procedure can be expressed as: for each non-empty source tile Tk in S for each source si in Tk for each target tj in T window(si ) V (tj ) += c(tj , si ) × Q(si ) Associated with tiling is a process called binning. It reshuffles the source data in the storage space, e.g., external memory or file, according to tiles. In terms of data movements, the complexity of an equally-sized tiling and binning is O(| S |),
Scalable Parallelization Strategies to Accelerate NuFFT Data Translation
t1
Random order s1 s2 s3 s4 s5 sm
129
s2
t2 t3 t4 t5
s1
tn
affected targets
(a) The conventional tiling on the convolution matrix
(b) Geometrically separate two sources and their affected targets
Fig. 2. Conventional tiling
overlapped target windows
overlapped target windows
(a) Equally-sized geometric (b) Adaptive geometric tiling with uniformly tiling with non-uniformly distributed sources distributed sources
Fig. 3. Geometric tiling
whereas the cost of an adaptive recursive tiling and binning is O(| S | log | S |). The latency of the latter increases dramatically when the data set size is very large, as observed in [20]. Consequently, in this work, we adopt the equally-sized tiling, irrespective of the source distribution, i.e., uniform distribution or nonuniform distribution, as illustrated in Figure 3. In this way, the non-Cartesian grid of sources is transformed to a Cartesian grid of tiles before any parallel partitioning and scheduling is applied, and this separation makes our parallelization strategies scalable with the data set sizes. 3.2
Parallelization Strategies
To parallelize the data translation step, two critical issues need to be considered: mutual exclusion of data updates and load balancing. On the one hand, a direct parallelization of the source loop of the code in Section 2.2 or the source tile loop of the code in Section 3.1 may lead to incorrect results, as threads on different cores may attempt to update the same target when they are processing geometrically nearby sources concurrently. Although parallelizing the inner target loop can avoid this, it would cause significant synchronization overheads. On the other hand, a simple equally-sized parallel partitioning in the data space may lead to unbalanced workloads across multiple processors when input sources are non-uniformly distributed. To address these issues, we have designed two parallelization strategies that aim at accelerating data translation on emerging multicore machines with onchip caches. One of these strategies is called the source driven parallelization, and the other is referred to as the target driven parallelization. They are intended to be used in the context of the source driven computation. To ensure mutual exclusion of target updates, the first strategy employs a special parallel scheduling, whereas the second strategy applies a customized parallel partitioning. Both the strategies use a recursive approach with dynamic task allocation to achieve load balance across the cores in the target architecture.
130
Y. Zhang et al. w
0 11 0
Fig. 4. One-dimensional partition with a 2-step scheduling for the 2D case
0 1 2 0
2 3 3 1
1 3 3 2
0 2 1 0
Fig. 5. Two-dimensional partition with a 4-step scheduling for the 2D case
1) Source Driven Parallelization For a data space containing both sources and targets, an explicit partitioning of sources will induce an implicit partitioning of targets, and vice versa, because of the local convolution window effect. The source driven parallelization carries out parallel partitioning and scheduling in the source domain, whereas the other operates on the targets. In both cases, the source domain has been transformed into a Cartesian grid of tiles through geometric tiling and binning (inside each tile, the samples are still unequally-spaced). When partitioning the sources, a special scheduling is indispensable to guaranteeing the mutual exclusion of target updates. Consider Figure 4 and Figure 5 for example. No matter how the sources in a 2D region are partitioned, e.g., using one-dimensional or two-dimensional partition, adjacent blocks cannot be processed at the same time due to potentially affected overlapping targets, as indicated by the dashed lines. However, if further dividing the neighboring source regions into smaller blocks according to the target overlapping patterns, a scheduling can be found to process those adjacent source regions through several steps, where at each step, the sub-blocks with the same number (indicated in the figures) can be executed concurrently. And, a synchronization takes place as moving from one step to another. Our observation is that a one-dimensional partition needs a 2-step scheduling to eliminate the contention, whereas a two-dimensional partition needs 4 steps. In general, an x-dimensional partition (1 ≤ x ≤ d) requires a 2x -step scheduling to ensure correctness. Based on this observation, our source driven parallelization scheme is designed as follows. Given m threads and a d-dimensional data space, first factorize m into p1 × p2 × . . .× pd , and then divide dimension i into 2pi segments (1 ≤ i ≤ d), which results in 2p1 × 2p2 × . . . 2pd blocks in the data space, and finally schedule every 2d neighboring blocks using the same execution-order pattern. Figure 6 (a) illustrates an example for the 2D case with m = 16 and p1 = p2 = 4. The blocks having the same time stamp (number) can be processed concurrently, provided that the window side length w is less than any side length of a block. In the case where m is very large and this window size condition no longer holds, some threads will be dropped to decrease m, until the condition is met. Although a similar d -dimensional partition (d < d) with a 2d -step scheduling can also be used for a d-dimensional data space, e.g., a one-dimensional partition for the 2D
Scalable Parallelization Strategies to Accelerate NuFFT Data Translation
131
w 0
1
0
1
0
1
0
1
2
3
2
3
2
3
2
3
0
1
0
1
0
1
0
1
2
3
2
3
2
3
2
3
0
1
0
1
0
1
0
1
2
3
2
3
2
3
2
3
0
1
0
1
0
1
0
1
2
3
2
3
2
3
2
3
0 1 0 1 0 1 0 10 1 0 1 0 1 01
(a) A two-dimensional partition with a 4-step scheduling for m=16, p1=p2=4
(b) A one-dimensional partition with a 2-step scheduling for m=16
Fig. 6. Illustration of non-recursive source driven parallelization in the 2D case
0
1
2
3
0
2
1
3
00 01 00 02 03 00 00
2 00 01 00 02 03 00 00
2
1
3 10 11 10 12 13 10 10
3
(a) A two-level recursive partitioning and scheduling for m=4, p1=p2=2
1
0
2
01 01 01
0
01
3
02 02 02
02
4
03 03 03
5
1
6
11 11 11
11
7
12 12 12
12
8
13 13 13
13
9
2
2
2
2
10
3
3
3
3
1
00 00 00
1
00
03 10 10
10
(b) The scheduling table for the source blocks obtained in (a) with 10 synchronization steps
Fig. 7. Illustration of recursive source driven parallelization in the 2D case
case, as shown in Figure 6 (b), it is not as scalable as the d-dimensional partition when m is huge. The scheme explained so far works well with uniformly distributed sources, but not with non-uniformly distributed ones, as it can cause unbalanced workloads in the latter case. However, a slight modification can fix this problem. Specifically, since the amount of computation in a block is proportional to the number of sources it contains, the above scheme can be recursively applied to the source space until the number of sources in a block is less than a preset value (δ). The recursion needs to ensure that the window size condition is not violated. Figure 7 (a) depicts an example of two-level recursive partitioning for the 2D case, where m = 4, p1 = p2 = 2, and the shaded regions are assumed to have dense samples. The corresponding concurrent schedule is represented by the table shown in Figure 7 (b). The synchronization takes place every time after processing each group of blocks pointed by a table entry. However, within each
132
Y. Zhang et al.
group, there is no processing order for the blocks, which will be dynamically allocated to the threads at run time. The selection of threshold δ has an impact on the recursion depth as well as the synchronization latency. A smaller value of δ usually leads to deeper recursions and higher synchronization overheads; but, it is also expected to have better load balance. Thus, there is a tradeoff between minimizing synchronization overheads and balancing workloads, and careful selection of δ is important for the overall performance. 2) Target Driven Parallelization We have also designed a target driven parallelization strategy to perform recursive partitioning and scheduling in the target domain, which has the advantage of requiring no synchronization. Given m threads and a d-dimensional data space, this scheme employs a 2d -branch geometric tree to divide the target space into blocks recursively until the number of sources associated with each block is below a threshold (σ), and then uses a neighborhood traversal to obtain an execution order for those blocks, based on which it dynamically allocates their corresponding sources to the threads at run time without any synchronization. Figure 8 (a) shows an example for the 2D case with quadtree partition [9] and its neighborhood traversal. Since target blocks are non-overlapping, there is no data update contention. Although the associated sources of adjacent blocks are overlapping, there is no coherence issue, as sources are only read from the external memory. The neighborhood traversal helps improve data locality during computation through the exploitation of source reuses. The threshold σ is set to be less than |S|/m and greater than the maximum number of sources in a tile. A smaller value of σ usually results in more target blocks and more duplicated source accesses because of overlapping; but, it is also expected to exhibit better load balance, especially with non-uniformly distributed inputs. Therefore, concerning the selection of value for σ, there is a tradeoff between minimizing memory access overheads and balancing workloads. In addition, to find the associated sources for a particular target block is not as easy as the other way around. Typically, one needs to compare the coordinates of each source with the boundaries of the target block, which will take O(| S |) time. Our parallelization scheme reduces this time to a constant by aligning the window of a target block with the Cartesian grid of tiles, as depicted in Figure 8 (b). Although this method attaches irrelevant sources to each block, the introduced execution overheads can be reduced by choosing proper tile size.
4
Experimental Evaluation
We implemented these two parallelization strategies in our tool [20], and evaluated them experimentally on a commercial multicore machine. Two sample inputs are used, both of which are generated in a 2D region of 10 × 10, with a data set size of 15K × 15K. One contains random sources that are uniformly distributed in the data space, whereas the other has samples generated on a
Scalable Parallelization Strategies to Accelerate NuFFT Data Translation
1
15
16 17 19 18
14
13 12 11
2
5
6 7 9 8
3
4
10
(a) An example of quadtree partition in the target domain and neighborhood traversal among target blocks
133
(b) An illustration of aligned window to the source tiles for a target block
Fig. 8. Example of target driven parallelization in the 2D case. Especially, (b) shows the Cartesian grid of tiles and a target partition with bolded lines, where the window of a target block is aligned to the tiles, indicated by red color.
polar grid with non-uniform distribution. The kernel is the Gaussian function and oversampling (α) is 2.25. The convolution window side length w is set to be 5 in terms of the number of targets affected in each dimension. In this case, each source is involved in the computation with 25 targets, and the total number of targets is 22.5K × 22.5K. The underlying platform is Intel Harpertown multicore machine [1], which features a dual quad-core operating at a frequency of 3GHz, 8 private L1 caches of size 32KB and 4 shared L2 caches of size 6MB, each connecting to a pair of cores. We first investigated the relationship between the tile size and the cache size (both L1 and L2) on the target platform, and analyzed the impact of tile size on the performance of binning (the most time-consuming process in the parallelization phase) and data translation. The tile size is determined based on the cache size so that all the sources and their potentially affected targets in each tile are expected to fit in the cache space. For non-uniform distributions, the tile size is actually calculated based on the assumption of uniform distribution. Figure 9 shows the execution time of binning and data translation for the input with uniformly distributed sources (random samples), using the source driven parallelization strategy. The cache size varies from 16KB to 12MB. The four groups depicted from left to right are the results of our experiments with 1 core, 2 cores, 4 cores, and 8 cores, respectively. For binning, the time decreases as the cache size (or tile size) increases, irrespective of the number of cores used. The reason is that, each tile employs an array data structure to keep track of its sources, when the tile size increases, or equivalently, the number of tiles decreases, it is likely that the data structure of the corresponding tile is in the cache when binning a source, i.e., the cache locality is better. Hence, fewer tiles makes binning faster. In contrast, the data translation becomes slower when the tile size increases, since the geometric locality of the sources in each tile worsens, which in turn reduces the target reuse. When both binning and data translation are concerned, we can see that, there is a lowest point (minimum execution time) in each group, which is around the tile size 1.5 MB, half of L2 cache size per core (3MB) in the Harpertown processor. A similar behavior can also be observed for
Y. Zhang et al.
Data translation
Binning
Cache Size (KB) - Tile Size
Fig. 9. Execution time of binning and data translation under uniform distribution, as cache (tile) size varies
50 0 16 25 6 20 48 92 16
16 25 6 20 48 92 16
16 25 6 20 48 92 16
16 25 6 20 48 92 16
25 6 20 48 92 16
0
100
16 25 6 20 48 92 16
50
150
16 25 6 20 48 92 16
100
Data translation
200
25 6 20 48 92 16
Time (Seconds)
150
16
Time (Seconds)
Binning 200
16
134
Cache Size (KB) - Tile Size
Fig. 10. Execution time of binning and data translation under non-uniform distribution, as cache (tile) size varies
the input with non-uniformly distributed sources (samples on the polar grid), as shown in Figure 10, but with a shifted minimum value. We then evaluated the efficiency of our two parallelization strategies, using the best tile size found from the first group of experiments. Figure 11 and Figure 12 show their performance with tuned σ and δ, respectively, for non-uniformly distributed sources. We can see that, in both the plots, the execution time of data translation scales well with the number of cores, due to improved data locality; however, the execution time of geometric tiling and binning reduces much slower when the number of cores increases, as this process involves only physical data movements in the memory. In particular, the two execution times become very close to each other on 8 cores. This indicates that our proposed parallelization strategies are suitable for a pipelined NuFFT at this point, where the three steps of the NuFFT, namely, parallelization, data translation and FFT are expected to have similar latencies in order to achieve balanced pipeline stages for streaming applications. Further, the two parallelization strategies have comparable performance for data translation with uniformly distributed sources, as shown by the group of bars on the left in Figure 13; however, the target driven strategy outperforms the other by 13% and 37% on 4 cores and 8 cores, respectively, with non-uniformly distributed sources, as depicted by the group of bars on the right in Figure 13, where the respective data translation times using source and target driven schemes on 4 cores are 23.1 and 20.1 seconds, and on 8 cores are 17.1 and 10.8 seconds. This difference is expected to become more pronounced as the number of cores increases, since there will be more synchronization overheads in the source driven scheme. We also conducted a performance comparison between the data translation using the target driven parallelization strategy on the input samples, and the FFT obtained from FFTW with ”FFTW MEASURE” option [11] [10] on the translated target points. Figure 14 presents the collected results with non-uniformly distributed sources. The graph shows that the execution times of data translation and FFT are comparable on 1, 2, 4 and 8 cores, respectively. In particular, the data translation becomes faster than the FFT as the number of cores increases. This good performance indicates that, with our parallelization strategies, the data translation step is no longer the bottleneck in the NuFFT computation.
Scalable Parallelization Strategies to Accelerate NuFFT Data Translation Parallelization phase
Parallelization phase
Data translation step
Time (Seconds)
Time (Seconds)
100 80 60 40 20 0 2 cores
4 cores
100 80 60 40 20 0
8 cores
1 core
Number of Cores
Data translation
Target driven parallelization
40 20
es co r
8
4
co r
es
es
re
co r
co
2
1
es co r
8
es
co r 4
co r 2
co
es
0
Number of cores
Fig. 13. Data translation time with the two parallelization strategies, for uniform and non-uniform distributions, respectively
Time (Seconds)
60
1
4 cores
8 cores
Fig. 12. Performance of the target driven parallelization with non-uniformly distributed sources
80
re
Time (Seconds)
Source driven parallelization
2 cores
Number of Cores
Fig. 11. Performance of the source driven parallelization with non-uniformly distributed sources
5
Data translation step
Total execution time
Total execution time
1 core
135
FFT
80 60 40 20 0 1 core
2 cores
4 cores
8 cores
Number of Cores
Fig. 14. Execution time of the data translation using target driven parallelization and the FFT from FFTW
Concluding Remarks
In this work, we proposed two new parallelization strategies for the NuFFT data translation step. Both schemes employ geometric tiling and binning to exploit data locality, and use recursive partitioning and scheduling with dynamic task allocation to achieve load balance on emerging multicore architectures. To ensure the mutual exclusion in data updates during concurrent computation, the first scheme applies a special parallel scheduling, whereas the second one employs a customized parallel partitioning. Our experimental results show that, the proposed parallelization strategies work well with large data set sizes, even for non-uniformly distributed input samples, which help NuFFT achieve good performance for data translation on multicores.
Acknowledgement This research is supported in part by NSF grants CNS 0720645, CCF 0811687, OCI 821527, CCF 0702519, CNS 0720749, and a grant from Microsoft.
References 1. http://www.intel.com 2. Beylkin, G.: On the fast Fourier transform of functions with singularities. Applied and Computational Harmonic Analysis 2, 363–381 (1995)
136
Y. Zhang et al.
3. Chen, G., Xue, L., et al.: Geometric tiling for reducing power consumption in structured matrix operations. In: Proceedings of IEEE International SOC Conference, pp. 113–114 (September 2006) 4. Cooley, J., Tukey, J.: An algorithm for the machine computation of complex Fourier series. Mathematics of Computation 19, 297–301 (1965) 5. Debroy, N., Pitsianis, N., Sun, X.: Accelerating nonuniform fast Fourier transform via reduction in memory access latency. In: SPIE, vol. 7074, p. 707404 (2008) 6. Duijndam, A., Schonewille, M.: Nonunifrom fast Fourier transform. Geophysics 64, 539 (1999) 7. Dutt, A., Rokhlin, V.: Fast Fourier transforms for nonequispaced data. SIAM Journal on Scientific Computing 14, 1368–1393 (1993) 8. Fessler, J.A., Sutton, B.P.: Nonuniform fast Fourier transforms using min-max interpolation. IEEE Transactions on Signal Processing 51, 560–574 (2003) 9. Finkel, R., Bentley, J.: Quad trees: A data structure for retrieval on composite keys. Acta Informatica 4, 1–9 (1974) 10. Frigo, M.: A fast Fourier transform compiler. ACM SIGPLAN Notices 34, 169–180 (1999) 11. Frigo, M., Johnson, S.: FFTW: An adaptive software architecture for the FFT. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. 1381–1384 (May 1998) 12. Greengard, L., Lee, J.Y.: Accelerating the nonuniform fast Fourier transform. SIAM Review 46, 443–454 (2004) 13. Knopp, T., Kunis, S., Potts, D.: A note on the iterative MRI reconstruction from nonuniform k-space data. International Journal of Biomedical Imaging 6, 4089– 4091 (2007) 14. Liu, Q., Nguyen, N.: An accurate algorithm for nonuniform fast Fourier transforms (NUFFTs). IEEE Microwaves and Guided Wave Letters 8, 18–20 (1998) 15. Liu, Q., Tang, X.: Iterative algorithm for nonuniform inverse fast Fourier transform (NU-IFFT). Electronics Letters 34, 1913–1914 (1998) 16. Renganarayana, L., Rajopadhye, S.: An approach to SAR imaging by means of non-uniform FFTs. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium, vol. 6, pp. 4089–4091 (July 2003) 17. Renganarayana, L., Rajopadhye, S.: A geometric programming framework for optimal multi-level tiling. In: Proceedings of ACM/IEEE Conference on Supercomputing, May 2004, p. 18 (2004) 18. Sorensen, T., Schaeffter, T., Noe, K., Hansen, M.: Accelerating the nonequispaced fast Fourier transform on commodity graphics hardware. IEEE Transactions on Medical Imaging 27, 538–547 (2008) 19. Ying, S., Kuo, J.: Application of two-dimensional nonuniform fast Fourier transform (2-d NuFFT) technique to analysis of shielded microstrip circuits. IEEE Transactions on Microwave Theory and Techniques 53, 993–999 (2005) 20. Zhang, Y., Kandemir, M., Pitsianis, N., Sun, X.: Exploring parallelization strategies for NUFFT data translation. In: Proceedings of Esweek, EMSOFT (2009)
Multicore and Manycore Programming Beniamino Di Martino1 , Fabrizio Petrini1 , Siegfried Benkner2 , Kirk Cameron2 , Dieter Kranzlm¨ uller2 , Jakub Kurzak2, 2 Davide Pasetto , and Jesper Larsson Tr¨aff2 1
Topic Chairs Members
2
We would like to join the other members of the Program Committee in welcoming you to the Multicore and Manycore Programming Topic of Europar 2010. Europar is one the primary forums where researchers, architects and designers from academia and indutry explore new and emerging technologies in multicore programming and algorithmic development. This year, we received 43 submissions. Each paper was reviewed by at least three reviewers and we were able to select 17 regular high-quality papers. The Topic Committee Members handled the paper review process aiming at high-quality and timely review process. Each TPC member was able to handle a high-load, exceeding 20 papers, providing valuable insight and guidance to improve the quality of the scientific contributions. The accepted papers discuss very interesting issues. In particular, the paper “Parallel Enumeration of Shortest Lattice Vectors” by M. Schneider and O. Dagdelen presents a parallel version of the shortest lattice enumeration algorithm, using multi-core CPU systems. The paper “Exploiting Fine-Grained Parallelism on Cell Processors” by A. Prell, R. Hoffmann and T. Rauber, presents a hierarchically distributed task pool for task parallel programming on Cell processors. The paper “Optimized on-chip-pipelined mergesort on the Cell/B.E.” by R. Hulten, C. Kessler and J. Keller works out the technical issues of applying the on-chip pipelining technique to parallel mergesort algorithm for the Cell processor. The paper “A Language-Based Tuning Mechanism for Task and Pipeline Parallelism” by F. Otto, C. A. Schaefer, M. Dempe and W. F. Tichy tackles the issues arising with auto-tuners for parallel applications of requiring several tuning runs to find optimal values for all parameters by introducing a language-based tuning mechanism. The paper “Near-optimal placement of MPI processes on hierarchical NUMA architecture” by E. Jeannot and G. Mercier describes a novel algorithm called TreeMatch that maps processes to resources in order to reduce the communication cost of the whole application. The paper “Multithreaded Geant4: Semi-Automatic Transformation into Scalable Thread-Parallel Software” by X. Dong, G. Cooperman, J. Apostolakis presents the transformation into scalable thread parallel version of an application case study, Geant4, which is a 750,000 line toolkit first designed in the early 1990s. The paper “Parallel Exact Time Series Motifs Discovery” by A. Narang presents novel parallel algorithms for exact motif discovery on multi-core architectures. The paper “JavaSymphony: A Programming and Execution Environment for Parallel and Distributed Many-core Architectures” by M. Aleem, R. Prodan and T. Fahringer proposes a new Java-based programming P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 137–138, 2010. c Springer-Verlag Berlin Heidelberg 2010
138
B. Di Martino et al.
model for shared memory multi-core parallel computers as an extension to the JavaSymphony distributed programming environment. The paper “Adaptive Fault Tolerance for Many-Core based Space-Borne Computing” by H. Zima describes an approach for providing software fault tolerance in the context of future deep-space robotic NASA missions, which will require a high degree of autonomy and enhanced on-board computational capabilities, focusing on introspection-based adaptive fault tolerance. The paper “A Parallel GPU Algorithm for Mutual Information based 3D Nonrigid Image Registration” by V. Saxena, J. Rohrer and L. Gong presents parallel design and implementation of 3D non-rigid image registration for the Graphics Processing Units (GPUs). The paper “A Study of a Software Cache Implementation of the OpenMP Memory Model for Multicore and Manycore Architectures” by C. Chen, J. Manzano, G. Gan, G. Gao and V. Sarkar presents an efficient and scalable software cache implementation of OpenMP on multicore and manycore architectures in general, and on the IBM CELL architecture in particular. The paper “Maestro: Data Orchestration for OpenCL Devices” by K. Spafford, J. S Meredith and J. Vetter introduces Maestro, an open source library for automatic data orchestration on OpenCL devices. The paper “Optimized dense matrix multiplication on a manycore architecture” by E. Garcia, I. E. Venetis, R. Khan and G. Gao utilizes dense matrix multiplication as a case of study to present a general methodology to map applications to manycore architectures. The paper “Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations” by E. Hermann, B. Raffin, F. Faure, T. Gautier and J. Allard proposes a parallelization scheme for dynamically balancing work load between multiple CPUs and GPUs. The paper “Long DNA Sequence Comparison on Multicore Architectures” by F. Snchez, F. A. Ramirez and M. Valero analyzes how large scale biology sequences comparison takes advantage of current and future multicore architectures, and investigate which memory organization is more efficient in a multicore environment. The paper “Programming CUDAbased GPUs to simulate two-layer shallow water flows” by M. De la Asuncin, J. Miguel Mantas Ruiz and M. Castro describes an accelerated implementation of a first order well-balanced finite volume scheme for 2D two-layer shallow water systems using GPUs supporting the CUDA programming model and double precision arithmetic. The paper “Scalable Producer-Consumer Pools based on Elimination-Diffraction Trees” by Y. Afek, G. Korland, M. Natanzon and N. Shavit presents new highly distributed pool implementations based on a novel combination combination of the elimination-tree and diracting-tree paradigms. Finally, the paper paper “Productivity and Performance: Improving Consumability of Hardware Transactional Memory through a Real-World Case Study” by H. Wang, G. Yi, Y. Wang and Y. Zou shows how, with well-designed encapsulation, HTM can deliver good consumability for commercial applications. We would like to take the opportunity of thanking the authors who submitted the contributions, the Euro-Par Chairs Domenico Talia, Pasqua D’ Ambra and Mario Guarracino, and the referees with their highly useful comments, whose efforts have made this conference and this topic possible.
JavaSymphony: A Programming and Execution Environment for Parallel and Distributed Many-Core Architectures Muhammad Aleem, Radu Prodan, and Thomas Fahringer Institute of Computer Science, University of Innsbruck, Technikerstraße 21a, A-6020 Innsbruck, Austria {aleem,radu,tf}@dps.uibk.ac.at
Abstract. Today, software developers face the challenge of re-engineering their applications to exploit the full power of the new emerging many-core processors. However, a uniform high-level programming model and interface for parallelising Java applications is still missing. In this paper, we propose a new Java-based programming model for shared memory many-core parallel computers as an extension to the JavaSymphony distributed programming environment. The concept of dynamic virtual architecture allows modelling of hierarchical resource topologies ranging from individual cores and multi-core processors to more complex parallel computers and distributed Grid infrastructures. On top of this virtual architecture, objects can be explicitly distributed, migrated, and invoked, enabling high-level user control of parallelism, locality, and load balancing. We evaluate the JavaSymphony programming model and the new shared memory run-time environment for six real applications and benchmarks on a modern multi-core parallel computer. We report scalability analysis results that demonstrate that JavaSymphony outperforms pure Java implementations, as well as other alternative related solutions.
1
Introduction
Over the last 35 years, increasing processor clock frequency was the main technique to enhance the overall processor power. Today, power consumption and heat dissipation are the two main factors which resulted in a design shift towards multi-core architectures. A multi-core processor consists of several homogeneous or heterogeneous processing cores packaged in a single chip [5] with possibly varying computing power, cache size, cache levels, and power consumption requirements. This shift profoundly affects application developers who can no longer transparently rely only on Moore’s law to speedup their applications. Rather, they have to re-engineer and parallelise their applications with user controlled load
This research is partially funded by the “Tiroler Zukunftsstiftung”, Project name: “Parallel Computing with Java for Manycore Computers”.
P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 139–150, 2010. c Springer-Verlag Berlin Heidelberg 2010
140
M. Aleem, R. Prodan, and T. Fahringer
balancing and locality control to exploit the underlying many-core architectures and ever complex memory hierarchies. The locality of task and data has significant impact on application performance as demonstrated by [9,11]. The use of Java for scientific and high performance applications has significantly increased in recent years. The Java programming constructs related to threads, synchronisation, remote method invocations, and networking are well suited to exploit medium to coarse grained parallelism required by parallel and distributed applications. Terracotta [10], Proactive [1], DataRush [4], and JEOPARD [6] are some of the prominent efforts which have demonstrated the use of Java for the performance-oriented applications. Most of these efforts, however, do not provide user-controlled locality of task and data to exploit the complex memory hierarchies on many-core architectures. In previous work, we developed JavaSymphony [2] (JS) as a Java-based programming paradigm for programming conventional parallel and distributed infrastructures such as heterogeneous clusters and computational Grids. In this paper, we extend JS with a new shared memory abstraction for programming many-core architectures. JS’s design is based on the concept of dynamic virtual architecture which allows modelling of hierarchical resource topologies ranging from individual cores and multi-core processors to more complex parallel computers and distributed Grid infrastructures. On top of this virtual architecture, objects can be explicitly distributed, migrated, and invoked, enabling high-level user control of parallelism, locality, and load balancing. The extensions to the run-time environment were performed with minimal API changes, meaning that the old distributed JS applications can now transparently benefit from being executed on many-core parallel computers with improved locality. The rest of paper is organised as follows. Next section discusses the related work. Section 3 presents the JS programming model for many-core processors, including a shared memory run-time environment, dynamic virtual architectures, and locality control mechanisms. In Section 4, we present experimental results on six real applications and benchmarks. Section 5 concludes the paper.
2
Related Work
Proactive [1] is an open source Java-based library and parallel programming environment for developing parallel, distributed and concurrent applications. Proactive provides high-level programming abstractions based on the concept of remote active objects [1], which return future objects after asynchronous invocations. Alongside programming, Proactive provides deployment-level abstractions for applications on clusters, Grids and multi-core machines. Proactive does not provide user-controlled locality of tasks and objects at processor or core level. The JEOPARD [6] project’s main goal is to provide a complete hardware and software framework for developing real-time Java applications on embedded systems and multi-core SMPs. The project is aiming to provide operating system support in the form of system-level libraries, hardware support in the form of Java processors, and tool support related to application development and performance
JavaSymphony: A Programming Environment for Manycores
141
analysis. Although focused on embedded multi-core systems, the API includes functionality for multi-core and NUMA parallel architectures. Terracotta [10] is a Java-based open source framework for application development on multi-cores, clusters, Grids, and Clouds. Terracotta uses a JVM clustering technique, although the application developer sees a combined view of the JVMs. Terracotta targets enterprise and Web applications and does not provide abstractions to hide concurrency from the application developer who has to take care of these low-level details. Pervasive DataRush [4] is a Java-based high-performance parallel solution for data-driven applications, such as data mining, text mining, and data services. A DataRush application consists of data flow graphs, which represent data dependencies among different components. The run-time system executes the data flow graphs and handles all underlying low-level details related to synchronisation and concurrency. Most of these related works either prevent the application developer from controlling the locality of data and tasks, or engage the developer in time consuming and error-prone low-level parallelisation details of the Java language such as socket communication, synchronisation, remote method invocation, and thread management. High-level user-controlled locality at the application, object, and task level distinguishes JavaSymphony from other Java-based frameworks for performance-oriented applications.
3
JavaSymphony
JavaSymphony (JS) [2] is a Java-based programming paradigm, originally designed for developing applications on distributed cluster and Grid architectures. JS provides high-level programming constructs which abstract low-level programming details and simplify the tasks of controlling parallelism, locality, and load balancing. In this section, we present extensions to the JavaSymphony programming model to support shared memory architectures, ranging from distributed NUMA and SMP parallel computers to modern many-core processors. We provide a unified solution for user-controlled locality-aware mapping of applications, objects and tasks on shared and distributed memory infrastructures with a uniform interface that shields the programmer from the low-level resource access and communication mechanisms. 3.1
Dynamic Virtual Architectures
Most existing Java infrastructures that support performance-oriented distributed and parallel applications hide the underlying physical architecture or assume a single flat hierarchy of a set (array) of computational nodes or cores. This simplified view does not reflect heterogeneous architectures such as multi-core parallel computers or Grids. Often the programmer fully depends on the underlying operating system on shared memory machines or on local resource managers on clusters and Grids to properly distribute the data and computations which results in important performance losses.
142
M. Aleem, R. Prodan, and T. Fahringer
To alleviate this problem, JS introduces the concept of dynamic virtual architecture (VA) that defines the structure of a hetero geneous architecture, which may vary from a small-scale multi-core processor or cluster to a large-scale Grid. The VAs are used to con trol mapping, load balancing, code placement and migration of objects in a parallel and dis tributed environment. A VA can be seen as a tree structure, where each node has a certain Fig. 1. Four-level locality-aware level that represents a specific resource gran- VA ularity. Originally, JavaSymphony focused exclusively on distributed resources and had no capabilities of specifying hierarchical VAs at the level of shared memory resources (i.e. scheduling of threads on shared memory resources was simply delegated to the operating system). In this paper, we extended the JavaSymphony VA to reflect the structure of shared memory many-core computing resources. For example, Figure 1 depicts a four-level VA representing a distributed memory machine (such as a shared memory NUMA) consisting of a set of SMPs on level 2, multi-core processors on level 1, and individual cores on the leaf nodes (level 0). Lines 6 − 8 in Listing 1 illustrates the JS statements for creating this VA structure.
3.2
JavaSymphony Objects
Writing parallel JavaSymphony applications requires encapsulating Java objects into so called JS objects, which are distributed and mapped onto the hierarchical VA nodes (levels 0 to n). A JS object can be either a single-threaded or a multithreaded object. The single-threaded JS object is associated with one thread which executes all invoked methods of that object. A multi-threaded JS object is associated with n parallel threads, all invoking methods of that object. In this paper, we extend JS with a shared memory programming model based on shared JS (SJS) objects. A SJS object can be mapped to a node of level 0 − 3 according to the VA depicted in Figure 1 (see Listing 1, lines 7 and 12) and cannot be distributed onto higher-level remote VA nodes. A SJS object can also be a single-threaded or a multi-threaded object. Listing 1 (line 12) shows the JS code for creating a multi-threaded SJS object. Three types of method invocations can be used with JS as well as with SJS objects: asynchronous (see Listing 1, line 14), synchronous, and one-sided method invocations. 3.3
Object Agent System
The Object Agent System, a part of JS run-time system (JSR), manages and processes shared memory jobs for SJS objects and remote memory jobs for JS objects. Figure 2 shows two Object Agents (OA). An OA is responsible for creating jobs, mapping objects to VAs, migrating, and releasing objects. The shared memory jobs are processed by a local OA, while the remote memory jobs
JavaSymphony: A Programming Environment for Manycores
OA
143
OA Remote memory job
Method Invocator
2. put in queue
1.create
1. create
4. return results ResultHandle
Shared memory job
Job Handler
2. put in queue ResultHandle 3. execute Job Handler Jobs queues single-threaded objects
3. execute Job Handler
Remote Communication Local Communication
4. return results
Jobs queue multi-threaded objects
3. execute
Jobs queue multi-threaded objects
Jobs queues single-threaded objects
3. execute Job Handler
Reference (to ResultHandler)
Fig. 2. The Object Agent System job processing mechanism
are distributed and processed by remote OAs. An OA has a multi-threaded job queue which contains all jobs related to the multi-threaded JS or SJS objects. A multi-threaded job queue is associated with n job processing threads called Job Handlers. Each single-threaded JS or SJS object has a single-threaded job queue and an associated job handler. The results returned by the shared and remote memory jobs are accessed using ResultHandle objects (listing 1, line 5), which can be either local object references in case of SJS objects, or remote object references in case of distributed JS objects. 3.4
Synchronisation Mechanism
JS provides synchronisation mechanism for one-sided and asynchronous object method invocations. Asynchronous invocations can be synchronised at individual or at group level. Group-level synchronisation involves n asynchronous invocations combined in one ResultHandleSet object (see Listing 1, lines 13−14). The group-level synchronisation may block or examine (without blocking) whether one, a certain number, or all threads have finished processing their methods. The JS objects invoked using one-sided method invocations may use barrier synchronisation implemented using barrier objects. A barrier object has a unique identifier and an integer value specifying the number of threads that will wait on that barrier. 3.5
Locality Control
The Locality Control Module (LCM) is a part of the JSR that applies and manages locality constraints on the executing JS application by mapping JS objects and tasks onto the VA nodes. In JS, We can specify locality constraints at three levels of abstraction: 1. Application-level locality constraints are applied to all JS or SJS objects and all future data allocations of a JS application. The locality constraints can be
144
M. Aleem, R. Prodan, and T. Fahringer
specified with the help of setAppAffinity static method of the JSRegistry class, as shown in line 9 of Listing 1. 2. Object-level locality constraints are applied to all method invocations and data allocations performed by a JS or SJS object (see line 12 in Listing 1). The object-level locality constraints override any previous application-level constraints for that object. 3. Task-level locality constraints are applied to specific task invocations and override any previous object or application-level constraints for that task. Mapping an application, object, or task to a specific core will constrain the execution to that core. Mapping them on a higher-level VA node will delegate the scheduling on the inferior VA nodes to the JSR. The LCM in coordination with the OAS job processing mechanism applies the locality constraints. The jobs are processed by the job handler threads within an OA. The job handlers are JVM threads that are executed by some system-level threads such as the POSIX threads on a Linux system. The locality constraints are enforced within the job handler thread by invoking appropriate POSIX system calls through the JNI mechanism. First, a unique systemwide thread identifier is obtained for the job handler thread with help of the syscall( NR gettid) system call. This unique thread identifier is then used as input to the sched setaffinity function call to schedule the thread on a certain core of the level 1 VA. 3.6
Matrix Transposition Example
Listing 1 displays the kernel of a simple shared memory matrix transposition application, a very simple but much used kernel in many numerical applications, which interchanges a matrix rows and columns: A[i, j] = A[j, i]. The application first initialises the matrix and some variables (lines 1 − 3) and registers itself to the JSR (line 4). Then, it creates a group-level ResultHandle object rhs and a level 3 VA node dsm (lines 5 − 8). Then the application specifies 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
boolean b S i n g l e T h r e a d e d = f a l s e ; i n t N = 1 0 2 4 ; i n t np = 4 ; i n t [ ] st a r t R o w = new i n t [ np ] ; i n t [ ] [ ] T = new i n t [ N ] [ N ] ; i n t [ ] [ ] A=new i n t [N ] [ N ] ; i n i t i a l i z e M a t r i x (A) ; J S R e g i s t r y r e g = new J S R e g i s t r y ( ” MatrixTransposeApp ” ) ; ResultHandleSet rhs ; VA smp1 = new VA( 2 , new i n t [ ] { 2 , 2 , 2 } ) ; VA smp2 = new VA( 2 , new i n t [ ] { 4 , 4 } ) ; VA dsm = new VA( 3 ) ; dsm . addVA ( smp1 ) ; dsm . addVA ( smp2 ) ; J S R e g i s t r y . s e t A p p A f f i n i t y ( dsm ) ; // A p p l i c a t i o n −l e v e l l o c a l i t y f o r ( i n t i = 0 ; i < N ; i = i + N / np ) st a r t R o w [ i ] = i ; S J S O b j e c t wo r k e r = new S J S O b j e c t ( b S i n g l e Th r e a d e d , ” wo r k sp a c e . Worker ” , new O b j e c t [ ] { A, T, N} , smp2 ) ; // O b je c t−l e v e l l o c a l i t y f o r ( i n t i = 0 ; i < np ; i ++) r h s . add ( wo r k e r . a i n v o k e ( ” Tr a n sp o se ” , new O b j e c t [ ] { i , st a r t R o w [ i ] } ) , i); rhs . waitAll () ; reg . unregister () ;
Listing 1. Matrix transposition example
JavaSymphony: A Programming Environment for Manycores
145
the application level locality (line 9) corresponding to a distributed shared memory NUMA parallel computer. The application then partitions the matrix blockwise among np parallel tasks (lines 10 − 11). In line 12, a multi-threaded SJS object is created and mapped onto the level 2 VA node smp2. The application then invokes np asynchronous methods, adds the returned ResultHandle objects to rhs (lines 13 − 14), and waits for all invoked methods to finish their execution (line 15). In the end, the application unregisters itself from the JSR (line 16).
4
Experiments
We have developed several applications and benchmarks using the JS shared memory programming model with locality awareness. We used for our experiments a shared memory SunFire X4600 M2 NUMA machine equipped with eight quad-core processors, where each processor has a local memory bank. Each core has a private L1 and L2 caches of size 128KB and 512KB, respectively, and a shared 2MB L3 cache. Each processor has three cache coherent hypertransport links supporting up to 8GB/sec of direct and inter-processor data transfer each. 4.1
Discrete Cosine Transformation
Execution time (sec)
The Discrete Cosine Transformation (DCT) algorithm is used to remove the non-essential information from digital images and audio data. 1000 JS shared memory Typically, it is used to compress the JPEG im800 JS distributed memory ages. The DCT algorithm divides the image into 600 square blocks and then applies the transforma400 tions on each block to remove the non-essential 200 information. After that, a reverse transforma0 tion is applied to produce a restored image, 2 4 8 16 32 which contains only essential data. Number of cores We developed a shared memory version of JS DCT and compared it with a previous message Fig. 3. DCT experimental passing-based implementation [2] in order to test results the improvement of the shared memory solution against the old implementation. Figure 3 shows that the shared memory implementation requires approximately 20% to 50% less execution time as compared to the RMI-based distributed implementation. The results validate the scalability of the JS shared memory programming model and also highlight its importance on a multi-core system as compared to the message passing-based model which introduces costly communication overheads. 4.2
NAS Parallel Benchmarks: CG Kernel
The NAS parallel benchmarks [3] consist of five kernels and three simulated applications. We used for our experiments the Conjugate Gradient (CG) kernel which
146
M. Aleem, R. Prodan, and T. Fahringer
2500
4 35 3,5 Speedup
3 2,5 2 15 1,5 JS with locality Proactive Java
1 0,5 0 1
2
4
8
Number of cores
(a) Speedup
16
32
L3 cac che misses (thousands)
4,5
2000 1500 1000 500
App with locality App without locality
0 2
4
8
16
32
Locall DRAM acces ss (thousand ds)
involves irregular long distance communications. The CG kernel uses the power and conjugate gradient method to compute an approximation to the smallest eigenvalue of large sparse, symmetric positive definite matrices. We implemented the kernel in JS using the Java-based implementation of CG [3]. The JS CG implementation is based on the master-worker computational model. The main JS program acts as master and creates multiple SJS worker objects that implement four different methods. The master program asynchronously invokes these methods on the SJS worker objects and waits for them to finish their execution. A single iteration involves several serial computations from the master and many steps of parallel computations from the workers. In the end, the correctness of results is checked using a validation method. We compared our locality-aware JS version of the CG benchmark with a pure Java and a Proactive-based implementation. Figure 4(a) shows that JS exhibits better speedup compared to the Java and Proactive-based implementations for almost all machine sizes. Proactive exhibits the worse performance since the communication among threads is limited to RMI, even within shared memory computers. The speedup of the JS implementation for 32 cores is lower than the Java-based version because of the overhead induced by the locality control, which becomes significant for the relatively small problem size used. In general, specifying locality constraints for n parallel tasks on a n-core machine does not bring much benefit, still it is useful to solve the problems related to thread migration and execution of multiple JVM-level threads by a single native thread. To investigate the effect of locality constraints, we measured the number of cache misses and local DRAM accesses for the locality aware and non-locality aware applications. Figure 4(b) illustrates that the number of L3 cache misses increased for the locality-aware JS implementation because of the contention on the L3 cache shared by multiple threads scheduled on the same multi-core processor. The locality constraints keep the threads close to the node where the data was allocated, which results in a high number of local memory accesses that significantly boost the overall performance (see Figure 4(c)). Although the JS CG kernel achieves better speedup as compared to the all other versions, the overall speedup is not impressive because it involves a large number of parallel invocations (11550 − 184800 for 2 − 32 cores) to complete 2500 2000 1500 1000 500
Number of cores
(b) L3 cache misses
App with locality App without locality
0 2
4
8
16
32
Number of cores
(c) Local DRAM accesses
Fig. 4. CG kernel experimental results
14
Java
12
JS with locality
10 8 6 4 2 0 1
2
4
8
16
32
30 25 20 15 10 App with locality App without locality
5 0 2
8
16
Number of cores
Number of cores
(a) Speedup
4
(b) L3 cache misses
32
Local DRAM access (thousands)
Speedup
16
L3 cac che misses (thousands)
JavaSymphony: A Programming Environment for Manycores
147
30 25 20 15 10 App with locality
5
App without locality
0 2
4
8
16
32
Number of cores
(c) Local DRAM accesses
Fig. 5. Ray tracing experimental results
75 iterations. It is is a communication intensive kernel and the memory access latencies play a major role in the application performance. The non-contiguous data accesses also result in more cache misses. This kernel achieves good speedup until 4 cores (all local memory accesses), while beyond 8 cores the speedup decrease, because of increased memory access latencies on the threads scheduled on the remote cores. 4.3
3D Ray Tracing
The ray tracing application, part of the Java Grande Forum (JGF) benchmark suite [8], renders a scene containing 64 spheres at N × N resolution. We implemented this application in JS using the multi-threaded Java version from the JGF benchmarks. The ray tracing application creates first several ray tracer objects, initialises them with scene and interval data, and then renders to the specified resolution. The interval data points to the rows of pixels a parallel thread will render. The JS implementation parallelises the code by distributing the outermost loop (over rows of pixels) to several SJS objects. The SJS objects render the corresponding rows of pixels and write back the resulting image data. We applied locality constraints by mapping objects to cores close to each other to minimise the memory access latencies. We experimentally compared our JS implementation with the multi-threaded Java ray tracer from JGF. As shown in Figure 5(a), the JS implementation achieves better speedup for all machine sizes. Figure 5(c) shows that there is a higher number of local memory accesses for the locality-aware JS implementation, which is the main reason for the performance improvement. Figure 5(b) further shows that a locality-aware version has high number of L3 cache misses because of the resource contention on this shared resource, however, the performance penalty is significantly lower compared to the locality gain. 4.4
Cholesky Factorisation
The Cholesky factorisation [7] expresses a N × N symmetric positive-definite matrix A, implying that all the diagonal elements are positive and the nondiagonal elements are not too big, as the product of a triangular matrix L and
5 4,5 4 3,5 3 2,5 2 1,5 1 0,5 0
Java 1
2
4
JS with locality 8
16
32
DRAM accesses (thousands)
M. Aleem, R. Prodan, and T. Fahringer
Speed dup
148
25
App without locality 20
10 5 0
Number of cores
(a) Speedup
App with locality
15
2
4
8
16
32
Number of cores
(b) Total DRAM accesses
Fig. 6. Cholesky factorisation experimental results
its transpose LT : A = L × LT . This numerical method is generally used to calculate the inverse and the determinant of a positive definite matrix. We developed a multi-threaded Java-based version and a JS-based implementation with locality constraints of the algorithm. The triangular matrix L is parallelised by distributing a single row of values among several parallel tasks; in this way, rows are computed one after another. Figure 6(a) shows that the locality-aware JS version has better speedup compared to the Java-based version, however, for the machine size 8 both versions show quite similar performance. We observed that the operating system scheduled by default the threads of the Java version one router away from the data (using both right and left neighbour nodes), which matched the locality constraints we applied to the JS version. Figure 6(b) shows that the locality-based version has a low number of memory accesses compared to the non-locality-based version due to the spatial locality. 4.5
Matrix Transposition
We developed a multi-threaded Java-based and a JS-based locality-aware version of the matrix transposition algorithm that we introduced in Section 3.6. Again, the locality-aware JS version achieved better speedup, as illustrated in Figure 7(a). The locality constraints mapped the threads to all cores of a processor before moving to other node, which resulted in L3 cache contention and, therefore, more cache misses (see Figure 7(b)). The locality also ensures that there are more local and less costly remote memory accesses which produces better application performance (see Figure 7(c)). 4.6
Sparse Matrix-Vector Multiplication
Sparse Matrix-Vector Multiplication (SpMV) is an important kernel used in many scientific applications which computes y = A · x, where A is a sparse matrix and x and y are dense vectors. We developed an iterative version of the SpMV kernel where matrix A was stored in a vector using the Compressed Row Storage format. We used for this experiment a matrix size of 10000 × 10000 and set the number of non-zero elements per row to 1000. We developed both
Java
12
JS with locality
Speedup
10 8 6 4 2 0 1
2
4
8
16
32
180 160 140 120 100 80 60 40 20 0
App without locality App with locality
2
4
Number of cores
8
16
32
Number of cores
(a) Speedup
(b) L3 cache misses
Local DRAM access (thousands)
14
L3 cac che misses (th housands)
JavaSymphony: A Programming Environment for Manycores
149
250 200 150 100 50
App without locality App with locality
0 2
4
8
16
32
Number of cores
(c) Local DRAM accesses
2000
14
Java
12
JS
10
JS with locality
1950 L3 cache e misses
Speedup
16
8 6 4
1900 1850 1800 1750
2
1700
0
1650 1
2
4
8
16
Number of cores
(a) Speedup
32
App without locality App with locality
2
4
8
16
32
Local D DRAM access s (thoisands)
Fig. 7. Matrix transposition experimental results 45 40 35 30 25 20 15 10 5 0 2
4
8
App with locality
16
32
Number of cores
Number of cores
(b) L3 cache misses
App without locality
(c) Local DRAM accesses
Fig. 8. SpMV experimental results
pure Java and JS versions of this kernel by distributing the rows of the sparse matrix several parallel threads. The resulting vector y is computed as follows: to n yi = j=1 aij · xj . SpMV involves indirect and unstructured memory accesses which negatively affects the pure Java implementation with no locality constraints (see Figure 8(a)), while the locality-aware JS implementation performs significantly better. The vector x is used by all threads and, once the data values from this vector are loaded in L3 shared cache, they are reused by the other working threads on that processor. This spatial locality results in less cache misses as shown in Figure 8(b). The locality also ensures less remote and more local memory accesses as shown in Figure 8(c), which further improves the application performance.
5
Conclusions
We presented JavaSymphony, a parallel and distributed programming and execution environment for multi-cores architectures. JS’s design is based on the concept of dynamic virtual architecture, which allows modelling of hierarchical resource topologies ranging from individual cores and multi-core processors to more complex symmetric multiprocessors and distributed memory parallel computers. On top of this virtual architecture, objects can be explicitly distributed,
150
M. Aleem, R. Prodan, and T. Fahringer
migrated, and invoked, enabling high-level user control of parallelism, locality, and load balancing. Additionally, JS provides high-level abstractions to control locality at application, object and thread level. We illustrated a number of real application and benchmark experiments showing that the locality-aware JS implementation outperforms conventional technologies such as pure Java relying entirely on operating system thread scheduling and data allocation.
Acknowledgements The authors thank Hans Moritsch for his contribution to earlier stages of this research.
References 1. Denis Caromel, M.L.: ProActive Parallel Suite: From Active Objects-SkeletonsComponents to Environment and Deployment. In: Euro-Par 2008 Workshops Parallel Processing, pp. 423–437. Springer, Heidelberg (2008) 2. Fahringer, T., Jugravu, A.: JavaSymphony: a new programming paradigm to control and synchronize locality, parallelism and load balancing for parallel and distributed computing: Research articles. Concurr. Comput. Pract. Exper. 17(7-8), 1005–1025 (2005) 3. Frumkin, M.A., Schultz, M., Jin, H., Yan, J.: Performance and scalability of the NAS parallel benchmarks in Java. IPDPS, 139a (2003) 4. Norfolk, D.: The growth in data volumes - an opportunity for it-based analytics with Pervasive Datarush. White Paper (June 2009), http://www.pervasivedatarush.com/Documents/WP%27s/ Norfolk%20WP%20-%20DataRush%20%282%29.pdf 5. Kumar, R., Farkas, K.I., Jouppi, N.P., Ranganathan, P., Tullsen, D.M.: Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In: MICRO-36’-03 (2003) 6. Siebert, F.: Jeopard: Java environment for parallel real-time development. In: JTRES-06 2008, pp. 87–93. ACM, New York (2008) 7. Siegfried, B., Maria, L., Kvasnicka, D.: Experiments with cholesky factorization on clusters of smps. In: Proceedings of the Book of Abstracts and Conference CD, NMCM 2002, pp. 30–31, 1–14 (July 2002) 8. Smith, L.A., Bull, J.M.: A multithreaded Java grande benchmark suite. In: Third Workshop on Java for High Performance Computing (June 2001) 9. Song, F., Moore, S., Dongarra, J.: Feedback-directed thread scheduling with memory considerations. In: The 16th international symposium on High performance distributed computing, pp. 97–106. ACM, New York (2007) 10. Terracotta, I.: The Definitive Guide to Terracotta: Cluster the JVM for Spring, Hibernate and POJO Scalability. Apress, Berkely (2008) 11. Yang, R., Antony, J., Janes, P.P., Rendell, A.P.: Memory and thread placement effects as a function of cache usage: A study of the gaussian chemistry code on the sunfire x4600 m2. In: International Symposium on Parallel Architectures, Algorithms, and Networks, pp. 31–36 (2008)
Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees Yehuda Afek, Guy Korland, Maria Natanzon, and Nir Shavit Computer Science Department Tel-Aviv University, Israel [email protected] Abstract. Producer-consumer pools, that is, collections of unordered objects or tasks, are a fundamental element of modern multiprocessor software and a target of extensive research and development. For example, there are three common ways to implement such pools in the Java JDK6.0: the SynchronousQueue, the LinkedBlockingQueue, and the ConcurrentLinkedQueue. Unfortunately, most pool implementations, including the ones in the JDK, are based on centralized structures like a queue or a stack, and thus are limited in their scalability. This paper presents the ED-Tree, a distributed pool structure based on a combination of the elimination-tree and diffracting-tree paradigms, allowing high degrees of parallelism with reduced contention. We use the ED-Tree to provide new pool implementations that compete with those of the JDK. In experiments on a 128 way Sun Maramba multicore machine, we show that ED-Tree based pools scale well, outperforming the corresponding algorithms in the JDK6.0 by a factor of 10 or more at high concurrency levels, while providing similar performance at low levels.
1
Introduction
Producer-consumer pools, that is, collections of unordered objects or tasks, are a fundamental element of modern multiprocessor software and a target of extensive research and development. Pools show up in many places in concurrent systems. For example, in many applications, one or more producer threads produce items to be consumed by one or more consumer threads. These items may be jobs to perform, keystrokes to interpret, purchase orders to execute, or packets to decode. A pool allows push and pop with the usual pool semantics [1]. We call the pushing threads producers and the popping threads consumers. There are several ways to implement such pools. In the Java JDK6.0 for example they are called “queues”: the SynchronousQueue, the LinkedBlockingQueue, and the ConcurrentLinkedQueue. The SynchronousQueue provides a “pairing up” function without buffering; it is entirely symmetric: Producers and consumers wait for one another, rendezvous, and leave in pairs. The term unfair refers to the fact that it allows starvation. The other queues provide a buffering mechanism and allow threads to sleep while waiting for their requests P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 151–162, 2010. c Springer-Verlag Berlin Heidelberg 2010
152
Y. Afek et al.
to be fulfilled. Unfortunately, all these pools, including the new scalable SynchronousQueue of Lea, Scott, and Shearer [2], are based on centralized structures like a lock-free queue or a stack, and thus are limited in their scalability: the head of the stack or queue is a sequential bottleneck and source of contention. This paper shows how to overcome this limitation by devising highly distributed pools based on an ED-Tree, a combined variant of the diffracting-tree structure of Shavit and Zemach [3] and the elimination-tree structure of Shavit and Touitou [4]. The ED-Tree does not have a central place through which all threads pass, and thus allows both parallelism and reduced contention. As we explain in Section 2, an ED-Tree uses randomization to distribute the concurrent requests of threads onto many locations so that they collide with one another and can exchange values. It has a specific combinatorial structure called a counting tree [3,5], that allows requests to be properly distributed if such successful exchanges did not occur. As shown in Figure 1, one can add queues at the leaves of the trees so that requests are either matched up or end up properly distributed on the queues at the tree leaves. By “properly distributed” we mean that requests that do not eliminate always end up in the queues: the collection of all the queues together has the behavior of one large queue. Since the nodes of the tree will form a bottleneck if one uses the naive implementation in Figure 1, we replace them with highly distributed nodes that use elimination and diffraction on randomly chosen array locations as in Figure 2. The elimination and diffraction tree structures were each proposed years ago [4,3] and claimed to be effective through simulation [6]. A single level of an elimination array was also used in implementing shared concurrent stacks [7]. However, elimination trees and diffracting trees were never used to implement real world structures. This is mostly due the fact that there was no need for them: machines with a sufficient level of concurrency and low enough interconnect latency to benefit from them did not exist. Today, multicore machines present the necessary combination of high levels of parallelism and low interconnection costs. Indeed, this paper is the first to show that that ED-Tree based implementations of data structures from the java.util.concurrent scale impressively on a real machine (a Sun Maramba multicore machine with 2x8 cores and 128 hardware threads), delivering throughput that at high concurrency levels 10 times that of the existing JDK6.0 algorithms. But what about low concurrency levels? In their elegant paper describing the JDK6.0 SynchronousQueue, Lea, Scott, and Shearer [2], suggest that using elimination techniques may indeed benefit the design of synchronous queues at high loads. However, they wonder whether the benefits of reduced contention achievable by using elimination under high loads, can be made to work at lower levels of concurrency because of the possibility of threads not meeting in the array locations. This paper shows that elimination and diffraction techniques can be combined to work well at both high and low loads. There are two main components that our ED-Tree implementation uses to make this happen. The first is to have each thread adaptively choose an exponentially varying array range from which it
Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees
153
randomly picks a location, and the duration it will wait for another thread at that location. This means that, without coordination, threads will tend to map into a smaller array range as the load decreases, thus increasing chances of a collision. The second component is the introduction of diffraction for colliding threads that do not eliminate because they are performing the same type of operation. The diffraction mechanism allows threads to continue down the tree at a low cost. The end result is an ED-Tree structure, that, as our empirical testing shows, performs well at both high and low concurrency levels.
2
The ED-Tree
Before explaining how the ED-Tree works, let us review its predecessor, the diffracting tree [3] (see Figure 1). Consider a binary tree of objects called balancers with a single input wire and two output wires, as depicted in Figure 1. Threads arrive at a balancer and it sends them alternately up and down, so its top wire always has the same or at most one more than the bottom one. The Tree[k] network of width k is a binary tree of balancers constructed inductively by taking two Tree[k/2] networks of balancers and perfectly shuffling their outputs [3]. As a first step in constructing the ED-Tree, we add to the diffracting tree a collection of lock-free queues at the output wires of the tree leaves. To perform a push, threads traverse the balancers from the root to the leaves and then push the item onto the appropriate queue. In any quiescent state, when there are no lock-free balancer
5
3
1
1
wire 0 2
5 4
1 3
head tail
1
5
lock-free queue
2
1
3
wire 1 4
2
0
4
Fig. 1. A Tree[4] [3] leading to 4 lock-free queues. Threads pushing items arrive at the balancers in the order of their numbers, eventually pushing items onto the queues located on their output wires. In each balancer, a pushing thread fetches and then complements the bit, following the wire indicated by the fetched value (If the state is 0 the pushing thread it will change it to 1 and continue to top wire (wire 0), and if it was 1 will change it to 0 and continue on bottom wire (wire 1)). The tree and stacks will end up in the balanced state seen in the figure. The state of the bits corresponds to 5 being the last inserted item, and the next location a pushed item will end up on is the queue containing item 2. Try it! We can add a similar tree structure for popping threads, so that the first will end up on the top queue, removing 1, and so on. This behavior will be true for concurrent executions as well: the sequences values in the queues in all quiescent states, when all threads have exited the structure, can be shown to preserve FIFO order.
154
Y. Afek et al. 1/2 width elimination-diffraction balancer eliminationdiffraction balancer
Pusher’s toggle-bit
C:return(1) 1
0
F:pop() 1
1
5
2
6
A: ok
A:push(6) Poper’s toggle-bit
B:return(2)
B:pop() C:pop()
0
F:return(3) 0
D:pop()
3
E:push(7) 0
4
E: ok D:return(7)
Fig. 2. An ED-Tree. Each balancer in Tree[4] is an elimination-diffraction balancer. The start state depicted is the same as in Figure 1, as seen in the pusher’s toggle bits. From this state, a push of item 6 by Thread A will not meet any others on the elimination-diffraction arrays, and so will toggle the bits and end up on the 2nd stack from the top. Two pops by Threads B and C will meet in the top balancer’s array, diffract to the sides, and end up going up and down without touching the bit, ending up popping the first two values values 1 and 2 from the top two lock-free queues. Thread F which did not manage to diffract or eliminate, will end up as desired on the 3rd queue, returning a value 3. Finally, Threads D and E will meet in the top array and “eliminate” each other, exchanging the value 7 and leaving the tree. This is our exception to the FIFO rule, to allow good performance at high loads, we allow threads with concurrent push and pop requests to eliminate and leave, ignoring the otherwise FIFO order.
threads in the tree, the output items are balanced out so that the top queues have at most one more element than the bottom ones, and there are no gaps. One could implement the balancers in a straightforward way using a bit that threads toggle: they fetch the bit and then complement it using a compareAndSet (CAS) operation, exiting on the output wire they fetched (zero or one). One could keep a second, identical tree for pops, and you would see that from one quiescent state to the next, the items removed are the first ones pushed onto the queue. Thus, we have created a collection of queues that are accessed in parallel, yet act as one quiescent FIFO queue. The bad news is that the above implementation of the balancers using a bit means that every thread that enters the tree accesses the same bit at the root balancer, causing that balancer to become a bottleneck. This is true, though to a lesser extent, with balancers lower in the tree. We can parallelize the tree by exploiting a simple observation similar to the one made about the elimination backoff stack: If an even number of threads pass through a balancer, the outputs are evenly balanced on the top and bottom wires, but the balancer’s state remains unchanged.
Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees
155
The idea behind the ED-Tree is combining the modified diffracting [3] tree as above with the elimination-tree techniques [4]. We use an eliminationArray in front of the bit in every balancer as in Figure 2. If two popping threads meet in the array, they leave on opposite wires, without a need to touch the bit, as anyhow it would remain in its original state. If two pushing threads meet in the array, they also leave on opposite wires. If a push or pop call does not manage to meet another in the array, it toggles the respective push or pop bit (in this sense it differs from prior elimination and/or diffraction balancer algorithms [4,3] which had a single toggle bit instead of separate ones, and provided LIFO rather than FIFO like access through the bits) and leaves accordingly. Finally, if a push and a pop meet, they eliminate, exchanging items. It can be shown that all push and pop requests that do not eliminate each other provide a quiescently consistent FIFO queue behavior. Moreover, while the worst case time is log k where k is the number of lock-free queues at the leaves, in contended cases, 1/2 the requests are eliminated in the first balancer, another 1/4 in the second, 1/8 on the third, and so on, which converges to an average of 2 steps to complete a push or a pop, independent of k.
3
Implementation
As described above, each balancer (see the pseudo-code in Listing 1) is composed of an eliminationArray, a pair of toggle bits, and two pointers, one to each of its child nodes. The last field, lastSlotRange,(which has to do with the adaptive behavior of the elimination array) will be described later in this section. 1 2 3 4 5 6
public class Balancer{ ToggleBit producerToggle, consumerToggle; Exchanger[] eliminationArray; Balancer leftChild , rightChild; ThreadLocal lastSlotRange; }
Listing 1. A Balancer
The implementation of a toggle bit as shown in Listing 2 is based on an AtomicBoolean which provides a CAS operation. To access it, a thread fetches the current value (Line 5) and tries to atomically replace it with the complementary value (Line 6). In case of a failure, the thread retries (Line 6). 1 2 3 4 5 6 7 8
AtomicBoolean toggle = new AtomicBoolean(true); public boolean toggle(){ boolean result; do{ result = toggle.get (); }while(!toggle.compareAndSet(result, !result )); return result; }
Listing 2. The Toggle of a Balancer
156
Y. Afek et al.
The implementation of an eliminationArray is based on an array of Exchangers. Each exchanger (Listing 3) contains a single AtomicReference which is used as a placeholder for exchanging, and an ExchangerPackage, where the ExchangerPackage is an object used to wrap the actual data and to mark its state and type. 1 2 3
public class Exchanger{ AtomicReference<ExchangerPackage> slot; }
4 5 6 7 8 9
public class ExchangerPackage{ Object value; State state ; Type type; }
Listing 3. An Exchanger
Each thread performing either a push or a pop, traverses the tree as follows. Starting from the root balancer, the thread tries to exchange its package with a thread with a complementary operation, a popper tries to exchange with a pusher and vice versa. In each balancer, each thread chooses a random slot in the eliminationArray, publishes its package, and then backs off in time, waiting in a loop to be eliminated. In case of failure, a backoff in “space” is performed several times. The type of space back off depends on the cause of the failure: If a timeout is reached without meeting any other thread, a new slot is randomly chosen in a smaller range. However, if a timeout is reached after repeatedly failing in the CAS while trying to either pair or just to swap in, a new slot is randomly chosen in a larger range. Adaptivity. In the backoff mechanism described above, a thread senses the level of contention and depending on it selects randomly an appropriate range of the eliminationArray to work on (by iteratively backing off). However, each time a thread starts a new operation, it initializes the backoff parameters, wasting the same unsuccessful rounds of backoff in place until sensing the current level of contention. To avoid this, we let each thread save its last-used range between invocations (Listing 1 line 5). This saved range is used as (a good guess of) the initial range at the beginning of the next operation. This method proved to be a major factor in reducing the overhead in low contention situations and allowing the EDTree to yield good performance under high contention. The result of the meeting of two threads in each balancer is one of the following four states: ELIMINATED, TOGGLE, DIFFRACTED0, or DIFFRACTED1. In case of ELIMINATED, a popper and a pusher successfully paired-up, and the method returns. If the result is TOGGLE, the thread failed to pair-up with any other type of request, so the toggle() method shown in Listing 1 is called, and according to its result the thread accesses one of the child balancers. Lastly, if the state is either DIFFRACTED0 or DIFFRACTED1, this is a result of two operations of the same type meeting in the same location, and the corresponding child balancer, either 0 or 1, is chosen.
Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees
2000 JDK-Sync-Queue ED-Sync-Queue
0 2
4 6 8 Number of threads
Unfair Synchronous Queue Throughput (× 103 op/s)
Throughput (× 103 op/s)
Unfair Synchronous Queue (detailed) 4000
10
157
12000 10000 8000 6000
JDK-Sync-Queue ED-Sync-Queue
4000 2000 0 50 100 150 200 Number of threads
250
Fig. 3. The unfair synchronous queue benchmark: a comparison of the latest JDK 6.0 algorithm and our novel ED-Tree based implementation. The graph on the left is a zoom in of the low concurrency part of the one on the right. Number of producers and consumers is equal in each of the tested workloads.
As a final step, the item of a thread that reaches one of the tree leaves is placed in the corresponding queue. A queue can be one of the known queue implementations: a SynchronousQueue, a LinkedBlockingQueue, or a ConcurrentLinkedQueue. Using ED-Trees with different queue implementations we created the following three types of pools: An Unfair Synchronous Queue. When setting the leaves to hold an unfair SynchronousQueue, we get a unfair synchronous queue [2]. An unfair synchronous queue provides a “pairing up” function without the buffering. Producers and consumers wait for one another, rendezvous, and leave in pairs. Thus, though it has internal queues to handle temporary overflows of mismatched items, the unfair synchronous queue does not require any long-term internal storage capacity. An Object Pool. With a simple replacement of the former SynchronousQueue with a LinkedBlockingQueue. With a ConcurrentLinkedQueue we get a blocking object pool, or a non-blocking object pool respectively. An object pool is a software design pattern. It consists of a multi-set of initialized objects that are kept ready to use, rather than allocated and destroyed on demand. A client of the object pool will request an object from the pool and perform operations on the returned object. When the client finishes work on an object, it returns it to the pool rather than destroying it. Thus, it is a specific type of factory object. Object pooling can offer a significant performance boost in situations where the cost of initializing a class instance is high, the rate of instantiation of a class is high, and the number of instances in use at any one time is low. The pooled object is obtained in predictable time when the creation of the new objects (especially over a network) may take variable time. In this paper we show two versions of an object pool: blocking and a non blocking. The only difference between these pools is the behavior of the popping thread when the pool is empty. While in the blocking version a popping thread is forced to wait until an
158
Y. Afek et al. Nonblocking Resource Pool Throughput (× 103 op/s)
Throughput (× 103 op/s)
Resource Pool 12000 10000 8000 6000
LinkedBlockingQueue ED-BlockingQueue
4000 2000 0 50 100 150 200 Number of threads
250
10000 8000 6000 4000
ConcurrentLinkedQueue ED-Pool
2000 0 50 100 150 200 Number of threads
250
Fig. 4. Throughput of BlockingQueue and ConcurrentQueue object pool implementations. Number of producers and consumers is equal in each of the tested workloads.
available resource is pushed back to Pool, in the unblocking version it can leave without an object. An example of a widely used object pool is a connection pool. A connection pool is a cache of database connections maintained by the database so that the connections can be reused when the database receives new requests for data. Such pools are used to enhance the performance of executing commands on the database. Opening and maintaining a database connection for each user, especially of requests made to a dynamic database-driven website application, is costly and wastes resources. In connection pooling, after a connection is created, it is placed in the pool and is used again so that a new connection does not have to be established. If all the connections are being used, a new connection is made and is added to the pool. Connection pooling also cuts down on the amount of time a user waits to establish a connection to the database. Starvation avoidance. Finally, in order to avoid starvation in the queues (Though it has never been observed in all our tests), we limit the time a thread can be blocked in these queues before it retries the whole Tree[k] traversal again.
4
Performance Evaluation
We evaluated the performance of our new algorithms on a Sun UltraSPARC T2 Plus multicore machine. This machine has 2 chips, each with 8 cores running at 1.2 GHz, each core with 8 hardware threads, so 64 way parallelism on a processor and 128 way parallelism across the machine. There is obviously a higher latency when going to memory across the machine (a two fold slowdown). We begin our evaluation in Figure 3 by comparing the new unfair SynchronousQueue of Lea et. al [2], scheduled to be added to the java.util.concurrent library of JDK6.0, to our ED-Tree based version of an unfair synchronous queue. As we explained earlier, an unfair synchronous queue provides a symmetric “pairing up” function without buffering: Producers and consumers wait for one another, rendezvous, and leave in pairs.
Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees
8000
ConcurrentLinkedQueue ED-Pool
6000 4000 2000 0 0
100 200 300 400 500 600 Work
Unfair Synchronous Queue Throughput (× 103 op/s)
Throughput (× 103 op/s)
Nonblocking Resource Pool 10000
159
10000
JDK-Sync-Queue ED-Sync-Queue
8000 6000 4000 2000 0 0
100 200 300 400 500 600 Work
Fig. 5. Throughput of a SynchronousQueue as the work load changes for 32 producer and 32 consumer threads.)
One can see that the ED-Tree behaves similarly to the JDK version up to 8 threads(left figure). Above this concurrency level, the ED-Tree scales nicely while the JDK implementation’s overall throughput declines. At its peak, at 64 threads, the ED-Tree delivers more than 10 times the performance of the JDK implementation. Beyond 64 threads, the threads are no longer placed on a single chip, and traffic across the interconnect causes a moderate performance decline for the ED-Tree version. We next compare two versions of an object Pool. An object pool is a set of initialized objects that are kept ready to use, rather than allocated and destroyed on demand. A consumer of the pool will request an object from the pool and perform operations on the returned object. When the consumer has finished using an object, it returns it to the pool, rather than destroying it. The object pool is thus a type of factory object. The consumers wait in case there is no available object, while the producers, unlike producers of unfair synchronous queue, never wait for consumers, they add the object to the pool and leave. We compared an ED-Tree BlockingQueue implementation to the LinkedBlockingQueue of JDK6.0. Comparison results for the object pool benchmark are shown on the lefthand side of Figure 4. The results are pretty similar to those in the unfair SynchronousQueue. The JDK’s LinkedBlockingQueue performs better than its unfair SynchronousQueue, yet it still does not scale well beyond 4 threads. In contrast, our ED-Tree version scales well even up to 80 threads because of its underlying use of the LinkedBlockingQueue. At its peak at 64 threads it has 10 times the throughput of the JDK’s LinkedBlockingQueue. Next, we evaluated implementations of ConcurrentQueue, a more relaxed version of an object pool in which there is no requirement for the consumer to wait in case there is no object available in the Pool. We compared the ConcurrentLinkedQueue of JDK6.0 (which in turn is based on Michael’s lock-free linked list algorithm [8]) to an ED-Tree based ConcurrentQueue (righthand side of Figure 4). Again, the results show a similar pattern: the JDK’s ConcurrentLinkedQueue scales up to 14 threads, and then drops, while the ED-Tree
160
Y. Afek et al. Resource pool
Unfair Synchronous Queue 9000
JDK-Linked-Blocking-Queue ED-Linked-Blocking-Queue
9000
Throughput (× 10 op/s)
8000
3
3
Throughput (× 10 op/s)
10000
7000 6000 5000 4000 3000 2000
JDK-Sync-Queue ED-Sync-Queue
8000 7000 6000 5000 4000 3000 2000 1000
50
60
70
80
% of Consumers in all threads
90
50
60
70
80
90
% of Consumers in all threads
Fig. 6. Performance changes of a Resource pool and unfair SynchronousQueue when total number of threads is 64, as the ratio of consumer threads grows from 50% to 90% of total thread amount
based ConcurrentQueue scales well up to 64 threads. At its peak at 64 threads, it has 10 times the throughput of the JDK’s ConcurrentLinkedQueue. Since the ED-Tree object pool behaves well at very high loads, we wanted to test how it behaves in scenarios where the working threads are not pounding the pool all the time. To this end we emulate varying work loads by adding a delay between accesses to the pool. We tested 64 threads with a different set of dummy delays due to work, varying it from 30-600ms. The comparison results in Figure 5 show that even as the load decreases the ED-Tree synchronous queue outperforms the JDK’s synchronous queue. , This is due to the low overhead adaptive nature of the randomized mapping into the eliminationArray: as the load decreases, a thread tends to dynamically shrink the range of array locations into which it tries to map. Another work scenario that was tested is the one when the majority of the pool users are consumers, i.e. the rate of inserting items to the pool is lower than the one demanded by consumers and they have to wait until items become available. Figure 6 shows what happens when number of threads using the pool is steady(64 threads), but the ratio of consumers changes from 50% to 90%. One can see that ED-tree outperforms JDK’s structures both in case when the number of producer and consumer thread equals and in cases where there are a
Fig. 7. Elimination rate by levels, as concurrency increases
Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees
161
Elimination array range
Elimination Size 8 7 6 5 4 3 2 1
ED-Sync-Queue
0
200
400 Work
600
800
Fig. 8. Elimination range as the work load changes for 32 producer and 32 consumer threads
lot more consumer threads than producer threads (for example 90% consumers and 10% producers) . Next, we investigated the internal behavior of the ED-Tree with respect to the number of threads. We check the elimination rate at each level of the tree. The results appear in Figure 7. Surprisingly, we found out that the higher the concurrency, that is, the more threads added, the more threads get all the way down the tree to the queues. At 4 threads, all the requests were eliminated at the top level, and throughout the concurrency range, even at 265 threads, 50% or more of the requests were eliminated at the top level of the tree, at least 25% at the next level, and at least 12.5% at the next. This, as we mentioned earlier, forms a sequence that converges to less than 2 as n, the number of threads, grows. In our particular 3-level ED-Tree tree the average is 1.375 balancer accesses per sequence, which explains the great overall performance. Lastly, we investigated how the adaptive method of choosing the elimination range behaves under different loads. Figure 8 shows that, as we expected, the algorithm adapts the working range to the load reasonably well. The more each thread spent doing work not related to the pool, the more the contention decreased, and respectively, the default range used by the threads decreased. Acknowledgements. This paper was supported in part by grants from Sun Microsystems, Intel Corporation, as well as a grant 06/1344 from the Israeli Science Foundation and European Union grant FP7-ICT-2007-1 (project VELOX).
References 1. Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann, NY (2008) 2. Scherer III, W.N., Lea, D., Scott, M.L.: Scalable synchronous queues. Commun. ACM 52(5), 100–111 (2009) 3. Shavit, N., Zemach, A.: Diffracting trees. ACM Trans. Comput. Syst. 14(4), 385–428 (1996)
162
Y. Afek et al.
4. Shavit, N., Touitou, D.: Elimination trees and the construction of pools and stacks: preliminary version. In: SPAA 1995: Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, pp. 54–63. ACM, New York (1995) 5. Aspnes, J., Herlihy, M., Shavit, N.: Counting networks. Journal of the ACM 41(5), 1020–1048 (1994) 6. Herlihy, M., Lim, B., Shavit, N.: Scalable concurrent counting. ACM Transactions on Computer Systems 13(4), 343–364 (1995) 7. Hendler, D., Shavit, N., Yerushalmi, L.: A scalable lock-free stack algorithm. In: SPAA 2004: Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pp. 206–215. ACM, New York (2004) 8. Michael, M.M., Scott, M.L.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: PODC 1996: Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, pp. 267–275. ACM, New York (1996)
Productivity and Performance: Improving Consumability of Hardware Transactional Memory through a Real-World Case Study Huayong Wang, Yi Ge, Yanqi Wang, and Yao Zou IBM Research - China {huayongw,geyi,yqwang}@cn.ibm.com,[email protected]
Abstract. Hardware transactional memory (HTM) is a promising technology to improve the productivity of parallel programming. However, a general agreement has not been reached on the consumability of HTM. User experiences indicate that HTM interface is not straightforward to be adopted by programmers to parallelize existing commercial applications, because of the internal limitation of HTM and the difficulties to identify shared variables hidden in the code. In this paper we demonstrate that, with well-designed encapsulation, HTM can deliver good consumability. Based on the study of a typical commercial application in supply chain simulations - GBSE, we develop a general scheduling engine that encapsulates the HTM interface. With the engine, we can convert the sequential program to multi-threaded model without changing any source code for the simulation logic. The time spent on parallelization is reduced from two months to one week, and the performance is close to the manually tuned counterpart with fine-grained locks. Keywords: hardware transactional memory, parallel programming, discrete event simulation, consumability.
1
Introduction
Despite years of research, parallel programming is still a challenging problem in most application domains, especially for commercial applications. Transactional memory (TM), with hardware-based solutions to provide desired semantics while incurring the least runtime overhead, has been proposed to ameliorate this challenge. With hardware transactional memory (HTM), programmers simply delimit regions of code that access shared data. The hardware ensures that the execution of these regions appears atomic with respect to other threads. As a result, HTM allows programmers to enforce mutual exclusion as simple as traditional coarse-grained locks, while achieving performance close to fine-grained locks. However, consumability of HTM programming model is still a point of controversy for commercial application developers, who have to take tradeoff between the cost of parallelization and the performance benefits. Better consumability means that applications can benefit from HTM in performance with less cost P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 163–174, 2010. c Springer-Verlag Berlin Heidelberg 2010
164
H. Wang et al.
of modification and debugging. It’s not easy to achieve good consumability due to two reasons. First, when parallelizing a sequential program, its source code usually contains a large quantity of variables shared by functions. Those shared variables are not indicated explicitly, nor protected by locks. To thoroughly identify these variables is a time consuming task. Second, the HTM programming model has limitations. Supporting I/O and other irrevocable operations (like system calls) inside transactions is expensive in terms of implementation complexity and performance loss. Many HTM researches adopt application kernels as benchmarks for performance evaluation. The benchmarks are designed for hardware tuning purpose and cannot reflect the consumability problem of HTM. How to improve the consumability of HTM is a practical problem at the time of HTM’s imminent emergence in commercial processors. Some research works have tried to provide friendly HTM programming interfaces [1]. By adding specific semantics into the HTM interfaces, it brings the new challenge on compatibilities of HTM implementations, which is a major concern for HTM commercial applications. It is more reasonable to solve the problem by combining two approaches: encapsulating general HTM interface by runtime libraries and taking advantage of application’s particularities. In this paper a typical commercial application in supply chain simulation domain is studied as a running example. We parallelize it by two ways: HTM and traditional locks. Based on the quantitative comparison of the costs, we demonstrate that HTM can have good consumability with proper encapsulation for such kind of applications. This paper makes the following contributions: 1. It shows that our method can reduce the time of the parallelization work from two months to one week by using a new HTM encapsulation interface on the case study application. The method is also suitable to most real-world applications in the discrete event simulation domain. 2. Besides showing the improvement of productivity, we also evaluate the overall performance and analyze the factors that influence the performance. With HTM encapsulation, we can generally achieve 4.36 and 5.13 times speedup of 8 threads in two different algorithms. The result is close to the performance achieved through fine-grained locks. The remainder of this paper is organized as follows. Section 2 provides background information on HTM and introduces the application studied in this paper. Section 3 describes the details of the parallelization work and explains how our method can help to improve the consumability. Section 4 presents and analyzes experimental results. Section 5 gives the conclusion.
2
Background
This section introduces HTM concepts as well as a case study application. 2.1
HTM Implementation and Interface
In order to conduct a fair evaluation, we choose a basic HTM implementation Best Effort Transaction (BET), which was referred in many papers as a baseline
Improving Consumability of Hardware Transactional Memory
165
design [2],[3],[4]. BET uses data cache as buffers to save the data accessed by transactions. Each cache line is augmented with an atomic flag (A-flag) and an atomic color (A-color). The A-flag indicates whether the cache line has been accessed by an uncommitted transaction, and A-color indicates which transaction has accessed the cache line. When a transaction accesses a cache line, the A-flag is set; when a transaction is completed or aborted, the A-flag of each cache line that has been accessed by the transaction is cleared. When two transactions access a cache line simultaneously, and at least one of them modifies the cache line, one of the two transactions should be aborted to ensure transaction atomicity. This is called “conflict”. If a transaction is aborted, all cache lines modified by the transaction need to be invalidated. Then, the execution flow jumps to a pre-defined error handler, in which the cleanup task can be performed before the transaction is re-executed. From the programmers’ perspective, this basic HTM exposes five primitives. TM BEGIN and TM END mark the start and end of a transaction. TM BEGIN has a parameter “priority”. It is used to avoid livelock. If a conflict happens, the transaction with lower priority is aborted. In this paper, the lower the value, the higher the priority. Therefore using time stamp as priority leads to the result that the older transaction aborts the younger in a conflict. TM SUSPEND and TM RESUME are two primitives used inside a transaction to temporarily stop and restart the transaction execution. In suspend state, the memory access operations are treated as regular memory access operations, except that conflicts with the thread’s own suspended transaction are ignored. I/O and system calls are allowed in suspend state. If a conflict happens in suspend state and the transaction needs to be aborted, the cancelation is delayed until TM RESUME is executed. The last primitive TM VALIDATE is used in suspend state to check whether a conflict happens. 2.2
Discrete Event and Supply Chain Simulation
Discrete event simulation (DES) is a method to mimic the dynamics of a real system. The Parallel DES (PDES) in this paper refers to the DES parallelized by multiple threads on a shared memory multiprocessor platform, rather than the distributed DES running on clusters. The core of a parallel discrete event simulator is an event list and a thread pool processing those events. Each event has a time stamp and a handler function. Event list contains all unprocessed events, sorted by time stamps, as shown in Fig. 1. The main processing loop in a simulator repeatedly removes the oldest event from the event list and calls the handler function for the event. Thus, the process can be viewed as a sequence of event computations. When an event is being processed, it is allowed to add one or more events to the event list with time stamps in the future. The principle of DES is to ensure that events with different time stamps are processed in time stamp order. This is worthy of extra precaution because each thread does not priori know whether a new event will be added later. Figure 1 demonstrates such a situation. Both threads fetch the event with the oldest time stamp to process. Since event B requires relatively longer processing time, thread
166
H. Wang et al.
Fig. 1. Event list
1 fetches event D after it finished processing event A. However, event B adds a new event C to the event list at time tadd , which causes an out-of-order execution of events C and D. There are two kinds of algorithms to address the problem [5]: conservative and optimistic algorithms. Briefly, the conservative algorithm takes precautions to avoid the out-of-order processing. That is, each event is processed only when there is no event with a smaller time stamp. To achieve it, a Lower Bound on the Time Stamp (LBTS)is used in the conservative algorithm. In this paper, LBTS is the smallest time stamp in the event list. Events with time stamp equal to LBTS are safe to be processed. After all of them have been processed, LBTS is increased to the next time stamp in the event list. The optimistic algorithm, on the contrary, uses a detection and recovery approach. Events are allowed to be processed out of time stamp order. However, if the computations of two events conflicts, the event with larger time stamp must be rolled back and reprocessed. Supply chain simulation is an application of DES. General Business Simulation Environment (GBSE) is a supply chain simulation and optimization tool developed by IBM [6]. It is a widely used commercial application and earned the 2008 Supply Chain Excellence Award, a top award in this domain. GBSE-C is the C/C++ version of GBSE. Besides the aforementioned features of DES, GBSE-C presents other properties to the parallelization work. 1. As a sequential program, it has a large amount of shared variables not protected by locks in event handlers. To parallelize it, all shared variables in the source code must be identified, which is a very time consuming task. 2. The number of events is large. Events with shared variables are probably processed at different time. Therefore the actual conflict rate between events is low. 3. For business reasons, source code of event handlers is frequently changed. Considering the code is developed by experts on supply chain, rather than experts on parallel programming, it is desirable to keep the programming style as simple as usual, i.e. writing sequential code as before, but achieving performance of parallel execution.
Improving Consumability of Hardware Transactional Memory
167
Using traditional lock-based technique to parallelize GBSE-C makes no sense because the business processing logic need to be modified. It’s unbearable to depend on the business logic programmers, who have little knowledge about parallelization, to explicitly specify which parts should be protected by locks and which are not. On the contrary, our HTM-based approach only modifies the simulation service layer and is transparent to business logic programmers.
3
Using Transactions in GBSE-C
For the purpose of parallelization, GBSE-C can be deemed to be a two-layer program. The upper layer is the business logic, such as order handling process, inventory control process, and procurement process. These logics are implemented in the form of event handlers. The lower layer is the event scheduling engine, which is the core of the simulator. The engine consists of two modules: 1. The resource management module encapsulates commonly used functions, such as malloc, I/O, and operations to the event list. After encapsulation, those functions are made safe to be used inside transactions. 2. The scheduling management module is in charge of event scheduling and processing. Threads from a thread pool fetch suitable events from the event list and process them. The scheduling policy is either conservative or optimistic, which defines the event processing mechanism.
3.1
Resource Management
The resource management module includes a memory pool, an I/O wrapper API, and the event list interface. Memory Pool. If a chunk of memory is allocated inside a transaction and the transaction is aborted, the allocated memory will never be released. Also memory allocation from a global pool usually requires accessing shared variables, and hence incurs conflict between transactions. To solve these problems, we implement a memory pool per thread. Inside a transaction, new functions TM MALLOC and TM FREE replace the original functions malloc and free. TM MALLOC obtains a chunk of memory from the corresponding memory pool. Then it records the address and size of the memory chunk in a thread-specific table (MA table). If the transaction is aborted, the error handler releases all memory chunks recorded in the table; otherwise, the allocated memory is valid. TM FREE returns the memory chunk to the pool and deletes the corresponding entry in the table. I/O Wrapper. There are many disk and network I/O operations in GBSEC. Functionally, they are used to access files, remote services, as well as for database operations. To simplify programming, these operations have already been encapsulated by helper functions. I/O operations inside a transaction are
168
H. Wang et al.
re-executed if the transaction is re-executed. Based on whether an I/O operation can tolerate this side effect, they can be classified to two categories. The first category is idempotent, such as reading a read-only file and printing for debug purpose. It can be re-executed without harmful impact. The second category is non-idempotent, as for example appending a row of data to a database table. It cannot be executed multiple times. We have encountered a lot of cases in the first category. What we do is to add TM SUSPEND and TM RESUME at the start and end of those helper functions. The code in event handlers can remain unchanged if it calls the helper functions for these operations. The cases in the second category are more interesting. We have two approaches to handle these cases. First, we use I/O buffering, as described in previous work [7]. Briefly, we buffer the data of the I/O operations until the transaction finishes. If the transaction is committed, perform the I/O operations with the data in the buffer; otherwise, discard the data in the buffer. However, this method is inconvenient for some complex cases where the buffered I/O operation influences later operations in the transaction. We propose another method by adding a new flag “serial mode” to each event. If it is set, the event handler contains complex I/O operations that should be handled in traditional sequential mode, where only one event is processed at a time. After this event is processed, the scheduler resumes the parallel execution mode. Event List Interface. Event handler may add a new event to the event list through the event list interface. The interface exposes a function ADD EVENT for programmers. The function does not manipulate the event list immediately. Instead, it records the function call and the corresponding parameters in a thread-specific table (ELO table). After the transaction is committed, extra code after TM END will execute the operations recorded in ELO table. If the transaction is aborted, the table is simply cleared. 3.2
Scheduling Management
In order to make TM programming transparent, the HTM interface is encapsulated in the event scheduler. Event handler programmers need not be aware of the HTM programming primitives. The engine supports both conservative and optimistic scheduling algorithms.It also has a special scheduling policy supporting conflict prediction. This policy is based on the conservative algorithm and needs change of HTM implementation. The Conservative Algorithms. The main process loop in each thread in the conservative algorithm includes the following steps: 1. Fetch an event with the smallest time stamp in the event list. 2. If the time stamp of this event is larger than LBTS, return the event to the event list and wait until LBTS is increased. The check guarantees that an event will be processed only when there is no event with smaller time stamp. 3. Execute the event handler in the context of a transaction. After the transaction is committed, execute the delayed event list operations (if any), and clear both ELO and MA tables.
Improving Consumability of Hardware Transactional Memory
169
4. If all events with time stamp equal to LBTS have been processed, LBTS is increased to the next time stamp in the event list. After that, a notification about the LBTS change is sent to all other threads. The threads blocked on LBTS then wake up. In the conservative algorithm, threads only process events with the same time stamp (LBTS). Thread pool will starve if the number of events with time stamp equal to LBTS is small. The parallelism of the conservative algorithm is limited by event density - the average number of events with each time stamp in the event list. In addition, the barrier synchronization to prevent out-of-order event processing might be costly when the event processing time is disproportional. Some threads might go idle while others are processing events that take a long execution time. Both degrade the performance of the conservative algorithm.
Fig. 2. An example of the optimistic algorithm
The Optimistic Algorithm. The optimistic algorithm overcomes the shortcomings of the conservative algorithm since it allows out of time stamp order execution. However, in traditional implementations, the optimistic algorithm does not necessarily bring optimized performance in a general sense, because it incurs overhead for checkpoint and rollback operations [8]. Using HTM, that overhead is minimized since checkpoint and rollback are done by hardware. The main process loop in the optimistic algorithm includes following steps: 1. Fetch an event with the smallest time stamp in the event list. 2. Execute the event handler in the context of a transaction. In this case the transaction priority is equal to the time stamp. 3. After the execution of the event handler, suspend the transaction and wait until the event’s time stamp is equal to LBTS. During this period, the handler periodically wakes up and checks whether a conflict is detected. If so, abort the transaction and re-execute the event handler. 4. After the transaction is committed, execute the delayed event list operations (if any), and clear both ELO and MA tables. 5. If all events with time stamp equal to LBTS have been processed, LBTS is increased to the next time stamp in the event list. After that a notification
170
H. Wang et al.
about LBTS change is sent to all other threads. The threads blocked on LBTS then wake up. Figure 2 shows an example of the optimistic algorithm. At the beginning, there are three events (A, B, and C) in the event list and LBTS is equal to k. The three events are being concurrently processed by three separate threads. Since the time stamp of event C is larger than LBTS, event C is blocked and put into the suspend state. The thread releases the CPU during the blocking. Meanwhile, event B adds a new event D into the event list with time stamp k+1 by calling function ADD EVENT. After event A and B are finished, LBTS is increased to k+1, and a notification about LBTS change is sent to event C. The execution of event C is woken up, and the corresponding transaction is going to be committed soon. Event D can also be processed whenever there is a free thread in the thread pool. If conflict happens between event A and C, event C is aborted since its priority is low. From the example, we can see that although events are processed out-of-order, they are committed strictly in order, which guarantees the correctness of the simulation. Besides the low cost of checkpoint and rollback, HTM-based optimistic algorithm has another advantage over traditional implementations: fine-grained rollback. In HTM-based implementation, only those events really affected are rolled back since each transaction has its own checkpoint. The optimistic algorithm might suffer from overly optimistic execution, i.e., some threads may advance too far in the event list. The results are two-fold. First, conflict rate balloons with the increase of the number of events being processed. Second, overflow might happen when a lot of transactions are concurrently executed. Both results limit the parallelism of the optimistic algorithm. Scheduler with Conflict Prediction. The performance of the case study application is limited by the high conflict rate when the thread number is large. Without appropriate event scheduling mechanisms, a large thread pool may consume more on-chip power without performance improvement, or even causing performance degradation. In our engine, the event scheduler supports a special policy to predict the conflict between events with the help of HTM. The prediction directs each thread to process suitable events to avoid unnecessary conflicts. The conflict prediction is feasible based on the following observations. – Data locality. In PDES applications, executions of one event show data locality. The memory footprints of the previous executions give hints about the memory addresses to be accessed in the future. – Conflict position. Some event handlers are composed with a common pattern: first read configuration data, then do some computation, and finally write back computing results. Positions of conflicts caused by shared variables accessing are usually at the beginning and end of the event handler. The time span of transaction execution between conflict position and the end of the transaction is highly relevant to the conflict probability. In order to design the scheduler with conflict prediction, we modify the HTM implementation by adding two bloom filters in each processor to record the
Improving Consumability of Hardware Transactional Memory
171
transaction’s read and write sets. When the transaction is committed, the contents of bloom filters called signatures are dumped into memory at pre-defined addresses. When the transaction is aborted, the bloom filters are cleared and the conflicting address is recorded. Each event maintains a conflict record table (CRT) for the conflict prediction, which contains the signatures and other transaction statistics. Some important fields of the statistics are described as follows: – Total Execution Cycles (TEC): total execution cycles of a transaction between TM BEGIN and TM END. – Conflicting Addresses (CA): conflicting addresses of a transaction. – Conflicting Execution Cycles (CEC): execution cycles of a transaction between TM BEGIN and conflict. Before an event is going to be processed, the scheduler first checks whether the event’s conflict addresses are in the signatures of any event under execution. If so, a conflict is possible to happen if this event is executed. Then the scheduler uses possible conflict time-span (PCT) as a metric to determine the conflict probability. PCT refers to the total time span within which if one event with lower priority starts to execute, it will be aborted by the other running event with high priority. The smaller the PCT is, the less probably the conflict occurs. The value of PCT between event A and event B can be computed according to Eq. 1. P CTAB = (T ECA − CECA ) + (T ECB − CECB ) (1) Fig. 3 shows three cases of transaction conflict between event A and B. In the first case, two transactions have conflicting memory accessing at the very beginning, the transaction with lower priority could be possibly aborted if it starts to execute at any time within the time span from point M to N. It has the largest PCT (P CTmax ≈ T ECA + T ECB ); in the second case, conflict position is located in the middle and conflict probability is lower than the first one with P CTmid ≈ 1/2(T ECA + T ECB ); while it’s seldom to conflict in the third case with P CTmin ≈ 0. Based on CRT and PCT, the scheduler can predict conflict and decide the scheduling policy for each event. Case 1: early conflict PCT≈TECA+TECB M
Case 2: middle conflict PCT ≈ 1/2(TECA+TECB)
TECB Conflict address PCT
CECB
Case 3: late conflict PCT ≈ 0
TECA CECA N
PCT TECA/2
Trans. A (High Priority)
CECA
CECB
Trans. A (High Priority) Trans. A (High Priority) Trans. B
Trans. B
Trans. B
Fig. 3. Three cases with different PCT
Fig. 4. Transaction conflict rate
172
4
H. Wang et al.
Productivity and Performance Evaluation
We have parallelized GBSE-C through two approaches: the new approach based on HTM and the traditional approach based on locks. Since the work was done by the same group of developers, the time spent implementing the different approaches can be considered as a straightforward indication of productivity. – HTM-based approach. We spent about one week to finish the parallelization work for both conservative and optimistic algorithms. Three days were used to identify I/O operations and handle them accordingly. It is not very difficult because the I/O operations are well encapsulated by helper functions. Additional time was used to encapsulate the HTM interface in the scheduling engine. – Lock-based approach. We spent two months to complete the parallelization work for the conservative algorithm. About half of the time was used to identify the variables shared by events. This task is time-consuming because understanding the program logic in event handler requires some domain knowledge and the usage of pointers exacerbates the problem. Performance tuning and debugging took another three weeks. It includes shortening critical sections (fine-grained locks), using proper locks (pthread locks vs. spinning locks), and preventing dead locks. Since the execution order of events with same time stamp is not deterministic, some bugs were found only after running the program many times. We did not implement the optimistic algorithm using locks because doing checkpoint and rollback by software means is difficult in general. The performance evaluation is carried out on an IBM full system simulator [9], which supports configurable cycle-accurate simulation for POWER/PowerPC based CMP and SMP. The target processor contains 4 clusters connected by an on-chip interconnect. Each cluster includes 4 processor cores and a shared L2 cache. Each core runs at a frequency of 2GHz, with an out-of-order multi-issue pipeline. We have studied the transaction size in this application. The sizes are measured separately for the read set and the write sets. Read and write sets contain the data read and written by a transaction respectively. 88% of the read sets and 91% of the write sets are less than 4KB, indicating that most transactions are small. The sizes of the read and write sets in the largest transaction approach 512KB. Since the L2 data cache (2MB) is much larger, the execution of a single transaction dose not incur overflow. Figure 4 shows the transaction conflict rate in conservative and optimistic algorithms. The conflict rate is defined as the number of conflict divided by the total number of transactions. Optimistic algorithm has higher conflict rate than the conservative algorithm since the optimistic algorithm tries more paralleling execution. When the number of threads in thread pool is increased, the conflict rate is also increased. Generally, the conflict rate is less than 24%. It proves that many events are actually independent and can be processed in parallel.
Improving Consumability of Hardware Transactional Memory
173
CP scheduler vs. Normal scheduler CP = Conflict Prediction 6
30.00% 4.64
5
4.96
25.00%
Speedup
4.36
3.26
1.91
2
15.00% 9.37% 7.79%
2.00
20.00%
15.19%
3.23
3
19.36%
10.00%
Conflict rate
4.34
4
1.07 3.87% 3.80%
1
5.00%
0.22% 0.02%
0
0.00% 1 thread
2 threads
4 threads
8 threads
16 threads
Conflict rate of CP scheduler
Conflict rate of normal scheduler
Speedup of CP scheduler
Speedup of normal scheduler
Fig. 5. Speedup from three algorithms Fig. 6. Speedup of scheduler with conflict prediction
Figure 5 illustrates the observed speedup from the three approaches: the conservative algorithm with fine-grained locks, the conservative algorithm with HTM and the optimistic algorithm with HTM. The first approach represents the best performance achievable through manual optimization. The conservative algorithm with HTM achieves performance that is slightly lower than the first approach, indicating HTM is effective for the parallelization work. The optimistic algorithm with HTM is better than the previous two approaches. It gains more than 2 times speedup when the thread count is 2. This is because optimistic algorithm reduces the synchronization overhead at each time stamp and the scheduler has more chances to balance workloads among threads. Generally, through HTM, we can achieve 4.36 and 5.13 times speedup with 8 threads in the conservative and optimistic algorithms respectively, and achieve 5.69 times speedup with 16 threads in the optimistic algorithm. Besides the performance comparison with the normal event scheduler, we also conduct an experiment to evaluate the performance of the scheduler with conflict prediction. Fig. 6 illustrates the speedup and conflict rate comparison between the two schedulers with conservative algorithm. Because of extra overhead of the conflict prediction, the scheduler with conflict prediction shows a slightly worse performance against the normal one when the thread count is low. But as the thread count increases, it gradually outperforms the normal one and the performance gap becomes larger. With 16 threads, the speedup is 14% more than the normal one. We can also see that the conflict rate is reduced by scheduler with conflict prediction.
5
Conclusion
In this paper, we demonstrate that the encapsulated HTM can have good consumability for some real-world applications. Based on these findings in the paper, we will further investigate the consumability of HTM in a broader range of applications in the future.
174
H. Wang et al.
References 1. McDonald, A., Chung, J., Carlstrom, B.D., Minh, C.C., Chafi, H., Kozyrakis, C., Olukotun, K.: Architectural semantics for practical transactional memory. In: Proc. of the 33rd International Symposium on Computer Architecture, pp. 53–65. IEEE, Los Alamitos (2006) 2. Wang, H., Hou, R., Wang, K.: Hardware transactional memory system for parallel programming. In: Proc. of the 13th Asia-Pacific Computer System Architecture Conference, pp. 1–7. IEEE, Los Alamitos (2008) 3. Baugh, L., Neelakantam, N., Zilles, C.: Using hardware memory protection to build a high-performance, strongly-atomic hybrid transactional memory. In: Proc. of the 35th International Symposium on Computer Architecture, pp. 115–126. IEEE, Los Alamitos (2008) 4. Chung, J., Baek, W., Kozyrakis, C.: Fast memory snapshot for concurrent programming without synchronization. In: Proc. of the 23rd International Conference on Supercomputing, pp. 117–125. ACM, New York (2009) 5. Perumalla, K.: Parallel and distributed simulation: traditional techniques and recent advances. In: Proc. of the 2006 Winter Simulation Conference, pp. 84–95 (2006) 6. Wang, W., Dong, J., Ding, H., Ren, C., Qiu, M., Lee, Y., Cheng, F.: An introduction on ibm general business simulation environment. In: Proc. of the 2008 Winter Simulation Conference, pp. 2700–2707 (2008) 7. Chung, J., Chafi, H., Minh, C., McDonald, A., Carlstrom, B., Kozyrakis, C., Olukotun, K.: The common case transactional behavior of multithreaded programs. In: Proc. of the 12th International Symposium on High-Performance Computer Architecture, pp. 166–177. IEEE, Los Alamitos (2006) 8. Poplawski, A., Nicol, D.: Nops: a conservative parallel simulation engine for ted. In: Proc. of the 12th Workshop on Parallel and Distributed Simulation, pp. 180–187 (1998) 9. Bohrer, P., Peterson, J., Elnozahy, M., Rajamony, R., Gheith, A., Rochhold, R.: Mambo: a full system simulator for the powerpc architecture. ACM SIGMETRICS Performance Evaluation Review 31(4), 8–12 (2004)
Exploiting Fine-Grained Parallelism on Cell Processors Ralf Hoffmann, Andreas Prell, and Thomas Rauber Department of Computer Science University of Bayreuth, Germany [email protected]
Abstract. Driven by increasing specialization, multicore integration will soon enable large-scale chip multiprocessors (CMPs) with many processing cores. In order to take advantage of increasingly parallel hardware, independent tasks must be expressed at a fine level of granularity to maximize the available parallelism and thus potential speedup. However, the efficiency of this approach depends on the runtime system, which is responsible for managing and distributing the tasks. In this paper, we present a hierarchically distributed task pool for task parallel programming on Cell processors. By storing subsets of the task pool in the local memories of the Synergistic Processing Elements (SPEs), access latency and thus overheads are greatly reduced. Our experiments show that only a worker-centric runtime system that utilizes the SPEs for both task creation and execution is suitable for exploiting fine-grained parallelism.
1
Introduction
With the advent of chip multiprocessors (CMPs), parallel computing is moving into the mainstream. However, despite the proliferation of parallel hardware, writing programs that perform well on a variety of CMPs remains challenging. Heterogeneous CMPs achieve higher degrees of efficiency than homogeneous CMPs, but are generally harder to program. The Cell Broadband Engine Architecture (CBEA) is the most prominent example [1,2]. It defines two types of cores with different instruction sets and DMA-based “memory flow control” for data movement and synchronization between main memory and softwaremanaged local stores. As a result, exploiting the potential of CBEA-compliant processors requires significant programming effort. Task parallel programming provides a flexible framework for programming homogeneous as well as heterogeneous CMPs. Parallelism is expressed in terms of independent tasks, which are managed and distributed by a runtime system. Thus, the programmer can concentrate on the task structure of an application, without being aware of how tasks are scheduled for execution. In practice, however, the efficiency of this approach strongly depends on the runtime system and its ability to deal with almost arbitrary workloads. Efficient execution across CMPs with different numbers and types of cores requires the programmer to maximize the available parallelism in an application. This is typically achieved P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 175–186, 2010. c Springer-Verlag Berlin Heidelberg 2010
176
R. Hoffmann, A. Prell, and T. Rauber
by exposing parallelism at a fine level of granularity. The finer the granularity, the greater the potential for parallel speedup, but also the greater the communication and synchronization overheads. In the end, it is the runtime system that determines the degree to which fine-grained parallelism can be exploited. When we speak of fine-grained parallelism, we assume task execution times on the order of 0.1–10µs. Given the trend towards large-scale CMPs, it becomes more and more important to provide support for fine-grained parallelism. In this work, we investigate task parallel programming with fine-grained tasks on a heterogeneous CMP, using the example of the Cell processor. Both currently available incarnations— the Cell Broadband Engine (Cell/B.E.) and the PowerXCell 8i—comprise one general-purpose Power Processing Element (PPE) and eight specialized coprocessors, the Synergistic Processing Elements (SPE). Although the runtime system we present is tailored to the Cell’s local store based memory hierarchy, we believe many concepts will be directly applicable to future CMPs. In summary, we make the following contributions: – We describe the implementation of a hierarchically distributed task pool designed to take advantage of CMPs with local memories, such as the Cell processor. Lowering the task scheduling overhead is the first step towards exploiting fine-grained parallelism. – Our experiments indicate that efficient support for fine-grained parallelism on Cell processors requires task creation by multiple SPEs instead of by the PPE. For this reason, we provide functions to offload the process of task creation to the SPEs. Although seemingly trivial, sequential bottlenecks must be avoided in order to realize the full potential of future CMPs.
2
Distributed Task Pools
Task pools are shared data structures for storing parallel tasks of an application. As long as there are tasks available, a number of threads keep accessing the task pool to remove tasks for execution and to insert new tasks upon creation. Task pools generalize the concept of work queueing. While a work queue usually implies an order of execution such as LIFO or FIFO, a task pool may not guarantee such an order. To improve scheduling, task pools are often based on the assumption that inserted tasks are free of dependencies and ready to run. Implementing a task pool can be as simple as setting up a task queue that is shared among a number of threads. While such a basic implementation might suffice for small systems, frequent task pool access and increasing contention will quickly limit scalability. Distributed data structures, such as per-thread task queues, address the scalability issue of centralized implementations, at the cost of requiring additional strategies for load balancing. 2.1
Task Pool Runtime System
The task pool runtime is implemented as a library, which provides an API for managing the task pool, running SPE threads, performing task pool operations,
Exploiting Fine-Grained Parallelism on Cell Processors
177
and synchronizing execution after parallel sections. In addition to the common approach of creating tasks in a PPE thread, we include functions to delegate task creation to a number of SPEs, intended for those cases in which the performance of the PPE presents a bottleneck to application scalability. Each SPE executes a basic self-scheduling loop that invokes user-defined task functions after removing tasks from the task pool. In this way, the user can focus on the implementation of tasks, rather than writing complete executables for the SPEs. To support workloads with nested parallelism, new tasks can be spawned from currently running tasks, without tying execution to any particular SPE. 2.2
Design and Implementation
To match the Cell’s memory hierarchy, we define a hierarchical task pool organization consisting of two separate storage domains: a local storage domain, providing fast access to a small set of tasks, and a shared storage domain, collecting all remaining tasks outside of local storage. From an implementation viewpoint, the hierarchical task pool may be thought of as two disjoint task pools, one per storage domain, with an interface for moving tasks between the task pools. An SPE thread initiates task movement in two cases: (1) there is no local task left when trying to schedule a task for execution, or (2) there is no local storage space left when trying to insert a new task. To reduce the frequency of these events, tasks should be moved in bundles. However, choosing a large bundle size may lead to increased load balancing activity, up to the point where load balancing overheads outweigh any savings. For this reason, we only try to maximize the bundle size within fixed bounds. Shared Task Storage. Data structures for storing tasks in main memory should meet the following requirements: (1) Allow concurrent access by multiple processing elements. (2) Provide potentially unbounded storage space that can grow and shrink as needed. (3) Facilitate movement of tasks between storage domains. Concurrent access by multiple processing elements is usually achieved by using a distributed task pool. In our previous work, we have described the implementation of a distributed task pool based on a set of double-ended queues (deques) [3]. We noted that dequeuing a single task involved up to 11 small DMA transfers, which added significant latency to the overall operation. As a consequence of these overheads, fine-grained parallelism remained hard to exploit. To increase the efficiency of DMA transfers, we now allocate blocks of contiguous tasks and arrange the blocks in a circular linked list. Similar in concept to distributed task queues, blocks are locked and accessed independently, allowing concurrent access to distinct blocks. Each SPE is assigned a separate block on which to operate. Only if an SPE finds its block empty or already locked when searching for tasks, it follows the pointer to the next block, effectively attempting to steal a task from another SPE’s block. Thus, load balancing is built into the data structure, rather than being implemented as part of the scheduling algorithm.
178
R. Hoffmann, A. Prell, and T. Rauber
Fig. 1. Task storage in main memory. The basic data structure is a circular linked list of task blocks, each of which is locked and accessed independently. If more than one list is allocated, the lists are in turn linked together.
The resulting data structure including our extension to support multiple lists is illustrated in Fig. 1. Individual task blocks may be resized by requesting help from the PPE, using the standard PPE-assisted library call mechanism. The number of blocks in a list remains constant after allocation and equals the number of SPEs associated with that list. In the case of more than one list, as shown in the example of Fig. 1, SPEs follow the pointer to the head of the next list if they fail to get a task from their current list. Local Task Storage. Tasks in the local storage domain are naturally distributed across the local stores of the SPEs. Given the limited size of local storage, we can only reserve space for a small set of tasks. In our current implementation, we assume a maximum of ten tasks per local store. With DMA performance in mind, we adopt the task queue shown in Fig. 2. The queue is split into private and public segments for local-only and shared access, respectively. Access to the public segment is protected by a lock. The private segment is accessed without synchronization. Depending on the number of tasks in the queue, the local SPE adjusts the segment boundary to share at least one task via the public segment. Adapting the segments by shifting the boundary requires exclusive access to the queue, including the public segment. If the queue is empty or there is only one task left to share, the queue is public by default (2a). When inserting new tasks, the private segment may grow up to a defined maximum size; beyond that size, tasks remain in the public segment (2b–d). Similarly, when searching for a task, the private segment is checked first before accessing the public segment (2e– f). Unless the public segment is empty, there is no need to shrink the private segment (2g). Given that other SPEs are allowed to remove tasks from a public segment, periodic checks are required to determine whether the public segment is empty, in which case the boundary is shifted left to share another task (2h–j). Dynamic Load Balancing. Task queues with private and public segments provide the basis for load balancing within the local storage domain. If an SPE fails to return a task from both local and shared storage, it may attempt to steal a task from another SPE’s local queue. Task stealing transfers a number of tasks between two local stores without involving main memory. We have implemented a two-level stealing protocol for a system containing two Cell processors. At first, task stealing is restricted to
Exploiting Fine-Grained Parallelism on Cell Processors
179
Fig. 2. Task storage in local memory. The basic data structure is a bounded queue, implemented as an array of tasks. The queue owner is responsible for partitioning the queue dynamically into private and public segments. In this example, the private segment may contain up to five tasks.
finding a local victim, i.e., a victim located on the same chip. SPEi starts off with peeking at the public segment of its logical neighbor SPE(i%N )+1 , where N is the number of local SPEs, and 1 ≤ i ≤ N . Note that two logically adjacent SPEs need not be physically adjacent, though we ensure that they are allocated on the same chip. To reduce the frequency of stealing attempts, a thief tries to steal more than one task at a time—up to half the number of tasks in a given public segment (steal-half policy). If the public segment is already locked or there is no task left to steal, the neighbor of the current victim becomes the next victim, and the procedure is repeated until either a task is found or the thief returns to its own queue. In the latter case, task stealing continues with trying to find a remote victim, i.e., a victim located off-chip. Again, each SPE in question is checked once. Task Pool Access Sequence. Our current task pool implementation is based on the assumption that tasks are typically executed by the SPEs. Therefore, the PPE is used to insert but not to remove tasks. In the following, we focus on the sequence in which SPEs access the task pool. For the sake of clarity, we omit the functions for adapting the local queue segments. Get Task Figure 3(a) shows the basic sequence for removing a task from the task pool. Tasks are always taken from the local queue, but if the queue is empty, the shared storage pool must be searched for a task bundle to swap in. The size of the bundle is ultimately limited by the size of the local queue. Our strategy does not attempt to find the largest possible bundle by inspecting each block in the list, but instead, tries to minimize contention by transferring the first bundle found; even if the bundle is really a single task. Before the lock is released and other SPEs might begin to steal, one of the transferred tasks is reserved for execution by the local SPE. Put Task Figure 3(b) shows the sequence for inserting a task into the task pool. The task is always inserted locally, but if the local queue is full, a number of tasks must be swapped out to the shared storage pool first. For reasons of locality, it makes sense not to clear the entire queue before inserting the new task. Tasks
180
R. Hoffmann, A. Prell, and T. Rauber void ∗g e t t a s k ( ) { void ∗t a s k = r e m o v e p r i v a t e ( ) ; i f ( t a s k ) ret urn t a s k ; lock ( public ) ; t a s k = r em ov e public ( ) ; i f ( task ) { unloc k ( p u b l i c ) ; ret urn t a s k ; } / / Loc al s t o r e queue i s empty swap in from mm ( ) ; t a s k = remove ( ) ; unloc k ( p u b l i c ) ; i f ( t a s k ) ret urn t a s k ; / / Not hing l e f t i n main memory ret urn s t e a l t a s k ( ) ; }
void p u t t a s k ( void ∗t a s k ) { bool i n s = i n s e r t p r i v a t e ( t a s k ) ; i f ( i n s ) ret urn ; lock ( public ) ; ins = i n se r t p u b l i c ( task ) ; i f ( ins ) { unloc k ( p u b l i c ) ; ret urn ; } / / Loc al s t o r e queue i s f u l l swap out to mm ( ) ; i n s e r t ( task ) ; unloc k ( p u b l i c ) ; }
(a) Get task
(b) Put task
Fig. 3. Simplified sequence for accessing the task pool (SPE code)
that are moved from local to shared storage are inserted into an SPE’s primary block. If the block is already full, the SPE calls back to its controlling PPE thread, which doubles the size of the block and resumes SPE execution.
3
Experimental Results
We evaluate the performance and scalability of our task pool implementation in two steps. First, we compare different task pool variants, using synthetic workloads generated by a small benchmark application. Second, we present results from a set of three applications with task parallelism: a matrix multiplication and LU decomposition and a particle simulation based on the Linked-Cell method [4]. For these applications, we compare performance with Cell Superscalar (CellSs) 2.1, which was shown to achieve good scalability for workloads with tasks in the 50µs range [5,6]. Matrix multiplication and decomposition codes are taken from the examples distributed with CellSs. We performed runtime experiments on an IBM BladeCenter QS22 with two PowerXCell 8i processors and 8 GB of DDR2 memory per processor. The system is running Fedora 9 with Linux kernel 2.6.25-14. Programming support is provided by the IBM Cell SDK version 3.1 [7]. The speedups presented in the following subsections are based on average runtimes from ten repetitions. Task pool parameters are summarized in Table 1. 3.1
Synthetic Application
We consider two synthetic workloads with the following runtime characteristics: – Static(f, n): Create n tasks of size f . All tasks are identical and require the same amount of computation. – Dynamic(f, n): Create n initial tasks of size f . Depending on the task, up to two child tasks may be created. Base tasks are identical to those of the static workload.
Exploiting Fine-Grained Parallelism on Cell Processors
181
Table 1. Distributed task pool parameters used in the evaluation
Task size is a linear function of the workload parameter f such that a base task of size f = 1 is executed in around 500 clock cycles on an SPE. Figure 4 shows the relative performance of selected task pools, based on executing synthetic workloads Static(f, 105) and Dynamic(f, 26). In the case of the dynamic workload, n = 26 initial tasks result in a total of 143 992 tasks to be created and executed. Task sizes range from very to moderately fine-grained. To evaluate the benefits of local task storage, we compare speedups with a central task pool that uses shared storage only. Such a task pool is simply configured without the local store queues described in the previous section. Lacking the corresponding data structures, tasks are scheduled one at a time without taking advantage of bundling. In addition, we include the results of our previous task pool implementation, reported in [3]. Speedups are calculated relative to the central task pool running with a single SPE. Figures 4(a) and (b) show that fine-grained parallelism drastically limits the scalability of all task pools that rely on the PPE to create tasks. This is to be expected, since the PPE cannot create and insert tasks fast enough to keep all SPEs busy. The key to break this sequential bottleneck is to involve the SPEs in the task creation. Because SPEs can further take advantage of their local stores, we see significant improvements in terms of scalability, resulting in a performance advantage of 8.5× and 6.7× over the same task pool using PPE task creation. In this and the following experiments, the tasks to be created are evenly distributed among half of the SPEs, in order to overlap task creation and execution. Without task creation support from the SPEs, our new implementation is prone to parallel slowdowns, as is apparent in Fig. 4(a) and (b). While the deques of our previous task pool allow for concurrency between enqueue and dequeue operations, the block list requires exclusive access to a given block when inserting or removing tasks. Thus, with each additional SPE accessing the block list, lock contention increases and as a result task creation performance of the PPE degrades. Figure 4(c) shows that tasks of size f = 100 (16µs) are already large enough to achieve good scalability with all distributed task pools, regardless of task creation. The central task pool scales equally well up to eight SPEs, but suffers from increasing contention when using threads allocated on a second Cell processor. Figures 4(d) and (e) show that fine-grained nested parallelism is a perfect fit for the hierarchically distributed task pool, with performance improvements of 25.5× and 14.5× over the central task pool (using 16 SPEs). Even for larger tasks, the new implementation maintains a significant advantage, as can be seen in Fig. 4(f).
182
R. Hoffmann, A. Prell, and T. Rauber
(a) Static(1, 105 )
(b) Static(10, 105 )
(c) Static(100, 105 )
(d) Dynamic(1, 26)
(e) Dynamic(10, 26)
(f) Dynamic(100, 26)
Fig. 4. Speedups for the synthetic application with static and dynamic workloads (f, n), where f is the task size factor and n is the number of tasks. The resulting average task size is shown in the upper left of each figure. Speedups are calculated relative to the central task pool (shared storage only) using one SPE.
3.2
Matrix Multiplication
The matrix multiplication workload is characterized by a single type of task, namely the block-wise multiplication of the input matrices. Figure 5 shows the effect of decreasing task size on the scalability of the implementations. Speedups are calculated relative to CellSs using a single SPE. CellSs scales reasonably well for blocks of size 32. However, if we further decrease the block size towards fine-grained parallelism, we clearly see the limitations of CellSs. Likewise, requiring the PPE to create all tasks drastically limits the scalability of our task pool implementation. Because the shared
(a) B = 32
(b) B = 16
(c) B = 8
Fig. 5. Speedups for the multiplication of two 1024×1024 matrices. The multiplication is carried out in blocks of size B × B. Speedups are relative to CellSs using one SPE.
Exploiting Fine-Grained Parallelism on Cell Processors
(a) B = 32
(b) B = 16
183
(c) B = 8
Fig. 6. Speedups for the LU decomposition of a 1024×1024 matrix. The decomposition is carried out in blocks of size B × B. Speedups are relative to CellSs using one SPE.
storage pool is contended by both PPE and SPE threads, we face similar slowdowns as described above. For scalable execution of tasks in the low microsecond range, task creation must be offloaded from the PPE to the SPEs. Using 16 SPEs (eight SPEs for task creation), the distributed task pool achieves up to 12.8× the performance of CellSs. 3.3
LU Decomposition
Compared to the matrix multiplication, the LU decomposition exhibits a much more complex task structure. The workload is characterized by four different types of tasks and decreasing parallelism with each iteration of the algorithm. Mapping the task graph onto the task pool requires stepwise scheduling and barrier synchronization. Figure 6 shows speedups relative to CellSs, based on the same block sizes as in the matrix multiplication example. Due to the complex task dependencies and the sparsity of the input matrix, we cannot expect to see linear speedups up to 16 SPEs. Using a block size greater than 32 results in a small number of coarse-grained tasks. Limited parallelism, especially in the final iterations of the algorithm, motivates a decomposition at a finer granularity. In fact, the best performance is achieved when using blocks of size 32 or 16. Once again, finegrained parallelism can only be exploited by freeing the PPE from the burden of task creation. In that case, we see an improvement of up to 6× over CellSs. 3.4
Linked-Cell Particle Simulation
The particle simulation code is based on the Linked-Cell method for approximating short-range pair potentials, such as the Lennard-Jones potential in molecular dynamics [4]. To save computational time, the Lennard-Jones potential is truncated at a cutoff distance rc , whose value is used to subdivide the simulation space into cells of equal length. Force calculations can then be limited to include interactions with particles in the same and in neighboring cells. For our runtime tests, we use a 2D simulation box with reflective boundary conditions. The box is subdivided into 40 000 cells, and particles are randomly
184
R. Hoffmann, A. Prell, and T. Rauber
(a) 1000 particles
(b) 10 000 particles
(c) 100 000 particles
Fig. 7. Speedups for the Linked-Cell application with three different workloads based on the number of particles to simulate. Idealized scheduling assumes zero overhead for task management, scheduling, and load balancing. Speedups are relative to CellSs using one SPE.
distributed, with the added constraint of forming a larger cluster. The workload consists of two types of tasks—force calculation and time integration—which update the particles of a given cell. At the end of each time step, particles that have crossed cell boundaries are copied to their new enclosing cells. This task is performed sequentially by the PPE. We focus on the execution of three workloads, based on the number of particles to simulate: 1000, 10 000, and 100 000. The potential for parallel speedup increases with the number of particles, but at the same time, clustering of particles leads to increased task size variability. Thus, efficient execution depends on load balancing at runtime. Figure 7 shows the results of running the simulation for ten time steps. Once again, speedups are calculated relative to CellSs using a single SPE. Maximum theoretical speedups are indicated by the dashed line, representing parallel execution of force calculation and time integration without any task related overheads. In the case of only 1000 particles, there is little parallelism to exploit. Although SPE task creation leads to some improvement, 16 SPEs end up delivering roughly the same performance as a single SPE executing the workload in sequence. In this regard, it is important to note that tasks are only created for non-empty cells, requiring that each cell be checked for that condition. Whereas the PPE can simply load a value from memory, an SPE has to issue a DMA command. Even with atomic DMA operations, the additional overhead of reading a cell’s content is on the order of a few hundred clock cycles. The amount of useful parallelism increases with the number of particles to simulate. Using 16 SPEs, the task pool delivers 62% and 75% of the ideal performance, when simulating 10 000 and 100 000 particles, respectively. In contrast, CellSs achieves only 16% and 20% of the theoretical peak.
4
Related Work
Efficient support for fine-grained parallelism requires runtime systems with low overheads. Besides bundling tasks in the process of scheduling, runtime systems
Exploiting Fine-Grained Parallelism on Cell Processors
185
may decide to increase the granularity of tasks depending on the current load and available parallelism. Strategies such as lazy task creation [8] and task cutoff [9] limit the number of tasks and the associated overhead of task creation, while still preserving enough parallelism for efficient execution. However, not all applications lend themselves to combining tasks dynamically at runtime. In a recent scalability analysis of CellSs, Rico et al. compare the speedups for a number of applications with different task sizes [10]. The authors conclude that, in some of their applications, the task creation overhead of the PPE limits scalability to less than 16 SPEs. To reduce this overhead and thereby improve scalability, the authors suggest a number of architectural enhancements to the PPE, such as out-of-order execution, wider instruction issue, and larger caches. All in all, task creation could be accelerated by up to 50%. In the presence of fine-grained parallelism, however, task creation must be lightweight and scalable, which we believe is best achieved by creating tasks in parallel. Regardless of optimizations, runtime systems introduce additional overhead, which limits their applicability to tasks of a certain minimum size. Kumar et al.make a case for exploiting fine-grained parallelism on future CMPs and propose hardware support for accelerating task queue operations [11,12]. Their proposed design, Carbon, adds a set of hardware task queues, which implement a scheduling policy based on work stealing, and per-core task prefetchers to hide the latency of accessing the task queues. On a set of benchmark applications with fine-grained parallelism from the field of Recognition, Mining, and Synthesis (RMS), the authors report up to 109% performance improvement over optimized software implementations. While hardware support for task scheduling and load balancing has great potential for exploiting large-scale CMPs, limited flexibility in terms of scheduling policies will not obviate the need for sophisticated software implementations.
5
Conclusions
In this paper, we have presented a hierarchically distributed task pool that provides a basis for efficient task parallel programming on Cell processors. The task pool is divided into local and shared storage domains, which are mapped to the Cell’s local stores and main memory. Task movement between the storage domains is facilitated by DMA-based operations. Although we make extensive use of Cell-specific communication and synchronization constructs, the concepts are general enough to be of interest for other platforms. Based on our experiments, we conclude that scalable execution on the Cell processor requires SPE-centric runtime system support. In particular, the current practice of relying on the PPE to create tasks is not suitable for exploiting finegrained parallelism. Instead, tasks should be created by a number of SPEs, in order to maximize the available parallelism at any given time.
Acknowledgments We thank the Forschungszentrum J¨ ulich for providing access to their Cell blades. This work is supported by the Deutsche Forschungsgemeinschaft (DFG).
186
R. Hoffmann, A. Prell, and T. Rauber
References 1. Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the Cell multiprocessor. IBM J. Res. Dev. 49(4/5) (2005) 2. Johns, C.R., Brokenshire, D.A.: Introduction to the Cell Broadband Engine Architecture. IBM J. Res. Dev. 51(5) (2007) 3. Hoffmann, R., Prell, A., Rauber, T.: Dynamic Task Scheduling and Load Balancing on Cell Processors. In: Proc. of the 18th Euromicro Intl. Conference on Parallel, Distributed and Network-Based Processing (2010) 4. Griebel, M., Knapek, S., Zumbusch, G.: Numerical Simulation in Molecular Dynamics, 1st edn. Springer, Heidelberg (September 2007) 5. Bellens, P., Perez, J.M., Badia, R.M., Labarta, J.: CellSs: a Programming Model for the Cell BE Architecture. In: Proc. of the 2006 ACM/IEEE conference on Supercomputing (2006) 6. Perez, J.M., Bellens, P., Badia, R.M., Labarta, J.: CellSs: Making it easier to program the Cell Broadband Engine processor. IBM J. Res. Dev. 51(5) (2007) 7. IBM: IBM Software Development Kit (SDK) for Multicore Acceleration Version 3.1, http://www.ibm.com/developerworks/power/cell 8. Mohr, E., Kranz, D.A., Halstead Jr., R.H.: Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. In: Proc. of the 1990 ACM conference on LISP and functional programming (1990) 9. Duran, A., Corbal´ an, J., Ayguad´e, E.: An adaptive cut-off for task parallelism. In: Proc. of the 2008 ACM/IEEE conference on Supercomputing (2008) 10. Rico, A., Ramirez, A., Valero, M.: Available task-level parallelism on the Cell BE. Scientific Programming 17, 59–76 (2009) 11. Kumar, S., Hughes, C.J., Nguyen, A.: Carbon: Architectural Support for FineGrained Parallelism on Chip Multiprocessors. In: Proc. of the 34th Intl. Symposium on Computer Architecture (2007) 12. Kumar, S., Hughes, C.J., Nguyen, A.: Architectural Support for Fine-Grained Parallelism on Multi-core Architectures. Intel Technology Journal 11(3) (2007)
Optimized On-Chip-Pipelined Mergesort on the Cell/B.E. Rikard Hult´en1 , Christoph W. Kessler1 , and J¨org Keller2 1 2
Link¨opings Universitet, Dept. of Computer and Inf. Science, 58183 Link¨oping, Sweden FernUniversit¨at in Hagen, Dept. of Math. and Computer Science, 58084 Hagen, Germany
Abstract. Limited bandwidth to off-chip main memory is a performance bottleneck in chip multiprocessors for streaming computations, such as Cell/B.E., and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memoryefficient code is an important program optimization. In earlier work, we have proposed such a transformation technique: on-chip pipelining. On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip bus, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead. In this paper, we consider parallel mergesort on Cell/B.E. as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique for the Cell processor, describe our implementation, evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed reduces, for realistic problem sizes, merging times by up to 70% on QS20 and 143% on PS3 compared to the merge phase of CellSort, which was by now the fastest merge sort implementation on Cell.
1 Introduction The new generation of multiprocessors-on-chip derives its raw power from parallelism, and explicit parallel programming with platform-specific tuning is needed to turn this power into performance. A prominent example is the Cell Broadband Engine [1] with a PowerPC core and 8 parallel slave processors called SPEs. Yet, many applications use the Cell BE like a dancehall architecture: the SPEs use their small on-chip local memories (256 KB for both code and data) as explicitly-managed caches, and they all load and store data from/to the external (off-chip) main memory. The bandwidth to the external memory is much smaller than the SPEs’ aggregate bandwidth to the on-chip interconnect bus (EIB). This limits performance and prevents scalability. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 187–198, 2010. c Springer-Verlag Berlin Heidelberg 2010
188
R. Hult´en, C.W. Kessler, and J. Keller
External memory is also a bottleneck in other multiprocessors-on-chip. This problem will become more severe as the core count per chip is expected to increase considerably in the foreseeable future. Scalable parallelization on such architectures therefore must use direct communication between the SPEs to reduce communication with off-chip main memory. In this paper, we consider the important domain of memory-intensive computations and consider the global merging phase of pipelined mergesort on Cell as a challenging case study, for the following reasons: – The ratio of computation to data movement is low. – The computational load of tasks varies widely (by a factor of 2k for a binary merge tree with k levels). – The computational load of a merge task is not fixed but only averaged. – Memory consumption is not proportional to computational load but constant among tasks. – Communication always occurs between tasks of different computational load. These factors complicate the mapping of tasks to SPEs. In total, pipelining a merge tree is much more difficult than task graphs of regular problems such as matrix vector multiplication. The task graph of the global merging phase consists of a tree of merge tasks that should contain, in the lowest layer, at least as many merger tasks as there are SPEs available. Previous solutions like CellSort [2] and AAsort [3] process the tasks of the merge tree layer-wise bottom-up in serial rounds, distributing the tasks of a layer equally over SPEs (there is no need to have more than one task per SPE). Each layer of the tree is then processed in a dancehall fashion, where each task operates on (buffered) operand and result arrays residing in off-chip main memory. This organization leads to relatively simple code but puts a high access load on the off-chip-memory interface. On-chip pipelining reorganizes the overall computation in a pipelined fashion such that intermediate results (i.e., temporary stream packets of sorted elements) are not written back to main memory where they wait for being reloaded in the next layer processing round, but instead are forwarded immediately to a consuming successor task that possibly runs on a different core. This will of course require some buffering in onchip memory and on-chip communication of intermediate results where producer and consumer task are mapped to different SPEs, but multi-buffering is necessary anyway in processors like Cell in order to overlap computation with (DMA) communication. It also requires that all merger tasks of the algorithm be active simultaneously; usually there are several tasks mapped to a SPE, which are dynamically scheduled by a userlevel round-robin scheduler as data is available for processing. However, as we would like to guarantee fast context switching on SPEs, the limited size of Cell’s local on-chip memory then puts a limit on the number of buffers and thus tasks that can be mapped to an SPE, or correspondingly a limit on the size of data packets that can be buffered, which also affects performance. Moreover, the total volume of intermediate data forwarded on-chip should be low and, in particular, must not exceed the capacity of the on-chip bus. Hence, we obtain a constrained optimization problem for mapping the tasks of streaming computations to the SPEs of Cell such that the resulting throughput is maximized.
Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.
189
In previous work we developed mappings for merge trees [4,5]. In particular, we have developed various optimal, approximative and heuristic mapping algorithms for optimized on-chip pipelining of merge trees. Theoretically, a tremendous reduction of required memory bandwidth could be achieved, and our simulations for an idealized Cell architecture indicated that considerable speedup over previous implementations are possible. But an implementation on the real processor is very tricky if it should overcome the overhead related to dynamic scheduling, buffer management, synchronization and communication delays. Here, we detail our implementation that actually achieves notable speedup of up to 61% over the best previous implementation, which supports our earlier theoretical estimations by experimental evidence. Also, the results support the hypothesis that on-chip pipelining as an algorithmic engineering option is worthwhile in general because simpler applications might profit even more. The remainder of this article is organized as follows. In Section 2, we give a short overview of the Cell processor, as far as needed for this article. Section 3 develops the on-chip pipelined merging algorithm, Section 4 gives details of the implementation, and Section 5 reports on the experimental results. Further details will soon be available in a forthcoming master thesis [6]. Section 6 concludes and identifies issues for future work.
2 Cell/B.E. Overview The Cell/B.E. (Broadband Engine) processor [1] is a heterogeneous multi-core processor consisting of 8 SIMD processors called SPE and a dual-threaded PowerPC core (PPE), which differ in architecture and instruction set. In earlier versions of the Sony PlayStation-3TM (PS3), up to 6 SPEs of its Cell processor could be used under Linux. On IBMs Cell blade servers such as QS20 and later models, two Cells with a total of 16 SPEs are available. Cell blades are used, for instance, in the nodes of RoadRunner, which was the world’s fastest supercomputer in 2008–2009. While the PPE is a full-fledged superscalar processor with direct access to off-chip memory via L1 and L2 cache, the SPEs are optimized for doing SIMD-parallel computations at a significantly higher rate and lower power consumption than the PPE. The SPE datapaths and registers are 128 bits wide, and the SPU vector instructions operate on them as on vector registers, holding 2 doubles, 4 floats or ints, 8 shorts or 16 bytes, respectively. For instance, four parallel float comparisons between the corresponding sections of two vector registers can be done in a single instruction. However, branch instructions can tremendously slow down data throughput of an SPE. The PPE should mainly be used for coordinating SPE execution, providing OS service and running control intensive code. Each SPE has a small local on-chip memory of 256 KBytes. This local store is the only memory that the SPE’s processing unit (the SPU) can access directly, and therefore it needs to accommodate both SPU code and data. There is no cache and no virtual memory on the SPE. Access to off-chip memory is only possible by asynchronous DMA put and get operations that can communicate blocks of up to 16KB size at a time to and from off-chip main memory. DMA operations are executed asynchronously by the SPE’s memory flow controller (MFC) unit in parallel with the local SPU; the SPU can initiate a DMA transfer and synchronize with a DMA transfer’s completion. DMA transfer is also possible between an SPE and another SPE’s local store.
190
R. Hult´en, C.W. Kessler, and J. Keller
There is no operating system or runtime system on the SPE except what is linked to the application code in the local store. This is what necessitates user-level scheduling if multiple tasks are to run concurrently on the same SPE. SPEs, PPE and the memory interface are interconnected by the Element Interconnect Bus (EIB) [1]. The EIB is implemented by four uni-directional rings with an aggregate bandwidth of 204 GByte/s (peak). The bandwidth of each unit on the ring to send data over or receive data from the ring is only 25.6 GB/s. Hence, the off-chip memory tends to become the performance bottleneck if heavily accessed by multiple SPEs. Programming the Cell processor efficiently is a challenging task. The programmer should partition an application suitably across the SPEs and coordinate SPE execution with the main PPE program, use the SPE’s SIMD architecture efficiently, and take care of proper communication and synchronization at fairly low level, overlapping DMA communication with local computation where possible. All these different kinds of parallelism are to be orchestrated properly in order to come close to the theoretical peak performance of about 220 GFlops (for single precision). To allow for overlapping DMA handling of packet forwarding (both off-chip and onchip) with computation on Cell, there should be at least buffer space for 2 input packets per input stream and 2 output packets per output stream of each streaming task to be executed on an SPE. While the SPU is processing operands from one buffer, the other one in the same buffer pair can be simultaneously filled or drained by a DMA operation. Then the two buffers are switched for each operand and result stream for processing the next packet of data. (Multi-buffering extends this concept from 2 to an arbitrary number of buffers per operand array, ordered in a circular queue.) This amounts to at least 6 packet buffers for an ordinary binary streaming operation, which need to be accommodated in the size-limited local store of the SPE. Hence, the size of the local store part used for buffers puts an upper bound on the buffer size and thereby on the size of packets that can be communicated. On Cell, the DMA packet size cannot be made arbitrarily small: the absolute minimum is 16 bytes, and in order to be not too inefficient, at least 128 bytes should be shipped at a time. Reasonable packet sizes are a few KB in size (the upper limit is 16KB). As the size of SPE local storage is severely limited (256KB for both code and data) and the packet size is the same for all SPEs and throughout the computation, this means that the maximum number of packet buffers of the tasks assigned to any SPE should be as small as possible. Another reason to keep packet size large is the overhead due to switching buffers and user-level runtime scheduling between different computational tasks mapped to the same SPE. Figure 1 shows the sensitivity of the execution time of our pipelined mergesort application (see later) to the buffer size.
3 On-Chip Pipelined Mergesort Parallel sorting is needed on every modern platform and hence heavily investigated. Several sorting algorithms have been adapted and implemented on Cell BE. The highest performance is achieved by Cellsort [2] and AAsort [3]. Both sort data sets that fit into off-chip main memory but not into local store. Both implementations have similarities. They work in two phases to sort a data set of size N with local memories of size N . In the first phase, blocks of data of size 8N that fit into the combined local memories
Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.
191
Fig. 1. Merge times (here for a 7-level merger tree pipeline), shown for various input sizes (number of 128bit-vectors per SPE), strongly depend on the buffer size used in multi-buffering
of the 8 SPEs are sorted. In the second phase, those sorted blocks of data are combined to a fully sorted data set. We concentrate on the second phase as the majority of memory accesses occurs there and as it accounts for the largest share of sorting time for larger input sizes. In CellSort [2], this phase is realized by a bitonic sort because this avoids data dependent control flow and thus fully exploits SPE’s SIMD architecture. Yet, O(N log2 N ) memory accesses are needed and the reported speedups are small. In AAsort [3], mergesort with 4-to-1-mergers is used in the second phase. The data flow graph of the merge procedures thus forms a fully balanced merge quadtree. The nodes of the tree are executed on the SPEs layer by layer, starting with the leaf nodes. As each merge procedure on each SPE reads from main memory and writes to main memory, all N words are read from and written to main memory in each merge round, resulting in N log4 (N/(8N )) = O(N log4 N ) data being read from and written to main memory. While this improves the situation, speedup still is limited. In order to decrease the bandwidth requirements to off-chip main memory and thus increase speedup, we use on-chip pipelining. This means that all merge nodes of all tree levels are active from the beginning, and that results from one merge node are forwarded in packets of fixed size to the follow-up merge node directly without usage of main memory as intermediate store. With b-to-1 merger nodes and a k-level merge tree, we realize bk -to-1 merging with respect to main memory traffic and thus reduce main memory traffic by a factor of k · log4 (b). The decision to forward merged data streams in packets of fixed size allows to use buffers of this fixed size for all merge tasks, and also enables follow-up merge tasks to start work before predecessor mergers have handled their input streams completely, thus keeping as many merge tasks busy as possible, and allowing pipeline depths independent of the lengths of data streams. Note that already the mergers in the AAsort algorithm [3] must work with buffering and fixed size packets.
192
R. Hult´en, C.W. Kessler, and J. Keller
The requirement to keep all tasks busy is complicated by the fact that the processing of data streams is not completely uniform over all tasks but depends on the data values in the streams. A merger node may consume only data from one input stream for some time, if those data values are much smaller than the data values in the other input streams. Hence, if all input buffers for those streams are filled, and the output buffers of the respective predecessor merge tasks are filled as well, those merge tasks will be stalled. Moreover, after some time the throughput of the merger node under consideration will be reduced to the output rate of the predecessor merger producing the input stream with small data values, so that follow-up mergers might also be stalled as a consequence. Larger buffers might alleviate this problem, but are not possible if too many tasks are mapped to one SPE. Finally, the merger nodes should be distributed over the SPEs such that two merger nodes that communicate data should be placed onto the same SPE whenever possible, to reduce communication load on the EIB. As a secondary goal, if they cannot be placed onto the same SPE, they might be placed such that their distance on the EIB is small, so that different parts of the EIB might be used in parallel. In our previous work [4], we have formulated the above problem of mapping of tasks to SPEs as an integer linear programming (ILP) optimization problem with the constraints given. An ILP solver (we use CPLEX 10.2 [7]) can find optimal mappings for small tree sizes (usually for k ≤ 6) within reasonable time; for k = 7, it can still produce an approximative solution. For larger tree sizes, we used an approximation algorithm [4].
(a)
(b)
Fig. 2. Two Pareto-optimal solutions for mapping a 5-level merge tree onto 5 SPEs, computed by the ILP solver [4]. (a) The maximum memory load is 22 communication buffers (SPE4 has 10 nodes with 2 input buffers each, and 2 of these have output buffers for cross-SPE forwarding) and communication load 1.75 (times the root merger’s data output rate); (b) max. memory load 18 and communication load 2.5. The (expected) computational load is perfectly balanced (1.0 times the root merger’s load on each SPE) in both cases.
Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.
193
The ILP based mapping optimizer can be configured by a parameter ∈ (0, 1) that controls the priority of different secondary optimization goals, for memory load or communication load; computational load balance is always the primary optimization goal. Example mappings computed with different for a 5-level tree are visualized in Fig. 2.
4 Implementation Details Merging kernel. SIMD instructions are being used as much as possible in the innermost loops of the merger node. Merging two (quad-word) vectors is completely done with SIMD instructions as in CellSort [2]. In principle, it is possible to use only SIMD instructions in the entire merge loop, but we found that it did not reduce time because the elimination of an if-statement required too many comparisons and moving data around redundantly. Mapping optimizer. The mapping of merger task nodes to SPEs is read in by the PPE from a text file generated by the mapping optimizer. The PPE generates the task descriptors for each SPE at runtime, so that our code in not constrained to a particular merge-tree, but still optimized to the merge-tree currently used. Due to the complexity of the optimization problem, optimal mappings can be (pre-)computed only for smaller tree sizes up to k = 6. For larger trees, we use the approximative mapping algorithm DC-map [4] that computes mappings by recursively composing mappings for smaller trees, using the available optimal mappings as base cases. SPE task scheduler. Tasks mapped to the same SPE are scheduled by a user-level scheduler in a round-robin order. A task is ready to run if it has sufficient input and an output buffer is free. A task runs as long as it has both input data and space in the output buffer, and then initiates the transfer of its result packet to its parent node and returns control to the scheduler loop. If there are enough other tasks to run afterwards, DMA time for flushing the output buffer is masked and hence only one output buffer per task is necessary (see below). Tasks that are not data-ready are skipped. As the root merger is always alone on its SPE, no scheduler is needed there and many buffers are available; its code is optimized for this special case. Buffer management. Because nodes (except for the root) are scheduled round-robin, the DMA latency can, in general, be masked completely by the execution of other tasks, and hence double-buffering of input or output streams is not necessary at all, which reduces buffer requirements considerably. An output stream buffer is only used for tasks whose parents/successors reside on a different SPE. Each SPE has a fixed sized pool of memory for buffers that gets equally shared by the nodes. This means that nodes on less populated SPEs, for instance the root merger that has a single SPE on its own, can get larger buffers (yet multiples of the packet size). Also, a SPE with high locality (few edges to tasks on other SPEs) needs fewer output buffers and thus may use larger buffers than another SPE with equally many nodes but where more output buffers are needed. A larger buffering capacity for certain tasks (compared to applying the worstcase size for all) reduces the likelihood of an SPE sitting idle as none of its merger tasks is data-ready.
194
R. Hult´en, C.W. Kessler, and J. Keller
Communication. Data is pushed upwards the tree (i.e., producers/child nodes control cross-SPE data transfer and consumers/parent nodes acknowledge receipt) except for the leaf nodes which pull their input data from main memory. The communication looks different depending on whether the parent (consumer) node being pushed to is located on the same SPE or not. If the parent is local, the memory flow controller cannot be used because it demands that the receiving address is outside the sender’s local store. Instead, the child’s output buffer and its parent’s input buffer can simply be the same. This eliminates the need for an extra output buffer and makes more efficient use of the limited amount of memory in the local store. The (system-global) addresses of buffers in the local store on the opposite side of cross-SPE DMA communications are exchanged between the SPEs in the beginning. Synchronization. Each buffer is organized as cyclic buffer with a head and a tail pointer. A task only reads from its input buffers and thus only updates the tail pointers and never writes to the head pointers. A child node only writes to its parent’s input buffers, which means it only writes to the head pointer and only reads from the tail position. The parent task updates the tail pointer of the input buffer for the corresponding child task; the child knows how large the parent’s buffer is and how much it has written itself to the parent’s input buffer so far, and thus knows how much space is left for writing data into the buffer. In particular, when a child reads the tail position of its parent’s input buffer, the value is pessimistic so it is safe to use even if the parent is currently using its buffer and is updating the tail position simultaneously. The reverse is true for the head position, the child writes to the parent’s head position of the corresponding input buffer and the parent only reads. This means that no locks are needed for the synchronization between nodes. DMA tag management. A SPE can have up to 32 DMA transfers in flight simultaneously and uses tags in {0, ..., 31} to distinguish between these when polling the DMA status. The Cell SDK offers an automatic tag manager for dynamic tag allocation and release. However, if an SPE has many buffers used for remote communication, it may run out of tags. If that happens, the tag-requesting task gives up, steps back into the task queue and tries to initiate that DMA transfer again when it gets scheduled next.
5 Experimental Results We used a Sony PlayStation-3 (PS3) with IBM Cell SDK 3.0 and an IBM blade server QS20 with SDK 3.1 for the measurements. We evaluated for as large data sets as could fit into RAM on each system, which means up to 32Mi integers on PS3 (6 SPEs, 256 MiB RAM) and up to 128Mi integers on QS20 (16 SPEs, 1GiB RAM). The code was compiled using gcc version 4.1.1 and run on Linux kernel version 2.6.18-128.e15. A number of blocks equal to the number of leaf nodes in the tree to be tested were filled with random data and sorted. This corresponds to the state of the data after the local sorting phase (phase 1) of CellSort [2]. Ideally, each such block would be of the size of the aggregated local storage available for buffering on the processor. CellSort sorts 32Ki (32,768) integers per SPE, blocks would thus be 4 × 128KiB = 512KiB on
Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.
195
the PS3 and 16 × 128KiB = 2MiB on the QS20. For example, a 6-level tree has 64 leaf nodes, hence the optimal data size on the QS20 would be 64 × 512KiB = 32MiB. However, block sizes of other sizes were used when testing in order to magnify the differences between mappings. Different mappings (usually for = 0.1, 0.5 and 0.9) were tested. 5.1 On-Chip-Pipelined Merging Times The resulting times with on-chip pipelining for 5-level and 6-level trees on PS3 are shown in Fig. 3. For QS20, mappings generated with = 0.1, 0.5 and 0.9 were tested on different data sizes and merger tree sizes from k = 5 to k = 8, see Fig 4. We see that the choice of the mapping can have a major impact on merging time, as even mappings that are optimal for different optimization goals exhibit timing differences of up to 25%.
Fig. 3. Merge times for k = 5 (left) and k = 6 (right) for different mappings () on PS3
5.2 Results of DC-Map Using the DC-map algorithm [4], mappings for trees for k = 8, 7 and 6 were constructed by recursive composition using optimal mappings (computed with the ILP algorithm with = 0.5) as base cases for smaller trees. Fig. 5 shows the result for merging 64Mi integers on QS20. 5.3 Comparison to CellSort Table 1 shows the direct comparison between the global merging phase of CellSort (which is dominating overall sorting time for large data sets like these) and on-chippipelined merging with the best mapping chosen. We achieve significant speedups for Table 1. Timings for the CellSort global merging phase vs. Optimized on-chip-pipelined merging for global merging of integers on QS20 k 5 6 7
#ints 16Mi 32Mi 64Mi
CellSort Global Merging 219 ms 565 ms 1316 ms
On-Chip-Pipelined Merging 174 ms 350 ms 772 ms
Speedup 1.26 1.61 1.70
196
R. Hult´en, C.W. Kessler, and J. Keller
k=5
k=6
k=7
k=8
Fig. 4. Merge times for k = 5, 6, 7, 8 and different input sizes and mappings () on QS20
Fig. 5. Merge times (64 Mi integers) for trees (k = 8, 7, 6) constructed from smaller trees using DC-map
on-chip-pipelining in all cases; the best speedup of 70% can be obtained with 7 SPEs (64Mi elements) on QS20, using the mapping with = 0.01 in Fig. 6; the corresponding speedup figure for the PS3 is 143% at k = 5, 16Mi elements. This is due to less communication with off-chip memory.
Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.
197
Fig. 6. Merge times on QS20 for k = 7, further mappings
5.4 Discussion Different mappings gives some variation in execution times, it seems like the cost model used in the mapping optimizer is more important than the priority parameters in it. Also with on-chip pipelining, using deeper tree pipelines (to fully utilize more SPEs) is not always beneficial beyond a certain depth k, here for k = 6 for PS3 and k = 8 for QS20, as a too large number of tasks increases the overhead of on-chip pipelining (smaller buffers, scheduling overhead, tag administration, synchronization, communication overhead). The overall pipeline fill/drain overhead is more significant for lower workloads but negligible for the larger ones. From Fig. 1 it is clear that, with optimized mappings, buffer size may be lowered without losing much performance, which frees more space in the local store of the SPEs, e.g. for accommodating the code for all phases of CellSort, saving the time overhead for loading in a different SPE program segment for the merge phase.
6 Conclusion and Future Work With an implementation of the global merging phase of parallel mergesort as a case study of a memory-intensive computation, we have demonstrated how to lower memory bandwidth requirements in code for the Cell BE by optimized on-chip pipelining. We obtained speedups of up to 70% on QS20 and 143% on PS3 over the global merging phase of CellSort, which dominates the sorting time for larger input sizes. On-chip pipelining is made possible by several architectural features of Cell that may not be available in other multicore processors. For instance, the possibility to forward data by DMA between individual on-chip memory units is not available on current GPUs where communication is only to and from off-chip global memory. The possibility to lay out buffers in on-chip memory and move data explicitly is not available on cache-based multicore architectures. Nevertheless, on-chip pipelining will be applicable in upcoming heterogeneous architectures for the DSP and multimedia domain with a design similar to Cell, such as ePUMA [9]. Intels forthcoming 48-core single-chip
198
R. Hult´en, C.W. Kessler, and J. Keller
cloud computer [8] will support on-chip forwarding between tiles of two cores, with 16KB buffer space per tile, to save off-chip memory accesses. On-chip pipelining is also applicable to other streaming computations such as general data-parallel computations or FFT. In [10] we have described optimal and heuristic methods for optimizing mappings for general pipelined task graphs. The downside of on-chip pipelining is complex code that is hard to debug. We are currently working on an approach to generic on-chip pipelining where, given an arbitrary acyclic pipeline task graph, an (optimized) on-chip-pipelined implementation will be generated for Cell. This feature is intended to extend our BlockLib skeleton programming library for Cell [11]. Acknowledgements. C. Kessler acknowledges partial funding from EU FP7 (project PEPPHER, #248481), VR (Integr. Softw. Pipelining), SSF (ePUMA), Vinnova, and CUGS. We thank Niklas Dahl and his colleagues from IBM Sweden for giving us access to their QS20 blade server.
References 1. Chen, T., Raghavan, R., Dale, J.N., Iwata, E.: Cell Broadband Engine Architecture and its first implementation—a performance view. IBM J. Res. Devel. 51(5), 559–572 (2007) 2. Gedik, B., Bordawekar, R., Yu, P.S.: Cellsort: High performance sorting on the Cell processor. In: Proc. 33rd Intl. Conf. on Very Large Data Bases, pp. 1286–1207 (2007) 3. Inoue, H., Moriyama, T., Komatsu, H., Nakatani, T.: AA-sort: A new parallel sorting algorithm for multi-core SIMD processors. In: Proc. 16th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), pp. 189–198. IEEE Computer Society, Los Alamitos (2007) 4. Keller, J., Kessler, C.W.: Optimized pipelined parallel merge sort on the Cell BE. In: Proc. 2nd Workshop on Highly Parallel Processing on a Chip (HPPC-2008) at Euro-Par 2008, Gran Canaria, Spain (2008) 5. Kessler, C.W., Keller, J.: Optimized on-chip pipelining of memory-intensive computations on the Cell BE. In: Proc. 1st Swedish Workshop on Multicore Computing (MCC-2008), Ronneby, Sweden (2008) 6. Hult´en, R.: On-chip pipelining on Cell BE. Forthcoming master thesis, Dept. of Computer and Information Science, Link¨oping University, Sweden (2010) 7. ILOG Inc.: Cplex version 10.2 (2007), http://www.ilog.com 8. Howard, J., et al.: A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS. In: Proc. IEEE International Solid-State Circuits Conference, pp. 19–21 (February 2010) 9. Liu, D., et al.: ePUMA parallel computing architecture with unique memory access (2009), http://www.da.isy.liu.se/research/scratchpad/ 10. Kessler, C.W., Keller, J.: Optimized mapping of pipelined task graphs on the Cell BE. In: Proc. of 14th Int. Worksh. on Compilers for Par. Computing, Z¨urich, Switzerland (January 2009) ˚ 11. Alind, M., Eriksson, M., Kessler, C.: Blocklib: A skeleton library for Cell Broadband Engine. In: Proc. ACM Int. Workshop on Multicore Software Engineering (IWMSE-2008) at ICSE-2008, Leipzig, Germany (May 2008)
Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures Emmanuel Jeannot1,2 and Guillaume Mercier1,2,3 1
LaBRI INRIA Bordeaux Sud-Ouest 3 Institut Polytechnique de Bordeaux {Emmanuel.Jeannot,Guillaume.Mercier}@labri.fr 2
Abstract. MPI process placement can play a deterministic role concerning the application performance. This is especially true with nowadays architecture (heterogenous, multicore with different level of caches, etc.). In this paper, we will describe a novel algorithm called TreeMatch that maps processes to resources in order to reduce the communication cost of the whole application. We have implemented this algorithm and will discuss its performance using simulation and on the NAS benchmarks.
1
Introduction
The landscape of parallel computing has undergone tremendous changes since the introduction of multicore architectures. Multicore machines feature hardware characteristics that are a novelty, especially when compared to cluster-based architectures. Indeed, the amount of cores available within each system is much higher and the memory hierarchy becomes much more complex than previously. Thus, the communication performance can dramatically change according to the processes location within the system since the closer the data is located from the process, the faster the access shall be. This is know as the Non-Uniform Memory Access (NUMA) effect and can be commonly experienced in modern computers. As the core amount in a node is expected to grow sharply in the near future, all these changes have to be taken into account in order to exploit such architectures at their full potential. However, there is a gap between the hardware and the software. Indeed, as far as programming is concerned, the change is less drastic since users still rely on standards such as MPI or OpenMP. Hybrid programming (that is, mixing both message-passing and shared memory paradigms) is one of the keys to obtain the best performance from hierarchical multicore machines. This implies new programming practices that users should follow and apply. However, legacy MPI applications can already take advantage of the computing power offered by such complex architectures. One way of achieving this goal is to match an application’s communication pattern to the underlying hardware. That is, the processes that communicate the most would be bound on cores that share the most levels in the memory hierarchy (e.g. caches). The idea is therefore to build a correspondence between the list of MPI process ranks and P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 199–210, 2010. c Springer-Verlag Berlin Heidelberg 2010
200
E. Jeannot and G. Mercier
the list of core numbers in the machine. Thus, the placement of MPI processes relies totally on the matching that is computed by a relevant algorithm. Then, the issue is to use an algorithm that yields a satisfactory solution to our problem. In this paper, we introduce a new algorithm, called TreeMatch that efficiently computes a solution to our problem by taking into account the specificities of the underlying hardware. The rest of this paper is organized as follows: in section 2 we will describe some related works. Section 3 exposes the problem and show how it can be modeled while section 4 describes our TreeMatch algorithm. Both theoretical and empirical results are detailed in section 5. At last, section 6 concludes this paper.
2
Related Work
Concerning process placement a pioneer work is provided by Kruskal and Snir in [9] where the problem is modeled by a multicommodity flow. The MPIPP [2] framework takes into consideration a multicluster context and strives to dispatch the various MPI processes on the different clusters used by an application. Graph theory algorithms are widely used in order to determine the matching (a list of (MPI process rank,core number) couples). For instance, several vendor MPI implementations, such as the ones provided by Hewlett-Packard [3]1 or by IBM (according to [4]) make use of such mechanism. [6] also formalizes the problem with graphs. In these cases, however, the algorithm computing the final mapping is MPI implementation-specific and does not take into account the complexity of the hierarchy encountered in multicore NUMA machines nor their topologies. In a previous paper [10], we used a graph partitioning algorithm called Scotch [5] to perform the task of mapping computation. However, Scotch is able to work on any type of graphs, not just trees as in our case. Making some hypothesis on the graph structure can lead to improvements and that is why we have developed a new algorithm, called TreeMatch, tailored to fit exactly our specific needs.
3
Problem Modeling
3.1
Hardware Architecture
The first step for determining a relevant process placement consists of gathering information about the underlying hardware. Retrieving the information about the memory hierarchy in a portable way is not a trivial task. Indeed, no tool is able to provide information about the various caches levels (such as their respective sizes and which cores can access them) on a wide spectrum of systems. To this end, we participated in the development of a specific software tool that fulfills this goal: Hardware Locality or Hwloc [1]. Thanks to Hwloc, we can get all the needed information about the architecture, that is, the number of NUMA nodes, sockets and cores as well as the information about the memory hierarchy. In previous works [2,10] the architecture is modeled by a topology 1
HP-MPI has since been replaced by Platform MPI.
Near-Optimal Placement of MPI Processes
201
matrix. Entries in this matrix correspond to the communication speed between two cores and take into account the number elements of the memory hierarchy shared between cores. However, such a representation induces a side effect as it “flattens” the view of the hardware’s structure. The result is a loss of valuable information that could be exploited by the matching algorithm in order to compute the best possible placement. Indeed, a NUMA machine is most of the time hierarchically structured. A tree can thus provide a more reliable representation than a topology matrix. Since Hwloc internally uses a tree-like data structure to store all information, we can easily build another data structure derived from Hwloc’s. Formally, we have a tree of hardware elements. The depth of this tree corresponds to the depth of this element in the hierarchy, cores (or other computing elements) are leaves of the tree. 3.2
Application Communication Patterns
The second vital piece of information is the target application’s communication pattern. In our case, this pattern consists of the global amount of data exchanged between each pair of processes in the application and is stored in a p × p communication matrix (where p is the number of processes). This approximation yields to a “flatten” view of the communication activities that occur during an application’s execution. In order to gather the needed data, we chose to introduce a slight amount of profiling elements within an existing MPI implementation (MPICH2). By modifying the low-level communication channels in the MPICH2 stack we are able to trace data exchanges in both cases of point-to-point and collective communication2 . Since this tracing is very light, it will not disturb an application’s execution. This approach has also been implemented by other vendor MPI implementations, such as HP-MPI (now Platform MPI) [3]. The main drawback is that a preliminary run of the application is mandatory in order to get the communication pattern and a change in the execution (number of processors, input data, etc.) often requires to rerun the profiling. However, there are other possible approaches. For instance, the MPIPP framework uses a tool called FAST [11] that combines static analysis of the application’s code and dynamic execution of a modified application that executes much faster while retaining the same communication pattern as the original source application. The goal is to reduce the time necessary to determine the communication pattern by several magnitudes. It is to be noted that in all approaches, an application, either the original one or a simpler version, has to be executed.
4
The TreeMatch Algorithm
We now describe the algorithm that we developed to compute the process placement. Our algorithm, as opposed to other approaches is able to take into account the hardware’s complex hierarchy. However, in order to slightly simplify the problem, we assume that the topology tree is balanced (leaves are all at the same 2
This profiling could also have been done by tools such as VampirTrace.
202
E. Jeannot and G. Mercier
Algorithm 1. The TreeMatch Algorithm
1 2 3 4 5 6 7
Input: T // The topology tree Input: m // The communication matrix Input: D // The depth of the tree groups[1..D − 1]=∅ // How nodes are grouped on each level foreach depth← D − 1..1 do // We start from the leaves p ← order of m // Extend the communication matrix if necessary if p mod arity(T, depth − 1) = 0 then m ←ExtendComMatrix(T ,m,depth) groups[depth]←GroupProcesses(T ,m,depth)// Group processes by communication affinity m ←AggregateComMatrix(m,groups[depth]) ; // Aggregate communication of the group of processes
8 MapGroups(T ,groups) // Process the groups to built the mapping
depth) and symmetric (all the nodes of a given depth possess the same arity). Such assumptions are indeed very realistic in the case of a homogeneous parallel machine where all processors, sockets, nodes or cabinets are identical. The goal of the TreeMatch algorithm is to assign to each MPI process a computing element and hence a leaf of the tree. In order to optimize the communication time of an application, the TreeMatch algorithm will map processes to cores depending on the amount of data they exchange. The TreeMatch algorithm is depicted in Algorithm 1. To describe how the TreeMatch algorithm works we will run it on the example given in Fig. 1. Here, the topology is modeled by a tree of depth 4 with 12 leaves (cores). The communication pattern between MPI processes is modeled by an 8 × 8 matrix (hence, we have 8 processes). The algorithm process the tree upward at depth 3. At this depth the arity of the node of the next level in the tree k=2, divides the order p = 8 of the matrix m. Hence, we directly go to line 6 where the algorithm calls function GroupProcesses. This function first builds the list of possible groups of processes. The size of the group is given by the arity k of the node of the tree at the upper level (here 2). For instance, we can group process 0 with processes 1 or 2 up to 7 Proc 0 1 2 3 4 5 6 7 0 0 1000 10 1 100 1 1 1 1 1000 0 1000 1 1 100 1 1 2 10 1000 0 1000 1 1 100 1 3 1 1 1000 0 1 1 1 100 4 100 1 1 1 0 1000 10 1 5 1 100 1 1 1000 0 1000 1 6 1 1 100 1 10 1000 0 1000 7 1 1 1 100 1 1 1000 0
(a) Communication Matrix
(b) Topology Tree (squares represent mapped processes using different algorithms)
Fig. 1. Input Example of the TreeMatch Algorithm
Near-Optimal Placement of MPI Processes
203
Function. GroupProcesses(T ,m,depth) Input: T //The topology tree Input: m // The communication matrix Input: depth // current depth 1 l ←ListOfAllPossibleGroups(T ,m,depth) 2 G ←GraphOfIncompatibility(l) 3 return IndependentSet(G)
and process 1 with process 2 up to 7 and so on. Formally we have 28 = 5400 possible groups of processes. As we have p = 8 processes and we will group them by pairs (k=2), we need to find p/k = 4 groups that do not have processes in common. To find these groups, we will build the graph of incompatibilities between the groups (line 2). Two groups are incompatible if they share a same process (e.g. group (2,5) is incompatible with group (5,7) as process 5 cannot be mapped at two different locations). In this graph of incompatibility, vertices are the groups and we have an edge between two vertices if the corresponding groups are incompatible. The set of groups we are looking for is hence an independent set of this graph. In the literature, such a graph is referred to as the complement of a Kneser Graph [8]. A valuable property3 of the graph is that since k divides p any maximal independent set is maximum and of size p/k. Therefore, any greedy algorithm always finds an independent set of the required size. However, all grouping of processes (i.e. independent sets) are not of equal quality. They depend on the value of the matrix. In our example, grouping process 0 with process 5 is not a good idea as they exchange only one piece of data and if we group them we will have a lot of remaining communication to perform at the next level of the topology. To account this, we valuate the graph with the amount of communication reduced thanks to this group. For instance, based on matrix m, the sum of communication of process 0 is 1114 and process 1 is 2104 for a total of 3218. If we group them together, we will reduce the communication volume by 2000. Hence the valuation of the vertex corresponding to group (0,1) is 3218-2000=1218. The smaller the value, the better the grouping. Unfortunately, finding such an independent set of minimum weight is NP-Hard and in-approximable at a constant ratio [7]. Hence, we use heuristics to find a “good” independent set: – smallest values first: we rank vertices by smallest value first and we built a maximal independent set greedily, starting by the vertices with smallest value. – largest values last: we rank vertex by smallest value first and we built a maximal independent set such that the largest index of the selected vertices is minimized. – largest weighted degrees first: we rank vertices by their decreasing weighted degree (the average weight of their neighbours) and we built a maximal independent set greedily, starting by the vertices with largest weighted degree [7]. 3
http://www.princeton.edu/~jacobfox/MAT307/lecture14.pdf
204
E. Jeannot and G. Mercier
In our case, whatever the heuristic we use we find the independent set of minimum weight, which is {(0,1),(2,3),(4,5),(5,6)}. This list is affected to the array group[3] in line 6 of the TreeMatch Algorithm. This means that, for instance, process 0 and process 1 will be put on leaves sharing the same parent.
Function. AggregateComMatrix(m,g) 1 2 3 4 5 6 7
Input: m // The communication matrix Input: g // list of groups of (virtual) processes to merge n ← NbGroups(g) for i ← 0..(n − 1) do for j ← 0..(n − 1) do if i = j then r[i, j] ← 0 else r[i, j] ← i ∈g[i] j ∈g[j] m[i1 , j1 ] 1
1
8 return r
To continue our algorithm, we will continue to build the groups at depth 2. However prior to that, we need to aggregate the matrix m with the remaining communication. The aggregated matrix is computed in the AggregateComMatrix. The goal is to compute the remaining communication between each group of processes. For instance between the first group (0,1) and the second group (2,3) the amount of communication is 1012 and is put in r[0, 1] (see Fig. 2(a)). The matrix r is of size 4 by 4 (we have 4 groups) and is returned to be affected to m (line 7 of the TreeMatch algorithm). Now, the matrix m correspond to the communication pattern between the group of processes (called virtual processes) built during this step. The goal of the remaining steps of the algorithm is to group this virtual processes up to the root of the tree. The Algorithm then loops and decrements depth to 2. Here, the arity at depth 1 is 3 and does not divide the order of m (4) hence we add two artificial groups that do not communicate to any other groups. This means that we add two lines and two columns full of zeroes to matrix m. The new matrix is depicted in Fig. 2(b). The goal of this step is to allow more flexibility in the mapping, thus yielding a more efficient mapping.
Virt. Proc 0 1 2 3 0 0 1012 202 4 1 1012 0 4 202 2 202 4 0 1012 3 4 202 1012 0
(a) Aggregated matrix (depth 2)
Virt. Proc 0 1 2 3 45 0 0 1012 202 4 0 0 1 1012 0 4 202 0 0 2 202 4 0 1012 0 0 3 4 202 1012 0 0 0 4 0 0 0 0 00 5 0 0 0 0 00
(b) Extended matrix
Virt. Proc 0 1 0 0 412 1 412
(c) Aggregated matrix (depth 1)
Fig. 2. Evolution of the communication matrix at different step of the algorithm
Near-Optimal Placement of MPI Processes
205
Function. ExtendComMatrix(T ,m,depth) Input: T //The topology tree Input: m // The communication matrix Input: depth // current depth 1 p ← order of m 2 k ←arity(T ,depth+1) 3 return AddEmptyLinesAndCol(m,k,p)
Once this step is performed, we can group the virtual processes (group of process built in the previous step). Here the graph modeling and the independent set heuristics lead to the following mapping: {(0,1,4),(2,3,5)}. Then we aggregate the remaining communication to obtain, a 2 by 2 matrix (see Fig. 2(c)). During the next loop (depth=1), we have only one possibility to group the virtual processes: {(0,1)}, which is affected to group[1]. The algorithm then goes to line 8. The goal of this step is to map the processes to the resources. To perform this task, we use the groups array, that describes a hierarchy of processes group. A traversal of this hierarchy gives the process mapping. For instance, virtual process 0 (resp. 1) of group[1], is mapped on the left (resp. right) part of the tree. When a group corresponds to an artificial group, no processes will be mapped to the corresponding subtree. At the end processes 0 to 7 are respectively mapped to leaves (cores) 0,2,4,6,1,3,5,7 (see bottom of Fig. 1(b)). This mapping is optimal. Indeed, it is easy to see that the algorithm provides an optimal solution if the communication matrix corresponds to a hierarchical communication pattern (processes can be arranged in tree, and the closer they are in this tree the more they communicate), that can be mapped to the topology tree (such as matrix of Fig. 1(a)). In this case, optimal groups of (virtual) processes are automatically found by the independent set heuristic as the corresponding weights of these groups are the smallest among all the groups. Moreover thanks to the creation of artificial groups line 5, we avoid the Packed mapping 0,2,4,6,8,1,3 which is worse as processes 4 and 5 communicate a lot with processes 6 and 7 and hence must be mapped to the same subtree. On the same figure, we can see that the Round Robin mapping that maps process i on core i leads also to a very poor result.
5
Experimental Validation
In this section, we will expose several sets of results. First, we will show simulation performance comparisons of TreeMatch when compared to simple placement policies such as Round Robin, where processes are dispatched on the various NUMA nodes in a round-robin fashion, packed, where processes are bound onto cores in the same node until it is fully occupied and so on. We will also show comparisons between TreeMatch and the algorithm used in the MPIPP framework [2]. Basically, the MPIPP algorithm starts from a random mapping and strives to improve it by switching the cores between two processors. The algorithm stops when improvement is not possible anymore. We have implemented
206
E. Jeannot and G. Mercier
two versions of this randomized algorithm: MPIPP.5 when we take the best result of the MPIPP algorithm using five different initial random mappings and MPIPP.1 when only one initial mapping is used. 5.1
Experimental Set-Up
For performing our simulation experiments, we have used the NAS communication patterns, computed by profiling the execution of each benchmark. In this work the considered benchmarks have a size (number of processes) of 16, 32/36 or 64. The kernels are bt, cg, ep, ft, is, lu, mg, sp and the classes (size of the data) are: A for size 16 and 32/36, B, C, and D for size 32/36 and 64. When we use the benchmark, we use the topology tree constructed from the real Bertha machine described below. We have also used 14 synthetic communication matrices. These are random matrices with a clear hierarchy between the processes: distinct pairs of processes communicate a lot together, then pairs of pairs communicate a little bit less, etc. For the synthetic communication matrices we also used synthetic topologies built using the Hwloc tools, in addition to the Bertha topology. We have used 6 different topologies with a depth from 3 to 5 mimicking current parallel machines (e.g. a cluster of 20 nodes with 4 socket per nodes, 6 cores per sockets, cores being paired by a L2 cache). The real-scale experiments were carried out on a 96-cores machine called Bertha. It is a single system composed of four NUMA nodes. In this context, all communication between NUMA nodes is performed through shared memory. Each NUMA node features four sockets and each socket features six cores. The CPUs are Intel Dunnington at 2.66 GHz where the L2 cache is shared between two cores and the L3 is shared by all the cores on the socket. We use MPICH2 to perform our real scale experiments as its process manager (called hydra) includes Hwloc and thus provides a simple way to bind processes to cores. 5.2
Simulation Results
We have carried-out simulation results to assess the raw performance of our algorithm. Results are depicted in Fig. 3. On the diagonal of each figure are displayed the different heuristics. On the lower part are displayed the histogram and the ECDF (empirical cumulative distribution function) of the average simulated runtime ratio between the 2 heuristics on the corresponding row and column. If the ratio is greater than 1 the above heuristic outperforms the below heuristic. On the upper part, some numeric summary indicates: the proportion of ratios that are strictly above 1; the proportion of ratios that are equal to 1 (if any) the median ratio, the average ration and, in brackets, the maximum and minimum ratios. For example, on Fig 3(a), that TreeMatch outperforms MPIPP.1 in more than 93% of the cases with a median ration of 1.415 and an average ratio of 1.446 with a minimum ratio of 1 and a maximum ratio of 2.352.
Near-Optimal Placement of MPI Processes
Sim NAS ALL
Sim synthetic ALL
40.51 % (>) 63.29 % (>) 93.67 % (>) 94.94 % (>) 16.46 % (=) 29.11 % (=) 5.06 % (=) Median=1.113 TreeMatch Median=1.000 Median=1.016 Avg=1.159 Median=1.415 Avg=1.011 Avg=1.036 Avg=1.446 [0.960,1.800] [0.895,1.156] [0.960,1.432] [1.000,2.352] ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●
●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
● ●● ●
● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●
● ● ●
● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●
● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ●● ● ●●● ●● ●● ●● ● ●
●●
● ● ● ●●
● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●
●
●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●● ●● ● ● ●● ● ● ● ●●
1.0 1.5
0.5
● ●● ● ● ● ● ● ● ●
1.0 1.5
● ● ● ●
● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ●●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ●
● ● ● ● ● ●
● ● ●
●
0.5
1.0 1.5
0.5
● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
100.00 % (>) Median=3.589 Avg=3.762 [1.969,8.797]
77.38 % (>) 22.62 % (=) Median=1.067 Avg=1.090 [1.000,1.475]
●
● ● ● ● ● ●●
●
Packed
0.00 % (>) Median=0.297 Avg=0.324 [0.117,0.567]
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
MPIPP.1
●
●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●●● ● ● ● ● ● ● ●
MPIPP.5
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●●
●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
0.5
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●● ●
95.24 % (>) 4.76 % (=) Median=1.306 Avg=1.387 [1.000,2.232]
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
73.42 % (>) Median=1.215 Avg=1.251 [0.788,1.830]
RR
100.00 % (>) Median=4.441 Avg=4.980 [ 2.063,14.906]
● ● ● ● ●
● ● ●
● ● ● ●
94.05 % (>) 5.95 % (=) Median=1.220 Avg=1.273 [1.000,2.064]
●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
83.54 % (>) 93.67 % (>) 12.66 % (=) 5.06 % (=) Median=1.062 Median=1.387 Avg=1.119 Avg=1.391 [0.988,1.800] [1.000,2.064]
Packed
● ● ● ●● ● ● ●● ● ●
● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●● ● ●
TreeMatch
75.95 % (>) 96.20 % (>) 96.20 % (>) 8.86 % (=) 3.80 % (=) Median=1.102 Median=1.015 Median=1.418 Avg=1.146 Avg=1.025 Avg=1.423 [0.944,1.800] [0.877,1.256] [1.000,2.069]
MPIPP.5
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●
207
●
● ●
●● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●● ●
● ●
1.0 1.5
(a) Simulation of matchings of the NAS benchmark for the different heuristics for all size
0.2
1.0
5.0
0.2
1.0
5.0
● ●
● ● ● ● ●
MPIPP.1
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●●
0.2
1.0
5.0
(b) Simulation of matchings of synthetic input for the different heuristics for all size. Here RR is not shown as it is identical to the Packed mapping
Fig. 3. Simulation results
On Fig. 3(a), we see that for NAS communication patterns, our TreeMatch Algorithm is better than all the other algorithms. It is only slightly better than MPIPP.5 but this version of MPIPP is run on five different seeds and hence is always slower than our algorithm4 . The MPIPP version with one seed (MPIPP.1) is outperformed by all the other algorithms. The packed method provides better results than the Round Robin method due because cores are not numbered sequentially by the system or the BIOS. On Fig. 3(b), we see that for synthetic input, the results are even in better favor for the TreeMatch algorithm. This comes, from the fact that our algorithm finds the optimal matching for these kinds of matrices as shown in section 4. 5.3
NAS Parallel Benchmarks
We then compare the results between each heuristics on the real Bertha machine, using the NAS benchmarks. Here, ratios are computed on average runtime of at least 4 runs. In Fig. 4(a) we see that the TreeMatch is the best heuristics among the all the other tested ones. It slightly outperforms MPIPP.5, but this heuristic is much slower than ours. Surprisingly, TreeMatch is also only slightly better than Round Robin and in some cases the ratio is under 0.8. Actually, it appears than Round Robin is very good for NAS of size 16 (when all cores are grouped to the same node). This means that for small size problems a clever mapping is not required. However, if we plot the ratios for sizes above or equal to 32 (Fig. 4(b)), we see that, in such cases, TreeMatch compares even more favorably to Packed, 4
Up to 82 times slower in some cases.
208
E. Jeannot and G. Mercier
Bertha ALL
Bertha 64−32
48.68 % (>) 52.63 % (>) 52.63 % (>) 59.21 % (>) 5.26 % (=) 11.84 % (=) 1.32 % (=) 2.63 % (=) TreeMatch Median=1.000 Median=1.001 Median=1.002 Median=1.005 Avg=1.026 Avg=1.027 Avg=1.006 Avg=1.043 [0.900,1.267] [0.923,1.269] [0.778,1.276] [0.900,1.699]
46.15 % (>) 55.77 % (>) 69.23 % (>) 63.46 % (>) 5.77 % (=) 11.54 % (=) 1.92 % (=) Median=1.012 TreeMatch Median=1.000 Median=1.002 Median=1.008 Avg=1.058 Avg=1.037 Avg=1.039 Avg=1.045 [0.900,1.699] [0.900,1.267] [0.923,1.269] [0.951,1.276]
● ●
●
●
●
●●
●● ● ●
●
●●● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
●
● ●●
● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
● ● ●
● ●
●
●● ●● ● ●● ●● ●● ●
● ● ●
●
●
●
●
●● ● ●●
●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●
● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●
●
●
●●
●
●●
● ●● ●●
● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●
●
●●
● ●
● ● ●
●
●●
0.9
● ● ●
● ● ●● ●●
● ●
●
● ● ●
● ●
●●● ●● ● ● ● ●●
●
●●
●
1.1
●● ●
●●
0.9
● ●
●
●
●
●
● ● ●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
● ● ●
● ●
1.1
● ●●
●
●
●●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●
0.9
●
60.53 % (>) 1.32 % (=) Median=1.005 Avg=1.040 [0.910,1.516]
RR
● ● ● ● ● ●● ● ● ●●● ●● ● ●● ● ●● ● ●● ●●
●
●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●● ●●
●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
1.1
0.9
●
● ● ●
●● ● ●● ●● ● ● ●● ● ●● ● ●● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●
● ● ●
● ●
●
●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
●
●
●
●
●
●
● ● ●
● ●
●
●
● ●
0.9
●
● ●
● ●● ●
●
●
●
●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
1.1
● ●
0.9
● ● ●
●
44.23 % (>) 1.92 % (=) Median=0.999 Avg=1.011 [0.910,1.516]
RR
●●
● ●
● ●
●
● ●
●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ●● ●
●● ●
● ●
●●
●●
69.23 % (>) 51.92 % (>) 5.77 % (=) Median=1.000 Median=1.002 Avg=1.018 Avg=1.006 [0.910,1.640] [0.959,1.081]
Packed
●
●
●
1.1
●
● ● ● ●
(a) NAS Benchmark comparison of the different heuristics for all size
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
MPIPP.1
59.62 % (>) 59.62 % (>) 73.08 % (>) 1.92 % (=) 1.92 % (=) Median=1.005 Median=1.001 Median=1.003 Avg=1.009 Avg=1.003 Avg=1.021 [0.948,1.120] [0.952,1.052] [0.916,1.699]
MPIPP.5
●
●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ●●
● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●●●
● ●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●● ●●●
● ●● ● ● ●● ● ● ● ● ●
50.00 % (>) 55.26 % (>) 5.26 % (=) 2.63 % (=) Median=1.000 Median=1.001 Avg=0.978 Avg=1.015 [0.773,1.137] [0.910,1.640]
Packed
●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
●
●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●
55.26 % (>) 59.21 % (>) 52.63 % (>) 1.32 % (=) 2.63 % (=) Median=1.002 Median=1.001 Median=1.002 Avg=0.980 Avg=1.002 Avg=1.017 [0.778,1.120] [0.952,1.052] [0.916,1.699]
MPIPP.5
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ●
1.1
●
0.9
● ●●
●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●● ● ●
●
1.1
0.9
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ●
●
●
MPIPP.1
1.1
(b) NAS Benchmark comparison of the different heuristics for size 32, 36 and 64
Fig. 4. NAS Comparison
RoundRobin or MPIPP.1 (Round-Robin, being in this case the worst method). Moreover, we see that TreeMatch is never outperformed by more than 10% (the ratio is never under 0.9) and in some cases the gain approaches 30%. 5.4
NAS Communication Patterns Modeling
In the previous section, we have seen that on the average the TreeMatch algorithm sometimes outperforms by only a small margin the other methods and we have many comparable performances (in the histograms of the Fig. 4, most of the result have a ratio close to 1). We conjecture that this is mainly due to the fact that many NAS kernel being computation-bound, improvement in terms of communication time are not always visible. Moreover, the communication patterns are an aggregation of the whole execution and do not account for phases in the algorithm.Hence, to evaluate the impact on the communication of our mapping, we have designed an MPI program that executes only the communication pattern (exchanging data corresponding to the pattern, with MPI_AlltoallV) and does not perform any computation. Results are displayed in Fig 5. In 5(a) we see that the TreeMatch is the best heuristic. Except for MPIPP.5, it outperforms the other heuristics in almost 75% of the cases. In several cases, the gain exceeds 30% and the loss never exceeds 10% (except for RR where the minimum ratio is 0.83). When we restrict the experiments to 64 processes the results are even more favorable to TreeMatch (Fig 5(b)). In this case, the overall worst ratio is always greater than 0.92 (8% degradation) while it outperforms the other techniques up to 20%. Moreover, it has a better or similar performance than MPIPP.5 in two thirds of the cases.
Near-Optimal Placement of MPI Processes
Model ALL
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
79.17 % (>) 79.17 % (>) 84.72 % (>) 1.39 % (=) 1.39 % (=) Median=1.158 Median=1.013 Median=1.027 Avg=1.245 Avg=1.026 Avg=1.033 [0.957,1.984] [0.857,1.333] [0.750,1.229]
●●
MPIPP.5
●●
● ● ● ● ● ●●
● ● ●
●
● ● ●● ●
● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
● ●● ●
●
● ● ●
● ●●
● ●
● ● ● ●● ●
0.6
● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
1.0
● ●
● ● ● ● ● ●
●
● ● ●
● ●
● ●● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
●● ●
1.4 0.6
● ●
● ●● ● ● ●● ● ● ● ● ● ●
●
●
●● ● ● ●●
● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
● ● ● ● ● ● ● ●
●
● ● ● ● ●
● ● ● ● ●● ●
● ●
MPIPP.5
●
●
68.06 % (>) 76.39 % (>) Median=1.015 Median=1.148 Avg=1.008 Avg=1.213 [0.750,1.230] [0.924,1.907]
●
●
● ● ● ● ● ● ● ●
●
● ●● ● ● ● ● ● ● ● ● ● ● ● ●
1.0
●
● ● ● ● ● ● ● ●
●
● ● ● ●
● ● ●●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
RR
●●
●
● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
1.0
● ● ●
●
● ● ● ●
●
●
● ● ●
●● ● ● ● ● ●
● ●
● ●
1.4 0.6
● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●
1.0
●
● ●
●
● ●
● ● ● ● ● ● ● ● ● ●● ● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
0.6
1.0
●
●
●
●
●
●
●
●
MPIPP.1
●
●
●
0.6
1.0
●
●
●
●
●
●
●
●
● ●
1.6
● ● ●
●
● ● ● ● ●
1.4
●
●
● ● ●
● ● ●
● ● ● ● ● ● ●
●
● ● ●
●
●
● ● ● ● ● ● ● ●
●
● ● ●
●
●
● ● ● ● ● ● ● ●
●
(a) NAS Benchmark communication pattern comparison of the different heuristics for all size
●
●
● ● ● ● ● ● ● ● ●
MPIPP.1
RR
●
● ●
●
75.00 % (>) Median=1.502 Avg=1.372 [0.914,1.745]
●
●
● ● ● ●
●
●
●●
●
●
●
●● ● ● ● ●
1.4 0.6
●
●
●
●●
●● ● ● ● ●
● ● ● ●● ●
Packed
●
● ● ● ● ●
83.33 % (>) Median=1.123 Avg=1.205 [0.914,1.745]
87.50 % (>) 75.00 % (>) Median=1.025 Median=1.526 Avg=1.024 Avg=1.410 [0.750,1.167] [0.924,1.907]
●
● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
79.17 % (>) 91.67 % (>) 75.00 % (>) 4.17 % (=) 4.17 % (=) Median=1.569 Median=1.039 Median=1.053 Avg=1.454 Avg=1.037 Avg=1.059 [0.957,1.984] [0.857,1.333] [0.750,1.186]
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●
Packed
●
● ● ● ● ●
● ●
● ●● ●● ● ● ● ● ●
TreeMatch
62.50 % (>) 83.33 % (>) 91.67 % (>) 75.00 % (>) 4.17 % (=) 4.17 % (=) 4.17 % (=) Median=1.574 Median=1.007 Median=1.032 Median=1.078 Avg=1.494 Avg=1.027 Avg=1.061 Avg=1.082 [0.924,1.890] [0.920,1.333] [0.976,1.333] [0.999,1.191]
●
● ● ● ● ● ●● ●
● ●
Model 64
55.56 % (>) 73.61 % (>) 72.22 % (>) 80.56 % (>) 1.39 % (=) 1.39 % (=) 1.39 % (=) Median=1.205 Median=1.002 Median=1.010 Median=1.024 Avg=1.296 Avg=1.041 Avg=1.067 Avg=1.075 [0.924,1.890] [0.920,1.364] [0.963,1.377] [0.832,1.534]
TreeMatch
209
1.6
0.6
●
●
●
●
●
1.0
1.6
0.6
1.0
1.6
(b) NAS Benchmark communication pattern comparison of the different heuristics for size 64
Fig. 5. Communication pattern comparison for the NAS Benchmark
6
Conclusion and Future Works
Executing a parallel application on a modern architecture requires to carefully take into account the architectural features of the environment. Indeed, current modern parallel computers are highly hierarchical both in terms of topology (e.g. clusters made of nodes of several multicore processors) and in terms of data accesses or exchanges (NUMA architecture with various levels of cache, network interconnection of nodes, etc.). In this paper, we have investigated the placement of MPI processes on these modern infrastructure. We have proposed an algorithm called TreeMatch that maps processes to computing elements based on the hierarchy topology of the target environment and on the communication pattern of the different processes. Under reasonable assumptions (e.g if the communication pattern is structured hierarchically), this algorithm provides an optimal mapping. Simulation results show that our algorithm outperforms other approaches (such as the MPIPP algorithm) both in terms of mapping quality and computation speed. Moreover, the quality improves with the number of processors. As, in some cases, the TreeMatch performance is very similar to other strategy; we have studied its impact when we remove the computations. In this case, we see greater difference in terms of performance. We can then conclude that this approach delivers its full potential for communication-bound application. However, this difference also highlights some modeling issues as the communication matrix is an aggregated view of the whole execution and does not account for different phases of the application with different communication patterns.
210
E. Jeannot and G. Mercier
References 1. Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications. In: Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010). IEEE Computer Society Press, Pisa (February 2010), http://hal.inria.fr/inria-00429889 2. Chen, H., Chen, W., Huang, J., Robert, B., Kuhn, H.: Mpipp: an automatic profileguided parallel process placement toolset for smp clusters and multiclusters. In: Egan, G.K., Muraoka, Y. (eds.) ICS, pp. 353–360. ACM, New York (2006) 3. Solt, D.: A profile based approach for topology aware MPI rank placement (2007), http://www.tlc2.uh.edu/hpcc07/Schedule/speakers/hpcc_hp-mpi_solt.ppt 4. Duesterwald, E., Wisniewski, R.W., Sweeney, P.F., Cascaval, G., Smith, S.E.: Method and System for Optimizing Communication in MPI Programs for an Execution Environment (2008), http://www.faqs.org/patents/app/20080288957 5. Pellegrini, F.: Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs. In: Proceedings of SHPCC 1994, Knoxville, pp. 486–493. IEEE, Los Alamitos (May 1994) 6. Träff, J.L.: Implementing the MPI process topology mechanism. In: Supercomputing 2002: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pp. 1–14. IEEE Computer Society Press, Los Alamitos (2002) 7. Kako, A., Ono, T., Hirata, T., Halldórsson, M.M.: Approximation algorithms for the weighted independent set problem. In: Kratsch, D. (ed.) WG 2005. LNCS, vol. 3787, pp. 341–350. Springer, Heidelberg (2005) 8. Kneser, M.: Aufgabe 300. Jahresber. Deutsch. Math. -Verein 58 (1955) 9. Kruskal, C., Snir, M.: Cost-performance tradeoffs for communication networks. Discrete Applied Mathematics 37-38, 359–385 (1992) 10. Mercier, G., Clet-Ortega, J.: Towards an efficient process placement policy for mpi applications in multicore environments. In: Ropo, M., Westerholm, J., Dongarra, J. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface. LNCS, vol. 5759, pp. 104–115. Springer, Heidelberg (2009) 11. Zhai, J., Sheng, T., He, J., Chen, W., Zheng, W.: Fact: fast communication trace collection for parallel applications through program slicing. In: SC. ACM, New York (2009)
Parallel Enumeration of Shortest Lattice Vectors ¨ ur Dagdelen1 and Michael Schneider2 Ozg¨ 1
2
Center for Advanced Security Research Darmstadt - CASED [email protected] Technische Universit¨ at Darmstadt, Department of Computer Science [email protected]
Abstract. Lattice basis reduction is the problem of finding short vectors in lattices. The security of lattice based cryptosystems is based on the hardness of lattice reduction. Furthermore, lattice reduction is used to attack well-known cryptosystems like RSA. One of the algorithms used in lattice reduction is the enumeration algorithm (ENUM), that provably finds a shortest vector of a lattice. We present a parallel version of the lattice enumeration algorithm. Using multi-core CPU systems with up to 16 cores, our implementation gains a speed-up of up to factor 14. Compared to the currently best public implementation, our parallel algorithm saves more than 90% of runtime. Keywords: lattice reduction, shortest vector problem, cryptography, parallelization, enumeration.
1
Introduction
A lattice L is a discrete subgroup of the space Rd . Lattices are represented by linearly independent basis vectors b1 , . . . , bn ∈ Rd , where n is called the dimension of the lattice. Lattices have been known in number theory since the eighteenth century. They already appear when Lagrange, Gauss, and Hermite study quadratic forms. Nowadays, lattices and hard problems in lattices are widely used in cryptography as the basis of promising cryptosystems. One of the main problems in lattices is the shortest vector problem (SVP), that searches for a vector of shortest length in the lattice. The shortest vector problem is known to be NP-hard under randomized reductions. It is also considered to be intractable even in the presence of quantum computers. Therefore, many lattice based cryptographic primitives, e.g. one-way functions, hash functions, encryption, and digital signatures, leverage the complexity of the SVP problem. In the field of cryptanalysis, lattice reduction is used to attack the NTRU and GGH cryptosystems. Further, there are attacks on RSA and low density knapsack cryptosystems. Lattice reduction still has applications in other fields of mathematics and number theory. It is used for factoring composite numbers and computing discrete logarithms using diophantine approximations. In the field of discrete optimization, lattice reduction is used to solve linear integer programs. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 211–222, 2010. c Springer-Verlag Berlin Heidelberg 2010
212
¨ Dagdelen and M. Schneider O.
The fastest algorithm known to solve SVP is the enumeration algorithm of Kannan [8] and the algorithm of Fincke and Pohst [3]. The variant used mostly in practice was presented by Schnorr and Euchner in 1991 [14]. Nevertheless, these algorithms solve SVP in exponential runtime. So far, enumeration is only applicable in low lattice dimensions (n ≤ 60).1 For higher dimensions it is only possible to find short vectors, but not a shortest vector. Mostly, these approximate solutions of the SVP are sufficient in practice. In 1982 the famous LLL algorithm was presented for factoring polynomials [9]. This algorithm does not solve SVP rather it finds a vector with length exponential in the lattice dimension. However, LLL is the first algorithm having a polynomial asymptotic running time. LLL can be run in lattice dimension up to 1000. In practice, the most promising algorithm for lattice reduction in high dimensions is the BKZ block algorithm by Schnorr and Euchner [14]. It mainly consists of two parts, namely enumeration in blocks of small dimension and LLL in high dimension. BKZ finds shorter vectors than LLL, at the expense of a higher runtime. Considering parallelization, there are various works dealing with LLL, e.g., [15,2]. The more time-consuming part of BKZ, namely the enumeration step (assuming the use of high block sizes) was considered in the master’s thesis of Pujol [11] and in a very recent work [13]. A GPU version of enumeration was shown in [7]. The enumeration in lattices can be visualized as a depth first search in a weighted search tree, with different subtrees being independent from each other. Therefore, it is possible to enumerate different subtrees in parallel threads without any communication between threads. We have chosen multi-core CPUs for the implementation of our parallel enumeration algorithm. Our Contribution. In this paper, we parallelize the enumeration (ENUM) algorithm by Schnorr and Euchner [14]. We implement the parallel version of ENUM and test it on multi-core CPUs. More precisely, we use up to 16 CPU cores to speed up the lattice enumeration, in lattice dimensions of 40 and above. Considering the search tree, the main problem is to predict the subtrees that are examined during enumeration beforehand. We gain speed-ups of up to factor 14 in comparison to our single core version.2 Compared to the fastest single-core ENUM implementation known our parallel version of ENUM saves more than 90% of runtime. We add some clever additional communication among threads, such that by using s processor cores we even gain a speed-up of more than s in some cases. By this work, we show that it is possible to parallelize the entire BKZ algorithm for the search for short lattice vectors. The strength of BKZ is used to assess the practical hardness of lattice reduction, which helps finding suitable parameters for secure lattice based cryptosystems. The algorithm of Pujol [11,13] uses a volume heuristic to predict the number of enumeration steps that will be performed in a subtree. This estimate is used to 1 2
The recent work of [5] could no more be considered for our final version. On a 24 core machine we gain speed-up factors of 22.
Parallel Enumeration of Shortest Lattice Vectors
213
predict if a subtree is split recursively for enumeration in parallel. In contrast to that, our strategy is to control the height of subtrees that can be split recursively. Organization. Section 2 explains the required basic facts on lattices and parallelization, Section 3 describes the ENUM algorithm by [14], Section 4 presents our new algorithm for parallel enumeration, and Section 5 shows our experimental results.
2
Preliminaries
Notation. Vectors and expression x denotes denotes the Euclidean v∞ . Throughout the
matrices are written in bold face, e.g. v and M. The the nearest integer to x ∈ R, i.e., x = x − 0.5. v norm, other norms are indexed with a subscript, like paper, n denotes the lattice dimension.
Lattices. A lattice is a discrete additive subgroup of Rd . It can be represented as the linear integer span of n ≤ d linear independent vectors b1 , . . . , bn ∈ Rd , which are arranged in a column matrix B = [b1 , . . . , bn ] ∈ Rd×n . The lattice L(B) is the set all linear integer combinations of the basis vectors bi , of n namely L(B) = { i=1 xi bi : xi ∈ Z} . The dimension of the lattice equals the number of linearly independent basis vectors n. If n = d, the lattice is called full-dimensional. For n ≥ 2 there are infinitely many bases of a lattice. One basis can be transformed into another using a unimodular transformation matrix. The first successive minimum λ1 (L(B)) is the length of a shortest nonzero vector of a lattice. There exist multiple shortest vectors of a lattice, a shortest vector is not unique. Define the Gram-Schmidt-orthogonalization B∗ = i−1 [b∗1 , . . . , b∗n ] of B. It is computed via b∗i = bi − j=1 μi,j b∗j for i = 1, . . . , n, 2 where μi,j = bT b∗ / b∗ for all 1 ≤ j ≤ i ≤ n. We have B = B∗ μT , i
j
j
where B∗ is orthogonal and μT is an upper triangular matrix. Note that B∗ is not necessarily a lattice basis.
Lattice Problems and Algorithms. The most famous problem in lattices is the shortest vector problem (SVP). The SVP asks to find a shortest vector in the lattice, namely a vector v ∈ L\{0} with v = λ1 (L). An approximation version of the SVP was solved by Lenstra, Lenstra, and Lov´asz [9]. The LLL algorithm is still the basis of most algorithms used for basis reduction today. It runs in polynomial time in the lattice dimension and outputs a so-called LLL-reduced basis . This basis consists of nearly orthogonal vectors, and a short, first basis vector with approximation factor exponential in the lattice dimension. The BKZ algorithm by Schnorr and Euchner [14] reaches better approximation factors, and is the algorithm used mostly in practice today. As a subroutine, BKZ makes use of an exact SVP solver, such as ENUM. In practice, SVP can only be solved in low dimension n, say up to 60, using exhaustive search techniques or, a second approach, using sieving algorithms that work probabilistically. An overview of enumeration algorithms is presented in [12]. A randomized sieving approach for solving exact SVP was presented in [1] and an improved variant in [10]. In this paper, we only deal with enumeration algorithms.
214
¨ Dagdelen and M. Schneider O.
Exhaustive Search. In [12] Pujol and Stehl´e examine the floating point behaviour of the ENUM algorithm. They state that double precision is suitable for lattice dimensions up to 90. It is common practice to pre-reduce lattices before starting enumeration, as this reduces the radius of the search space. In BKZ, the basis is always reduced with the LLL algorithm when starting enumeration. Publicly available implementations of enumeration algorithms are the established implementation of Shoup’s NTL library and the fpLLL library of Stehl´e et al. Experimental data on enumeration algorithms using NTL can be found in [4,10], both using NTL’s enumeration. A parallel implementation of ENUM is available at Xavier Pujol’s website.3 To our knowledge there are no results published using this implementation. Pujol mentions a speedup factor of 9.7 using 10 CPUs. Our work was developed independently of Pujol’s achievements. Parallelization. Before we present our parallel enumeration algorithm, we need to introduce definitions specifying the quality of the realized parallelization. Furthermore, we give a brief overview of parallel computing paradigms. There exist many parallel environments to perform operations concurrently. Basically, on today’s machines, one distinguishes between shared memory and distributed memory passing. A multi-core microprocessor follows the shared memory paradigm in which each processor core accesses the same memory space. Nowadays, such computer systems are commonly available. They possess several cores, while each core acts as an independent processor unit. The operating system is responsible to deliver operations to the cores. In the parallelization context there exist notions that measure the achieved quality of a parallel algorithm compared to the sequential version. In the sequel of this paper, we will need the following definitions: Speed-up factor: time needed for serial computation divided by the time required for the parallel algorithm. Using s processes, a speed-up factor of up to s is expected. Efficiency: speed-up factor divided by the number of used processors. An efficiency of 1.0 means that s processors lead to a speed-up factor of s which can be seen as a “perfect” parallelization. Normally the efficiency is smaller than 1.0 because of the communication overhead for inter-process communication. Parallel algorithms such as graph search algorithms may benefit from communication, in such a way that fewer operations need to be computed. As soon as the number of saved operations exceeds the communication overhead, an efficiency of more than 1.0 might be achieved. For instance, branch-and-bound algorithms for Integer Linear Programming might have superlinear speedup, due to the interdependency between the search order and the condition which enables the algorithm to disregard a subtree. The enumeration algorithm falls into this category as well. 3
http://perso.ens-lyon.fr/xavier.pujol/index_en.html
Parallel Enumeration of Shortest Lattice Vectors
3
215
Enumeration of the Shortest Lattice Vector
In this chapter we give an overview of the ENUM algorithm first presented in [14]. In the first place, the algorithm was proposed as a subroutine in the BKZ algorithm, but ENUM can be used as a stand-alone instance to solve the exact SVP. An example instance of ENUM in dimension 3 is shown by the solid line of Figure 1. An algorithm listing is shown as Algorithm 1. Algorithm 1. Basic Enumeration Algorithm ∗ 2 ∗ 2 Input: Gram-Schmidt coefficients (μi,j )1≤j≤i≤n , b1 . . . bn = λ1 (L(B)) Output: umin such that n u b i i i=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2 A ← b∗ 1 , umin ← (1, 0, . . . , 0), u ← (1, 0, . . . , 0), l ← (0, . . . , 0), c ← (0, . . . , 0) t=1 while t ≤ n do 2 lt ← lt+1 + (ut + ct )2 b∗ t if lt < A then if t > 1 then t ← t −1 move one layer down in the tree n ct ← ut ← ct i=t+1 ui μi,t , else A ← lt , umin ← u set new minimum end else t←t+1 move one layer up in the tree choose next value for ut using the zig-zag pattern end end
To find a shortest non-zero vector of a lattice L(B) with B = [b1 , . . . , bn ], ENUM takes as input the Gram-Schmidt coefficients (μi,j )1≤j≤i≤n , the quadratic 2 2 norm of the Gram-Schmidt orthogonalization b∗1 , . . . , b∗n of B, and an n initial bound n A. The search space is the set of all coefficient vectors u ∈ Z that satisfy t=1 ut bt ≤ A. Starting with an LLL-reduced basis, it is common to 2 set A = b∗1 in the beginning. If the norm of the shortest vector is known beforehand, it is possible to start with a lower A, which limits the search space and reduces the runtime of the algorithm. If a vector v of length smaller than A is found, A can be reduced to the norm of v, that means A always denotes the size of the current shortest vector. The goal of ENUM is to find a coefficient vector u ∈ Zn satisfying the equation n n ut bt = minn xt bt . (1) x∈Z t=1
t=1
Therefore, all coefficient combinations u that determine a vector of norm less than A are enumerated. t−1In Equation 1 we replace all bt by their orthogonalization, i.e., bt = b∗t + j=1 μt,j b∗j and get Equation (2): 2 2 n n t−1 n n ∗ ∗ u t bt = u · (b + μ b ) = (u + μi,t ui )2 · b∗t 2 . t t,j j t t t=1 t=1 t=1 j=1 i=t+1
216
¨ Dagdelen and M. Schneider O.
n Let c ∈ Rd with ct = i=t+1 μi,t ui (line 8), which is predefined by all coefficients ui with n ≥ i > t. The intermediate norm lt (line 4) is defined as lt = lt+1 + (ut + ct )2 b∗t 2 . This is the norm part of Equation 2 that is predefined by the values ui with n ≥ i ≥ t. The algorithm enumerates the coefficients in reverse order, from un to u1 . This can be considered as finding a minimum in a weighted search tree. The height of the tree is uniquely determined by the dimension n. The root of the tree denotes the coefficient un . The coefficient values ut for 1 ≤ t ≤ n determine the values of the vertices of depth (n − t + 1), leafs of the tree contain coefficients u1 . The inner nodes represent intermediate nodes, not complete coefficient vectors, i.e., a node on level t determines a subtree (⊥, . . . , ⊥, ut , ut+1 , . . . , un ), where the first t − 1 coefficients are not yet set. lt is the norm part predefined by this inner node on level t. We only enumerate parts of the tree with lt < A. Therefore, the possible values for ut on the next lower level are in an interval around ct with (ut + ct )2 < (A − lt+1 )/ b∗t , following the definition of lt . ENUM iterates over all possible values for ut , as long as lt ≤ A, the current minimal value. If lt exceeds A, enumeration of the corresponding subtree can be cut off, the intermediate norm lt will only increase when stepping down in the tree, as lt ≤ lt−1 always holds. The iteration over all possible coefficient values is (due to Schnorr and Euchner) performed in a zig-zag pattern. The values for ut will be sequenced like either ct , ct + 1, ct − 1, ct + 2, ct − 2, . . . or ct , ct − 1, ct + 1, ct − 2, ct + 2, . . .. ENUM starts at the leaf (1, 0, . . . , 0) and gives the first possible solution for a shortest vector in the given lattice. The algorithm performs its search by moving up (when a subtree can be cut off due to lt ≥ A) and down in the tree (lines 13 and 7). The norm of leaf nodes is compared to A. If l1 ≤ A, it stores A ← l1 and umin ← u (line 10), which define the current shortest vector and its size. When ENUM moves up to the root of the search tree it terminates and outputs the computed global minimum A and the corresponding shortest vector umin .
4
Algorithm for Parallel Enumeration of the Shortest Lattice Vector
In this section we describe our parallel algorithm for enumeration of the shortest lattice vector. The algorithm is a parallel version of the algorithm presented in [14]. First we give the main idea of parallel enumeration. Secondly, we present a high level description. Algorithms 2 and 3 depict our parallel ENUM. Thirdly, we explain some improvements that speed up the parallelization in practice. 4.1
Parallel Lattice Enumeration
The main idea for parallelization is the following. Different subtrees of the complete search tree are enumerated in parallel independently from each other representing them as threads (Sub-ENUM threads). Using s processors, s subtrees can be enumerated at the same time. All threads ready for enumeration are
Parallel Enumeration of Shortest Lattice Vectors
217
End
5. Thread
Start
1. Thread
2. Thread
3. Thread
4. Thread
Fig. 1. Comparison of serial (solid line) and parallel (dashed line) processing of the search tree
stored in a list L, and each CPU core that has finished enumerating a subtree picks the next subtree from the list. Each of the subtrees is an instance of SVP in smaller dimension; the initial state of the sub-enumeration can be represented by a tuple (u, l, c, t). When the ENUM algorithm increases the level in the search tree, the center (ct ) and the range ((A − lt+1 )/ b∗t ) of possible values for the current index are calculated. Therefore, it is easy to open one thread for every value in this range. Figure 1 shows a 3-dimensional example and compares the flow of the serial ENUM with our parallel version. Beginning at the starting node the procession order of the serial ENUM algorithm follows the directed solid edges to the root. In the parallel version dashed edges represent the preparation of new Sub-ENUM threads which can be executed by a free processor unit. Crossed-out edges point out irrelevant subtrees. Threads terminate as soon as they reach either a node of another thread or the root node. Extra Communication – Updating the Shortest Vector. Again, we denote the current minimum, the global minimum, as A. In our parallel version, it is the global minimum of all threads. As soon as a thread has found a new minimum, the Euclidean norm of this vector is written back to the shared memory, i.e. A is updated. At a certain point every thread checks the global minimum whether another thread has updated A and, if so, uses the updated one. The smaller A is, the faster a thread terminates, because subtrees that exceed the current minimum can be cut off in the enumeration. The memory access for this update operation is minimal, only one integer value has to be written back or read from shared memory. This is the only type of communication among threads, all other computations can be performed independently without communication overhead. 4.2
The Algorithm for Parallel Enumeration
Algorithm 2 shows the main thread for the parallel enumeration. It is responsible to initialize the first Sub-ENUM thread and manage the thread list L.
218
¨ Dagdelen and M. Schneider O.
Algorithm 2. Main thread for parallel enumeration ∗ 2 ∗ 2 Input: Gram-Schmidt coefficients (μi,j )1≤j≤i≤n , b1 . . . bn Output: umin such that n i=1 ui bi = λ1 (L(B)) 1 2 3 4 5 6 7 8 9
2 A ← b∗ 1 , umin ← (1, 0, . . . , 0) u ← (1, 0, . . . , 0), l ← 0, c ← 0, t ← 1 L ← {(u, l, c, t)} while L = ∅ or threads are running do if L = ∅ and cores available then pick Δ = (u, l, c, t) from L start Sub-ENUM thread Δ = (u, l, c, t) on new core end end
Global variables Local variables Initialize list
A Sub-ENUM thread (SET) is represented by the tuple (u, l, c, t), where u is the coefficient vector, l the intermediate norm of the root to this subtree, c the search region center and t the lattice dimension minus the starting depth of the parent node in the search tree.
Algorithm 3. Sub-ENUM thread (SET) 2 ∗ 2 Input: Gram-Schmidt coefficients (μi,j )1≤j≤i≤n , b∗ u, ¯ l, c¯, t¯) 1 . . . bn , (¯ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
¯ , l ← (0, . . . , 0), c ← (0, . . . , 0) u←u t ← t¯, lt+1 ← ¯ l, ct ← c¯, bound ← n while t ≤ bound do 2 lt ← lt+1 + (ut + ct )2 b∗ t if lt < A then if t > 1 then t ← t −1 move one layer down in the tree n ct ← ut ← ct i=t+1 ui μi,t , if bound = n then L ← L ∪ (u, lt+2 , ct+1 , t + 1) insert new SET in list L bound ← t end else A ← lt , umin ← u set new global minimum end else t←t+1 move one layer up in the tree choose next value for ut using the zig-zag pattern end end
Whenever the list contains a SET and free processor units exist, the first SET of the list is executed. The execution of SETs is performed by Algorithm 3. We process the search tree in the same manner as the serial algorithm (Algorithm 1), except the introduction of the loop bound bound and the handling of new SETs (lines 9− 11). First, the loop bound controls the termination of the subtree and prohibits that nodes are visited twice. Second, only the SET whose bound is set to the lattice dimension is allowed to create new SETs. Otherwise, if we allow each SET to create new SETs by itself, this would lead to an explosion of the number of threads and each thread has too few computations to perform.
Parallel Enumeration of Shortest Lattice Vectors
219
We denote the SET with bound set to n by unbounded SET (USET). At any time, there exists only one USET that might be stored in the thread list L. As soon as an USET has the chance to find a new minimum within the current subtree (lines 5 − 6), its bound is set to the current t value. Thereby, it is transformed to a SET and the recent created SET becomes the USET. 4.3
Improvements
We presented a first solution for the parallelization of the ENUM algorithm providing a runtime speed-up by a divide and conquer technique. We distribute subtrees to several processor units to search for the minimum. Our improvements deal with the creation of SETs and result in significantly shorter running time. Recall the definitions of Sub-ENUM thread (SET) and unbounded Sub-ENUM thread (USET). By now we call a node, where a new SET can be created, a candidate. Note that a candidate can only be found in an USET. The following paragraphs show worst cases of the presented parallel ENUM algorithm and present possible solutions to overcome the existing drawbacks. Threads within threads. Our parallel ENUM algorithm allows to create new SETs only by an USET. The avoidance of producing immense overhead which happens by permitting the creation of new SETs by any SET, backs our decision that it suffices to let only USET create new instances. However, if an USET creates a new SET at a node of depth 1, then this new SET is executed by a single processor sequentially. Note that this SET solves the SVP problem in dimension n − 1. It turns out that in the case the depth of a current analyzed node in ENUM is sufficient far away from the depth t of the starting node, the creation of a new SET is advantageous according to the overall running time and the number of simultaneously occupied processors. Therefore, we introduce a bound sdeep which expresses what we consider to be sufficient far away, i.e. if a SET visits a node with depth k fulfilling the equation k − t ≥ sdeep where t stands for the depth of the starting node and it is not an USET, then this SET is permitted to create a new SET once. Thread Bound. Although we avoid the execution of SETs where the dimension of the subtree is too big, we are still able to optimize the parallel ENUM algorithm by considering execution bounds. We achieve additional performance improvements by the following idea. Instead of generating SETs in each possible candidate, we consider the depth of the node. This enables us to avoid big subtrees for new SETs by introducing an upper bound sup representing the minimum distance of a node to the root to become a candidate. If ENUM visits a node with depth t fulfilling n − t > sup and this node is a candidate, we no longer make a subtree ready for a new SET. We rather prefer to behave in that situation like the serial ENUM algorithm. Good choices for the above bounds sdeep and sup are evaluated in Section 5.
¨ Dagdelen and M. Schneider O.
220
5
Experiments
We performed numerous experiments to test our parallel enumeration algorithm. We created 5 different random lattices of each dimension n ∈ {42, . . . , 56} in the sense of Goldstein and Mayer [6]. The bitsize of the entries of the basis matrices were in the order of magnitude of 10n. We started with bases in Hermite normal form, then LLL-reduced the bases (using LLL parameter δ = 0.99). The experiments were performed on a compute server equipped with four AMD Opteron (2.3GHz) quad core processors. We compare our results to the highly optimized, serial version of fpLLL in version 3.0.12, the fastest ENUM implementation known, on the same platform. The programs were compiled using gcc version 4.3.2. For handling parallel processes, we used the Boost-Threadsublibrary in version 1.40. Our C++ implementation uses double precision to 2 2 store the Gram-Schmidt coefficients μi,j and the b∗1 , . . . , b∗n . Due to [12], this is suitable up to dimension 90, which seems to be out of the range of today’s enumeration algorithms. We tested the parallel ENUM algorithm for several sdeep values and concluded that sdeep = 25 36 (n − t) seems to be a good choice, where t is the depth of the starting node in a SET instance. Further, we use sup = 56 n. 1e+06
100
1 core fplll (1 core) 4 cores 8 cores 16 cores
100000
90 80 occupancy [%]
Time [s]
10000 1000 100
70 60 50 40
10 30 1
avg. load of all cpu cores max value min value
20
0.1
10 42
44
46
48
50
52
54
56
0
10
20
Dimension
30
40
50
60
70
80
90
100
time [%]
Fig. 2. Average runtimes of enumeration of 5 random lattices in each dimension, comparing our multi-core implementation to fpLLL’s and our own single-core version
Fig. 3. Occupancy of the cores. The x-axis marks the percentage of the complete runtime, the y-axis shows the average occupancy of all CPU cores over 5 lattices.
Table 1. Average time in seconds for enumeration of lattices in dimension n
n 1 core 4 cores 8 cores 16 cores fpLLL 1 core
42
44
46
48
50
52
54
3.81 0.99 0.62 0.52 3.32
27.7 7.2 4.0 2.6 23.7
37.6 8.8 4.8 3.5 29.7
241 55 28 18 184
484 107 56 36 367
3974 976 504 280 3274
10900 2727 1390 794 9116
56 223679 56947 28813 16583 184730
Parallel Enumeration of Shortest Lattice Vectors 16
16 cores 8 cores 4 cores fplll (1 core)
14
Speedup compared to fplll (single core)
Speedup compared to single core
16
12 10 8 6 4 2
221
16 cores 8 cores 4 cores
14 12 10 8 6 4 2
42
44
46
48
50
52
54
56
42
Dimension
44
46
48
50
52
54
56
Dimension
Fig. 4. Average speed-up of parallel ENUM compared to our single-core version (left) and compared to fpLLL single-core version (right)
Table 1 and Figure 2 present the experimental results that compare our parallel version to our serial algorithm and to the fpLLL library. We only present the timings, as the output of the algorithms is in all cases the same, namely a shortest non-zero vector of the input lattice. The corresponding speed-ups are shown in Figure 4. To show the strength of parallelization of the lattice enumeration, we first compare our multi-core versions to our single-core version. The best speed-ups are 4.5 (n = 50) for 4 cores, 8.6 (n = 50) for 8 cores, and 14.2 (n = 52) for 16 cores. This shows that, using s processor cores, we sometimes gain speed-ups of more than s, which corresponds to an efficiency of more than 1. This is a very untypical behavior for (standard) parallel algorithms, but understandable for graph search algorithms as our lattice enumeration. It is caused by the extra communication for the write-back of the current minimum A. The highly optimized enumeration of fpLLL is around 10% faster than our serial version. Compared to the fpLLL algorithm, we gain a speed-up of up to 6.6 (n = 48) using 8 CPU cores and up to 11.7 (n = 52) using 16 cores. This corresponds to an efficiency of 0.825 (8 cores) and 0.73 (16 cores), respectively. Figure 3 shows the average, the maximum, and the minimum occupancy of all CPU cores during the runtime of 5 lattices in dimension n = 52. The average occupancy of more than 90% points out that all cores are nearly optimally loaded; even the minimum load values are around 80%. These facts show a good balanced behaviour of our parallel algorithm.
6
Conclusion and Further Work
In this paper we have presented a parallel version of the most common algorithm for solving the shortest vector problem in lattices, the ENUM algorithm. We have shown that a huge speed-up and a high efficiency is reachable using multi-core processors. As parallel versions of LLL are already known, with our parallel ENUM we have given evidence that both parts of the BKZ reduction algorithm can be parallelized. It remains to combine both, parallel LLL and parallel ENUM, to a parallel version of BKZ. Our experience with BKZ shows that
222
¨ Dagdelen and M. Schneider O.
in higher blocksizes of ≈ 50 ENUM takes more than 99% of the complete runtime. Therefore, the speed-up of ENUM will directly speed up BKZ reduction, which in turn influences the security of lattice based cryptosystems. Furthermore, to enhance scalability further, an extension of our algorithm to parallel systems with multiple multicore nodes is considered as future work.
Acknowledgments We thank Jens Hermans, Richard Lindner, Markus R¨ uckert, and Damien Stehl´e for helpful discussions and their valuable comments. We thank Michael Zohner for performing parts of the experiments. We thank the anonymous reviewers for their comments.
References 1. Ajtai, M., Kumar, R., Sivakumar, D.: A sieve algorithm for the shortest lattice vector problem. In: STOC 2001, pp. 601–610. ACM, New York (2001) 2. Backes, W., Wetzel, S.: Parallel lattice basis reduction using a multi-threaded Schnorr-Euchner LLL algorithm. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009 Parallel Processing. LNCS, vol. 5704, pp. 960–973. Springer, Heidelberg (2009) 3. Fincke, U., Pohst, M.: A procedure for determining algebraic integers of given norm. In: van Hulzen, J.A. (ed.) ISSAC 1983 and EUROCAL 1983. LNCS, vol. 162, pp. 194–202. Springer, Heidelberg (1983) 4. Gama, N., Nguyen, P.Q.: Predicting lattice reduction. In: Smart, N.P. (ed.) EUROCRYPT 2008. LNCS, vol. 4965, pp. 31–51. Springer, Heidelberg (2008) 5. Gama, N., Nguyen, P.Q., Regev, O.: Lattice enumeration using extreme pruning, To appear in Eurocrypt 2010 (2010) 6. Goldstein, D., Mayer, A.: On the equidistribution of Hecke points. Forum Mathematicum 2003 15(2), 165–189 (2003) 7. Hermans, J., Schneider, M., Buchmann, J., Vercauteren, F., Preneel, B.: Parallel shortest lattice vector enumeration on graphics cards. In: Bernstein, D.J., Lange, T. (eds.) AFRICACRYPT 2010. LNCS, vol. 6055, pp. 52–68. Springer, Heidelberg (2010) 8. Kannan, R.: Improved algorithms for integer programming and related lattice problems. In: STOC 1983, pp. 193–206. ACM, New York (1983) 9. Lenstra, A., Lenstra, H., Lov´ asz, L.: Factoring polynomials with rational coefficients. Mathematische Annalen 4, 515–534 (1982) 10. Micciancio, D., Voulgaris, P.: Faster exponential time algorithms for the shortest vector problem. In: SODA 2010 (2010) 11. Pujol, X.: Recherche efficace de vecteur court dans un r´eseau euclidien. Masters thesis, ENS Lyon (2008) 12. Pujol, X., Stehl´e, D.: Rigorous and efficient short lattice vectors enumeration. In: Pieprzyk, J. (ed.) ASIACRYPT 2008. LNCS, vol. 5350, pp. 390–405. Springer, Heidelberg (2008) 13. Pujol, X., Stehl´e, D.: Accelerating lattice reduction with FPGAs, To appear in Latincrypt 2010 (2010) 14. Schnorr, C.P., Euchner, M.: Lattice basis reduction: Improved practical algorithms and solving subset sum problems. Mathematical Programming 66, 181–199 (1994) 15. Villard, G.: Parallel lattice basis reduction. In: ISSAC 1992, pp. 269–277. ACM, New York (1992)
A Parallel GPU Algorithm for Mutual Information Based 3D Nonrigid Image Registration Vaibhav Saxena1 , Jonathan Rohrer2 , and Leiguang Gong3 1
IBM Research - India, New Delhi 110070, India [email protected] 2 IBM Research - Zurich, 8803 R¨ uschlikon, Switzerland [email protected] 3 IBM T.J. Watson Research Center, NY 10598, USA [email protected] Abstract. Many applications in biomedical image analysis require alignment or fusion of images acquired with different devices or at different times. Image registration geometrically aligns images allowing their fusion. Nonrigid techniques are usually required when the images contain anatomical structures of soft tissue. Nonrigid registration algorithms are very time consuming and can take hours for aligning a pair of 3D medical images on commodity workstation PCs. In this paper, we present parallel design and implementation of 3D non-rigid image registration for the Graphics Processing Units (GPUs). Existing GPU-based registration implementations are mainly limited to intra-modality registration problems. Our algorithm uses mutual information as the similarity metric and can process images of different modalities. The proposed design takes advantage of highly parallel and multi-threaded architecture of GPU containing large number of processing cores. The paper presents optimization techniques to effectively utilize high memory bandwidth provided by GPU using on-chip shared memory and co-operative memory update by multiple threads. Our results with optimized GPU implementation showed an average performance of 2.46 microseconds per voxel and achieved factor of 28 speedup over a CPU-based serial implementation. This improves the usability of nonrigid registration for some real world clinical applications and enables new ones, especially within intra-operative scenarios, where strict timing constraints apply.
1
Introduction
Image registration is a key computational tool in medical image analysis. It is the process of aligning two images usually acquired at different times with different imaging parameters or slightly different body positions, or using different imaging modalities, such as CT and MRI. Registration can compensate for subject motion and enables a reliable analysis of disease progression or treatment effectiveness over time. Fusion of pre- and intra-operative images can provide guidance to the surgeon. Rigid registration achieves alignment by scaling, rotation and translation. However, most parts of the human body are soft-tissue P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 223–234, 2010. c Springer-Verlag Berlin Heidelberg 2010
224
V. Saxena, J. Rohrer, and L. Gong
structures and more complex transformation models are required to align them. There is a variety of so-called nonrigid registration algorithms. A major obstacle to their widespread clinical use is the high computational cost. Nonrigid registration algorithms typically require many hours to register 3D images prohibiting interactive use [3]. Some implementations of nonrigid registration algorithms achieve runtimes in the order of minutes [11,5]. However, they typically run on large parallel systems of up to more than hundred processors. Acquisition and maintenance of such architectures is expensive and therefore the availability of such solutions is very limited. In this paper, we present a CUDA-based implementation of a mutual information based nonrigid registration algorithm on the GPU, which achieves significant runtime acceleration and in many cases even sub-minute runtimes for multimodal registration of 3D images. Our solution provides low-cost fast nonrigid registration, which hopefully will facilitate a widespread clinical use. We present data partitioning for 3D images to effectively utilize large number of threads supported on the GPU, and optimization techniques to achieve high memory throughput from the GPU memory spaces. The proposed implementation achieves consistent speedup across different datasets of varying dimensions. The registration algorithm bases on a B-spline transformation model [15,13]. This approach has been used successfully for the registration of a variety of anatomical structures, such as the brain, the chest [7], the heart, the liver and the breast [13]. Mutual information is the most common metric for both monomodal and multimodal registration [10]. There are fast GPU implementations of nonrigid registration algorithms, but most of them are limited to monomodal registration [8,14,2]. The only multimodality-enabled implementation we are aware of is [17], which, however, uses only 2D textures and implemented with OpenGL and GLSL. Another similar work implemented on the Cell/B.E. is reported in [12]. However, the Cell/B.E. and GPU represent two very contrasting architectures and require different optimization approaches for achieving good performance. The next section will provide an overview of the registration method. Section 3 will briefly describe the GPU architecture followed by section 4 of discussion of proposed parallel algorithm implementation. The experimental evaluation will be discussed in section 5 followed by conclusion in section 6.
2
Mutual Information Based Nonrigid Image Registration
Nonrigid image registration is the process to compute a nonlinear mapping or transformation between two images (2D or 3D). One of the images is called reference or fixed image and other one is called floating or moving image. We implemented a well-known mutual information based nonrigid image registration algorithm which models the transformation using B-splines. A set of control points are overlaid on the fixed image and transformation can be generated by letting these control points move freely. The transformation T (x; μ) is obtained by B-spline interpolation with transformation parameters μ that are B-spline
A Parallel GPU Algorithm for Mutual Information
225
coefficients located at the control points. The degrees of freedom of control points are governed by the spacing between the points. To register fixed image ff ix to moving image fmov , an optimal set of transformation parameters μ are computed that best align the fixed and transformed moving image (i.e. establish similarity between them). To support the registration of images obtained from different modalities, we use negative of Mutual information as similarity metric. A gradient descent optimizer with feedback step adjustment [6] is used to iteratively obtain the optimal set of transformation parameters (coefficients) that minimize the similarity metric. Mathematically, T provides a mapping of point in the fixed image space with coordinates xf ix to the point in the moving image space with coordinates xmov xmov = T (xf ix ; μ)
Using above mapping, we can obtain warped (transformed) moving image fmov
fmov (x) = fmov (T (xf ix ; μ))
An optimal set of transformation parameters μ make the images fmov and ff ix comparable (similar) to each other based on a similarity metric. The mutual information calculation is based on a Parzen estimation of the joint histogram p(if ix , imov ) of the fixed and the transformed moving image [16]. As proposed in [7], a zero-order B-spline Parzen window for the fixed image and a cubic B-spline Parzen window for the moving image is used. Together with a continuous cubic B-Spline representation of the floating image, this allows to calculate the gradient of the similarity metric S in closed form. The metric S is computed as p(τ, η; μ) S(μ) = − p(τ, η; μ)log pf ix (τ ; μ)pmov (η; μ) τ η where p is the joint pdf with fixed image intensity values τ and warped moving image intensity values η. pf ix and pmov represents marginal pdfs. The derivative of S with respect to transformation parameter μi is following sum over all the fixed image voxels that are within support region of μi : ∂p(if ix , imov ) ∂S = −α |if ix =ff ix (x),imov =fmov (T (x;μ)) ∂μi ∂imov x
T ∂ ∂ . fmov (ξ)|ξ=T (x;μ) . T (x; μ) ∂ξ ∂μi where α is a normalization factor. We use all the voxels of the fixed image to calculate the histogram and not only a subset like in [7]. We also use a multi-resolution approach to increase robustness and speed [16], meaning that we first register with reduced image and transformation resolutions, and then successively increase them until the desired resolution has been reached. We downsample the images by 2 in each dimension using gaussian filter and interpolation to obtain them in lower resolution.
226
3
V. Saxena, J. Rohrer, and L. Gong
GPU Architecture Overview
A GPU can be modeled as a set of SIMD multiprocessors (SMs) each consisting of a set of scalar processor cores (SPs). The SPs of a SM execute the same instruction simultaneously but on different data points. The GPU has a large device global memory with high bandwidth and high latency. In addition, each SM also contains a very fast, low-latency on-chip shared memory. In the CUDA programming model [9], a host program runs on the CPU and launches a kernel program to be executed on the GPU device in parallel. The kernel executes as a grid of one or more thread blocks. Each thread block is dynamically scheduled to be executed on a single SM. The threads of a thread block cooperate with each other by synchronizing their execution and efficiently sharing resources on the SM such as shared memory and registers. Threads within a thread block gets executed on a SM in the scheduling units of 32 threads called a warp. Global memory is used most efficiently when multiple threads simultaneously access words from a contiguous aligned segment of memory, enabling GPU hardware to coalesce these memory accesses into a single memory transaction. The Nvidia Tesla C1060 GPU used in the present work contains 4GB of offchip device memory and 16KB of on-chip shared memory. The GPU supports a maximum of 512 threads per thread block.
4
Parallelization and Optimization
The main steps of the registration algorithm are outlined in Algorithm 1. The iterative gradient descent optimization part (within the first for-block) consumes almost all the computation of the registration algorithm. It has been shown that the two (for) loops that iterate over all the fixed image voxels take more than 99% of the optimization time [12]. We therefore focus our parallelization and optimization effort on this part following the CUDA programming model. 4.1
Parallel Execution on the GPU
We offload the two (for) loops to the GPU. Whenever the control on the host reaches any of these loops, the host calls the corresponding GPU routine. Once the GPU routine finishes, the control is returned to the host for further execution. Data Partitioning and Distribution: To enable parallel computation of joint histogram (loop 1 ) and parallel computation of gradient (loop 2 ) on the GPU, we divide the images into contiguous cubic blocks before the start of iterative gradient descent optimization part and process the fixed image blocks independently. The fixed image blocks are distributed to the CUDA thread blocks with each thread block processing one or more image blocks. A thread block requires N/T iterations to process a fixed image block, where N is the number of voxels in the image block and T is the number of threads in a thread block. In each iteration, a thread in the thread block processes one voxel of the fixed image block performing operations in loop 1 or loop 2.
A Parallel GPU Algorithm for Mutual Information
227
Algorithm 1. Image Registration Algorithm Construct the image pyramid For each level of the multi-resolution hierarchy Initialize the transformation coefficients For each iteration of the gradient descent optimizer Iterate over all fixed image voxels (loop 1) Map coordinates to moving image space Interpolate moving image intensity Update joint histogram Calculate joint and marginal probability density functions Calculate the mutual information Iterate over all fixed image voxels (loop 2) Map coordinates to moving image space Interpolate moving image intensity and its gradient Update the gradients for the coefficients in the neighborhood Calculate the new set of coefficients (gradient descent)
For example, for the fixed image block size of 8×8×8, we use 3 dimensional thread block of size (8, 8, Dz ). A thread block makes 8/Dz iterations for processing an image block, where Dz is the number of threads in the third dimension. In mth (0≤m < 8/Dz ) iteration a thread with index (i, j, k) processes voxel (i, j, k + m×Dz ) of the image block. As described in section 3, a maximum of 512 threads are supported per thread block and a single image block of size 8×8×8 also contains same number (512) of voxels. This allows a thread block with 512 threads to process one fixed image block in a single iteration. Moreover, as the fixed image blocks are stored contiguously in the global memory, the threads can read the consecutive fixed image values in a coalesced fashion. The reference and moving images don’t change as part of the iterative optimization steps. Therefore we transfer these images to the GPU global memory in the beginning and never modify them throughout the optimization process. Joint Histogram Computation: The joint histogram computation in the first loop requires several threads to update (read and write) a common GPU global memory region allocated for the histogram. This will require synchronization among the threads. As described in section 2 we use parzen estimation of joint histogram with cubic B-spline window for moving image. This requires four bins to be updated with cubic B-spline weights for each pair of fixed image and warped moving image intensity values. The two threads with same intensity value pair will have collision for all of these four bins. Moreover, there will be collisions even if the threads have different intensity value pairs but they share some common bins to be updated. An atomic update based approach would be costly in this case and therefore we allocate a separate local buffer per CUDA thread in the GPU global memory to store its partial histogram results. A joint histogram with 32x32 (or 64x64) bins requires 4K (or 16K) memory for its storage therefore it is not possible to store these per thread partial histogram buffers in the on-chip
228
V. Saxena, J. Rohrer, and L. Gong
GPU shared memory. In the end, we reduce these buffers on GPU to compute the final joint histogram values. Gradient Computation: Similar to the joint histogram computation, the gradient computation in second loop also requires several threads to update (read and write) a common GPU global memory region allocated for storing gradient values. Each thread processing a voxel updates gradient values for 192 (3×4×4×4) coefficients in the neighborhood affected by the voxel. Single precision gradient values for 192 coefficients require 768 bytes of memory and hence it is not possible to allocate per thread partial gradient buffer on the on-chip 16KB shared memory for more than 21 threads per thread block. Therefore, we allocate a separate local buffer per CUDA thread in the GPU global memory to store its partial gradient results. In the end, we reduce these buffers on GPU to compute the final gradient values. For both Joint Histogram and Gradient Computation, the final reduction of partial buffers is performed using a binary tree based parallel reduction approach that uses shared memory [4]. Marginal pdfs and Mutual Information Computation: The computation of marginal pdfs for fixed and moving images together with mutual information computation is still done on the host. This computation on the host takes less than 0.02% of the total gradient descent optimization time and hence does not become a bottleneck. Performing this computation on the host requires transformation coefficients to be sent to the GPU before performing joint histogram computation. Once the computation is done, the computed joint histogram is transferred back to the host. Similarly, for the gradient computation, we send modified histogram values to the GPU before the computation and transfer back the computed gradient values to the host in the end. However, this transfer of data does not become the bottleneck as the amount of data transferred is small. The experiment results also show that these transfers require less than 0.1% of the total time for joint histogram and gradient computation. 4.2
Use of Look Up Table
When computing transformed fixed image voxels and moving image interpolation of transformed voxels, we need to perform weighted sum of B-spline coefficients with B-spline weights. As the cubic B-spline base functions only have limited support, therefore this weighted sum requires only four B-spline coefficients located at four neighboring control points. For 3D case, we need to consider only 4×4×4 points in the neighborhood of x to compute the interpolated value: ci,j,k βx,i βy,j βz,k f (x) = i,j,k=0...3
where f is one component of the transformation function or moving image intensity, ci,j,k are B-spline coefficients and βs are cubic B-spline weights. For different components of the transformation function, weights remain the same but coefficients differ. Instead of computing these weights repeatedly at runtime,
A Parallel GPU Algorithm for Mutual Information
229
we pre-compute these weights at sub-grid points with a spacing of 1/32 the voxel size and store the computed values to a lookup table. To enable the use of lookup table, we constrain the control point spacing to be an integral multiple of voxel spacing. Similarly, for image interpolation, we round down the point coordinates to the nearest sub-grid point. We compute lookup table on the host and transfer it once to the GPU device memory along with the reference and moving images before the start of the optimization process. The lookup table doesn’t get modified afterward and remains constant. For 32 sub-grid points in one voxel width, we only require 512 bytes of memory (four single precision B-spline weights per sub-grid point). On the GPU, we store the lookup table on the on-chip shared memory to avoid accessing high latency global memory each time. 4.3
Optimizations for Transformation Coefficients
As explained previously, each fixed image voxel requires 3×4×4×4 (4×4×4 per dimension) transformation coefficients in the neighborhood for transforming its coordinates to moving image space. However if the spacing between coefficient grid points is an integral multiple of fixed image block size, then all voxels within a fixed image block require same set of coefficients for the transformation. Each fixed image block is transformed by a single thread block and storing 3×4×4×4 coefficients only requires 768 bytes for single precision. Therefore, before processing an image block, all the 192 coefficients required by this block are loaded to the on-chip shared memory for faster access. There is no need for any synchronization in this case as threads only read the coefficients. 4.4
Memory Coalescing for the Gradient Computation
As described in section 4.1, for the joint histogram and gradient computation, we allocated per thread separate buffers in the global memory to store partial results. These buffers were reduced in the end to get the final values. For gradient computation, each voxel updates its local gradient buffer for its neighboring 4×4×4 coefficients per dimension. However, each thread of a warp updates the gradient buffer entry corresponding to the same coefficient in its local gradient buffer. The gradient buffer entries for the threads can be organized in either Array of Structures (AOS) or Structure of Arrays (SOA) form. In the AOS form, the gradient entries for a thread are stored contiguously in its separate distinct buffer. In the SOA form, the local buffers of different threads are interleaved so that gradient entries for different threads corresponding to same coefficient are stored contiguously in global memory. In the AOS form, updates by threads result in non-coalesced memory accesses to global memory for read and writes. To avoid this, we use the SOA form for gradient computation. In this form, threads update consecutive memory locations in the global memory resulting in coalesced memory access. In the end, we reduce per coefficient values from all the threads to compute the final gradient values. We evaluate performance with these two forms in the result section 5.
230
5
V. Saxena, J. Rohrer, and L. Gong
Experimental Results
The parallel version of the code was run on the Nvidia Tesla C1060 GPU. The Tesla C1060 is organized as a set of 30 SMs each containing 8 SPs with a total of 240 scalar cores. The scalar cores run at the frequency of 1.3 GHz. The GPU has 4 GB of off-chip device memory with peak memory bandwidth of 102 GB/s. It has 16 KB of on-chip shared memory and 16K registers available per SM. The CUDA SDK 2.3 and NVCC 0.2.1221 compiler were used in the compilation of the code. The GPU host system has Intel Xeon 3.0 GHz processor with 3 GB of main memory. The GPU has a PCIe x16 link interface to the host providing a peak bandwidth of 4 GB/s in a single direction. The serial version of the code was running on one of the cores of an Intel Xeon processor running at 2.33 GHz with 2 GB of memory. The code was compiled with the GCC 4.1.1 compiler (with -O2). For both the systems, we measured the runtime of the most computation expensive multi-resolution iterative optimization process for single precision data. On the GPU system, it is assumed that the reference and moving images are already transferred to the GPU global memory. 5.1
Performance Results and Discussion
Runtime. We performed registrations of 22 different sets (pair) of CT abdominal images of different sizes, and measured the average of the registration time per voxel. The images were partitioned into cubic blocks of size 8×8×8. A three level multi-resolution pyramid was used with B-spline grid spacing of 16x16x16 voxels at the finest level. The gradient descent optimizer was set to perform fixed number of 30 iterations at each pyramid level. For the purpose of comparison we measured the per voxel time for the finest pyramid level only. The sequential version required 69.02 (±16) μs/voxel. In contrast, the GPU version required 2.46 (±0.33) μs/voxel resulting in factor of 28 speedup compared to the serial version. For example, the sequential version required 1736.65 seconds for an image of size 512x512x98 for 30 iterations at the finest level, whereas GPU version only required 67.20 seconds. Note that the above time does not include the time for operations that are performed for each pyramid level before starting the iterations for the gradient descent optimizer e.g. allocating and transferring fixed and moving images on the GPU. To include these operations in the performance measurement as well, we compared the total registration application execution times for serial and parallel versions for all the 22 datasets. The GPU parallel based version showed a speedup between 18x to 26x with an average speedup of 22.3 (±2.7) compared to serial version for total execution time. The good speedup suggests that the time consuming part of the application was successfully offloaded to the GPU. Scalability. We measured the scalability of the GPU implementation with different number of threads per block and the number of thread blocks. Figure 1(a) shows the scalability on GPU for an image of size 512×512×98 with different number of thread blocks up to 180 thread blocks. The performance with a single thread block is taken as the speedup of factor one. The number of
A Parallel GPU Algorithm for Mutual Information
(a) Scaling with thread blocks
231
(b) Scaling with threads per thread block
Fig. 1. Scaling on GPU with different number of threads and thread blocks
threads per thread block has been fixed to 128 for this experiment. As shown in the figure, we see good scalability with increasing number of thread blocks with factor 34 speedup with 60 thread blocks over single thread block performance. There is no performance improvement beyond 60 thread blocks. As described perviously, we use per thread local buffers for computing joint histogram and gradient values. These buffers need to be initialized to zero before performing the computation and need to be reduced in the end. This initialization and reduction time increases with increasing number of threads in the kernel grid and compensate any slight improvement in the actual computation time. For example, with 512×512×98 image dataset, the reduction time for joint histogram computation (and gradient computation) increases from about 1.26% to 3.42% (and about 1.1% to 2.66%) of the total histogram (and gradient) computation time when increasing number of thread blocks from 60 to 180. Figure 1(b) shows the scaling for an image of size 512×512×98 with different number of threads per thread block for 30, 60 and 90 thread blocks. We could not use 512 threads per thread block due to register overflow. Also the implementation requires a minimum of 64 threads for 8×8×8 cubic image block size. We observe the best performance with 60 thread blocks and 128 threads. Although we show the scaling for only three sets of thread blocks, more number of threads per thread block seem to provide better performance in general as expected. Also, the performance difference with different number of threads decreases with increasing number of thread blocks as seen by the performance variation for different threads with 30 thread blocks compared to 90. Optimizations. We compared the performance with AOS and SOA forms for gradient computation discussed in section 4.4. The table 1 shows the speedup with SOA form over the AOS form. The SOA form enabling coalesced memory accesses provides significant performance improvement over the AOS. 5.2
Validation of Registration Results
In this experiment, we validate the results of image registration for mono and multi-modal cases. We used the BrainWeb [1] simulated MRI volumes IT 1 and
232
V. Saxena, J. Rohrer, and L. Gong
Table 1. Speedup with SOA over AOS for gradient computation
Table 2. Similarity metric values for mono and multi-modal case
Algorithm Component
Speed Modality Mono Multi ˜ Registered T1 to T1D to T1 to T1D to up Per iteration of gradient descent optimizer 7-8x Images T1 T1 T2 T2 Total Gradient Computation (loop 2) Time 10-12x Serial -1.9443 -1.8914 -1.3975 -1.3743 (including data transfer and reduction) GPU -1.8246 -1.8426 -1.3452 -1.3590 Gradient Computation Only 10-11x (excluding reduction)
IT 2 (181×217×181 voxels, isotropic voxel spacing of 1mm, sample slices shown in figures 3(a) and 3(b)): the T1 and T2 volumes are aligned. We deformed IT 1 with an artificially generated transformation function TD based on randomly placed Gaussian blobs to obtain IT 1D . We then registered IT 1D to IT 1 (monomodal case) and IT 1D to IT 2 (multi-modal case) using both the serial and GPU implementations and measured the mutual information (MI) values before and after the registration. To verify the final MI values after the registration, we also compared these with the MI values of the perfectly aligned IT 1 & IT 1 (monomodal) and IT 1 & IT 2 (multi-modal). We used three levels of the multi-resolution pyramid and 30 iterations were carried out per pyramid level. Table 2 shows the final values of similarity metric (negative of MI) after the registration along with the metric values for perfectly aligned images. Mono-Modal Case: The similarity metric for perfectly aligned IT 1 and IT 1 is -1.9443 (computed using serial version). The final metric values after registering IT 1D to IT 1 is -1.8914 and -1.8426 using the serial and GPU implementations respectively. The difference in metric values on the two platforms can be attributed to the difference in the floating point arithmetic and different ordering. The difference in the metric values for perfectly aligned images on the two platforms is similar. Visual inspection of images before and after the registration confirmed that the registered image aligns well with the original image, and there is no difference between the images registered using the serial and GPU implementations. Figure 2 shows an example of visual comparison of the GPU registration accuracy. Multi-Modal Case: In this case, we registered IT 1D and IT 2 to obtain registered image IT 1R . We then compared IT 1R to already known solution volume IT 1 to verify the registration result. The similarity metric for perfectly aligned IT 1 and IT 2 is -1.3975. The final metric values after registering IT 1D to IT 2 is -1.3743 and -1.3590 using the serial and GPU implementations respectively. For visual comparison we cannot directly merge the registered image T1R with T2 as done in the mono-modal case. Therefore, we color-merged T1R and T1 for the purpose of validation. In case of correct registration, the color-merged image should not have any colored regions. Figure 3 shows the visual comparison of the GPU registration accuracy in case of multi-modal registration.
A Parallel GPU Algorithm for Mutual Information
(a) Original deformed
233
and (b) Original & regis- (c) Registered images tered image on GPU on Xeon and GPU
Fig. 2. Visual comparison of the GPU registration accuracy in mono-modal case. Pairs of grayscale images are color-merged (one image in the green channel, the other in the red and the blue channel); areas with alignment errors appear colored. (a) shows the misalignment of the original (green) and the artificially deformed image. (b) shows that after registration on the GPU only minor registration errors are visible (original image in green). In (c), no difference between the image registered on the GPU and the Xeon (green) is visible.
(a) T1
(b) T2
(c) Original T1 & (d) Overlay of T1R registered image on T2
Fig. 3. Multi-modal case. (a) T1 image slice. (b) T2 image slice. (c) color-merged image of T1 (in green channel) and the registered image T1R on the GPU (in red and blue channels). Only minor registration errors are visible. (d) overlay of registered image T1R (red colored) on top of T2 (green colored) using 50% transparency.
6
Summary and Conclusions
We discussed in this paper a GPU-based implementation of mutual information based nonrigid registration. Our preliminary experimental results with the GPU implementation showed an average performance of 2.46 microseconds per voxel with 28x speedup over a serial version. For a pair of images of 512x512x24 pixels, the registration takes about 17.1 seconds to complete. Our GPU performance also compares well with the other high performance platforms, although it is difficult to make a perfectly fair comparison due to differences in implemented algorithms and experimental setup. A parallel implementation of mutual information based
234
V. Saxena, J. Rohrer, and L. Gong
nonrigid registration algorithm presented in [11] used 64 CPUs of a supercomputer reporting a speedup factor of up to 40 compared to a single CPU and resulting in mean execution time of 18.05 microseconds per voxel. The proposed GPU-based nonrigid registration provides a cost-effective alternative to existing implementations based on other more expensive parallel platforms. Future work will include more systematic and comparative evaluation of our GPU-based implementation vs others based on different multicore platforms.
References 1. Collins, D.L., Zijdenbos, A.P., Kollokian, V., Sled, J.G., Kabani, N.J., Holmes, C.J., Evans, A.C.: Design and construction of a realistic digital brain phantom. IEEE Trans. Med. Imaging 17(3), 463–468 (1998) 2. Courty, N., Hellier, P.: Accelerating 3D Non-Rigid Registration using Graphics Hardware. International Journal of Image and Graphics 8(1), 81–98 (2008) 3. Crum, W.R., Hartkens, T., Hill, D.L.G.: Non-rigid image registration: theory and practice. Br. J. Radiol. 77(2), 140–153 (2004) 4. Harris, M.: Optimizing parallel reduction in CUDA (2007), http://www.nvidia.com/object/cuda_sample_advanced_topics.html 5. Ino, F., Tanaka, Y., Hagihara, K., Kitaoka, H.: Performance study of nonrigid registration algorithm for investigating lung disease on clusters. In: Proc. PDCAT, pp. 820–825 (2005) 6. Kybic, J., Unser, M.: Fast parametric elastic image registration. IEEE Transactions on Image Processing 12(11), 1427–1442 (2003) 7. Mattes, D., Haynor, D., Vesselle, H., Lewellen, T., Eubank, W.: PET-CT image registration in the chest using free-form deformations. IEEE Trans. Med. Imag. 22(1), 120–128 (2003) 8. Muyan-Ozcelik, P., Owens, J.D., Xia, J., Samant, S.S.: Fast deformable registration on the GPU: A CUDA implementation of demons. In: ICCSA, pp. 223–233 (2008) 9. Nvidia CUDA Prog. Guide 2.3, http://www.nvidia.com/object/cuda_get.html 10. Pluim, J., Maintz, J., Viergever, M.: Mutual information based registration of medical images: a survey. IEEE Trans. Med. Imaging 22(8), 986–1004 (2003) 11. Rohlfing, T., Maurer Jr., C.R.: Nonrigid image registration in shared-memory multiprocessor environments with application to brains, breasts, and bees. IEEE Trans. Inf. Technol. Biomed. 7(1), 16–25 (2003) 12. Rohrer, J., Gong, L., Sz´ekely, G.: Parallel Mutual Information Based 3D Non-Rigid Registration on a Multi-Core Platform. In: HPMICCAI workshop in conjunction with MICCAI (2008) 13. Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L.G., Leach, M.O., Hawkes, D.J.: Nonrigid registration using free-form deformations: Application to breast MR images. IEEE Transactions on Medical Imaging 18(8), 712–721 (1999) 14. Sharp, G., Kandasamy, N., Singh, H., Folkert, M.: GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration. Phys. Med. Biol. 52(19), 5771–5783 (2007) 15. Szeliski, R., Coughlan, J.: Spline-based image registration. Int. J. Comput. Vision 22(3), 199–218 (1997) 16. Thevenaz, P., Unser, M.: Spline pyramids for inter-modal image registration using mutual information. In: Proc. SPIE, vol. 3169, pp. 236–247 (1997) 17. Vetter, C., Guetter, C., Xu, C., Westermann, R.: Non-rigid multi-modal registration on the GPU. In: Proc. SPIE, vol. 6512 (2007)
Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations Everton Hermann1 , Bruno Raffin1 , Fran¸cois Faure2 , Thierry Gautier1 , and J´er´emie Allard1 1
2
INRIA Grenoble University
Abstract. Today, it is possible to associate multiple CPUs and multiple GPUs in a single shared memory architecture. Using these resources efficiently in a seamless way is a challenging issue. In this paper, we propose a parallelization scheme for dynamically balancing work load between multiple CPUs and GPUs. Most tasks have a CPU and GPU implementation, so they can be executed on any processing unit. We rely on a two level scheduling associating a traditional task graph partitioning and a work stealing guided by processor affinity and heterogeneity. These criteria are intended to limit inefficient task migrations between GPUs, the cost of memory transfers being high, and to favor mapping small tasks on CPUs and large ones on GPUs to take advantage of heterogeneity. This scheme has been implemented to support the SOFA physics simulation engine. Experiments show that we can reach speedups of 22 with 4 GPUs and 29 with 4 CPU cores and 4 GPUs. CPUs unload GPUs from small tasks making these GPUs more efficient, leading to a “cooperative speedup” greater than the sum of the speedups separatly obtained on 4 GPUs and 4 CPUs.
1
Introduction
Interactive physics simulations are a key component of realistic virtual environments. However the amount of computations as well as the code complexity grows quickly with the variety, number and size of the simulated objects. The emergence of machines with many tightly coupled computing units raises expectations for interactive physics simulations of a complexity that has never been achieved so far. These architectures usually show a mix of standard generic processor cores (CPUs) with specialized ones (GPUs). The difficulty is then to efficiently take advantage of these architectures. Several parallelization approaches have been proposed but usually focused on one aspect of the physics pipe-line or targeting only homogeneous platforms (GPU or multiple CPUs). Object level parallelizations usually intend to identify non-colliding groups of objects to be mapped on different processors. Fine grain parallelizations on a GPU achieve high-speedups but require to deeply revisit the computation kernels. In this article we propose a parallelization approach that takes advantage of the multiple CPU cores and GPUs available on a SMP machine. We rely on P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 235–246, 2010. c Springer-Verlag Berlin Heidelberg 2010
236
E. Hermann et al.
the open source SOFA physics simulation library designed to offer a high degree of flexibility and high performance executions. SOFA [1] supports various types of differential equation solvers for single objects as well as complex scenes of different kinds of interacting physical objetcs (rigid objects, deformable solids, fluids). The physics pipe-line is classically split in two main steps: collision detection and time integration. The collision detection being performed efficiently on a single GPU [2], we focus here on the time integration step. The work load of time integration varies according to collisions: new collisions require the time integration step to compute and apply the associated new repulsion forces. We developed a multiple CPUs and GPUs parallelization for the time integration step. A first traverse enables to extract a data dependency graph between tasks. It defines the control flow graph of the application, which identifies the first level of parallelism. Several tasks have a CPU implementation as well as a GPU one using CUDA [3]. This GPU code provides a second fine-grain parallelization level. At runtime, the tasks are scheduled according to a two levels scheduling strategy. At initialisation and every time the task graph changes (addition or removal of collisions), the task graph is partitioned with a traditional graph partitioner and partitions are distributed to PUs (GPUs or CPUs are called Processing Unit). Then, work stealing is used to move paritions between PUs to correct the work inbalance that may appear as the simulation progresses. Our approach differs from the classical work stealing algorithm [4] as our stealing strategy takes the temporal and spatial locality into account. Spatial locality relies on the classical Owner Compute Rule where tasks using the same data tend to be scheduled on the same PU. This locality criteria is guaranteed during the tasks graph partitioning, where tasks accessing the same data are gathered in the same affinity group. Temporal locality occurs by reusing the task mapping between consecutive iterations. Thus, when starting a new time integration step, tasks are first assigned to the PU they ran on at the previous iteration. CPUs tend to be more efficient than GPUs for small tasks and vice-versa. We thus associate weights to tasks, based on their execution time, that PUs use to steal tasks better suited to their capacities. Thanks to this criteria, PU heterogeneity becomes a performance improvement factor rather than a limiting one. Experiments show that a complex simulation composed of 64 colliding objects, totaling more than 400k FEM elements, is simulated on 8 GPUs in 0.082s per iteration instead of 3.82s on one CPU. A heterogeneous scene with both complex and simple objects can efficiently exploit all resources in a machine with 4 GPUs and 8 CPU cores to compute the time integration step 29 times faster than with a single CPU. The addition of the 4 CPU cores not dedicated to the GPUs actually increases the simulation performance by 30%, significantly more than the 5% performance expected because CPUs unload GPUs from small tasks making these GPUs more efficient. After discussing related works (Sec. 2) and a quick overview of the physics simulator (Sec 3), we focus on multi-GPU support (Sec. 4) and scheduling (Sec. 5). Experimental results (Sec. 6) are detailed before to conclude.
Multi-GPU and Multi-CPU Parallelization
2
237
Related Works
We first review works related to the parallelization of physics engines before to focus on approaches for scheduling tasks on multiple GPUs or between the CPU and GPU. Some approaches propose cluster based parallelizations for interactive physics simulations, but the scalability is usually limited due to the high overhead associated with communications [5]. This overhead is more limited with shared memory multi-processor approaches [6,7,8]. GPUs drew a lot of attention for physics, first because we can expect these co-processors to be easily available on users machines, but also as it can lead to impressive speedups[9,10,11]. The Bullet 1 and PhysX 2 physics engines defer solid and articulated objects simulation on GPU or Cell for instance. All these GPU approaches are however limited to one CPU and one GPU or co-processor. Task distribution between the processor and the co-processor is statically defined by the developer. Recent works propose a more transparent support of heterogeneous architectures mixing CPUs and GPUs. GPU codelets are either automatically extracted from an existing code or manually inserted by the programmer for more complex tasks [12,13]. StarPU supports an heterogeneous scheduling on multiples CPUs and GPUs with a software cache to improve CPU/GPU memory transfers [14]. They experiment various scheduling algorithms, some enabling to get “cooperative speedups” where the GPU gets support from the CPU to get a resulting speedup higher to the sum of the individual speedups. We also get such speedups in our experiments. A regular work stealing strategy is also tested but the performance gain is more limited. The stealing scheme is not adapted to cope with the heterogeneity. Published experiments include tests with one GPU only. We know two different approaches for multi GPU dynamics load balancing. The extension of the StarSs for GPUs [15] proposes a master/helper/worker scheme, where the master inserts tasks in a task dependency graph, helpers grab a ready task when their associated GPU becomes idle, while workers are in charge of memory transfers. The master leads to a centralized list scheduling that work stealing enables to avoid. RenderAnts is a Reyes renderer using work stealing on multiple GPUs [16]. The authors underline the difficulty in applying work stealing for all tasks due to the overhead of data transfers. They get good performance by duplicating some computations to avoid transfers and they keep work stealing only on one part of the Reyes pipeline. Stealing follows a recursive data splitting scheme leading to tasks of adaptive granularity. Both RenderAnts and StarSs address multi GPU schedulin g, but none include multiple CPUs. In this paper we address scheduling on GPUs and CPUs. We extend the Kaapi runtime [17] to better schedule tasks with data flow precedences on multiple CPUs and GPUs. Contrary to previous works, the initial work load is balanced by computing at runtime a partition of the tasks with respect to their affinity to accessed objects. Then during the execution, the work imbalance is corrected by 1 2
http://www.bulletphysics.com http://www.nvidia.com/physx
238
E. Hermann et al.
Mechanical Mesh
Initial State
Intermediary State
Final State
Fig. 1. Simulation of 64 objects, falling and colliding under gravity. Each object is a deformable body simulated using Finite Element Method (FEM) with 3k particles.
a work stealing scheduling algorithm [4,18]. In [19] the performance of the Kaapi work stealing algorithm was proved for tasks with data flow precedences, but not in the heterogenenous context mixing CPUs and GPUs. A GPU is significantly different from a ”fast” CPU, due to the limited bandwidth between the CPU and GPU memories, to the overhead of kernel launching and to the SIMD nature of GPUs that does not fit all algorithms. The work stealing policy needs to be adapted to take advantage of the different PU capabilities to get “cooperative speedups”.
3
Physics Simulation
Physics simulations, particularly in interactive scenarios, are very challenging high performance applications. They require many different computations whose cost can vary unpredictably, depending on sudden contacts or user interactions. The amount of data involved can be important depending on the number and complexity of the simulated objects. Data dependencies evolve during the simulation due to collisions. To remain interactive the application shoud execute each iteraction within a few tens of milliseconds. The physics simulation pipeline is an iterative process where a sequence of steps is executed to advance the scene forward in time. The pipeline includes a collision detection step based on geometry intersections to dynamically create or delete interactions between objects. Time integration consists in computing a new state (i.e. positions and velocity vectors), starting from the current state and integrating the forces in time. Finally, the new scene state is rendered and displayed or sent to other devices. In this paper, we focus on time integration. Interactive mechanical simulators can involve objects of different kinds (solid, articulated, soft, fluids), submitted to interaction forces. The objects are simulated independently, using their own encapsulated simulation methods (FEM, SPH, mass-springs, etc.). Interaction forces are updated at each iteration based on the current states of the objects. Our approach combines flexibility and performance, using a new efficient approach for the parallelization of strong coupling between independently
Multi-GPU and Multi-CPU Parallelization
239
implemented objects [8]. We extend the SOFA framework [1] we briefly summarize here. The simulated scene is split into independent sets of interacting objects. Each set is composed of objects along with their interaction forces, and monitored by an implicit differential equation solver. The object are made of components, each of them implementing specific operations related to forces, masses, constraints, geometries and other parameters of the simulation. A collision detection pipeline creates and removes contacts based on geometry intersections. It updates the objects accordingly, so that each one can be processed independently from the others. By traversing the object sets, the simulation process generates elementary tasks to evaluate the physical model.
4
Multi-GPU Abstraction Layer
We first introduce the abstraction layer we developed to ease deploying codes on multiple GPUs. 4.1
Multi-architecture Data Types
The multi-GPU implementation for standard data types intends to hide the complexity of data transfers and coherency management among multiple GPUs and CPUs. On shared memory multiprocessors all CPUs share the same address space and data coherency is hardware managed. In opposite, even when embedded in a single board, GPUs have their own local address space. We developed a DSM (Distributed Shared Memory) like mechanism to release the programmer from the burden of moving data between a CPU and a GPU or between two GPUs. When accessing a variable, our data structure first queries the runtime environment to identify the processing unit trying to access the data. Then it checks a bitmap to test if the accessing processing unit has a valid data version. If so, it returns a memory reference that is valid in the address space of the processing unit requesting data access. If the local version is not valid, a copy from a valid version is required. For instance it happens when a processing unit accesses a variable for the first time, or when another processing unit has changed the data. This detection is based on dirty bits to flag the valid versions of the data on each PU. These bits are easily maintained by setting the valid flag of a PU each time the data is copied to it, and resetting all the flags but that of the current PU when the data is modified. Since direct copies between GPU memories are not supported at CUDA level, data first have to transit through CPU memory. Our layer transparently takes care of such transfers, but these transfers are clearly expensive and must be avoided as much as possible. 4.2
Transparent GPU Kernel Launching
The target GPU a kernel is started on is explicit in the code launching that kernel. This is constraining in our context as our scheduler needs to reallocate
240
E. Hermann et al.
Fig. 2. Multi-implementation task definition. Top: Task Signature. Left: CPU Implementation. Right: GPU Implementation.
a kernel to a GPU different form the one it was supposed to run on, without having to modify the code. We reimplemented part of the CUDA Runtime API. The code is compiled and linked as usually done in a single GPU scenario. At execution time our implementation of the CUDA API is loaded and intercepts the calls to the standard CUDA API. When a CUDA kernel is launched, our library queries the runtime environment to know the target GPU. Then the execution context is retargeted to a different GPU if necessary and the kernel is launched. Once the kernel is finished, the execution context is released, so that other threads can access it. 4.3
Architecture Specific Task Implementations
One of the objectives of our framework is to seamlessly execute a task on a CPU or a GPU. This requires an interface to hide the task implementation that is very different if it targets a CPU or a GPU. We provide a high level interface for architecture specific task implementations (Fig. 2). First a task is associated with a signature that must be respected by all implementations. This signature includes the task parameters and their access mode (read or write). This information will be further used to compute the data dependencies between tasks. Each CPU or GPU implementation of a given task is encapsulated in a functor object. There is thus a clear separation between a task definition and its various architecture specific implementations. We expect that at least one implementation be provided. If an implementation is missing, the task scheduler will simply reduce the range of possible target architectures to the supported subset.
5
Scheduling on Multi-GPUs
We mix two approaches for task scheduling. We first rely on a task partitioning that is executed every time the task graph changes, i.e. if new collisions or user interactions appear or disappear. Between two partitionings, work stealing is used to reduce the load imbalance that may result from work load variations due to the dynamic behavior of the simulation.
Multi-GPU and Multi-CPU Parallelization
5.1
241
Partitioning and Task Mapping
As partitioning is executed at runtime it is important to keep its cost as reduced as possible. The task graph is simply partitioned by creating one partition per physical object. Interaction tasks, i.e. tasks that access two objects, are mapped to one of these objects’ partition. Then, using METIS or SCOTCH, we compute a mapping of each partition that try to minimize communications between PUs. Each time the task graph changes due to addition or removal of interactions between objects (new collision or new user interactions), the partitioning is recomputed. Associating all tasks that share the same physical object into the same partition allows to increase affinity between these tasks. This significantly reduces memory transfers and improves performance especially on GPUs where these transfers are costly. A physics simulation also shows a high level of temporal locality, i.e. the changes from one iteration to the next one are usually limited. Work stealing can move partitions to reduce load inbalance. These movements have a good change to be relevant for the next iteration. Thus if no new partioning is required, each PU simply starts with the partitions executed during the previous iteration. 5.2
Dynamic Load Balancing
At a beginning of a new iteration each processing unit has a queue of partitions (an ordered list of tasks) to execute. The execution is then scheduled by the Kaapi [19] work stealing algorithm. During the execution, a PU first searches in its local queue for partitions ready to execute. A partition is ready if and only if all its read mode arguments are already produced. If there is no ready partition in the local queue, the PU is considered idle and it tries to steal work from another PU selected at random. To improve performance we need to guide steals to favor gathering interacting objects on the same processing unit. We use an affinity list of PUs attached to each partition: a partition owned by a given PU has another distant PU in its affinity list if and only if this PU holds at least one task that interacts with the target partition. A PU steals a partition only if this PU is in the affinity list of the partition. We update the affinity list with respect to the PU that executes the tasks of the partition. Unlike the locality guided work stealing in [20], this affinity control is only employed if the first task of the partition has already been executed. Before that, any processor can steal the partition. As we will see in the experiments, this combination of initial partitioning and locality guided work stealing significantly improves data locality and thus performance. 5.3
Harnessing Multiple GPUs and CPUs
Our target platforms have multiple CPUs and GPUs: the time to perform a task depends on the PUs, but also on the kind of task itself. Some of them may
E. Hermann et al.
242
Fig. 3. Speedup per iteration when simulating 64 deformable objects falling under gravity (Fig. 1) using up to 8 GPUs
perform better on CPU, while others have shortest execution times on GPU. Usually GPUs are more efficient than CPUs on time consuming tasks with high degree of data parallelism; and CPUs generally outperform GPUs on small size problems due to the cost of data transfer and the overhead of kernel launching. Following the idea of [18], we extended the work stealing policy to schedule time consuming tasks on the fastest PUs, i.e. GPUs. Because tasks are grouped into partitions (Sec. 5.1), we apply this idea on partitions. During execution we collect the execution time of each partition on CPUs and GPUs. The first iterations are used as a warming phase to obtain these execution times. Not having the best possible performance for these first iterations is acceptable as interactive simulations usually run several minutes. Instead of having a queue of ready partitions sorted by their execution times, we implement a dynamic threshold algorithm that allows a better parallel conU T ime ratio below the threshold are current execution. Partitions with the CP GP U T ime executed on a CPU, otherwise on a GPU. When a thief PU randomly selects a victim, it checks if this victim has a ready partition that satisfies the threshold criteria and steals it. Otherwise the PU chooses a new victim. To avoid PU starving for too long, the threshold is increased each time a CPU fails to steal, and decreases each time a GPU fails.
6
Results
To validate our approach we used different simulation scenes including independent objects or colliding and attached objects. We tested it on a quad-core Intel Nehalem 3GHz with 4 Nvidia GeForce GTX 295 dual GPUs. Tests using 4 GPUs where performed on a dual quad-core Intel Nehalem 2.4 GHz with 2 Nvidia GeForce GTX 295 dual GPUs. The results presented are obtained from the mean value over 100 executions. 6.1
Colliding Objects on Multiple GPUs
The first scene consists of 64 deformable objects falling under gravity (Fig. 3). This scene is homogeneous as all objects are composed of 3k particles and simulated using a Finite Element Method with a conjugate gradient equation solver.
Multi-GPU and Multi-CPU Parallelization
243
#$ !
!"
(a)
(b)
Fig. 4. (a) A set of flexible bars attached to a wall (a color is associated to each block composing a bar). (b) Performances with blocks of different sizes, using different scheduling strategies.
At the beginning of the simulation all objects are separated, then the number of collisions increases reaching 60 pairs of colliding objects, before the objects start to separate from each other under the action of repulsion forces. The reference average CPU sequential time is 3.8s per iteration. We remind that we focus on the time integration step. We just time this phase. Collision detection is executed in sequence with time integration on one GPU (0.04s per iteration on average for this scene). We can observe that when objects are not colliding (beginning and end of the simulation) the speedup (relative to one GPU) is close to 7 with 8 GPUs. As expected the speedup decreases as the number of collisions increases, but we still get at least a 50% efficiency (at iteration 260). During our experiments, we observed a high variance of the execution time at the iteration following the apparition of new collisions. This is due to the increasing number of steals needed to adapt the load from the partitioning. Steal overhead is important as it triggers GPU-CPU-GPU memory transfers. The second scene tested is very similar. We just changed the mechanical models of objects to get a scene composed of heterogeneous objects. Half of the 64 objects where simulated using a Finite Element Method, while the other ones relied on a Mass-Springs model. The object sizes were also heterogeneous, ranging from 100 to 3k particles. We obtained an average speedup of 4.4, to be compared with 5.3 obtained for the homogeneous scene (Fig. 3). This lower speedup is due to the higher difficulty to find a well balanced distribution due to scene heterogeneity. 6.2
Affinity Guided Work Stealing
We investigated the efficiency of our affinity guided work stealing. We simulated 30 soft blocks grouped in 12 bars (Fig. 4(a)). These bars are set horizontally
244
E. Hermann et al.
Fig. 5. Simulation performances with various combinations of CPUs and GPUs
and are attached to a wall. They flex under the action of gravity. The blocks attached in a single bar are interacting similarly to colliding objects. We then compare the performance of this simulation while activating different scheduling strategies (Fig. 4(b)). The first scheduling strategy assigns blocks to 4 GPUs in a round-robin way. The result is a distribution that has a good load balance, but poor data locality, since blocks in the same bar are in different GPUs. The second strategy uses a static partitioning that groups the blocks in a same bar on the same GPU. This solution has a good data locality since no data is transferred between different GPUs, but the work load is not well balanced as the bars have different number of blocks. The third scheduling relies on a standard work stealing. It slightly outperforms the static partitioning for small blocks as it ensures a better load balancing. It also outperforms the round-robin scheduling because one GPU is slightly more loaded as it executes the OpenGL code for rendering the scene on a display. For larger objects, the cost of memory transfers during steals become more important, making work stealing less efficient than the 2 other schedulings. When relying on affinity, work stealing gives the best results for both block sizes. It enables to achieve a good load distribution while preserving data locality. 6.3
Involving CPUs
We tested a simulation combining multiple GPUs and CPUs in a machine with 4 GPUs and 8 Cores. Because GPUs are passive devices, a core is associated to each GPU to manage it. We thus have 4 cores left that could compute for the simulation. The scene consists of independent objects with 512 to 3 000 particles. We then compare standard work stealing with the priority guided work stealing (Sec. 5.3). Results (Fig. 5) show that our priority guided work stealing always outperforms standard work stealing as soon as at least one CPU and one GPU are involved. We also get “cooperative speedups”. For instance the speedup with 4 GPUs and 4 CPUs (29), is larger than the sum of the 4 CPUs (3.5) and 4 GPUs (22) speedups. Processing a small object on a GPU sometimes takes as long as a large one. With the priority guided work stealing, the CPUs will execute
Multi-GPU and Multi-CPU Parallelization
245
tasks that are not well suited to GPUs. Then the GPU will only process larger tasks, resulting on larger performance gains than that if it had to take care of all smaller tasks. In opposite, standard work stealing lead to “competitive speed-downs”. The simulation is slower with 4 GPUs and 4 CPUs than with only 4 GPUs. It can be explained by the fact that when a CPU takes a task that is not well-adapted to its architecture, it can become the critical path of the iteration, since tasks are not preemptive.
7
Conclusion
In this paper we proposed to combine partitioning and work stealing to parallelize physics simulations on multiple GPUs and CPUs. We try to take advantage of spatial and temporal locality for scheduling. Temporal locality relies mainly on reusing the partition distribution between consecutive iterations. Spatial locality is enforced by guiding steals toward partitions that need to access a physical object the thief already owns. Moreover, in the heterogeneous context where both CPUs and GPUs are involved, we use a priority guided work stealing to favor the execution of low weight partitions on CPUs and large weight ones on GPUs. The goal is to give each PU the partitions it executes the most efficiently. Experiments confirm the benefits of these strategies. In particular we get “cooperative speedups” when mixing CPUs and GPUs. Though we focused on physics simualtions, our approach can probably be straitforwardly extended to other iterative simulations. Future work focuses on task preemption so that CPUs and GPUs can collaborate even when only large objects are available. We also intend to directly spawn tasks to the target processor instead of using an intermediary task graph, which should reduce the runtime environment overhead. Integrating the parallelization of the collision detection and time integration steps would also avoid the actual synchronization point and enable a global task scheduling further improving data locality.
Acknowledgments We would like to thanks Th´eo Trouillon, Marie Durand and Hadrien Courtecuisse for their precious contributions to code development.
References 1. Allard, J., Cotin, S., Faure, F., Bensoussan, P.J., Poyer, F., Duriez, C., Delingette, H., Grisoni, L.: Sofa - an open source framework for medical simulation. In: Medicine Meets Virtual Reality (MMVR’15), Long Beach, USA (2007) 2. Faure, F., Barbier, S., Allard, J., Falipou, F.: Image-based collision detection and response between arbitrary volume objects. In: Symposium on Computer Animation (SCA 2008), pp. 155–162. Eurographics, Switzerland (2008)
246
E. Hermann et al.
3. NVIDIA Corporation: NVIDIA CUDA compute unified device architecture programming guide (2007) 4. Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 multithreaded language. SIGPLAN Not. 33(5), 212–223 (1998) 5. Allard, J., Raffin, B.: Distributed physical based simulations for large vr applications. In: Virtual Reality Conference, 2006, pp. 89–96 (2006) 6. Guti´errez, E., Romero, S., Romero, L.F., Plata, O., Zapata, E.L.: Parallel techniques in irregular codes: cloth simulation as case of study. J. Parallel Distrib. Comput. 65(4), 424–436 (2005) 7. Thomaszewski, B., Pabst, S., Blochinger, W.: Parallel techniques for physically based simulation on multi-core processor architectures. Computers & Graphics 32(1), 25–40 (2008) 8. Hermann, E., Raffin, B., Faure, F.: Interactive physical simulation on multicore architectures. In: EGPGV, Munich (2009) 9. Georgii, J., Echtler, F., Westermann, R.: Interactive simulation of deformable bodies on gpus. In: Proceedings of Simulation and Visualisation, pp. 247–258 (2005) 10. Comas, O., Taylor, Z.A., Allard, J., Ourselin, S., Cotin, S., Passenger, J.: Efficient Nonlinear FEM for Soft Tissue Modelling and its GPU Implementation within the Open Source Framework SOFA. In: Bello, F., Edwards, E. (eds.) ISBMS 2008. LNCS, vol. 5104, pp. 28–39. Springer, Heidelberg (2008) 11. Harris, M.J., Coombe, G., Scheuermann, T., Lastra, A.: Physically-based visual simulation on graphics hardware. In: HWWS 2002: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pp. 109–118. Eurographics Association, Aire-la-Ville (2002) 12. Leung, A., Lhot´ ak, O., Lashari, G.: Automatic parallelization for graphics processing units. In: PPPJ 2009, pp. 91–100. ACM, New York (2009) 13. Dolbeau, R., Bihan, S., Bodin, F.: Hmpp: A hybrid multi-core parallel programming environment. In: First Workshop on General Purpose Processing on Graphics Processing Unit (2007) 14. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009 Parallel Processing. LNCS, vol. 5704, pp. 863–874. Springer, Heidelberg (2009) 15. Ayguad´e, E., Badia, R.M., Igual, F.D., Labarta, J., Mayo, R., Quintana-Ort´ı, E.S.: An extension of the starss programming model for platforms with multiple gpus. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009 Parallel Processing. LNCS, vol. 5704, pp. 851–862. Springer, Heidelberg (2009) 16. Zhou, K., Hou, Q., Ren, Z., Gong, M., Sun, X., Guo, B.: Renderants: interactive reyes rendering on gpus. ACM Trans. Graph. 28(5), 1–11 (2009) 17. Gautier, T., Besseron, X., Pigeon, L.: KAAPI: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In: Parallel Symbolic Computation 2007 (PASCO 2007), London, Ontario, Canada, pp. 15–23. ACM, New York (2007) 18. Bender, M.A., Rabin, M.O.: Online Scheduling of Parallel Programs on Heterogeneous Systems with Applications to Cilk. Theory of Computing Systems 35(3), 289–304 (2000) 19. Gautier, T., Roch, J.L., Wagner, F.: Fine grain distributed implementation of a dataflow language with provable performances. In: PAPP Workshop, Beijing, China. IEEE, Los Alamitos (2007) 20. Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The data locality of work stealing. In: SPAA, pp. 1–12. ACM, New York (2000)
Long DNA Sequence Comparison on Multicore Architectures Friman S´ anchez1 , Felipe Cabarcas2,3, Alex Ramirez1,2 , and Mateo Valero1,2 1
Technical University of Catalonia, Barcelona, Spain 2 Barcelona Supercomputing Center, BSC, Spain 3 Universidad de Antioquia, Colombia {fsanchez}@ac.upc.es, {felipe.cabarcas,alex.ramirez,mateo.valero}@bsc.es
Abstract. Biological sequence comparison is one of the most important tasks in Bioinformatics. Due to the growth of biological databases, sequence comparison is becoming an important challenge for high performance computing, especially when very long sequences are compared. The Smith-Waterman (SW) algorithm is an exact method based on dynamic programming to quantify local similarity between sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). In this work, we show how long sequences comparison takes advantage of current and future multicore architectures. We analyze two different SW implementations on the CellBE and use simulation tools to study the performance scalability in a multicore architecture. We study the memory organization that delivers the maximum bandwidth with the minimum cost. Our results show that a heterogeneous architecture is an valid alternative to execute challenging bioinformatic workloads.
1
Introduction
Bioinformatics is an emerging technology that is attractring the attention of computer architects, due to the important challenges it presents from the performance point of view. Sequence comparison is one of the fundamental tasks of bioinformatics and the starting point of almost all analysis that imply more complex tasks. This is basically an inference algorithm oriented to identify similarities between sequences. The need for speeding up this process is consequence of the continuous growth of sequence length. Usually, biologists compare long DNA sequences of entire genomes (coding and non-coding regions) looking for matched regions which mean similar functionality or conserved regions in the evolution; or unmatched regions showing functional differences, foreign fragments, etc. Dynamic programming based algorithms (DP) are recognized as optimal methods for sequence comparison. The Smith-Waterman algorithm [16] (SW) is a well-known exact method to find the best local alignment between sequences. However, because DP based algorithm’s complexity is O(nm) (being n and m the length of sequences), comparing very long sequences becomes a challenging scenario. In such a case, it is common to obtain many optimal solutions, which P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 247–259, 2010. c Springer-Verlag Berlin Heidelberg 2010
248
F. S´ anchez et al.
can be relevant from the biological point of view. The time and space requirements of SW algorithms limit its use. As alternative, heuristics solutions have been proposed. FASTA [13] and BLAST [4] are widely used heuristics which allow fast comparisons, but at the expense of sensitivity. For these reasons, the use of parallel architectures that are able to exploit the several levels of parallelism existing in this workload is mandatory to get high quality results in a reduced time. At the same time, computer architects have been moving towards the paradigm of multicore architectures, which rely on the existence of sufficient thread-level parallelism (TLP) to exploit the large number of cores. In this context, we consider the use of multicore architectures in bioinformatics to provide the computing performance required by this workload. In this paper we analyze how large-scale sequence comparisons can be performed efficiently using modern parallel multicore architectures. As a baseline we take the IBM CellBE architecture, which has proved to be an efficient alternative for highly parallel applications [11][3][14]. We study the performance scalability of this workload in terms of speedup when many processing units are used concurrently in a multicore environment. Additionally, we study the memory organization that the algorithm requires and how to overcome the memory space limitation of the architecture. Furthermore, we analyze two different synchronization strategies. This paper is organized as follows: Section 2 discusses related work on parallel alternatives for sequence comparison. Section 3 describes the SW algorithm and the strategy of parallelism. Section 4 describes the baseline architecture and presents two parallel implementations of the SW algorithm on CellBE. Section 5 describes the experimental methodology. Section 6 discusses the results of our experiments. Finally, section 7 concludes the paper with a general outlook.
2
Related Work
Researchers have developed many parallel versions of the SW algorithm [7][8], each designed for a specific machine. These works are able to find a short set of optimal solutions when comparing two very long sequences. The problem of finding many optimal solutions grows exponentially with the sequences length, becoming this a more complex problem. Azzedine et al [6] SW implementation avoids the excessive memory requirements and obtains all the best local alignments between long sequences in a reduced time. In that work, the process is divided in two stages: First, the score matrix is computed and the maximum scores and their coordinates are stored. Second, with this information, part of the matrix is recomputed with the inverses sequences (smaller than the original sequences) and the best local alignments are retrieved. The important point is that the compute time of the first stage is much higher than the needed in phase two. Despite the efforts to reduce time and space, the common feature is that the score matrix computation is required, which is still the most time-consuming part. There are some works about the SW implementations on modern multicore architectures. Svetlin [12] describes an implementation on the Nvidia’s Graphics Processing Units (GPU). Sachdeva et al [15] present results on the use of the
Long DNA Sequence Comparison on Multicore Architectures
249
CellBE to compare few and short pairs of sequences that fit entirely in the Local Storage (LS) of each processor. S´ anchez [10] compares SW implementation on several modern multicore architectures like SGI Altix, IBM Power6 and CellBE, which support multiple dimension of parallelism (ILP, DLP and TLP). Furthermore, several FPGAs and custom VLSI hardware solutions have been designed for sequence comparison [1][5]. They are able to process millions of matrix cells per second. Among those alternatives, it is important to highlight the Kestrel processor [5], which is a single instruction multiple data (SIMD) parallel processor with 512 processing elements organized as a systolic array. The system originally focuses on efficient high-throughput DNA and protein sequence comparison. Designers argue that although this is a specific processor, it can be considered to be in the midpoint of dedicated hardware and general purpose hardware due to its programmability and reconfigurable architecture. Multicore architectures can deliver high performance in a wide range of applications like games, multimedia, scientific algorithms, etc. However, achieving high performance with these systems is a complex task: as the number of cores per chip and/or the number of threads per core increases, new challenges emerge in terms of power, scalability, design complexity, memory organization, bandwidth, programalibility, etc. In this work we make the following contributions: - We implement the SW on the CellBE. However, unlike previous works, we focus on long sequences comparison. We present two implementations that exploit TLP and DLP. In the first one, the memory is used as a centralized data storage because the SPE LS is small to hold sequences and temporal data. In the second one, each SPE stores parts of the matrix in its own LS and other SPEs synchronously read data via DMA operations. It requires to handle data dependencies in a multicore environment, synchronization mechanisms between cores, on-chip and off-chip traffic management, double buffering use for hiding data communication latency, SIMD programming for extracting fine-grain data parallelism, etc. - As a major contribution, we use simulation techniques to explore the SW performance scalability along different number of cores working in parallel. We investigate the memory organization that deliveres the maximum bandwidth with the minimum hardware cost, and analyze the impact of including shared cache that can be accessed by all the cores. We also study the impact of memory latency and the synchronization overhead on the performance.
3
Algorithm Description and Parallelism
The SW algorithm determines the optimal local alignment between two sequences of length lx and ly by assigning scores to each character-to-character comparison: positive for exact matches/substitutions, negative for insertions/ deletions. The process is done recursively and the data dependencies are shown in figure 1a. The matrix cell (i, j) computation depends on results (i − 1, j), (i, j − 1) and (i − 1, j − 1). However, cell across the antidiagonals are independent. The final score is reached when all the symbols have been compared. After
250
F. S´ anchez et al.
computing the similarity matrix, to obtain the best local alignment, the process starts from the cell which has the highest score, following the arrows until the value zero is reached. 3.1
Available Parallelism
Because most of the time is spent computing the score matrix, this is the part usually parallelized. The commonly used strategy is the wavefront method in which computation advances parallel to the antidiagonals. As figure 1a shows, the maximum available parallelism is obtained when the main antidiagonal is reached. Before developing an specific implementation, it is necessary to understand the parameters that influence performance. Figure 1c and table 1 illustrate these parameters and their description. The computation is done by blocks of a determined size. We can identify three types of relevant parameters: first, those which depend on the input data set, (the sequence lengths lx and ly ); second, those which depend on the algorithm implementation like b, k (the vertical and horizontal block lengths); and third, those which depend on the architecture (the number of workers p and the time Tblock(b,k) required to compute a block of size b ∗ k). There are studies on the parallelism in this kind of problems [2][9], Instead of developing a new model, we just want to summarize this remarking that the total time to compute the comparison can be expressed as follows: T otal timeparallel = Tseq
part
+ Tcomp(b,k) + Ttransf (b,k) + Tsync(b,k)
(1)
Where Tseq part is the intrinsic sequential part of the execution; Tcomp(b,k) is the time to process the matrix in parallel with p processors and with a specific block size b ∗ k; Ttransf (b,k) is the time spent transferring all blocks with size b ∗ k used in the computation; and Tsync(b,k) is the synchronization overhead. Each synchronization is done after a block of size b ∗ k is computed. On one hand Tcomp(b,k) basically depends on p, b and k, as the number of processors increases, this time decreases, The limit is given by the processors speed and the main antidiagonal, that is, if ly and lx are different, the maximum parallelism continues for |lx − ly | stages and then decreases again. Small values of b and k increases the number of parallel blocks, making the use of a larger number of processor effective. On the contrary, larger values of b and k reduce parallelism, therefore Tcomp(b,k) increases. On the other hand, Tsync(b,k) also depends on b, k. Small values of them increase the number of synchronization events, which can degrade performance seriously. Finally, Ttransf (b,k) increases with large values of b and k but also with very small values of them. The latter situation happens because it increases the number of inefficient data transfers due to the size.
4
Parallel Implementations on a Multicore Architecture
Comparing long sequences presents many challenges to any parallel architecture. Many relevant issues like synchronization, data partition, bandwidth use, memory space and data organization should be studied carefully to efficiently use the available features of a machine to minimize equation 1.
Long DNA Sequence Comparison on Multicore Architectures
251
Table 1. Parameters involved in the execution of the SW implementation on CellBE Name
Description
b
Horizontal block size (in number of symbols (bytes))
k
Vertical block size (in number of symbols (bytes))
lx
Length of sequence in the horizontal direction
ly
Length of sequence in the vertical direction
p
Number of processors (workers), SPE is the case of CellBE
Tblock(b,k) Time required to process a block of size b ∗ k
! "#$!"%" $$" &
Fig. 1. (a) Data dependency (b) Different optimal regions (c) Computation distribution
To exploit TLP in the SW, a master thread takes sequences and preprocess them: makes profiling computation according to a substitution score matrix, prepares the worker execution contexts and receives results. Those issues correspond to the sequential part of the execution, the first term of equation 1. Workers compute the similarity matrix as figure 1b shows, that is, each worker computes different rows of the matrix. For example, if p = 8, p0 computes row 0, row 8, row 16, etc; p1 computes row 1, row 9, row 17, and so on. Since SIMD registers of the workers are 16-bytes long, it is possible to compute 8 symbols in parallel, (having 2 bytes per temporal scores), that is, k = 8 symbols. Each worker has to store temporal matrix values which will be used by the next worker, for example, in figure 1b, computing block 2 by p0 generates temporal data used in the computation of block 1 by p1 . This feature leads to several possible implementations. In this work, we show two, both having advantages and disadvantages, and being affected differently by synchronization and communications. 4.1
Centralized Data Storage Approach
Due to the small size of the scratch pad memory (LS of 256KB in CellBE, shared between instructions and data), a buffer per worker is defined in memory to store data as figure 2a shows. Each worker reads from its own buffer and write to the next worker’s buffer via DMA GET and PUT operations. Shared data correspond to the border of consecutive rows as shown in figure 1b. That implies that all workers are continuously reading/writing from/to memory, which is a possible problem from the BW point of view, but it is easy to program. The Ttransf (b,k)
252
F. S´ anchez et al.
!" #
!" #
!" #
!" #
!" #
Fig. 2. (a) SPEs store data in memory. (b) SPEs store data in an internal buffer.
term of equation 1 is minimized using double buffering. It reduces the impact of DMA operation latency, overlapping computation with data transfer. Atomic operations are used to synchronize workers and guarantee data dependencies. 4.2
Distributed Data Storage Approach
Here, each worker defines a small local buffer in its own LS to store temporal results (figure 2b). When pi computes a block, it signals pi+1 indicating that a block is ready. When pi+1 receives this signal, it starts a DMA GET operation to bring data from the LS of pi to its own LS. When data arrives, pi+1 handshakes pi sending an ack signal. Once pi receives this signal, it knows that the buffer is available for storing new data. The process continues until all blocks are computed. However, due to the limited size of LS, the first and the last workers in the chain read from and write to memory. This approach reduces data traffic generated in previous approach. However it is more complex to program. There are three types of DMA: between workers to transfer data from LS to LS (on-chip traffic), from memory to LS and from LS to memory (off-chip traffic). Synchronization overhead is reduced taking into account that one SPE does not have to wait immediately for the ack signal from other SPE, it can start the computation of the next block and later wait for the ack signal. Synchronization is done by using the signal operations available in the CellBE.
5
Experimental Methodology
As a starting point, we execute both SW implementations on the CellBE with 1 to 16 SPUs. We evaluate the performance impact of the block size and the bandwidth requirements. Then, using simulation, we study the performance scalability and the impact of different memory organizations. We obtain traces from the execution to feed the architecture simulator (TaskSim) and carry out an analysis in more complex multicore scenarios. TaskSim is base on the idea that, in distributed memory architectures, the computation time on a processor does not depend on what is happening in the rest of the system, as it happens on the CellBE. Execution time depends on inter-thread synchronization, and
Long DNA Sequence Comparison on Multicore Architectures "#$
"#$%!
)
!
!
"#$
"#$
253
*+
)
*+
' (,'-.$/' 01'#0.
*+
)
*+
*+
&' (
*+
&' (
*+
&' (
*+
'-
&' (
*+
Fig. 3. Modeled system Table 2. Evaluated configurations ranked by L2 bandwidth and Memory bandwidth Number of Banks L2 Cache Organization Bandwidth [GB/s]
1 25.6
2 51,2
4 102.4
8 204.8
mics/dram per mics 1/1 1/2 or 2/1 1/4 or 4/1 2/4 or 4/2 Memory Organization Bandwidth [GB/s] 6.4 12.8 25.6 51.2
16 409.6 4/4 102.4
Each combination of L2 BW and memory BW is a possible configuration, e.i, 2 L2 banks and 2 mics with 2 dram/mic deliver 51.2 GB/s to L2 and 25.6 GB/s BW to memory
the memory system for DMA transfers. TaskSim models the memory system in cycle-accurate mode: the DMA controller, the interconnection buses, the Memory Interface Controller, the DRAM channels, and the DIMMs. TaskSim does not model the processors themselves, it relies on the computation time recorded in the trace to measure the delay between memory operations (DMAs) or interprocessor synchronizations (modeled as blocking semaphores). As inputs, we use sequences with length 3.4M and 1.8M symbols, for real execution. However, to obtain traces with manageable size for simulation, we take shorter sequences ensuring available parallelism up to 112 workers, using block transfer of 16KB. The code running on PPU and SPU side are coded in C and were compiled with ppu-gcc and spu-gcc 4.1.1 respectively, with -O3 option. The executions run on a IBM BladeCenter QS20 system composed by 2 CellBE at 3.2 GHz. As a result, it is possible to have up to 16 SPEs running concurrently. 5.1
Modeled Systems
Figure 3 shows the general organization of the evaluated multicore architectures using simulation. It is comprised of several processing units integrated into clusters, the working frequency is 3.2 GHz. The main elements are: - A control processor P: a PowerPC core with SIMD capabilities. - Accelerators: 128 SIMD cores, connected into clusters of eight core each. Each core is connected with a private LS and a DMA controller.
254
F. S´ anchez et al.
- A Global Data Bus (GDB) connects the LDBs, L2 cache and memory controllers. It allows 4 8-bytes request per cycle, with 102.4 GB/s of bandwidth. - Local Data Buses (LDB) connects each cluster to GDB. Each GDB to LDB connection is 8-bytes/cycle. (25,6 GB/s). - A shared L2 cache distributed into 1 to 16 banks as table 2 describes. Each bank is a 8-way set associative with size ranging from 4KB to 4MB. - On-chip memory interface controllers (MIC) connect GDB and memory, providing up to 25.6 GB/s each (4x 6.4 GB/s Multi-channel DDR-2 modules). Our baseline blade CellBE-like machine consists of: 2 clusters with 8 workers each, without L2 cache, one GDB providing a peak BW of 102.4 GB/s for onchip data transfer. One MIC providing a peak BW of 25.6 GB/s to memory. And four DRAM modules connected to the MIC with 6.4 GB/s each.
6 6.1
Experimental Results Speedup in the Real Machine
Figures 4 and 5 show performance results for the centralized and distributed approaches on CellBE. The baseline is the execution with one worker. Figures show the performance impact of the block size (parameter b of table 1). When using small block sizes (128B or 256B), the application does not exploit the available parallelism due to two reasons: First, although more parallel blocks are available, the number of inefficient DMAs increases. Second, because each block transfer is synchronized, the amount of synchronization operations also increases and the introduced overhead degrades performance. With larger blocks like 16KB, the parallelism decreases, but the synchronization overhead decreases. Less aggressive impact is observed in the distributed SW because the synchronization mechanism is direct between workers and data transfers are done directly between LSs. With blocks of 16KB, performance of both approaches is similar (14X for centralized and 15X for distributed with 16 workers), that is, synchronization and data transfer in both approaches are hidden by computation. 6.2
Bandwidth Requirements
Data traffic is measured in both centralized and distributed cases. With these results, we perform some estimation of the on-chip and off-chip BW requirements, dividing the total traffic by the time to compute the matrix. Figure 6 depicts results of these measures. Up to 16 workers, the curves reflect the real execution on CellBE, for the rest points, we made a mathematical extrapolation that gives some idea of the BW required when using more workers. Figure shows that centralized SW doubles the BW requirements of distributed case when using equal number of workers. The reason is that in the centralized case, a worker sends data from a LS to memory first, and then, another worker bring this data from memory to its own LS, besides, all the traffic is off-chip. In the distributed case, data travels only once: from a LS to another LS and most of the traffic
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
lineal 128B 256B 512B 4096B 8192B 16384B
Speedup
Speedup
Long DNA Sequence Comparison on Multicore Architectures
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of SPEs
lineal 0c 512 c 4096 c 8192 c 16384 c
120
60
100
50
80
Speedup
Bandwidth in GB/s
Fig. 5. Distributed SW implem.
out-of-chip bw (distr) intra-chip bw (distr) out-of-chip bw (centr)
70
40 30
lineal 128B 256B 512B 4096B 8192B 16384B
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of SPEs
Fig. 4. Centralized SW implem. 80
255
60 40
20
20
10 0 20
40 60 80 100 Number of workers
120
Fig. 6. Bandwidth requirements
0 0
20
40
60
80
100
120
Number of Workers
Fig. 7. Memory latency impact, cent.
is on-chip. For example, for 16 workers off-chip BW arrives to 9.1 GB/s in the first case, and on-chip BW is around 4.7 GB/s in the second one. Although the current CellBE architecture can deliver these BW for 16 cores, it is clear this demand is unsustainable when more than 16 workers are used, as figure shows. 6.3
Simulation Results
This section presents simulation results according to the configurations of section 3. We show results for both SW implementations using 16KB of block size. We study the memory latency with perfect memory, the real memory system impact, inclusion of L2 cache impact and the synchronization overhead impact. Memory Latency Impact. We perform experiments using up to 128 cores, without L2 cache, with sufficient memory bandwidth, with different latencies in a perfect memory and without synchronization overhead. Figure 7 shows results for the centralized SW. The first observation is that even in the ideal case (0 cycles), the execution does not reach a linear performance. This is because of Amdahl’s law: with 1 worker, the sequential part of the execution takes 0.41% of time, but with 128 workers it takes around 30.2% of time. Figure also shows this implementation hides the latency properly due to the double buffering use. The degratation starts with latencies near to 8K cycles, when more than 64 workers are used. Finally, the performance does not increase with more than 112 cores
F. S´ anchez et al.
Speedup
100 80
6.4 GB/s 12.8 GB/s 25.6 GB/s 51.2 GB/s 102.4 GB/s
120
lineal 6.4 GB/s 12.8 GB/s 25.6 GB/s 51.2 GB/s 102.4 GB/s 204.8 GB/s
120
100 Speedup
256
60
80 60 40
40 20 20 0 0
0 0
20
40 60 80 100 Number of Workers
Fig. 8. Memory BW, centralized
120
120 1MB 2MB 4MB 100
80
Speedup
Speedup
100
40 60 80 100 Number of Workers
Fig. 9. Memory BW, distributed
1MB 16MB 256MB 512MB 1GB
120
20
120
60 40
80 60 40
20
20
0 0
20
40 60 80 100 Number of Workers
120
0 0
20
40 60 80 100 Number of Workers
120
Fig. 10. Cache Sizes, 6.4 Gb/s BW, cent. Fig. 11. Cache Sizes, 6.4 Gb/s BW, dist.
because the simulated trace only has parallelism for this amount of cores, as explained in section 5. Results for the distributed case exhibit similar behavior. Real Memory System Without L2 Cache. Figures 8 and 9 show the performance results when using several combinations of MICs and DRAMS per MIC, without L2 caches. As shown, having 128 workers in the centralized case, the required BW to obtain the maximum performance is between 51.2 GB/s (2 MICS and 4 DRAM/MIC) and 102.4GB/s (4 MICS and 4 DRAMS/MIC). Comparing the results with the extrapolation of figure 6, we conclude that the required BW with 128 workers is near to 65 GB/s. However, having some configurations like them is unrealistic because of that physical connections do not scale in this way. For the distributed case, 12.8 GB/s (1 MIC and 2 DRAMS/MIC) is sufficient to obtaining the maximum performance. This is because the off-chip traffic in this case is very small (figure 6). Basically all the traffic is kept inside the chip. Impact of the L2 Cache and Local Storage. There are several ways to include cache or local memory in the system. We evaluate two options: first, adding a bank-partitoned L2 cache connected to the GDB; second, adding small size of LS to each worker. These two models differ in the way data locallity and interprocessor communication is managed. Figure 10 shows results for the centralized case in which only one MIC and one DRAM module is used (6.4
Long DNA Sequence Comparison on Multicore Architectures
120
Speedup
100 80
257
1 ns 500 ns 1000 ns 5000 ns 10000 ns
60 40 20 20
40 60 80 Number of Workers
100
120
Fig. 12. Synchronization Overhead
GB/s of memory BW) and a maximum 204.5 GB/s of L2 BW is available (L2 is distributed in 8 banks). As shown, the cache requirement is very high due to that the matrix to compute is bigger than L2 and data reuse is very small: when a block is computed, data is used once for another worker, after that, it is replaced by a new block. Additionally, with many workers the conflict misses increase significantly with small L2 caches, therefore, the miss rate increases and performance degrades. Figure 11 show results for the distributed case, where each worker has a 256KB LS (as CellBE) and there is L2 cache distributed in 8 banks to access date that are not part of the matrix computation. Results show that a shared L2 cache of 2MB is enought to capture the on-chip traffic. Synchronization Overhead. To obtain small synchronization overhead it is required to used a proper synchronization technique that match well in the target machine. So far, we have made experiments regardless of synchronization overhead. Now, this is included in the performance analysis. Each time a worker computes a block, it communicates to another worker that data is available, as explained in section 4.1. We include this overhead assuming that the time to perform a synchronization event (signal o wait) after a block is computed is a fraction of the requirede time to compute it, that is, Tsyncb lock(b,k) = α ∗ Tblock(b,k) . We give this information to our simulator and measure the performance for different values of α. Figure 12 shows the experiment results for the centralized SW implementation. As observed, the system assimilate the impact of up to 1000 nanoseconds of latency in each synchronization event. The results of the distributed approach exhibit a similar behavior.
7
Conclusions
This paper describes the implementation of DP algorithms for long sequences comparisons on modern multicore architectures that exploit several levels of parallelism. We have studied different SW implementations that efficiently use the CellBE hardware and achieve speedups near to linear with respect to the number of workers. Furthermore, the major contribution of our work is the use
258
F. S´ anchez et al.
of simulation tools to study more complex multicore configurations. We have studied key aspects like memory latency impact, efficient memory organization capable of delivering maximum BW and synchronization overhead. We observed that it is possible to minimize the memory latency impact by using techniques like double buffering while large data blocks are computed. Besides, we have shown that due to the sequential part of the algorithm, the performance does not scale linerly with large number of workers. It becomes necessary to perform optimizations in the sequential part of the SW implementations. We investigated the memory configuration that deliveres maximum BW to satisfy tens and even hundreds of cores on a single chip. As a result, we determined that for the SW algorithm it is more efficient to distribute small size of LS accross the workers instead of having a shared L2 on-chip data cache connected to the GDB. This is consequence of the streaming nature of the application, in which data reuse is low. However, the use of LS makes more challenging the programming, because the communication is always managed at the user-level. Finally, we observed that our synchronization strategy minimizes the impact of this overhead because it prevents a worker to wait inmediatly for the response of a previous signal. In this way, the application can endure an overhead of up to thousand ns with a maximum performance degradation of around 3%.
Acknowledgements This work was sponsored by the European Commission (ENCORE Project, contract 248647), the HiPEAC Network of Excellence and the Spanish Ministry of Science (contract TIN2007-60625). Program AlBan (Scholarship E05D058240CO).
References 1. Fast data finder (fdf) and genematcher (2000), http://www.paracel.com 2. Aji, A.M., Feng, W.c., Blagojevic, F., Nikolopoulos, D.S.: Cell-swat: modeling and scheduling wavefront computations on the cell broadband engine. In: CF 2008: Proceedings of the 5th conference on Computing frontiers, pp. 13–22. ACM, New York (2008) 3. Alam, S.R., Meredith, J.S., Vetter, J.S.: Balancing productivity and performance on the cell broadband engine. In: IEEE International Conference on Cluster Computing, pp. 149–158 (2007) 4. Altschul, S.F., Madden, T.L., Schffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped blast and psi-blast: a new generation of protein database serach programs. Nucleic acids research 25, 3389–3402 (1997) 5. Blas, A.D., Karplus, K., Keller, H., Kendrick, M., Mesa-Martinez, F.J., Hughey, R.: The ucsc kestrel parallel processor. IEEE Transactions on Parallel and Distributed systems (January 2005) 6. Boukerche, A., Magalhaes, A.C., Ayala, M., Santana, T.M.: Parallel strategies for local biological sequence alignment in a cluster of workstations. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS. IEEE Computer Society, Los Alamitos (2005)
Long DNA Sequence Comparison on Multicore Architectures
259
7. Boukerche, A., Melo, A.C., Sandes, E.F., Ayala-Rincon, M.: An exact parallel algorithm to compare very long biological sequences in clusters of workstations. Cluster Computing 10(2), 187–202 (2007) 8. Chen, C., Schmidt, B.: Computing large-scale alignments on a multi-cluster. In: IEEE International Conference on Cluster Computing, vol. 38 (2003) 9. Edmiston, E.E., Core, N.G., Saltz, J.H., Smith, R.M.: Parallel processing of biological sequence comparison algorithms. Int. J. Parallel Program. 17(3) (1988) 10. Friman, S., Ramirez, A., Valero, M.: Quantitative analysis of sequence alignment applications on multiprocessor architectures. In: CF 2009: Proceedings of the 6th ACM conference on Computing frontiers, pp. 61–70. ACM, New York (2009) 11. Gedik, B., Bordawekar, R.R., Yu, P.S.: Cellsort: high performance sorting on the cell processor. In: VLDB 2007: Proceedings of the 33rd international conference on Very large data bases, pp. 1286–1297, VLDB Endowment (2007) 12. Manavski, S.A., Valle, G.: Cuda compatible gpu cards as efficient hardwarer accelerator for smith-waterman sequence alignment. BMC Bioinformatics 9 (2008) 13. Pearson, W.R.: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the smith-waterman and FASTA algorithms. Genomics 11 (1991) 14. Petrini, F., Fossum, G., Fern´ andez, J., Varbanescu, A.L., Kistler, M., Perrone, M.: Multicore surprises: Lessons learned from optimizing sweep3d on the cell broadband engine. In: IPDPS, pp. 1–10 (2007) 15. Sachdeva, V., Kistler, M., Speight, E., Tzeng, T.H.K.: Exploring the viability of the cell broadband engine for bioinformatics applications. In: Proceedings of the 6th Workshop on High Performance Computational Biology, pp. 1–8 (2007) 16. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing Mark James1 , Paul Springer1 , and Hans Zima1,2 1
Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 2 University of Vienna, Austria {mjames,pls,zima}@jpl.nasa.gov
Abstract. This paper describes an approach to providing software fault tolerance for future deep-space robotic NASA missions, which will require a high degree of autonomy supported by an enhanced on-board computational capability. Such systems have become possible as a result of the emerging many-core technology, which is expected to offer 1024-core chips by 2015. We discuss the challenges and opportunities of this new technology, focusing on introspection-based adaptive fault tolerance that takes into account the specific requirements of applications, guided by a fault model. Introspection supports runtime monitoring of the program execution with the goal of identifying, locating, and analyzing errors. Fault tolerance assertions for the introspection system can be provided by the user, domain-specific knowledge, or via the results of static or dynamic program analysis. This work is part of an on-going project at the Jet Propulsion Laboratory in Pasadena, California.
1
Introduction
On-board computing systems for space missions are subject to stringent dependability requirements, with enforcement strategies focusing on strict and widely formalized design, development, verification, validation, and testing procedures. Nevertheless, history has shown that despite these precautions errors occur, sometimes resulting in the catastrophical loss of an entire mission. There are theoretical as well as practical reasons for this situation: 1. No matter how much effort is spent for verification and test well-known undecidability and NP-completeness results show that many relevant problems are either undecidable or computationally intractable. 2. As a result, large systems typically do contain design faults. 3. Even a perfectly designed system may be subject to external faults, such as radiation effects and operator errors. As a consequence, it is essential to provide methods that avoid system failure and maintain the functionality of a system, possibly with degraded performance, even in the case of faults. This is called fault tolerance. Fault tolerant systems were built long before the advent of the digital computer, based on the use of replication, diversified design, and federation of equipment. In an article on Babbage’s difference engine published in 1834 Dionysius P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 260–274, 2010. c Springer-Verlag Berlin Heidelberg 2010
Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing
261
Lardner wrote [1]: “The most certain and effectual check upon errors which arise in the process of computation is to cause he same computations to be made by separate and independent computers; and this check is rendered still more decisive if they make their computation by different methods.” An example for an early fault-tolerant computer is NASA’s Self-Testing-and-Repairing (STAR) system developed for a 10-year mission to the outer planets in the 1960s. Today, highly sophisticated fault-tolerant computing systems control the new generation of fly-by-wire aircraft, such as the Airbus and Boeing airliners. Perhaps the most widespread use of fault-tolerant computing has been in the area of commercial transactions systems, such as automatic teller machines and airline reservation systems. Most space missions of the past were largely controlled from Earth, so that a significant number of failures could be handled by putting the spacecraft in a “safe” mode, with Earth-bound controllers attempting to return it to operational mode. This approach will no longer work for future deep-space missions, which will require enhanced autonomy and a powerful on-board computational capability. Such missions are becoming possible as a result of recent advances in microprocessor technology, which are leading to low-power many-core chips that today already have on the order of 100 cores, with 2015 technology expected to offer 1024-core systems. These developments have many consequences for fault tolerance, some of them challenging and others providing new opportunities. In this paper we focus on an approach for software-implemented applicationadaptive fault tolerance. The paper is structured as follows: In Section 2, we establish a conceptual basis, providing more precise definitions for the notions of dependability and fault tolerance. Section 3 gives an overview of future missions and their requirement, and outlines an on-board architecture that complements a radiation-hardened spacecraft control and communication component with a COTS-based high-performance processing system. After introducing introspection in Section 4, we discuss introspection-based adaptive fault tolerance in Section 5. The paper ends with an overview of related work and concluding remarks in Sections 6 and 7.
2
Fault Tolerance in the Context of Dependability
Dependability has been defined by the IFIP 10.4 Working Group on Dependable Computing and Fault Tolerance as the “trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers”. Dependability is characterized by its attributes, the threats to it, and the means by which it can be achieved [2,3]. The attributes of dependability specify a set of properties that can be used to assess how a system satisfies its overall requirements. Key attributes are reliability, availability, mean-time-to-failure, and safety. A threat is any fact or event that negatively affects the dependability of a system. Threats can be classified as faults, errors, or failures. Their relationship can be illustrated by the fault-error-failure chain shown in Figure 1.
262
M. James, P. Springer, and H. Zima
Service Interface
propagation to service interface
activation
Fault
Error−1
defect
invalid state
propagation
....
Error−n
FAILURE
invalid state
Violation of System SPEC External Fault (caused by external failure)
Service Interface Fig. 1. Threats: the fault-error-failure chain
A fault is a defect in a system. Faults can be dormant—e.g., incorrect program code that is not executed—and have no effect. When activated during system operation, a fault leads to an error, which is an illegal system state. Errors may be propagated through a system, generating other errors. For example, a faulty assignment to a variable may result in an error characterized by an illegal value for that variable; the use of the variable for the control of a for-loop can lead to ill-defined iterations and other errors, such as illegal accesses to data sets and buffer overflows. A failure occurs if an error reaches the service interface of a system, resulting in system behavior that is inconsistent with its specification. With the above terminology in place, we can now precisely characterize a system as fault tolerant if it never enters a failure state. Errors may occur in such a system, but they never reach its service boundary and always allow recovery to take place. The implementation of fault tolerance in general implies three steps: error detection, error analysis, and recovery. The means for achieving dependability include fault prevention, fault removal, and fault tolerance. Fault prevention addresses methods that prevent faults to being incorporated into a system. In the software domain, such methods include restrictive coding structures that avoid common programming faults, the use of object-oriented techniques, and the provision of high-level APIs. An example for hardware fault prevention is shielding against radiation-caused faults. Fault removal refers to the a set of techniques that eliminate faults during the design and development process. Verification and Validation (V&V) are important in
Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing
263
this context: Verification provides methods for the review, inspection, and test of systems, with the goal of establishing that they conform to their specification. Validation checks the specification in order to determine if it correctly expresses the needs of the system’s users. For theoretical as well as practical reasons, neither fault prevention nor fault removal provide complete solutions, i.e., in general for non-trivial programs there is no guarantee that they do not contain design faults. However, even in a program completely free of design faults, hardware malfunction can cause software errors at execution time. In the domain underlying this paper, the problem is actually more severe: a spacecraft can be hit by radiation, which can cause arbitrary errors in data and control structures. This is discussed in more detail in the next section.
3
Future Space Missions and Their Requirements
Future deep-space missions face the challenge of designing, building, and operating progressively more capable autonomous spacecraft and planetary rovers. Given the latency and bandwidth of spacecraft-Earth communication for such missions, the need for enhanced autonomy becomes obvious: Earth-based mission controllers will be unable to directly control distant spacecraft and robots to ensure timely precision and safety, and to support “opportunistic science” by capturing rapidly changing events, such as dust devils on Mars or volcanic eruptions on a remote moon in the solar system [4]. Furthermore, the high data volume yielded by smart instruments on board of the spacecraft can overwhelm the limited bandwidth of spacecraft-Earth communication, enforcing on-board data analysis, filtering, and compression. Science processing will require a highperformance capability that may range up to hundreds of Teraops for on-board synthetic aperture radar (SAR), hyperspectral assessment of scenes, or stereo vision. Currently, the performance of traditional mission architectures lags that of commercial products by at least two orders of magnitude; furthermore, this gap is expected to widen in the future. As a consequence, the traditional approach to on-board computing is not expected to scale with the requirements of future missions. A radical departure is necessary. Emerging technology offers a way out of this dilemma. Recent developments in the area of commercial multi-core architectures have resulted in simpler processor cores, enhanced efficiency in terms of performance per Watt, and a dramatic increase in the number of cores on a chip, as illustrated by Tilera Corporation’s Tile64 [5]—a homogeneous parallel chip architecture with 64 identical cores arranged in an 8x8 grid performing at 192 Gops with a power consumption of 170-300mW per core—or Intel’s terachip announced for 2011—an 80-core chip providing 1.01 Teraflops based on a frequency of 3.16 GHz, with a power consumption of 62W. These trends suggest a new paradigm for spacecraft architectures, in which the ultra-reliable radiation-hardened core component responsible for control, navigation, data handling, and communication is extended with a scalable commoditybased multi-core system for autonomy and science processing. This approach
264
M. James, P. Springer, and H. Zima
will provide the basis for a powerful parallel on-board supercomputing capability. However, bringing COTS components into space leads to a new problem—the need to address their vulnerability to hardware as well as software faults. Space missions are subject to faults caused by equipment failure or environmental impacts, such as radiation, temperature extremes, or vibration. Missions operating close to the Earth/Moon system can be controlled from the ground. Such missions may allow controlled failure, in the sense that they fail only in specific, pre-defined modes, and only to a manageable extent, avoiding complete disruption. Rather than providing the capability of resuming normal operation, a failure in such a system puts it into a safe mode, from which recovery is possible after the failure has been detected and identified. As an example, the on-board software controlling robotic planetary exploration spacecraft for those portions of a mission during which there is no critical activity (such as detumbling the spacecraft after launch or descent to a planetary surface) can be organized as a system allowing controlled failure. When a fault is detected during operation, all active command sequences are terminated, components inessential for spacecraft survival are powered off, and the spacecraft is positioned into a stable sun-pointed attitude. Critical information regarding the state of the spacecraft and the fault are transmitted to ground controllers via an emergency link. Restoring the spacecraft health is then delegated to controllers on Earth. However, such an approach is not adequate for deep-space missions beyond immediate and continuous control from the Earth. For such missions, fault tolerance is a key prerequisite, i.e., a fail-operational response to faults must be provided, implying that the spacecraft must be able to deal autonomously with faults and continue to provide the full range of critical functionality, possibly at the cost of degraded performance. Systems which preserve continuity of service can be significantly more difficult to design and implement than fail-controlled systems. Not only is it necessary to determine that a fault has occurred, the software must be able to determine the effects of the fault on the system’s state, remove the effects of the fault, and then place the system into a state from which processing can proceed. This is the situation on which the rest of this paper is based. We focus on strategies and techniques for providing adaptive, introspection-based fault tolerance for space-borne systems. Deep-space missions are subject to radiation in the form of cosmic rays and the solar wind, exposing them to protons, alpha particles, heavy ions, ultraviolet radiation, and X-rays. Radiation can interact with matter through atomic displacement—a rearrangement of atoms in a crystal lattice—or ionization, with the potential of causing permanent or transient damage [6]. Modern COTS circuits are protected against long-term cumulative degradation as well as catastrophic effects caused by radiation. However, they are exposed to transient faults in the form of Single Event Upsets or Multiple Bit Upsets, which do not cause lasting damage to the device. A Single Event Upset (SEU) changes the state of a single bit in a register or memory, whereas a Multiple Bit Upset (MBU) results in a
Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing
265
change of state of multiple adjacent bits. The probability of SEUs and MBUs depends on the environment in which the spacecraft is operating, and on the detailed characterization of the hardware components in use. COTS semiconductor fabrication processes vary: with the 65nm process now in commercial production, some of the semiconductor foundries are using Silicon on Insulator (SOI) construction, which makes the chips less susceptible to these radiation effects. Depending on the efficacy of fault tolerance mechanisms, SEUs and MBUs can manifest themselves at different levels. For example, faults may affect processor cores and caches, DRAM memory units, memory controllers, on-chip communication networks, I/O processors nodes, and interconnection networks. This can result in the corruption of instruction fetch/decode, address selection, memory units, synchronization, communication, and signal/interrupt processing. In a sequential thread this may lead to the (unrecognized) use of corrupted data and the execution of wrong or illegal instructions, branches, and data accesses in the program. Hangs or crashes of the program, as well as unwarranted exceptions are other possible consequences. In a distributed system, transient faults can cause communication errors, livelock, deadlock, data races, or arbitrary Byzantine failures [7]. Some of these effects may be caught and corrected in the hardware (e.g., via the use of an error-correcting code (ECC)) with no disruption of the program. A combination of hardware and software mechanisms may provide an effective approach, as in the case of the fault isolation of cores in a multi-core chip [8]. Other faults, such as those causing illegal instruction codes, illegal addresses, or the violation of access protections may trigger a synchronous interrupt, which can lead to an application-specific response. In a distributed system, watchdogs may detect a message failure. However, in general, an error may remain undetected. Figure 2 outlines key building blocks of an architecture for space-borne computing in which the radiation-hardened core is augmented with a COTS-based scalable high-performance computing system (HPCS). The Spacecraft Control & Communication System is the core component of the on-board system, controlling the overall operation, navigation, and communication of the spacecraft. Due to its critical role for the operation and survival of the spacecraft this system is typically implemented using radiation-hardened components that are largely immune to the harsh radiation environments encountered in space. The Fault-Tolerant High-Capability Computational Subsystem (FTCS) is designed to provide an additional layer of fault tolerance around the HPCS via a Reliable Controller that shields the Spacecraft Control & Communication System from faults that evaded detection or masking in the High Performance Computing System. The Reliable Controller is the only component of the FTCS that communicates directly with the spacecraft control and communication system. As a consequence, it must satisfy stringent reliability requirements. Important approaches for implementing the Reliable Controller—either on a pure software basis, or by a combination of hardware and software—have been developed in the Ghidrah [9] and ST8 systems [10].
266
M. James, P. Springer, and H. Zima
Spacecraft Control& Communication System
Spacecraft Control Computer (SCC) Communication Subsystem (COMM)
Fault−Tolerant High−Capability Computational Subsystem (FTCS) Spacecraft Interface
Reliable Controller (CTRL)
High Performance Computing System (HPCS)
Intelligent Mass Data
Storage (IMDS)
Interconnection Network(s) Instrument Interface
Instruments ...
EARTH Fig. 2. An architecture for scalable space-borne computing
4
Introspection
The rest of this paper deals with an introspection-based approach to providing fault tolerance for the High Performance Computing System in the architecture depicted in Figure 2. A generic framework for introspection has been described in [11]; here we outline its major components. Introspection provides a generic software infrastructure for the monitoring, analysis, and feedback-oriented management of applications at execution time. Consider a parallel application executing in the High Performance Computing System. Code and data belonging to the object representation of the application will be distributed across its components, creating a partitioning of the application into application segments. An instance of the introspection system consists of a set of interacting introspection modules, each of which can be linked to application segments. The structure of an individual introspection module is outlined in Figure 3. Its components include: – Application/System Links are either sensors or actuators. Sensors represent hardware or software events that occur during the execution of the application; they provide input from the application to the introspection module. Actuators represent feedback from the introspection module to the application. They are triggered as a result of module-internal processing and may result in changes of the application state, its components, or its instrumentation.
Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing
267
Sensors
Inference Engine
.. .
(SHINE)
.. . control links
Monitoring
Application Segment(s)
to/from Analysis
Knowledge
.. .
Feedback/ Recovery
Base
Prognostics
introspection modules
.. .
Actuators
Fig. 3. Introspection module
– Inference Engine. The nature of the problems to which introspection is applied demands efficient and flexible control of the associated application segments. These requirements are met in our system by the Spacecraft Health Inference Engine (SHINE) as the core of each introspection module. SHINE is a real-time inference engine that provides an expert systems capability and functionality for building, accessing, and updating a structured knowledge base. – The Knowledge Base consists of declarative facts and rules that specify how knowledge can be processed. It may contain knowledge about the underlying system, the programming languages supported, the application domain, and properties of application programs and their execution that are either derived by static or dynamic analysis or supplied by the user. We have implemented a prototype introspection system for a cluster of Cell Broadband Engines. Figure 4 illustrates the resulting hierarchy of introspection modules, where the levels of the hierarchy, from bottom to top, are respectively associated with the SPEs, the PPE, and the overall cluster.
5
Adaptive Fault Tolerance for High-Performance On-Board Computing
The prototype system mentioned above relied on user-specified assertions guiding the introspection system. In the following we outline the ideas underlying
268
M. James, P. Springer, and H. Zima
external links Sensors Inference Engine
...
Mo
Monitoring
...
Analysis
Knowledge Base
Feedback/Rec Prognostics
Actuators
Sensors
Sensors Inference Engine
...
. ..
Monitoring
...
Analysis
Actuators
. Actuators
Knowledge Base
Base
Prognostics
Actuators
...
Knowledge
Feedb/Rec
Sensors Inference Engine Monitoring Analysis Recovery Prognostics
Analysis
Prognostics
Sensors
.
Monitoring
...
Knowledge Base
Feedb/Rec
Inference Engine
...
. . Actuators
Sensors Inference Engine Monitoring Analysis Recovery Prognostics
Knowledge Base
. . Actuators
Inference Engine Monitoring Analysis Recovery Prognostics
Knowledge Base
.. .
Sensors
. . Actuators
Inference Engine Monitoring Analysis Recovery Prognostics
Knowledge Base
Fig. 4. Introspection hierarchy for a cluster of Cell Broadband Engines
our current work, which generalizes this system in a number of ways, with a focus on providing support for the automatic generation of assertions. Our approach to providing fault tolerance for applications executing in the high-performance computing system is adaptive in the sense that faults can be handled in a way that depends on the potential damage caused by them. This enables a flexible policy resulting in a reduced performance penalty for the fault tolerance strategy when compared to fixed-redundancy schemes. For example, an SEU causing a single bitflip in the initial phase of an image processing algorithm may not at all affect the outcome of the computation. However, SEU-triggered faults such as the corruption of a key data structure caused by an illegal assignment to one of its components, the change of an instruction code, or the corruption of an address computation may have detrimental effects on the outcome of the computation. Such faults need to be handled through the use of redundancy, with an approach that reflects their severity and takes into account known properties of the application and the underlying system. 5.1
Assertions
An assertion describes a propositional logic predicate that must be satisfied at certain locations of the program, during specific phases of execution, or in
Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing
269
program regions such as loops and methods. Its specification consists of four components—the specification of an assertion expression, the region in which this expression can be applied, the characterization of the fault if the assertion is violated, and an optional recovery specification. We illustrate this by a set of examples. assert ((A(i) ≤ B(i)) in (L1) fault (F 1, i, . . .) recovery (...) The assertion expression A(i) ≤ B(i)) must be satisfied immediately after the execution of the statement labeled by L1. If it fails, a fault type, F1, is specified and a set of relevant arguments is relayed to the introspection system. Furthermore, a hint for the support of a recovery method is provided. assert (x = 0) pre in (L2) fault (F T 2, x, . . .) assert (z = f 2(x)) in (L2) fault (F T 3, x, y, z, . . .) The two assertion expressions x = 0 and z = f 2(x) respectively serve as precondition and postcondition for a statement at label L2, with respective fault types F T 2 and F T 3 for assertion violations. assert (diff≥ ) invariant in (r loop) fault (...) The assertion expression diff≥ specifies an invariant that is associated with the region defined by r loop. It must be satisfied at any point of execution within this loop.
5.2
Fault Detection and Recovery
Introspection-based fault tolerance provides a flexible approach that in addition to applying innovative methods can leverage existing technology. Methods that are useful in this context include assertion-based acceptance tests that check the value of an assertion and transfer control to the introspection system in case of violation, and fault detectors that can effectively mask a fault by using redundant code based on analysis information (see Section 5.3). Furthermore, faults in critical sections of the code can be masked by leveraging fixed redundancy techniques such as TMR or NMR. Another technique is the replacement of a function with an equivalent version that implements AlgorithmBased Fault Tolerance (ABFT). Information supporting the generation of assertion-based acceptance tests as well as fault detectors can be derived from static or dynamic automatic program analysis, retrieved from domain- or system specific information contained in the knowledge base or be directly specified by an expert user. Figure 5 provides an informal illustration of the different methods used to gather such information. 5.3
Analysis-Based Assertion Generation
Automatic analysis of program properties relevant for fault tolerance can be leveraged from a rich spectrum of existing tools and methods. This includes the static analysis of the control and data structures of a program, its intra- and
270
M. James, P. Springer, and H. Zima
source program
P
analysis
instrumentation
assertions
KB
P’
User
application and system knowledge
instrumented program Fig. 5. Assertion generation
inter-procedural control flow, data flow and data dependences, data access patterns, and patterns of synchronization and communication in multi-threaded programs [12,13]. Other static tools can check for the absence of deadlocks or race conditions. Profiling from simulation runs or test executions can contribute information on variable ranges, loop counts, or potential bottlenecks [14]. Furthermore, dynamic analysis provides knowledge that cannot be derived at compile time, such as the actual paths taken during a program execution and dynamic dependence relationships. Consider a simple example. In data flow analysis, a use-definition chain is defined as the link between the statement that uses (i.e., reads) a variable to the set of all definitions (i.e., assignments) of that variable that can reach this statement along an execution path. Similarly, a definition-use chain links a definition to all its uses. An SEU can break such a chain, for example by redirecting an assignment. This can result in a number of different faults, including the following: – attempt to use an undefined variable or dereference an undefined pointer – rendering a definition of a variable useless
Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing
271
– leading to an undefined expression evaluation – destroying a loop bound The results of static analysis (as well as results obtained from program profiling) can be exploited for fault detection and recovery in a number of ways, including the generation of assertions in connection with specific program locations or program regions. Examples include asserting: – – – – – – –
the value of a variable that has been determined to be a constant the value range of a variable or a pointer the preservation of use-definition and definition-use chains the preservation of dependence relationships a limit for the number of iterations in a loop an upper limit for the size of a data structure correctness of access sequences to files
The generation of such assertions must be based on the statically derived information in combination with the generation of code that records the corresponding relationships at runtime. A more elaborate technique that exploits static analysis for the generation of a fault detector using redundant code generation can be based on program slicing [15]. This is an analysis technique that extracts from a program the set of statements that affect the values required at a certain point of interest. For example, it answers the question which statements of the program contribute to the value of a critical variable at a given location. These statements form a slice. The occurrence of SEUs can disrupt the connection between a variable occurrence and its slice. A fault detector for a specific variable assignment generates redundant code based only on that slice, and compares its outcome with that of the original code. Some of the techniques applied to sequential programs can be generalized to deal with multi-threaded programs. Of specific importance in this context are programs whose execution is organized as a data-parallel set of threads according to the Single-Program-Multiple-Data (SPMD) paradigm since the vast majority of parallel scientific applications belong to this category [16].
6
Related Work
The Remote Exploration and Experimentation (REE) [17] project conducted at NASA was among the first to consider putting a COTS-based parallel machine into space and address the resulting problems related to application-adaptive fault tolerance [18]. More recently, NASA’s Millenium ST-8 project [10] developed a “Dependable Multiprocessor” around a COTS-based cluster using the IBM PowerPC 750FX as a data processor, with a Xilinx VirtexII 6000 FPGA co-processor for the support of application-specific modules for digital signal processing, data compression, and vector processing. A centralized system controller for the cluster is implemented using a redundant configuration of radiationhardened Motorola processors.
272
M. James, P. Springer, and H. Zima
Some significant work has been done in the area of assertions. The EAGLE system [19] provides an assertion language with temporal constraints. The Design for Verification (D4V) [20] system uses dynamic assertions, which are objects with state that are constructed at design time and tied to program objects and locations. Language support for assertions and invariants has been provided in Java 1.4, Eiffel for pre- and post condition in Hoare’s logic, and the Java Modeling Language (JML). Intelligent resource management in an introspectionbased approach has been proposed in [21]. Finally, the concept of introspection, as used in our work, has been outlined in [22]. A similar idea has been used by Iyer and co-workers for application-specific security [23] based on hardware modules embedded in a reliability and security engine.
7
Conclusion
This paper focused on software-provided fault tolerance for future deep-space missions providing an on-board COTS-based computing capability for the support of autonomy. We described the key features of an introspection framework for runtime monitoring, analysis, and feedback-oriented recovery, and outlined methods for the automatic generation of assertions that trigger key actions of the framework. A prototype version of the system was originally implemented on a cluster of Cell Broadband Engines; currently, an implementation effort is underway for the Tile64 system. Future work will address an extension of the introspection technology to performance tuning and power management. Furthermore, we will study the integration of introspection with traditional V&V.
Acknowledgment This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration and funded through the internal Research and Technology Development program.
References 1. Lardner, D.: Babbages’s Calculating Engine. Edinburgh Review (July 1834); Reprinted in Morrison, P., Morrison, E. (eds.). Charles Babbage and His Calculating Engines. Dover, New York (1961) 2. Avizienis, A., Laprie, J.C., Randell, B.: Fundamental Concepts of Dependability. Technical report, UCLA (2000) (CSD Report No. 010028)
Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing
273
3. Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing 1(1) (January-March 2004) 4. Castano, R., Estlin, T., Anderson, R.C., Gaines, D.M., Castano, A., Bornstein, B., Chouinard, C., Judd, M.: OASIS: Onboard Autonomous Science Investigation System for Opportunistic Rover Science. Journal of Field Robotics 24(5), 379–397 (2007) 5. Tile64 Processor Family (2007), http://www.tilera.com 6. Shirvani, P.P.: Fault-Tolerant Computing for Radiation Environments. Technical Report 01-6, Center for Reliable Computing, Stanford University, Stanford, California 94305 (June 2001) (Ph.D. Thesis) 7. Lamport, L., Shostak, R., Pease, M.: The Byzantine Generals Problem. ACM Trans. Programming Languages and Systems 4(3), 382–401 (1982) 8. Aggarwal, N., Ranganathan, P., Jouppi, N.P., Smith, J.E.: Isolation in Commodity Multicore Processors. IEEE Computer 40(6), 49–59 (2007) 9. Li, M., Tao, W., Goldberg, D., Hsu, I., Tamir, Y.: Design and Validation of Portable Communication Infrastructure for Fault-Tolerant Cluster Middleware. In: Cluster 2002: Proceedings of the IEEE International Conference on Cluster Computing, p. 266. IEEE Computer Society, Washington (September 2002) 10. Samson, J., Gardner, G., Lupia, D., Patel, M., Davis, P., Aggarwal, V., George, A., Kalbarcyzk, Z., Some, R.: High Performance Dependable Multiprocessor II. In: Proceedings 2007 IEEE Aerospace Conference, pp. 1–22 (March 2007) 11. James, M., Shapiro, A., Springer, P., Zima, H.: Adaptive Fault Tolerance for Scalable Cluster Computing in Space. International Journal of High Performance Computing Applications (IJHPCA) 23(3) (2009) 12. Zima, H.P., Chapman, B.M.: Supercompilers for Parallel and Vector Computers. ACM Press Frontier Series (1991) 13. Nielson, F., Nielson, H.R., Hankin, C.: Principles of Program Analysis. Springer, New York (1999) 14. Havelund, K., Goldberg, A.: Verify Your Runs. In: Meyer, B., Woodcock, J. (eds.) VSTTE 2005. LNCS, vol. 4171, pp. 374–383. Springer, Heidelberg (2008) 15. Weiser, M.: Program Slicing. IEEE Transactions on Software Engineering 10, 352– 357 (1984) 16. Strout, M.M., Kreaseck, B., Hovland, P.: Data Flow Analysis for MPI Programs. In: Proceedings of the 2006 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2006) (June 2006) 17. Some, R., Ngo, D.: REE: A COTS-Based Fault Tolerant Parallel Processing Supercomputer for Spacecraft Onboard Scientific Data Analysis. In: Proceedings of the Digital Avionics Systems System Conference, pp. 7.B.3-1–7.B.3-12 (1999) 18. Kalbarczyk, Z.T., Iyer, R.K., Bagchi, S., Whisnant, K.: Chameleon: A software infrastructure for adaptive fault tolerance. IEEE Trans. Parallel Distrib. Syst. 10(6), 560–579 (1999) 19. Goldberg, A., Havelund, K., McGann, C.: Runtime Verification for Autonomous Spacecraft Software. In: Proceedings 2005 IEEE Aerospace Conference, pp. 507– 516 (March 2005) 20. Mehlitz, P.C., Penix, J.: Design for Verification with Dynamic Assertions. In: Proceedings of the 2005 29th Annual IEEE/NASA Software Engineering Workshop, SEW 2005 (2005)
274
M. James, P. Springer, and H. Zima
21. Kang, D.I., Suh, J., McMahon, J.O., Crago, S.P.: Preliminary Study toward Intelligent Run-time Resource Management Techniques for Large Multi-Core Architectures. In: Proceedings of the 2007 Workshop on High Performance Embedded Computing, HPEC 2007 (September 2007) 22. Zima, H.P.: Introspection in a Massively Parallel PIM-Based Architecture. In: Joubert, G.R. (ed.) Advances in Parallel Computing, vol. 13, pp. 441–448. Elsevier B.V., Amsterdam (2004) 23. Iyer, R.K., Kalbarczyk, Z., Pattabiraman, K., Healey, W., Hwu, W.M.W., Klemperer, P., Farivar, R.: Toward Application-Aware Security and Reliability. IEEE Security and Privacy 5(1), 57–62 (2007)
Maestro: Data Orchestration and Tuning for OpenCL Devices Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter Oak Ridge National Laboratory {spaffordkl,jsmeredith,vetter}@ornl.gov
Abstract. As heterogeneous computing platforms become more prevalent, the programmer must account for complex memory hierarchies in addition to the difficulties of parallel programming. OpenCL is an open standard for parallel computing that helps alleviate this difficulty by providing a portable set of abstractions for device memory hierarchies. However, OpenCL requires that the programmer explicitly controls data transfer and device synchronization, two tedious and error-prone tasks. This paper introduces Maestro, an open source library for data orchestration on OpenCL devices. Maestro provides automatic data transfer, task decomposition across multiple devices, and autotuning of dynamic execution parameters for some types of problems.
1
Introduction
In our previous work with general purpose computation on graphics processors (GPGPU) [1, 2], as well as a survey of similar research in the literature [3, 4, 5, 6, 7], we have encountered several recurring problems. First and foremost is code portability–most GPU programming environments have been proprietary, requiring code to be completely rewritten in order to run on a different vendor’s GPU. With the introduction of OpenCL, the same kernel code (code which executes on the device) can generally be used on any platform, but must be “hand tuned” for each new device in order to achieve high performance. This manual optimization of code requires significant time, effort, and expert knowledge of the target accelerator’s architecture. Furthermore, the vast majority of results in GPGPU report performance for only single-GPU implementations, presumably due to the difficulty of task decomposition and load balancing, the process of breaking a problem into subtasks and dividing the work among multiple devices. These tasks require the programmer to know the relative processing capability of each device in order to appropriately partition the problem. If the load is poorly balanced, devices with insufficient work will be idle while waiting on those with larger portions of work to finish. Task decomposition also requires that the programmer carefully aggregate output data and perform device synchronization. This prevents OpenCL code from being portable–when moving to a platform with a different number of devices, or devices which differ in relative speed, work allocations must be adjusted. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 275–286, 2010. c Springer-Verlag Berlin Heidelberg 2010
276
K. Spafford, J. Meredith, and J. Vetter
Typically, GPUs or other compute accelerators are connected to the host processor via a bus. Many results reported in the literature focus solely on kernel execution and neglect to include the performance impact of data transfer across the bus. Poor management of this interconnection bus is the third common problem we have identified. The bandwidth of the bus is almost always lower than the memory bandwidth of the device, and suboptimal use of the bus can have drastic consequences for performance. In some cases, the difference in bandwidth can be more than an order of magnitude. Consider the popular NVIDIA Tesla C1060 which has a peak memory bandwidth of 102 gigabytes per second. It is usually connected to a host processor via a sixteen lane PCIe 2.0 bus, with peak bandwidth of only eight gigabytes per second. One common approach to using the bus is the function offload model. In this model, sequential portions of an application execute on the host processor. When a parallel section is reached, input data is transferred to the accelerator. When the accelerator is finished computing, outputs are transferred back to the host processor. This approach is the simplest to program, but the worst case for performance. It poorly utilizes system resources–the bus is never active at the same time as the accelerator. In order to help solve these problems, we have developed an open source library called Maestro. Maestro leverages a combination of autotuning, multibuffering, and OpenCL’s device interrogation capabilities in an attempt to provide a portable solution to these problems. Further, we argue that Maestro’s automated approach is the only practical solution, since the parameter space for hand tuning OpenCL applications is enormous. 1.1
OpenCL
In December 2008, the Khronos Group introduced OpenCL [8], an open standard for parallel computing on heterogeneous platforms. OpenCL specifies a language, based on C99, that allows a programmer to write parallel functions called kernels which can execute on any OpenCL device, including CPUs, GPUs, or any other device with a supporting implementation. OpenCL provides support for data parallelism as well as task parallelism. OpenCL also provides a set of abstractions for device memory hierarchies and an API for controlling memory allocation and data transfer. In OpenCL, parallel kernels are divided into tens of thousands of work items, which are organized into local work groups. For example, in matrix multiplication, a single work item might calculate one entry in the solution matrix, and a local work group might calculate a submatrix of the solution matrix.
2
Overview
Before proceeding to Maestro’s proposed solutions to the observed problems, it is important to introduce one of the key ideas in Maestro’s design philosophy, a single, high level task queue.
Maestro: Data Orchestration and Tuning for OpenCL Devices
2.1
277
High Level Queue
In the OpenCL task queue model, the programmer must manage a separate task queue for each GPU, CPU, or other accelerator in a heterogeneous platform. This model requires that the programmer has detailed knowledge about which OpenCL devices are available. Modifications to the code are required to obtain high performance on a system with a different device configuration. The Maestro model (contrasted in Figure 1), unifies the disparate, devicespecific queues into a single, high-level task queue. At runtime, Maestro queries OpenCL to obtain information about the available GPUs or other accelerators in a given system. Based on this information, Maestro can transfer data and divide work among the available devices automatically. This frees the programmer from having to synchronize multiple devices and keep track of device specific information.
Fig. 1. Task Queue Hierarchies–In the OpenCL task queue hierarchy, the programmer must manage a separate task queue for each device. Maestro unifies these into a single, high-level queue which is independent of the underlying hardware.
3
Problem: Code Portability
OpenCL’s claim to portability relies on its ability to execute kernel code on any device with a supporting implementation. While this represents a substantial improvement over proprietary programming environments, some obstacles to portability remain. One such obstacle is the organization of local work items. What is the appropriate local work group size for a given kernel? With current hardware, local work items roughly correspond to device threads. Hence, for GPUs, the rule of thumb is to start with a sufficiently high multiple of sixteen, e.g. 128 or 256. However, this heuristic does not guarantee a kernel will execute successfully, much less exhibit high performance. For example, the OpenCL implementation in Mac OS X imposes an upper limit on local group size of one for code to execute on a CPU. Also, while larger group sizes often lead to better performance, if the kernel is strongly constrained by either register or local memory usage, it may simply fail to execute on a GPU due to lack of resources.
278
3.1
K. Spafford, J. Meredith, and J. Vetter
Proposed Solution: Autotuning
Since OpenCL can execute on devices which differ so radically in architecture and computational capabilities, it is difficult to develop simple heuristics with strong performance guarantees. Hence, Maestro’s optimizations rely solely on empirical data, instead of any performance model or a priori knowledge. Maestro’s general strategy for all optimizations can be summarized by the following steps: 1. 2. 3. 4.
Estimate based on benchmarks Collect empirical data from execution Optimize based on results While performance continues to improve, repeat steps 2-3
This strategy is used to optimize a variety of parameters including local work group size, data transfer size, and the division of work among multiple devices. However, these dynamic execution parameters are only one of the obstacles to true portability. Another obstacle is the choice of hardware specific kernel optimizations. For instance, some kernel optimizations may result in excellent performance on a GPU, but reduce performance on a CPU. This remains an open problem. Since the solution will no doubt involve editing kernel source code, it is beyond the scope of Maestro.
4
Problem: Load Balancing
In order to effectively distribute a kernel among multiple OpenCL devices, a programmer must keep in mind, at a minimum, each device’s relative performance on that kernel, the speed of the interconnection bus between host processor and each device (which can be asymmetric), a strategy for input data distribution to devices, and a scheme on how to synchronize devices and aggregate output data. Given that an application can have many kernels, which can very significantly in performance characteristics (bandwidth bound, compute bound, etc.), it quickly becomes impractical to tune optimal load balancing for every task by hand. 4.1
Proposed Solution: Benchmarks and Device Interrogation
At install time, Maestro uses benchmarks and the OpenCL device interrogation API to characterize a system. Peak FLOPS, device memory bandwidth, and bus bandwidth are measured using benchmarks based on the Scalable Heterogeneous Computing (SHOC) Benchmark Suite [9]. The results of these benchmarks serve as the basis, or initial estimation, for the optimization of the distribution of work among multiple devices. As a kernel is repeatedly executed, either in an application or Maestro’s offline tuning methods, Maestro continues to optimize the distribution of work. After each iteration, Maestro computes the average rate at which each device completes work items, and updates a running, weighted average. This rate is specific to each device and kernel combination, and is a practical way to measure many interacting factors for performance. We examine the convergence to an optimal distribution of work in Section 6.2.
Maestro: Data Orchestration and Tuning for OpenCL Devices
5
279
Problem: Suboptimal Use of Interconnection Bus
OpenCL devices are typically connected to the host processor via a relatively slow interconnection bus. With current hardware, this is normally the PCIe bus. Since the bandwidth of this bus is dramatically lower than a GPU’s memory bandwidth, it introduces a nontrivial amount of overhead. 5.1
Proposed Solution: Multibuffering
In order to minimize this overhead, Maestro attempts to overlap computation and communication as much as possible. Maestro leverages and extends the traditional technique of double buffering (also known as ping-pong buffering).
Fig. 2. Double Buffering–This figure contrasts the difference between (a) the function offload model and (b) a very simple case of double buffering. Devices which can concurrently execute kernels and transfer data are able to hide some communication time with computation.
Figure 2 illustrates the difference between the function offload model and double buffered execution. Maestro implements concurrent double buffering to multiple devices, including optimization of the data chunk size, which we term multibuffering. In Maestro’s implementation of multibuffering, the initial data chunk size is set to the size that resulted in the maximum bus bandwidth measured by benchmarks at install time. Maestro then varies the chunk size and optimizes based on observed performance. However, double buffering cannot be used in all cases. Some OpenCL platforms simply lack the support for concurrent data copy and execution. Furthermore, some algorithms are not practical for use with double buffering. Consider an algorithm which accesses input data randomly. A work item might require data at the end of an input buffer which has not yet been transferred to the accelerator, resulting in an error. In order to accommodate this class of algorithms, Maestro allows the programmer to place certain inputs in a universal buffer, which is copied to all devices before execution begins. While this does
280
K. Spafford, J. Meredith, and J. Vetter
limit the availability of some performance optimizations, it greatly expands the number of algorithms which can be supported by Maestro.
6 6.1
Results Experimental Testbeds
OpenCL Limitations. Since OpenCL is still a nascent technology, early software implementations impose several restrictions on the composition of test platforms. First, it is not possible to test a system with GPUs from different vendors due to driver and operating system compatibility issues. Second, CPU support is not widely available. As such, we attempt to provide results from a comprehensive selection of devices, including platforms with homogeneous GPUs, heterogeneous GPUs, and with an OpenCL-supported CPU and GPU. Host Configurations – Krakow. Krakow is a dual socket Nehalem based system, with a total of eight cores running at 2.8Ghz with 24GB of RAM. Krakow also features an NVIDIA Tesla S1070, configured to use two Tesla T10 processors connected via a sixteen lane PCIe v2.0 bus. Results are measured using NVIDIA’s GPU Computing SDK version 3.0. – Lens. Lens is a medium sized cluster primarily used for data visualization and analysis. Its thirty-two nodes are connected via Infiniband, with each node containing four AMD quad core Barcelona processors with 64GB of RAM. Each node also has two GPUs–one NVIDIA Tesla C1060 and one NVIDIA GeForce 8800GTX, connected to the host processor over a PCIe 1.0 bus with sixteen active lanes. Lens runs Scientific Linux 5.0, and results were measured using NVIDIA’s GPU computing SDK, version 2.3. – Lyon. Lyon is an dual-socket, single-core 2.0 GHz AMD Opteron 246 system with a 16-lane PCIe 1.0 bus and 4GB of RAM, housing an ATI Radeon HD 5870 GPU. It runs Ubuntu 9.04 and uses the ATI Stream SDK 2.0 with the Catalyst 9.12 Hotfix 8.682.2RC1 driver. Graphics Processors – NVIDIA G80 Series. The NVIDIA G80 architecture combined the vertex and pixel hardware pipelines of traditional graphics processors into a single category of cores, all of which could be tasked for general-purpose computation if desired. The NVIDIA 8800GTX has 128 processor cores split among sixteen multiprocessors. These cores run at 1.35GHz, and are fed from 768MB of GDDR3 RAM through a 384-bit bus. – NVIDIA GT200 Series. The NVIDIA Tesla C1060 graphics processor comprises thirty streaming multiprocessors, each of which contains eight stream processors for a total of 240 processor cores clocked at 1.3Ghz. Each multiprocessor has 16KB of shared memory, which can be accessed as quickly as a register under certain access patterns. The Tesla C1060 has 4GB of global memory and supplementary cached constant and texture memory.
Maestro: Data Orchestration and Tuning for OpenCL Devices
281
Table 1. Comparison of Graphics Processors GPU Peak FLOPS Mem. Bandwidth Processors Clock Memory Units GF GB/s # Mhz MB Tesla C1060/T10 933 102 240 1300 4096 GeForce 8800GTX 518 86 128 1350 768 Radeon HD5870 2720 153 1600 850 1024
– ATI Evergreen Series. In ATI’s “Terascale Graphics Engine” architecture, Stream processors are divided into groups of eighty, which are collectively known as SIMD cores. Each SIMD core contains four texture units, an L1 cache, and has its own control logic. SIMD cores can communicate with each other via an on-chip global data share. We present results from the Radeon HD5870 (Cypress XT) which has 1600 cores.
6.2
Test Kernels
We have selected the following five test kernels to evaluate Maestro. These kernels range in both complexity and performance characteristics. In all results, the same kernel code is used on each platform, although the problem size is varied. As such, cross-machine results are not directly comparable, and are instead presented in normalized form. – Vector Addition. The first test kernel is the simple addition of two one dimensional vectors, C ← A + B. This kernel is very simple and strongly bandwidth bound. Both input and output vectors can be multibuffered. – Synthetic FLOPS. The synthetic FLOPS kernel maintains the simplicity of vector addition, but adds in an extra constant, K, C ← A + B + K. K is computed using a sufficiently high number of floating point operations to make the kernel compute bound. – Vector Outer Product. The vector outer product kernel, u ⊗ v, takes two input vectors of length n and m, and creates an output matrix of size n × m. The outer product reads little input data compared to the generated output, and does not support multibuffering on any input. – Molecular Dynamics. The MD test kernel is a computation of the LennardJones potential from molecular dynamics. It is a strongly compute bound, O(n2 ) algorithm, which must compare each pair of atoms to compute all contributions to the overall potential energy. It does not support multibuffering on all inputs. – S3D. We also present results from the key portion of S3D’s Getrates kernel. S3D is a computational chemistry application optimized for GPUs in our previous work [2]. This kernel is technically compute bound, but also consumes seven inputs, making it the most balanced of the test kernels.
282
K. Spafford, J. Meredith, and J. Vetter
Fig. 3. Autotuning the local work group size – This figure shows the performance of the MD kernel on various platforms at different local work group sizes, normalized to the performance at a group size of 16. Lower runtimes are better.
Local Tuning Results. Maestro’s capability for autotuning of local work group size is shown using the MD kernel in Figure 3. All runtimes are shown in normalized fashion, in this case as a percentage of the runtime on each platform with a local work group size of 16 (the smallest allowable on several devices). The optimal local work group size is highlighted for each platform. Note the variability and unpredictability of performance due to the sometimes competing demands of register pressure, memory access patterns, and thread grouping. These results indicate that a programmer will be unlikely to consistently determine an optimal work group size at development time. By using Maestro’s autotuning capability, the developer can focus on writing the kernel code, not on the implications of local work group size on correctness and performance portability. Multibuffering Results. Maestro’s effectiveness when overlapping computation with communication can be improved by using an optimal buffer chunk size. Figure 4 shows Maestro’s ability to auto-select the best buffer size on each platform. We observe in the vector outer product kernel one common situation, where the largest buffer size performs the best. Of course, the S3D kernel results show that this is not always the case; here, a smaller buffer size is generally better. However, note that on Krakow with two Tesla S1070 GPUs, there is an asymmetry between the two GPUs, with one preferring larger and one preferring smaller buffer sizes. This result was unusual enough to merit several repeated experiments for verification. Again, this shows the unpredictability of performance, even with what appears to be consistent hardware, and highlights the need for autotuning.
Maestro: Data Orchestration and Tuning for OpenCL Devices
283
Fig. 4. Autotuning the buffer chunk size – This figure shows the performance on the (a) vector outer product and (b) S3D kernels when splitting the problem into various size chunks and using multibuffering. Lower runtimes are better. Values are normalized to the runtime at the 256kB chunk size on each platform.
Mutli-GPU Results. One of Maestro’s strengths is its ability to automatically partition computation between multiple devices. To determine the proportion of work for each device, it initially uses an estimate based on benchmarks run at install time, but will quickly iterate to an improved load distribution based on the imbalance for a specific kernel. Figure 5 shows for the S3D and MD kernels the proportion of total time spent on each device. Note that there is generally an initial load imbalance which can be significant, and that even well balanced hardware is not immune. Maestro’s ability to automatically detect and account for load imbalance makes efficient use of the resources available on any platform. Combined Results. Maestro’s autotuning has an offline and an online component. At install time, Maestro makes an initial guess for local work group size, buffering chunk size, and workload partitioning for all kernels based on values which are measured using benchmarks. However, Maestro can do much better, running an autotuning process to optimize all of these factors, often resulting in significant improvements. Figure 6 shows the results of Maestro’s autotuning for specific kernels relative to its initial estimate for these parameters. In (a) we see the single-GPU results, showing the combined speedup both from tuning the local work group size and applying double buffering with a tuned chunk size, showing improvement up to 1.60×. In (b) we see the multi-GPU results, showing the combined speedup both from tuning the local work group size and applying a tuned workload partitioning, showing speedups of up to 1.8×. This autotuning can occur outside full application runs. Kernels of particular interest can be placed in a unit test and executed several times to provide Maestro with performance data (measured internally via OpenCL’s event API) for coarsegrained adjustments. This step is not required, since the same optimizations can
284
K. Spafford, J. Meredith, and J. Vetter
Fig. 5. Autotuning the load balance – This figure shows the load imbalance on the (a) S3D and (b) MD kernels, both before and after tuning the work distribution for the specific kernel. Longer striped bars show a larger load imbalance.
Fig. 6. Combined autotuning results – (a) Shows the combined benefit of autotuning both the local work group size the double buffering chunk size for a single GPU of the test platforms. (b) Shows the combined benefit of autotuning both the local work group size and the multi-GPU load imbalance using both devices (GPU+GPU or GPU+CPU) of the test platforms. Longer bars are better.
be performed online, but reduces the number of online kernel executions with suboptimal performance.
7
Related Work
An excellent overview of the history of GPGPU is given in [10]. Typically, work in this area has been primarily focused on case studies, which describe the process of accelerating applications or algorithms which require extremely
Maestro: Data Orchestration and Tuning for OpenCL Devices
285
high performance[3, 4, 5, 6, 7, 1, 2]. These applications are typically modified to use graphics processors, STI Cell, or field programmable gate arrays (FPGAs). These studies serve as motivation for Maestro, as many of them help illustrate the aforementioned common problems. Autotuning on GPU-based systems is beginning to gain some popularity. For example, Venkatasubramanian et. al. have explored autotuning stencil kernels for multi-CPU and multi-GPU environments [11]. Maestro is distinguished from this work because it uses autotuning for the optimization of data transfers and execution parameters, rather than the kernel code itself.
8
Conclusions
In this paper, we have presented Maestro, a library for data orchestration and tuning on OpenCL devices. We have shown a number of ways in which achieving the best performance, and sometimes even correctness, is a daunting task for programmers. For example, we showed that the choice of a viable, let alone optimal local work group size for OpenCL kernels cannot be accomplished with simple rules of thumb. We showed that multibuffering, a technique nontrivial to incorporate in OpenCL code, is further complicated by the problem- and devicespecific nature of choosing an optimal buffer chunk size. And we showed that even in what appear to be well-balanced hardware configurations, load balancing between multiple GPUs can require careful division of the workload. Combined, this leads to a space of performance and correctness parameters which is immense. By not only supporting double buffering and problem partitioning for existing OpenCL kernels, but also applying autotuning techniques to find the high performance areas of this parameter space with little developer effort, Maestro leads to improved performance, improved program portability, and improved programmer productivity.
Acknowledgements This manuscript has been authored by a contractor of the U.S. Government under Contract No. DE-AC05-00OR22725. Accordingly, the U.S. Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes.
References [1] Meredith, J.S., Alvarez, G., Maier, T.A., Schulthess, T.C., Vetter, J.S.: Accuracy and Performance of Graphics Processors: A Quantum Monte Carlo Application Case Study. Parallel Computing 35(3), 151–163 (2009) [2] Spafford, K.L., Meredith, J.S., Vetter, J.S., Chen, J., Grout, R., Sankaran, R.: Accelerating S3D: A GPGPU Case Study. In: HeteroPar 2009: Proceedings of the Seventh International Workshop on Algorithms, Models, and Tools for Parallel Computing on Heterogeneous Platforms (2009)
286
K. Spafford, J. Meredith, and J. Vetter
[3] Rodrigues, C.I., Hardy, D.J., Stone, J.E., Schulten, K., Hwu, W.M.W.: GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications. In: CF 2008: Proceedings of the 2008 Conference on Computing Frontiers, pp. 273–282. ACM, New York (2008) [4] He, B., Govindaraju, N.K., Luo, Q., Smith, B.: Efficient Gather and Scatter Operations on Graphics Processors. In: SC 2007: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pp. 1–12. ACM, New York (2007) [5] Fujimoto, N.: Faster Matrix-Vector Multiplication on GeForce 8800GTX. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–8 (April 2008) [6] Bolz, J., Farmer, I., Grinspun, E., Schr¨ ooder, P.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. In: ACM SIGGRAPH 2003, pp. 917– 924. ACM, New York (2003) [7] Stone, J.E., Phillips, J.C., Freddolino, P.L., Hardy, D.J., Trabuco, L.G., Schulten, K.: Accelerating Molecular Modeling Applications With Graphics Processors. Journal of Computational Chemistry 28, 2618–2640 (2005) [8] The Khronos Group (2009), http://www.khronos.org/opencl/ [9] Danalis, A., Marin, G., McCurdy, C., Mereidth, J., Roth, P., Spafford, K., Tipparaju, V., Vetter, J.: The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In: Proceedings of the Third Annual Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU 2010). ACM, New York (2010) [10] Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., Phillips, J.: GPU Computing. Proceedings of the IEEE 96(5), 879–899 (2008) [11] Venkatasubramanian, S., Vuduc, R.W.: Tuned and Wildly Asynchronous Stencil Kernels for Hybrid CPU/GPU Systems. In: ICS 2009: Proceedings of the 23rd international conference on Supercomputing, pp. 244–255. ACM, New York (2009)
Multithreaded Geant4: Semi-automatic Transformation into Scalable Thread-Parallel Software Xin Dong1 , Gene Cooperman1 , and John Apostolakis2 1
College of Computer Science, Northeastern University, Boston, MA 02115, USA {xindong,gene}@ccs.neu.edu 2 PH/SFT, CERN, CH-1211, Geneva 23, Switzerland [email protected] Abstract. This work presents an application case study. Geant4 is a 750,000 line toolkit first designed in the mid-1990s and originally intended only for sequential computation. Intel’s promise of an 80-core CPU meant that Geant4 users would have to struggle in the future with 80 processes on one CPU chip, each one having a gigabyte memory footprint. Thread parallelism would be desirable. A semi-automatic methodology to parallelize the Geant4 code is presented in this work. Our experimental tests demonstrate linear speedup in a range from one thread to 24 on a 24core computer. To achieve this performance, we needed to write a custom, thread-private memory allocator, and to detect and eliminate excessive cache misses. Without these improvements, there was almost no performance improvement when going beyond eight cores. Finally, in order to guarantee the run-time correctness of the transformed code, a dynamic method was developed to capture possible bugs and either immediately generate a fault, or optionally recover from the fault.
1
Introduction
The number of cores on a CPU chip is currently doubling every two years, in a manner consistent with Moore’s Law. If sequential software has a working set that is larger than the CPU cache, then running a separate copy of the software for each core has the potential to present immense memory pressure on the bus to memory. It is doubtful that the memory pressure will continue to be manageable as the number of cores on a CPU chip continues to double. This work presents an application case study concerned with just this issue, with respect to Geant4 (GEometry ANd Tracking, http://geant4.web.cern. ch/). Geant4 was developed over about 15 years by physicists around the world, using the Booch software engineering methodology. The widest use for Geant4 is for Monte Carlo simulation and analysis of experiments at the LHC collider in Geneva. In some of the larger experiments, such as CMS [1], software applications can grow to a two gigabyte footprint that includes hundreds of dynamic libraries (.so files). In addition to collider experiments, Geant4 is used for radiationbased medical applications [2], for cosmic ray simulations [3], and for space and radiation simulations [4]. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 287–303, 2010. c Springer-Verlag Berlin Heidelberg 2010
288
X. Dong, G. Cooperman, and J. Apostolakis
Geant4 is a package with 750,000 lines of C++ code spread over 6,000 files. It is a toolkit with deep knowledge of the physics of particle tracks. Given the geometry, the corresponding materials and the fundamental particles, a Geant4 simulation is driven by randomly generated independent events. Within a loop, each event is simulated in sequence. The corresponding computation for each event is organized into three levels: event generation and result aggregation; tracking in each event; and stepping on each track. Geant4 stepping is governed by physics processes, which specify particle and material interaction. The Geant4 team of tens of physicists issues a new release every six months. Few of those physicists have experience writing thread-parallel code. Hence, a manual rewriting of the entire Geant4 code base for thread parallelism was not possible. Geant4 uses an event loop programming style that lends itself to a straightforward thread parallelization. The parallelization also required the addition of the ANSI C/C++ thread keyword to most of the files in Geant4. As described in Section 3.1, an automatic way to add this thread parallelism was developed by modifying the GNU C++ parser. Section 3.2 then describes a further step to reduce the memory footprint. This thread-parallel implementation of Geant4 is known as Geant4MT. However, this intermediate Geant4MT was found not to be scalable. When scaling to 24 cores, two important performance drains were found: memory allocation and writes to shared variables. Custom memory allocator. First, none of the standard memory allocators scale properly when used in Geant4. Some of the allocators tried include the glibc default malloc (ptmalloc2) [5], tcmalloc [6], ptmalloc3 [5] and hoard [7]. This is because the malloc standard requires the use of a shared memory data structure so that any thread can free the memory allocated by any other thread. Yet most of the Geant4 allocations are thread-private. The number of futex calls in Geant4MT provided the final evidence of the importance of a threadprivate custom memory allocator. We observed the excessive number of futex calls (Linux analog of mutex calls) to completely disappear after introducing our thread-private allocator. Writes to shared variables. The second important drain occurs due to excessive writes to shared variables. This drain occurs even when the working set is small. Note that the drain on performance makes itself known as excessive cache misses when measuring performance using performance counters. However, this is completely misleading. The real issue is the particular cache misses caused by a write to a shared variable. Even if the shared variable write is a cache hit, all threads that include this shared variable in their active working set will eventually experience a read/write cache miss. This is because there are four CPU chips on the motherboard with no off-chip cache, in the high performance machines on which we tested. So, a write by one of the threads forces the chip set logic to invalidate the corresponding cache lines of the other three CPU chips. Thus, a single write eventually forces three subsequent L3 cache misses, one miss in each of the other three chips. The need to understand this unexpected behavior was a major source of the delay in making Geant4 fully scalable. The interaction with the malloc issue
Multithreaded Geant4: Semi-automatic Transformation
289
above initially masked this second performance drain. It was only after solving the issue of malloc, and then building a mechanism to track down the shared variables most responsible for the cache misses, that we were able to confirm the above working hypothesis. The solution was then quite simple: eliminate unnecessary sharing of writable variables. Interaction of memory allocator and shared writable variables. As a result of this work, we were able to conclude that the primary reason that the standard memory allocators suffered degraded performance was likely not the issue of excessive futex calls. Instead, we now argue that it was due to writes to shared variables of the allocator implementation. Our back-of-the-envelope calculations indicated that there were not enough futex calls to account for the excessive performance drain! We considered the use of four widely used memory allocators, along with our own customized allocator for a malloc/free intensive toy program. Surprisingly, we observed a parallel slowdown for each of the four memory allocators. In increasing the number of threads from 8 to 16 on a 16-core computer, the execution was found to become slower! Reasoning for correctness of Geant4. The domain experts are deeply concerned about the correctness of Geant4MT. Yet it challenges existing formal methods. First, the ubiquitous callback mechanism and C++ virtual member functions defined in Geant4 resist static methods. What part of the code will be touched is determined dynamically at the run-time. Second, the memory footprint is huge for large Geant4 applications, rendering dynamic methods endless. For an example, Helgrind [8] makes the data initialization too slow to finish for a representative large Geant4 application. Because Geant4MT, like Geant4, is a toolkit with frequent callbacks to end user code, we relax the correctness requirements. It is not possible with today’s technology to fully verify Geant4MT in the context of arbitrary user callbacks. Hence, we content ourselves with enhancements to verify correctness of production runs. In particular, we enforce the design assumption that “shared application data is never changed when parallel computing happens”. A run-time tool is developed to verify this condition. This tool also allows the application to coordinate the threads so as to avoid data races when updates to shared variables occur unexpectedly. Experience with Geant4MT. Geant4MT represents a development effort of two and a half years. This effort has now yielded experimental results showing linear speedup both on a 24-core Intel computer (four Nehalem-class CPU 6core chips), and on a 16-core AMD computer (four Barcelona-class CPU 4-core chips). The methodology presented here is recommended because it compresses these two and a half years of work into a matter of three days for a new version of Geant4. By using the tools developed as part of this work, along with the deeper understanding of the systems issues, we estimate the time for a new project to be the same three days, plus the time to understand the structure of the new software and create an appropriate policy about what objects should be shared,
290
X. Dong, G. Cooperman, and J. Apostolakis
while respecting the original software design. The contributions of this work are four-fold. It provides: 1. a semi-automatic way to transform C++ code into working thread-parallel code; 2. a thread private malloc library scalable for intensive concurrent heap accesses to transient objects; 3. the ability to automatically attribute frequent sources of cache misses to particular variables; and 4. a dynamic method to guarantee the run-time correctness for the threadparallel program One additional novelty in Section 3.4 is an analytical formula that predicts the number of updates to shared variables by all threads, based on measurements of the number of cache misses. The number of shared variable updates is important, because it has been identified as one of two major sources of performance degradation. The experimental section (Section 4) confirms the high accuracy of this formula. The rest of this paper is organized as follows. Section 2 introduces Geant4 along with some earlier work on parallelization for clusters. Section 3 explains our multithreading methodology and describes the implementation of multithreading tools. Section 4 evaluates the experimental results. We review related work in Section 5 and conclude in Section 6.
2 2.1
Geant4 and Parallelization Geant4: Background
Detectors, as used to detect, track, and/or identify high-energy particles, are ubiquitous in experimental and applied particle physics, nuclear physics, and nuclear engineering. Modern detectors are also used as calorimeters to measure the energy of the detected radiation. As detectors become larger, more complex and more sensitive, commensurate computing capacity is increasingly demanded for large-scale, accurate and comprehensive simulations of detectors. This is because simulation results dominate the design of modern detectors. A similar requirement also exists in space science and nuclear medicine, where particlematter interaction plays a key role. Geant4 [9,10] responds to this challenge by implementing and providing a “diverse, wide-ranging, yet cohesive set of software components for a variety of settings” [9]. Beginning with the first production release in 1998, the Geant4 collaboration has continued to refine, improve and enhance the toolkit towards more sophisticated simulations. With abundant physics knowledge, the Geant4 toolkit has modelling support for such concepts as secondary particles and secondary tracks (for example, through radioactive decay), the effect of electromagnetic fields, unusual geometries and layouts of particle detectors, and aggregation of particle hits in detectors.
Multithreaded Geant4: Semi-automatic Transformation
2.2
291
Prior Distributed Memory Parallelizations of Geant4
As a Monte Carlo simulation toolkit, Geant4 profits from improved throughput via parallelism derived from independence among modelled events and their computation. Therefore, researchers have adopted two methods for parallel simulation in the era of computer clusters. The first method is a standard parameter sweep: Each node of the cluster runs a separate instance of Geant4 that is given a separate set of input events to compute. The second method is that of ParGeant4 [11]. ParGeant4 uses a master-worker style of parallel computing on distributed-memory multiprocessors, implemented on top of the open source TOP-C package (Task Oriented Parallel C/C++) [12]. Following master-worker parallelism, each event is dispatched to the next available worker, which leads to dynamic load-balancing on workers. Prior to dispatching the events, each worker does its own initialization of global data structures. Since then, many-core computing has gained an increasing presence in the landscape. For example, Intel presented a roadmap including an 80-core CPU chip for the future. It was immediately clear that 80 Geant4-based processes, each with a footprint of more than a gigabyte, would never work, due to the large memory pressure on a single system bus to memory. A potentially simple solution takes advantage of UNIX copy-on-write semantics to enhance the sharing of data further by forking the child processes after Geant4 initializes its data. However, the natural object-oriented style of programming in Geant4 encourages a memory layout in which all fields of an object are placed on the same memory page. If just one field of the object is written to, then the entire memory page containing that object will no longer be shared. Hence the copy-on-write approach with forked processes was rejected as insufficiently scalable.
3
Geant4MT Methodology and Tools
The Geant4MT follows the same event-level parallelism as the prior distributed memory parallelization has done. The goal of the Geant4MT is: given a computer with k cores, we wish to replace k independent copies of the Geant4 process with an equivalent single process with k threads, which use the many-core machine in a memory-efficient scalable manner. The corresponding methodology includes the code transformation for thread safety(T 1) and for memory footprint reduction(T 2), the thread private malloc library and the shared-update checker. The transformation work generates two techniques further used by other work: one is to dump the data segment to the thread local storage (TLS [13]) for thread safety, which later produces the thread private allocator; another is to protect memory pages for write-access reflection, which evolves to the run-time tool for shared-update elimination, correctness verification and data race arbitration. 3.1
T 1: Transformation for Thread Safety
The Transformation for Thread Safety transforms k independent copies of the Geant4 process into an equivalent single process with k threads. The goal is
292
X. Dong, G. Cooperman, and J. Apostolakis
to create correct thread-safe code, without yet worrying about the memory footprint. This transformation includes two parts: global variable detection and global variable privatization. For C++ programs, we collect the information from four kinds of declarations for global variables, which are the possible source of data race. They are “static” declarations, global declarations, “extern” declarations and “static const” declarations for pointers. The last case is very special and rare: pointers are no longer constants if each thread holds its own copy of the same object instance. A rigorous way to collect the information for all global variables is to patch some code in the C++ parser to recognize them. In our case, we change the GNU (g++ version 4.2.2) C++ parser source file parser.c to patch some output statements there. After the parser has been changed, we re-compile gcc-4.2.2. We then use the patched compiler to build Geant4 once more. In this pass, all concerned declarations and their locations are collected as part of the building process. For each global variable declaration, we add the ANSI C/C++ keyword thread, as described by T 1.1 in Table 1. After this step, the data segment is almost empty with merely some const values left. As a result, at the binary level, each thread acts almost the same as the original process, since any variable that could have been shared has been declared thread-local. Naturally, the transformed code is thread-safe. The handling of thread-local pointers is somewhat more subtle. In addition to the term thread-local, we will often refer to thread-private data. A thread may create a new object stored in the heap. If the pointer to the new object is stored in TLS, then no other thread can access this new object. In this situation, we refer to both the new object and its thread-local pointer as being thread-private. However, only the pointer is stored in TLS as a thread-local variable. The threadlocal pointer serves as a remedy for TLS to support variables that are not plain old data (not POD) and to implement dynamic initialization for TLS variables. To make non-POD or dynamically initialized variables thread safe, we introduce new variables and transform the original type to be pointer type, which is a plain old data structure (POD). Then initialize variables dynamically before it is first used. Two examples T 1.2 and T 1.3 in Table 1 demonstrate the implementation. A tool based on an open source C++ parser, Elsa [14], has been developed to transform global declarations collected by the patched parser. This tool works not only for Geant4 but also for CLHEP, which is a common library widely used by Geant4. The transformation T 1 generates an unconditionally threadsafe Geant4. This version is called unconditional because any thread can call any function with no data race. 3.2
T 2: Transformation for Memory Footprint Reduction
The Transformation for Memory Footprint Reduction allows threads to share data without violating thread safety. The goal is to determine the read-only variables and field members, and remove the thread keyword, so that they become shared. A member field may have been written to during its initialization,
Multithreaded Geant4: Semi-automatic Transformation
293
Table 1. Representative Code Transformation for T 1 and T 2 T 1.1 T 1.2
int global = 0; static nonPOD field;
−→ −→
T 1.3
static int var1 = var2;
−→
class volume { //large relatively read-only member field RD t RD; //small transitory member field RDWR t RDWR; }; thread vector store;
−→
T 2.1
thread int global = 0; static thread nonPOD *newfield; if (!newfield) //Patch before refer to, as indicated newfield = new nonPOD; //by compilation errors #define field (*newfield) #define CLASS::field (*CLASS::newfield) static thread int *var1 NEW PTR = 0; if (!var1 NEW PTR ) { var1 NEW PTR = new int; *var1 NEW PTR = var2; } int &var1 = *var1 NEW PTR ; //dynamically extended by the main thread via the //constructor and replicated by worker threads 1 thread RDWR t *RDWR array; 2 class volume 3 { int instanceID; 4 RD t RD; }; 5 #define RDWR (RDWR array[instanceID]) 6 vector store;
but may be read-only thereafter. The difficulty is to figure out for each sharable class (as defined in the next paragraph), which member fields become read-only after the worker threads are spawned. Below, such a field member is referred to as relatively read-only. A sharable class is defined as a class that has many instances, most of whose member fields are relatively read-only. A sharable instance is defined as an instance of a sharable class. We take the left part of T 2.1 in Table 1 as an example. The class “volume” is a sharable class that contains some relatively read-only fields, whose cumulative size is large. The class “volume” also contains some read-write fields, whose cumulative size is small. However, it is not clear which field is relatively read-only and which field is read-write. To recognize relatively read-only data, we put all sharable instances into a pre-allocated region in the heap by overloading the “new” and “delete” methods for sharable classes and replacing the allocator for their containers. The allocator substitute has overloaded “new” and “delete” methods using the pre-allocated region of sharable instances. Another auxiliary program, the tracer, is introduced to direct the execution of the Geant4 application. Our tracer tool controls the execution of the application using the ptrace system call similarly to how “gdb” debugs a target process. In addition, the tracer tool catches segmentation fault signals in order to recognize member fields that are not relatively read-only. (Such non-relatively read-only data will be called transitory data below.) Figure 1 briefly illustrates the protocol between the tracer and the target application. First, the Geant4 application (the “inferior” process in the terminology of ptrace) sets up signal handlers and spawns the tracer tool. Second, the tracer tool (the “superior” process in the terminology of ptrace) attaches and notifies the “inferior” to remove the “write” permission from the pre-allocated memory region. Third, the tracer tool intercepts and relays each segmentation fault signal to the “inferior” to re-enable the “write” permission for the pre-allocated
294
X. Dong, G. Cooperman, and J. Apostolakis
Non−violation
0
Spawn
1
2
3
4
SIGUSR1
0
ATTACH
1
CONT
Violation
5
6
SIGFAULT SIGFAULT SIGUSR1
2
CONT
3
4
CONT
Retry
5
DETACH
Inferior state Downward signals sent by "raise" or OS Superior state Upward signals sent by "ptrace" ptrace with PTRACE_CONT Spawn Setup signal handlers, create tracer and sleep CONT Retry Retry the volative instruction Violation Write to the protected memory Inferior SIGUSR1 handler remove the write permission for the protected memory while SIGFAULT handler re−enable it
Fig. 1. Interaction between Inferior and Tracer
memory region, retry the instruction that calls the segmentation fault, and return to the second step. As the last step, the “inferior” will actively re-enable the “write” permission for the pre-allocated memory region and tell the tracer tool to terminate, which then forces the tracer tool to detach. If a sharable class has any transitory member field, it is unsafe for threads to share the whole instance for this class. Instead, threads share instances whose transitory member fields have been moved to the thread-private region. The implementation is described by the right part of T 2.1 in Table 1, whose objective is illustrated by Figure 2. In this figure, two threads share three instances for the sharable class “volume” in the heap. First, we remove the transitory member field set RW-Field from the class “volume”. Then, we add a new field as an instance ID, which is declared by line 3 of T 2.1. In Figure 2, the ID for each shared instance is 0, 1 or 2. Each thread uses a TLS pointer to an array of fields of type RW-Field, which is indexed by the instance ID. The RW-field array is declared in line 1 while the RW-field reference is redefined by the macro on line 5. As we can see from Figure 2, when worker thread 1 accesses the RWField set of instance 0, it follows the TLS pointer and assigns the RW-Field to 1. Similarly, worker thread 2 follows the TLS pointer and assigns the RWField to 8. Therefore, two threads access the RW-Field of the same instance, but access different memory locations. Following this implementation pattern, T 2 transforms the unconditionally thread-safe version of Geant4 further to share the detector data. For the T 2 transformation, it is important to recognize in advance all transitory member fields (fields that continue to be written to). This leads to larger read-only and read-write memory chunks in the logical address space as a sideeffect. Furthermore, this helps to reduce the total physical memory consumption even for process parallelism by taking advantage of copy-on-write technology. The internal experiments show that threads always perform better after the performance bottleneck is eliminated.
Multithreaded Geant4: Semi-automatic Transformation
295
Shared Relatively Read Only Data Thread Private Heap
Central Heap
Transcient Text (code)
Static/Global variables
More write−protect
Instance ID 0
Transcient
Instance ID 2
Instance ID RW−Field
Transcient 0 1 2 1 5 3
Instance ID RW−Field
0 1 2 8 2 6
Write−protect for T2 transformation
Thread Local Storage
Transcient
Instance ID 1
Thread Private Heap
Stack
RW−Field−Pointer
Thread Local Storage
Stack
RW−Field−Pointer
Thread Worker 1
Thread Worker 2
Fig. 2. Geant4MT Data Model
3.3
Custom Scalable Malloc Library
The parallel slowdown for the glibc default malloc library is reproducible through a toy program in which multiple threads work cooperatively on a fixed pool of tasks. The task for the toy program is to allocate 4,000 chunks of size 4 KB and to then free them. As the number of threads increases, the wall-clock time increases, even though the load per thread decreases. An obvious solution to this demonstrated problem would be to directly modify the Geant4MT source code. One can pre-allocate storage for “hot”, or frequently used, classes. The pre-allocated memory is then used instead of the dynamically allocated one in the central heap. This methodology works only if there is a small upper bound on the number of object instances for each “hot” class. Further, the developer time to overload the “new” and “delete” method, and possibly define specialized allocators for different C++ STL containers, is unacceptable. Hence, this method is impractical except for software of medium size. The preferred solution is a thread-private malloc library. We call our implementation tpmalloc. This thread-private malloc uses a separate malloc arena for each thread. This makes the non-portable assumption that if a thread allocates memory, then the same thread will free it. Therefore, a thread-local global variable is also provided, so that the modified behavior can be turned on or off on a per-thread basis. This custom allocator can be achieved by applying T 1 to any existing malloc library. In our case, we modify the original malloc library from glibc to create the thread-private malloc arenas. In addition, we pre-initialize a memory region for each worker thread and force it to use the thread private top chunk in the heap. In portions of the code where we know that a thread executing that code will not need to share a central heap region, we turn on a thread-local global variable to use a thread-private malloc arena. As Figure 2 shows, this allows Geant4MT to keep both transient objects and transitory data in a thread-private heap region. Therefore, the original lock associated with each arena in the glibc malloc library, is no longer used by the custom allocator.
296
3.4
X. Dong, G. Cooperman, and J. Apostolakis
Detection for Shared-Update and Run-Time Correctness
The Geant4MT methodology makes worker threads share read-only data, while other data is stored as thread-local or thread-private. This avoids the need for thread synchronization. If the multithreaded application still slows down with additional cache misses, the most likely reason is the violation of the assumption that all shared variables are read-only. Using a machine with 4 Intel Xeon 7400 Dunnington CPUs (3 levels cache and 24 cores in total), we estimate the number of additional write cache misses generated by updates to shared variables via a simple model. This number would otherwise be difficult to measure. Suppose there are w ≥ 2 thread workers that update the same variable. The total number of write accesses is designated as n. In the simulation, write accesses are assumed to be evenly distributed. The optimal load-balancing for thread parallelism is also assumed, so that the number of updates is n/w per thread. Consider any update access u1 and its predecessor u0 . One cache miss results from u1 whenever the variable is changed by another thread between the times of u0 and u1 . It is easy to see that any write access from another thread falls between u0 and u1 with probability w/n. Therefore, the probability that no update happens during this period is (1 − w/n)(n−n/w) ≈ e(1−w) . Furthermore, one cache miss happens at u1 with probability 1 − e(1−w) . For the L1 cache, each core has its own cache, and it suffices to set w to be the number of cores. In the L3 case, there is a single L3 cache per chip. Hence, for L3 cache and a machine with multiple chips, one can consider the threads of one chip as a single large super-thread and apply the formula by setting the number of threads w to be the number of super-threads, or the number of chips. Similar ideas can be used for analyzing L2 cache. The performance bottleneck from updates to shared variables is subtle enough to make testing based on the actual source code impractical for a software package as large as Geant4. Luckily, we had already developed a tracer tool for Transformation T 2 (see Section 3.2, along with Figure 1). An enhancement of the tracer tool is applicable to track down this bottleneck. The method is to remove write permission from all shared memory regions and to detect write accesses to the region. As seen in Figure 2, all read-only regions are write-protected. Under such circumstance, the “tracer” will catch a segmentation fault, allowing it to recognize when the shared data has been changed. This method exposes some frequently used C++ expressions in the code that are unscalable. An expression is scalable if given a load of n times execution of the expression, a parallel execution with k threads on k cores spends 1/k of the time for one thread with the same load. Some of the unscalable C++ expressions found in the code by this methodology are listed as follows: 1. cout.precision(*); Shared-updates to precision, even in the absence of output from cout. 2. str = ””; All empty strings refer to the same static value using a reference count. This assignment changes the reference count.
Multithreaded Geant4: Semi-automatic Transformation
297
3. std::ostringstream os; The default constructor for std::ostringstream takes a static instance of the locale class, which changes the reference count for this instance. Whenever updates for shared variables occur intensively, the tracer tool can be employed to determine all instructions that modify shared variables. Note that since all workers execute the same code, if one worker thread modifies a shared variable, then all worker threads modify that shared variable. This creates a classic situation of “ping pong” for the cache line containing that variable. Hence, this results in unscalable code. The most frequent such occurrences are the obvious suspects for performance bottlenecks. The relevant code is then analyzed and replaced with a thread-private expression where possible. The same tracer tool is sensitive to the violation of the T 2 read-only assumption. So it works also for the production phase to guarantee the run-time correctness of Geant4MT applications. For this purpose, the tracer tool is enhanced further with some policies and corresponding mechanisms for the coordination of shared variable updates. The tracer tool based on memory protection decides whether the shared data has ever been changed or not by Geant4MT. It serves as a dynamic verifier to guarantee that a production run is correct. If no segmentation fault happens in the production phase, the computation is correct and the results are valid. When a segmentation fault is captured, one just aborts that event. The results from all previous events are still valid. The tracer tool has zero run-time overhead in previous events in the case that the T 2 read-only assumption is not violated. A more sophisticated policy is to use a recovery strategy. The tracer tool suspends each thread that tries to write to the shared memory region. In that event all remaining threads finish their current event and arrive at a quiescent state. Then, all quiescent threads wait upon the suspended threads. The tracer tool first picks a suspended thread to resume and finish its current event. All suspended threads then redo their current events in sequence. This portion of the computation experiences a slowdown due to the serialization, but the computation can continue without aborting. To accomplish the above recovery strategy, a small modification of the Geant4MT source code is needed. The Geant4MT application must send a signal to the tracer tool before and after each Geant4 event. When violations of the shared read-only data are rare, this policy has a minimal effect on performance.
4
Experimental Results
Geant4MT was tested using an example based on FullCMS. FullCMS is a simplified version of the actual code used by the CMS experiment at CERN [1]. This example was run on a machine with 4 AMD Opteron 8346 HE processors and a total of 16 cores working at 1.8 GHz. The hierarchy of the cache for this CPU is a single 128 KB L1 and a single 512 KB L2 cache (non-inclusive) per core. There is a 2 MB L3 cache shared by the 4 cores on the same chip. The cache line size is 64 bytes. The kernel version of this machine is Linux 2.6.31 and the compiler is gcc 4.3.4 with the “-O2” option, following the Geant4 default.
298
X. Dong, G. Cooperman, and J. Apostolakis
Removal of futex delays from Geant4MT. The first experiment, whose results are reported in the left part of Table 2, reveals the bottleneck from the ptmalloc2 library (the glibc default). The total number of Geant4 events is always 4800. Each worker holds around 20 MB of thread-private data and 200 MB of shared data. The shared data is initialized by the master thread prior to spawning the worker threads. Table 2 reports the wall-clock time for the simulation. From this table, we see the degradation of Geant4MT. Along with the degradation is a tremendously increasing number of futex system calls. Table 2. Malloc issue for FullCMS/Geant4MT: 4 AMD Opteron 8346 HE (4×4 cores) vs. 4 Intel Xeon 7400 Dunnington (4×6 cores). Time is in seconds.
Number of Workers 1 4 8 16
4 AMD Opteron 8346 HE CPUs ptmalloc2 tpmalloc Time Speedup Futex & Time Time Speedup 10349 1 0, 0 10285 1 2650 3.91 2.4K, 0.04 2654 3.87 1406 7.36 38K, 0.4 1355 7.59 804 12.87 24M, 244 736 13.98
4 Intel Xeon 7400 Dunnington CPUs Number of ptmalloc2 tpmalloc Workers Time Speedup Futex & Time Time Speedup 1 6843 1 0, 0 6571 1 6 1498 4.57 13K, 0.3 1223 5.37 12 1050 6.51 24M, 266 824 7.97 24 654 10.3 66M, 1281 496 13.25
After replacing the glibc default ptmalloc2 with our tpmalloc library in Geant4MT, the calls to futex completely disappeared. (This is because the tpmalloc implementation with thread-private malloc arenas is lock-free.) As expected, the speedup increased, as seen in Table 2. However, that speedup was less than linear. The reason was that another performance bottleneck still existed in this intermediate version! (This was the issue of writes to shared variables.) Similar results were observed on a second computer populated with 4 Intel Xeon 7400 Dunnington CPUs (24 cores in total), running at 2.66 GHz. This CPU has three levels of cache. Each core has 64 KB of L1 cache. There are three L2 caches of 3 MB each, with each cache shared among two cores. Finally, there is a single 16 MB L3 cache, shared among all cores on the same chip. The cache line size is 64 bytes for all three levels. The kernel version on this machine is Linux 2.6.29 and the compiler is gcc 4.3.3 with the “-O2” option specified, just as with the experiment on the AMD computer. The load remained at 4800 Geant4 events in total. The wall-clock times are presented in the right part of Table 2. This table shows that the malloc libraries do not scale as well on this Intel machine, as compared to the AMD machine. This may be because the Intel machine runs at a higher clock rate. On the Intel machine, just 12 threads produce a number of futexes similar to that produced by 16 threads on the AMD computer. Analysis of competing effects of futexes versus cache misses. The performance degradation due to writes to shared variables tends to mask the distinctions among different malloc libraries. Without that degradation, one expects still greater improvements from tpmalloc, for both platforms. It was the experimental results from Table 2 that allowed us to realize the existence of this
Multithreaded Geant4: Semi-automatic Transformation
299
one additional performance degradation. Even with the use of tpmalloc, we still experienced only a 13 times speedup with 24 threads. Yet tpmalloc had eliminated the extraordinary number of futexes, and we no longer observed increasing system time (time in kernel) for more threads. Hence, we eliminated operating system overhead as a cause of the slowdown. Having removed futexes as a potential bottleneck, we turned to considering cache misses. We measured the number of cache misses for the three cache levels. The results are listed in Table 3 under the heading “before removal”. The number of L3 cache misses for this case was observed to increase excessively. As described in the introduction, this was later ascribed to writes to shared variables. Upon discovering this, some writes to shared variables were removed. The heading “after removal” refers to L3 cache misses after making threadprivate some frequently used variables with excessive writes. Table 3. Shared-update Elimination on 4 Intel Xeon 7400 Dunnington (4×6 cores) Number of Non-dominating statistics Before removal After removal Workers # Instructions L1-D Misses L2 Misses L3 References L3 Misses CPU Cycles L3 Misses Time Speedup 1 1,598 24493M 402M 87415M 293M 1945G 308M 6547s 1 6 1,598G 24739M 630M 87878M 326M 2100G 302M 1087s 6.02 12 1,598G 24742M 634M 88713M 456M 3007G 302M 543s 12.06 24 1,599G 24827M 612M 88852M 517M 3706G 294M 271s 24.16
Estimating the number of writes to shared variables. It remains to experimentally determine if the remaining performance degradation is due to the updates to shared variables, or due to other reasons. For example, if the working set is larger than the available cache, this would create a different mechanism for performance degradation. If updates to shared variables is the primary cause of cache misses, then the analytical formula of Section 3.4 can be relied upon to correctly predict the number of updates to shared variables. The formula will be used in two ways. 1. The formula will be used to confirm that most of the remaining cache misses are due only to updates to shared variables. This is done by first considering five different cases from the measured data: L2/6 (L2 cache misses with 6 threads); L2/12; L2/24; L3/12; and L3/24. This will be used to show that all of the legal combinations predict approximately the same number of shared variable updates. 2. Given the predicted number of shared variable updates, this is used to determine if the number of shared variable updates is excessive. When can we stop looking for shared variables that are frequently updated? Answer: when the number of shared variable updates is sufficiently small. Some data from Table 3 can be used to predict the number of shared variable updates following the first usage of the formula. The results are listed in Table 4. This table provides experimental validation that the predicted number of shared variable updates can be trusted.
300
X. Dong, G. Cooperman, and J. Apostolakis
Table 4. Prediction for Shared Variable Updates on 4 Intel Xeon 7400 Dunnington (4×6 cores) Level of Number of Number of Cache Misses Number of Threads or Prediction for Shared Cache Workers Single Thread Multiple Threads Additional Super-threads (w) Variable Updates (n) 6 6 6 L2 cache 6 402 × 10 630 × 10 228 × 10 3 ≈ 263 × 106 6 6 6 L2 cache 12 402 × 10 634 × 10 232 × 10 6 ≈ 230 × 106 L2 cache 24 402 × 106 612 × 106 210 × 106 12 ≈ 210 × 106 L3 cache 12 293 × 106 456 × 106 163 × 106 2 ≈ 258 × 106 L3 cache 24 293 × 106 517 × 106 224 × 106 4 ≈ 236 × 106
Effect of removing writes to shared memory. Table 3 shows that Geant4MT scales linearly to 24 threads after most updates to shared variables are removed. According to our estimate in Table 4, 260 × 106 updates to shared variables have been removed. As seen earlier, with no optimizations, only a 10.3 times speedup using 24 cores was obtained. By using tpmalloc, a 13.25 times speedup was obtained, as seen in Table 2. With tpmalloc and removal of excessive writes to shared variables, a 24.16 times speedup using 24 cores was obtained, as seen in Table 3. In fact, this represents a slight superlinear speedup (a 1% improvement). We believe this is due to reduced L3 cache misses due to sharing of cached data among distinct threads. Results for different allocators with few writes to shared memory. The last experiment compared different malloc implementations using Geant4MT, after having eliminated previous performance bottlenecks (shared memory updates). Some interesting results from this experiment are listed in Table 5. First, ptmalloc3 is worse than ptmalloc2 for Geant4MT. This may account for why glibc has not included ptmalloc3 as the default malloc library. Second, tcmalloc is excellent for 8 threads and is generally better than hoard although hoard has better speedup for some cases. Our custom thread-private tpmalloc does not show any degradation so that Geant4MT with tpmalloc leads to a better speed-up. This demonstrates the strong potential of Geant4MT for still larger scalability on future many-core machines. Table 5. Malloc Library Comparison Using Geant4 on 4 AMD Opteron 8346 HE (4×4 cores) Number of Workers 1 2 4 8 16
ptmalloc2 ptmalloc3 hoard tcmalloc tpmalloc Time Speedup Time Speedup Time Speedup Time Speedup Time Speedup 9923s 1 10601s 1 10503s 1 9918s 1 10090s 1 4886s 2.03 6397s 1.66 6316s 1.66 4980s 1.99 5024s 2.01 2377s 4.17 4108s 2.58 2685s 3.91 2564s 3.87 2504s 4.03 1264s 7.85 2345s 4.52 1321s 7.95 1184s 8.37 1248s 8.08 797s 12.46 1377s 7.70 691s 15.20 660s 15.02 623s 16.20
Multithreaded Geant4: Semi-automatic Transformation
5
301
Related Work
Some multithreading work for linear algebra algorithms are: PLASMA [15], which addresses the scheduling framework; and PLUTO [16] and its variant [17], which addresses the compiler-assisted parallelization, optimization and scheduling. While the compiler-based methodologies are fit for the data parallelism existing in tractable loop nests, new approaches are necessary for other applications, e.g., commutativity analysis [18] to automatically extract parallelism from utilities such as gzip, bzip2, cjpeg, etc. The Geant4MT T 1 transformation is similar to well-known approaches such as the private data clause in OpenMP [19] and Cilk [20]; and the SUIF [21] privatizable directive from either programmer or compiler. Nevertheless, T 1 pursues thread safety in Geant4 with its large C/C++ code base containing many virtual and callback functions — a context that would overwhelm both the programming and the compile-time analysis. The Geant4MT T 2 transformation applies task-oriented parallelism (one event is a task) and gains some data parallelism by sharing relatively read-only data and replicated transitory data. While this transformation benefits existing parallel programming tools, it raises thread safety issues for the generated code. Besides our own tool for run-time correctness, other approaches are also available. Some tools for static data race detection are: SharC [22], which checks data sharing strategies for multi-threaded C via declaring the data sharing strategy and checking; RELAY [23], which is used to detect data races in the Linux kernel; RacerX [24], which finds serious errors in Linux and FreeBSD; and KISS [25], which obtains promising initial results in Windows device drivers. All these methods are unsound. Two sound static data race detection tools are LOCKSMITH [26], which finds several races in Linux device drivers and the method of Henzinger et al. [27], which is based on model checking. Sound methods need to check a large state space and may fail to complete due to resource exhaustion. All these methods, even with their limitations, are crucial for system software (e.g., an O/S) which requires strict correctness and is intended to run forever. In contrast, this work addresses the run-time correctness for application software that already runs correctly in the sequential case.
6
Conclusion
Multithreaded software will benefit even more from future many-core computers. However, efficient thread parallelization is difficult when confronted by two real-world facts. First, software is continually undergoing continuing active development with periodically appearing new releases to eliminate bugs, to enhance functionality, or to port for additional platforms. Second, the parallelization expert and the domain expert often have only limited understanding of each other’s job. The methodology presented encourages black box transformations so that the jobs of the parallelization expert and the domain expert can proceed in a largely independent manner. The tools in this work not only enable highly scalable thread parallelism, but also provide a solution of wide applicability for efficient thread parallelization.
302
X. Dong, G. Cooperman, and J. Apostolakis
Acknowledgement We gratefully acknowledge the use of the 24-core Intel computer at CERN for testing. We also gratefully acknowledge the helpful discussions with Vincenzo Innocente, Sverre Jarp and Andrzej Nowak. The many performance tests run by Andrzej Nowak on our modified software were especially helpful in gaining insights into the sources for the performance degradation.
References 1. CMS, http://cms.web.cern.ch/cms/ 2. Arce, P., Lagares, J.I., Perez-Astudillo, D., Apostolakis, J., Cosmo, G.: Optimization of An External Beam Radiotherapy Treatment Using GAMOS/Geant4. In: World Congress on Medic Physics and Biomedical Engineering, vol. 25(1), pp. 794–797. Springer, Heidelberg (2009) 3. Hohlmann, M., Ford, P., Gnanvo, K., Helsby, J., Pena, D., Hoch, R., Mitra, D.: GEANT4 Simulation of a Cosmic Ray Muon Tomography System With MicroPattern Gas Detectors for the Detection of High-rmZ Materials. IEEE Transactions on Nuclear Science 56(3-2), 1356–1363 (2009) 4. Godet, O., Sizun, P., Barret, D., Mandrou, P., Cordier, B., Schanne, S., Remou´e, N.: Monte-Carlo simulations of the background of the coded-mask camera for Xand Gamma-rays on-board the Chinese-French GRB mission SVOM. Nuclear Instruments and Methods in Physics Research Section A 603(3), 365–371 (2009) 5. malloc, http://www.malloc.de/en/ 6. TCMalloc, http://goog-perftools.sourceforge.net/doc/tcmalloc.html 7. The Hoard Memory Allocator, http://www.hoard.org/ 8. Instrumentation Framework for Building Dynamic Analysis Tools, http://valgrind.org/ 9. Agostinelli, S., et al.: GEANT4–a simulation toolkit. Nuclear Instruments and Methods in Physics Research Section A 506(3), 250–303 (2003) (over 100 authors, including J. Apostolakis and G. Cooperman) 10. Allison, J., et al.: Geant4 Developments and Applications. IEEE Transactions on Nuclear Science 53(1), 270–278 (2006) (73 authors, including J. Apostolakis and G. Cooperman) 11. Cooperman, G., Nguyen, V., Malioutov, I.: Parallelization of Geant4 Using TOP-C and Marshalgen. In: IEEE NCA 2006, pp. 48–55 (2006) 12. TOP-C, http://www.ccs.neu.edu/home/gene/topc.html 13. Thread-Local Storage, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1966.html 14. Elsa: An Elkhound-based C++ Parser, http://www.cs.berkeley.edu/~ smcpeak/elkhound/ 15. Parallel Linear Algebra For Scalable Multi-core Architecture, http://icl.cs.utk.edu/plasma/ 16. Bondhugula, U., Hartono, A., Ramanujam, J.: A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In: PLDI 2008, vol. 43(6), pp. 101–113 (2008) 17. Baskaran, M.M., Vydyanathan, N., Bondhugula, U.K.R., Ramanujam, J., Rountev, A., Sadayappan, P.: Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors. In: PPoPP 2009, pp. 219–228 (2009)
Multithreaded Geant4: Semi-automatic Transformation
303
18. Aleen, F., Clark, N.: Commutativity Analysis for Software Parallelization: Letting Program Transformations See the Big Picture. In: ASPLOS 2009, vol. 44(3), pp. 241–252 (2009) 19. OpenMP, http://openmp.org/wp/ 20. Cilk, http://www.cilk.com/ 21. SUIF, http://suif.stanford.edu/ 22. Anderson, Z., Gay, D., Ennals, R., Brewer, E.: SharC: Checking Data Sharing Strategies for Multithreaded C. In: PLDI 2008, vol. 43(6), pp. 149–158 (2008) 23. Voung, J.W., Jhala, R., Lerner, S.: RELAY: Static Race Detection on Millions of Lines of Code. In: ESEC-FSE 2007, pp. 205–214 (2007) 24. Engler, D., Ashcraft, K.: RacerX: Effective, Static Detection of Race Conditions and Deadlocks. In: SOSP 2003, vol. 37(5), pp. 237–252 (2003) 25. Qadeer, S., Wu, D.: KISS: Keep It Simple and Sequential. In: PLDI 2004, pp. 149–158 (2004) 26. Pratikakis, P., Foster, J.S., Hicks, M.: LOCKSMITH: Context-Sensitive Correlation Analysis for Race Detection. In: PLDI 2006, vol. 41(6), pp. 320–331 (2006) 27. Henzinger, T.A., Jhala, R., Majumdar, R.: Race Checking by Context Inference. In: PLDI 2004, pp. 1–13 (2004)
Parallel Exact Time Series Motif Discovery Ankur Narang and Souvik Bhattacherjee IBM India Research Laboratory, New Delhi {annarang,souvbhat}@in.ibm.com
Abstract. Time series motifs are an integral part of diverse data mining applications including classification, summarization and near-duplicate detection. These are used across wide variety of domains such as image processing, bioinformatics, medicine, extreme weather prediction, the analysis of web log and customer shopping sequences, the study of XML query access patterns, electroencephalograph interpretation and entomological telemetry data mining. Exact Motif discovery in soft real-time over 100K time series is a challenging problem. We present novel parallel algorithms for soft real-time exact motif discovery on multi-core architectures. Experimental results on large scale P6 SMP system, using real life and synthetic time series data, demonstrate the scalability of our algorithms and their ability to discover motifs in soft real-time. To the best of our knowledge, this is the first such work on parallel scalable soft real-time exact motif discovery. Keywords: exact motif discovery, parallel algorithm, multi-core multithreaded architectures.
1 Introduction Time series motifs are pairs of individual time series, or subsequences of a longer time series, which are very similar to each other. Since the formalism of time series motifs in 2002, dozens of researchers have used them in domains as diverse as medicine, biology, telemedicine, entertainment and severe weather prediction. Further, domains such as financial market analysis [5], sensor networks and disaster management require realtime motif discovery over massive amounts of data and possibly in an online fashion. The intuitive algorithm for computing motifs is quadratic in the number of individual time series (or the length of the single time series from which subsequences are extracted). Thus, for massive time series it is hard to obtain exact time series motif in realistic timeframe. More than a dozen approximate algorithms to discover motifs have been proposed [1] [2] [6] [7] [10]. The sequential time complexity of most of these algorithms is O(m) or O(m log m), where m is the number of time series; but the associated constant factors are high. [8] shows a tractable exact algorithm to find time series motifs. This exact algorithm is worst case quadratic, but it can reduce the time required by three orders of magnitude. This algorithm enables tackling problems which have previously been thought intractable, for example automatically constructing dictionaries of recurring patterns from electroencephalographs. Further this algorithm is fast enough to be used as a subroutine P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 304–315, 2010. c Springer-Verlag Berlin Heidelberg 2010
Parallel Exact Time Series Motif Discovery
305
in higher level data mining algorithms for summarization, near-duplicate detection and anytime classification [12]. However, the sequential MK algorithm [8] cannot deliver soft real-time exact motif discovery for large number (100K or more) of time series. Hence, there is an imperative need to explore parallel scalable algorithm for exact motif discovery. Emerging and next generation multi-core architectures have large number of hardware threads and support high rate throughput computing. For scalable parallel algorithm for exact motif discovery on multi-core architectures one needs to have finegrained cache-aware multithreaded design along with optimizations for load balancing. Soft real-time scalable performance on large multi-cores is a challenging problem due to simultaneous consideration of cache locality and dynamic load balancing. In this paper, we present the design of parallel multi-threaded algorithms for exact motif discovery along with novel optimizations for cache performance and dynamic load balancing. Our algorithm is essentially in-memory bound. Experimental results on real EEG data and random walk data show that our algorithms scale well and discover motifs in soft real-time. We make the following contributions: – We present the design of parallel algorithms for exact discovery of time series motifs. To achieve high scalability and performance on multi-core architectures, we use novel cache locality and dynamic load balancing optimizations. – We demonstrate soft real-time parallel motif discovery for realistic (EEG data) and randomly generated time series on large scale Power6 based SMP systems.
2 Background and Notation Definition 1. Time Series: A Time Series is a sequence T = (t1 , t2 , . . . , tn ) which is an ordered set of n real valued numbers. The ordering is typically temporal; however other kinds of data such as color distributions, shapes and spectrographs also have a well defined ordering and can fruitfully be considered time series for the purpose of indexing and mining. For simplicity and without loss of generality we consider only equispaced data. In general, we may have many time series to consider and thus need to define a time series database. Definition 2. Time Series Database: A Time Series Database (D) is an unordered set of m time series possibly of different lengths. For simplicity, we assume that all the time series are of the same length and D fits in the main memory. Thus, D is a matrix of real numbers where Di is the ith row in D as well as the ith time series Ti in the database and Di,j is the value at time j of Ti . Definition 3. Time Series Motif: The Time Series Motif of a time series database D is the unordered pair of time series {Ti , Tj } in D which is the most similar among all possible pairs. More formally, ∀a, b, i, j the pair {Ti , Tj } is the motif iff dist(Ti , Tj ) ≤ dist(Ta , Tb ), i = j and a = b.
306
A. Narang and S. Bhattacherjee
We can generalize the notion of motifs by considering a motif ranking notion. More formally: Definition 4. k th Time Series Motif: The k th -Time Series motif is the k th most similar pair in the database D. The pair {Ti , Tj } is the k th motif iff there exists a set S of pairs of time series of size exactly k-1 such that ∀Td ∈ D {Ti , Td } ∈ / S and {Tj , Td } ∈ / S and ∀ {Tx , Ty } ∈ S, {Ta , Tb } ∈ / S dist(Tx , Ty ) ≤ dist(Ti , Tj ) ≤ dist(Ta , Tb ). These ideas can be extended to subsequences of a very long time series by treating every subsequence of length n (n m) as an object in the time series database. Motifs in such a database are subsequences that are conserved locally in the long time series. More formally: Definition 5. Subsequence: A subsequence of length n of a time series T = (t1 , t2 , .., tm ) is a time series Ti,n = (ti , ti+1 , .., ti+n−1 ) for 1 ≤ i ≤ m − n + 1. Definition 6. Subsequence Motif: The Subsequence Motif is a pair of subsequences {Ti,n , Tj,n } of a long time series T that are most similar. More formally, ∀a, b, i, j the pair {Ti,n , Tj,n } is the subsequence motif iff dist(Ti,n , Tj,n ) ≤ dist(Ta,n , Tb,n ), |i − j| ≥ w and |a − b| ≥ w for w > 0. To avoid trivial motifs, w is typically chosen to be ≥ n/2. [8] describes a fast sequential algorithm for exact motif discovery. This algorithm relies on distance of all the time series from a randomly chosen reference time series as the guiding heuristic to reduce the search space. A randomly chosen time series is taken as a reference. The distances of all other time series with respect to this time series are calculated and sorted. The key insight of this algorithm is that this linear ordering of data provides some useful heuristic information to guide the motif search. The observation is that if two objects are close in the original space, they must also be close in the linear ordering, but the reverse may not be true. Two objects can be arbitrarily close in the linear ordering but very far apart in the original space. This observation helps to obtain a speed up over the usual the brute force algorithm. However, a further increase in speed up is obtained when a single reference point is extended to multiple such reference points. This extended version is known as the MK algorithm [8]. From each reference time series in the MK algorithm (Fig. 1), the distance of all other time series are calculated. The reference time series with maximum standard deviation is used to order the time series with increasing distance. The distances between the time series as defined by this ordering is a good lower bound on the actual distance between the time series. At the top level, the algorithm has iterations with increasing offset values starting with offset=1. In each iteration, it updates the best so far variable if a closer pair of time series is discovered.
3 Related Work Most of the literature has focused on computationally fast approximate algorithms for motif discovery [1] [2] [4] [6] [7] [10], however the time series motif problem has been exactly solved by FLAME algorithm [11] only. The FLAME algorithm works on
Parallel Exact Time Series Motif Discovery
307
Algorithm MK Motif Discovery Procedure [L1 ,L2 ] =MK Motif(D,R) 1 best-so-far = INF 2 for i=1 to R 3 refi =a randomly chosen time series Dr from D 4 for j=1 to m 5 Disti,j =d(refi ,Dj ) 6 if Disti,j = SZ(i+1) 10 find an ordering I of the indices to the time series in D such that DistZ(1),I(j) =DistZ(1),I(j+1) 11 offset=0, abandon=false 12 while abandon=false 13 offset=offset+1, abandon=true 14 for j=1 to m 15 reject=false 16 for i=1 to R 17 lower bound=|DistZ(i),I(j) - DistZ(i),I(j+of f set) | 18 if lower bound >best-so-far 19 reject=true, break 20 else if i = 1 21 abandon=false 22 if reject=false 23 if d(DI(j) ,DI(j+of f set) ) current difference in reference distance) during the traversal of the sorted reference distance array. At the end of each iteration, the threads perform a reduction operation to update the overall best so far value and the corresponding pair of time series representing the motif pair. Further, at this point the threads check if the termination criteria for the algorithm has been reached by all threads. If so, the program exits and the motif pair is reported.
Sorted Reference Distance Array
A[1..4]
A[5..8]
A[9..12]
Offset = 1
T1
T2
Sorted Reference Distance Array
T3
Offset = 2
T1
T2
T3
T1
T2
T3
A[1..4]
T1
Offset e [ 1..2]
T5
Offset e [ 3..4]
T2
Offset e [ 3..4]
T6
Offset e [ 5..6]
T3
Offset e [ 5..6]
T7
Offset e [ 7..8]
T4
Offset e [ 7..8]
T8
Offset e [ 9..10]
T1
Offset e [ 9..10]
T5
Offset e[ 11..12]
T2
Offset e[ 11..12]
T6
Offset e[ 13..14]
T3
Offset e[ 13..14]
T7
Offset e[ 15..16]
T4
Offset e[ 15..16]
T8
Offset = 3
Single Dimension Parallelism: Array Parallelism
A[5..8]
Offset e [ 1..2]
Two Dimension Parallelism: Array and Offset Parallelism
Fig. 2. Parallelism in Exact Motif Discovery
When the second dimension of parallelism, i.e. offset parallelism is also used, then in each iteration a group of offset values are considered as opposed to single offset per iteration. Each thread traverses its partition for all the offset values assigned to the current offset group. At the end of each iteration, reduction operation is performed to update the overall best so far value and the corresponding pair of time series representing the motif pair, across all the offset values considered in this iteration. The number of offset values considered in an iteration is determined empirically for best performance. The parallel algorithm including offset parallelism is given in Fig. 2. This algorithm is referred to as Par-MK algorithm. When both the dimensions of parallelism are used, the threads that work on different offsets but the same array partition are co-scheduled on the same core or multiple cores that share the same L2 cache. This helps in improving cache performance and reduces pressure on the external memory bandwidth demand. The overall performance of this parallel algorithm depends on the architectural and the algorithm parameters. The architectural parameters we consider for performance evaluation include number of cores, number of hardware threads per core, size of L2 cache and number of cores sharing L2 cache. For a given time series database, the algorithm parameters that affect performance are the number of reference points and the number of parallel offsets
310
A. Narang and S. Bhattacherjee
when offset parallelism is used. The results and analysis Section 5 details the interesting interplay amongst these parameters for obtaining the best performance. The compute workload on each thread is non-uniform due to variable number of distance computations between time series. This causes load imbalance problem across the threads and results in lower scalability and efficiency with increasing number of threads and cores on the target architecture. The next section presents the design optimizations for load balancing. 4.1 Load Balancing Optimizations Each thread in the Algorithm 2 performs different number of distance computations between time series depending upon on its local best so far[i] value and the distribution of distances with respect to the reference points. This leads to load imbalance across the threads. In this section, we present static and dynamic load balancing techniques. We first present the static load balancing technique (referred as Par-MK-SLB). The work that threads do is divided into two categories: (a) sorted distance array traversal and best so far[i] update, (b) distance computation between pairs of time series. (Fig. 3) Each thread is assigned a fixed category of work. Each thread, Ti , that performs traversal of the sorted distance array (referred as search thread) has a synchronized queue, Qi , associated with it where it puts in the distance computation requests. The distance computation request is the following tuple: [Sa , Sb , Rab ], where, Sa and Sb are the time series and Rab is the difference in the reference distances of these two time series. Each thread, DTj , that computes distance between time series (referred as compute thread) picks up distance computation request from a queue, Qi . It first checks if this difference in the reference distance is less than the best so far[i]. If so, then it picks up the request for computation, else the request is discarded. It then computes the distance between the two given time series (Sa , Sb ). If the actual distance between these two time series is less than the best so far[i] value then, best so far[i] is updated in a synchronized fashion. For static load balancing, each thread in the system is assigned to one of the multiple work-share groups. Each work-share group consists of a certain set of search threads (and their respective queues) and a set of compute threads. The ratio of compute threads to search threads is referred to as cs-ratio. For optimized mapping onto the target architectures, the threads that belong to the same work-share group are co-scheduled on the cores that share the same L2 cache (Fig. 3). The cs-ratio in each work-share group is determined by the relative load on these threads and also by the underlying architecture for best cache performance. The static load balancing has its limitations of not being able to share compute threads across the work-share groups and also not being able to adjust the ratio between the overall number of compute threads to the overall number of search threads in the system. Further, there can be load imbalance between the search threads due to variable number of synchronized queue operations. Thus, to provide further improvements in load balancing we extend this to dynamic load balancing. Dynamic Load Balancing. In the dynamic load balancing scheme, the execution is divided into multiple phases. Each thread does one category of work in a phase but can switch to other category in the next phase if needed for load balancing. Thus, the csratio can vary across phases (Fig. 4). In each phase, the average queue occupancies are
Parallel Exact Time Series Motif Discovery Traversal Threads
311
Distance Compute Threads Queue(1)
T1
T6 Work Share Group(1)
Queue(2) T2
T7 Queue(3)
T3
T8 Work Share Group(2)
Queue(4) T4
T9 Queue(5)
T5
T10
Work Share Group(3)
Fig. 3. Static Load Balancing
monitored to determine whether the search load is high or the compute load is high. If in the current phase, the compute load is high, as denoted by the relatively high occupancy of the queues, then some search threads switch their work category to compute work in the next phase. Thus, the cs-ratio increases in the next phase. Symmetrically, the csratio is decreased in the next phase, if in the current phase the queues have relatively less occupancy. The length of each phase in terms of number of iterations is determined empirically. The threads chosen to switch from one category to another between phases and the new value of the cs-ratio is dependent on the architectural configuration to ensure maximal cache locality while taking care of load balancing. Further, we employ Search Threads
Compute Threads
T1
schunk(0) schunk(1) schunk(2) schunk(3)
Sorted Reference Distance Array
Queue(1) T6 Queue(2) T2 next_available chunk
T7 Queue(3)
T3
T8 Queue(4)
T4
T9
Random Distance Computation Request Fetch
Fig. 4. Dynamic Load Balancing
312
A. Narang and S. Bhattacherjee
fine-grained parallelism to get dynamic load balance between the search threads. Here, instead of the partitioning the sorted distance array into a number of chunks equal to the number of search threads, we partition it into much smaller chunks. A synchronized variable points to the next available small chunk for the fixed set of offsets. Each search thread picks the next available small chunk and performs search over it while en-queuing distance computation requests for the compute thread (Fig. 4). When no small chunks are available for this iteration, the search threads perform a barrier and the global best so far value is updated and visible to all threads. Thus, the search thread now accesses non-contiguous small chunks as opposed to the static load balancing algorithm (Par-MK-SLB) or the Par-MK algorithm. The dynamic load balancing algorithm is referred to as Par-MK-DLB.
5 Results and Analysis We implemented all the three algorithms, Par-MK, Par-MK-SLB and Par-MK-DLB using Posix Threads (NPTL) API. The test data used was the same as in [8]. We evaluated the performance and scalability of our algorithms on the EEG data and the Random Walk data. The EEG data [8] has voltage differences across the scalp and reflects the activity of the large populations of neurons underlying the recording electrode. The EEG data has a single sequence of length, L = 180K. We extracted subsequences of length, N = 200 from this long sequence, as mentioned in Section 2 and used them to find the subsequence motif. For the synthetic data, we generated a random walk of T S = 100K time series each of length, N = 1024. We performed the experiments on a large scale SMP machine, with 32 (4GHz) Power6 cores. Each core has 32KB of L1 cache and two cores share 32MB of L2 cache. Further, each thread was assigned to a single core. 5.1 Strong Scalability In strong scalability experiment we kept the number of time series and the length of each of the time series as constant, while increasing the number of threads from 2 to 32. Fig. 5(a) shows the variation in exact motif discovery time with increasing number of cores. The input consists of subsequences derived from the EEG time series of length L = 180K. The motif discovery time decreases from 207s for 2 cores/threads to 22.05s for 32 cores/threads. We also measured the time of the sequential MK algorithm [8] on this power machine. Compared to the sequential time of 734s, we obtained a speedup of 33.4× on 32 cores. The superlinear speedup is due to the dynamic load balancing in our algorithm with superior L2 cache performance. The parallel efficiency in this case is 104.4%. Using random walk data over T S = 50K time series each of length 1024, we obtained 29.5s time on 32 threads and 176.6s on 2 threads (Fig. 5(b). While the sequential MK obtained the same motif in 253s. This gives a speedup of roughly 8.57×. The parallel efficiency turns out to be 27%. This fall in speedup and efficiency is due to the reduction in cache performance owing to larger size of the time series (N = 1024). Using dimension based partitioning we can obtain a better cache performance and hence better speedup.
Parallel Exact Time Series Motif Discovery
Strong Scalability: EEG data (L = 180K, N = 200)
313
Strong Scalability: Random Walk (TS = 50K, N = 1024)
800
300
700 250 200
500 EEG
400 300 200
Time(s)
Time(s)
600
RWalk
150 100 50
100 0 EEG
0
1 734
2
4
8
16
32
207.11 107.87 65.03 31.37 22.05 Num ber of Threads
(a)
RWalk
1
2
4
8
16
32
253 176.6 115.4 70.7 40.83 29.5 Number of Threads
(b)
Fig. 5. (a) Strong Scalability (EEG) Motif Discovery. (b) Strong Scalability (Random Walk) Motif Discovery.
5.2 Load Balancing Optimizations Analysis We studied and compared load balance (across the threads) between the algorithms Par-MK and Par-MK-DLB. Fig. 6(a) plots the wait time for 32 threads for the EEG data using the Par-MK algorithm. The wait time for a thread is the sum over all iterations, of the time the thread has to wait for the all-thread barrier to complete. The wait time represents the load imbalance across the threads. The minimum wait time is 2.9s while the maximum is 10.47s. In each iteration the load balance across threads can be different depending on the nature of data. This irregular nature of the computation causes all threads to have non-zero total wait-time over all iterations. Since, in the Par-MK algorithm the search and compute loads across the threads are variable so this load imbalance across the threads is high. This load imbalance results in poor scalability and performance. Fig. 6(b) shows the wait time for 32 threads for the Par-MK-DLB algorithm. Here the minimum wait time is 2.2s while the maximum is 4s. Here, each thread processes as many small available chunks in each iteration as it can, based on the load from each chunk it picks. This dynamic load balancing strategy leads to much better load balance compared to the Par-MK algorithm. The EEG data has more search load than distance computation load hence the cs-ratio of the compute threads to the search threads is kept low (1:7). The random walk data has higher distance computation load compared to the search load. To obtain the best overall time, we tried multiple values of the ratio of the number of compute threads to the number of search threads. Fig. 7(a) displays the wait time (using 32 threads, random walk data, T S = 100K, N = 1024) for the search threads when the cs-ratio is 1. The minimum wait time is 20s while the maximum is 53.5s, with mean 33.7s and standard deviation 9.7s. The overall motif discovery time in this case is 164s. The wait time denotes variation of the distance computation workload and search traversal workload across the search threads. Since, for random walk data the length of the time series is set to 1024, there is more work for distance computation than traversal across the reference distance array. Due to this, the queues for the search threads have higher occupancy in this experiment. Thus, even when the cs-ratio is 1 : 1,
A. Narang and S. Bhattacherjee Load Imbalance, Par-MK-DLB, EEG data, 32 threads
Load Imbalance (Wait Time), EEG data, 32 threads
Thread Id
24
27
18
0
30
24 27
21
18
15
9 12
3
6
0
0
21
2
Wait Time
15
Wait Time
4
9
6
Time
Time
8
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 12
10
6
12
3
314
Thread Id
(a)
(b)
Fig. 6. (a) Load Balance Analysis Par-MK. (b) Load Balance Analysis Par-MK-DLB.
the number of compute threads are lesser than needed and hence the load imbalance across the search threads is large and the overall completion time is 172s. When the cs-ratio is chosen as 7 : 1, then the load imbalance across the threads becomes much better as indicated by the Fig. 7(b). Here, the minimum wait time is around 4s while the maximum is around 8.1s, with mean 5.84s and standard deviation as 2.26s. The total motif discovery time here is 146.5s. The higher number of compute threads are able to quickly serve the queues with distance computation requests resulting in better load balance and lower overall exact motif discovery time.
|C| = 16, |S| = 16, Random Walk, TS = 100K
|C| = 28, |S| = 4, Random Walk, TS = 100K
60
40 Wait Time
30 20 10
14
12
10
8
6
4
0
2
0
Search Thread ID
(a)
Wait Time (s)
Wait Time (s)
50
9 8 7 6 5 4 3 2 1 0
Wait Time
0
1
2
3
Search Thread ID
(b)
Fig. 7. (a) Load Balance Analysis with CS-Ratio 1:1. (b) Load Balance Analysis with CS-Ratio 7:1.
6 Conclusions and Future Work We presented the design of parallel algorithms for exact motif discovery over large time series. Novel load balancing and cache optimizations provide high scalability and performance on multi-core architectures. We demonstrate the performance of our optimized parallel algorithms on large Power SMP machines using both real and random data sets. We achieved around 33× performance speedup over the best sequential
Parallel Exact Time Series Motif Discovery
315
performance reported so far for discovery of exact motifs on real data. Our optimized parallel algorithm delivers exact motif in soft real-time. To the best of our knowledge, this is the first such work on parallel scalable soft real-time exact motif discovery on multi-core architectures. In future, we plan to investigate scalability on manycore architectures with thousands of threads.
References 1. Beaudoin, P., van de Panne, M., Poulin, P., Coros, S.: Motion-motif graphs. In: Symposium on Computer Animation (2008) 2. Chiu, B., Keogh, E., Lonardi, S.: Probabilistic discovery of time series motifs. In: 9th International Conference on Knowledge Discovery and Data mining (KDD 2003), pp. 493–498 (2003) 3. Guralnik, V., Karypis, G.: Parallel tree-projection-based sequence mining algorithms. Parallel Computing 30(4), 443–472 (2001) 4. Guyet, T., Garbay, C., Dojat, M.: Knowledge construction from time series data using a collaborative exploration system. Journal of Biomedical Informatics 40(6), 672–687 (2007) 5. Jiang, T., Feng, Y., Zhang, B., Shi, J., Wang, Y.: Finding motifs of financial data streams in real time. In: Kang, L., Cai, Z., Yan, X., Liu, Y. (eds.) ISICA 2008. LNCS, vol. 5370, pp. 546–555. Springer, Heidelberg (2008) 6. Meng, J., Yuan, J.: Hans, M., Wu, Y.: Mining motifs from human motion. In: Proc. of EUROGRAPHICS (2008) 7. Minnen, D., Isbell, C., Essa, I., Starner, T.: Discovering multivariate motifs using subsequence density estimation and greedy mixture learning. In: Conf. on Artificial Intelligence, AAAI 2007 (2007) 8. Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B.: Exact discovery of time series motifs. In: SDM, pp. 473–484 (2009) 9. Cong, S., Han, J., Padua, D.: Parallel mining of closed sequential patterns. In: Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, USA, pp. 562–567 (2005) 10. Tanaka, Y., Iwamoto, K., Uehara, K.: Discovery of time-series motif from multi-dimensional data based on mdl principle. Machine Learning 58(2-3), 269–300 (2005) 11. Tata, S.: Declarative Querying For Biological Sequences. Ph.D. thesis. The University of Michigan (2007) 12. Ueno, K., Xi, X., Keogh, E., Lee, D.: Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: Proc. of IEEE International Conference on Data Mining (2006) 13. Zaki, M.: Parallel sequence mining on shared-memory machines. Journal of Parallel and Distributed Computing 61(3), 401–426 (2001)
Optimized Dense Matrix Multiplication on a Many-Core Architecture Elkin Garcia1, Ioannis E. Venetis2 , Rishi Khan3 , and Guang R. Gao1 1
Computer Architecture and Parallel Systems Laboratory Department of Electrical and Computer Engineering University of Delaware, Newark 19716, U.S.A. {egarcia,ggao}@capsl.udel.edu 2 Department of Computer Engineering and Informatics University of Patras, Rion 26500, Greece [email protected] 3 ET International, Newark 19711, U.S.A. [email protected]
Abstract. Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), belong to a new set of many-core-on-a-chip systems with a software managed memory hierarchy. New programming and compiling methodologies are required to fully exploit the potential of this new class of architectures. In this paper, we use dense matrix multiplication as a case of study to present a general methodology to map applications to these kinds of architectures. Our methodology exposes the following characteristics: (1) Balanced distribution of work among threads to fully exploit available resources. (2) Optimal register tiling and sequence of traversing tiles, calculated analytically and parametrized according to the register file size of the processor used. This results in minimal memory transfers and optimal register usage. (3) Implementation of architecture specific optimizations to further increase performance. Our experimental evaluation on a real C64 chip shows a performance of 44.12 GFLOPS, which corresponds to 55.2% of the peak performance of the chip. Additionally, measurements of power consumption prove that the C64 is very power efficient providing 530 MFLOPS/W for the problem under consideration.
1
Introduction
Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. They exploit temporal locality making use of cache tiling techniques with tile size selection and padding [8,18]. However, the data location and replacement in the cache is controlled by hardware making fine control of these parameters difficult. In addition, power consumption and chip die area constraints make increasing on-chip cache an untenable solution to the memory wall problem [5,17]. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 316–327, 2010. c Springer-Verlag Berlin Heidelberg 2010
Optimized Dense Matrix Multiplication on a Many-Core Architecture
317
As a result, new architectures like the IBM Cyclops-64 (C64) belong to a new set of many-core-on-a-chip systems with a software managed memory hierarchy. These new kinds of architectures hand the management of the memory hierarchy to the programmer and save the die area of hardware cache controllers and over-sized caches. Although this might complicate programming at their current stage, these systems provide more flexibility and opportunities to improve performance. Following this path, new alternatives for classical algorithmic problems, such as Dense Matrix Multiplication (MM), LU decomposition (LU) and Fast Fourier Transform (FFT) have been studied under these new many-core architectures [7,15,21]. The investigation of these new opportunities leads to two main conclusions: (1) The optimizations for improving performance on cache-based parallel system are not necessarily feasible or convenient on software managed memory hierarchy systems. (2) Memory access patterns reached by appropriate tiling substantially increase the performance of applications. Based on these observations we can conclude that new programming and compiling methodologies are required to fully exploit the potential of these new classes of architectures. We believe that a good starting point for developing such methodologies are classical algorithms with known memory access and computation patterns. These applications provide realistic scenarios and have been studied thoroughly under cache-based parallel systems. Following this idea, we present a general methodology that provides a mapping of applications to software managed memory hierarchies, using MM on C64 as a case of study. MM was chosen because it is simple to understand and analyze, but computationally and memory intensive. For the basic algorithm, the arithmetic complexity and the number of memory operations in multiplications of two matrices m × m are O(m3 ) . The methodology presented in this paper is composed of three strategies that result in a substantial increase in performance, by optimizing different aspects of the algorithm. The first one is a balanced distribution of work among threads. Providing the same amount of work to each thread guarantees minimization of the idle time of processing units waiting for others to finish. If a perfect distribution is not feasible, a mechanism to minimize the differences is proposed. The second strategy is an optimal register tiling and sequence of traversing tiles. Our register tiling and implementation of the sequence of traversing tiles are designed to maximize the reuse of data in registers and minimize the number of memory accesses to slower levels, avoiding unnecessary stalls in the processing units while waiting for data. The last strategy involves more specific characteristics of C64. The use of special instructions, optimized instruction scheduling and other techniques further boost the performance reached by the previous two strategies. The impact on performance can change according to the particular characteristics of the many-core processor used. The experimental evaluation was performed using a real C64 chip. After the implementation of the three strategies proposed, the performance reached by the C64 chip is 44.12 GFLOPS, which corresponds to 55.2% of the peak performance. Additionally, measurements of power consumption prove that C64 is very power
318
E. Garcia et al.
Processor 1
Processor 2
SP
SP
SP
FP
FP
SP ···
SP
Host Interface
TU
TU
TU
TU
1.92 TB/s
3D Mesh Control Network
FP A-Switch
TU
Node
Chip
Processor 80
SP
Gigabit Ethernet
Load: 31 cycles; Store: 15 cycles
GM ~2.5MB
320GB/s
SRAM Bank
SRAM Bank
SRAM Bank
SRAM Bank
SRAM Bank
SRAM Bank
SP 16kB
640GB/s
Crossbar Network ···
Latency Overall Bandwidth 64 Registers
Load: 2 cycles; Store: 1 cycle
HD
FPGA
TU
Read: 1 cycle Write: 1 cycle
DDR2 SDRAM Controller
(a) C64 Chip Architecture
Off-Chip Memory
Load: 57 cycles; Store: 28 cycles 16GB/s (Multiple load and Multiple store instructions); 2GB/s
Off-Chip DRAM 1GB
(b) Memory Hierarchy of C64
Fig. 1. C64 Architecture details
efficient, providing 530 MFLOPS/W for the problem under consideration. This value is comparable to the top of the Green500 list [13], which provides a ranking of the most energy-efficient supercomputers in the world. The rest of this paper is organized as follows. In Section 2, we describe the C64 architecture. In Section 3, we give a short overview on the current status of MM Algorithms. In Section 4, we introduce our proposed MM Algorithm and optimizations. In Section 5, we present the experimental evaluation of our implementation. Finally, we conclude and present future work in Section 6.
2
The IBM Cyclops-64 Architecture
Cyclops-64 (C64) is an innovative architecture developed by IBM, designed to serve as a dedicated petaflop computing engine for running high performance applications. A C64 chip is an 80-processor many-core-on-a-chip design, as can be seen in Fig. 1a. Each processor is equipped with two thread units (TUs), one 64-bit floating point unit (FP) and two SRAM memory banks of 30kB each. It can issue one double precision floating point “Multiply and Add” instruction per cycle, for a total performance of 80 GFLOPS per chip when running at 500MHz. A 96-port crossbar network with a bandwidth of 4GB/s per port connects all TUs and SRAM banks [11]. The complete C64 system is built out of tens of thousands of C64 processing nodes arranged in a 3-D mesh topology. Each processing node consists of a C64 chip, external DRAM, and a small amount of external interface logic. A C64 chip has an explicit three-level memory hierarchy (scratchpad memory, on-chip SRAM, off-chip DRAM), 16 instruction caches of 32kB each (not shown in the figure) and no data cache. The scratchpad memory (SP) is a configured portion of each on-chip SRAM bank which can be accessed with very low latency by the TU it belongs to. The remaining sections of all on-chip SRAM banks consist the on-chip global memory (GM), which is uniformly addressable from all TUs. As a summary, Fig. 1b reflects the current size, latency (when there is no contention) and bandwidth of each level of the memory hierarchy.
Optimized Dense Matrix Multiplication on a Many-Core Architecture
319
Execution on a C64 chip is non-preemptive and there is no hardware virtual memory manager. The former means that the C64 micro-kernel will not interrupt the execution of a user application unless an exception occurs. The latter means the three-level memory hierarchy of the C64 chip is visible to the programmer.
3
Classic Matrix Multiplication Algorithms
MM algorithms have been studied extensively. These studies focus mainly on: (1) Algorithms that decrease the na¨ıve complexity of O(m3 ). (2) Implementations that take advantage of advanced features of computer architectures to achieve higher performance. This paper is oriented towards the second area. In the first area, more efficient algorithms are developed. Strassen’s algorithm [20] is based on the multiplication of two 2 × 2 matrices with 7 multiplications, instead of 8 that are required in the straightforward algorithm. The recursive application of this fact leads to a complexity of O(mlog7 ) [10]. Disadvantages, such as numerical instability and memory space required for submatrices in the recursion, have been discussed extensively [14]. The current best lower bound is O(m2.376 ), given by the Coppersmith–Winograd algorithm [9]. However, this algorithm is not used in practice, due to its large constant term. The second area focuses on efficient implementations. Although initially more emphasis was given towards implementations for single processors, parallel approaches quickly emerged. A common factor among most implementations is the decomposition of the computation into blocks. Blocking algorithms not only give opportunities for better use of specific architectural features (e.g., memory hierarchy) but also are a natural way of expressing parallelism. Parallel implementations have exploited the interconnection pattern of processors, like Cannon’s matrix multiply algorithm [6,16], or the reduced number of operations like Strassen’s algorithm [4,12]. These implementations have explored the design space along different directions, according to the targeted parallel architecture. Other studies have been focused on models that captures performance-relevant aspects of the hierarchical nature of computer memory like the Uniform Memory Hierarchy (UMH) model or the Parallel Memory Hierarchy (PMH) model [1,2]. The many-core architecture design space has not yet been explored in detail, but existing studies already show their potential. A performance prediction model for Cannon’s algorithm has shown a huge performance potential for an architecture similar to C64 [3]. Previous research of MM on C64 showed that is possible to increase performance substantially by applying well known optimizations methods and adapting them to specific features of the chip [15]. More recent results on LU decomposition conclude that some optimizations that performs well for classical cached-based parallel system are not the best alternative for improving performance on software managed memory hierarchy systems [21].
4
Proposed Matrix Multiplication Algorithm
In this section we analyze the proposed MM algorithm and highlight our design choices. The methodology used is oriented towards exploiting the maximum
320
E. Garcia et al.
benefit of features that are common across many-core architectures. Our target operation is the multiplication of dense square matrices A × B = C, each of size m × m using algorithms of running time O(m3 ). Throughout the design process, we will use some specific features of C64 to illustrate the advantages of the proposed algorithm over different choices used in other MM algorithms. Our methodology alleviates three related sources identified to cause poor performance in many-core architectures: (1) Inefficient or unnecessary synchronization. (2) Unbalanced work between threads. (3) Latency due to memory operations. Relation and impact in performance of these sources are architecture dependent and modeling their interactions has been an active research topic. In our particular case of interest, the analysis of MM is easier than other algorithms not only for the simple way it can be described but also for the existence of parallel algorithms that do not required synchronizations. It simplifies the complexity of our design process because we only need to carefully analyze in two instead of the three causes of poor performance we have identified as long as the algorithm proposed does not require synchronizations. These challenges will be analyzed in the following subsections. 4.1
Work Distribution
The first challenge in our MM algorithm is to distribute work among P processors avoiding synchronization. It is well known that each element ci,j in C can be calculated independently. Therefore, serial algorithms can be parallelized without requiring any synchronization for the computation of each element ci,j , which immediately solves this requirement. The second step is to break the m × m matrix C into blocks such that we minimize the maximum block size pursuing optimal resource utilization and trying to avoid overloading a processor. This is optimally done by breaking the 2 problem into blocks of mP elements, but the blocks must be rectangular and fit into C. One way to break C in P rectangular blocks is dividing rows and columns of C into q1 and q2 sets respectively, with q1 · q2 = P . The optimal way to minimize the maximum block size is to divide the m rows into q1 sets of qm1 rows (with some having an extra row) and thesame for columns. The maximum tile size is qm1 · qm2 and it is bounded by qm1 + 1 · qm2 + 1 . The difference between this upper bound and the optimal tile size is qm1 + qm2 + 1 and this difference is √ minimized when q1 = q2 = P . If P is not a square√number, we find the q1 that is a factor of P and closest but not larger than P . To further optimize, we can turn off some processors if the maximum tile size could be decreased. In practice, this reduces to turning off processors if q2 − q1 is smaller and in general, this occurs if P is prime or one larger than a square number. 4.2
Minimization of High Cost Memory Operations
After addressing the synchronization and load-balancing problems for MM, the next major bottleneck is the impact of memory operations. Despite the high
Optimized Dense Matrix Multiplication on a Many-Core Architecture
321
bandwidth of on-chip memory in many-core architectures (e.g. C64), bandwidth and size of memory are still bottlenecks for algorithms, producing stalls while processors are waiting for new data. As a result, implementations of MM, LU and FFT are still memory bound [7,15,21]. However, the flexibility of softwaremanaged memory hierarchies provides new opportunities to the programmer for developing better techniques for tiling and data locality without the constraints imposed by cache parameters like line sizes or line associativity [19,21]. It implies an analysis of the tile shapes, the tile size and the sequences in which tiles have to be traversed taking advantage of this new dimension in the design space. While pursuing a better use of the memory hierarchy, our approach takes two levels of this hierarchy, one faster but smaller and the other slower but bigger. Our objective is to minimize the number of slow memory operations, loads (LD) and stores (ST ), that are a function of the problem (Λ), the number of processors (P ), the tile parameters (L) and the sequence of traversing tiles (S), subject to the data used in the current computation (R) cannot exceed the size of the small memory (Rmax ). This can be expressed as the optimization problem: min L,S
LD (Λ, P, L, S) + ST (Λ, P, L, S) ,
s.t. R (Λ, P, L, S) ≤ Rmax
(1)
In our case, registers are the fast memory and Λ is the MM with the partitioning described in subsection 4.1. Our analysis assumes a perfect load-balancing m √ where each block C ∈ C of size n × n n = P computed by one processor is subdivided into tiles Ci,j ∈ C of size L2 × L2 . Due to data dependencies, the required blocks A ∈ A and B ∈ B of sizes n × m and m × n are subdivided into tiles Ai,j ∈ A and Bi,j ∈ B of sizes L2 × L1 and L1 × L2 respectively. Each processor requires 3 nested loops for computing all the tiles of its block. Using loop interchange analysis, an exhaustive study of the 6 possible schemes to traverse tiles was conducted and two prototype sequences S1 and S2 were found. The algorithms that describe these sequences are shown in Fig. 2. S1: for i = 1 to Ln2 S2: for j = 1 to Ln2 S3: Initialize Ci,j S4: for k = 1 to Lm1 S5: Load Ai,k , Bk,j S6: Ci,j + = Ai,k · Bk,j S : end for S7: Store Ci,j S : end for S : end for
S1: for i = 1 to Ln2 S2: for k = 1 to Lm1 S3: Load Ai,k S4: for j = 1 to Ln2 S5: if k = 1 then Initialize Ci,j S6: else Load Ci,j S7: Load Bk,j S8: Ci,j + = Ai,k · Bk,j S9: Store Ci,j S : end for S : end for S : end for
(a) Algorithm using sequence S1
(b) Algorithm using sequence S2
Fig. 2. Implementation of sequences for traversing tiles in one block of C
322
E. Garcia et al.
Based on the data dependencies of this implementations, the general optimization problem described in (1) can be expressed for our case by Eq. (2). min
L∈{L1 ,L2 }, S∈{S1 ,S2 }
f (m, P, L, S) =
L2 m + m √ 2 1 3 m + + P − 1 m2 L2 L1 2
3
2
if S = S1 if S = S2 (2)
s.t. 2L1 L2 + L22 ≤ Rmax Analyzing the piecewise function f , we notice that if P ≥ 4 the objective function for S = S1 is always smaller to the objective function for S = S2 . Since f only depends on L2 , we minimize f by maximizing L2 . Given the constraint, L2 is maximized by minimizing L1 . Thus L1 = 1, we solve the optimum L2 in the boundary of the constraint. The solution of Eq. (2) if P ≥ 4 is: L1 = 1, L2 = 1 + Rmax − 1 (3) This result is not completely accurate, since we assumed that there are not remainders when we divide the matrices into blocks and subdivide the blocks in tiles. Despite this fact, they can be used as a good estimate. For comparison purposes, C64 has 63 registers and we need to keep one register for the stack pointer, pointers to A, B, C matrices, m and stride parameters, then Rmax = 63 − 6 = 57 and the solution of Eq. (3) is L1 = 1 and L2 = 6. Table 1 summarizes the results in terms of the number of LD and ST for the tiling proposed and other 2 options that fully utilizes the registers and have been used in practical algorithms: inner product of vectors (L1 = 28 and L2 = 1) and square tiles (L1 = L2 = 4). As a consequence of using sequence S1 , the number of ST is equal in all tiling strategies. As expected, the tiling proposed has the minimum number of LD: 6 times less than the inner product tiling and 1.5 times less than the square tiling. 4.3
Architecture Specific Optimizations
Although the general results of subsection 4.2 are of major importance, an implementation that properly exploits specific features of the architecture is also important for maximizing the performance. We will use our knowledge and experience for taking advantage of the specific features of C64 but the guidelines proposed here could be extended to similar architectures.
Table 1. Number of memory operation for different tiling strategies Memory Operations Inner Product Square Optimal 1 1 Loads 2m3 m3 m3 2 3 2 2 Stores m m m2
Optimized Dense Matrix Multiplication on a Many-Core Architecture
323
The first optimization is the use of special assembly functions for Load and Store. C64 provides the instructions multiple load (ldm RT, RA, RB ) and multiple store (stm RT, RA, RB ) that combine several memory operations into only one instruction. For the ldm instruction, starting from an address in memory contained in RA, consecutive 64-bit values in memory are loaded into consecutive registers, starting from RT through and including RB. Similarly, stm instruction stores 64-bit values in memory consecutively from RT through and including RB starting in the memory address contained in RA. The advantage in the use of these instructions is that the normal load instruction issues one data transfer request per element while the special one issues one request each 64-byte boundary. Because our tiling is 6 × 1 in A and 1 × 6 in B, we need A in column-major order and B in row-major order as a requirement for exploiting this feature. If they are not in the required pattern, we transpose one matrix without affecting the complexity of the algorithms proposed because the running time of transposition is O(m2 ). The second optimization applied is instruction scheduling: the correct interleaving of independent instructions to alleviate stalls. Data dependencies can stall the execution of the current instruction waiting for the result of one issued previously. We want to hide or amortize the cost of critical instructions that increase the total computation time executing other instructions that do not share variables or resources. The most common example involves interleaving memory instructions with data instructions but there are other cases: multiple integer operations can be executed while one floating point operation like is computed.
5
Experimental Evaluation
This section describes the experimental evaluation based on the analysis done in section 4 using the C64 architecture described in section 2. Our baseline parallel MM implementation works with square matrices m × m and it was written in C. The experiments were made up to m = 488 for placing matrices A and B in on-chip SRAM and matrix C in off-chip DRAM, the maximum number of TUs used is 144. To analyze the impact of the partitioning schema described in subsection 4.1 we compare it with other two partition schemes. Fig. 3 shows the performance reached for two different matrix sizes. In Partitioning 1, the m rows are di vided into q1 sets, the first q1 − 1 containing qm1 and the last set containing the remainder rows. The same partitioning is followed for columns. It has the worst performance of the three partitions because it does not minimize the maximum tile size. Partitioning 2 has optimum maximum tile size of qm1 · qm2 but does not distribute the number of rows and columns uniformly between sets q1 and q2 respectively. Its performance is very close to our algorithm Partitioning 3, which has optimum maximum tile size and better distribution of rows and columns between sets q1 and q2 respectively. A disadvantage of Partitioning 2 over Partitioning 3 is that for small matrices (n ≤ 100) and large number of TUs
324
E. Garcia et al.
2.5
3.5
1.5
1.0
Partitioning 1
Partitioning 2
3.0
Partitioning 3
2.5
2.0 1.5
1.0
0.5
Partitioning 2
Performance (GFLOPS)
2.0
Performance (GFLOPS)
Partitioning 1
Partitioning 3
0.5 Thread Units
0.0 1
4
9
16 25 36 49 64 81 100 121 144
(a) Matrix Size 100 × 100
Thread Units
0.0 1
4
9
16 25 36 49 64 81 100 121 144
(b) Matrix Size 488 × 488
Fig. 3. Different Partition Schemes vs. Number of Threads Units
Partitioning 2 may produce a significant lower performance as can be observed in Fig. 3a. Our partitioning algorithm Partitioning 3 performs always better, the maximum performance reached is 3.16 GFLOPS. The other one with optimum maximum tile size performs also well for large matrices, indicating that minimizing the maximum tile size is an appropriate target for optimizing the work load. In addition, our partition algorithm scales well with respect to the number of threads which is essential for many-core architectures. The results of the progressive improvements made to our MM algorithm are shown in Fig. 4 for the maximum size of matrices that fits on SRAM. The implementation of the tiling strategy proposed in subsection 4.2 for minimizing the number of memory operations, was made in assembly code using tiles of 6×1, 1 × 6 and 6 × 6 for blocks in A, B and C respectively. Because the size of blocks in C are not necessarily multiple of 6, all possible combinations of tiles with size less than 6 × 6 were implemented. The maximum performance reached was 30.42 GFLOPS, which is almost 10 times the maximum performance reached by the version that uses only the optimum partition. This big improvement shows the advantages of the correct tiling and sequence of traversing tiles that directly minimizes the time waiting for operands, substituting costly memory operations in SRAM with operations between registers. From another point of view, our tiling increases the reuse of data in registers minimizing number of access to memory for a fixed number of computations. The following optimizations related more with specific features of C64 also increased the performance. The use of multiple load and multiple store instructions (ldm/stm) diminishes the time spent transferring data addressed consecutively in memory. The new maximum performance is 32.22 GFLOPS: 6% better than the version without architecture specific optimizations. The potential of these features has not been completely exploted because transactions that cross a 64byte boundary are divided and transactions in groups of 6 do not provide an optimum pattern for minimizing this division. Finally, the instruction scheduling applied for hiding the cost of some instructions doing other computations in the middle increases performance by 38%. The maximum performance of our
Optimized Dense Matrix Multiplication on a Many-Core Architecture
325
50.0 Partitioning
Performance (GFLOPS)
45.0
Tiling
40.0
Optimization 1 - ldm/stm
35.0
Optimization 2 - Inst. Scheduling
30.0 25.0 20.0 15.0 10.0
5.0 0.0 1
4
9
16
25 36 49 Thread Units
64
81 100 121 144
Fig. 4. Impact of each optimization on the performance of MM using m = 488
MM algorithm is 44.12 GFLOPS which corresponds to 55.2% of the peak performance of a C64 chip. We also made measurements of power consumption using the current consumed by the two voltage sources of the C64 chip (1.2V and 1.8V) yielding a total of 83.22W or 530 MFLOPS/W. This demostrates the power efficiency of C64 for the problem under consideration. This value is similar to the top of the Green500 list, which provides a ranking of the most energy-efficient supercomputers in the world.
6
Conclusions and Future Work
In this paper we present a methodology to design algorithms for many-core architectures with a software managed memory hierarchy taking advantage of the flexibility these systems provide. We apply it to design a Dense Matrix Multiplication (MM) mapping and we implement MM for C64. We propose three strategies for increasing performance and show their advantages under this kind of architecture. The first strategy is a balanced distribution of work amount threads: our partitioning strategy not only distributes the amount of computation as uniform as possible but also minimizes the maximum block size that belongs to each thread. Experimental results show that the partitioning proposed scales well with respect to the number of threads for different sizes of square matrices and performs better than other similar schemes. The second strategy alleviates the total cost of memory accesses. We propose an optimal register tiling with an optimal sequence of traversing tiles that minimizes the
326
E. Garcia et al.
number of memory operations and maximizes the reuse of data in registers. The implementation of the proposed tiling reached a maximum performance of 30.42 GFLOPS which is almost 10 times larger than the maximum performance reached by the optimum partition alone. Finally, specific architecture optimizations were implemented. The use of multiple load and multiple store instructions (ldm/stm) diminishes the time spent transferring data that are consecutive stored/loaded in memory. It was combined with instruction scheduling, hiding or amortizing the cost of some memory operations and high cost floating point instructions doing other computations in the middle. After these optimizations, the maximum performance of our MM algorithm is 44.12 GFLOPS which corresponds to 55.2% of the peak performance of a C64 chip. We also provide evidence of the power efficiency of C64: power consumption measurements show a maximum efficiency of 530 MFLOPS/W for the problem under consideration. This value is comparable to the top of the Green500 list, which provides a ranking of the most energy-efficient supercomputers in the world. Future work includes the study of other techniques like software pipelining and work-stealing that can further increase the performance of this algorithm. We also want to explore how to increase the size of the tiles beyond the maximum number of registers, using the stack and SPM. In addition, we desire to apply this methodology to other linear algebra algorithmic problems like matrix inversion.
Acknowledgments This work was supported by NSF (CNS-0509332, CSR-0720531, CCF-0833166, CCF- 0702244), and other government sponsors. We thank all the members of CAPSL group at University of Delaware and ET International that have given us valuable comments and feedback.
References 1. Alpern, B., Carter, L., Feig, E., Selker, T.: The uniform memory hierarchy model of computation. Algorithmica 12, 72–109 (1992) 2. Alpern, B., Carter, L., Ferrante, J.: Modeling parallel computers as memory hierarchies. In: Proceedings Programming Models for Massively Parallel Computers, pp. 116–123. IEEE Computer Society Press, Los Alamitos (1993) 3. Amaral, J.N., Gao, G.R., Merkey, P., Sterling, T., Ruiz, Z., Ryan, S.: Performance Prediction for the HTMT: A Programming Example. In: Proceedings of the Third PETAFLOP Workshop (1999) 4. Bailey, D.H., Lee, K., Simon, H.D.: Using Strassen’s Algorithm to Accelerate the Solution of Linear Systems. Journal of Supercomputing 4, 357–371 (1991) 5. Callahan, D., Porterfield, A.: Data cache performance of supercomputer applications. In: Supercomputing 1990: Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pp. 564–572. IEEE Computer Society Press, Los Alamitos (1990)
Optimized Dense Matrix Multiplication on a Many-Core Architecture
327
6. Cannon, L.E.: A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. thesis, Montana State University, Bozeman, MT, USA (1969) 7. Chen, L., Hu, Z., Lin, J., Gao, G.R.: Optimizing the Fast Fourier Transform on a Multi-core Architecture. In: IEEE 2007 International Parallel and Distributed Processing Symposium (IPDPS 2007), pp. 1–8 (March 2007) 8. Coleman, S., McKinley, K.S.: Tile size selection using cache organization and data layout. In: PLDI 1995: Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, pp. 279–290. ACM, New York (1995) 9. Coppersmith, D., Winograd, S.: Matrix Multiplication via Arithmetic Progressions. In: Proceedings of the 19th Annual ACM symposium on Theory of Computing (STOC 1987), New York, NY, USA, pp. 1–6 (1987) 10. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (2001) 11. Denneau, M.: Warren Jr., H.S.: 64-bit Cyclops: Principles of Operation. Tech. rep., IBM Watson Research Center, Yorktown Heights, NY (April 2005) 12. Douglas, C.C., Heroux, M., Slishman, G., Smith, R.M.: GEMMW: A Portable Level 3 Blas Winograd Variant of Strassen’s Matrix-Matrix Multiply Algorithm (1994) 13. Feng, W.C., Scogland, T.: The Green500 List: Year One. In: 5th IEEE Workshop on High-Performance, Power-Aware Computing. In: Conjunction with the 23rd International Parallel & Distributed Processing Symposium, Rome, Italy (May 2009) 14. Higham, N.J.: Exploiting Fast Matrix Multiplication Within the Level 3 BLAS. ACM Transactions on Mathematical Software 16(4), 352–368 (1990) 15. Hu, Z., del Cuvillo, J., Zhu, W., Gao, G.R.: Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 134–144. Springer, Heidelberg (2006) 16. Lee, H.-J., Robertson, J.P., Fortes, J.A.B.: Generalized Cannon’s algorithm for parallel matrix multiplication. In: Proc. of the 11th International Conference on Supercomputing (ICS 1997), pp. 44–51. ACM, Vienna (1997) 17. Kondo, M., Okawara, H., Nakamura, H., Boku, T., Sakai, S.: Scima: a novel processor architecture for high performance computing. In: Proceedings of the Fourth International Conference/Exhibition on High Performance Computing in the AsiaPacific Region, vol. 1, pp. 355–360 (2000) 18. Lam, M.D., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: ASPLOS-IV: Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, pp. 63–74. ACM, New York (1991) 19. Orozco, D.A., Gao, G.R.: Mapping the fdtd application to many-core chip architectures. In: ICPP 2009: Proceedings of the 2009 International Conference on Parallel Processing, pp. 309–316. IEEE Computer Society, Washington (2009) 20. Strassen, V.: Gaussian Elimination is not Optimal. Numerische Mathematik 14(3), 354–356 (1969) 21. Venetis, I.E., Gao, G.R.: Mapping the LU Decomposition on a Many-Core Architecture: Challenges and Solutions. In: Proceedings of the 6th ACM Conference on Computing Frontiers (CF 2009), Ischia, Italy, pp. 71–80 (May 2009)
A Language-Based Tuning Mechanism for Task and Pipeline Parallelism Frank Otto, Christoph A. Schaefer, Matthias Dempe, and Walter F. Tichy Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany {otto,cschaefer,dempe,tichy}@ipd.uka.de
Abstract. Current multicore computers differ in many hardware aspects. Tuning parallel applications is indispensable to achieve best performance on a particular hardware platform. Auto-tuners represent a promising approach to systematically optimize a program’s tuning parameters, such as the number of threads, the size of data partitions, or the number of pipeline stages. However, auto-tuners require several tuning runs to find optimal values for all parameters. In addition, a program optimized for execution on one machine usually has to be re-tuned on other machines. Our approach tackles this problem by introducing a language-based tuning mechanism. The key idea is the inference of essential tuning parameters from high-level parallel language constructs. Instead of identifying and adjusting tuning parameters manually, we exploit the compiler’s context knowledge about the program’s parallel structure to configure the tuning parameters at runtime. Consequently, our approach significantly reduces the need for platform-specific tuning runs. We implemented the approach as an integral part of XJava, a Java language extension to express task and pipeline parallelism. Several benchmark programs executed on different hardware platforms demonstrate the effectiveness of our approach. On average, our mechanism sets over 90% of the relevant tuning parameters automatically and achieves 93% of the optimal performance.
1
Introduction
In the multicore era, performance gains for applications of all kind will come from parallelism. The prevalent thread model forces programmers to think on low abstraction levels. As a consequence, writing multithreaded code that offers satisfying performance is not straight-forward. New programming models have been proposed for simplifying parallel programming and improving portability. Interestingly, the high-level constructs can be used for automatic performance tuning. Libraries, in contrast, do not normally provide semantic information about parallel programming patterns. Case studies have shown that parallel applications typically employ different types of parallelism on different levels of granularity [13]. Performance depends on various parameters such as the number of threads, the number of pipeline P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 328–340, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Language-Based Tuning Mechanism for Task and Pipeline Parallelism
329
stages, or load balancing strategies. Usually, these parameters have to be defined and set explicitly by the programmer. Finding a good parameter configuration parameters is far from easy due to large parameter search spaces. Auto-tuners provide a systematic way to find an optimal parameter configuration. However, as the best configuration strongly depends on the target platform, a program normally has to be re-tuned after porting to another machine. In this paper, we introduce a mechanism to automatically infer and configure five essential tuning parameters from high-level parallel language constructs. Our approach exploits explicit information about task and pipeline parallelism and uses tuning heuristics to set appropriate parameter values at runtime. From the programmer’s perspective, a considerable number of tuning parameters becomes invisible. That is, the need for feedback-directed auto-tuning processes on different target platforms is drastically reduced. We implemented our approach as part of the previously introduced language XJava [11,12]. XJava extends Java with language constructs for high-level parallel programming and allows the direct expression of task and pipeline parallelism. An XJava program compiles to Java code instrumented with tuning parameters and context information about its parallel structure. The XJava runtime system exploits the context information and platform properties to set tuning parameters. We evaluated our approach for a set of seven benchmark programs. Our approach sets over 90% of the relevant tuning parameters automatically, achieving 93% of the optimum performance on three different platforms.
2
The XJava Language
XJava extends Java by adding tasks and parallel statements. For a quick overview, the simplified grammar extension in BNF style is shown in Figure 1. We basically extend the existing production rules for method declarations (rule 1) and statements (rule 7). New keywords are work and push, new operators are => and |||. Semantics are described next. 2.1
Language
Tasks. Tasks are conceptually related to filters in stream languages. Basically, a task is an extension of a method. Unlike methods, a task defines a concurrently executable activity that expects a stream of input data and produces a stream of output data. The types of data elements within the input and output stream are defined by the task’s input and output type. These types can also be void in order to specify that there is no input or output. For example, the code public String => String encode(Key key) { work (String s) { push encrypt(s, key); } } declares a public task encode with input and output type String. The work block defines what to do for each incoming element and can be thought of as a
330 $ % *
F. Otto et al.
!" # " #
!" # " # & & '( )'( & & '( + + + & & '( )'( + ,)&- ) - + + ,)&- ) - )'( ,)&- ) -
+
&& #)&./ # && #)&0./ #
#)& #1
Fig. 1. The grammar extension of XJava
loop. A task body contains either exactly one or no work block (rule 6). A push statement inside a task body puts an element into the output stream. In the example, these elements are String objects encrypted by the method encrypt and the parameter key. Parallel statements. Tasks are called like methods; parallelism is generated by combining task calls with operators to compose parallel statements (rule 9). Basically, these statements can be used both outside and inside a task body; the latter case introduces nested parallelism. Parallel statements allow for easily expressing many different types of parallelism, such as linear and non-linear pipelines, master/worker configurations, data parallelism, and recursive parallelism. (1) Combining tasks with the “=>” operator introduces pipeline parallelism. In addition to the task encrypt above, we assume two more tasks read and write for reading and writing to a file. Then, the pipeline statement read(fin) => encode(key) => write(fout); creates a pipeline that encodes the content of the file fin and writes results to the file fout. (2) Combining tasks with the “|||” operator introduces task parallelism. Assuming a task compress, the concurrent statement compress(f1) ||| compress(f2); compresses two files f1 and f2 concurrently. By default, a task is executed by one thread. Optionally, a task call can be marked with a “+” operator to make it replicable. A replicable task can be executed by more than one thread, which is useful to reduce bottleneck effects in pipelines.
A Language-Based Tuning Mechanism for Task and Pipeline Parallelism
331
For example, the task encode in the pipeline example above might be the slowest stage. Using the expression encode(key)+ instead of encode(key) can increase throughput since we allow more threads to execute that critical stage. The number of replicates is determined at runtime and thus does not need to be specified by the programmer. If the programmer wants to create a concrete number of task instances at once, say 4, he can use the expression encode(key):[4]. 2.2
Compiler and Runtime System
The XJava compiler transforms XJava to optimized and instrumented Java code, which is then translated into bytecode. The translated program consists of logical code units that are passed to the XJava runtime system XJavaRT. XJavaRT is the place where parallelism happens. It is designed as a library employing executor threads and built-in scheduling mechanisms.
3
Tuning Challenges
A common reason for poor performance of parallel applications is poor adaption of parallel code to the underlying hardware platform. With the parallelization of an application, a large number of performance-relevant tuning parameters arise, e.g. how many threads are used for a particular calculation, how to set the size of data partitions, how many stages a pipeline requires, or how to accomplish load balancing for worker threads. Manual tuning is tedious, costly, and due to the large number of possible parameter configurations often hopeless. To automate the optimization process, search-based automatic performance tuning (auto-tuning) [23,1,20,22] is a promising approach. Auto-tuning represents a feedback-directed process consisting of several steps: choice of parameter configuration, program execution, performance monitoring, and generation of a new configuration based on search algorithms such as hill climbing or simulated annealing. Experiments with realworld parallel applications have shown that using appropriate tuning techniques, a significant performance gain can be achieved on top of “plausible” configurations chosen by the programmer [13,18]. However, as the diversity of application areas for parallelism has grown and the available parallel platforms differ in many respects (e.g. in number or type of cores, cache architecture, available memory, or operating system), the number of targets to optimize for is large. Optimizations made for a certain machine may cause a slowdown on another machine. Thus, a program optimized for a particular hardware platform usually has to be re-tuned on other platforms. For illustration, let’s think of a parallel program with only one tuning parameter t that adjusts the number of concurrent threads. While the best configuration for t on a 4-core-machine is probably a value close to 4, this configuration might be suboptimal for a machine with 16 cores. From the auto-tuner’s perspective, t represents a set of values to choose from. If the tuner knew the purpose of t, it would be able to configure t directly in relation to the number of cores providing significantly improved performance.
332
F. Otto et al.
To tackle the problem of optimization portability, recent approaches propose the use of tuning heuristics to exploit information about purpose and impact of tuning parameters [17]. This context information helps configuring parameters implicitly without enumerating and testing their entire value range.
4
Language-Based Tuning Mechanism
We propose an approach that exploits tuning-relevant context information from XJava’s high-level parallel language constructs (cf. Section 2). Relevant tuning parameters are automatically inferred and implicitly set by the runtime system (XJavaRT). Therefore, porting an XJava application to another machine requires less re-tuning, in several cases no re-tuning at all. Figure 2 illustrates the concept of our approach (b) in contrast to feedback-directed auto-tuning (a). Our work focuses on task and pipeline parallelism; both forms of parallelism are widely used. Task parallelism refers to tasks whose computations are independent from each other. Pipeline parallelism refers to tasks with input-output dependencies, i.e. the output of one task serves as the input of the next task.
&'!($ !
&'!($ !
&'!($ !
!
)
"#$ %
!
%(#%
*
"#$ %
%
!
+,,
"#$ %
Fig. 2. Adapting a parallel program P to different target platforms M 1, M 2, M 3. (a) A search-based auto-tuner requires the explicit declaration of tuning parameters a, b, c. The auto-tuner needs to perform several feedback-directed tuning runs on each platform to find the best configuration. (b) In our approach, we use compiler knowledge to automatically infer relevant tuning parameters and context information about the program’s parallel structure. The parameters are set by the runtime system XJavaRT, which uses tuning heuristics that depend on the characteristics of the target platform.
First, we describe essential types of tuning parameters for these forms of parallelism (Section 4.1). Then, we show how the XJava compiler infers tuning parameters and context information from code (Sections 4.2 and 4.3). Finally, we describe heuristics to set the tuning parameters (Section 4.4).
A Language-Based Tuning Mechanism for Task and Pipeline Parallelism
4.1
333
Tuning Parameters
Tuning parameters represent program variables that may influence performance. In our work, we distinguish between explicit and implicit tuning parameters. The first have to be specified and configured by the programmer, the latter are invisible to the programmer and set automatically. In the following we describe essential types of tuning parameters for task and pipeline parallelism [13,17]. Thread count (T C). The total number of threads executing an application strongly influences its performance. To underestimate the number will limit speedup, to overestimate the number might slow down the program due to synchronization overhead and memory consumption. Load balancing strategy (LB). The load balancing strategy determines how to distribute workload to execution threads or CPU cores. Load balancing can be done statically, e.g. in a round-robin style, or dynamically, e.g. in a first-comefirst-serve fashion or combined with work stealing. Cut-off depth (CO). Parallel applications typically employ parallelism on different levels. Low-level parallelism can have a negative impact on the performance, if the synchronization and memory costs are higher than the additional speedup of concurrent execution. In other words, there is a level CO where parallelism is not worthwhile and a serial execution of the code is preferable. Stage replicates (SR). The throughput and speedup achieved by a pipeline is limited by its slowest stage. If this stage is stateless, it can be replicated in order to be executed by more than one thread. The parameter SR denotes the number of replicates. Stage fusion (SF ). From the programmer’s perspective, the conceptual layout of a pipeline usually consist of n stages s1 , ..., sn . However, mapping each stage si to one thread may not be the best configuration. Instead, fusing some stages could reduce bottleneck effects. Stage fusion represents functional composition of stages and is similar to the concept of filter fusion [14]. Data size (DS). Parallel programs often process a large amount of data that needs to be decomposed into smaller partitions. The data partition size typcially affects the program’s performance. The applications considered here expose up to 14 parameters that need to be tuned (cf. Section 5). Note that one application can contain several parameters of the same type. The following sections show how our approach automatically infers and sets these parameters, except DS. As the most appropriate size of data partitions depends on the type of application, we leave this issue to the programmer or further tuning. The XJava programmer must define separate tasks for decomposing and merging data. 4.2
Inferring Tuning Parameters from XJava Code
The XJava compiler generates Java code and adds tuning parameters. Task parallel statements are instrumented with the parameter cut-off depth (CO).
334
F. Otto et al.
Fig. 3. Inferring tuning parameters and context information from XJava code
When compiling a pipeline statement consisting of n stages s1 , ..., sn , the stages s2 , ...sn are instrumented with the boolean tuning parameter stage fusion (SF ), indicating whether that stage should be fused with the previous one. In addition, the parameter stage replicates (SR) is added to each stage declared as replicable. Figure 3 illustrates the parameter inference for a task parallel statement and a pipeline. A task parallel statement p() ||| q() is instrumented with the parameter CO. Depending on its value, that statement executes either concurrently or sequentially, if the cut-off depth is reached. A pipeline a() => b()+ => c()+ => d() compiles to a set of four task instances a, b, c and d. Since b and c are replicable, a tuning parameter SR is added to them. In addition, b, c and d get a boolean parameter SF defining whether to fuse that stage with the previous one. The parameters T C and LB for the overall number of threads and the load balancing strategy affect both task parallel statement and pipelines. In Section 4.4, we describe the heuristics used to set the parameters. 4.3
Inferring Context Information
Beside inferring tuning parameters, the XJava compiler exploits context information about the program’s parallel structure. The compiler makes this knowledge available at runtime to set tuning parameters appropriately. The context information of a task call includes several aspects: (1) purpose of the task (pipeline stage or a part of task-parallel section), (2) input and output dependences, (3) periodic or non-periodic task, (4) level of parallelism, and (5) current workload of the task. Aspects 1-3 can be inferred at compile time, aspects 4 and 5 at runtime. However, XJavaRT has access to all information. Figure 3 sketches potential context information for tasks. 4.4
Tuning Heuristics
Thread count (T C). XJavaRT provides a global thread pool to control the total number of threads and to monitor the numbers of running and idle threads at any time. XJavaRT knows the number n of a machine’s CPU cores and
A Language-Based Tuning Mechanism for Task and Pipeline Parallelism
335
therefore uses the heuristic T C = n · α for some α ≥ 1. We use α = 1.5 as a predefined value. Load balancing (LB). XJavaRT employs different load balancing strategies depending on the corresponding context information. For recursive task parallelism, such as divide and conquer algorithms, XJavaRT applies a work stealing mechanism based on the Java fork/join framework [8]. For pipelines, XJavaRT prefers stages with higher workloads to execute, thus implementing a dynamic load balancing strategy. Cut-off depth (CO). XJavaRT dynamically determines the cut-off depth for task parallel expressions to decide whether to execute a task parallel statement concurrently or in sequential order. Since XJavaRT keeps track of the number of idle executor threads, it applies the heuristic CO = ∞ if idle threads exist, and CO = l otherwise (where l is the nested level of the task parallel expression). In other words, tasks are executed sequentially if there are no executor threads left. Stage replicates (SR). When a replicable task is called, XJavaRT creates SR = i replicates of the task, where i denotes the number of idle executor threads. If there are no idle threads, i.e. all CPU cores are busy, no replicates will be created. XJavaRT uses a priority queue putting tasks with lower work load (i.e. few data items waiting at their input port) at the end. This mechanism does not always achieve optimal results, but seems effective in practice, as our results show.
Fig. 4. Stage fusion for a pipeline a() => b()+ => c()+ => d(). (a) Stage replication without fusion introduces overhead for splitting and joining data items. (b) Stage fusion prior to replication removes some of this overhead.
Stage fusion (SF ). In a pipeline consisting of several stages, combining two or more stages into a single stage can increase performance, as the overhead for split-join operations is reduced. Therefore, XJava fuses consecutive replicable tasks within a pipeline expression to create a single replicable task. Figure 4 illustrates this mechanism for a pipeline a() => b()+ => c()+ => d().
5
Experimental Results
We evaluate our approach using a set of seven benchmarks that cover a wide range of parallel applications, including algorithmic problems such as sorting
336
F. Otto et al.
or matrix multiplication, as well as the real-world applications for raytracing, video processing and cryptography. The applications use task, data or pipeline parallelism. We measure two metrics: Implicit tuning parameters. We count the number of automatically handled tuning parameters as a metric for simplification of the optimization process. If more tuning parameters are automated, fewer optimizations have to be performed manually. Performance. For each application, we compared a sequential version to an XJava version and measured the speedups heur and best : – heur: Speedups of the XJava programs using our heuristic-based approach. These programs did not require any manual adjustments. – best: Speedups achieved for the best parameter configuration found by an auto-tuner performing an exhaustive search. The speedups over the sequential versions were measured on three different parallel platforms: (1) an Intel Quadcore Q6600 with 2.40 GHz, 4 GB RAM and Windows 7 Professional 64 Bit; (2) a Dual Intel Xeon Quadcore E5320 1.86 GHz, 8 GB RAM and Ubuntu Linux 7.10; (3) a Sun Niagara T2 with 8 cores (each capable of 8 threads), 1.2 GHz, 16 GB RAM and Solaris 10. 5.1
Benchmarked Applications
MSort and QSort implement the recursive mergesort and quicksort algorithms to sort a randomly generated array with approximately 33.5 million integer values. Matrix multiplies two matrices based on a master-worker configuration, where the master divides the final matrix into areas and assigns them to workers. MBrot computes the mandelbrot set for a given resolution and a maximum number of 1000 iterations. LRay is a lightweight raytracer entirely written in Java. MBrot and LRay both use the master-worker pattern by letting the master divide the image into multiple blocks, which are then computed concurrently by workers. The applications Video and Crypto use pipeline parallelism. Video is used to combine multiple frames into a slideshow, while performing several filters such as scaling and sharpening on each of the video-frames. The resulting pipeline contains eight stages, five of which are data parallel and can be replicated. Crypto applies multiple encryption algorithms from the javax.crypto package to a 60 MB text file that is split into 5 KB blocks. The pipeline has seven stages; each stage except those for input and output are replicable. 5.2
Results
Implicit tuning parameters. Depending on the parallelization strategy, the programs expose different tuning parameters. Figure 5 shows the numbers of explicit and implicit parameters for each application. Explicit parameters are declared and set in the program code. Implicit parameters do not appear in the code, they are automatically inferred by the compiler and set using our approach.
A Language-Based Tuning Mechanism for Task and Pipeline Parallelism ! ! ,- ,- ,-
+ + &2 +3 $2 : ; <
$> 6>
" # $%&%$' " # $%&%$' (. /&% $ (. /&% $ (. /&% $ (. (# $%&%456%057 " (8 $%&%456%157 )=41 4=41
&>3 2 7> ?
# # # # # (8 (8 0=(8
337
())* ())* 01* 01* 01* 9#* ())* 9(*
$'>"?? .>
Fig. 5. Explicit and implicit tuning parameters for the benchmarked applications. On average, our approach infers and sets 91% of the parameters automatically.
On average, the number of explicit tuning parameters is reduced by 91%, ranging from 67% to 100%. Our mechanism automatically infers and sets all parameters except the data size (DS). For Matrix , these are the sizes of the parts of the matrix to be computed by a worker; for MBrot and LRay, these are the sizes of the image blocks calculated concurrently. In Crypto the granularity is determined by the size of the data blocks that are sent through the pipeline. As Video decomposes the video data frame by frame, there is no need for an explicit tuning parameter to control the data size.
Fig. 6. Execution times (milliseconds) of the sequential benchmark programs
Performance. Figure 6 lists the execution times of the sequential benchmark programs. Figure 7 shows the speedups for the corresponding XJava versions on the three parallel platforms. Using our approach, the XJava programs achieve an average speedup of about 3.5 on the Q6600 quadcore, 5.0 on the E5320 dualquadcore, and 17.5 on the Niagara T2. The automatic replication of XJava tasks achieves good utilization of the available cores in the master-worker and pipeline applications, although the roundrobin distribution of items leads to a suboptimal load balancing in the replicated stages. The blocks in Crypto are of equal size, leading to an even workload. The frames in Video have different dimensions, resulting in slightly lower speedups. To examine the quality of our heuristics, we used a script-based auto-tuner performing an exhaustive search to find the best parameter configuration. We
338
F. Otto et al.
Fig. 7. Performance of our heuristic-based approach (heur ) in comparison to the best configuration found by a search-based auto-tuner
observed the largest performance difference for QSort and for LRay on the E5320 machine. In general, QSort benefits from further increasing the cutoff threshold, as more tasks allow better load balancing with workstealing. For LRay, reducing the number of workers by one increases the speedup from 3.4 to 5 - we attribute this behavior to poor cache usage or other memory bottlenecks when using too many threads. In all other cases, the search-based auto-tuner achieved only minor additional speedups compared to our language-based tuning mechanism. In total, the mean error rate of the heuristic-based configurations to the best configurations are 9% on the E5320 dual-quadcore, 7% on the Niagara T2, and 4% on the Q6600 quadcore. That is, our approach achieves 93% of the optimal performance.
6
Related Work
Auto-tuning has been investigated mainly in the area of numerical software and high-performance computing. Therefore, many approaches (such as ATLAS [23], FFTW [5], or FIBER [7]) focus on tuning particular types of algorithms rather than entire parallel applications. Datta et al. [4] address auto-tuning and optimization strategies for stencil computations on multicore architectures. MATE [10] uses a model-based approach to dynamically optimize distributed master/worker applications. MATE predicts the performance of these programs. However, optimizing other types of parallel patterns requires the creation of new analytic models. MATE does not target multicore systems.
A Language-Based Tuning Mechanism for Task and Pipeline Parallelism
339
Atune [17,19] introduces tuning heuristics to improve search-based auto-tuning of parallel architectures. However, Atune needs a separate configuration language and an offline auto-tuner. Stream languages such as StreamIt [21,6] provide explicit syntax for data, task and pipeline parallelism. Optimizations are done at compile time for a given machine; dynamic adjustments are typically not addressed. Libraries such as java.util.concurrent [9] or TBB [16] provide constructs for high-level parallelism, but do not exploit context information and still require explicit tuning. Languages such as Chapel [2], Cilk [15] and X10 [3] focus on task and data parallelism but not on explicit pipelining and do not support tuning parameter inference.
7
Conclusion
Tuning parallel applications is essential to achieve best performance on a particular platform. In this paper, we presented a language-based tuning mechanism for basically any kind of application employing task and pipeline parallelism. Our approach automatically infers tuning parameters and corresponding context information from high-level parallel language constructs. Using appropriate heuristics, tuning parameters are set at runtime. We implemented our technique as part of the XJava compiler and runtime system. We evaluated our approach for seven benchmark programs covering different types of parallelism. Our tuning mechanism infers and sets over 90% of the relevant tuning parameters automatically. The average performance achieves 93% of the actual optimum, drastically reducing the need for further tuning. If further search-based tuning is still required, our approach provides a good starting point. Future work will address the support of further tuning parameters (such as data size), the refinement of tuning heuristics, and the integration of a feedbackdriven online auto-tuner.
References 1. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report, University of California, Berkeley (2006) 2. Chamberlain, B.L., Callahan, D., Zima, H.P.: Parallel Programmability and the Chapel Language. Int. J. High Perform. Comput. Appl. 21(3) (August 2007) 3. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. In: Proc. OOPSLA 2005. ACM, New York (2005) 4. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures. In: Proc. Supercomputing Conference (2008)
340
F. Otto et al.
5. Frigo, M., Johnson, S.G.: FFTW: An Adaptive Software Architecture for the FFT. In: Proc. ICASSP, vol. 3 (May 1998) 6. Gordon, M.I., Thies, W., Amarasinghe, S.: Exploiting Coarse-grained Task, Data, and Pipeline Parallelism in Stream Programs. In: Proc. ASPLOS-XII. ACM, New York (2006) 7. Katagiri, T., Kise, K., Honda, H., Yuba, T.: FIBER: A Generalized Framework for Auto-tuning Software. In: Proc. International Symposium on HPC (2003) 8. Lea, D.: A Java fork/join Framework. In: Proc. Java Grande 2000. ACM, New York (2000) 9. Lea, D.: The java.util.concurrent Synchronizer Framework. Sci. Comput. Program 58(3) (2005) 10. Morajko, A., Margalef, T., Luque, E.: Design and Implementation of a Dynamic Tuning Environment. Parallel and Distributed Computing 67(4) (2007) 11. Otto, F., Pankratius, V., Tichy, W.F.: High-level Multicore Programming With XJava. In: Comp. ICSE 2009, New Ideas And Emerging Results. ACM, New York (2009) 12. Otto, F., Pankratius, V., Tichy, W.F.: XJava: Exploiting Parallelism with ObjectOriented Stream Programming. In: Sips, H., Epema, D., Lin, H.-X. (eds.) EuroPar 2009 Parallel Processing. LNCS, vol. 5704, pp. 875–886. Springer, Heidelberg (2009) 13. Pankratius, V., Schaefer, C.A., Jannesari, A., Tichy, W.F.: Software Engineering for Multicore Systems: an Experience Report. In: Proc. IWMSE 2008. ACM, New York (2008) 14. Proebsting, T.A., Watterson, S.A.: Filter Fusion. In: Proc. Symposium on Principles of Programming Languages (1996) 15. Randall, K.: Cilk: Efficient Multithreaded Computing. PhD Thesis. Dep. EECS, MIT (1998) 16. Reinders, J.: Intel Threading Building Blocks. O’Reilly Media, Inc., Sebastopol (2007) 17. Schaefer, C.A.: Reducing Search Space of Auto-Tuners Using Parallel Patterns. In: Proc. IWMSE 2009. ACM, New York (2009) 18. Schaefer, C.A., Pankratius, V., Tichy, W.F.: Atune-IL: An Instrumentation Language for Auto-Tuning Parallel Applications. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009 Parallel Processing. LNCS, vol. 5704, pp. 9–20. Springer, Heidelberg (2009) 19. Schaefer, C.A., Pankratius, V., Tichy, W.F.: Engineering Parallel Applications with Tunable Architectures. In: Proc. ICSE. ACM, New York (2010) 20. Tapus, C., Chung, I., Hollingsworth, J.K.: Active Harmony: Towards Automated Performance Tuning. In: Proc. Supercomputing Conference (2002) 21. Thies, W., Karczmarek, M., Amarasinghe, S.: StreamIt: A Language for Streaming Applications. In: Horspool, R.N. (ed.) CC 2002. LNCS, vol. 2304, p. 179. Springer, Heidelberg (2002) 22. Werner-Kytola, O., Tichy, W.F.: Self-tuning Parallelism. In: Williams, R., Afsarmanesh, H., Bubak, M., Hertzberger, B. (eds.) HPCN-Europe 2000. LNCS, vol. 1823, p. 300. Springer, Heidelberg (2000) 23. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated Empirical Optimizations of Software and the ATLAS Project. Journal of Parallel Computing 27 (2001)
A Study of a Software Cache Implementation of the OpenMP Memory Model for Multicore and Manycore Architectures Chen Chen1 , Joseph B. Manzano2 , Ge Gan2 , Guang R. Gao2 , and Vivek Sarkar3 1 2
Tsinghua University, Beijing 100084, P.R. China University of Delaware, Newark DE 19716, USA 3 Rice University, Houston TX 77251, USA
Abstract. This paper is motivated by the desire to provide an efficient and scalable software cache implementation of OpenMP on multicore and manycore architectures in general, and on the IBM CELL architecture in particular. In this paper, we propose an instantiation of the OpenMP memory model with the following advantages: (1) The proposed instantiation prohibits undefined values that may cause problems of safety, security, programming and debugging. (2) The proposed instantiation is scalable with respect to the number of threads because it does not rely on communication among threads or a centralized directory that maintains consistency of multiple copies of each shared variable. (3) The proposed instantiation avoids the ambiguity of the original memory model definition proposed on the OpenMP Specification 3.0. We also introduce a new cache protocol for this instantiation, which can be implemented as a software-controlled cache. Experimental results on the Cell Broadband Engine show that our instantiation results in nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to a software cache design derived from a stronger memory model that maintains a global total ordering among flush operations.
1 Introduction An important open problem for future multicore and manycore chip architectures is the development of shared-memory organizations and memory consistency models (or memory models for short) that are effective for small local memory sizes in each core, scalable to a large number of cores, and still productive for software to use. Despite the fact that strong memory models such as Sequential Consistency (SC) [1] are supported on mainstream small-scale SMPs, it seems likely that weaker memory models will be explored in current and future multicore and manycore architectures such as the Cell Broadband Engine [2], Tilera [3], and Cyclops64 [4]. OpenMP [5] is a natural candidate as a programming model for multicore and manycore processors with software-managed local memories, thanks to its weak memory model. In the OpenMP memory model, each thread may maintain a temporary view of the shared memory which “allows the thread to cache variables and thereby to avoid going to memory for every reference to a variable” [5]. It includes a flush operation on P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 341–352, 2010. c Springer-Verlag Berlin Heidelberg 2010
342
C. Chen et al.
a specified flush-set that can be used to synchronize the temporary view with the shared memory for the variables in the flush-set. It is a weak consistency model “because a thread¡¯s temporary view of memory is not required to be consistent with memory at all times” [5]. This relaxation of the memory consistency constraints provides room for computer system designers to experiment with a wide range of caching schemes, each of which has different performance and cost tradeoff. Therefore, the OpenMP memory model can exhibit very different instantiations, each of which is a memory model that is stronger than the OpenMP memory model, i.e., any legal value under an instantiation is also a legal value under the OpenMP memory model, but not vice versa. Among various instantiations of the OpenMP memory model, an important problem is to find an instantiation that can be efficiently implemented on multicore and manycore architectures and easily understood by programmers. 1.1 A Key Observation for Implementing the Flush Operation Efficiently The flush operation synchronizes temporary views with the shared memory. So it is more expensive than read and write operations. In order to efficiently implement the OpenMP memory model, the instantiation should be able to implement the flush operation efficiently. Unfortunately, the OpenMP memory model has the serialization requirement for flush operations, i.e., “if the intersection of the flush-sets of two flushes performed by two different threads is non-empty, then the two flushes must be completed as if in some sequential order, seen by all threads” [5]. Therefore, it seems that it is very hard to efficiently implement the flush operation because of the serialization requirement. However, this requirement has a hidden meaning that is not clearly explained in [5]. The hidden meaning is the key for efficiently implement the flush operation. We use an example to explain the real meaning of the serialization requirement. For the program in Fig. 1, it seems that the final status of the shared memory must be either x = y = 1 or x = y = 2 according to the serialization requirement. However, after discussion with the OpenMP community, x = 1, y = 2 and x = 2, y = 1 are also legal results under the OpenMP memory model. The reason is that the OpenMP memory model allows flush operations to be completed earlier (but cannot be later) than the flush points (statements 3 and 6 in this program). Therefore, one possible way to get the result x = 1, y = 2 is that firstly thread 2 assigns 2 to x and immediately flushes x into the shared memory, then thread 1 assigns 1 to x and 1 to y and then flushes x and y, and finally thread 2 assigns 2 to y and flushes y. Therefore, we get a key observation for implementing the flush operation efficiently as follows. Thread 1 Thread 2 1: x = 1; 4: x = 2; 2: y = 1; 5: y = 2; 3: flush(x,y); 6: flush(x,y); Is x = 1, y = 2 (or x = 2, y = 1) legal under the OpenMP memory model? Fig. 1. A motivating example for understanding the serialization requirement under the OpenMP memory model
A Study of a Software Cache Implementation of the OpenMP Memory Model
343
The Key Observation: A flush operation on a flush-set of shared locations can be decomposed into unordered flush operations on each individual location. Each flush operation after decomposition must be completed no later than the flush point of the original flush operation. Assuming that a memory location is the minimal unit for atomic memory accesses, the serialization requirement is naturally satisfied. 1.2 Main Contributions In this paper, we propose an instantiation of the OpenMP memory model based on the key observation in Section 1.1. It has the following advantages. – Our instantiation prohibits undefined values that may cause problems of safety, security, programming and debugging. The OpenMP memory model may allow programs with data races to generate undefined values. However, in our instantiation, all return values must be in a subset of initial value and the values that was written by some thread before. Since the OpenMP memory model allows programs with data races 1 , our instantiation would be helpful when programming in such cases. – Our instantiation is scalable with respect to the number of threads because it does not rely on communication among threads or a centralized directory that maintains consistency of multiple copies of each shared variable. – Our instantiation avoids the ambiguity of the original memory model definition proposed on the OpenMP Specification 3.0, such as the unclear serialization requirement, the problem of handling temporary overflow and the unclear semantics for programs with data races. Therefore, our instantiation is easy to understand from the angle of efficient implementations. We also propose the cache protocol of the instantiation and implement the softwarecontrolled cache on Cell Broadband Engine. The experimental results show that our instantiation has nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to a software cache design derived from a stronger memory model that maintains a global total ordering among flush operations. The rest of the paper is organized as follows. Section 2 introduces our instantiation of the OpenMP memory model. Section 3 introduces the cache protocol of the instantiation. Section 4 presents the experimental results. Section 5 discusses the related work. The conclusion is presented in Section 6.
2 Formalization of Our OpenMP Memory Model Instantiation A necessary prerequisite to build OpenMP’s software cache implementations is the availability of formal memory models that establish the legality conditions for determining if an implementation is correct. As observed in [6], “it is impossible to verify OpenMP applications formally since the prose does not provide a formal consistency 1
Section 2.8.6 of the OpenMP specification 3.0 [5] shows a program with data races that implements critical sections.
344
C. Chen et al.
model that precisely describes how reads and writes on different threads interact”. While there is general agreement that the OpenMP memory model is based on temporary views and flush operations, discussion with OpenMP experts led us to conclude that the OpenMP specification provides a lot of leeway on when flush operations can be performed and on the inclusion of additional flush operations (not specified by the programmer) to deal with local memory size constraints. In this section, we formalize an instantiation of the OpenMP Memory Model — ModelLF , based on the key observation in Section 1.1. ModelLF builds on OpenMP’s relaxed-consistency memory model in which each worker thread maintains a temporary view of shared data which may not always be consistent with the actual data stored in the shared memory. The OpenMP flush operation is used to establish consistency between these temporary views and the shared memory at specific program points. In ModelLF , each flush operation only forces local temporary view to be consistent with the shared memory. That is why we call it ModelLF where “LF” means local flush. A flush operation is only applied on a single location. We assume that a memory location is the minimal unit for atomic memory accesses. Therefore, the serialization requirement of flush operations is naturally satisfied. A flush operation on a set of shared locations is decomposed into unordered flush operations on each individual location, where those flush operations after decomposition must be completed no later than the flush point of the original flush operation. So it avoids the known problem of decomposition as explained in Section 2.8.6 of the OpenMP specification 3.0 [5], where the compiler may reorder the flush operations after decomposition to a later position than the flush point and cause incorrect semantics. 2.1 Operational Semantics of ModelLF In this section, we define the operational semantics of ModelLF . Firstly, we introduce a little background for the definition. A store, σ, is a mathematical representation of the machine’s shared memory, which maps memory location addresses to values (σ : addr → val). We model temporary views by introducing a distinct store, σi , for each worker thread Ti in an OpenMP parallel region. Following OpenMP’s convention, thread T0 is assumed to be the master thread. σi [l] represents the value stored in location l in thread Ti ’s temporary view. The flush operation, flush(Ti , l) makes temporary view σi consistent with the shared memory σ on location l. Under ModelLF , program flush operations are performed at the program points specified by the programmer. Moreover, additional flush operations may be inserted nondeterministically by the implementation at any program point, which makes it possible to implement the memory model with bounded space for temporary views, such as caches. The operational semantics of memory operations of ModelLF include the read, write, program flush operation and nondeterministic flush operation defined as follows: – Memory read: If thread Ti needs to read the value of the location l, it performs a read(Ti , l) operation on store σi . If σi does not contain any value of l, the value in the shared memory will be loaded to σi and returned to the read operation. – Memory write: If thread Ti needs to write value v to the location l, it performs a write(Ti , v, l) operation on store σi .
A Study of a Software Cache Implementation of the OpenMP Memory Model
345
– Program / Nondeterministic flush: If thread Ti needs to synchronize σi with the shared memory on a shared location l, it performs a f lush(Ti, l) operation. If σi contains a “dirty value” 2 of l, it will write back the value into the shared memory. After the flush operation, σi will discard the value of l. A thread performs program flush operations at program points specified by the programmer, and can nondeterministically perform flush operations at any program point. All the program and nondeterministic flush operations on the same shared location must be observed by all threads to be completed in the same sequential order.
3 Cache Protocol of ModelLF In this section, we introduce the cache protocol that implements ModelLF . We assume that each thread contains a cache which corresponds to its temporary view. Therefore, performing operations on temporary views is equivalent to performing such operations on the caches. Without loss of generality, in this section, we assume that each operation is performed on one cache line. The reason is that an operation on one cache line can be decomposed into sub operations; each of which is performed on a single location. We use per-location dirty bits in a cache line to take care of the decomposition problem. 3.1 Cache Line States We assume that each cache line contains multiple locations. Each location contains a value that can be a “clean value” 3 , a “dirty value”, or an “invalid value”. Each cache line can be in one of the five states as follows. Invalid: All the locations contain “invalid values”. Clean: All the locations contain “clean values”. Dirty: All the locations contain “dirty values”. Clean-Dirty: Each location contains either a “clean value” or a “dirty value”. Invalid-Dirty: Each location contains either an “invalid value” or a “dirty value”. For simplicity, the cache line cannot be in other states such as Invalid-Clean. Additional nondeterministic flush operations may be performed when necessary to force the cache line to be in one of the five states as above. We use a per-line flag bit together with the dirty bits to represent the state of the cache line. The flag bit indicates whether those non-dirty values in the cache line are clean or invalid. 3.2 Cache Operations and State Transitions The state transition diagram of ModelLF cache protocol is shown in Fig. 2. Now we explain how each cache operation affects the state transition diagram. Memory read: If the original state of the cache line is invalid or invalid-dirty, the invalid locations will load “clean values” from memory. Therefore, the state will change to clean or clean-dirty, respectively. In other cases, the state will not change. After that, the values in the cache line will be returned. 2 3
The term “dirty value” means that the value of location l was modified by thread Ti . The term “clean value” means that the value was read but not modified by the thread.
346
C. Chen et al. r
f Invalid w
w
f
f
r Clean
f
w
f r
InvalidͲDirty Invalid Dirty w
r/w CleanͲDirty Clean Dirty
w
w
w Dirty r/w
r :memoryread w :memroy write f :flush
Fig. 2. State transition diagram for the cache protocol of Model LF
Memory write: A write operation writes specified “dirty values” to the cache line. Therefore, if the original state is invalid or invalid-dirty, it becomes either invalid-dirty or dirty after the write operation, which depends on whether all the locations contain “dirty values”. In other cases, the state will become either clean-dirty or dirty, which depends on whether all the locations contain “dirty values”. Program / Nondeterministic flush: A flush operation forces all the “dirty values” of the cache line to be written back into memory. Then, the state will become invalid. There may be various ways to implement the flush operation. For example, many architectures support a block of data to be written back at a time. So a possible way of implementing the flush operation is to write back the entire cache line that is being flushed together with the dirty bits and then merge the “dirty values” into the corresponding memory line in the shared memory. If the mergence in memory is not supported, a thread has to load the memory line, and then merge it with the cache line, and finally write back the merged line, where the process must be atomic to handle the false sharing problem. For example, on the Cell processor, atomic DMA operations can be used to guarantee atomicity of the process.
4 Experimental Results and Analyses In this section, we introduce our experimental results under ModelLF cache protocol. In section 4.1, we introduce the experimental testbed. Then in section 4.2, we introduce the major observations of our experiments. Finally, we introduce the details and analyses of the observations in the last two sections. 4.1 Experimental Testbed The experimental results presented in this paper were obtained on CBEA (Cell Broadband Engine Architecture) [2] under the OPELL (OPenmp for cELL) framework [7]. CBEA: CBEA has a main processor called the Power Processing Element (PPE) and a number of co-processors called the Synergistic Processing Elements (SPEs). The PPE handles most of the computational workload and has control over the SPEs, i.e., start, stop, interrupt, and schedule processes onto the SPEs. Each SPE has a 256KB local
A Study of a Software Cache Implementation of the OpenMP Memory Model
347
storage which is used to store both instructions and data. An SPE can only access its own local storage directly. Both PPE and SPEs share main memory. SPEs access main memory via DMA (direct memory access) transfers which are much slower than the access on each SPE’s own local storage. We executed the programs on a PlayStation 3 [8] which has one 3.2 GHz Cell Broadband Engine CPU (with 6 accessible SPEs) and 256MB global shared memory. Our experiments used all 6 SPEs with the exception of the evaluation of speedup which used various numbers of SPEs from 1 to 6. OPELL Framework: OPELL is an open source toolchain / runtime effort to implement OpenMP for the CBEA. OPELL has a single source compiler which compiles an OpenMP program to a single source file that is executable on CBEA. During runtime, the executable file starts to run the sequential region of the program on PPE. Once the program enters a parallel region, PPE will assign tasks of computing parallel codes to SPEs. After SPEs finish the tasks, the parallel region ends and PPE will go ahead to execute the following sequential region. Since each SPE only has 256KB local storage to store both instructions and data, OPELL has a partition /overlay manager runtime library that partitions the parallel codes into small pieces to fit for the local storage size, and loads and replaces those pieces on demand. Since a DMA transfer is much slower than an access on the local storage, OPELL has a software cache runtime library to take advantage of locality. The runtime library manages a part of local storages as caches and has a user interface for accessing. We implement our cache protocol in OPELL’s software cache runtime library. The cache protocol uses 4-way set associative caches. The size of each cache line is 128 bytes. We ran the experiments on various cache sizes which range from 4KB to 64KB. We did not try bigger cache size because the size of local storage is very limited (256KB) and a part of it is used to store instructions and maintain stack. Benchmarks: We used three benchmark programs in our experiments — Integer Sort (IS), Embarrassingly Parallel (EP) and Multigrid (MG) from the NAS Parallel Benchmarks [9]. 4.2 Summary of Main Results The main results of our experiments are as follows: Result I: Scalability (Section 4.3): ModelLF cache protocol has nearly linear speedup with respect to the number of threads for the tested benchmarks. Result II: Impact of Cache Size (Section 4.4): We use another instantiation of the OpenMP memory model — ModelGF 4 , to compare with ModelLF . ModelGF maintains a global total ordering among flush operations. The difference between ModelGF and ModelLF is that when ModelGF performs a flush operation on a location l, it enforces the temporary views of all threads to see the same value of l by discarding the values of l in the temporary views. To implement ModelGF , we simulate a centralized directory 4
Operational semantics of ModelGF is defined in [10].
348
C. Chen et al.
that maintains the information for all the caches. When a flush operation on a location l is performed, the directory informs all the threads that contain the value of l to discard the value. We assume that the centralized directory is “ideal”, i.e., the cost of maintenance and lookup is trivial. However, the cost of informing a thread is as expensive as a DMA transfer because the directory is placed in main memory. ModelLF outperforms ModelGF due to its cheaper flush operations. Our results show that the performance gap between ModelLF and ModelGF cache protocols increases as the cache size becomes smaller. This observation is significant because the current trend in multicore and manycore processors is that the local memory size per core decreases as the number of cores increases. 4.3 Scalability Fig. 3 shows the speedup as a function of the number of SPEs (Each SPE runs one thread.) under ModelLF cache protocol. The tested applications are MG with a 32KB cache size, and IS and EP with a 64KB cache size. All the three applications have input size W. We can see that for IS and EP benchmarks, ModelLF cache protocol nearly achieves linear speedup. For MG benchmark, the speedup is not as good as the other two when the number of threads is 3, 5 and 6. The reason is that the workloads among threads are not balanced when the number of threads is not a power of 2. ISͲWandEPͲW achievealmost hi l t linearspeedup.
6 6SHHGXS
0*: 0* : ,6: (3:
1XPEHURI7KUHDGV63(V
MGͲWperforms MG W performs worsebecause ofunbalanced workloads workloads.
Fig. 3. Speedup as a function of the number of SPEs under Model LF cache protocol
4.4 Impact of Cache Size Fig. 4 and 5 show execution time and cache eviction ratio curves for IS and MG with input size W on various cache sizes (4KB, 8KB, 16KB, 32KB and 64KB 5 ) per thread. The two figures show that the cache eviction ratio curves under the two cache protocols are equal, but the execution time curves are not. Moreover, the difference in execution time becomes larger as the cache size becomes smaller. This is because the cost of cache eviction in ModelGF cache protocol is much higher. Moreover, the smaller the cache size is, the higher the cache eviction ratio is. To show the change of performance gap clearly, we normalize the execution time into the interval [0, 1] by applying division on every execution time where the divisor is the maximal execution time in all tested configurations. The corresponding configurations to the maximal execution time are 4KB cache sizes under ModelGF for both MG and IS. The performance gap between ModelGF and ModelLF keeps constantly for EP when we change the cache sizes. The reason is that EP has very bad temporal locality. So it is insensitive to the change of cache sizes. 5
64KB is only for IS.
A Study of a Software Cache Implementation of the OpenMP Memory Model
0RGHO*)
0RGHO/)
&DFKH6L]HV
N
N
N
N
&DF FKH(YLF FWLRQ5DW WLR
1RUPDO OL]HG([HFXWLRQ7LPH
349
0RGHO*) 0RGHO/)
&DFKH6L]HV
N
N
N
N
N
N
Thedifferenceofnormalizedexecutiontimeincreasedfrom0.15to0.25asthecachesizeper SPEwasdecreasedfrom64KBto4KB.
Fig. 4. Trends of execution time and cache eviction ratio for IS-W on various cache sizes
0RGHO*) 0RGHO/)
&DFKH6L]HV
N
N
N
N
0RGHO*)
&DFKH(YLFWWLRQ5DWLR
1RUPDO OL]HG([[HFXWLRQ7LPH
0RGHO/)
&DFKH6L]HV
N
N
N
N
Thedifferenceofnormalizedexecutiontimeincreasedfrom0.04to0.16asthecache sizeperSPEwasdecreasedfrom32KBto4KB.
Fig. 5. Trends of execution time and cache eviction ratio for MG-W on various cache sizes
5 Related Work Despite over two decades of research on memory consistency models, there does not appear to be a consensus on how memory models should be formalized [11,12,13,14]. The efforts to formalize memory models for mainstream parallel languages such as the Java memory model [15], the C++ memory model [16], and the OpenMP memory model [6] all take different approaches. The authoritative source for the OpenMP memory model can be found in the specifications for OpenMP 3.0 [5], but the memory model definition therein is provided in terms of informal prose. To address this limitation, a formalization of the OpenMP memory model was presented in [6]. In this paper, the authors developed a formal, mathematical language to model the relevant features of OpenMP. They developed an operational model to verify its conformance to the OpenMP standard. Through these tools, the authors found that the OpenMP memory model is weaker than the weak consistency model [17]. The authors also claimed that they found some ambiguities in the informal definition of the OpenMP memory model presented in the OpenMP specification version 2.5 [18]. Since there is no significant change of the OpenMP memory model from version 2.5 to version 3.0, their work demonstrates the need for the
350
C. Chen et al.
OpenMP community to work towards a formal and complete definition of the OpenMP memory model. Some early research on software controlled caches can be found in the NYU Ultracomputer [19], Cedar [20], and IBM RP3 [21] projects. All three machines have local memories that can be used as programmable caches, with software taking responsibility for maintaining consistency by inserting explicit synchronization and cache consistency operations. By default, this responsibility falls on the programmer but compiler techniques have also been developed in which these operations are inserted by the compiler instead, e.g., [22]. Interest in software caching has been renewed with the advent of multicore processors with local memories such as the Cell Broadband Engine. There have been a number of reports on more recent software cache optimization from compiler angle as described in [23,24,25]. Examples of recent work on software cache protocol implementation on Cell processors can be found in [26,27,28]. The cache protocol used in [26] uses a centralized directory to keep tract cache line state information in the implementation - reminds us the ModelGF cache protocol in this paper. The cache protocols reported in [27,28] do not appear to use a centralized directory - hence appear to be more close to the ModelLF cache protocol. However, we do not have access to the detailed information on the implementations of these models, and cannot make a more definitive comparisons at the time when this paper is written. OPELL [7] is an open source toolchain / runtime effort to implement OpenMP for the Cell Broadband Engine. Our cache protocol framework reported here has been developed much earlier in 2006-2007 frame and embedded in OPELL (see [7])- but the protocols themselves are not published externally.
6 Conclusion and Future Work In this paper, we investigate the problem of software cache implementations for the OpenMP memory model on multicore and manycore processors. We propose an instantiation of the OpenMP memory model — ModelLF which prohibits undefined values and avoids the ambiguity of the original memory model definition on OpenMP Specification 3.0. ModelLF is scalable with respect to the number of threads because it does not rely on communications among threads or a centralized directory that maintains consistency of multiple copies of each shared variable. We propose the corresponding cache protocol and implement the cache protocol by software cache on the Cell processor. The experimental results show that ModelLF cache protocol has nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to ModelGF cache protocol derived from a stronger memory model that maintains a global total ordering among flush operations. This provides a useful way that how to formalize (architecture unspecified) OpenMP memory model in different ways and evaluate the instantiations to produce different performance profiles. Our conclusion is that OpenMP’s relaxed memory model with temporary views is a good match for software cache implementations, and that the refinements in ModelLF can lead to good opportunities for scalable implementations of OpenMP on future multicore and manycore processors.
A Study of a Software Cache Implementation of the OpenMP Memory Model
351
In the future, we will investigate the possibility of implementing our instantiation on different architectures and study its scalability in the case that the architecture contains big number of cores (e.g. over 100).
Acknowledgment This work was supported by NSF (CNS-0509332, CSR-0720531, CCF-0833166, CCF0702244), and other government sponsors. We thank all the members of CAPSL group at University of Delaware. We thank Ziang Hu for his suggestions on the experimental design. We thank Bronis R. de Supinski and Greg Bronevetsky for answering questions regarding the OpenMP memory model. We thank the efforts of our reviewers for their helpful suggestions that have led to several important improvements of our work.
References 1. Lamport, L.: How to make a multiprocessor that correctly executes multiprocess programs. IEEE Trans. on Computers C-28(9), 690–691 (1979) 2. IBM Microelectronics: Cell Broadband Engine, http://www-01.ibm.com/chips/techlib/techlib.nsf/products/ Cell Broadband Engine 3. Tilera Corporation: Tilera, http://www.tilera.com/ 4. Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Fast: A functionally accurate simulation toolset for the Cyclops-64 cellular architecture. In: Proceedings of the Workshop on Modeling, Benchmarking and Simulation, Held in conjunction with the 32nd Annual International Symposium on Computer Architecture, Madison, Wisconsin, pp. 11–20 (2005) 5. OpenMP Architecture Review Board: OpenMP Application Program Interface Version 3.0 (May 2008), http://www.openmp.org/mp-documents/spec30.pdf 6. Bronevetsky, G., de Supinski, B.R.: Complete formal specification of the OpenMP memory model. Int. J. Parallel Program. 35(4), 335–392 (2007) 7. Manzano, J., Hu, Z., Jiang, Y., Gan, G.: Towards an automatic code layout framework. In: Chapman, B., Zheng, W., Gao, G.R., Sato, M., Ayguad´e, E., Wang, D. (eds.) IWOMP 2007. LNCS, vol. 4935, pp. 157–160. Springer, Heidelberg (2008) 8. Sony Computer Entertainment: PlayStation3, http://www.us.playstation.com/ps3/features 9. NASA Ames Research Center: NAS Parallel Benchmark, http://www.nas.nasa.gov/Resources/Software/npb.html 10. Chen, C., Manzano, J.B., Gan, G., Gao, G.R., Sarkar, V.: A study of a software cache implementation of the openmp memory model for multicore and manycore architectures. Technical Memo CAPSL/TM-93 (February 2010) 11. Adve, S., Hill, M.D.: A unified formalization of four shared-memory models. IEEE Transactions on Parallel and Distributed Systems 4, 613–624 (1993) 12. Shen, X., Arvind, Rudolph, L.: Commit-Reconcile & Fences (CRF): a new memory model for architects and compiler writers. In: ISCA 1999: Proceedings of the 26th annual international symposium on Computer architecture, pp. 150–161. IEEE Computer Society, Washington (1999) 13. Saraswat, V.A., Jagadeesan, R., Michael, M., von Praun, C.: A theory of memory models. In: PPoPP 2007: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 161–172. ACM, New York (2007)
352
C. Chen et al.
14. Arvind, A., Maessen, J.W.: Memory model = instruction reordering + store atomicity. SIGARCH Comput. Archit. News 34(2), 29–40 (2006) 15. Manson, J., Pugh, W., Adve, S.V.: The Java memory model. In: POPL 2005: Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 378–391. ACM, New York (2005) 16. Boehm, H.J., Adve, S.V.: Foundations of the C++ concurrency memory model. In: PLDI 2008: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, pp. 68–78. ACM, New York (2008) 17. Dubois, M., Scheurich, C., Briggs, F.: Memory access buffering in multiprocessors. In: ISCA 1998: 25 years of the international symposia on Computer architecture (selected papers), pp. 320–328. ACM, New York (1998) 18. OpenMP Architecture Review Board: OpenMP Application Program Interface (2005), http://www.openmp.org/mp-documents/spec25.pdf 19. Gottlieb, A., Grishman, R., Kruskal, C.P., McAuliffe, K.P., Rudolph, L., Snir, M.: The NYU ultracomputer—designing a MIMD, shared-memory parallel machine. In: ISCA 1998: 25 years of the international symposia on Computer architecture (selected papers), pp. 239–254. ACM, New York (1998) 20. Gajski, D., Kuck, D., Lawrie, D., Sameh, A.: CEDAR—a large scale multiprocessor, pp. 69–74. IEEE Computer Society Press, Los Alamitos (1986) 21. Pfister, G., Brantley, W., George, D., Harvey, S., Kleinfelder, W., McAuliffe, K., Melton, E., Norton, V., Weiss, J.: The research parallel processor prototype (RP3): Introduction and architecture. In: ICPP 1985: Proceedings of the 1985 International Conference on Parallel Processing, pp. 764–771 (1985) 22. Cytron, R., Karlovsky, S., McAuliffe, K.P.: Automatic management of programmable caches. In: ICPP 1988: Proceedings of the 1988 International Conference on Parallel Processing, pp. 229–238 (August 1988) 23. Eichenberger, A.E., O’Brien, K., O’Brien, K., Wu, P., Chen, T., Oden, P.H., Prener, D.A., Shepherd, J.C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M.: Optimizing compiler for the CELL processor. In: PACT 2005: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pp. 161–172. IEEE Computer Society, Los Alamitos (2005) 24. Eichenberger, A.E., O’Brien, J.K., O’Brien, K.M., Wu, P., Chen, T., Oden, P.H., Prener, D.A., Shepherd, J.C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M.K., Archambault, R., Gao, Y., Koo, R.: Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture. IBM Syst. J. 45(1), 59–84 (2006) 25. Chen, T., Zhang, T., Sura, Z., Tallada, M.G.: Prefetching irregular references for software cache on CELL. In: CGO 2008: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, pp. 155–164. ACM, New York (2008) 26. Lee, J., Seo, S., Kim, C., Kim, J., Chun, P., Sura, Z., Kim, J., Han, S.: COMIC: a coherent shared memory interface for Cell BE. In: PACT 2008: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 303–314. ACM, New York (2008) 27. Chen, T., Lin, H., Zhang, T.: Orchestrating data transfer for the Cell/B.E. processor. In: ICS 2008: Proceedings of the 22nd annual international conference on Supercomputing, pp. 289– 298. ACM, New York (2008) 28. Gonz`alez, M., Vujic, N., Martorell, X., Ayguad´e, E., Eichenberger, A.E., Chen, T., Sura, Z., Zhang, T., O’Brien, K., O’Brien, K.: Hybrid access-specific software cache techniques for the Cell BE architecture. In: PACT 2008: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 292–302. ACM, New York (2008)
Programming CUDA-Based GPUs to Simulate Two-Layer Shallow Water Flows Marc de la Asunci´on1 , Jos´e M. Mantas1 , and Manuel J. Castro2 1
Dpto. Lenguajes y Sistemas Inform´ aticos, Universidad de Granada 2 Dpto. An´ alisis Matem´ atico, Universidad de M´ alaga
Abstract. The two-layer shallow water system is used as the numerical model to simulate several phenomena related to geophysical flows such as the steady exchange of two different water flows, as occurs in the Strait of Gibraltar, or the tsunamis generated by underwater landslides. The numerical solution of this model for realistic domains imposes great demands of computing power and modern Graphics Processing Units (GPUs) have demonstrated to be a powerful accelerator for this kind of computationally intensive simulations. This work describes an accelerated implementation of a first order well-balanced finite volume scheme for 2D two-layer shallow water systems using GPUs supporting the CUDA (Compute Unified Device Architecture) programming model and double precision arithmetic. This implementation uses the CUDA framewok to exploit efficiently the potential fine-grain data parallelism of the numerical algorithm. Two versions of the GPU solver are implemented and studied: one using both single and double precision, and another using only double precision. Numerical experiments show the efficiency of this CUDA solver on several GPUs and a comparison with an efficient multicore CPU implementation of the solver is also reported.
1
Introduction
The two-layer shallow water system of partial differential equations governs the flow of two superposed shallow layers of immiscible fluids with different constant densities. This mathematical model is used as the numerical model to simulate several phenomena related to stratified geophysical flows such as the steady exchange of two different water flows, as occurs in the Strait of Gibraltar [6], or the tsunamis generated by underwater landslides [13]. The numerical resolution of two-layer or multilayer shallow water systems has been object of an intense research during the last years: see for instance [1,3,4,6,13]. The numerical solution of these equations in realistic applications, where big domains are simulated in space and time, is computationally very expensive. This fact and the degree of parallelism which these numerical schemes exhibit suggest the design of parallel versions of the schemes for parallel machines in order to solve and analyze these problems in reasonable execution times. In this paper, we tackle the acceleration of a finite volume numerical scheme to solve two-layer shallow water systems. This scheme has been parallelized and P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 353–364, 2010. c Springer-Verlag Berlin Heidelberg 2010
354
M. de la Asunci´ on, J.M. Mantas, and M.J. Castro
optimized by combining a distributed implementation which runs on a PC cluster [4] with the use of SSE-optimized routines [5]. However, despite of the important performance improvements, a greater reduction of the runtimes is necessary. A cost effective way of obtaining a substantially higher performance in these applications consists in using the modern Graphics Processor Units (GPUs). The use of these devices to accelerate computationally intensive tasks is growing in popularity among the scientific and engineering community [15,14]. Modern GPUs present a massively parallel architecture which includes hundreds of processing units optimized for performing floating point operations and multithreaded execution. These architectures make it possible to obtain performances that are orders of magnitude faster than a standard CPU at a very affordable price. There are previous proposals to port finite volume one-layer shallow water solvers to GPUs by using a graphics-specific programming language [9,10]. These solvers obtain considerable speedups to simulate one-layer shallow water systems but their graphics-based design is not easy to understand and maintain. Recently, NVIDIA has developed the CUDA programming toolkit [11] which includes an extension of the C language and facilitates the programming of GPUs for general purpose applications by preventing the programmer to deal with the graphics details of the GPU. A CUDA solver for one-layer systems based on the finite volume scheme presented in [4] is described in [2]. This one-layer shallow water CUDA solver obtains a good exploitation of the massively parallel architecture of several NVIDIA GPUs. In this work, we extend the proposal presented in [2] for the case of two-layer shallow water systems and we study its performance. From the computational point of view, the numerical solution of the two-layer system presents two main problems with respect to the one-layer case: the need of using double precision arithmetic for some calculations of the scheme and the need of managing a higher volume of data to perform the basic calculations. Our goal is to exploit efficiently the GPUs supporting CUDA and double precision arithmetic in order to accelerate notably the numerical solution of two-layer shallow water systems. This paper is organized as follows: the next section describes the underlying mathematical model, the two-layer shallow water system, and the finite-volume numerical scheme which has been ported to GPU. A description of the data parallelism of the numerical scheme and its CUDA implementation are presented in Section 4. Section 5 shows and analyzes the performance results obtained when the CUDA solver is applied to several test problems using two different NVIDIA GPUs supporting double precision. Finally, Section 6 summarizes the main conclusions and presents the lines for further work.
2
Mathematical Model and Numerical Scheme
The two-layer shallow water system is a system of conservation laws and nonconservative products with source terms which models the flow of two homogeneous fluid shallow layers with different densities that occupy a bounded domain
Programming CUDA-Based GPUs
355
D ⊂ R2 under the influence of a gravitational acceleration g. The system has the following form: ∂W ∂F1 ∂F2 ∂W ∂W ∂H ∂H + (W )+ (W ) = B1 (W ) +B2 (W ) +S1 (W ) +S2 (W ) ∂t ∂x ∂y ∂x ∂y ∂x ∂y (1) being ⎛ ⎛ ⎜ ⎜ ⎜ ⎜ W =⎜ ⎜ ⎜ ⎜ ⎝
h1 q1,x q1,y h2 q2,x q2,y
⎛
⎞
⎜ q2 ⎟ ⎜ 1,x 1 ⎟ ⎜ + gh21 ⎟ ⎜ h1 ⎟ 2 ⎜ ⎟ ⎜ q1,x q1,y ⎟ ⎜ ⎟ ⎜ ⎟ h ⎜ 1 ⎟, F1 (W ) = ⎜ ⎟ q 2,x ⎜ ⎟ ⎜ q2 ⎟ 1 2⎟ ⎜ 2,x ⎜ + gh2 ⎟ ⎜ h2 ⎟ 2 ⎜ ⎟ ⎝ q2,x q2,y ⎠ h2
⎞ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎠
⎛
q1,x
0 ⎜ ⎜ 0 ⎜ ⎜ 0 B1 (W ) = ⎜ ⎜ 0 ⎜ ⎜ ⎝−rgh2 0
00 0 0 0 −gh1 00 0 00 0 00 0 00 0 ⎛
⎞ 00 ⎟ 0 0⎟ ⎟ 0 0⎟ ⎟, 0 0⎟ ⎟ ⎟ 0 0⎠ 00
⎞ 0 ⎜ ⎟ ⎜gh1 ⎟ ⎜ ⎟ ⎜ 0 ⎟ ⎜ ⎟, S1 (W ) = ⎜ ⎟ ⎜ 0 ⎟ ⎜ ⎟ ⎝gh2 ⎠ 0
q1,y
⎞
⎜ ⎟ ⎜ q1,x q1,y ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 2 h1 ⎟ ⎜ q1,y ⎟ 1 2⎟ ⎜ ⎜ h + 2 gh1 ⎟ 1 ⎟, F2 (W ) = ⎜ ⎜ ⎟ q2,y ⎜ ⎟ ⎜ ⎟ ⎜ q2,x q2,y ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 2 h2 ⎟ ⎝ q2,y ⎠ 1 + gh22 h2 2 ⎛
⎜ ⎜ ⎜ ⎜ B2 (W ) = ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎜ ⎜ ⎜ ⎜ S2 (W ) = ⎜ ⎜ ⎜ ⎜ ⎝
0 0 0 0 0 −rgh2 0 0 gh1 0 0 gh2
00 0 00 0 0 0 −gh1 00 0 00 0 00 0
⎞ 00 ⎟ 0 0⎟ ⎟ 0 0⎟ ⎟, 0 0⎟ ⎟ ⎟ 0 0⎠ 00
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
where hi (x, y, t) ∈ R denotes the thickness of the water layer i at point (x, y) at time t, H(x, y) ∈ R is the depth function measured from a fixed level of reference and r = ρ1 /ρ2 is the ratio of the constant densities of the layers (ρ1 < ρ2 ), which in realistic oceanographical applications is close to 1 (see Fig. 1). Finally, qi (x, y, t) = (qi,x (x, y, t), qi,y (x, y, t)) ∈ R2 is the mass-flow of the water layer i at point (x, y) at time t . To discretize System (1), the computational domain D is divided into L cells or finite volumes Vi ⊂ R2 , which are assumed to be quadrangles. Given a finite
356
M. de la Asunci´ on, J.M. Mantas, and M.J. Castro
Fig. 1. Two-layer sketch
volume Vi , Ni ∈ R2 is the centre of Vi , ℵi is the set of indexes j such that Vj is a neighbour of Vi ; Γij is the common edge of two neighbouring cells Vi and Vj , and |Γij | is its length; ηij = (ηij,x , ηij,y ) is the unit vector which is normal to the edge Γij and points towards the cell Vj [4] (see Fig. 2).
Fig. 2. Finite volumes
Assume that the approximations at time tn , Win , have already been calculated. To advance in time, with Δtn being the time step, the following numerical scheme is applied (see [4] for more details): Win+1 = Win −
Δtn |Γij | Fij− |Vi |
(2)
j∈ℵi
being Fij− =
1 −1 Kij · (I − sgn(Dij )) · Kij · Aij (Wjn − Win ) − Sij (Hj − Hi ) , 2
where |Vi | is the area of Vi , Hl = H(Nl ) with l = 1, . . . , L, Aij ∈ R6×6 and Sij ∈ R6 depends on Win and Wjn , Dij is a diagonal matrix whose coefficients are the eigenvalues of Aij , and the columns of Kij ∈ R6×6 are the associated eigenvectors.
Programming CUDA-Based GPUs
To compute the n-th time step, the following condition can be used: −1 j∈ℵi | Γij | Dij ∞ n Δt = min i=1,...,L 2γ | Vi |
357
(3)
where γ, 0 < γ ≤ 1, is the CFL (Courant-Friedrichs-Lewy) parameter.
3
CUDA Implementation
In this section we describe the potential data parallelism of the numerical scheme and its implementation in CUDA. 3.1
Parallelism Sources
Figure 3a shows a graphical description of the main sources of parallelism obtained from the numerical scheme. The main calculation phases, identified with circled numbers, presents a high degree of parallelism because the computation performed at each edge or volume is independent with respect to that performed at other edges or volumes.
(a) Parallelism sources of the numerical scheme
(b) General steps of the parallel algorithm implemented in CUDA
Fig. 3. Parallel algorithm
When the finite volume mesh has been constructed, the time stepping process is repeated until the final simulation time is reached: 1. Edge-based calculations: Two calculations must be performed for each edge Γij communicating two cells Vi and Vj (i, j ∈ {1, . . . , L}):
358
M. de la Asunci´ on, J.M. Mantas, and M.J. Castro
a) Vector Mij = | Γij | Fij− ∈ R6 must be computed as the contribution of each edge to the calculation of the new states of its adjacent cells Vi and Vj (see (3)). This contribution can be computed independently for each edge and must be added to the partial sums Mi and Mj associated to Vi and Vj , respectively. b) The value Zij = | Γij | Dij ∞ must be computed as the contribution of each edge to the calculation of the local Δt values of its adjacent cells Vi and Vj (see (2)). This contribution can be computed independently for each edge and must be added to the partial sums Zi and Zj associated to Vi and Vj , respectively. 2. Computation of the local Δti for each volume: For each volume Vi , the local Δti is obtained as follows (see (3)): Δti = 2γ |Vi | Zi−1 . In the same way, the computation for each volume can be performed in parallel. 3. Computation of Δt: The minimum of all the local Δti values previously computed for each volume is obtained. This minimum Δt represents the next time step which will be applied in the simulation. 4. Computation of Win+1 : The (n + 1)-th state of each volume (Win+1 ) is calculated from the n-th state and the data computed in previous phases, in Δt the following way (see (2)): Win+1 = Win − |V Mi . This phase can also be i| performed in parallel (see Fig. 3a). As can be seen, the numerical scheme exhibits a high degree of potential data parallelism and it is good candidate to be implemented on CUDA architectures.
4
Algorithmic Details of the CUDA Version
In this section we describe the parallel algorithm we have developed and its implementation in CUDA. It is an extension of the algorithm described in [2] to simulate two-layer shallow water systems. We consider problems consisting in a bidimensional regular finite volume mesh. The general steps of the parallel algorithm are depicted in Fig. 3b. Each processing step executed on the GPU is assigned to a CUDA kernel. A kernel is a function executed on the GPU by many threads which are organized forming a grid of thread blocks that run logically in parallel (see [12] for more details). Next, we describe in detail each step: – Build data structure: In this step, the data structure that will be used on the GPU is built. For each volume, we store its initial state (h1 , q1,x , q1,y , h2 , q2,x and q2,y ) and its depth H. We define two arrays of float4 elements, where each element represents a volume. The first array contains h1 , q1,x , q1,y and H, while the second array contains h2 , q2,x and q2,y . Both arrays are stored as 2D textures. The area of the volumes and the length of the vertical and horizontal edges are precalculated and passed to the CUDA kernels that need them. We can know at runtime if an edge or volume is frontier and the value of the normal ηij of an edge by checking the position of the thread in the grid.
Programming CUDA-Based GPUs
359
– Process vertical edges and process horizontal edges: As in [2], we divide the edge processing into vertical and horizontal edge processing. For vertical edges, ηij,y = 0 and therefore all the operations where this term takes part can be discarded. Similarly, for horizontal edges, ηij,x = 0 and all the operations where this term takes part can be avoided. In vertical and horizontal edge processing, each thread represents a vertical and horizontal edge, respectively, and computes the contribution of the edge to their adjacent volumes as described in section 3.1. The edges (i.e. threads) synchronize each other when contributing to a particular volume by means of four accumulators (in [2] we used two accumulators for one-layer systems), each one being an array of float4 elements. The size of each accumulator is the number of volumes. Let us call the accumulators 1-1, 1-2, 2-1 and 2-2. Each element of accumulators 1-1 and 2-1 stores the contributions of the edges to the layer 1 of Wi (the first 3 elements of Mi ) and to the local Δt of the volume (a float value Zi ), while each element of accumulators 2-1 and 2-2 stores the contributions of the edges to the layer 2 of Wi (the last 3 elements of Mi ). Then, in the processing of vertical edges: ◦ Each vertical edge writes in the accumulator 1-1 the contribution to the layer 1 and to the local Δt of its right volume, and writes in the accumulator 1-2 the contribution to the layer 2 of its right volume. ◦ Each vertical edge writes in the accumulator 2-1 the contribution to the layer 1 and to the local Δt of its left volume, and writes in the accumulator 2-2 the contribution to the layer 2 of its left volume. Next, the processing of horizontal edges is performed in an analogous way, but with the difference that the contribution is added to the accumulators instead of only writing it. Figure 4 shows this process graphically.
(a) Vertical edge processing
(b) Horizontal edge processing
Fig. 4. Computing the sum of the contributions of the edges of each volume
360
M. de la Asunci´ on, J.M. Mantas, and M.J. Castro
(a) Contribution to Δti
(b) Contribution to Wi
Fig. 5. Computation of the final contribution of the edges for each volume
– Compute Δti for each volume: In this step, each thread represents a volume and computes the local Δti of the volume Vi as described in section 3.1. The final Zi value is obtained by summing the two float values stored in the positions corresponding to the volume Vi in accumulators 1-1 and 2-1 (see Fig. 5a). – Get minimum Δt: This step finds the minimum of the local Δti of the volumes by applying a reduction algorithm on the GPU. The reduction algorithm applied is the kernel 7 (the most optimized one) of the reduction sample included in the CUDA Software Development Kit [11]. – Compute Wi for each volume: In this step, each thread represents a volume and updates the state Wi of the volume Vi as described in section 3.1. The final Mi value is obtained as follows: the first 3 elements of Mi (the contribution to layer 1) are obtained by summing the two 3×1 vectors stored in the positions corresponding to the volume Vi in accumulators 1-1 and 2-1, while the last 3 elements of Mi (the contribution to layer 2) are obtained by summing the two 3 × 1 vectors stored in the positions corresponding to the volume Vi in accumulators 1-2 and 2-2 (see Fig. 5b). Since a CUDA kernel can not write directly into textures, the textures are updated by firstly writing the results into temporary arrays and then these arrays are copied to the CUDA arrays bound to the textures. A version of this CUDA algorithm which uses double precision to perform all the computing phases has also been implemented. The volume data is stored in three arrays of double2 elements (which contain the state of the volumes) and one array of double elements (the depth H). We use six accumulators of double2 elements (for storing the contributions to Wi ) and two accumulators of double elements (for storing the contributions to the local Δt of each volume).
Programming CUDA-Based GPUs
5
361
Experimental Results
We consider an internal circular dambreak problem in the [−5, 5] × [−5, 5] rectangular domain in order to compare the performance of our implementations. The depth function is given by H(x, y) = 2 and the initial condition is: T
Wi0 (x, y) = (h1 (x, y), 0, 0, h2 (x, y), 0, 0) where h1 (x, y) =
1.8 if x2 + y 2 > 4 , 0.2 otherwise
h2 (x, y) = 2 − h1 (x, y)
The numerical scheme is run for several regular bidimensional finite volume meshes with different number of volumes (see Table 1). Simulation is carried out in the time interval [0, 1]. CFL parameter is γ = 0.9, r = 0.998 and wall boundary conditions (q1 · η = 0, q2 · η = 0) are considered. To perform the experiments, several programs have been implemented: – A serial CPU version of the CUDA algorithm. This version has been implemented in C++ and uses the Eigen library [8] for operating with matrices. We have used the double data type in this implementation. – A quadcore CPU version of the CUDA algorithm. This is a parallelization of the aforementioned serial CPU version which uses OpenMP [7]. – A mixed precision CUDA implementation (CUSDP). In this GPU version, the eigenvalues and eigenvectors of the Aij matrix (see Sect. 2) are computed using double precision to avoid numerical instability problems, but the rest of operations are performed in single precision. – A full double precision CUDA implementation (CUDP). All the programs were executed on a Core i7 920 with 4 GB RAM. Graphics cards used were a GeForce GTX 280 and a GeForce GTX 480. Figure 7 shows the evolution of the fluid. Table 1 shows the execution times in seconds for all the meshes and programs. As can be seen, the number of volumes and the execution times scale with a different factor because the number of time steps required for the same time interval also augments when the number of cells is increased (see (3)). Using a GTX 480, for big meshes, CUSDP achieves a speedup of 62 with respect to the monocore CPU version, while CUDP reaches a speedup of 38. As expected, the OpenMP version only reaches a speedup less than four in all meshes. CUDP has been about 38 % slower than CUSDP for big meshes in the GTX 480 card, and 24 % slower in the GTX 280 card. In the GTX 480 card, we get better execution times by setting the sizes of the L1 cache and shared memory to 48 KB and 16 KB per multiprocessor, respectively, for the two edge processing CUDA kernels. Table 2 shows the mean values of the percentages of the execution time and GPU FLOPS for all the computing steps and implementations. Clearly, almost all the the execution time is spent in the edge processing steps.
362
M. de la Asunci´ on, J.M. Mantas, and M.J. Castro
Table 1. Execution times in seconds for all the meshes and programs Mesh size L = Lx × Ly 100 × 100 200 × 200 400 × 400 800 × 800 1600 × 1600 2000 × 2000
CPU 1 core
CPU 4 cores
GTX 280 GTX 480 CUSDP CUDP CUSDP CUDP
7.54 2.10 0.48 0.80 0.37 59.07 15.84 3.15 4.38 1.42 454.7 121.0 21.92 29.12 8.04 3501.9 918.7 163.0 216.1 57.78 28176.7 7439.4 1262.7 1678.0 453.5 54927.8 14516.6 2499.2 3281.0 879.7
0.53 2.17 13.01 94.57 735.6 1433.6
Table 2. Mean values of the percentages of the execution time and GPU FLOPS for all the computing steps
Computing step Process vertical edges Process horizontal edges Compute Δti Get minimum Δt Compute Win+1
% Execution time 1 core 4 cores CUSDP CUDP 49.6 49.8 0.2 0.4
48.2 48.6 1.1 0.4 1.7
49.5 49.4 0.3 0.1 0.7
50.0 48.5 0.3 0.2 1.0
% GPU FLOPS 49.5 49.9 0.1 0.0 0.5
Figure 6 shows graphically the GB/s and GFLOPS obtained in the CUDA implementations with both graphics cards. In the GTX 480 card, CUSDP achieves 4.2 GB/s and 34 GFLOPS for big meshes. Theoretical maximums are: for the GTX 480, 177.4 GB/s, and 1.35 TFLOPS in single precision, or 168 GFLOPS in double precision; for the GTX 280, 141.7 GB/s, and 933 GFLOPS in single precision, or 78 GFLOPS in double precision. As can be seen, the speedup, GB/s and GFLOPS reached with the CUSDP program are notably worse than those obtained in [2] with the single precision CUDA implementation for one-layer systems. This is mainly due to two reasons. Firstly, since double precision has been used to compute the eigenvalues and eigenvectors, the efficiency is reduced because the double precision speed is 1/8 of the single precision speed in GeForce cards with GT200 and GF100 architectures. Secondly, since the register usage and the complexity of the code executed by each thread is higher in this implementation, the CUDA compiler has to store some data into local memory, which also increases the execution time. We also have compared the numerical solutions obtained in the monocore and the CUDA programs. The L1 norm of the difference between the solutions obtained in CPU and GPU at time t = 1.0 for all meshes was calculated. The order of magnitude of the L1 norm using CUSDP vary between 10−4 and 10−6 , while that of obtained using CUDP vary between 10−12 and 10−14 , which reflects the different accuracy of the numerical solutions computed on the GPU using both single and double precision, and using only double precision.
Programming CUDA-Based GPUs
(a) GB/s
363
(b) GFLOPS
Fig. 6. GB/s and GFLOPS obtained with the CUDA implementations in all meshes with both graphics cards
(a) t = 0.0
(b) t = 2.5
(c) t = 5.0
Fig. 7. Graphical representation of the fluid evolution at different time instants
6
Conclusions and Further Work
In this paper we have presented an efficient first order well-balanced finite volume solver for two-layer shallow water systems. The numerical scheme has been parallelized, adapted to the GPU and implemented using the CUDA framework in order to exploit the parallel processing power of GPUs. On the GTX 480 graphics card, the CUDA implementation using both single and double precision has reached 4.2 GB/s and 34 GFLOPS, and has been one order of magnitude faster than a monocore CPU version of the solver for big uniform meshes. It is expected that this results will significantly improve on a NVIDIA Tesla GPU architecture based on Fermi, since this architecture includes more double precision support than the GTX 480 graphics card. The simulations carried out also reveal the different accuracy obtained with the two implementations of the solver, getting better accuracy using double precision than using both single and double precision. As further work, we propose to extend the strategy to enable efficient simulations on irregular and non-structured meshes.
364
M. de la Asunci´ on, J.M. Mantas, and M.J. Castro
Acknowledgements J. M. Mantas acknowledges partial support from the DGI-MEC project MTM200806349-C03-03. M. de la Asunci´ on and M. J. Castro acknowledge partial support from DGI-MEC project MTM2009-11923.
References 1. Abgrall, R., Karni, S.: Two-layer shallow water system: A relaxation approach. SIAM J. Sci. Comput. 31(3), 1603–1627 (2009) 2. de la Asunci´ on, M., Mantas, J.M., Castro, M.: Simulation of one-layer shallow water systems on multicore and CUDA architectures. The Journal of Supercomputing (2010), http://dx.doi.org/10.1007/s11227-010-0406-2 3. Audusse, E., Bristeau, M.O.: Finite-volume solvers for a multilayer Saint-Venant system. Int. J. Appl. Math. Comput. Sci. 17(3), 311–320 (2007) 4. Castro, M.J., Garc´ıa-Rodr´ıguez, J.A., Gonz´ alez-Vida, J.M., Par´es, C.: A parallel 2D finite volume scheme for solving systems of balance laws with nonconservative products: Application to shallow flows. Comput. Meth. Appl. Mech. Eng. 195, 2788–2815 (2006) 5. Castro, M.J., Garc´ıa-Rodr´ıguez, J.A., Gonz´ alez-Vida, J.M., Par´es, C.: Solving shallow-water systems in 2D domains using finite volume methods and multimedia SSE instructions. J. Comput. Appl. Math. 221(1), 16–32 (2008) 6. Castro, M.J., Garc´ıa-Rodr´ıguez, J.A., Gonz´ alez-Vida, J.M., Mac´ıas, J., Par´es, C.: Improved FVM for two-layer shallow-water models: Application to the strait of gibraltar. Adv. Eng. Softw. 38(6), 386–398 (2007) 7. Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press, Cambridge (2007) 8. Eigen 2.0.12, http://eigen.tuxfamily.org 9. Hagen, T.R., Hjelmervik, J.M., Lie, K.A., Natvig, J.R., Henriksen, M.O.: Visual simulation of shallow-water waves. Simulation Modelling Practice and Theory 13(8), 716–726 (2005) 10. Lastra, M., Mantas, J.M., Ure˜ na, C., Castro, M.J., Garc´ıa-Rodr´ıguez, J.A.: Simulation of shallow-water systems using graphics processing units. Math. Comput. Simul. 80(3), 598–618 (2009) 11. NVIDIA: CUDA home page, http://www.nvidia.com/object/cuda_home_new.html 12. NVIDIA: NVIDIA CUDA Programming Guide 3.0 (2010), http://developer.nvidia.com/object/cuda_3_0_downloads.html 13. Ostapenko, V.V.: Numerical simulation of wave flows caused by a shoreside landslide. J. Appl. Mech. Tech. Phys. 40, 647–654 (1999) 14. Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proceedings of the IEEE 96(5), 879–899 (2008) 15. Rumpf, M., Strzodka, R.: Graphics Processor Units: New prospects for parallel computing. Lecture Notes in Comput. Science and Engineering 51, 89–134 (2006)
Theory and Algorithms for Parallel Computation Christoph Kessler1 , Thomas Rauber2 , Yves Robert1 , and Vittorio Scarano2 1
Topic Chairs Members
2
Parallelism concerns all levels of current computing systems, from single CPU machines to large server farms. Effective use of parallelism relies crucially on the availability of suitable models of computation for algorithm design and analysis, and of efficient strategies for the solution of key computational problems on prominent classes of platforms, as well as of good models of the way the different components are interconnected. With the advent of multicore parallel machines, new models and paradigms are needed to allow parallel programming to advance into mainstream computing. This includes the following topics: – foundations of parallel, distributed, multiprocessor and network computation; – models of parallel, distributed, multiprocessor and network computation; – emerging paradigms of parallel, distributed, multiprocessor and network computation; – models and algorithms for parallelism in memory hierarchies; – models and algorithms for real networks (scale-free, small world, wireless networks); – theoretical aspects of routing; – deterministic and randomized parallel algorithms; – lower bounds for key computational problems; This year, 8 papers discussing some of these issues were submitted to this topic. Each paper was reviewed by four reviewers and, finally, we were able to select 4 regular papers. The accepted papers discuss very interesting issues about theory and models for parallel computing, as well as the mapping of parallel computations to the execution resources of parallel platforms. The paper “Analysis of Multi-Organization Scheduling Algorithms” by J. Cohen, D. Cordeiro, D. Trystram and F. Wagner considers the problem of scheduling single-processor tasks on computing platforms composed of several independent organizations where each organization only cooperates if its local make-span is not increased by jobs of other organizations. This is called ’local constraint’. Moreover, a ’selfishness constraint’ is considered which does not allow schedules in which foreign jobs are finished before all local jobs are started. The article shows some lower bounds and proves that the scheduling problem is NPcomplete. Three approximation algorithms are discussed and compared by an experimental evaluation with randomly generated workloads. The paper “Area-Maximizing Schedules for Series-Parallel DAGs” by G. Cordasco and A. Rosenberg explores the computations of schedules for series-parallel DAGs. In particular, AREA-maximizing schedules are computed which produce P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 365–366, 2010. c Springer-Verlag Berlin Heidelberg 2010
366
C. Kessler et al.
execution-eligible tasks as fast as possible. In previous work, the authors have introduced this problem as IC-scheduling and have shown how AREA-maximizing schedules can be derived for specific families of DAGs. In this article, this work is extended to arbitrary series-parallel (SP) DAGs that are obtained by series and parallel composition. The paper “Parallel selection by regular sampling” by A. Tiskin examines the selection problem and uses the BSP model to express a new deterministic algorithm for solving the problem. The new algorithm needs O(n/p) local computations and communications (optimal) and O(log log p) synchronizations for arrays of length n using p processors. The paper “Ants in Parking Lots” by A. Rosenberg investigates the movement of autonomous units (ants) in a 2D mesh. The ants are controlled by a specialized finite-state machine (FSM); all ants have the same FSM. Ants can communicate by direct contact or by leaving pheromone values on mesh points previously visited. The paper discusses some movement operations of ants in 1D or 2D meshes. In particular, the parking problem is considered, defined as the movement to the nearest corner of the mesh. We would like to take the opportunity of thanking the authors who submitted a contribution, as well as the Euro-Par Organizing Committee, and the referees with their highly useful comments, whose efforts have made this conference and this topic possible.
Analysis of Multi-Organization Scheduling Algorithms Johanne Cohen1 , Daniel Cordeiro2, Denis Trystram2 , and Fr´ed´eric Wagner2 1
Laboratoire d’Informatique PRiSM, Universit´e de Versailles St-Quentin-en-Yvelines ´ 45 avenue des Etats-Unis, 78035 Versailles Cedex, France 2 LIG, Grenoble University 51 avenue Jean Kuntzmann, 38330 Montbonnot Saint-Martin, France
Abstract. In this paper we consider the problem of scheduling on computing platforms composed of several independent organizations, known as the Multi-Organization Scheduling Problem (MOSP). Each organization provides both resources and tasks and follows its own objectives. We are interested in the best way to minimize the makespan on the entire platform when the organizations behave in a selfish way. We study the complexity of the MOSP problem with two different local objectives – makespan and average completion time – and show that MOSP is NP-Hard in both cases. We formally define a selfishness notion, by means of restrictions on the schedules. We prove that selfish behavior imposes a lower bound of 2 on the approximation ratio for the global makespan. We present various approximation algorithms of ratio 2 which validate selfishness restrictions. These algorithms are experimentally evaluated through simulation, exhibiting good average performances.
1 1.1
Introduction Motivation and Presentation of the Problem
The new generation of many-core machines and the now mature grid computing systems allow the creation of unprecedented massively distributed systems. In order to fully exploit such large number of processors and cores available and reach the best performances, we need sophisticated scheduling algorithms that encourage users to share their resources and, at the same time, that respect each user’s own interests. Many of these new computing systems are composed of organizations that own and manage clusters of computers. A user of such systems submits his/her jobs to a scheduler system that can choose any available machine in any of these clusters. However, each organization that shares its resources aims to take maximum advantage of its own hardware. In order to improve cooperation between the organizations, local jobs should be prioritized. To find an efficient schedule for the jobs using the available machines is a crucial problem. Although each user submits jobs locally in his/her own organization, it is necessary to optimize the allocation of the jobs for the whole P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 367–379, 2010. c Springer-Verlag Berlin Heidelberg 2010
368
J. Cohen et al.
platform in order to achieve good performance. The global performance and the performance perceived by the users will depend on how the scheduler allocates resources among all available processors to execute each job. 1.2
Related Work
From the classical scheduling theory, the problem of scheduling parallel jobs is related to the Strip packing [1]. It corresponds to pack a set of rectangles (without rotations and overlaps) into a strip of machines in order to minimize the height used. Then, this problem was extended to the case where the rectangles were packed into a finite number of strips [16, 15]. More recently, an asymptotic (1 + )-approximation AFPTAS with additive constant O(1) and with runningtime polynomial in n and in 1/ was presented in [8]. Schwiegelshohn, Tchernykh, and Yahyapour [14] studied a very similar problem, where the jobs can be scheduled in non-contiguous processors. Their algorithm is a 3-approximation for the maximum completion time (makespan) if all jobs are known in advance, and a 5-approximation for the makespan on the on-line, non-clairvoyant case. The Multi-Organization Scheduling problem (MOSP) was introduced by Pascual et al. [12, 13] and studies how to efficiently schedule parallel jobs in new computing platforms, while respecting users’ own selfish objectives. A preliminary analysis of the scheduling problem on homogeneous clusters was presented with the target of minimizing the makespan, resulting in a centralized 3-approximation algorithm. This problem was then extended for relaxed local objectives in [11]. The notion of cooperation between different organizations and the study of the impact of users’ selfish objectives are directly related to Game Theory. The study of the Price of Anarchy [9] on non-cooperative games allows to analyze how far the social costs – results obtained by selfish decisions – are from the social optimum on different problems. In selfish load-balancing games (see [10] for more details), selfish agents aim to allocate their jobs on the machine with the smallest load. In these games, the social cost is usually defined as the completion time of the last job to finish (makespan). Several works studied this problem focusing in various aspects, such as convergence time to a Nash equilibrium [4], characterization of the worst-case equilibria [3], etc. We are not targeting here at such game theoretical approaches. 1.3
Contributions and Road Map
As suggested in the previous section, the problem of scheduling in multi-organization clusters has been studied from several different points of view. In this paper, we propose a theoretical analysis of the problem using classical combinatorial optimization approaches. Our main contribution is the extension and analysis of the problem for the case in which sequential jobs are submitted by selfish organizations that can handle different local objectives (namely, makespan and average completion times).
Analysis of Multi-Organization Scheduling Algorithms
369
We introduce new restrictions to the schedule that take into account the notion of selfish organizations, i.e., organizations that refuse to cooperate if their objectives could be improved just by executing earlier one of their jobs in one of their own machines. The formal description of the problem and the notations used in this paper are described in Section 2. The Section 3 shows that any algorithm respecting our new selfishness restrictions can not achieve approximation ratios better than 2 and that both problems are intractable. New heuristics for solving the problem are presented in Section 4. Simulation experiments, discussed in Section 5, show the good results obtained by our algorithms in average.
2
Problem Description and Notations
In this paper, we are interested in the scheduling problem in which different organizations own a physical cluster of identical machines that are interconnected. They share resources and exchange jobs with each other in order to simultaneously maximize the profits of the collectivity and their own interests. All organizations intent to minimize the total completion time of all jobs (i.e., the global makespan) while they individually try to minimize their own objectives – either the makespan or the average completion time of their own jobs – in a selfish way. Although each organization accepts to cooperate with others in order to minimize the global makespan, individually it behaves in a selfish way. An organization can refuse to cooperate if in the final schedule one of its migrated jobs could be executed earlier in one of the machines owned by the organization. Formally, we define our target platform as a grid computing system with N different organizations interconnected by a middleware. Each organization O(k) (1 ≤ k ≤ N ) has m(k) identical machines available that can be used to run jobs submitted by users from any organization. (k) Each organization O(k) has n(k) jobs to execute. Each job Ji (1 ≤ i ≤ n(k) ) (k) will use one processor for exactly pi units of time1 . No preemption is allowed, (k) i.e., after its activation, a job runs until its completion at time Ci . (k) (k) We denote the makespan of a particular organization kby Cmax = max (Ci ) 1≤i≤n(k) (k) and its sum of completion times as Ci . The global makespan for the entire grid (k) computing system is defined as Cmax = max (Cmax ). 1≤k≤N
2.1
Local Constraint
The Multi-Organization Scheduling Problem, as first described in [12] consists in minimizing the global makespan (Cmax ) with an additional local constraint : at the end, no organization can have its makespan increased if compared with the makespan that the organization could have by scheduling the jobs in its 1
All machines are identical, i.e., every job will be executed at the same speed independently of the chosen machine.
370
J. Cohen et al. (k) local
own machines (Cmax optimization problem:
). More formally, we call MOSP(Cmax ) the following (k)
(k) local
minimize Cmax such that, for all k (1 ≤ k ≤ N ), Cmax ≤ Cmax
In this work, we also study the case where all organizations are interested locally in minimizing their average completion time while minimizing the global makespan. As in MOSP(Cmax ), each organization imposes that the sum of completion times of its jobs can not be increased if compared with what the orga (k) local nization could have obtained using only its own machines ( Ci ). We denote this problem MOSP( Ci ) and the goal of this optimization problem is to: (k) (k) local Ci ≤ Ci minimize Cmax such that, for all k (1 ≤ k ≤ N ),
2.2
Selfishness
In both MOSP(Cmax ) and MOSP( Ci ), while the global schedule might be computed by a central entity, the organizations keep control on the way they execute the jobs in the end. This property means that, in theory, it is possible for organizations to cheat the devised global schedule by re-inserting their jobs earlier in the local schedules. In order to prevent such behavior, we define a new restriction on the schedule, called selfishness restriction. The idea is that, in any schedule respecting this restriction, no single organization can improve its local schedule by cheating. (l) Given a fixed schedule, let Jf be the first foreign job scheduled to be executed (k)
in O(k) (or the first idle time if O(k) has no foreign job) and Ji any job belonging (l) to O(k) . Then, the selfishness restriction forbids any schedule where Cf < (k)
(k)
Ci − pi . In other words, O(k) refuses to cooperate if one of its jobs could be executed earlier in one of O(k) machines even if this leads to a larger global makespan.
3 3.1
Complexity Analysis Lower Bounds
Pascual et al. [12] showed with an instance having two organizations and two machines per organization that every algorithm that solves MOSP (for rigid, parallel jobs and Cmax as local objectives) has at least a 32 approximation ratio when compared to the optimal makespan that could be obtained without the local constraints. We show that the same bound applies asymptotically even with a larger number of organizations. Take the instance depicted in Figure 1a. O(1) initially has two jobs of size N and all the others initially have N jobs of size 1. All organizations contribute
Analysis of Multi-Organization Scheduling Algorithms
371
(a) Initial instance – Cmax = 2N
(b) Global optimum without constraints – Cmax = N + 1
(c) Optimum with MOSP constraints – Cmax = 3N 2
Fig. 1. Ratio between global optimum makespan and the optimum makespan that can be obtained for both MOSP(Cmax ) and MOSP( Ci ). Jobs owned by organization O(2) are highlighted.
only with 1 machine each. The optimal makespan for this instance is N + 1 (Figure 1b), nevertheless it delays jobs from O(2) and, as consequence, does not respect MOSP’s local constraints. The best possible makespan that respects the local constraints (whenever the local objective is the makespan or the average completion time) is 3N 2 , as shown in Figure 1c. 3.2
Selfishness and Lower Bounds
Although all organizations will likely cooperate with each other to achieve the best global makespan possible, their selfish behavior will certainly impact the quality of the best attainable global makespan. We study here the impact of new selfishness restrictions on the quality of the achievable schedules. We show that these restrictions impact MOSP(Cmax ) and MOSP( Ci ) as compared with unrestricted schedules and, moreover, that MOSP(Cmax ) with selfishness restrictions suffers from limited performances as compared to MOSP(Cmax ) with local constraints. Proposition 1. Any approximation algorithm for both MOSP(Cmax ) and MOSP( Ci ) has ratio greater than or equal to 2 regarding the optimal makespan without constraints if all organizations behave selfishly. Proof. We prove this result by using the example described in Figure 1. It is clear from Figure 1b that an optimal solution for a schedule without local constraints can be achieved in N + 1. However, with added selfishness restrictions, Figure 1a (with a makespan of 2N ) represents the only valid schedule possible. We can, therefore, conclude that local constraints combined with selfishness restrictions imply that no algorithm can provide an approximation ratio of 2 when compared with the problem without constraints. Proposition 1 gives a ratio regarding the optimal makespan without the local constraints imposed by MOSP. We can show that the same approximation ratio of 2 also applies for MOSP(Cmax ) regarding the optimal makespan even if MOSP constraints are respected.
372
J. Cohen et al.
(a) Initial instance – Cmax = 2N − 2
(b) Global optimum with MOSP constraints – Cmax = N
Fig. 2. Ratio between global optimum makespan with MOSP constraints and the makespan that can be obtained by MOSP(Cmax ) with selfish organizations
Proposition 2. Any approximation algorithm for MOSP(Cmax ) has ratio greater than or equal to 2 − N2 regarding the optimal makespan with local constraints if all organizations behave selfishly. Proof. Take the instance depicted in Figure 2a. O(1) initially has N jobs of size 1 and O(N ) has two jobs of size N . The optimal solution that respects MOSP local constraints is given in Figure 2b and have Cmax equal to N . Nevertheless, the best solution that respects the selfishness restrictions is the initial instance with a Cmax equal to 2N − 2. So, the ratio of the optimal solution with the selfishness restrictions to the optimal solution with MOSP constraints is 2 − N2 . 3.3
Computational Complexity
This section studies how hard it is to find optimal solutions for MOSP even for the simpler case in which all organizations contribute only with one machine and two jobs. We consider the decision version of the MOSP defined as follows: Instance: a set of N organizations (for 1 ≤ k ≤ N , organization O(k) has n(k) jobs, m(k) identical machines, and makespan as the local objective) and an integer . Question: Does there exist a schedule with a makespan less than ? Theorem 1. MOSP(Cmax ) is strongly NP-complete. Proof. It is straightforward to see that MOSP(Cmax ) ∈ N P . Our proof is based on a reduction from the well-known 3-Partition problem [5]: Instance: a bound B ∈ Z + and a finite set A of 3m integers {a1 , . . . , a3m }, such that every element of A is strictly between B/4 and B/2 and such that 3m i=1 ai = mB. Question: can A be partitioned into m disjoint sets A1 , A2 , . . . , Am such that, for all 1 ≤ i ≤ m, a∈Ai a = B and Ai is composed of exactly three elements? Given an instance of 3-Partition, we construct an instance of MOSP where, (k) (k) for 1 ≤ k ≤ 3m, organization O(k) initially has two jobs J1 and J2 with
Analysis of Multi-Organization Scheduling Algorithms (k)
373
(k)
p1 = (m + 1)B + 7 and p2 = (m + 1)ak + 1, and all other organizations have two jobs with processing time equal to 2. We then set to be equal to (m + 1)B + 7. Figure 3 depicts the described instance. This construction is performed in polynomial time. Now, we prove that A can be split into m disjoint subsets A1 , . . . , Am , each one summing up to B, if and only if this instance of MOSP has a solution with Cmax ≤ (m + 1)B + 7. Assume that A = {a1 , . . . , a3m } can be partitioned into m disjoint subsets A1 , . . . , Am , each one summing up to B. In this case, we can build an optimal schedule for the instance as follows: (k)
– for 1 ≤ k ≤ 3m, J1 is scheduled on machine k; (k) (k) – for 3m + 1 ≤ k ≤ 4m, J1 and J2 are scheduled on machine k; (ai1 )
– for 1 ≤ i ≤ m, let Ai = {ai1 , ai2 , ai3 } ⊆ A. The jobs J2 are scheduled on machine 3m + i.
(ai2 )
, J2
(ai3 )
and J2
So, the global Cmax is (m + 1)B + 7 and the local constraints are respected. Conversely, assume that MOSP has a solution with Cmax ≤ (m + 1)B + 7. The total work (W ) of the jobs that must be executed is W = 3m((m + 1)B + 3m 7) + 2 · 2m + (m + 1) i=1 ai + 3m = 4m(m + 1)B + 7. Since we have exactly 4m organizations, the solution must be the optimal solution and there are no idle times in the scheduling. Moreover, 3m machines must execute only one job of size (m + 1)B + 7. W.l.o.g, we can consider that for 3m + 1 ≤ k ≤ 4m, machine k performs jobs of size less than (m + 1)B + 7. To prove our proposition, we first show two lemmas: Lemma 1. For all 3m + 1 ≤ k ≤ 4m, at most four jobs of size not equal to 2 (k) can be scheduled on machine k if Cmax ≤ (m + 1)B + 7. Proof. It is enough to notice that all jobs of size not equal to 2 are greater than (m+1)B/4+1, that Cmax must be equal to (m+1)B +7 and that m+1 > 3. Lemma 2. For all 3m + 1 ≤ k ≤ 4m, exactly two jobs of size 2 are scheduled (k) on each machine k if Cmax ≤ (m + 1)B + 7. Proof. We prove this lemma by contradiction. Assume that there exists a machine k such that at most one job of size 2 is scheduled on it. So, by definition of the size of jobs, all jobs scheduled in machine k have a size greater than (m + 1)B/4 + 1. By consequence of Lemma 1, since at most four jobs can be scheduled on machine k, the total work on this machine is (m + 1)B + y + 2 where y ≤ 4. This fact is in contradiction with the facts that there does not exist idle processing time and that K = (m + 1)B + 7. Now, we construct m disjoint subsets A1 , A2 , . . . , Am of A as follows: for all 1 ≤ i ≤ m, aj is in Ai if the job with size (m + 1)aj + 1 is scheduled on machine 3m + i. Note that all elements of A belong to one and only one set in {A1 , . . . , Am }. We prove that A is a partition with desired properties. We focus
374
J. Cohen et al.
Fig. 3. Reduction of MOSP(Cmax ) from 3-Partition
on a fixed element Ai . By definition of Ai we have that 4+ ((m + 1)aj + 1) = (m + 1)B + 7 ⇒ ((m + 1)aj + 1 = (m + 1)B + 3 aj ∈Ai
aj ∈Ai
Since m + 1 > 3, we have aj ∈Ai (m + 1)aj = (m + 1)B. Thus, we can deduce that Ai is composed of exactly three elements and a∈Ai a = B. We continue by showing that even if all organizations are interested locally in the average completion time, the problem is still NP-complete. We prove NP-completeness of the MOSP( Ci ) problem (having a formulation similar to the MOSP(Cmax ) decision problem) using a reduction from the Partition problem.The idea here is similar to the one used in the previous reduction, but the Ci constraints heavily restrict the allowed movements of jobs when compared to the Cmax constraints. Theorem 2. MOSP( Ci ) is NP-complete. Proof. First, note that it is straightforward to see that MOSP( Ci ) ∈ N P . We use the Partition [5] problem to prove this theorem. Instance: a set of n integers s1 , s2 , . . . , sn . Question: does there exist a subset J ⊆ I = {1, . . . , n} such that si = si ? i∈J
i∈I\J
Consider an integer M > i si . Given an instance of the Partition problem, we construct an instance of MOSP( Ci ) problem, as depicted in Figure 4a. There are N = 2n + 2 organizations having two jobs each. The organizations O(2n+1) and O(2n+2) have two jobs with processing time 1. Each integer si from the Partition problem corresponds to a pair of jobs ti and ti , with processing time equal to 2i M and 2i M + si respectively. We set (k) (k) J1 = tk , for all 1 ≤ k ≤ n and J1 = tk−n , for all n + 1 ≤ k ≤ 2n. We set K
t +
t +4
i i i i to W . To complete the construction, for any k, 1 ≤ k ≤ 2n, the N = 2 (k) (k) has also a job J2 with processing time equal to K. We set organization O
Analysis of Multi-Organization Scheduling Algorithms
375
(a) Initial instance Fig. 4. Reduction of MOSP(
(b) Optimum
Ci ) from Partition
K to . This construction is performed in polynomial time and we prove that it is a reduction. First, assume that {s1 , s2 , . . . , sn } is partitioned into 2 disjoint sets S1 , S2 with the desired properties. We construct a valid schedule with optimal global makespan for MOSP( Ci ). For all si , if i ∈ J, we schedule job ti in organization O(N ) and job ti in organization O(N −1) . Otherwise, we schedule ti in O(N −1) and ti in O(N ) . The problem constraints impose that organizations O(N −1) and O(N ) will first schedule their own jobs (two jobs of size 1). The remaining jobs will be scheduled in non-decreasing time, using the Shortest Processing Time first (SPT) rule. This schedule respects MOSP’s constraints of not increasing the organization’s average completion time because each job is being delayed by at most its own size (by construction, the sum of all jobs scheduled before the job being scheduled is (N −1) smaller than the size of the job). Cmax will be equal to 2 + i 2i M + i∈J si . (N −1) (N ) Since J is a partition, Cmax is exactly equal to Cmax = 2+ i 2i M + i∈I\J si . (N )
(N −1)
Also, Cmax = Cmax = K, which gives us the theoretical lower bound for Cmax . Second, assume MOSP( Ci ) has a solution with Cmax ≤ K. We prove that {s1 , s2 , . . . , sn } is partitioned into 2 disjoint sets S1 , S2 with the desired properties. This solution of MOSP( Ci ) has the structure drawn in Figure 4b. To achieve a Cmax equal to K, the scheduler must keep all jobs that have size exactly equal to K in their initial organizations. Moreover all jobs of size 1 must also remain in their initial organizations, otherwise these jobs would be delayed. The remaining jobs (all ti and ti jobs) must be scheduled either in organizations O(N −1) or O(N ) . Each processor must execute a total work of i 2 i 2 M+ i si s 2K−4 = = i 2i M + 2i i to achieve a makespan equal to K. 2 2 Let J ⊆ I = {1, . . . , n} such that i ∈ J if ti was scheduled on organization (N −1) . O(N −1) execute a total work of W (N −1) = i 2i M + i∈J si , that must O be equal to the total work of O(N ) , W (N ) = i 2i M + i∈I\J si . Since i si < M , we have W (N −1) ≡ i∈J si (mod M ) and W (N ) ≡ i∈I\J (mod M ). This (N −1) means thatW (N −1) = W (N ) =⇒ mod M ) = (W (N ) mod M ) =⇒ (W s = s . If MOSP( C ) has a solution with Cmax ≤ K, then set i i∈J i i∈I\J i J is a solution for Partition.
376
4
J. Cohen et al.
Algorithms
In this section, we present three different heuristics to solve MOSP(Cmax ) and MOSP( Ci ). All algorithms present the additional property of respecting selfishness restrictions. 4.1
Iterative Load Balancing Algorithm
The Iterative Load Balancing Algorithm (ILBA) [13] is a heuristic that redistributes the load from the most loaded organizations. The idea is to incrementally rebalance the load without delaying any job. First the less loaded organizations are rebalanced. Then, one-by-one, each organization has its load rebalanced. The heuristic works as follows. First, each organization schedules its own jobs locally and the organizations are enumerated by non-decreasing makespans, i.e. (1) (2) (N ) Cmax ≤ Cmax ≤ . . . ≤ Cmax . For k = 2 until N , jobs from O(k) are rescheduled sequentially, and assigned to the less loaded of organizations O(1) . . . O(k) . Each job is rescheduled by ILBA either earlier or at the same time that the job was scheduled before the migration. In other words, no job is delayed by ILBA, which guarantees that the local constraint is respected for MOSP(Cmax ) and MOSP( Ci ). 4.2
LPT-LPT and SPT-LPT Heuristics
We developed and evaluated (see Section 5) two new heuristics based on the classical LPT (Longest Processing Time First [6]) and SPT (Smallest Process ing Time First [2]) algorithms for solving MOSP(Cmax ) and MOSP( Ci ), respectively. Both heuristics work in two phases. During the first phase, all organizations minimize their own local objectives. Each organization starts applying LPT for its own jobs if the organization is interested in minimizing its own makespan, or starts applying SPT if the organization is interested in its own average completion time. The second phase is when all organizations cooperatively minimize the makespan of the entire grid computing system without worsening any local objective. This phase works as follows: each time an organization becomes idle, i.e., it finishes the execution of all jobs assigned to it, the longest job that does not have started yet is migrated and executed by the idle organization. This greedy algorithm works like a global LPT, always choosing the longest job yet to be executed among jobs from all organizations. 4.3
Analysis
ILBA, LPT-LPT and SPT-LPT do not delay any of the jobs when compared to the initial local schedule. During the rebalancing phase, all jobs either remain in their original organization or are migrated to an organization that became idle at a preceding time. The implications are:
Analysis of Multi-Organization Scheduling Algorithms
377
– the selfishness restriction is respected – if a job is migrated, it will start before the completion time of the last job of the initial organization; – if organizations’ local objective is to minimize the makespan, migrating a job to a previous moment in time will decrease the job’s completion time and, as consequence, it will not increase the initial makespan of the organization; – if organizations’ local objective is to minimize the average completion time, migrating a job from the initial organization to another that became idle at a previous moment in time will decrease the completion time of all jobs from the initial organization and of the job being migrated. This means that the Ci of the jobs from the initial organization is always decreased; – the rebalancing phase of all three algorithms works as the list scheduling algorithms. Graham’s classical approximation ratio 2 − N1 of list scheduling algorithms [6] holds for all of them. We recall from Section 3.2 that no algorithm respecting selfishness restrictions can achieve an approximation ratio for MOSP(Cmax ) better than 2. Since all our algorithms reach an approximation ratio of 2, no further enhancements are possible without removing selfishness restrictions.
5
Experiments
We conducted a series of simulations comparing ILBA, LPT-LPT, and SPT-LPT under various experimental settings. The workload was randomly generated with parameters matching the typical environment found in academic grid computing systems [13]. We evaluated the algorithms with instances containing a random number of machines, organizations and jobs with different sizes. In our tests, the number of initial jobs in each organization follows a Zipf distribution with exponent equal to 1.4267, which best models virtual organizations in real-world grid computing systems [7]. We are interested in the improvement of the global Cmax provided by the different algorithms. The results are evaluated with comparison to the Cmax obtained by the algorithms with the well-known theoretical lower bound for the p(k) scheduling problem without constraints LB = max( mi(k) , pmax ). i,k
Our main conclusion is that, despite the fact that the selfishness restrictions are respected by all heuristics, ILBA and LPT-LPT obtained near optimal results for most cases. This is not unusual, since it follows the patterns of experimental behavior of standard list scheduling algorithms, in which it is easy to obtain a near optimal schedule when the number of tasks grows large. SPT-LPT produces worse results due to the effect of applying SPT locally. However, in some particular cases, in which the number of jobs is not much larger than the number of machines available, the experiments yield more interesting results. Figure 5 shows the histogram of a representative instance of such a particular case. The histograms show the frequency of the ratio Cmax obtained to the lower bound over 5000 different instances with 20 organizations and 100 jobs for ILBA, LPT-LPT and SPT-LPT. Similar results have been obtained for
J. Cohen et al.
378
(a) ILBA
(b) LPT-LPT
(c) SPT-LPT
Fig. 5. Frequency of results obtained by ILBA, LPT-LPT, and SPT-LPT when the results are not always near optimal
many different sets of parameters. LPT-LPT outperforms ILBA (and SPT-LPT) for most instances and its average ratio to the lower bound is less than 1.3.
6
Concluding Remarks
In this paper, we have investigated the scheduling on multi-organization platfrom the literature and exforms. We presented the MOSP(Cmax ) problem tended it to a new related problem MOSP( Ci ) with another local objective. In each case we studied how to improve the global makespan while guaranteeing that no organization will worsen its own results. We showed first that both versions MOSP(Cmax ) and MOSP( Ci ) of the problem are NP-hard. Furthermore, we introduced the concept of selfishness in these problems which corresponds to additional scheduling restrictions designed to reduce the incentive for the organizations to cheat locally and disrupt the global schedule. We proved that any algorithm respecting selfishness restrictions can not achieve a better approximation ratio than 2 for MOSP(Cmax ). Two new scheduling algorithms were proposed, namely LPT-LPT and SPTLPT, in addition to ILBA from the literature. All these algorithms are list scheduling, and thus achieve a 2-approximation. We provided an in-depth analysis of these algorithms, showing that all of them respect the selfishness restrictions. Finally, all these algorithms were implemented and analysed through experimental simulations. The results show that our new LPT-LPT outperforms ILBA and that all algorithms exhibit near optimal performances when the number of jobs becomes large. Future research directions will be more focused on game theory. We intend to study schedules in the case where several organizations secretly cooperate to cheat the central authority.
References 1. Baker, B.S., Coffman Jr., E.G., Rivest, R.L.: Orthogonal packings in two dimensions. SIAM Journal on Computing 9(4), 846–855 (1980) 2. Bruno, J.L., Coffman Jr., E.G., Sethi, R.: Scheduling independent tasks to reduce mean finishing time. Communications of the ACM 17(7), 382–387 (1974)
Analysis of Multi-Organization Scheduling Algorithms
379
3. Caragiannis, I., Flammini, M., Kaklamanis, C., Kanellopoulos, P., Moscardelli, L.: Tight bounds for selfish and greedy load balancing. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4051, pp. 311–322. Springer, Heidelberg (2006) 4. Even-Dar, E., Kesselman, A., Mansour, Y.: Convergence time to nash equilibria. ACM Transactions on Algorithms 3(3), 32 (2007) 5. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (January 1979) 6. Graham, R.L.: Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics 17(2), 416–429 (1969) 7. Iosup, A., Dumitrescu, C., Epema, D., Li, H., Wolters, L.: How are real grids used? The analysis of four grid traces and its implications. In: 7th IEEE/ACM International Conference on Grid Computing, pp. 262–269 (September 2006) 8. Jansen, K., Otte, C.: Approximation algorithms for multiple strip packing. In: Bampis, E., Jansen, K. (eds.) WAOA 2009. LNCS, vol. 5893, pp. 37–48. Springer, Heidelberg (2010) 9. Koutsoupias, E., Papadimitriou, C.: Worst-case equilibria. In: Meinel, C., Tison, S. (eds.) STACS 1999. LNCS, vol. 1563, pp. 404–413. Springer, Heidelberg (1999) 10. Nisam, N., Roughgarden, T., Tardos, E., Vazirani, V.V.: Algorithmic Game Theory. Cambridge University Press, Cambridge (September 2007) 11. Ooshita, F., Izumi, T., Izumi, T.: A generalized multi-organization scheduling on unrelated parallel machines. In: International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 26–33. IEEE Computer Society, Los Alamitos (December 2009) 12. Pascual, F., Rzadca, K., Trystram, D.: Cooperation in multi-organization scheduling. In: Kermarrec, A.-M., Boug´e, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 224–233. Springer, Heidelberg (August 2007) 13. Pascual, F., Rzadca, K., Trystram, D.: Cooperation in multi-organization scheduling. Concurrency and Comp.: Practice & Experience 21(7), 905–921 (2009) 14. Schwiegelshohn, U., Tchernykh, A., Yahyapour, R.: Online scheduling in grids. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–10 (April 2008) 15. Ye, D., Han, X., Zhang, G.: On-line multiple-strip packing. In: Berlin, S. (ed.) Proceedings of the 3rd International Conference on Combinatorial Optimization and Applications, June 2009. LNCS, vol. 5573, pp. 155–165. Springer, Heidelberg (2009) 16. Zhuk, S.N.: Approximate algorithms to pack rectangles into several strips. Discrete Mathematics and Applications 16(1), 73–85 (2006)
Area-Maximizing Schedules for Series-Parallel DAGs Gennaro Cordasco1 and Arnold L. Rosenberg2, 1 University of Salerno, ISISLab, Dipartimento di Informatica ed Applicazioni “R.M. Capocelli”, Fisciano 84084, Italy [email protected] 2 Colorado State University, Electrical and Computer Engineering, Fort Collins, CO 81523, USA [email protected]
Abstract. Earlier work introduced a new optimization goal for DAG schedules: the “AREA” of the schedule. AREA-maximizing schedules are intended for computational environments—such as Internet-based computing and massively multicore computers—that benefit from DAG-schedules that produce executioneligible tasks as fast as possible. The earlier study of AREA-maximizing schedules showed how to craft such schedules efficiently for DAGs that have the structure of trees and other, less well-known, families of DAGs. The current paper extends the earlier work by showing how to efficiently craft AREA-maximizing schedules for series-parallel DAGs, a family that arises, e.g., in multi-threaded computations. The tools that produce the schedules for series-parallel DAGs promise to apply also to other large families of computationally significant DAGs.
1 Introduction Many modern computing platforms, such as the Internet and massively multicore architectures, have characteristics that are not addressed by traditional strategies1 for scheduling DAGs i.e., computations having inter-task dependencies that constrain the order of executing tasks. This issue is discussed at length in, e.g., [24], where the seeds of the Internet-based computing scheduling (IC-scheduling) paradigm are planted. ICscheduling strives to meet the needs of the new platforms by crafting schedules that execute DAGs in a manner that renders new tasks eligible for execution at the maximal possible rate. The paradigm thereby aims to: (a) enhance the utilization of computational resources, by always having work to allocate to an available client/processor; (b) lessen the likelihood of a computation’s stalling pending completion of alreadyallocated tasks. Significant progress in [5,6,8,21,24,25] has extended the capabilities of IC-scheduling so that it can now optimally schedule a wide range of computationally significant DAGs (cf. [4]). Moreover, simulations using DAGs that arise in real scientific computations [20] as well as structurally similar artificial DAGs [13] suggest that IC-schedules can have substantial computational benefits over schedules produced by a 1
Research supported in part by US NSF Grant CNS-0905399. Many traditional DAG-scheduling strategies are discussed and compared in [12,18,22].
P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 380–392, 2010. c Springer-Verlag Berlin Heidelberg 2010
Area-Maximizing Schedules for Series-Parallel DAGs
381
range of common heuristics. However, it has been known since [21] that many significant classes of DAGs do not admit schedules that are optimal within the framework of IC-scheduling. The current authors have responded to this fact with a relaxed version of IC-scheduling under which every DAG admits an optimal schedule [7]. This relaxed strategy strives to maximize the average rate at which new tasks are rendered eligible for execution. For reasons that become clear in Section 2, we call the new optimization metric for schedules the AREA of the schedule: the goal is an AREA-maximizing schedule for a DAG (an AM-schedule, for short). The study in [7] derived many basic properties of AM-schedules. Notable among these are: (1) Every DAG admits an AM-schedule. (2) AM-scheduling subsumes the goal of IC-scheduling, in the following sense. If a DAG G admits an optimal IC-schedule, then: (a) that schedule is an AM-schedule; (b) every AM-schedule is optimal under IC-scheduling. Thus, we never lose scheduling quality by focusing on achieving an AM-schedules, rather than an IC-optimal schedule. Our Contribution. The major algorithmic results of [7] show how to craft AMschedules efficiently for DAGs that have the structure of a monotonic tree, as well as for other, less common, families of DAGs. The current paper extends this contribution by showing how to efficiently craft an AM-schedule for any series-parallel DAG (SPDAG , for short). SP- DAG s have a regularity of structure that makes them algorithmically advantageous in a broad range of applications; cf. [15,23,27]. Most relevant to our study: (1) SP-DAGs are the natural abstraction of many significant classes of computations, including divide-and-conquer algorithms. (2) SP-DAGs admit efficient schedules in parallel computing systems such as CILK [1] that employ a multi-threaded computational paradigm. (3) Arbitrary DAGs can be efficiently (in linear time) reformulated as SP-DAGs with little loss in degree of parallelism [15,23]. Our contribution addresses two facets of the existing work on DAG-scheduling: (1) The work on CILK and kindred multi-threaded systems [1,2,3] employ performance metrics that are not relevant in the computational environments that we target (as discussed earlier). (2) Many SP- DAGs do not admit optimal schedules under IC-scheduling [6,21]. Related Work. Problems related to scheduling DAGs on parallel/distributed computing systems have been studied for decades. Most versions of these problems are known to be NP-Hard [10], except when scheduling special classes of DAGs (cf. [11]). This has led both researchers and practitioners to seek efficient scheduling heuristics that seem to perform well in practice (cf. [14,26]). Among the interesting attempts to understand such heuristics are the comparisons in [12,18,22] and the taxonomy in [19]. Despite the disparate approaches to scheduling in sources such as those cited, virtually every algorithm/heuristic that predates IC-scheduling shared one central characteristic: they all rely on knowing (almost) exact times for each computer to execute each task and to communicate with a collaborating computer. The central premise underlying ICand AM-scheduling is that within many modern computing environments, one cannot even apprximate such knowledge reliably. This premise is shared by sources such as [16,17], whose approaches to scheduling for IC platforms admit margins of error in time estimates of 50% or more. Indeed, IC- and AM scheduling are analytical proposals for what to do when accurate estimates are out of the question.
382
G. Cordasco and A.L. Rosenberg
2 Background Computation-DAGs and schedules. We study computations that are described by DAG s. Each DAG G has a set VG of nodes, each representing a task, and a set AG of (directed) arcs, each representing an intertask dependency. For arc (u → v) ∈ AG : • task v cannot be executed until task u is; • u is a parent of v, and v is a child of u in G. A parentless node is a source; a childless node is a target. G is connected if it is so when one ignores arc orientations. When VG 1 ∩ VG 2 = ∅, the sum G 1 + G 2 of DAGs G 1 and G 2 is the DAG with node-set VG 1 ∪ VG 2 and arc-set AG 1 ∪ AG 2 . When one executes a DAG G, a node v ∈ VG becomes ELIGIBLE (for execution) only after all of its parents have been executed. Note that all of G’s sources are ELIGIBLE at the beginning of an execution; the goal is to render all of G’s targets ELIGIBLE. Informally, a schedule Σ for G is a rule for selecting which ELIGIBLE node to execute at each step of an execution of G; formally, Σ is a topological sort of G, i.e., a linearization of VG under which all arcs point from left to right (cf. [9]). We do not allow recomputation of nodes/tasks, so a node loses its ELIGIBLE status once it is executed. In compensation, after v ∈ VG has been executed, there may be new nodes that are rendered ELIGIBLE; this occurs when v is their last parent to be executed. We measure the quality of a schedule Σ using the rate at which Σ renders nodes of G ELIGIBLE. Toward this end, we define2 , for k ∈ [1, |VG |], the quantities EΣ (k) and eΣ (k): EΣ (k) is the number of nodes of G that are ELIGIBLE after Σ has executed k nodes; and eΣ (k) is the number of nodes (perforce, nonsources) of G that are rendered ELIGIBLE by Σ’s kth node-execution. (We measure time in an event-driven manner, as the number of nodes that have been executed thus far, so we often refer to “step k” rather than “Σ’s kth node-execution.”) The AREA metric for schedules. Focus on a DAG G = (VG , AG ) with n nontargets, def N nonsources, s sources, and S targets (note: NG = |VG | = s + N = S + n). The quality of a schedule Σ for G at step t is given by the size of EΣ (t): the larger, the better. The goal of IC-scheduling is to execute G’s nodes in an order that maximizes quality at every step t ∈ [1, NG ] of the execution. A schedule Σ ∗ that achieves this demanding goal is IC-optimal; formally, (∀t ∈ [1, NG ]) EΣ ∗ (t) =
max
Σ a schedule for G
{EΣ (t)}
The AREA of a schedule Σ for G, AREA(Σ), is the sum AREA(Σ) = EΣ (0) + EΣ (1) + · · · + EΣ (NG ). def
(2.1)
def The normalized AREA, E(Σ) = AREA(Σ) ÷ NG , is the average number of nodes that are ELIGIBLE when Σ executes G. The term “area” arises by formal analogy with Riemann sums as approximations to integrals. The goal of the scheduling paradigm we develop here is to find an AM schedule for G, i.e., a schedule Σ such that
AREA(Σ ) = 2
max
Σ a schedule for G
[a, b] denotes the set of integers {a, a + 1, . . . , b}.
AREA(Σ).
Area-Maximizing Schedules for Series-Parallel DAGs
383
We have discovered in [7] a number of ways to simplify the quest for AM schedules. First, we can restrict the form of such schedules. Lemma 1. [7] Altering a schedule Σ for DAG G so that it executes all of G’s nontargets before any of its targets cannot decrease Σ’s AREA. Hence, we can streamline analysis by ignoring targets. Second, we can alter the AREA metric in certain ways: The only portion of AREA(Σ) that actually depends on choices made by Σ is area(Σ) = def
n t
eΣ (j)neΣ (1)+(n−1)eΣ (2)+ . . . +eΣ (n).
(2.2)
t=1 j=1
The eligibility profile associated with schedule Σ for G is the n-entry vector Π(Σ) = eΣ (1), eΣ (2), . . . , eΣ (n) . Our view of schedules as sequences of nontargets nodes allows us to talk about subschedules of Σ, which are contiguous subsequences of Σ. Each subschedule Φ delimits a (not necessarily connected) subDAG G Φ of G: VG Φ is the subset of VG whose nodes appear in Φ plus all the nodes in VG which become ELIGIBLE due to the execution of Φ, and AG Φ contains precisely those arcs from AG that have both ends in VG Φ . Let G be a DAG that admit a schedules Σ. The following abbreviations hopefully enhance legibility. Let Φ be a subschedule of Σ. For any sequence Φ, including (sub)schedules, Φ = |Φ| denotes the length of Φ. Note that EΦ (k) and eΦ (k) are defined for k ∈ [1, Φ]. – Π(Φ) denotes G Φ ’s eligibility profile: Π(Φ) = eΦ (1), . . . , eΦ (Φ ) = eΣ (i), . . . , eΣ (j) for some i, j ∈ [1, n] with Φ = j − i + 1. b – SU M (Φ, a, b) = i=a eΦ (i), and SU M (Φ) = SU M (Φ, 1, Φ ). For disjoint subschedules Φ and Ψ of Σ, denote by (Φ · Ψ ) the subschedule of Σ that concatenates Φ and Ψ . Thus: (Φ · Ψ ) has Φ + Ψ elements; it first executes all nodes in Φ, in the same order as Φ, and then executes all nodes in Ψ , in the same order as Ψ . Series-parallel DAGs (SP-DAGs). A (2-terminal) series-parallel DAG G (SP-DAG, for short) is produced by a sequence of the following operations (cf. Fig. 1): 1. Create. Form a DAG G that has: (a) two nodes, a source s and a target t, which are jointly G’s terminals, (b) one arc, (s → t), directed from s to t. 2. Compose SP-DAGs, G with terminals s , t , and G , with terminals s , t . (a) Parallel composition. Form G = G ⇑ G from G and G by identifying/merging s with s to form a new source s and t with t to form a new target t. (b) Series composition. Form G = (G → G ) from G and G by identifying/merging t with s . G has the single source s and the single target t . One can use examples from [6] to craft SP- DAGs that do not admit optimal IC-schedules.
384
G. Cordasco and A.L. Rosenberg
Fig. 1. Compositions of SP-DAGs
3 Maximizing Area for Series-Parallel DAGs Theorem 1. There exists an algorithm ASP-DAG that finds, in time O(n2 ), an AMschedule for any n-node SP-DAG. The remainder of the section develops Algorithm ASP-DAG , to prove Theorem 1. 3.1 The Idea Behind Algorithm ASP-DAG The problem of recognizing when a given DAG is an SP-DAG is classic within the area of algorithm design. A decomposition-based algorithm in [27], which solves the problem in linear time, supplies the basic idea underlying Algorithm ASP-DAG . We illustrate how with the sample SP-DAG G in Fig. 2(left). The fact that G is an SP-DAG can be verified with the help of a binary decomposition tree T G whose structure illustrates the sequence of series and parallel compositions that form G from a set of nodes; hence, T G ultimately takes one back to the definition of SP-DAG in Section 2. Fig. 2(right) depicts T G for the G of Fig. 2(left). The leaves of T G are single-arc DAGs; each of its internal nodes represents the SP-DAG obtained by composing (in series or in parallel) its two children. Importantly for the design of Algorithm ASP-DAG , one can use the algorithm of [27] to construct T G from a given DAG G: the construction succeeds just when G is an SP- DAG . Algorithm ASP-DAG uses T G to design an AM-schedule for G inductively. 3.2 Algorithm ASP-DAG ’s Inductive Approach Given an SP-DAG G, Algorithm ASP-DAG first constructs T G . It then designs an AMschedule for G by exploiting the structure of T G . A. Leaf-DAGs of T G . Each leaf-DAG is a single-arc DAG, i.e., a series composition of degenerate one-node DAGs. Each such DAG has the form (s → t), hence admits a unique schedule (execute s; then execute t) which, perforce, is an AM-schedule.
Area-Maximizing Schedules for Series-Parallel DAGs
385
Fig. 2. An example of the decomposition of SP-DAGs
B. Series-Composing Internal Nodes of T G . Focus on a node of T G that represents the series composition (G → G ) of disjoint SP-DAGs G and G . Lemma 2. If the disjoint SP-DAGs G and G admit, respectively, AM-schedules Σ and Σ , then the schedule (Σ · Σ ) is AM for the series composition (G → G ). Proof. Let G have N nonsources, and let G have n nontargets. By definition of ELIGIBILITY , every node of G must be executed before any node of G , so we need focus only on schedules for (G → G ) of the form (Σ1 · Σ2 ), where Σ1 is a schedule for G and Σ2 is a schedule for G . We then have area(Σ1 · Σ2 ) = area(Σ1 ) + area(Σ2 ) + Σ2 SU M (Σ1 ) = area(Σ1 ) + area(Σ2 ) + n N . Thus, choosing schedules Σ1 and Σ2 that maximize both area(Σ1 ) and area(Σ2 ) will maximize area(Σ1 · Σ2 ). The lemma follows. C. Parallel-Composing Internal Nodes of T G . Focus on a node of T G that represents the parallel composition (G ⇑ G ) of disjoint SP-DAGs G and G . We present an algorithm that crafts an AM-schedule Σ for (G ⇑ G ) from AM-schedules Σ and Σ for G and G . Algorithm SP-AREA Input: SP-DAGs G and G and respective AM schedules Σ and Σ . Output: AM schedule Σ for G = (G ⇑ G ). G and G are disjoint within (G ⇑ G ), except for their shared source s and target t. Because every schedule for (G ⇑ G ) begins by executing s and finishes by executing t, we can focus only on how to schedule the sum, G 1 + G 2 , obtained by removing both s and t from (G ⇑ G ): G 1 and G 2 are not-necessarily-connected sub DAGs of, respectively, G and G . (The parallel composition in Fig. 1 illustrates that G 1 and/or G 2 can be disconnected: removing s and t in the figure disconnects the lefthand DAG.)
386
G. Cordasco and A.L. Rosenberg
(1) Construct the average-eligibility profile AV G(Σ ) from Π(Σ ) as follows. If Π(Σ ) = eΣ (1), . . . , eΣ (n) , then AV G(Σ ) = aΣ (1), . . . , aΣ (n) , where, for 1 k k ∈ [1, n], aΣ (k) = eΣ (k). k i=1 Similarly, construct the average-eligibility profile AV G(Σ ) from Π(Σ ). (2) (a) Let j be the smallest index of AV G(Σ ) whose value is maximum within profile AV G(Σ ). Segregate the subsequence of Σ comprising elements 1, . . . , j , to form an (indivisible) block of nodes of Σ with average eligibility value (AEV) aΣ (j ). Perform a similar analysis for Σ to determine the value j and the associated block of nodes of Σ with AEV aΣ (j ). (b) Repeat procedure (a) for Σ , using indices from j + 1 to n, collecting blocks, until we find a block that ends with the last-executed node of Σ . Do the analogous repetition for Σ . After procedure (a)-then-(b), each of Σ and Σ is decomposed into a sequence of blocks, plus the associated sequences of AEVs. We claim that the following schedule Σ for (G ⇑ G ) is AM. Σ: 1. Execute s 2. Merge the blocks of Σ and Σ in nonincreasing order of AEV. (Blocks are kept intact; ties in AEV are broken arbitrarily.) 3. Execute t Claim. Σ is a valid schedule for (G ⇑ G ). This is obvious because: (1) Σ keeps the blocks of both Σ and Σ intact; (2) Σ incorporates the blocks of Σ (resp., Σ ) in their order within Σ (resp., Σ ). Claim. Σ is an AM schedule for (G ⇑ G ). Before verifying Σ’s optimality, we digress to establish two technical lemmas. Lemma 3. Let Σ be a schedule for DAG G, and let Φ and Ψ be disjoint subschedules SU M (Φ) SU M (Ψ ) 3 of Σ. Then area(Φ · Ψ ) > area(Ψ · Φ) iff > . Φ Ψ Proof. Invoking (2.2), we find that area(Φ · Ψ ) = (Φ + Ψ )eΦ (1) + (Φ + Ψ − 1)eΦ (2) + · · · + (Ψ + 1)eΦ (Φ ) +Ψ eΨ (1) + (Ψ − 1)eΨ (2) + · · · + eΨ (Ψ ); area(Ψ · Φ) = (Φ + Ψ )eΨ (1) + (Φ + Ψ − 1)eΨ (2) + · · · + (Φ + 1)eΨ (Ψ ) +Φ eΦ (1) + (Φ − 1)eΦ (2) + · · · + eΦ (Φ ). Therefore, area(Φ · Ψ ) − area(Ψ · Φ) = Ψ · SU M (Φ) − Φ · SU M (Ψ ). The result now follows by elementary calculation.
Lemma 4. Let Σ be a schedule for DAG G, and let Φ, Ψ , Γ , andΔ be four mutually disjoint subschedules of Σ. Then area(Φ·Ψ ) > area(Ψ ·Φ) iff area(Γ ·Φ·Ψ ·Δ) > area(Γ · Ψ · Φ · Δ) . 3
The disjointness of Φ and Ψ ensures that both (Φ · Ψ ) and (Ψ · Φ) are subschedules of Σ.
Area-Maximizing Schedules for Series-Parallel DAGs
387
The import of Lemma 4 is: One cannot change the area-ordering of the two concatenations of Φ and Ψ by appending the same fixed subschedule (Γ ) before the concatenations and/or appending the same fixed subschedule (Δ) after the concatenations. Proof. The following two differences have the same sign: area(Φ · Ψ ) − area(Ψ · Φ) and area(Γ · Φ · Ψ · Δ) − area(Γ · Ψ · Φ · Δ) . To wit: area(Γ ·Φ·Ψ ·Δ) = (Γ +Φ +Ψ +Δ )eΓ (1) + · · · + (1+Φ +Ψ +Δ )e( Γ ) +(Γ +Φ +Δ )eΦ (1) + · · · + (1+Ψ +Δ )eΦ (Φ ) +(Ψ +Δ )eΨ (1) + · · · + (1+Δ )eΨ (Ψ ) +Δ eΔ (1) + · · · + eΔ (Δ ); area(Γ ·Ψ ·Φ·Δ) = (Γ +Ψ +Φ +Δ )eΓ (1) + · · · + (1+Ψ +Φ +Δ )eΓ (Γ ) +(Ψ +Φ +Δ )eΨ (1) + · · · + (1+Φ +Δ )eΨ (Ψ ) +(Φ +Δ )eΦ (1) + · · · + (1+Δ )eΦ (Φ )) +Δ eΔ (1) + · · · + eΔ (Δ ). Elementary calculation now shows that area(Γ ·Φ·Ψ ·Δ) − area(Γ ·Ψ ·Φ·Δ) = area(Φ · Ψ ) − area(Ψ · Φ) .
The Optimality of Schedule Σ. We focus only on step 2 of Σ because, as noted earlier, every schedule for (G ⇑ G ) begins by executing s and ends by executing t. We validate each of the salient properties of Algorithm SP-AREA. Lemma 5. The nodes inside each block determined by Algorithm SP-AREA cannot be rearranged in any AM-schedule for (G ⇑ G ). Proof. We proceed by induction on the structure of the decomposition tree T G of G = (G ⇑ G ). The base of the induction is when T G consists of a single leaf-node. The lemma is trivially true in this case, for the structure of the resulting block is mandated by the direction of the arc in the node. (In any schedule for the DAG (s → t), s must be computed before t.) The lemma is trivially true also when G is formed solely via parallel compositions. To wit, for any such DAG, removing the source and target leaves one with a set of independent nodes. The resulting eligibility profile is, therefore, a sequence of 0s: each block has size 1 and AEV 0. This means that every valid schedule is AM. In general, as we decompose G: (a) The subDAG schedule needed to process a parallel composition (G 1 ⇑ G 2 ) does not generate any blocks other than those generated by the schedules for G 1 and G 2 . (b) The sub DAG schedule needed to process a series composition can generate new blocks. To wit, consider SP-DAGs G 1 and G 2 , with respective schedules Σ1 and Σ2 , so that (Σ1 · Σ2 ) is a schedule for the serial composition (G 1 → G 2 ). Say that (Σ1 · Σ2 ) generates consecutive blocks, B1 from Σ1 and B2 from Σ2 , where B1 ’s AEV is smaller than B2 ’s. Then blocks B1 and B2 are merged—via concatenation—by (Σ1 · Σ2 ) into block B1 · B2 . For instance, the indicated scenario occurs when B1 is the last block of Σ1 and B2 is the first block of Σ2 . Assuming, for induction, that the nodes in each of B1 and B2 cannot be rearranged, we see that the nodes inside B1 · B2 cannot be reordered. That is, every node of G 1 —including those in B1 —must be executed before any node of G 2 —including those in B2 .
388
G. Cordasco and A.L. Rosenberg
Lemma 5 tells that the nodes inside each block generated by Algorithm SP-AREA cannot be reordered. We show now that blocks can also not be subdivided. Note that this will also preclude merging blocks in ways that violate blocks’ indivisibility. Lemma 6. If a schedule Σ for a DAG (G ⇑ G ) subdivides a block (as generated by Algorithm SP-AREA), then Σ is not AM. Proof. Assume, for contradiction, that there exists an AM schedule Σ for G = (G ⇑ G ) that subdivides some block A of Σ into two blocks, B and C. (Our choice of Σ here clearly loses no generality.) This subdivision means that within Σ, B and C are separated by some sequence D. We claim that Σ’s AREA-maximality implies that AEV (C) > AEV (B). Indeed, by construction, we know that 1 1 SU M (A, 1, A ) > SU M (A, 1, j) . (∀j ∈ [1, A − 1]) A j Therefore, we have 1 1 SU M (A, 1, A ) > SU M (A, 1, B ) , A B which means that A B B eA (i) > A eA (i). i=1
i=1
It follows, noting that C = A − B , that B
A
eA (i) > (A − B )
B
eA (i).
i=1
i=B +1
This, finally, implies that AEV (C) > AEV (B). Now we are in trouble: Either of the following inequalities, (a) [AEV (D) < AEV (C)]
or
(b) [AEV (D) ≥ AEV (C)]
(3.3)
would allow us to increase Σ’s AREA, thereby contradicting Σ’s alleged AREA-maximality! If inequality (3.3(a)) held, then Lemmas 3 and 4 would allow us to increase Σ’s AREA by interchanging blocks D and C. If inequality (3.3(b)) held, then because [AEV (C) > AEV (B)], we would have [AEV (D) > AEV (B)]. But then Lemmas 3 and 4 would allow us to increase Σ’s AREA by interchanging blocks B and D. We conclude that schedule Σ cannot exist. We summarize the results of this section in the following result. Theorem 2. Let G and G be disjoint SP-DAGs that, respectively, admit AM-schedules Σ and Σ , and have n and n nontargets. If we know the block decompositions of Σ and Σ , ordered by AEV, then Algorithm SP-AREA determines, within time O(n +n ) an AM-schedule for G = (G ⇑ G ). Proof. By Lemma 6, any AM schedule for G must be a permutation of the blocks obtained by decomposing Σ and Σ . Assume, for contradiction, that some AM schedule Σ ∗ is not obtained by selecting blocks in decreasing order of AEV. There would then exist (at least) two blocks, A and B, that appear consecutively in Σ ∗ , such that
Area-Maximizing Schedules for Series-Parallel DAGs
389
AEV (A) < AEV (B). We would then have SU M (A) /A < SU M (B) /B , so by Lemma 3, area(A · B) > area(B · A). Let C (resp., D) be the concatenation of all blocks that come before A (resp., after B) in Σ ∗ . An invocation of Lemma 4 shows that Σ ∗ cannot be AM, contrary to assumption. Timing: Because the blocks obtained by the decomposition are already ordered, com-
pleting G’s schedule requires only merging two ordered sequences of blocks, which can be accomplished in linear time. Observation 1. It is worth noting that Algorithm ASP-DAG ’s decomposition phase, as described in Section 3.2, may require time quadratic in the size of G. However, we use Algorithm SP-AREA in a recursive manner, so the decomposition for an SP-DAG G is simplified by the algorithm’s (recursive) access to the decompositions of G’s sub DAGs.
Fig. 3. An example of composition
An Example. The DAG G of Fig. 3(center) is the parallel composition of the DAGs of Fig. 3(left). (G appears at the root of T G in Fig. 2.) By removing nodes s and t from G, we obtain the disjoint DAGs G and G of Fig. 3(right). We note the AM-schedules: Σ = a, b, c, d, e, f, g, h, i for G ; Σ = k, l, m, n, o, p, q, r for G . We use G and G to illustrate Algorithm SP-AREA. For schedule Σ : Π(Σ ) = 2, 2, 2, 0, 1, 0, 1, 0, 1 , and profile AV G(Σ ) = 2, 2, 2, 3/2, 7/5, 7/6, 8/7, 1, 1 . The position of the maximum element of AV G(Σ ) is 3 (choosing the rightmost element in case of ties), so the first block is a, b, c , with AEV 2. Continuing: the new eligibility profile is 0, 1, 0, 1, 0, 1 , the average eligibility profile is 0, 1/2, 1/3, 1/2, 2/5, 1/2 . The maximum is the last element, so the new block is d, e, f, g, h, i , with AEV 1/2. For schedule Σ : Π(Σ ) = 2, 0, 1, 4, 0, 0, 0, 1 , and AV G(Σ ) = 2, 1, 1, 7/4, 7/5, 7/6, 1, 1 . The maximum is at the first element, so the first block is k , with AEV 2. The next eligibility profile is 0, 1, 4, 0, 0, 0, 1 , and the average eligibility profile is 0, 1/2, 5/3, 5/4, 1, 5/6, 1 , which has its maximum at the 3rd element. The new block is l, m, n , with AEV 5/3. The new average eligibility profile is 0, 0, 0, 1/4 , so the last block is o, p, q, r with AEV 1/4. Thus, the two schedules are split into five blocks:
390
G. Cordasco and A.L. Rosenberg
Σ Blocks AEV k 2 l, m, n 5/3 o, p, q, r 1/4 The AM-schedule obtained by ordering the blocks in order of decreasing AEV is a, b, c, k, l, m, n, d, e, f, g, h, i, o, p, q, r . Σ Blocks AEV a, b, c 2 d, e, f, g, h, i 1/2
3.3 The Timing of Algorithm ASP-DAG Let G be an N -node SP-DAG, and let T (N ) be the time that Algorithm ASP-DAG takes to find an AM-schedule Σ for G, plus the time it takes to decompose Σ into (indivisible) ordered blocks. Let us see what goes into T (N ). (1) Algorithm ASP-DAG invokes [27] to decompose G in time O(N ). (2) The algorithm finds an AM-schedule Σ for G by recursively unrolling G’s decomposition tree T G . (2.a) If G is a series composition (G → G ) of an n-node SP-DAG G and an (N −n+1)-node SP-DAG G , then, by induction, T (N ) = T (N −n+1)+T (n)+O(N ) for some n ∈ [2, N − 1]. (The “extra” node is the merged source of G and target of G .) The algorithm: (i) recursively generates AM-schedules, Σ for G and Σ for G , in time T (N − n + 1) + T (n); (ii) generates Σ by concatenating Σ and Σ in time O(N ); (iii) decomposes Σ into blocks in time O(N ), using the O(N ) blocks obtained while generating Σ and Σ . (2.b) If G is a parallel composition (G ⇑ G ) of an (n + 1)-node SP-DAG G and an (N − n + 1)-node SP-DAG G , then, by induction, T (N ) = T (N − n + 1) + T (n + 1) + O(N ), for some n ∈ [2, N − 2]. (The two “extra” nodes are the merged sources and targets of G and G .) The algorithm: (i) recursively generates AM-schedules, Σ for G and Σ for G , in time T (N − n + 1) + T (n); (ii) generates Σ in time O(N ) using Algorithm SP-AREA; (iii) decomposes Σ into blocks in time O(N ), by taking the union of the blocks obtained while generating Σ and Σ . (Recall that the parallel composition does not generate additional blocks.) Overall, then, T (N ) = O(N 2 ), as claimed.
4 Conclusion In furtherance of our extension of IC-Scheduling [4,5,6,8,21] to a scheduling paradigm that applies to all DAGs, we have expanded our initial study [7] of AREA-maximizing (AM) DAG-scheduling so that we can now find optimal schedules of all series-parallel DAG s efficiently. We are in the process of studying whether AM-scheduling shares the computational benefits of IC-Scheduling [13,20]. We are also seeking (possibly heuristic) algorithms that will allow us to provably efficiently find schedules that are (approximately) AREA-maximizing for arbitrary DAGs.
References 1. Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An efficient multithreaded runtime system. In: 5th ACM SIGPLAN Symp. on Principles and Practices of Parallel Programming, PPoPP 1995 (1995)
Area-Maximizing Schedules for Series-Parallel DAGs
391
2. Blumofe, R.D., Leiserson, C.E.: Space-efficient scheduling of multithreaded computations. SIAM J. Comput. 27, 202–229 (1998) 3. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46, 720–748 (1999) 4. Cordasco, G., Malewicz, G., Rosenberg, A.L.: Applying IC-scheduling theory to some familiar computations. In: Wkshp. on Large-Scale, Volatile Desktop Grids, PCGrid 2007 (2007) 5. Cordasco, G., Malewicz, G., Rosenberg, A.L.: Advances in IC-scheduling theory: scheduling expansive and reductive DAGs and scheduling DAGs via duality. IEEE Trans. Parallel and Distributed Systems 18, 1607–1617 (2007) 6. Cordasco, G., Malewicz, G., Rosenberg, A.L.: Extending IC-scheduling via the Sweep algorithm. J. Parallel and Distributed Computing 70, 201–211 (2010) 7. Cordasco, G., Rosenberg, A.L.: On scheduling DAGs to maximize area. In: 23rd IEEE Int. Symp. on Parallel and Distributed Processing, IPDPS 2009 (2009) 8. Cordasco, G., Rosenberg, A.L., Sims, M.: Accommodating heterogeneity in IC-scheduling via task fattening. In: On clustering tasks in IC-optimal DAGs, 37th Intl. Conf. on Parallel Processing, ICPP 2008 (2008) (submitted for publication) 9. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press, Cambridge (1999) 10. Garey, M.R., Johnson, D.S.: Computers and Intractability. W.H. Freeman and Co., San Francisco (1979) 11. Gao, L.-X., Rosenberg, A.L., Sitaraman, R.K.: Optimal clustering of tree-sweep computations for high-latency parallel environments. IEEE Trans. Parallel and Distributed Systems 10, 813–824 (1999) 12. Gerasoulis, A., Yang, T.: A comparison of clustering heuristics for scheduling DAGs on multiprocessors. J. Parallel and Distributed Computing 16, 276–291 (1992) 13. Hall, R., Rosenberg, A.L., Venkataramani, A.: A comparison of DAG-scheduling strategies for Internet-based computing. In: Intl. Parallel and Distr. Processing Symp. (2007) 14. Hwang, K., Xu, Z.: Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, New York (1998) 15. Jayasena, S., Ganesh, S.: Conversion of NSP DAGs to SP DAGs. MIT Course Notes 6.895 (2003) 16. Kondo, D., Casanova, H., Wing, E., Berman, F.: Models and scheduling mechanisms for global computing applications. In: Intl. Parallel and Distr. Processing Symp. (2002) 17. Korpela, E., Werthimer, D., Anderson, D., Cobb, J., Lebofsky, M.: SETI@home: massively distributed computing for SETI. In: Dubois, P.F. (ed.) Computing in Science and Engineering. IEEE Computer Soc. Press, Los Alamitos (2000) 18. Kwok, Y.-K., Ahmad, I.: Benchmarking and comparison of the task graph scheduling algorithms. J. Parallel and Distributed Computing 59, 381–422 (1999) 19. Kwok, Y.-K., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys 31, 406–471 (1999) 20. Malewicz, G., Foster, I., Rosenberg, A.L., Wilde, M.: A tool for prioritizing DAGMan jobs and its evaluation. J. Grid Computing 5, 197–212 (2007) 21. Malewicz, G., Rosenberg, A.L., Yurkewych, M.: Toward a theory for scheduling dags in Internet-based computing. IEEE Trans. Comput. 55, 757–768 (2006) 22. McCreary, C.L., Khan, A.A., Thompson, J., Mcardle, M.E.: A comparison of heuristics for scheduling DAGs on multiprocessors. In: 8th Intl. Parallel Processing Symp., pp. 446–451 (1994) 23. Mitchell, M.: Creating minimal vertex series parallel graphs from directed acyclic graphs. In: 2004 Australasian Symp. on Information Visualisation -, vol. 35, pp. 133–139 (2004) 24. Rosenberg, A.L.: On scheduling mesh-structured computations for Internet-based computing. IEEE Trans. Comput. 53, 1176–1186 (2004)
392
G. Cordasco and A.L. Rosenberg
25. Rosenberg, A.L., Yurkewych, M.: Guidelines for scheduling some common computationdags for Internet-based computing. IEEE Trans. Comput. 54, 428–438 (2005) 26. Sarkar, V.: Partitioning and Scheduling Parallel Programs for Multiprocessors. MIT Press, Cambridge (1989) 27. Valdes, J., Tarjan, R.E., Lawler, E.L.: The recognition of series-parallel digraphs. SIAM J. Comput. 11, 289–313 (1982)
Parallel Selection by Regular Sampling Alexander Tiskin Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK
Abstract. Bulk-synchronous parallelism (BSP) is a simple and efficient paradigm for parallel algorithm design and analysis. In this paper, we present a new simple deterministic BSP algorithm for the classical problem of selecting the k-th smallest element from an array of size n, for a given k, on a parallel computer with p processors. Our algorithm is based on the technique of regular sampling. It runs in optimal O np local computation and communication, and near-optimal O(log log p) synchronisation. The algorithm is of theoretical interest, as it gives an improvement in the asymptotic synchronisation cost over its predecessors. It is also simple enough to be implementable.
1
Introduction
The selection problem is a classical problem in computer science. Definition 1. Given an array of size n, and an integer k, 0 ≤ k < n, the selection problem asks for the array’s element with rank k (i.e. the k-th smallest element, counting from 0). Without loss of generality, we assume that all array elements are distinct. In this paper, we restrict ourselves to comparison-based selection, where the only primitive operations allowed on the elements are pairwise comparisons, with possible outcomes “greater than” or “less than”. The selection problem is closely related to sorting. Indeed, the naive solution to the selection problem consists in sorting the array, and then indexing the required element. Using an efficient sorting algorithm, such as mergesort, the selection problem can therefore be solved in time O(n log n). However, it is wellknown that, in contrast to sorting, this time bound for the selection problem is not optimal, and linear time is achievable. The first selection algorithm running in time O(n) was given by Blum et al. [4]. Further constant-factor improvements were obtained by Sch¨ onhage et al. [17], and by Dor and Zwick [8]; see also a survey by Paterson [15]. In this paper, we consider the selection problem in a coarse-grained parallel computation model, such as BSP or its variant CGM. Once again, the naive solution to the selection problem on a coarse-grained parallel computer consists in sorting the array, and then indexing the required element. Using an efficient sorting algorithm, such as parallel sorting by regular sampling (PSRS) by Shi and Schaeffer [18] (see also [20]), the selection problem can besolved determin n , communication O np , and synchroniistically in local computation O n log p sation O(1), on a parallel computer with p processors. Taking into account the P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 393–399, 2010. c Springer-Verlag Berlin Heidelberg 2010
394
A. Tiskin
cost ofsharing the input of size O(n) among p processors, the communication cost O np is clearly optimal. The synchronisation cost O(1) is also trivially optimal. However, the local computation cost is not optimal, due to the existence of linear-time sequential selection algorithms. A natural question arises: can the local computation cost of coarse-grained parallel selection be reduced to the optimal O np , while keeping the asymptotic optimality in communication and synchronisation? A step towards this goal was made by Ishimizu et al. [12], who gave a coarsegrained parallel selection algorithm running in optimal local computation and communication O np , and in synchronisation that is essentially O(log p). Fujiwara et al. [9] improved the synchronisation cost to O(min(log p, log log n)). They also proposed another algorithm, with suboptimal local computation cost p (which is still better than the one achievable by naive sorting), and O n log p the optimal synchronisation cost O(1). Gerbessiotis and Siniolakis [10] gave a randomised BSP algorithm, running in optimal local computation and communication O np , and optimal synchronisation O(1), with high probability. An optimal selection algorithm in the PRAM model was given by Han [11]. There have also been significant developments in the area of experimental evaluation of selection algorithms. Various practical parallel selection algorithms have been proposed by Al-Furiah et al. [1], Saukas and Song [16], Bader [2], Cafaro et al. [5]. In this paper, we present a new simple deterministic BSP algorithm for selection. Our algorithm is based on the technique of regular sampling. It runs in optimal O np local computation and communication, and in near-optimal O(log log p) synchronisation. Our algorithm is of theoretical interest, as it gives an improvement in the asymptotic synchronisation cost over its predecessors. It is also simple enough to be implementable. Throughout the paper, we ignore small irregularities arising from imperfect matching between integer parameters. This allows us to avoid overloading our notation with floor and ceiling functions, which have to be assumed implicitly wherever necessary.
2
The BSP Model
The model of bulk-synchronous parallel (BSP) computation [22,13,3] provides a simple and practical framework for general-purpose parallel computing. Its main goal is to support the creation of architecture-independent and scalable parallel software. Key features of BSP are its treatment of the communication medium as an abstract fully connected network, and strict separation of all interaction between processors into point-to-point asynchronous data communication and barrier synchronisation. This separation allows an explicit and independent cost analysis of local computation, communication and synchronisation. A BSP computer contains • p processors; each processor has a local memory and is capable of performing an elementary operation or a local memory access every time unit;
Parallel Selection by Regular Sampling
395
• a communication network, capable of accepting a word of data from every processor, and delivering a word of data to every processor, every g time units; • a barrier synchronisation mechanism, capable of synchronising all processors every l time units. The processors may follow different threads of computation, and have no means of synchronising with one another between the global barriers. It will be convenient to consider a version of the BSP model equipped with an external memory, which serves as the source of the input and the destination for the output, and can also be used for intermediate data. Algorithms designed for this model can easily be translated to the traditional distributed-memory setting; see [20] for details. A BSP computation is a sequence of supersteps. The processors are synchronised between supersteps; the computation within a superstep is completely asynchronous. Consider a superstep in which every processor performs up to w local operations, sends up to hout words of data, and receives up to hin words of data. We call w the local computation cost, and h = hout +hin the communication cost of the superstep. The total superstep cost is defined as w+h·g +l, where the communication gap g and the latency l are parameters of the network defined above. For a computation comprising S supersteps with local computation costs ws and communication costs hs , 1 ≤ s ≤ S, the total cost is W + H · g + S · l, where • W =
S
s=1
S
ws is the total local computation cost ;
• H = s=1 hs is the total communication cost ; • S is the synchronisation cost. The values of W , H and S typically depend on the number of processors p and on the problem size. The original definition of BSP does not account for memory as a limited resource. However, the model can be easily extended by an extra parameter m, representing the maximum capacity of each processor’s local memory. Note that this approach also limits the amount of communication allowed within a superstep: h ≤ m. One of the early examples of memory-sensitive BSP algorithm design is given in [14]. An alternative approach to reflecting memory cost is given by the model CGM, proposed in [7]. A CGM is essentially a memory-restricted BSP computer, where memory capacity and maximum superstep communication are determined by the size of the input/output: h ≤ m = O input +output . A large number of algorithms p have been developed for the CGM, see e.g. [19]. In order to utilise the computer resources efficiently, a typical BSP program regards the values p, g and l as configuration parameters. Algorithm design should aim to minimise local computation, communication and synchronisation costs for any realistic values of these parameters.
396
3
A. Tiskin
The Algorithm
A typical approach to selection involves partitioning the input array into subarrays, sampling these subarrays, and then using (a subset of) the obtained samples as splitters for the elimination of array elements. For sequential selection, it is sufficient to choose suitable constants for the subarray size and the sampling frequency: for example, one can use subarrays of size 5, take subarray medians as samples, and take the median of samples as the single splitter. Each elimination stage reduces the amount of data by a constant factor, and therefore the overall data reduction rate is exponential. In a coarse-grained parallel setting, a natural subarray size is n/p; we assume that this value is sufficiently high. As before, we could reduce the amount of data by a constant factor in every stage by taking a suitable constant for the sampling frequency. However, we can do better by varying the sampling frequency. Generally, as the remaining data become more and more sparse, we can afford more and more frequent sampling on the data. This accelerates the data reduction rate to super-exponential. Algorithm 1 (Selection) Parameter: integer k, 0 ≤ k < n. Input: array a of size n in external memory; we assume n ≥ p3/2 . Output: the value of the element of a with rank k. Description. First phase. The array is gradually reduced in size, by eliminating elements in repeated rounds of regular sampling. Let N = 2n. Consider a round in which m = N/r elements remain in the global array; for instance, in the first round we have m = N/2, and therefore r = 2. Each processor reads from the external memory a local subarray of size m p . Then, by repeated application of a sequential selection algorithm, each processor selects from its local subarray a set of 2r1/2 + 1 samples spaced at regular intervals, inclusive of the two boundary elements of the subarray. Then, all the (2r1/2 +1)p samples are collected in a designated processor. The designated processor sorts the array of samples, and then selects from it a subset of 2r1/2 +1 splitters spaced at regular intervals, inclusive of the two boundary elements (i.e. the minimum and the maximum sample). The whole set of splitters is then broadcast across the processors, using an efficient broadcasting algorithm, such as e.g. two-phase broadcast [3]. For each splitter, a processor finds its local rank within that processor’s subarray. These local ranks are then collected in a designated processor, which adds them up to obtain the global rank of each splitter within the global array. Consider two adjacent splitters a− , a+ , such that the global rank of a− (respectively, a+ ) is below (respectively, above) k. We call the subset of all array elements that fall (inclusively) between splitters a− and a+ the bucket. Note that 1/2
+1)p = p samples. Crucially, the bucket also contains the bucket contains (2r 2r 1/2 +1 the element of global rank k, which is required as the algorithm’s output.
Parallel Selection by Regular Sampling
397
The bucket boundary splitters a− , a+ are then broadcast across the processors. Each processor eliminates all the elements of its local subarray that fall outside the bucket, i.e. are either below a− , or above a+ . Then, a designated processor collects the sizes of the remaining local subarrays, allocates for each processor a range of locations of appropriate size in the external memory, and communicates to each processor the address of its allocated areas. The processors write their remaining local subarrays into the allocated areas. The bucket, which has now been collected in the external memory, replaces the global array in the subsequent round. To reflect that, we update the value of m by setting it to the size of the bucket, and we let r = N/m. We also update the value of k by subtracting from it the global rank of a− . The round is completed. Rounds are perfomed repeatedly, until the size of the array is reduced to np . Second phase. A designated processor reads the remaining array of size np from the external memory, and finds the element of rank k by a sequential selection algorithm. Cost analysis First phase. Consider a particular round where, as before, there are m = N/r remaining elements. The communication cost of each reading its local n processor = O . The local computation subarray from the external memory is O m p rp 1/2 cost of each processor selecting the regular samples is (2r + 1) · O m = p n O r1/2 p . For each processor, let us call the subset of its local elements that fall between two (locally) adjacent samples a block (note that since the local subarray is not sorted, blocks are in general not physically contiguous). We have 2r1/2 blocks per processor, and therefore 2r1/2 p blocks overall, each containing 2rm 1/2 p elements (including or excluding the boundaries as appropriate). Consider a block defined by a pair of samples b− , b+ . We call this block low (respectively, high) if the block’s lower boundary b− is non-strictly below (respectively, strictly above) the bucket’s lower boundary a− . A low block has a non-empty intersection with the bucket, if and only if it contains the bucket’s lower boundary a− : b− < a− ≤ b+ . Since for each processor, all its blocks are disjoint, at most one of them can contain the value a− . Therefore, across all processors, there can be at most p low blocks intersecting the bucket. A high block has a non-empty intersection with the bucket, if and only if its lower boundary b− is contained within the bucket: a− ≤ b− < a+ . Each of the p samples contained in the bucket can be a lower boundary sample for at most one high block. Therefore, across all processors, there can be at most p high blocks intersecting the bucket. In total, there can be at most p + p = 2p blocks having a non-empty intersection with the bucket. Hence, the number of elements in the bucket is at most 2p · m N = rm 1/2 = r 3/2 . Since all the elements outside the bucket are eliminated, the 2r 1/2 p maximum possible fraction (relative to N ) of remaining elements gets thus raised
398
A. Tiskin
to the power of 3/2 in every round. After log3/2 log2 p = O(log log p) rounds, the log3/2 log2 p
N number of remaining elements is at most 2−(3/2) N = 2log = O( np ). 2p The local computation and communication costs decrease superexponentially in every round, and therefore dominated by the respective costs in the first are round, equal to O np . The total number of rounds, and therefore the synchronisation cost, is O(log log p). Second phase. The local computation and communication costs are O np , and the synchronisation cost is O(1). Total. The overall resource costs are dominated by thefirst phase. Therefore, the local computation and communication costs are O np , and the synchronisation cost is O(log log p).
It should be noted that the resource costs of Algorithm 1 are controlled mainly by the frequency of initial sampling in every round. In contrast, the frequency of choosing splitters among the samples is relatively less important, and allows for substantial freedom. In particular, we could even choose all the samples as splitters. Instead, we chose the minimum possible number of splitters, thus stressing the similarity with the PSRS algorithm of [18] (see also [20]). Also note that the assumption of external memory allows us to perform implicit load balancing between the rounds. If the algorithm were expressed in the standard distributed-memory version of BSP, then explicit load balancing within every round would be required.
4
Conclusions
We have given a deterministic BSP algorithm for selection, running in the optimal local computation and communication O np , and synchronisation O(log log p). It remains an open problem whether the synchronisation cost can be reduced to the optimal O(1) without increasing the asymptotic local computation and communication costs, and without randomisation. In the context of coarse-grained parallel computation, regular sampling has been used previously for the problems of sorting [18,20] and computing 2D and 3D convex hulls [21]. In the current paper, we have applied this technique to the problem of selection. We expect that further applications of regular sampling are possible, and conclude that this powerful technique should be an essential part of the parallel algorithm design toolkit.
References 1. Al-Furiah, I., Aluru, S., Goil, S., Ranka, S.: Practical algorithms for selection on coarse-grained parallel computers. IEEE Transactions on Parallel and Distributed Systems 8(8), 813–824 (1997) 2. Bader, D.A.: An improved, randomized algorithm for parallel selection with an experimental study. Journal of Parallel and Distributed Computing 64(9), 1051– 1059 (2004)
Parallel Selection by Regular Sampling
399
3. Bisseling, R.H.: Parallel Scientific Computation: A structured approach using BSP and MPI. Oxford University Press, Oxford (2004) 4. Blum, M., Floyd, R.W., Pratt, V.R., Rivest, R.L., Tarjan, R.E.: Time bounds for selection. Journal of Computer and System Sciences 7(4), 448–461 (1973) 5. Cafaro, M., De Bene, V., Aloisio, G.: Deterministic parallel selection algorithms on coarse-grained multicomputers. Concurrency and Computation: Practice and Experience 21(18), 2336–2354 (2009) 6. Corrˆea, R., et al. (eds.): Models for Parallel and Distributed Computation: Theory, Algorithmic Techniques and Applications, Applied Optimization, vol. 67. Kluwer Academic Publishers, Dordrecht (2002) 7. Dehne, F., Fabri, A., Rau-Chaplin, A.: Scalable parallel computational geometry for coarse grained multicomputers. International Journal on Computational Geometry 6, 379–400 (1996) 8. Dor, D., Zwick, U.: Selecting the median. SIAM Journal on Computing 28(5), 1722–1958 (1999) 9. Fujiwara, A., Inoue, M., Masuzawa, T.: Parallel selection algorithms for CGM and BSP models with application to sorting. Transactions of Information Processing Society of Japan 41(5), 1500–1508 (2000) 10. Gerbessiotis, A.V., Siniolakis, C.J.: Architecture independent parallel selection with applications to parallel priority queues. Theoretical Computer Science 301(1-3), 119–142 (2003) 11. Han, Y.: Optimal parallel selection. In: Proceedings of ACM–SIAM SODA, pp. 1–9 (2003) 12. Ishimizu, T., Fujiwara, A., Inoue, M., Masuzawa, T., Fujiwara, H.: Parallel algorithms for selection on the BSP and BSP* models. Systems and Computers in Japan 33(12), 97–107 (2002) 13. McColl, W.F.: Scalable computing. In: van Leeuwen, J. (ed.) Computer Science Today: Recent Trends and Developments. LNCS, vol. 1000, pp. 46–61. Springer, Heidelberg (1995) 14. McColl, W.F., Tiskin, A.: Memory-efficient matrix multiplication in the BSP model. Algorithmica 24(3/4), 287–297 (1999) 15. Paterson, M.: Progress in selection. In: Karlsson, R., Lingas, A. (eds.) SWAT 1996. LNCS, vol. 1097, pp. 368–379. Springer, Heidelberg (1996) 16. Saukas, E.L.G., Song, S.W.: A note on parallel selection on coarse-grained multicomputers. Algorithmica 24(3-4), 371–380 (1999) 17. Schoenhage, A., Paterson, M., Pippenger, N.: Finding the median. Journal of Computer and System Sciences 13(2), 184–199 (1976) 18. Shi, H., Schaeffer, J.: Parallel sorting by regular sampling. Journal of Parallel and Distributed Computing 14(4), 361–372 (1992) 19. Song, S.W.: Parallel graph algorithms for coarse-grained multicomputers. In: Corrˆea, et al. (eds.) [6], pp. 147–178 20. Tiskin, A.: The bulk-synchronous parallel random access machine. Theoretical Computer Science 196(1-2), 109–130 (1998) 21. Tiskin, A.: Parallel convex hull computation by generalised regular sampling. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 392–399. Springer, Heidelberg (2002) 22. Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 33(8), 103–111 (1990)
Ants in Parking Lots Arnold L. Rosenberg Electrical & Computer Engineering, Colorado State University, Fort Collins, CO 80523, USA [email protected]
Abstract. Ants provide an attractive metaphor for robots that “cooperate” to perform complex tasks. This paper is a step toward understanding the algorithmic concomitants of this metaphor, the strengths and weaknesses of ant-based computation models. We study the ability of finite-state ant-robots to scalably perform a simple path-planning task called parking, within fixed, geographically constrained environments (“factory floors”). This task: (1) has each ant head for its nearest corner of the floor and (2) has all ants within a corner organize into a maximally compact formation. Even without (digital analogues of) pheromones, many initial configurations of ants can park, including: (a) a single ant situated along an edge of the floor; (b) any assemblage of ants that begins with two designated adjacent ants. In contrast, a single ant in the middle of (even a one-dimensional) floor cannot park, even with the help of (volatile digital) pheromones. Keywords: Ant-inspired robots, Finite-state machines, Path planning.
1 Introduction As we encounter novel computing environments that offer unprecedented computing power, while posing unprecedented challenges, it is compelling to seek inspiration from natural analogues of these environments. Thus, empowered with technology that enables mobile intercommunicating robotic computers, it is compelling to seek inspiration from social insects, mainly ants (because robots typically operate within a two-dimensional world), when contemplating how to employ the computers effectively and efficiently in a variety of geographical environments; indeed, many sources—see, e.g., [1,4,6,7,8,9]— have done precisely that. This paper is a step toward understanding the algorithmic concomitants of the robot-as-ant metaphor within the context of a simple, yet nontrivial, path-planning problem. Ant-robots in a “factory.” We focus on mobile robotic computers (henceforth, ants, to
stress the natural inspiration) that function within a fixed geographically constrained environment (henceforth, a factory floor [a possible application domain]) that is tesselated with identical (say, square) tiles. We expect ants to be able to: – navigate the floor, while avoiding collisions (with obstacles and one another); – communicate with and sense one another, by “direct contact” (as when real ants meet) and “timestamped message passing” (as when real ants deposit pheromones); – assemble in desired locations, in desired configurations. Although not relevant to the current study, we would also want ants to be able to discover goal objects (“food”) and to convey “food” from one location to another; cf. [4,6,8,9]. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 400–411, 2010. c Springer-Verlag Berlin Heidelberg 2010
Ants in Parking Lots
401
Indeed, the “parking” problem that we study might be the aftermath of the preceding activities, e.g., if one has ants convey collected “food” to designated stations. In the “standard realization” of ants studied here, ants are endowed with “intelligence” via embedded computers. These “intelligent” ants are responsible for planning and orchestrating assigned activities. In a quest to understand how a large assemblage of ants of very limited ability can “cooperate” to accomplish complex tasks, we focus on ants that are mobile finite-state machines. In particular, we want to understand what paths ants can plan in a scalable manner, i.e., without counting (beyond a fixed limit). Having ants park. We study the preceding issue by focusing on a simple, yet algorith-
mically nontrivial, path-planning task that we studied in [8] under a rather different ant-based model. This task, parking: (1) has each ant head for the nearest corner of the floor and (2) has all ants within a corner organize into a maximally compact formation (see Section 2.2 for details). While we have not yet characterized which initial configurations of ants can park successfully, we report here on progress toward this goal: – Even without using (digital analogues of) pheromones, many initial configurations of ants can park. These configurations include: • a single ant that starts anywhere along an edge of the floor (Theorem 2); • any assemblage of ants that begins with two distinguished ants that are adjacent—i.e., on tiles that share an edge or a corner (Theorem 3). – In contrast: A single ant can generally not park, even on a one-dimensional floor and even with the help of (volatile digital) pheromones (Theorem 1). The algorithmic setting. We require algorithms to be simple, scalable and decentralized.
(1) Algorithms must work on floors of arbitrary sizes. An algorithm cannot exploit information about the size of the floor; it must treat the side-length n of the floor as an unknown, never exploiting its specific value. (2) Algorithms must work with arbitrarily large collections of ants. This means, in particular, that all ants are identical; no ant has a “name” that renders it unique. (3) Algorithms must employ noncentralized coordination. All coordination among ants is achieved in a distributed, noncentralized manner, via messages that pass between ants on neighboring tiles. (4) Algorithms must be “finite-state.” All ants must execute the same program (in SPMD mode), and this program must have the very restricted form described in Section 2.1.B. These guidelines may be too costly to observe in practical systems; cf. [2,4,6].
2 Technical Background 2.1 Ant-Robots Formalized A. Pheromone-bearing floors. Floors and connectivity (cf. [5]). The n × n floor is a square1 mesh of tiles, n along each side (Fig. 1(a)); tiles are indexed by the set [0, n − 1] × [0, n − 1].2 We represent the floor algorithmically by the n × n grid(-graph) 1 2
Our study adapts to rectangular floors with little difficulty. def N denotes the nonnegative integers. For i ∈ N and j ≥ i, [i, j] = {i, i + 1, . . . , j}.
402
A.L. Rosenberg
2,0
2,1
2,2
2,0
2,1
2,2
1,0
1,1
1,2
1,0
1,1
1,2
0,0
0,1
0,2
0,0
0,1
0,2
(a)
(b)
1,0
1,1
0,0
0,1
(c)
Fig. 1. (a) A 3 × 3 floor M3 ; (b) the associated grid (edges represent mated opposing arcs); (c) the 2 × 2 corner of a grid with all incident arcs
(Fig. 1(b)). Mn ambiguously denotes the mesh or the grid, according to context. Ants move along Mn ’s King’s-move arcs, which are labeled by the compass directions. Each tile v = i, j of Mn is connected by mated in- and out-arcs to its neighbors (Fig. 1(c)). – If i ∈ {0, m − 1} and j ∈ {0, n − 1} then v is a corner tile and has 3 neighbors. – If i = 0 (resp., i = n − 1) and j ∈ [1, n − 2], then v is a bottom (resp., top) tile. If j = 0 (resp., j = n − 1) and i ∈ [1, n − 2], then v is a left (resp., right) tile. These four are collectively edge tiles; each has 5 neighbors. – If i, j ∈ [1, n − 2], then v is an internal tile and has 8 neighbors. Mn ’s regions. Mn ’s quadrants are its induced subgraphs3 on the following sets of tiles: Quadrant SOUTHWEST NORTHWEST SOUTHEAST NORTHEAST
Name QSW QNW QSE QNE
Tile-set [0, n/2 − 1] × [0, n/2 − 1] [n/2, n − 1] × [0, n/2 − 1] [0, n/2 − 1] × [n/2, n − 1] [n/2, m − 1] × [n/2, n − 1]
In analogy with quadrants, which are “fenced off” by imaginary vertical and horizontal side-bisecting lines, Mn has four wedges, W N , W E , W S , W W , which are “fenced off” by imaginary diagonals that connect its corners: Wedge Name WN SOUTH W S EAST W E WEST W W
NORTH
Tile-set {x, y | [x ≥ y] and [x + y {x, y | [x < y] and [x + y {x, y | [x < y] and [x + y {x, y | [x ≥ y] and [x + y
≥ n − 1]} < n − 1]} ≥ n − 1]} < n − 1]}
The asymmetries in boundaries give each tile a unique quadrant and wedge. For k ∈ [0, 2n − 2], Mn ’s kth diagonal is the set of tiles Δk = {i, j ∈ [0, n − 1] × [0, n − 1] | i + j = k}. For d ∈ [1, 2n − 1], the radius-d quarter-sphere of QSW d−1 is the union: k=0 Δk . “Virtual” pheromones. Each tile of Mn contains a fixed number c of counters; each
counter i can hold an integer in the range [0, Ii ]. Each counter represents a “virtual” 3
Let the directed graph G have tile-set N and arc-set A. The induced subgraph of G on the set N ⊆ N is the subgraph of G with tile-set N and all arcs from A that have both in N .
Ants in Parking Lots
403
1 2
1
3
2
3
(a)
(b)
(c)
1
2
1
3
2
(d)
(e)
Fig. 2. Snapshots of a pheromone of intensity I = 3 changing as FSM-ant F (the dot) moves. All snapshots have F on the center tile; unlabeled tiles have level 0. (a) F has deposited a maximum dose of pheromone on each tile that it has reached via a 2-step SE-SW path; note that the pheromone has begun to “evaporate” on the tiles that F has left. (b) F stands still for one time-step and deposits no pheromone. (c) F moves W and deposits a maximum dose of pheromone. (d) F moves S and deposits a maximum dose of pheromone. (e) F moves E and does not deposit any pheromone.
pheromone (cf. [4])—a digital realization of real ants’ volatile organic compounds; each value within [0, Ii ] is an intensity level of pheromone i. The number c and the ranges [0, Ij ]cj=1 are characteristics of a specific realization of the model. The volatility of real pheromones is modeled by a schedule of decrements of every pheromone counter; see Fig. 2. Every computation begins with all tiles having level 0 of every pheromone. B. Computers and programs. Each ant contains a computer that endows it with “intelligence.” Each computer possesses I/O ports that allow it to communicate with the outside world and with computers on adjacent tiles. In a single “step,” a computer can: 1. detect an ant on an adjacent tile; 2. recognize Mn ’s four edges/sides and its four corners; 3. communicate with each neighboring computer—one on a tile that shares an edge or corner—by receiving one message and transmitting one message per time-step. 4. receive a command from the outside world. As discussed earlier, we have embedded computers function as identical copies of a fixed, specialized finite-state machine (FSM) F ; cf. [10]. We specify FSMs’ statetransition functions in an algorithmically friendly fashion, as programs that are finite sets of case statements;4 an FSM F is, thus, specified as follows: – F has s states, named LABEL1 , . . . , LABELs . – F responds to a fixed repertoire of inputs, each INPUTi being a combination of: • the messages that F receives from neighboring FSMs and/or the outside world; • the presence/absence of an edge/corner of Mn , a “food” item to be manipulated, an “obstacle” to be avoided; • the levels of intensity of the pheromones that are present on the current tile. – F responds to the current input by • emitting an output from a repertoire, each OUTPUTk being a combination of: ∗ the messages that F sends to neighboring FSMs; ∗ pheromone-related actions: deposit a type-h pheromone at intensity I ≤ Ih , enhance a type-j pheromone, increasing its intensity to I ≤ Ij ; ∗ “food”-related actions: pick up and carry the item on the current tile, deposit the item that F is currently carrying; 4
The CARPET programming environment [3,11] employs a similar programming style.
404
A.L. Rosenberg
∗ stand still or move to an appropriate neighboring tile (one that will not cause F to “fall off” Mn ). • changing state (by specifying the next case statement to execute). An FSM is thus specified in a format similar to the following: LABEL 1:
if INPUT 1 then OUTPUT 1,1 and goto LABEL 1,1 .. . if INPUT m then OUTPUT 1,m and goto LABEL 1,m .. .. . . LABEL s : if INPUT 1 then OUTPUT s,1 and goto LABEL s,1 .. . if INPUT m then OUTPUT s,m and goto LABEL s,m
We specify algorithms in English—aiming to supply enough detail to make it clear how to craft finite-state programs that implement the algorithms. 2.2 The Parking Problem for Ants The parking problem has each ant F move as close as possible to the corner tile of Mn that is closest to F when it receives the command PARK (from the outside world). Additionally, the ants in each quadrant must cluster within the most compact quartersphere “centered” at the quadrant’s corner tile. Focus on QSW (easy clerical changes work for the other quadrants). Formally: A configuration of ants solves the parking problem for QSW precisely if it minimizes the parking potential function Π(t) = def
2n−2
(k + 1) × (the number of ants residing on Δk at step t).
(1)
k=0
This simply specified, yet algorithmically nontrivial, path-planning problem is a good vehicle for studying what ants can determine about the “floor” without “counting.”
3 Single Ants and Parking 3.1 The Simplified Framework for Single Ants The input and output repertoires of this section’s FSMs need not contain pheromone levels, because pheromones do not enhance the path-planning power of a single ant. Proposition 1. Given any FSM F that employs pheromones while navigating Mn , there exists an FSM F that follows the same trajectory as F while not using pheromones.
Ants in Parking Lots
405
Proof (Sketch). We eliminate pheromones one at a time, so we may assume that F uses a single pheromone that it deposits with intensity levels from the set [0, I]. The pheromone-less FSM F “carries around” (in finite-state memory) a map that specifies all necessary information about the positions and intensities of deposits of F ’s pheromone. For F to exist, the map must be: (a) “small”—with size independent of n—and (b) easily updated as F ’s steps are simulated. Map size. The portion of Mn that could contain nonzero levels of the pheromone is no larger than the “radius”-I subgrid of Mn that F has been in during the most recent I steps. No trace of pheromone can persist outside this region because of volatility. Thus, the map needs only be a (2I − 1) × (2I − 1) mesh centered at F ’s current tile. Because F is the only FSM, at most one tile of the map contains the integer I (a maximum level of the pheromone at this step), at most one contains the integer I − 1 (a maximum level one step ago), . . . , at most one contains the integer 1 (a maximum level I − 1 steps ago). Fig. 2 displays a sample map, with four sample one-step updates. Updating the map. Because of a map’s restricted size and contents, there are fewer I−1 than 1 + j=0 ((2I − 1)2 − j) distinct maps (even ignoring the necessary adjacency of tiles that contain the integers k and k − 1). F can, therefore, carry the set of all possible maps in its finite-state memory, with the then-current map clearly “tagged.” Thus, F has finitely many states as long as F does. The state-transition function of F augments F ’s by updating each state’s map-component while emulating F’s state change. 3.2 A Single Ant Cannot Park, Even on a One-Dimensional Mesh Theorem 1. One cannot design an FSM-ant that successfully parks when started on an arbitrary tile of (even the one-dimensional version of) Mn . Proof (Sketch). To the end of deriving a contradiction, say that we have an FSM F with state-set Q that can park successfully no matter where we place it in an arbitrarily large mesh Mn . By Proposition 1, we may assume that F does not use pheromones. Let us place F far enough within Mn that its path to its (allegedly correct) parking corner is longer than q = |Q|. Let us label each tile along this path with the state that F is in the last time it leaves the tile. Because F starts out far from its parking corner, there must be tiles along the path that are labeled with the same state, s ∈ Q, despite the fact that one tile is farther from F ’s parking corner than the other. (In Fig. 3(left), Mn ’s N E corner is depicted as F ’s parking corner.) Let us now lengthen Mn ’s sidelength—i.e., increase n—by “cutting and splicing” portions of Mn that begin and end with state-label s, as in Fig. 3(right). Because F is deterministic, it cannot “distinguish” between two tiles that have the same state-label. This means that F ’s ultimate behavior will be the same in the original and stretched versions of Mn : it will end its journey in both meshes at the N E corner, despite the fact that the “cut and splice” operation has lengthened F ’s parking path. However, if we perform the “cut and splice” operation enough times, we can make F ’s original tile as far as we want from its parking corner. In particular, we can make this distance greater than the distance between F ’s original tile and some other corner of (the stretched) Mn . Once this happens, F is no longer parking successfully. The theorem follows.
406
A.L. Rosenberg
STATE
s
STATE
s
Fig. 3. Illustrating the “cut and splice” operation of Theorem 1
OR
Fig. 4. The two possible edge-terminating trajectories for a single ant
3.3 Single-Ant Configurations That Can Park Theorem 2. A single FSM-ant F can park in time O(n) when started on an arbitrary edge-tile of Mn . Proof (Sketch). With no loss of generality, let F begin on Mn ’s bottom edge, at tile 0, k. Clearly, F ’s target parking tile is either 0, 0 or 0, n − 1. To decide between these alternatives, F begins a 60◦ northeasterly walk from 0, k, i.e., a walk consisting of “Knight’s-move supersteps” of the form (0, +1), (+1, 0), (+1, 0) (see Fig. 4), continuing until it encounters an edge or a corner of Mn . Note that after s “supersteps,” F has moved from 0, k to 2s, k + s. Consequently, if F ’s walk terminates in: – Mn ’s right edge, then F ’s parking tile is 0, n − 1, because 2s < n − 1 ≤ k + s; – Mn ’s top edge or NE corner, then F ’s parking tile is 0, 0, because k + s ≤ n − 1 ≤ 2s.
4 Multi-Ant Configurations That Can Park We do not yet know if a collection of ants can park successfully when started in an arbitrary initial configuration in Mn —but one simple restriction enables successful parking. Two ants are adjacent if they reside on tiles that share an edge or a corner. Theorem 3. Any collection of ants that contains two designated adjacent ants can park in O(n2 ) synchronous steps.
Ants in Parking Lots
407
Proof (Sketch). We focus, in turn, on the three components of the activity of parking: 1. having each ant determine its home quadrant; 2. having each ant use this information to plan a route to its target corner of Mn ; 3. having ants that share the same home quadrant—hence, the same target corner— organize into a configuration that minimizes the parking potential function (1). 4.1 Quadrant Determination with the Help of Adjacent Ants Lemma 1. Any collection of ants that contains two designated adjacent ants can determine their home quadrants in O(n2 ) synchronous steps. We address the problem of quadrant determination in three parts. Section 4.1.A presents an algorithm that allows two adjacent ants to park. Because this algorithm is a bit complicated, we present in Section 4.1.B a much simpler algorithm that allows three adjacent ants to park. We then show in Section 4.1.C how two adjacent ants can act as “shepherds” to help any collection of ants determine their home quadrants. A. Two adjacent ants can park. The following algorithm was developed in collaboration with Olivier Beaumont. The essence of the algorithm is depicted in Fig. 5. Say, for definiteness, that Mn contains two horizontally adjacent ants: AL on tile x, y and AR on tile x, y + 1. (This assumption loses no generality, because the ants can remember their true initial configuration in finite-state memory, then move into the left-right configuration, and finally compute the adjustments necessary to make the correct determination about their home quadrants.) Our algorithm has two phases. Determining home wedges. AL performs two roundtrip walks within Mn from its initial
tile, one at 45◦ (a sequence of (+1, +1) moves, then a sequence of (−1, −1) moves) and one at −45◦ (a sequence of (+1, −1) moves, then a sequence of (−1, +1) moves); see Fig. 5(a). The termini of the outward walks determine ants’ home wedges: 45◦ walk terminus corner corner top edge top edge right edge right edge
−45◦ walk terminus AL ’s home wedge: AR ’s home wedge: corner/top WN WE left edge WW If walk ends within one tile of Mn ’s corner then: W E else: W S corner/top WN WN left edge WW If walk ends within one tile of Mn ’s corner then: W N else: W W corner/top WE WE left edge WS If walk ends within one tile of Mn ’s corner then: W E else: W S
Determining home quadrants. AL and AR use the knowledge of their home wedge(s) to
determine their home quadrant(s), via one of the following subprocedures. • The ants did not both begin in either W E or W W . In this case, AL and AR move in lockstep to the bottom edge of Mn (see Fig. 5(b)). They set out thence in lockstep on independent diagonal roundtrip walks, AL at an angle of −45◦ (via (+1, −1) moves)
408
A.L. Rosenberg
2 2
AL
AL 1
1
AL
AR
1 2
2
Determine Wedge
(a)
1
1
AR
2
Determine Quadrant (horizontal)
Determine Quadrant (vertical)
(b)
(c)
Fig. 5. Quadrant determination for two adjacent ants: (a) wedge determination; (b) quadrant determination, horizontal version; (c) quadrant determination, vertical version
and AR at an angle of 45◦ (via (+1, +1) moves). When an ant returns to its original tile on the bottom edge, it checks for the other ant’s presence in the appropriate adjacent tile, to determine the outcome of their roundtrip “race.” A discrete set of “race” outcomes provides the ants bounds on the columns where each began to park: Did the “race” end in a tie? Did one ant win by exactly one step? by more than one step? These bounds allow AL and AR to determine their home quadrants. • The ants both began in either W E or W W . In this case, a “vertical” analogue of the preceding “race” (see Fig. 5(c)) allows AL and AR to determine their home quadrants. Once AL and AR know their respective home quadrants, they can park greedily. B. Three adjacent ants can park. Let the ants A1 , A2 , and A3 , be adjacent on Mn , in that their adjacency graph is connected. Via local communication, each ant can recognize and remember its initial location relative to the others’. The ants can thereby adjust the result of the following algorithm and determine their respective home quadrants. Determining home quadrants. Horizontal placement (Fig. 6(Left)). The ants align verti-
cally in the top-to-bottom order A1 , A2 , A3 , with A2 retaining its original tile. (A simple clerical adjustment is needed if A2 begins on either the top or bottom edge of Mn .) A1 marches leftward to Mn ’s left edge and returns to its pre-march tile; in lockstep, A3 marches rightward to Mn ’s right edge and returns to its pre-march tile. – If A1 and A3 return to A2 at the same step, or if A1 returns one step before A3 , then A2 ’s home is either QN W or QSW . A1 and A3 can now determine their home quadrants by noting how they moved in order to achieve the vertical alignment. Vertical
placement
A1
A1 A3
A3 Horizontal placement
Fig. 6. The essence of the parking algorithm for three adjacent ants: (Left) Determining each ant’s horizontal placement. (Right) Determining each ant’s vertical placement.
Ants in Parking Lots
409
– If A1 returns to A2 two or more steps before A3 , then all three ants have QN W or QSW as home. The situation when A3 returns to A2 before A1 is almost the mirror image—except for the asymmetry in eastern and western quadrants’ boundaries. Vertical placement (Fig. 6(Right)). This determination is achieved by “rotating the horizontal-placement algorithm by 90◦ .” Details are left to the reader. After horizontal and vertical placement, each ant knows its home quadrant, hence can park as specified in Section 4.2. C. Two adjacent ants acting as shepherds. All ants in any collection that contains two designated adjacent ants are able to determine their respective home quadrants— with the help of the two adjacent ants. As we verify this, we encounter instances of ants blocking the intended paths of other ants. We resolve this problem by having conflicting ants switch roles—which is possible because all ants are identical. If ant A is blocking ant B, then A “becomes” B and continues B’s blocked trajectory; and B “becomes” A and moves onto the tile that A has relinquished (by “becoming” B). We have m ≥ 2 ants, A1 , . . . , Am , with two designated adjacent ants—without loss of generality, A1 and A2 . We present a three-phase algorithm that takes O(n2 ) synchronous steps and that has A1 and A2 act as shepherds to help all other ants determine their home quadrants. Phase 1: A1 and A2 determine their home quadrant(s), using the algorithm of Sec-
tion 4.1.A. They remember this information, which they will use to park in phase 3. Phase 2: A1 and A2 help other ants determine their home quadrants, via four subphases.
Subphase a: A1 and A2 distinguish east from west. A1 and A2 head to corner SW. A1 determines the parity of n via a roundtrip from tile 0, 0 to tile 0, n − 1 and back. If n is even, then: – A2 moves one tile eastward at every time-step until it reaches Mn ’s right edge. It then reverses direction and begins to move one tile westward at every time-step. – Starting one step later, A1 moves one tile eastward at every third time-step. – A1 and A2 stop when they are adjacent: A1 on tile 0, 12 n − 1, A2 on tile 0, 12 n. A1 and A2 thus determine the midpoint of Mn ’s bottom row (hence, of every row): – A2 ’s trajectory, 0, 0 0, n − 1 0, 12 n takes 32 n − 2 time-steps. – A1 ’s trajectory 0, 0 0, 12 n − 1 takes 12 n − 3 steps. Because it starts one step later than A2 does, A1 arrives at 0, 12 n − 1 after 32 n − 2 time-steps. If n is odd, then: – A2 moves one tile eastward at time-step until it reaches Mn ’s right edge. At that point, it reverses direction and begins to move one tile westward at every time-step. – Starting one time-step later, A1 moves one tile eastward at every third time-step. – A1 and A2 stop when they are adjacent: A1 on 0, 12 n − 1, A2 on 0, 12 n. A1 and A2 thus determine the midpoint of Mn ’s bottom row (hence, of every row): – A2 ’s trajectory, 0, 0 0, n − 1 0, 12 n, takes 3 12 n − 4 time-steps. – A1 ’s trajectory, 0, 0 0, 12 n − 1, takes 3 12 n − 3 steps. Because it starts one step later than A2 does, A1 arrive at 0, 21 n − 1 after 3 21 n − 4 time-steps.
410
A.L. Rosenberg
Subphase b: A1 and A2 identify easterners, westerners. A1 walks through the western half of Mn , column by column, informing each encountered ant that it resides in either QN W or QSW . Simultaneously, A2 does the symmetric task in the eastern half of Mn , informing each encountered ant that it resides in either QN E or QSE . A1 and A2 meet at corner NW of Mn after their walks. Subphase c: A1 and A2 distinguish north from south, via a process analogous to that of Subphase a. Subphase d: A1 and A2 identify northerners and southerners, via a process analogous to that of Subphase b. By the end of Phase 2, every ant knows its home quadrant. Phase 3: Ants park. Every ant except for A1 and A2 begins to park as soon as it learns
its home quadrant—which occurs no later than the end of Phase 2; A1 and A2 await the end of Phase 2, having learned their home quadrants in Phase 1.
4.2 Completing the Parking Process We describe finally how ants that know their home quadrants travel to their parking corner and configure themselves optimally compactly within that corner. Traveling to the corner. Each ant follows a two-stage trajectory. It proceeds horizontally
to the left edge of Mn . Having achieved that edge, it proceeds vertically toward its parking corner. An ant that is proceeding horizontally: (a) moves only to an empty tile; if none exists, then it waits; (b) yields to an ant that is already proceeding vertically. Organizing within the corner. We describe this process in terms of QSW ; other quadrants
...
...
...
...
are treated analogously; see Fig. 7. The first ant that reaches its parking corner (it may have started there) becomes an usher. It directs vertically arriving ants into the next adjacent downward diagonal; thus-directed ants proceed down this diagonal (k) and up the next higher one (k + 1)—moving only when there is an ant behind them that wants them to move! When an ant reaches the top of the upward diagonal, it “defrocks” the current usher and becomes an usher itself (via a message relayed by its lower neighbor). The corner of Mn thus gets filled in compactly, two diagonals at a time. One shows that this manner of filling the corner organizes the ants into a configuration that minimizes the parking potential function (1). This completes the parking algorithm and the proof of Theorem 3.
...
...
Fig. 7. Three stages in the snaked parking trajectory within QSW ; X-ed cells contain “ushers”
Ants in Parking Lots
411
5 Conclusions We have reported on progress in understanding the algorithmic strengths and weaknesses of ant-inspired robots within geographically constrained rectangular “floors.” We have obtained this understanding via the simple path-planning problem we call parking: have ants configure themselves in a maximally compact manner within their nearest corner of the floor. We have illustrated a variety of initial configurations of a collection of ants that enable successful, efficient parking, the strongest being just that the collection contains two ants that are initially adjacent. We have also shown that—even in a one-dimensional world—a single ant cannot park. We mention “for the record” that if efficiency is unimportant, then any collection of ants that contains at least four adjacent ones can perform a vast array of path-planning computations (and others as well), by simulating an autonomous (i.e., input-less) 2counter Register Machine whose registers have capacity O(n2 ); cf. [10]. Where do we go from here? Most obviously, we want to solve the parking problem definitively, by characterizing which initial configurations enable parking and which do not. It would also be valuable to understand the capabilities of ant-inspired robots within the context of other significant tasks that involve path planning [1,6,7,8,9], including tasks that involve finding and transporting “food” and avoiding obstacles (as in [8,9]), as well those that involve interactions among multiple genres of ants. Acknowledgments. This research was supported in part by US NSF Grant CNS0905399. Thanks to O. Beaumont and O. Brock and the anonymous reviewers for insights and suggestions.
References 1. Chen, L., Xu, X., Chen, Y., He, P.: A novel ant clustering algorithm based on Cellular automata. In: IEEE/WIC/ACM Int’l. Conf. Intelligent Agent Technology (2004) 2. Chowdhury, D., Guttal, V., Nishinari, K., Schadschneider, A.: A cellular-automata model of flow in ant trails: non-monotonic variation of speed with density. J. Phys. A: Math. Gen. 35, L573–L577 (2002) 3. Folino, G., Mendicino, G., Senatore, A., Spezzano, G., Straface, S.: A model based on Cellular automata for the parallel simulation of 3D unsaturated flow. Parallel Computing 32, 357–376 (2006) 4. Geer, D.: Small robots team up to tackle large tasks. IEEE Distributed Systems Online 6(12) (2005) 5. Goles, E., Martinez, S. (eds.): Cellular Automata and Complex Systems. Kluwer, Amsterdam (1999) 6. http://www.kivasystems.com/ 7. Marchese, F.: Cellular automata in robot path planning. In: EUROBOT 1996, pp. 116–125 (1996) 8. Rosenberg, A.L.: Cellular ANTomata. In: Stojmenovic, I., Thulasiram, R.K., Yang, L.T., Jia, W., Guo, M., de Mello, R.F. (eds.) ISPA 2007. LNCS, vol. 4742, pp. 78–90. Springer, Heidelberg (2007) 9. Rosenberg, A.L.: Cellular ANTomata: food-finding and maze-threading. In: 37th Int’l. Conf. on Parallel Processing (2008) 10. Rosenberg, A.L.: The Pillars of Computation Theory: State, Encoding, Nondeterminism. Universitext Series. Springer, Heidelberg (2009) 11. Spezzano, G., Talia, D.: The CARPET programming environment for solving scientific problems on parallel computers. Parallel and Distributed Computing Practices 1, 49–61 (1998)
High Performance Networks Jos´e Flich1 , Alfonso Urso1 , Ulrich Bruening2 , and Giuseppe Di Fatta2 1
Topic Chairs Members
2
Interconnection networks are key elements in current scalable compute and storage systems, such as parallel computers, networks of workstations, clusters, and even on-chip interconnects. In all these systems, common aspects of communication are of high interest, including advances in the design, implementation, and evaluation of interconnection networks, network interfaces, topologies, routing algorithms, communication protocols, etc. This year, five papers discussing some those issues were submitted to this topic. Each paper was reviewed by four reviewers and, finally, we were able to select three regular papers. Although the low submission this year, the three selected papers exhibited the standard quality of past years in the track. The accepted papers discuss interesting issues like the Head-Of-Line removal in fattree topologies, the introduction of new topologies for on-chip networks, and the optimization of algorithms in specific topologies. In detail, the paper titled An Efficient Strategy for Reducing Head-Of-Line Blocking in Fat-Trees and authored by J. Escudero-Sahuquillo, P. J. Garcia, F. J. Quiles, and J.Duato, discusses and proposes a new queue management strategy for fat-trees, where the number of queues is minimized while guaranteeing good performance levels. In the paper titled A First Approach to King Topologies for On-Chip Networks, authored by E. Stafford, J.L. Bosque, C. Martinez, F. Vallejo, R. Beivide, and C. Camarero, the use of new topologies is tackled for on-chip networks. Finally, paper titledOptimizing Matrix Transpose on Torus Interconnects, and authored by V. T. Chakaravarthy, N. Jain, and Y. Sabharwal, focusing on optimizing matrix transpose on torus interconnects, proposes application-level routing techniques that improve the load balancing, resulting in better performance. We would like to take the opportunity of thanking the authors who submitted their contribution, as well as the Euro-Par Organizing Committee, and the referees with their highly useful comments, whose efforts have made this conference and this topic possible.
P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, p. 412, 2010. c Springer-Verlag Berlin Heidelberg 2010
An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees Jesus Escudero-Sahuquillo1, Pedro Javier Garcia1 , Francisco J. Quiles1 , and Jose Duato2 1 2
Dept. of Computing Systems, University of Castilla-La Mancha, Spain {jescudero,pgarcia,paco}@dsi.uclm.es Dept. of Computer Engineering, Technical University of Valencia, Spain [email protected]
Abstract. The fat-tree is one of the most common topologies for the interconnection networks of PC Clusters which are currently used for high-performance parallel computing. Among other advantages, fat-trees allow the use of simple but very efficient routing schemes. One of them is a deterministic routing algorithm that has been recently proposed, offering similar (or better) performance than Adaptive Routing while reducing complexity and guaranteeing in-order packet delivery. However, as other deterministic routing proposals, this deterministic routing algorithm cannot react when high traffic loads or hot-spot traffic scenarios produce severe contention for the use of network resources, leading to the appearance of Head-Of-Line (HOL) blocking, which spoils network performance. In that sense, we present in this paper a simple, efficient strategy for dealing with the HOL blocking that may appear in fat-trees with the aforementioned deterministic routing algorithm. From the results presented in the paper, we can conclude that, in the mentioned environment, our proposal considerably reduces HOL blocking without significantly increasing switch complexity and required silicon area. Keywords: Interconnection Networks, Fat-trees, Deterministic Routing, Head-of-line Blocking.
1 Motivation High-performance interconnection networks are currently key elements for many types of computing and communication systems: massive parallel processors (MPPs), local and system area networks, IP routers, networks on chip, and clusters of PCs and workstations. In such environments, the performance achieved by the whole system greatly depends on the performance the interconnection network offers, especially when the number of processing and/or storage nodes is high. On its side, network performance depends on several issues (topology, routing, switching, etc.) which should be carefully considered by interconnect designers in order to obtain the desired low packets latencies and high network bandwidth. However, network design decisions are not actually taken based only on the achieved performance, but also on other factors, like network cost and power consumption. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 413–427, 2010. c Springer-Verlag Berlin Heidelberg 2010
414
J. Escudero-Sahuquillo et al.
In fact, a clear trend on interconnect researching is to propose cost-effective techniques, which allow to obtain good performance while minimizing network resource requirements. Obviously, cost-effective solutions are especially relevant for the commercial high-speed interconnects (Myrinet[1], Infiniband[2], etc.) used for building high-performance clusters, since they must satisfy, at affordable cost, the growing performance demands of cluster designers and users (many of the most powerful parallel computers are currently cluster-based machines [3]). In that sense, the fat-tree [4] has become one of the most popular network topologies since it offers (among other properties) a high communication bandwidth while minimizing the required hardware. Consequently, many interconnection networks in current clusters and MPPs (for instance, the Earth Simulator[5]) are fat-trees. Additionally, the fat-tree pattern eases the implementation of different efficient routing algorithms, either deterministic (packets follow a fixed path between source and destination) or adaptive (packets may follow several alternative paths). In general, adaptive routing algorithms are more difficult to implement than deterministic ones and present problems regarding in-order packet delivery and deadlock-freedom, but it has been traditionally assumed that they better balance traffic, thus achieving higher throughput. However, a recently proposed deterministic routing algorithm for fat-trees [6] achieves a similar or better throughput than adaptive routing, thanks to a smart balance of link utilization based on the properties of the fat-tree pattern. This algorithm can be implemented in a cost-effective way by using Flexible Interval Routing [7] (FIR), a memoryefficient generic routing strategy. Summing up, this algorithm offers the advantages of deterministic routing (simple implementation, in-order packet delivery, deadlockfreedom) while reaching the performance of (or even outperforming) adaptive routing. Hereafter, this algorithm will be referred to as DET. However, the DET routing algorithm, like other deterministic proposals, is unable to keep by itself its good performance level when packet contention appears. Contention happens in a switch when several packets, from different input ports, concurrently require access to the same output port. In these cases, only one packet can cross while the others will have to wait until the required output port becomes free, thus their latency increases. Moreover, when contention is persistent (in this case, it is usually referred to as congestion), a packet waiting to cross will prevent other packets stored behind in the same input buffer from advancing, even if these packets request a free output port, thus their latency increases and switch throughput drops. This phenomenon is known as Head-of-Line (HOL) blocking, and may limit the throughput of the switch up to about 58% of its peak value [8]. Of course, high traffic loads and hot-spot situations favor the appearance of contention and HOL blocking, and consequently in these scenarios network performance suffers degradation. In order to solve that problem, many mechanisms have been proposed. In that sense, as modern high-speed interconnects are lossless (i.e. discarding blocked packets is not allowed), the most common approach to deal with HOL blocking is to have different queues at each switch port, in order to separately store packets belonging to different flows. This is the basic approach followed by several techniques that, on the other hand, differ in many aspects, basically in the required number of queues and in the policy for mapping packets to queues. For instance, Virtual Output Queues at network
An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees
415
level (VOQnet)[9] requires as many queues per port as end-nodes, mapping each packet to the queue assigned to its destination. This solution guarantees that packets addressed to different destinations never share queues, thus completely eliminating HOL blocking, but, on the other hand, it is too costly in terms of the silicon area required to implement all these queues per port in medium or large networks. On the contrary, other techniques like Virtual Output Queues at switch level (VOQsw)[10], Dynamically Allocated Multi-Queues (DAMQs)[11] or Destination-Based Buffer Management (DBBM)[12] use far fewer queues per port than VOQnet. Although these techniques map packets to queues according to different “static” (i.e. independent of traffic conditions) criteria, all of them allow packets belonging to different flows to share queues, thus they just partially reduce HOL blocking. In contrast with the “static” queue assignment of the previously mentioned techniques, both Regional Explicit Congestion Notification (RECN)[13], [14] and Flow-Based Implicit Congestion Management (FBICM)[15] detect which flows cause HOL blocking and dynamically assign queues to separate them from others. Although this approach quite efficiently eliminates HOL blocking without using many queues, it requires specific mechanisms to detect congested flows and additional resources (basically, Content-Addressable Memories, CAMs) to keep track of them. In conclusion, it seems that an effective HOL blocking elimination would imply to considerably increase switch cost and/or to introduce significant complexity. On the contrary, we think that, if the queue assignment criterion exploits the properties of both network topology and routing scheme, it would be possible to quite effectively eliminate HOL blocking without using too many queues per port and without introducing significant complexity. In that sense, we propose in this paper an efficient HOL blocking elimination technique for fat-trees which use the DET routing algorithm. Our proposal uses a reduced set of queues per port, mapping packets to these queues according to the traffic balance the routing algorithm performs. As it is shown in the paper, by linking queue assignment to the routing algorithm, our proposal reduces HOL blocking probability with the minimum number of resources, thus reducing resource requirements of other generic HOL blocking elimination techniques while achieving similar (or even better) performance. Moreover, as queue assignment is static in our proposal, it does not introduce additional complexity, as dynamic approaches do. Summing up, we propose a cost-effective and simple technique for eliminating HOL blocking in a cost-effective interconnect configuration (fat-trees using the DET routing algorithm). As the proposed queue assignment is based on traffic balance, and this balance is based on a smart distribution of destinations among output ports, we call our proposal OutputBased Queue Assignment (OBQA). The rest of the paper is organized as follows. In Section 2 we summarize the deterministic routing algorithm for fat-trees which is the base of our proposal. Next, in Section 3, the basics of our OBQA proposal are presented. In Section 4, OBQA is compared in terms of performance and resource needs to previously proposed techniques which also reduce HOL blocking. Finally, in Section 5, some conclusions are drawn.
2 Efficient Deterministic Routing in Fat-Trees Theoretically, a fat-tree topology [4] is an indirect bidirectional interconnection network consisting of several stages of switches connected by links, forming a complete tree
416
J. Escudero-Sahuquillo et al.
Fig. 1. Eight node 2-ary 3-tree with DET routing algorithm
which gets “thicker” near the “root” switch (transmission bandwidth between switches is increased by adding more links in parallel as switches become closer to the root), and whose “leaves” (switches at the first stage) are the places where processors are located. However, as this theoretical scheme requires that the number of ports of the switches increases as we get closer to the root, its implementation is actually unfeasible. For this reason, some equivalent, alternative implementations have been proposed that use switches with constant radix [16]. In [17] a fat-tree definition is given for embedding all the topologies quoted as fat-trees. Among these, we focus on k-ary n-trees, a parametric family of regular multistage interconnection networks (MINs) [18], where bandwidth between switches is increased as we go towards the root by replicating switches. A k-ary n-tree topology is defined by two parameters: k, which represents the arity or number of ports connecting a switch to the next or previous stage, and n, which represents the number of stages of the network. A k-ary n-tree is able to interconnect k n nodes and includes nk n−1 switches. Figure 1 shows a 2-ary 3-tree connecting 8 processing nodes with 12 switches. In such topologies, a minimal source-destination path can be easily found by going from source to one of the nearest common ancestors of both source and destination (“upwards” direction) and then turning towards destination (“downwards” direction). Note that when advancing upwards, several paths are possible (thus making possible adaptive routing), but when advancing downwards, only a single path is available. Taking that into account, adaptivity is limited to the upwards path, where any switch can select any of its upwards output ports for forwarding packets towards the next stage, but this is enough for allowing adaptive routing algorithms based on different criteria. For instance, it’s possible to use a round-robin policy for selecting output ports, in order to balance traffic inside the network and thus improving performance. However, adaptive routing algorithms introduce out-of-order packet delivery, as packets may cross different paths between the same source-destination pair. By contrast, a deterministic routing algorithm able to balance traffic as well as adaptive routing schemes would solve the out-of-order delivery problem without losing performance. Such deterministic routing algorithm (as mentioned above referred to as
An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees
417
Fig. 2. OBQA logical input port organization
DET) is proposed in [6], and it reduces the multiple upwards paths into a single one for each source-destination pair without unbalancing network link utilization (so, all the links of a given stage are used by a similar number of paths). This is accomplished by shuffling consecutive destinations at each switch in the upwards direction. That means that consecutive destinations are distributed among the upwards output links, reaching different switches in the next stage. Figure 1 also shows the destination distribution proposed by the deterministic routing algorithm in a 2-ary 3-tree network. Note that each link is labeled1 with its assigned destinations (that is, with the destinations of packets that will be forwarded through it). It can be seen that, by shuffling destinations as proposed, packets crossing a switch are addressed to a number of destinations that decrease as the stage of the switch increases (that is, the higher the stage, the lower number of destinations the switch deals with). In fact, each switch of the last stage receives packets addressed only to two destinations, and packets destined to each one are forwarded through a different downwards link (e.g. switch 8 receives packets only for destinations 0 and 4, and they are forwarded downwards through different output ports). Note also that packets addressed to the same destination reach the same switch at the last stage, then following a unique downwards path. That means that, when going downwards, a packet share links only with packets addressed to the same destination. In this way, the mechanism distributes the traffic destined to different nodes and thus traffic is completely balanced (that is, both upwards and downwards links at a given stage are used by the same number of paths). In [6], more details about this deterministic routing algorithm for fat-trees can be found, including a possible implementation using Flexible Interval Routing [7] and an evaluation of its performance. This evaluation shows that this deterministic routing algorithm equals adaptive routing for synthetic traffic patterns (while being simpler to implement), but it outperforms adaptive routing by a factor of 3 when real applications are considered. Summing up, this deterministic algorithm is one of the best options regarding routing in fat-trees, since it reaches high performance while being cost-effective. However, as this deterministic routing proposal considers only switches with singlequeue ports, its performance drops when contention situations cause Head-Of-Line blocking in the single queues2 , thus some mechanism should be used to help this routing algorithm to guarantee a certain performance level even in those situations. 1 2
Upwards links in red, downwards links in green. All in all, it’s not in the scope of the DET routing algorithm to solve the HOL blocking problem.
418
J. Escudero-Sahuquillo et al.
3 OBQA Description In this section we describe in depth our new proposal for reducing HOL blocking in fat-trees. As mentioned above, we call our proposal Output-Based Queue Assignment (OBQA), since it assigns packets to queues depending on its requested output port, thus taking advantage of the traffic balance the DET routing algorithm performs. In the following paragraphs, we firstly detail the assumed memory organization at each switch port. Next, we explain the OBQA basics and how it “fits” with the DET routing algorithm in order to efficiently eliminate HOL blocking. Figure 2 depicts a diagram of the assumed queue scheme at input ports3 . As can be seen, a RAM is used to store incoming packets, this memory being statically divided into a reduced set of queues of fixed size. The exact number (n) of queues may be tuned, but, as one of our objectives is to reduce the queue requirements of other techniques, we assume that n is always smaller than switch radix (as we explain in Section 4, a value of n equal to or lower than half of switch radix is enough for quite efficiently eliminating HOL blocking). In order to calculate the queue in which each incoming packet must be stored, OBQA performs a simple modulo-mapping operation between the output port requested by the packet and the number of queues per port, thus: assigned− queue−number = requested− output− port MOD number− of− queues This queue assignment is the key of OBQA functionality. As a first benefit, packets requesting the same output port (which in fact may be addressed to different destinations) are stored in the same input queue, thus if that output port becomes congested, packets in other queues do not suffer HOL blocking. Moreover, taking into account that we assume the use of the DET routing algorithm, the number of destinations a switch deals with is lower as higher the stage (as can be seen in Figure 1), thus the number of destinations assigned to each OBQA queue is lower as higher the stage. Therefore, the higher the stage, the less destinations share queues, thus decreasing HOL-blocking probability. In fact, both HOL blocking whose origin is contention inside switch and HOL blocking created by contention in other switches are reduced. Figure 3 depicts an example of OBQA operation in a portion of a 2-ary 3-tree (see Figure 1) consisting of several switches in different stages. Note that switch radix is 4, switch ports being numbered from P0 to P3. We assume the number of queues per port is 2, although only queues at port 0 in every switch are shown. Inside the queues, packets are labeled with their own destination. As in Figure 1, routing information are depicted besides the output ports, indicating which destinations are assigned to each output port (upwards destinations are colored in red, while downwards ones in green). In the example, packets addressed to all destinations are injected by the end-node connected to port 0 of switch 0. When received at switch 0, these packets are stored in their corresponding input queue, which is calculated by using the aforementioned modulo-mapping function. For instance, packets addressed to destinations 1, 3, 5 or 7 are stored in queue 1 because, for these destinations, the requested output ports are either P1 or P3, then in all these cases requested− output− port MOD 2 = 1. Analogously, packets addressed to destinations 2, 4 or 6 are stored in queue 0. 3
Hereafter, for the sake of simplicity, we assume an Input-Queued (IQ) switch architecture, but note OBQA could be also valid for Combined Input and Output Queued (CIOQ) switches.
An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees
419
Fig. 3. OBQA operation example (4 × 4 Switches)
At the next stage, the number of destinations present in switch 4 are lower, because the DET routing algorithm smartly balances traffic between network links. For instance, port 0 of switch 4 only receives packets addressed to destinations 2, 4 and 6, and the routing algorithm again balances traffic among all possible output ports. At this point, packets addressed to destination 4 request P2, thus being stored in queue 0, while packets addressed to destinations 2 or 6 respectively request P1 and P3, thus being stored in queue 1. Note that at this stage, the number of destinations sharing each queue is lower than in the former stage, then HOL-blocking probability is reduced. At switch 8, port 0 only receives packets addressed to destination 4. This switch is at the highest stage and, for this reason, the number of destinations the switch deals with is minimal, thus HOL-blocking is completely eliminated at this stage. Finally, as the DET routing algorithm implies that downwards paths are exclusively used by packets addressed to the same destination, HOL-blocking situations in downwards paths are not possible. Summing up, OBQA smartly exploits both k-ary n-trees benefits and the DET routing algorithm properties for progressively reducing HOL-blocking probability along any path inside the fat-tree. As this is achieved with a reduced set of queues per port, OBQA can be considered a cost-efficient technique for HOL-blocking elimination in fat-trees. In the next section, we evaluate OBQA in comparison with other schemes proposed for reducing HOL blocking, in terms of performance and memory requirements.
4 Performance Evaluation In this section we present the evaluation of the OBQA mechanism, based on simulation results showing network performance when OBQA or other HOL-blocking elimination technique are used. The simulation tool used in our experiments is an ad-hoc, eventdriven simulator modeling interconnection networks at cycle level. In the next sections we firstly describe the simulated scenarios and modeling considerations used to configure all the experiments. Then, we show and analyze the simulation results. 4.1 Simulation Model The simulator models different types of network topologies by defining the number of switches, end-nodes and links. In our experiments, we model fat-trees (k-ary n-trees) with different network sizes and different switch radix values. In particular, we use the network configurations shown in Table 1.
420
J. Escudero-Sahuquillo et al.
Table 1. Evaluated network configurations # Fat-Tree Size Interconnection Pattern #Switch radix #Switches (total) #Stages #1
64 × 64
4-ary 3-tree
8
48
3
#2 256 × 256
4-ary 4-tree
8
256
4
#3 256 × 256
16-ary 2-tree
32
32
2
For all network configurations, we use the same link model. In particular, we assume serial full-duplex pipelined links with 1GByte/s of bandwidth and 4 nanoseconds for link delay, both for switch-to-switch links as node-to-switch links. The DET routing algorithm described in Section 2 has been used for all the network configurations in Table 1. The modeled switch architecture follows the IQ scheme, thereby memories are implemented only at switch input ports. However, memory size and organization at each input port depend on the mechanism used for reducing HOL-blocking. In that sense, for the purpose of comparison, the simulator models the following memory queue schemes: – OBQA. A memory of 4 KB per input port is used, statically and equally divided among all queues in the input port. As described in Section 3, queues are assigned to a set of output ports (taking into account the routing algorithm, that means that a queue is assigned to a set of destinations) according to the aforementioned modulomapping function. We consider three values for the number of queues: 2, 4 or 8. – DBBM. As in the previous scheme, a memory of 4 KB per input port is assumed, statically and equally divided among all configured DBBM queues in the port. Each queue is assigned to a set of destinations according to the modulo-mapping function assigned− queue−number = destination MOD number− of− queues. We consider 4 or 8 queues per port for DBBM. – Single Queue. This is the simplest case, with only one queue at each input port for storing all the incoming packets. Hence, there is no HOL-blocking reduction policy at all, thus this scheme allows to evaluate the performance achieved by the “alone” DET routing algorithm. 4 KB memories are used for this case. – VOQ at switch level (VOQsw). 4 KB memories per input port are used, statically and equally divided into as many queues as switch output ports, in order to store each incoming packet in the queue corresponding to its requested output port. As the number of queues equals switch radix, 8 queues are used for network configurations #1 and #2, and 32 queues for network configuration #3. – VOQ at network level (VOQnet). This scheme, although the most effective one, requires a greater memory size per port, because the memory must be split into as many queues as end-nodes in the network, and each queue requires a minimum size. Considering flow control restrictions, packet size, link bandwidth and link delay, we fix the minimum queue size to 512 bytes, which implies input port memories of 32 KB for the 64 × 64 fat-tree and 128 KB for the other (larger) networks. Note that this scheme is actually almost unfeasible, thus it is considered only for showing theoretical maximum efficacy in HOL-blocking elimination4. 4
As both RECN and FBICM have been reported to obtain a performance level similar to VOQnet, we consider redundant to include these techniques in the comparison.
An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees
421
Table 2. Synthetic traffic patterns used in the evaluation Random Traffic
Hot-Spot Traffic
Traffic case # Sources Destination Generation rate # Sources Destination Generation rate Start time End time #1 #2
100% 75%
random random
incremental 100%
0% 25%
123
100%
250 µs
300 µs
Notice that although memory requirements are different for each policy, we use the same memory amount per port (4 KB), except in VOQnet scheme. Later, we analyze the minimum memory requirements for each mechanism, thus putting results in this section in the right context. Regarding message switching policy, we assume Virtual Cut-Through. Moreover, in all the switches, the flow control policy is a credit-based mechanism at the queue level. Packets are forwarded from input queues to output queues through a multiplexed crossbar, modeled with a speedup of 1 (link bandwidth is equal to crossbar bandwidth). The end-nodes are connected to switches through Input Adapters (IAs). Every IA is modeled with a fixed number of admittance queues (as many as destinations in the network, in order to avoid HOL-blocking before packet injection), and a variable number of injection queues, which follow the same selected scheme as queues at input port memories. A generated message is completely stored in a particular admittance queue assigned to its destination. Then, the stored message is packetized before being transferred to an injection queue. We use 64-byte packets. Regarding traffic loads, we use both synthetic traffic patterns (in order to simulate ideal traffic scenarios) and storage area network (SAN) traces. The synthetic traffic patterns are described in Table 2. For uniform traffic (traffic case #1) each source injects packets addressed to random destinations. We range injection rate from 0% up to 100% of the link bandwidth. In addition, a simple, intensive hot-spot scenario (traffic case #2) is defined, in order to create heavy congestion situations within the network. In this case congestion has been generated by a percentage of sources (25%) injecting packets always addressed to the same destination, while the rest of sources (75%) inject packets to random destinations. Note that random packet generation rate is incremental in case #1, thus increasing the traffic rate from 0% up to 100% of link bandwidth. By contrast, traffic pattern #2 has been used to obtain performance results as a function of time, in order to show the impact of a sudden congestion situation. These synthetic traffic patterns have been applied to network configurations #2 and #3. On the other hand, we use real traffic traces provided by Hewlett-Packard Labs. They include all the I/O activity generated from 1/14/1999 to 2/28/1999 at the disk interface of the cello system. As these traces are eleven years old, we apply several time compression factors to the traces. Of course, only results as a function of time are shown in this case. Traces are used for network configuration #1. Finally, although the simulator offers many metrics, we base our evaluation on the ones usually considered for measuring network performance: network throughput (network efficiency when normalized) and packet latency. Therefore, in the following subsections we analyze, by means of these metrics, the obtained network performance.
422
J. Escudero-Sahuquillo et al.
4.2 Results for Uniform Traffic Figure 4 depicts the simulation latency results as a function of traffic load, obtained for fat-trees configurations #2 (Figure 4(a)) and #3 (Figure 4(b)) when traffic case #1 (completely uniform traffic) is used. As can be seen in Figure 4(a), when the network is made of switches of radix 8 (configuration #2), OBQA configured with 4 queues per port reaches the saturation point (the point where average message latency dramatically increases) for the same traffic load as VOQnet and VOQsw. Note that in this case VOQsw requires 8 queues per input port and VOQnet requires 256 queues. Thus, OBQA equals the maximum possible performance while significantly reducing the number of required queues. On its side, the DBBM scheme, which is also configured using 4 queues per input port, experiences a high packet latency, near the poorest result (the one of the Single Queue scheme). It can also be seen that, even when configured with just 2 queues, OBQA achieves better results than DBBM with 4 queues, although worse than VOQsw (however, note that the difference is around 12% while queues are reduced in 75%). Note also that using OBQA increases dramatically (around 30%) the performance achieved by the Single Queue scheme (that is, the performance achieved by the DET routing algorithm without HOL blocking elimination mechanism). For a network with switches of radix 32 (Figure 4(b)), OBQA configured with 8 queues per port reaches the saturation point for the same traffic load as VOQsw, which in this case requires 32 queues per port. Moreover, this traffic load is just 2% lower than the VOQnet saturation point, thus the number of queues per port could be reduced from 256 to 8 at the cost of a minimum performance decrease. Even if configured with 4 queues, OBQA reaches the saturation point at a load only 5% lower than VOQsw. Again, the Single Queue scheme and DBBM with 4 or 8 queues achieve very poor results. As already explained, that is what we should expect for the Single Queue scheme, since it does not implement any HOL blocking elimination mechanism. However, the reason for the poor behavior of DBBM is that its queue assignment policy does not “fit” the routing algorithm (on the contrary to OBQA), thus some DBBM queues are not efficiently used for eliminating HOL blocking. In fact, considering the routing algorithm, in some switches beyond the first stage, some DBBM queues are not used at all.
8000
1Q DBBM-4Q 7000 OBQA-2Q OBQA-4Q VOQSW 6000 VOQNet
Network Latency (Cycles)
Network Latency (Cycles)
8000
5000 4000 3000 2000 1000
1Q DBBM-4Q 7000 DBBM-8Q OBQA-4Q 6000 OBQA-8Q VOQSW VOQNet 5000 4000 3000 2000 1000
0
0 0
0.2
0.4 0.6 Normalized Accepted Traffic
0.8
(a) Network Configuration #2
1
0
0.2
0.4 0.6 Normalized Accepted Traffic
0.8
1
(b) Network Configuration #3
Fig. 4. Network latency versus Accepted traffic. Uniform distribution of packet destinations.
An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees
423
Therefore, we can conclude that, in a uniform traffic scenario, OBQA achieves a network performance similar to the VOQnet and VOQsw ones, while requiring far less queues per port (specifically, a maximum number of queues per port equal to half the switch radix). Moreover, OBQA clearly outperforms DBBM. 4.3 Results for Hot-Spot Traffic In this subsection, we present in Figure 5 network efficiency results as a function of time, when synthetic traffic pattern #2 is used in fat-tree configurations #2 (Figure 5(a)) and #3 (Figure 5(b)). As previously described, in this case 25% of end-nodes generate hot-spot traffic addressed to a single end-node (specifically, destination 123), whereas the rest of traffic is randomly distributed. Furthermore, the hot-spot generation sources are active only during 50 microseconds, after 250 microseconds from simulation start time, thus creating a sudden, eventual congestion situation. As can be seen in Figure 5(a), for a network with 8-radix switches, the Single Queue scheme barely achieves a 5% of network efficiency when congestion appears. Likewise, DBBM (with 4 queues) performance decreases around 25% when congestion arises. Obviously, both queue schemes are dramatically affected by the HOL-blocking created by congestion, and don’t recover their maximum theoretical performance during simulation time. OBQA with 4 queues per port and VOQsw (8 queues per port) achieve similar performances which are better than the DBBM one, decreasing around 20% when the hot-spot appears, thus they slightly improve DBBM behavior and completely outperform the Single Queue scheme. However, again OBQA requires half the queues than VOQsw while achieving similar performance. On its side, VOQnet achieves the maximum efficiency but requiring 256 queues. Similar results are achieved for a fattree made of 32-radix switches, as can be seen in Figure 5(b), VOQsw (32 queues per port in this case) achieving the same results as OBQA with 8 queues per port. Note that in this case the performance level of OBQA with 8 queues is quite close to the maximum (VOQnet), and also that OBQA with 4 queues outperforms DBBM with 8 queues.
(a) Network Configuration #2.
(b) Network Configuration #3.
Fig. 5. Network efficiency (normalized throughput) versus Time. Hot-Spot traffic (25% of packets are addressed to the same destination).
424
J. Escudero-Sahuquillo et al.
0.18
0.25
0.16 0.2
0.12 0.1 0.08 0.06 1Q DBBM-4Q OBQA-2Q OBQA-4Q VOQSw VOQNet
0.04 0.02 0 0
500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06 Time (nanoseconds)
(a) Network Configuration #1. FC=20.
Normalized Throughput
Normalized Throughput
0.14
0.15
0.1 1Q DBBM-4Q OBQA-2Q OBQA-4Q VOQSw VOQNet
0.05
0 0
500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06 Time (nanoseconds)
(b) Network Configuration #1. FC=40.
Fig. 6. Network efficiency (normalized throughput) versus Time. Storage area network (SAN) traces with different compression factors.
Summing up, the analysis of hot-spot results leads to similar conclusions than the former (uniform traffic) analysis: OBQA reaches great performance with a number of queues per port equal to half or a quarter switch radix. 4.4 Results for Traces Finally, we evaluate network performance when real traffic (the I/O traces described in section 4.1) are used as traffic load. In this case we use fat-tree configuration #1 (64 nodes, switch radix is 8). Figure 6(a) shows results for a trace compression factor of 20. As can be seen, OBQA with 4 queues again achieves excellent efficiency, at the same level as VOQsw (8 queues), and they slightly outperforms DBBM (4 queues) and the Single Queue scheme. When a higher compression factor (40) is applied to the trace (Figure 6(b)), it can be seen that OBQA (4 queues), as VOQsw, more clearly outperforms DBBM, achieving in some cases a 30% of improvement. Again, the Single Queue scheme achieves the poorest results (until 50% of the OBQA efficiency). 4.5 Data Memory Requirements In this section, we compare the minimum data memory requirements of different switch organizations, as an estimation of switch complexity in each case. In particular, we consider the same HOL blocking elimination queue schemes modeled in the simulation tests. Table 3 shows memory size and area requirements for each queue scheme at each switch input port. In order to help in the comparison, different number of queues per port have been evaluated for different schemes. For each case the minimum memory requirements have been computed assuming two packets per queue (64-byte packet size) and Virtual Cut-Through switching. All queue schemes can be interchangeably used by 8 × 8 and 32 × 32 switches, except the VOQsw ones. In particular, line #3 corresponds to 32 × 32 switches, and line #4 represents data for 8 × 8 switches. On the other hand, VOQnet requirements are shown for 256-node networks (line #1) and for
An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees
425
Table 3. Memory size and area requirements per input port for different queue schemes # Queue scheme # Queues per port Data Memory size Data Memory Area per port (mm2 ) 1 2
V OQnet V OQnet
256 64
32768 bytes 8192 bytes
0.33291 0.07877
3 4
V OQsw V OQsw
32 8
4096 bytes 1024 bytes
0.03569 0.01412
5 6
DBBM DBBM
8 4
1024 bytes 512 bytes
0.01412 0.00647
7 8 9
OBQA OBQA OBQA
8 4 2
1024 bytes 512 bytes 256 bytes
0.01412 0.00647 0.00359
64-node networks (line #2). Memory area has been estimated by means of the CACTI tool v5.3 [19] using its SRAM modeling. We assume SRAM memories with one read and one write port, 45 nm technology node, and 1 byte as readout value. As can be seen, in general OBQA and DBBM present similar memory size and area requirements per port, although, as we have shown in previous sections, OBQA always outperforms DBBM when configured with the same number of queues (or even with less queues than DBBM in some scenarios). On its side, VOQsw requirements for 32 × 32 switches are much greater than the ones of any OBQA or DBBM schemes. For 8×8 switches, VOQsw requirements (8 queues) equal OBQA and DBBM requirements if configured with 8 queues. Note, however, that for this switch radix, OBQA with 4 queues always equals VOQsw performance, thus in this case OBQA requirements are actually lower than (half) the VOQsw ones. Finally, VOQnet needs a vast amount of memory storage as it demands many queues (as many queues as destinations in the network), and its required area is impractical in real implementations.
5 Conclusions Currently, one the most popular high-performance interconnection network topologies is the fat-tree, whose nice properties have favored its use in many clusters and massive parallel processors. One of these properties is the high communication bandwidth offered with the minimum hardware, thus being a cost-effective topology. The special connection pattern of the fat-tree has been exploited by a deterministic routing algorithm which achieves the same performance as adaptive routing in fat-trees, whereas being simpler and thus cost-effective. However, the performance of that deterministic routing algorithm is spoiled by Head-Of-Line blocking when this phenomenon appears due to high traffic loads and/or hot-spot scenarios. The HOL blocking elimination technique proposed, described and evaluated in this paper solves this problem, keeping the good performance of the aforementioned deterministic routing algorithm, even in adverse scenarios. Our proposal is called OutputBased Queue Assignment (OBQA), and it is based on using at each switch port a reduced set of queues, and on mapping incoming packets to queues in line with the routing algorithm. Specifically, OBQA uses a modulo-mapping function which selects
426
J. Escudero-Sahuquillo et al.
the queue to store a packet depending on its requested output port, thus suiting the routing algorithm. From the results shown in this paper we can conclude that a number of OBQA queues per port equal to half (or even a quarter) switch radix is enough for efficiently dealing with HOL blocking even in scenarios with high traffic loads and/or hot-spots, especially in networks where switch radix is not high. In fact, OBQA outperforms previous techniques with similar queue requirements (like DBBM) and achieves similar (or slightly worse) performance than other techniques with much higher queue requirements. Furthermore, OBQA significantly improves the performance achieved by the deterministic routing algorithm when no HOL blocking elimination technique is used, especially in hot-spot scenarios. As this is accomplished without requiring many resources, OBQA can be considered as a cost-effective solution to significantly reduce HOL blocking in fat-trees.
Acknowledgements This work is jointly supported by the MEC, MICINN and European Commission under projects Consolider Ingenio-2010-CSD2006-00046 and TIN2009-14475-C04, and by the JCCM under projects PCC08-0078 (PhD. grant A08/048) and POII10-0289-3724. We are grateful to J. Flich, C. Gomez, F. Gilabert, M. E. Gomez y P. Lopez, from the Department of Computer Engineering, Technical University of Valencia, Spain, for their generous support to this work.
References 1. Myrinet 2000, Series Networking (2000), http://www.cspi.com/multicomputer/products/ 2000 series networking/2000 networking.htm 2. InfiniBand Trade Association. InfiniBand Architecture. Specification Volume 1. Release 1.0, http://www.infinibandta.com/ 3. Top 500 List, http://www.top500.org 4. Leiserson, C.E.: Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Transactions on Computers 34(10), 892–901 (1985) 5. Earth Simulator, http://www.jamstec.go.jp/es/en/index.html 6. Gomez, C., Gilabert, F., Gomez, M., Lopez, P., Duato, J.: Deterministic versus adaptive routing in fat-trees. In: Workshop CAC (IPDPS 2007), p. 235 (March 2007) 7. Gomez, M.E., Lopez, P., Duato, J.: A memory-effective routing strategy for regular interconnection networks. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2005), p. 41.2 (April 2005) 8. Karol, M.J., Hluchyj, M.G., Morgan, S.P.: Input versus output queueing on a space-division packet switch. IEEE Trans. on Commun. COM-35, 1347–1356 (1987) 9. Dally, W., Carvey, P., Dennison, L.: Architecture of the Avici terabit switch/router. In: Proc. of 6th Hot Interconnects, pp. 41–50 (1998) 10. Anderson, T., Owicki, S., Saxe, J., Thacker, C.: High-speed switch scheduling for local-area networks. ACM Transactions on Computer Systems 11(4), 319–352 (1993) 11. Tamir, Y., Frazier, G.: Dynamically-allocated multi-queue buffers for vlsi communication switches. IEEE Transactions on Computers 41(6) (June 1992)
An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees
427
12. Nachiondo, T., Flich, J., Duato, J.: Destination-based HoL blocking elimination. In: Proc. 12th ICPADS, pp. 213–222 (July 2006) 13. Garc´ıa, P.J., Flich, J., Duato, J., Johnson, I., Quiles, F.J., Naven, F.: Efficient, scalable congestion management for interconnection networks. IEEE Micro 26(5), 52–66 (2006) 14. Mora, G., Garc´ıa, P.J., Flich, J., Duato, J.: RECN-IQ: A cost-effective input-queued switch architecture with congestion management. In: Proc. ICPP (2007) 15. Escudero-Sahuquillo, J., Garc´ıa, P.J., Quiles, F.J., Flich, J., Duato, J.: FBICM: Efficient Congestion Management for High-Performance Networks Using Distributed Deterministic Routing. In: Sadayappan, P., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2008. LNCS, vol. 5374, pp. 503–517. Springer, Heidelberg (2008) 16. Leiserson, C.E., Maggs, B.M.: Communication-efficient parallel algorithms for distributed random-access machines. Algorithmica 3, 53–77 (1988) 17. k-ary n-trees: High Performance Networks for Massively Parallel Architectures. In: Proceedings of International Parallel Processing Symposium (1997) 18. Duato, J., Yalamanchili, S., Ni, L.: Interconnection networks. An engineering approach. Morgan Kaufmann Publishers, San Francisco (2004) 19. Thoziyoor, S., Muralimanohar, N., Ahn, J.H., Jouppi, N.P.: Cacti 5.1. technical report hpl2008-20. Technical report, Hewlett-Packard Development Company (April 2008)
A First Approach to King Topologies for On-Chip Networks Esteban Stafford, Jose L. Bosque, Carmen Mart´ınez, Fernando Vallejo, Ramon Beivide, and Cristobal Camarero Electronics and Computers Department, University of Cantabria, Faculty of Sciences, Avda. Los Castros s/n, 39006 Santander, Spain [email protected], {joseluis.bosque,carmen.martinez,fernando.vallejo, ramon.beivide}@unican.es, [email protected]
Abstract. In this paper we propose two new topologies for on-chip networks that we have denoted as king mesh and king torus. These are a higher degree evolution of the classical mesh and torus topologies. In a king network packets can traverse the networks using orthogonal and diagonal movements like the king on a chess board. First we present a topological study addressing distance properties, bisection bandwidth and path diversity as well as a folding scheme. Second we analyze different routing mechanisms. Ranging from minimal distance routings to missrouting techniques which exploit the topological richness of these networks. Finally we make an exhaustive performance evaluation comparing the new king topologies with their classical counterparts. The experimental results show a performance improvement, that allow us to present these new topologies as better alternative to classical topologies.
1
Introduction
Although a lot of research on interconnection networks has been conducted in the last decades, constant technological changes demand new insights about this key component in modern computers. Nowadays, networks are critical for managing both off-chip and on-chip communications. Some recent and interesting papers advocate for networks with high-radix routers for large-scale supercomputers[1][2]. The advent of economical optical signalling enables this kind of topologies that use long global wires. Although the design scenario is very different, on-chip networks with higher degree than traditional 2D meshes or tori have also been recently explored[3]. Such networks entail the use of long wires in which repeaters and channel pipelining are needed. Nevertheless, with current VLSI technology, the planar substrate in which the network is going to be deployed suggests the use of 2D mesh-like topologies. This has been the case of Tilera[4] and the Intel’s Teraflop research chip[5], with 64 and 80 cores arranged in a 2D mesh respectively. Forthcoming technologies such as on-chip high-speed signalling and optical communications could favor the use of higher degree on-chip networks. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 428–439, 2010. c Springer-Verlag Berlin Heidelberg 2010
A First Approach to King Topologies for On-Chip Networks
429
In this paper, we explore an intermediate solution. We analyze networks whose degrees double the radix of a traditional 2D mesh while still preserving an attractive layout for planar VLSI design. We study meshes and tori of degree eight in which a packet located in any node can travel in one hop to any of its eight neighbours just like the king on a chessboard. For this reason, we denote these networks king meshes and king tori. In this way, we adopt a more conservative evolution towards higher radix networks trying to exploit their advantages while avoiding the use of long wires. The simplicity and topological properties of these networks offer tantalising features for future on-chip architectures: higher throughput, smaller latency, trivial partitioning in smaller networks, good scalability and high fault-tolerance. The use of diagonal topologies has been considered in the past, in the fields of VLSI[6], FPGA[7] and interconnection networks[8]. Also mesh and toroidal topologies with added diagonals have been considered, both with degree six[9] and eight[10].The king lattice has been previously studied in several papers of Information Theory[11]. The goal of this paper is to explore the suitability of king topologies to constitute the communication substrate of forthcoming on-chip parallel systems. With this idea in mind, we present the foundations of king networks and a first attempt to unleash their potential. The main contributions of our research are the following: i) An in-depth analysis of the topological characteristics of king tori and king meshes. ii) The introduction and evaluation of king tori, not considered previously in the technical literature. iii) A folding scheme that ensures king tori scalability. iv) An adaptive and deadlock-free routing algorithm for king topologies. v) A first performance evaluation of king networks based on synthetic traffic. The remainder of this paper is organized as follows. Section 2 is devoted to define the network topologies considered in this paper. The most relevant distance parameters and the bisection bandwidth are computed for each network and a folding method is considered for networks with wrap-around links. Section 3 tackles the task of finding routing algorithms to unlock the networks’ potential high performance, starting with simple minimum-distance algorithms and evolving to more elaborate missrouting and load balancing techniques. Section 4 presents a first performance evaluation of these networks. Finally, Section 5 concludes the paper highlighting its most important findings.
2
Description of the Topologies
In this Section we define and analyze distance properties of the network topologies considered in this paper: square meshes, square king meshes, square tori and square king tori. Then, we obtain expressions for significant distance parameters as well as the bisection bandwidth. Finally, we consider lay-out possibilities minimizing wire length for those topologies with wrap-around edges.
430
E. Stafford et al.
As usual, networks are modeled by graphs, where graph vertices represent processors and edges represent the communication links among them. In this paper we will only consider square networks, as sometimes networks with sides of different length result in an unbalanced use of the links in each dimension[12]. Therefore, in the following we will obviate the adjective “square”. Hence, for any of the networks considered here the number of nodes will be n = s2 , for any integer s > 1. By Ms we will denote the usual mesh of side s. This is a very well-known topology which has been deeply studied. A mesh based network of degree eight can be obtained by adding new links such that, any packet not only can travel in orthogonal directions, but also can use diagonal movements. Will denote by KMs the king mesh network, which is obtained by adding diagonal links (just for non-peripheral nodes) to Ms . Note that both networks are neither regular nor vertex-symmetric. The way to make this kind of network regular and vertex-symmetric is to add wrap-around links in order to make that every node has the same number of neighbors. We will denote as Ts the usual torus network of side s. The torus is obviously the four degree regular counterpart of the mesh. Then, KTs will denote the king torus network, that is, a king mesh with new wrap-around links in order to obtain an eight degree regular network. Another way to see this network is as a torus with extra diagonal links that turn the four degree torus into an eight degree network. In Figure 1 an example of each network is shown.
Fig. 1. Examples of Mesh, King Mesh, Torus and King Torus Networks
A First Approach to King Topologies for On-Chip Networks
431
In an ideal system, transmission delays in the network can be inferred from its topological properties. The maximum packet delay is given by the diameter of the graph. It is the maximum length over all minimum paths between any pair of nodes. The average delay is proportional to the average distance, which is computed as the average length of all minimum paths connecting every pair of nodes of the network. In Table 1 we record these parameters of the four networks considered. The diameter and average distance of mesh and torus are well-known values, [13]. The distance properties of king torus were presented in [14]. Table 1. Topological Parameters Network
Ms
KMS
Ts
KTs
Diameter
2s
s
s
2s
Average Distance
≈ 23 s
Bisection Bandwidth
2s
≈
7 s 15
6s
≈
s 2
4s
≈
s 3
12s
An specially important metric of interconnection networks is the throughput, the maximum data rate the network can deliver. In the case of uniform traffic, that is, nodes send packets to random nodes with uniform probability, the throughput is bounded by the bisection. According to the study in [13], in networks with homogeneous channel bandwidth, as the ones considered here, the bisection bandwidth is proportional to the channel count across the smallest cut that divides the network into two equal halves. This value represents an upper bound in the throughput under uniform traffic. In Table 1, values for the bisection for mesh and torus are shown, see [13]. The obtention of the bisection bandwidth in king mesh and torus is straightforward. Note that a king network doubles the number of links of its orthogonal counterpart but has three times the bisection bandwidth. In a more technological level, physical implementation of computer networks usually requires that the length of the links is similar, if not constant. In the context of networks-on-chip, mesh implementation is fairly straightforward. A regular mesh can be lade out with a single metal layer. Due to the crossing diagonal links, the king mesh requires two metal layers. However tori have wrap-around links whose length depend on the size of the network. To overcome this problem, a well known technique is graph folding. A standard torus can be implemented with two metal layers. Our approach to folding king tori is based on the former but because of the diagonal links four metal layers are required. As a consequence of the folding, the length of the links √ is between two and 8 in king tori. This seems to be the optimal solution for this kind of networks. Figure 2 shows a 8 × 8 folded king torus. For the sake of clarity, the folded graph is shown with the orthogonal and diagonal links separated. Now, if we compare king meshes with tori, we observe that the cost of doubling the number of links gives great returns. Bisection bandwidth is 50% larger,
432
E. Stafford et al.
Fig. 2. Folding of King Torus Network. For the sake of clarity, the orthogonal and diagonal links are shown in separates graphs.
average distance is almost 5% less and diameter remains the same. In addition, implementation of a king mesh on a network-on-chip is simpler, as it does not need to be folded and fits in two metal layers just like a folded torus.
3
Routing
This section explores different routing techniques trying to take full advantage of the king networks. For simplicity it focuses on toroidal networks assuming that meshes will have a similar behaviour. Our development starts with the most simple minimum distance routing continuing through to more elaborate load balancing schemes capable of giving high performance in both benign and adverse traffic situations. Enabling packets to reach their destination in direct networks is traditionally done with source routing. This means that at the source node, when the packet is injected, a routing record is calculated based on source and destination using a routing function. This routing record is a vector whose integer components are the number of jumps the packet must make in each dimension in order to reach its destination. In 2D networks routing records have two components, ΔX and ΔY . These components could be used to route packets in king networks, but the diagonal links, that can be thought as shortcuts, would never be used. Then it is necessary to increase the number of components in the routing record to account for the greater degree of these new networks. Thus we will broaden the definition of routing record as a vector whose components are the number of jumps a packet must make in each direction, not dimension. Thus, king networks will have four directions, namely X and Y as the horizontal and vertical, Z for the diagonal y = x and T for the diagonal y = −x.
A First Approach to King Topologies for On-Chip Networks
3.1
433
Minimal Routing
To efficiently route packets in a king network, we need a routing function that takes source and destination nodes and gives a routing record that makes the packet reach its destination in the minimum number of jumps. Starting with the 2D routing record, it is easy to derive a naive king routing record that is minimal(Knaive). From the four components of the routing record, this routing function will not use two of them. Hence, routing records will have, at most, two non-zero components, one is orthogonal and the other is diagonal. The algorithm is simple, consider (ΔX , ΔY ) where ΔX > ΔY > 0. The corresponding king routing record would be (δX , δY , δZ , δT ) = (ΔX − ΔY , 0, ΔY , 0). The rest of the cases are calculated in a similar fashion. In addition to being minimal, this algorithm balances the use of all directions under uniform traffic, a key aspect in order to achieve maximum throughput. The drawback, however, is that it does not exploit all the path diversity available in the network. Path diversity is defined as the number of minimal paths between a pair of nodes a, b of a network. For mesh and tori will denote it as |Rab |. |Δx | + |Δy | |Rab | = . |Δx | Similarly, in king mesh and tori the path diversity is: n |Δx | n 2n − 2j j n where = (−1) |RKab | = k 2 j=0 j n−k−j |Δy | 2 Thus, the path diversity for king networks is overwhelmingly higher than in meshes and tori. Take for example Δx = 7, Δy = 1, this is the routing record to go from the white box to the gray box in Figure 1. In a mesh the path diversity would be Rab = 8 while in a king mesh RKab = 357. Now, the corresponding Knaive routing record is (δX , δY , δZ , δT ) = (6, 0, 1, 0). This yields only 7 alternative paths, so 350 path are ignored, this is even less than the 2d torus. This is not a problem under uniform and other benign traffic patterns but on adverse situations a diminished performance is observed. For instance, see the performance of 16 × 16 torus with 1-phit packets in Figure 3. The throughput in uniform traffic of the Knaive algorithm is 2.4 times higher than that of a standard torus, which is a good gain for the cost of doubling network resources. However, in shuffle traffic, the throughput is only double and under other traffic patterns even less. A way of improving this is increasing the path diversity by using routing records with three non-zero components. This can be done by applying the notion that two jumps in one orthogonal direction can be replaced by a jump in Z plus one in T without altering the path’s length. Based on our experiments we have found that the best performance is obtained when using transformations similar to the following. δX δX δX (δX , 0, δZ , 0) → ( , 0, δZ + , ) 3 3 3
434
E. Stafford et al.
Being this an enhancement of the Knaive algorithm we denote it EKnaive. It is important to note that it is still minimum-distance and gives more path diversity but not all that is available. Continuing with our example, this algorithm will give us 210 of the total 357 paths (See Table 2). As can be seen in Figure 3, the EKnaive routing record improves the throughput in some adverse traffic patterns due to its larger path diversity. However this comes at a cost. The inherent balance in the link utilization of the Knaive algorithm is lost, thus giving worse performance under uniform traffic. Table 2. Alternative routing records for (6,0,1,0) with corresponding path diversity Routing Record (δX , δY , δZ , δT ) (6,0,1,0) (4,0,2,1) (2,0,3,2) (0,0,4,3) theoretical
Path Diversity 7 105 210 35 357
16x16 king torus, shuffle traffic Throughput (phits/cycle/node)
Throughput (phits/cycle/node)
16x16 king torus, uniform traffic 1.2 1 0.8 0.6 0.4 0.2 0 0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Offered load (phits/cycle/node)
1.2 Routing 2d torus Knaive EKnaive Kmiss KBugal
1 0.8 0.6 0.4 0.2 0 0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Offered load (phits/cycle/node)
Fig. 3. Throughput comparison of the various routing algorithms in 16 × 16 toroidal networks
3.2
Misrouting
In the light of the previous experiences, we find that direction balancing is key. But is it important enough to relax the minimum distance requirement? In response to this question, we have developed a new routing function whose routing record may have four non-zero components. Forcing packets to use all directions will cause missrouting as the minimum paths will no longer be used. Thus we name this approach Kmiss. Ideally, to achieve direction balance, the four components would be as close as possible. However this would cause path lengths to be unreasonable long. A compromise must be reached between the path length and component similarity. With Kmiss, the routing record is extracted from a table indexed by the 2D
A First Approach to King Topologies for On-Chip Networks
435
routing record. The table is constructed so that the components of the routing records do not differ more than 3. The new function improves the load balance regardless of the traffic pattern and provides packets with more means to avoid local congestion. In addition it increases the path diversity. Experimental results as those shown in Section 4 show that this algorithm gives improved throughput in adverse traffic patterns but the misrouting diminishes its performance in benign situations. Figure 3 shows that Kmiss is still poor in uniform traffic, but gives the highest throughput under shuffle. 3.3
Routing Composition
In essence, we have a collection of routing algorithms. Some are very good in benign traffic but perform badly under adverse traffic, while others are reasonably good in the latter but disappointing in the former. Ideally, we would like to choose which algorithm to use depending on the situation. Better yet would be that the network switches from one to another by its self. This is achieved to a certain extent in Universal Globally Adaptive Load-balancing (UGAL)[15]. In a nutshell what this algorithm does is routing algorithm composition. Based on local traffic information, each node decides whether a packet is sent using a minimal routing or the non-minimal Valiant’s routing [16], composing a better algorithm that should have the benefits of both of the simple ones. As we show next, KBugal is an adaptation of UGAL to king networks and bubble routing with two major improvements. On one hand, for the non-minimal routing, instead of Valiant’s algorithm, we use Kmiss routing. This approach takes advantage of the topology’s path diversity without significantly increasing latency and it has a simpler implementation. On the other hand, the philosophy behind UGAL resides in estimating the transmission time of a packet at the source node based on local information. Thus selecting the shortest output queue length among all profitable channels both for the minimal and the non-minimal routings. In the best scenario, the performance of KBugal is the best out of the two individual algorithms, as can be seen in Figure 3. The use of bubble routing allows deadlock-free operation with only two virtual channels per physical channel in contrast to the three used by original UGAL. In order to get a better estimation, KBugal takes into account the occupation of both virtual channels together for each profitable physical channel. The reason behind this is fairly simple. Considering that all virtual channels share the same physical channel, the latency is determined by the occupation of all virtual channels, not only the one it is injected in.
4
Evaluation
In this section we present the experimental evaluation carried out to verify the better performance and scalability of the proposed networks. This is done by comparing with other networks usually considered for future network-on-chip
436
E. Stafford et al.
8x8 networks, latency 30 28
2 Latency (cycles)
Throughput (phits/cycle/node)
8x8 networks, throughput 2.5
1.5 1 0.5
26 24 22
Topology king_mesh king_torus mesh torus
20 18
0
16 0
0.5 1 1.5 2 2.5 Offered load (phits/cycle/node)
3
0
0.2 0.4 0.6 0.8 Offered load (phits/cycle/node)
1
Fig. 4. Throughput and latency of king topologies with Knaive compared to mesh and tori under uniform traffic
architectures, as are the mesh and torus with size 8 × 8. The same study was made with 16 × 16 networks, but due to their similarity to 8 × 8 and lack of space, these results are not shown. All the experiments have been done on a functional simulator called fsin[17]. The router model is based on the bubble adaptive router presented in [18] with two virtual channels. As we will be comparing networks of different degree, a constant buffer space will be assigned to each router and will be divided among all individual buffers. Another important factor in the evaluation of networks are the traffic patterns. The evaluation has been performed with synthetic workload using typical traffic patterns. According to the effect on load balance, traffic patterns can be classified into benign and adverse. The former naturally balances the use of network resources, like uniform or local, while the latter introduces contention and hotspots that reduce performance, as in complement or butterfly. Due to space limitations, only the results for three traffic patterns are shown as they can represent the behaviour observed on the rest. These are uniform, bitcomplement and butterfly. Figure 4 shows the throughput and latency of king networks using Knaive compared to those of 2d tori and meshes. It proves that the increased degree of the king networks outperforms their baseline counterparts by more than a factor two. The average latency on zero load is reduced according to the average distance theoretical values. Packets are 16-phit long, thus making the latency improvement less obvious in the graphs. Observe that king meshes have significantly better performance than 2d tori, both in throughput and latency. Figure 5 presents an analysis of the different routing techniques under the three traffic patterns and for 8 × 8 king tori and meshes. Comparing the results of networks with different sizes highlights that the throughput per node is halved. This is due to the well known fact that the number of nodes in square networks grows quadratically with the side while the bisection bandwidth grows linearly. For benign traffic patterns, the best results are given by Knaive routing. However in adverse traffic, a sensible decrease in performance is observed, caused by the reduced path diversity. As mentioned in Section 3 this limitation is overcome
A First Approach to King Topologies for On-Chip Networks complement traffic 8x8 king mesh 45
40
40
40
35 30 25
30 25
35 30 25
20
20
15
15
15
0 0.1 0.2 0.3 0.4 0.5 Offered load (phits/cycle/node)
1 0.8 0.6 0.4 0.2 0
0 0.1 0.2 0.3 0.4 0.5 Offered load (phits/cycle/node)
complement traffic 8x8 king mesh 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05
ROUTING Knaive Kmiss ugal KBugal
0
0 0.5 1 1.5 2 Offered load (phits/cycle/node)
butterfly traffic 8x8 king mesh Throughput (phits/cycle/node)
1.2
Throughput (phits/cycle/node)
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0 0.5 1 1.5 2 Offered load (phits/cycle/node)
uniform traffic 8x8 king torus
0 0.5 1 1.5 2 Offered load (phits/cycle/node)
complement traffic 8x8 king torus
butterfly traffic 8x8 king torus 45
40
40
40
35 30 25
Latency (cycles)
45
Latency (cycles)
45
35 30 25
35 30 25
20
20
20
15
15
15
0 0.2 0.4 0.6 0.8 1 Offered load (phits/cycle/node)
0 0.2 0.4 0.6 0.8 1 Offered load (phits/cycle/node)
2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Offered load (phits/cycle/node)
complement traffic 8x8 king torus Throughput (phits/cycle/node)
uniform traffic 8x8 king torus 2.5
0 0.2 0.4 0.6 0.8 1 Offered load (phits/cycle/node)
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
ROUTING Knaive Kmiss ugal KBugal
0 0 0.5 1 1.5 2 2.5 3 3.5 4 Offered load (phits/cycle/node)
butterfly traffic 8x8 king torus Throughput (phits/cycle/node)
Throughput (phits/cycle/node)
35
20
uniform traffic 8x8 king mesh
Latency (cycles)
Latency (cycles)
45
0 0.1 0.2 0.3 0.4 0.5 Offered load (phits/cycle/node)
Throughput (phits/cycle/node)
butterfly traffic 8x8 king mesh
45
Latency (cycles)
Latency (cycles)
uniform traffic 8x8 king mesh
437
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Offered load (phits/cycle/node)
Fig. 5. Throughput and latency of routings on 8×8 king meshes and tori under different traffic patterns
438
E. Stafford et al.
by the Kmiss routing. In fact this routing yields poor performance under benign traffic pattern but very good under the adverse ones. Our composite routing algorithm KBugal gives the best average performance on all traffic patterns. In the benign situations the throughput is slightly less than Knaive. And under adverse traffic, performance is similar to the Kmiss routing, being even better in some situations. The results show that KBugal gives better performance than its more generic predecessor UGAL. As can be seen, under benign traffic a improvement of 15% is obtained and between 10% (complement) and 90% (butterfly).
5
Conclusion
In this paper we have presented the foundations of king networks. Their topological properties offer tantalising possibilities, positioning them as clear candidates for future network-on-chip systems. Noteworthy are king meshes, which have the implementation simplicity and wire length of a mesh yet better performance than 2d tori. In addition, we have presented a series of routing techniques specific for king networks, that are both adaptive and deadlock free, which allow to exploit their topological richness. A first performance evaluation of these algorithms based on synthetic traffic has been presented in which their properties are highlighted. Further study will be required to take full advantage of these novel topologies that promise higher throughput, smaller latency, trivial partitioning and high fault-tolerance.
Acknowledgment This work has been partially funded by the Spanish Ministry of Education and Science (grant TIN2007-68023-C02-01 and Consolider CSD2007-00050), as well as by the HiPEAC European Network of Excellence.
References 1. Kim, J., Dally, W., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. SIGARCH Comput. Archit. News 36(3), 77–88 (2008) 2. Scott, S., Abts, D., Kim, J., Dally, W.: The blackwidow high-radix clos network. SIGARCH Comput. Archit. News 34(2), 16–28 (2006) 3. Kim, J., Balfour, J., Dally, W.: Flattened butterfly topology for on-chip networks. In: MICRO 2007: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 172–182. IEEE Computer Society, Washington (2007) 4. Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao, C.C., Bown III, J.F., Agarwal, A.: On-chip interconnection architecture of the tile processor. IEEE Micro 27, 15–31 (2007) 5. Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Singh, A., Jacob, T., Jain, S., Erraguntla, V., Roberts, C., Hoskote, Y., Borkar, N., Borkar, S.: An 80-tile sub-100-w teraflops processor in 65-nm cmos. IEEE Journal of Solid-State Circuits 43(1), 29–41 (2008)
A First Approach to King Topologies for On-Chip Networks
439
6. Igarashi, M., Mitsuhashi, T., Le, A., Kazi, S., Lin, Y., Fujimura, A., Teig, S.: A diagonal interconnect architecture and its application to risc core design. IEIC Technical Report (Institute of Electronics, Information and Communication Engineers) 102(72), 19–23 (2002) 7. Marshall, A., Stansfield, T., Kostarnov, I., Vuillemin, J., Hutchings, B.: A reconfigurable arithmetic array for multimedia applications. In: FPGA 1999: Proceedings of the 1999 ACM/SIGDA seventh international symposium on Field programmable gate arrays, pp. 135–143. ACM, New York (1999) 8. Tang, K., Padubidri, S.: Diagonal and toroidal mesh networks. IEEE Transactions on Computers 43(7), 815–826 (1994) 9. Shin, K., Dykema, G.: A distributed i/o architecture for harts. In: Proceedings of 17th Annual International Symposium on Computer Architecture, pp. 332–342 (1990) 10. Hu, W., Lee, S., Bagherzadeh, N.: Dmesh: a diagonally-linked mesh network-onchip architecture. nocarc (2008) 11. Honkala, I., Laihonen, T.: Codes for identification in the king lattice. Graphs and Combinatorics 19(4), 505–516 (2003) 12. Camara, J., Moreto, M., Vallejo, E., Beivide, R., Miguel-Alonso, J., Martinez, C., Navaridas, J.: Twisted torus topologies for enhanced interconnection networks. IEEE Transactions on Parallel and Distributed Systems 99 (2010) (PrePrints) 13. Dally, W., Towles, B.: Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco (2003) 14. Martinez, C., Stafford, E., Beivide, R., Camarero, C., Vallejo, F., Gabidulin, E.: Graph-based metrics over qam constellations. In: IEEE International Symposium on Information Theory, ISIT 2008, pp. 2494–2498 (2008) 15. Singh, A.: Load-Balanced Routing in Interconnection Networks. PhD thesis (2005) 16. Valiant, L.: A scheme for fast parallel communication. SIAM Journal on Computing 11(2), 350–361 (1982) 17. Ridruejo Perez, F., Miguel-Alonso, J.: Insee: An interconnection network simulation and evaluation environment (2005) 18. Puente, V., Izu, C., Beivide, R., Gregorio, J., Vallejo, F., Prellezo, J.: The adaptive bubble router. J. Parallel Distrib. Comput. 61(9), 1180–1208 (2001)
Optimizing Matrix Transpose on Torus Interconnects Venkatesan T. Chakaravarthy, Nikhil Jain, and Yogish Sabharwal IBM Research - India, New Delhi {vechakra,nikhil.jain,ysabharwal}@in.ibm.com
Abstract. Matrix transpose is a fundamental matrix operation that arises in many scientific and engineering applications. Communication is the main bottleneck in performing matrix transpose on most multiprocessor systems. In this paper, we focus on torus interconnection networks and propose application-level routing techniques that improve load balancing, resulting in better performance. Our basic idea is to route the data via carefully selected intermediate nodes. However, directly employing this technique may lead to worsening of the congestion. We overcome this issue by employing the routing only for selected set of communicating pairs. We implement our optimizations on the Blue Gene/P supercomputer and demonstrate up to 35% improvement in performance.
1
Introduction
Matrix transpose is a fundamental matrix operation that arises in many scientific and engineering applications. On a distributed multi-processor system, the matrix is distributed over the processors in the system. Performing transpose involves communication amongst these processors. On most interconnects with sub-linear bisection bandwidth, the primary bottleneck for transpose is the communication. Matrix transpose is also included in the HPC Challenge benchmark suite [8]. This is a suite for evaluating the performance of high performance computers. The HPCC matrix transpose benchmark mainly aims at evaluating the interconnection network of the distributed system. We shall be interested in optimizing matrix transpose on torus interconnects. Torus interconnects are attractive interconnection architectures for distributed memory supercomputers. They are more scalable in comparison to competing architectures such as the hypercube and therefore many modern day supercomputers such as the IBM Blue Gene and Cray XT are based on these interconnects. We will be mainly interested in the case of asymmetric torus networks. Recall that a torus network is said to be symmetric if all the dimensions are of the same length; it is said to be asymmetric otherwise. Notice that increasing the size of a torus from one symmetric configuration to a larger one requires addition of nodes along all the dimensions, requiring a large number of nodes to be added. It is therefore of considerable interest to study asymmetric configurations as they are realized often in practice. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 440–451, 2010. c Springer-Verlag Berlin Heidelberg 2010
Optimizing Matrix Transpose on Torus Interconnects
441
X
S
D
Fig. 1. Illustration of SDR technique and routing algorithms
Adaptive routing is the most common routing scheme supported by torus networks at the hardware level. In this scheme, the data may take any of the shortest paths from the source to the destination; the exact path is determined dynamically based on various network parameters such as congestion on the links. Dynamically choosing paths leads to better load balancing. The goal of this paper is to propose optimizations for matrix transpose on asymmetric torus networks when the underlying hardware level routing scheme is adaptive routing. A standard approach for performing matrix transpose (also used in the widely used linear algebra library, ScaLAPACK [2]) is to split the processing into multiple phases. Each phase involves a permutation communication, wherein, every node sends data to exactly one node and receives data from exactly one node. In other words, the communication happens according to a permutation π : P → P and each node u sends data to the node π(u), where P is the set of all nodes in the system. (If π(u) = u, then the node does not participate in the communication). Our optimization is based on an application level routing scheme that we call Short Dimension Routing (SDR), which reduces congestion for permutation communications on asymmetric torus interconnects. Apart from matrix transpose, permutation communication patterns also arise frequently in other HPC applications such as binary-exchange based fast-fourier transforms [6] and recursive doubling for MPI collectives in MPICH [11]. Our techniques are also useful in other applications that involve permutation communication. The SDR technique is based on a simple, but useful, observation that on asymmetric torus networks the load on the links lying along the shorter dimensions are typically less loaded in comparison to the links lying along the longer dimensions. To see this in a quantitative manner, consider random permutation communication patterns on a 2-dimensional torus of size NX × NY . We show that the expected load on any X-link (namely, a link lying along the Xdimension) is NX /4 and the expected load any Y -link is NY /4. Thus, in case of asymmetric torus, say a torus with NX ≥ 2NY , the expected load on the Y -links is twice that of the X-links. The SDR technique exploits this fact and work as follows. Instead of sending data directly from sources to destinations, we route the data through intermediate nodes. Namely, for each source-destination pair, the data is first sent from the source to the intermediate node and then from the intermediate node to the destination. In both the steps, communication takes place using the adaptive routing supported at the hardware level. The crucial aspect of the SDR technique is its choice of the intermediate node. We restrict the selection of the intermediate node to be from one amongst those that can
442
V.T. Chakaravarthy, N. Jain, and Y. Sabharwal
be reached by traversing only along the shorter dimensions from the source. To better illustrate the idea, consider the case of a 2-dimensional asymmetric torus of size NX × NY with NX ≥ 2NY . In this case, we will choose one of the nodes lying on the same column as the source node as the intermediate node. Intuitively, the idea is that the above process improves the load balance of X-links, although increasing the load on Y -links. The increased load on Y -links is not an issue, since to begin with they are lesser loaded in comparison to the X-links. Overall, the routing leads to load balancing among the X-links, without making the Y -links potential bottleneck. This leads to reduction in the congestion resulting in better performance. The idea is illustrated in Figure 1. The figure shows a 2-dimensional torus of size NX = 8 and NY = 4 (the wrap-around links are omitted for the sake of clarity). For a communication with S as the source node, one of the black nodes will be used as the intermediate node. The next important aspect of the SDR technique is the exact choice of the intermediate nodes. Consider a 2-dimensional torus of size NX × NY , with NY ≥ NX . Consider a communication from node u1 = x1 , y1 to node u2 = x2 , y2 . Our strategy is to route this data through an intermediate node u = x1 , y where y = (y2 + NY /2) mod NY . Intuitively, this choice of u provides the adaptive routing scheme maximum number of X-links as options to choose from, when sending data from u to u2 . Therefore, our algorithm leads to better loadbalancing on the X links. Figure 1 illustrates the idea: for a communication with S as the source node, the node labeled X will serve as the intermediate node. However, when the torus is nearly symmetric (say NX ≤ 2NY ) our particular choice of intermediate node may significantly increase load on Y links, possibly making the Y -links as the bottleneck. To overcome this hurdle, we employ the idea of selective routing. In this scheme, the intermediate node routing is performed only for a carefully selected subset of the source-destination pairs. To summarize, we carefully choose a subset of source-destination pairs and perform application level routing for these pairs via carefully chosen intermediate nodes. We then generalize the above ideas to higher dimensions. We implement our optimizations on the Blue Gene/P supercomputer and demonstrate upto 35% speedup in performance for matrix transpose. Our experiments show that both the choice of the intermediate nodes and the idea of selective routing are critical in achieving the performance gain. Related Work: Our work focuses on matrix transpose on asymmetric torus networks. The transpose problem has been well-studied on other interconnection architectures [7,4,1,5]. The basic idea of intermediate-node routing goes back to the work of Valiant [12]. Valiant considers permutation communications on hypercubes and shows that routing data via randomly chosen intermediate nodes helps in load balancing. Subsequently, oblivious routing algorithms have been studied extensively. In these work, the entire routing path for each source-destination pair is fully determined by the algorithm. They also consider arbitrary network topologies given as input in the form of graphs. We refer to the survey by R¨ acke [9] for discussion on this topic. In our case, we
Optimizing Matrix Transpose on Torus Interconnects
443
focus on asymmetric torus networks, where the underlying hardware supports adaptive routing. Our goal is to use minimal application-level routing on top of the hardware-level adaptive routing and obtain load balancing. We achieve this by choosing a subset of source-destination pairs and routing the data for these pairs via carefully chosen intermediate nodes. We note that the routing from the source to the intermediate node and from the intermediate node to the destination is performed using the underlying hardware-level adaptive routing scheme. Our work shows that for permutation communication on asymmetric torus networks, the adaptive routing scheme can be improved.
2
Matrix Transpose Overview
In this Section, we give a brief overview of performing matrix transpose in parallel. We shall describe the underlying communication pattern which specifies the interactions amongst the processors in the system. Then, we discuss a popular multi-phase algorithm [3] that has been incorporated in the widely used ScaLAPACK [2] library. This algorithm is also used in the HPC Challenge Transpose benchmark [8]. In parallel numerical linear algebra software, matrices are typically distributed in a block cyclic manner [2]. Such a layout allows for operations to be performed on submatrices instead of individual matrix elements in order to take advantage of the hierarchical memory architecture. The processors in a parallel system are logically arranged in a P × Q grid. A matrix A of size N × M is divided into smaller blocks of size b × b which are distributed over the processor grid. For simplicity, we assume that b divides M and N ; so the matrix can be viewed as an array of M/b × N/b blocks. Let Rp,q denote the processor occupying position i, j on the processor grid where 0 ≤ p < P and 0 ≤ q < Q. Further, let Br,c denote the block in the position r, c of the matrix where 0 ≤ r < M/b and 0 ≤ c < N/b. In the block cyclic layout, the block br,c resides on the processor Rr%P,c%Q1 . In parallel matrix transpose, for each block br,c , the block must be sent to the processor containing the block bc,r . This requires communication between the processors Rr%P,c%Q and Rc%P,r%Q. We now discuss the communication pattern for parallel matrix transpose. Let L denote the LCM of P , Q and G denote the GCD of P , Q. Using the generalized Chinese remainder theorem, it can be shown that a processor Rp,q communicates with a processor Rp ,q iff p ≡ q (mod G) and q ≡ p (mod G). Based on this fact, it can be shown that any processor communicates with exactly (L/P × (L/Q) processors. The multi-phase algorithm[3] works in (L/P ) × (L/Q) phases. In each phase, every processor sends data to exactly one processor and similarly receives data from exactly one processor. The communication for processor Rp,q is described in Figure 2. The important aspect to note is that the communication in each phase is a permutation communication.
1
% denotes the mod operator.
444
V.T. Chakaravarthy, N. Jain, and Y. Sabharwal g = (q − p)%G p˜ = (p + g)%P , q˜ = (q − g)%Q For j = 0 to L/P − 1 For i = 0 to L/Q − 1 p + i · G)%P p1 = (˜ q1 = (˜ q − j · G)%Q Send data to Rp1 ,q1 p2 = (˜ p − i · G)%P q2 = (˜ q + j · G)%Q Receive data from Rp2 ,q2
Fig. 2. Communication for processor Rp,q in the multi-phase transpose algorithm
3
Load Imbalance on Asymmetric Torus Networks
As mentioned in the introduction, the shorter dimension routing technique is based on the observation that typical permutation communications on an asymmetric torus have higher load on links of the longer dimensions compared to the shorter dimensions. To demonstrate the above observation in a quantitative manner, here we consider random permutation communications on a 2-dimensional torus and show that the expected load on longer dimension links is higher than that shorter dimension links. Consider a torus of size NX × NY . Consider a random permutation π. We assume that the underlying routing scheme is adaptive routing. We will show that the expected load on any X-link is NX /8 and similarly the expected load on any Y -link is NY /8. For a link e, let Le be the random variable denoting the load on the link e. Theorem 1. For any X-link e, E[Le ] = NX /8. Similarly, for any Y -link e, E[Le ] = NY /8. Proof. Let LX be the random variable denoting the sum of the load over all the X-links. Consider any node u = x1 , y1 . Every node is equiprobable to be the destination π(u). Since we are dealing with a torus network, π(u) is equally likely to be located to the left of u or to the right of u. In either case, the packet traverses 0 to NX /2 X-links with equal probability. Therefore expected number of X-links traversed by the packet is NX /4. Therefore. number of X-links traversed by packet from u to π(u) E[LX ] = u∈P
=
(NX /4) = |P| · (NX /4),
u∈P
where P = NX NY is the number of nodes. The total number of X-links is 2|P| (considering bidirectionality of links) and no link is special. Thus, the expected load on the X-links is the same. It follows that
Optimizing Matrix Transpose on Torus Interconnects
E[Le ] =
445
NX |P| · (NX /4) = . 2|P| 8
The case of Y -links is proved in a similar manner.
The above theorem shows that in case of asymmetric torus, the links of different dimensions have different expected load. For instance, if NX ≥ 2NY , the expected load on the X-links is twice that of the Y -links. Based on this observation, our application-level routing algorithms try to balance the load on the X-links by increasing load on the Y -links, while ensuring that the Y -links do not become the bottleneck.
4
Optimizing Permutation Communications
In this section, we discuss our heuristic application-level routing algorithm for optimizing permutation communications. As discussed earlier, matrix transpose is typically implemented in multiple phases, where each phase involves a permutation communication. As a consequence, we obtain improved performance for matrix transpose. Our application-level algorithm is based on two key ideas: the basic SDR technique and selective routing. 4.1
Basic Short Dimension Routing (SDR)
Consider a two-dimensional asymmetric torus of size NX × NY . Without loss of generality, assume that NX > NY . As discussed in Section 3, for typical permutation communication patterns, the X-links are expected to be more heavily loaded in comparison to the Y -links. We exploit the above fact and design a heuristic that achieves better load balancing. Namely, for sending a packet from a source u1 to a destination u2 , we will choose a suitable intermediate node u lying on the same column as u1 . This ensures that the data from u1 to u traverses only the Y -links. The choice of u is discussed next. Consider a communication from a node u1 = x1 , y1 to a node u2 = x2 , y2 . Recall that in adaptive routing, a packet from u1 to u2 may take any of the shortest paths from u1 to u2 . In other words, the packet may traverse any of the links inside the smallest rectangle defined by taking u1 and u2 as the diagonally opposite ends. Let DX (u1 , u2 ) denote the number of X-links crossed by any packet from u1 to u2 . It is given by DX (u1 , u2 ) = min{(x1 − x2 ) mod NX , (x2 − x1 ) mod NX }. Similarly, let DY (u1 , u2 ) = min{(y1 − y2 ) mod NY , (y2 − y1 ) mod NY }. In adaptive routing, the load balancing on the X-links is proportional to DY (u1 , u2 ), since a packet has as many as DY (u1 , u2 ) choices of X-links to choose from. Similarly, the load balancing on the Y -links is proportional to DX (u1 , u2 ). Therefore maximum load balancing on the X-links can be achieved when we route the packets from u1 to u2 through the intermediate node u = u1 , (u2 + NY /2) mod NY . Note that the packets from u1 to u only traverse Y -links and do not put extra load on the X-links. Among the nodes that have the above property, u is the node having the maximum DY value with respect
446
V.T. Chakaravarthy, N. Jain, and Y. Sabharwal
to u2 . Thus, the choice of u offers maximum load balance in the second phase when packets are sent from the intermediate node to the destination. An important issue with the above routing algorithm is that it may overload the Y -links resulting in the Y -links becoming the bottleneck. This would happen when the torus is nearly symmetric (for instance, NX ≤ 2NY ). To demonstrate this issue, we present an analysis of the basic SDR based algorithm. Analysis. Consider a random permutation π. The following Lemma derives the expected load on the X-links and the Y -links for the basic SDR algorithm. For a link e, let Le be the random variable denoting the load on the link e. Lemma 1. For an X-link e, the expected load is E[Le ] = NX /8 and for a Y -link e, the expected load is E[Le ] = 3NY /8. Proof. Let us split the overall communication into two phases: sending packets from every node to the intermediate nodes and sending the packets from the intermediate nodes to the final destination. Let us first derive the expected load for the case of X-links. In the first phase of communication, packets only cross the Y-links and do not cross any X-link. So, we need to only consider the second phase of communication. As in the case of Theorem 1, we have that the expected load E[Le ] = NX /8, for any X-link e. Now consider the case of Y -links. For a Y-link e, let Le,1 be the random variable denoting the number of packets crossing the link during the first phase. For two numbers 0 ≤ y, j < NY , let y ⊕ j denote y + j mod NY ; similarly, let y j denote y − j mod NY . The Y-links come in two types: (i) links from a node x, a to the node x, a ⊕ 1; (ii) links from a node x, a to the node x, a 1. Consider a link e of the first type. (The analysis for the second type of Ylinks is similar.) The set of nodes that can potentially use the link e is given by {x, a i : 0 ≤ i ≤ NY /2}. Let ui denote the node x, a i and let the intermediate node used by ui be ρ(ui ) = x, yi . The packet from the node x, yi will cross the link e, if yi ∈ {a ⊕ 1, a ⊕ 2, . . . , a ⊕ (NY /2 − i)} Since π(ui ) is random, yi is also random. Therefore we have that Pr[The packet from ui crosses e in the first phase] =
(NY /2) − i . NY
The expected number of packets that cross the link e in the first phase is: NY /2
E[Le,1 ] =
(NY /2) − i NY . = N 8 Y i=0
Hence, the expected load E[Le,1 ] = NY /8. In the second phase of the communication, every packet traverses NY /2 number of Y -links. As there are NX NY communicating pairs, the total load on the Y -links is NX NY 2 /2. The number of Y -links is 2NX NY . The expected load put in the second phase on each Y -link e is NY /4. Thus, for a Y -link e, the expected combined load on the link e is E[Le ] = NY /8 + NY /4 = 3NY /8.
Optimizing Matrix Transpose on Torus Interconnects
447
This shows that our routing algorithm will perform well on highly asymmetric torus (i.e., when NX ≥ 3NY ). However, when the torus is not highly asymmetric (for instance when NX ≤ 2NY ), the Y -links become the bottleneck. 4.2
Selective Routing
We saw that for torus interconnects that are not highly asymmetric, the Y -links become the bottleneck in the above routing scheme. To overcome the above issue for such networks, we modify our routing algorithm to achieve load balancing on X-links without overloading the Y -links. The idea is to use application-level routing only for a chosen fraction of the communicating pairs. The chosen pairs communicate via the intermediate nodes which are selected as before. The remaining pairs communicate directly without going via intermediate nodes. Recall that for a communicating pair u1 and u2 , the load imbalance is inversely proportional to the DX (u1 , u2 ). Therefore, we select the pairs having small value of DX to communicate via intermediate nodes. The remaining pairs are made to communicate directly. We order the communicating pairs in increasing order of their DX . Then, we choose the first α fraction to communicate via intermediate nodes and the rest are made to communicate directly. The fraction α is a tunable parameter, which is determined based on the values NX and NY . When NX ≥ 3NY , we choose α = 1; otherwise, α is chosen based on the ratio NX /NY . 4.3
Extending to Higher Dimensions
Our routing algorithm can be extended for higher dimensional torus. We now briefly sketch the case of 3-dimensional asymmetric torus of size NX × NY × NZ . Without loss generality, assume that NX ≥ NY ≥ NZ . Consider communicating packets from a source node u1 = x1 , y1 , z1 to the destination node u2 = x2 , y2 , z2 . Being a three dimensional torus, two natural choices exist for selecting the intermediate node: (i) u = x1 , y1 , (z2 + NZ /2) mod NZ ; (ii) u = x1 , (y2 + NY /2) mod NY , (z2 + NZ /2) mod NZ . We need to make a decision on which type of intermediate nodes to use. If we choose u as the intermediate then the packets from source to the intermediate node will traverse only Z-links. On the other hand, if we choose u as the intermediate then the packets from source to the intermediate node will traverse both Y -links and only Z-links. In case NY and NZ are close to NX , we send all the packets directly without using intermediate nodes (Example: NX = NY = NZ ). In case only NY is close to NX , we use the first type of intermediate nodes (Example: NX = NY = 2NZ ). Finally, if both NY and NZ are considerably smaller, we use the second type of intermediate nodes (Example: NX = 2NY = 2NZ ). 4.4
Implementation Level Optimizations
We fine tune our algorithm further, by employing the following strategies. These strategies were devised based on experimental studies. Chunk based exchange (CBE): The source node divides the data to be sent to the destination into β smaller chunks. This allows the intermediate routing
448
V.T. Chakaravarthy, N. Jain, and Y. Sabharwal
node to forward the chunks received so far while other chunks are being received in a pipeline manner. Here β is a tunable parameter; our experimental evaluation suggests that β = 1024 gives best results. Sending to nearby destinations: For pairs that are separated by a very small distance along the longer dimension, we choose not to perform intermediate node routing. For all pairs with distance less than γ along the longer dimension, we send the data directly. Here γ is a tunable parameter; our experimental evaluation suggests that γ = 2 gives best results. Handling routing node conflicts: A node x may be selected as the intermediate routing node for more than one communicating pair, resulting in extra processing load on this node. To avoid this scenario, only one communicating pair is allowed to use x as an intermediate routing node. For the other communicating pairs, we look for intermediate nodes that are at distance less than δ from x. In case all such nodes are also allocated for routing, the pair is made to communicate directly. Here δ is a tunable parameter; our experimental evaluation suggests that δ = 2 gives best results. Computing intermediate nodes: Note that our algorithm needs to analyze the permutation communication pattern and choose an α fraction of sourcedestination pairs for which intermediate-node routing has to be performed. This process is carried out on any one of the nodes. The rest of the nodes send information regarding their destination to the above node. This node sends back information about the intermediate node to be used, if any. In case the multiphase algorithm involves more than one phase, the above process is carried out on different nodes in parallel for the different phases.
5
Experimental Evaluation
In this Section, we present an experimental evaluation of our application-level routing algorithm on the Blue Gene/P supercomputer. We first present an overview of the Blue Gene supercomputer and then discuss our results. 5.1
Blue Gene Overview
The Blue Gene/P [10] is a massively parallel supercomputer comprising of quadcore nodes. The nodes themselves are physically small, allowing for very high packaging density in order to realize optimum cost-performance ratio. The Blue Gene/P uses five interconnect networks for I/O, debug, and various types of inter-processor communication. The most significant of these interconnection networks is the three-dimensional torus that has the highest aggregate bandwidth and handles the bulk of all communication. Each node supports six independent 850 MBps bidirectional nearest neighbor links, with an aggregate bandwidth of 5.1 GBps. The torus network uses both dynamic (adaptive) and deterministic routing with virtual buffering and cut-through capability.
Optimizing Matrix Transpose on Torus Interconnects
Random Mapping - Performance Scaling
Default Mapping - Performance Scaling
256
256 Base Opt
144.61 128
64
59.64
32
19.52
16
17.41
12.32 10.68
8 4
106.63
75.98
3.98 3.56 32
Performance (GB/s)
Performance (GB/s)
128
449
256
1024 2048
119.90 68.17
64 32 16
11.88
18.43 16.53
10.28
8 3.89
3.25
32
Number of nodes
(a) Random mapping
92.79
55.10
4 128
Base Opt
128
256
1024 2048
Number of nodes
(b) Default mapping
Fig. 3. Performance comparison
5.2
Experimental Setup
All our experiments involve performing transpose operations on a matrix distributed over a Blue Gene system. We consider systems of sizes ranging from 32 to 2048 nodes. The torus dimensions of these systems are of the form d × d × 2d or d × 2d × 2d. The distributed matrix occupies 256-512 MB data on each node. We consider two methods of mapping the MPI ranks to the physical nodes: Default mapping: This is the traditional way of mapping MPI ranks to physical nodes on a torus interconnect. In this mapping, the ranks are allocated in a dimension ordered manner, i.e., the dimensions of the physical node may be determined by examining contiguous bits in the binary encoding of the MPI rank. For example the physical node for an MPI rank r may be determined using an XY Z ordering of dimensions as follows. The Z coordinate of the physical node would be r mod NZ , the Y coordinate would be (r/NZ ) mod NY and the X coordinate would be r/(NZ NY ); here, divisions are integer divisions. The dimensions may be considered in other orders as well, for instance Y XZ or ZXY . Random mapping: In this case the MPI ranks are randomly mapped to the physical nodes. This results in a random permutation communication pattern. This is characteristic of the HPC Challenge Transpose benchmark [8], which is used to analyze the network performance of HPC systems. 5.3
Results
We begin the result section with comparison of the performance of the multiphase transpose algorithm(Base) with our application-level heuristic routing algorithm (Opt). This comparison is followed by experimental study of effects of varying various heuristic parameters like αandβ. Also an attempt has been made to delineate the contribution of each optimization individually on the overall performance gain that has been obtained in Opt. Comparison of Opt with Base: In our heuristic algorithm for optimal performance parameter α was set to .5, β to 1024, γ to 2 and δ to 2. These parameters
450
V.T. Chakaravarthy, N. Jain, and Y. Sabharwal
Effect of CBE on Performance
Effect of Alpha on Performance Peformance
75.98
74
76
73.27 72.46
72
72.11
71.81
70
70.01 69.58
67.84
68
67.11
66 0
0.125 0.25 0.375 0.5 0.625 0.75 0.875 Alpha
1
(a) Effect of α
Performance (GB/s)
Performance (GB/s)
76
75.95
Peformance 75.30
75.92
75.90
74 72 70 68
67.34
66 64
66.83
64.89 65.01
64.42 4
16
64 Beta
256
1024
4096
(b) Effect of β
Effect of gamma on Performance Peformance Performance (GB/s)
Version
75.98
76 74 72 71.26
71.94
70 68 67.58
67.32
66 0
1
2
3
Performance (GB/s) Base 59.64 Base with CBE 60.86 Opt with δ = 1, γ = 0 71.35 Opt with δ = 1, γ = 2 73.74 Opt 75.98
4
Gamma
(c) Effect of γ
(d) Effect of Other Factors
Fig. 4. Contributions to Performance
have been obtained experimentally. The results are shown in Figure 3. The Xaxis is the system size (number of nodes) and the Y-axis is the performance of matrix transpose communication obtained in GB/s. In Figure 3(a), the performance results are obtained by mapping the MPI tasks onto the nodes using the random mapping. We see that Opt provides significant improvements over Base. The gain varies from 12% for 32 nodes to 35% for 2048 nodes. In Figure 3(b), the performance results are obtained by mapping the MPI tasks onto the nodes using the default mapping. We see that Opt provides significant improvements over Base. The gain ranges from 11% to 29%, with the best performance being observed on 2048 nodes. In both the graphs, we observe that the performance is significantly better for the systems of size 1024 and 2048 in comparison to the smaller system sizes. This is due to the fact the underlying interconnection network for these system sizes is a torus whereas for the smaller system sizes, it is a mesh. It is to be noted that the number of paths between intermediate routing nodes and destination nodes is double in case of a torus as compared to a mesh. Therefore, the intermediate routing node in a mesh offers limited gain in the number of paths to the destination in comparison to the torus. Our results demonstrate the benefit of our application-level routing heuristic algorithm. The overheads involved in initially determining the intermediate nodes were found to be negligible.
Optimizing Matrix Transpose on Torus Interconnects
451
Effect of α: The effects of varying α for 1024 nodes is shown in Figure 4(a). The performance improves as α is increased to about 1/2. This is expected, as the load gets balanced on the longer dimension links. As we further increase α, the performance drops as the shorter dimension links become more congested. Effect of β: Figure 4(b) shows the effect of varying β for 1024 nodes. As expected, the performance improves as β is increased. The performance saturates for large values of β. Effect of γ: The result of varying γ is shown in Figure 4(c) for 1024 nodes. In the extreme case when γ = 0, i.e., all the communicating pairs try to use intermediate routing nodes, the performance is as low as 71 GB/s. The performance increases as gamma is increased to 2. As gamma is increased further, the performance begins to drop as a large proportion of pairs communicate directly, thereby loosing the advantage of SDR. We also conducted experiments to separately study the contributions of the different optimizations. These results are presented in Figure 4(d).
References 1. Azari, N., Bojanczyk, A., Lee, S.: Synchronous and asynchronous algorithms for matrix transposition on mcap. In: Advanced Algorithms and Architectures for Signal Processing III. SPIE, vol. 975, pp. 277–288 (1988) 2. Blackford, L.S., Choi, J., Cleary, A., Petitet, A., Whaley, R.C., Demmel, J., Dhillon, I., Stanley, K., Dongarra, J., Hammarling, S., Henry, G., Walker, D.: Scalapack: a portable linear algebra library for distributed memory computers - design issues and performance. In: Supercomputing 1996: Proceedings of the 1996 ACM/IEEE conference on Supercomputing, CDROM (1996) 3. Choi, J., Dongarra, J., Walker, D.: Parallel matrix transpose algorithms on distributed memory concurrent computers. Parallel Comp. 21(9), 1387–1405 (1995) 4. Eklundh, J.: A fast computer method for matrix transposing. IEEE Trans. Comput. 21(7), 801–803 (1972) 5. Johnsson, S., Ho, C.: Algorithms for matrix transposition on boolean n-cube configured ensemble architecture. SIAM J. Matrix Anal. Appl. 9(3) (1988) 6. Kumar, V.: Introduction to Parallel Computing (2002) 7. Leary, D.: Systolic arrays for matrix transpose and other reorderings. IEEE Transactions on Computers 36, 117–122 (1987) 8. Luszczek, P., Dongarra, J., Koester, D., Rabenseifner, R., Lucas, B., Kepner, J., Mccalpin, J., Bailey, D., Takahashi, D.: Introduction to the hpc challenge benchmark suite. Tech. rep (2005) 9. R¨ acke, H.: Survey on oblivious routing strategies. In: CiE 2009: Proceedings of the 5th Conference on Computability in Europe, pp. 419–429 (2009) 10. IBM journal of Reasearch, Development staff: Overview of the ibm blue gene/p project. IBM J. Res. Dev. 52(1/2), 199–220 (2008) 11. Thakur, R., Rabenseifner, R.: Optimization of collective communication operations in mpich. International Journal of High Performance Computing Applications 19, 49–66 (2005) 12. Valiant, L.: A scheme for fast parallel communication. SIAM J. of Comp. 11, 350– 361 (1982)
Mobile and Ubiquitous Computing Gregor Schiele1 , Giuseppe De Pietro1 , Jalal Al-Muhtadi2 , and Zhiwen Yu2 1
Topic Chairs Members
2
The tremendous advances in wireless networks, mobile computing, sensor networks along with the rapid growth of small, portable and powerful computing devices offers opportunities for pervasive computing and communications. Topic 14 deals with cutting-edge research in various aspects related to the theory or practice of mobile computing or wireless and mobile networking, including architectures, algorithms, networks, protocols, modeling and performance, applications, services, and data management. This year, we received 11 submissions for Topic 14. Each paper was peer reviewed by at least three reviewers. We selected 7 regular papers. The accepted papers discuss very interesting issues about wireless ad-hoc networks, mobile telecommunication systems and sensor networks. In their paper “cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks” Huanyu Zhao, Xin Yang, Xiaolin (Andy) Li describe a novel trust aggregation scheme for cyclic MANETs. The second paper “On Deploying Tree Structured Agent Applications in Embedded Systems” by Nikos Tziritas, Thanasis Loukopoulos, Spyros Lalis, Petros Lampsas presents a distributed algorithm aiming at arranging communicating agents over a set of wireless nodes in order to optimize the deployment of embedded applications. The third paper by Nicholas Loulloudes, George Pallis, Marios Dikaiakos is entitled “Caching Dynamic Information in Vehicular Ad-Hoc Networks”. It proposes an approach based on caching techniques for minimizing network overhead imposed by Vehicular Ad Hoc Networks and for assessing the performance of Vehicular Information Systems. The fourth paper “Meaningful Metrics for Evaluating Eventual Consistency” by Joao Pedro Barreto, Paulo Ferreira analyses different metrics for evaluating the effectiveness of eventually consistent systems. In the fifth paper, “Collaborative GSM-based Location”, David Navalho and Nuno Preguica examine how information sharing among nearby mobile devices can be used to improve the accuracy of GSM- or UMTS-based location estimation. The sixth paper is entitled “@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks” by Jose Mocito, Luis Rodrigues, Hugo Miranda. It proposes an adaptive routing system in which each node is able to adapt the routing process dynamically with respect to the current system context. The approach integrates multiple routing protocols in a single system. Finally, the paper“Maximizing Growth Codes Utility in Large-scale Wireless Sensor Networks” by Zhao Yao, Xin Wang extends existing work on robust information distribution in wireless sensor networks using Growth Codes by loosening assumptions made in the original approach. This allows to apply Growth Codes to a wider range of applications. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 452–453, 2010. c Springer-Verlag Berlin Heidelberg 2010
Mobile and Ubiquitous Computing
453
We would like to take the opportunity to thank all authors who submitted a contribution, the Euro-Par Organizing Committee, and all reviewers for their hard and valuable work. Their efforts made this conference and this topic possible.
cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks Huanyu Zhao, Xin Yang, and Xiaolin Li Scalable Software Systems Laboratory, Department of Computer Science Oklahoma State University, Stillwater, OK 74078, USA {huanyu,xiny,xiaolin}@cs.okstate.edu
Abstract. In a Cyclic Mobile Ad Hoc Network (CMANET) where nodes move cyclically, we formulate trust management problems and propose the cTrust scheme to handle trust establishment and aggregation issues. Unlike trust management in conventional peer-to-peer (P2P) systems, trust management in MANETs is based on simple neighbor trust relationships and also location and time dependent. In this paper, we focus on trust management problem in highly mobility environment. We model trust relations as a trust graph in CMANET to enhance accuracy and efficiency of trust establishment among peers. Leveraging a stochastic distributed BellmanFord based algorithm for fast and lightweight aggregation of trust scores, the cTrust scheme is a decentralized and self-configurable trust aggregation scheme. We use the NUS student contact patterns derived from campus schedules as our CMANET communication model. The analysis and simulation results demonstrate the efficiency, accuracy, scalability of the cTrust scheme. With increasing scales of ad hoc networks and complexities of trust topologies, cTrust scales well with marginal overheads.
1
Introduction
Research in Mobile Ad Hoc Networks (MANETs) has made tremendous progress in fundamental protocols, routing, packet forwarding and data gathering algorithms, and systems. Different from the conventional networks, in MANETs, nodes carry out routing and packet forwarding functions so that nodes act as terminals and routers. The increasing popularity of these infrastructure-free systems with autonomous peers and communication paradigms have made MANETs prone to selfish behaviors and malicious attacks. MANETs are inherently insecure and untrustful. In MANETs, each peer is free to move independently, and will therefore change its connections to other peers frequently, which results in a very high rate of network topology changes. The communication is usually multi-hop, and each node may forward traffic even unrelated to its own use. The transmission power, computational ability and available bandwidth of each node in MANETs is limited. In reality, we notice that a large part of MANETs
The research presented in this paper is supported in part by National Science Foundation (grants CNS-0709329, CCF-0953371, OCI-0904938, CNS-0923238).
P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 454–465, 2010. c Springer-Verlag Berlin Heidelberg 2010
cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks
455
peers have cyclic movement traces and can be modeled as Cyclic Mobile Ad Hoc Networks (CMANETs) which is defined as a kind of MANET where nodes’ mobilities are cyclic [5,7]. In this paper, we focus on the trust management problem in CMANETs. Conventional centralized trust establishment approaches are not suited well for use within CMANETs scenarios. And to our best knowledge, little research work has investigated the trust issues in CMANETs. Trust establishment in CMANETs is still an open and challenging topic. Unlike the P2P trust, CMANETs are based on naive neighbor trust relationships. Trust in CMANETs is also location and time dependent. We propose a trust aggregation scheme called cTrust for aggregation of distributed trust information in completely decentralized and highly-dynamic CMANET environments. Our contributions in this work are multifold. (1) we model the movement patterns and trust relationships in CMANET as a trust graph. And we model the most trustable path finding process as the Markov Decision Process (MDP). (2) We propose trust transfer function, value iteration function and distributed trust aggregation algorithm to solve the most trustable path finding problem. This algorithm leverages a stochastic markov chain based distributed Bellman-Ford algorithm, which greatly reduces the message overhead. It requires only localized communication between neighbor peers, and it captures a concise snapshot of the whole network from each peer’s perspective. (3) We design the evaluation metrics for cTrust. Using random and scale-free trust topologies, we conduct extensive experimental evaluations based on the NUS student real campus movement trace data. The structure of the rest paper is as follows. Section 2 presents the related works. Section 3 proposes trust model and the stochastic distributed cTrust aggregation algorithm. We present the simulation results to explore the performance of cTrust in section 4. Section 5 concludes the paper.
2
Related Work
The EigenTrust scheme proposed by Kamvar et al. presents a method to obtain a global trust value for each peer by calculating the eigen value from a trust ratings matrix [4]. Xiong and Liu developed a reputation-based trust framework PeerTrust [11]. Zhou and Hwang etc. proposed the PowerTrust system for DHTbased P2P networks [14]. H. Zhao and X. Li proposed the concept of trust vector and a trust management scheme VectorTrust for aggregation of distributed trust scores [13], and they also proposed a group trust rating aggregation scheme HTrust using H-Index technique [12]. In the field of MANETs, Sonja Buchegger and Jean-Yves Le Boudec proposed a reputation scheme to detect misbehavior in MANETs [1]. Their scheme is based on a modified Bayesian estimation method. Sonja Bucheggery and his colleague also proposed a self-policing reputation mechanism [2]. The scheme is based on nodes’ locally observation, and it leverages second-hand trust information to rate and detect misbehaving nodes. The CORE system adopts a reputation mechanism to achieve nodes cooperation in MANETs [6]. The goal of CORE
456
H. Zhao, X. Yang, and X. Li
system is to prevent nodes’ selfish behavior. [1], [2] and [6] mainly deal with the identification and isolation of misbehaved nodes in MANETs, but the mobility feature of MANETs is not fully addressed in these previous work. Yan Sun et al. considered trust as a measure of uncertainty, and they presented a formal model to represent, model and evaluate trust in MANETs [9]. Ganeriwal, Saurabh et al. extended the trust scheme application scenario to sensor networks and they built a trust framework for sensor networks [3]. Another Ad Hoc trust scheme is [10] where the trust confidence factor was proposed.
3
cTrust Scheme
3.1
Trust Graph in CMANETs
In CMANETs, nodes have short radio range, high mobility, and uncertain connectivity. Two nodes are able to communicate only when they reach each others’ transmission range. When two nodes meet at a particular time, they have a contract probability P (P ∈ [0, 1]) that they contact or start some transactions. The cyclic movement trace graph of a CMANET consisting of three nodes is shown in Figure 1. The unit time is set as 10. Each peer i moving cyclically has motion cycle time Ci . We can tell from the trace that CA = 30, CB = 30 and CC = 20. The system motion cycle time CS is the Least Common Multiple (LCM) of all the peers’ motion cycle time in the network, CS = lcm(CA , CB , CC ) = 60. Note that, CMANETs movement traces are not required to be following some shapes. The “cyclic” is explained that if two nodes meet at time T 0, they have a high probability to meet after every particular time period TP . We represent the movement traces in this paper as some shapes to ease the presentation and understanding. A
A
C
A
B
C
C
B
B
(a) T = 0
(b) T = 10
(c) T = 20
A
A
B
C
C
A
B
C
B
(d) T = 30
(e) T = 40
(f) T = 50
Fig. 1. CMANET Movement Trace Snapshots for One System Cycle (CS = 60)
cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks
457
A/ 10+30n
0.8/60n A/ 20+30n
A/30n
0.9/20+30n
C/20n
0.9/30n 0.5/30+60n
B/ 20+30n
C/ 10+20n
B/30n 0.7/30+60n
B/ 10+30n
Fig. 2. Trust Graph. The vertices in the graph correspond to peer states (time and location) in a system. An directed solid edge shows the initial trust relation (as trust rating and trust direction) and contact time. The dashed line shows the nodes’ movement trace. Each peer maintains a local trust table to store trust information.
To describe the CMANETs system features and trust relationships, we combine the snapshot graph and trust relationships into a directed trust graph as shown in Figure 2. Each node is represented by several states based on periodically appearing locations as the vertices in graph. We represent the states Xi as i/Ti {Loc} where i is the node ID and Ti {Loc} is appearance time for particular locations. The appearance time is give by, Ti {Loc} = T 0i {Loc} + Ci × n(n = 0, 1, 2, . . .)
(1)
where T 0i {Loc} is the first time node i appears at this location and Ci is node i’s motion cycle time. For example, node A appears at three locations. So in trust graph, node A is represented by three states: A/TA {Loc0 }, A/TA {Loc1 } and A/TA {Loc2 }. Following Equation (1), we have, TA {Loc0 } = 0 + CA × n = 30n, TA {Loc1 } = 10 + CA × n = 10 + 30n, TA {Loc2 } = 20 + CA × n = 20 + 30n, (n = 0, 1, 2, . . .). The three states generated by node A is A/30n, A/(10+30n), A/(20+ 30n). State Xi ’s one hop direct trust neighbors is represented by the set H (Xi ), e.g., H (B/30n) = {A/30n, C/(10 + 20n)}. The directed dashed lines between states shows nodes’ movement trace as state transfer edges in trust graph. The initial trust relationships are shown by the solid directed edges in the graph. There is an edge directed from peer i to peer j if and only if i has a trust rating on j. The value Ri,j (Ri,j ∈ [0, 1]) reflects how much i trusts j where Ri,j = 0 indicates i never/distrust trust j, Ri,j = 1 indicates i fully trust j. The trust between different states of the same node i is considered as Ri,i = 1. The trust rating is personalized which means the peer has self-policing trust on other peers, rather than obtaining a global reputation value for each peer. We adopt personalized trust rating because in an open and decentralized CMANET
458
H. Zhao, X. Yang, and X. Li
environment, peers will not have any centralized infrastructure to maintain a global reputation. The solid trust rating value on the edge could be obtained by applying different functions to consider all the history transactions’ importance, date, service quality between two peers. How to rate a service and how to generate and normalize the accurate direct trust ratings are not with the scope of this paper. In this paper, we assume the normalized trust ratings have been generated. What we studied in this paper is the trust aggregation/propagation process in an ad hoc network with high mobility. Besides trust ratings, each edge is also labeled by a time function showing when two nodes can communicate by this trust link. The appearance time for each link is given by Equation (2): TRi,j = T 0Ri,j + lcm(Ci , Cj ) × n(n = 0, 1, 2, . . .)
(2)
Where Ci , Cj is the relevant nodes’ motion cycle time and T 0Ri,j is the first time they meet by this link. The solid edges are represented as Ri,j /TRi,j . The system trust graph shows all the trust relationships, the moving trace and possible contacts of the network. For example, setting T = 0 at Figure 2, we obtain the snapshot trace as in Figure 1(a) and the appearing trust links. 3.2
Trust Path Finding Problems in CMANET
In cTrust system, each peer maintains a local trust table. The trust table consists of the remote peer ID as entry, the trust rating for each possible remote peer, the next hop to reach the remote peer. Each entry shows only the next hop instead of the whole trust path. Initially, peers’s trust tables only contain the trust information of their one hop direct experience. Due to the communication range and power constrains, peers are not able to communicate with remote peers directly. Suppose peer i wishes to start a transaction with remote peer k. i wishes to infer an indirect trust rating for peer k to check k’s reputation. In cTrust, the trust transfer is defined as follows. Definition 3.1 (Trust Transfer): If peer i has a trust rating Ri,j towards peer j, peer j has trust rating Rj,k towards peer k, then peer i has indirect trust Ri,k = Ri,j ⊗ Rj,k towards peer k. In Definition 3.1, Ri,j and Rj,k can be both direct and indirect trust. Beside the trust rating, peer i also wishes to find a trustable path and depend on the multihop communication to finish this transaction. Among a set of paths between i and k, i tends to choose the Most Trustable Path (MTP). Definition 3.2 (Most Trustable Path): The most trustable path from peer i to peer k is the trust path yielding highest trust rating Ri,k . MTP is computed as the maximal ⊗ production value of all directed edges along a path. And this production will be considered as i’s trust rating towards peer k. The MTP provides a trustable communication path, and is used to launch multi-hop transactions with an unfamiliar target peer. cTrust scheme solves the trust rating transfer and MTP finding problems in CMANETs.
cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks
3.3
459
Markov Decision Process Model
Markov Decision Process (MDP) is a discrete time stochastic control process consisting of a set of states. In each state there are several actions to choose. For a state x and an action a, the state transition function Px,x determines the transition probabilities to the next state. A reward is also earned for each state transition. We model the MTP finding process as a MDP. We propose value iteration to solve the MTP finding problem. Theorem 3.1: The MTP finding process is a Markov Decision Process. Proof Initially, for a sequence of random node states in trust graph X1 , X2 , X3 , . . . , Xt , the trust path has the following relation: P r(Xt+1 = x|X1 = x1 , X2 = x2 ..., Xt = xt ) = P r(Xt+1 = x|Xt = xt )
(3)
Equation 3 indicates the state transitions of a trust path possess the markov property: the future states depend only on the present state, and are independent of past states. The components required in a MDP are defined by the following notations: – S: state space of MDP, the node state set in the trust graph. – A: action set of MDP, the state transition decisions. – Px,x = P r(Xt+1 = x |Xt = x, at = a): the probability that action a in node state x at time t will lead to node x at time t + 1. – Rx,x : the reward received after transition to state x from state x with transition probability Px,x . Rx,x is in terms of trust rating in out scheme. The state transition probability in state x to state x is computed from normalizing all x’s out trust links (trust ratings). Px,x = P r(Xt+1 = x |Xt = x) =
R x,x Rx,y
(4)
y∈H (x)
In each node state, the next state probability sums to one. The trust path finding process is a stochastic process that all state transitions are probabilistic. The goal is to maximize the cumulative trust rating for the whole path, typically the expected production from the source peer to the destination peer. γRs1 ,s2 ⊗ γ 2 Rs2 ,s3 ⊗ γ 3 Rs3 ,s4 ⊗ ... ⊗ γ t Rst ,st+1
(5)
where γ is the discount rate and satisfies 0 ≤ γ ≤ 1. It is typically close to 1. Therefore, the MTP finding process is a MDP (S, A, P.,. , R.,. ). The solution to this MDP can be expressed as a trust path π (MTP), The standard algorithms to calculate the policy π is the value iteration process.
460
3.4
H. Zhao, X. Yang, and X. Li
Value Iteration
Section 3.2 presents the trust transfer function Ri,k = Ri,j ⊗ Rj,k . The upper bound for Ri,j ⊗ Rj,k is min(Ri,j , Rj,k ) because the combination of trust cannot exceed any original trust. Ri,j ⊗ Rj,k should be larger than Ri,j × Rj,k , which avoid a fast trust rating dropping in trust transfer. The discount rate γ(γ ∈ [0, 1]) determines the importance of remote trust information. The trust transfer function Ri,j ⊗ Rj,k needs to meet the following condition: Ri,j × γRj,k ≤ Ri,j ⊗ Rj,k ≤ min(Ri,j , γRj,k )
(6)
In cTrust scheme, we set the trust transfer function as:
na
Ri,j ⊗ Rj,k = min(Ri,j , γRj,k ) ×
max(Ri,j , γRj,k )
(7)
We prove that the given function meets the condition in (6). Proof max(Ri,j , γRj,k ) ≤ 1, so, min(Ri,j , γRj,k ) × na max(Ri,j , γRj,k ) ≤ min(Ri,j , γRj,k ) min(Ri,j , γRj,k ) × na max(Ri,j , γRj,k ) =
min(Ri,j ,γRj,k )×max(Ri,j ,γRj,k )
√
na na −1
max(Ri,j ,γRj,k )
=
R
√ i,j
na na −1
×γRj,k
max(Ri,j ,γRj,k )
≥ Ri,j × γRj,k
Therefore, we have proved that the trust transfer function (7) meets the condition (6).
By setting up the adjusting factor na (na =1, 2, 3...), (Ri,j ⊗Rj,k ) can be sliding between the upper and lower bound. In each round of the iteration, the trust table of each node is updated by choosing an action (next hop state in trust graph). The value iteration is executed concurrently for all nodes. It compares the new information with the old trust information and makes a correction to the trust tables based on the new information. The trust tables associated with the nodes are updated iteratively and until they converge. Based on the trust transfer function, the value iteration function is set up as: Ri,k = max Ri,k , α min(Ri,j , γRj,k ) na max(Ri,j , γRj,k )
(8)
where Ri,k is the trust rating towards peer k given peer i’s local trust table, Ri,j is the direct link trust and Rj,k is the received trust information towards peer k. α(α ∈ [0, 1]) is the learning rate. The learning rate determines to what extent the newly acquired trust information will replace the old trust rating. A learning rate α = 0 indicates that the node does not learn anything, and a learning rate factor α = 1 indicates that the node fully trusts and learns the new information. At the convergence status of the value iteration, each peer’s trust table will contain the trust rating for MTP. 3.5
cTrust Distributed Trust Aggregation Algorithm
In the initial stage of an evolving CMANET, pre-set direct trust ratings are stored in local trust tables. However, the direct trust information is limited and
cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks
461
does not cover all potential interactions. The distributed trust aggregation algorithm gathers trust ratings to any peer in a network (Algorithm 1). In this algorithm, each trust path is aggregated to MTP with highest trust rating towards target peer. Indirect trust information will be added to trust tables and be updated as the aggregation process evolves. The algorithm is implemented based on distributed Bellman-Ford algorithm. Updates are performed periodically where peers retrieve one of their direct trust neighbors’ trust tables and replace existing trust ratings with more higher ones in local trust tables,then include relevant neighbors as the next hops. Algorithm 1. Distributed Trust Aggregation Algorithm in CMANETs 1: Initialize local trust tables. 2: for each time slot do 3: for peer i ∈ CMANET do 4: Find i’s direct trust neighbor set H (i). 5: if H (i) = ∅, then 6: Normalize transition probability by Equation (4). 7: Decide one target node j by transition probability in set H (i) 8: Send trust table request toj (The contact between i and j is based on their contact probability P). 9: Receive incoming trust tables. 10: Relaxe each trust table entry by trust value iteration function (8), update nexthop peers. 11: If receive any trust table request from other peers, send trust table back. 12: end if 13: end for 14: end for
4 4.1
Experimental Evaluation Experiment Setup
CMANET Contact Pattern Model. We construct an unstructured network based on the NUS student trace. The data of contact patterns is from the class schedules for the Spring semester of 2006 in National University of Singapore (NUS) among 22341 students with 4875 sessions [7,8]. For each enrollment student, we have her/his class schedule. It gives us extremely accurate information about the contact patterns among students over large time scales. The contact patterns among students inside classrooms were presented in [7]. Following class schedules, students move around on campus and meet each other when they are in the same session. The trace data set considers only students movements during business hours, and ignores contacts that students hang around campus for various activities outside of class. The time is compressed by removing any idle time slots without any active sessions. So the contacts take place only in classrooms. Two students are within communication range of each other if and
462
H. Zhao, X. Yang, and X. Li
only if they are in the same classroom at the same time. The sessions can be considered as classes. The unit time is one hour, and a session may last multiple hours. The NUS contact patterns can be modeled as CMANET. In our experiment, 100 to 1000 students are randomly chosen to simulate 100 to 1000 moving peers in CMANET. Following her/his class schedule, each student appears moving cyclically in classrooms. The contact probability P is set as 0.9 which indicates that when two nodes meet, they have a probability of 0.9 to communicate. We considered all 4875 sessions in the data set. The time for the whole system cycle (CS ) is 77 hours (time units). Trust Topology Model. The random trust topology and scale-free trust topology are used to establish trust relationships in this simulation. In random trust topology, the trust outdegree of a peer follows normal distribution with mean value μd = 20, 25, 30 and variance σd2 = 5. On the random trust topology, all peers have similar initial trust links. Under the scale-free trust topology, highly active peers possess large numbers of trust links, and most other peers only have small numbers of trust links. The number of trust links follows power law distribution with an scaling exponent k = 2. Parameter Setting. The network is configured from 100 nodes to 1000 nodes. The network complexity is represented in terms of nodes’ average outdegree d. A network complexity with d = 20, 25, 30 indicates on average, the initial nodes’ outdegrees is 20, 25, 30. Peer’s real behavior is represented by a pre-set normal distribution (μr = 0.25, 0.75, σr2 = 0.2) rating score r ∈ [0, 1]. As mentioned in Section 3, we assume the accurate direct trust ratings have already been generated. This is reasonable because any trust inference scheme must rely on an accurate trust rating scheme. It is meaningless to study the inference trust based on the direct trust rating if the direct trust rating is not reliable. So in our simulation, the direct trust rating R ∈ [0, 1] is generated with a normal distribution 2 = 0.1) based on peer’s real behavior score r. The parameters in (μR = r, σR iteration function is set up as learning rate α = 1, discount factor γ = 1 and adjusting factor na = 9. To measure the performance under dynamic models, new transactions is continuously generated according to a poisson distribution with an arrival rate λ = 10 to 50 transactions per service cycle, between a random source node and a random destination node. New node will randomly join the network, and peer leave/die also randomly happens. In such a dynamic model and in a real mobile ad hoc network, it is hard to achieve strictly convergence status. So the convergence in our simulation is -convergence, and -convergence is defined as that the variance between any peer’s two consecutive trust tables is smaller than the pre-set threshold = 0.02. 4.2
Results and Analysis
Convergence Time. The convergence time is measured in terms of the number of time units needed to achieve -convergence status. Figure 3 shows that cTrust only needs a small number of aggregation cycles before convergence. We also observe that convergence time increases as network complexity increases. As
800
800
700
700
600 500 400 300 200
d=20 d=25 d=30
100 0 100
200
300
400 500
600
700
800
Convergence Time
Convergence Time
cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks
600 500 400 300 200
d=20 d=25 d=30
100 0 100 200 300
900 1000
463
400
500
600
700 800 900 1000
Network Size
Network Size
(a) Random Trust Topology
(b) Scale-Free Trust Topology
40
40
30
30
20
10
d=20 d=25 d=30
0 100 200 300 400 500
600 700 800 900 1000
Avg. Message
Avg. Message
Fig. 3. Convergence Time
20
10
d=20 d=25 d=30
0 100
200
300
Network Size
400
500
600
700
800
900 1000
Network Size
(a) Random Trust Topology
(b) Scale-Free Trust Topology
Fig. 4. Message Overhead
network size N increases, convergence time increases relatively slowly (O(n)). This shows that cTrust features satisfactory scalability. We also observe that the trust topologies do not affect the convergence time so much as network size. Communication Message Overhead. Figure 4 shows the average communication message overhead to achieve convergence per individual peer. cTrust greatly reduce the communication message overhead by using MDP model. This is because, in each iteration, each node only receive trust table for one of its most trusted neighbors (Equation (4)). The message overhead grows slowly as the network size grows, showing that cTrust is a lightweight scheme. In a network with high complexity, cTrust system incurs more message overheads. In a typical cTrust network, the average message overhead is affected by only network size N and complexity d and not affected by trust topology. As a result, the overhead curves for both topologies in the figures appear similar. Average Trust Path Length. Figure 5 indicates the average length of a trust path starts from a source peer to a destination peer in convergence status. Generally, the trust path length increases with the network size and complexity, which indicates peers gain more remote trust information. In the scale-free trust topology, the trust path length is greatly reduced. This is because in scale-free trust topology most peers have only a few connections while some power peers control many links, making trust information hard to spread. In a complex network where trust information can be spread father, there are more longer trust paths and involve more trust transfers.
464
H. Zhao, X. Yang, and X. Li
6
7
d=20 d=25 d=30
Avg. Path Length
Avg. Path Length
7
5 4 3 2 1 100
200
300
400
500
600
700
800
5 4 3 2 1 100
900 1000
d=20 d=25 d=30
6
200
300
400
500
600
700
800
900 1000
Network Size
Network Size
(a) Random Trust Topology
(b) Scale-Free Trust Topology
100%
100%
98%
98%
96%
96%
94%
94%
Accruacy
Accuracy
Fig. 5. Average Trust Path Length
92% 90%
85%
d=20 d=25 d=30
80% 100 200 300 400 500 600 700 800 900 1000
92% 90%
85%
80% 100 200 300 400 500 600 700 800 900 1000
Network Size
Network Size
(a) Random Trust Topology
d=20 d=25 d=30
(b) Scale-Free Trust Topology
Fig. 6. Aggregation Accuracy
Accuracy. cTrust aggregation accuracy is measured by comparing all the inferred trust ratings with peers real behavior scores. The similarity is considered as aggregation accuracy. As shown in Figure 6, on average, cTrust aggregation accuracy is maintained above 90%. The result is very encouraging because cTrust is a personalized trust system using inferred (not direct) trust and the information for each node to access is limited in CMANETs. As the network complexity increases, the accuracy decreases. This is because in complex networks, there are more long trust paths that involve more trust transfers, resulting in lower accuracy in inferred trust ratings due to multi-hop relationships. The accuracy in scale-free trust topology is slightly higher than in random trust topology. One reason is in scale-free trust topology, the average trust path is shorter which leads to high accuracy in trust transfer.
5
Conclusion
We have presented the cTrust scheme in CMANETs. cTrust scheme is aimed to provide a common framework to enable trust inferring in a CMANET trust landscape. We presented the trust transfer function, trust value iteration function, and the cTrust distribution trust aggregation algorithm. To validate our proposed algorithms and protocols, we conducted extensive evaluation based on NUS students trace data. The experimental results demonstrate that cTrust
cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks
465
scheme trust aggregation is efficient. cTrust convergence time increases slowly with network size. Message overhead in cTrust is modest. The trust information spreads fast and extensively in CMANETs. The trust rating inference accuracy in cTrust scheme is over 90%. We believe that cTrust establishes a solid foundation to design trust-enabled applications and middleware in CMANETs.
References 1. Buchegger, S., Boudec, J.Y.L.: A robust reputation system for mobile ad-hoc networks. Technical report, IC/2003/50, EPFL-IC-LCA (2003) 2. Buchegger, S., Boudee, J.Y.L.: Self-policing mobile ad hoc networks by reputation systems. IEEE Communications Magazine 43(7), 101–107 (2005) 3. Ganeriwal, S., Balzano, L.K., Srivastava, M.B.: Reputation-based framework for high integrity sensor networks. ACM Trans. Sen. Netw. 4(3), 1–37 (2008) 4. Kamvar, S.D., Schlosser, M.T., G.-Molina, H.: The eigentrust algorithm for reputation management in p2p networks. In: Proceedings of the 12th International Conference on World Wide Web (WWW). pp. 640–651. Budapest,Hungary (May 20-24 2003) 5. Liu, C., Wu, J.: Routing in a cyclic mobispace. In: MobiHoc: Proceedings of the 9th ACM international symposium on Mobile ad hoc networking and computing, pp. 351–360 (2008) 6. Michiardi, P., Molva, R.: Core: A collaborative reputation mechanism to enforce node cooperation in mobile ad hoc networks. In: Sixth IFIP conference on security communications, and multimedia (CMS 2002), Portoroz, Slovenia (2002) 7. Srinivasan, V., Motani, M., Ooi, W.T.: Analysis and implications of student contact patterns derived from campus schedules. In: MobiCom 2006: Proceedings of the 12th annual international conference on Mobile computing and networking, New York, NY, USA, pp. 86–97 (2006) 8. Srinivasan, V., Motani, M., Ooi, W.T.: CRAWDAD data set nus/contact, v. 200608-01 (August 2006), http://crawdad.cs.dartmouth.edu/nus/contact 9. Sun, Y.L., Yu, W., Han, Z., Liu, K.J.R.: Information theoretic framework of trust modeling and evaluation for ad hoc networks. IEEE Journal on Selected Areas in Communications 24(2) (2006) 10. Theodorakopoulos, G., Baras, J.: On trust models and trust evaluation metrics for ad hoc networks. IEEE Journal on Selected Areas in Communications 24(2) (February 2006) 11. Xiong, L., Liu, L.: Peertrust: supporting reputation-based trust for peer-to-peer electronic communities. IEEE Transactions on Knowledge and Data Engineering 16(7), 843–857 (2004) 12. Zhao, H., Li, X.: H-trust: A robust and lightweight group reputation system for peer-to-peer desktop grid. Journal of Computer Science and Technology (JCST) 24(5), 833–843 (2009) 13. Zhao, H., Li, X.: Vectortrust: The trust vector aggregation scheme for trust management in peer-to-peer networks. In: The 18th International Conference on Computer Communications and Networks (ICCCN 2009), San Francisco, CA USA, August 2-6 (2009) 14. Zhou, R., Hwang, K.: Powertrust: A robust and scalable reputation system for trusted peer-to-peer computing. IEEE Transactions on Parallel and Distributed Systems 18(4), 460–473 (2007)
Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks Yao Zhao, Xin Wang, Jin Zhao, and Xiangyang Xue School of Computer Science, Fudan University, China Shanghai Key Lab of Intelligent Information Processing, Shanghai, China
Abstract. The goal of Growth Codes proposed by Karma et.al. is to increase the “persistence” of sensed data, so as to promise that data is more likely to reach a data sink. In many “zero-configuration” sensor networks, where the network topology would change very rapidly, Growth Codes are especially useful. However, the design of Growth Codes is based on two assumptions: (1) each sensor node contains only one single-snapshot of the monitored environment, and each packet contains only one sensed symbol; (2) all codewords have the same probability to be received by the sink. Obviously, these two assumptions do not hold in many practical scenarios of large-scale sensor networks, thus the performance of Growth Codes would be sub-optimal. In this paper, we generalize the scenarios to include multi-snapshot and less random encounters. By associating the decimal degree with the codewords, and by using priority broadcast to exchange codewords, we aim to achieve a better performance of Growth Codes over a wider range of sensor networks applications. The proposed approaches are described in detail by means of both analysis and simulations.
1
Introduction
Wireless sensor networks have been widely used for data perception and collection in different scenarios, such as floods, fires, earthquakes, and military areas. Often in such networks, the sensor nodes used to collect and deliver data are prone to catch failure suddenly and unpredictably themselves. Thus, all the protocols designed for these sensor networks should focus on the reliability of data collection and temporarily store part of the data for information survives. Many coding approaches are proposed to achieving the robustness in such networks, since coding over the sensed data may increase the likelihood that the data will survive as some nodes fail. Among them, Growth Codes [7] are specially designed for the purpose of increasing the “persistence” of the sensed data. Assuming that the network topology changes very rapidly, e.g., due to link instability or node failures, Growth Codes address to increase the data persistence, where data persistence is defined as the fraction of data generated within the network and eventually reaches the sink(s). The codewords employ a dynamically changing codeword degree distribution to deliver data at a much faster rate to the sink and promise to be able to decode a substantial number of the codewords at any given time. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 466–477, 2010. c Springer-Verlag Berlin Heidelberg 2010
Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks
467
However, the design of Growth Codes is based on two assumptions: (1) each sensor node contains only one single-snapshot of the monitored environment, and each packet contains only one sensed symbol; (2) all codewords have the same probability to be received by the sink. This is true for information exchange at random encounters among nodes or for very high mobility. However, it does not well capture protocol behavior in realistic large-scale sensor networks, where the performance of Growth Codes would be sub-optimal. In this paper, we loose the above assumptions and generalize the scenarios of original Growth Codes to a wider range of applications. Firstly, the design of Growth Codes is generalized to the scenarios of multisnapshot, where the buffer of each sensor node can store multiple symbols sensed from the monitored area, and the size of a transmission packet is larger than the size of a sensed symbol. Notice that, this assumption is quite reasonable in practical applications of large-scale sensor network, e.g., a transmission packet may contain more than hundreds of bytes, whereas a sensed symbol may contain only several bytes. For clarity to describe, we term “packet” as the transmission data unit and term “symbol” as the sensed data unit, respectively. Thus, when a node receives an input packet, it can disassemble this packet into several symbols and then generate a new packet over other symbols in its local buffer. In this case, Growth Codes can be modified to encode on the symbol level, instead of on the packet level, to achieve a better utility in the multi-snapshot scenarios. Secondly, the design of Growth Codes is generalized to the scenarios of less random encounters, where the network scale is too large to promise that the all codewords could have the same probability to reach to the sink. Often in such scenarios, the nodes located far from the sink would have less chance to deliver their symbols to the sink than the closer ones. In this case, we need investigate special technique to increase the chance of the symbols which are sensed in an area far from the sink. Motivated by the natural property of broadcasting in wireless sensor networks, we introduce priority broadcast to disseminate sensed data, which gives a high priority to the farther data. By priority broadcast, we show that even in the scenarios of less random encounters, the performance of Growth Codes can approach to the theoretical value. The rest of the paper is organized as follows. Section 2 gives an overview of the related works. Section 3 gives a brief overview of data persistence in large scale wireless sensor networks and introduces the original Growth Codes. In Section 4, the design of Growth Codes is modified to encode on the symbol level to maximize utility in the case of multi-snapshot. In Section 5, the design of Growth Codes is modified to use priority broadcast to maximize utility in the case of large network scale. Section 6 evaluates the performance of our proposed modifications by simulations. Section 7 concludes this paper.
2
Related Works
Storing and disseminating network coded [2] information instead of the original data can bring significant performance improvements to wireless sensor network
468
Y. Zhao et al.
protocols [1]. However, though network coding is generally beneficial, coding over all available packets might leave symbols undecodable until the sink receives enough codewords. A decentralized implementation of erasure codes is proposed in [5]. Assuming that there are n storage nodes with limited memory and k < n sources generating the data, the authors consider that the sink is able to retrieve all the data by query any k nodes. similarly, decentralized fountain codes [8] [3] are proposed to persist the cached data in wireless sensor networks. Growth Codes [7] are specifically designed to increase the “persistence” of the sensed data in dynamic environments. In Growth Codes, nodes initially disseminate the original data and then gradually code over other data as well to increase the probability that information survives. To achieve more data persistence under scenarios of different node mobility, resilient coding algorithms [10] are proposed. The authors propose to keep a balance between coding over all incoming packets and Growth Codes. Furthermore, the approach of Growth Codes is generalized to the multi-snapshots scenarios in [9], where the authors aim to maximize the expected utility gain through joint coding and scheduling and propose two algorithms, with and without mixing different snapshots. When dealing with multi-snapshot, the authors prove the existence of the optimal codeword degree but do not propose any specific methods. Border node retransmission based probabilistic broadcast, is proposed in [4] to reduce the number of rebroadcasting messages. The authors suggest that it is interesting to privilege the retransmission by nodes that are located at the radio border of the sender. In some sense, their insight of border node retransmission motives our proposed approach of priority broadcast.
3 3.1
Problem Description Network Model
We start with the description of the network model of large-scale sensor networks. This is a sensor network consists of a large number of sensors distributed randomly in a monitored disaster region, such as earthquakes, fires, floods, etc. Since the sudden configuration changes due to failure, this network is operated in a “zero-configuration” manner and the data collection must be initiated immediately, before the nodes have a chance to assess the current network topology. In addition, the location of the sink is unknown and all data is of equal importance. Each sensor node has a unique identifier (ID) and is capable of sensing an area around itself. And also, each sensor node has a radio interface and can communicate directly with some of the sensors around it. Here, we assume the node number to be N and transmission radius to be R. Specially, it is difficult for the nodes that located far from the sink to deliver their symbols to the sink, since it will take too many hops. Moreover, we assume that each packet can at most contain P symbols, and assume the buffer size to be B, as is depicted in Fig. 1.
Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks
Snap 1
Snap 2
Snap 3
Snap 4
Snap 5
Symbol 1
Symbol 2
Symbol 3
Symbol 4
Symbol 5
469
Packet
Fig. 1. Storing symbols of multi-snapshot in memory (P = 3, B = 5)
Since the location of the sink is unknown, thus the intermediate node randomly chooses a neighbor and a random packet in its memory and then disseminates this packet to the selected neighbor. In Section 5, we also investigate the technique of priority broadcast, where the intermediate node chooses the packet with the highest priority and then disseminates this packet to all the neighbors. 3.2
Overview of Growth Codes
Growth Codes define the degree of a codeword to be the number of packets XOR’d together to form a codeword. Each node initializes its buffer with its own packet (e.g., the sensed information); when a new packet is received, the node stores the incoming packet at a random position in its memory, overwriting a random unit of previously stored information (but not its sensed information). Though a high codeword degree would increase the probability that a codeword provides innovative information, more codewords and more operations are necessary to decode a received codeword. As is described in [7], codewords in the network should start with degree one and then increase the degree over time. Specifically, for the first R1 = N2−1 packets, codewords of degree 1 are dissem−1 inated; after recovering Rk = kN k+1 packets, codewords of degree k + 1 are disseminated. Let d be the degree of a codeword to be transmitted after r packets have been recovered at the sink, and assume that all codewords have the same probability to be received by the sink, Growth Codes give us: d=
4 4.1
N +1 . N −r
(1)
Maximizing Growth Codes Utility by Coding on The Symbol Level Decimal Codeword Degree
Growth Codes assume that each transmission packet contains only one sensed symbol, thus the codeword degree in Growth Codes is always the integer number. However, in many scenarios of multi-snapshot, a sensor node can store multiple symbols in the buffer, thus a transmission packet would contain several sensed symbols. Therefore, when using Growth Codes on the symbol level, to generate the codeword of the decimal degree is indeed possible.
470
Y. Zhao et al.
packet A, degree =1 A1
A2
B1
B2
new codeword, degree =1.5 A1+B1
A2
packet B, degree =1
Fig. 2. An example of a codeword with the decimal degree 1.5
A1
A2
B1
B2
C1
C2
A1
D1
D2
C1
(a) A 4 node network, when 2 packets have been recovered
A1+B1
A2
A1+B1
A2+B2
C1+D1
C2
A2
C1+D1
C2+D2
A1+C1
A2
C2
A1+C1
A2+C2
A1+C1
C2
(b) Codewords of degree 1
(c) Codewords of degree 2
(d) Codewords of degree 1.5
Fig. 3. An example that the decimal degree outperforms the integer degree
An example of the decimal degree is depicted in Fig. 2. An intermediate node stores 2 packets (i.e., packet A and packet B) in the buffer, each packet contains 2 symbols (i.e., a total of 4 symbols:A1 , A2 ; B1 , B2 ). To transmit, this node generates a new packet, which is encoded over one whole packet (A1 , A2 ) and another half packet (B1 ), to get a new codeword of degree 1.5. Here, the value of 1.5 is calculated by 12 ∗ 2 + 12 ∗ 1, since A1 + B1 has a degree of 2 while A2 has a degree of 1. −1 In Growth Codes, after Rk = kN k+1 packets having been recovered at the sink, a codeword of degree k + 1 is more likely to be successful decoded than a codeword of degree k. However, when the decimal degree is used, a decimal degree such as k +0.5 may outperform both degree of k and k +1. For clarity, the example that the decimal degree outperforms the integer degree is given in Fig. 3. It is a 4 node sensor network, each node has sensed 2 symbols and has the same transmission capacity of 2 symbols. At one time, the sink has already recovered 2 packets (see Fig. 3a: A1 , A2 , A3 , A4 ), then we compare three codeword degrees: degree 1 , degree 2 and degree 1.5. In the first case (see Fig. 3b), the sink randomly receives a codeword of degree 1. At expectation, 0.5 new packet (1 new symbol) can be recovered. In the second case (see Fig. 3c), the sink randomly receives a codeword of degree 2. At expectation, 0.5 new packet (1 new symbol) can also be recovered. In the third case (see Fig. 3d), the sink may get 2 symbols (e.g., from a codeword of A1 + C1 and C2 ) at the probability of 13 , get 1 symbol (e.g., from a codeword of C1 + D1 and C2 ) at the probability of 12 and get no symbol (e.g., from a codeword of
Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks
471
A1 + B1 and A2 ) at the probability of 16 . At expectation, the number of newly recovered symbols is: 13 ∗ 2 + 12 ∗ 1 + 16 ∗ 0 = 76 , which is larger than 1. 4.2
Search for Appropriate Codeword Degree
+1 Growth Codes select the codeword degree as d = N N −r , but the previous subsection suggests that the codeword of the decimal degree may have a better performance. To get the appropriate decimal codeword degree, we use the approach of interpolation. Notice that, the degree transition points suggested in Growth Codes, which can be thought as the sample set used for interpolation, are: {< d, f (d) > |1≤d≤N, f (d) = N − Nd+1 }. Then, the problem of searching for appropriate codeword degree is turning into the interpolation problem as follow. Given a partition : {i = 0, 1, . . . , N − 1, x = {xi }, xi = i + 1, y = {yi }, yi = N − Nx+1 }, how to get the interpolation function s (x, y)? i yi −yi−1 Let si = xi −xi−1 =yi − yi−1 (i = 1, . . . , N − 1) and define mi =min (si , si+1 ), the method of Hermite spline interpolation [6] tells us:
s (x, y) = mi−1 (xi − x)2 (x − xi−1 ) − mi (x − xi−1 )2 (xi − x) + yi−1 (xi − x)2 [2(x − xi−1 ) + 1] + yi (x − xi−1 )2 [2(xi − x) + 1] (2) The next theorem illustrates the efficiency of Hermite spline interpolation. And the example of this kind of hermite interpolation is illustrated in Fig. 4. For clarity to observe, we only show the part of degree less than 10. Theorem 1. The codeword degree distribution calculated by Eqn. 2 is nearoptimal. Proof. (1) It is obvious that s (xi ) = yi , for i = 0, 1, . . . , N − 1; (2) Since s (xi−1 ) = mi−1 , s (xi ) = mi and s (x) > 0(xi−1 < x < xi ), thus in each interval [xi1 , xi ], s (x, y) is monotonically increasing. Thus, Eqn. 2 is conform to the “growth” property of Growth Codes, where the codewords degree is gradually increasing as time goes, so that the codeword degree distribution calculated by Eqn. 2 is near-optimal.
Having known the appropriate degree, then we have the modified coding strategy as follow. When an intermediate node has a chance to disseminate a packet, it first determines an appropriate codeword degree of d and then generates P coded symbols, among which (d − d) ∗ P coded symbols are encoded over d symbols and (1 − d + d) ∗ P coded symbols are encoded over d symbols.
5 5.1
Maximizing Growth Codes Utility by Priority Broadcast Unicast
Since the location of the sink is unknown, and since all data is of equal importance, Growth Codes use a simple algorithm to exchange information: a node
472
Y. Zhao et al. Searching the decimal degree by interpolation (N=500). 10 9
The decimal degree The integer degree
Codeword degree
8 7 6 5 4 3 2 1 0
0
50
100
150
200
250
300
350
400
450
500
The number of packets recovered at the sink
Fig. 4. Searching the decimal degree by interpolation (N=500)
randomly chooses a neighbor and a random codeword in its memory and exchanges information. However, this kind of unicast may be less efficient than broadcast, since the natural property of broadcasting for wireless sensor nodes. Furthermore, the method of unicast is locally uniform and thus is damage for those nodes with long distance from the sink to deliver their packet to the sink. Clearly, it violates the assumption by the design of Growth Codes that all codewords in the network have the same probability to be received by the sink. The above assumption by Growth Codes only holds if the sink encounters other nodes with uniform probability. However, in less random scenarios, coding performance will decrease, thus to exchange information by unicast is not an efficient way to maximize the utility of Growth Codes. (The simulation results are shown in Section 6.) 5.2
Priority Broadcast
Since the natural property of broadcast in wireless sensor networks, and since broadcast can diffuse a message from a source node to other nodes in the network quickly, broadcast seems to be an intuitive method of more efficient information exchange. To maximize the Growth Codes utility, we use priority broadcast to disseminate packets. It is privilege to deliver the packet with the higher priority. Firstly, the priority relates to the intersection of the radio areas of two nodes. The packet received from a neighbor that is located at the radio border of an intermediate node should have a high priority, since this kind of border data transmission is beneficial to disseminate sensed data quickly. It is observed that the distance between two nodes with full duplex communication can be evaluated by comparing their neighbor lists. When two nodes A and B can contact each other, the union of their communication areas ZA ∪ZB can be partitioned in three zones (Fig. 5): Z1 , Z2 , Z3 , where Z1 denotes ZA − ZB , Z3 2 denotes ZB −ZA and Z2 denotes ZA ∩ZB . We define the ratio μ1 by: μ1 = Z1Z+Z . 3
Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks
473
Z3
Z1
Z2 A
B
Fig. 5. Intersection of the radio areas of node A and node B
1
1
0 0
1
(a) δ =
1 2
1
0 0
1
(b) δ = 1
0 0
1
(c) δ = 2
Fig. 6. Example of the priority function convexity according to δ parameter
The parameter μ1 approximately denotes the distance between node A and B, where a larger value of μ1 denotes a longer distance while a fewer value of μ1 denotes a shorter distance. When node A wants to know μ1 , it has to identify its neighbors and the neighbors of B. For that purpose, each node which forwards a broadcast adds the identities of all its neighbors in the message. When node B receives the broadcast message, it compares the list from the incoming message to its own neighbors list. Then it can determine the approximate value of μ1 . Therefore, when node B receives a packet of sensed data from node A, node B calculates the priority of this packet as follow: p1 = (
μ1 δ1 ) . M1
(3)
Here, M1 denotes the constant which represents the maximal value of μ1 . This 2 value can be evaluated by the maximal value of the ratio Z1Z+Z which correspond 3 to the case when the distance between node A and node B is equal to the transmission radius(M ≈ 0.6). δ1 denotes a coefficient which is to control the impact of μ1 (see Fig. 6). Secondly, the priority relates to the walk length. Here, the walk length is referred to as the number of hops for a source packet to take to arrive at certain intermediate node. We can set a counter μ2 for each source packet and increase the counter by one after each transmission unless it reaches to the sink. Therefore, when a packet is received by an intermediate node, its priority is calculated as follows: p2 = (
μ2 δ2 ) . M2
(4)
474
Y. Zhao et al.
Here, M2 denotes the constant which represents the maximal value of μ2 . Since each node can store at most B packets in the buffer, a data disseminating walk would be interrupted at a random node at the probability of B1 . Thus, the value of M2 can be approximately evaluated by the buffer size B. δ2 denotes a coefficient which is to control the impact of μ2 (also see Fig. 6). Combining the previous equations, each packet in the buffer has the broadcast priority as follows. Theorem 2 shows the efficiency of priority broadcast. p=(
μ1 δ1 μ2 δ2 ) +( ) . M1 M2
(5)
Theorem 2. There exists the appropriate δ1 and δ2 to promise that the performance of priority broadcast in the scenarios of less random encounters approaches to the ones of random encounters. Proof. Fig. 7 illustrates a scenario of less random encounters, where node S is a intermediate node in the network, and where node B has the longer distance to node S than node A. When node S stores both packet A and packet B in its δ2 buffer, it should calculate their priorities pA and pB by Eqn. 5: pA = 1 + ( B1 ) , δ
pB = ( M1,B ) 1 +( B2 ) 2 . Obviously, there exists some value of δ1 and δ2 to promise 1,B that pA =pB . Thus, though node B has a long distance from node S than node A, the priority of their packets are equal by appropriate δ1 and δ2 .
μ
δ
A S B
C
Fig. 7. Priority broadcast in the scenario of less random encounters
Having known the priority of each packet in the buffer, then we have the modified coding strategy as follow. When an intermediate node has a chance to disseminate a packet, the packet k is chosen at the probability of Bpk P . i=1
6
i
Simulation Results
In this section, we look at how efficient our proposed methods can help the sink recover more data in wireless sensor networks. Our simulation is implemented in about 1000 lines of C++ code. We consider a random wireless sensor network where 500 nodes are randomly replaced in a 1×1 square, each sensor node has a same transmission radius R, thus a pair of sensor nodes are connected by a link if they are within a distance R.
Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks
6.1
475
Impact of Coding on the Symbol Level
First, we look at how the codewords with the decimal degree (coding on the symbol level) impacts the data persistence. In our simulation, we assume P = 10 and B = 100, i.e., each node can store at most 100 symbols (10 packets) in the buffer. The transmission radius R is 0.2. In the scenarios of multi-snapshot, a transmission packet may contain several sensed symbols. In this case, Growth Codes on the symbol level can promise the codeword degree to gradually increase in the granular level than coding on the packet level. As is depicted in Fig. 8, the sink needs to use 365 codewords to recover 50 percent of symbols by coding on the symbol level but 400 codewords by coding on the packet level. When the sink has received 500 codewords, it can recover 59% of original symbols by coding on the symbol level but only 52.6% of symbols coding on the packet level. Thus, our proposed method persist much more data than the original Growth Codes and the efficiency of coding on the symbol level in the scenarios of multi-snapshot is shown.
Fig. 8. Data persistence in a 500 node random sensor network (R = 0.2, B = 100)
6.2
Impact of Priority Broadcast
Second, we look at how the information exchange by priority broadcast impacts the performance of Growth Codes in different scenarios, whatever the scenarios of less random encounters or random encounters. We simulate several networks where the node density varies from 10 to 30. Moreover, two kinds of network scale of 500 and 1000 are considered respectively, we assume δ1 = 2 and δ2 = 0.5. In the scenarios of random encounters (e.g., a much dense network where node density is 30), the original Growth Codes perform well by both unicast and priority broadcast. However, in the scenarios of less random encounters, the performance of the original Growth Codes decrease since it is difficult for a part of symbols to reach to the sink. While associating Growth Codes with
476
Y. Zhao et al.
Time to recover 80% of the symbols Time to recover 80% of the symbols (*N)
Time to recover 50% of the symbols (*N)
Time to recover 50% of the symbols 1.4 N=500, unicast N=1000, unicast N=500, priority broadcast N=1000, priority broadcast Theoretical Value
1.2
1
0.8
10
15
20
25
30
Density (=Average number of neighbors)
(a) Time taken to recover 50% of the data in different networks
2.2 N=500, unicast N=1000, unicast N=500, priority broadcast N=1000, priority broadcast Theoretical Value
2
1.8
1.6
1.4
1.2
1 10
15
20
25
30
Density (=Average number of neighbors)
(b) Time taken to recover 80% of the data in different networks
Fig. 9. The performance of Growth Codes differs in different networks
priority broadcast, even the symbols that have the long distance from the sink can also reach to the sink. Fig. 9 depicts the time taken to recover certain fraction of sensed data when the nodes use Growth Codes, by both unicast and priority broadcast. The bottommost curve in either Fig. 9a or Fig. 9b plots the theoretical performance of Growth Codes. The results suggest that the performance of the original Growth Codes differs in different networks. If the network scale is large (e.g, N = 500 or 1000) but the node density is not large enough (e.g, less than 30), the performance of the original Growth Codes would decease. However, to associate Growth Codes with priority broadcast can approach to the theoretical performance in all kinds of scenarios. Thus, the efficiency of priority broadcast is shown.
7
Conclusion
In this paper, we generalize the application scenarios of the original Growth Codes to include multi-snapshot and less random encounters, so as to achieve a better performance over a wider range of networks applications, especially for large-scale of wireless sensor networks. The core idea of this paper is: (1) to use the decimal codeword degree instead of the integer codeword degree to deal with the multi-snapshot scenarios; (2) to use the priority broadcast, which is correlated with the node border transmission and the walk length, to approach to the theoretical performance of Growth Codes in the scenarios of less random encounters. In addition, the proposed methods are compared with the original Growth Codes by simulations, and the results validate the efficiency of our proposed methods. We believe that our work would greatly help to the practical application of Growth Codes in kinds of large-scale sensor networks.
Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks
477
Acknowledgement This work was supported in part by 863 program of China under Grant No. 2009AA01A348, National Key S&T Project under Grant No. 2010ZX03003-00303, Shanghai Municipal R&D Foundation under Grant No. 09511501200, the Shanghai Rising-Star Program under Grant No. 08QA14009. Xin Wang is the corresponding author.
References 1. Acedanski, S., Deb, S., Medard, M., Koetter, R.: How good is random linear coding based distributed networked storage. In: NetCod (2005) 2. Ahlswede, R., Cai, N., Li, S.Y., Yeung, R.: Network information flow. IEEE Transactions on Information Theory 46(4), 1204–1216 (2000) 3. Aly, S.A., Kong, Z., Soljanin, E.: Fountain codes based distributed storage algorithms for large-scale wireless sensor networks. In: IPSN 2008: Proceedings of the 7th international conference on Information processing in sensor networks, pp. 171–182. IEEE Computer Society, Washington (2008) 4. Cartigny, J., Simplot, D.: Border node retransmission based probabilistic broadcast protocols in ad-hoc networks. In: HICSS 2003: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS 2003) - Track 9, p. 303. IEEE Computer Society, Washington (2003) 5. Dimakis, A., Prabhakaran, V., Ramchandran, K.: Decentralized erasure codes for distributed networked storage. IEEE Transactions on Information Theory 52(6), 2809–2816 (2006) 6. Ahlberg, J.H., Nilson, E.N., Walsh, J.L.: The theory of splines and their applcations. Academic Press, New York (1967) 7. Kamra, A., Misra, V., Feldman, J., Rubenstein, D.: Growth codes: maximizing sensor network data persistence. In: SIGCOMM 2006: Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications, pp. 255–266. ACM, New York (2006) 8. Lin, Y., Liang, B., Li, B.: Data persistence in large-scale sensor networks with decentralized fountain codes. In: 26th IEEE International Conference on Computer Communications, INFOCOM 2007, pp. 1658–1666. IEEE, Los Alamitos (May 2007) 9. Liu, J., Liu, Z., Towsley, D., Xia, C.H.: Maximizing the data utility of a data archiving & querying system through joint coding and scheduling. In: IPSN 2007: Proceedings of the 6th international conference on Information processing in sensor networks, pp. 244–253. ACM, New York (2007) 10. Munaretto, D., Widmer, J., Rossi, M., Zorzi, M.: Resilient coding algorithms for sensor network data persistence. In: Verdone, R. (ed.) EWSN 2008. LNCS, vol. 4913, pp. 156–170. Springer, Heidelberg (2008)
@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks Jos´e Mocito1 , Lu´ıs Rodrigues2 , and Hugo Miranda3 1
INESC-ID / FCUL INESC-ID / IST University of Lisbon
2 3
Abstract. Flooding is a fundamental building block in multi-hop networks (both mobile and static); for instance, many routing protocols for wireless ad hoc networks use flooding as part of their route discovery/maintenance procedures. Unfortunately, most flooding algorithms have configuration parameters that must be tuned according to the execution environment, in order to provide the best possible performance. Given that ad hoc environments are inherently unpredictable, dynamic, and often heterogeneous, anticipating the most adequate configuration of these algorithms is a challenging task. This paper presents @Flood, an adaptive protocol for flooding in wireless ad hoc networks that allows each node to auto-tune the configuration parameters, or even change the forwarding algorithm, according to the properties of the execution environment. Using @Flood, nodes autoconfigure themselves, circumventing the need for pre-configuring all the devices in the network for the expected operational conditions, which is impractical or even impossible in very dynamic environments.
1
Introduction
Flooding is a communication primitive that provides best-effort delivery of messages to every node in the network. It is a fundamental building block of many communication protocols for multi-hop networks. In routing, flooding is typically used during the discovery phase, to find routes among nodes. It has many other uses, like building a distributed cache that keeps data records in the close vicinity of clients [11], performing robust and decentralized code propagation [8] or executing distributed software updates [1], performing query processing [15] or enhancing the privacy of communicating nodes [5]. Given its paramount importance, it is no surprise that the subject of implementing flooding in an efficient manner has been extensively studied in the literature. Its most simple implementation requires every node in the network to retransmit every message once, which results in an overwhelming use of communication resources that can negatively affect the capacity of nodes to communicate with each other [12]. As a result, many different approaches to implement
This work was partially supported by FCT (PTDC/EIA/71752/2006) through POSI and FEDER.
project
REDICO
P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 478–489, 2010. c Springer-Verlag Berlin Heidelberg 2010
@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks
479
(optimized) flooding have been proposed [12,4,2,10,3]. Most require fine-tuning a number of operational parameters in order to provide the best performance. Such tuning can depend on several factors, like the node density, geographical distribution, device characteristics, or traffic load. An ill-configured algorithm can contribute to an increased battery depletion of the devices and for and excessive occupation of the available bandwidth. Usually, flooding protocols provide none or very few mechanisms to self-adapt their parameters, and must be totally or partially configured off-line, before the actual deployment of nodes in the network. Such task is obviously challenging, since multi-hop networks are typically highly dynamic, where the topology is formed spontaneously, and the node distribution, movement, or sleeping schedule, can create heterogeneous topologies that do not favor any static or homogeneous configuration. In other words, uniformly using the same, common configuration in every node unlikely provides optimal results. Moreover, even in the case of homogeneous and stable operating conditions, the task of preconfiguring a large number of devices may be impractical or even impossible (we recall that flooding may be required to be already operational in order to perform configuration/software updates on the fly). This paper proposes the Auto-Tunable Flooding, AT-Flooding for short, or simply @Flood, a generic auto-tuning flooding protocol where each node, individually, selects the parameter configuration that is more advantageous. Furthermore, flooding initiators can select the most adequate flooding algorithm given the perceived network conditions. This makes @Flood particularly useful in unpredictable, dynamic, and/or heterogeneous environments. The fact that each node can operate using a different configuration paves the way for efficient operation in scenarios where different portions of the network are subject to different operating conditions. Given that each node auto-configures autonomously, the configuration of the network is scalable, as no global synchronization is required. The mechanism is based on a feedback loop that monitors the performance of the flooding algorithms. We illustrate the feasibility of the approach using different algorithms. Experimental results show that, in all scenarios, our technique is able to match or supersede the performance of the best off-line configuration. The remainder of the paper is organized as follows. The related work is presented in Section 2. In Section 3 we present detailed descriptions of several flooding algorithms, that are used to motivate our work and illustrate the application of our scheme. Section 4 provides experimental arguments advocating the need for a generic adaptive flooding solution. Section 5 provides a detailed description of the @Flood protocol. In Section 6 we evaluate our proposal. Section 7 concludes the paper and highlights the future directions of this work.
2
Related Work
Devising adaptive solutions for flooding is not a new problem. However, the majority of the approaches focus on defining retransmission probabilities as a function of the perceived neighborhood, which may provide sub-optimal results
480
J. Mocito, L. Rodrigues, and H. Miranda
in the presence of uneven node distribution. We will briefly describe the approaches that, to the best of our knowledge, are closely related to our solution. Hypergossiping [6] is a dual-strategy flooding mechanism to overcome partitioning and to efficiently disseminate messages within a partition. It uses an adaptive approach where the retransmission probability at some node is a function of its neighbor count. Rather than defining a mechanism to locally tune this configuration value, the authors derive optimal assignments off-line on the basis of experimental results obtained a priori. Unfortunately, these results are intrinsically tied to the studied scenarios. Tseng et al. [13] proposed adaptive versions of the flooding protocols published in [12], that adapt to the density of the network where they are executing by observing the neighbor set at each node, either by overhearing traffic or explicit periodic announcements. A more recent work by Liarikapis and Shahrabi [9] proposes a similar approach, specifically focused on a probability-based scheme, where the neighbor set is obtained by counting the number of retransmissions of a single message. In both solutions the configuration of the flooding strategy is obtained by a function, defined a priori, of the perceived neighbor count. Such function is established by finding the best values through extensive simulations and refinements for a given scenario. Once again, this suggests that the resulting function may not perform well if the conditions change in run-time. Smart Gossip [7] is an adaptive gossip-based flooding service where retransmission probabilities are applied to reduce the communication overhead. Like our approach, the performance metric is defined for each source. By knowing the estimated diameter of the network along with some node dependency sets that include, for instance, the set of nodes that are forwarders for messages of a given source, every node can determine its reception probability for that source and influence the retransmission probability of the forwarders. By requiring an estimate of the network diameter and accurate knowledge of a partial topology this approach is not suitable for dynamic networks.
3
Flooding Algorithms
A plethora of flooding algorithms have been proposed in the literature [12,4,2,10,3]. Here we do not aim at providing a comprehensive survey of the related works (an interesting one can be found in [14]). Instead, in this section we focus on a small set of improved flooding algorithms that we consider representative, and that we have used to illustrate the operation of our @Flood protocol, namely GOSSIP1 [4], PAMPA [10], and 6SB [3]. GOSSIP1 employs a probabilistic approach where the forwarding of data packets is driven by a pre-configured probability p, i.e. upon receiving a message for the first time, every node decides with probability p to retransmit the message. PAMPA is a power-aware flooding algorithm where the signal strength indication (RSSI) is used to estimate the distance of nodes to the source and to select the farthest ones as the first to restransmit. A pre-configured threshold c determines the number of retransmissions detected after which a node will no longer forward a message.
@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks
481
6SB employs the same rationale as PAMPA but uses a location-based approach where the the geographic position of nodes determines the best candidates for retransmitting first, i.e. the farthest from the source. Again, a retransmission counter threshold c is employed to limit the number of forwards.
4
The Case for Adaptive Behavior
The challenge in developing a flooding protocol consists in finding the right tradeoff between robustness and efficiency. The aforementioned flooding protocols aim at minimizing the number of packets required to achieve some target level of reliability using different strategies. To motivate the need for supporting adaptive behavior, we performed a set of experimental tests that illustrate the impact of different factors that influence the flooding performance. For that purpose we tested several algorithms using different configuration parameters and measured their performance using the delivery ratio and the total amount data transmitted. To run the experiments, we implemented the four protocols in the NS-2 network simulation environment, and performed the tests in a MANET, with signal propagation following the Two Ray Ground model, and composed of 50 nodes, 5 of which are senders, producing 256 byte messages at a rate of one message every second. Nodes were distributed uniformly across the space. Unless otherwise stated, all results are averages of 25 runs in the same conditions using different scenarios. Network Density. The first, most obvious, factor that we consider is the network density. For self containment, we illustrate the relevance of this aspect by depicting in Figure 1 the performance of GOSSIP1 in sparse (average neighbor count of 4) and dense networks (average neighbor count of 17). As it can be observed, different values for retransmission probability p provide different delivery results, being the value 0.4 enough for providing high delivery in dense scenarios, while in sparse scenarios higher value is required to obtain high delivery ratios. We can also confirm by the results that an increase of p will produce more messages, which reflects the increasingly conservative behavior of the protocol. Note that in the dense scenario, after some point, this increase in the number of messages is no longer productive, as there is no noticeable benefit in the delivery ratio. Geographical Positioning. Some forwarding algorithms rely on geographical positioning to perform optimized flooding. This can be determined, for instance, using the Global Positioning System (GPS). These positioning systems are, however, prone to errors and its accuracy can range from the tens to the hundreds of meters. In position based algorithms inaccurate positioning can originate suboptimal performance by generating more traffic than required or have an impact in the delivery ratio. This case is depicted in Fig. 2 where 6SB protocol is used in a network where the positioning error varies between 0 and 150 meters. As we can observe, the impact in the delivery ratio is quite significant, ranging from 2% when there’s no positioning error, to 15% when the error reaches 150 meters.
482
J. Mocito, L. Rodrigues, and H. Miranda
1
40000
Deliv. ratio Traffic (KB)
0.9
1
0.8
40000
Deliv. ratio Traffic (KB)
0.9 0.8 30000
30000
0.7
0.7
0.6
0.6
0.5
20000
0.5
0.4
20000
0.4
0.3
0.3 10000
10000
0.2
0.2
0.1
0.1
0
0 0.2
0.4
0.6
0.8
0
1
0 0.2
0.4
Parameter p
0.6
0.8
1
Parameter p
(a) Dense Scenario
(b) Sparse Scenario
Fig. 1. GOSSIP1 algorithm with different parameters 1
6SB
Delivery ratio
0.9
0.8
0.7
0.6 0
25
50
75
100
125
150
Positioning error
Fig. 2. Positioning inaccuracy in 6SB
Discussion. There are many factors that may affect the performance of a flooding protocol. The most widely studied is the network density. However, as we have discussed in the previous paragraphs, different forwarding algorithms may be equally sensitive to other factors, like inaccuracies in the readings of GPS data or the signal strength. Operational factors may even invalidate the assumption of a given forwarding strategy (for instance, the availability of positioning information in 6SB) and may require the use of a different forwarding strategy. Therefore, it is of relevance to devise a scheme to automatically adapt the flooding protocol in face of several types of changes in the operational envelope. The selected adaptation may either change the forwarding algorithm or just tune its configuration parameters.
5
@Flood
Auto-Tunable Flooding, AT-Flooding for short, or simply @Flood is an adaptive flooding protocol that aims at finding in an automated manner the most suitable forwarding strategy, and for the selected strategy, the most suitable parameter configuration. The protocol continuously monitors the performance of
@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks
483
Algorithm 1. @Flood: forwarding procedure 1: 2: 3: 4: 5: 6: 7: 8: 9:
R(alg, pid) ← ∅ primary procedure flood(m) m.alg ← primary m.alg.send(< p, m >)
primary algorithm
procedure receive(src, < orig, m >) if {m} ∈ / R(m.alg, orig) then R(m.alg, orig) ← R(m.alg, orig) ∪ {m} m.alg.forward(< orig, m >)
the executing configuration, and performs adaptations as needed. The adaptation procedure is strictly local, thus involving no coordination between nodes. @Flood is composed by 4 complementary components, namely: 1) a forwarding procedure, in charge of implementing the message dissemination; 2) a probing module, that injects a negligible amount of data into the network using different flooding algorithms; 3) a feedback loop, that monitors the performance of the underlying flooding algorithms and; 4) an adaptation engine, in charge of selecting the flooding algorithm and tuning its configuration parameters at run-time. Each of these components is described in detail in the next subsections. Note that @Flood can use multiple forwarding algorithms. Therefore, some parts of the protocol are generic, and some parts are specific of the forwarding algorithms in use. In the following subsections we clarify this distinction. 5.1
Forwarding Procedure
The forwarding procedure consists in invoking the primary flooding algorithm, i.e. the one that is selected for carrying application data. Our protocol implements two methods: flood, called to send a message and receive, called when a message is received from the network. We also assume that each message is tagged with the forward algorithm that has been selected by the adaptive mechanism. Each algorithm has its own specific send method, called to send a message, and forward, called to (possibly) retransmit the message. Alg. 1 describes the behavior of this component. When a node wishes to flood a message, it issues a call to the primary flooding algorithm (lines 3–5). Likewise, when a node receives a message for the first time (lines 6–9) stores its identifier and feeds it to the flooding method of the algorithm selected for that message. The reason for storing the message identifiers will become clear in the next sections. 5.2
Probing Module
To support the switching between flooding algorithms one needs to assess the performance of each alternative, in order to produce an adequate decision. In @Flood this is accomplished through a mechanism (Alg. 2) where, periodically, a negligible amount of probing messages is injected in the network using the different forwarding strategies. Therefore, throughout the execution of @Flood all the alternative strategies are running, producing traffic that is used by the
484
J. Mocito, L. Rodrigues, and H. Miranda
Algorithm 2. @Flood: probe module 1: 2: 3: 4:
alternatives upon periodicProbeTimer probe ← select from alternatives \ primary probe.send(< p,PROBE>)
list of alternative algorithms
Algorithm 3. @Flood: monitor module 1: Nn (alg, pid) ← ∅ List of sets of received messages by n from pid since last adaptation 2: A(alg, pid) ← ∅ Set of all collected messages from pid since last adaptation 3: Pn (alg, pid) ← 0 List of performance metrics for every node n 4: upon periodicHelloTimer 5: tp (alg, pid) ← c(alg, pid) 6: datalink.send() 7: procedure receive(src, ) 8: Nsrc ← Nsrc ∪ r 9: A← A∪a 10: for n = 0 to p .length do 11: if tn .isNewer(Tn ) then 12: Pn ← pn
monitor module to extract the information required by the performance metric. These messages are treated as regular messages and therefore use the forwarding mechanism described in the previous section. To optimize the usage of communication resources, the probing module is only activated when a node starts producing regular traffic. 5.3
Monitor Module
In order to adapt the underlying algorithm, @Flood needs to gather information about the performance of each flooding algorithm, including the primary and the alternatives. To that purpose, @Flood uses a monitor module, that determines the performance of each forwarding strategy. Several implementations of the monitor are possible. In this paper we use a simple approach based on the exchange of periodic HELLO messages that can be either explicit, or piggy-backed on regular data or control traffic. A description of the monitor module is provided in Alg. 3. For each algorithm alg and traffic source s three data structures, A(alg, s) and Nn (alg, s) and Pn (alg, s), hold respectively the identifiers of messages known to be sent in the system, received by neighbor node n, and the performance metrics computed at every node n, since the last adaptation (lines 1–3). Periodically, a node a broadcasts to its neighbors a HELLO message containing R, A and P (lines 4–6). Upon reception of the HELLO message node b updates its own data structures by integrating the received digests messages (lines 7–12). Each node p can compute the performance of the current configuration of any algorithm from any given data source from the number of messages detected to have been missed by the neighbors compared to the amount of messages produced since the last adaptation. The performance metric can be computed using the following function:
@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks
localPerf(alg, src) = average∀i∈Neighborhood(p)
|Ni (alg, src)| |A(alg, src)|
485
Since the data initiators will typically perceive complete delivery of the produced messages to the neighboring nodes, the previous metric does not capture problems that are further in the dissemination process. Therefore, these nodes use a different metric that captures the average delivery ratio perceived by every node in the network: globalPerf(alg, p) = average∀i∈N odes (Pi (alg, p)) In both metrics the sets of messages (N and A) and the list of performance measurements (P ) are reset right after a reconfiguration action by the adaptation engine, in order to reflect the performance of the current configuration. 5.4
Adaptation Engine
The adaptation engine of @Flood works in a loop, where periodically: i) the monitor module is called to assess the performance of the different forwarding algorithms (both the primary and the alternatives being probed); ii) each algorithm (including the alternatives) adapts its configuration parameters in order to improve its performance; iii) the primary algorithm is changed if another alternative outperforms the current selection. The part of the protocol that depends on the forwarding algorithms is encapsulated in the adapt method that is exported by each implementation. This method adapts the configuration of the algorithm and accepts as parameters the current measured performance for the algorithm, the target performance and a stable margin δ that introduces some hysteresis in the system by defining a stable margin around the target performance given by the localP erf metric, and is related to how accurately one can locally estimate this value, which depends on the sampling period. Naturally, the concrete action depends on the protocol in use, and might involve increasing or decreasing the value of configuration parameters. In our implementations, a delivery ratio below the target performance originates, in GOSSIP1 an increment in the retransmission probability p, and in 6SB and PAMPA an increment in the retransmission counter c. If the performance is below target, the parameters for the three algorithms are decremented. The algorithm used by the adaptation engine is illustrated in Alg. 4. In the first phase (lines 4–7) every flooding algorithm for which p is not the source is adapted towards meeting the target performance in terms of the average delivery ratio of the neighboring nodes (localPerf). In the second phase (lines 8–11), for the algorithms for which p is a source, the global delivery ratio metric (globalPerf) is computed and compared with the one for the primary algorithm. If theres is an alternative that meets the target performance and the primary algorithm does not, the later is substituted by the former.
486
J. Mocito, L. Rodrigues, and H. Miranda
Algorithm 4. @Flood: adaptive engine 1: target 2: upon periodicAdaptTimer 3: Pp (primary, p) ← globalPerf(primary,p) 4: for all alg ∈ alternatives do 5: for all src ∈ sources \ p do 6: Pp (alg,src) ← localPerf(alg, src) 7: alg.adapt(Pp (alg,src), target, δ) 8: Pp (alg, p) ← globalPerf(alg,p) 9: if target − δ < Pp (alg, p) < target + δ then 10: if Pp (primary, p) < target − δ ∨ Pp (primary, p) > target + δ 11: primary ← alg
6
performance target
then
Evaluation
To validate and evaluate the performance of @Flood, we performed a series of experiments using the forwarding algorithms listed in Sect. 3. The NS-2 simulator was used to test the protocol in a wireless ad hoc network of 50 nodes uniformly distributed and using the Two Ray Ground signal propagation model. The area and transmission range were manipulated to obtain sparse network scenarios, where each node has, on average, four neighbors. Five nodes at random produce 256 byte messages at a rate of one message every second. The target delivery ratio was set to 0.90 and the δ to 0.05. The time period between adaptations was set to 30 seconds. Unless otherwise stated, all results are averages of 25 runs in the same conditions using different scenarios, and the 95% confidence intervals are presented in the figures. 6.1
Convergence
We first show how @Flood is able to make the configuration of a given forwarding algorithm converge to values that offer good performance (i.e., that match the target delivery ratio and reduce the number of redundant messages). For this purpose, we simulated the execution of @Flood using GOSSIP1 as the forwarding algorithm, starting from two different initial configurations (p = 0 and p = 1, respectively), and measured the evolution of the delivery ratio and amount of traffic in sampling periods of 10 seconds. The results are depicted in Fig. 3. For an initial p = 0 the delivery is roughly 17% due to the neighbors of the initiators receiving data. For an initial p = 1 the delivery is roughly 97% since every node restransmits (the 3% loss is due to collisions). We can observe that @Flood drives GOSSIP1 to a configuration that offers approximately the same performance, in terms of delivery ratio and produced traffic, regardless of its initial configuration. Moreover, the delivery ratio stabilizes close to the pre-defined goal of 0.90 (±0.05). 6.2
@Flood vs. Standard Pre-configuration
One interesting question is to assess how well @Flood performs with regard to non-adaptive versions of the protocol with the standard (homogeneous) preconfiguration of the nodes. Can @Flood achieve the same performance?
@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks
1
800
0.9
700
0.8
p=0.0 p=1.0
0.6 0.5 0.4
p=0.0 p=1.0
600 Traffic (KB)
Delivery ratio
0.7
487
500 400 300 200
0.3
100
0.2 0.1
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140
10 20 30 40 50 60 70 80 90 100 110 120 130 140
Simulation time (x10 s)
Simulation time (x10 s)
Fig. 3. @Flood convergence with GOSSIP1 60
Generated traffic (MB)
50
40
30 91%
90%
w/o @Flood
w/ @Flood
20
10
0
Fig. 4. @Flood vs. best pre-config
In this test we used the GOSSIP1 algorithm and performed several runs with different values of the parameter p to determine which results in a delivery performance close to 0.90. Coincidently the value was also 0.90 and resulted in a delivery ratio of 0.91. The same algorithm was also executed using @Flood but without competing alternatives, i.e., only GOSSIP1 has been set as the single alternative. The configurable parameters were set for a low value and an initial warm-up period was exercised in order for these values to converge, and bring the delivery performance close to a target of 0.90 (±0.05). The results for the amount of traffic generated are depicted in Fig. 4. We can observe that GOSSIP1 without @Flood sends slightly more traffic than with @Flood. We can therefore conclude that the reduction of retransmissions operated by @Flood compensates the overhead introduced by the performance monitor. 6.3
Adapting the Forwarding Algorithm
We now illustrate the use of @Flood to change the forwarding algorithm, by commuting between a position-based forwarding strategy to a power-aware forwarding strategy, more precisely, between 6SB and PAMPA. This allows to use 6SB when position information is accurate and PAMPA otherwise.
J. Mocito, L. Rodrigues, and H. Miranda
1
1
0.9
0.9
Delivery ratio
Delivery ratio
488
0.8
0.7 6SB PAMPA
0.6
0.8
0.7
Delivery ratio Primary algorithm 6SB
0.6 PAMPA
0.5
0.5 10 20 30 40 50 60 70 80 90 100 110 120 130 140
10 20 30 40 50 60 70 80 90 100110120130140
Simulation time (x10 s)
Simulation time (x10 s)
Fig. 5. Positioning inaccuracy in 6SB and @Flood
To illustrate the operation of @Flood in such scenarios, we performed an experimental test, with a single data producer, where the error in the positioning estimation varies throughout the simulation time. More specifically, the test consisted in inducing a position estimation error (by a maximum of 150 meters in a random direction ) in every node at specific moments in time. @Flood was configured to start with 6SB as the primary forwarding algorithm and PAMPA as an alternative. The target was set to 0.98 and the δ to 0.02. In Fig. 5 we show two runs of the same experiment, one with static versions of 6SB and PAMPA, and the other with @Flood commuting between the two. The first plot depicts the equivalent delivery performance of both algorithms and the significant performance loss of 6SB whenever an error is induced in the positioning information. On the other hand, the second plot exhibits less penalty, since @Flood automatically switches to PAMPA as soon as the perceived performance of the 6SB algorithm drops, and returns to 6SB when the performance metric measured on the probing traffic returns to values above the target threshold.
7
Conclusions and Future Work
In this paper we presented @Flood, a new adaptive flooding protocol for mobile ad hoc networks. Experimental results with different forwarding algorithms validate the effectiveness of the solution. The performance of the resulting system matches a system that is pre-configured with the best configuration values for the execution scenario but with a smaller number of retransmissions, thus contributing to an extended network lifetime. This makes the approach very interesting for dynamic and/or large scale scenarios and eliminates the need to pre-configure the flooding protocol. In the future, we plan to experiment with other forwarding algorithms, study other alternative schemes to monitor the performance of the network and drive the adaptation procedure, and investigate a mechanism to self-adapt the hysteresis parameter to the currently executing algorithms and network conditions.
@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks
489
References 1. Busnel, Y., Bertier, M., Fleury, E., Kermarrec, A.: Gcp: gossip-based code propagation for large-scale mobile wireless sensor networks. In: Proc. of the 1st Int. Conf. on Autonomic Computing and Communication Systems, pp. 1–5. ICST (2007) 2. Drabkin, V., Friedman, R., Kliot, G., Segal, M.: Rapid: Reliable probabilistic dissemination in wireless ad-hoc networks. In: Proc. of the 26th IEEE Int. Symp. on Reliable Distributed Systems, pp. 13–22. IEEE CS, Los Alamitos (2007) 3. Garbinato, B., Holzer, A., Vessaz, F.: Six-shot broadcast: A context-aware algorithm for efficient message diffusion in manets. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part I. LNCS, vol. 5331, pp. 625–638. Springer, Heidelberg (2008) 4. Haas, Z.J., Halpern, J.Y., Li, L.: Gossip-based ad hoc routing. IEEE/ACM Trans. on Networking 14(3), 479–491 (2006) 5. Kamat, P., Zhang, Y., Trappe, W., Ozturk, C.: Enhancing source-location privacy in sensor network routing. In: Proc. of the 25th IEEE Int. Conf. on Distributed Computing Systems, pp. 599–608. IEEE Computer Society, Los Alamitos (2005) 6. Khelil, A., Marr´ on, P.J., Becker, C., Rothermel, K.: Hypergossiping: A generalized broadcast strategy for mobile ad hoc networks. Ad Hoc Net. 5(5), 531–546 (2007) 7. Kyasanur, P., Choudhury, R.R., Gupta, I.: Smart gossip: An adaptive gossip-based broadcasting service for sensor networks. In: Proc. of the 2006 IEEE Int. Conf. on Mobile Adhoc and Sensor Systems, pp. 91–100 (October 2006) 8. Levis, P., Patel, N., Culler, D., Shenker, S.: Trickle: a self-regulating algorithm for code propagation and maintenance in wireless sensor networks. In: Proc. of the 1st Symp. on Networked Systems Design and Impl., p. 2. USENIX (2004) 9. Liarokapis, D., Shahrabi, A.: A probability-based adaptive scheme for broadcasting in manets. In: Mobility 2009: Proc. of the 6th International Conference on Mobile Technology, Application Systems, pp. 1–8. ACM, New York (2009) 10. Miranda, Leggio, S., Rodrigues, L., Raatikainen, K.: A power-aware broadcasting algorithm. In: Proc. of The 17th Annual IEEE Int. Symp. on Personal, Indoor and Mobile Radio Communications, September 11-14 (2006) 11. Miranda, H., Leggio, S., Rodrigues, L., Raatikainen, K.: Epidemic dissemination for probabilistic data storage. In: Emerging Comm.: Studies on New Tech. and Practices in Communication, Global Data Management, vol. 8. IOS Press, Amsterdam (2006) 12. Tseng, Y.C., Ni, S.Y., Chen, Y.S., Sheu, J.P.: The broadcast storm problem in a mobile ad hoc network. Wireless Networking 8(2/3), 153–167 (2002) 13. Tseng, Y.C., Ni, S.Y., Shih, E.Y.: Adaptive approaches to relieving broadcast storms in a wireless multihop mobile ad hoc network. In: Proc. of the 21st Int. Conf. on Distributed Computing Systems, p. 481. IEEE, Los Alamitos (2001) 14. Williams, B., Camp, T.: Comparison of broadcasting techniques for mobile ad hoc networks. In: Proc. of the 3rd ACM Int. Symp. on Mobile Ad Hoc Networking & Computing, pp. 194–205. ACM, New York (2002) 15. Woo, A., Madden, S., Govindan, R.: Networking support for query processing in sensor networks. Commun. ACM 47(6), 47–52 (2004)
On Deploying Tree Structured Agent Applications in Networked Embedded Systems Nikos Tziritas1,3, Thanasis Loukopoulos2,3, Spyros Lalis1,3, and Petros Lampsas2,3 1
Dept. of Computer and Communication Engineering, Univ. of Thessaly, Glavani 37, 38221 Volos, Greece {nitzirit,lalis}@inf.uth.gr 2 Dept. of Informatics and Computer Technology, Technological Educational Institute (TEI) of Lamia, 3rd km. Old Ntl. Road Athens, 35100 Lamia, Greece {luke,plam}@teilam.gr 3 Center for Research and Technology Thessaly (CERETETH), Volos, Greece
Abstract. Given an application structured as a set of communicating mobile agents and a set of wireless nodes with sensing/actuating capabilities and agent hosting capacity constraints, the problem of deploying the application consists of placing all the agents on appropriate nodes without violating the constraints. This paper describes distributed algorithms that perform agent migrations until a “good” mapping is reached, the optimization target being the communication cost due to agent-level message exchanges. All algorithms are evaluated using simulation experiments and the most promising approaches are identified.
1 Introduction Mobile code technologies for networked embedded systems, like Aggila [1], SmartMessages [2], Rovers [3] and POBICOS [4], allow the programmer to structure an application as a set of mobile components that can be placed on different nodes based on their computing resources and sensing/actuating capabilities. From a system perspective, the challenge is to optimize such a placement taking into account the message traffic between application components. This paper presents distributed algorithms for the dynamic migration of mobile components, referred to as agents, in a system of networked nodes with the objective of reducing the network load due to agent-level communication. The proposed algorithms are simple so they can be implemented on nodes with limited memory and computing capacity. Also, modest assumptions are made regarding the knowledge of routing paths used for message transport. The algorithms rely on information that can be provided by even simple networking or middleware logic without incurring (significant) additional communication overhead. The contributions of the paper are the following: (i) we identify and formulate the agent placement problem (APP) in a way that is of practical use to the POBICOS middleware but can also prove useful to other work on mobile agent systems with placement constraints, (ii) we present a distributed algorithm that relies on minimal network knowledge and extend it so that it can exploit additional information about P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 490–502, 2010. © Springer-Verlag Berlin Heidelberg 2010
On Deploying Tree Structured Agent Applications in Networked Embedded Systems
491
the underlying network topology (if available), (iii) we evaluate both algorithm variants via simulations and discuss their performance.
2 Application and System Model, Problem Formulation This section introduces the type of applications targeted in this work and the underlying system and network model. It then formulates the agent placement problem (APP) and the respective optimization objectives. 2.1 Application Model We focus on applications that are structured as a set of cooperating agents organized in a hierarchy. For instance, consider a demand-response client which tries to reduce power consumption upon request of the energy utility. A simplified possible structure is shown in Fig. 1. The lowest level of the tree comprises agents that periodically report individual device status and power consumption to a room agent, which reports (aggregated) data for the entire room to the root agent. When the root decides to lower power consumption (responding to a request issued by the electric utility), it requests some or all room agents to curve power consumption as needed. In turn, room agents trigger the respective actions (turn off devices, lower consumption level) in the end devices by sending requests to the corresponding device agents. Leaf (sensing and actuating) agents interact with the physical environment and must be placed on nodes that provide the respective sensors or actuators and are located at the required areas, hence are called node-specific. On the other hand, intermediate agents perform their tasks using just general-purpose computing resources which can be provided by any node; thus we refer to these agents as node-neutral. In Fig. 1, device agents are node-specific while all other agents are node-neutral. Agents can migrate between nodes to offload their current hosts or to get closer to the agents they communicate with. In our work we consider migration only for nodeneutral agents because their operation is location- and node-independent by design, while node-specific agents remain fixed on the nodes where they were created. Still, the ability to migrate node-neutral agents creates a significant optimization potential in terms of reducing the overall communication cost. 2.2 System Model We assume a network of capacitated (resource-constrained) nodes with sensing and/or actuating capabilities. Let ni denote the ith node, 1≤i≤N and r(ni) its resource capacity (processing power or memory size). The capacity of a node imposes a generic constraint to the number of agents it can host. Nodes communicate with each other on top of a (wireless) network that is treated as a black box. The underlying routing topology is abstracted as a graph, its vertices representing nodes and each edge representing a bidirectional routing-level link between a node pair. In this work we consider tree-based routing, i.e., there is exactly one path for connecting any two nodes. Let D be a N×N×N boolean matrix encoding the routing topology as follows: Dijx=1 iff the path from ni to nj includes nx, else 0. Since we assume that the network is a tree Dijx= Djix. Also, Diii=1, Dijj=1 and Diij=0. Let hij be the path length between ni and nj; equal to 0 for i=j. Obviously, hij = hji.
492
N. Tziritas et al.
root D/R
…...
room room11D/R D/R
room n D/R
….. device 1 (sense & control)
device m (sense & control)
…...
Fig. 1. Agent tree structure of an indicative sensing/control application
Each application is structured as a set of cooperating agents organized in a tree, the leaf agents being node-specific and all other agents being node-neutral. Assuming an enumeration of agents whereby node-neutral agents come first, let ak be the kth agent, 1≤k≤A+S, with A and S being equal to the total number of node-neutral and nodespecific agents, respectively. Let r(ak) be the capacity required to host ak. Agent-level traffic is captured via an (A+S)×(A+S) matrix C, where Ckm denotes the load from ak to am (measured in data units over a time period). Note that Ckm need not be equal to Cmk. Also, Ckk=0 since an agent does not send messages to itself. 2.3 Problem Formulation For the sake of generality we target the case where all agents are already hosted on some nodes, but the current placement is non-optimal. Let P be an N×(A+S) matrix used to encode the placement of agents on nodes as follows: Pik=1 iff ni hosts ak, 0 otherwise. The total network load L incurred by the application for a placement P can then be expressed as:
L=
A+ S A+ S
N
N
∑ ∑ Ckm ∑∑ hij Pik Pjm k =1 m =1
(1)
i =1 j =1
A placement P is valid iff each agent is hosted on exactly one node and the node capacity constraints are not violated: N
∑P i =1
ik
= 1, ∀1 ≤ k ≤ A + S
(2)
A+ S
∑ P r (a ) ≤ r (n ), ∀1 ≤ i ≤ N k =1
ik
k
i
(3)
Also, a migration is valid only if starting from a valid placement P it leads to another valid agent placement P’ without moving any node-specific agents: Pik = Pik' , ∀A < k ≤ A + S
(4)
On Deploying Tree Structured Agent Applications in Networked Embedded Systems
493
The agent placement problem (APP) can then be stated as: starting from an initial valid agent placement Pold, perform a series of valid agent migrations, eventually leading to a new valid placement Pnew that minimizes (1). Note that the solution to APP may be a placement that is actually suboptimal in terms of (1). This is the case when the optimal placement is infeasible due to (3), more specifically, when it can be reached only by performing a “swap” that cannot be accomplished because there is not enough free capacity on any node. A similar feasibility issue is discussed in [5] but in a slightly different context. Also, (1) does not take into account the cost for performing a migration. This is because we target scenarios where the application structure, agent-level traffic pattern and underlying routing topology are expected to be sufficiently stable to amortize the migration costs.
3 Uncapacitated 1-Hop Agent Migration Algorithm This section presents an agent migration algorithm for the case where nodes can host any number of agents, i.e., without taking into account capacity limitations. In terms of routing knowledge, each node knows only its immediate (1-hop) neighbors involved in transporting inbound and outbound agent messages; we refer to this as 1hop network awareness. This information can be provided by even a very simple networking layer. A node does not attempt to discover additional nodes but simply considers migrating agents to one of its neighbors. An agent may nevertheless move to distant nodes via consecutive 1-hop migrations. Description. The 1-hop agent migration algorithm (AMA-1) works as follows. A node records, for each locally hosted agent, the traffic associated with each neighboring node as well as the local traffic, due to the message exchange with remote and local agents, respectively. Periodically, this information is used to decide if it is beneficial for the agent to migrate to a neighbor. More formally, let lijk denote the load associated with agent ak hosted at node ni for a neighbor node nj (notice that ni does not have to know D to compute lijk):
lijk =
A+ S
∑ (C m =1
km
+ Cmk ) Dixj , Pxm = 1
(5)
The decision to migrate ak from ni to nj is taken iff lijk is greater than the total load with all other neighbors of ni plus the local load associated with ak: lijk > liik +
∑l
x ≠i, j
ixk
, hij = hix = 1
(6)
The intuition behind (6) is that by moving ak from its current host ni to a neighbor nj, the distance for the load with nj decreases by one hop while the distance for all other loads, including the load that used to take place locally, increases by one hop. If (6) holds, the cost-benefit of the migration is positive, hence the migration reduces the total network load as per (1).
494
N. Tziritas et al. n1 a1
a2
n3
n2
a3
n5
n4 a4
a5
a6
n6
a7
(a) n7 (b)
Fig. 2. (a) Application agent structure; (b) Agent placement on the network
Consider the application depicted in Fig. 2a which comprises four node-specific agents (a4, a5, a6, a7), two intermediate node-neutral agents (a42 a3) and a node-neutral root agent (a1), and the actual agent placement on nodes shown in Fig. 2b. Let each node-specific agent generate 2 data units per time unit towards its parent, which in turn generates 1 data unit per time unit towards the root (edge values in Fig. 2a). Assume that n1 runs the algorithm for a3 (striped). The load associated with a3 for the neighbor node n2 and n3 is l123=2 respectively l133=3 while the local load is l113=0. According to (6) the only beneficial migration for a3 is for it to move on n3. Continuing the example, assume that a3 indeed migrates to n3 and is (again) checked for migration. This time the relevant loads are l313=2, l353=2, l363=0, l333=1, thus a3 will remain at n3. Similarly, a1 will remain at n3 while a2 will eventually migrate from n4 to n2 then to n1 and last to n3, resulting in a placement where all node-neutral agents are hosted at n3. This placement is stable since there is no beneficial migration as per (6). Implementation and complexity. For each local agent it is required to record the load with each neighboring node and the load with other locally hosted agents. This can be done using a A’×(g+1) load table, where A’ is the number of local node-neutral agents and g is the node degree (number of neighbors). The destination for each agent can then be determined as per (6) in a single pass across the respective row of the load table, in O(g) operations or a total of O(gA’) for all agents. Note that the results of this calculation remain valid as long as the underlying network topology, application structure and agent message traffic statistics do not change. Convergence. The algorithm does not guarantee convergence because it is susceptible to livelocks. Revisiting the previous example, assume that the application consists only of the right-hand subtree of Fig. 2a, placed as in Fig. 2b. Node n1 may decide to move a3 to n3 while n3 may decide to move a1 to n1. Later on, the same migrations may be performed in the reverse direction, resulting in the old placement etc. We expect such livelocks to be rare in practice, especially if neighboring nodes invoke the algorithm at different intervals. Nevertheless, to guarantee convergence we introduce a coordination scheme in the spirit of a mutual exclusion protocol. When ni decides to migrate ak to nj it asks for a permission. To avoid “swaps” nj denies this
On Deploying Tree Structured Agent Applications in Networked Embedded Systems
495
request if: (i) it hosts an agent ak’ that is the child or the parent of ak, (ii) it has decided to migrate ak’ to ni, and (iii) the identifier of nj is smaller than that of ni (j