PROCEEDINGS OF THE SIXTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FIRST WORKSHOP ON ANALYTIC ALGORITHM...

Author:
Lars Arge

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

PROCEEDINGS OF THE SIXTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FIRST WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS

SIAM PROCEEDINGS SERIES LIST Glowinski, R., Golub, G. H., Meurant, G. A., and Periaux, J., First International Conference on Domain Decomposition Methods for Partial Differential Equations (1988) Salam, Fathi M. A. and Levi, Mark L, Dynamical Systems Approaches to Nonlinear Problems in Systems and Circuits (1988) Datta, B., Johnson, C, Kaashoek, M., Plemmons, R., and Sontag, E., Linear Algebra in Signals, Systems and Control (1988) Ringeisen, Richard D. and Roberts, Fred S., Applications of Discrete Mathematics (1988) McKenna, James and Temam, Roger, ICIAM '87: Proceedings of the First International Conference on Industrial and Applied Mathematics (1988) Rodrigue, Garry, Parallel Processing for Scientific Computing (1989) Caflish, Russel E., Mathematical Aspects of Vortex Dynamics (1989) Wouk, Arthur, Parallel Processing and Medium-Scale Multiprocessors (1989) Flaherty, Joseph E., Paslow, Pamela J., Shephard, Mark S., and Vasilakis, John D., Adaptive Methods for Partial Differential Equations (1989) Kohn, Robert V. and Milton, Graeme W., Random Media and Composites (1989) Mandel, Jan, McCormick, S. F., Dendy, J. E., Jr., Farhat, Charbel, Lonsdale, Guy, Porter, Seymour V., Ruge, John W., and Stuben, Klaus, Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods 0939^ Colton, David, Ewing, Richard, and Rundell, William, Inverse Problems in Partial Differential Equations (1990) Chan, Tony F., Glowinski, Roland, Periaux, Jacques, and Widlund, Olof B., Third International Symposium on Domain Decomposition Methods for Partial Differential Equations (1990) Dongarra, Jack, Messina, Paul, Sorensen, Danny C., and Voigt, Robert G., Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing (1990) Glowinski, Roland and Lichnewsky, Alain, Computing Methods in Applied Sciences and Engineering (1990) Coleman, Thomas F. and Li, Yuying, Large-Scale Numerical Optimization (1990) Aggarwal, Alok, Borodin, Allan, Gabow, Harold, N., Galil, Zvi, Karp, Richard M., Kleitman, Daniel J., Odlyzko, Andrew M., Pulleyblank, William R., Tardos, Eva, and Vishkin, Uzi, Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms (1990) Cohen, Gary, Halpern, Laurence, and Joly, Patrick, Mathematical and Numerical Aspects of Wave Propagation Phenomena (1991) Gomez, S., Hennart, J. P., and Tapia, R. A., Advances in Numerical Partial Differential Equations and Optimization: Proceedings of the Fifth Mexico-United States Workshop (1991) Glowinski, Roland, Kuznetsov, Yuri A., Meurant, Gerard, Periaux, Jacques, and Widlund, Olof B., Fourth International Symposium on Domain Decomposition Methods for Partial Differential Equations (1991) Alavi, Y., Chung, F. R. K., Graham, R. L., and Hsu, D. R, Graph Theory, Combinatorics, Algorithms, and Applications (1991) Wu, Julian J., Ting, T. C. I, and Barnert, David M., Modern Theory of Anisotropic Elasticity and Applications (1991) Shearer, Michael, Viscous Profiles and Numerical Methods for Shock Waves (1991) Griewank, Andreas and Corliss, George F., Automatic Differentiation of Algorithms: Theory, Implementation, and Application (1991) Frederickson, Greg, Graham, Ron, Hochbaum, Dorit S., Johnson, Ellis, Kosaraju, S. Rao, Luby, Michael, Megiddo, Nimrod, Schieber, Baruch, Vaidya, Pravin, and Yao, Frances, Proceedings of the Third Annual ACM-SIAM Symposium on Discrete Algorithms (1992) Field, David A. and Komkov, Vadim, Theoretical Aspects of Industrial Design (1992) Field, David A. and Komkov, Vadim, Geometric Aspects of Industrial Design (1992) Bednar, J. Bee, Lines, L R., Stolt, R. H., and Weglein, A. B., Geophysical Inversion (1992)

O'Malley, Robert E. Jr., ICIAM 91: Proceedings of the Second International Conference on Industrial and Applied Mathematics (1992) Keyes, David E., Chan, Tony F., Meurant, Gerard, Scroggs, Jeffrey S., and Voigt, Robert G., Fifth International Symposium on Domain Decomposition Methods for Partial Differential Equations (1992) Dongarra, Jack, Messina, Paul, Kennedy, Ken, Sorensen, Danny C., and Voigt, Robert G., Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing (1992) Corones, James P., Kristensson, Gerhard, Nelson, Paul, and Seth, Daniel L, Invariant Imbedding and Inverse Problems (1992) Ramachandran, Vijaya, Bentley, Jon, Cole, Richard, Cunningham, William H., Guibas, Leo, King, Valerie, Lawler, Eugene, Lenstra, Arjen, Mulmuley, Ketan, Sleator, Daniel D., and Yannakakis, Mihalis, Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (1993) Kleinman, Ralph, Angell, Thomas, Colton, David, Santosa, Fadil, and Stakgold, Ivar, Second International Conference on Mathematical and Numerical Aspects of Wave Propagation (1993) Banks, H. I, Fabiano, R. H., and Ito, K., Identification and Control in Systems Governed by Partial Differential Equations (1993) Sleator, Daniel D., Bern, Marshall W., Clarkson, Kenneth L, Cook, William J., Karlin, Anna, Klein, Philip N., Lagarias, Jeffrey C., Lawler, Eugene L., Maggs, Bruce, Milenkovic, Victor J., and Winkler, Peter, Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms (1994) Lewis, John G., Proceedings of the Fifth SIAM Conference on Applied Linear Algebra (1994) Brown, J. David, Chu, Moody I, Ellison, Donald C., and Plemmons, Robert J., Proceedings of the Cornelius Lanczos International Centenary Conference (1994) Dongarra, Jack J. and Tourancheau, B., Proceedings of the Second Workshop on Environments and Tools for Parallel Scientific Computing (1994) Bailey, David H., Bj0rstad, Petter E., Gilbert, John R., Mascagni, Michael V, Schreiber, Robert S., Simon, Horst D., Torczon, Virginia J., and Watson, Layne I, Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing (1995) Clarkson, Kenneth, Agarwal, Pankaj K., Atallah, Mikhail, Frieze, Alan, Goldberg, Andrew, Karloff, Howard, Manber. Udi, Munro, Ian, Raghavan, Prabhakar, Schmidt, Jeanette, and Young, Moti, Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms (1995) Becache, Elaine, Cohen, Gary, Joly, Patrick, and Roberts, Jean E., Third International Conference on Mathematical and Numerical Aspects of Wave Propagation (1995) Engl, Heinz W., and Rundell, W., GAMM-SIAM Proceedings on Inverse Problems in Diffusion Processes (1995) Angell, T. S., Cook, Pamela L., Kleinman, R. E., and Olmstead, W. E., Nonlinear Problems in Applied Mathematics (1995) Tardos, Eva, Applegate, David, Canny, John, Eppstein, David, Galil, Zvi, Kdrger, David R., Karlin, Anna R., Linial, Nati, Rao, Satish B., Vitter, Jeffrey S., and Winkler, Peter M., Proceedings of the Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (1996) Cook, Pamela L., Roytburd, Victor, and Tulin, Marshal, Mathematics Is for Solving Problems (1996) Adams, Loyce and Nazareth, J. L., Linear and Nonlinear Conjugate Gradient-Related Methods (1996) Renardy, Yuriko Y., Coward, Adrian V., Papageorgiou, Demetrios T., and Sun, Shu-Ming, Advances in Multi-Fluid Flows (1996) Berz, Martin, Bischof, Christian, Corliss, George, and Griewank, Andreas, Computational Differentiation: Techniques, Applications, and Tools (1996) Delic, George and Wheeler, Mary F., Next Generation Environmental Models and Computational Methods (1997) Engl, Heinz W., Louis, Alfred, and Rundell, William, Inverse Problems in Geophysical Applications (1997) Saks, Michael, Anderson, Richard, Bach, Eric, Berger, Bonnie, Blum, Avrim, Chazelle, Bernard, Edelsbrunner,Herbert, Henzinger, Monika, Johnson, David, Kannan, Sampath, Khuller, Samir, Maggs, Bruce, Muthukrishnan, S., Ruskey, Frank, Seymour, Paul, Spencer, Joel, Williamson, David P., and Williamson, Gill, Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms (1997) Alexandrov, Natalia M. and Hussaini, M. Y., Multidisciplinary Design Optimization: State of the Art (1997) Van Huffel, Sabine, Recent Advances in Total Least Squares Techniques and Errors-in-Variables Modeling (1997)

Ferris, Michael C. and Pang, Jong-Shi, Complementarity and Variational Problems: State of the Art (1997) Bern, Marshall, Fiat, Amos, Goldberg, Andrew, Kannan, Sampath, Karloff, Howard, Kenyan, Claire, Kierstead, Hal, Kosaraju, Rao, Linial, Nati, Rabani, Yuval, Rodl, Vojta, Sharir, Micha, Shmoys, David, Spielman, Dan, Spinrad, Jerry, Srinivasan, Aravind, and Sudan, Madhu, Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (1998) DeSanto, John A., Mathematical and Numerical Aspects of Wave Propagation (1998) Tarjan, Robert E., Warnow, Tandy, Amenta, Nina, Benham, Craig, Cornell, Derek G., Edelsbrunner, Herbert, Feigenbaum, Joan, Gusfield, Dan, Habib, Michel, Hall, Leslie, Karp, Richard, King, Valerie, Koller, Daphne, McKay, Brendan, Moret, Bernard, Muthukrishnan, S., Phillips, Cindy, Raghavan, Prabhakar, Randall, Dana, and Scheinerman, Edward, Proceedings of the Tenth ACM-SIAM Symposium on Discrete Algorithms (1999) Hendrickson, Bruce, Yelick, Katherine A., Bischof, Christian H., Duff, lain S., Edelman, Alan S., Geist, George A., Heath, Michael I, Heroux, Michael H., Koelbel, Chuck, Schrieber, Robert S., Sincovec, Richard F., and Wheeler, Mary F., Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing (1999) Henderson, Michael E., Anderson, Christopher R., and Lyons, Stephen L, Object Oriented Methods for Interoperable Scientific and Engineering Computing (1999) Shmoys, David, Brightwell, Graham, Cohen, Edith, Cook, Bill, Eppstein, David, Gerards, Bert, Irani, Sandy, Kenyan, Claire, Ostrovsky, Rafail, Peleg, David, Pevzner, Pavel, Reed, Bruce, Stein, Cliff, Tetali, Prasad, and Welsh, Dominic, Proceedings of the Eleventh ACM-SIAM Symposium on Discrete Algorithms (2000) Bermudez, Alfredo, Gomez, Dolores, Hazard, Christophe, Joly, Patrick, and Roberts, Jean E., Fifth International Conference on Mathematical and Numerical Aspects of Wave Propagation (2000) Kosaraju, S. Rao, Bellare, Mihir, Buchsbaum, Adam, Chazelle, Bernard, Graham, Fan Chung, Karp, Richard, Lovasz, Laszlo, Motwani, Rajeev, Myrvold, Wendy, Pruhs, Kirk, Sinclair, Alistair, Spencer, JoeLStein, Cliff, Tardos, Eva, Vempala, Santosh, Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (2001) Koelbel, Charles and Meza, Juan, Proceedings of the Tenth SIAM Conference on Parallel Processing for Scientific Computing (2001) Grossman, Robert, Kumar, Vipin, and Han, Jiawei, Proceedings of the First SIAM International Conference on Data Mining (2001) Berry, Michael, Computational Information Retrieval (2001) Eppstein, David, Demaine, Erik, Doerr, Benjamin, Fleischer, Lisa, Goel, Ashish, Goodrich, Mike, Khanna, Sanjeev, King, Valerie, Munro, Ian, Randall, Dana, Shepherd, Bruce, Spielman, Dan, Sudakov, Benjamin, Suri, Subhash, and Warnow, Tandy, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2002) Grossman, Robert, Han, Jiawei, Kumar, Vipin, Mannila, Heikki, and Motwani, Rajeev, Proceedings of the Second SIAM International Conference on Data Mining (2002) Estep, Donald and Tavener, Simon, Collected Lectures on the Preservation of Stability under Discretization (2002) Ladner, Richard E., Proceedings of the Fifth Workshop on Algorithm Engineering and Experiments (2003) Arge, Lars, Italiano, Giuseppe F., and Sedgewick, Robert, Proceedings of the Sixth Workshop on Algorithm Engineering and Experiments and the First Workshop on Analytic Algorithmics and Combinatorics (2004)

PROCEEDINGS OF THE SIXTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FIRST WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS

Edited by Lars Arge, Giuseppe F, Italiano, and Robert Sedgewick

Society for Industrial and Applied Mathematics Philadelphia

PROCEEDINGS OF THE SIXTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FIRST WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS

Proceedings of the Sixth Workshop on Algorithm Engineering and Experiments, New Orleans, LA, January 10, 2004 Proceedings of the First Workshop on Analytic Algorithmics and Combinatorics, New Orleans, LA, January 10, 2004 The workshops were supported by the ACM Special Interest Group on Algorithms and Computation Theory and the Society for Industrial and Applied Mathematics. Copyright © 2004 by the Society for Industrial and Applied Mathematics. 10987654321 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Library of Congress Catalog Card Number: 2004104581 ISBN 0-89871-564-4

is a registered trademark.

CONTENTS ix

Preface to the Workshop on Algorithm Engineering and Experiments

xi

Preface to the Workshop on Analytic Algorithmics and Combinatorics Workshop on Algorithm Engineering and Experiments

3

Engineering Geometric Algorithms: Persistent Problems and Some Solutions (Abstract of Invited Talk) Don Halperin

4

Engineering a Cache-Oblivious Sorting Algorithm Gerth St0lting Brodal, Rolf Fagerberg, and Kristoffer Vinther

18

The Robustness of the Sum-of-Squares Algorithm for Bin Packing Michael A. Bender, Bryan Bradley, Geetha Jagannathan, and Krishnan Pillaipakkamnatt

31

Practical Aspects of Compressed Suffix Arrays and FM-lndex in Searching DNA Sequences Wing-Kai Hon, Tak-Wah Lam, Wing-Kin Sung, Wai-Leuk Tse, Chi-Kwong Wong, and Siu-Ming Yiu

39

Faster Placement of Hydrogens in Protein Structures by Dynamic Programming Andrew Leaver-Fay, Yuanxin Liu, and Jack Snoeyink

49

An Experimental Analysis of a Compact Graph Representation Daniel K. Blandford, Guy E. Blelloch, and Ian A. Kash

62

Kernelization Algorithms for the Vertex Cover Problem: Theory and Experiments Faisal N. Abu-Khzam, Rebecca L. Collins, Michael R. Fellows, Michael A. Langston, W. Henry Suters, and Christopher T. Symons

70

Safe Separators for Treewidth Hans L Bodlaender and Arie M.C.A. Koster

79

Efficient Implementation of a Hotlink Assignment Algorithm for Web Sites ArturAlves Pessoa, Eduardo Sony Laber, and Criston de Souza

88

Experimental Comparison of Shortest Path Approaches for Timetable Information Evangelia Pyrga, Frank Schulz, Dorothea Wagner, and Christos Zaroliagis

100

Reach-Based Routing: A New Approach to Shortest Path Algorithms Optimized for Road Networks Ron Gutman

112

Lazy Algorithms for Dynamic Closest Pair with Arbitrary Distance Measures Jean Cardinal and David Eppstein

120

Approximating the Visible Region of a Point on a Terrain Boaz Ben-Moshe, Paz Carmi, and Matthew J. Katz

129

A Computational Framework for Handling Motion Leonidas Guibas, Menelaos I. Karavelas, and Daniel Russel

142

Engineering a Sorted List Data Structure for 32 Bit Keys Roman Dementiev, Lutz Kettner, Jens Mehnert, and Peter Sanders

VII

CONTENTS Workshop on Analytic Algorithmics and Combinatorics 152

Theory and Practice of Probabilistic Counting Algorithms (Abstract of Invited Talk) Philippe Flajolet

153

Analysis of a Randomized Selection Algorithm Motivated by the LZ'77 Scheme Mark Daniel Ward and Wojciech Szpankowski

161

The Complexity of Jensen's Algorithm for Counting Polyominoes Gill Barequet and Micha Moffie

170

Distributional Analyses of Euclidean Algorithms Viviane Baladi and Brigitte Vallee

185

A Simple Primality Test and the rth Smallest Prime Factor Daniel Panario, Bruce Richmond, and Martha Yip

194

Gap-Free Samples of Geometric Random Variables Pawei Hitczenko and Arnold Knopfmacher

199

Computation of a Class of Continued Fraction Constants Lo'i'ck Lhote

211

Compositions and Patricia Tries: No Fluctuations in the Variance! Helmut Prodinger

216

Quadratic Convergence for Scaling of Matrices Martin Purer

224

Partial Quicksort Conrado Martinez

229

Author Index

VIII

ALENEX WORKSHOP PREFACE The annual workshop on Algorithm Engineering and Experiments (ALENEX) provides a forum for the presentation of original research in the implementation and experimental evaluation of algorithms and data structures. ALENEX 2004 was the sixth workshop in this series. It was held in New Orleans, Louisiana, on January 10,2004. These proceedings contain the 14 papers that were selected for presentation from a total of 56 submissions. Considerable effort was devoted to the evaluation of the submissions. However, submissions were not refereed in the thorough and detailed way that is customary for journal papers. It is expected that most of the papers in these proceedings will eventually appear in finished form in scientific journals. We would like to thank all the people who contributed to a successful workshop. In particular, we thank the Program Committee and all of our many colleagues who helped the Program Committee evaluate the submissions. We also thank Adam Buchsbaum for answering our many questions along the way and Bryan Holland-Minkley for help with the submission and program committee software. We gratefully acknowledge the generous support of Microsoft, which helped reduce the registration fees for students, and thank SIAM for providing in-kind support and facilitating the workshop; in particular, the help of Kirsten Wilden from SIAM has been invaluable. Finally, we would like to thank the invited speaker, Dan Halperin of Tel Aviv University. Lars Arge and Giuseppe F. Italiano

ALENEX 2004 Program Committee

Lars Arge, Duke University Jon Bentley, Avaya Labs Research Mark de Berg, Technische Universiteit Eindhoven Monika Henzinger, Google Giuseppe F. Italiano, University of Rome David Karger, Massachusetts Institute of Technology Ulrich Meyer, Max-Planck-lnstitut fur Informatik Jan Vahrenhold, University of Munster ALENEX 2004 Steering Committee

Adam Buchsbaum, AT&T Research Roberto Battiti, University of Trento Andrew V. Goldberg, Microsoft Research Michael Goodrich, University of California, Irvine David S. Johnson, AT&T Research Richard E. Ladner, University of Washington, Seattle Catherine C. McGeoch, Amherst College David Mount, University of Maryland, College Park Bernard M.E. Moret, University of New Mexico Jack Snoeyink, University of North Carolina, Chapel Hill Clifford Stein, Columbia University

IX

ALENEX WORKSHOP PREFACE

ALENEX 2004 Subreviewers

Pankaj K. Agarwal Susanne Albers Ernst Althaus Rene Beier Michael Bender Henrik Blunck Christian Breimann Herve Bronnimann David Bryant Adam Buchsbaum Stefan Burkhardt Marco Cesati Sunil Chandran Erik Demaine Roman Dementiev Camil Demetrescu Jeff Erickson Irene Finocchi Luisa Gargano Leszek Gasieniec Raffaele Giancarlo Roberto Grossi Concettina Guerra Bryan Holland-Minkley Michael Jacob

Klaus Jansen Juha Kaerkkainen Spyros Kontogiannis Luigi Laura Giovanni Manzini Eli Packer Ron Parr Marco Pellegrini Seth Pettie Jordi Planes Naila Rahman Rajeev Raman Joachim Reichel Timo Ropinski Kay Salzwedel Peter Sanders Guido Schaefer Christian Scheideler Naveen Sivadasan Martin Skutella Bettina Speckmann Venkatesh Srinivasan Firas Swidan Kavitha Telikepalli Norbert Zeh

X

ANALCO WORKSHOP PREFACE The papers in these proceedings, along with the invited talk by Philippe Flajolet, "Theory and Practice of Probabilistic Counting Algorithms," were presented at the First Workshop on Analytic Algorithmics and Combinatorics (ANALCO04), which was held in New Orleans, Louisiana, on January 10, 2004. The aim of ANALCO is to provide a forum for the presentation of original research in the analysis of algorithms and associated combinatorial structures. The papers study properties of fundamental combinatorial structures that arise in practical computational applications (such as permutations, trees, strings, tries, and graphs) and address the precise analysis of algorithms for processing such structures, including average-case analysis; analysis of moments, extreme, and distributions; and probabilistic analysis of randomized algorithms. Some of the papers present significant new information about classic algorithms; others present analyses of new algorithms that present unique analytic challenges or address tools and techniques for the analysis of algorithms and combinatorial structures, both mathematical and computational. The workshop took place on the same day as the Sixth Workshop on Algorithm Engineering and Experiments (ALENEX04); the papers from that workshop are also published in this volume. Since researchers in both fields are approaching the problem of learning detailed information about the performance of particular algorithms, we expect that interesting synergies will develop. People in the ANALCO community are encouraged to look over the ALENEX papers for problems where the analysis of algorithms might play a role; people in the ALENEX community are encouraged to look over these ANALCO papers for problems where experimentation might play a role.

Program Committee Kevin Compton, University of Michigan Luc Devroye, McGill University, Canada Mordecai Golin, The Hong Kong University of Science and Technology, Hong Kong Hsien-Kuei Hwang, Academic Sinica, Taiwan Robert Sedgewick (Chair), Princeton University Wojciech Szpankowski, Purdue University Brigitte Vallee, Universite de Caen, France Jeffrey S. Vitter, Purdue University

XI

This page intentionally left blank

Workshop on Algorithm Engineering and Experiments

This page intentionally left blank

Invited Plenary Speaker Abstract Engineering Geometric Algorithms: Persistent Problems and Some Solutions Dan Halperin School of Computer Science Tel Aviv University The last decade has seen growing awareness to engineering geometric algorithms, in particular around the development of large scale software for computational geometry (like CGAL and LEDA). Besides standard issues such as efficiency, the developer of geometric software has to tackle the hardship of robustness problems, namely problems related to arithmetic precision and degenerate input, typically ignored in the theory of geometric algorithms, and which in spite of considerable efforts are still unresolved in full (practical) generality. We start with an overview of these persistent robustness problems, together with a brief review of prevailing solutions to them. We also briefly describe the CGAL project and library. We then focus on fixed precision approximation methods to deal with robustness issues, and in particular on so-called controlled perturbation which leads to robust implementation of geometric algorithms while using the standard machine floating-point arithmetic. We conclude with algorithm-engineering matters that are still geometric but have the more general flavor of addressing efficiency: (i) We discuss the fundamental issue of geometric decomposition (that is, decomposing geometric structures into simpler substructures), exemplify the large gap between the theory and practice of such decompositions, and present practical solutions in two and three dimensions, (ii) We suggest a hybrid approach to motion planning that significantly improves simple heuristic methods by integrating exact geometric algorithms to solve subtasks. The new results that we describe are documented in: http://www.cs.tau.ac.il/CGAL/Projects/

3

Engineering a Cache-Oblivious Sorting Algorithm* Gerth St01ting Brodal^*

Rolf Fagerberg*

Abstract This paper is an algorithmic engineering study of cacheoblivious sorting. We investigate a number of implementation issues and parameter choices for the cacheoblivious sorting algorithm Lazy Funnelsort by empirical methods, and compare the final algorithm with Quicksort, the established standard for comparison based sorting, as well as with recent cache-aware proposals. The main result is a carefully implemented cacheoblivious sorting algorithm, which our experiments show can be faster than the best Quicksort implementation we can find, already for input sizes well within the limits of RAM. It is also at least as fast as the recent cache-aware implementations included in the test. On disk the difference is even more pronounced regarding Quicksort and the cache-aware algorithms, whereas the algorithm is slower than a careful implementation of multiway Mergesort such as TPIE. 1 Introduction Modern computers contain a hierarchy of memory levels, with each level acting as a cache for the next. Typical components of the memory hierarchy are: registers, level 1 cache, level 2 cache, level 3 cache, main memory, and disk. The time for accessing a level increases for each new level (most dramatically when going from main memory to disk), making the cost of a memory access depend highly on what is the current lowest memory level containing the element accessed. As a consequence, the memory access pattern of an algorithm has a major influence on its running time in practice. Since classic asymptotic analysis of algorithms in the RAM model is unable to capture this. "This work is based on the M.Sc. thesis of the third author [29]. tBRIGS (Basic Research in Computer Science, www.brics.dk, funded by the Danish National Research Foundation), Department of Computer Science, University of Aarhus, DK-8000 Arhus C, Denmark. E-mail: {gerth.rolf}«brics.dk. Partially supported by the Future and Emerging Technologies programme of the EU under contract number 1ST-1999-14186 (ALCOM-FT). * Supported by the Carlsberg Foundation (contract number ANS-0257/20). § Systematic Software Engineering A/S, S0ren Frichs Vej 39, DK-8000 Arhus C, Denmark. E-mail: kvflbrics.dk.

4

Kristoffer Vinther5

a number of more elaborate models for analysis have been proposed. The most widely used of these is the I/O model introduced by of Aggarwal and Vitter [2] in 1988, which assumes a memory hierarchy containing two levels, the lower level having size M and the transfer between the two levels taking place in blocks of B consecutive elements. The cost of the computation is the number of blocks transferred. The strength of the I/O model is that it captures part of the memory hierarchy, while being sufficiently simple to make analysis of algorithms feasible. In particular, it adequately models the situation where the memory transfer between two levels of the memory hierarchy dominates the running time, which is often the case when the size of the data significantly exceeds the size of main memory. By now, a large number of results for the I/O model exists—see the surveys by Arge [3] and Vitter [30]. Among the fundamental facts are that in the I/O model, comparison based sorting takes 0(SortM,B(W)) I/Os in the worst case, where SortA/,s(^V) = ^ \O?,M/B ^. More elaborate models for multi-level memory hierarchies have been proposed ([30, Section 2.3] gives an overview), but fewer analyses of algorithms have been done. For these models, as for the I/O model of Aggarwal and Vitter, algorithms are assumed to know the characteristics of the memory hierarchy. Recently, the concept of cache-oblivious algorithms was introduced by Frigo et al. [19]. In essence, this designates algorithms formulated in the RAM model, but analyzed in the I/O model for arbitrary block size B and memory size M. I/Os are assumed to be performed automatically by an offline optimal cache replacement strategy. This seemingly simple change has significant consequences: since the analysis holds for any block and memory size, it holds for all levels of a multilevel memory hierarchy (see [19] for details). In other words, by optimizing an algorithm to one unknown level of the memory hierarchy, it is optimized to each level automatically. Thus, the cache-oblivious model in an elegant way combines the simplicity of the I/Omodel with a coverage of the entire memory hierarchy. An additional benefit is that the characteristics of the memory hierarchy do not need to be known, and do not need to be hardwired into the algorithm for

the analysis to hold. This increases the algorithms In this paper, we investigate the practical value of portability (a benefit for e.g. software libraries), and cache-oblivious methods in the area of sorting. We foits robustness against changing memory resources on cus on the Lazy Funnelsort algorithm, since we believe machines running multiple processes. it to have the biggest potential for an efficient impleIn 1999, Prigo et al. introduced the concept of cache- mentation among the current proposals for I/0-optimal obliviousness, and presented optimal cache-oblivious cache-oblivious sorting algorithms. We explore a numalgorithms for matrix transposition, FFT, and sort- ber of implementation issues and parameter choices for ing [19], and also gave a proposal for static search the cache-oblivious sorting algorithm Lazy Funnelsort. trees [25] with search cost matching that of standard and settle the best choices through experiments. We (cache-aware) B-irees [6]. Since then, quite a num- then compare the final algorithm with tuned versions ber of results for the model have appeared, including of Quicksort, which is generally acknowledged to be the the following: Bender et al. [11] gave a proposal for fastest all-round comparison based sorting algorithm, cache-oblivious dynamic search trees with search cost as well as with recent cache-aware proposals. Note that matching B-trees. Simpler cache-oblivious search trees the I/O cost of Quicksort is ©(^ Iog2 ^), which only with complexities matching that of [11] were presented differs from the optimal bound SortM,B(W) by the base in [12, 17, 26], and a variant with worst case bounds of the logarithm. for updates appear in [8]. Cache-oblivious algorithms The main result is a carefully implemented cachehave been given for problems in computational geome- oblivious sorting algorithm, which our experiments try [1, 8, 14], for scanning dynamic sets [7], for layout show can be faster than the best Quicksort implemenof static trees [9], and for partial persistence [8]. Cache- tation we can find, already for input sizes well within oblivious priority queues have been developed in [4, 15], the limits of RAM. It is also at least as fast as the rewhich in turn gives rise to several cache-oblivious graph cent cache-aware implementations included in the test. algorithms [4]. On disk the difference is even more pronounced regardSome of these results, in particular those involv- ing Quicksort and the cache-aware algorithms, whereas ing sorting and algorithms to which sorting reduces, the algorithm is slower than a careful implementation such as priority queues, are proved under the assump- of multiway Mergesort such as TPIE [18]. tion M > jB2, which is also known as the tall cache asThese findings support—and extend to the area of sumption. In particular, this applies to the Funnelsort sorting—the conclusion of the previous empirical results algorithm of Frigo et al. [19]. A variant termed Lazy on cache-obliviousness. This conclusion is that cacheFunnelsort [14] works under the weaker tall cache as- oblivous methods can lead to actual performance gains sumption M > Bl+£ for any fixed e > 0, at the cost over classic algorithms developed in the RAM-model. of a l/£ factor compared to the optimal sorting bound The gains may not always match those of the best algorithm tuned to a specific memory hierarchy level, but e(SortM,s(AO) for the case M > B1+e. Recently, it was shown [16] that a tall cache assump- on the other hand appear to be more robust, applying tion is necessary for cache-oblivious -comparison based to several memory hierarchy levels simultaneously. sorting algorithms, in the sense that the trade-off atOne observation of independent interest made in tained by Lazy Funnelsort between strength of assump- this paper is that for the main building block of Funneltion and cost for the for the case M » Bl+£ is best sort, namely the fc-merger, there is no need for a spepossible. This demonstrates a separation in power be- cific memory layout (contrary to its previous descriptween the I/O model and the cache-oblivious model for tions [14, 19]) for its analysis to hold. Thus, the centhe problems of comparison based sorting. Separations tral feature of the fc-merger definition is the sizes of its have also been shown for the problems of permuting [16] buffers, and does not include its layout in memory. and of comparison based searching [10]. The rest of this paper is organized as follows: In In contrast to the abundance of theoretical results Section 2, we describe Lazy Funnelsort. In Section 3, described above, empirical evaluations of the merits of we describe our experimental setup. In Section 4, we cache-obliviousness are more scarce. Existing results develop our optimized implementation of Funnelsort, have focused on basic matrix algorithms [19], and search and in Section 5, we compare it experimentally to a trees [17, 23, 26]. Although a bit tentative, they collection of existing efficient sorting algorithms. In conclude that in these areas, the efficiency of cache- Section 6, we sum up our findings. oblivious algorithms lies between that of classic RAMalgorithms and that of algorithms exploiting knowledge 2 Funnelsort about the specific memory hierarchy present (often Three algorithms for cache-oblivious sorting have been termed cache-aware algorithms). proposed so far: Funnelsort [19], its variant Lazy

5

Funnelsort [14], and a distribution based algorithm [19]. These all have the same optimal bound Sorter, fi(Af) on the number of I/Os performed, but have rather different structural complexity, with Lazy Funnelsort being the simplest. As simplicity of description often translates into smaller and more efficient code (for algorithms of same asymptotic complexity), we find the Lazy Funnelsort algorithm the most promising with respect to practical efficiency. In this paper, we choose it as the basis for our study of the practical feasibility of cache-oblivious sorting. We now review the algorithm briefly, and give an observation which further simplifies it. For the full details, see [14]. The algorithm is based on binary mergers. A binary merger takes as input two sorted streams of elements Figure 1: A 16-merger consisting of 15 binary mergers. and delivers as output the sorted stream formed by Shaded regions are the occupied parts of the buffers. merging of these. One merge step moves an element from the head of one of the input streams to the tail of 1 the output stream. The heads of the input streams and parameter, and the sizes of the remaining buffers are the tail of the output stream reside in buffers holding a defined by recursion on the top tree and the bottom limited number of elements. A buffer is simply an array trees. In the descriptions in [14, 19], a fc-merger is also of elements, plus fields storing the capacity of the buffer and pointers to the first and last elements in the buffer. laid out recursively in memory (according to the soBinary mergers can be combined to binary merge trees called van Emde Boas layout [25]), in order to achieve by letting the output buffer of one merger be an input I/O efficiency. We observe in this paper that this is buffer of another—in other words, binary merge trees not necessary: In the proof of Lemma 1 in [14], the are binary trees with mergers at the nodes and buffers central idea is to follow the recursive definition down to at the edges. The leaves of the tree contain the streams a specific size k of trees, and then consider the number of I/Os for loading this ^-merger and one block for each to be merged. An invocation of a merger is a recursive procedure of its output streams into memory. However, this price which performs merge steps until its output buffer is is not (except for constant factors) changed if we for full or both input streams are exhausted. If during the each of the k — 1 nodes have to load one entire block invocation an input buffer gets empty, but the corre- holding the node, and one block for each of the input sponding stream is not exhausted, the input buffer is and output buffers of the node. From this follows that recursively filled by an invocation of the merger having the proof holds true, no matter how the fc-merger is laid 2 this buffer as its output buffer. If both input streams out. Hence, the crux of the definition of the fc-merger of a merger get exhausted, the corresponding output lies entirely in the definition of the sizes of the buffers, stream is marked as exhausted. A single invocation of and does not include the van Emde Boas layout. To actually sort N elements, the algorithm recurthe root of the merge tree will merge the streams at the sively sorts Nl/d segments of size Nl~l/d of the input leaves of the tree. One particular merge tree is the fc-merger. For k and then merges these using an JV^-merger. For a a power of two, a fc-merger is a perfect binary tree of proof that this is an I/O optimal algorithm, see [14, 19]. k — 1 binary mergers with appropriate sized buffers on the edges, k input streams, and an output buffer at the 3 Methodology root of size kd, for a parameter d > 1. A 16-merger is As said, our goal is first to develop a good impleillustrated in Figure 1. mentation of Funnelsort by finding good choices for The sizes of the buffers are defined recursively: Let design options and parameter values through empirithe top tree be the subtree consisting of all nodes of cal investigation, and then to compare its efficiency to depth at most ft/2], and let the subtrees rooted by x nodes at depth [i/^l + 1 be the bottom trees. The edges The parameter a is introduced in this paper for tuning between nodes at depth [i/2] and depth [z/2] + 1 have purposes. 2 However, the (entire) A'-merger should occupy a contiguous associated buffers of size orfd 3 / 2 ], where a is a positive segment of memory in order for the complexity proof (Theorem 2 in [14]) of Funnelsort itself to be valid.

6

Architecture type Operation system Clock rate Address space Pipeline stages Ll data cache size Ll line size Ll associativity L2 cache size L2 line size L2 associativity TLB entries TLB associativity TLB miss handling RAM size

Petnium 4 Modern CISC Linux v. 2.4.18 2400MHz 32 bit 20 8KB

128 B 4- way 512KB 128 B 8- way 128

full hardware 512 MB

Pentium III Classic CISC Linux v. 2.4.18 800MHz 32 bit 12

MIPS 10000 RISC IRIX v. 6.5 175MHz 64 bit 6

32KB 32 B 2-way 1024 KB 32 B 2-way

16KB 32 B 4-way 256KB 32 B 4-way 64

64

64-way software 128 MB

4-way hardware 256MB

AMD Athlon Modern CISC Linux 2.4.18 1333 MHz 32 bit

Itanium 2 EPIC Linux 2.4.18 1137 MHz 64 bit

10

8

128 KB 64 B 2-way 256 KB 64 B 8- way

32 KB 64 B 4-way 256 KB 128 B 8- way

40

4-way hardware 512 MB

128

full ?

3072 MB

Table 1: The specifications of the machines used in this paper. that of Quicksort—the established standard for comparison based sorting algorithms—as well as that of recent cache-aware proposals. To ensure robustness of the conclusions, we perform all experiments on three rather different architectures, namely Pentium 4, Pentium III, and MIPS 10000. These are representatives of the modern CISC, the classic CISC, and the RISC type of computer architecture, respectively. In the final comparison of algorithms, we add the AMD Athlon (a modern CISC architecture) and the Intel Itanium 2 (denoted an EPIC architecture by Intel, for Explicit Parallel Instruction-set Computing) for even larger coverage. The specifications of all five machines used can be seen3 in Table 1. Our programs are written in C++ and compiled by GCC v. 3.3.2 (Pentiums 4 and III, AMD Athlon), GCC v. 3.1.1 (MIPS 10000), or the Intel C++ compiler v. 7.0 (Itanium 2). We compile using maximal optimization. We use three element types: integers, records containing one integer and one pointer, and records of 100 bytes. The first type is commonly used in experimental papers, but we do not find it particularly realistic, as keys normally have associated information. The second type models sorting small records directly, as well as key-sorting of large records. The third type models sorting medium sized records directly, and is the data type used in the Datamation Benchmark [20] originating from the database community. We mainly consider uniformly distributed keys, but also try skewed inputs such as almost sorted data, and a

Additionally, the Itanium 2 machine has 3072 KB of L3 cache, which is 12-way associative and has a cache line size of 128 B.

data with few distinct key values, to ensure robustness of the conclusions. To keep the experiments during the engineering phase (Section 4) tolerable in number, we only use the second data type and the uniform distribution, believing that tuning based on these will transfer to other situations. We use the drand48 family of C library functions for generation of random values. Our performance metric is wall clock time, as measured by the gettimeof day C library function. We keep the code for the different implementation options tested in the engineering phase as similar as possible, even though this generality entails some overhead. After judging what is the best choices of these options, we implement a clean version of the resulting algorithm, and use this in the final comparison against existing sorting algorithms. Due to space limitations, we in this paper mainly sum up our findings, and show only few plots of experimental data. A full set of plots (close to hundred) can be found in [29]. Our code is available from http: //www.daimi.au.dk/'kv/ALENEX04/. 4

Engineering Lazy Funnelsort

We consider a number of design and parameter choices for our implementation of Lazy Funnelsort. We group them as indicated by the following subsections. To keep the number of experiments within realistic limits, we settle the choices one by one, in the order presented here. We test each particular question by experiments exercising only parts of the implementation, and/or by fixing the remaining choices at hopefully reasonable values while varying the parameter under investigation. In this section, we for reasons of space merely summarize

7

the results of each set of experiments—the actual plots can be found in [29]. Regarding notation: a and d are the parameters from the definition of the fc-merger (see Section 2), and z denotes the degree of the basic mergers (see Section 4.3).

which shows that the spatial locality of the layout is not entirely without influence in practice, despite its lack of influence on the asymptotic analysis. The implicit vEB layout is slower than its pointer based version, but less so on the Pentium 4 architecture, which also is the fastest of the processors and most likely the one least 4.1 k-Merger Structure As noted in Section 2, no strained by complex arithmetic expressions. particular layout is needed for the analysis of Lazy Funnelsort to hold. However, some layout has to be 4.2 Tuning the Basic Mergers The "inner loop" chosen, and the choice could affect the running time. in the Lazy Funnelsort algorithm is the code performing We consider BFS, DPS, and vEB layout. We also the merge step in the nodes of the fc-merger. We explore consider having a merger node stored along with its several ideas for efficient implementation of this code. output buffer, or storing nodes and buffers separately One idea tested is to compute the minimum of the (each part having the same layout). number of elements left in either input buffer and the The usual tree navigation method is by pointers. space left in the output buffer. Merging can proceed However, for the three layouts above, implicit naviga- for at least that many steps without checking the state tion using arithmetic on node indices is possible—this of the buffers, thereby eliminating one branch from the is well-known for BFS [31], and arithmetic expressions core merging loop. We also try several hybrids of this for DFS and vEB layouts can be found in [17]. Implicit idea and the basic merger. navigation saves space at the cost of more CPU cycles This idea will not be a gain (rather, the minimum per navigation step. We consider both pointer based computation will constitute an overhead) in situations where one input buffer stays small for many merge and implicit navigation. We try two coding styles for the invocation of a steps. For this reason, we also implement the optimal merger, namely the straight-forward recursive imple- merging algorithm of Hwang and Lin [21, 22], which has mentation, and an iterative version. To control the higher overhead, but is an asymptotical improvement forming of the layouts, we make our own allocation func- when merging sorted lists of very different sizes. To tion, which starts by acquiring enough memory to hold counteract its overhead, we also try a hybrid solution the entire merger. We test the efficiency of our alloca- which invokes it only when the contents of the input tion function by also trying out the default allocator in buffers are skewed in size. C++. Using this, we have no guarantee that the proper Experiments: We run the same experiment as in memory layouts are formed, so we only try pointer based Section 4.1. The values of a and d influence the sizes of the smallest buffers in the merger. These smallest navigation in these cases. Experiments: We test all combinations of the buffers occur on every second level of the merger, so choices described above, except for a few infeasible ones any node has one of these as either input or output (e.g. implicit navigation with the default allocator), giv- buffer, making this size affect the heuristics above. For ing a total of 28 experiments on each of the three ma- this reason, we repeat the experiment for (a, d) equal to chines. One experiment consists of merging k streams (1,3), (4,2.5), and (16,1.5). These have smallest buffer of k"2 elements in a fc-merger with z = 2, a = 1, and sizes of 8, 23, and 45. respectively. d = 2. For each choice, we for values of k in [15; 270] Results: The Hwang-Lin algorithm has, as exmeasure the time for [20,000,000/A;3] such mergings. pected, a large overhead (a factor of three for the nonResults: The best combination on all architectures hybrid version). Somewhat to our surprise, the heuristic is recursive invocation of a pointer based vEB layout calculating minimum sizes is not competitive, being bewith nodes and buffers separate, allocated by the stan- tween 15% and 45% slower than the fastest, (except on dard allocator. The time used for the slowest combina- the MIPS 10000 architecture, where the differences betion is up to 65% larger, and the difference is biggest tween heuristics are less pronounced). Several hybrids on the Pentium 4 architecture. The largest gain occurs fare better, but the straight-forward solution is consisby choosing the recursive invocation over the iterative, tently the winner in all experiments. We interpret this and this gain is most pronounced on the Pentium 4 as the branch prediction of the CPUs being as efficient architecture, which also is the most sophisticated (it as explicit hand-coding for exploiting predictability in e.g. has a special return address stack holding the ad- the branches in this code (all branches, except the result dress of the next instruction to be fetched after return- of the comparison of the heads of the input buffers, are ing from a function call, for its immediate execution). rather predictable). Thus, hand-coding just constitutes The vEB layout ensures around 10% reduction in time, overhead.

8

4.3 Degree of Basic Mergers There is no need for the fc-merger to be a binary tree. If we for instance base it on four-way basic mergers, we effectively remove every other level of the tree. This means less element movement and less tree navigation. In particular, a reduction in data movement seems promising—part of Quicksorts speed can be attributed to the fact that for random input, only about every other element is moved on each level in the recursion, whereas e.g. binary Mergesort moves all elements at each level. The price to pay is more CPU steps per merge step, and code complication due to the increase in number of input buffers that can be exhausted. Based on considerations of expected register use, element movements, and number of branches, we try several different ways of implementing multi-way mergers using sequential comparison of the front elements in the input buffers. We also try a heap-like approach using looser trees [22], which proved efficient in a previous study by Sanders [27] of priority queues in RAM. In total, seven proposals for multi-way mergers are implemented. Experiments: We test the seven implementations in a 120-merger with (a, d) = (16,2), and measure the time for eight mergings of 1,728,000 elements each. The test is run for degrees 2 = 2,3,4,...,9. For comparison, we also include the binary mergers from the last set of experiments. Results: All implementations except the looser tree show the same behavior: As z goes from 2 to 9, the time first decreases, and then increases again, with minimum attained around 4 or 5. The maximum is 4065% slower than the fastest. Since the number of levels for elements to move through evolves as I/ log(z), while the number of comparisons for each level evolves as 2, a likely explanation is that there is an initial positive effect due to decrease in element movements, which soon is overtaken by increase in instruction count per level. The looser trees show decrease only in running time for increasing 2, consistent with the fact that the number of comparisons per element for a traversal of the merger is the same for all values of z, but the number of levels, and hence data movements, evolves as I/ log(2). Unfortunately, the running time starts out twice as large as for the remaining implementations for z = 2, and barely reaches them at z = 9. Apparently, the overhead is too large to make looser trees competitive in this setting. The plain binary mergers compete well, but are beaten by around 10% by the fastest four- or fiveway mergers. All these findings are rather consistent across the three architectures.

4.4 Merger Caching In the outer recursion of Funnelsort, the same size fc-merger is used for all invocations on the same level of the recursion. A natural optimization would be to precompute these sizes and construct the needed fc-mergers once for each size. These mergers are then reset each time they are used. Experiments: We use the Lazy Funnelsort algorithm with (a, d, 2) = (4,2.5,2), straight-forward implementation of binary basic mergers, and a switch to std: :sort, the STL implementation of Quicksort, for sizes below azd = 23. We sort instances ranging in size from 5,000,000 to 200,000,000 elements. Results: On all architectures, merger caching gave a 3-5% speed-up. 4.5 Base Sorting Algorithm Like any recursive algorithm, the base case in Lazy Funnelsort is handled specially. As a natural limit, we require all fc-mergers to have height at least two—this will remove a number of special cases in the code constructing the mergers. Therefore, for input sizes below azd we switch to another sorting algorithm. Experiments with the sorting algorithms Insertionsort, Selectionsort, Heapsort, Shellsort, and Quicksort (in the form of std: : sort from the STL library) on input size from 10 to 100 revealed the expected result, namely that std: :sort, which (in the GCC implementation) itself switches to Insertionsort below size 16, is the fastest for all sizes. We therefore choose this as the sorting algorithm for the base case. 4.6 Parameters a and d The final choices concern the parameters a (factor in buffer size expression) and d (main parameter defining the progression of the recursion, in the outer recursion of Funnelsort, as well as in the buffer sizes in the fc-merger). These control the buffer sizes, and we investigate their impact on the running time. Experiments: For values of d between 1.5 and 3 and for values of a between 1 and 40, we measure the running time for sorting inputs of various sizes in RAM. Results: There is a marked rise in running time when a drops below 10, increasing to a factor of four for Q = 1. This effect is particularly strong for d = 1.5. Smaller a and d give smaller sizes of buffers, and the most likely explanation seems to be that the cost of navigating to and invoking a basic merger is amortized over fewer merge steps when the buffers are smaller. Other than that, the different values of d appear to behave quite similarly. A sensible choice appears to be a around 16, and d around 2.5

9

5 Evaluating Lazy Punnelsort In Section 4, we settled the best choices for a number of implementation issues for Lazy Funnelsort. In this section, we investigate the practical merits of the resulting algorithm. We implement two versions: Funnelsort2, which uses binary basic mergers as described in Section 2, and Funnelsort4, which uses the four-way basic mergers found in Section 4.3 to give slightly better results. The remaining implementation details follow what was declared the best choices in Section 4. Both implementations use parameters (o,rf) = (16,2), and use std:: sort for input sizes below 400 (as this makes all k-mergers have height at least two in both). 5.1 Competitors Comparing algorithms with the same asymptotic running time is a delicate matter. Tuning of code can often change the constants involved significantly, which leaves open the question of how to ensure equal levels of engineering in implementations of different algorithms. Our choice in this paper is to use Quicksort as the main yardstick. Quicksort is known as a very fast general-purpose comparison based algorithm [28], and has long been the standard choice of sorting algorithm in software libraries. Over the last 30 years, many improvements have been suggested and tried, and the amount of practical experience with Quicksort is probably unique among sorting algorithms. It seems reasonable to expect implementations in current libraries to be highly tuned. To further boost confidence in the efficiency of the chosen implementation of Quicksort, we start by comparing several widely used library implementations, and choose the best performer as our main competitor. We believe such a comparison will give a good picture of the practical feasibility of cacheoblivious ideas in the area of comparison based sorting. The implementations we consider are std:: sort from the STL library included in the GCC v. 3.2 distribution, std:: sort from the STL library from Dinkumware4 included with Intels C++ compiler v.7.0, the implementation from [28, Chap. 7], and an implementation of our own, based on the proposal of Bentley and Mcllroy [13], but tuned slightly further by making it simpler for calls on small instances and adding an even more elaborate choice of pivot element for large instances. These algorithms mainly differ in their partitioning strategies—how meticulously they choose the pivot element and whether they use two- or three-way partitioning. Two-way partitioning allows tighter code, but is less robust when repeated keys are present. 4

www.dinkumware.com

10

To gain further insight, we also compare with recent implementations of cache-aware sorting algorithms aiming for efficiency in either internal memory or external memory by tunings based on knowledge of the memory hierarchy. TPIE [18] is a library for external memory computations, and includes highly optimized routines for e.g. scanning and sorting. We choose TPIEs sorting routine AMI_sort as representative of sorting algorithms efficient in external memory. The algorithm needs to know the amount of available internal memory, and following suggestions in the TPIEs manual we set it to 192 Mb, which is 50-75% of the physical memory on all machines where it is tested. The TPIE version used is the newest at time of writing (release date August 29, 2002). TPIE does not support the MIPS and Itanium architectures, and requires an older version (2.96) of the GCC compiler on the remaining architectures. Several recent proposals for cache-aware sorting algorithms in internal memory exist, including [5, 24. 32]. LaMarca and Ladner [24] give proposals for better exploiting LI and L2 cache. Improving on their effort, Arge et al. [5] give proposals using registers better, and Kubricht et al. [32] give variants of the algorithms from [24] taking the effects of TLB (Translation Lookaside Buffers) misses and the low associativity of caches into account. In this test, we compare against the two Mergesort based proposals from [24] as implemented by [32] (we encountered problems with the remaining implementations from [32]), and the R-merge algorithm of [5]. We use the publicly available code from [32] and code from [5] sent to us by the authors. 5.2 Experiments We test the algorithms described above on inputs of sizes in the entire RAM range, as well as on inputs residing on disk. All experiments are performed on machines with no other users. The influence of background processes is minimized by running each experiment in internal memory 21 times, and reporting the median. In external memory experiments are rather time consuming, and we run each experiment only once, believing that background processes will have less impact on these. Besides the three machines used in Section 3, we in these final experiments also include the AMD Athlon and the Intel Itanium 2 processor.5 Their specifications can be seen in Table 1. The methodology is as described in Section 3.

5 Due to our limited access period for the Itanium machine, we do not have results for all algorithms on this architecture.

5.3 Results The plots described in this section are shown in Appendix A. In all graphs, the y-axis shows wall time in seconds divided by n log n, and the x-axis shows log n, where n is the number of input elements. The comparison of Quicksort implementations showed that three contestants ran pretty close, with the GCC implementation as the overall fastest. It uses a compact two-way partitioning scheme, and simplicity of code here seems to pay off. It is closely followed by our own implementation (denoted Mix), based on the tuned three-way partitioning of Bentley and Mcllroy. The implementation from Sedgewicks book (denoted Sedge) is not much after, whereas the implementation from the Dinkumware STL library (denoted Dink) lags rather behind, probably due to a rather involved three-way partitioning routine. We use the GCC and the Mix implementation as the Quicksort contestants in the remaining experiments—the first we choose for pure speed, the latter for having better robustness with almost no sacrifice in efficiency. In the main experiments in RAM, we see that the Funnelsort algorithm with four-way basic mergers are consistently better than the one with binary basic mergers, except on the MIPS architecture, which has a very slow CPU. This indicates that the reduced number of element movements really do outweigh the increased merger complexity, except when CPU cycles are costly compared to memory accesses. For the smallest input sizes, the best Funnelsort looses to GCC Quicksort (by 10-40%), but on three architectures gains as n grows, ending up winning (by the approximately the same ratio) for the largest instances in RAM. The two architectures where GCC keeps its lead are the MIPS 10000 with its slow CPU, and the Pentium 4, which features the PC800 bus (decreasing the access time to RAM), and which has a large cache line size (reducing effects of cache latency when scanning data in cache). This can be interpreted as on these two architectures, CPU cycles, not cache effects, are dominating the running time for sorting, and on architectures where this is not the case, the theoretically better cache performance of Funnelsort actually shows through in practice, at least for a tuned implementation of the algorithm. The two cache-aware implementations msort-c and msort-m from [32] are not competitive on any of the architectures. The R-merge algorithm is competing well, and like Funnelsort shows its cache-efficiency by having a basically horizontal graph throughout the entire RAM range on the architectures dominated by cache effects. However, four-way Funnelsort is consistently better than R-merge, except on the MIPS 10000 machine. The latter is a RISC-type architecture and has a large

number of registers, something which the R-merge algorithm is designed to exploit. TPIEs algorithm is not competitive in RAM. For the experiments on disk, TPIE is the clear winner. It is optimized for external memory, and we suspect in particular that its use of double-buffering (something which seems hard to transfer to a cache-oblivious setting) gives it an unbeatable advantage6. However, Funnelsort comes in as a second, and outperforms GCC quite clearly. The gain over GCC seems to grow as n grows larger, which is in good correspondence with the difference in the base of logarithms in the I/O complexity of these algorithms. The algorithms tuned to cache perform notably badly on disk. Due to lack of space, we have only shown plots for uniformly distributed data of the second data type (records of integer and pointer pairs). The results for the other types and distributions discussed in Section 3 are quite similar, and can be found in [29]. 6

Conclusion

Through a careful engineering effort, we have developed a tuned implementation of Lazy Funnelsort, which we have compared empirically with efficient implementations of other comparison based sorting algorithms. The results show that our implementation is competitive in RAM as well as on disk, in particular in situations where sorting is not CPU bound. Across the many input sizes tried, Funnelsort was almost always among the two fastest algorithms, and clearly the one adapting most gracefully to changes of level in the memory hierarchy. In short, these results show that for sorting, the overhead involved in being cache-oblivious can be small enough for the nice theoretical properties to actually transfer into practical advantages. 7 Acknowledgments We thank Brian Vinter of the Department of Mathematics and Computer Science, University of Southern Denmark, for access to an Itanium 2 processor.

References [1] P. Agarwal, L. Arge, A. Banner, and B. HollandMinkley. On cache-oblivious multidimensional range searching. In Proc. 19th ACM Symposium on Computational Geometry, 2003. [2] A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31 (9):1116-1127, Sept. 1988. °The TPIE sorting routine sorts one run while loading the next from disk, thus parallelizing CPU work and I/Os.

11

[3] L. Arge. External memory data structures. In Proc. 9th Annual European Symposium on Algorithms (ESA), volume 2161 of LNCS, pages 1-29. Springer, 2001. [4] L. Arge, M. A. Bender, E. D. Demaine, B. HollandMinkley, and J. I. Munro. Cache-oblivious priority queue and graph algorithm applications. In Proc. 34th Ann. ACM Symp. on Theory of Computing, pages 268276. ACM Press, 2002. [5] L. Arge, J. Chase, J. Vitter, and R. Wickremesinghe. Efficient sorting using registers and caches. ACM Journal of Experimental Algorithmics, 7(9), 2002. [6] R. Bayer and E. McCreight. Organization and maintenance of large ordered indexes. Acta Informatica, 1:173-189, 1972. [7] M. Bender, R. Cole, E. Demaine, and M. FarachColton. Scanning and traversing: Maintaining data for traversals in a memory hierarchy. In Proc. 10th Annual European Symposium on Algorithms (ESA), volume 2461 of LNCS, pages 139-151. Springer, 2002. [8] M. Bender, R. Cole, and R. Raman. Exponential structures for cache-oblivious algorithms. In Proc. 29th International Colloquium on Automata, Languages, and Programming (ICALP), volume 2380 of LNCS, pages 195-207. Springer, 2002. [9] M. Bender, E. Demaine, and M. Farach-Colton. Efficient tree layout in a multilevel memory hierarchy. In Proc. 10th Annual European Symposium on Algorithms (ESA), volume 2461 of LNCS, pages 165-173. Springer, 2002. [10] M. A. Bender, G. S. Brodal, R. Fagerberg, D. Ge, S. He, H. Hu, J. lacono, and A. Lopez-Ortiz. The cost of cache-oblivious searching. In Proc. 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 271-282, 2003. [11] M. A. Bender, E. Demaine, and M. Farach-Colton. Cache-oblivious B-trees. In Proc. 41st Ann. Symp. on Foundations of Computer Science, pages 399-409. IEEE Computer Society Press, 2000. [12] M. A. Bender, Z. Duan, J. lacono, and J. Wu. A locality-preserving cache-oblivious dynamic dictionary. In Proc. 13th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 29-39, 2002. [13] J. L. Bentley and M. D. Mcllroy. Engineering a sort function. Software-Practice and Experience, 23(1):1249-1265, 1993. [14] G. S. Brodal and R. Fagerberg. Cache oblivious distribution sweeping. In Proc. 29th International Colloquium on Automata, Languages, and Programming (ICALP), volume 2380 of LNCS, pages 426-438. Springer, 2002. [15] G. S. Brodal and R. Fagerberg. Funnel heap - a cache oblivious priority queue. In Proc. 13th Annual International Symposium on Algorithms and Computation, volume 2518 of LNCS, pages 219-228. Springer, 2002. [16] G. S. Brodal and R. Fagerberg. On the limits of cacheobliviousness. In Proc. 35th Annual ACM Symposium on Theory of Computing (STOC), pages 307-315, 2003.

12

[17] G. S. Brodal, R. Fagerberg, and R. Jacob. Cache oblivious search trees via binary trees of small height. In Proc. 13th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 39-48, 2002. [18] Department of Computer Science, Duke University. TPIE: a transparent parallel I/O environment. WWW page, http://ww.cs.duke.edu/TPIE/, 2002. [19] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In 40th Annual Symposium on Foundations of Computer Science, pages 285-297. IEEE Computer Society Press, 1999. [20] J. Gray. Sort benchmark home page. WWW page, http://research.microsoft.com/bare/ SortBenchmark/, 2003. [21] F. K. Hwang and S. Lin. A simple algorithm for merging two disjoint linearly ordered sets. SI AM Journal on Computing, l(l):31-39, 1972. [22] D. E. Knuth. The Art of Computer Programming, Vol 3, Sorting and Searching. Addison-Wesley, Reading, USA, 2 edition, 1998. [23] R. E. Ladner, R. Fortna, and B.-H. Nguyen. A comparison of cache aware and cache oblivious static search trees using program instrumentation. In Experimental Algorithmics, volume 2547 of LNCS, pages 78-92. Springer, 2002. [24] A. LaMarca and R. E. Ladner. The influence of caches on the performance of sorting. Journal of Algorithms, 31:66-104, 1999. [25] H. Prokop. Cache-oblivious algorithms. Master's thesis, Massachusetts Institute of Technology, June 1999. [26] N. Rahman, R. Cole, and R. Raman. Optimised predecessor data structures for internal memory. In Proc. 5th Int. Workshop on Algorithm Engineering (WAE), volume 2141, pages 67-78. Springer, 2001. [27] P. Sanders. Fast priority queues for cached memory. ACM Journal of Experimental Algorithmics, 5(7), 2000. [28] R. Sedgewick. Algorithms in C++: Parts 1-4: Fundamentals, Data Structures, Sorting, Searching. Addison-Wesley, Reading, MA, USA, third edition, 1998. Code available at http://www.cs.princeton. edu/"rs/Algs3.cxxl-4/code.txt. [29] K. Vinther. Engineering cache-oblivious sorting algorithms. Master's thesis, Department of Computer Science, University of Aarhus, Denmark, May 2003. Available online at http://www.daimi.au.dk/ "kv/thesis/. [30] J. S. Vitter. External memory algorithms and data structures: Dealing with massive data. A CM Computing Surveys, 33(2):209-271, June 2001. [31] J. W. J. Williams. Algorithm 232: Heapsort. Commun. ACM, 7:347-348, 1964. [32] L. Xiao, X. Zhang, and S. A. Kubricht. Improving memory performance of sorting algorithms. ACM Journal of Experimental Algorithmics, 5(3), 2000.

A

Plots Comparison of Quicksort Implementations

13

14

15

Results for Inputs on disk

16

17

The Robustness of the Sum-of-Squares Algorithm for Bin Packing Michael A. Bender*

Bryan Bradley*

Geetha Jagannathan*

Krishnan Pillaipakkamnatt§ Abstract Csirik et al. [CJK+99, CJK+00] introduced the sum-ofsquares algorithm (SS) for online bin packing of integralsized items into integral-sized bins. They showed that for discrete distributions, the expected waste of SS is sublinear if the expected waste of the offline optimal is sublinear. This algorithm SS has a time complexity of O(nB) to pack n items into bins of size B. In [CJK"I"02] the authors present variants of this algorithm that enjoy the same asymptotic expected waste as SS (with larger multiplicative constants), but with time complexities of O(nlogB) and O(n). In this paper we present three sets of results that demonstrate the robustness of the sum-of-squares approach. First, we show the results of experiments from two new variants of the SS algorithm. The first variant, which runs in time O(n\/BlogB), appears to have almost identical expected waste as the sum-of-squares algorithm on all the distributions mentioned in [CJK+99, CJK+00, CJK+02]. The other variant, which runs in O(nlogB) time performs well on most, but not on all of those distributions. Both these algorithms have simple implementations. We present results from experiments that extend the sum-of-squares algorithm to the bin-packing problem with two bin sizes (the variable-sized bin-packing problem). From our experiments comparing SS and Best Fit over uniform distributions, we observed that there are scenarios where when one bin size is 2/3 the size of the other, SS has 6(-y/n) waste while Best Fit has linear waste. We also present situations where SS has 6(1) waste while Best Fit has 6(n) waste. We observe an anomalous behavior in Best Fit that does not seem to affect SS. Finally, we apply SS to the related problem of online memory allocation. Our experimental comparisons between SS and Best Fit indicate that neither algorithm is consistently better than the other. If the amount of randomness is low, SS appears to have lower waste than Best Fit, while larger amounts of randomness appear to favor Best Fit. An interesting phenomenon shows that for a given range of allocation sizes we can find ranges of allocation duration where SS has lower waste than Best Fit. In the online memoryallocation problem for the uniform and interval distributions, SS does not seem to have an asymptotic advantage over Best Fit, in contrast with the bin-packing problem. * Department of Computer Science, State University of New York, Stony Brook, NY 11794-4400, USA, email: benderflcs.sunysb.edu. Supported in part by Sandia National Laboratories and NSF grants EIA-0112849 and CCR-0208670. t Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA, email: pot8osCacm.org. * Department of Computer Science, State University of New York, Stony Brook, NY 11794-4400, USA, email: geethajaganflacm.org. 5Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA, email: csckzpfihofstra.edu.

18

1

Introduction

In classical bin packing the input is a list L = (01,02, --,a n ) of n items and an infinite supply of bins of unit capacity. Each item a* has size s(aj), where 0 < s(a,i) < 1. The objective is to pack the items into a minimum number of bins subject to the constraint that the sum of sizes of the items in each bin is no greater than 1. Bin packing has a wide range of applications including stock cutting, truck packing, commercials assignment to stations breaks in television programming, and memory allocation. Because this problem is NPhard [GJ79], most bin-packing research concentrates on finding polynomial-time approximation algorithms for bin packing. Bin packing is among the earliest problems for which the performance of approximation algorithms were analyzed [CGJ96]. In this paper we focus on the average-case performance of online algorithms for bin packing, where item sizes are drawn according to some discrete distribution. An algorithm is said to be online if the algorithm packs each item as soon as it "arrives" without any knowledge of the items not yet encountered. That is, the decision to pack item a* into some particular bin can be based only on the knowledge of items oi,...,Oi_i. Moreover, once an item has been assigned to a bin, it cannot be reassigned. Discrete distributions are those in which each item size is an element of some set {si,si,...,sj} of integers, and such that each size has an associated rational probability. The capacity of a bin is a fixed integer B > sj. We overload notation and write s(L) for the sum of the sizes of the items in the list L. For an algorithm A, we use A{L) to denote the number of bins used by A to pack the items in L. We write OPT to denote an optimal packing algorithm. Let F be a probability distribution over item sizes. Then Ln(F) denotes a list of n items drawn according to distribution F. A packing is a specific assignment of items to bins. The size of a packing P, written ||P||, is the number of bins used by P. For an algorithm A, we use P£(F) to denote a packing resulting from the application of algorithm A to the list Ln(F). Given a packing P of a list L, the waste for P, the sum of the unused bin capacities, is defined

as W(P) = B \\P\\ - s(L). The expected waste of an algorithm A on a distribution F is EW*(F) = E[W(P*(F))], where the expectation is taken over all lists of length n. 1.1

Related Work.

Variable-sized Bin Packing. In the variable-sized bin-packing problem all bins do not have the same capacity. Assume that there are infinitely many bins of sizes Let L denote a list of items whose sizes are in (0, BI]. Lately this problem has received more attention because of its application in stock cutting and in the assignments of commercials to variable-sized breaks in television programming. Much of the literature for variable-sized bin-packing algorithms measures effectiveness in terms of the performance ratio of the algorithm (instead of expected waste). The asymptotic expected performance ratio for algorithm A on distribution F is the average ratio of the number of bins taken by A to the number of bins taken by an optimal algorithm. That is,

Bin Packing. Recently progress has been made in the average-case analysis of the standard heuristics for the discrete uniform distributions U{j, k} where the bin capacity is taken as B = fc, and item sizes are uniformly drawn from 1, 2, ...,.;' < k. When j = k — 1, the online Best-Fit and First-Fit algorithms have BCV**) expected waste. Remarkably, when 1 < j < k — 2, the expected waste of the optimal is O(l) [ECG+91j. An algorithm is said to be stable under a distribution if the expected waste remains bounded even when the number of items goes to infinity. Coffman et al. [ECG+91] proved that Best Fit is stable when k > j(j + 3)/2. Kenyon et al. [KRS98] showed that Best Fit is stable under U{k — 2, k} and is also stable for some specific values Frieson and Langston [FL86] first investigated the of (j, k) with k < 14. It has been shown experimentally variable-sized bin-packing problem. Kinnerly and that for most pairs (j, k) the expected waste of Best Langston [KL88] gave an online algorithm with perFit is 0(n). formance ratio 7/4. The variable-harmonic algorithm [Csi89], which is based on the harmonic-fit algorithm, The Sum-of-Squares algorithm. We assume that has a performance ratio of at most 1.69103. all items have integral sizes. The gap of a bin is the Epstein, Seiden, and Stee [ESS01] presented two amount of unassigned capacity in the bin. Let N(g) unbounded-space online algorithms for variable-sized denote the number of bins in the current packing with bin packing. They focused on the cases in which there gap 1 < g < B. Initially, N(g) = 0 for all g. The are two bin sizes. These algorithms are a combination sum-of-squares algorithm puts an item a of size s(a) of variable-harmonic and refined-harmonic algorithms into a bin such that after placing the item a the value [LL81]. Epstein et al. also proved a lower bound for any online algorithm for two bin sizes. The asymptotic of is minimized. performance ratio is greater than 1.33561. Csirik et al. [CJK+99], gave experimental evidence that for discrete uniform distributions U{j, k} (with Memory Allocation. Memory is modeled as an ink = 100 and 1 < j < 98) EW$S is 0(1). They also finitely long array of storage locations. An allocator reshowed for j = 99, EW£S = O(Jn). Their results ceives requests for blocks of memory of various sizes and indicated that for j = 97,98,99, the expected waste requests for their deallocation arrive in some unknown of SS, EW§S goes from 0(1) to Q(Jn) whereas the order. Although the size of the request is known to the expected waste of BF, EW^F transitions from 0(n) to allocator, the deallocation time is unknown at the time 0(1) to e(v'n). of allocation. The deallocation of a block may leave In a theoretical analysis of the sum-of-squares a "hole" in the memory. The objective of a memoryalgorithm [CJK+00] Csirik et al. proved that for any allocation algorithm is to minimize the total amount of perfectly-packable distribution F, EW§S — O(^/n). space wasted in these holes. They also proved that if F is a bounded waste disAlthough memory allocation has been studied since tribution then EWgS is either 0(1) or 0(logra). In the early days of computing, only a handful of results particular, if F is a discrete uniform distribution concern the competitive ratios of the standard heuristics U{j, k} where j < k - 1 then EW£S(F) = O(l). They for the problem. For the memory-allocation problem, also proved that for all lists L, SS(L) < 3OPT(L). the competitive ratio of an online algorithm is the ratio between the total amount of memory required by the algorithm to satisfy all requests, to W, the largest

19

amount of concurrently allocated memory. Luby, Naor and Orda [LNO96] showed that First Fit (the heuristic which assigns a request to the lowest indexed hole that can accommodate the request) has a competitive ratio of 0{min(log W, log C)}, where C is the largest number of concurrent allocations. By Robson's result [Rob74], this bound is the best possible value for any online algorithm. 1.2

Results.

In this paper we study the performance of the sum-ofsquares algorithm and its variants. We present faster variants of the sum-of-squares algorithm for the online bin-packing problem. Next, we compare the sumof-squares algorithm with the Best-Fit algorithm for variable-sized bin packing. Finally, we show the results of applying the sum-of-squares algorithm to memory allocation. We performed our experiments with the uniform distribution U{j, k}. We have also run our algorithms on interval distributions, especially those that are claimed to be difficult in [CJK+02]. For the memory allocation problems we used various interval distributions for both block sizes and block durations. • In Section 2 we present our variants for the sumof-squares algorithm. The first is the SSmax variant, which runs in O(nVB\ogB) time. Our experiments suggest that the performance of this variant is close to the SS algorithm in absolute terms, for all the distributions mentioned in [CJK+99, CJK+00, CJK+02]. The remaining algorithms form a family called the segregated sum-of-squares algorithms (SSS). These algorithms perform well in most of the distributions mentioned in the above papers. But on some distributions they do not have the same expected waste as SS. The best runtime in this family is 0(nloglogJ3). • Section 3 provides experimental results for the sum-of-squares algorithm applied to a generalized version of the bin-packing problem where bins come in two sizes. For fixed bin sizes, the performance of the sum-of-squares and Best-Fit algorithms can be categorized into three ranges based on item sizes. In the first range both algorithms appear to have constant waste. In the second range, Best Fit has 0(n) waste and SS has 0(1) waste. In the third range, when one bin size is around 2/3 the size of the other, SS has Q(\/n) waste while Best Fit has linear waste. When the bin sizes are not in this ratio, both algorithms have Q(-^/n} waste. We observe an anomalous behavior in Best Fit that does not seem to affect SS.

20

• In Section 4 we applied SS to the related problem of online memory allocation. Our experimental comparisons between SS and Best Fit indicate that neither algorithm is consistently better than the other. Smaller allocation durations appear to favor Best Fit, while larger allocation durations favor SS. Also, if the amount of randomness is low, SS appears to have lower waste than Best Fit, while larger amounts of randomness appear to favor Best Fit. An interesting phenomenon shows that for a given range of allocation sizes we can find ranges of allocation duration where SS has lower waste than Best Fit. In the online memory-allocation problem SS does not seem to have an asymptotic advantage over Best Fit, in contrast to the bin-packing problem. 2

Faster Variants Algorithm

of

the

Sum-of-Squares

In this section we present variants of the sum-of-squares algorithm. The SSmax variant of Section 2.2.1 runs in 0(n\/JBlogjB) time and appears to have an expected waste remarkably close to that of SS. Experimental results indicate that the segregated sum-of-squares family of algorithms (Section 2.2.2) run faster, but, in some cases, have 0(n) expected waste when SS has 0(x/n) waste. 2.1

Sum-of-Squares Algorithm.

The sum of the sizes of items in a bin is the level of the bin. The gap of a bin is the amount of its unused capacity. If the level of a bin is i, then its gap is B — t. Let P be a packing of a list L of items. Let the gap count of g, N(g}, denote the number of bins in the packing that have a gap of 1 < g < B. We call N the profile vector for the packing. We ignore perfectly packed bins (whose gap is 0) and completely empty bins. The sumB-l

of-squares for a packing P is ss(P) = ]T N(g)2. 9=1

The sum-of-squares algorithm is an online algorithm that works as follows. Let a be the next item to pack. It is packed into a legal bin (whose gap is at least s(a)) such that for the resulting packing P' the value of ss(P') is minimum over all possible packings of a. When an item of size s arrives, there are three possible ways of packing the item 1. Open a new bin: Here the sum of squares increases by l + 2N(B-s). 2. Perfectly fill an old bin: Here the sum of squares decreases by 2N(s) — 1.

3. The item goes into a bin of gap g where g > s: Instead of examining all possible legal gaps for the best Here the sum of squares increases by 2 - (N(g — s) — possible gap (as SS does), we examine only the k largest N(g) + 1). This step requires finding a g which gap counts, for some value of k. maximizes the value of N(g) — N(g — s). 2.2.1 Parameterized SSmax Algorithm. The Each tune an item arrives, the algorithm performs SSmax(fc) parameterized algorithm is based on the an exhaustive search to evaluate the change in ss(P) SooS variant mentioned above. The right choice for and finds an appropriate bin in O(B) time. In [CJK+02] k is discussed later in this section. To pack a given the authors discuss variants of the original SS algorithm. item into a bin, SSmax(A:) computes the value of ss(P) They present O(nlog5) and O(n) variants that apfor the k largest gap counts (for some k > 1). The proximate the calculation of the sum-of-squares. The algorithm also computes ss(P) for opening a new bin authors prove that these variants have the same asympand for perfectly packing the new item into an already totic growth rate for expected waste as SS, but with open bin. The given item is packed into a bin such larger multiplicative constants. For example, they conthat the resulting packing minimizes ss(P). (Note that sidered the distribution C/{400,1000}. For n = 100,000 when we set k — B, we get the original SS algorithm.) items, they state that the number of bins used by the Using a priority queue, we can find the k largest values variant is 400% more than optimal, whereas Best Fit in O(fclogB). This can be done in O(fcloglogB) time uses only 0.3% more bins, and SS uses 0.25% more bins using faster priority queues such as Van Emde Boas than necessary. The situation does improve in favor of trees. their variant for larger values of n. For n — 107, the variant uses roughly 9.8% more bins than optimal, but Experimental Results. For small values of k (such SS uses .0.0025% more bins than necessary. The auas for constant k, or for k = log-B), the experiments thors claim their fast variants of SS are "primarily of suggest that the results for SSmax(fc) are qualitatively theoretical interest", and that they are unlikely to be similar to those of SooS. However, for k near 1\fB the competitive with Best Fit, except for large values of n. average waste for SSmax(fc) undergoes a substantial They also looked at other generalizations for the change. For each of the distributions mentioned in sum-of-squares algorithm. Instead of considering the [CJK+99, CJK+00, CJK+02], we observed that the squares of the elements of the profile vector, they experformance of SSmax(2v^B) matched that of SS to ih amined the r power, for r > 1. These generalizations within a factor of 1.4. In many cases the factor is as yield various SrS algorithms. For all finite r > 1, SrS low as 1.02. (See Table 1 and Figure 1 for results from performed qualitatively the same as SS. For the limit some of the distributions.) Note that in this and other case of r = 1 (SIS), the resulting algorithm is one that tables when it becomes clear that the growth rate of an satisfies the any-fit constraint. That is, the algorithm algorithm is linear in n, we did not run experiments does not open a new bin unless none of the already open for n = 108 and n = 109. In all other cases, we bins can accommodate the new item to be packed. Best used 100 samples for n € {105,106}, 25 samples for Fit and First Fit are examples of algorithms that satisfy n € {107,108}, and 3 samples for n = 109. the any-fit constraint. The SooS algorithm chooses to We also compared the waste of this variant with minimize the largest value in the profile vector. This SS for the distributions U{j, 100}, t/{j,200} and variant has the optimal waste for the uniform distribuU{j, 1000}. As our results in Table 2, and Figure 2 tions U{j, k}. However, the authors provide examples show, the performance of the variant SSmax(2\/B) is of distributions for which SooS has linear waste while remarkably similar to that of SS. the optimal and SS have sublinear waste. 2.2 Our Variants. The variants try to approximate the behavior of the sum-of-squares algorithm. By attempting to minimize the value of ss(P'), SS effectively tries to ensure that it has an approximately equal number of bins of each gap

SIAM PROCEEDINGS SERIES LIST Glowinski, R., Golub, G. H., Meurant, G. A., and Periaux, J., First International Conference on Domain Decomposition Methods for Partial Differential Equations (1988) Salam, Fathi M. A. and Levi, Mark L, Dynamical Systems Approaches to Nonlinear Problems in Systems and Circuits (1988) Datta, B., Johnson, C, Kaashoek, M., Plemmons, R., and Sontag, E., Linear Algebra in Signals, Systems and Control (1988) Ringeisen, Richard D. and Roberts, Fred S., Applications of Discrete Mathematics (1988) McKenna, James and Temam, Roger, ICIAM '87: Proceedings of the First International Conference on Industrial and Applied Mathematics (1988) Rodrigue, Garry, Parallel Processing for Scientific Computing (1989) Caflish, Russel E., Mathematical Aspects of Vortex Dynamics (1989) Wouk, Arthur, Parallel Processing and Medium-Scale Multiprocessors (1989) Flaherty, Joseph E., Paslow, Pamela J., Shephard, Mark S., and Vasilakis, John D., Adaptive Methods for Partial Differential Equations (1989) Kohn, Robert V. and Milton, Graeme W., Random Media and Composites (1989) Mandel, Jan, McCormick, S. F., Dendy, J. E., Jr., Farhat, Charbel, Lonsdale, Guy, Porter, Seymour V., Ruge, John W., and Stuben, Klaus, Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods 0939^ Colton, David, Ewing, Richard, and Rundell, William, Inverse Problems in Partial Differential Equations (1990) Chan, Tony F., Glowinski, Roland, Periaux, Jacques, and Widlund, Olof B., Third International Symposium on Domain Decomposition Methods for Partial Differential Equations (1990) Dongarra, Jack, Messina, Paul, Sorensen, Danny C., and Voigt, Robert G., Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing (1990) Glowinski, Roland and Lichnewsky, Alain, Computing Methods in Applied Sciences and Engineering (1990) Coleman, Thomas F. and Li, Yuying, Large-Scale Numerical Optimization (1990) Aggarwal, Alok, Borodin, Allan, Gabow, Harold, N., Galil, Zvi, Karp, Richard M., Kleitman, Daniel J., Odlyzko, Andrew M., Pulleyblank, William R., Tardos, Eva, and Vishkin, Uzi, Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms (1990) Cohen, Gary, Halpern, Laurence, and Joly, Patrick, Mathematical and Numerical Aspects of Wave Propagation Phenomena (1991) Gomez, S., Hennart, J. P., and Tapia, R. A., Advances in Numerical Partial Differential Equations and Optimization: Proceedings of the Fifth Mexico-United States Workshop (1991) Glowinski, Roland, Kuznetsov, Yuri A., Meurant, Gerard, Periaux, Jacques, and Widlund, Olof B., Fourth International Symposium on Domain Decomposition Methods for Partial Differential Equations (1991) Alavi, Y., Chung, F. R. K., Graham, R. L., and Hsu, D. R, Graph Theory, Combinatorics, Algorithms, and Applications (1991) Wu, Julian J., Ting, T. C. I, and Barnert, David M., Modern Theory of Anisotropic Elasticity and Applications (1991) Shearer, Michael, Viscous Profiles and Numerical Methods for Shock Waves (1991) Griewank, Andreas and Corliss, George F., Automatic Differentiation of Algorithms: Theory, Implementation, and Application (1991) Frederickson, Greg, Graham, Ron, Hochbaum, Dorit S., Johnson, Ellis, Kosaraju, S. Rao, Luby, Michael, Megiddo, Nimrod, Schieber, Baruch, Vaidya, Pravin, and Yao, Frances, Proceedings of the Third Annual ACM-SIAM Symposium on Discrete Algorithms (1992) Field, David A. and Komkov, Vadim, Theoretical Aspects of Industrial Design (1992) Field, David A. and Komkov, Vadim, Geometric Aspects of Industrial Design (1992) Bednar, J. Bee, Lines, L R., Stolt, R. H., and Weglein, A. B., Geophysical Inversion (1992)

O'Malley, Robert E. Jr., ICIAM 91: Proceedings of the Second International Conference on Industrial and Applied Mathematics (1992) Keyes, David E., Chan, Tony F., Meurant, Gerard, Scroggs, Jeffrey S., and Voigt, Robert G., Fifth International Symposium on Domain Decomposition Methods for Partial Differential Equations (1992) Dongarra, Jack, Messina, Paul, Kennedy, Ken, Sorensen, Danny C., and Voigt, Robert G., Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing (1992) Corones, James P., Kristensson, Gerhard, Nelson, Paul, and Seth, Daniel L, Invariant Imbedding and Inverse Problems (1992) Ramachandran, Vijaya, Bentley, Jon, Cole, Richard, Cunningham, William H., Guibas, Leo, King, Valerie, Lawler, Eugene, Lenstra, Arjen, Mulmuley, Ketan, Sleator, Daniel D., and Yannakakis, Mihalis, Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (1993) Kleinman, Ralph, Angell, Thomas, Colton, David, Santosa, Fadil, and Stakgold, Ivar, Second International Conference on Mathematical and Numerical Aspects of Wave Propagation (1993) Banks, H. I, Fabiano, R. H., and Ito, K., Identification and Control in Systems Governed by Partial Differential Equations (1993) Sleator, Daniel D., Bern, Marshall W., Clarkson, Kenneth L, Cook, William J., Karlin, Anna, Klein, Philip N., Lagarias, Jeffrey C., Lawler, Eugene L., Maggs, Bruce, Milenkovic, Victor J., and Winkler, Peter, Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms (1994) Lewis, John G., Proceedings of the Fifth SIAM Conference on Applied Linear Algebra (1994) Brown, J. David, Chu, Moody I, Ellison, Donald C., and Plemmons, Robert J., Proceedings of the Cornelius Lanczos International Centenary Conference (1994) Dongarra, Jack J. and Tourancheau, B., Proceedings of the Second Workshop on Environments and Tools for Parallel Scientific Computing (1994) Bailey, David H., Bj0rstad, Petter E., Gilbert, John R., Mascagni, Michael V, Schreiber, Robert S., Simon, Horst D., Torczon, Virginia J., and Watson, Layne I, Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing (1995) Clarkson, Kenneth, Agarwal, Pankaj K., Atallah, Mikhail, Frieze, Alan, Goldberg, Andrew, Karloff, Howard, Manber. Udi, Munro, Ian, Raghavan, Prabhakar, Schmidt, Jeanette, and Young, Moti, Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms (1995) Becache, Elaine, Cohen, Gary, Joly, Patrick, and Roberts, Jean E., Third International Conference on Mathematical and Numerical Aspects of Wave Propagation (1995) Engl, Heinz W., and Rundell, W., GAMM-SIAM Proceedings on Inverse Problems in Diffusion Processes (1995) Angell, T. S., Cook, Pamela L., Kleinman, R. E., and Olmstead, W. E., Nonlinear Problems in Applied Mathematics (1995) Tardos, Eva, Applegate, David, Canny, John, Eppstein, David, Galil, Zvi, Kdrger, David R., Karlin, Anna R., Linial, Nati, Rao, Satish B., Vitter, Jeffrey S., and Winkler, Peter M., Proceedings of the Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (1996) Cook, Pamela L., Roytburd, Victor, and Tulin, Marshal, Mathematics Is for Solving Problems (1996) Adams, Loyce and Nazareth, J. L., Linear and Nonlinear Conjugate Gradient-Related Methods (1996) Renardy, Yuriko Y., Coward, Adrian V., Papageorgiou, Demetrios T., and Sun, Shu-Ming, Advances in Multi-Fluid Flows (1996) Berz, Martin, Bischof, Christian, Corliss, George, and Griewank, Andreas, Computational Differentiation: Techniques, Applications, and Tools (1996) Delic, George and Wheeler, Mary F., Next Generation Environmental Models and Computational Methods (1997) Engl, Heinz W., Louis, Alfred, and Rundell, William, Inverse Problems in Geophysical Applications (1997) Saks, Michael, Anderson, Richard, Bach, Eric, Berger, Bonnie, Blum, Avrim, Chazelle, Bernard, Edelsbrunner,Herbert, Henzinger, Monika, Johnson, David, Kannan, Sampath, Khuller, Samir, Maggs, Bruce, Muthukrishnan, S., Ruskey, Frank, Seymour, Paul, Spencer, Joel, Williamson, David P., and Williamson, Gill, Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms (1997) Alexandrov, Natalia M. and Hussaini, M. Y., Multidisciplinary Design Optimization: State of the Art (1997) Van Huffel, Sabine, Recent Advances in Total Least Squares Techniques and Errors-in-Variables Modeling (1997)

Ferris, Michael C. and Pang, Jong-Shi, Complementarity and Variational Problems: State of the Art (1997) Bern, Marshall, Fiat, Amos, Goldberg, Andrew, Kannan, Sampath, Karloff, Howard, Kenyan, Claire, Kierstead, Hal, Kosaraju, Rao, Linial, Nati, Rabani, Yuval, Rodl, Vojta, Sharir, Micha, Shmoys, David, Spielman, Dan, Spinrad, Jerry, Srinivasan, Aravind, and Sudan, Madhu, Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (1998) DeSanto, John A., Mathematical and Numerical Aspects of Wave Propagation (1998) Tarjan, Robert E., Warnow, Tandy, Amenta, Nina, Benham, Craig, Cornell, Derek G., Edelsbrunner, Herbert, Feigenbaum, Joan, Gusfield, Dan, Habib, Michel, Hall, Leslie, Karp, Richard, King, Valerie, Koller, Daphne, McKay, Brendan, Moret, Bernard, Muthukrishnan, S., Phillips, Cindy, Raghavan, Prabhakar, Randall, Dana, and Scheinerman, Edward, Proceedings of the Tenth ACM-SIAM Symposium on Discrete Algorithms (1999) Hendrickson, Bruce, Yelick, Katherine A., Bischof, Christian H., Duff, lain S., Edelman, Alan S., Geist, George A., Heath, Michael I, Heroux, Michael H., Koelbel, Chuck, Schrieber, Robert S., Sincovec, Richard F., and Wheeler, Mary F., Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing (1999) Henderson, Michael E., Anderson, Christopher R., and Lyons, Stephen L, Object Oriented Methods for Interoperable Scientific and Engineering Computing (1999) Shmoys, David, Brightwell, Graham, Cohen, Edith, Cook, Bill, Eppstein, David, Gerards, Bert, Irani, Sandy, Kenyan, Claire, Ostrovsky, Rafail, Peleg, David, Pevzner, Pavel, Reed, Bruce, Stein, Cliff, Tetali, Prasad, and Welsh, Dominic, Proceedings of the Eleventh ACM-SIAM Symposium on Discrete Algorithms (2000) Bermudez, Alfredo, Gomez, Dolores, Hazard, Christophe, Joly, Patrick, and Roberts, Jean E., Fifth International Conference on Mathematical and Numerical Aspects of Wave Propagation (2000) Kosaraju, S. Rao, Bellare, Mihir, Buchsbaum, Adam, Chazelle, Bernard, Graham, Fan Chung, Karp, Richard, Lovasz, Laszlo, Motwani, Rajeev, Myrvold, Wendy, Pruhs, Kirk, Sinclair, Alistair, Spencer, JoeLStein, Cliff, Tardos, Eva, Vempala, Santosh, Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (2001) Koelbel, Charles and Meza, Juan, Proceedings of the Tenth SIAM Conference on Parallel Processing for Scientific Computing (2001) Grossman, Robert, Kumar, Vipin, and Han, Jiawei, Proceedings of the First SIAM International Conference on Data Mining (2001) Berry, Michael, Computational Information Retrieval (2001) Eppstein, David, Demaine, Erik, Doerr, Benjamin, Fleischer, Lisa, Goel, Ashish, Goodrich, Mike, Khanna, Sanjeev, King, Valerie, Munro, Ian, Randall, Dana, Shepherd, Bruce, Spielman, Dan, Sudakov, Benjamin, Suri, Subhash, and Warnow, Tandy, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2002) Grossman, Robert, Han, Jiawei, Kumar, Vipin, Mannila, Heikki, and Motwani, Rajeev, Proceedings of the Second SIAM International Conference on Data Mining (2002) Estep, Donald and Tavener, Simon, Collected Lectures on the Preservation of Stability under Discretization (2002) Ladner, Richard E., Proceedings of the Fifth Workshop on Algorithm Engineering and Experiments (2003) Arge, Lars, Italiano, Giuseppe F., and Sedgewick, Robert, Proceedings of the Sixth Workshop on Algorithm Engineering and Experiments and the First Workshop on Analytic Algorithmics and Combinatorics (2004)

PROCEEDINGS OF THE SIXTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FIRST WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS

Edited by Lars Arge, Giuseppe F, Italiano, and Robert Sedgewick

Society for Industrial and Applied Mathematics Philadelphia

PROCEEDINGS OF THE SIXTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FIRST WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS

Proceedings of the Sixth Workshop on Algorithm Engineering and Experiments, New Orleans, LA, January 10, 2004 Proceedings of the First Workshop on Analytic Algorithmics and Combinatorics, New Orleans, LA, January 10, 2004 The workshops were supported by the ACM Special Interest Group on Algorithms and Computation Theory and the Society for Industrial and Applied Mathematics. Copyright © 2004 by the Society for Industrial and Applied Mathematics. 10987654321 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Library of Congress Catalog Card Number: 2004104581 ISBN 0-89871-564-4

is a registered trademark.

CONTENTS ix

Preface to the Workshop on Algorithm Engineering and Experiments

xi

Preface to the Workshop on Analytic Algorithmics and Combinatorics Workshop on Algorithm Engineering and Experiments

3

Engineering Geometric Algorithms: Persistent Problems and Some Solutions (Abstract of Invited Talk) Don Halperin

4

Engineering a Cache-Oblivious Sorting Algorithm Gerth St0lting Brodal, Rolf Fagerberg, and Kristoffer Vinther

18

The Robustness of the Sum-of-Squares Algorithm for Bin Packing Michael A. Bender, Bryan Bradley, Geetha Jagannathan, and Krishnan Pillaipakkamnatt

31

Practical Aspects of Compressed Suffix Arrays and FM-lndex in Searching DNA Sequences Wing-Kai Hon, Tak-Wah Lam, Wing-Kin Sung, Wai-Leuk Tse, Chi-Kwong Wong, and Siu-Ming Yiu

39

Faster Placement of Hydrogens in Protein Structures by Dynamic Programming Andrew Leaver-Fay, Yuanxin Liu, and Jack Snoeyink

49

An Experimental Analysis of a Compact Graph Representation Daniel K. Blandford, Guy E. Blelloch, and Ian A. Kash

62

Kernelization Algorithms for the Vertex Cover Problem: Theory and Experiments Faisal N. Abu-Khzam, Rebecca L. Collins, Michael R. Fellows, Michael A. Langston, W. Henry Suters, and Christopher T. Symons

70

Safe Separators for Treewidth Hans L Bodlaender and Arie M.C.A. Koster

79

Efficient Implementation of a Hotlink Assignment Algorithm for Web Sites ArturAlves Pessoa, Eduardo Sony Laber, and Criston de Souza

88

Experimental Comparison of Shortest Path Approaches for Timetable Information Evangelia Pyrga, Frank Schulz, Dorothea Wagner, and Christos Zaroliagis

100

Reach-Based Routing: A New Approach to Shortest Path Algorithms Optimized for Road Networks Ron Gutman

112

Lazy Algorithms for Dynamic Closest Pair with Arbitrary Distance Measures Jean Cardinal and David Eppstein

120

Approximating the Visible Region of a Point on a Terrain Boaz Ben-Moshe, Paz Carmi, and Matthew J. Katz

129

A Computational Framework for Handling Motion Leonidas Guibas, Menelaos I. Karavelas, and Daniel Russel

142

Engineering a Sorted List Data Structure for 32 Bit Keys Roman Dementiev, Lutz Kettner, Jens Mehnert, and Peter Sanders

VII

CONTENTS Workshop on Analytic Algorithmics and Combinatorics 152

Theory and Practice of Probabilistic Counting Algorithms (Abstract of Invited Talk) Philippe Flajolet

153

Analysis of a Randomized Selection Algorithm Motivated by the LZ'77 Scheme Mark Daniel Ward and Wojciech Szpankowski

161

The Complexity of Jensen's Algorithm for Counting Polyominoes Gill Barequet and Micha Moffie

170

Distributional Analyses of Euclidean Algorithms Viviane Baladi and Brigitte Vallee

185

A Simple Primality Test and the rth Smallest Prime Factor Daniel Panario, Bruce Richmond, and Martha Yip

194

Gap-Free Samples of Geometric Random Variables Pawei Hitczenko and Arnold Knopfmacher

199

Computation of a Class of Continued Fraction Constants Lo'i'ck Lhote

211

Compositions and Patricia Tries: No Fluctuations in the Variance! Helmut Prodinger

216

Quadratic Convergence for Scaling of Matrices Martin Purer

224

Partial Quicksort Conrado Martinez

229

Author Index

VIII

ALENEX WORKSHOP PREFACE The annual workshop on Algorithm Engineering and Experiments (ALENEX) provides a forum for the presentation of original research in the implementation and experimental evaluation of algorithms and data structures. ALENEX 2004 was the sixth workshop in this series. It was held in New Orleans, Louisiana, on January 10,2004. These proceedings contain the 14 papers that were selected for presentation from a total of 56 submissions. Considerable effort was devoted to the evaluation of the submissions. However, submissions were not refereed in the thorough and detailed way that is customary for journal papers. It is expected that most of the papers in these proceedings will eventually appear in finished form in scientific journals. We would like to thank all the people who contributed to a successful workshop. In particular, we thank the Program Committee and all of our many colleagues who helped the Program Committee evaluate the submissions. We also thank Adam Buchsbaum for answering our many questions along the way and Bryan Holland-Minkley for help with the submission and program committee software. We gratefully acknowledge the generous support of Microsoft, which helped reduce the registration fees for students, and thank SIAM for providing in-kind support and facilitating the workshop; in particular, the help of Kirsten Wilden from SIAM has been invaluable. Finally, we would like to thank the invited speaker, Dan Halperin of Tel Aviv University. Lars Arge and Giuseppe F. Italiano

ALENEX 2004 Program Committee

Lars Arge, Duke University Jon Bentley, Avaya Labs Research Mark de Berg, Technische Universiteit Eindhoven Monika Henzinger, Google Giuseppe F. Italiano, University of Rome David Karger, Massachusetts Institute of Technology Ulrich Meyer, Max-Planck-lnstitut fur Informatik Jan Vahrenhold, University of Munster ALENEX 2004 Steering Committee

Adam Buchsbaum, AT&T Research Roberto Battiti, University of Trento Andrew V. Goldberg, Microsoft Research Michael Goodrich, University of California, Irvine David S. Johnson, AT&T Research Richard E. Ladner, University of Washington, Seattle Catherine C. McGeoch, Amherst College David Mount, University of Maryland, College Park Bernard M.E. Moret, University of New Mexico Jack Snoeyink, University of North Carolina, Chapel Hill Clifford Stein, Columbia University

IX

ALENEX WORKSHOP PREFACE

ALENEX 2004 Subreviewers

Pankaj K. Agarwal Susanne Albers Ernst Althaus Rene Beier Michael Bender Henrik Blunck Christian Breimann Herve Bronnimann David Bryant Adam Buchsbaum Stefan Burkhardt Marco Cesati Sunil Chandran Erik Demaine Roman Dementiev Camil Demetrescu Jeff Erickson Irene Finocchi Luisa Gargano Leszek Gasieniec Raffaele Giancarlo Roberto Grossi Concettina Guerra Bryan Holland-Minkley Michael Jacob

Klaus Jansen Juha Kaerkkainen Spyros Kontogiannis Luigi Laura Giovanni Manzini Eli Packer Ron Parr Marco Pellegrini Seth Pettie Jordi Planes Naila Rahman Rajeev Raman Joachim Reichel Timo Ropinski Kay Salzwedel Peter Sanders Guido Schaefer Christian Scheideler Naveen Sivadasan Martin Skutella Bettina Speckmann Venkatesh Srinivasan Firas Swidan Kavitha Telikepalli Norbert Zeh

X

ANALCO WORKSHOP PREFACE The papers in these proceedings, along with the invited talk by Philippe Flajolet, "Theory and Practice of Probabilistic Counting Algorithms," were presented at the First Workshop on Analytic Algorithmics and Combinatorics (ANALCO04), which was held in New Orleans, Louisiana, on January 10, 2004. The aim of ANALCO is to provide a forum for the presentation of original research in the analysis of algorithms and associated combinatorial structures. The papers study properties of fundamental combinatorial structures that arise in practical computational applications (such as permutations, trees, strings, tries, and graphs) and address the precise analysis of algorithms for processing such structures, including average-case analysis; analysis of moments, extreme, and distributions; and probabilistic analysis of randomized algorithms. Some of the papers present significant new information about classic algorithms; others present analyses of new algorithms that present unique analytic challenges or address tools and techniques for the analysis of algorithms and combinatorial structures, both mathematical and computational. The workshop took place on the same day as the Sixth Workshop on Algorithm Engineering and Experiments (ALENEX04); the papers from that workshop are also published in this volume. Since researchers in both fields are approaching the problem of learning detailed information about the performance of particular algorithms, we expect that interesting synergies will develop. People in the ANALCO community are encouraged to look over the ALENEX papers for problems where the analysis of algorithms might play a role; people in the ALENEX community are encouraged to look over these ANALCO papers for problems where experimentation might play a role.

Program Committee Kevin Compton, University of Michigan Luc Devroye, McGill University, Canada Mordecai Golin, The Hong Kong University of Science and Technology, Hong Kong Hsien-Kuei Hwang, Academic Sinica, Taiwan Robert Sedgewick (Chair), Princeton University Wojciech Szpankowski, Purdue University Brigitte Vallee, Universite de Caen, France Jeffrey S. Vitter, Purdue University

XI

This page intentionally left blank

Workshop on Algorithm Engineering and Experiments

This page intentionally left blank

Invited Plenary Speaker Abstract Engineering Geometric Algorithms: Persistent Problems and Some Solutions Dan Halperin School of Computer Science Tel Aviv University The last decade has seen growing awareness to engineering geometric algorithms, in particular around the development of large scale software for computational geometry (like CGAL and LEDA). Besides standard issues such as efficiency, the developer of geometric software has to tackle the hardship of robustness problems, namely problems related to arithmetic precision and degenerate input, typically ignored in the theory of geometric algorithms, and which in spite of considerable efforts are still unresolved in full (practical) generality. We start with an overview of these persistent robustness problems, together with a brief review of prevailing solutions to them. We also briefly describe the CGAL project and library. We then focus on fixed precision approximation methods to deal with robustness issues, and in particular on so-called controlled perturbation which leads to robust implementation of geometric algorithms while using the standard machine floating-point arithmetic. We conclude with algorithm-engineering matters that are still geometric but have the more general flavor of addressing efficiency: (i) We discuss the fundamental issue of geometric decomposition (that is, decomposing geometric structures into simpler substructures), exemplify the large gap between the theory and practice of such decompositions, and present practical solutions in two and three dimensions, (ii) We suggest a hybrid approach to motion planning that significantly improves simple heuristic methods by integrating exact geometric algorithms to solve subtasks. The new results that we describe are documented in: http://www.cs.tau.ac.il/CGAL/Projects/

3

Engineering a Cache-Oblivious Sorting Algorithm* Gerth St01ting Brodal^*

Rolf Fagerberg*

Abstract This paper is an algorithmic engineering study of cacheoblivious sorting. We investigate a number of implementation issues and parameter choices for the cacheoblivious sorting algorithm Lazy Funnelsort by empirical methods, and compare the final algorithm with Quicksort, the established standard for comparison based sorting, as well as with recent cache-aware proposals. The main result is a carefully implemented cacheoblivious sorting algorithm, which our experiments show can be faster than the best Quicksort implementation we can find, already for input sizes well within the limits of RAM. It is also at least as fast as the recent cache-aware implementations included in the test. On disk the difference is even more pronounced regarding Quicksort and the cache-aware algorithms, whereas the algorithm is slower than a careful implementation of multiway Mergesort such as TPIE. 1 Introduction Modern computers contain a hierarchy of memory levels, with each level acting as a cache for the next. Typical components of the memory hierarchy are: registers, level 1 cache, level 2 cache, level 3 cache, main memory, and disk. The time for accessing a level increases for each new level (most dramatically when going from main memory to disk), making the cost of a memory access depend highly on what is the current lowest memory level containing the element accessed. As a consequence, the memory access pattern of an algorithm has a major influence on its running time in practice. Since classic asymptotic analysis of algorithms in the RAM model is unable to capture this. "This work is based on the M.Sc. thesis of the third author [29]. tBRIGS (Basic Research in Computer Science, www.brics.dk, funded by the Danish National Research Foundation), Department of Computer Science, University of Aarhus, DK-8000 Arhus C, Denmark. E-mail: {gerth.rolf}«brics.dk. Partially supported by the Future and Emerging Technologies programme of the EU under contract number 1ST-1999-14186 (ALCOM-FT). * Supported by the Carlsberg Foundation (contract number ANS-0257/20). § Systematic Software Engineering A/S, S0ren Frichs Vej 39, DK-8000 Arhus C, Denmark. E-mail: kvflbrics.dk.

4

Kristoffer Vinther5

a number of more elaborate models for analysis have been proposed. The most widely used of these is the I/O model introduced by of Aggarwal and Vitter [2] in 1988, which assumes a memory hierarchy containing two levels, the lower level having size M and the transfer between the two levels taking place in blocks of B consecutive elements. The cost of the computation is the number of blocks transferred. The strength of the I/O model is that it captures part of the memory hierarchy, while being sufficiently simple to make analysis of algorithms feasible. In particular, it adequately models the situation where the memory transfer between two levels of the memory hierarchy dominates the running time, which is often the case when the size of the data significantly exceeds the size of main memory. By now, a large number of results for the I/O model exists—see the surveys by Arge [3] and Vitter [30]. Among the fundamental facts are that in the I/O model, comparison based sorting takes 0(SortM,B(W)) I/Os in the worst case, where SortA/,s(^V) = ^ \O?,M/B ^. More elaborate models for multi-level memory hierarchies have been proposed ([30, Section 2.3] gives an overview), but fewer analyses of algorithms have been done. For these models, as for the I/O model of Aggarwal and Vitter, algorithms are assumed to know the characteristics of the memory hierarchy. Recently, the concept of cache-oblivious algorithms was introduced by Frigo et al. [19]. In essence, this designates algorithms formulated in the RAM model, but analyzed in the I/O model for arbitrary block size B and memory size M. I/Os are assumed to be performed automatically by an offline optimal cache replacement strategy. This seemingly simple change has significant consequences: since the analysis holds for any block and memory size, it holds for all levels of a multilevel memory hierarchy (see [19] for details). In other words, by optimizing an algorithm to one unknown level of the memory hierarchy, it is optimized to each level automatically. Thus, the cache-oblivious model in an elegant way combines the simplicity of the I/Omodel with a coverage of the entire memory hierarchy. An additional benefit is that the characteristics of the memory hierarchy do not need to be known, and do not need to be hardwired into the algorithm for

the analysis to hold. This increases the algorithms In this paper, we investigate the practical value of portability (a benefit for e.g. software libraries), and cache-oblivious methods in the area of sorting. We foits robustness against changing memory resources on cus on the Lazy Funnelsort algorithm, since we believe machines running multiple processes. it to have the biggest potential for an efficient impleIn 1999, Prigo et al. introduced the concept of cache- mentation among the current proposals for I/0-optimal obliviousness, and presented optimal cache-oblivious cache-oblivious sorting algorithms. We explore a numalgorithms for matrix transposition, FFT, and sort- ber of implementation issues and parameter choices for ing [19], and also gave a proposal for static search the cache-oblivious sorting algorithm Lazy Funnelsort. trees [25] with search cost matching that of standard and settle the best choices through experiments. We (cache-aware) B-irees [6]. Since then, quite a num- then compare the final algorithm with tuned versions ber of results for the model have appeared, including of Quicksort, which is generally acknowledged to be the the following: Bender et al. [11] gave a proposal for fastest all-round comparison based sorting algorithm, cache-oblivious dynamic search trees with search cost as well as with recent cache-aware proposals. Note that matching B-trees. Simpler cache-oblivious search trees the I/O cost of Quicksort is ©(^ Iog2 ^), which only with complexities matching that of [11] were presented differs from the optimal bound SortM,B(W) by the base in [12, 17, 26], and a variant with worst case bounds of the logarithm. for updates appear in [8]. Cache-oblivious algorithms The main result is a carefully implemented cachehave been given for problems in computational geome- oblivious sorting algorithm, which our experiments try [1, 8, 14], for scanning dynamic sets [7], for layout show can be faster than the best Quicksort implemenof static trees [9], and for partial persistence [8]. Cache- tation we can find, already for input sizes well within oblivious priority queues have been developed in [4, 15], the limits of RAM. It is also at least as fast as the rewhich in turn gives rise to several cache-oblivious graph cent cache-aware implementations included in the test. algorithms [4]. On disk the difference is even more pronounced regardSome of these results, in particular those involv- ing Quicksort and the cache-aware algorithms, whereas ing sorting and algorithms to which sorting reduces, the algorithm is slower than a careful implementation such as priority queues, are proved under the assump- of multiway Mergesort such as TPIE [18]. tion M > jB2, which is also known as the tall cache asThese findings support—and extend to the area of sumption. In particular, this applies to the Funnelsort sorting—the conclusion of the previous empirical results algorithm of Frigo et al. [19]. A variant termed Lazy on cache-obliviousness. This conclusion is that cacheFunnelsort [14] works under the weaker tall cache as- oblivous methods can lead to actual performance gains sumption M > Bl+£ for any fixed e > 0, at the cost over classic algorithms developed in the RAM-model. of a l/£ factor compared to the optimal sorting bound The gains may not always match those of the best algorithm tuned to a specific memory hierarchy level, but e(SortM,s(AO) for the case M > B1+e. Recently, it was shown [16] that a tall cache assump- on the other hand appear to be more robust, applying tion is necessary for cache-oblivious -comparison based to several memory hierarchy levels simultaneously. sorting algorithms, in the sense that the trade-off atOne observation of independent interest made in tained by Lazy Funnelsort between strength of assump- this paper is that for the main building block of Funneltion and cost for the for the case M » Bl+£ is best sort, namely the fc-merger, there is no need for a spepossible. This demonstrates a separation in power be- cific memory layout (contrary to its previous descriptween the I/O model and the cache-oblivious model for tions [14, 19]) for its analysis to hold. Thus, the centhe problems of comparison based sorting. Separations tral feature of the fc-merger definition is the sizes of its have also been shown for the problems of permuting [16] buffers, and does not include its layout in memory. and of comparison based searching [10]. The rest of this paper is organized as follows: In In contrast to the abundance of theoretical results Section 2, we describe Lazy Funnelsort. In Section 3, described above, empirical evaluations of the merits of we describe our experimental setup. In Section 4, we cache-obliviousness are more scarce. Existing results develop our optimized implementation of Funnelsort, have focused on basic matrix algorithms [19], and search and in Section 5, we compare it experimentally to a trees [17, 23, 26]. Although a bit tentative, they collection of existing efficient sorting algorithms. In conclude that in these areas, the efficiency of cache- Section 6, we sum up our findings. oblivious algorithms lies between that of classic RAMalgorithms and that of algorithms exploiting knowledge 2 Funnelsort about the specific memory hierarchy present (often Three algorithms for cache-oblivious sorting have been termed cache-aware algorithms). proposed so far: Funnelsort [19], its variant Lazy

5

Funnelsort [14], and a distribution based algorithm [19]. These all have the same optimal bound Sorter, fi(Af) on the number of I/Os performed, but have rather different structural complexity, with Lazy Funnelsort being the simplest. As simplicity of description often translates into smaller and more efficient code (for algorithms of same asymptotic complexity), we find the Lazy Funnelsort algorithm the most promising with respect to practical efficiency. In this paper, we choose it as the basis for our study of the practical feasibility of cache-oblivious sorting. We now review the algorithm briefly, and give an observation which further simplifies it. For the full details, see [14]. The algorithm is based on binary mergers. A binary merger takes as input two sorted streams of elements Figure 1: A 16-merger consisting of 15 binary mergers. and delivers as output the sorted stream formed by Shaded regions are the occupied parts of the buffers. merging of these. One merge step moves an element from the head of one of the input streams to the tail of 1 the output stream. The heads of the input streams and parameter, and the sizes of the remaining buffers are the tail of the output stream reside in buffers holding a defined by recursion on the top tree and the bottom limited number of elements. A buffer is simply an array trees. In the descriptions in [14, 19], a fc-merger is also of elements, plus fields storing the capacity of the buffer and pointers to the first and last elements in the buffer. laid out recursively in memory (according to the soBinary mergers can be combined to binary merge trees called van Emde Boas layout [25]), in order to achieve by letting the output buffer of one merger be an input I/O efficiency. We observe in this paper that this is buffer of another—in other words, binary merge trees not necessary: In the proof of Lemma 1 in [14], the are binary trees with mergers at the nodes and buffers central idea is to follow the recursive definition down to at the edges. The leaves of the tree contain the streams a specific size k of trees, and then consider the number of I/Os for loading this ^-merger and one block for each to be merged. An invocation of a merger is a recursive procedure of its output streams into memory. However, this price which performs merge steps until its output buffer is is not (except for constant factors) changed if we for full or both input streams are exhausted. If during the each of the k — 1 nodes have to load one entire block invocation an input buffer gets empty, but the corre- holding the node, and one block for each of the input sponding stream is not exhausted, the input buffer is and output buffers of the node. From this follows that recursively filled by an invocation of the merger having the proof holds true, no matter how the fc-merger is laid 2 this buffer as its output buffer. If both input streams out. Hence, the crux of the definition of the fc-merger of a merger get exhausted, the corresponding output lies entirely in the definition of the sizes of the buffers, stream is marked as exhausted. A single invocation of and does not include the van Emde Boas layout. To actually sort N elements, the algorithm recurthe root of the merge tree will merge the streams at the sively sorts Nl/d segments of size Nl~l/d of the input leaves of the tree. One particular merge tree is the fc-merger. For k and then merges these using an JV^-merger. For a a power of two, a fc-merger is a perfect binary tree of proof that this is an I/O optimal algorithm, see [14, 19]. k — 1 binary mergers with appropriate sized buffers on the edges, k input streams, and an output buffer at the 3 Methodology root of size kd, for a parameter d > 1. A 16-merger is As said, our goal is first to develop a good impleillustrated in Figure 1. mentation of Funnelsort by finding good choices for The sizes of the buffers are defined recursively: Let design options and parameter values through empirithe top tree be the subtree consisting of all nodes of cal investigation, and then to compare its efficiency to depth at most ft/2], and let the subtrees rooted by x nodes at depth [i/^l + 1 be the bottom trees. The edges The parameter a is introduced in this paper for tuning between nodes at depth [i/2] and depth [z/2] + 1 have purposes. 2 However, the (entire) A'-merger should occupy a contiguous associated buffers of size orfd 3 / 2 ], where a is a positive segment of memory in order for the complexity proof (Theorem 2 in [14]) of Funnelsort itself to be valid.

6

Architecture type Operation system Clock rate Address space Pipeline stages Ll data cache size Ll line size Ll associativity L2 cache size L2 line size L2 associativity TLB entries TLB associativity TLB miss handling RAM size

Petnium 4 Modern CISC Linux v. 2.4.18 2400MHz 32 bit 20 8KB

128 B 4- way 512KB 128 B 8- way 128

full hardware 512 MB

Pentium III Classic CISC Linux v. 2.4.18 800MHz 32 bit 12

MIPS 10000 RISC IRIX v. 6.5 175MHz 64 bit 6

32KB 32 B 2-way 1024 KB 32 B 2-way

16KB 32 B 4-way 256KB 32 B 4-way 64

64

64-way software 128 MB

4-way hardware 256MB

AMD Athlon Modern CISC Linux 2.4.18 1333 MHz 32 bit

Itanium 2 EPIC Linux 2.4.18 1137 MHz 64 bit

10

8

128 KB 64 B 2-way 256 KB 64 B 8- way

32 KB 64 B 4-way 256 KB 128 B 8- way

40

4-way hardware 512 MB

128

full ?

3072 MB

Table 1: The specifications of the machines used in this paper. that of Quicksort—the established standard for comparison based sorting algorithms—as well as that of recent cache-aware proposals. To ensure robustness of the conclusions, we perform all experiments on three rather different architectures, namely Pentium 4, Pentium III, and MIPS 10000. These are representatives of the modern CISC, the classic CISC, and the RISC type of computer architecture, respectively. In the final comparison of algorithms, we add the AMD Athlon (a modern CISC architecture) and the Intel Itanium 2 (denoted an EPIC architecture by Intel, for Explicit Parallel Instruction-set Computing) for even larger coverage. The specifications of all five machines used can be seen3 in Table 1. Our programs are written in C++ and compiled by GCC v. 3.3.2 (Pentiums 4 and III, AMD Athlon), GCC v. 3.1.1 (MIPS 10000), or the Intel C++ compiler v. 7.0 (Itanium 2). We compile using maximal optimization. We use three element types: integers, records containing one integer and one pointer, and records of 100 bytes. The first type is commonly used in experimental papers, but we do not find it particularly realistic, as keys normally have associated information. The second type models sorting small records directly, as well as key-sorting of large records. The third type models sorting medium sized records directly, and is the data type used in the Datamation Benchmark [20] originating from the database community. We mainly consider uniformly distributed keys, but also try skewed inputs such as almost sorted data, and a

Additionally, the Itanium 2 machine has 3072 KB of L3 cache, which is 12-way associative and has a cache line size of 128 B.

data with few distinct key values, to ensure robustness of the conclusions. To keep the experiments during the engineering phase (Section 4) tolerable in number, we only use the second data type and the uniform distribution, believing that tuning based on these will transfer to other situations. We use the drand48 family of C library functions for generation of random values. Our performance metric is wall clock time, as measured by the gettimeof day C library function. We keep the code for the different implementation options tested in the engineering phase as similar as possible, even though this generality entails some overhead. After judging what is the best choices of these options, we implement a clean version of the resulting algorithm, and use this in the final comparison against existing sorting algorithms. Due to space limitations, we in this paper mainly sum up our findings, and show only few plots of experimental data. A full set of plots (close to hundred) can be found in [29]. Our code is available from http: //www.daimi.au.dk/'kv/ALENEX04/. 4

Engineering Lazy Funnelsort

We consider a number of design and parameter choices for our implementation of Lazy Funnelsort. We group them as indicated by the following subsections. To keep the number of experiments within realistic limits, we settle the choices one by one, in the order presented here. We test each particular question by experiments exercising only parts of the implementation, and/or by fixing the remaining choices at hopefully reasonable values while varying the parameter under investigation. In this section, we for reasons of space merely summarize

7

the results of each set of experiments—the actual plots can be found in [29]. Regarding notation: a and d are the parameters from the definition of the fc-merger (see Section 2), and z denotes the degree of the basic mergers (see Section 4.3).

which shows that the spatial locality of the layout is not entirely without influence in practice, despite its lack of influence on the asymptotic analysis. The implicit vEB layout is slower than its pointer based version, but less so on the Pentium 4 architecture, which also is the fastest of the processors and most likely the one least 4.1 k-Merger Structure As noted in Section 2, no strained by complex arithmetic expressions. particular layout is needed for the analysis of Lazy Funnelsort to hold. However, some layout has to be 4.2 Tuning the Basic Mergers The "inner loop" chosen, and the choice could affect the running time. in the Lazy Funnelsort algorithm is the code performing We consider BFS, DPS, and vEB layout. We also the merge step in the nodes of the fc-merger. We explore consider having a merger node stored along with its several ideas for efficient implementation of this code. output buffer, or storing nodes and buffers separately One idea tested is to compute the minimum of the (each part having the same layout). number of elements left in either input buffer and the The usual tree navigation method is by pointers. space left in the output buffer. Merging can proceed However, for the three layouts above, implicit naviga- for at least that many steps without checking the state tion using arithmetic on node indices is possible—this of the buffers, thereby eliminating one branch from the is well-known for BFS [31], and arithmetic expressions core merging loop. We also try several hybrids of this for DFS and vEB layouts can be found in [17]. Implicit idea and the basic merger. navigation saves space at the cost of more CPU cycles This idea will not be a gain (rather, the minimum per navigation step. We consider both pointer based computation will constitute an overhead) in situations where one input buffer stays small for many merge and implicit navigation. We try two coding styles for the invocation of a steps. For this reason, we also implement the optimal merger, namely the straight-forward recursive imple- merging algorithm of Hwang and Lin [21, 22], which has mentation, and an iterative version. To control the higher overhead, but is an asymptotical improvement forming of the layouts, we make our own allocation func- when merging sorted lists of very different sizes. To tion, which starts by acquiring enough memory to hold counteract its overhead, we also try a hybrid solution the entire merger. We test the efficiency of our alloca- which invokes it only when the contents of the input tion function by also trying out the default allocator in buffers are skewed in size. C++. Using this, we have no guarantee that the proper Experiments: We run the same experiment as in memory layouts are formed, so we only try pointer based Section 4.1. The values of a and d influence the sizes of the smallest buffers in the merger. These smallest navigation in these cases. Experiments: We test all combinations of the buffers occur on every second level of the merger, so choices described above, except for a few infeasible ones any node has one of these as either input or output (e.g. implicit navigation with the default allocator), giv- buffer, making this size affect the heuristics above. For ing a total of 28 experiments on each of the three ma- this reason, we repeat the experiment for (a, d) equal to chines. One experiment consists of merging k streams (1,3), (4,2.5), and (16,1.5). These have smallest buffer of k"2 elements in a fc-merger with z = 2, a = 1, and sizes of 8, 23, and 45. respectively. d = 2. For each choice, we for values of k in [15; 270] Results: The Hwang-Lin algorithm has, as exmeasure the time for [20,000,000/A;3] such mergings. pected, a large overhead (a factor of three for the nonResults: The best combination on all architectures hybrid version). Somewhat to our surprise, the heuristic is recursive invocation of a pointer based vEB layout calculating minimum sizes is not competitive, being bewith nodes and buffers separate, allocated by the stan- tween 15% and 45% slower than the fastest, (except on dard allocator. The time used for the slowest combina- the MIPS 10000 architecture, where the differences betion is up to 65% larger, and the difference is biggest tween heuristics are less pronounced). Several hybrids on the Pentium 4 architecture. The largest gain occurs fare better, but the straight-forward solution is consisby choosing the recursive invocation over the iterative, tently the winner in all experiments. We interpret this and this gain is most pronounced on the Pentium 4 as the branch prediction of the CPUs being as efficient architecture, which also is the most sophisticated (it as explicit hand-coding for exploiting predictability in e.g. has a special return address stack holding the ad- the branches in this code (all branches, except the result dress of the next instruction to be fetched after return- of the comparison of the heads of the input buffers, are ing from a function call, for its immediate execution). rather predictable). Thus, hand-coding just constitutes The vEB layout ensures around 10% reduction in time, overhead.

8

4.3 Degree of Basic Mergers There is no need for the fc-merger to be a binary tree. If we for instance base it on four-way basic mergers, we effectively remove every other level of the tree. This means less element movement and less tree navigation. In particular, a reduction in data movement seems promising—part of Quicksorts speed can be attributed to the fact that for random input, only about every other element is moved on each level in the recursion, whereas e.g. binary Mergesort moves all elements at each level. The price to pay is more CPU steps per merge step, and code complication due to the increase in number of input buffers that can be exhausted. Based on considerations of expected register use, element movements, and number of branches, we try several different ways of implementing multi-way mergers using sequential comparison of the front elements in the input buffers. We also try a heap-like approach using looser trees [22], which proved efficient in a previous study by Sanders [27] of priority queues in RAM. In total, seven proposals for multi-way mergers are implemented. Experiments: We test the seven implementations in a 120-merger with (a, d) = (16,2), and measure the time for eight mergings of 1,728,000 elements each. The test is run for degrees 2 = 2,3,4,...,9. For comparison, we also include the binary mergers from the last set of experiments. Results: All implementations except the looser tree show the same behavior: As z goes from 2 to 9, the time first decreases, and then increases again, with minimum attained around 4 or 5. The maximum is 4065% slower than the fastest. Since the number of levels for elements to move through evolves as I/ log(z), while the number of comparisons for each level evolves as 2, a likely explanation is that there is an initial positive effect due to decrease in element movements, which soon is overtaken by increase in instruction count per level. The looser trees show decrease only in running time for increasing 2, consistent with the fact that the number of comparisons per element for a traversal of the merger is the same for all values of z, but the number of levels, and hence data movements, evolves as I/ log(2). Unfortunately, the running time starts out twice as large as for the remaining implementations for z = 2, and barely reaches them at z = 9. Apparently, the overhead is too large to make looser trees competitive in this setting. The plain binary mergers compete well, but are beaten by around 10% by the fastest four- or fiveway mergers. All these findings are rather consistent across the three architectures.

4.4 Merger Caching In the outer recursion of Funnelsort, the same size fc-merger is used for all invocations on the same level of the recursion. A natural optimization would be to precompute these sizes and construct the needed fc-mergers once for each size. These mergers are then reset each time they are used. Experiments: We use the Lazy Funnelsort algorithm with (a, d, 2) = (4,2.5,2), straight-forward implementation of binary basic mergers, and a switch to std: :sort, the STL implementation of Quicksort, for sizes below azd = 23. We sort instances ranging in size from 5,000,000 to 200,000,000 elements. Results: On all architectures, merger caching gave a 3-5% speed-up. 4.5 Base Sorting Algorithm Like any recursive algorithm, the base case in Lazy Funnelsort is handled specially. As a natural limit, we require all fc-mergers to have height at least two—this will remove a number of special cases in the code constructing the mergers. Therefore, for input sizes below azd we switch to another sorting algorithm. Experiments with the sorting algorithms Insertionsort, Selectionsort, Heapsort, Shellsort, and Quicksort (in the form of std: : sort from the STL library) on input size from 10 to 100 revealed the expected result, namely that std: :sort, which (in the GCC implementation) itself switches to Insertionsort below size 16, is the fastest for all sizes. We therefore choose this as the sorting algorithm for the base case. 4.6 Parameters a and d The final choices concern the parameters a (factor in buffer size expression) and d (main parameter defining the progression of the recursion, in the outer recursion of Funnelsort, as well as in the buffer sizes in the fc-merger). These control the buffer sizes, and we investigate their impact on the running time. Experiments: For values of d between 1.5 and 3 and for values of a between 1 and 40, we measure the running time for sorting inputs of various sizes in RAM. Results: There is a marked rise in running time when a drops below 10, increasing to a factor of four for Q = 1. This effect is particularly strong for d = 1.5. Smaller a and d give smaller sizes of buffers, and the most likely explanation seems to be that the cost of navigating to and invoking a basic merger is amortized over fewer merge steps when the buffers are smaller. Other than that, the different values of d appear to behave quite similarly. A sensible choice appears to be a around 16, and d around 2.5

9

5 Evaluating Lazy Punnelsort In Section 4, we settled the best choices for a number of implementation issues for Lazy Funnelsort. In this section, we investigate the practical merits of the resulting algorithm. We implement two versions: Funnelsort2, which uses binary basic mergers as described in Section 2, and Funnelsort4, which uses the four-way basic mergers found in Section 4.3 to give slightly better results. The remaining implementation details follow what was declared the best choices in Section 4. Both implementations use parameters (o,rf) = (16,2), and use std:: sort for input sizes below 400 (as this makes all k-mergers have height at least two in both). 5.1 Competitors Comparing algorithms with the same asymptotic running time is a delicate matter. Tuning of code can often change the constants involved significantly, which leaves open the question of how to ensure equal levels of engineering in implementations of different algorithms. Our choice in this paper is to use Quicksort as the main yardstick. Quicksort is known as a very fast general-purpose comparison based algorithm [28], and has long been the standard choice of sorting algorithm in software libraries. Over the last 30 years, many improvements have been suggested and tried, and the amount of practical experience with Quicksort is probably unique among sorting algorithms. It seems reasonable to expect implementations in current libraries to be highly tuned. To further boost confidence in the efficiency of the chosen implementation of Quicksort, we start by comparing several widely used library implementations, and choose the best performer as our main competitor. We believe such a comparison will give a good picture of the practical feasibility of cacheoblivious ideas in the area of comparison based sorting. The implementations we consider are std:: sort from the STL library included in the GCC v. 3.2 distribution, std:: sort from the STL library from Dinkumware4 included with Intels C++ compiler v.7.0, the implementation from [28, Chap. 7], and an implementation of our own, based on the proposal of Bentley and Mcllroy [13], but tuned slightly further by making it simpler for calls on small instances and adding an even more elaborate choice of pivot element for large instances. These algorithms mainly differ in their partitioning strategies—how meticulously they choose the pivot element and whether they use two- or three-way partitioning. Two-way partitioning allows tighter code, but is less robust when repeated keys are present. 4

www.dinkumware.com

10

To gain further insight, we also compare with recent implementations of cache-aware sorting algorithms aiming for efficiency in either internal memory or external memory by tunings based on knowledge of the memory hierarchy. TPIE [18] is a library for external memory computations, and includes highly optimized routines for e.g. scanning and sorting. We choose TPIEs sorting routine AMI_sort as representative of sorting algorithms efficient in external memory. The algorithm needs to know the amount of available internal memory, and following suggestions in the TPIEs manual we set it to 192 Mb, which is 50-75% of the physical memory on all machines where it is tested. The TPIE version used is the newest at time of writing (release date August 29, 2002). TPIE does not support the MIPS and Itanium architectures, and requires an older version (2.96) of the GCC compiler on the remaining architectures. Several recent proposals for cache-aware sorting algorithms in internal memory exist, including [5, 24. 32]. LaMarca and Ladner [24] give proposals for better exploiting LI and L2 cache. Improving on their effort, Arge et al. [5] give proposals using registers better, and Kubricht et al. [32] give variants of the algorithms from [24] taking the effects of TLB (Translation Lookaside Buffers) misses and the low associativity of caches into account. In this test, we compare against the two Mergesort based proposals from [24] as implemented by [32] (we encountered problems with the remaining implementations from [32]), and the R-merge algorithm of [5]. We use the publicly available code from [32] and code from [5] sent to us by the authors. 5.2 Experiments We test the algorithms described above on inputs of sizes in the entire RAM range, as well as on inputs residing on disk. All experiments are performed on machines with no other users. The influence of background processes is minimized by running each experiment in internal memory 21 times, and reporting the median. In external memory experiments are rather time consuming, and we run each experiment only once, believing that background processes will have less impact on these. Besides the three machines used in Section 3, we in these final experiments also include the AMD Athlon and the Intel Itanium 2 processor.5 Their specifications can be seen in Table 1. The methodology is as described in Section 3.

5 Due to our limited access period for the Itanium machine, we do not have results for all algorithms on this architecture.

5.3 Results The plots described in this section are shown in Appendix A. In all graphs, the y-axis shows wall time in seconds divided by n log n, and the x-axis shows log n, where n is the number of input elements. The comparison of Quicksort implementations showed that three contestants ran pretty close, with the GCC implementation as the overall fastest. It uses a compact two-way partitioning scheme, and simplicity of code here seems to pay off. It is closely followed by our own implementation (denoted Mix), based on the tuned three-way partitioning of Bentley and Mcllroy. The implementation from Sedgewicks book (denoted Sedge) is not much after, whereas the implementation from the Dinkumware STL library (denoted Dink) lags rather behind, probably due to a rather involved three-way partitioning routine. We use the GCC and the Mix implementation as the Quicksort contestants in the remaining experiments—the first we choose for pure speed, the latter for having better robustness with almost no sacrifice in efficiency. In the main experiments in RAM, we see that the Funnelsort algorithm with four-way basic mergers are consistently better than the one with binary basic mergers, except on the MIPS architecture, which has a very slow CPU. This indicates that the reduced number of element movements really do outweigh the increased merger complexity, except when CPU cycles are costly compared to memory accesses. For the smallest input sizes, the best Funnelsort looses to GCC Quicksort (by 10-40%), but on three architectures gains as n grows, ending up winning (by the approximately the same ratio) for the largest instances in RAM. The two architectures where GCC keeps its lead are the MIPS 10000 with its slow CPU, and the Pentium 4, which features the PC800 bus (decreasing the access time to RAM), and which has a large cache line size (reducing effects of cache latency when scanning data in cache). This can be interpreted as on these two architectures, CPU cycles, not cache effects, are dominating the running time for sorting, and on architectures where this is not the case, the theoretically better cache performance of Funnelsort actually shows through in practice, at least for a tuned implementation of the algorithm. The two cache-aware implementations msort-c and msort-m from [32] are not competitive on any of the architectures. The R-merge algorithm is competing well, and like Funnelsort shows its cache-efficiency by having a basically horizontal graph throughout the entire RAM range on the architectures dominated by cache effects. However, four-way Funnelsort is consistently better than R-merge, except on the MIPS 10000 machine. The latter is a RISC-type architecture and has a large

number of registers, something which the R-merge algorithm is designed to exploit. TPIEs algorithm is not competitive in RAM. For the experiments on disk, TPIE is the clear winner. It is optimized for external memory, and we suspect in particular that its use of double-buffering (something which seems hard to transfer to a cache-oblivious setting) gives it an unbeatable advantage6. However, Funnelsort comes in as a second, and outperforms GCC quite clearly. The gain over GCC seems to grow as n grows larger, which is in good correspondence with the difference in the base of logarithms in the I/O complexity of these algorithms. The algorithms tuned to cache perform notably badly on disk. Due to lack of space, we have only shown plots for uniformly distributed data of the second data type (records of integer and pointer pairs). The results for the other types and distributions discussed in Section 3 are quite similar, and can be found in [29]. 6

Conclusion

Through a careful engineering effort, we have developed a tuned implementation of Lazy Funnelsort, which we have compared empirically with efficient implementations of other comparison based sorting algorithms. The results show that our implementation is competitive in RAM as well as on disk, in particular in situations where sorting is not CPU bound. Across the many input sizes tried, Funnelsort was almost always among the two fastest algorithms, and clearly the one adapting most gracefully to changes of level in the memory hierarchy. In short, these results show that for sorting, the overhead involved in being cache-oblivious can be small enough for the nice theoretical properties to actually transfer into practical advantages. 7 Acknowledgments We thank Brian Vinter of the Department of Mathematics and Computer Science, University of Southern Denmark, for access to an Itanium 2 processor.

References [1] P. Agarwal, L. Arge, A. Banner, and B. HollandMinkley. On cache-oblivious multidimensional range searching. In Proc. 19th ACM Symposium on Computational Geometry, 2003. [2] A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31 (9):1116-1127, Sept. 1988. °The TPIE sorting routine sorts one run while loading the next from disk, thus parallelizing CPU work and I/Os.

11

[3] L. Arge. External memory data structures. In Proc. 9th Annual European Symposium on Algorithms (ESA), volume 2161 of LNCS, pages 1-29. Springer, 2001. [4] L. Arge, M. A. Bender, E. D. Demaine, B. HollandMinkley, and J. I. Munro. Cache-oblivious priority queue and graph algorithm applications. In Proc. 34th Ann. ACM Symp. on Theory of Computing, pages 268276. ACM Press, 2002. [5] L. Arge, J. Chase, J. Vitter, and R. Wickremesinghe. Efficient sorting using registers and caches. ACM Journal of Experimental Algorithmics, 7(9), 2002. [6] R. Bayer and E. McCreight. Organization and maintenance of large ordered indexes. Acta Informatica, 1:173-189, 1972. [7] M. Bender, R. Cole, E. Demaine, and M. FarachColton. Scanning and traversing: Maintaining data for traversals in a memory hierarchy. In Proc. 10th Annual European Symposium on Algorithms (ESA), volume 2461 of LNCS, pages 139-151. Springer, 2002. [8] M. Bender, R. Cole, and R. Raman. Exponential structures for cache-oblivious algorithms. In Proc. 29th International Colloquium on Automata, Languages, and Programming (ICALP), volume 2380 of LNCS, pages 195-207. Springer, 2002. [9] M. Bender, E. Demaine, and M. Farach-Colton. Efficient tree layout in a multilevel memory hierarchy. In Proc. 10th Annual European Symposium on Algorithms (ESA), volume 2461 of LNCS, pages 165-173. Springer, 2002. [10] M. A. Bender, G. S. Brodal, R. Fagerberg, D. Ge, S. He, H. Hu, J. lacono, and A. Lopez-Ortiz. The cost of cache-oblivious searching. In Proc. 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 271-282, 2003. [11] M. A. Bender, E. Demaine, and M. Farach-Colton. Cache-oblivious B-trees. In Proc. 41st Ann. Symp. on Foundations of Computer Science, pages 399-409. IEEE Computer Society Press, 2000. [12] M. A. Bender, Z. Duan, J. lacono, and J. Wu. A locality-preserving cache-oblivious dynamic dictionary. In Proc. 13th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 29-39, 2002. [13] J. L. Bentley and M. D. Mcllroy. Engineering a sort function. Software-Practice and Experience, 23(1):1249-1265, 1993. [14] G. S. Brodal and R. Fagerberg. Cache oblivious distribution sweeping. In Proc. 29th International Colloquium on Automata, Languages, and Programming (ICALP), volume 2380 of LNCS, pages 426-438. Springer, 2002. [15] G. S. Brodal and R. Fagerberg. Funnel heap - a cache oblivious priority queue. In Proc. 13th Annual International Symposium on Algorithms and Computation, volume 2518 of LNCS, pages 219-228. Springer, 2002. [16] G. S. Brodal and R. Fagerberg. On the limits of cacheobliviousness. In Proc. 35th Annual ACM Symposium on Theory of Computing (STOC), pages 307-315, 2003.

12

[17] G. S. Brodal, R. Fagerberg, and R. Jacob. Cache oblivious search trees via binary trees of small height. In Proc. 13th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 39-48, 2002. [18] Department of Computer Science, Duke University. TPIE: a transparent parallel I/O environment. WWW page, http://ww.cs.duke.edu/TPIE/, 2002. [19] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In 40th Annual Symposium on Foundations of Computer Science, pages 285-297. IEEE Computer Society Press, 1999. [20] J. Gray. Sort benchmark home page. WWW page, http://research.microsoft.com/bare/ SortBenchmark/, 2003. [21] F. K. Hwang and S. Lin. A simple algorithm for merging two disjoint linearly ordered sets. SI AM Journal on Computing, l(l):31-39, 1972. [22] D. E. Knuth. The Art of Computer Programming, Vol 3, Sorting and Searching. Addison-Wesley, Reading, USA, 2 edition, 1998. [23] R. E. Ladner, R. Fortna, and B.-H. Nguyen. A comparison of cache aware and cache oblivious static search trees using program instrumentation. In Experimental Algorithmics, volume 2547 of LNCS, pages 78-92. Springer, 2002. [24] A. LaMarca and R. E. Ladner. The influence of caches on the performance of sorting. Journal of Algorithms, 31:66-104, 1999. [25] H. Prokop. Cache-oblivious algorithms. Master's thesis, Massachusetts Institute of Technology, June 1999. [26] N. Rahman, R. Cole, and R. Raman. Optimised predecessor data structures for internal memory. In Proc. 5th Int. Workshop on Algorithm Engineering (WAE), volume 2141, pages 67-78. Springer, 2001. [27] P. Sanders. Fast priority queues for cached memory. ACM Journal of Experimental Algorithmics, 5(7), 2000. [28] R. Sedgewick. Algorithms in C++: Parts 1-4: Fundamentals, Data Structures, Sorting, Searching. Addison-Wesley, Reading, MA, USA, third edition, 1998. Code available at http://www.cs.princeton. edu/"rs/Algs3.cxxl-4/code.txt. [29] K. Vinther. Engineering cache-oblivious sorting algorithms. Master's thesis, Department of Computer Science, University of Aarhus, Denmark, May 2003. Available online at http://www.daimi.au.dk/ "kv/thesis/. [30] J. S. Vitter. External memory algorithms and data structures: Dealing with massive data. A CM Computing Surveys, 33(2):209-271, June 2001. [31] J. W. J. Williams. Algorithm 232: Heapsort. Commun. ACM, 7:347-348, 1964. [32] L. Xiao, X. Zhang, and S. A. Kubricht. Improving memory performance of sorting algorithms. ACM Journal of Experimental Algorithmics, 5(3), 2000.

A

Plots Comparison of Quicksort Implementations

13

14

15

Results for Inputs on disk

16

17

The Robustness of the Sum-of-Squares Algorithm for Bin Packing Michael A. Bender*

Bryan Bradley*

Geetha Jagannathan*

Krishnan Pillaipakkamnatt§ Abstract Csirik et al. [CJK+99, CJK+00] introduced the sum-ofsquares algorithm (SS) for online bin packing of integralsized items into integral-sized bins. They showed that for discrete distributions, the expected waste of SS is sublinear if the expected waste of the offline optimal is sublinear. This algorithm SS has a time complexity of O(nB) to pack n items into bins of size B. In [CJK"I"02] the authors present variants of this algorithm that enjoy the same asymptotic expected waste as SS (with larger multiplicative constants), but with time complexities of O(nlogB) and O(n). In this paper we present three sets of results that demonstrate the robustness of the sum-of-squares approach. First, we show the results of experiments from two new variants of the SS algorithm. The first variant, which runs in time O(n\/BlogB), appears to have almost identical expected waste as the sum-of-squares algorithm on all the distributions mentioned in [CJK+99, CJK+00, CJK+02]. The other variant, which runs in O(nlogB) time performs well on most, but not on all of those distributions. Both these algorithms have simple implementations. We present results from experiments that extend the sum-of-squares algorithm to the bin-packing problem with two bin sizes (the variable-sized bin-packing problem). From our experiments comparing SS and Best Fit over uniform distributions, we observed that there are scenarios where when one bin size is 2/3 the size of the other, SS has 6(-y/n) waste while Best Fit has linear waste. We also present situations where SS has 6(1) waste while Best Fit has 6(n) waste. We observe an anomalous behavior in Best Fit that does not seem to affect SS. Finally, we apply SS to the related problem of online memory allocation. Our experimental comparisons between SS and Best Fit indicate that neither algorithm is consistently better than the other. If the amount of randomness is low, SS appears to have lower waste than Best Fit, while larger amounts of randomness appear to favor Best Fit. An interesting phenomenon shows that for a given range of allocation sizes we can find ranges of allocation duration where SS has lower waste than Best Fit. In the online memoryallocation problem for the uniform and interval distributions, SS does not seem to have an asymptotic advantage over Best Fit, in contrast with the bin-packing problem. * Department of Computer Science, State University of New York, Stony Brook, NY 11794-4400, USA, email: benderflcs.sunysb.edu. Supported in part by Sandia National Laboratories and NSF grants EIA-0112849 and CCR-0208670. t Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA, email: pot8osCacm.org. * Department of Computer Science, State University of New York, Stony Brook, NY 11794-4400, USA, email: geethajaganflacm.org. 5Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA, email: csckzpfihofstra.edu.

18

1

Introduction

In classical bin packing the input is a list L = (01,02, --,a n ) of n items and an infinite supply of bins of unit capacity. Each item a* has size s(aj), where 0 < s(a,i) < 1. The objective is to pack the items into a minimum number of bins subject to the constraint that the sum of sizes of the items in each bin is no greater than 1. Bin packing has a wide range of applications including stock cutting, truck packing, commercials assignment to stations breaks in television programming, and memory allocation. Because this problem is NPhard [GJ79], most bin-packing research concentrates on finding polynomial-time approximation algorithms for bin packing. Bin packing is among the earliest problems for which the performance of approximation algorithms were analyzed [CGJ96]. In this paper we focus on the average-case performance of online algorithms for bin packing, where item sizes are drawn according to some discrete distribution. An algorithm is said to be online if the algorithm packs each item as soon as it "arrives" without any knowledge of the items not yet encountered. That is, the decision to pack item a* into some particular bin can be based only on the knowledge of items oi,...,Oi_i. Moreover, once an item has been assigned to a bin, it cannot be reassigned. Discrete distributions are those in which each item size is an element of some set {si,si,...,sj} of integers, and such that each size has an associated rational probability. The capacity of a bin is a fixed integer B > sj. We overload notation and write s(L) for the sum of the sizes of the items in the list L. For an algorithm A, we use A{L) to denote the number of bins used by A to pack the items in L. We write OPT to denote an optimal packing algorithm. Let F be a probability distribution over item sizes. Then Ln(F) denotes a list of n items drawn according to distribution F. A packing is a specific assignment of items to bins. The size of a packing P, written ||P||, is the number of bins used by P. For an algorithm A, we use P£(F) to denote a packing resulting from the application of algorithm A to the list Ln(F). Given a packing P of a list L, the waste for P, the sum of the unused bin capacities, is defined

as W(P) = B \\P\\ - s(L). The expected waste of an algorithm A on a distribution F is EW*(F) = E[W(P*(F))], where the expectation is taken over all lists of length n. 1.1

Related Work.

Variable-sized Bin Packing. In the variable-sized bin-packing problem all bins do not have the same capacity. Assume that there are infinitely many bins of sizes Let L denote a list of items whose sizes are in (0, BI]. Lately this problem has received more attention because of its application in stock cutting and in the assignments of commercials to variable-sized breaks in television programming. Much of the literature for variable-sized bin-packing algorithms measures effectiveness in terms of the performance ratio of the algorithm (instead of expected waste). The asymptotic expected performance ratio for algorithm A on distribution F is the average ratio of the number of bins taken by A to the number of bins taken by an optimal algorithm. That is,

Bin Packing. Recently progress has been made in the average-case analysis of the standard heuristics for the discrete uniform distributions U{j, k} where the bin capacity is taken as B = fc, and item sizes are uniformly drawn from 1, 2, ...,.;' < k. When j = k — 1, the online Best-Fit and First-Fit algorithms have BCV**) expected waste. Remarkably, when 1 < j < k — 2, the expected waste of the optimal is O(l) [ECG+91j. An algorithm is said to be stable under a distribution if the expected waste remains bounded even when the number of items goes to infinity. Coffman et al. [ECG+91] proved that Best Fit is stable when k > j(j + 3)/2. Kenyon et al. [KRS98] showed that Best Fit is stable under U{k — 2, k} and is also stable for some specific values Frieson and Langston [FL86] first investigated the of (j, k) with k < 14. It has been shown experimentally variable-sized bin-packing problem. Kinnerly and that for most pairs (j, k) the expected waste of Best Langston [KL88] gave an online algorithm with perFit is 0(n). formance ratio 7/4. The variable-harmonic algorithm [Csi89], which is based on the harmonic-fit algorithm, The Sum-of-Squares algorithm. We assume that has a performance ratio of at most 1.69103. all items have integral sizes. The gap of a bin is the Epstein, Seiden, and Stee [ESS01] presented two amount of unassigned capacity in the bin. Let N(g) unbounded-space online algorithms for variable-sized denote the number of bins in the current packing with bin packing. They focused on the cases in which there gap 1 < g < B. Initially, N(g) = 0 for all g. The are two bin sizes. These algorithms are a combination sum-of-squares algorithm puts an item a of size s(a) of variable-harmonic and refined-harmonic algorithms into a bin such that after placing the item a the value [LL81]. Epstein et al. also proved a lower bound for any online algorithm for two bin sizes. The asymptotic of is minimized. performance ratio is greater than 1.33561. Csirik et al. [CJK+99], gave experimental evidence that for discrete uniform distributions U{j, k} (with Memory Allocation. Memory is modeled as an ink = 100 and 1 < j < 98) EW$S is 0(1). They also finitely long array of storage locations. An allocator reshowed for j = 99, EW£S = O(Jn). Their results ceives requests for blocks of memory of various sizes and indicated that for j = 97,98,99, the expected waste requests for their deallocation arrive in some unknown of SS, EW§S goes from 0(1) to Q(Jn) whereas the order. Although the size of the request is known to the expected waste of BF, EW^F transitions from 0(n) to allocator, the deallocation time is unknown at the time 0(1) to e(v'n). of allocation. The deallocation of a block may leave In a theoretical analysis of the sum-of-squares a "hole" in the memory. The objective of a memoryalgorithm [CJK+00] Csirik et al. proved that for any allocation algorithm is to minimize the total amount of perfectly-packable distribution F, EW§S — O(^/n). space wasted in these holes. They also proved that if F is a bounded waste disAlthough memory allocation has been studied since tribution then EWgS is either 0(1) or 0(logra). In the early days of computing, only a handful of results particular, if F is a discrete uniform distribution concern the competitive ratios of the standard heuristics U{j, k} where j < k - 1 then EW£S(F) = O(l). They for the problem. For the memory-allocation problem, also proved that for all lists L, SS(L) < 3OPT(L). the competitive ratio of an online algorithm is the ratio between the total amount of memory required by the algorithm to satisfy all requests, to W, the largest

19

amount of concurrently allocated memory. Luby, Naor and Orda [LNO96] showed that First Fit (the heuristic which assigns a request to the lowest indexed hole that can accommodate the request) has a competitive ratio of 0{min(log W, log C)}, where C is the largest number of concurrent allocations. By Robson's result [Rob74], this bound is the best possible value for any online algorithm. 1.2

Results.

In this paper we study the performance of the sum-ofsquares algorithm and its variants. We present faster variants of the sum-of-squares algorithm for the online bin-packing problem. Next, we compare the sumof-squares algorithm with the Best-Fit algorithm for variable-sized bin packing. Finally, we show the results of applying the sum-of-squares algorithm to memory allocation. We performed our experiments with the uniform distribution U{j, k}. We have also run our algorithms on interval distributions, especially those that are claimed to be difficult in [CJK+02]. For the memory allocation problems we used various interval distributions for both block sizes and block durations. • In Section 2 we present our variants for the sumof-squares algorithm. The first is the SSmax variant, which runs in O(nVB\ogB) time. Our experiments suggest that the performance of this variant is close to the SS algorithm in absolute terms, for all the distributions mentioned in [CJK+99, CJK+00, CJK+02]. The remaining algorithms form a family called the segregated sum-of-squares algorithms (SSS). These algorithms perform well in most of the distributions mentioned in the above papers. But on some distributions they do not have the same expected waste as SS. The best runtime in this family is 0(nloglogJ3). • Section 3 provides experimental results for the sum-of-squares algorithm applied to a generalized version of the bin-packing problem where bins come in two sizes. For fixed bin sizes, the performance of the sum-of-squares and Best-Fit algorithms can be categorized into three ranges based on item sizes. In the first range both algorithms appear to have constant waste. In the second range, Best Fit has 0(n) waste and SS has 0(1) waste. In the third range, when one bin size is around 2/3 the size of the other, SS has Q(\/n) waste while Best Fit has linear waste. When the bin sizes are not in this ratio, both algorithms have Q(-^/n} waste. We observe an anomalous behavior in Best Fit that does not seem to affect SS.

20

• In Section 4 we applied SS to the related problem of online memory allocation. Our experimental comparisons between SS and Best Fit indicate that neither algorithm is consistently better than the other. Smaller allocation durations appear to favor Best Fit, while larger allocation durations favor SS. Also, if the amount of randomness is low, SS appears to have lower waste than Best Fit, while larger amounts of randomness appear to favor Best Fit. An interesting phenomenon shows that for a given range of allocation sizes we can find ranges of allocation duration where SS has lower waste than Best Fit. In the online memory-allocation problem SS does not seem to have an asymptotic advantage over Best Fit, in contrast to the bin-packing problem. 2

Faster Variants Algorithm

of

the

Sum-of-Squares

In this section we present variants of the sum-of-squares algorithm. The SSmax variant of Section 2.2.1 runs in 0(n\/JBlogjB) time and appears to have an expected waste remarkably close to that of SS. Experimental results indicate that the segregated sum-of-squares family of algorithms (Section 2.2.2) run faster, but, in some cases, have 0(n) expected waste when SS has 0(x/n) waste. 2.1

Sum-of-Squares Algorithm.

The sum of the sizes of items in a bin is the level of the bin. The gap of a bin is the amount of its unused capacity. If the level of a bin is i, then its gap is B — t. Let P be a packing of a list L of items. Let the gap count of g, N(g}, denote the number of bins in the packing that have a gap of 1 < g < B. We call N the profile vector for the packing. We ignore perfectly packed bins (whose gap is 0) and completely empty bins. The sumB-l

of-squares for a packing P is ss(P) = ]T N(g)2. 9=1

The sum-of-squares algorithm is an online algorithm that works as follows. Let a be the next item to pack. It is packed into a legal bin (whose gap is at least s(a)) such that for the resulting packing P' the value of ss(P') is minimum over all possible packings of a. When an item of size s arrives, there are three possible ways of packing the item 1. Open a new bin: Here the sum of squares increases by l + 2N(B-s). 2. Perfectly fill an old bin: Here the sum of squares decreases by 2N(s) — 1.

3. The item goes into a bin of gap g where g > s: Instead of examining all possible legal gaps for the best Here the sum of squares increases by 2 - (N(g — s) — possible gap (as SS does), we examine only the k largest N(g) + 1). This step requires finding a g which gap counts, for some value of k. maximizes the value of N(g) — N(g — s). 2.2.1 Parameterized SSmax Algorithm. The Each tune an item arrives, the algorithm performs SSmax(fc) parameterized algorithm is based on the an exhaustive search to evaluate the change in ss(P) SooS variant mentioned above. The right choice for and finds an appropriate bin in O(B) time. In [CJK+02] k is discussed later in this section. To pack a given the authors discuss variants of the original SS algorithm. item into a bin, SSmax(A:) computes the value of ss(P) They present O(nlog5) and O(n) variants that apfor the k largest gap counts (for some k > 1). The proximate the calculation of the sum-of-squares. The algorithm also computes ss(P) for opening a new bin authors prove that these variants have the same asympand for perfectly packing the new item into an already totic growth rate for expected waste as SS, but with open bin. The given item is packed into a bin such larger multiplicative constants. For example, they conthat the resulting packing minimizes ss(P). (Note that sidered the distribution C/{400,1000}. For n = 100,000 when we set k — B, we get the original SS algorithm.) items, they state that the number of bins used by the Using a priority queue, we can find the k largest values variant is 400% more than optimal, whereas Best Fit in O(fclogB). This can be done in O(fcloglogB) time uses only 0.3% more bins, and SS uses 0.25% more bins using faster priority queues such as Van Emde Boas than necessary. The situation does improve in favor of trees. their variant for larger values of n. For n — 107, the variant uses roughly 9.8% more bins than optimal, but Experimental Results. For small values of k (such SS uses .0.0025% more bins than necessary. The auas for constant k, or for k = log-B), the experiments thors claim their fast variants of SS are "primarily of suggest that the results for SSmax(fc) are qualitatively theoretical interest", and that they are unlikely to be similar to those of SooS. However, for k near 1\fB the competitive with Best Fit, except for large values of n. average waste for SSmax(fc) undergoes a substantial They also looked at other generalizations for the change. For each of the distributions mentioned in sum-of-squares algorithm. Instead of considering the [CJK+99, CJK+00, CJK+02], we observed that the squares of the elements of the profile vector, they experformance of SSmax(2v^B) matched that of SS to ih amined the r power, for r > 1. These generalizations within a factor of 1.4. In many cases the factor is as yield various SrS algorithms. For all finite r > 1, SrS low as 1.02. (See Table 1 and Figure 1 for results from performed qualitatively the same as SS. For the limit some of the distributions.) Note that in this and other case of r = 1 (SIS), the resulting algorithm is one that tables when it becomes clear that the growth rate of an satisfies the any-fit constraint. That is, the algorithm algorithm is linear in n, we did not run experiments does not open a new bin unless none of the already open for n = 108 and n = 109. In all other cases, we bins can accommodate the new item to be packed. Best used 100 samples for n € {105,106}, 25 samples for Fit and First Fit are examples of algorithms that satisfy n € {107,108}, and 3 samples for n = 109. the any-fit constraint. The SooS algorithm chooses to We also compared the waste of this variant with minimize the largest value in the profile vector. This SS for the distributions U{j, 100}, t/{j,200} and variant has the optimal waste for the uniform distribuU{j, 1000}. As our results in Table 2, and Figure 2 tions U{j, k}. However, the authors provide examples show, the performance of the variant SSmax(2\/B) is of distributions for which SooS has linear waste while remarkably similar to that of SS. the optimal and SS have sublinear waste. 2.2 Our Variants. The variants try to approximate the behavior of the sum-of-squares algorithm. By attempting to minimize the value of ss(P'), SS effectively tries to ensure that it has an approximately equal number of bins of each gap

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close