SHORTEST CONNECTIVITY
COMBINATORIAL OPTIMIZATION VOLUME 17 Through monographs and contributed works the objective of t...
44 downloads
682 Views
11MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
SHORTEST CONNECTIVITY
COMBINATORIAL OPTIMIZATION VOLUME 17 Through monographs and contributed works the objective of the series is to publish state of the art expository research covering all topics in the field of combinatorid optimization. In addition, the series will include books, which are suitable for graduate level courses in computer science, engineering,business, applied mathematics, and operations research. Combinatorial (or discrete) optimization problems arise in various applications, including communications network design, VLSI design, machine vision, airline crew scheduling, corporate planning, computer-aided design and manufacturing, database query design, cellular telephone frequency assignment, constraint directed reasoning, and computational biology. The topics of the books will cover complexity analysis and algorithm design (parallel and serial), computational experiments and application in science and engineering.
Series Editors Ding-Zhu Du, University of Minnesota Panos M . Pardalos, University of Florida
Advisory Editorial Board Alfonso Ferreira, CNRS-LIP ENS London Jun Gu, University of Calgary David S. Johnson, AT&T Research James B. Orlin, MI.T. Christos H . Papadimitriou, University of California at Berkeley Fred S. Roberts, Rutgers University Paul Spirakis, Computer Tech Institute (CTI)
SHORTEST CONNECTIVITY An Introduction with Applications in Phylogeny
DIETMAR CIESLIK Ernst-Moritz-Arndt University, Greifswald, Germany Massey University, Palmerston North, New Zealand
Q - Springer I
Library of Congress Cataloging-in-Publication Data A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN 0-387-23538-8
e-ISBN 0-387-23539-6
Printed on acid-free paper.
O 2005 Springer Science+Business Media, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, Inc., 233 Spring Street, New York NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now know or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if the are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Printed in the United States of America. 9 8 7 6 5 4 3 2 1 springeronline. com
SPIN 11336228
CONTENTS
PREFACE 1
T W O CLASSICAL OPTIMIZATION PROBLEMS 1.1 The Fermat-Torricelli point 1.2 Minimum Spanning Trees
2
GAUSS' QUESTION 2.1 2.2 2.3 2.4 2.5
3
WHAT DOES SOLUTION MEAN? 3.1 3.2 3.3 3.4 3.5
4
,4 metaphysical approach Does a solution exist? Does an algorithm esist? Does an efficient algorithm exist? Does an approximation exist?
NETWORK DESIGN PROBLEMS 4.1 4.2
5
Gauss' question and their coii.i.ersion to Steiner's Problem Examples and Esercises References A first analysis of Steiner's Problem Steiner's Problem in graphs
,4n overview of applications Several variants
A N E W CHALLENGE: T H E PHYLOGENY 5.1
Phylogenetic Trees
vii
5.2 3.3
Phylogenetic Spaces Applications and related questions
AN ANALYSIS OF STEINER'S PROBLEM IN PHYLOGENETIC SPACES 6.1 Difficulties 6.2 More about trees 6.3 Cluster Analysis 6.4 Spanning Trees metric spaces 6.5 Counting the elements in cli~cret~e 6.6 Fermat's Problem in several discrete metric spaces
TREE BUILDING ALGORITHMS 7.1 7.2 7.3 7.4 7.5 7.6
Tree building methods - an overview Maximum Parsimony Method The perfect phylogeny problem Pair Group Methods Steinerization Handling more than one tree
REFERENCES INDEX
PREFACE
The problem of "Shortest Connectivity" has a long and convoluted history. Usually, the problem is linown as Steiner's Problem and it can be described more precisely in the following way: Given a finite set of points in a metric space, search for a network that connects these points with the shortest possible length. This shortest network must be a tree and is called a Steiner Minimal Tree (SNIT). It may contain vertices different from the points which are to be connected. Such points are called Steiner points. Steiner's Problem seems disarmingly simple. but it is rich with possibilities and difficulties, even in the simplest case. the Euclidean plane. This is one of the reasons that an enormous volume of literature has been published, starting in the seventeenth century and continuing today. Over the years Steiner's Problem has taken on an increasingly important role. More and more real-life problems are given which use Steiner's Problem or one of its relatives as an application, as a subproblem or as a model. 1717, will discuss the problem of "Shortest Connectivity" as a general approach to investigate real structures in nature. We will see that this involves the ident,ification of a combinatorial structure that requires the smallest number of changes. It is often said that this principle abides by Ocliham's razor, according to which the best hypothesis is the one recluiring the smallest number of assumptions.' At first we mill give a,n overview of Steiner's Problem and its relatives as one of the most interesting optimization problems in the intersection of colnbinatorics and geometry. In this sense, the present book is an introduction to the theory of "Shortest Connectivity". We mill see that Steiner's Problem is the core of the so-called "Geometric Network Design Problems", where the general problem can be stated as follo~vs: given a configuration of vertices and/or edges, find a network which contains these objects, satisfies sonie predetermined relRoughly speaking: Do not increase the number of entities without unnecessarity.
SHORTESTCONNECTIVITY
viii
quirernents, and which minimizes a given objective function that depends on several distance measures. Secondly, we will discuss a new challenge, namely t o create trees which reflect the phylogeny, which is the evolutionary history of "living entities". For 3.5 billion years, since life on earth began, evolution has created a remarkable variety of organisms. Millions of different species are alive today, while countless have become extinct. To describe the evolution of these species is a fundamental problem t h a t has been of interest a t least since Charles Darwin first proposed the theory of evolution more exactly. Trees are widely used to represent evolutionary relationships. In biology, for example, the dominant view of t,he evolution of life is that all existing organisms are derived from some common ancestor and that a new species arises by a splitting of one population into two or more populations that not do not crossbreed, rather than by a mixing of two populations into one. The principle of Maximum Parsimony involves the identification of a combinatorial structure t h a t requires the sma,llest number of evolutionary changes. Note t h a t here, minimizing the number of assumptions does not mean minimizing the steps of a n evolution" it m e m s t h a t among all possible structures we seek one which satisfies only one, and moreover a natural, condition. We mill consider the problem of reconstruction of phylogenetic trees in our sense of shortest connectivity. To do this we introduce the so-called Phylogenetic spaces. These are metric spaces whose points are arbitrary words generated by characters from some alphabet, and the metric measuring "similarity" of the words is generated by a cost measure on the characters. The "central dogma" will be: A phylogenetic tree is a n SMT in a desired chosen phylogenetic space. In any case this topic contains many problems for further research. The aim in this graduate-level text is to outline the key mathematical concepts that underpin the important questions in applied mathematics. These concepts involve discrete mathematics (particularly graph theory), optimization; computer science, and several ideas in biology. Acknowledgements. I thank all people who supported my research and gave me helpful advice on how t o write this book: A. Dress (Bielefeld), W.M. Fitch (Irvine), P.Gardner (Palmerston North), R. Graham (La Jolla), M.D. Hendy (Palmerston North), K.Huber (Uppsala), A.v.Haeseler (Jiilich/Diisseldorf); A.O. Ivanov (I\/loscow), h hat ever
that means!
Prefuce
ix
4 . Kemnitz (Braunschweig), V.il'1oulton (Cppsala), P. Pardalos (Gainesville), D.Penny (Palmerston North), H.J.Prome1 (Berlin), J.iVlacGregor Smith (Amherst); M.Stee1 (Christchurch), A.A. Tuzhilin (hfoscow), D.M. Warme (Xlexandria) and J . Wills (Siegen). I thank Tim White (Palmerston North) and my student K.Kruse for proof reading of the manuscript. Heidrun G. Kohler (Greifswald) gave a lot of remarks regarding writing t,his book in a suitable style. Moreover, I thank my colleagues H.-R. Frieling, W. Girbardt and for helpful technical support.
I&'.
Passauer
I thank the Institute of Fundamental Sciences, Massey University, Kew Zealand; the von Neuma,nn Institut for Computing, Forschungszentrum Jiilich, Germany; and the Allan Wilson Cent,re for Molecular Evolution and Ecology, Massey University, New Zealand for hosting me during the winter 2001/02, the spring of 2002, a,nd the spring of 2003, respectively.
T W O CLASSICAL OPTIMIZATION PROBLEMS
Scientific or engineering applications usually require the solution of mathematical ~ p t i m i z a t ~ i oproblems. n Such applicatioiis span a wide range, from modelling the evolution of species i11 biology t o niodelling soap films for grids of wires; from the design of collections of data t o the design of heating or air-coiiclitioning systems in buildings; and from the creation of oil and gas pipelines to the creation of comrnunicatioil net~vorlis,road alid railway lines. These are all network design problems of significant importance and nontrivial complexity. T h e network topology and design characteristics of these systems are classical examples of optimization problems. T h e general networli design problem is this: for a given configuration of vertices and/or edges, find a networli ~vhiclicontains these objects, fulfills some predetermined requirements and minimizes a given objective function. This is quite general and rnoclels a wide variety of probleins. Two classical optimization problerns represent the parsimonious view of the world: The Fermat-Steiner-Weber-Proble~nand the probleni of minimum spanning trees.
1.1 THE FERMAT-TORRICELLI POINT The problem discussed here has a long and strange history; moreover, it has gone by many names. Players from a lot of fields of study have stepped on its stage, and some of them have stumbled. I t is usual t o credit the Italian mathematicians with proposing and solving the problem: The problem was posed by Ferniat early in the 17th century a t
the end of his book Treatise on Mznzma and Maxima [159], and was stated as follo~i%: Given three points in the plane, find a fourth point such that the sum of its distances to the three given points is minimal. The problem seems disarmingly simple, but is so rich in possibilities and traps that it has generated an enormous literature dating back to the seventeenth century, and contiiiues to do so. We will come across these more than once in our considerations. Around 1640 Torricelli solved this problem: He asserted that, assuming that the given points forrn a triangle in which all angles are less than 120°, the circles which circumscribe the equilateral triangles constructed on the sides of and outside of the given triangle intersect in the desired point, called the Torricelli point. Note, that, in general, the Torricelli point is not one of the well-known points for triangles, it has its own character. Shortly afterwards, in 1647, Cavalieri's Exerciones Geometricae showed that the three lines joining the Torricelli point to the given points form angles of 120' with each other. Over the centuries Fermat,'~Problem mas rediscovered and genera,lized by other mathematicians. In the following centuries this problem was well established in the mathematical folklore. A history of Ferrnat's Problem is given by Boltyanski et al. [48],Scriba and Schreiber [391], and Wesolomsky [455]. In the nineteenth century Steiner studied this problem and generalized it to include an arbitrarily large set of points in the plane. About one hundred years later Courant and Robbins [I161 wrote:
" A very simple but instructive problem mas treated by Jacob Steiner, the famous representative of geoinetr) a t the Univeisity of Berlin in the early nineteenth century. Three rillages A,B,C are to be joined by a system of roads of minimum total length".' In other terms, we are interested in l ~ u itt should be noted what Kuhn, compare [455],said: "Although this very gifted geometer (Steiner) of the 19th century can be counted among the dozens of mathematicians who have written 011 the subject, he does not seem to have contributed anything new. either to its formulation or its solution."
T w o classical optimixation problems
Fermat's Problem Given: A finite set of points in the Euclidean plane (or in a Euclidean space). Find: A point such that the sun1 of the distances to all points of the set is as small as possible. This point will be called a Torricelli point. Here, the Euclidean plane is the affine plane equipped with the norm by ~ ~ ( x ,=y ) ~ ~
1 . 1 I defined (1.1)
Let N be the set of given points. To establish the existence of a Torricelli point we note firstly that the so-called Felrnat function FN.
mhich is to be minimized, is continuous, and secondly the fact that we have to search for a Torricelli point only in
which is a compact set. This implies that FAT attains a minimum value." In the twentieth century, the problem passed to those who claimed there was a use for it. Weber uses in his book ~ b e rden Standort der Industrien [449] a weighted three point version of the problem to depict industrial location minimizing transport cost. A mathematical appendix to his book, written by Pick, gives a geometrical construction procedure to find the optimum location, and discusses the conditions under mhich one of these points is the optimum. m7e will follow these considerations. Let
be a set of n points of the Euclidean plane. Then the Fermat function is given by
= "ote
c
d(z,
-
x)'
+ (y,
-
that the Torricelli point can be one of the given points.
yI2.
(1.5)
where w = ( 2 ,y). If we differentiate f and set the partial derivatives equal to zero to obtain the first order conditions for optimality, we have the following observations:
L e m m a 1.1.1 Let f (x, y) = F,v(w) be th,e Fermat f u n ~ t i o nfor the set N = {(xL,yl) : i = 1 , .. . , n ) of points i n th,e Euclidearh plane. Th,en the following conditions are necessary and s u f i c i e n t "or the minimal.ity o f f outside the set N itself:
and
v) + (yi
-
~ ( ( z z;
where (x,y)
# (xi,y i ) for a71y i
=
=0 -
u)?)
I,. . . , n.
Let q be the Torricelli point for v l , . . . , v,,, and assume that q is not one of the given points. Defining the vectors ul,. . . , u,, by
i = 1.. . . ,n,then the equations (1.6) and (1.7) can be written as the vector equality 1%
(1.9) n-hereby o is the zero vector. The inner product square of this equation is
("""") is the cosine of the angle between the segments from the points v , I u,II and .cj to the Torricelli point q, respectively. Hence.
3 ~ o t that e F,v is a convex function!
-
Two classical optimization problems
5
On the other hand, from the vector equation (1.9) we find by inner product luj the equation multiplication with uj/l
1
+
for any j = 1,. . . , n,.Then (1.11) and (1.12) form a system of n 1 equations for ( n 2 - n ) / 2 unknown variables, which can be solved uniquely for 7z = 3 and with one free parameter for n = 4. In geometric terms this says the following: The segments from three given points to the Torricelli point make angles of 120° with each other, provided that the given points form a triangle in which each angle is less than 120". For four given points the sum of neighboring angles for the segments from the Torricelli point equals 180'; provided that the given points form a convex quadrilateral. For n > 4 the equations (1.6) and (1.7) cannot, in general, be solved explicitly for (z, y ) , see Bajaj [26]. These facts are helpful in deciding whether or not a Torricelli point can be constructed with compass and ruler. (i) The compass construction: Given two points are can use the compass to draw a circle, centered a t one of them and passing thiough the other. (ii) For any two different points. the ruler can be used to join them x i t h a line segment, which can be extended as far as we like. Then, me have
T h e o r e m 1.1.2 Let 1L' be a finite set of n point.$ i n the Euclidean plane (a) If n,= 2 t h e n a n y point in the segment created b y the t w o given points is
a Torricelli point for N . ( b ) (Torricelli 1646. Cavalieri 1647) Let n = 3. If the convex hull o f ATf o r m s a triangle i n which each angle i s less t h u n 120°, t1ze.n the Torricelli point for N = { ~ ~ , v 2 ,can ' ~ be ~ found ) with the following construction:
1. Find a n equilateral triangle dra;wn along o n e side, for instance with the third node v'; 2. Construct the circle C circumscribing the equilateral triangle;
vlv2,
3. T l ~ eTorricelli point i s the point ,wh,ere the segment circle C .
& intersects the
Otherwise, ,if one of the angles i s at least 120'; one of the given points is the Torricelli point, n a m e l y thx point i n which this ungle i s present. ( c ) (Faynano j157) Let n = 4 . 4 t h e n n o general construction of the Torricelli point ~ u i t hcompass and ruler exists.
Figure 1.1 Cavalieri's coristruction
1.1.2(d) says that Fermat's Problem in the Euclidean plane turns out to be highly intractable: It cannot exactly be solved under models of computation with the four basic arithmetic operations and taking square roots. Thisl leaves
TWOclassical optimixation problems
us only numerical or symbolic approximation methods. It was Weiszfeld in 1937 who provided a practical method for finding the Torricelli point for large number of given points. This method is an iteration procedure. In view of 1.1.1it is clear that the following is true for a finite set iVwhich contains a t least three points and is not colliiiear4: (a) Palermo [332]: The Torricelli point is uniquely determined.
(b) Kupitz et al. [271], [320]: If the point q is outside of N then the condition
is sufficient and necessary for v to be the Torricelli point. (c) Kupitz et al. [271]. [320]: If the Torricelli point q is in ATthen the condition
holds true. In effect, the follo~vingalgorithm attempts to solve the first order condition written in 1.1.1iteratively. Weiszfeld asserted that such a sequence converges to the Torricelli point. This assertion has been discussed in Illgen [233],Krarup, Pruzan [268] and Kuhn [270]. Algorithm 1.1.3 (Wei.szfeld) Let N be a finite set i n the Euclidean plane. T h e n the following procedure finds a Torricelli point for N iteratively:
( a ) If for a point q E N it holds that
t h e n q is a n exact solution for Fermat's Problem; (b)
Otherwise I.
Choose a n error estimate
E;
%et N = { v l ; ..., v,} be a collinear set of points appearing in this order on the line. If n is an odd number then u ( , , + , ) / ~ is the Torricelli point. If n is an even number then any point on the segment 1J,,/2Vn/2+1 is a Torricelli point.
2. Choose q ( 0 ) In conuN, 3. Fork = 0 , 1 , . . . do
Weiszfeld's algorithm is simple. However, its rates of convergence are not very attractive, since tlie convergence is slow in tlie vicinity of the given points." Xue, Wang [467] discuss this observation. A further disadvantage of the Weiszfeld procedure is that it fails if one of the iterated points q(" falls on a given point; the reason for this is t h a t the Fermat function FN is non-differentiable there. This problem can be avoided by replacing FN with a hyperbolic approximation. An example is tlie following: Define the distance function in t,he Fermat function by
where
17
is a very small real number.
In view of many contributions t o the Fermat problem, its popularity through the ages, and its natural applications to various practical questions, it is hopeless t o expect a complete list of the many facets of the problem. RIoreover; location analysis as the theory of the "generalized" Fermat problem, has attracted the attention of researchers from many academic disciplines including many applied fields. This tremendous interest in location modelling is the result of several factors. In the introduction to the first issue of the journal Location Science the editors wrote: First, location decisions are frequently made a t all levels of human organization from individuals and households t o firms, governments, and international agancies. Second, such decisions are often strategic in nature; that is, they involve significant capital resources and their econoniic effects are long term in nature. Third, they frequently impose economic e~t~ernalities.Such externalities include economic development, as well as pollution and congestion. Fourth, location models are often extremely difficult to solve, a t least optimally. Even " ~ r e z r i e ret.al. [I351 give an example which gives the algorithm a very hard time.
T w o classical optimixation problems
9
some of the most basic models are cornputationally intractable for all but the smallest problem instances. In fact, the computational complexity of location models is a major reason t,ha,tthe widespread interest in formulating and implementing such models did not occur until the advent of high speed digital computers. Finally, location models are application specific. Their structural forrn, "the objectives, constraints and variables", is determined by the particular location problem under study. Consequently, there does not esist a general location model that is appropriate for all, or even most, applications. It is well-known t h a t solutions of Fermat's problems depend essentially on the way in which the distances in space are determined. Surveys in the forrn of monographs are given by 1. W.Domschke, *A.Drexl: "Logistilt: Standorte", 1982, [128]. 2. R.F.Love, J.G.Morris, G.O.Wesolowsky: "Facilities Location", 1989, [292]. 3. H. W.Hamacher: " Mathematische Losungsverfahren fiir planare Standortprobleme", 1995, 12061. 4. D.Cieslik: " Steiner Minimal Trees", 1998. [92] 5. V.Boltjanslti, H.1Iartini. V.Soltan: "Geometric Metliods and Optimization Problems", 1999, [48].
6. A.Schobe1: "Locating Lines and Hyperplanes", 1999, [384] There are several collections of works on Fermat's Problem and its relatives: [33], [74], [78], [%I, [134], [149], [151], [224], [234]: [250]: [272], [2851, [345], [422], [455] and 14671. Let AT be t,he set of given points. In applied mathematics the Fermat function
is usually called the median function, and a Torricelli point is called a median of N. Also of practical interest is the so-called center function
Ghr (w)= max Ilv utN
- IU
11,
(1.17)
which is to be minimized. A solution point is called a center of N . Of course, this is a complete other question, and has other solution strategies. For us it mill be only necessary to collect several observations.
Observation 1.1.4 L e t AT be a finite set of given points. L e t FN a n d G N be t h e m e d i a n a n d center fu,n,ction for AT, respectively. T h e n
holds for each, point w. T h e search of a center can be described in the sense of covering: We consider balls in the plane defined by B,(w) = {n: : I z - 'IUI where
T
2 0 is a
elysolvable. There are infinitely many points in the plane, and even though most of them are probably irrelevant, it is not obvious that any algorithm exist. Then Melzak [305] established many basic properties of ail SMT: Without loss of generality, the following is true for any SMT T for a finite set N of points in the Euclidean plane: (i) The degree of each vertex is a t most three. (ii) The degree of each Steiner point equals three (iii) Any Steiner point is the Torricelli point of its neighbors; and two edges incident t o a Steiner point meet a t a n angle of 120'. Consequently, a Steiner point is uniquely located in relation to its neighbors. (iv) There are a t most IhT- 2 Steiner points: equality holds if and only if the T are the leaves of T and the Steiner points are of degree vertices from A three.
G a ~ ~ squestion s'
25
(v) An SMT has a t most 2lN
- 3 edges: equality holds if and only if the vertices from N are the leaves of T and the Steiner points are of degree three.
(vi) When there is a Steiner point in the tree has two given points as neighbours.
T:then least one of these points
(vii) The SMT is an MST for the set N U Q , where Q is the set of Steiner points of T. As a consequence of all these statements it is sufficient to develop solution methods only for specific kinds of trees: Let T = (V, E) be a tree for AT = { u I , . . . ,u,): 11 > 2, with
Such a tree will be called a full tree. Second, h/Ielzak gave a finite solution method to Steiner's Problem, using a set of Euclidean (that is ruler and compass) ~onst~ructions. The central idea is given in the Torricelli construction given in the chapter before: In the threepoint problern, a replacement point can be substituted for two of the given points without changing the length of the tree. In the general version of the problem the algorithm must guess which pair is to be replaced, which could potentially involve may trying all possible guesses. After one pair of points in the subset has been replaced by a single point, each subsequent step of the algorithm replaces either two given points, a given point and a replacement point or two replacement points with another replacement point until the subset is reduced to three point^.^ Once the Steiner point for those three points has been found, the algorithm works bacltwards, attempting to determine the Steiner point corresponding to each replaceinent point. A11 att,empt can fail because of contradictory constraints on the placement of Steiiier points. Now we give a complete list of the instructions of this method: 7Surprisingly, the Melzak algorithm cannot be extended to higher-dimensional Euclidean spaces, not even to spaces of dimension three. The reason is that for two given points there are an infinite number of replacement points.
Algorithm 2.1.1 (Melzak [305]) Let T = (I/:E ) he a full tree for the finite set N of points. T h e n do
2. (Reduction, stage)
Q := v1\ lV!; if Q i s e m p t y t h e n goto 4.;
3. Let q be i n Q s u ~ hthat q is adjacent t o vl and v2 i,n N1; Delete v l , v2 and q; A d d a substitution point v12 that forms a n equilateral triangle with v2: If' := v l u{ v 1 2 j \ { v 1 , v 2 , q j ; 1V' := N' u {v12} \ { v l ,v 2 } ; got0 2.;
vl
and
111
and
4. (Recovery stage) Connect the last two points o f AT' b y a n edge: 5. Reserve the order of th,e reduction steps and bring back each pair of va a t each recovery step;
6. Let C be the circle circumscribing v l , va and ~ 1 2 ; If the arc v 1 u ~of C intersects the edge inxident t o vls a t the point v', t h e n v' is the S t e i n e r point joining vl and vz; i n this case con,nect these points and discard vi2; got0 5 . The proof of correctness is to apply the construction of 1.1.2(b): Let q be a Steiner point adjacent to the given points vl and ~ 2 v1, . v2 and vlz form an equilateral triangle. Since the Steiner point q is the Torricelli point for v l . vz and v3 it makes angles of 120" with the edges to each of them. If a quadrilateral is inscribed in a circle, the sum of opposite angles equals 180'. Thus the Steiner point q is necessarily located on the clrcle circumscribing ~ 1 . 2 1 2 and via. The theorem of Ptoleniaius says
and consequently ll?b - 411
This means that
3
+ Ilv1
-
qll =
1 1 ~ 1 2-
qll.
Gauss' question
achieves a minimal value if and only if q E 7 4 2 ~ 3
I t is obvious that using iCIelzak's algorithm t o find an SMT, although effective, is extremely redundant and inefficient; more exactly it takes exponential time. There are two causes of the exponential running time: T h e main reason is the large number of trees which are to be considered. each step chooses one of two possible substitution points because there are two equilateral tria,ngles for a given side. Since the correctness of the choice cannot he seen until the tree has been constructed or demonstrated t o be impossible, backtraclting is necessary. Hence, we require O(2') time; where k is the number of Steiner points in the given tree. Hwang 12291 has described a implementation of Melzak's construction which eliminates the second cause of exponential behavior. In general. to determine a n SLIT for a given finite set of points we have to consider many different trees, and conlpare their lengths in order t o single out the shortest ones. Unfortunatel~,this needs a n astronomical number of computational steps. Although exponential-time algorithms have been found for Steiner's Problem, no polynomial-time algorithms have yet been found and the prospects for such a n algorithm are riot good.
2.2
EXAMPLES AND EXERCISES
For a n introduction to Steiner's Problem it is helpful to investigate several specific cases t o explore the difficulties and surprising twists of the problem.
I. Show that the degree of each vertex is a t most three; and hence. t h a t the degree of each Steiner point equals three. I t is helpful to observe that any Steiner point is the Torricelli poiut of its neighbors. Moreover, we then have that two edges incident to a Steiner point meet a t a n angle of 120°.
11. Not every locally minimal tree, however, is a solution of minimal length overall - t h a t is, a n S3IT. Large-scale rearrangements of the Steiner points
may be necessary to transform a network into a shortest possible tree, which is a globally minimal tree. To see this we investigate the following example: Consider the four corners of a rectangle in the Euclidean plane measuring three units by four units. An MST for these points has length 10. There are two locally minimal trees with two Steiner points. Each arrangement forms a tree that has three edges connected t o each Steiner point a t 120°. If the Steiner points are arranged parallel to the width, the locally minimal tree t h a t result,s is 9 . 9 2 8 . . . units long. If the St>einerpoints are arranged parallel t o the length, a locally nlinimal tree results with a length of 9.196 . . .. Consequently, oiily in the last case do we have an SMT. Ollerenshaw (compare [147]) proved that if two full trees exist for the four points, the one with the longer edge between the two Steiner points is the shorter tree, i.e. the SMT. Illoreover, this consideration shows that a solution of Steiner's Problem is not always uniquely determined: For four points forming d solution. a square, we have two equivalent ( e q ~ length)
>
111. Let AT be the set of nodes of a regular ri-go11in the plane, n = IN1 3. Find a n SIVT for !Y. For n = 3 we seek a Torricelli point. For n = 4 the example above will be helpful, where, roughly spoken, the "Double Y" is shorter than the "X". I t is not simple t o see (compare [141]) t h a t for n 2 6 there is no Steiner point in the tree, meaning the SMT is a n MST with length equal to (n - 1) . I , where I is the length of a side. Jariiik and Kosler proved this result 13. I t was another fifty years until the proof by Du et al., in 1934 for n compare [314].
>
IV. A set AT = {(i,O), (i, 1) : i = 0,.. . , n - 1) is called a ladder. Chung and Graham [84] examined ladders and determined the length of SMT's for these sets. Particularly, tliej. denionstrated t h a t there are arbitrarily large sets of points for which the SAIT cannot be separated, that means cannot be divided in full trees. Burltard e t a 1 [60] describe a method t h a t always finds a solution for Steiner's V = {(i . b , 0),( i . b, 1) : i = 0 , . . . , n - I), where Problem for ladders of the kind i b 5 1. The subject becomes more difficult if we consider grids of arbitrary dimension. A nice representation of this question has been given in [I751 and [176]. V. Suppose we wish to find a netmorlt that will connect a set of given points. One may t o do this is t o use a MST, which uses only edges joining pairs of the given points. We saw that such a netmorlt is easy to find. Another is to use an SMT. Obviously, the length of the SMT is less than or equal the length of the MST. How much shorter can it get? Consider three points which form the
Gauss' question
Figure 2 . 2
TWOlocally rninirrlal trees
corners of a n equilateral triangle of unit side length. An MST for these points has length 2. Ail SA'IT uses one Steiner point, which is uniquely deter~ninedby the condition t h a t the three angles a t this point are equal, and consequently equal 120°. Consequently, wit11 help of a simple calculation, using the cosine law, we find the length of the SMT in 3 . J1/3 = A. So we have the ratio of between the length of the both iietworlts is 4 1 2 = 0.866025 .... Is there a finite set of points for which the ratio is smaller?
VI. Related t o Steiner's Problem, we mill require that the minimal network has a t most k Steiner points, where k 0 is a predetermined integer independent of the number of given points. Such a network must be a tree also, and is called a k-SLIT. This problem mas introduced indepelideritly by C. [87] in 1982 and Georgakopoulos and Papadiniitriou [183] in 1987. T h e combinatorial structures of k-SMT's and SiVIT's are quite different. Particularly, in contrast to I., we find Steiner points of degree 4 in ~ - S M T ' S . ~
>
I t is a difficult task t o discuss all these esamples in spaces other than the Euclidean plane.
2.3
REFERENCES
Steiner's Problem is one of the most famous combinatorial-geometrical problems. It is the core of the so-called Geometric Ketwork Design, but has itself two origins: Fermat's Problem and the Minimum Spanning Tree problem. Consequently, in the last three decades the investigations into and, naturally, the publications about Steiner's Problem have increased rapidly. The articles that have been writt,en on Steiner's Problem and its relatives are nearly countless. The first survey of Steiner's Problem in the Euclidean plane was presented by Gilbert and Pollak in 1968 [186]; they christened the terms "Steiner Minimal Tree" for the shortest inkrconnecting network and "Steiner points" for the additional vertices. I t is u-ell-known that solutions of network design problems depend essentially on the wag in which the distances in space are determined. Clearly, this is true for Steiner's Problem. Consequently, there are many metric spacesg to be considered. Surveys in form of monographs are given by 1. S.VoB: " Steiner-Probleme in Graphen", 1990, [439]
2. F.K.Hwang, D.S.Richards, P.Winter: " T h e Steiner Tree Problem", 1992, [23l]. 3. -4.O.Ivanov, X.A.Tuzhilin: "Minimal Networlis - The Steiner Problem arid Its Generalizations" ; 1994, [238]. 8 ~ u not t Steiner points of higher degree. see [89]and [369]. 9See the next section.
Gauss ' question
4. D.Cieslik: " Steiner Minimal Trees", 1998, [92]. 5. A.O.Ivanov, A.A.Tuzhilin: "Branching Solutions of One-Dimensional Variational Problems ", 2000, [235]. 6. D.Cieslik: "The Steiner Ratio", 2001, [99]
7. H.J.Promel, A.Steger: "The Steiner Tree Problem", 2002, [355]. Surveys in journals are given by Harris [212], Hwang and Richards [230]. and Winter [464]. There are several collections about Steiner's Problem and its relatives: [79], [143], [239]. [333]. [441] and [435]. A nice representation of the complete subject has been given in [44], [43]. [108], [175], [176], [219], [234]. [389] and [422]. In this sense it is strange that people "discover" Steiner's Problem again and again, and prove "facts" which have al~eaclybeen proven a dozen times.''
2.4
A FIRST ANALYSIS O F STEINER'S PROBLEM
We start with a general analysis of Steiner's Problem in arbitrary metric spaces. We describe several basic fact,s about the combinatorial and geometrical st,ructure of SMT's. Later we will discuss more detailed facts that arise if we restrict ourselves to specific cases.
2.4.1
Metric spaces
Distance is the mathematical description of the idea of proximity, and consequently, we may assume (and it is not hard to see) that a solution of Steiner's Problem depends essentially on the way in which a distance in the space is determined. The following term was introduced by Friichet in 1906: A pair ( X , p) is called a metric space if X is a nonenlpty set of elements called the points, and p : X x X + R is a real-valued function satisfying: l0One of these discoveries is the fact that the degree of a Steiner point in an SLIT in Euclidean spaces of arbitrary dimension equals 3 .
(i) p(x, y)
> 0 for all
2, y
in X ;
(ii) p(x, y) = 0 if and only if x = y; (iii) p(x, y ) = p(y, x) for all z;: y in X: and (iv) p(x, y ) 5 p(2, Z)
+ p(z, y) for all 2, y, z in X
(triangle inequality).
Usually, such a function p is called a metric.'' TVe will say that the quantity p(2, y) is the distance between the points x and y. If p satisfies (ii) only in the weaker form (ii') p(x, x) = 0 for all x in X ; we say that p is a pseudoinetric. If the function p satisfies the conditons (i).(ii') and (iii) it is called a dissimilarity . I 2 A metric, pseudometric or dissiinilarity p on a finite set X of n points can be specified by a n n x n matrix of (nonnegative) real numbers. (Actually numbers suffice because of (ii') and (iii).)
(y)
Let ( X , p) be a metric space. If X' C X , then the restriction p' of the metric p on X' x X ' is a metric on S ' . In what follows me regard ( X ' , p') as a metric space and call it a subspace of ( X , p).
A graph G = (V,E) is embedded in ( X , p) such that (i) V is a (finite) subset of S (ii) E is the set of all unordered pairs
of points v and v' in 5'.
(iii) The metric p induces a length function for the graph, so t h a t for each edge g d a length is given by p(v, v'). ' l ~ l l eaxioms are not independent: (i) is a consequence of (iv).
011
the other hand,
Observation 2.4.1 A metric p can be defin.ed equivalen,tly b y
( i i ) p(x,y ) = 0 if and only if z = y ; and (zvl) p ( 2 , y )
< p(x,z) + p ( y : z ) for
all z,y , z i n S .
'"Ve will give the reason for this name later. There are various measures of dissimilarity, and not all of them yield a metric, but many do.
Gauss' question
33
(iv) We define the length of the graph G in ( S , p ) as the total length of G:
In general we will consider graphs and their embedding in a metric space a t the same time. In each case it will be easy to see whether we use combinatorial or metric/geometric facts. Steiner's Problem is the "Problem of Shortest Connectivity". Since the demand of shortness forces the netxork to be cycle free it is only necessary to consider trees: Observation 2.4.2 Steiner's Problem is only interested in trees. Let N be a finite set of points in the metric space (X,p ) . For a given natural number k and for k points v l , .... vx E X \ AT,let T ( k , vl, ..., v~,)be a spanning tree of minimal length in the complete graph with the set ATU {vl, ...,vx) of vertices, where the length of the graph is induced by the metric p as defined in (2.2).13 If there is both a number k' and points wl. .... wnj such that the value
is minimal among all candidates T ( k .vl,..., vk), then me call T ( k l ,wl,..., wk,) a Steiner Minimal Tree (SMT) for N , and the points wl. .... wp are called Steiner points. T h a t means, an ShIT for i\; is a minimum spanning tree on i Y U Q, where Q is a set of additional vertices inserted into the metric space in order to achieve a minimal solution. It is not true that there is an SMT for any given finite set in each metric space, but for all spaces considered in this book any given finite set has an SMT; this implies that the set Q of additional vertices is a finite set as well.14 In the remainder of this section, we will discuss which properties an SMT possesses, under the assumption that an SMT exists. 1 3 ~ e c a l that l a minimum spanning tree can be found easily. l%xamples for spaces in which there does not alwals exist an S M T are given in the next chapter.
Observation 2.4.3 Let ( X , p ) be a m e t r i c space and let AT be a finite set of points in X . W i t h o u t loss of generality, the followin,g is t r u e for a n y S M T T = (V,E ) for AT
> g ~ ( v> ) 3 for
( a ) g ~ ( v ) 1 for each vertex v i n 1;; (b)
each S t e i n e r point v i n I/'.
Proof. (a) is a n obvious fact, since T is a tree which connects all vertices. I t is impossible for a Steiner point v to have degree one, since the edge v.u' which joins v with the remaining tree has a positive length, contradicts the minimality requirement. T h e triangle inequality of the metric p implies (h) in the following way: Let v and be a Steiner point of degree two. Then we may replace the two edges & by the edge &. Because p(w, w')
< p(w,v) + p('u. w'),
(2.4)
the new tree is not longer than the old.
Moreover, a Steiner point v in a n SLIT T can be of degree two. Then p(w. v)
+ p(v. w') = p(w, w')
(2.5)
holds for {w. w') = ATT(v).I5 Now, n7e will prove t h a t the number of Steiner points cannot increase arbitrarily:
Observation 2.4.4 I t i s suficien,t t o consider only finite trees as candidates for a n SMT.
Proof. Let T = (If, E) be a tree interconnecting a finite set N = {vl, ...,v,) of points. 2n2 then there T h e number of vertices in T is bounded, more precisely: If 111' exists a tree interconnecting AJ which is a proper subtree of T and consequently has a shorter length. To show this we distinguish between two cases:
>
lSThis observation will be helpful in several investigatio~ls. In some proofs we will use Steiner points of degree two.
Gauss' question
35
Case 1: For any tW0 points c and v' in A' the path T ( v . . . . , v') contains a t most 2n vertices. Then we define the graph G by 11-1
G=
U T ( v 2 ,.... v i + l ) .
2=1
The graph G interconnects all points of N by edges of T and contains a t most 2n(n - 1) = 2n2 - 2n < 27a2 = Vi vertices. Hence, a spanning tree of G is a proper subtree of T and must be shorter. Case 2: There are two points v and u' in N sucli that the path T(!u,.. . , v') has more than 2n vertices. Then T ( v , ...,v') contains a t least n+ 1 Steiner points, each of which is of degree a t least three, see 2.4.3. If T(v, ..., v') is removed from T we get the graph G. We observe that G is a forest with a t least n + 1 connected components. Hence, at least one component does not contain a point of N . If we remove this component in the tree T we get a shorter tree.
We can determine a sharp upper bound for the number of Steiner points explicitly: Observation 2.4.5 Let ( X , p ) be a metric space and let 1V be a finite set of points i n X . Without loss of generality,
hence
v1 5 2 . IN1
-
2
(2.7)
an,d
lE 5 2 . lV - 3 (2.8) is true for any S M T I' = (11; E) for N. Equality holds if and only if the vertices from AT are the leaves of I' and the Steiner points are of degree three. Proof. In 2.4.4 me found that it is sufficient to consider finite trees. Hence, the first assertion is a consequence of
The nurnher of edges in a tree is one less than the number of vertices. Consequently. the third inequality must hold. The discussion of equality follows immediately from 1.2.5.
Another observation for trees n-ith Steiner points mill frequently he helpful:
Observation 2.4.6 Let T = (V, E ) be an SMT for N . If V \ N is nonempty then it contains at least one Steiner poin,t adjacent to two given points.
Proof. Assume that each Steiner point is adjacent to a t most one vertex in N. The set 1'' = V \ i V induces in T a subgraph G' = (I", El). It follows from 1.2.1 that
This contradicts the fact that the forest G' has a t most / 1"l 1.2.6.
-
1 edges, compare
An SLIT is a finite tree. The number of such trees for a finite set of given points (vertices) rnust also he finite. In other words,
Gauss' question
37
Observation 2.4.7 I t is s u f i c i e n t t o consider onsly a finite n u m b e r of trees as candidates for a n SMT. It will be helpful to associate a matrix to a graph: Let G = (V,E) be a graph ..., v,, ). Then we define and assume that the vertices are labelled, i.e. V = {q, the adjacency matrix A(G) = (a,,), ,=L ,, with aij
=
1 : if the vertices ,u, and 0 : otherwise
vj
are adjacent
These matrices contain the complete information about the structure of graphs. Consequently, many matrix calculatioiis have a meaning in the sense of graph theory.16 The adjacency matrix of the graph G does depend on the labelling of the vertices of G ; t,hat is, a different labelling of the vertices may result in a different matrix, but they are closely relat,ed in that one can be obtained from the other simply by interchanging rows and columns. A matrix which contains entries only 0 or 1 is called a binary or Boolean matrix. Using adjacency matrices we can describe the length of a graph G by
In other words,
Observation 2.4.8 For a given topology of a tree its length i n a m,etric space i s a linear fw,n,ction of the metric. Steiner point locations in the space are not prespecified from a candidate list of point locations, but we may assume that the set of Steiner points is contained in a suitably bounded subset of the space. Here. a set It' of points in a metric space (X.p) is called bounded if
Equivalently, we consider balls it1 the space defined by
16We will discuss this further in the next section
>
where r 0 is a real and z is a point of the space. Then it is easy to see that the set W is bounded if and only if there exists a nonnegative real r and a point z such that Idr g B r ( z ) . (2.12) Observation 2.4.9 Let N be a finite set of poin,ts in a metric space ( X , p ) . Then we may assum,e that the set L7\!V of Steiner points of an SMT T = (If, E ) for N is contained In 0, bounded subset of X :
where u is an arbitrary point in N and
I- = L ( X ,p) (MST
for N)
Kote t h a t it is not simple t o describe a small set containing all Steiner points. Such a set is usually called a Steiner hull of N. A known Steiner hull allows confinement of the construction of the tree within a given set. Hence, the smaller a Steinel hull is, the better it is. O n the other hand, if the Steiner points in Q have been localized, a n SMT for N is simple to find, since Observation 2.4.10 Let N be a finite set of points in a metric space. Then an SMT T = (If,E ) for N is an MST for If*. Comparing all these facts, the search for a n SMT for a finite set of points in a ~ n e t r i cspace forces investigations of two specific questions: How many Steiner points are used in a n SMT? Where are these Steiner points located in the space? Unfortunately, these questions cannot solved independently from the construction of the shortest tree itself. For a complete discussion of these difficulties see [92], [230], [231] and [464], or the next chapter. What are the spaces for which an SMT always exists? Such a tree necessarily exists if the bounded subset which contains the Steiner points is cornpact.17 In 1 7 ~ c t u a l l in y several interesting cases i t will be finite
Gauss' p e s t i o n
39
this case r e must consider, for each tree of a finite number of trees, the value of the function (2.9). More precisely:
I. Considering "continuous" spaces, it is sufficient for an SAIT to exist if the metric space has the following four properties: (i) (X, p) is complete; (ii) ( X , p) is finitely compact. i.e. each bounded and closed subset is compact; (iii) Each pair of points in ( X , p) can be connected with a geodesic curve, i.e. a curve of shortest lengt,h;ls (iv) For all points x, x' in (X,p), the distance p(z, x') is equal to the length of a geodesic curve joining x and x'. T h e following classes of metric spaces satisfy the four properties and thus in each case we establish the existence of an SMT for a finite set N of points with the help of a compactness argument: (a) Euclidean spaces are classical examples for Steiner's Problem. (b) Finite-dimensional Banach spaces. Since these spaces play a n iniportant role in both theoretical questions and in applications we will describe them more extensively. In his book Geometrie der Zahlen [310], published in 1896, Minkowski proved a number of results by geometrical arguments, using the idea of normed spaces mhich is based on the assumption t h a t t o each vector can be assigned its "length" or norm satisfying some "natural" conditions. A convex and compact body B of the d-dimensional affine space Ad centered in the origin o is called a unit ball, and induces a norm I . / = I . / IB in the corresponding d-dimensional linear space AClaccording t o the so-called Miriltomski fui~ctiorlal:
1 1 ~ 1 =1 i~ n f { t > 0 :v E t B ) for any3uin Ad \{o), and
On the other hand. let 1.11 be a norm in Ad. ~vhichmeans that 11.1 : ~4~ + is a real-valued function satisfying (i)
positivity: 1 . c /
> 0 for any
'L'
in ACL;
his is the specific form of Steiner's Problem for two given points
(ii)
identity: I vli = 0 if and only if
2:
= o;
(iii) and
homogeneity: (ltvll = ltl . i/vil for any v in Ad and any real t:
(iv)
triangle inequality: Iv
+ vll
5 Ilv /
+ i l ~ i ' l l for any w ,v' in Ad.
Then B = ( 7 1 E ACi: jvj 5 1) is a unit ball in the above iense. It is not hard t o see t h a t the correspondence between unit balls B and norms 1 1 . 1 1 is unique, t h a t is, a norm 1s completely determined by its unit ball and vice versa. Co~~secluently, such a space is uniquely defined by a n affine space & and a unit ball B. I t is called a Banach-hlinltouslti space, and is abbreviated as AId(B). A Banach-Minkowski space M d ( B ) is a coniplete metric linear space if we define the metric by (2.14) P ( ~v l.) = I V - u l I ~ . Usually, a (finite- or infinite-dimensional) linear space which is complete with regard to its given norm is called a Banach space. Essentially, every Banach-Minltowslti space is a finit,e-dimensional Banach space and vice versa.lg
Observation 2.4.11 Segments in a Banach-Minkowski space are shortest cuwes (in the sense of inner geometry). They are the unique shortest curves if and only if the unvit hall is strictly Roughly speaking, the observation t h a t a straight line is the shortest distance between two points is Steiner's Problem for a set of two points. In particular. we consider finite-dimensional spaces with p-norm, defined in the following lvay: For r; = ( J ~.. .,, zd) we define the norm by
"'n infinit,e dimensional Banacli space is often called a Banach-Wiener space, compare 14601. The structure of such spaces is intrinsically more complicated than that of the finite dimensional ones. ' O ~ h efourth problem of Hilberl, is to characterize all geometries in which segments (convex hulls of two different points) are shortest curves (in the sense of inner geometry). In particular, Hilbert asks for the construction of all these metrics and the study of the individual geometries. Hilbert's problem is a program of research about the foundations of geometry. The major contributions were the books T h e Geometry of Geodesics [61] b y Busemann in 1955 and Hilbert's Fourth Problem [347] by Pogorelov in 1979. For a historical discussion compare [ll] and [468].
Gaz~ss'question
41
where 1 5 p < m is a real number. If p runs t o infinity then we get the so-called Maximum norm
In each case we obtain a Banach-Minkowslii space written by
C;.
(c) Compact manifolds. About more facts of metric/geometric properties of several continuous spaces compare [262], [281], 12971, [364], [381], [411] and [426].
11. Concerning "discrete" spaces we make the following definition: X metric space (X, p) is called a discrete metric space if any bounded set is finite. In other words. if for a subset W it holds that
then also
/ W < 00.
(2.18)
Consequently, a n SMT esists for any finite set of points in such spaces. Examples are: (a) Finite metric spaces. (b) Graphs ( a specific case of finite ~ n e t r i cspaces"). (c) Let Z be the set of all integers, then Z" equipped with a rectilinear, Euclidean or any other "desired chosen" distance is a discrete metric space. (d) Spaces of words with phglogenetic (= space measured el-olutionary) distances. Kote, that a n infinite set with the so-called discrete metric, which defines the distance betxeen two different points to be 1, is not a discrete metric space." For more facts about metric/geometric properties of several discrete spaces compare [246] and [476]. 21An introduction to the theory of graphs we gave in the previous chapter; the representation as metric spaces we will describe at the end of the present chapter. " B U ~ , of course, for any given set of points in such a space there exists an SWIT, namely the MST.
2.4.2
More facts in the Euclidean plane
Of course, if we investigate a more specific metric space, we find further facts about Steiner Mininial Trees. The Euclidean plane is defined in the affine plane with the Euclidean metric J(xl - ~ 2 )(yl~ between t,he points (zl, y l ) and ( 2 2 , y2) derived from a norm 1.1: (2.19) i(z,y)II =
+
dm.
Steiner's Problem looks for a shortest network and in particular for a curve C of shortest length joining two points. For our purposes, we regard a geodesic curve as any curve of shortest length. If we para~netrizethe curve C by a differentiable map y : [O,11 -+ lRd we define
1 1
length of
C=
y d t
It is not hard to see that among all differentiable curves C from the point v to the point v' the segment
'u'L1/ = {tv + (1- t)vl : 0 5 t 5
1)
(2.21)
minimizes the length of C. And, moreover, as a consequence of 2.4.11. Observation 2.4.12 A l l s e g m e n t s a n d n o other sets of points are geodesic
curves in the Euclidean plane. Consequently, in the Euclidean plane SiLITs always exist, and n-e may represent a graph G = (1;. E) embedded in the plane so that (i) V is a finite set of points; (ii) Each edge w' is a geodesic curve. which means a shortest curve in the sense of inner geometry. We may assume that w'is a segment.23; (iii) Each edge
w'has length
1.1: -
I
v' ;
(iv) The length of G is defined by
d E E
h his justifies
the double meaning of
w'
as an edge of a graph and as a segment
Gauss' question
Using our first example in section 2.2, me have
Observation 2.4.13 Let N be a finite set of points i n the Euclidean plane. Witlzout loss of generality, i n a n SMT T = (11'. E) for AT a given point can have degree 1, 2 or 3; a Steiner point always has degree 3. Moreover, paying attention 2.4.5, we find
Observation 2.4.14 A n S M T for n given points has exactly n - 2 Steiner points if and only if each g h e n point i s of degree one.
.A tree with the property described in the last observation is called a full tree. I t is a binary tree, i.e. it contains only leaves and internal vertices of degree three. T h e following property of full trees can be empirically observed: "Typical" sets of given points in the Euclidean plane usually do not have SMTs which are full trees. T h a t is, its SNITS tend t o he unions of small full trees.24 We decompose a given tree for N into full trees by the folloving procedure: Procedure 2.4.15 Let T = (Ii, E) be a tree for X , th,at rnean,s iV let v be a poin,t isn N with g ( v ) > 1.
C V , and
1. Define G = (V \ {v}, E \ {.ul;' : v' i s a neighhor o f v}). (G is a forest with g(v) components Gi = (I;,Ej), i = I , . . . , g ( v ) . )
2. Define for i = 1,.. . , g ( v ) the graph : v' is a neighbor of v i n G and v' is in I/:}), G ( i )= (V, U { v i ) , Ei U where v ; is n o t in V . If we repeat this procedure vie obtain a fanlily of trees in vihich f o ~each tree, the degree of any vertex which is a given point equals one.
Observation 2.4.16 Let T = trees o f T i s
(V,E ) be a tree for N . T h e n the n u m b e r of full
WEN
2"he fastest exact algorithms (in practice) for Steiner's Problem use two phases: first a small but sufficient collection of full SMTs is generated and then an S M T is constructed from this collection. See [444].
To estimate the total number of full trees more exactly, denote by f ( n ) the number of such trees with n given and n - 2 Skiner points. Then f ( 2 ) = 1. If one removes a given point and also its adjacent Steiner point, one obtains a full tree. This shows that every full tree with n 1 given points can be obtained from a full tree with n given points by adding a Steiner point in one of the 2 n - 3 edges and adding a new edge. Hence,
+
f (n
+ 1 ) = ( 2 n - 3) . f ( n ) .
(2.24)
A solution of this recursive equation is given by Observation 2.4.17 Th,ere are
pairwise distirrct full trees with n, leaves. Conseqz~ently,'if w e ignore the numbering of the internal vertices, w e h u e to check
distinct ,full trees. Remember that it is not simple t o describe a Steiner hull of
X.
Observation 2.4.18 In the Euclidean plane th,e convex hull of the set of gi,uen points i s a S t e i n e r hull. In other words. there is a Steiner hull which is a polygon. Cocltayne [I101was the first to find t h a t a n improved polygonal hull can be obtained by repeatedly deleting triangles from the boundary of the convex hull of the given set: In the following description, let A-be a finite set of points in the Euclidean plane. 1 . Start with the convex hull corivAT;
2. Let v and 71' he two points of A' such that C boundarv of convAT. If there is a third point w in AT such that the triangle conv{v, v',w) contains no other point of N and the angle a t w is not less than 120° then no edge of the ShIT is within conv{u, v',70);
Gauss' question
The new boundary of the Steiner hull is obtained by replacing the segment by the segments and .wv'. If the hull then becomes self-intersecting in some of the given points, the original problem can be decomposed into two or more smaller problems. Weng [452] has generalized this concept and gives a method t o construct Steiner polygons by repeatedly deleting m-gons. here m is a t most the number of given points. He has also shown the uniqueness of the Steiner polygons obtained by this method.
It is a n interesting question t o decide which of all these facts are true in higherdimensional Euclidean spaces, or more generally, in metric spaces.'" 2" helpful discovery in the investigations of Steiner's Problem is the the observation that the degrees of vertices of SMTs in finite-dimensional Bamch spaces are bounded by a quantity which only depends on the space:
Observation 2.4.19 Consider d-dimen,sional Banach spaces with a smooth n o r m . T h e n it holds that ( a ) ( C . [92]) T h e degree of each vertez i n a n S M T is at most 2 d . ( b ) (Lawlor, Morgan [275], Stuan,epoel [415]) d S M T , but never d 2 .
+
+ 1 edges can meet at
(I
Stern,er point of an,
In particular, the degree of Steiner points in Eucliciean spaces is independent of the dimension:
T h e o r e m 2.4.20 I n Euclidean spaces of any dimension the degree of a Steiner point i n a n S M T equals 3. Proof. The equation ( 1 . 1 1 ) also holds true in d dimensions. Hence we have
that is, an inequality which is satisfied only for 77, 5 3. For the planar case we know more about the ~ P r t e xdegrees.
Observation 2.4.21 Conszder S M T s i n a Banach planes equipped with u unit ball B . T h e n ( a ) ( C . [gl], Swanepoel [4lG]) For the degrees of the vertices the following holds true: If B is a n ajjinely regular hezagon, then the degree is at most 6, otherwise at m o s t 4 . ( b ) (Morgan et.al. [315]) A t most four edges come together i n u Steiner point
2.5
STEINER'S PROBLEM I N GRAPHS
Connectivity is also a very important concept in combinatorial optimization. M7e will discuss this concept in the sense of Shortest Connectivity in metric spaces.
2.5.1
The metric closure of a network
Here we consider networks. These are (connected) graphs G = (V, E) equipped with a length function f : E + lR. This fimction on the edges of G is constrained t o take only strictly positive ~ a l u e s . ~ ' T h e simplest question, which mill be of great importance in further considerations, is t o look for the "geodesic curves", which are the interconnecting chains of shortest length between vertices in the network:
The Shortest Path Problem Given: A netvork G = (L: E , f ) and two vertices v and v' of G. Find: 4 path connecting v and v' with minimal length.
A solution is called a shortest path (between the vertices 1: and v' in G). With this in mind each network is a metric space, more precisely Observation 2 . 5 . 1 Let G = (17, E) be a connected graph equipped with a length f i ~ n ~ c t i ofn : E + lR.Define the distance function p o n V so that p(v,vl) = the length of a shortest path between the ~uerticesv and v' in G , ,
,
for two different vertices v and v', and p ( v , v ) = 0. T h e n ( V , p ) i s a m e t r i c space. The space (17; p) is called t,he metric closure Gf of a graph G = (1)': E ) with length function f : E + IR. We can also define Gf as the complete graph on I7 such that the length of a n edge .uv' in G,f is the length of a shortest path between 21 and v' in G. Then we call Gfthe distance graph of the network G = (1:; E, f ) . Note that G is a subgraph of G f , but the restriction of p on G must not be f . everth he less saying it explicitely, sonletimes we will use a length function which has the value 0 for several edges.
Gauss' question
47
The problem of finding shortest paths in a graph with a length function is easy to solve by the so-called dynamic programming technique, which is a rather general method for solving combinatorial problems having the property that their optimal solution can be computed recursively from solutions t o subproblems. More precisely, we use the following observation, called Bellman's principle of optimality, which is indeed the core of dynamic programming: Observation 2.5.2 (Bellm,an [37]) Let G = (I/: E , f ) be a network, and let u and v' be two vertices of G . If e = & is tlze final edge of some slzortest path 'u, . . . , W , v' from v to v', then 7:, . . . , w (that is the path without the edge e ) is a shortest path from v to w .
Roughly speaking: An optimal strategy contains only optimal substrategies. The observation gives immediately Algorithm 2.5.3 (Dijkstra [125]) Let G = (V,E , f) be a network. A shortest path between tlze vertices v and v' can be found by the following procedure:
1. Start wiM the vertex v ; Label v ,with 0: L(v) := 0; all other vertices are unhbelled;
2. Determine min{L(vl) + f (v 1 v 2 ) )where zll and labelled and uy not; Choose GI and 62 which attain the minimum; Label f i 2 by L(62) = L(G1) f
+
Un
are adjacent vertices,
vl
(a):
3. Repeat the second step until v' is labelled.
For all labelled wertices w the quan,tity L(w) is the len,gth of a shortest path connect~n~g v and w: p(v,w ) = L ( w ) . Kow it is easy to construct the metric closure G f : it is sufficient to apply 2.5.3 111' times.17 m7hen we are only interested in the metric p we can find the metric closure in a simpler way: " 1 ~ h e n the aigorithrrl in 2.5.3 runs if all vertices are labelled then the algorithm creates a tree T = (V,F) in which the unique path from v to all other vertices 2;' is a shortest path interconnecting these points in G. T is called the distance tree related to T I .
Algorithm 2.5.4 (Floyd [I 661) Let G = (1)' = {vl, . . . , v,), E, f ) be a network. The m.etric closure G f = (11: p) can be found b y the following procedure: 1. for
. . '$ EIde,fine f (d) = m,
2. for i := I t o n do for j := 1 to n do p(vi, vj) := f (vivj); 3. for i := 1 to n do for j := 1 to n do for k := 1 t o n do ) p(vi, vk) i f ~ ( v j , v i+
< p(uj; uk) then p(vj; vk)
+
:= p(vj, ~ i ) p(v,, vk)
In particular, the function f = 1 is a length function. It measures the distance by counting the number of edges in the path.
A first example: Let A = A(G) = (a,,),,,=l, ,,, be the adjacency matrix for the graph G = (1' = {vl, ..., v,,}, E). Then. obviously, t h e equation a,, = 1 means t h a t there is a chain of length 1 from u , to v,. Now consider
the k-th power of '4. Using induction it is not hard t o see t h a t the equation a!;) = rn means t h a t there are rn different chains of length exactly k from vi t o v j . Hence, t h e graph G is connected if and only if for any pair of distinct vertices vi and v, there is a number k = k(i, j ) between 1 a n d n - I such t h a t a (ikj) > 0: Remark 2.5.5 Let G = (b7 = { v l . . . ,v,,),E) be a connected graph, let A = A(G) rts adjacency rnatrzz and let A ' = (a,,( A ) ) , , , = I , ,,. k = 1 , 2 . . . .. T h e n
holds true for any two distinct riertices vi and v, T h e quantity diam G = max{p(u, u ' ) : v , v' E V}
Gauss' question
49
is called the diameter of t,he graph G = (11, E)." Of course, for any connected graph G it holds t h a t diam G jV(- 1. This implies that, using the adjacency matrix, we have to check only the powers up t o k = 1)' - 1 t o decide if a graph is connected or not.
rigonometricfunctions, in general analytic functions. T h e so-called real-RAM is described by Preparata and Shamos [351]. It closely reflects the kinds of programs t h a t are typically written in highlevel algorithmic languages, in which it is common to treat variables of the type 'real' as having unlimited precision, and we ignore such questions as how a real number can be read or written in unit time. The relationship between TM/R-4M and real-RALI is still a n open question. More specific forms and descriptions of algorithms are closely connected with concrete problems and will be discussed in their own environment^.^ 3 ~ o ar readable description of the~ret~ical aspects of coinpter science see Hare1 [210], [211].
What does solution mean?
3.4
61
DOES A N EFFICIENT ALGORITHM EXIST?
We are riot interested only in the creation of some algorithm, but also in the a,mount of the algorithm takes t o run. We wish to distinguish fast solution methods from slower ones: clearly this requires us t o formulate some objective notions on how t o measure algorithm efficiency. I t should be emphasized t h a t although faster computers can produce solutions more rapidly than slower computers, the main advances resulted from the improvements in the understanding of the mathematical structure of the underlying problems. A problem is usually expressed in terms of several input parameters which are described but whose values are left unspecified. In most cases, there exist two or more algorithms for solving a given problem. If we have in mind the implementation of the algorithm on a machine there is a feature that must be compared in deciding on one algorithm rather than another, namely the time taken (which depends on the number of times each step is executed), the so-called time complexity. This quantity depends on the size of the input parameter^.^ We may assume t h a t for a size n the time complesity t ( n ) is a function, where in general, but not exclusively, t ( n ) 2 n . In the following discussion we will use the phrase "on the order o f ' t o express lower and upper bounds. More precisely: Let f and g be functions from the positive integers t o t,he real numbers. Then: (i) T h e function g ( n ) is said t o be of order a t least f ( n ) . denoted n(f ( n ) ) , if there are positive constants c and no such that g(n) 2 c . f ( n ) for all 11 2 no. (ii) T h e function g ( n ) is said to be of order a t most f ( n ) , denoted O(f ( n ) ) , if there are positive constants c and no such t h a t g ( n ) 5 c . f ( n ) for all n 2 no. (iii) The function g(n) is said t o be of order f ( n ) , denoted O(f ( n ) ) if, g ( n ) = R(f ( n ) ) and g(n) = O(f ( n ) ) . T h a t is. f (n) and g(7z) both grow a t the same rate; only the multiplicative constants mag be different. This notation allows us to concentrate on the dominating term in a n expression describing a lower or upper bound and to ignore any multiplicative constants. T h e time complexity of a n algorithm expressed in terms of any of these nota4.411 questions, definitions and investigations about algorithms will be done in view of our original problem, namely the search for shortest trees. Hence, for our considerations we will use the number of given points as the size of the input.
tions is, in general, referred t o as asymptotic time complexity because it reflects the behavior of the algorithm for sufficiently large values of the problem size. I t is not hard to see t h a t these "Ordern-notations have the following properties: (a) gin) = O ( f (n)) if and only if f (n)= R(g(n)). (b) T h e order of the sum of two functions is given by the order of the faster growing function: f ( n ) g(n) = O(max{ f ( n ) .g ( n ) ) ) .
+
(c) If f (n) is a polynomial of degree k then f (n) = O ( n 9 . is transitive. (d) The relation represented by "0" (e) For the logarithmic order O(1og n ) the base is irrelevant since logbn = log, n . log, a. 0(b1') (f) Exponential functiolls grow faster than polynomial functions: n" for all k > 0 and b > 1. Conversely, logarithmic functions grow more slo~vlythan pol\-izomial functions.
A broader and more detailed discussion of the growth of functions is given by Aigner [3]. For our purpose we will use the following "classes of complexity", which are defined in terms of the input size n: Order
O(1) O(log n,) O(n) O ( n log n,) O(n" 0(n3) O(n9
Name of the "class" constant, logarithmic linear log-linear quadratic cubic polynomial
Remark esecution time is independent of the input size the base is irrelevant the base is irrelevant
k is a fixed positive integei
filention that the previous table shows the "fast" algorithms, this table the "slow" ones:
What does solution mean?
Order O(cl"
Name of the "class" exponential
Remark c > 1 is a fixed positive real number
O(n!)
factorial
Stirling's formula: r ~ % ! &):( Stirling's inequalities: e (:)IL 5 and n! en (2)"
-21 = 0.5
(3.23)
Thzs is the best possible bound. Proof. Let T be a n SNIT for a finite set ATin X. Consider the graph G obtained by replacing each edge of T by two parallel edges. Since a n even number of edges is incident with each vertex of G the graph G has a Eulerian cycle17 which has the length 2 . L ( T ) and is a tour through N. This tour is not shorter than a minimal tour in which no Steiner point exists. If we delete any edge of the minimal tour we obtain a tree interconnecting i V without Steiner points. Hence, L(MST for AT)5 2 . L(T)= 2 . L(SMT for N) (3.24) which implies the first assertion. Next we show that the lomw bound 0.3 is the best possible over the class of all metric spaces. Let G = (1); E) he a star with n leaves. All edges have unit length. The leaves form the set i V of given points. Then a n YIST for N has length ~ ( I L 1) - and a n SMT with the internal vert,ex of the star as Steiner point 1 7 ~ h iiss defined as a cycle that uses each edge exactly once, compare 4.2.19.
What does solution mean?
77
has the length n. Hence, the ratio between the two lengths is nl(2n- 2), which tends to 0.5as n tends to infinity.
With 3.4.8and 3.5.2 in mind me have that the Steiner ratio of metric spaces lies precisely in the range between 0.5and 1. This is even true for spaces of finite cardinality. Ivanov and T~izliilinshorn- in [240]that for any real number between 0.3and 1 there is a metric space with this Steiner ratio. we can find a tree interconnecting a set of n points in a metric In view of (3.24) space in 0 ( n 2 . log n) time1' with length at most twice that of an SMT. The performance ratio of an MST as an approximation of Steiner's Problem in a metric space (X,p) is
With these facts in mind, we are only interested in approsimations and heuristics satisfying one or both of the following properties: The running time of the algorithm is a t most the time to compute an hIST in this space. The error is a t most
l l m , where m is the Steiner ratio of the space.
The proof of the theorem 3.5.2immediately suggests an approximation algorithm for Steiner's Problem in graphs: Algorithm 3.5.3 (Kou, Markowsky, Berman [267]) A finsite set N of n vertices in a network G = (11, E , f ) is given. Then,
1. Describe the m,etric closure G f : For all v,v' E AT,.u # v' determine the distances p(v, w') and the shortest paths G(v, . . . , d);
2. Find an MST T = ( N ,F ) for iV in the metric space (b; p); Set F' := U,,, G(v, . . . , 7)'); and Set 1." := Ulu'tF, { u , d); I 8 0 r faster using more specific techniques in several metric spaces; but not faster than
R(n . l o g n ) , see the previous section.
3. W h i l e there is a cycle G, i n ( V ' ;F')delete a n y edge from G,; Deleting leaves which are n o t m,em,bers of N .
I/& m > J 2 4 + 2 - (7 + 2&) m > 415 m
m
= 0.57735. . .
Graham, Hwang, 1976, [190]
= 0.74309..
Chung, H r a n g , 1978, 1861
= 0.8
Du, Hwang, 1983, [138]
> 0.82416.. .
Chung, Graham, 1983, [85]
Finally, in 1990, Du and Hwang [139]. [140] created inany new methods and succeeded in proving the Gilbert-Pollak conjecture completely.21 -
2 1 ~ h i mathematical s fact appeared in The New York Times, October 30, 1990 under the title "Solution to Old Puzzle: How Short a Shortcut?"
For most metric spaces the exact value of the Steiner ratio is still unknown. For a broader discussion of the concept of the Steinrr ratio and more ltnon-ledge of its values for specific spaces coinpare C. [92]. In particular, for the following specific (Banacli-Rlinkowski) planes n-e do know the exact value for the Steiner ratio: T h e norni is essentially
Steiner ratio
parallelogram
rectilinear
$
ellipse
Euclidean
Unit ball
= 0.6666.. . = 0.86602 = 0.75
affinely regular hexagon
An interesting problem, but which seerns very difficult, is to determine the range of the Steiner ratio m d ( B ) for d-dimensional Banach spaces equipped with a unit ball B, depending on the value d. &lore precisely: Determine the best possible constants cd and Cd such that
for all unit balls B of &. Both the values Cd and c,i are attained by certain d-dimensional Banach spaces. This follows from the continuity of the Steiner ratio as a function of the space and the Blaschke selection theorem2" The quantity CC1is defined as the upper bound of all numbers m d ( B ) ranging over all unit balls B of Acl. Of course, C1 = 1. Conjecture 3.5.6 For d = 2, 3,
where m(d, 2) denotes the Steiner ratio of th8ed-dimensionml Euclidean space.'3 2 2 ~ f f i n and e convex geometry are parts of the geometry for Banach-Minkowski spaces. Consequently, in our investigations the idea of convexity plays a central role and we will often use arguments from these geometries. For textbooks see LeichtweiB [281] or Valentine [43l]. "If we have an analytic formula decribing the norm, we also have the possibility of estimating the Steiner ratio with direct calculations. In particular, for d-dimensional Cp-spaces; abbreviated by m ( d , p ) , d = 1 , 2 , . . ., 1 5 p oc. For these quantities compare [ 7 ] ,[8],[ l o ] , [9]and [286].
E) such that N C V and the modified length C(G) = C ( a , p)(G) = a . 1V \ N / L ( S , p)(G) (4.17)
+
is minimal. Such a graph must be a tree and is called a Steiner Miniiml Tree weighted by the real a , or briefly a n S M T ( a ) , for N in ( X , p ) . For a = O we obtain a usual SMT. For a > 0 a n S M T ( a ) can assume different structures than those available to SMTs. More precisely, a n SMT(O) is a n ordinary SMT, while on the ot,her hand, if a is the length of a n MST for a finite set A7 of points, then a n S M T ( a ) must be an MST. Consequently, number of Steiner points produced decreases as the n-eight a runs from zero to infinity. Consider the folloxing introductory example: Interconnect the four points (1.0), (-1,0), (0,1) and (0, - 1). which are the corners of a square in the Euclidean plane: Shortest tree
Length L ( . )
Number of Steiner points
Then it is easy to calculate that TI is a n ShIT(0.2) and To is a n ShIT(0.4). Underwood 14301 presents many properties of Si\ilT(a)s in the Euclidean plane
MST
=
0-SAIT
1-SMT
SAIlT
=
2-SkIT
Figure 4.1 LIinimal Networks
and a modified I\/Ielzal<procedure which computes an SiLIT(a!) for a given set of points. In (4.15) we saw that the best addition of k Steiner points to the initial set of given points ca,nnot improve drastically the approximation in conlparison to the best addition of k - 1 Steiner points, if k is a large number. More precisely: Let N be a finite set of points in a metric space. Then the relative defect when going from a (k - 1)-SMT to a k-ShIT for N is monotonely decreasing in k and tends to zero as k runs to infinity. This fact is useful to estimate t,he number k for k-SMT's depeliding on the number a for SATT(a)s:
Theorem 4.2.8 Let ( X ,p) be a m e t r i c space which fulfills both the a s s ~ ~ m p t i o n s A and B ( k ) . Let N be a finite set i n X . If we seek a n SMT(a) for a set N of given points i n a m e t r i c space ( X ,p ) we are only interested i n the k-SMTs for N with k 5 2 . L ( M S T for AT), (4.18) a!
where y ( X , p ) is defined in (4.16).1° Using the theorem we have an a priori bound for an ShlT(ol), namely
and a better bound, found during the computation from the (k - l ) t h step to the kth step, namely
l0Since we assume that c ( S , p )
> 3 we have -,2
Netu~orkDesign
problem,^
Paying attention to 4.2.7 we find
Corollary 4.2.9 T h e search for a n SMT(a) for a set N of n given points consumes O ( n '-. ( a c - j + : ) . L + a , og n ) (4.21)
time: where L = L ( M S T for N ) and c = c ( S , p).
4.2.2
A monotonic iterative procedure to approximate trees of minimal length
Using 1-SMTs we can find a fast running, and in general, good iterative procedure t o produce shortest trees. We apply a procedure for creating a 1-SMT repeatedly, meaning that we start with the given finite set, and successively add Steiner points, one Steiner point a t a time. Note that once added, a Steiner point cannot be removed.'' We call such a method a monotonic iterative algorithm. During the course of such a n algorithm, a sequence
of sets of points is constructed such that L ( X , p)(MST for I + 1 ) 5 L ( X ,p)(MST for IX),
(4.23)
for k 2 0. It is, however, possible t>hatsuch constructions do not produce an SLIT; though it appears to produce shorter trees on average than other known heuristics in many metric spaces.'" The iterated 1-Steiner heuristic of Kahng and Robins [247] is a n exa~npleof a monotonic iterative algorithm. 51k generalize this greedy st#rategyin t,he following way: Procedure 4.2.10 Let AT be a finite set of n points sin a m e t r i c space ( X ,p) which satisfy the assumptions A asnd B(1). l l l n this sense, this met,hod is greedy. l%aloare and \Varme [377] show that for a specific set of given points in the plane with rectilinear distance, such a monotonic iterative algorithm does not construct an SMT. But empirical evidence suggests that in general this procedure creates a tree whose length is not far from the length of an SMT.
1. Determine a n MST T(')= (Vo,E o ) for AT; 2. For k
> 1 find
a 1-SMT T(" = (Vk,Ek) for l'k-l:
3. Terminate as soon as one of the following things is true: n - 2 zteratrons have been executed;
L(X, p)(T(") w
= m ( X ,p) . L ( X , ,o)(T(')):
L ( X ,p ) ( ~ ( " )= L ( X , p) ( T ( ' ~ ' .)I 3)
Clearly, this method only consumes polgnomially bounded time, namely
0 ( n 2 )+ ( 1 2
-
2 ) . O ( ~ " ( ~ ~log P n) )+=~~
(
n
"0 ~
g (n ) . ~
.
~(4.24) ~
Moreover, let t l - s n l T ( n ) be the time required to find a 1-SMT for a set of n given points. We may assume that
I. tl-s,bf.r is polgnomially hounded; 2. t l - s n I T ( n )is a t least the time to find a n M S T for n given points, and consequently (4.25) t l P s n . r ~ ( n ) CL(nlog 7 2 ) ;
>
3.
t 1 - s ~ is a~n
increasing function in the size of the input.
All these facts imply Remark 4.2.11 The procedure 4.2.10 runs i n polyn,omially bounded time. If t l p s n l ~ ( n )is the time needed to fin,d a 1-SMT for n given points then 4.2.10 needs ~ ( . t ln- S , w ~ ( n ) )j o ( ~ ~ ( log ~ n) ' P ) + ~ (4.26)
For applications of this strategy in the rectilinear plane and in networks see [471] and [193],respectively. The length of the tree produced by the algorithm 4.2.10 is a t most the length of an MST, and on the other hand me have 13Kote that these conditions are not independently valid. In particular, if the first or the second holds, then the third also holds.
Network Design Problems
101
Observation 4.2.12 Let ~ ( be~ tthe1 tree for a gisuen finite set constructed by 4.8.10 in the kth step. Then
We define the relative performance ratio of the metric space ( X , p) by
1 = inf e r r o r ( S , p) (k : k')
I
L ( X , p) (T(") for 12') : 3 is a finite set in (X,p) , L ( S , p)(T(")) for AT) (4.28)
L ( T ( * ~ ) L(T(") > L ( S I I T for 3') >L(T("1) - L(T(O)) - L(I\IST for K ) This implies
Observation 4.2.13 For the relative performance ratio of the ,metric space (X, p) the inequality I 5 error(X,p)(k : k') 5 ---- < 2 m(x,p) -
(4.31)
hold. And. moreover
Theorem 4.2.14 (C. [97]) For the relati,ve perfo~mance ratio of the ,metric ,space ( X , p) satisfylng the assumptions A and B(1) it holds that I 5 error(X,p) (k : k - 1) 5 1
for all k
+ y ( Xk , P)
-
(4.32)
> 0,where y(X,p) is defined 'in (4.1G).
Now, we have two performance error bounds: The absolute, a priori. bound given in 4.2.13 and the relative, a posterion, one given in 4.2.14. If k runs to infinity then the relative performance ratio tends to zero. Of course, we call also apply algorithm 4.2.10 in metric spaces which do not satisfy the assumptions A and/or B(1). but then we do not obtain the nice performance ratio of 4.2.14.
4.2.3
Component-size bounded Steiner Trees
There is a n a p p r ~ x i m a t ~ i omethod n for Steiner's P r o b l e ~ nwhich uses trees that can contain Steiner points, but not in an arbitrary sense: Let N be a finite set of points in a metric space ( X , p). Let T = (V, E) be a tree interconnecting N . For such trees n7eassume t h a t the degree of each given point is a t least one and the degree of each Steiner point in V \ ! Y is a t least three. However. a given point in such a tree may not be a leaf. When a given point v is not a leaf, T can be decomposed (by splitting a t the given point) into several smaller trees, so that given points only occur as leaves. More precisely:
1. Define G = (11 \ {v), E \ {d : u' is a neighbor of v)). ( G is a forest with g(v) cornponents G, = (I/;, E,), i = 1,.. . , g(v).) 2. Define for L = 1, . . . , g(v) the graph : 7;' is a neighbor of G(,) = (If',U { u , ) , E,U where v, is not in V.
71 in
G and v 1 is in If,)),
In this way, every tree interconnecting N is deconiposed into so-called full components. The size of a full component is the number of given points in the full component.
A k-size tree for N is a tree interconnecting all points of AT with all full components of size a t most k . A k-size SLIT is the shortest one among all k-size trees. The k-size Steiner's Problem Given: A finite set AT of points in a metric space (X,p ) and a n integer k 2. Find: A network G = (I E) i u r h that
>
(ii) Every full component contain a t most k given points, and (iii) L ( X , p ) (G) is minimal. For k = 2 we look for an S E T . For every k
> 4 this problem is A'P-hard,
[355].
Clearly. we are interested in the greatest lower bound for the iatio between the lengths of an SAIT and a k-size ShIT for the same set of points in a metric
Network Design Problems
space: m ( Q = =("((X, p) = inf
{
L(SMT for IY) :N L(k-size SMT for N )
5 ( X , p ) is a finite set
(4.33) This quantity is called the k-size-Steiner ratio of the metric space (X,p). In any metric space ( X . p) a n 2-size SivIT is a n h.LST. Hence. the 2-size-Steiner ratio is the Steiner ratio:
~n( (X, ~p) ) = m(X, p).
(4.34)
Furthermore, Observation 4.2.15 For the k-size-Steiner ratio m(", k
> 2 the following is
known,: (a) (Zelikovsky [473]) For a n y metric space ( X ,p) it holds that
(Du [136]) This lower bound is the best possible one over the class of all metric spaces. ( b ) ( D u [IdGI) For a n y metric space ( X ,p) i t holds t l ~ a t
where r = Llog,, kj Now we can describe the performance ratio of approxinlations for Steiner's Problem more exactly. Zeliltovsliy [473] showed that there exists a polynomialtime approximation A for Steiner's Problem in a metric space (X,p) with performance ratio error(A) =
-
.
1
m(3) (X,p)
f
m(" (X,p)
provided that a n SMT for three given points can be computed in polynomial time. Using a similar idea. Berman and Ranmiyer [40] showed that there is a polynomial-time approximation $Ik with performance ratio error(Ak)
> -.11. 2
1 m ( 2 ) (S, p)
2 1 + --. +-. 1 2 . 3 m ( 3(X, ) ,o) 3 . 4
1 +. . . , (4.38) ~n("(x,p)
provided that for any k a n SMT for k points can be computed in polynomial time. Clearly, vie are interested in the k-size-Steiner ratio for specific spaces. For the plane with rectilinear distance we have k =2
m(" = 2
=3
"5
2"1
Source Hwang, [228] Berman and Ramaiyer, [4O] Borchers et al.. [52].
Such nice results for thc Euclidcan plane are not yet known. Borcliers and Du [51] determine the k-size-Steiner ratio for graphs exactly: For k = 2' s , where 0 5 s < 2 r . this quantity is
+
4.2.4
The relative neighborhood problem
The MST problem has numerous applications in geometric network design. We saw, and will see again, that it will be useful in approximation algorithms for some R'P-hard problems. Consequently, it will be of interest to investigate the geometric structure of MSTs more thoroughly. Let i V be a finite set of points in a metric space ( X , p ) . Two points v and w' of fV are said to be relative neighbors if and only if p(v, v') p(v, w) or p(v,vl) 5 p(vl,w) for all w E AT. The geometric interpretation for this is that the so-called lune of v and v',
contains no points of N. Now.:
The Relative Neighborhood Problem Given: A finite set N of points in a metric space (S,p). Find: A graph G = (W, E) in which all relative neighbors are connected by a n edge.
Network Design Problems
105
A solution to this problem is called a relative neighborhood graph RNG for AT. For a finite set of points the RNG and MST are relatives:
MST
5 RNG.
We now prove this fact. Theorem 4.2.16 (Katajainen [253]) An MST for subgraph of the RNG for N .
R in a metric space is a
Proof. Let vv' be an edge in an MST T = (AT,E) and assume that it is not an edge in a RNG. This would imply that there exists a point w of AT which is inside C(v, v'). Without loss of generality. we can assume that there is a path in T which connects w to v and which does not contain the edge .uv'. Let T' = ( N .E \ {d) U {h)). T'is another spanning tree for N whose total length is less than L ( T ) since p(v. v') > p(vl,w).This contradiction proves the assertion.
TTJe have established that in an RNG for a given finite set N two points v and v' are adjacent if and only if
Katajainen [254] presented a method for computing all relative neighbors constructing the RNG for a set of given points in quadratic time.14 14~\lloreover,for a finite set of points the RNG and D T are relatives of the LIST. More precisely, in the plane with a norm derived from a smooth unit ball B we have hIST (I RNG
C DT
Here, a graph G = ( h r . E )is called the Delaunay triangulation (DT) for AT if G has the following property: A11 edge & is in E if and only if there is a homothetic copy T B + U (with a real number r > 0 and a vector u of the space) such that
and 7u $! i n t ( r B f
u)
(4.43)
for all w € h r \ {w,wl). This is the so-called empty circle condition, which means that a triangle appears in the D T if and only if its circumcircle encloses none of t,he otliex. given
4.2.5
Steiner's Problem in spaces with a weaker triangle inequality
Up to now, we have used the triangle inequality as a property of the metric. It is conceivable that slight violations of the triangle inequality should not be too deleterious with respect to performance guarantees of an approximation. Andreae and Bandelt [15] consider the deviation from the triangle inequality captured by a para~neterT in the following relaxation:
for all .c. v', w E X . Such a parametrizied triangle inequality is given in the situation that the input data are from a fixed range of values. Assume that all distances under consideration are bounded by real numbers L and U in the following way:
L
< p(v, v') 5 li
(4.45)
for different points u and v'. For instance. for a netrork G we have L = 1 and U = diamG. If L > 0 then p(v, w) +p(w, v') 2 2L, so that U(p(u, w) +p(w, v')) 2Lp(v, 2;'). Hence,
>
Observation 4.2.17 T h e m e t r i c p satisfies t h e inequality (4.45) with th,e parameter U 1 T=->-. (4.46) 2L - 2 This scenario applies to the minimum spanning tree approximation for Steiner's Problem: When the parameter T approaches 112, the performance guarantee factor 2 decreases and eventually reaches 1; recall 3.5.2. 16% can see that the factor decreases when n-e make the additional assumption that. for some T with 0 < T 1, the set N of given points satisfies the following inequality:
4
Euclidean plane
Rectilinear plane
,VP-hard h'P-hard Open Polynomial
JC'P-complete NP-complete Polynomial Polynomial
For a complete proof and additional commentjs compare [96] In view of the hardness of finding a 8-MST for ?!, = 2 , 3 , 4 . approximation techniques are of interest:
O , =
Performance ratio
Source
The performance ratio is given relative to the length of an ordinary MST
A more general s-ersion of degree-constrained trees is: The generalized BDMST Problem Given: A set N = { v l , ..., v,,) of (labelled) points in a metric space and a sequence (4.63) {PI...., $n C { 1 , 2 , . . .) U {m) of positive integers. Find: A tree T = (AT, E) interconnecting the points of iV with nlininlal length such that no vertex w, has a degree greater than O,, z = I, ...,n.
>
This problem is a generalization of the BDAIST problem in which v e assume that 81 = . . . = D, = p. and of the MST problem in ~vhic1-1we assume that Dl = . . . = O,, = a.It is easv to see that
Observation 4.2.37 A solution of the gen,eralized a n d o n 1 ~if 7L
BDMST problem exists if
Network Design Problems
119
Clearly, we look for an ordinary MST if a11 P, = oo. More generally, if just one degree constraint for just one point is given, meaning only for one value j is pj # oo,the problem is solvable in polynomial tirne using a "quasi-greedy" algorithm, because this problem is linear-time equivalent to the unconstrained minimum spanning tree problem, see [173], [307]. The general case in which more t h a n one degree is constrained has been shown t o be N P - h a r d [179]. Our well-known niethodi for computing a n MST create a heuristic to find degree-constrained trees with minimal length: In each step where a new edge is added check whether the constraints are satisfied. In general the problem of finding a spanning tree with a bounded number of leaves is Jt'P-complete [179], We will now see t h a t the number of leaves in a P-LIST for a set of n points cannot be too large. Consider such a tree, then. in view of 1.2.5, we have
where n, denotes the number of ~ e r t ~ i c eofs degree i and A is the maximum degree in the tree. Hence,
Theorem 4.2.38 (C. [96]) In, each metric space w h r e u number c' exzsts; a 8-MST has at m o s t
leaves, where A = iniii{,8, c').
4.2.10
Small diameters and spanners
T h e previous problems have all been based on the length of the tree constructed. TTre now consider another criterion for the quality of a tree:
The Minimum Diameter Spanning Tree Problem Given: A finite set AT of points in a metric space. Find: A tree T = (AT,E) interconnecting the points of N such that diamT = max L ( T ( u , . . . , v')) = Min! z'.u'E!Y
(4.66)
A solution is called a Minirnum Diameter Spanning Tree ( N D S T ) for n'. I t is easy t o see t h a t a tl-pica1 MST can have large diameter. But surprisingly, can be shown t h a t there exists a n WIDST such that the longest path in the tree consists of no more than three edges: Lemma 4.2.39 ( H o et al. [220]) For a n y set of giuen points i n the Euclidean p l m e there i s a n MDST in which there are at m o s t t w o internal vertices.
Proof. Let T = (AT,E) be a AIDST. We perform a sequence of diameterpreserving transformations until it is in the above form. Let
be a longest path in T. Consider the forest G = (I\;, E \ El). For each vertex v of TI, let T,denote the tree in G containing 71. For any other vertex u , let P, denote the vertex v such t h a t u is in T,.Then we construct a new tree TI = TI u ( N ,{UP,,)).
(4.68)
TI has the same diameter as T , since the distance between any two vertices cannot increase except when they are in the same tree T , , and in that case. by the assumption for TI, the distance of each point t o v is less than the distance from v t o the endvertices of TI. Row suppose that TI has four or more edges and the length of the path v l , v2, vs is a t most half the lerigt,h of T'. Form a tree TL by removing every edge uvz and reconnecting each such vertex u to 7 4 . This can only decrease the lengths of paths already going through ~ 3 in T I . So the only pairs of vertices with increased path lengths are those newly connected to us. But the length of any such path is a t most twice the length of the path v l , vz,v3, so the dia~neterof T2 is no more than that of T . Each repetition of this transformation decreases the number of edges in T 1 until it is a t most three, and preserves the property that each vertex is within one edge of T'.
Network Design Problems
Consequently, T h e o r e m 4.2.40 (Ho et al. [220]) In, the Euclidean plane, we can find a n M D S T in cubic time. T h e problem of finding a spanning tree whose length and diameter are both minimal is :\/?-hard, see Ho e t al. [220].
A related problem: Let t > 1 be a real number. Consider a finite set N of points in a metric space (X,p). We intend to design a graph G = (!V, E) t h a t approximates the complete graph G = (AT, mit,h length-function f : E 7' lR,f (vl;') = p(v, v'), in the following sense: (A:))
1. El = O ( I N ) . I n particular, this is satisfied if G is a planar graph 2. For each pair v , v l E N there is a shortest p a t h T ( v , . . . , v l ) in G t h a t connects the vertices 2: a n d v' in the graph G , and it holds t h a t
L(T(?),. . . , v')) 5 t . p ( v , v l ) .
(4.69)
Such a network is called a t-spanner for :Ii T h e existence of t-spanners in each Banach-Lhkowsl 1, a,nd let t > 1 be a given real number. Then there exists a number c = c ( M d ( B ) ,t ) such that each finite set N of points in Adcl(B) has a t-spanner wath at most c . N l edges. An improved version of this theorem gives a procedure t o construct t-spanners. T h e procedure is the following greedy algorithm P r o c e d u r e 4.2.42 A network G = (V;E, f ) and a real number t given. 1. Sort the edges in E in nmn-decreasing order of the len,gths,
>
1 are
2. L e t E' := 0; G':= (11, E'),
3. For each edge & from t h e sorted list of E d o zf f f ~(v, " u') > t . f ~ ( vv') , then E' := E' U { t d ) a n d G':= (If, E');
4.
S t o p w h e n all edges are checked.
T h e n G'= (If, E')is a t - s p a n n e r of G . This procedure needs 0 ( n 3logn) time. For a faster method t o construct spanners in the &spaces see Chandra et al. [76], [77].Spanners in Euclidean spaces are discussed by Salowe [376]. An application of our considerations in networks is given by Primer [333]: of course, if T is a spanning tree for the graph G,n-e have pG(v. v') 5 pT(v,vl) for all vertices v and v'. Prisner construct spanning trees with p T ( u . v') 5 t . pG(v, v') for specific classes of networks and numbers t .
A N E W CHALLENGE: THE PHYLOGENY
As it became accepted t h a t evolution mas to be uiiderstoocl in terms of Mendelian genetics and Darwinian natural selection, so too it became clear that this understanding could not be sought only a t a qualitative level. A fundamental problem is the reconstruction of species' evolutionary past, which is called the phylogeny of those species. Trees are widely used to represent evolutionary relationships. In biology, for example, the dominant view of the evolution of life is that all existing organisms are derived from some common ancestor and t h a t a new species arises by the splitting of one population into two or more populations that not do not crossbreed, rat,her than from the mixing of two populations into one. Here, the high level history of life is ideally organized and displayed as a tree. A phylogenetic tree is ail evolutionary tree for a given set of taxa.' Trees rnay also be used to classify individuals of t,he same species. In historical linguistics, trees have been used t o represent the evolution of languages, while in the branch of philology known as stemmatology, trees may represent the way in which different versions of a manuscript arose through successive copying. Often trees are used to describe the relatedness of objects which have developed tree showing the from a common ancestor. In [222] we find ail ev~lut~ionary architectural connections and influences during the the development of parallel computers from the early 1950s; in [360] me see a tree showing the history of the common computer languages. We mill discuss the problem of reconstruction of phylogenetic trees in our sense of shortest connectivity. To do this we introduce so-called phylogenetic spaces. These are metric spaces whose points are arbitrary words generated by letters 'Such a tree may h e called a "pl~ylogeny",a "dendrogram". or a "cladogram". We will define phylogenetic trees more precise in the following sections.
(or symbols) from some (finite) alphabet, and whose metric measures "sameness" of words according to some cost measuie on the letters, or a similarity of the ~vordsgenerated by a scoring system.
5.1
PHYLOGENETIC TREES
Nothing in biology ~naliessense except in the light of evolution. Theodosius Dobzhansliy The most surprising application of Steiner's Problem is in the area of phylogenetics. Trees are widely used to represent evolutionary, historical, or hierarchical relationships in various fields of classification. T h e underlying principle of phylogeny is to try to group "living entities" according t o their level of similarity. In biology for example, such trees ("phylogenies") typically represent the evolutionary history of a collection of extant species or the line of descent of some gene. No two members of a species are exactly the same - each has slight modifications from their parents. As environmental conditions change, nature will favour t h a t branch of a species with some particular modification; as time goes on another mutation of the basic stock will become dominant. In this way: all species are continually evolving. This evolution occurs in a number of mays a t the same time: some species die out and some become new species in their own right. This was already seen by Darwin [120]. He recognised that the characteristics which identified the species could indicate a history of descent, that is, a tree of evolution. Darwin wrote: T h e affinities of all the beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth. The green and budding twigs may represent existing species; and those produced during each former year may represent the long succession of extinct species ... The limbs divided into great branches, and these into lesser and lesser branches, were themselves once, when , twigs; and this connesion of the former the tree was s ~ n a l l budding and present buds by ramifying branches may well represent the classification of all extinct and living species in groups subordinate t o groups ... From the first growth of the tree, many a limb and branch has decayed and dropped OR, and these lost branches of various sizes
A new challenge: The Phylogeny
125
may represent those whole orders, families, and genera which have now no living representatives, and which are ltnown to us only from having been found in a fossil state ... As buds give rise by growth t o fresh buds, and these, if vigorous, branch out and overtop on all a feebler branch, so by generation I belive it has been with the great Tree of Life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications. Historically, this was a new idea: The concept of species having a continuity through time was only developed in the late 17th century; higher life forms were no longer thought to transmute into different kinds during the lifetime of a n individual. I t took over 150 years from the developnlent of this concept before a rooted tree was proposed by Darwin. Note t h a t in Darwin's fundamental book The origin of species [120] there is exactly one figure, and this shows the description of the evolutionary history by a tree. In other words, Darwin means t h a t his theory of evolution, today called Darwinism, implies the existence of a n evolutionary tree. T h e phylogenetic tree can therefore be thought of as a central metaphor for evolution, providing a natural and meaningful way to order d a t a , and with a n enormous amount of evolutionary information contained within its branches. Clearly, this idea is attractive, but how are we t o find the tree? Note that there are s e v e ~ a difficulties. l even in the definition of the problem: What is the tree of life? A tree which is given by a classification or the evolutionary tree? What is the mechanism of evolution? Darwin provided mutation and natural selection, which suggested a scientific model for the relation of species. Darwin's evolutio~larytree is neither obvious, nor easy t o find. There must be some criterion for deciding which of the many phylogenies that may be drawn most closely resembles the act,ual evolutionary changes. Darwin saw another difficulty in the underlying problems. In a letter to Huxley he wrote: " T h e time will come, I believe, though I shall not live to see it, when we shall have fairly true genealogical trees of each great kingdom of Nature."
Considering the origin of life: Was there just one, or more than one "starting point"? W h a t does we know about the last universal common ancestor, if it exists? It has been argued t h a t tlle "Tree of Life" is perhaps really a "Web of Life". as mechanisms such as hybridization. recombination and swapping of genes probably play a role in evolution.
A nice representation of this subject has been given by Davies [122], Pennisi [336], and STard and Bromnlee [443]. A surrey about What Evolutzon zs mas given by Mayr [301]. For the history of Darwin's theory compare Bowlel [54] and Weber [450]. Each species can be described in terms of a sequence of specific values, called characters. These characters were originally morphological, t,hat is deri.ved from a n analysis of a n organism's form and structure, but how are these values measurable? In biology, "characters" describe attributes of the species under consideration and are the d a t a that biologists typically use t o reconstruct phylogenetic trees. SVe wish t o consider characters for species in a morphological sense. To do this we assunie that there is gi.i,en a (finite or infinite) state space C of characters. We also assunie that there is a metric in C. Discrete character data are those for which a function f assigns a character state fi, t o each taxon i for each character j . T h e most important problem in morphological pliylogenetics is selecting the characters. Here opposing side picking out is the favourite method. On the other hand, characters must be coded if there are more than two distinct possibilities. LVe think of characters as independent variables. This assumption is common t o virtually all character-based methods. If we could not assume independence, we would be forced t o take covariance among characters into account, and the computational methods would by necessity become more complicated. .Another assumption required of character data is t h a t the characters be homologous, that means that a character must be defined in such a way t h a t all of the states observed over taxa for that particular character must have been derived from a corresponding state observed in the common ancestor of those taxa. As sequence data became readily available it mas predicted a n end to this conflict. Kow, the biological units are written in words constructed from the letters corresponding either t o amino acids, which generate proteins, or to nucleotides forming DNA or RNA molecules. By comparing such words one can construct
A new challenge: The Phylogeny
127
evolutionary (phylogenetic) trees showing how closeness of the words in the tree corresponds t o the closeness of the unit. In other words,
The Phylogenetic Tree Problem Given: A set of sequences, each representing a taxon. Find: Their phylogenetic tree. representing its evolutionary history T h e set of leaves represent the given taxa. the internal vertices are the ancestors, and the root of the tree represents the common ancestor of all. The phylogenetic tree of life shows when groups of organisms arose and gives the basic relationships between then?. First, molecular sequence data was used by Fitch and Margoliash in their landmark paper [I611 from 1967 dealing with cytochrome c sequences. The basic idea in that field is that species (given by their sequences) which appear t o be closely related should have diverged more r e c e n t l ~than species which appear to be less closely related. To find such a phylogenetic tree we construct a metric space which forms a model for the phllogeny. Nore precisely, Bein and Graham 1451. David Sankoff of the University of Rlontreal and other investigators defined a version of the Steiner problem in order to compute plausible phylogenetic trees. The workers first isolate a particular protein t h a t is comnlon to the organism they want to classify. For each organism they then determine the sequence of the amino acids t h a t make up the protein and define a point a t a position det,ermined by the number of differences between the corresponding organism's protein and the protein of other organisms. Organisms with similar sequences are thus defined as being close together and organisms with dissimilar sequences are defined as being far apart. In a shortest network for this abstract arrangement of given points, the Steiner points correspond t o the most plausible ancestors; and edges correspond to relations between organisms and ancestor that assume the fewest mutations. The latter remark explains the importance of trees having the least possible length in phylogenetic spaces for evolutionary relation investigation. This approach to Evolution Theory was suggested first by Fitch 11621 in 1971, and also
explicitly written by Foulds et al. [170], [395] in 1979. Unfortunately, this idea .~ Bern and Graham [GI: does not give a simple m e t h ~ d Again, Since the phylogenetic Steiner problem is no easier than other Steiner problems, however, the problem - except as it is applied t o small numbers of organisms - has served more as a thought experiment than as a practical research tool. In other words, reliable tree building algorithms do not (yet) exist. On the other hand, for specific questions, examples, and investigations this approach will be helpful. Hence, it seems impossible to describe the "Great Darwin Tree" since the diversity of the living world is staggering: more than two million existing species of plants and animals have been named and described; many more - both existing and past - remain to be discovered. On the other hand, it will be useful to describe the phylogeny between several organisms by their DNA\ sequences taken from their genomes. On this topic Vingron et.al. [436] wrote Many similar DNA sequences from different species have common ancestors in evolution. The relationship among sequences are described by a phylogenetic t,ree. Phylogenetic trees do not merely allow for an exact classification of life forms, but also give hints to yet unknown properties of organisms, as well as insight into mechanisms of evolution. This holds true even for comparatively short periods of time, for example the evolution of the HIV 1w-us.. ' .. The notion of a Steiner Tree subsumes both tree topology and multiple alignment. In a graph that has biological sequences as nodes, edges represent evolutionary operations t h a t modify sequences. This view of the problem . . . unifies two optimization steps t h a t are commonly treated separately - the Multiple Alignment and the Parsimony problem. By treating the two problems a t the same time one can hope for better results in terms of the sirnplicity of the resulting tree. The principle of Maximum Parsiniony involves the identification of a conibinatorial structure that requires the smallest number of evolutionary changes. I t 'And seems to have been rather forgotten in the field of biology after tree-building program packages became widely available.
A n e w challenge: T h e Phylogeny
129
is often said that this principle abides by Ocltham's razor. according t o which the best hypothesis is the one requiring the smallest number of assumptions. Or in other words: (a) It is futile t o do with more what can be done with fewer (b) More precisely in Latin: Entia non sunt multiplicanda praeter necessita(c) More roughly spoken: Keep it simple. This is true, but not in a simple sense. Cavalli-Sforza [72]:
... it does not necessarily follow that a method of tree reconstruction minimizing the number of mutations is the best or uses all the information contained in the sequences. The minimization of the number of mutations is intuit,ively attractive because we know t h a t mutations are rare. There may be some confusion, however, between the advantage of minimizing the number of mutations and sometimes invoked parallel of Ocltham's razor ..., which was developed in the context of medieval theology. T h e extrapolation of Ocltham's razor to the number of nlutations in an evolutionary tree is hardly convincing. Note t h a t in this case minimizing the number of assumptioils does not mean minimizing the nuniber of mutations, or the steps of an evolution, it means t h a t among all possible network structures we seek one which satisfies only few conditions. With the "razor", Ockham cuts out all superfluous, reclundaiit explanations. As a conclusion, me find that Steiner Minirnal Trees in sequence spaces are SIaximum Parsimony Trees. And in this sense, we will investigate Steiner's Problem in spaces of sequences equipped with a any desired chosen m e t r i ~ .I t~ means that among all possible structures we seek one which satisfy only one, namely the condition of minimal length. What other condition can be more natural in a metric space? For the biological background and a more detailed discussion of these problems see Graur and Li [191], v.Haeseler and Liebers [202], and Page and Holmes [331]. In particular. a broader discussion of the application of the principle of Maximum Parsimony can be found in Farris and Kluge [158]. 3For a broder philosophical discussion of Ocltham's razor see Brown [57] and Russel [371], [373]. 4 ~ o t that e parsimony does not point to the root of the tree. To find the root, we n e ~ d additional information.
Note t h a t this approach to describing the evolutioiiary history has a deep consequence for the following quest,ion: Is evolutioii a scientific theory? On this topic, Hendy [214] recalls:
I began a mathematical study into evolution, after attending a debate, a t Massey University in 1973, between a creationist and a local scientist, on the Theory of Evolution. The creationist made reference t o the work of the philosopher of scientific process, Karl Popper. Popper [350] had stated that "Darwinism is not a testable scientific theory, but a metaphysical research program - a possible framework for testable scientific theory". I discussed this issue with a colleague a t I'Iassey University, David Penny, who had a research interest in molecular evolution. David suggested a mechanism t h a t might provide a testable hypothesis t h a t we could be apply to the theory of evolution t o ~ n e e tPopper's criterion for a scientific theory. We succed in this quest [339], using the tree building method of "Maximum Parsimony" to derive evolutionary trees from a number of independent protein sequences, for a common set of mammalian species. We then compared the resultant trees. Compare also Penny, Hendy and Poole [342]. Moreover, in this sense, each organism is a n experiment for the hypothesis of biology, in particular, of evolution. The principle of Ocltham's razor suggests t h a t one should choose the simplest possible hypothesis. For more facts about the denial of the theory of evolution compare Pigliucci [344]. Note a n essential difference in the application of Steiner's Problem in engineering and in biology. In the first case we search for a tree which is as short as possible: a n approximation may be acceptable. In the second case me look for the shortest tree (or all shortest trees); i.e., we are interested in an exact solution. Here, a n approximation gives only an upper bound for the length of a n SMT. Moreover, the idea grew out of a n investigation into the accuracy of a n SMT. I t is not possible t o directly test the "accuracy" of such a tree-building method, as the "true evolutionary tree" is not, and in general cannot be, known with ~ertainty.~ "0
example, consider the phylogenetic tree for Darwin's finches in [188].
A new challenge: The Phylogeny
5.2
PHYLOGENETIC SPACES
Einstein said: "God does not play dice." He mas right. God plays scrabble. Philip Gold We will introduce metric spaces which are of interest t o describe the genetic d a t a in evolutionary processes. Here the input data is a set of sequence information. The sequence information is usually DNA, RNA or protein sequences. In more detail:
DN.4 sequences are the informational-containing molecules and are comDNA of a,ri posed of nucleotides from a n alphabet of four letters."he organism plays a central role in its existence. Its sequential arrangements forms chromosomes. These strings may be millions of nucleotides long, measured in base pairs (bp). The entire set of genetic information of a n organism is called its genome. Fitch [I631 gives the following exemplary genorne sizes: Domain
Organism
Size (bp)
l~iruses Bacteria Eultaryotes
HIV E, coli mammals
9 . lo3 4 , lo6 3 . 10"
Roughly speaking, the order of genorne size is kbp. NIbp and Gbp for Viruses, Prokarya and Eukarya, respectively. Proteins, which are the operational molecules, are composed of of amino acids from a n alphabet of 20 letters. Typical proteins contains about 300 amino acids (aa), but there are proteins with fewer than 100 or as many as 5000 a a . Structural proteins act a,s tissue building blocks, whereas other proteins known as enzymes act as catalysts of chemical reactions. RNA sequences, which stand between DNA and protein and composed of nucleotides from an alphabet of four letters. "he informational aspect combined with the massive parellelism and the complementarity in the double strand present the possibility of a computing paradigm which is rather different from those customary in present-day computer science. For a survey about this " D N A Computing" see PXun et al. [358].
(It is remarkable the the niolecules which are the carriers of information and the operational units which make life work are all linear polymers.) The Central Dogma of Molecular Biology7 describes the roles of these polymers: DNA acts as a template t,o replicate itself, DNA is also transcribed into RNA, and RNA is transla~teclinto protein. So we start our investigations with spaces of these sequences (strings) reflecting the "written nature of life".
5.2.1
Alphabets and words
An alphabet A is a nonempty and finite set of distinguished letters (or symbols). If -4 contains exactly one letter, all further discussed concepts and problems are senseless or trivial, respectively. Hence, we assume that A contains a t least two elements. If 4 contains exactly tm7o letters it is called a binary alphabet. Important examples of alphabets are:
A = {0,1) is a n alphabet which play a central rule in coding theory. Moreover, we consider a word of 0's and l ' s as a description of some individual, perhaps a genetic sequence in which each entry may take on one of two possible values. A = {a, c. g. t ) is the alphabet which codes the nucleotides of a DNA molecule, where n stands for adenine, c for cytosine, g for guanine and t for thymine. A similar alphabet, namely A = {a, c. g. u) is used for tlie nucleotides of RNA, where u codes for uracil. Derived from this alphabet there is a binary alphabet 4' = { r ,y) in which r codes for a purine ( a or g ) , and y codes for a pyrimidine (c or t ) . The amino acids comnionly found in proteins are coded by tlie alphabet
.4 = {nla, nrg, . . . , v a l ) , where the letters abbreviat the amino acids alanine, arginine, ...,valine. In the usual genetic code / A /= 20 amino acids are coded. T h e English language needs 26 letters: A,B,...,Y,Z, and a letter for the empty space. German needs several letters more: A4,0, 0, fi. 7 ~ o m e t i m e also s called "The Holy Trinity of Molecular Biology".
A new challenge: Th,e Phylogeny
133
*4 word over an alphabet A is a finite sequence of letters from A. The length wl of the word w is the number of letters composing it. We additionally define an empty word X of length 0. Note that the description of a word contains a left-to-right order of the letters. We will write u: = a l a z . . . ad for a word 21: consisting of the letters a l , a2, . . . ad in this order; or using the notions for algorithms, w = n [ l ] a [ 2.]. . a [ d ] for a one-dimensional array; then we will also speak about sequences or strings. The letter ai = a [ i ] in the word, sequence respectively, is called the i-th position. We say that two words w = alas . . .ad and w' = blba . . . bdi over the same alphabet are equal, and we write w = w',if d = d' and a, = bi for all i = 1,. . . , d.
-
Let IU = a l a z . . . ad ancl w' = b l b 2 . . . bdj be two words over the Yame alphabet A. The concatenation of w and w', written ww',is the word slat . . . a d b l b 2 . . . bdl over A. Hence, lww'1 = Iwl Iw'l.Moreover, we will write wh = LL . . . w and
+
k-times
m0 = X for each word w.
The set Ad contains all words over -4 with length exactly d. Clearly, A0 = {A), A' = A, and = illd. (5.1)
Asd denotes the set of all words of length a t most d; and we have
In particular, when a set of words contains only words of a predetermined bounded length, then this set is finite. More about the combinatorics of words can be found in [288]ancl [296]. If there is an order 5 of the letters in '4, then the set Ad is endowed with the following partial order heory,namely in the sense of error-correcting codes. More precisely: If (code-)words are transmitted then it is possible that errors mill arise and so the received words may differ trom those that were sent. T h e basic idea behind a n error-correcting code is t o choose the words t o be sufficiently different from each other so that even if some error in trarlsmission occurs, each received word is closer to the transmitted word than t o any other. This is the concept of distance between words.'' Compare Hankerson et al. [208] or Schulz [388] for a common description of information and coding theory, and Casti [66] or Yockey [470]for its application in molecular miology.
(Ad, pH) is also a graph, realized by defining .Idas the set of vertices and making two vertices to, w' E Ad adjacent if and only if p ~ ( u 1w') , = 1. 4 specific space is the so-called d-dimensional hypercube
where lB = ( 0 , l ) . T h a t is the graph whose set of vertices consists of all binary vectors of size d, with a n edge joining two vectors if and only if they differ in exactly one coordinate.12 T h e hypercube has the following properties: (a)
Q,,has
2"vertices
and d . 2"-I edges;
(b) it is a bipartite graph; (c) each vertex in
Qd
has degree d ;
llFor example, when for any two different code-words 2 ~ : and w' of a binary code the inequality P H ( W , W ' ) 2 2 ' t + 1, with a desired chosen integer t holds, then the code can correct errors affecting up to t binary digits. 1 2 ~ h iiss the specific form of the fact that each finile metric space is essentially a graph.
(d) the diameter of Qd equals d, and for a given vertex w there is a unique vertex w' with pH(w, w') = d; and (e) we may also define Qd inductively by letting Qo be a single vertex and then obtaining Qd by taking two copies of QdPl and joining corresponding vertices.13 T h e metric space (A" pH) has a strange property: on one hand, it is a "big" space, since it contains lAld many points; on the other hand, it is a "small" space, since its diameter equals d:
For some deep consequences of this observation for molecular evolution see Eigen [153]. Now consider the set A* of all words over the alphabet A. T h e edit distance p ~ between , two mords of not necessarily equal length is the minimal number of "edit operations" required t o change one word into the other, where an edit operation is a deletion, insertion, or substitution of a single letter in either word. This distance is also called Levenshtein distance, since it was introduced by Levenshtein [284] in connection with error correcting codes. As a n example consider the two German words w =APFEL and w' = P F E R D , where we have P L ( W , w') = 3. I t is not difficult t o see t h a t the problem to compute the Levenshtein distance between two words w and w' is solved by a serial algorithm in O ( w . lwl/)time, through dynamic programming. We will discuss this method more precisely and generally later. T h e set A* equipped with the Levenshtein distance is called the phylogenetic space (over A). (A*, pL) is an infinite discrete metric space. More precisely: let w and w' be two mords in ( A * , P L ) , 1-41 2. Then
>
In particular, the second inequality implies t h a t any bounded set of mords is a finite set. *4t first glance, it seems t h a t the sequence spaces are subspaces of the phylogenetic space, but this is not true: Consider the two -wolds v = ( ~ b and ) ~ 2~ = (ba)"; then p ~ ( vw) , = 2 but p ~ ( vw) , = 2d. To extend the Hamming dist,ance to a metric for all words we may use the following way: Let A be a set of let,ters. Add a "dummy" letter "-" to A. We 131n view of this fact we have that Qd must be Hamiltonial~;compare [185].
A n e w challenge: The Phylogeny
define a map
cl : ( A U {-))*
i
,4*
(3.8)
deleting all dummies in a word from (A U {-))*. Then for two words w and w' in A* me define the extended Hamming-distance as
Observation 5.2.1 T h e extended Hamm,ing-distance coincides with the Leuenshtein metric. In one sense, the phylogerlet,ic space is of interest in "pure" network design. Remember t h a t we showed t h a t the Steiner ratio of any metric space is a t least 0.5, but me did not describe a space with this value as the Steiner ratio. To determine the Steiner ratio of the Phylogenetic space, consider the words wi which consist of the letter a repeated d times, except the 1;-th position where another letter b is located, i = 1 , . . . , d . Then define the set
of d points. For i # j it holds t h a t p L ( w , , w , ~ =)2 . Hence. L(1IST for A T ( d ) )= 2 ( d
-
I).
(3.11)
T h e word w = a . . . a has distance 1 to any w,.Consequently, the star with the z = 1,.. . . d is a n SAIT for K ( d ) for which center w and the leaves w,,
L(ShlT for N ( d ) ) = d .
(5.12)
Both equations (5.11) and (5.12) give
>
2. Now. we have found a metric space which for all positive integers d achieves the lower bound 0.5 f o ~the Steiner ratio:
>
Theorem 5.2.2 For the S t e i n e r ratio of the plzylogenetic space (A*,p ~ ) l ,i l l 2 , it holds that 1 (5.14) m ( - 4 * , p ~= ) -. 2
Note t h a t we don't have a finite set !\To of points such t h a t L(SR1IT for N O )- 1 L (MST for No) 2 ' and, moreover, in view of 3.5.5, we cannot find such a set.'"
5.2.3
Distance and similarity
In the biological context the equality of words makes no sense, since mutations do not allow identical sequences in reality. On the other hand. in biomolecular sequences, high sequence similarity usually implies significant functional and structural similarity.15 Let A be a n alphabet. U7e consider the set A* of all worcls over A. Our interest is to define measures on .4* which reflect the "proximity" of two words. Here, two different approaches are t o be distinguished: distance and similarity. Historically, the origin of the first was the result of investigations for a rigorous mathematical solution t,o a n important biological problem; the second was the result of a heuristic a,pproach. We mill introduce both measures in the greatest possible generality. This is necessary, since evolution, as reflected a t the molecular level, proceeds by a series of insertions, deletions and substitutions of letters, as well as other far rarer mechanisms which me are ignore here, since we observe not complete genomes, only genes or other "smaller" words.16
A cost measure (c, h) is given by w
A function c : 14x A
+ R>O, which satisfies the following conditions:
14similar considerat,ions about the Steiner ratio of sequence spaces give
Consequently, m(rld
w
1
(5.17)
if d )> 1, see Foulds [167]. 15But note that the converse is, in general, not true. Arid in realit?., for applications in biology it is sometimes necessary to take into account several other properties of the macro-molecules to measure their similarity, for instance structure. expression and pathway similarity, compare [248]. lGNote that gene trees and species trees may not match due lineage sorting, hybridization, recombination and other events. LVe will discuss this question later more extensivly.
A new challenge: The Phylogeny
(i) c is non-negative: c(a, b)
> 0;
(ii) c(a, a ) = 0;and (iii) c is symmetric: c(a, b) = c(b, a ) for any a , b E A.
*4 positive real number 11 T h e substitution of a letter b for a lettei a costs c(b, a ) = c(a, b). The insertion or deletion of a letter effectively transforms a non-gap letter in one word t o a gap in the other. Since me do not know the direction of the change through time. it is useful to group both operations under the term indel. Each indel costs 12. The distance p(w, w'), between two sequences w. w' E A* according to a cost measure is the iniliimuni of the costs running over all series of operations transforming w into w'.
Observation 5 . 2 . 3 The functlon p Is a p,seudo-n~etric.If, moreover, the function c satisfies the non-degeneracy property, i.e. that c(a, b) = 0 holds if and only if a = b, then p is a metric.
Consequently a given cost measure for an alphabet A generates a metric (or pseudo-metric) space (A*, p) . Note t h a t we do not assume that c satiesfies the triangle inequality, but we can assume this. The reason for this assumption is that even if we start with a cost measure (c, h) that does 11ot satisfy it, we can always define a new pair (c': h) t h a t does satisfy it and produces the same metric. Namely, if three letters a1 , as and a3 are such t h a t c ( a l , a 2 ) > c ( a l , a s ) c(a3,a s ) , then every time we need to replace a1 by a:! we will not do it directly but rather replace a1 by a3 and later as by a2, producing the same effect a t a lower cost. Moreover, using the the same reasoning, the restriction of tlie metric p to the alphabet itself need not be c. This is only true if t,he function c satisfies the triangle inequality.17
+
An example for a cost measure is given by c(a, b) = I for any pair a and b of different letters and 12 = 1. This creates the Levenshtein distance discussed in the section before. Another example: For tlie cost measure (c, h) defined by l7cornpare our investigations about the metric closure of networks, see 2.5.1.This also give hints for our later work.
and h = 4, we find p(agc, n3c) = 5, p(acg, a3c) = 7 and
T h e (pseudo-) distance p(w, w') between two words m and w' is attained with some (finite) operation sequence transforming w into w'. Moreover, Observation 5.2.4 T h i s m e t r i c space (A*; p) i s a discrete one, that m e a n s , if for a subset T/V of words over A i t holds that
sup{p(w, w') : w, w' E It7}
< cc
(5.18)
t h e n also
1 T/V' 1 < oc .
(5.19)
To see this we recall that: 1. If we consider the substitutions, there are a t most A l possible different letters a t each position; 2. T h e "gap penalty" is chosen as a positive real. Hence, the distance between two words hounds the difference of their lengths:
which is in any case a positive real if the words of different lengths. Consequently, in a bounded set of words there are a t most finitely many different ones.
Another approach uses similarity. T h e procedure used to find such quantity is called sequence alignment and depeiids on a scoring system.
141
A new challenge: Th,e P h y l o g e n y
Given two sequences w and w' over the sa,me alphabet, a n alignment of w and w' is a partial mapping from letters in w to w',or vice versa, which preserves the left-to-right ordering. Such a n alignment can be represented by a diagram with aligned letters above each other. and unaligned letters placed opposite gaps. An alignment can be viewed as a way t o estend the sequences t o be of the same length using gaps or "dummy symbols". For instance consider the two words w = ac'g2t2 and 70' = agct. T h e following arrays are all alignments for w and w': a a
c g
c c
g t
g -
t -
t -
a a
c -
c -
g g
g c
t t
t -
and
where "-" denotes a "dummy" symbol. In other v a r d s , we are search for a diagram such t h a t (i) T h e elongated sequences are of the same length; (ii) There is 1-10position for which the elongated sequences both have a dummy (i.e. we do not use pairs of dummies). T h a t means, a pairwise alignment for two words w and w' over a n alphabet A is a 2 x I-array with values from *-IU {-) and
Consequently, there are only finitely many alignments for a given pair of sequences. Consider two words w = a l a z . . .a,, and w' = b l b 2 . . . b,,,. To count alignments is t o identify aligned pairs (z,, IJ:,) and simply to choose subwords of w and w' to align. This gives
c(3(a) (T)
k>O
alignments. Hence,
=
(5.22)
Observation 5.2.5 There are
alignments of t w o words with n and 7n letters, respectively. In particular, if both words have the s a m e length 7% there are
Nore about the combinatorics of alignments can be found in T k t e r m a n [447]. F~lrthermore,the eloilgated sequences in a n alignment should be as si~nilar as possible according t o some predefined scoring system. Given an alignment between two words, Tve assign a score to it as follows: Each column of the alignment will receive a certain value depending on its contents and the total score for the alignment will be the sum of t,he values assigned t o its columns. Let a n alignment between two words be given. If a column has two identical symbols we mill call it a match, two different symbols is called a mismatch, and finally, a space, that is a dummy in one row, is called a gap. More generally:
A scoring system (p, g) is given by A symmetric function p : d x A
+ I/?, and
A non-positive real number g. The array of p is called the (substitution) score matrix. The value p ( a , b) scores pairs of aligned letters a and b. The penalty g is used t o penalize gaps. In general, we assume that p ( a , n ) > 0, for a E A, and g < 0.'' Clearly, '"ere we not count the pairs of (a-,-6) and (-a,b-) as distinct. Otherwise, the number , f ( n ,m) of such alignments for two sequences of n and m letters fulfils the equality
which does not have a nice explicit description. But it can shown that
f ( n , n)
(I +
,
fi,
see [446]. " ~ n dunlikely substitutions are penalized with a negative score.
(5.26)
A new challenge: The Phylogeny
143
the selection of a n appropriate score matrix is crucial for achieving "good" alignments. A scoring system assigns a value, called the score, to each possible alignment. The si~nilaritysim(w, w'); between two sequences w, w' E A* according to a scoring system is the maxiinurn of the scores running over all alignments of .(I: and w ' . ~ O T h e concepts of distance and of similarity are essentially dual. More precisely:
Algorithm 5.2.6 Given a cost measure (c, h ) and a constant K , we can define a scoring system (p, g ) as follows:
under the constraint K 5
212.
(5.27)
And conversely, given a scoring system ( p :g ) with the property that p(a, a ) = K for all a E A, we can define a cost measure (c; h ) as follows:
under the constrain,ts
K I
>
max{p(a, b) : a;b E A ) , and 29.
'O1n a biological context a scoring matrix p is a table of values that describe the probability of a residue (amino acid or base) pair occuring in an alignment. Substitution matrices for amino acids are complicated because they reflect the chemical nature and the frequency of occurrence of the amino acids, see [20].Such matrices for bases in D N A or RNA sequences are very simple: in most cases, it is reasonable to assume that a:t and g:c occur in roughly equal proportions. But sometimes the following score matrix is used:
In other words, we have the following interrelation between a cost measure (c, h ) and a scoring system (p, g):
for all a , b E '4,which obviously reflects the duality. Roughly speaking. "large distance" is "small similarity" and vice versa. Moreover. distance computation can be reduced t o similarity computation: Theorem 5.2.7 (Smith, Waterman,, Fitch [4O2], Setubal, Meidanis [394], Waterman [446]) A cost measure and th8ecorresponding scoring system as i n 5.2.6 are given for a certain value K . Let w and w' be words over A. Then
Both the cost measure and the corresponding scoring system yield th8e same optim8al alignmen,t~.~' Sketch of the proof. Let w and w' be words of length rn and n respectively, and let a be an alignment between w and !w'. We define a series a of operations transforming w into w 1by dividing oi int,o columns corresponding t o the operations in a natural way: matches and mismatches of letters correspond t o substitutions; gaps correspoiids t o indels. We shall now compute the score of a and the cost of a . Suppose there are exactly 1 letters which are matched or mismatched in a , occupying positions wi in w and lo: in wl; 1 5 i 5 1. Suppose further t h a t there are exactly r gaps in a. Then
+
score(rr) =
p ( q , w:) rg. z= 1
On the other hand, the cost of a is 1
cost(o) = = ( w L , w:)
+ rh.
(5.31)
1=1
Memberwise addition of (5.30) and (3.31) in conjunction with 5.2.6 give score(a) "Although same
+ cost(a) = 1K + r -.K2
(5.32)
with different scores. B u t using the formula given in 5.2.6 the distance is the
145
A n.ew challenge: The Phylogeny
Moreover the values of 1 and 1- are not independent: each match uses two letters and each gap uses one. Therefore, the total number of letters must be
Then (5.32) can be written as score(cr)
+ cost(a) = K2
-
. ( m + n).
(5.34)
Since this is true for any alignment. we have one half of the assertion. The other half follows similarly.
All these considerations imply that, from the mathematical standpoint, an alignment and an edit transformation are equivalent ways to describe a relationship between two words. alignment can be easily converted to its dual edit transformation and vice versa: two opposing letters that mismatch in an alignment correspond to a substitution; a gap in the first word of an alignment corresponds to an insertion of the opposing letter into the first word; and a gap in the second word corresponds to a deletion of the opposing letter from the first word. Thus the edit distance of two words is given by the alignment minimizing the number of opposing letters that mismatch plus the number of letters opposite gaps. But we should note what Gusfield [I981wrote: Although an alignment and an edit transcript are mathematically equivalent, from a modeling standpoint, an edit transcript is quite different from an alignment. An edit transcript emphasizes the putative mutational events (point mutations in the model so far) that transform one string to another, whereas an alignment only displays a relationship between two strings. The distinction is one of process versus product. Different evolutionary nlodels are formalized via different permitted string operations, and yet these car1 result in the same alignment. So an alignment alone blurs the mathematical model. This is often a pedantic point but proves helpful in some discussions of evolutionary modeling. We will switch between the concepts of edit transformations and alignments whenever it is convenient to do so.
A simplified scoring system, called a match-mismatch-gap system, is given if all matches have the same value 111= p(a. a ) and likewise all mismatches have the same value rn = p(a, b), a # b. Of course. we assume that M > 0 and g < 0. Additionally, a substitution ( a , b) must be "cheaper" than two indels (a-. -b). Hence, we have Corollary 5.2.8 Let (114, m, g ) be a scorin,g system with only ,ualues for matches, mismatches and gaps. T1ie.n a cost 7neasure (c, h) having c(a, a ) = 0 and c(a, b) = c > 0 is given by
provided that
> > 29, in which at least one inequality is strict, 114 > 0, and g < 0 . il/I
7n
As examples we consider several standard systems:
I. T h e Levenshteiii distance. that is c = 1 and 11 = 1. We may choose match score hl = 2, mismatch score 7n = 1 and gap score g = 0. I\lore generally, if we wish t o measure the distance by p(w, w') = # substit,utions
+ h . # indels,
(5.36)
>
for 12 1 (i.e. that gaps are h times as costly as substitutions), we may choose M=2.m=1andg=l-h.
11. T h e standard match-mismatch-gap system (1.-1,-2) implies the cost measure c = 2 and 1z = 512. 111. A "normed" match-mismatxh-gap system with one free parameter is given by (1,m, 0) where 1 m 0. Equivalently, we have a cost measure with c = 1 - m and h = 112. In particular, the search for a longest common subsequence for a pair of words uses the match-mismatch-gap system (1,0,0) which implies c = 1 and 17, = 112.''
>
77
>
--The converse of the longest common subsequence problem is The problem of shortest supersequence Given: A set of sequences over the same alphabet. Find: A shortest sequence that contains each of the given sequences as a subsequence. This problem is AfP-complete [435].
A new challenge: The Phylogeny
147
How can we find the similarity of or the distance between two words? Clearly, the consideration of all possible alignments does not make sense, since there are too many; see 3.2.3. Observe that we cannot change the order of the letters in the words. This fact suggests that a dynamic progra~nmingapproach will be useful. A dynamic programming algorithm finds the solution by first breaking the original proble~ninto smaller subproblems and then solving all these subproblems, storing each intermediate solution in a table along with a score, and finally choosing the sequence of solutions that yields the highest score. The goal is to maximize the total score for the alignment. In order to do this, the number of high-scoring residue pairs must be maximized and the number of gaps and low-scoring pairs must be m i n i m i ~ e d . ~ ~ Due to the widespread applications of the problem, however, a solution and several basic variants were discovered and published in literature catering to diverse disciplines. It is usual to credit Needleman and Wunsch [319] for creating in 1970 the algorithm for finding the similarity, and Sellers [392] for describing in 1974 the method to compute the distance. Both are designed to produce an optimal measure of the minimum number of changes required to convert one given word into another given word, and may be viewed as an extension of the original Hamming sequence metric. In 1981 Smith, Waterman and Fitch [402] proved the equivalence of both techniques. Two years later they discussed optimal sequence alignments on an important example; see [164]. Let w and w' be two words over A with length m and n, respectively. The algorithms use a ( m 1) x (n 1) matrix, and determine the values of this matrix in the following way:
+
+
Algorithm 5.2.9 Let w = n[l]a[2]. . . n[m,]and w' = b[l]b[2]. . . b[n] be two sequences in A*, equipped with a scoring system (p, q ) . Then, we fin,d the similarity sim(w, w')=sim[m, n] by the following procedure. 1. for i := 0 to m do sim[i, 01 := i . g; .
-
sim[O,j ] := j . g; 2 3 ~ e c a l that l we used a dynamic programming technique to find a shortest path in a network. And indeed, me can frame the task of finding an optimal alignment as such a problem, compare [447]. But it turns out to be easy to reduce the running time by choosing a better algorithm.
3. for i := 1 to m do for j := 1 to n do sim[i, j] := max{sim[i
-
1,j]
+ g , sim[i
-
1,j - 11 + ~ [ ij], , sim[i, j
-
+
11 g)
An alignment of two words w and w' is called a n optimal alignment if its score equals sim(w, w'). T h e algorithm, as stated above, only computes the similarity of the words. For the explicit construction of a n optimal alignment, the algorithm has t o be supplemented by a baclttraclting procedure. This alignment corresponding t o the similarity may well not be unique; b u t all such alignments can be found "baclttraclting" from the cell sim[m. n ] to t h e cell sim[O, 01 in all possible ways. Dual, we have a n algorithm to compute the distance between two words:
Algorithm 5.2.10 Let w = a[l]a[2]. . . a[m] and w' = b[l]b[2].. . b[n] be two sequences in A*, equipped with a cost measure (c, h ) . Then we find the distance p(w, w') = p[m, n] by the following procedure I . for i := 0 to m do p[i, 0] := i . h;
2. for j
:= 0 to n do p[O, j] := j . h;
3. for i := 1 to m do for j := I to n do p [ i , j ] := min{p[i - 1,jl + h , p [ i - 1,j - 11 + c [ i , j ] , p [ i ,j - 11 + h ) Obviously, in both cases, the algorithms run in quadratic time: Observation 5.2.11 Let w and w' be two words over the same alphabet A. Let a scoring system o r a cost measure be given for A. Then the quantities ) sim(w, w') and p(w, w') ca8n be determined in O(lwl . 1 ~ 1 time.
Note t h a t this method t o determine the similarity of tn7o sequences is relatively fast b u t still too slow for most practical work, where t h e length of the sequences and the number of sequences to be compared are very large. Consequently,
A new challenge: The Phylogeny
149
there are heuristic methods which are more efficiently for "similarity-searching" a n entry in a collection of sequences.24 T h e similarity-based approach is more general than t h a t of distance, since T h e distance-based approach is restricted t o global comparisons only, it is not suitable for local ones. Here, a local alignrnent between two sequences w and w' is a n alignrnent between a subsequence of w and a subsequence of w'.Our algorithm 5.2.9 can be adapted t o find the highest scoring local alignment between two sequences:
Algorithm 5.2.12 Let w = n[l]n[2]. . . n[m] and w' = b[l]b[2]. . . b[n] be two sequences in A*, equipped with a scoring system ( p ;q ) . Then, compute the local alignment scores as follows: 1. for i := 0 to m do sim[i,01 := 0;
2. for j := 0 to n do sim[0, j] := 0;
3. for i := 1 to m do for j := 1 to n do sim[i,j] := rnax{sim[,i 11 + g, 0)
-
1,j ]
+ g, sim[i
-
1,j
-
11
+ ~ [ j], i , sim[i,j -
I n th8e end, it suffices to find the m8aximurn en,try in the whole array sim: this will be the score of an optirn,al local alignment. For this algorithm and derivations of our basic technique compare [394]. With similarities we can penalize gaps depending on their lengths. This cannot be done with metrics. This is a n important observation, since if two aligned sequences are for functional protein coding genes, then any gaps would be expected to have lengths that were multiples of three, to preserve the reading frame of the gene; and for ribosomal genes there may be aspects of the secondary structure that can be used t o evaluate the plausibility of the various gaps introduced in a n alignment. In any case me assume t h a t for a cost measure (c, h) the equality c(a, a) =
0 holds for all letters a. O n the other hand, there are scoring systems "1n particular, the well-known BLAST method runs in linear; that is O ( w l compare 13941.
+1 ~ ' ) :
time,
(p, g) conceivable in which for different letters a and b we have p(a, a) # p(b, b). T h e PAM (Point Accepted Mutation) series of score matrices are frequently used for protein alignments [13] and [124]. Each entry in a PAM matrix gives the logarithm of they ratio of the frequency a t which a pair of residues is observed in pairwise comparisons of homologous proteins to the frequency expected due t o chance alone.25 For a generalized scoring system, derived dissimilarity need not satisfy the triangle inequality.
5.2.4
Multiple Alignments
In the context of molecular biology, multiple sequence comparison is the most critical cutting-edge tool for extracting and representing biologically important commonalities from a set of sequences. I t plays a n essential role in two related areas: Finding highly conserved subregions among a collection of sequences; and Inferring the evolutionary history of some species from their associated sequences. One central technique for multiple sequence comparison involves multiple alignment. Here, a (global) multiple alignment of n > 2 sequences t u l , . . . , w,,is a o T h a t means that we natural generalization of the alignment of t ~ sequences. insert gap characters (called dummies) into, or a t either end of, each of the sequences to produce a new collection of elongated sequences t h a t obeys these rules: (i) All elongated sequences have the same length, 1 ; (ii) There is no position a t which all the elongated sequences have a dummy. Then the sequences are arrayed in a matrix of n rows and 1 columns, where m a s wrl i=1.....n
< 15
lwil. i=l
25Arnino acids that regularly replace each other have a positive score, while amino acids that rarely replace each other have a negative score.
A new challenge: Th,e Phylogeny
151
Consequently, there are only finitely many multiple alignments for a collection of sequences. Furthermore, the elongated sequences in a multiple alignment are as similar as possible according t o some predefined scoring syst,em, cost measure a length of a network. Although the notation of a multiple alignment is easily extended from two t o many sequences, the score or the cost of a multiple alignment is not easily generalized. There is no function that has been universally accepted for multiple alignment as distance or similarity has been for pairwise alignment. T h e essence of first idea is to extend the dynamic programming technique 5.2.10 from pairwise alignment t o the alignment of n > 2 sequences. A cost measure (c, h) for a n alphabet A to compare two sequences can be also written as a f ~ n c t ~ i ofn : (A U (-1)" E , where - is the "dummy" symbol, - $ ! A , and
(f (-, -) is not defined.) A U (-1 is called the extended alphabet, and such a function f , extended to n 2 values, is called a generalized cost measure. More precisely: A generalized cost measure is a function f : (AU{-))n t R>,-,, which satisfies the following conditions:
>
(i) f is non-negative: f ( a l , . . . ,a,,)
> 0;
(ii) f ( a , . . . , a ) = 0, for each a E A; f (-, . . . , -) is not defined; (iii) f ( a l , . . . ,a,,)
> 0 if
a, = - holds for a t least one index i ;
(iv) f is symmetric:
holds true for any permutation
T
With this in mind, we have
Algorithm 5.2.13 (Clote, Backofen [log], Waterman 14471) Let A be an alphabet. Let w = a[l]a[2]. . . a[k], w' = b[l]b[2]. . . b[m]and w" = c[l]c[2]. . . c[l]
+ lR be a generalized cost be three sequences i n (A u (-1)". Let f : (A u measure. W e find the "generalized" distance R ( w , w', w") = R [ k ,m , I ] by the following procedure:
Applied to the case with n sequences, we have the following strict generalisation of 5.2.11:
Observation 5.2.14 Let N = {wi : i = 1,.. . , n ) be a set of words over the s a m e alphabet A. Let a generalized cost measure be given for A. T h e n the quantity R ( w l , . . . , w,) can be determined in 0(II:",,Iwil) t i m e . Another approach is t o use single pairwise alignments. Given a multiple alignment .2/1 for the sequences wl , . . . , w,, . the induced pairwise alignment of two sequences w,and w, is obtained from ,M by
1. removing all rows except the two rows for wi and wj; 2. removing columns consisting of a dummy opposing another dummy. To find the cost, or the score, we use the cost measures, or the scoring systems, respectively, in the standard manner. Then, we define the cost (score) of JM by summing up the distances (similarities) of several pairs of induced alignments. This can be described in graph-theoretical terms: Let N = { w l , . . . , w,,) be a set of sequences from the same phylogenetic space. Then define the generalized distance of a graph alignment G = (T'E), AT C I/, by
A multiple alignment of a collection of words is called a n optimal multiple alignment if its generalized distance is minimal aniong all multiple alignments
A new challenge: The Phylogeny
153
of these words. Such a n alignment may well not be unique. Some specific examples of generalized scoring systems are of interest: The sum-of-pairs or complete alignment, which is the sum of all pairs cost, This definition that means we consider the complete graph G = (N, is mathematically natural but not biologically intuitive; in particular, evolutionary relationships are ignored. This formulation of an optimal multiple alignment for a set of sequences has been shown to be ,UP-hard; see Wang and Jiang [442].
(y)).
The tree Here we are near evolutionary trees: Given a set AT of n sequences and a partially labelled tree T = (V, E) with n leaves, where each leaf is associated with a given sequence", we want t o reconstruct a sequence for each internal node to minimize the length of T. A complication, however, is t h a t the alignment may change depending upon the tree on which the sequences are aligned. This is not a simple issue, since most of the phylogenetic studies align the sequences first, then compute a phylogeny based on that alignment. One solution to this dilemma is to infer both the alignment a,nd the tree a t the same time, so t h a t the "optimal" alignment and the phylogeny and tree are obtained together. We will discuss this approach below, a n overview is given by Jiang and Wang [243]. This formulation of a n optimal multiple alignment for a set of sequences has also been shown t o be JVP-hard; see Wang and Jiang [442]." A heuristic approach has been created by Schwikowski and Vingron [390]. The more specific star alignment, in which it is assumed that the underlying tree is a star. This implies t h a t all sequences share a common ancestor. Restricting the "topology" makes this approach much more tractable, but nevertheless it too is not solvable in polynomial time. If we pick one of the given sequences as the internal vertex of the star, we can find a n optimal alignment in
time [394]. And to find the center sequence we can compute all 0 ( n 2 ) optimal pairwise alignments and select as the center the sequence w, t h a t '6Note that this term is used to mean several different things in the literature. 2 7 ~ a t ewe r will call such a tree an N-tree. L8Moreover, they show that there is no polynomial time approximation scheme for the problem, unless ? = ,kr?.
154
minimizes
For a broader discussion of the relationship between multiple alignment and phylogeny construction, compare Vingron [437]. T h e generalized distance for these applications can be very different: Let A = {r,y ) . Consider the costs for one column consisting of ml instances of the letter r and mz instances of the letter y, where m l my = n, and using the length function
+
for a , b E {r,Y). I t is easy t o check that the complete align~nenthas length ml . mz, t h a t there is a star alignment of length min{ml, mz), and that there is a tree alignment of length 0 : ml=Oormz=O 1 : otherwise =
i
Surveys of multiple sequence comparison methods are given in [75], [I481 and [438]. In any case, the alignment array can be summarized in a single sequence called a consensus sequence, which is frequently added a t the end of the alignment. I t is common in computational molecular biology t o compute a multiple alignment for a set of sequences, and then represent those sequences by the consensus sequence derived from the alignment. T h e consensus sequence consits of letters that summarizes the letters of the alignment in each column. A simple way t o calculate a consensus sequence is t o use the so-called majority rule (MR), which chooses the most frequently occuring letter in each column. We distinguish between two rules: T h e normal rule uses the alphabet A U (-1. The restricted rule uses only the alphabet A. An example compares the word for
SCHOOL in different languages:
A new challenge: Th,e Phylogeny
Language German English French Italian Consensus, MR Consensus, restricted MR
-
S S
C C
H H
U
-
0
0
L
-
E
-
C
O
-
L
E
-
S
C
-
U
0
L
A
-
S S
C
Hor-
H
OorC 0 or U
Oor-
C
L L
E E
E
0
L E
More generally, assuming that there is a cost measure (c. h ) , written as a generalized cost function f : (A U { - ) ) 2 -+ R.we define the consensus sequence as follows: Given a multiple alignment M = (a,,) of a set N of n sequences, the consensus letter of column a of JU is the letter a that minimizes 12
(If we allow a = -, then v,-e have to define also f (-, -), in general by setting f (-, -) = h.) The consensus sequence derived from JU is the concatenation of the consensus letters for each column of M. Using the generalized cost measure defined by
f (a,
=
{0
: a = b 1 : otherwise
gives the majority rule. I t is easy t o find the consensus sequence for a given multiple alignment. The following problem in a phylogenetic space, given by a n alphabet A and a generalized cost measure f : (A U (-1)' + R,is not so simple:
The Consensus Sequence Problem Given: A set N = {wl, . . . , w,,) of sequences. Find: A multiple alignment .#U= (aji)j,l ,....,, sensus sequence w = a1 . . . a1 such that
i,l,,.,, 1
for N , and a con-
is minimal. For a broader discussion of this problem, compare Gusfield [198]; for the consequences of this observation for molecular evolution see Eigen [153].
5.2.5
Steiner's Problem in phylogenetic spaces: The question
Until now we have vaguely defined the maximum parsimony problem as the problem of reconstructing the evolutionary history with the fewest number of mutations. Phylogeny construction is a prominent application of the notion of a Steiner Minimal Tree: one of the first formal versions of phylogeny construction interpreted the ancestral sequences as Steiner points in a hypercube, namely in {a, c, g, t I d . We are given a set of aligned sequences and a tree topology, where the leaves are labelled with the given sequences. For any assignment of sequences t o the internal vertices of the tree, the length of the tree is defined as the number of mismatches between the pairs of sequences incident t o each edge. A most parsinlonious assignment of sequences is one t h a t minimizes the total length. An algorithm for its solution that is linear in the number and length of the sequences was given by Fitch [I621 in 1971. In 1975 Sanltoff [378]generalized this approach t o handle assumed tree topologies and unaligned sequences. We will discuss this technique below, and in more depth, in the last chapter of the book. Equipped with the proper terminology, we can now give a precise definition of the maximum parsimony problem: Consider a phylogenetic space A over the alphabet A with the scoring system ( p , q ) generat,ing the similarity function sim, and the equivalent cost measure (c,h) generating the metric p. In A , the length of a tree T = (V, E) is given by
The metric p may be a pseudometric, in which case we call L a pseudo-length. The most important principle in molecular evolution, namely that the degree of similarity between genes reflects the strength of the evolutionary relationship between them, gives rise to the following observation: Let i V be a finite set of points (sequences, words) in a phylogenetic space A. A most parsimonious tree is an SMT for N in A . An SAIT for N must exist and it is only necessary to search the Steiner points in the set & = {w E A : p(u, W ) L(A)(I\IST for N ) ) , (5.47)
Corollary 6 . 2 . 2 The nu'mber of n,on-isom,orphic graphs with n vertices is at leust fi7L(n-1)/n!. Often we have no exact formula for counting the number of combinatorial objects of some kind, but we can describe its asymptotic behavior. Then we use the following notation: Let f and g be functions from the positive integers to the real numbers, then (i) The function g ( n ) is said to be growing faster than f (n),denoted f ( n ) = lim
n+m
f (n)
-=
g(n)
0.
(iv) for each integer k , the same number of cycles of length k . However, these properties are necessary but not sufficient criteria for isomorphism. It is strange, but the computational complexity to verify whether two graphs are isomorphic is still unknown: No polynomially hounded algorithm is known, on the other hand it has not been proved that this problem is in N'PC. Maybe, this problem is a member of A'PZ. A monograph on isomorphism detection is given in [223]. There is a quadratic time algorithm which decides whether two trees are isomorphic; see [433]. 3 0 r onto another set of n distinguished names.
(ii) The function g ( n ) is said to approximately f ( n ) ,denoted f ( n )= g ( n ) , if lim nim
f
(72)
-
g(n)
= 1.
This notation allows us to concentrate on the dominating term in an expression describing a lower or upper bound and to ignore any multiplicative constant^.^." T h e o r e m 6 . 2 . 3 Denote by c o n n ( n ) the n u m b e r of connected graphs with n labelled vertices. I t holds t l ~ a t
Proof. We show that the sum
is the number of disconnected graphs: A proper component of a graph has at least one and a t most n - 1 vertices. Let i be the number of vertices outside of such a component and let n - i be the number of vertices inside, 1 i 5 n - 1. For a fixed number i there are
2.
T h e n the following proce-
I . Generate, by .simple counting, all Priifer codes in { I , .. . , n } l Z p 2 ;
2. For each code apply 6.2.20. This procedure coilsumes n71p2. O ( n ) = O(nl"') linear time. Hence, it is an effective technique.
tirne, since 6.2.20 runs in
Remember that counting only partially labelled trees is fundamentally harder, and so it is with generating.
A n analysis of Steiner's Problem in phylogenetic spaces
191
The simple process which we use to prove observation 2.4.17 is also useful to generate all binary N-trees: Let AT = {vl.. . . , v,). There is a single N-tree with IN1 = 3. The fourth leaf vLi can be connected to any of the three edges. This leads to three N-trees with IN = 4. each with five edges. Then, for each tree add the fifth leaf t o any of these edges. and so on. Note t h a t t o use this procedure to generate all N-trees, we have t o generate the set
We will describe a nonoptimal technique, involx-ing drawing a tree in the plane: Let n > 1 be a n integer. A planar code w (with respect to n) is a sequence in ,732(n-1) with the following properties: (i) In each prefix of w the number of 1s is a t least the number of 0s; In particular, the first letter in w must be 1; (ii) The number of 1s in w equals the nurnber of 0s; In particular, the last letter in m must be 0.
Algorithm 6.2.22 L e t u: be a planar code w i t h respect t o n. T h e n draw a tree g by th,e f o l l o w ~ n ~procedure: 1. P u t a v e r t e x a s t h e origin,;
2. R e a d w letter b y letter a n d if y o u see a 1 t h e n draw a n e w edge t o a ne8w vertex; ,if y o u see a 0 t h e n m o v e back by o n e edge toward t h e origin. Thus the t,ree is described by its planar code. Hence, after generating all planar codes, we can generate all unlabelled trees with n vertices. The number of planar codes is the Catalan number (compare [14] or [296]), which gives a n upper bound for the number of non-isomorphic trees. Note that the planar code is far from optimal; every unlabelled tree has many different codes. For instance all the codes 11010010, 10110100, 11101000, 10101100, 11011000 and 11100100 generate the same tree. The table below summarizing our met,hods.
Generating all
Optimal
Running time w.r.t. number of trees
Labelled trees Binary AT-trees Unlabelled trees
Yes \'es No
Linear Exponential Exponential
In the above "optimal" means that the algorithm generates each tree exactly once. Fliege [165], Lee. Lee, Wong [279] and Winter [462] describe several other methods t o generate trees and full trees.
6.3
CLUSTER ANALYSIS
Evolution implies that many different species have a common ancestor and that all forms of life probably stem from the same remote beginnings. Once these relationships are understood, they are summarized by grouping species into collections of related organisms, called taxa. We will describe the structures underlying these relationships.
Classifications
6.3.1
A classification is the formal naming of a group of individuals. In the sense of set theory a classification C of a (finite) set N of individuals is given by a collectio~iof subsets of N satisfying (i)
0 $ c;
(ii) N E C; (iii) {v)
C for any v E N ; and
(iv) For any two members N' and N" of
In other words, any two sets in other (see 5.3.1).
C it holds that,
C are disjoint or one is contained in the
An analysis of Steiner's Problem in phylogenetic spaces
193
A member of a classification is called a class or a cluster of N Let T be an N-tree rooted by the vertex w. Then we create a collection C of classes for the set N in the following way:
I. For each leaf u of T put { u ) in C; Mark the vertex v; 2. Let v # w be an umarked vertex adjacent to exactly one other unmarked vertex. All other neighbors v1, . . . , vk of v are marked and belong to classes Nl, . . . , Nk in C, respectively. Then - Put I"\', in C, and - Mark v;
u:=,
3. Mark w ; Put 1 Y in C
Y with the properties Conversely, if we have a collection C of classes of the set 1 that {v) E C for each element v E hr and K E C. we can form a tree T by: I. Each class of C is a vertJex of T: 2. Two vertices hiland 1V2 are adjacent if and only if - !LTln hi2 E {N1, N2), and - there is no class N' such that ATj n N' E {ATl,iV') for j = 1 , 2 . (That means, AT1 must be the inaxinla1 proper subset of AT2 or vice versa.) Summing up all these observations, we have the following fundamental equivalence between classifications and rooted trees. Observation 6.3.1 There i s a one-to-one correspondence between the collec-
V and the collection of rooted N-trees. t i o n of classifications for a set ! In other words, classifications for a set iV and rooted AT-treescontain essentially the same information. The classification C = C(T) which is induced by the tree T is called the content of T. In view of this observation, each evolutionary tree implies a classification of the given names. But we saw that such a classification is not applicable in practice, , is since the depth of the tree lies between n(1ogn) and O(n) for n = J N Jand
obviously too big. Taxonomists are interested in trees with a constant depth. I n particular Linnaeus' system has depth 8. Hence, in such systems the trees are not binary. 6.3.1 can be viewed as the rooted analogue of 6.2.10. We need t o describe equivalences between the families of rooted ill-trees and N-trees. and corresponding equivalences between classifications on N and collections of pairwise compatible N-splits. T h e following proposition describes the desired equivalences. T h e proof is a n application of 6.2.10 a n d 6.3.1.
Observation 6 . 3 . 2 (Semple and Steel [393]) Let V ! be finite set. C is a classification for N if and only if the collection
is a set of pairwise compatible splits on N ; and vice versa. For instance, consider the set N = {a, b, c, d, e). Coming from the (binary) N-tree (((ab)c)(de)) we have the split system
S
=
{{a, bcde), {b, acde), {c, abde}, {d, abce}; {e, abcd), {ah, cde), {abc, de)).
(6.44)
Using each of the three internal vertices as a root gives the following classifications:
CI = {a, b, c, d , e, ed, ced) U {AT) Cz = {a, b, c, d , e , nb; ed) U { N ) CS = {a, b, c, d, e, ab, abc} U { N ) .
(6.45) (6.46) (6.47)
W i t h 6.3.1 in mind. we have several c~nsiderat~ions. Firstly we determine the maximal number of sets in a classification. Let T = (V, E) be a rooted il'-tree with lhTI= n.t h a t k internal vertices each of degree greater t h a n 2. and a root w . Then 1.2.5 says t h a t k n - 2. Consequently,
for all words w in TIC,'
>
where w, is a fixed word in T V of length z,. Equivalently,
Hence, T,i/
C il the Euclitican p1;rrie and the plane wit11 rectilinear distance. Is this also true in pli~.logenetic spxrs'.' 'See our discussioi~in he l~cgiriningof the fifth chapter.
Algorithm 7.2.1 (Pitch [i62]) Let -\-he (L set of r2 sequences in a .seque3r2ce space (-q",p ~ ) -Y : = { u k = vk.1;. . . . c k . d : k; = 1.. . . . T I ) , and let CL bir,ary S - t r e e T = (11: E ) De gi'um. T h e n do:
1. For each position i = I , . . . . cl do I . Mark euch leuf (I:,;~ u i t h{vk,j): L , := 0: 2. Until all oertices are mnrked do ve7tice.s 'uiith th,e Find a n unmarked vertex: ,whrc/, is c~djtrcentt o two rr~r~rked marks and -I\; Mark the ~ ~ n m a r k e~d~ e r t S c ewith s if LY1 n -1; # 0; otherwise ( a ) !Y1n (b) IY~ U :IT and2 L ; := L i + 1;
2 . L ( T ) := c:=,
L,.
Tlie correctness of Fitch's algorithm is proven by Hartigan [213]. In particular. it is shon-n that the final answer is independent of the vertices cliosen when moving: through the tree. The algorithm c o ~ n p u t c sthe Icngth of tlie tree. Since a binary .\'-tree has 212-2 vertices. it uses O(11) time for each position and hencc O ( d .n ) t,irrie to find the length. O n tlie other hand. there are esponentially man!- binary trees. Hence.
Observation 7.2.2 T h e Fitclr nlgoritlzm 7.2.1 uses linear t i m e t o j k d the length of a girien binary A\'-tree i n a sequence space. ~ Applying 7.2.1 for all binary :\:-trees find a n SMT for a finite .set of g i w points in a sequelace space i n e . ~ p o n e n t i dt i m e . Aft,er applying 7.2.1 vce ha1.e irlarks for all the internal s-ertices in t,hc tree. Honever; some marks ha-\re more than one letter and hence are ambiguous. There are several methods for choosing n-hich one of the possible states yields the most parsirnonious reco~lstruction:the simplest one is Farris' met,hotl: go back u p the tree assigning to anj- internal vertex tliat is ambiguous the intersection of its niark nit11 tliat of it,s i~ninctliatcancestor. H o m v e r , as tlie number of possible t,rees increases rapidly n-ith the nurnber of given sequences. it is virt,ually iriipossihle to employ a n exhaustive search when the nurnber of give11 sequences is not srnall. Fortunately. thcrp exist shortcut algorithms for identifying all shortcst trees that do not require exhaustive
e ~ m n ~ r a t i o rarid i , work for larger w t s of sequencps. Onc such algorithm is the branch-and-bound rnethod bj. Hendy a ~ i t Penny l [216], described briefly belon:
1. Guess a "good tree" To using a li~uristic~'): Lo := L(To); Let S be the set of all binary !Y-trees: 2. (Iteration:) 1. Partit,ion S into a small n111111~1.of subsets XI. S 2 :. . . ; Xk: 2. For i := 1;.. . , k do - Find a length L ( S , ) such t h a t L ( T ) L ( S i ) for all T E X i ; - If L ( S , ) 5 Lo then it,crat,e (ret,urn to 1. with X = X,).
>
The ohsermtion 5.2.1 suggests that the niethocl given by 7.2.1 can he extended P L ) . Alrld t,o find the location of St>einerpoints in phylogenetic- spaces (A*. indeed, Sankoff gives a dynamic yrograinrning algoritlnn for tree aligimient. He merges the high-clirne~isionalversion of the dynamic prograniming algorit,llrn for pairn-ise aligninent with the Fitch algoritlini:
Observation 7.2.3 ( S u ~ n k o f f[,378]) L e t :Y he (1 s e t of n words i n t h e phylogenetic space ( = l * . p ~ ) . L e t n birmry S - t r e e T be giuen. T h e n t h e location of t h e S t e i n e r poz'nts in T ca,n be reduced t o ( 2 4 " applications of 7.2.1, where d = inax{/v : u E *Y).
7.3
THE PERFECT PHYLOGENY PROBLEM
Now, n.e corlsider character state data. Recall the perfect phylogeny problml. Given: A set Ai of 11 taxa on a set C of characters, reprcsent,ed by a n n x 777 character-stat,? matrix 41. Determine: \'\'ilet,her a perfect phylogeny cxist,s. A h d ,if so. const~ructone. Tht. following observation is not a surprise
Theorem 7.3.1 (Steel [40G]) T h e pcvfert p h y l o g ~ r ~py~ o b l c r nas .I*?-complete
TZe will nom- restrict ourselvc to the binary case, that is we allow a character t o take exactly tn-o st,ates: -\I is a 0 - 1-matrix. Here, wc n-ill see t h a t t,hc problem can solved ~fficient~ly. For the following algorithm it will be convenient t o first reorder the col~irnris of -11. Consider each column as a b i ~ i a r ynurnber: sort tlwse ni nurnbers illto deriote tile dccreasirig order, placing the largest numl-wr in colu~nri1. Let reordercd matrix 31. From this p o i ~ i ton. c x h character n-ill he narned by t,he of ariy column it occupies in Hence. a character j will be to the riglit in _\^I character i if and only if i < j . For any colurnn k of YI. let O k be the set of t a m with a 1 in column k- that is the t a m that have charact,er k. Clearly, if I l k strictly contains Ojt , l ~ e ncolumn ( c l ~ a r a c t ~k~ rmust ) to be lcft of colun111 j in the m a t r i s lyf. Tlie major fact and the basis for a n ~fficientsolution of the pcrfrct phylogeny problem is
lo
T h e o r e m 7 . 3 . 2 T h e 71~0,trzx,\I h a s a p h y l o y r l e f j c trre if a'nd only if for euery puir of columns i and j , either 0, n r ~ d0,,are disjoi'nt OT o n e cont(~in,sthe other. This t h e o r e ~ nis intuitively clcnr. arid a complete proof is given in [I981 and [391]. To make this technique rlearer, Gusfield [198] furnislics the follon-ing srnall example: Let -\Il be the matrix
Tree buildin,g algorithn~,i
In 1-ien- of 7.3.2 we crcate the follom-ing algo~ithrn: Algorithm 7.3.3 Gi~~e'ra n character.-,state m a t r i x JI for 17 t a m and nb bi,nwy prochumcters, w e find (1 perfect phylogeny, if it e:~:lsts,by u ~ i n gtlie followi~r~g cedure:
do 2. For each row fIZli of Construct the string consi.sting of the cl~urncters; z'r, sorted ( i n c ~ ~ e u s i n y ) o&r, that JI, possesses: 21
-
answer is gi~.en
Remark 7.4.4 4 1 , y o T i t h 7.4.1 .yi,ues t h e c o v e c t tree if t h e d i s t a n m s f o m ~a3rL u l t r a ~ n e t r i cspace. LIoreover: si~nilarityarid evolut~ioi~ar!~ relationships n-ill only coincide exactly if tlic distancxls a r r ~llt,ramrtrir:coniIm.c3 [27S].T h a t means t h a t ultrarlictric
distances will precisely fit a tree so that the distance between any t , ~ toa m is q u a 1 to the sum of the lerigt,hs of the edges joining t)hem, and the tree can be rooted so t,kiat all of the t,asa arc equidistant fro111 tlic root.
7.4.1
Linkage Clustering
One of the simplest agglonierativc methods is linkage clustering. I T clistii~guisll between two liincis of suclr techniques: T ~ I Psingle and the cornplet,e linkage clustering. T h e main feature of singlc linlcage clust,ering is that t,he distance between classes is defined as that between the closest pair of indi~.icIuals. where only pairs co~lsisti~lg of oirc i~idi~.idual from each class are considered. Suppose n-e choose sets in ,\', say ;Y, a d lYJ: to airialgate to form the nPn set S' = A T , U -VJ. -4 new distance function (and matrix) is found hy recalculating define as follows: For all scts K c ,\'\ {~"i'} with S' replacing S, ant1
T h e single liirlcagc metliod is closel~.related t,o rnir~iinumspanning t,rees. This can be seen if wc compare this t,ochnique n-it11 algorithm 1.2.8. T h e complete liriltage clnsterirrg method is the opposite of single linkage in the sense t h a t t,he distarice betn-eeii classes is IIOW definecl as that betn-een t,he . all most dist,ant pair of individuals. one from each class. In otlrer ~ o r d s for define sets Ii E ,\- \ {L\7')
7.4.2
Simple joining
One of t,lre most popular mctlrods is the nearest neighbor technique.
\* {.Y1) define For all sets K E ,I
Theorem 7.4.5 If the distance j i ~ ~ ~ c t ido irn~ the proced~ur~e7.4.1 co'rnes from a metl-ic t h e n (1 is a d i ~ s i n r i l c l ~ r r t ! ~ .
Tree building algorithn~s
7.4.3
UPGMA and WPGMA
Another specific variant of our PGlI algorithm is t,he unneiglit,etl pair group rncthod n-ith arithmetic mean (UPGLIA). It is t l ~ rmost co~nmonlj-used clustering rriethod. Hrre, the last step of 7.4.1 heconics: For all sets
K
E .\rlag. [69] L.L. Cavalli-Sforza. G e n e , Volker u n d S p m ~ h ~ e nCarl 1990. arid F. Cavalli-Sforza. Vemchieden u'nd (loch gleicli,. [70] L.L. C~T-alli-Sforza I Ferrndt-Problem In [74] G.D Chakeimn ant1 h1.A Ghandch;iri lIlnl~o~~~slI.*A. Kowak? D.C. Krakaner, and A . Dress. .A11 error limit for the oyolntion of language. I'roc. R . Soc. L o n d . , 266:2131-2136. 1999. e r unit splieres [324] A. Ocllyzko and N.,J.A4.Sloane. S e x bounds o:i tlw ~ i u ~ i l h of that can touch a unit sphere in 11 di~nensions.J. C o m b . Tlreory, -A,26:210214. 1979. [32.5] -1. Okabe, B. Boot,s. and Ieditor. Afathenimtical A4etlsods fmr DNA-Seq7~encirag:pages 53-92. CRC Press, 1989.
[447]11.S.ITater~nan.A4pplicationsof Co~nbi~latorics t,o lfolecular Biology. In R.L. Graham, bI. Grotschel. and L. L o ~ i s z editors. . Handbook of Corrhisnatorics, pages 1983-2001. Elsevier Science B.1-.. 1993.
REFERENCES
261
[A481 h1.S. TYaterriian. I~~troduction to Cornp~~tc~tional Biolog?~.Cliapmar~& Heil. 1995. [a391 -4. TYeber. V e b e ~clen Standort der Inclustrieri. Tiibingen, 1909.
. DU-hIont, 2000. [-I501 T P. \Yebe1 Daru~rnnnd d ~ c Arist~ffer. [45l] D . TYelsh. Approxiniate Counting. Lectur,e Note Series of the, Lo,ndon Math. Society. 241:287 324, 1997. [452] J.F. Vkng. Steiner Polj-gons in the Steiner Problem. Geometrine Dedicats, 52:119--127. 1994. [453] J . F . IT7cng. A r1c.n. liiotlel of generalized Steiner Trees and 3-coordi~late Systems. DIMACS Series in Discrete Mathe~rnaticsand Th,eoretical Co,rr~putel Srience. 40:413-424. 1998.
[455] G.O. TYesolowsky. The TVebrr Problem: History and Perspectives. Location Science: 1:5-23, 1993. [456] F..J. Ketuclinon-ski. Graplieri u r d Yet,zc. In S.lT7. Jabloriski and O.B. Lupanon-, editors, Diskrete Mr~therriatik land mnthemati~ch~e Fmgen der ~ e r l a g 1980. Kyhernetik. pages 145-197. A ~ l ~ a c l e r r ~ i e - ~Berlin. [-I571 K. TYhite, AI. Farber. and MT. Pullcj~blanlt. Strinrr Trees, Corinected Doniinatio~iand Strongly Cliordal Graphs. 1Vc:twork:.s,15:109-124, 1983. [458] .J. ST-h~tfield.Born in a water3 cornriiune. Nature, 427474-676, 20114. [439] P. TTYrnayer. Fast Approrirrantion Algorithms for Steiner's Problem In Graphs. PhD thesis. Unix-ersitat I<arlsrulie, 1987. Habilitatiorisschrift.
Cann. A4frilta~liscl~~r Ursprung cles inoderneri hIeri[461] A.C. TYilson and R.L. s d i m . In B. Streit,, editor, Euolutio~rrties Mrnschen; pages 86-93. Spektrum AIl~ltde~nischer ITerlag,1995. [462] P, Il-inter. An Xlgorithni for the Steiner Prohlrrn in the Euclidean Plane. Netviorks. 15:323-345. 1985. al~zcdSt~i11c.rProhle~nin Ser ies-Par allel S e t n orlts J [463] P TTTinter. G c n c ~ of Algor~th~ns, 7 549 566. 1986
[464] P. V'lnter Stcmer Problems ln Netnorks 167. 1987
A \
Su~r-ey Netluorks, 17 129-
[465] B.Y. \T'u and K.-;\I. Chao. Spanni7isg Trees and Optimization Problems. Chapman a l ~ dHall, 2004. [466] X D u . X H u . and S Jla. On Sho1 test k-edge-connected Steiner Netnorks of C o m b ~ r ~ a t o r(I)Ptmz~atror/. ~d 4.99-107. 2000. in AI~tlicS ~ ~ C Jo1~17ial PS 14671 G. S u e and C. \Tang. The Euclideari facilities location prohlem. In J . Sun D.Z. Du. rditor. At1z~ance.sin Optimization a7d Appromimation, pagcs 313-331. Khm-cr .\cadcniic Publisllcrs. 1994.
~S 3 Problem,s and Their, Soluer,~. [468] B.H. Yandell. The H O ' I L OCl(~s~-Hilbe7.t .A K Petcm. Natick, i\lassachusetts. 2002. [469] A.C. Yao. An o(lel log log 11:l) algoritlzn~for finding minirmrn spanning trees. hfornr. Process. Lett.. 4:21-23. 1975. 14701 H. Yockey. Infor,mation Theor!~marl Molecula7. Bioloqy. Cambridge Cniwrsity Press, 1992. [A711 11. Zacliariasen. The Rectilinear Steiner Tree Problem: A Tut,orial. In S.Clzeng and D.Z. Du, edit,ors. Steinrr Trees in Intlu,stl-y. pages 467-307. Klun-er =Icademic Publishers: 2001. 14721 N.Zadelz. Construction of Eficient Tree Ket,works: The Pipeline Prohlem. Networh, 3:l-31, 1973. [473] *\.Z. Zelikorslq~.*An 1116-rlpprosimatiorl Algorithm for the Steirier problem on graphs. Ann. of Discrete Mathrnaatics. 41:351-354, 1992. [474] 13. Zeli~zka. JIedians and peripherians of trces. Arch. AJ(ith,. (Br,no). 4:8f 9 3 , 1968. 1-1-75]C. Zong. S~111cr.ePackings. Springer. 1999 [476] A.A. Zyko~..Theory of Finite Graphs (Russian). No\-osibirsk. 1969.
INDEX
Achi~veinentof the Steincr ratio. 78 Acyclic. 1 3 Adjacency matris. 37 Adjacent, 12 ;Ilgorit,hrn. 59 ~loildetcr~riiriist,iq 66 Align~nerit,,141 induced. 132 local, 149 rriultiple, 130 optimal, 148 pairn-ise. 141 Alphabet. 132 binary, 132 cxtelided, 151 Xricestor. 183 immediate. 185 =Ipprosirnable. 75 Approxirriation. 74 Arc, 183 ahyrnptotic behavior. 175 -irerage-case perforniaric~,63 Ball, 10. 37. 195 Ball family. 196 Banach-SIiiikon-ski space. 40 Baliach-TT'iener space. 40 Banach space, 40 Base pairs, 131 Bell~nan'sprinciple. 47 Binary matrix, 37 Binary LIST: 117 Binary search. G3 Binary tree. 13. 180
Bipartite, 53 Bisector. 71 Boolean matris. 37 Borur-la's algorithm, 16 Bounded Degree 51inimuni Sparirii~igTree, 117 ge~ieralized,118 Bounded set. 37 Bp. 131 Bracket fomiat. 181 Branch arid bound. 52 Bridgc, 14 C.dj,le~.'s. tree formula, 179 Center: 10 Center function, 9 Center tree, 208 Central dogma. 209 Cent,ral Dogma of Slolecular Biology: 91. 132 Chain. 1 3 Character-state matrix: 159 Character. 126. 138 Chinese Postnlari Probleni. 109 Chomsky hierarch!-. 134 Chrorriosonie. 131 Church's thesis. 60 Circlc: 10 Class. 193 Classes of complexity. 62 Classification. 162, 192 ~ awfication 1:. .' . metric, 195 Cluster-distance, 218 Cluster. 193 Conmion ancestor. 164
last 11ni~-ersal. 164 most recent, 164 Con~patiblesplits. 182 Complete alignment, 153 Cornplet,e graph. 13 Cornpletc linkage clustering. 220 Component, 14 Computational cornplesity, 63 Concatenation. 133 Connected component, 14 Connected graph. 13 Connected strongly, 185 Consens~isletter, 155 Consensus secluellce. 154 Consensus Sequeiic.e Problem. 155 Consensus trec., 227 Consensus majority, 228 strict. 228 Construction x i t h ruler and co1npass; 5 Content. 193 Come?: polytope. 87 Cook's hypothesis. 67 Cost measure. 138 generalized, 151 Counting problern. 174 Covering, 10 Cubic graph, 12 Cubic time, 62 Cycle. 12-13 Decomposition of a tree, 102 Degree, 12 Delaunay triangulation, 20. 71. 103 Dendrograni, 198 Depth of the tree. 186 Diameter. 201 Diameter of a graph. 49 Different and isomorphic. 175 Digraph, 185
Dijkstra's algorithm (rninirnum spanning tree). 68 Dijkstra's algorithm (shortest path). 47 Directed graph. 185 Dircction of a n arc, 185 Discrct,e rnct,ric. 1 1 Discrete metric space. 41 Dissiniilarity. 32 Distance. 32 Distance graph: 46 Distancc matrix. 217 Distance t,ree, 37 Diversity, 128, 165 Di~.itieand Coriyuer rnetliocl, 229 DN.4. 131 Dominance region, 70 D T . 20, 71. 105 Dyna~nicprogran~mingalgorithm. 147 Dj.na~nicprogramming technique; 47 Edge. 12 Edit, clist,arice, 136 Elementary tree tra~~sformat~ion, 201 Ernbedcling of a graph in a metric space. 32 Empt,y circle condition. 10; Empt,y IT-ord,133 E n d v ~ r t e s ,12 Enurnc~at,ionproblem. 174. 205 Error of an algorithm. 7.5 Euclidean metric. 42 Euclitlcan planc. 3. 42 Euler's formula; 53 Eulerian chain. 107 Eulerian c j d e , 107 Eulerian graph. 107 Farris' method. 213 F e r ~ n a t ' sProblem. 2. 206 Fcrmat function. 3; 206
Index
generalized. 85 Forest,, 14 Full components, 102 Full t,ree, 25, 43 Fully polynomial approxi~nation scheme, 75 G a p , 142 G a d question, 22 Generating graphs, 188 Generating trees, 189 Genome, 131 Geodesic curve: 42 Geometry of Sumhers, 202 Gill-mt-Pollnk conjectilre. 79 Graph, 11 Graph alignment. 132 Graph regular, 12 Greedj. algorit,hm. 18 Greedy Tree, 84 GT. 84 Hadn-iger 11u1dxr. 94 Hamiltonian cycle. 108 Hatniltonian graph, 108 Hani~nirig-dist,arice extended. 137 Harriinirig clist~ance,135 Harriming n-eight, 203 Heuristic, 74 Hierarchy. 197 Hicrholzer's algorith~n.108 Hilbert's fourth problem, 40 Homologous. 138 Hypercube. 13.5; 173 Incident, 12 Indegree, 185 Indel. 139 Internal edge. 184 Internal vertes, 14 Intractable, 66 Isornorphic. 174 Isoinorphisrn. 174
K-conected Steiner ratio. 116 K-connected. 114 K-edge-connected graph. 113 K-edge-connected llinirnuln Spanning Net,work. 116 K-edge-connected St,eincr lZIinirna1 Network. 114 I